MyArxiv
Computer Vision and Pattern Recognition
Moment-Based 3D Gaussian Splatting: Resolving Volumetric Occlusion with Order-Independent Transmittance
The recent success of 3D Gaussian Splatting (3DGS) has reshaped novel view synthesis by enabling fast optimization and real-time rendering of high-quality radiance fields. However, it relies on simplified, order-dependent alpha blending and coarse approximations of the density integral within the rasterizer, thereby limiting its ability to render complex, overlapping semi-transparent objects. In this paper, we extend rasterization-based rendering of 3D Gaussian representations with a novel method for high-fidelity transmittance computation, entirely avoiding the need for ray tracing or per-pixel sample sorting. Building on prior work in moment-based order-independent transparency, our key idea is to characterize the density distribution along each camera ray with a compact and continuous representation based on statistical moments. To this end, we analytically derive and compute a set of per-pixel moments from all contributing 3D Gaussians. From these moments, a continuous transmittance function is reconstructed for each ray, which is then independently sampled within each Gaussian. As a result, our method bridges the gap between rasterization and physical accuracy by modeling light attenuation in complex translucent media, significantly improving overall reconstruction and rendering quality.
V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties
Large-scale video generation models have shown remarkable potential in modeling photorealistic appearance and lighting interactions in real-world scenes. However, a closed-loop framework that jointly understands intrinsic scene properties (e.g., albedo, normal, material, and irradiance), leverages them for video synthesis, and supports editable intrinsic representations remains unexplored. We present V-RGBX, the first end-to-end framework for intrinsic-aware video editing. V-RGBX unifies three key capabilities: (1) video inverse rendering into intrinsic channels, (2) photorealistic video synthesis from these intrinsic representations, and (3) keyframe-based video editing conditioned on intrinsic channels. At the core of V-RGBX is an interleaved conditioning mechanism that enables intuitive, physically grounded video editing through user-selected keyframes, supporting flexible manipulation of any intrinsic modality. Extensive qualitative and quantitative results show that V-RGBX produces temporally consistent, photorealistic videos while propagating keyframe edits across sequences in a physically plausible manner. We demonstrate its effectiveness in diverse applications, including object appearance editing and scene-level relighting, surpassing the performance of prior methods.
comment: Project Page: https://aleafy.github.io/vrgbx
Particulate: Feed-Forward 3D Object Articulation
We present Particulate, a feed-forward approach that, given a single static 3D mesh of an everyday object, directly infers all attributes of the underlying articulated structure, including its 3D parts, kinematic structure, and motion constraints. At its core is a transformer network, Part Articulation Transformer, which processes a point cloud of the input mesh using a flexible and scalable architecture to predict all the aforementioned attributes with native multi-joint support. We train the network end-to-end on a diverse collection of articulated 3D assets from public datasets. During inference, Particulate lifts the network's feed-forward prediction to the input mesh, yielding a fully articulated 3D model in seconds, much faster than prior approaches that require per-object optimization. Particulate can also accurately infer the articulated structure of AI-generated 3D assets, enabling full-fledged extraction of articulated 3D objects from a single (real or synthetic) image when combined with an off-the-shelf image-to-3D generator. We further introduce a new challenging benchmark for 3D articulation estimation curated from high-quality public 3D assets, and redesign the evaluation protocol to be more consistent with human preferences. Quantitative and qualitative results show that Particulate significantly outperforms state-of-the-art approaches.
comment: Project page: https://ruiningli.com/particulate
AnchorDream: Repurposing Video Diffusion for Embodiment-Aware Robot Data Synthesis
The collection of large-scale and diverse robot demonstrations remains a major bottleneck for imitation learning, as real-world data acquisition is costly and simulators offer limited diversity and fidelity with pronounced sim-to-real gaps. While generative models present an attractive solution, existing methods often alter only visual appearances without creating new behaviors, or suffer from embodiment inconsistencies that yield implausible motions. To address these limitations, we introduce AnchorDream, an embodiment-aware world model that repurposes pretrained video diffusion models for robot data synthesis. AnchorDream conditions the diffusion process on robot motion renderings, anchoring the embodiment to prevent hallucination while synthesizing objects and environments consistent with the robot's kinematics. Starting from only a handful of human teleoperation demonstrations, our method scales them into large, diverse, high-quality datasets without requiring explicit environment modeling. Experiments show that the generated data leads to consistent improvements in downstream policy learning, with relative gains of 36.4% in simulator benchmarks and nearly double performance in real-world studies. These results suggest that grounding generative world models in robot motion provides a practical path toward scaling imitation learning.
comment: Project page: https://jay-ye.github.io/AnchorDream/
Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation
Reality is a dance between rigid constraints and deformable structures. For video models, that means generating motion that preserves fidelity as well as structure. Despite progress in diffusion models, producing realistic structure-preserving motion remains challenging, especially for articulated and deformable objects such as humans and animals. Scaling training data alone, so far, has failed to resolve physically implausible transitions. Existing approaches rely on conditioning with noisy motion representations, such as optical flow or skeletons extracted using an external imperfect model. To address these challenges, we introduce an algorithm to distill structure-preserving motion priors from an autoregressive video tracking model (SAM2) into a bidirectional video diffusion model (CogVideoX). With our method, we train SAM2VideoX, which contains two innovations: (1) a bidirectional feature fusion module that extracts global structure-preserving motion priors from a recurrent model like SAM2; (2) a Local Gram Flow loss that aligns how local features move together. Experiments on VBench and in human studies show that SAM2VideoX delivers consistent gains (+2.60\% on VBench, 21-22\% lower FVD, and 71.4\% human preference) over prior baselines. Specifically, on VBench, we achieve 95.51\%, surpassing REPA (92.91\%) by 2.60\%, and reduce FVD to 360.57, a 21.20\% and 22.46\% improvement over REPA- and LoRA-finetuning, respectively. The project website can be found at https://sam2videox.github.io/ .
comment: Project Website: https://sam2videox.github.io/
Uncertainty-Aware Domain Adaptation for Vitiligo Segmentation in Clinical Photographs
Accurately quantifying vitiligo extent in routine clinical photographs is crucial for longitudinal monitoring of treatment response. We propose a trustworthy, frequency-aware segmentation framework built on three synergistic pillars: (1) a data-efficient training strategy combining domain-adaptive pre-training on the ISIC 2019 dataset with an ROI-constrained dual-task loss to suppress background noise; (2) an architectural refinement via a ConvNeXt V2-based encoder enhanced with a novel High-Frequency Spectral Gating (HFSG) module and stem-skip connections to capture subtle textures; and (3) a clinical trust mechanism employing K-fold ensemble and Test-Time Augmentation (TTA) to generate pixel-wise uncertainty maps. Extensive validation on an expert-annotated clinical cohort demonstrates superior performance, achieving a Dice score of 85.05% and significantly reducing boundary error (95% Hausdorff Distance improved from 44.79 px to 29.95 px), consistently outperforming strong CNN (ResNet-50 and UNet++) and Transformer (MiT-B5) baselines. Notably, our framework demonstrates high reliability with zero catastrophic failures and provides interpretable entropy maps to identify ambiguous regions for clinician review. Our approach suggests that the proposed framework establishes a robust and reliable standard for automated vitiligo assessment.
MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator
Video matting remains limited by the scale and realism of existing datasets. While leveraging segmentation data can enhance semantic stability, the lack of effective boundary supervision often leads to segmentation-like mattes lacking fine details. To this end, we introduce a learned Matting Quality Evaluator (MQE) that assesses semantic and boundary quality of alpha mattes without ground truth. It produces a pixel-wise evaluation map that identifies reliable and erroneous regions, enabling fine-grained quality assessment. The MQE scales up video matting in two ways: (1) as an online matting-quality feedback during training to suppress erroneous regions, providing comprehensive supervision, and (2) as an offline selection module for data curation, improving annotation quality by combining the strengths of leading video and image matting models. This process allows us to build a large-scale real-world video matting dataset, VMReal, containing 28K clips and 2.4M frames. To handle large appearance variations in long videos, we introduce a reference-frame training strategy that incorporates long-range frames beyond the local window for effective training. Our MatAnyone 2 achieves state-of-the-art performance on both synthetic and real-world benchmarks, surpassing prior methods across all metrics.
comment: Project page: https://pq-yang.github.io/projects/MatAnyone2/
Smudged Fingerprints: A Systematic Evaluation of the Robustness of AI Image Fingerprints
Model fingerprint detection techniques have emerged as a promising approach for attributing AI-generated images to their source models, but their robustness under adversarial conditions remains largely unexplored. We present the first systematic security evaluation of these techniques, formalizing threat models that encompass both white- and black-box access and two attack goals: fingerprint removal, which erases identifying traces to evade attribution, and fingerprint forgery, which seeks to cause misattribution to a target model. We implement five attack strategies and evaluate 14 representative fingerprinting methods across RGB, frequency, and learned-feature domains on 12 state-of-the-art image generators. Our experiments reveal a pronounced gap between clean and adversarial performance. Removal attacks are highly effective, often achieving success rates above 80% in white-box settings and over 50% under constrained black-box access. While forgery is more challenging than removal, its success significantly varies across targeted models. We also identify a utility-robustness trade-off: methods with the highest attribution accuracy are often vulnerable to attacks. Although some techniques exhibit robustness in specific settings, none achieves high robustness and accuracy across all evaluated threat models. These findings highlight the need for techniques balancing robustness and accuracy, and identify the most promising approaches for advancing this goal.
comment: This work has been accepted for publication in the 4th IEEE Conference on Secure and Trustworthy Machine Learning (IEEE SaTML 2026). The final version will be available on IEEE Xplore
Reducing Domain Gap with Diffusion-Based Domain Adaptation for Cell Counting ICML
Generating realistic synthetic microscopy images is critical for training deep learning models in label-scarce environments, such as cell counting with many cells per image. However, traditional domain adaptation methods often struggle to bridge the domain gap when synthetic images lack the complex textures and visual patterns of real samples. In this work, we adapt the Inversion-Based Style Transfer (InST) framework originally designed for artistic style transfer to biomedical microscopy images. Our method combines latent-space Adaptive Instance Normalization with stochastic inversion in a diffusion model to transfer the style from real fluorescence microscopy images to synthetic ones, while weakly preserving content structure. We evaluate the effectiveness of our InST-based synthetic dataset for downstream cell counting by pre-training and fine-tuning EfficientNet-B0 models on various data sources, including real data, hard-coded synthetic data, and the public Cell200-s dataset. Models trained with our InST-synthesized images achieve up to 37\% lower Mean Absolute Error (MAE) compared to models trained on hard-coded synthetic data, and a 52\% reduction in MAE compared to models trained on Cell200-s (from 53.70 to 25.95 MAE). Notably, our approach also outperforms models trained on real data alone (25.95 vs. 27.74 MAE). Further improvements are achieved when combining InST-synthesized data with lightweight domain adaptation techniques such as DACS with CutMix. These findings demonstrate that InST-based style transfer most effectively reduces the domain gap between synthetic and real microscopy data. Our approach offers a scalable path for enhancing cell counting performance while minimizing manual labeling effort. The source code and resources are publicly available at: https://github.com/MohammadDehghan/InST-Microscopy.
comment: Accepted at ICMLA 2025
SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder
Visual generation grounded in Visual Foundation Model (VFM) representations offers a highly promising unified pathway for integrating visual understanding, perception, and generation. Despite this potential, training large-scale text-to-image diffusion models entirely within the VFM representation space remains largely unexplored. To bridge this gap, we scale the SVG (Self-supervised representations for Visual Generation) framework, proposing SVG-T2I to support high-quality text-to-image synthesis directly in the VFM feature domain. By leveraging a standard text-to-image diffusion pipeline, SVG-T2I achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench. This performance validates the intrinsic representational power of VFMs for generative tasks. We fully open-source the project, including the autoencoder and generation model, together with their training, inference, evaluation pipelines, and pre-trained weights, to facilitate further research in representation-driven visual generation.
comment: Code Repository: https://github.com/KlingTeam/SVG-T2I; Model Weights: https://huggingface.co/KlingTeam/SVG-T2I
mViSE: A Visual Search Engine for Analyzing Multiplex IHC Brain Tissue Images
Whole-slide multiplex imaging of brain tissue generates massive information-dense images that are challenging to analyze and require custom software. We present an alternative query-driven programming-free strategy using a multiplex visual search engine (mViSE) that learns the multifaceted brain tissue chemoarchitecture, cytoarchitecture, and myeloarchitecture. Our divide-and-conquer strategy organizes the data into panels of related molecular markers and uses self-supervised learning to train a multiplex encoder for each panel with explicit visual confirmation of successful learning. Multiple panels can be combined to process visual queries for retrieving similar communities of individual cells or multicellular niches using information-theoretic methods. The retrievals can be used for diverse purposes including tissue exploration, delineating brain regions and cortical cell layers, profiling and comparing brain regions without computer programming. We validated mViSE's ability to retrieve single cells, proximal cell pairs, tissue patches, delineate cortical layers, brain regions and sub-regions. mViSE is provided as an open-source QuPath plug-in.
Weak-to-Strong Generalization Enables Fully Automated De Novo Training of Multi-head Mask-RCNN Model for Segmenting Densely Overlapping Cell Nuclei in Multiplex Whole-slice Brain Images
We present a weak to strong generalization methodology for fully automated training of a multi-head extension of the Mask-RCNN method with efficient channel attention for reliable segmentation of overlapping cell nuclei in multiplex cyclic immunofluorescent (IF) whole-slide images (WSI), and present evidence for pseudo-label correction and coverage expansion, the key phenomena underlying weak to strong generalization. This method can learn to segment de novo a new class of images from a new instrument and/or a new imaging protocol without the need for human annotations. We also present metrics for automated self-diagnosis of segmentation quality in production environments, where human visual proofreading of massive WSI images is unaffordable. Our method was benchmarked against five current widely used methods and showed a significant improvement. The code, sample WSI images, and high-resolution segmentation results are provided in open form for community adoption and adaptation.
Reframing Music-Driven 2D Dance Pose Generation as Multi-Channel Image Generation
Recent pose-to-video models can translate 2D pose sequences into photorealistic, identity-preserving dance videos, so the key challenge is to generate temporally coherent, rhythm-aligned 2D poses from music, especially under complex, high-variance in-the-wild distributions. We address this by reframing music-to-dance generation as a music-token-conditioned multi-channel image synthesis problem: 2D pose sequences are encoded as one-hot images, compressed by a pretrained image VAE, and modeled with a DiT-style backbone, allowing us to inherit architectural and training advances from modern text-to-image models and better capture high-variance 2D pose distributions. On top of this formulation, we introduce (i) a time-shared temporal indexing scheme that explicitly synchronizes music tokens and pose latents over time and (ii) a reference-pose conditioning strategy that preserves subject-specific body proportions and on-screen scale while enabling long-horizon segment-and-stitch generation. Experiments on a large in-the-wild 2D dance corpus and the calibrated AIST++2D benchmark show consistent improvements over representative music-to-dance methods in pose- and video-space metrics and human preference, and ablations validate the contributions of the representation, temporal indexing, and reference conditioning. See supplementary videos at https://hot-dance.github.io
Referring Change Detection in Remote Sensing Imagery WACV
Change detection in remote sensing imagery is essential for applications such as urban planning, environmental monitoring, and disaster management. Traditional change detection methods typically identify all changes between two temporal images without distinguishing the types of transitions, which can lead to results that may not align with specific user needs. Although semantic change detection methods have attempted to address this by categorizing changes into predefined classes, these methods rely on rigid class definitions and fixed model architectures, making it difficult to mix datasets with different label sets or reuse models across tasks, as the output channels are tightly coupled with the number and type of semantic classes. To overcome these limitations, we introduce Referring Change Detection (RCD), which leverages natural language prompts to detect specific classes of changes in remote sensing images. By integrating language understanding with visual analysis, our approach allows users to specify the exact type of change they are interested in. However, training models for RCD is challenging due to the limited availability of annotated data and severe class imbalance in existing datasets. To address this, we propose a two-stage framework consisting of (I) \textbf{RCDNet}, a cross-modal fusion network designed for referring change detection, and (II) \textbf{RCDGen}, a diffusion-based synthetic data generation pipeline that produces realistic post-change images and change maps for a specified category using only pre-change image, without relying on semantic segmentation masks and thereby significantly lowering the barrier to scalable data creation. Experiments across multiple datasets show that our framework enables scalable and targeted change detection. Project website is here: https://yilmazkorkmaz1.github.io/RCD.
comment: 2026 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing
Recent advances in diffusion models (DMs) have achieved exceptional visual quality in image editing tasks. However, the global denoising dynamics of DMs inherently conflate local editing targets with the full-image context, leading to unintended modifications in non-target regions. In this paper, we shift our attention beyond DMs and turn to Masked Generative Transformers (MGTs) as an alternative approach to tackle this challenge. By predicting multiple masked tokens rather than holistic refinement, MGTs exhibit a localized decoding paradigm that endows them with the inherent capacity to explicitly preserve non-relevant regions during the editing process. Building upon this insight, we introduce the first MGT-based image editing framework, termed EditMGT. We first demonstrate that MGT's cross-attention maps provide informative localization signals for localizing edit-relevant regions and devise a multi-layer attention consolidation scheme that refines these maps to achieve fine-grained and precise localization. On top of these adaptive localization results, we introduce region-hold sampling, which restricts token flipping within low-attention areas to suppress spurious edits, thereby confining modifications to the intended target regions and preserving the integrity of surrounding non-target areas. To train EditMGT, we construct CrispEdit-2M, a high-resolution dataset spanning seven diverse editing categories. Without introducing additional parameters, we adapt a pre-trained text-to-image MGT into an image editing model through attention injection. Extensive experiments across four standard benchmarks demonstrate that, with fewer than 1B parameters, our model achieves similarity performance while enabling 6 times faster editing. Moreover, it delivers comparable or superior editing quality, with improvements of 3.6% and 17.6% on style change and style transfer tasks, respectively.
Particle Image Velocimetry Refinement via Consensus ADMM
Particle Image Velocimetry (PIV) is an imaging technique in experimental fluid dynamics that quantifies flow fields around bluff bodies by analyzing the displacement of neutrally buoyant tracer particles immersed in the fluid. Traditional PIV approaches typically depend on tuning parameters specific to the imaging setup, making the performance sensitive to variations in illumination, flow conditions, and seeding density. On the other hand, even state-of-the-art machine learning methods for flow quantification are fragile outside their training set. In our experiments, we observed that flow quantification would improve if different tunings (or algorithms) were applied to different regions of the same image pair. In this work, we parallelize the instantaneous flow quantification with multiple algorithms and adopt a consensus framework based on the alternating direction method of multipliers, seamlessly incorporating priors such as smoothness and incompressibility. We perform several numerical experiments to demonstrate the benefits of this approach. For instance, we achieve a decrease in end-point-error of up to 20% of a dense-inverse-search estimator at an inference rate of 60Hz, and we show how this performance boost can be increased further with outlier rejection. Our method is implemented in JAX, effectively exploiting hardware acceleration, and integrated in Flow Gym, enabling (i) reproducible comparisons with the state-of-the-art, (ii) testing different base algorithms, (iii) straightforward deployment for active fluids control applications.
comment: Code: https://github.com/antonioterpin/flowgym
Text images processing system using artificial intelligence models
This is to present a text image classifier device that identifies textual content in images and then categorizes each image into one of four predefined categories, including Invoice, Form, Letter, or Report. The device supports a gallery mode, in which users browse files on flash disks, hard disk drives, or microSD cards, and a live mode which renders feeds of cameras connected to it. Its design is specifically aimed at addressing pragmatic challenges, such as changing light, random orientation, curvature or partial coverage of text, low resolution, and slightly visible text. The steps of the processing process are divided into four steps: image acquisition and preprocessing, textual elements detection with the help of DBNet++ (Differentiable Binarization Network Plus) model, BART (Bidirectional Auto-Regressive Transformers) model that classifies detected textual elements, and the presentation of the results through a user interface written in Python and PyQt5. All the stages are connected in such a way that they form a smooth workflow. The system achieved a text recognition rate of about 94.62% when tested over ten hours on the mentioned Total-Text dataset, that includes high resolution images, created so as to represent a wide range of problematic conditions. These experimental results support the effectiveness of the suggested methodology to practice, mixed-source text categorization, even in uncontrolled imaging conditions.
comment: 8 pages, 12 figures, article
Depth-Copy-Paste: Multimodal and Depth-Aware Compositing for Robust Face Detection
Data augmentation is crucial for improving the robustness of face detection systems, especially under challenging conditions such as occlusion, illumination variation, and complex environments. Traditional copy paste augmentation often produces unrealistic composites due to inaccurate foreground extraction, inconsistent scene geometry, and mismatched background semantics. To address these limitations, we propose Depth Copy Paste, a multimodal and depth aware augmentation framework that generates diverse and physically consistent face detection training samples by copying full body person instances and pasting them into semantically compatible scenes. Our approach first employs BLIP and CLIP to jointly assess semantic and visual coherence, enabling automatic retrieval of the most suitable background images for the given foreground person. To ensure high quality foreground masks that preserve facial details, we integrate SAM3 for precise segmentation and Depth-Anything to extract only the non occluded visible person regions, preventing corrupted facial textures from being used in augmentation. For geometric realism, we introduce a depth guided sliding window placement mechanism that searches over the background depth map to identify paste locations with optimal depth continuity and scale alignment. The resulting composites exhibit natural depth relationships and improved visual plausibility. Extensive experiments show that Depth Copy Paste provides more diverse and realistic training data, leading to significant performance improvements in downstream face detection tasks compared with traditional copy paste and depth free augmentation methods.
Cross-modal Context-aware Learning for Visual Prompt Guided Multimodal Image Understanding in Remote Sensing
Recent advances in image understanding have enabled methods that leverage large language models for multimodal reasoning in remote sensing. However, existing approaches still struggle to steer models to the user-relevant regions when only simple, generic text prompts are available. Moreover, in large-scale aerial imagery many objects exhibit highly similar visual appearances and carry rich inter-object relationships, which further complicates accurate recognition. To address these challenges, we propose Cross-modal Context-aware Learning for Visual Prompt-Guided Multimodal Image Understanding (CLV-Net). CLV-Net lets users supply a simple visual cue, a bounding box, to indicate a region of interest, and uses that cue to guide the model to generate correlated segmentation masks and captions that faithfully reflect user intent. Central to our design is a Context-Aware Mask Decoder that models and integrates inter-object relationships to strengthen target representations and improve mask quality. In addition, we introduce a Semantic and Relationship Alignment module: a Cross-modal Semantic Consistency Loss enhances fine-grained discrimination among visually similar targets, while a Relationship Consistency Loss enforces alignment between textual relations and visual interactions. Comprehensive experiments on two benchmark datasets show that CLV-Net outperforms existing methods and establishes new state-of-the-art results. The model effectively captures user intent and produces precise, intention-aligned multimodal outputs.
comment: 12 pages, 5 figures
Stochastics of shapes and Kunita flows
Stochastic processes of evolving shapes are used in applications including evolutionary biology, where morphology changes stochastically as a function of evolutionary processes. Due to the non-linear and often infinite-dimensional nature of shape spaces, the mathematical construction of suitable stochastic shape processes is far from immediate. We define and formalize properties that stochastic shape processes should ideally satisfy to be compatible with the shape structure, and we link this to Kunita flows that, when acting on shape spaces, induce stochastic processes that satisfy these criteria by their construction. We couple this with a survey of other relevant shape stochastic processes and show how bridge sampling techniques can be used to condition shape stochastic processes on observed data thereby allowing for statistical inference of parameters of the stochastic dynamics.
Kinetic Mining in Context: Few-Shot Action Synthesis via Text-to-Motion Distillation
The acquisition cost for large, annotated motion datasets remains a critical bottleneck for skeletal-based Human Activity Recognition (HAR). Although Text-to-Motion (T2M) generative models offer a compelling, scalable source of synthetic data, their training objectives, which emphasize general artistic motion, and dataset structures fundamentally differ from HAR's requirements for kinematically precise, class-discriminative actions. This disparity creates a significant domain gap, making generalist T2M models ill-equipped for generating motions suitable for HAR classifiers. To address this challenge, we propose KineMIC (Kinetic Mining In Context), a transfer learning framework for few-shot action synthesis. KineMIC adapts a T2M diffusion model to an HAR domain by hypothesizing that semantic correspondences in the text encoding space can provide soft supervision for kinematic distillation. We operationalize this via a kinetic mining strategy that leverages CLIP text embeddings to establish correspondences between sparse HAR labels and T2M source data. This process guides fine-tuning, transforming the generalist T2M backbone into a specialized few-shot Action-to-Motion generator. We validate KineMIC using HumanML3D as the source T2M dataset and a subset of NTU RGB+D 120 as the target HAR domain, randomly selecting just 10 samples per action class. Our approach generates significantly more coherent motions, providing a robust data augmentation source that delivers a +23.1% accuracy points improvement. Animated illustrations and supplementary materials are available at (https://lucazzola.github.io/publications/kinemic).
FactorPortrait: Controllable Portrait Animation via Disentangled Expression, Pose, and Viewpoint
We introduce FactorPortrait, a video diffusion method for controllable portrait animation that enables lifelike synthesis from disentangled control signals of facial expressions, head movement, and camera viewpoints. Given a single portrait image, a driving video, and camera trajectories, our method animates the portrait by transferring facial expressions and head movements from the driving video while simultaneously enabling novel view synthesis from arbitrary viewpoints. We utilize a pre-trained image encoder to extract facial expression latents from the driving video as control signals for animation generation. Such latents implicitly capture nuanced facial expression dynamics with identity and pose information disentangled, and they are efficiently injected into the video diffusion transformer through our proposed expression controller. For camera and head pose control, we employ Plücker ray maps and normal maps rendered from 3D body mesh tracking. To train our model, we curate a large-scale synthetic dataset containing diverse combinations of camera viewpoints, head poses, and facial expression dynamics. Extensive experiments demonstrate that our method outperforms existing approaches in realism, expressiveness, control accuracy, and view consistency.
comment: Project page: https://tangjiapeng.github.io/FactorPortrait/
Fast and Explicit: Slice-to-Volume Reconstruction via 3D Gaussian Primitives with Analytic Point Spread Function Modeling
Recovering high-fidelity 3D images from sparse or degraded 2D images is a fundamental challenge in medical imaging, with broad applications ranging from 3D ultrasound reconstruction to MRI super-resolution. In the context of fetal MRI, high-resolution 3D reconstruction of the brain from motion-corrupted low-resolution 2D acquisitions is a prerequisite for accurate neurodevelopmental diagnosis. While implicit neural representations (INRs) have recently established state-of-the-art performance in self-supervised slice-to-volume reconstruction (SVR), they suffer from a critical computational bottleneck: accurately modeling the image acquisition physics requires expensive stochastic Monte Carlo sampling to approximate the point spread function (PSF). In this work, we propose a shift from neural network based implicit representations to Gaussian based explicit representations. By parameterizing the HR 3D image volume as a field of anisotropic Gaussian primitives, we leverage the property of Gaussians being closed under convolution and thus derive a \textit{closed-form analytical solution} for the forward model. This formulation reduces the previously intractable acquisition integral to an exact covariance addition ($\mathbfΣ_{obs} = \mathbfΣ_{HR} + \mathbfΣ_{PSF}$), effectively bypassing the need for compute-intensive stochastic sampling while ensuring exact gradient propagation. We demonstrate that our approach matches the reconstruction quality of self-supervised state-of-the-art SVR frameworks while delivering a 5$\times$--10$\times$ speed-up on neonatal and fetal data. With convergence often reached in under 30 seconds, our framework paves the way towards translation into clinical routine of real-time fetal 3D MRI. Code will be public at {https://github.com/m-dannecker/Gaussian-Primitives-for-Fast-SVR}.
comment: Under Review for MIDL 2026
Embodied Image Compression
Image Compression for Machines (ICM) has emerged as a pivotal research direction in the field of visual data compression. However, with the rapid evolution of machine intelligence, the target of compression has shifted from task-specific virtual models to Embodied agents operating in real-world environments. To address the communication constraints of Embodied AI in multi-agent systems and ensure real-time task execution, this paper introduces, for the first time, the scientific problem of Embodied Image Compression. We establish a standardized benchmark, EmbodiedComp, to facilitate systematic evaluation under ultra-low bitrate conditions in a closed-loop setting. Through extensive empirical studies in both simulated and real-world settings, we demonstrate that existing Vision-Language-Action models (VLAs) fail to reliably perform even simple manipulation tasks when compressed below the Embodied bitrate threshold. We anticipate that EmbodiedComp will catalyze the development of domain-specific compression tailored for Embodied agents , thereby accelerating the Embodied AI deployment in the Real-world.
comment: 15 pages, 12 figures, 3 tables
Using GUI Agent for Electronic Design Automation
Graphical User Interface (GUI) agents adopt an end-to-end paradigm that maps a screenshot to an action sequence, thereby automating repetitive tasks in virtual environments. However, existing GUI agents are evaluated almost exclusively on commodity software such as Microsoft Word and Excel. Professional Computer-Aided Design (CAD) suites promise an order-of-magnitude higher economic return, yet remain the weakest performance domain for existing agents and are still far from replacing expert Electronic-Design-Automation (EDA) engineers. We therefore present the first systematic study that deploys GUI agents for EDA workflows. Our contributions are: (1) a large-scale dataset named GUI-EDA, including 5 CAD tools and 5 physical domains, comprising 2,000+ high-quality screenshot-answer-action pairs recorded by EDA scientists and engineers during real-world component design; (2) a comprehensive benchmark that evaluates 30+ mainstream GUI agents, demonstrating that EDA tasks constitute a major, unsolved challenge; and (3) an EDA-specialized metric named EDAgent, equipped with a reflection mechanism that achieves reliable performance on industrial CAD software and, for the first time, outperforms Ph.D. students majored in Electrical Engineering. This work extends GUI agents from generic office automation to specialized, high-value engineering domains and offers a new avenue for advancing EDA productivity. The dataset will be released at: https://github.com/aiben-ch/GUI-EDA.
comment: 17 pages, 15 figures, 8 tables
Brain-Semantoks: Learning Semantic Tokens of Brain Dynamics with a Self-Distilled Foundation Model
The development of foundation models for functional magnetic resonance imaging (fMRI) time series holds significant promise for predicting phenotypes related to disease and cognition. Current models, however, are often trained using a mask-and-reconstruct objective on small brain regions. This focus on low-level information leads to representations that are sensitive to noise and temporal fluctuations, necessitating extensive fine-tuning for downstream tasks. We introduce Brain-Semantoks, a self-supervised framework designed specifically to learn abstract representations of brain dynamics. Its architecture is built on two core innovations: a semantic tokenizer that aggregates noisy regional signals into robust tokens representing functional networks, and a self-distillation objective that enforces representational stability across time. We show that this objective is stabilized through a novel training curriculum, ensuring the model robustly learns meaningful features from low signal-to-noise time series. We demonstrate that learned representations enable strong performance on a variety of downstream tasks even when only using a linear probe. Furthermore, we provide comprehensive scaling analyses indicating more unlabeled data reliably results in out-of-distribution performance gains without domain adaptation.
comment: Code and pretrained models available at https://github.com/SamGijsen/Brain-Semantoks
In-Context Learning for Seismic Data Processing
Seismic processing transforms raw data into subsurface images essential for geophysical applications. Traditional methods face challenges, such as noisy data, and manual parameter tuning, among others. Recently deep learning approaches have proposed alternative solutions to some of these problems. However, important challenges of existing deep learning approaches are spatially inconsistent results across neighboring seismic gathers and lack of user-control. We address these limitations by introducing ContextSeisNet, an in-context learning model, to seismic demultiple processing. Our approach conditions predictions on a support set of spatially related example pairs: neighboring common-depth point gathers from the same seismic line and their corresponding labels. This allows the model to learn task-specific processing behavior at inference time by observing how similar gathers should be processed, without any retraining. This method provides both flexibility through user-defined examples and improved lateral consistency across seismic lines. On synthetic data, ContextSeisNet outperforms a U-Net baseline quantitatively and demonstrates enhanced spatial coherence between neighboring gathers. On field data, our model achieves superior lateral consistency compared to both traditional Radon demultiple and the U-Net baseline. Relative to the U-Net, ContextSeisNet also delivers improved near-offset performance and more complete multiple removal. Notably, ContextSeisNet achieves comparable field data performance despite being trained on 90% less data, demonstrating substantial data efficiency. These results establish ContextSeisNet as a practical approach for spatially consistent seismic demultiple with potential applicability to other seismic processing tasks.
comment: Source code available under https://codeberg.org/fuchsfa/in-context-learning-seismic
Evaluating Foundation Models' 3D Understanding Through Multi-View Correspondence Analysis NeurIPS 2025
Benchmarking 3D spatial understanding of foundation models is essential for real-world applications such as robotics and autonomous driving. Existing evaluations often rely on downstream finetuning with linear heads or task-specific decoders, making it difficult to isolate the intrinsic 3D reasoning ability of pretrained encoders. In this work, we introduce a novel benchmark for in-context 3D scene understanding that requires no finetuning and directly probes the quality of dense visual features. Building on the Hummingbird framework, which evaluates in-context 2D scene understanding, we extend the setup to the 3D Multi-View ImageNet (MVImgNet) dataset. Given a set of images from objects in specific angles (keys), we benchmark the performance of segmenting novel views (queries) and report the scores in 4 categories of easy, medium, hard, and extreme based on the key-query view contrast. We benchmark 8 state-of-the-art foundation models and show DINO-based encoders remain competitive across large viewpoint shifts, while 3D-aware models like VGGT require dedicated multi-view adjustments. Our code is publicly available at https://github.com/ToyeshC/open-hummingbird-3d-eval .
comment: NeurIPS 2025 UniReps workshop
Multi-temporal Calving Front Segmentation
The calving fronts of marine-terminating glaciers undergo constant changes. These changes significantly affect the glacier's mass and dynamics, demanding continuous monitoring. To address this need, deep learning models were developed that can automatically delineate the calving front in Synthetic Aperture Radar imagery. However, these models often struggle to correctly classify areas affected by seasonal conditions such as ice melange or snow-covered surfaces. To address this issue, we propose to process multiple frames from a satellite image time series of the same glacier in parallel and exchange temporal information between the corresponding feature maps to stabilize each prediction. We integrate our approach into the current state-of-the-art architecture Tyrion and accomplish a new state-of-the-art performance on the CaFFe benchmark dataset. In particular, we achieve a Mean Distance Error of 184.4 m and a mean Intersection over Union of 83.6.
DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry
Reliable interpretation of multimodal data in dentistry is essential for automated oral healthcare, yet current multimodal large language models (MLLMs) struggle to capture fine-grained dental visual details and lack sufficient reasoning ability for precise diagnosis. To address these limitations, we present DentalGPT, a specialized dental MLLM developed through high-quality domain knowledge injection and reinforcement learning. Specifically, the largest annotated multimodal dataset for dentistry to date was constructed by aggregating over 120k dental images paired with detailed descriptions that highlight diagnostically relevant visual features, making it the multimodal dataset with the most extensive collection of dental images to date. Training on this dataset significantly enhances the MLLM's visual understanding of dental conditions, while the subsequent reinforcement learning stage further strengthens its capability for multimodal complex reasoning. Comprehensive evaluations on intraoral and panoramic benchmarks, along with dental subsets of medical VQA benchmarks, show that DentalGPT achieves superior performance in disease classification and dental VQA tasks, outperforming many state-of-the-art MLLMs despite having only 7B parameters. These results demonstrate that high-quality dental data combined with staged adaptation provides an effective pathway for building capable and domain-specialized dental MLLMs.
3DTeethSAM: Taming SAM2 for 3D Teeth Segmentation AAAI 2026
3D teeth segmentation, involving the localization of tooth instances and their semantic categorization in 3D dental models, is a critical yet challenging task in digital dentistry due to the complexity of real-world dentition. In this paper, we propose 3DTeethSAM, an adaptation of the Segment Anything Model 2 (SAM2) for 3D teeth segmentation. SAM2 is a pretrained foundation model for image and video segmentation, demonstrating a strong backbone in various downstream scenarios. To adapt SAM2 for 3D teeth data, we render images of 3D teeth models from predefined views, apply SAM2 for 2D segmentation, and reconstruct 3D results using 2D-3D projections. Since SAM2's performance depends on input prompts and its initial outputs often have deficiencies, and given its class-agnostic nature, we introduce three light-weight learnable modules: (1) a prompt embedding generator to derive prompt embeddings from image embeddings for accurate mask decoding, (2) a mask refiner to enhance SAM2's initial segmentation results, and (3) a mask classifier to categorize the generated masks. Additionally, we incorporate Deformable Global Attention Plugins (DGAP) into SAM2's image encoder. The DGAP enhances both the segmentation accuracy and the speed of the training process. Our method has been validated on the 3DTeethSeg benchmark, achieving an IoU of 91.90% on high-resolution 3D teeth meshes, establishing a new state-of-the-art in the field.
comment: Accepted by AAAI 2026
SSL-MedSAM2: A Semi-supervised Medical Image Segmentation Framework Powered by Few-shot Learning of SAM2 MICCAI 2025
Despite the success of deep learning based models in medical image segmentation, most state-of-the-art (SOTA) methods perform fully-supervised learning, which commonly rely on large scale annotated training datasets. However, medical image annotation is highly time-consuming, hindering its clinical applications. Semi-supervised learning (SSL) has been emerged as an appealing strategy in training with limited annotations, largely reducing the labelling cost. We propose a novel SSL framework SSL-MedSAM2, which contains a training-free few-shot learning branch TFFS-MedSAM2 based on the pretrained large foundation model Segment Anything Model 2 (SAM2) for pseudo label generation, and an iterative fully-supervised learning branch FSL-nnUNet based on nnUNet for pseudo label refinement. The results on MICCAI2025 challenge CARE-LiSeg (Liver Segmentation) demonstrate an outstanding performance of SSL-MedSAM2 among other methods. The average dice scores on the test set in GED4 and T1 MRI are 0.9710 and 0.9648 respectively, and the Hausdorff distances are 20.07 and 21.97 respectively. The code is available via https://github.com/naisops/SSL-MedSAM2/tree/main.
comment: Accepted by MICCAI 2025 CARE Challenge, waiting for publication
Infinity and Beyond: Compositional Alignment in VAR and Diffusion T2I Models
Achieving compositional alignment between textual descriptions and generated images - covering objects, attributes, and spatial relationships - remains a core challenge for modern text-to-image (T2I) models. Although diffusion-based architectures have been widely studied, the compositional behavior of emerging Visual Autoregressive (VAR) models is still largely unexamined. We benchmark six diverse T2I systems - SDXL, PixArt-$α$, Flux-Dev, Flux-Schnell, Infinity-2B, and Infinity-8B - across the full T2I-CompBench++ and GenEval suites, evaluating alignment in color and attribute binding, spatial relations, numeracy, and complex multi-object prompts. Across both benchmarks, Infinity-8B achieves the strongest overall compositional alignment, while Infinity-2B also matches or exceeds larger diffusion models in several categories, highlighting favorable efficiency-performance trade-offs. In contrast, SDXL and PixArt-$α$ show persistent weaknesses in attribute-sensitive and spatial tasks. These results provide the first systematic comparison of VAR and diffusion approaches to compositional alignment and establish unified baselines for the future development of the T2I model.
HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning
Key frame selection in video understanding presents significant challenges. Traditional top-K selection methods, which score frames independently, often fail to optimize the selection as a whole. This independent scoring frequently results in selecting frames that are temporally clustered and visually redundant. Additionally, training lightweight selectors using pseudo labels generated offline by Multimodal Large Language Models (MLLMs) prevents the supervisory signal from dynamically adapting to task objectives. To address these limitations, we propose an end-to-end trainable, task-adaptive framework for frame selection. A Chain-of-Thought approach guides a Small Language Model (SLM) to generate task-specific implicit query vectors, which are combined with multimodal features to enable dynamic frame scoring. We further define a continuous set-level objective function that incorporates relevance, coverage, and redundancy, enabling differentiable optimization via Gumbel-Softmax to select optimal frame combinations at the set level. Finally, student-teacher mutual learning is employed, where the student selector (SLM) and teacher reasoner (MLLM) are trained to align their frame importance distributions via KL divergence. Combined with cross-entropy loss, this enables end-to-end optimization, eliminating reliance on static pseudo labels. Experiments across various benchmarks, including Video-MME, LongVideoBench, MLVU, and NExT-QA, demonstrate that our method significantly outperforms existing approaches.
comment: 18 pages, 8 figures
Parallax: Runtime Parallelization for Operator Fallbacks in Heterogeneous Edge Systems
The growing demand for real-time DNN applications on edge devices necessitates faster inference of increasingly complex models. Although many devices include specialized accelerators (e.g., mobile GPUs), dynamic control-flow operators and unsupported kernels often fall back to CPU execution. Existing frameworks handle these fallbacks poorly, leaving CPU cores idle and causing high latency and memory spikes. We introduce Parallax, a framework that accelerates mobile DNN inference without model refactoring or custom operator implementations. Parallax first partitions the computation DAG to expose parallelism, then employs branch-aware memory management with dedicated arenas and buffer reuse to reduce runtime footprint. An adaptive scheduler executes branches according to device memory constraints, meanwhile, fine-grained subgraph control enables heterogeneous inference of dynamic models. By evaluating on five representative DNNs across three different mobile devices, Parallax achieves up to 46% latency reduction, maintains controlled memory overhead (26.5% on average), and delivers up to 30% energy savings compared with state-of-the-art frameworks, offering improvements aligned with the responsiveness demands of real-time mobile inference.
Super-Resolved Canopy Height Mapping from Sentinel-2 Time Series Using LiDAR HD Reference Data across Metropolitan France
Fine-scale forest monitoring is essential for understanding canopy structure and its dynamics, which are key indicators of carbon stocks, biodiversity, and forest health. Deep learning is particularly effective for this task, as it integrates spectral, temporal, and spatial signals that jointly reflect the canopy structure. To address this need, we introduce THREASURE-Net, a novel end-to-end framework for Tree Height Regression And Super-Resolution. The model is trained on Sentinel-2 time series using reference height metrics derived from LiDAR HD data at multiple spatial resolutions over Metropolitan France to produce annual height maps. We evaluate three model variants, producing tree-height predictions at 2.5 m, 5 m, and 10 m resolution. THREASURE-Net does not rely on any pretrained model nor on reference very high resolution optical imagery to train its super-resolution module; instead, it learns solely from LiDAR-derived height information. Our approach outperforms existing state-of-the-art methods based on Sentinel data and is competitive with methods based on very high resolution imagery. It can be deployed to generate high-precision annual canopy-height maps, achieving mean absolute errors of 2.62 m, 2.72 m, and 2.88 m at 2.5 m, 5 m, and 10 m resolution, respectively. These results highlight the potential of THREASURE-Net for scalable and cost-effective structural monitoring of temperate forests using only freely available satellite data. The source code for THREASURE-Net is available at: https://github.com/Global-Earth-Observation/threasure-net.
Reconstruction as a Bridge for Event-Based Visual Question Answering
Integrating event cameras with Multimodal Large Language Models (MLLMs) promises general scene understanding in challenging visual conditions, yet requires navigating a trade-off between preserving the unique advantages of event data and ensuring compatibility with frame-based models. We address this challenge by using reconstruction as a bridge, proposing a straightforward Frame-based Reconstruction and Tokenization (FRT) method and designing an efficient Adaptive Reconstruction and Tokenization (ART) method that leverages event sparsity. For robust evaluation, we introduce EvQA, the first objective, real-world benchmark for event-based MLLMs, comprising 1,000 event-Q&A pairs from 22 public datasets. Our experiments demonstrate that our methods achieve state-of-the-art performance on EvQA, highlighting the significant potential of MLLMs in event-based vision.
On Geometric Understanding and Learned Data Priors in VGGT
The Visual Geometry Grounded Transformer (VGGT) is a 3D foundation model that infers camera geometry and scene structure in a single feed-forward pass. Trained in a supervised, single-step fashion on large datasets, VGGT raises a key question: does it build upon geometric concepts like traditional multi-view methods, or does it rely primarily on learned appearance-based data-driven priors? In this work, we conduct a systematic analysis of VGGT's internal mechanisms to uncover whether geometric understanding emerges within its representations. By probing intermediate features, analyzing attention patterns, and performing interventions, we examine how the model implements its functionality. Our findings reveal that VGGT implicitly performs correspondence matching within its global attention layers and encodes epipolar geometry, despite being trained without explicit geometric constraints. We further investigate VGGT's dependence on its learned data priors. Using spatial input masking and perturbation experiments, we assess its robustness to occlusions, appearance variations, and camera configurations, comparing it with classical multi-stage pipelines. Together, these insights highlight how VGGT internalizes geometric structure while using learned data-driven priors.
SSA3D: Text-Conditioned Assisted Self-Supervised Framework for Automatic Dental Abutment Design
Abutment design is a critical step in dental implant restoration. However, manual design involves tedious measurement and fitting, and research on automating this process with AI is limited, due to the unavailability of large annotated datasets. Although self-supervised learning (SSL) can alleviate data scarcity, its need for pre-training and fine-tuning results in high computational costs and long training times. In this paper, we propose a Self-supervised assisted automatic abutment design framework (SS$A^3$D), which employs a dual-branch architecture with a reconstruction branch and a regression branch. The reconstruction branch learns to restore masked intraoral scan data and transfers the learned structural information to the regression branch. The regression branch then predicts the abutment parameters under supervised learning, which eliminates the separate pre-training and fine-tuning process. We also design a Text-Conditioned Prompt (TCP) module to incorporate clinical information (such as implant location, system, and series) into SS$A^3$D. This guides the network to focus on relevant regions and constrains the parameter predictions. Extensive experiments on a collected dataset show that SS$A^3$D saves half of the training time and achieves higher accuracy than traditional SSL methods. It also achieves state-of-the-art performance compared to other methods, significantly improving the accuracy and efficiency of automated abutment design.
TSkel-Mamba: Temporal Dynamic Modeling via State Space Model for Human Skeleton-based Action Recognition
Skeleton-based action recognition has garnered significant attention in the computer vision community. Inspired by the recent success of the selective state-space model (SSM) Mamba in modeling 1D temporal sequences, we propose TSkel-Mamba, a hybrid Transformer-Mamba framework that effectively captures both spatial and temporal dynamics. In particular, our approach leverages Spatial Transformer for spatial feature learning while utilizing Mamba for temporal modeling. Mamba, however, employs separate SSM blocks for individual channels, which inherently limits its ability to model inter-channel dependencies. To better adapt Mamba for skeleton data and enhance Mamba`s ability to model temporal dependencies, we introduce a Temporal Dynamic Modeling (TDM) block, which is a versatile plug-and-play component that integrates a novel Multi-scale Temporal Interaction (MTI) module. The MTI module employs multi-scale Cycle operators to capture cross-channel temporal interactions, a critical factor in action recognition. Extensive experiments on NTU-RGB+D 60, NTU-RGB+D 120, NW-UCLA and UAV-Human datasets demonstrate that TSkel-Mamba achieves state-of-the-art performance while maintaining low inference time, making it both efficient and highly effective.
VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing
Satellite imagery differs fundamentally from natural images: its aerial viewpoint, very high resolution, diverse scale variations, and abundance of small objects demand both region-level spatial reasoning and holistic scene understanding. Current remote-sensing approaches remain fragmented between dual-encoder retrieval models, which excel at large-scale cross-modal search but cannot interleave modalities, and generative assistants, which support region-level interpretation but lack scalable retrieval capabilities. We propose $\textbf{VLM2GeoVec}$, an instruction-following, single-encoder vision-language model trained contrastively to embed interleaved inputs (images, text, bounding boxes, and geographic coordinates) in a unified vector space. Our single encoder interleaves all inputs into one joint embedding trained with a contrastive loss, eliminating multi-stage pipelines and task-specific modules. To evaluate its versatility, we introduce $\textbf{RSMEB}$, a novel benchmark covering key remote-sensing embedding applications: scene classification; cross-modal search; compositional retrieval; visual-question answering; visual grounding and region-level reasoning; and semantic geospatial retrieval. On RSMEB, it achieves $\textbf{26.6%}$ P@1 on region-caption retrieval (+25 pp vs. dual-encoder baselines), $\textbf{32.5%}$ P@1 on referring-expression retrieval (+19 pp), and $\textbf{17.8%}$ P@1 on semantic geo-localization retrieval (over $3\times$ prior best), while matching or exceeding specialized baselines on conventional tasks such as scene classification and cross-modal retrieval. VLM2GeoVec unifies scalable retrieval with region-level spatial reasoning, enabling cohesive multimodal analysis in remote sensing. We will publicly release the code, checkpoints, and data upon acceptance.
comment: 21 pages, 7 figures, under review
CADMorph: Geometry-Driven Parametric CAD Editing via a Plan-Generate-Verify Loop NeurIPS 2025
A Computer-Aided Design (CAD) model encodes an object in two coupled forms: a parametric construction sequence and its resulting visible geometric shape. During iterative design, adjustments to the geometric shape inevitably require synchronized edits to the underlying parametric sequence, called geometry-driven parametric CAD editing. The task calls for 1) preserving the original sequence's structure, 2) ensuring each edit's semantic validity, and 3) maintaining high shape fidelity to the target shape, all under scarce editing data triplets. We present CADMorph, an iterative plan-generate-verify framework that orchestrates pretrained domain-specific foundation models during inference: a parameter-to-shape (P2S) latent diffusion model and a masked-parameter-prediction (MPP) model. In the planning stage, cross-attention maps from the P2S model pinpoint the segments that need modification and offer editing masks. The MPP model then infills these masks with semantically valid edits in the generation stage. During verification, the P2S model embeds each candidate sequence in shape-latent space, measures its distance to the target shape, and selects the closest one. The three stages leverage the inherent geometric consciousness and design knowledge in pretrained priors, and thus tackle structure preservation, semantic validity, and shape fidelity respectively. Besides, both P2S and MPP models are trained without triplet data, bypassing the data-scarcity bottleneck. CADMorph surpasses GPT-4o and specialized CAD baselines, and supports downstream applications such as iterative editing and reverse-engineering enhancement.
comment: NeurIPS 2025
DOS: Distilling Observable Softmaps of Zipfian Prototypes for Self-Supervised Point Representation AAAI-26
Recent advances in self-supervised learning (SSL) have shown tremendous potential for learning 3D point cloud representations without human annotations. However, SSL for 3D point clouds still faces critical challenges due to irregular geometry, shortcut-prone reconstruction, and unbalanced semantics distribution. In this work, we propose DOS (Distilling Observable Softmaps), a novel SSL framework that self-distills semantic relevance softmaps only at observable (unmasked) points. This strategy prevents information leakage from masked regions and provides richer supervision than discrete token-to-prototype assignments. To address the challenge of unbalanced semantics in an unsupervised setting, we introduce Zipfian prototypes and incorporate them using a modified Sinkhorn-Knopp algorithm, Zipf-Sinkhorn, which enforces a power-law prior over prototype usage and modulates the sharpness of the target softmap during training. DOS outperforms current state-of-the-art methods on semantic segmentation and 3D object detection across multiple benchmarks, including nuScenes, Waymo, SemanticKITTI, ScanNet, and ScanNet200, without relying on extra data or annotations. Our results demonstrate that observable-point softmaps distillation offers a scalable and effective paradigm for learning robust 3D representations.
comment: AAAI-26
Exploring MLLM-Diffusion Information Transfer with MetaCanvas
Multimodal learning has rapidly advanced visual understanding, largely via multimodal large language models (MLLMs) that use powerful LLMs as cognitive cores. In visual generation, however, these powerful core models are typically reduced to global text encoders for diffusion models, leaving most of their reasoning and planning ability unused. This creates a gap: current multimodal LLMs can parse complex layouts, attributes, and knowledge-intensive scenes, yet struggle to generate images or videos with equally precise and structured control. We propose MetaCanvas, a lightweight framework that lets MLLMs reason and plan directly in spatial and spatiotemporal latent spaces and interface tightly with diffusion generators. We empirically implement MetaCanvas on three different diffusion backbones and evaluate it across six tasks, including text-to-image generation, text/image-to-video generation, image/video editing, and in-context video generation, each requiring precise layouts, robust attribute binding, and reasoning-intensive control. MetaCanvas consistently outperforms global-conditioning baselines, suggesting that treating MLLMs as latent-space planners is a promising direction for narrowing the gap between multimodal understanding and generation.
comment: Project page: https://metacanvas.github.io
Boosting Skeleton-based Zero-Shot Action Recognition with Training-Free Test-Time Adaptation
We introduce Skeleton-Cache, the first training-free test-time adaptation framework for skeleton-based zero-shot action recognition (SZAR), aimed at improving model generalization to unseen actions during inference. Skeleton-Cache reformulates inference as a lightweight retrieval process over a non-parametric cache that stores structured skeleton representations, combining both global and fine-grained local descriptors. To guide the fusion of descriptor-wise predictions, we leverage the semantic reasoning capabilities of large language models (LLMs) to assign class-specific importance weights. By integrating these structured descriptors with LLM-guided semantic priors, Skeleton-Cache dynamically adapts to unseen actions without any additional training or access to training data. Extensive experiments on NTU RGB+D 60/120 and PKU-MMD II demonstrate that Skeleton-Cache consistently boosts the performance of various SZAR backbones under both zero-shot and generalized zero-shot settings. The code is publicly available at https://github.com/Alchemist0754/Skeleton-Cache.
YawDD+: Frame-level Annotations for Accurate Yawn Prediction
Driver fatigue remains a leading cause of road accidents, with 24\% of crashes involving drowsy drivers. While yawning serves as an early behavioral indicator of fatigue, existing machine learning approaches face significant challenges due to video-annotated datasets that introduce systematic noise from coarse temporal annotations. We develop a semi-automated labeling pipeline with human-in-the-loop verification, which we apply to YawDD, enabling more accurate model training. Training the established MNasNet classifier and YOLOv11 detector architectures on YawDD+ improves frame accuracy by up to 6\% and mAP by 5\% over video-level supervision, achieving 99.34\% classification accuracy and 95.69\% detection mAP. The resulting approach deliver up to 59.8 FPS on edge AI hardware (NVIDIA Jetson Nano), confirming that enhanced data quality alone supports on-device yawning monitoring without server-side computation.
comment: This paper is submitted at European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, 2026
Flowception: Temporally Expansive Flow Matching for Video Generation
We present Flowception, a novel non-autoregressive and variable-length video generation framework. Flowception learns a probability path that interleaves discrete frame insertions with continuous frame denoising. Compared to autoregressive methods, Flowception alleviates error accumulation/drift as the frame insertion mechanism during sampling serves as an efficient compression mechanism to handle long-term context. Compared to full-sequence flows, our method reduces FLOPs for training three-fold, while also being more amenable to local attention variants, and allowing to learn the length of videos jointly with their content. Quantitative experimental results show improved FVD and VBench metrics over autoregressive and full-sequence baselines, which is further validated with qualitative results. Finally, by learning to insert and denoise frames in a sequence, Flowception seamlessly integrates different tasks such as image-to-video generation and video interpolation.
Back to the Baseline: Examining Baseline Effects on Explainability Metrics
Attribution methods are among the most prevalent techniques in Explainable Artificial Intelligence (XAI) and are usually evaluated and compared using Fidelity metrics, with Insertion and Deletion being the most popular. These metrics rely on a baseline function to alter the pixels of the input image that the attribution map deems most important. In this work, we highlight a critical problem with these metrics: the choice of a given baseline will inevitably favour certain attribution methods over others. More concerningly, even a simple linear model with commonly used baselines contradicts itself by designating different optimal methods. A question then arises: which baseline should we use? We propose to study this problem through two desirable properties of a baseline: (i) that it removes information and (ii) that it does not produce overly out-of-distribution (OOD) images. We first show that none of the tested baselines satisfy both criteria, and there appears to be a trade-off among current baselines: either they remove information or they produce a sequence of OOD images. Finally, we introduce a novel baseline by leveraging recent work in feature visualisation to artificially produce a model-dependent baseline that removes information without being overly OOD, thus improving on the trade-off when compared to other existing baselines. Our code is available at https://github.com/deel-ai-papers/Back-to-the-Baseline
JoyAvatar: Real-time and Infinite Audio-Driven Avatar Generation with Autoregressive Diffusion
Existing DiT-based audio-driven avatar generation methods have achieved considerable progress, yet their broader application is constrained by limitations such as high computational overhead and the inability to synthesize long-duration videos. Autoregressive methods address this problem by applying block-wise autoregressive diffusion methods. However, these methods suffer from the problem of error accumulation and quality degradation. To address this, we propose JoyAvatar, an audio-driven autoregressive model capable of real-time inference and infinite-length video generation with the following contributions: (1) Progressive Step Bootstrapping (PSB), which allocates more denoising steps to initial frames to stabilize generation and reduce error accumulation; (2) Motion Condition Injection (MCI), enhancing temporal coherence by injecting noise-corrupted previous frames as motion condition; and (3) Unbounded RoPE via Cache-Resetting (URCR), enabling infinite-length generation through dynamic positional encoding. Our 1.3B-parameter causal model achieves 16 FPS on a single GPU and achieves competitive results in visual quality, temporal consistency, and lip synchronization.
Collaborative Reconstruction and Repair for Multi-class Industrial Anomaly Detection
Industrial anomaly detection is a challenging open-set task that aims to identify unknown anomalous patterns deviating from normal data distribution. To avoid the significant memory consumption and limited generalizability brought by building separate models per class, we focus on developing a unified framework for multi-class anomaly detection. However, under this challenging setting, conventional reconstruction-based networks often suffer from an identity mapping problem, where they directly replicate input features regardless of whether they are normal or anomalous, resulting in detection failures. To address this issue, this study proposes a novel framework termed Collaborative Reconstruction and Repair (CRR), which transforms the reconstruction to repairation. First, we optimize the decoder to reconstruct normal samples while repairing synthesized anomalies. Consequently, it generates distinct representations for anomalous regions and similar representations for normal areas compared to the encoder's output. Second, we implement feature-level random masking to ensure that the representations from decoder contain sufficient local information. Finally, to minimize detection errors arising from the discrepancies between feature representations from the encoder and decoder, we train a segmentation network supervised by synthetic anomaly masks, thereby enhancing localization performance. Extensive experiments on industrial datasets that CRR effectively mitigates the identity mapping issue and achieves state-of-the-art performance in multi-class industrial anomaly detection.
comment: Accepted to Data Intelligence 2025
Minimal Clips, Maximum Salience: Long Video Summarization via Key Moment Extraction
Vision-Language Models (VLMs) are able to process increasingly longer videos. Yet, important visual information is easily lost throughout the entire context and missed by VLMs. Also, it is important to design tools that enable cost-effective analysis of lengthy video content. In this paper, we propose a clip selection method that targets key video moments to be included in a multimodal summary. We divide the video into short clips and generate compact visual descriptions of each using a lightweight video captioning model. These are then passed to a large language model (LLM), which selects the K clips containing the most relevant visual information for a multimodal summary. We evaluate our approach on reference clips for the task, automatically derived from full human-annotated screenplays and summaries in the MovieSum dataset. We further show that these reference clips (less than 6% of the movie) are sufficient to build a complete multimodal summary of the movies in MovieSum. Using our clip selection method, we achieve a summarization performance close to that of these reference clips while capturing substantially more relevant video information than random clip selection. Importantly, we maintain low computational cost by relying on a lightweight captioning model.
FlowDC: Flow-Based Decoupling-Decay for Complex Image Editing
With the surge of pre-trained text-to-image flow matching models, text-based image editing performance has gained remarkable improvement, especially for \underline{simple editing} that only contains a single editing target. To satisfy the exploding editing requirements, the \underline{complex editing} which contains multiple editing targets has posed as a more challenging task. However, current complex editing solutions: single-round and multi-round editing are limited by long text following and cumulative inconsistency, respectively. Thus, they struggle to strike a balance between semantic alignment and source consistency. In this paper, we propose \textbf{FlowDC}, which decouples the complex editing into multiple sub-editing effects and superposes them in parallel during the editing process. Meanwhile, we observed that the velocity quantity that is orthogonal to the editing displacement harms the source structure preserving. Thus, we decompose the velocity and decay the orthogonal part for better source consistency. To evaluate the effectiveness of complex editing settings, we construct a complex editing benchmark: Complex-PIE-Bench. On two benchmarks, FlowDC shows superior results compared with existing methods. We also detail the ablations of our module designs.
The N-Body Problem: Parallel Execution from Single-Person Egocentric Video
Humans can intuitively parallelise complex activities, but can a model learn this from observing a single person? Given one egocentric video, we introduce the N-Body Problem: how N individuals, can hypothetically perform the same set of tasks observed in this video. The goal is to maximise speed-up, but naive assignment of video segments to individuals often violates real-world constraints, leading to physically impossible scenarios like two people using the same object or occupying the same space. To address this, we formalise the N-Body Problem and propose a suite of metrics to evaluate both performance (speed-up, task coverage) and feasibility (spatial collisions, object conflicts and causal constraints). We then introduce a structured prompting strategy that guides a Vision-Language Model (VLM) to reason about the 3D environment, object usage, and temporal dependencies to produce a viable parallel execution. On 100 videos from EPIC-Kitchens and HD-EPIC, our method for N = 2 boosts action coverage by 45% over a baseline prompt for Gemini 2.5 Pro, while simultaneously slashing collision rates, object and causal conflicts by 55%, 45% and 55% respectively.
comment: project webpage: https://zhifanzhu.github.io/ego-nbody
Out-of-Distribution Segmentation via Wasserstein-Based Evidential Uncertainty
Deep neural networks achieve superior performance in semantic segmentation, but are limited to a predefined set of classes, which leads to failures when they encounter unknown objects in open-world scenarios. Recognizing and segmenting these out-of-distribution (OOD) objects is crucial for safety-critical applications such as automated driving. In this work, we present an evidence segmentation framework using a Wasserstein loss, which captures distributional distances while respecting the probability simplex geometry. Combined with Kullback-Leibler regularization and Dice structural consistency terms, our approach leads to improved OOD segmentation performance compared to uncertainty-based approaches.
Assisted Refinement Network Based on Channel Information Interaction for Camouflaged and Salient Object Detection
Camouflaged Object Detection (COD) stands as a significant challenge in computer vision, dedicated to identifying and segmenting objects visually highly integrated with their backgrounds. Current mainstream methods have made progress in cross-layer feature fusion, but two critical issues persist during the decoding stage. The first is insufficient cross-channel information interaction within the same-layer features, limiting feature expressiveness. The second is the inability to effectively co-model boundary and region information, making it difficult to accurately reconstruct complete regions and sharp boundaries of objects. To address the first issue, we propose the Channel Information Interaction Module (CIIM), which introduces a horizontal-vertical integration mechanism in the channel dimension. This module performs feature reorganization and interaction across channels to effectively capture complementary cross-channel information. To address the second issue, we construct a collaborative decoding architecture guided by prior knowledge. This architecture generates boundary priors and object localization maps through Boundary Extraction (BE) and Region Extraction (RE) modules, then employs hybrid attention to collaboratively calibrate decoded features, effectively overcoming semantic ambiguity and imprecise boundaries. Additionally, the Multi-scale Enhancement (MSE) module enriches contextual feature representations. Extensive experiments on four COD benchmark datasets validate the effectiveness and state-of-the-art performance of the proposed model. We further transferred our model to the Salient Object Detection (SOD) task and demonstrated its adaptability across downstream tasks, including polyp segmentation, transparent object detection, and industrial and road defect detection. Code and experimental results are publicly available at: https://github.com/akuan1234/ARNet-v2.
comment: 15 pages, 9 figures
Reliable Detection of Minute Targets in High-Resolution Aerial Imagery across Temporal Shifts
Efficient crop detection via Unmanned Aerial Vehicles is critical for scaling precision agriculture, yet it remains challenging due to the small scale of targets and environmental variability. This paper addresses the detection of rice seedlings in paddy fields by leveraging a Faster R-CNN architecture initialized via transfer learning. To overcome the specific difficulties of detecting minute objects in high-resolution aerial imagery, we curate a significant UAV dataset for training and rigorously evaluate the model's generalization capabilities. Specifically, we validate performance across three distinct test sets acquired at different temporal intervals, thereby assessing robustness against varying imaging conditions. Our empirical results demonstrate that transfer learning not only facilitates the rapid convergence of object detection models in agricultural contexts but also yields consistent performance despite domain shifts in image acquisition.
Prior-Enhanced Gaussian Splatting for Dynamic Scene Reconstruction from Casual Video
We introduce a fully automatic pipeline for dynamic scene reconstruction from casually captured monocular RGB videos. Rather than designing a new scene representation, we enhance the priors that drive Dynamic Gaussian Splatting. Video segmentation combined with epipolar-error maps yields object-level masks that closely follow thin structures; these masks (i) guide an object-depth loss that sharpens the consistent video depth, and (ii) support skeleton-based sampling plus mask-guided re-identification to produce reliable, comprehensive 2-D tracks. Two additional objectives embed the refined priors in the reconstruction stage: a virtual-view depth loss removes floaters, and a scaffold-projection loss ties motion nodes to the tracks, preserving fine geometry and coherent motion. The resulting system surpasses previous monocular dynamic scene reconstruction methods and delivers visibly superior renderings
A Multi-Mode Structured Light 3D Imaging System with Multi-Source Information Fusion for Underwater Pipeline Detection
Underwater pipelines are highly susceptible to corrosion, which not only shorten their service life but also pose significant safety risks. Compared with manual inspection, the intelligent real-time imaging system for underwater pipeline detection has become a more reliable and practical solution. Among various underwater imaging techniques, structured light 3D imaging can restore the sufficient spatial detail for precise defect characterization. Therefore, this paper develops a multi-mode underwater structured light 3D imaging system for pipeline detection (UW-SLD system) based on multi-source information fusion. First, a rapid distortion correction (FDC) method is employed for efficient underwater image rectification. To overcome the challenges of extrinsic calibration among underwater sensors, a factor graph-based parameter optimization method is proposed to estimate the transformation matrix between the structured light and acoustic sensors. Furthermore, a multi-mode 3D imaging strategy is introduced to adapt to the geometric variability of underwater pipelines. Given the presence of numerous disturbances in underwater environments, a multi-source information fusion strategy and an adaptive extended Kalman filter (AEKF) are designed to ensure stable pose estimation and high-accuracy measurements. In particular, an edge detection-based ICP (ED-ICP) algorithm is proposed. This algorithm integrates pipeline edge detection network with enhanced point cloud registration to achieve robust and high-fidelity reconstruction of defect structures even under variable motion conditions. Extensive experiments are conducted under different operation modes, velocities, and depths. The results demonstrate that the developed system achieves superior accuracy, adaptability and robustness, providing a solid foundation for autonomous underwater pipeline detection.
Surveillance Video-Based Traffic Accident Detection Using Transformer Architecture
Road traffic accidents represent a leading cause of mortality globally, with incidence rates rising due to increasing population, urbanization, and motorization. Rising accident rates raise concerns about traffic surveillance effectiveness. Traditional computer vision methods for accident detection struggle with limited spatiotemporal understanding and poor cross-domain generalization. Recent advances in transformer architectures excel at modeling global spatial-temporal dependencies and parallel computation. However, applying these models to automated traffic accident detection is limited by small, non-diverse datasets, hindering the development of robust, generalizable systems. To address this gap, we curated a comprehensive and balanced dataset that captures a wide spectrum of traffic environments, accident types, and contextual variations. Utilizing the curated dataset, we propose an accident detection model based on a transformer architecture using pre-extracted spatial video features. The architecture employs convolutional layers to extract local correlations across diverse patterns within a frame, while leveraging transformers to capture sequential-temporal dependencies among the retrieved features. Moreover, most existing studies neglect the integration of motion cues, which are essential for understanding dynamic scenes, especially during accidents. These approaches typically rely on static features or coarse temporal information. In this study, multiple methods for incorporating motion cues were evaluated to identify the most effective strategy. Among the tested input approaches, concatenating RGB features with optical flow achieved the highest accuracy at 88.3%. The results were further compared with vision language models (VLM) such as GPT, Gemini, and LLaVA-NeXT-Video to assess the effectiveness of the proposed method.
Task-Specific Distance Correlation Matching for Few-Shot Action Recognition
Few-shot action recognition (FSAR) has recently made notable progress through set matching and efficient adaptation of large-scale pre-trained models. However, two key limitations persist. First, existing set matching metrics typically rely on cosine similarity to measure inter-frame linear dependencies and then perform matching with only instance-level information, thus failing to capture more complex patterns such as nonlinear relationships and overlooking task-specific cues. Second, for efficient adaptation of CLIP to FSAR, recent work performing fine-tuning via skip-fusion layers (which we refer to as side layers) has significantly reduced memory cost. However, the newly introduced side layers are often difficult to optimize under limited data conditions. To address these limitations, we propose TS-FSAR, a framework comprising three components: (1) a visual Ladder Side Network (LSN) for efficient CLIP fine-tuning; (2) a metric called Task-Specific Distance Correlation Matching (TS-DCM), which uses $α$-distance correlation to model both linear and nonlinear inter-frame dependencies and leverages a task prototype to enable task-specific matching; and (3) a Guiding LSN with Adapted CLIP (GLAC) module, which regularizes LSN using the adapted frozen CLIP to improve training for better $α$-distance correlation estimation under limited supervision. Extensive experiments on five widely-used benchmarks demonstrate that our TS-FSAR yields superior performance compared to prior state-of-the-arts.
comment: 9 pages. 4 figures, conference
UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models
With the advancement of multi-modal Large Language Models (LLMs), Video LLMs have been further developed to perform on holistic and specialized video understanding. However, existing works are limited to specialized video understanding tasks, failing to achieve a comprehensive and multi-grained video perception. To bridge this gap, we introduce UFVideo, the first Video LLM with unified multi-grained cooperative understanding capabilities. Specifically, we design unified visual-language guided alignment to flexibly handle video understanding across global, pixel and temporal scales within a single model. UFVideo dynamically encodes the visual and text inputs of different tasks and generates the textual response, temporal localization, or grounded mask. Additionally, to evaluate challenging multi-grained video understanding tasks, we construct the UFVideo-Bench consisting of three distinct collaborative tasks within the scales, which demonstrates UFVideo's flexibility and advantages over GPT-4o. Furthermore, we validate the effectiveness of our model across 9 public benchmarks covering various common video understanding tasks, providing valuable insights for future Video LLMs.
comment: 22 pages, 13 figures, technical report
FreqDINO: Frequency-Guided Adaptation for Generalized Boundary-Aware Ultrasound Image Segmentation
Ultrasound image segmentation is pivotal for clinical diagnosis, yet challenged by speckle noise and imaging artifacts. Recently, DINOv3 has shown remarkable promise in medical image segmentation with its powerful representation capabilities. However, DINOv3, pre-trained on natural images, lacks sensitivity to ultrasound-specific boundary degradation. To address this limitation, we propose FreqDINO, a frequency-guided segmentation framework that enhances boundary perception and structural consistency. Specifically, we devise a Multi-scale Frequency Extraction and Alignment (MFEA) strategy to separate low-frequency structures and multi-scale high-frequency boundary details, and align them via learnable attention. We also introduce a Frequency-Guided Boundary Refinement (FGBR) module that extracts boundary prototypes from high-frequency components and refines spatial features. Furthermore, we design a Multi-task Boundary-Guided Decoder (MBGD) to ensure spatial coherence between boundary and semantic predictions. Extensive experiments demonstrate that FreqDINO surpasses state-of-the-art methods with superior achieves remarkable generalization capability. The code is at https://github.com/MingLang-FD/FreqDINO.
Physics-Informed Video Flare Synthesis and Removal Leveraging Motion Independence between Flare and Scene
Lens flare is a degradation phenomenon caused by strong light sources. Existing researches on flare removal have mainly focused on images, while the spatiotemporal characteristics of video flare remain largely unexplored. Video flare synthesis and removal pose significantly greater challenges than in image, owing to the complex and mutually independent motion of flare, light sources, and scene content. This motion independence further affects restoration performance, often resulting in flicker and artifacts. To address this issue, we propose a physics-informed dynamic flare synthesis pipeline, which simulates light source motion using optical flow and models the temporal behaviors of both scattering and reflective flares. Meanwhile, we design a video flare removal network that employs an attention module to spatially suppress flare regions and incorporates a Mamba-based temporal modeling component to capture long range spatio-temporal dependencies. This motion-independent spatiotemporal representation effectively eliminates the need for multi-frame alignment, alleviating temporal aliasing between flares and scene content and thereby improving video flare removal performance. Building upon this, we construct the first video flare dataset to comprehensively evaluate our method, which includes a large set of synthetic paired videos and additional real-world videos collected from the Internet to assess generalization capability. Extensive experiments demonstrate that our method consistently outperforms existing video-based restoration and image-based flare removal methods on both real and synthetic videos, effectively removing dynamic flares while preserving light source integrity and maintaining spatiotemporal consistency of scene.
MLLM Machine Unlearning via Visual Knowledge Distillation
Recently, machine unlearning approaches have been proposed to remove sensitive information from well-trained large models. However, most existing methods are tailored for LLMs, while MLLM-oriented unlearning remains at its early stage. Inspired by recent studies exploring the internal mechanisms of MLLMs, we propose to disentangle the visual and textual knowledge embedded within MLLMs and introduce a dedicated approach to selectively erase target visual knowledge while preserving textual knowledge. Unlike previous unlearning methods that rely on output-level supervision, our approach introduces a Visual Knowledge Distillation (VKD) scheme, which leverages intermediate visual representations within the MLLM as supervision signals. This design substantially enhances both unlearning effectiveness and model utility. Moreover, since our method only fine-tunes the visual components of the MLLM, it offers significant efficiency advantages. Extensive experiments demonstrate that our approach outperforms state-of-the-art unlearning methods in terms of both effectiveness and efficiency. Moreover, we are the first to evaluate the robustness of MLLM unlearning against relearning attacks.
KeyframeFace: From Text to Expressive Facial Keyframes
Generating dynamic 3D facial animation from natural language requires understanding both temporally structured semantics and fine-grained expression changes. Existing datasets and methods mainly focus on speech-driven animation or unstructured expression sequences and therefore lack the semantic grounding and temporal structures needed for expressive human performance generation. In this work, we introduce KeyframeFace, a large-scale multimodal dataset designed for text-to-animation research through keyframe-level supervision. KeyframeFace provides 2,100 expressive scripts paired with monocular videos, per-frame ARKit coefficients, contextual backgrounds, complex emotions, manually defined keyframes, and multi-perspective annotations based on ARKit coefficients and images via Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Beyond the dataset, we propose the first text-to-animation framework that explicitly leverages LLM priors for interpretable facial motion synthesis. This design aligns the semantic understanding capabilities of LLMs with the interpretable structure of ARKit's coefficients, enabling high-fidelity expressive animation. KeyframeFace and our LLM-based framework together establish a new foundation for interpretable, keyframe-guided, and context-aware text-to-animation. Code and data are available at https://github.com/wjc12345123/KeyframeFace.
SATMapTR: Satellite Image Enhanced Online HD Map Construction
High-definition (HD) maps are evolving from pre-annotated to real-time construction to better support autonomous driving in diverse scenarios. However, this process is hindered by low-quality input data caused by onboard sensors limited capability and frequent occlusions, leading to incomplete, noisy, or missing data, and thus reduced mapping accuracy and robustness. Recent efforts have introduced satellite images as auxiliary input, offering a stable, wide-area view to complement the limited ego perspective. However, satellite images in Bird's Eye View are often degraded by shadows and occlusions from vegetation and buildings. Prior methods using basic feature extraction and fusion remain ineffective. To address these challenges, we propose SATMapTR, a novel online map construction model that effectively fuses satellite image through two key components: (1) a gated feature refinement module that adaptively filters satellite image features by integrating high-level semantics with low-level structural cues to extract high signal-to-noise ratio map-relevant representations; and (2) a geometry-aware fusion module that consistently fuse satellite and BEV features at a grid-to-grid level, minimizing interference from irrelevant regions and low-quality inputs. Experimental results on the nuScenes dataset show that SATMapTR achieves the highest mean average precision (mAP) of 73.8, outperforming state-of-the-art satellite-enhanced models by up to 14.2 mAP. It also shows lower mAP degradation under adverse weather and sensor failures, and achieves nearly 3 times higher mAP at extended perception ranges.
comment: 9 pages (+ 3 pages of Appendix)
MultiEgo: A Multi-View Egocentric Video Dataset for 4D Scene Reconstruction ACM MM 2025
Multi-view egocentric dynamic scene reconstruction holds significant research value for applications in holographic documentation of social interactions. However, existing reconstruction datasets focus on static multi-view or single-egocentric view setups, lacking multi-view egocentric datasets for dynamic scene reconstruction. Therefore, we present MultiEgo, the first multi-view egocentric dataset for 4D dynamic scene reconstruction. The dataset comprises five canonical social interaction scenes: meetings, performances, and a presentation. Each scene provides five authentic egocentric videos captured by participants wearing AR glasses. We design a hardware-based data acquisition system and processing pipeline, achieving sub-millisecond temporal synchronization across views, coupled with accurate pose annotations. Experiment validation demonstrates the practical utility and effectiveness of our dataset for free-viewpoint video (FVV) applications, establishing MultiEgo as a foundational resource for advancing multi-view egocentric dynamic scene reconstruction research.
comment: ACM MM 2025 Dataset Track
Few-Shot VLM-Based G-Code and HMI Verification in CNC Machining
Manual generation of G-code is important for learning the operation of CNC machines. Prior work in G-code verification uses Large-Language Models (LLMs), which primarily examine errors in the written programming. However, CNC machining requires extensive use and knowledge of the Human-Machine Interface (HMI), which displays machine status and errors. LLMs currently lack the capability to leverage knowledge of HMIs due to their inability to access the vision modality. This paper proposes a few-shot VLM-based verification approach that simultaneously evaluates the G-code and the HMI display for errors and safety status. The input dataset includes paired G-code text and associated HMI screenshots from a 15-slant-PRO lathe, including both correct and error-prone cases. To enable few-shot learning, the VLM is provided with a structured JSON schema based on prior heuristic knowledge. After determining the prompts, instances of G-code and HMI that either contain errors or are error free are used as few-shot examples to guide the VLM. The model was then evaluated in comparison to a zero-shot VLM through multiple scenarios of incorrect G-code and HMI errors with respect to per-slot accuracy. The VLM showed that few-shot prompting led to overall enhancement of detecting HMI errors and discrepancies with the G-code for more comprehensive debugging. Therefore, the proposed framework was demonstrated to be suitable for verification of manually generated G-code that is typically developed in CNC training.
Autoregressive Video Autoencoder with Decoupled Temporal and Spatial Context
Video autoencoders compress videos into compact latent representations for efficient reconstruction, playing a vital role in enhancing the quality and efficiency of video generation. However, existing video autoencoders often entangle spatial and temporal information, limiting their ability to capture temporal consistency and leading to suboptimal performance. To address this, we propose Autoregressive Video Autoencoder (ARVAE), which compresses and reconstructs each frame conditioned on its predecessor in an autoregressive manner, allowing flexible processing of videos with arbitrary lengths. ARVAE introduces a temporal-spatial decoupled representation that combines downsampled flow field for temporal coherence with spatial relative compensation for newly emerged content, achieving high compression efficiency without information loss. Specifically, the encoder compresses the current and previous frames into the temporal motion and spatial supplement, while the decoder reconstructs the original frame from the latent representations given the preceding frame. A multi-stage training strategy is employed to progressively optimize the model. Extensive experiments demonstrate that ARVAE achieves superior reconstruction quality with extremely lightweight models and small-scale training data. Moreover, evaluations on video generation tasks highlight its strong potential for downstream applications.
RcAE: Recursive Reconstruction Framework for Unsupervised Industrial Anomaly Detection AAAI-26
Unsupervised industrial anomaly detection requires accurately identifying defects without labeled data. Traditional autoencoder-based methods often struggle with incomplete anomaly suppression and loss of fine details, as their single-pass decoding fails to effectively handle anomalies with varying severity and scale. We propose a recursive architecture for autoencoder (RcAE), which performs reconstruction iteratively to progressively suppress anomalies while refining normal structures. Unlike traditional single-pass models, this recursive design naturally produces a sequence of reconstructions, progressively exposing suppressed abnormal patterns. To leverage this reconstruction dynamics, we introduce a Cross Recursion Detection (CRD) module that tracks inconsistencies across recursion steps, enhancing detection of both subtle and large-scale anomalies. Additionally, we incorporate a Detail Preservation Network (DPN) to recover high-frequency textures typically lost during reconstruction. Extensive experiments demonstrate that our method significantly outperforms existing non-diffusion methods, and achieves performance on par with recent diffusion models with only 10% of their parameters and offering substantially faster inference. These results highlight the practicality and efficiency of our approach for real-world applications.
comment: 19 pages, 7 figures, to be published in AAAI-26
FilmWeaver: Weaving Consistent Multi-Shot Videos with Cache-Guided Autoregressive Diffusion AAAI-2026
Current video generation models perform well at single-shot synthesis but struggle with multi-shot videos, facing critical challenges in maintaining character and background consistency across shots and flexibly generating videos of arbitrary length and shot count. To address these limitations, we introduce \textbf{FilmWeaver}, a novel framework designed to generate consistent, multi-shot videos of arbitrary length. First, it employs an autoregressive diffusion paradigm to achieve arbitrary-length video generation. To address the challenge of consistency, our key insight is to decouple the problem into inter-shot consistency and intra-shot coherence. We achieve this through a dual-level cache mechanism: a shot memory caches keyframes from preceding shots to maintain character and scene identity, while a temporal memory retains a history of frames from the current shot to ensure smooth, continuous motion. The proposed framework allows for flexible, multi-round user interaction to create multi-shot videos. Furthermore, due to this decoupled design, our method demonstrates high versatility by supporting downstream tasks such as multi-concept injection and video extension. To facilitate the training of our consistency-aware method, we also developed a comprehensive pipeline to construct a high-quality multi-shot video dataset. Extensive experimental results demonstrate that our method surpasses existing approaches on metrics for both consistency and aesthetic quality, opening up new possibilities for creating more consistent, controllable, and narrative-driven video content. Project Page: https://filmweaver.github.io
comment: AAAI-2026
Evaluating the Efficacy of Sentinel-2 versus Aerial Imagery in Serrated Tussock Classification SC
Invasive species pose major global threats to ecosystems and agriculture. Serrated tussock (\textit{Nassella trichotoma}) is a highly competitive invasive grass species that disrupts native grasslands, reduces pasture productivity, and increases land management costs. In Victoria, Australia, it presents a major challenge due to its aggressive spread and ecological impact. While current ground surveys and subsequent management practices are effective at small scales, they are not feasible for landscape-scale monitoring. Although aerial imagery offers high spatial resolution suitable for detailed classification, its high cost limits scalability. Satellite-based remote sensing provides a more cost-effective and scalable alternative, though often with lower spatial resolution. This study evaluates whether multi-temporal Sentinel-2 imagery, despite its lower spatial resolution, can provide a comparable and cost-effective alternative for landscape-scale monitoring of serrated tussock by leveraging its higher spectral resolution and seasonal phenological information. A total of eleven models have been developed using various combinations of spectral bands, texture features, vegetation indices, and seasonal data. Using a random forest classifier, the best-performing Sentinel-2 model (M76*) has achieved an Overall Accuracy (OA) of 68\% and an Overall Kappa (OK) of 0.55, slightly outperforming the best-performing aerial imaging model's OA of 67\% and OK of 0.52 on the same dataset. These findings highlight the potential of multi-seasonal feature-enhanced satellite-based models for scalable invasive species classification.
comment: Accepted in Earthsense 2025 (IEEE INTERNATIONAL CONFERENCE ON NEXT-GEN TECHNOLOGIES OF ARTIFICIAL INTELLIGENCE AND GEOSCIENCE REMOTE SENSING)
Do We Need Reformer for Vision? An Experimental Comparison with Vision Transformers
Transformers have recently demonstrated strong performance in computer vision, with Vision Transformers (ViTs) leveraging self-attention to capture both low-level and high-level image features. However, standard ViTs remain computationally expensive, since global self-attention scales quadratically with the number of tokens, which limits their practicality for high-resolution inputs and resource-constrained settings. In this work, we investigate the Reformer architecture as an alternative vision backbone. By combining patch-based tokenization with locality-sensitive hashing (LSH) attention, our model approximates global self-attention while reducing its theoretical time complexity from $\mathcal{O}(n^2)$ to $\mathcal{O}(n \log n)$ in the sequence length $n$. We evaluate the proposed Reformer-based vision model on CIFAR-10 to assess its behavior on small-scale datasets, on ImageNet-100 to study its accuracy--efficiency trade-off in a more realistic setting, and on a high-resolution medical imaging dataset to evaluate the model under longer token sequences. While the Reformer achieves higher accuracy on CIFAR-10 compared to our ViT-style baseline, the ViT model consistently outperforms the Reformer in our experiments in terms of practical efficiency and end-to-end computation time across the larger and higher-resolution settings. These results suggest that, despite the theoretical advantages of LSH-based attention, meaningful computation gains require sequence lengths substantially longer than those produced by typical high-resolution images.
PersonaLive! Expressive Portrait Image Animation for Live Streaming
Current diffusion-based portrait animation models predominantly focus on enhancing visual quality and expression realism, while overlooking generation latency and real-time performance, which restricts their application range in the live streaming scenario. We propose PersonaLive, a novel diffusion-based framework towards streaming real-time portrait animation with multi-stage training recipes. Specifically, we first adopt hybrid implicit signals, namely implicit facial representations and 3D implicit keypoints, to achieve expressive image-level motion control. Then, a fewer-step appearance distillation strategy is proposed to eliminate appearance redundancy in the denoising process, greatly improving inference efficiency. Finally, we introduce an autoregressive micro-chunk streaming generation paradigm equipped with a sliding training strategy and a historical keyframe mechanism to enable low-latency and stable long-term video generation. Extensive experiments demonstrate that PersonaLive achieves state-of-the-art performance with up to 7-22x speedup over prior diffusion-based portrait animation models.
Task-Aware Multi-Expert Architecture For Lifelong Deep Learning
Lifelong deep learning (LDL) trains neural networks to learn sequentially across tasks while preserving prior knowledge. We propose Task-Aware Multi-Expert (TAME), a continual learning algorithm that leverages task similarity to guide expert selection and knowledge transfer. TAME maintains a pool of pretrained neural networks and activates the most relevant expert for each new task. A shared dense layer integrates features from the chosen expert to generate predictions. To reduce catastrophic forgetting, TAME uses a replay buffer that stores representative samples and embeddings from previous tasks and reuses them during training. An attention mechanism further prioritizes the most relevant stored information for each prediction. Together, these components allow TAME to adapt flexibly while retaining important knowledge across evolving task sequences. Experiments on binary classification tasks derived from CIFAR-100 show that TAME improves accuracy on new tasks while sustaining performance on earlier ones, highlighting its effectiveness in balancing adaptation and retention in lifelong learning settings.
Cross-modal Prompting for Balanced Incomplete Multi-modal Emotion Recognition AAAI 2026
Incomplete multi-modal emotion recognition (IMER) aims at understanding human intentions and sentiments by comprehensively exploring the partially observed multi-source data. Although the multi-modal data is expected to provide more abundant information, the performance gap and modality under-optimization problem hinder effective multi-modal learning in practice, and are exacerbated in the confrontation of the missing data. To address this issue, we devise a novel Cross-modal Prompting (ComP) method, which emphasizes coherent information by enhancing modality-specific features and improves the overall recognition accuracy by boosting each modality's performance. Specifically, a progressive prompt generation module with a dynamic gradient modulator is proposed to produce concise and consistent modality semantic cues. Meanwhile, cross-modal knowledge propagation selectively amplifies the consistent information in modality features with the delivered prompts to enhance the discrimination of the modality-specific output. Additionally, a coordinator is designed to dynamically re-weight the modality outputs as a complement to the balance strategy to improve the model's efficacy. Extensive experiments on 4 datasets with 7 SOTA methods under different missing rates validate the effectiveness of our proposed method.
comment: Accepted by AAAI 2026
WildCap: Facial Appearance Capture in the Wild via Hybrid Inverse Rendering
Existing methods achieve high-quality facial appearance capture under controllable lighting, which increases capture cost and limits usability. We propose WildCap, a novel method for high-quality facial appearance capture from a smartphone video recorded in the wild. To disentangle high-quality reflectance from complex lighting effects in in-the-wild captures, we propose a novel hybrid inverse rendering framework. Specifically, we first apply a data-driven method, i.e., SwitchLight, to convert the captured images into more constrained conditions and then adopt model-based inverse rendering. However, unavoidable local artifacts in network predictions, such as shadow-baking, are non-physical and thus hinder accurate inverse rendering of lighting and material. To address this, we propose a novel texel grid lighting model to explain non-physical effects as clean albedo illuminated by local physical lighting. During optimization, we jointly sample a diffusion prior for reflectance maps and optimize the lighting, effectively resolving scale ambiguity between local lights and albedo. Our method achieves significantly better results than prior arts in the same capture setup, closing the quality gap between in-the-wild and controllable recordings by a large margin. Our code will be released \href{https://yxuhan.github.io/WildCap/index.html}{\textcolor{magenta}{here}}.
comment: Technical report. project page: https://yxuhan.github.io/WildCap/index.html; code: https://github.com/yxuhan/WildCap
RoomPilot: Controllable Synthesis of Interactive Indoor Environments via Multimodal Semantic Parsing
Generating controllable and interactive indoor scenes is fundamental to applications in game development, architectural visualization, and embodied AI training. Yet existing approaches either handle a narrow range of input modalities or rely on stochastic processes that hinder controllability. To overcome these limitations, we introduce RoomPilot, a unified framework that parses diverse multi-modal inputs--textual descriptions or CAD floor plans--into an Indoor Domain-Specific Language (IDSL) for indoor structured scene generation. The key insight is that a well-designed IDSL can act as a shared semantic representation, enabling coherent, high-quality scene synthesis from any single modality while maintaining interaction semantics. In contrast to conventional procedural methods that produce visually plausible but functionally inert layouts, RoomPilot leverages a curated dataset of interaction-annotated assets to synthesize environments exhibiting realistic object behaviors. Extensive experiments further validate its strong multi-modal understanding, fine-grained controllability in scene generation, and superior physical consistency and visual fidelity, marking a significant step toward general-purpose controllable 3D indoor scene generation.
comment: 20 pages, 6 figures
REST: Diffusion-based Real-time End-to-end Streaming Talking Head Generation via ID-Context Caching and Asynchronous Streaming Distillation
Diffusion models have significantly advanced the field of talking head generation. However, the slow inference speeds and non-autoregressive paradigms severely constrain the application of diffusion-based THG models. In this study, we propose REST, the first diffusion-based, real-time, end-to-end streaming audio-driven talking head generation framework. To support real-time end-to-end generation, a compact video latent space is first learned through high spatiotemporal VAE compression. Additionally, to enable autoregressive streaming within the compact video latent space, we introduce an ID-Context Cache mechanism, which integrates ID-Sink and Context-Cache principles to key-value caching for maintaining temporal consistency and identity coherence during long-time streaming generation. Furthermore, an Asynchronous Streaming Distillation (ASD) training strategy is proposed to mitigate error accumulation in autoregressive generation and enhance temporal consistency, which leverages a non-streaming teacher with an asynchronous noise schedule to supervise the training of the streaming student model. REST bridges the gap between autoregressive and diffusion-based approaches, demonstrating substantial value for applications requiring real-time talking head generation. Experimental results demonstrate that REST outperforms state-of-the-art methods in both generation speed and overall performance.
comment: 10pages, 4 figures
FutureX: Enhance End-to-End Autonomous Driving via Latent Chain-of-Thought World Model
In autonomous driving, end-to-end planners learn scene representations from raw sensor data and utilize them to generate a motion plan or control actions. However, exclusive reliance on the current scene for motion planning may result in suboptimal responses in highly dynamic traffic environments where ego actions further alter the future scene. To model the evolution of future scenes, we leverage the World Model to represent how the ego vehicle and its environment interact and change over time, which entails complex reasoning. The Chain of Thought (CoT) offers a promising solution by forecasting a sequence of future thoughts that subsequently guide trajectory refinement. In this paper, we propose FutureX, a CoT-driven pipeline that enhances end-to-end planners to perform complex motion planning via future scene latent reasoning and trajectory refinement. Specifically, the Auto-think Switch examines the current scene and decides whether additional reasoning is required to yield a higher-quality motion plan. Once FutureX enters the Thinking mode, the Latent World Model conducts a CoT-guided rollout to predict future scene representation, enabling the Summarizer Module to further refine the motion plan. Otherwise, FutureX operates in an Instant mode to generate motion plans in a forward pass for relatively simple scenes. Extensive experiments demonstrate that FutureX enhances existing methods by producing more rational motion plans and fewer collisions without compromising efficiency, thereby achieving substantial overall performance gains, e.g., 6.2 PDMS improvement for TransFuser on NAVSIM. Code will be released.
VFMF: World Modeling by Forecasting Vision Foundation Model Features
Forecasting from partial observations is central to world modeling. Many recent methods represent the world through images, and reduce forecasting to stochastic video generation. Although such methods excel at realism and visual fidelity, predicting pixels is computationally intensive and not directly useful in many applications, as it requires translating RGB into signals useful for decision making. An alternative approach uses features from vision foundation models (VFMs) as world representations, performing deterministic regression to predict future world states. These features can be directly translated into actionable signals such as semantic segmentation and depth, while remaining computationally efficient. However, deterministic regression averages over multiple plausible futures, undermining forecast accuracy by failing to capture uncertainty. To address this crucial limitation, we introduce a generative forecaster that performs autoregressive flow matching in VFM feature space. Our key insight is that generative modeling in this space requires encoding VFM features into a compact latent space suitable for diffusion. We show that this latent space preserves information more effectively than previously used PCA-based alternatives, both for forecasting and other applications, such as image generation. Our latent predictions can be easily decoded into multiple useful and interpretable output modalities: semantic segmentation, depth, surface normals, and even RGB. With matched architecture and compute, our method produces sharper and more accurate predictions than regression across all modalities. Our results suggest that stochastic conditional generation of VFM features offers a promising and scalable foundation for future world models.
Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy
The pursuit of out-of-distribution generalization in Vision-Language-Action (VLA) models is often hindered by catastrophic forgetting of the Vision-Language Model (VLM) backbone during fine-tuning. While co-training with external reasoning data helps, it requires experienced tuning and data-related overhead. Beyond such external dependencies, we identify an intrinsic cause within VLA datasets: modality imbalance, where language diversity is much lower than visual and action diversity. This imbalance biases the model toward visual shortcuts and language forgetting. To address this, we introduce BayesVLA, a Bayesian factorization that decomposes the policy into a visual-action prior, supporting seeing-to-act, and a language-conditioned likelihood, enabling prompt-to-specify. This inherently preserves generalization and promotes instruction following. We further incorporate pre- and post-contact phases to better leverage pre-trained foundation models. Information-theoretic analysis formally validates our effectiveness in mitigating shortcut learning. Extensive experiments show superior generalization to unseen instructions, objects, and environments compared to existing methods. Project page is available at: https://xukechun.github.io/papers/BayesVLA.
SmokeBench: Evaluating Multimodal Large Language Models for Wildfire Smoke Detection WACV 2026
Wildfire smoke is transparent, amorphous, and often visually confounded with clouds, making early-stage detection particularly challenging. In this work, we introduce a benchmark, called SmokeBench, to evaluate the ability of multimodal large language models (MLLMs) to recognize and localize wildfire smoke in images. The benchmark consists of four tasks: (1) smoke classification, (2) tile-based smoke localization, (3) grid-based smoke localization, and (4) smoke detection. We evaluate several MLLMs, including Idefics2, Qwen2.5-VL, InternVL3, Unified-IO 2, Grounding DINO, GPT-4o, and Gemini-2.5 Pro. Our results show that while some models can classify the presence of smoke when it covers a large area, all models struggle with accurate localization, especially in the early stages. Further analysis reveals that smoke volume is strongly correlated with model performance, whereas contrast plays a comparatively minor role. These findings highlight critical limitations of current MLLMs for safety-critical wildfire monitoring and underscore the need for methods that improve early-stage smoke localization.
comment: Accepted to WACV 2026
AutoRefiner: Improving Autoregressive Video Diffusion Models via Reflective Refinement Over the Stochastic Sampling Path
Autoregressive video diffusion models (AR-VDMs) show strong promise as scalable alternatives to bidirectional VDMs, enabling real-time and interactive applications. Yet there remains room for improvement in their sample fidelity. A promising solution is inference-time alignment, which optimizes the noise space to improve sample fidelity without updating model parameters. Yet, optimization- or search-based methods are computationally impractical for AR-VDMs. Recent text-to-image (T2I) works address this via feedforward noise refiners that modulate sampled noises in a single forward pass. Can such noise refiners be extended to AR-VDMs? We identify the failure of naively extending T2I noise refiners to AR-VDMs and propose AutoRefiner-a noise refiner tailored for AR-VDMs, with two key designs: pathwise noise refinement and a reflective KV-cache. Experiments demonstrate that AutoRefiner serves as an efficient plug-in for AR-VDMs, effectively enhancing sample fidelity by refining noise along stochastic denoising paths.
CADKnitter: Compositional CAD Generation from Text and Geometry Guidance
Crafting computer-aided design (CAD) models has long been a painstaking and time-intensive task, demanding both precision and expertise from designers. With the emergence of 3D generation, this task has undergone a transformative impact, shifting not only from visual fidelity to functional utility but also enabling editable CAD designs. Prior works have achieved early success in single-part CAD generation, which is not well-suited for real-world applications, as multiple parts need to be assembled under semantic and geometric constraints. In this paper, we propose CADKnitter, a compositional CAD generation framework with a geometry-guided diffusion sampling strategy. CADKnitter is able to generate a complementary CAD part that follows both the geometric constraints of the given CAD model and the semantic constraints of the desired design text prompt. We also curate a dataset, so-called KnitCAD, containing over 310,000 samples of CAD models, along with textual prompts and assembly metadata that provide semantic and geometric constraints. Intensive experiments demonstrate that our proposed method outperforms other state-of-the-art baselines by a clear margin.
Beyond Memorization: Gradient Projection Enables Selective Learning in Diffusion Models
Memorization in large-scale text-to-image diffusion models poses significant security and intellectual property risks, enabling adversarial attribute extraction and the unauthorized reproduction of sensitive or proprietary features. While conventional dememorization techniques, such as regularization and data filtering, limit overfitting to specific training examples, they fail to systematically prevent the internalization of prohibited concept-level features. Simply discarding all images containing a sensitive feature wastes invaluable training data, necessitating a method for selective unlearning at the concept level. To address this, we introduce a Gradient Projection Framework designed to enforce a stringent requirement of concept-level feature exclusion. Our defense operates during backpropagation by systematically identifying and excising training signals aligned with embeddings of prohibited attributes. Specifically, we project each gradient update onto the orthogonal complement of the sensitive feature's embedding space, thereby zeroing out its influence on the model's weights. Our method integrates seamlessly into standard diffusion model training pipelines and complements existing defenses. We analyze our method against an adversary aiming for feature extraction. In extensive experiments, we demonstrate that our framework drastically reduces memorization while rigorously preserving generation quality and semantic fidelity. By reframing memorization control as selective learning, our approach establishes a new paradigm for IP-safe and privacy-preserving generative AI.
Multi-task Learning with Extended Temporal Shift Module for Temporal Action Localization ICCV25
We present our solution to the BinEgo-360 Challenge at ICCV 2025, which focuses on temporal action localization (TAL) in multi-perspective and multi-modal video settings. The challenge provides a dataset containing panoramic, third-person, and egocentric recordings, annotated with fine-grained action classes. Our approach is built on the Temporal Shift Module (TSM), which we extend to handle TAL by introducing a background class and classifying fixed-length non-overlapping intervals. We employ a multi-task learning framework that jointly optimizes for scene classification and TAL, leveraging contextual cues between actions and environments. Finally, we integrate multiple models through a weighted ensemble strategy, which improves robustness and consistency of predictions. Our method is ranked first in both the initial and extended rounds of the competition, demonstrating the effectiveness of combining multi-task learning, an efficient backbone, and ensemble learning for TAL.
comment: BinEgo360@ICCV25
Lightweight 3D Gaussian Splatting Compression via Video Codec
Current video-based GS compression methods rely on using Parallel Linear Assignment Sorting (PLAS) to convert 3D GS into smooth 2D maps, which are computationally expensive and time-consuming, limiting the application of GS on lightweight devices. In this paper, we propose a Lightweight 3D Gaussian Splatting (GS) Compression method based on Video codec (LGSCV). First, a two-stage Morton scan is proposed to generate blockwise 2D maps that are friendly for canonical video codecs in which the coding units (CU) are square blocks. A 3D Morton scan is used to permute GS primitives, followed by a 2D Morton scan to map the ordered GS primitives to 2D maps in a blockwise style. However, although the blockwise 2D maps report close performance to the PLAS map in high-bitrate regions, they show a quality collapse at medium-to-low bitrates. Therefore, a principal component analysis (PCA) is used to reduce the dimensionality of spherical harmonics (SH), and a MiniPLAS, which is flexible and fast, is designed to permute the primitives within certain block sizes. Incorporating SH PCA and MiniPLAS leads to a significant gain in rate-distortion (RD) performance, especially at medium and low bitrates. MiniPLAS can also guide the setting of the codec CU size configuration and significantly reduce encoding time. Experimental results on the MPEG dataset demonstrate that the proposed LGSCV achieves over 20% RD gain compared with state-of-the-art methods, while reducing 2D map generation time to approximately 1 second and cutting encoding time by 50%. The code is available at https://github.com/Qi-Yangsjtu/LGSCV .
comment: Accepted by DCC2026 Oral
SOF: Sorted Opacity Fields for Fast Unbounded Surface Reconstruction SIGGRAPH
Recent advances in 3D Gaussian representations have significantly improved the quality and efficiency of image-based scene reconstruction. Their explicit nature facilitates real-time rendering and fast optimization, yet extracting accurate surfaces - particularly in large-scale, unbounded environments - remains a difficult task. Many existing methods rely on approximate depth estimates and global sorting heuristics, which can introduce artifacts and limit the fidelity of the reconstructed mesh. In this paper, we present Sorted Opacity Fields (SOF), a method designed to recover detailed surfaces from 3D Gaussians with both speed and precision. Our approach improves upon prior work by introducing hierarchical resorting and a robust formulation of Gaussian depth, which better aligns with the level-set. To enhance mesh quality, we incorporate a level-set regularizer operating on the opacity field and introduce losses that encourage geometrically-consistent primitive shapes. In addition, we develop a parallelized Marching Tetrahedra algorithm tailored to our opacity formulation, reducing meshing time by up to an order of magnitude. As demonstrated by our quantitative evaluation, SOF achieves higher reconstruction accuracy while cutting total processing time by more than a factor of three. These results mark a step forward in turning efficient Gaussian-based rendering into equally efficient geometry extraction.
comment: SIGGRAPH Asia 2025; Project Page: https://r4dl.github.io/SOF/
Efficient Action Counting with Dynamic Queries
Temporal repetition counting aims to quantify the repeated action cycles within a video. The majority of existing methods rely on the similarity correlation matrix to characterize the repetitiveness of actions, but their scalability is hindered due to the quadratic computational complexity. In this work, we introduce a novel approach that employs an action query representation to localize repeated action cycles with linear computational complexity. Based on this representation, we further develop two key components to tackle the essential challenges of temporal repetition counting. Firstly, to facilitate open-set action counting, we propose the dynamic update scheme on action queries. Unlike static action queries, this approach dynamically embeds video features into action queries, offering a more flexible and generalizable representation. Secondly, to distinguish between actions of interest and background noise actions, we incorporate inter-query contrastive learning to regularize the video representations corresponding to different action queries. As a result, our method significantly outperforms previous works, particularly in terms of long video sequences, unseen actions, and actions at various speeds. On the challenging RepCountA benchmark, we outperform the state-of-the-art method TransRAC by 26.5% in OBO accuracy, with a 22.7% mean error decrease and 94.1% computational burden reduction. Code is available at https://github.com/lizishi/DeTRC.
comment: code: https://github.com/lizishi/DeTRC, proj page: https://shirleymaxx.github.io/DeTRC/
Visual-Friendly Concept Protection via Selective Adversarial Perturbations AAAI
Personalized concept generation by tuning diffusion models with a few images raises potential legal and ethical concerns regarding privacy and intellectual property rights. Researchers attempt to prevent malicious personalization using adversarial perturbations. However, previous efforts have mainly focused on the effectiveness of protection while neglecting the visibility of perturbations. They utilize global adversarial perturbations, which introduce noticeable alterations to original images and significantly degrade visual quality. In this work, we propose the Visual-Friendly Concept Protection (VCPro) framework, which prioritizes the protection of key concepts chosen by the image owner through adversarial perturbations with lower perceptibility. To ensure these perturbations are as inconspicuous as possible, we introduce a relaxed optimization objective to identify the least perceptible yet effective adversarial perturbations, solved using the Lagrangian multiplier method. Qualitative and quantitative experiments validate that VCPro achieves a better trade-off between the visibility of perturbations and protection effectiveness, effectively prioritizing the protection of target concepts in images with less perceptible perturbations.
comment: AAAI AISI 26
FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing
Text-driven video editing aims to modify video content based on natural language instructions. While recent training-free methods have leveraged pretrained diffusion models, they often rely on an inversion-editing paradigm. This paradigm maps the video to a latent space before editing. However, the inversion process is not perfectly accurate, often compromising appearance fidelity and motion consistency. To address this, we introduce FlowDirector, a novel training-free and inversion-free video editing framework. Our framework models the editing process as a direct evolution in the data space. It guides the video to transition smoothly along its inherent spatio-temporal manifold using an ordinary differential equation (ODE), thereby avoiding the inaccurate inversion step. From this foundation, we introduce three flow correction strategies for appearance, motion, and stability: 1) Direction-aware flow correction amplifies components that oppose the source direction and removes irrelevant terms, breaking conservative streamlines and enabling stronger structural and textural changes. 2) Motion-appearance decoupling optimizes motion agreement as an energy term at each timestep, significantly improving consistency and motion transfer. 3) Differential averaging guidance strategy leverages differences among multiple candidate flows to approximate a low variance regime at low cost, suppressing artifacts and stabilizing the trajectory. Extensive experiments across various editing tasks and benchmarks demonstrate that FlowDirector achieves state-of-the-art performance in instruction following, temporal consistency, and background preservation, establishing an efficient new paradigm for coherent video editing without inversion.
comment: Project Page is https://flowdirector-edit.github.io
Open-World Object Counting in Videos AAAI 2026
We introduce a new task of open-world object counting in videos: given a text description, or an image example, that specifies the target object, the objective is to enumerate all the unique instances of the target objects in the video. This task is especially challenging in crowded scenes with occlusions and objects of similar appearance, where avoiding double counting and identifying reappearances is crucial. To this end, we make the following contributions: we introduce a model, CountVid, for this task. It leverages an image-based counting model, and a promptable video segmentation and tracking model, to enable automated open-world object counting across video frames. To evaluate its performance, we introduce VideoCount, a new dataset for this novel task built from the TAO and MOT20 tracking datasets, as well as from videos of penguins and metal alloy crystallization captured by x-rays. Using this dataset, we demonstrate that CountVid provides accurate object counts, and significantly outperforms strong baselines. The VideoCount dataset, the CountVid model, and all the code are available at https://www.robots.ox.ac.uk/~vgg/research/countvid/.
comment: AAAI 2026
Unconsciously Forget: Mitigating Memorization; Without Knowing What is being Memorized
Recent advances in generative models have demonstrated an exceptional ability to produce highly realistic images. However, previous studies show that generated images often resemble the training data, and this problem becomes more severe as the model size increases. Memorizing training data can lead to legal challenges, including copyright infringement, violations of portrait rights, and trademark violations. Existing approaches to mitigating memorization mainly focus on manipulating the denoising sampling process to steer image embeddings away from the memorized embedding space or employ unlearning methods that require training on datasets containing specific sets of memorized concepts. However, existing methods often incur substantial computational overhead during sampling, or focus narrowly on removing one or more groups of target concepts, imposing a significant limitation on their scalability. To understand and mitigate these problems, our work, UniForget, offers a new perspective on understanding the root cause of memorization. Our work demonstrates that specific parts of the model are responsible for copyrighted content generation. By applying model pruning, we can effectively suppress the probability of generating copyrighted content without targeting specific concepts while preserving the general generative capabilities of the model. Additionally, we show that our approach is both orthogonal and complementary to existing unlearning methods, thereby highlighting its potential to improve current unlearning and de-memorization techniques.
Exploring Diffusion with Test-Time Training on Efficient Image Restoration
Image restoration faces challenges including ineffective feature fusion, computational bottlenecks and inefficient diffusion processes. To address these, we propose DiffRWKVIR, a novel framework unifying Test-Time Training (TTT) with efficient diffusion. Our approach introduces three key innovations: (1) Omni-Scale 2D State Evolution extends RWKV's location-dependent parameterization to hierarchical multi-directional 2D scanning, enabling global contextual awareness with linear complexity O(L); (2) Chunk-Optimized Flash Processing accelerates intra-chunk parallelism by 3.2x via contiguous chunk processing (O(LCd) complexity), reducing sequential dependencies and computational overhead; (3) Prior-Guided Efficient Diffusion extracts a compact Image Prior Representation (IPR) in only 5-20 steps, proving 45% faster training/inference than DiffIR while solving computational inefficiency in denoising. Evaluated across super-resolution and inpainting benchmarks (Set5, Set14, BSD100, Urban100, Places365), DiffRWKVIR outperforms SwinIR, HAT, and MambaIR/v2 in PSNR, SSIM, LPIPS, and efficiency metrics. Our method establishes a new paradigm for adaptive, high-efficiency image restoration with optimized hardware utilization.
comment: We withdraw this paper due to erroneous experiment data in the ablation study, which was inadvertently copied from our preprint "Ultra-Lightweight Semantic-Injected Imagery Super-Resolution for Real-Time UAV Remote Sensing" This nearly constituted academic misconduct. We sincerely apologize and thank those who alerted us
3D-LATTE: Latent Space 3D Editing from Textual Instructions
Despite the recent success of multi-view diffusion models for text/image-based 3D asset generation, instruction-based editing of 3D assets lacks surprisingly far behind the quality of generation models. The main reason is that recent approaches using 2D priors suffer from view-inconsistent editing signals. Going beyond 2D prior distillation methods and multi-view editing strategies, we propose a training-free editing method that operates within the latent space of a native 3D diffusion model, allowing us to directly manipulate 3D geometry. We guide the edit synthesis by blending 3D attention maps from the generation with the source object. Coupled with geometry-aware regularization guidance, a spectral modulation strategy in the Fourier domain and a refinement step for 3D enhancement, our method outperforms previous 3D editing methods enabling high-fidelity and precise edits across a wide range of shapes and semantic manipulations. Our project webpage is https://mparelli.github.io/3d-latte
Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration
Existing multimodal reasoning models and frameworks suffer from fundamental architectural limitations: most lack the human-like ability to autonomously explore diverse reasoning pathways-whether in direct inference, tool-driven visual exploration, programmatic visual manipulation, or intrinsic visual imagination. Consequently, they struggle to adapt to dynamically changing capability requirements in real-world tasks. Meanwhile, humans exhibit a complementary set of thinking abilities when addressing such tasks, whereas existing methods typically cover only a subset of these dimensions. Inspired by this, we propose Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration, a new paradigm for multimodal agentic reasoning. We define six core capabilities essential for multimodal reasoning and organize a comprehensive evaluation benchmark, Octopus-Bench, accordingly. Octopus is capable of autonomously exploring during reasoning and dynamically selecting the most appropriate capability based on the current state. Experimental results show that Octopus achieves the best performance on the vast majority of tasks in Octopus-Bench, highlighting the crucial role of capability coordination in agentic multimodal reasoning.
Free-Lunch Color-Texture Disentanglement for Stylized Image Generation NeurIPS2025
Recent advances in Text-to-Image (T2I) diffusion models have transformed image generation, enabling significant progress in stylized generation using only a few style reference images. However, current diffusion-based methods struggle with fine-grained style customization due to challenges in controlling multiple style attributes, such as color and texture. This paper introduces the first tuning-free approach to achieve free-lunch color-texture disentanglement in stylized T2I generation, addressing the need for independently controlled style elements for the Disentangled Stylized Image Generation (DisIG) problem. Our approach leverages the Image-Prompt Additivity property in the CLIP image embedding space to develop techniques for separating and extracting Color-Texture Embeddings (CTE) from individual color and texture reference images. To ensure that the color palette of the generated image aligns closely with the color reference, we apply a whitening and coloring transformation to enhance color consistency. Additionally, to prevent texture loss due to the signal-leak bias inherent in diffusion training, we introduce a noise term that preserves textural fidelity during the Regularized Whitening and Coloring Transformation (RegWCT). Through these methods, our Style Attributes Disentanglement approach (SADis) delivers a more precise and customizable solution for stylized image generation. Experiments on images from the WikiArt and StyleDrop datasets demonstrate that, both qualitatively and quantitatively, SADis surpasses state-of-the-art stylization methods in the DisIG task.Code is released at https://deepffff.github.io/sadis.github.io/.
comment: Accepted by NeurIPS2025. Code is available at https://deepffff.github.io/sadis.github.io/
SceneSplat++: A Large Dataset and Comprehensive Benchmark for Language Gaussian Splatting
3D Gaussian Splatting (3DGS) serves as a highly performant and efficient encoding of scene geometry, appearance, and semantics. Moreover, grounding language in 3D scenes has proven to be an effective strategy for 3D scene understanding. Current Language Gaussian Splatting line of work fall into three main groups: (i) per-scene optimization-based, (ii) per-scene optimization-free, and (iii) generalizable approach. However, most of them are evaluated only on rendered 2D views of a handful of scenes and viewpoints close to the training views, limiting ability and insight into holistic 3D understanding. To address this gap, we propose the first large-scale benchmark that systematically assesses these three groups of methods directly in 3D space, evaluating on 1060 scenes across three indoor datasets and one outdoor dataset. Benchmark results demonstrate a clear advantage of the generalizable paradigm, particularly in relaxing the scene-specific limitation, enabling fast feed-forward inference on novel scenes, and achieving superior segmentation performance. We further introduce GaussianWorld-49K a carefully curated 3DGS dataset comprising around 49K diverse indoor and outdoor scenes obtained from multiple sources, with which we demonstrate the generalizable approach could harness strong data priors. Our codes, benchmark, and datasets are released at https://scenesplatpp.gaussianworld.ai/.
comment: 15 pages, codes, data and benchmark are released
Enhancing Object Discovery for Unsupervised Instance Segmentation and Object Detection
We propose Cut-Once-and-LEaRn (COLER), a simple approach for unsupervised instance segmentation and object detection. COLER first uses our developed CutOnce to generate coarse pseudo labels, then enables the detector to learn from these masks. CutOnce applies Normalized Cut (NCut) only once and does not rely on any clustering methods (e.g., K-Means), but it can generate multiple object masks in an image. Our work opens a new direction for NCut algorithm in multi-object segmentation. We have designed several novel yet simple modules that not only allow CutOnce to fully leverage the object discovery capabilities of self-supervised model, but also free it from reliance on mask post-processing. During training, COLER achieves strong performance without requiring specially designed loss functions for pseudo labels, and its performance is further improved through self-training. COLER is a zero-shot unsupervised model that outperforms previous state-of-the-art methods on multiple benchmarks. We believe our method can help advance the field of unsupervised object localization. Code is available at: https://github.com/Quantumcraft616/COLER.
Generalized Denoising Diffusion Codebook Models (gDDCM): Tokenizing images using a pre-trained diffusion model
Denoising diffusion models have emerged as a dominant paradigm in image generation. Discretizing image data into tokens is a critical step for effectively integrating images with Transformer and other architectures. Although the Denoising Diffusion Codebook Models (DDCM) pioneered the use of pre-trained diffusion models for image tokenization, it strictly relies on the traditional discrete-time DDPM architecture. Consequently, it fails to adapt to modern continuous-time variants-such as Flow Matching and Consistency Models-and suffers from inefficient sampling in high-noise regions. To address these limitations, this paper proposes the Generalized Denoising Diffusion Codebook Models (gDDCM). We establish a unified theoretical framework and introduce a generic "De-noise and Back-trace" sampling strategy. By integrating a deterministic ODE denoising step with a residual-aligned noise injection step, our method resolves the challenge of adaptation. Furthermore, we introduce a backtracking parameter $p$ and significantly enhance tokenization ability. Extensive experiments on CIFAR10 and LSUN Bedroom datasets demonstrate that gDDCM achieves comprehensive compatibility with mainstream diffusion variants and significantly outperforms DDCM in terms of reconstruction quality and perceptual fidelity.
comment: in Chinese language
SpecDETR: A transformer-based hyperspectral point object detection network
Hyperspectral target detection (HTD) aims to identify specific materials based on spectral information in hyperspectral imagery and can detect extremely small-sized objects, some of which occupy a smaller than one-pixel area. However, existing HTD methods are developed based on per-pixel binary classification, neglecting the three-dimensional cube structure of hyperspectral images (HSIs) that integrates both spatial and spectral dimensions. The synergistic existence of spatial and spectral features in HSIs enable objects to simultaneously exhibit both, yet the per-pixel HTD framework limits the joint expression of these features. In this paper, we rethink HTD from the perspective of spatial-spectral synergistic representation and propose hyperspectral point object detection as an innovative task framework. We introduce SpecDETR, the first specialized network for hyperspectral multi-class point object detection, which eliminates dependence on pre-trained backbone networks commonly required by vision-based object detectors. SpecDETR uses a multi-layer Transformer encoder with self-excited subpixel-scale attention modules to directly extract deep spatial-spectral joint features from hyperspectral cubes. We develop a simulated hyperspectral point object detection benchmark termed SPOD, and for the first time, evaluate and compare the performance of visual object detection networks and HTD methods on hyperspectral point object detection. Extensive experiments demonstrate that our proposed SpecDETR outperforms SOTA visual object detection networks and HTD methods. Our code and dataset are available at https://github.com/ZhaoxuLi123/SpecDETR.
Continuous Vision-Language-Action Co-Learning with Semantic-Physical Alignment for Behavioral Cloning AAAI 2026
Language-conditioned manipulation facilitates human-robot interaction via behavioral cloning (BC), which learns control policies from human demonstrations and serves as a cornerstone of embodied AI. Overcoming compounding errors in sequential action decisions remains a central challenge to improving BC performance. Existing approaches mitigate compounding errors through data augmentation, expressive representation, or temporal abstraction. However, they suffer from physical discontinuities and semantic-physical misalignment, leading to inaccurate action cloning and intermittent execution. In this paper, we present Continuous vision-language-action Co-Learning with Semantic-Physical Alignment (CCoL), a novel BC framework that ensures temporally consistent execution and fine-grained semantic grounding. It generates robust and smooth action execution trajectories through continuous co-learning across vision, language, and proprioceptive inputs (e.g., robot internal states). Meanwhile, we anchor language semantics to visuomotor representations by a bidirectional cross-attention to learn contextual information for action generation, successfully overcoming the problem of semantic-physical misalignment. Extensive experiments show that CCoL achieves an average 8.0% relative improvement across three simulation suites, with up to 19.2% relative gain in human-demonstrated bimanual insertion tasks. Real-world tests on a 7-DoF robot further confirm CCoL's generalization under unseen and noisy object states.
comment: Accepted at AAAI 2026, the Project website is available at https://qhemu.github.io/CCoL/
Tera-MIND: Tera-scale mouse brain simulation via spatial mRNA-guided diffusion
Holistic 3D modeling of molecularly defined brain structures is crucial for understanding complex brain functions. Using emerging tissue profiling technologies, researchers charted comprehensive atlases of mammalian brain with sub-cellular resolution and spatially resolved transcriptomic data. However, these tera-scale volumetric atlases pose computational challenges for modeling intricate brain structures within the native spatial context. We propose \textbf{Tera-MIND}, a novel generative framework capable of simulating \textbf{Tera}-scale \textbf{M}ouse bra\textbf{IN}s in 3D using a patch-based and boundary-aware \textbf{D}iffusion model. Taking spatial gene expression as conditional input, we generate virtual mouse brains with comprehensive cellular morphological detail at teravoxel scale. Through the lens of 3D \textit{gene}-\textit{gene} self-attention, we identify spatial molecular interactions for key transcriptomic pathways, including glutamatergic and dopaminergic neuronal systems. Lastly, we showcase the translational applicability of Tera-MIND on previously unseen human brain samples. Tera-MIND offers an efficient generative modeling of whole virtual organisms, paving the way for integrative applications in biomedical research. Project website: https://musikisomorphie.github.io/Tera-MIND.html
Equivariant symmetry-aware head pose estimation for fetal MRI
We present E(3)-Pose, a novel fast pose estimation method that jointly and explicitly models rotation equivariance and object symmetry. Our work is motivated by the challenging problem of accounting for fetal head motion during a diagnostic MRI scan. We aim to enable automatic adaptive prescription of 2D diagnostic MRI slices with 6-DoF head pose estimation, supported by 3D MRI volumes rapidly acquired before each 2D slice. Existing methods struggle to generalize to clinical volumes, due to pose ambiguities induced by inherent anatomical symmetries, as well as low resolution, noise, and artifacts. In contrast, E(3)-Pose captures anatomical symmetries and rigid pose equivariance by construction, and yields robust estimates of the fetal head pose. Our experiments on publicly available and representative clinical fetal MRI datasets demonstrate the superior robustness and generalization of our method across domains. Crucially, E(3)-Pose achieves state-of-the-art accuracy on clinical MRI volumes, paving the way for clinical translation. Our implementation is available at github.com/ramyamut/E3-Pose.
Multimodal Learning for Scalable Representation of High-Dimensional Medical Data
Integrating artificial intelligence (AI) with healthcare data is rapidly transforming medical diagnostics and driving progress toward precision medicine. However, effectively leveraging multimodal data, particularly digital pathology whole slide images (WSIs) and genomic sequencing, remains a significant challenge due to the intrinsic heterogeneity of these modalities and the need for scalable and interpretable frameworks. Existing diagnostic models typically operate on unimodal data, overlooking critical cross-modal interactions that can yield richer clinical insights. We introduce MarbliX (Multimodal Association and Retrieval with Binary Latent Indexed matriX), a self-supervised framework that learns to embed WSIs and immunogenomic profiles into compact, scalable binary codes, termed ``monogram.'' By optimizing a triplet contrastive objective across modalities, MarbliX captures high-resolution patient similarity in a unified latent space, enabling efficient retrieval of clinically relevant cases and facilitating case-based reasoning. \textcolor{black}{In lung cancer, MarbliX achieves 85-89\% across all evaluation metrics, outperforming histopathology (69-71\%) and immunogenomics (73-76\%). In kidney cancer, real-valued monograms yield the strongest performance (F1: 80-83\%, Accuracy: 87-90\%), with binary monograms slightly lower (F1: 78-82\%).
VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models WACV 2026
Video anomaly understanding (VAU) aims to provide detailed interpretation and semantic comprehension of anomalous events within videos, addressing limitations of traditional methods that focus solely on detecting and localizing anomalies. However, existing approaches often neglect the deeper causal relationships and interactions between objects, which are critical for understanding anomalous behaviors. In this paper, we propose VADER, an LLM-driven framework for Video Anomaly unDErstanding, which integrates keyframe object Relation features with visual cues to enhance anomaly comprehension from video. Specifically, VADER first applies an Anomaly Scorer to assign per-frame anomaly scores, followed by a Context-AwarE Sampling (CAES) strategy to capture the causal context of each anomalous event. A Relation Feature Extractor and a COntrastive Relation Encoder (CORE) jointly model dynamic object interactions, producing compact relational representations for downstream reasoning. These visual and relational cues are integrated with LLMs to generate detailed, causally grounded descriptions and support robust anomaly-related question answering. Experiments on multiple real-world VAU benchmarks demonstrate that VADER achieves strong results across anomaly description, explanation, and causal reasoning tasks, advancing the frontier of explainable video anomaly analysis.
comment: Accepted to WACV 2026. Project page available at: https://vader-vau.github.io/
From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models
This paper introduces the concept of Microscopic Spatial Intelligence (MiSI), the capability to perceive and reason about the spatial relationships of invisible microscopic entities, which is fundamental to scientific discovery. To assess the potential of Vision-Language Models (VLMs) in this domain, we propose a systematic benchmark framework MiSI-Bench. This framework features over 163,000 question-answer pairs and 587,000 images derived from approximately 4,000 molecular structures, covering nine complementary tasks that evaluate abilities ranging from elementary spatial transformations to complex relational identifications. Experimental results reveal that current state-of-the-art VLMs perform significantly below human level on this benchmark. However, a fine-tuned 7B model demonstrates substantial potential, even surpassing humans in spatial transformation tasks, while its poor performance in scientifically-grounded tasks like hydrogen bond recognition underscores the necessity of integrating explicit domain knowledge for progress toward scientific AGI. The datasets are available at https://huggingface.co/datasets/zongzhao/MiSI-bench.
Spec-Gloss Surfels and Normal-Diffuse Priors for Relightable Glossy Objects
Accurate reconstruction and relighting of glossy objects remains a longstanding challenge, as object shape, material properties, and illumination are inherently difficult to disentangle. Existing neural rendering approaches often rely on simplified BRDF models or parameterizations that couple diffuse and specular components, which restrict faithful material recovery and limit relighting fidelity. We propose a relightable framework that integrates a microfacet BRDF with the specular-glossiness parameterization into 2D Gaussian Splatting with deferred shading. This formulation enables more physically consistent material decomposition, while diffusion-based priors for surface normals and diffuse color guide early-stage optimization and mitigate ambiguity. A coarse-to-fine environment map optimization accelerates convergence, and negative-only environment map clipping preserves high-dynamic-range specular reflections. Extensive experiments on complex, glossy scenes demonstrate that our method achieves high-quality geometry and material reconstruction, delivering substantially more realistic and consistent relighting under novel illumination compared to existing Gaussian splatting methods.
Defense That Attacks: How Robust Models Become Better Attackers
Deep learning has achieved great success in computer vision, but remains vulnerable to adversarial attacks. Adversarial training is the leading defense designed to improve model robustness. However, its effect on the transferability of attacks is underexplored. In this work, we ask whether adversarial training unintentionally increases the transferability of adversarial examples. To answer this, we trained a diverse zoo of 36 models, including CNNs and ViTs, and conducted comprehensive transferability experiments. Our results reveal a clear paradox: adversarially trained (AT) models produce perturbations that transfer more effectively than those from standard models, which introduce a new ecosystem risk. To enable reproducibility and further study, we release all models, code, and experimental scripts. Furthermore, we argue that robustness evaluations should assess not only the resistance of a model to transferred attacks but also its propensity to produce transferable adversarial examples.
MMAP: A Multi-Magnification and Prototype-Aware Architecture for Predicting Spatial Gene Expression PRICAI 2025
Spatial Transcriptomics (ST) enables the measurement of gene expression while preserving spatial information, offering critical insights into tissue architecture and disease pathology. Recent developments have explored the use of hematoxylin and eosin (H&E)-stained whole-slide images (WSIs) to predict transcriptome-wide gene expression profiles through deep neural networks. This task is commonly framed as a regression problem, where each input corresponds to a localized image patch extracted from the WSI. However, predicting spatial gene expression from histological images remains a challenging problem due to the significant modality gap between visual features and molecular signals. Recent studies have attempted to incorporate both local and global information into predictive models. Nevertheless, existing methods still suffer from two key limitations: (1) insufficient granularity in local feature extraction, and (2) inadequate coverage of global spatial context. In this work, we propose a novel framework, MMAP (Multi-MAgnification and Prototype-enhanced architecture), that addresses both challenges simultaneously. To enhance local feature granularity, MMAP leverages multi-magnification patch representations that capture fine-grained histological details. To improve global contextual understanding, it learns a set of latent prototype embeddings that serve as compact representations of slide-level information. Extensive experimental results demonstrate that MMAP consistently outperforms all existing state-of-the-art methods across multiple evaluation metrics, including Mean Absolute Error (MAE), Mean Squared Error (MSE), and Pearson Correlation Coefficient (PCC).
comment: Received Best Paper Award at the 2025 Pacific Rim International Conference on Artificial Intelligence (PRICAI 2025)
Enhancing Supervised Composed Image Retrieval via Reasoning-Augmented Representation Engineering
Composed Image Retrieval (CIR) presents a significant challenge as it requires jointly understanding a reference image and a modified textual instruction to find relevant target images. Some existing methods attempt to use a two-stage approach to further refine retrieval results. However, this often requires additional training of a ranking model. Despite the success of Chain-of-Thought (CoT) techniques in reducing training costs for language models, their application in CIR tasks remains limited -- compressing visual information into text or relying on elaborate prompt designs. Besides, existing works only utilize it for zero-shot CIR, as it is challenging to achieve satisfactory results in supervised CIR with a well-trained model. In this work, we proposed a framework that includes the Pyramid Matching Model with Training-Free Refinement (PMTFR) to address these challenges. Through a simple but effective module called Pyramid Patcher, we enhanced the Pyramid Matching Model's understanding of visual information at different granularities. Inspired by representation engineering, we extracted representations from COT data and injected them into the LVLMs. This approach allowed us to obtain refined retrieval scores in the Training-Free Refinement paradigm without relying on explicit textual reasoning, further enhancing performance. Extensive experiments on CIR benchmarks demonstrate that PMTFR surpasses state-of-the-art methods in supervised CIR tasks. The code will be made public.
ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Sensing
Spatiotemporal image generation is a highly meaningful task, which can generate future scenes conditioned on given observations. However, existing change generation methods can only handle event-driven changes (e.g., new buildings) and fail to model cross-temporal variations (e.g., seasonal shifts). In this work, we propose ChangeBridge, a conditional spatiotemporal image generation model for remote sensing. Given pre-event images and multimodal event controls, ChangeBridge generates post-event scenes that are both spatially and temporally coherent. The core idea is a drift-asynchronous diffusion bridge. Specifically, it consists of three main modules: a) Composed bridge initialization, which replaces noise initialization. It starts the diffusion from a composed pre-event state, modeling a diffusion bridge process. b) Asynchronous Drift Diffusion, which uses a pixel-wise drift map, assigning different drift magnitudes to event and temporal evolution. This enables differentiated generation during the pre-to-post transition. c) Drift-Aware Denoising, which embeds the drift map into the denoising network, guiding drift-aware reconstruction. Experiments show that ChangeBridge can generate better cross-spatiotemporal aligned scenarios compared to state-of-the-art methods. Additionally, ChangeBridge shows great potential for land-use planning and as a data generation engine for a series of change detection tasks.
MOAT: Evaluating LMMs for Capability Integration and Instruction Grounding
Large multimodal models (LMMs) have demonstrated significant potential as generalists in vision-language (VL) tasks. However, adoption of LMMs in real-world tasks is hindered by their poor performance in tasks that require a combination of VL capabilities, as well as in tasks that involve the grounding of complex text or visual instructions. To thoroughly investigate this gap and its underlying causes, we propose MOAT, a diverse benchmark with 1005 complex real-world vision questions that are straightforward for humans but challenging for LMMs. Specifically, the tasks in MOAT require LMMs to engage in generalist problem solving by integrating VL capabilities such as reading text, counting, understanding spatial relations, grounding textual and visual instructions, etc. All these abilities fit into a taxonomy proposed by us that contains 9 VL capabilities, enabling MOAT to provide a fine-grained view of LMMs' strengths and weaknesses. Besides, MOAT is the first benchmark to explicitly evaluate LMMs' ability to ground complex text and visual instructions, which is essential for many real-world applications. We evaluated 17 proprietary and open source LMMs, finding that the best performing LMM (Gemini 2.5 Pro) achieved only 44% accuracy, far below what would be acceptable in real-world applications. To guide future model development, we analyze common trends in our results and discuss the underlying causes of poor performance, focusing on the impact of text-centric reasoning, which VL capabilities form bottlenecks in complex tasks, and the potential harmful effects of tiling. Code and data are available at https://cambrian-yzt.github.io/MOAT/.
comment: Project page: https://cambrian-yzt.github.io/MOAT
InterAgent: Physics-based Multi-agent Command Execution via Diffusion on Interaction Graphs
Humanoid agents are expected to emulate the complex coordination inherent in human social behaviors. However, existing methods are largely confined to single-agent scenarios, overlooking the physically plausible interplay essential for multi-agent interactions. To bridge this gap, we propose InterAgent, the first end-to-end framework for text-driven physics-based multi-agent humanoid control. At its core, we introduce an autoregressive diffusion transformer equipped with multi-stream blocks, which decouples proprioception, exteroception, and action to mitigate cross-modal interference while enabling synergistic coordination. We further propose a novel interaction graph exteroception representation that explicitly captures fine-grained joint-to-joint spatial dependencies to facilitate network learning. Additionally, within it we devise a sparse edge-based attention mechanism that dynamically prunes redundant connections and emphasizes critical inter-agent spatial relations, thereby enhancing the robustness of interaction modeling. Extensive experiments demonstrate that InterAgent consistently outperforms multiple strong baselines, achieving state-of-the-art performance. It enables producing coherent, physically plausible, and semantically faithful multi-agent behaviors from only text prompts. Our code and data will be released to facilitate future research.
comment: Project page: https://binlee26.github.io/InterAgent-Page
COSMO-INR: Complex Sinusoidal Modulation for Implicit Neural Representations ICLR 2026
Implicit neural representations (INRs) are a powerful paradigm for modeling data, offering a continuous alternative to discrete signal representations. Their ability to compactly encode complex signals has led to strong performance in many vision tasks. Prior work shows INR performance is highly sensitive to the choice of activation function in the underlying multilayer perceptron, yet the theoretical reasons remain unclear. Key limitations also persist, including spectral bias (reduced sensitivity to high-frequency content), limited robustness to noise, and difficulty capturing local and global structure jointly. We analyze INR signal representation using harmonic analysis and Chebyshev polynomials. We prove that modulating activation functions with a complex sinusoidal term yields richer and more complete spectral support throughout the network. Building on this, we introduce a new activation function tailored to INRs and validate our theory using Chebyshev analysis and extensive experiments. We additionally use a regularized deep prior, extracted from a task-specific model, to adapt the activations, further improving convergence speed and stability. Across image reconstruction (average PSNR gain of +5.67 dB over the nearest counterpart on a diverse dataset), denoising (+0.46 dB PSNR), super-resolution (+0.64 dB over the nearest SOTA method for 6X upscaling), inpainting, and 3D shape reconstruction, our activation consistently outperforms existing state-of-the-art alternatives.
comment: Submitted as a conference paper to ICLR 2026
Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation
To address the bottleneck of accurate user intent interpretation within the current video generation community, we present Any2Caption, a novel framework for controllable video generation under any condition. The key idea is to decouple various condition interpretation steps from the video synthesis step. By leveraging modern multimodal large language models (MLLMs), Any2Caption interprets diverse inputs--text, images, videos, and specialized cues such as region, motion, and camera poses--into dense, structured captions that offer backbone video generators with better guidance. We also introduce Any2CapIns, a large-scale dataset with 337K instances and 407K conditions for any-condition-to-caption instruction tuning. Comprehensive evaluations demonstrate significant improvements of our system in controllability and video quality across various aspects of existing video generation models. Project Page: https://sqwu.top/Any2Cap/
comment: Project Page: https://sqwu.top/Any2Cap/
Advancing Weakly-Supervised Change Detection in Satellite Images via Adversarial Class Prompting
Weakly-Supervised Change Detection (WSCD) aims to distinguish specific object changes (e.g., objects appearing or disappearing) from background variations (e.g., environmental changes due to light, weather, or seasonal shifts) in paired satellite images, relying only on paired image (i.e., image-level) classification labels. This technique significantly reduces the need for dense annotations required in fully-supervised change detection. However, as image-level supervision only indicates whether objects have changed in a scene, WSCD methods often misclassify background variations as object changes, especially in complex remote-sensing scenarios. In this work, we propose an Adversarial Class Prompting (AdvCP) method to address this co-occurring noise problem, including two phases: a) Adversarial Prompt Mining: After each training iteration, we introduce adversarial prompting perturbations, using incorrect one-hot image-level labels to activate erroneous feature mappings. This process reveals co-occurring adversarial samples under weak supervision, namely background variation features that are likely to be misclassified as object changes. b) Adversarial Sample Rectification: We integrate these adversarially prompt-activated pixel samples into training by constructing an online global prototype. This prototype is built from an exponentially weighted moving average of the current batch and all historical training data. Our AdvCP can be seamlessly integrated into current WSCD methods without adding additional inference cost. Experiments on ConvNet, Transformer, and Segment Anything Model (SAM)-based baselines demonstrate significant performance enhancements. Furthermore, we demonstrate the generalizability of AdvCP to other multi-class weakly-supervised dense prediction scenarios. Code is available at https://github.com/zhenghuizhao/AdvCP
comment: Accepted by IEEE Transactions on Image Processing
Temporal In-Context Fine-Tuning with Temporal Reasoning for Versatile Control of Video Diffusion Models
Recent advances in text-to-video diffusion models have enabled high-quality video synthesis, but controllable generation remains challenging, particularly under limited data and compute. Existing fine-tuning methods for conditional generation often rely on external encoders or architectural modifications, which demand large datasets and are typically restricted to spatially aligned conditioning, limiting flexibility and scalability. In this work, we introduce Temporal In-Context Fine-Tuning (TIC-FT), an efficient and versatile approach for adapting pretrained video diffusion models to diverse conditional generation tasks. Our key idea is to concatenate condition and target frames along the temporal axis and insert intermediate buffer frames with progressively increasing noise levels. These buffer frames enable smooth transitions, aligning the fine-tuning process with the pretrained model's temporal dynamics. TIC-FT requires no architectural changes and achieves strong performance with as few as 10-30 training samples. We validate our method across a range of tasks, including image-to-video and video-to-video generation, using large-scale base models such as CogVideoX-5B and Wan-14B. Extensive experiments show that TIC-FT outperforms existing baselines in both condition fidelity and visual quality, while remaining highly efficient in both training and inference. For additional results, visit https://kinam0252.github.io/TIC-FT/
comment: project page: https://kinam0252.github.io/TIC-FT/
BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain
Understanding how the human brain represents visual concepts, and in which brain regions these representations are encoded, remains a long-standing challenge. Decades of work have advanced our understanding of visual representations, yet brain signals remain large and complex, and the space of possible visual concepts is vast. As a result, most studies remain small-scale, rely on manual inspection, focus on specific regions and properties, and rarely include systematic validation. We present a large-scale, automated framework for discovering and explaining visual representations across the human cortex. Our method comprises two main stages. First, we discover candidate interpretable patterns in fMRI activity through unsupervised, data-driven decomposition methods. Next, we explain each pattern by identifying the set of natural images that most strongly elicit it and generating a natural-language description of their shared visual meaning. To scale this process, we introduce an automated pipeline that tests multiple candidate explanations, assigns quantitative reliability scores, and selects the most consistent description for each voxel pattern. Our framework reveals thousands of interpretable patterns spanning many distinct visual concepts, including fine-grained representations previously unreported.
Wukong's 72 Transformations: High-fidelity Textured 3D Morphing via Flow Models
We present WUKONG, a novel training-free framework for high-fidelity textured 3D morphing that takes a pair of source and target prompts (image or text) as input. Unlike conventional methods -- which rely on manual correspondence matching and deformation trajectory estimation (limiting generalization and requiring costly preprocessing) -- WUKONG leverages the generative prior of flow-based transformers to produce high-fidelity 3D transitions with rich texture details. To ensure smooth shape transitions, we exploit the inherent continuity of flow-based generative processes and formulate morphing as an optimal transport barycenter problem. We further introduce a sequential initialization strategy to prevent abrupt geometric distortions and preserve identity coherence. For faithful texture preservation, we propose a similarity-guided semantic consistency mechanism that selectively retains high-frequency details and enables precise control over blending dynamics. This avoids common artifacts like oversmoothing while maintaining semantic fidelity. Extensive quantitative and qualitative evaluations demonstrate that WUKONG significantly outperforms state-of-the-art methods, achieving superior results across diverse geometry and texture variations.
The Finer the Better: Towards Granular-aware Open-set Domain Generalization
Open-Set Domain Generalization (OSDG) tackles the realistic scenario where deployed models encounter both domain shifts and novel object categories. Despite impressive progress with vision-language models like CLIP, existing methods still fall into the dilemma between structural risk of known-classes and open-space risk from unknown-classes, and easily suffers from over-confidence, especially when distinguishing ``hard unknowns" that share fine-grained visual similarities with known classes. To this end, we propose a Semantic-enhanced CLIP (SeeCLIP) framework that explicitly addresses this dilemma through fine-grained semantic enhancement. In SeeCLIP, we propose a semantic-aware prompt enhancement module to decompose images into discriminative semantic tokens, enabling nuanced vision-language alignment beyond coarse category labels. To position unknown prompts effectively, we introduce duplex contrastive learning with complementary objectives, that is, repulsion to maintain separability from known classes, and cohesion to preserve semantic proximity. Further, our semantic-guided diffusion module synthesizes pseudo-unknowns by perturbing extracted semantic tokens, generating challenging samples that are visually similar to known classes yet exhibit key local differences. These hard negatives force the model to learn finer decision boundaries. Extensive experiments across five benchmarks demonstrate consistent improvements of 3% accuracy and 5% H-score over state-of-the-art methods.
comment: 9 pages,3 figures,aaai2026
Fine-grained Defocus Blur Control for Generative Image Models
Current text-to-image diffusion models excel at generating diverse, high-quality images, yet they struggle to incorporate fine-grained camera metadata such as precise aperture settings. In this work, we introduce a novel text-to-image diffusion framework that leverages camera metadata, or EXIF data, which is often embedded in image files, with an emphasis on generating controllable lens blur. Our method mimics the physical image formation process by first generating an all-in-focus image, estimating its monocular depth, predicting a plausible focus distance with a novel focus distance transformer, and then forming a defocused image with an existing differentiable lens blur model. Gradients flow backwards through this whole process, allowing us to learn without explicit supervision to generate defocus effects based on content elements and the provided EXIF data. At inference time, this enables precise interactive user control over defocus effects while preserving scene contents, which is not achievable with existing diffusion models. Experimental results demonstrate that our model enables superior fine-grained control without altering the depicted scene.
comment: Project link: https://www.ayshrv.com/defocus-blur-gen
UStyle: Waterbody Style Transfer of Underwater Scenes by Depth-Guided Feature Synthesis
The concept of waterbody style transfer remains largely unexplored in the underwater imaging and vision literature. Traditional image style transfer (STx) methods primarily focus on artistic and photorealistic blending, often failing to preserve object and scene geometry in images captured in high-scattering mediums such as underwater. The wavelength-dependent nonlinear attenuation and depth-dependent backscattering artifacts further complicate learning underwater image STx from unpaired data. This paper introduces UStyle, the first data-driven learning framework for transferring waterbody styles across underwater images without requiring prior reference images or scene information. We propose a novel depth-aware whitening and coloring transform (DA-WCT) mechanism that integrates physics-based waterbody synthesis to ensure perceptually consistent stylization while preserving scene structure. To enhance style transfer quality, we incorporate carefully designed loss functions that guide UStyle to maintain colorfulness, lightness, structural integrity, and frequency-domain characteristics, as well as high-level content in VGG and CLIP (contrastive language-image pretraining) feature spaces. By addressing domain-specific challenges, UStyle provides a robust framework for no-reference underwater image STx, surpassing state-of-the-art (SOTA) methods that rely solely on end-to-end reconstruction loss. Furthermore, we introduce the UF7D dataset, a curated collection of high-resolution underwater images spanning seven distinct waterbody styles, establishing a benchmark to support future research in underwater image STx. The UStyle inference pipeline and UF7D dataset are released at: https://github.com/uf-robopi/UStyle.
Annotation-Free Reinforcement Learning Query Rewriting via Verifiable Search Reward
Optimizing queries for Retrieval-Augmented Generation (RAG) systems poses a significant challenge, particularly across diverse modal indices. We introduce RL-QR, a novel annotation-free reinforcement learning framework for query rewriting that eliminates the need for costly human-annotated data. By leveraging verifiable search rewards derived from index-aligned synthetic queries, RL-QR overcomes human-annotation dependencies, extending its applicability to various modalities and index domains. Experimental results demonstrate the framework's robustness, achieving substantial retrieval performance gains of up to 3.9$\times$ on lexical retrievers and 3.5$\times$ on semantic retrievers on the MTEB VIDORE V2 benchmark for unstructured visual documents, along with consistent 5\% to 10\% improvements on MS MARCO v2.1 and internal industrial datasets.
Beyond Endpoints: Path-Centric Reasoning for Vectorized Off-Road Network Extraction
Deep learning has advanced vectorized road extraction in urban settings, yet off-road environments remain underexplored and challenging. A significant domain gap causes advanced models to fail in wild terrains due to two key issues: lack of large-scale vectorized datasets and structural weakness in prevailing methods. Models such as SAM-Road employ a node-centric paradigm that reasons at sparse endpoints, making them fragile to occlusions and ambiguous junctions in off-road scenes, leading to topological errors. This work addresses these limitations in two complementary ways. First, we release WildRoad, a global off-road road network dataset constructed efficiently with a dedicated interactive annotation tool tailored for road-network labeling. Second, we introduce MaGRoad (Mask-aware Geodesic Road network extractor), a path-centric framework that aggregates multi-scale visual evidence along candidate paths to infer connectivity robustly. Extensive experiments show that MaGRoad achieves state-of-the-art performance on our challenging WildRoad benchmark while generalizing well to urban datasets. A streamlined pipeline also yields roughly 2.5x faster inference, improving practical applicability. Together, the dataset and path-centric paradigm provide a stronger foundation for mapping roads in the wild. We release both the dataset and code at https://github.com/xiaofei-guan/MaGRoad.
comment: v2: Corrected the abstract to accurately reflect the paper content. Updated the project link to the correct repository. No changes to the main text
MM-SeR: Multimodal Self-Refinement for Lightweight Image Captioning
Systems such as video chatbots and navigation robots often depend on streaming image captioning to interpret visual inputs. Existing approaches typically employ large multimodal language models (MLLMs) for this purpose, but their substantial computational cost hinders practical application. This limitation motivates our development of a lightweight captioning model. Our investigation begins by replacing the large-scale language component in MLLMs with a compact 125M-parameter model. Surprisingly, this compact model, despite a 93x reduction in size, achieves comparable performance to MLLMs, suggesting that factual image captioning does not significantly require the complex reasoning abilities of LLMs. Despite this promising result, our lightweight model still lacks reliability. To address this, we draw inspiration from the human visual process: perceiving a global and coarse understanding of the scene before attending to finer details. Accordingly, we propose a multimodal self-refinement framework that guides the model to utilize features from salient regions, identified by referencing the previous coarse caption, and to produce a refined description. Experimental results demonstrate the superiority of our model in both single-sentence and detailed captioning, extending even to long-range video QA tasks.
comment: Project page: https://sites.google.com/view/junha/mm-ser
Class-wise Balancing Data Replay for Federated Class-Incremental Learning NeurIPS'25
Federated Class Incremental Learning (FCIL) aims to collaboratively process continuously increasing incoming tasks across multiple clients. Among various approaches, data replay has become a promising solution, which can alleviate forgetting by reintroducing representative samples from previous tasks. However, their performance is typically limited by class imbalance, both within the replay buffer due to limited global awareness and between replayed and newly arrived classes. To address this issue, we propose a class wise balancing data replay method for FCIL (FedCBDR), which employs a global coordination mechanism for class-level memory construction and reweights the learning objective to alleviate the aforementioned imbalances. Specifically, FedCBDR has two key components: 1) the global-perspective data replay module reconstructs global representations of prior task in a privacy-preserving manner, which then guides a class-aware and importance-sensitive sampling strategy to achieve balanced replay; 2) Subsequently, to handle class imbalance across tasks, the task aware temperature scaling module adaptively adjusts the temperature of logits at both class and instance levels based on task dynamics, which reduces the model's overconfidence in majority classes while enhancing its sensitivity to minority classes. Experimental results verified that FedCBDR achieves balanced class-wise sampling under heterogeneous data distributions and improves generalization under task imbalance between earlier and recent tasks, yielding a 2%-15% Top-1 accuracy improvement over six state-of-the-art methods.
comment: NeurIPS'25 Accepted, Oral
Two Datasets Are Better Than One: Method of Double Moments for 3-D Reconstruction in Cryo-EM
Cryo-electron microscopy (cryo-EM) is a powerful imaging technique for reconstructing three-dimensional molecular structures from noisy tomographic projection images of randomly oriented particles. We introduce a new data fusion framework, termed the method of double moments (MoDM), which reconstructs molecular structures from two instances of the second-order moment of projection images obtained under distinct orientation distributions: one uniform, the other non-uniform and unknown. We prove that these moments generically uniquely determine the underlying structure, up to a global rotation and reflection, and we develop a convex-relaxation-based algorithm that achieves accurate recovery using only second-order statistics. Our results demonstrate the advantage of collecting and modeling multiple datasets under different experimental conditions, illustrating that leveraging dataset diversity can substantially enhance reconstruction quality in computational imaging tasks.
MultiMotion: Multi Subject Video Motion Transfer via Video Diffusion Transformer
Multi-object video motion transfer poses significant challenges for Diffusion Transformer (DiT) architectures due to inherent motion entanglement and lack of object-level control. We present MultiMotion, a novel unified framework that overcomes these limitations. Our core innovation is Maskaware Attention Motion Flow (AMF), which utilizes SAM2 masks to explicitly disentangle and control motion features for multiple objects within the DiT pipeline. Furthermore, we introduce RectPC, a high-order predictor-corrector solver for efficient and accurate sampling, particularly beneficial for multi-entity generation. To facilitate rigorous evaluation, we construct the first benchmark dataset specifically for DiT-based multi-object motion transfer. MultiMotion demonstrably achieves precise, semantically aligned, and temporally coherent motion transfer for multiple distinct objects, maintaining DiT's high quality and scalability. The code is in the supp.
MetaVoxel: Joint Diffusion Modeling of Imaging and Clinical Metadata
Modern deep learning methods have achieved impressive results across tasks from disease classification, estimating continuous biomarkers, to generating realistic medical images. Most of these approaches are trained to model conditional distributions defined by a specific predictive direction with a specific set of input variables. We introduce MetaVoxel, a generative joint diffusion modeling framework that models the joint distribution over imaging data and clinical metadata by learning a single diffusion process spanning all variables. By capturing the joint distribution, MetaVoxel unifies tasks that traditionally require separate conditional models and supports flexible zero-shot inference using arbitrary subsets of inputs without task-specific retraining. Using more than 10,000 T1-weighted MRI scans paired with clinical metadata from nine datasets, we show that a single MetaVoxel model can perform image generation, age estimation, and sex prediction, achieving performance comparable to established task-specific baselines. Additional experiments highlight its capabilities for flexible inference. Together, these findings demonstrate that joint multimodal diffusion offers a promising direction for unifying medical AI models and enabling broader clinical applicability.
Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling NeurIPS 2025
Vision-language models (VLMs) have recently been integrated into multiple instance learning (MIL) frameworks to address the challenge of few-shot, weakly supervised classification of whole slide images (WSIs). A key trend involves leveraging multi-scale information to better represent hierarchical tissue structures. However, existing methods often face two key limitations: (1) insufficient modeling of interactions within the same modalities across scales (e.g., 5x and 20x) and (2) inadequate alignment between visual and textual modalities on the same scale. To address these gaps, we propose HiVE-MIL, a hierarchical vision-language framework that constructs a unified graph consisting of (1) parent-child links between coarse (5x) and fine (20x) visual/textual nodes to capture hierarchical relationships, and (2) heterogeneous intra-scale edges linking visual and textual nodes on the same scale. To further enhance semantic consistency, HiVE-MIL incorporates a two-stage, text-guided dynamic filtering mechanism that removes weakly correlated patch-text pairs, and introduces a hierarchical contrastive loss to align textual semantics across scales. Extensive experiments on TCGA breast, lung, and kidney cancer datasets demonstrate that HiVE-MIL consistently outperforms both traditional MIL and recent VLM-based MIL approaches, achieving gains of up to 4.1% in macro F1 under 16-shot settings. Our results demonstrate the value of jointly modeling hierarchical structure and multimodal alignment for efficient and scalable learning from limited pathology data. The code is available at https://github.com/bryanwong17/HiVE-MIL.
comment: Accepted at NeurIPS 2025
Noise Matters: Optimizing Matching Noise for Diffusion Classifiers
Although today's pretrained discriminative vision-language models (e.g., CLIP) have demonstrated strong perception abilities, such as zero-shot image classification, they also suffer from the bag-of-words problem and spurious bias. To mitigate these problems, some pioneering studies leverage powerful generative models (e.g., pretrained diffusion models) to realize generalizable image classification, dubbed Diffusion Classifier (DC). Specifically, by randomly sampling a Gaussian noise, DC utilizes the differences of denoising effects with different category conditions to classify categories. Unfortunately, an inherent and notorious weakness of existing DCs is noise instability: different random sampled noises lead to significant performance changes. To achieve stable classification performance, existing DCs always ensemble the results of hundreds of sampled noises, which significantly reduces the classification speed. To this end, we firstly explore the role of noise in DC, and conclude that: there are some ``good noises'' that can relieve the instability. Meanwhile, we argue that these good noises should meet two principles: Frequency Matching and Spatial Matching. Regarding both principles, we propose a novel Noise Optimization method to learn matching (i.e., good) noise for DCs: NoOp. For frequency matching, NoOp first optimizes a dataset-specific noise: Given a dataset and a timestep t, optimize one randomly initialized parameterized noise. For Spatial Matching, NoOp trains a Meta-Network that adopts an image as input and outputs image-specific noise offset. The sum of optimized noise and noise offset will be used in DC to replace random noise. Extensive ablations on various datasets demonstrated the effectiveness of NoOp.
Conditional Text-to-Image Generation with Reference Guidance WACV 2026
Text-to-image diffusion models have demonstrated tremendous success in synthesizing visually stunning images given textual instructions. Despite remarkable progress in creating high-fidelity visuals, text-to-image models can still struggle with precisely rendering subjects, such as text spelling. To address this challenge, this paper explores using additional conditions of an image that provides visual guidance of the particular subjects for diffusion models to generate. In addition, this reference condition empowers the model to be conditioned in ways that the vocabularies of the text tokenizer cannot adequately represent, and further extends the model's generalization to novel capabilities such as generating non-English text spellings. We develop several small-scale expert plugins that efficiently endow a Stable Diffusion model with the capability to take different references. Each plugin is trained with auxiliary networks and loss functions customized for applications such as English scene-text generation, multi-lingual scene-text generation, and logo-image generation. Our expert plugins demonstrate superior results than the existing methods on all tasks, each containing only 28.55M trainable parameters.
comment: WACV 2026
Gaze on the Prize: Shaping Visual Attention with Return-Guided Contrastive Learning
Visual Reinforcement Learning (RL) agents must learn to act based on high-dimensional image data where only a small fraction of the pixels is task-relevant. This forces agents to waste exploration and computational resources on irrelevant features, leading to sample-inefficient and unstable learning. To address this, inspired by human visual foveation, we introduce Gaze on the Prize. This framework augments visual RL with a learnable foveal attention mechanism (Gaze), guided by a self-supervised signal derived from the agent's experience pursuing higher returns (the Prize). Our key insight is that return differences reveal what matters most: If two similar representations produce different outcomes, their distinguishing features are likely task-relevant, and the gaze should focus on them accordingly. This is realized through return-guided contrastive learning that trains the attention to distinguish between the features relevant to success and failure. We group similar visual representations into positives and negatives based on their return differences and use the resulting labels to construct contrastive triplets. These triplets provide the training signal that teaches the attention mechanism to produce distinguishable representations for states associated with different outcomes. Our method achieves up to 2.52x improvement in sample efficiency and can solve challenging tasks from the ManiSkill3 benchmark that the baseline fails to learn, without modifying the underlying algorithm or hyperparameters.
comment: Project page: https://andrewcwlee.github.io/gaze-on-the-prize
Image and Video Processing
mViSE: A Visual Search Engine for Analyzing Multiplex IHC Brain Tissue Images
Whole-slide multiplex imaging of brain tissue generates massive information-dense images that are challenging to analyze and require custom software. We present an alternative query-driven programming-free strategy using a multiplex visual search engine (mViSE) that learns the multifaceted brain tissue chemoarchitecture, cytoarchitecture, and myeloarchitecture. Our divide-and-conquer strategy organizes the data into panels of related molecular markers and uses self-supervised learning to train a multiplex encoder for each panel with explicit visual confirmation of successful learning. Multiple panels can be combined to process visual queries for retrieving similar communities of individual cells or multicellular niches using information-theoretic methods. The retrievals can be used for diverse purposes including tissue exploration, delineating brain regions and cortical cell layers, profiling and comparing brain regions without computer programming. We validated mViSE's ability to retrieve single cells, proximal cell pairs, tissue patches, delineate cortical layers, brain regions and sub-regions. mViSE is provided as an open-source QuPath plug-in.
EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing
Recent advances in diffusion models (DMs) have achieved exceptional visual quality in image editing tasks. However, the global denoising dynamics of DMs inherently conflate local editing targets with the full-image context, leading to unintended modifications in non-target regions. In this paper, we shift our attention beyond DMs and turn to Masked Generative Transformers (MGTs) as an alternative approach to tackle this challenge. By predicting multiple masked tokens rather than holistic refinement, MGTs exhibit a localized decoding paradigm that endows them with the inherent capacity to explicitly preserve non-relevant regions during the editing process. Building upon this insight, we introduce the first MGT-based image editing framework, termed EditMGT. We first demonstrate that MGT's cross-attention maps provide informative localization signals for localizing edit-relevant regions and devise a multi-layer attention consolidation scheme that refines these maps to achieve fine-grained and precise localization. On top of these adaptive localization results, we introduce region-hold sampling, which restricts token flipping within low-attention areas to suppress spurious edits, thereby confining modifications to the intended target regions and preserving the integrity of surrounding non-target areas. To train EditMGT, we construct CrispEdit-2M, a high-resolution dataset spanning seven diverse editing categories. Without introducing additional parameters, we adapt a pre-trained text-to-image MGT into an image editing model through attention injection. Extensive experiments across four standard benchmarks demonstrate that, with fewer than 1B parameters, our model achieves similarity performance while enabling 6 times faster editing. Moreover, it delivers comparable or superior editing quality, with improvements of 3.6% and 17.6% on style change and style transfer tasks, respectively.
Particle Image Velocimetry Refinement via Consensus ADMM
Particle Image Velocimetry (PIV) is an imaging technique in experimental fluid dynamics that quantifies flow fields around bluff bodies by analyzing the displacement of neutrally buoyant tracer particles immersed in the fluid. Traditional PIV approaches typically depend on tuning parameters specific to the imaging setup, making the performance sensitive to variations in illumination, flow conditions, and seeding density. On the other hand, even state-of-the-art machine learning methods for flow quantification are fragile outside their training set. In our experiments, we observed that flow quantification would improve if different tunings (or algorithms) were applied to different regions of the same image pair. In this work, we parallelize the instantaneous flow quantification with multiple algorithms and adopt a consensus framework based on the alternating direction method of multipliers, seamlessly incorporating priors such as smoothness and incompressibility. We perform several numerical experiments to demonstrate the benefits of this approach. For instance, we achieve a decrease in end-point-error of up to 20% of a dense-inverse-search estimator at an inference rate of 60Hz, and we show how this performance boost can be increased further with outlier rejection. Our method is implemented in JAX, effectively exploiting hardware acceleration, and integrated in Flow Gym, enabling (i) reproducible comparisons with the state-of-the-art, (ii) testing different base algorithms, (iii) straightforward deployment for active fluids control applications.
comment: Code: https://github.com/antonioterpin/flowgym
Embodied Image Compression
Image Compression for Machines (ICM) has emerged as a pivotal research direction in the field of visual data compression. However, with the rapid evolution of machine intelligence, the target of compression has shifted from task-specific virtual models to Embodied agents operating in real-world environments. To address the communication constraints of Embodied AI in multi-agent systems and ensure real-time task execution, this paper introduces, for the first time, the scientific problem of Embodied Image Compression. We establish a standardized benchmark, EmbodiedComp, to facilitate systematic evaluation under ultra-low bitrate conditions in a closed-loop setting. Through extensive empirical studies in both simulated and real-world settings, we demonstrate that existing Vision-Language-Action models (VLAs) fail to reliably perform even simple manipulation tasks when compressed below the Embodied bitrate threshold. We anticipate that EmbodiedComp will catalyze the development of domain-specific compression tailored for Embodied agents , thereby accelerating the Embodied AI deployment in the Real-world.
comment: 15 pages, 12 figures, 3 tables
Multimodal Learning for Scalable Representation of High-Dimensional Medical Data
Integrating artificial intelligence (AI) with healthcare data is rapidly transforming medical diagnostics and driving progress toward precision medicine. However, effectively leveraging multimodal data, particularly digital pathology whole slide images (WSIs) and genomic sequencing, remains a significant challenge due to the intrinsic heterogeneity of these modalities and the need for scalable and interpretable frameworks. Existing diagnostic models typically operate on unimodal data, overlooking critical cross-modal interactions that can yield richer clinical insights. We introduce MarbliX (Multimodal Association and Retrieval with Binary Latent Indexed matriX), a self-supervised framework that learns to embed WSIs and immunogenomic profiles into compact, scalable binary codes, termed ``monogram.'' By optimizing a triplet contrastive objective across modalities, MarbliX captures high-resolution patient similarity in a unified latent space, enabling efficient retrieval of clinically relevant cases and facilitating case-based reasoning. \textcolor{black}{In lung cancer, MarbliX achieves 85-89\% across all evaluation metrics, outperforming histopathology (69-71\%) and immunogenomics (73-76\%). In kidney cancer, real-valued monograms yield the strongest performance (F1: 80-83\%, Accuracy: 87-90\%), with binary monograms slightly lower (F1: 78-82\%).
MarsQE: Semantic-Informed Quality Enhancement for Compressed Martian Image
Lossy image compression is essential for Mars exploration missions, due to the limited bandwidth between Earth and Mars. However, the compression may introduce visual artifacts that complicate the geological analysis of the Martian surface. Existing quality enhancement approaches, primarily designed for Earth images, fall short for Martian images due to a lack of consideration for the unique Martian semantics. In response to this challenge, we conduct an in-depth analysis of Martian images, yielding two key insights based on semantics: the presence of texture similarities and the compact nature of texture representations in Martian images. Inspired by these findings, we introduce MarsQE, an innovative, semantic-informed, two-phase quality enhancement approach specifically designed for Martian images. The first phase involves the semantic-based matching of texture-similar reference images, and the second phase enhances image quality by transferring texture patterns from these reference images to the compressed image. We also develop a post-enhancement network to further reduce compression artifacts and achieve superior compression quality. Our extensive experiments demonstrate that MarsQE significantly outperforms existing approaches for Earth images, establishing a new benchmark for the quality enhancement on Martian images.
UStyle: Waterbody Style Transfer of Underwater Scenes by Depth-Guided Feature Synthesis
The concept of waterbody style transfer remains largely unexplored in the underwater imaging and vision literature. Traditional image style transfer (STx) methods primarily focus on artistic and photorealistic blending, often failing to preserve object and scene geometry in images captured in high-scattering mediums such as underwater. The wavelength-dependent nonlinear attenuation and depth-dependent backscattering artifacts further complicate learning underwater image STx from unpaired data. This paper introduces UStyle, the first data-driven learning framework for transferring waterbody styles across underwater images without requiring prior reference images or scene information. We propose a novel depth-aware whitening and coloring transform (DA-WCT) mechanism that integrates physics-based waterbody synthesis to ensure perceptually consistent stylization while preserving scene structure. To enhance style transfer quality, we incorporate carefully designed loss functions that guide UStyle to maintain colorfulness, lightness, structural integrity, and frequency-domain characteristics, as well as high-level content in VGG and CLIP (contrastive language-image pretraining) feature spaces. By addressing domain-specific challenges, UStyle provides a robust framework for no-reference underwater image STx, surpassing state-of-the-art (SOTA) methods that rely solely on end-to-end reconstruction loss. Furthermore, we introduce the UF7D dataset, a curated collection of high-resolution underwater images spanning seven distinct waterbody styles, establishing a benchmark to support future research in underwater image STx. The UStyle inference pipeline and UF7D dataset are released at: https://github.com/uf-robopi/UStyle.
Multimedia
EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing
Recent advances in diffusion models (DMs) have achieved exceptional visual quality in image editing tasks. However, the global denoising dynamics of DMs inherently conflate local editing targets with the full-image context, leading to unintended modifications in non-target regions. In this paper, we shift our attention beyond DMs and turn to Masked Generative Transformers (MGTs) as an alternative approach to tackle this challenge. By predicting multiple masked tokens rather than holistic refinement, MGTs exhibit a localized decoding paradigm that endows them with the inherent capacity to explicitly preserve non-relevant regions during the editing process. Building upon this insight, we introduce the first MGT-based image editing framework, termed EditMGT. We first demonstrate that MGT's cross-attention maps provide informative localization signals for localizing edit-relevant regions and devise a multi-layer attention consolidation scheme that refines these maps to achieve fine-grained and precise localization. On top of these adaptive localization results, we introduce region-hold sampling, which restricts token flipping within low-attention areas to suppress spurious edits, thereby confining modifications to the intended target regions and preserving the integrity of surrounding non-target areas. To train EditMGT, we construct CrispEdit-2M, a high-resolution dataset spanning seven diverse editing categories. Without introducing additional parameters, we adapt a pre-trained text-to-image MGT into an image editing model through attention injection. Extensive experiments across four standard benchmarks demonstrate that, with fewer than 1B parameters, our model achieves similarity performance while enabling 6 times faster editing. Moreover, it delivers comparable or superior editing quality, with improvements of 3.6% and 17.6% on style change and style transfer tasks, respectively.
Extending a Parliamentary Corpus with MPs' Tweets: Automatic Annotation and Evaluation Using MultiParTweet LREC 2026
Social media serves as a critical medium in modern politics because it both reflects politicians' ideologies and facilitates communication with younger generations. We present MultiParTweet, a multilingual tweet corpus from X that connects politicians' social media discourse with German political corpus GerParCor, thereby enabling comparative analyses between online communication and parliamentary debates. MultiParTweet contains 39 546 tweets, including 19 056 media items. Furthermore, we enriched the annotation with nine text-based models and one vision-language model (VLM) to annotate MultiParTweet with emotion, sentiment, and topic annotations. Moreover, the automated annotations are evaluated against a manually annotated subset. MultiParTweet can be reconstructed using our tool, TTLABTweetCrawler, which provides a framework for collecting data from X. To demonstrate a methodological demonstration, we examine whether the models can predict each other using the outputs of the remaining models. In summary, we provide MultiParTweet, a resource integrating automatic text and media-based annotations validated with human annotations, and TTLABTweetCrawler, a general-purpose X data collection tool. Our analysis shows that the models are mutually predictable. In addition, VLM-based annotation were preferred by human annotators, suggesting that multimodal representations align more with human interpretation.
comment: Submitted to LREC 2026
HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning
Key frame selection in video understanding presents significant challenges. Traditional top-K selection methods, which score frames independently, often fail to optimize the selection as a whole. This independent scoring frequently results in selecting frames that are temporally clustered and visually redundant. Additionally, training lightweight selectors using pseudo labels generated offline by Multimodal Large Language Models (MLLMs) prevents the supervisory signal from dynamically adapting to task objectives. To address these limitations, we propose an end-to-end trainable, task-adaptive framework for frame selection. A Chain-of-Thought approach guides a Small Language Model (SLM) to generate task-specific implicit query vectors, which are combined with multimodal features to enable dynamic frame scoring. We further define a continuous set-level objective function that incorporates relevance, coverage, and redundancy, enabling differentiable optimization via Gumbel-Softmax to select optimal frame combinations at the set level. Finally, student-teacher mutual learning is employed, where the student selector (SLM) and teacher reasoner (MLLM) are trained to align their frame importance distributions via KL divergence. Combined with cross-entropy loss, this enables end-to-end optimization, eliminating reliance on static pseudo labels. Experiments across various benchmarks, including Video-MME, LongVideoBench, MLVU, and NExT-QA, demonstrate that our method significantly outperforms existing approaches.
comment: 18 pages, 8 figures
Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation
Video-to-Music generation seeks to generate musically appropriate background music that enhances audiovisual immersion for videos. However, current approaches suffer from two critical limitations: 1) incomplete representation of video details, leading to weak alignment, and 2) inadequate temporal and rhythmic correspondence, particularly in achieving precise beat synchronization. To address the challenges, we propose Video Echoed in Music (VeM), a latent music diffusion that generates high-quality soundtracks with semantic, temporal, and rhythmic alignment for input videos. To capture video details comprehensively, VeM employs a hierarchical video parsing that acts as a music conductor, orchestrating multi-level information across modalities. Modality-specific encoders, coupled with a storyboard-guided cross-attention mechanism (SG-CAtt), integrate semantic cues while maintaining temporal coherence through position and duration encoding. For rhythmic precision, the frame-level transition-beat aligner and adapter (TB-As) dynamically synchronize visual scene transitions with music beats. We further contribute a novel video-music paired dataset sourced from e-commerce advertisements and video-sharing platforms, which imposes stricter transition-beat synchronization requirements. Meanwhile, we introduce novel metrics tailored to the task. Experimental results demonstrate superiority, particularly in semantic relevance and rhythmic precision.
Recent Advances in Discrete Speech Tokens: A Review
The rapid advancement of speech generation technologies in the era of large language models (LLMs) has established discrete speech tokens as a foundational paradigm for speech representation. These tokens, characterized by their discrete, compact, and concise nature, are not only advantageous for efficient transmission and storage, but also inherently compatible with the language modeling framework, enabling seamless integration of speech into text-dominated LLM architectures. Current research categorizes discrete speech tokens into two principal classes: acoustic tokens and semantic tokens, each of which has evolved into a rich research domain characterized by unique design philosophies and methodological approaches. This survey systematically synthesizes the existing taxonomy and recent innovations in discrete speech tokenization, conducts a critical examination of the strengths and limitations of each paradigm, and presents systematic experimental comparisons across token types. Furthermore, we identify persistent challenges in the field and propose potential research directions, aiming to offer actionable insights to inspire future advancements in the development and application of discrete speech tokens.
comment: 26 pages, 8 figures, 3 tables. Accepted to IEEE TPAMI
Computation and Language
SUMFORU: An LLM-Based Review Summarization Framework for Personalized Purchase Decision Support
Online product reviews contain rich but noisy signals that overwhelm users and hinder effective decision-making. Existing LLM-based summarizers remain generic and fail to account for individual preferences, limiting their practical utility. We propose SUMFORU, a steerable review summarization framework that aligns outputs with explicit user personas to support personalized purchase decisions. Our approach integrates a high-quality data pipeline built from the Amazon 2023 Review Dataset with a two-stage alignment procedure: (1) persona-aware Supervised Fine-Tuning (SFT) via asymmetric knowledge distillation, and (2) Reinforcement Learning with AI Feedback (RLAIF) using a preference estimator to capture fine-grained, persona-relevant signals. We evaluate the model across rule-based, LLM-based, and human-centered metrics, demonstrating consistent improvements in consistency, grounding, and preference alignment. Our framework achieves the highest performance across all evaluation settings and generalizes effectively to unseen product categories. Our results highlight the promise of steerable pluralistic alignment for building next-generation personalized decision-support systems.
comment: Code available at https://github.com/Harry20030331/SumForU
From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines
While voice-based AI systems have achieved remarkable generative capabilities, their interactions often feel conversationally broken. This paper examines the interactional friction that emerges in modular Speech-to-Speech Retrieval-Augmented Generation (S2S-RAG) pipelines. By analyzing a representative production system, we move beyond simple latency metrics to identify three recurring patterns of conversational breakdown: (1) Temporal Misalignment, where system delays violate user expectations of conversational rhythm; (2) Expressive Flattening, where the loss of paralinguistic cues leads to literal, inappropriate responses; and (3) Repair Rigidity, where architectural gating prevents users from correcting errors in real-time. Through system-level analysis, we demonstrate that these friction points should not be understood as defects or failures, but as structural consequences of a modular design that prioritizes control over fluidity. We conclude that building natural spoken AI is an infrastructure design challenge, requiring a shift from optimizing isolated components to carefully choreographing the seams between them.
comment: 6 pages, 1 figure
Speculative Decoding Speed-of-Light: Optimal Lower Bounds via Branching Random Walks
Speculative generation has emerged as a promising technique to accelerate inference in large language models (LLMs) by leveraging parallelism to verify multiple draft tokens simultaneously. However, the fundamental limits on the achievable speedup remain poorly understood. In this work, we establish the first ``tight'' lower bounds on the runtime of any deterministic speculative generation algorithm. This is achieved by drawing a parallel between the token generation process and branching random walks, which allows us to analyze the optimal draft tree selection problem. We prove, under basic assumptions, that the expected number of tokens successfully predicted per speculative iteration is bounded as $\mathbb{E}[X] \leq (μ+ μ_{(2)})\log(P )/μ^2 + O(1)$, where $P$ is the verifier's capacity, $μ$ is the expected entropy of the verifier's output distribution, and $μ_{(2)}$ is the expected second log-moment. This result provides new insights into the limits of parallel token generation, and could guide the design of future speculative decoding systems. Empirical evaluations on Llama models validate our theoretical predictions, confirming the tightness of our bounds in practical settings.
Automating Historical Insight Extraction from Large-Scale Newspaper Archives via Neural Topic Modeling
Extracting coherent and human-understandable themes from large collections of unstructured historical newspaper archives presents significant challenges due to topic evolution, Optical Character Recognition (OCR) noise, and the sheer volume of text. Traditional topic-modeling methods, such as Latent Dirichlet Allocation (LDA), often fall short in capturing the complexity and dynamic nature of discourse in historical texts. To address these limitations, we employ BERTopic. This neural topic-modeling approach leverages transformerbased embeddings to extract and classify topics, which, despite its growing popularity, still remains underused in historical research. Our study focuses on articles published between 1955 and 2018, specifically examining discourse on nuclear power and nuclear safety. We analyze various topic distributions across the corpus and trace their temporal evolution to uncover long-term trends and shifts in public discourse. This enables us to more accurately explore patterns in public discourse, including the co-occurrence of themes related to nuclear power and nuclear weapons and their shifts in topic importance over time. Our study demonstrates the scalability and contextual sensitivity of BERTopic as an alternative to traditional approaches, offering richer insights into historical discourses extracted from newspaper archives. These findings contribute to historical, nuclear, and social-science research while reflecting on current limitations and proposing potential directions for future work.
comment: This is a preprint of a manuscript submitted to Digital Scholarship in the Humanities (Oxford University Press). The paper is currently under peer review
Bounding Hallucinations: Information-Theoretic Guarantees for RAG Systems via Merlin-Arthur Protocols
Retrieval-augmented generation (RAG) models rely on retrieved evidence to guide large language model (LLM) generators, yet current systems treat retrieval as a weak heuristic rather than verifiable evidence. As a result, LLMs answer without support, hallucinate under incomplete or misleading context, and rely on spurious evidence. We introduce a training framework that treats the entire RAG pipeline -- both the retriever and the generator -- as an interactive proof system via an adaptation of the Merlin-Arthur (M/A) protocol. Arthur (the generator LLM) trains on questions of unkown provenance: Merlin provides helpful evidence, while Morgana injects adversarial, misleading context. Both use a linear-time XAI method to identify and modify the evidence most influential to Arthur. Consequently, Arthur learns to (i) answer when the context support the answer, (ii) reject when evidence is insufficient, and (iii) rely on the specific context spans that truly ground the answer. We further introduce a rigorous evaluation framework to disentangle explanation fidelity from baseline predictive errors. This allows us to introduce and measure the Explained Information Fraction (EIF), which normalizes M/A certified mutual-information guarantees relative to model capacity and imperfect benchmarks. Across three RAG datasets and two model families of varying sizes, M/A-trained LLMs show improved groundedness, completeness, soundness, and reject behavior, as well as reduced hallucinations -- without needing manually annotated unanswerable questions. The retriever likewise improves recall and MRR through automatically generated M/A hard positives and negatives. Our results demonstrate that autonomous interactive-proof-style supervision provides a principled and practical path toward reliable RAG systems that treat retrieved documents not as suggestions, but as verifiable evidence.
comment: 34 pages, 19 figures
Visualizing token importance for black-box language models
We consider the problem of auditing black-box large language models (LLMs) to ensure they behave reliably when deployed in production settings, particularly in high-stakes domains such as legal, medical, and regulatory compliance. Existing approaches for LLM auditing often focus on isolated aspects of model behavior, such as detecting specific biases or evaluating fairness. We are interested in a more general question -- can we understand how the outputs of black-box LLMs depend on each input token? There is a critical need to have such tools in real-world applications that rely on inaccessible API endpoints to language models. However, this is a highly non-trivial problem, as LLMs are stochastic functions (i.e. two outputs will be different by chance), while computing prompt-level gradients to approximate input sensitivity is infeasible. To address this, we propose Distribution-Based Sensitivity Analysis (DBSA), a lightweight model-agnostic procedure to evaluate the sensitivity of the output of a language model for each input token, without making any distributional assumptions about the LLM. DBSA is developed as a practical tool for practitioners, enabling quick, plug-and-play visual exploration of LLMs reliance on specific input tokens. Through illustrative examples, we demonstrate how DBSA can enable users to inspect LLM inputs and find sensitivities that may be overlooked by existing LLM interpretability methods.
Extending a Parliamentary Corpus with MPs' Tweets: Automatic Annotation and Evaluation Using MultiParTweet LREC 2026
Social media serves as a critical medium in modern politics because it both reflects politicians' ideologies and facilitates communication with younger generations. We present MultiParTweet, a multilingual tweet corpus from X that connects politicians' social media discourse with German political corpus GerParCor, thereby enabling comparative analyses between online communication and parliamentary debates. MultiParTweet contains 39 546 tweets, including 19 056 media items. Furthermore, we enriched the annotation with nine text-based models and one vision-language model (VLM) to annotate MultiParTweet with emotion, sentiment, and topic annotations. Moreover, the automated annotations are evaluated against a manually annotated subset. MultiParTweet can be reconstructed using our tool, TTLABTweetCrawler, which provides a framework for collecting data from X. To demonstrate a methodological demonstration, we examine whether the models can predict each other using the outputs of the remaining models. In summary, we provide MultiParTweet, a resource integrating automatic text and media-based annotations validated with human annotations, and TTLABTweetCrawler, a general-purpose X data collection tool. Our analysis shows that the models are mutually predictable. In addition, VLM-based annotation were preferred by human annotators, suggesting that multimodal representations align more with human interpretation.
comment: Submitted to LREC 2026
DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry
Reliable interpretation of multimodal data in dentistry is essential for automated oral healthcare, yet current multimodal large language models (MLLMs) struggle to capture fine-grained dental visual details and lack sufficient reasoning ability for precise diagnosis. To address these limitations, we present DentalGPT, a specialized dental MLLM developed through high-quality domain knowledge injection and reinforcement learning. Specifically, the largest annotated multimodal dataset for dentistry to date was constructed by aggregating over 120k dental images paired with detailed descriptions that highlight diagnostically relevant visual features, making it the multimodal dataset with the most extensive collection of dental images to date. Training on this dataset significantly enhances the MLLM's visual understanding of dental conditions, while the subsequent reinforcement learning stage further strengthens its capability for multimodal complex reasoning. Comprehensive evaluations on intraoral and panoramic benchmarks, along with dental subsets of medical VQA benchmarks, show that DentalGPT achieves superior performance in disease classification and dental VQA tasks, outperforming many state-of-the-art MLLMs despite having only 7B parameters. These results demonstrate that high-quality dental data combined with staged adaptation provides an effective pathway for building capable and domain-specialized dental MLLMs.
HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning
Key frame selection in video understanding presents significant challenges. Traditional top-K selection methods, which score frames independently, often fail to optimize the selection as a whole. This independent scoring frequently results in selecting frames that are temporally clustered and visually redundant. Additionally, training lightweight selectors using pseudo labels generated offline by Multimodal Large Language Models (MLLMs) prevents the supervisory signal from dynamically adapting to task objectives. To address these limitations, we propose an end-to-end trainable, task-adaptive framework for frame selection. A Chain-of-Thought approach guides a Small Language Model (SLM) to generate task-specific implicit query vectors, which are combined with multimodal features to enable dynamic frame scoring. We further define a continuous set-level objective function that incorporates relevance, coverage, and redundancy, enabling differentiable optimization via Gumbel-Softmax to select optimal frame combinations at the set level. Finally, student-teacher mutual learning is employed, where the student selector (SLM) and teacher reasoner (MLLM) are trained to align their frame importance distributions via KL divergence. Combined with cross-entropy loss, this enables end-to-end optimization, eliminating reliance on static pseudo labels. Experiments across various benchmarks, including Video-MME, LongVideoBench, MLVU, and NExT-QA, demonstrate that our method significantly outperforms existing approaches.
comment: 18 pages, 8 figures
Does Less Hallucination Mean Less Creativity? An Empirical Investigation in LLMs
Large Language Models (LLMs) exhibit remarkable capabilities in natural language understanding and reasoning, but suffer from hallucination: the generation of factually incorrect content. While numerous methods have been developed to reduce hallucinations, their impact on creative generations remains unexplored. This gap is particularly critical for AI-assisted scientific discovery, which requires both factual accuracy and creative hypothesis generation. We investigate how three hallucination-reduction techniques: Chain of Verification (CoVe), Decoding by Contrasting Layers (DoLa), and Retrieval-Augmented Generation (RAG), affect creativity in LLMs. Evaluating multiple model families (LLaMA, Qwen, Mistral) at varying scales (1B - 70B parameters) on two creativity benchmarks (NeoCoder and CS4), we find that these methods have opposing effects on divergent creativity. CoVe enhances divergent thinking, DoLa suppresses it, and RAG shows minimal impact. Our findings provide guidance for selecting appropriate hallucination-reduction methods in scientific applications, where the balance between factual accuracy and creative exploration is crucial.
Building Patient Journeys in Hebrew: A Language Model for Clinical Timeline Extraction IJCAI 2025
We present a new Hebrew medical language model designed to extract structured clinical timelines from electronic health records, enabling the construction of patient journeys. Our model is based on DictaBERT 2.0 and continually pre-trained on over five million de-identified hospital records. To evaluate its effectiveness, we introduce two new datasets -- one from internal medicine and emergency departments, and another from oncology -- annotated for event temporal relations. Our results show that our model achieves strong performance on both datasets. We also find that vocabulary adaptation improves token efficiency and that de-identification does not compromise downstream performance, supporting privacy-conscious model development. The model is made available for research use under ethical restrictions.
comment: In Proceedings of the Workshop on Large Language Models and Generative AI for Health Informatics 2025, IJCAI 2025, Montreal, Canada
Mistake Notebook Learning: Selective Batch-Wise Context Optimization for In-Context Learning
Large language models (LLMs) adapt to tasks via gradient fine-tuning (heavy computation, catastrophic forgetting) or In-Context Learning (ICL: low robustness, poor mistake learning). To fix this, we introduce Mistake Notebook Learning (MNL), a training-free framework with a persistent knowledge base of abstracted error patterns. Unlike prior instance/single-trajectory memory methods, MNL uses batch-wise error abstraction: it extracts generalizable guidance from multiple failures, stores insights in a dynamic notebook, and retains only baseline-outperforming guidance via hold-out validation (ensuring monotonic improvement). We show MNL nearly matches Supervised Fine-Tuning (93.9% vs 94.3% on GSM8K) and outperforms training-free alternatives on GSM8K, Spider, AIME, and KaggleDBQA. On KaggleDBQA (Qwen3-8B), MNL hits 28% accuracy (47% relative gain), outperforming Memento (15.1%) and Training-Free GRPO (22.1) - proving it's a strong training-free alternative for complex reasoning.
Rethinking Expert Trajectory Utilization in LLM Post-training
While effective post-training integrates Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), the optimal mechanism for utilizing expert trajectories remains unresolved. We propose the Plasticity-Ceiling Framework to theoretically ground this landscape, decomposing performance into foundational SFT performance and the subsequent RL plasticity. Through extensive benchmarking, we establish the Sequential SFT-then-RL pipeline as the superior standard, overcoming the stability deficits of synchronized approaches. Furthermore, we derive precise scaling guidelines: (1) Transitioning to RL at the SFT Stable or Mild Overfitting Sub-phase maximizes the final ceiling by securing foundational SFT performance without compromising RL plasticity; (2) Refuting ``Less is More'' in the context of SFT-then-RL scaling, we demonstrate that Data Scale determines the primary post-training potential, while Trajectory Difficulty acts as a performance multiplier; and (3) Identifying that the Minimum SFT Validation Loss serves as a robust indicator for selecting the expert trajectories that maximize the final performance ceiling. Our findings provide actionable guidelines for maximizing the value extracted from expert trajectories.
comment: 24 pages, 5 figures, under review
CLINIC: Evaluating Multilingual Trustworthiness in Language Models for Healthcare
Integrating language models (LMs) in healthcare systems holds great promise for improving medical workflows and decision-making. However, a critical barrier to their real-world adoption is the lack of reliable evaluation of their trustworthiness, especially in multilingual healthcare settings. Existing LMs are predominantly trained in high-resource languages, making them ill-equipped to handle the complexity and diversity of healthcare queries in mid- and low-resource languages, posing significant challenges for deploying them in global healthcare contexts where linguistic diversity is key. In this work, we present CLINIC, a Comprehensive Multilingual Benchmark to evaluate the trustworthiness of language models in healthcare. CLINIC systematically benchmarks LMs across five key dimensions of trustworthiness: truthfulness, fairness, safety, robustness, and privacy, operationalized through 18 diverse tasks, spanning 15 languages (covering all the major continents), and encompassing a wide array of critical healthcare topics like disease conditions, preventive actions, diagnostic tests, treatments, surgeries, and medications. Our extensive evaluation reveals that LMs struggle with factual correctness, demonstrate bias across demographic and linguistic groups, and are susceptible to privacy breaches and adversarial attacks. By highlighting these shortcomings, CLINIC lays the foundation for enhancing the global reach and safety of LMs in healthcare across diverse languages.
comment: 49 pages, 31 figures
Task-Specific Sparse Feature Masks for Molecular Toxicity Prediction with Chemical Language Models
Reliable in silico molecular toxicity prediction is a cornerstone of modern drug discovery, offering a scalable alternative to experimental screening. However, the black-box nature of state-of-the-art models remains a significant barrier to adoption, as high-stakes safety decisions demand verifiable structural insights alongside predictive performance. To address this, we propose a novel multi-task learning (MTL) framework designed to jointly enhance accuracy and interpretability. Our architecture integrates a shared chemical language model with task-specific attention modules. By imposing an L1 sparsity penalty on these modules, the framework is constrained to focus on a minimal set of salient molecular fragments for each distinct toxicity endpoint. The resulting framework is trained end-to-end and is readily adaptable to various transformer-based backbones. Evaluated on the ClinTox, SIDER, and Tox21 benchmark datasets, our approach consistently outperforms both single-task and standard MTL baselines. Crucially, the sparse attention weights provide chemically intuitive visualizations that reveal the specific fragments influencing predictions, thereby enhancing insight into the model's decision-making process.
comment: 6 pages, 4 figures
Minimal Clips, Maximum Salience: Long Video Summarization via Key Moment Extraction
Vision-Language Models (VLMs) are able to process increasingly longer videos. Yet, important visual information is easily lost throughout the entire context and missed by VLMs. Also, it is important to design tools that enable cost-effective analysis of lengthy video content. In this paper, we propose a clip selection method that targets key video moments to be included in a multimodal summary. We divide the video into short clips and generate compact visual descriptions of each using a lightweight video captioning model. These are then passed to a large language model (LLM), which selects the K clips containing the most relevant visual information for a multimodal summary. We evaluate our approach on reference clips for the task, automatically derived from full human-annotated screenplays and summaries in the MovieSum dataset. We further show that these reference clips (less than 6% of the movie) are sufficient to build a complete multimodal summary of the movies in MovieSum. Using our clip selection method, we achieve a summarization performance close to that of these reference clips while capturing substantially more relevant video information than random clip selection. Importantly, we maintain low computational cost by relying on a lightweight captioning model.
Improving Translation Quality by Selecting Better Data for LLM Fine-Tuning: A Comparative Analysis
We investigated the impact of data selection on machine translation fine-tuning for open LLMs. Using Japanese-English corpora, we compare five selectors: TF-IDF, COMET Kiwi, QuRate, FD-Score, and random selection, under controlled training conditions. We observed that semantic selectors consistently outperform lexical and geometry-based heuristics, and that even when the selected data differ by less than 3%, the impact on model performance is substantial, underscoring the sensitivity of fine-tuning to data quality.
comment: To appear at IEEE Big Data 2025
Mining Legal Arguments to Study Judicial Formalism
Courts must justify their decisions, but systematically analyzing judicial reasoning at scale remains difficult. This study refutes claims about formalistic judging in Central and Eastern Europe (CEE) by developing automated methods to detect and classify judicial reasoning in Czech Supreme Courts' decisions using state-of-the-art natural language processing methods. We create the MADON dataset of 272 decisions from two Czech Supreme Courts with expert annotations of 9,183 paragraphs with eight argument types and holistic formalism labels for supervised training and evaluation. Using a corpus of 300k Czech court decisions, we adapt transformer LLMs for Czech legal domain by continued pretraining and experiment with methods to address dataset imbalance including asymmetric loss and class weighting. The best models successfully detect argumentative paragraphs (82.6\% macro-F1), classify traditional types of legal argument (77.5\% macro-F1), and classify decisions as formalistic/non-formalistic (83.2\% macro-F1). Our three-stage pipeline combining ModernBERT, Llama 3.1, and traditional feature-based machine learning achieves promising results for decision classification while reducing computational costs and increasing explainability. Empirically, we challenge prevailing narratives about CEE formalism. This work shows that legal argument mining enables reliable judicial philosophy classification and shows the potential of legal argument mining for other important tasks in computational legal studies. Our methodology is easily replicable across jurisdictions, and our entire pipeline, datasets, guidelines, models, and source codes are available at https://github.com/trusthlt/madon.
comment: pre-print under review
qa-FLoRA: Data-free query-adaptive Fusion of LoRAs for LLMs AAAI 2026
The deployment of large language models for specialized tasks often requires domain-specific parameter-efficient finetuning through Low-Rank Adaptation (LoRA) modules. However, effectively fusing these adapters to handle complex, multi-domain composite queries remains a critical challenge. Existing LoRA fusion approaches either use static weights, which assign equal relevance to each participating LoRA, or require data-intensive supervised training for every possible LoRA combination to obtain respective optimal fusion weights. We propose qa-FLoRA, a novel query-adaptive data-and-training-free method for LoRA fusion that dynamically computes layer-level fusion weights by measuring distributional divergence between the base model and respective adapters. Our approach eliminates the need for composite training data or domain-representative samples, making it readily applicable to existing adapter collections. Extensive experiments across nine multilingual composite tasks spanning mathematics, coding, and medical domains, show that qa-FLoRA outperforms static fusion by ~5% with LLaMA-2 and ~6% with LLaMA-3, and the training-free baselines by ~7% with LLaMA-2 and ~10% with LLaMA-3, while significantly closing the gap with supervised baselines. Further, layer-level analysis of our fusion weights reveals interpretable fusion patterns, demonstrating the effectiveness of our approach for robust multi-domain adaptation.
comment: Accepted at AAAI 2026 (Main Technical Track)
Unifying Dynamic Tool Creation and Cross-Task Experience Sharing through Cognitive Memory Architecture
Large Language Model agents face fundamental challenges in adapting to novel tasks due to limitations in tool availability and experience reuse. Existing approaches either rely on predefined tools with limited coverage or build tools from scratch without leveraging past experiences, leading to inefficient exploration and suboptimal performance. We introduce SMITH (Shared Memory Integrated Tool Hub), a unified cognitive architecture that seamlessly integrates dynamic tool creation with cross-task experience sharing through hierarchical memory organization. SMITH organizes agent memory into procedural, semantic, and episodic components, enabling systematic capability expansion while preserving successful execution patterns. Our approach formalizes tool creation as iterative code generation within controlled sandbox environments and experience sharing through episodic memory retrieval with semantic similarity matching. We further propose a curriculum learning strategy based on agent-ensemble difficulty re-estimation. Extensive experiments on the GAIA benchmark demonstrate SMITH's effectiveness, achieving 81.8% Pass@1 accuracy and outperforming state-of-the-art baselines including Alita (75.2%) and Memento (70.9%). Our work establishes a foundation for building truly adaptive agents that continuously evolve their capabilities through principled integration of tool creation and experience accumulation.
LegalRikai: Open Benchmark -- A Benchmark for Complex Japanese Corporate Legal Tasks
This paper introduces LegalRikai: Open Benchmark, a new benchmark comprising four complex tasks that emulate Japanese corporate legal practices. The benchmark was created by legal professionals under the supervision of an attorney. This benchmark has 100 samples that require long-form, structured outputs, and we evaluated them against multiple practical criteria. We conducted both human and automated evaluations using leading LLMs, including GPT-5, Gemini 2.5 Pro, and Claude Opus 4.1. Our human evaluation revealed that abstract instructions prompted unnecessary modifications, highlighting model weaknesses in document-level editing that were missed by conventional short-text tasks. Furthermore, our analysis reveals that automated evaluation aligns well with human judgment on criteria with clear linguistic grounding, and assessing structural consistency remains a challenge. The result demonstrates the utility of automated evaluation as a screening tool when expert availability is limited. We propose a dataset evaluation framework to promote more practice-oriented research in the legal domain.
CIP: A Plug-and-Play Causal Prompting Framework for Mitigating Hallucinations under Long-Context Noise
Large language models often hallucinate when processing long and noisy retrieval contexts because they rely on spurious correlations rather than genuine causal relationships. We propose CIP, a lightweight and plug-and-play causal prompting framework that mitigates hallucinations at the input stage. CIP constructs a causal relation sequence among entities, actions, and events and injects it into the prompt to guide reasoning toward causally relevant evidence. Through causal intervention and counterfactual reasoning, CIP suppresses non causal reasoning paths, improving factual grounding and interpretability. Experiments across seven mainstream language models, including GPT-4o, Gemini 2.0 Flash, and Llama 3.1, show that CIP consistently enhances reasoning quality and reliability, achieving 2.6 points improvement in Attributable Rate, 0.38 improvement in Causal Consistency Score, and a fourfold increase in effective information density. API level profiling further shows that CIP accelerates contextual understanding and reduces end to end response latency by up to 55.1 percent. These results suggest that causal reasoning may serve as a promising paradigm for improving the explainability, stability, and efficiency of large language models.
AdaSD: Adaptive Speculative Decoding for Efficient Language Model Inference
Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their increasing parameter sizes significantly slow down inference. Speculative decoding mitigates this issue by leveraging a smaller draft model to predict candidate tokens, which are then verified by a larger target model. However, existing approaches often require additional training, extensive hyperparameter tuning, or prior analysis of models and tasks before deployment. In this paper, we propose Adaptive Speculative Decoding (AdaSD), a hyperparameter-free decoding scheme that dynamically adjusts generation length and acceptance criteria during inference. AdaSD introduces two adaptive thresholds: one to determine when to stop candidate token generation and another to decide token acceptance, both updated in real time based on token entropy and Jensen-Shannon distance. This approach eliminates the need for pre-analysis or fine-tuning and is compatible with off-the-shelf models. Experiments on benchmark datasets demonstrate that AdaSD achieves up to 49\% speedup over standard speculative decoding while limiting accuracy degradation to under 2\%, making it a practical solution for efficient and adaptive LLM inference.
When Actions Teach You to Think: Reasoning-Action Synergy via Reinforcement Learning in Conversational Agents
Supervised fine-tuning (SFT) has emerged as one of the most effective ways to improve the performance of large language models (LLMs) in downstream tasks. However, SFT can have difficulty generalizing when the underlying data distribution changes, even when the new data does not fall completely outside the training domain. Recent reasoning-focused models such as o1 and R1 have demonstrated consistent gains over their non-reasoning counterparts, highlighting the importance of reasoning for improved generalization and reliability. However, collecting high-quality reasoning traces for SFT remains challenging -- annotations are costly, subjective, and difficult to scale. To address this limitation, we leverage Reinforcement Learning (RL) to enable models to learn reasoning strategies directly from task outcomes. We propose a pipeline in which LLMs generate reasoning steps that guide both the invocation of tools (e.g., function calls) and the final answer generation for conversational agents. Our method employs Group Relative Policy Optimization (GRPO) with rewards designed around tool accuracy and answer correctness, allowing the model to iteratively refine its reasoning and actions. Experimental results demonstrate that our approach improves both the quality of reasoning and the precision of tool invocations, achieving a 1.5% relative improvement over the SFT model (trained without explicit thinking) and a 40% gain compared to the base of the vanilla Qwen3-1.7B model. These findings demonstrate the promise of unifying reasoning and action learning through RL to build more capable and generalizable conversational agents.
Leveraging LLMs for Title and Abstract Screening for Systematic Review: A Cost-Effective Dynamic Few-Shot Learning Approach
Systematic reviews are a key component of evidence-based medicine, playing a critical role in synthesizing existing research evidence and guiding clinical decisions. However, with the rapid growth of research publications, conducting systematic reviews has become increasingly burdensome, with title and abstract screening being one of the most time-consuming and resource-intensive steps. To mitigate this issue, we designed a two-stage dynamic few-shot learning (DFSL) approach aimed at improving the efficiency and performance of large language models (LLMs) in the title and abstract screening task. Specifically, this approach first uses a low-cost LLM for initial screening, then re-evaluates low-confidence instances using a high-performance LLM, thereby enhancing screening performance while controlling computational costs. We evaluated this approach across 10 systematic reviews, and the results demonstrate its strong generalizability and cost-effectiveness, with potential to reduce manual screening burden and accelerate the systematic review process in practical applications.
comment: 22 pages, 3 figures
Multi-Intent Spoken Language Understanding: Methods, Trends, and Challenges
Multi-intent spoken language understanding (SLU) involves two tasks: multiple intent detection and slot filling, which jointly handle utterances containing more than one intent. Owing to this characteristic, which closely reflects real-world applications, the task has attracted increasing research attention, and substantial progress has been achieved. However, there remains a lack of a comprehensive and systematic review of existing studies on multi-intent SLU. To this end, this paper presents a survey of recent advances in multi-intent SLU. We provide an in-depth overview of previous research from two perspectives: decoding paradigms and modeling approaches. On this basis, we further compare the performance of representative models and analyze their strengths and limitations. Finally, we discuss the current challenges and outline promising directions for future research. We hope this survey will offer valuable insights and serve as a useful reference for advancing research in multi-intent SLU.
Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery: Sublinear Memory Growth for Efficient LLM Inference
We present Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery (ASR-KF-EGR), a training-free inference-time framework for efficient large language model generation. Our method introduces a reversible soft-freeze mechanism that temporarily suspends key-value (KV) updates for low-importance tokens identified within a sliding attention window. Unlike eviction-based approaches that permanently discard context, ASR-KF-EGR preserves all tokens in off-GPU storage and restores them on demand. We extend the framework with sublinear freeze scheduling, where freeze duration grows sublinearly with repeated low-importance detections, preventing over-aggressive compression. Preliminary experiments on LLaMA-3 8B demonstrate 55-67% reduction in active KV cache size while maintaining generation quality and passing needle-in-haystack retrieval tests. The method is architecture-agnostic, requires no fine-tuning, and provides a practical solution for memory-constrained deployment of long-context LLMs.
comment: 6 pages, 3 tables , 1 figure
FutureWeaver: Planning Test-Time Compute for Multi-Agent Systems with Modularized Collaboration
Scaling test-time computation improves large language model performance without additional training. Recent work demonstrates that techniques such as repeated sampling, self-verification, and self-reflection can significantly enhance task success by allocating more inference-time compute. However, applying these techniques across multiple agents in a multi-agent system is difficult: there does not exist principled mechanisms to allocate compute to foster collaboration among agents, to extend test-time scaling to collaborative interactions, or to distribute compute across agents under explicit budget constraints. To address this gap, we propose FutureWeaver, a framework for planning and optimizing test-time compute allocation in multi-agent systems under fixed budgets. FutureWeaver introduces modularized collaboration, formalized as callable functions that encapsulate reusable multi-agent workflows. These modules are automatically derived through self-play reflection by abstracting recurring interaction patterns from past trajectories. Building on these modules, FutureWeaver employs a dual-level planning architecture that optimizes compute allocation by reasoning over the current task state while also speculating on future steps. Experiments on complex agent benchmarks demonstrate that FutureWeaver consistently outperforms baselines across diverse budget settings, validating its effectiveness for multi-agent collaboration in inference-time optimization.
SciLaD: A Large-Scale, Transparent, Reproducible Dataset for Natural Scientific Language Processing
SciLaD is a novel, large-scale dataset of scientific language constructed entirely using open-source frameworks and publicly available data sources. It comprises a curated English split containing over 10 million scientific publications and a multilingual, unfiltered TEI XML split including more than 35 million publications. We also publish the extensible pipeline for generating SciLaD. The dataset construction and processing workflow demonstrates how open-source tools can enable large-scale, scientific data curation while maintaining high data quality. Finally, we pre-train a RoBERTa model on our dataset and evaluate it across a comprehensive set of benchmarks, achieving performance comparable to other scientific language models of similar size, validating the quality and utility of SciLaD. We publish the dataset and evaluation pipeline to promote reproducibility, transparency, and further research in natural scientific language processing and understanding including scholarly document processing.
comment: 12 pages, 2 figures, 3 tables
DeepSeek's WEIRD Behavior: The cultural alignment of Large Language Models and the effects of prompt language and cultural prompting
Culture is a core component of human-to-human interaction and plays a vital role in how we perceive and interact with others. Advancements in the effectiveness of Large Language Models (LLMs) in generating human-sounding text have greatly increased the amount of human-to-computer interaction. As this field grows, the cultural alignment of these human-like agents becomes an important field of study. Our work uses Hofstede's VSM13 international surveys to understand the cultural alignment of the following models: DeepSeek-V3, V3.1, GPT-4, GPT-4.1, GPT-4o, and GPT-5. We use a combination of prompt language and cultural prompting, a strategy that uses a system prompt to shift a model's alignment to reflect a specific country, to align these LLMs with the United States and China. Our results show that DeepSeek-V3, V3.1, and OpenAI's GPT-5 exhibit a close alignment with the survey responses of the United States and do not achieve a strong or soft alignment with China, even when using cultural prompts or changing the prompt language. We also find that GPT-4 exhibits an alignment closer to China when prompted in English, but cultural prompting is effective in shifting this alignment closer to the United States. Other low-cost models, GPT-4o and GPT-4.1, respond to the prompt language used (i.e., English or Simplified Chinese) and cultural prompting strategies to create acceptable alignments with both the United States and China.
Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale
Real-world AI software engineering demands coding agents that can reason over massive repositories, maintain durable memory across and within long sessions, and robustly coordinate complex toolchains at test time. Existing open-source coding agents provide transparency but frequently fall short when pushed to these industrial-scale workloads, while proprietary coding agents offer strong practical performance but limited extensibility, interpretability, and controllability. We present the Confucius Code Agent (CCA), an open-sourced AI software engineer that can operate at an industrial scale. CCA is built atop the Confucius SDK, an open-sourced agent development platform designed around three complementary perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). The SDK introduces a unified orchestrator with hierarchical working memory for long-context reasoning, a persistent note-taking system for cross-session continual learning, and a modular extension module for robust tool use. Moreover, a meta-agent automates the synthesis, evaluation, and refinement of agent configurations through a build-test-improve loop, enabling rapid agent development on new tasks, environments, and tool stacks. Instantiated on Confucius SDK with these mechanisms, CCA delivers strong performance on real-world software engineering tasks. On SWE-Bench-Pro, CCA achieves a state-of-the-art Resolve@1 performance of 54.3%, substantially improving over prior coding agents. Together, the Confucius SDK and CCA provide a transparent, extensible, and reproducible foundation for AI agents, bridge gaps between research prototypes and production-grade systems, and support agent development and deployment at industrial scale.
comment: Meta requires more thorough internal review process to ensure paper quality and experiments as well as compliance with the internal research publishing process
Uncertainty Distillation: Teaching Language Models to Express Semantic Confidence
As large language models (LLMs) are increasingly used for factual question-answering, it becomes more important for LLMs to have the capability to communicate the likelihood that their answer is correct. For these verbalized expressions of uncertainty to be meaningful, they should reflect the error rates at the expressed level of confidence. However, when prompted to express confidence, the error rates of current LLMs are inconsistent with their communicated confidences, highlighting the need for uncertainty quantification methods. Many prior methods calculate lexical uncertainty, estimating a model's confidence in the specific string it generated. In some cases, however, it may be more useful to estimate semantic uncertainty, or the model's confidence in the answer regardless of how it is verbalized. We propose a simple procedure, uncertainty distillation, to teach an LLM to verbalize calibrated semantic confidences. Using held-out data to map initial uncertainty estimates to meaningful probabilities, we create examples annotated with verbalized probabilities for supervised fine-tuning. We find that our method yields verbalized confidences that correlate well with observed error rates, even when compared to strong baselines, some of which are more than twenty times slower at inference time. Additionally, we demonstrate that our method can be applied to black-box models that allow API-based fine-tuning, resulting in estimates of uncertainty that are both more effective and more efficient than any of our baselines.
MapFormer: Self-Supervised Learning of Cognitive Maps with Input-Dependent Positional Embeddings
A cognitive map is an internal model which encodes the abstract relationships among entities in the world, giving humans and animals the flexibility to adapt to new situations, with a strong out-of-distribution (OOD) generalization that current AI systems still do not possess. To bridge this gap, we introduce MapFormers, new architectures based on Transformer models, which can learn cognitive maps from observational data and perform path integration in parallel, in a self-supervised manner. Cognitive maps are learned in the model by disentangling structural relationships in the inputs from their specific content, a property that can be achieved naturally by updating the positional encoding in Transformers with input-dependent matrices. We developed two variants of MapFormers that unify absolute and relative positional encoding to model episodic (EM) and working memory (WM), respectively. We tested MapFormers on several tasks, including a classic 2D navigation task, showing that our models can learn a cognitive map of the underlying space and generalize OOD (e.g., to longer sequences) with near-perfect performance, unlike current architectures. Together, these results demonstrate the superiority of models designed to learn a cognitive map, and the importance of introducing a structural bias for structure-content disentanglement, which can be achieved in Transformers with input-dependent positional encoding. MapFormers have broad applications in both neuroscience and AI, by explaining the neural mechanisms giving rise to cognitive maps, while allowing these relation models to be learned at scale.
comment: 19 pages (29 with appendix), 8 figures
Sorting the Babble in Babel: Assessing the Performance of Language Identification Algorithms on the OpenAlex Database
This project aims to optimize the linguistic indexing of the OpenAlex database by comparing the performance of various Python-based language identification procedures on different metadata corpora extracted from a manually-annotated article sample. The precision and recall performance of each algorithm, corpus, and language is first analyzed, followed by an assessment of processing speeds recorded for each algorithm and corpus type. These different performance measures are then simulated at the database level using probabilistic confusion matrices for each algorithm, corpus, and language, as well as a probabilistic modeling of relative article language frequencies for the whole OpenAlex database. Results show that procedure performance strongly depends on the importance given to each of the measures implemented: for contexts where precision is preferred, using the LangID algorithm on the greedy corpus gives the best results; however, for all cases where recall is considered at least slightly more important than precision or as soon as processing times are given any kind of consideration, the procedure that consists in the application of the FastSpell algorithm on the Titles corpus outperforms all other alternatives. Given the lack of truly multilingual large-scale bibliographic databases, it is hoped that these results help confirm and foster the unparalleled potential of the OpenAlex database for cross-linguistic and comprehensive measurement and evaluation.
comment: 32 pages, 4 figures
ReCode: Unify Plan and Action for Universal Granularity Control
Real-world tasks require decisions at varying granularities, and humans excel at this by leveraging a unified cognitive representation where planning is fundamentally understood as a high-level form of action. However, current Large Language Model (LLM)-based agents lack this crucial capability to operate fluidly across decision granularities. This limitation stems from existing paradigms that enforce a rigid separation between high-level planning and low-level action, which impairs dynamic adaptability and limits generalization. We propose ReCode (Recursive Code Generation), a novel paradigm that addresses this limitation by unifying planning and action within a single code representation. In this representation, ReCode treats high-level plans as abstract placeholder functions, which the agent then recursively decomposes into finer-grained sub-functions until reaching primitive actions. This recursive approach dissolves the rigid boundary between plan and action, enabling the agent to dynamically control its decision granularity. Furthermore, the recursive structure inherently generates rich, multi-granularity training data, enabling models to learn hierarchical decision-making processes. Extensive experiments show ReCode significantly surpasses advanced baselines in inference performance and demonstrates exceptional data efficiency in training, validating our core insight that unifying planning and action through recursive code generation is a powerful and effective approach to achieving universal granularity control. The code is available at https://github.com/FoundationAgents/ReCode.
Empirical Analysis of the Effect of Context in the Task of Automated Essay Scoring in Transformer-Based Models
Automated Essay Scoring (AES) has emerged to prominence in response to the growing demand for educational automation. Providing an objective and cost-effective solution, AES standardises the assessment of extended responses. Although substantial research has been conducted in this domain, recent investigations reveal that alternative deep-learning architectures outperform transformer-based models. Despite the successful dominance in the performance of the transformer architectures across various other tasks, this discrepancy has prompted a need to enrich transformer-based AES models through contextual enrichment. This study delves into diverse contextual factors using the ASAP-AES dataset, analysing their impact on transformer-based model performance. Our most effective model, augmented with multiple contextual dimensions, achieves a mean Quadratic Weighted Kappa score of 0.823 across the entire essay dataset and 0.8697 when trained on individual essay sets. Evidently surpassing prior transformer-based models, this augmented approach only underperforms relative to the state-of-the-art deep learning model trained essay-set-wise by an average of 3.83\% while exhibiting superior performance in three of the eight sets. Importantly, this enhancement is orthogonal to architecture-based advancements and seamlessly adaptable to any AES model. Consequently, this contextual augmentation methodology presents a versatile technique for refining AES capabilities, contributing to automated grading and evaluation evolution in educational settings.
comment: MSc Dissertation
Textual Self-attention Network: Test-Time Preference Optimization through Textual Gradient-based Attention AAAI2026
Large Language Models (LLMs) have demonstrated remarkable generalization capabilities, but aligning their outputs with human preferences typically requires expensive supervised fine-tuning. Recent test-time methods leverage textual feedback to overcome this, but they often critique and revise a single candidate response, lacking a principled mechanism to systematically analyze, weigh, and synthesize the strengths of multiple promising candidates. Such a mechanism is crucial because different responses may excel in distinct aspects (e.g., clarity, factual accuracy, or tone), and combining their best elements may produce a far superior outcome. This paper proposes the Textual Self-Attention Network (TSAN), a new paradigm for test-time preference optimization that requires no parameter updates. TSAN emulates self-attention entirely in natural language to overcome this gap: it analyzes multiple candidates by formatting them into textual keys and values, weighs their relevance using an LLM-based attention module, and synthesizes their strengths into a new, preference-aligned response under the guidance of the learned textual attention. This entire process operates in a textual gradient space, enabling iterative and interpretable optimization. Empirical evaluations demonstrate that with just three test-time iterations on a base SFT model, TSAN outperforms supervised models like Llama-3.1-70B-Instruct and surpasses the current state-of-the-art test-time alignment method by effectively leveraging multiple candidate solutions.
comment: AAAI2026
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
In this paper, we introduce a new embedding model called M3-Embedding, which is distinguished for its versatility in \textit{Multi-Linguality}, \textit{Multi-Functionality}, and \textit{Multi-Granularity}. It provides a uniform support for the semantic retrieval of more than 100 working languages. It can simultaneously accomplish the three common retrieval functionalities: dense retrieval, multi-vector retrieval, and sparse retrieval. Besides, it is also capable of processing inputs of different granularities, spanning from short sentences to long documents of up to 8,192 tokens. The effective training of M3-Embedding presents a series of technical contributions. Notably, we propose a novel self-knowledge distillation approach, where the relevance scores from different retrieval functionalities can be integrated as the teacher signal to enhance the training quality. We also optimize the batching strategy, which enables a large batch size and high training throughput to improve the discriminativeness of embeddings. M3-Embedding exhibits a superior performance in our experiment, leading to new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
Less Is More for Multi-Step Logical Reasoning of LLM Generalisation Under Rule Removal, Paraphrasing, and Compression
Large language models (LLMs) achieve strong performance on many natural language tasks, yet their generalisation under structured perturbations of logical rule systems remains insufficiently characterised. We present a controlled evaluation framework that probes reasoning reliability through four stress tests: (1) rule deletion, removing redundant versus essential rules from a multi-step inference chain; (2) contradictory evidence injection; (3) logic-preserving rewrites based on equivalence laws (contraposition, double negation, implication-to-disjunction, De Morgan, identity, and commutativity); and (4) multi-law equivalence stacking that composes 2--5 transformations. Across three representative model families -- BERT, Qwen2, and LLaMA-like models -- all models attain Acc$=1.0000$ on the base split and show no degradation under redundant rule deletion. In contrast, essential rule deletion yields a pronounced decrease to near-chance performance, and injecting explicit contradictions reduces accuracy to 0.0000. Under logic-preserving rewrites, accuracy is largely preserved for single-law transformations with only small degradations in a few cases, whereas multi-law stacking exposes model-dependent sensitivity: BERT matches the base condition, TinyLlama shows only marginal degradation, and Qwen2 exhibits a substantial drop. Overall, the results indicate that contemporary LLMs are generally stable under semantic-preserving reformulations, yet remain brittle to missing or inconsistent evidence and may degrade under composed logical transformations depending on the model family. The proposed framework provides a concise diagnostic tool for isolating these failure modes and for evaluating logical generalisation beyond surface-form variation.
MiSS: Revisiting the Trade-off in LoRA with an Efficient Shard-Sharing Structure
Low-Rank Adaptation (LoRA) is a widely adopted technique for parameter-efficient fine-tuning, but its slow convergence has spurred the development of numerous variants. Nevertheless, existing methods often fail to improve performance, memory footprint, and computational efficiency simultaneously. To address this challenge, we revisit the causes of LoRA's slow convergence. Building on these insights, we propose Matrix Shard Sharing (MiSS), which updates shards of the original weight matrix using a single shared trainable matrix $\boldsymbol{D}$, initialized to zeros. To simultaneously ensure computational efficiency, low memory footprint, and scalable serving, we introduce MiSS$^e$. Both theoretical analysis and empirical results demonstrate that our method reduces optimization complexity without compromising performance, thereby achieving a more favorable trade-off among performance, memory, and efficiency. Furthermore, we conduct a comprehensive comparative analysis of various PEFT methods, evaluating their memory usage, initialization overhead, and computational efficiency. By mapping the Pareto frontier across these dimensions, we show that MiSS occupies a favorable position, effectively capturing the advantages of prior approaches.
MOAT: Evaluating LMMs for Capability Integration and Instruction Grounding
Large multimodal models (LMMs) have demonstrated significant potential as generalists in vision-language (VL) tasks. However, adoption of LMMs in real-world tasks is hindered by their poor performance in tasks that require a combination of VL capabilities, as well as in tasks that involve the grounding of complex text or visual instructions. To thoroughly investigate this gap and its underlying causes, we propose MOAT, a diverse benchmark with 1005 complex real-world vision questions that are straightforward for humans but challenging for LMMs. Specifically, the tasks in MOAT require LMMs to engage in generalist problem solving by integrating VL capabilities such as reading text, counting, understanding spatial relations, grounding textual and visual instructions, etc. All these abilities fit into a taxonomy proposed by us that contains 9 VL capabilities, enabling MOAT to provide a fine-grained view of LMMs' strengths and weaknesses. Besides, MOAT is the first benchmark to explicitly evaluate LMMs' ability to ground complex text and visual instructions, which is essential for many real-world applications. We evaluated 17 proprietary and open source LMMs, finding that the best performing LMM (Gemini 2.5 Pro) achieved only 44% accuracy, far below what would be acceptable in real-world applications. To guide future model development, we analyze common trends in our results and discuss the underlying causes of poor performance, focusing on the impact of text-centric reasoning, which VL capabilities form bottlenecks in complex tasks, and the potential harmful effects of tiling. Code and data are available at https://cambrian-yzt.github.io/MOAT/.
comment: Project page: https://cambrian-yzt.github.io/MOAT
Enhancing Instruction-Following Capabilities in Seq2Seq Models: DoLA Adaptations for T5
Encoder-decoder models such as FLAN-T5 are finetuned to follow instructions, but often fail when the instructions conflict with memorized continuations ingrained during training. To understand this behavior, we adapt DoLa to FLAN-T5 and examine how representations evolve in the decoder. Our findings show that T5's intermediate layers undergo rapid shifts driven by cross-attention to the encoder. When projected through the language modeling head, each depth presents highly volatile token preferences, leading to unreliable behavior with contrastive decoding. Motivated by this, we introduce a gradient-based activation-steering method that injects an "instruction-compliance" direction into mid-decoder layers, where the representation is both meaningful and still malleable. This intervention dramatically improves MemoTrap performance (52% to 99.7%), demonstrating that mechanistic steering can succeed where contrastive decoding fails in Seq2Seq architectures.
RECAP: REwriting Conversations for Intent Understanding in Agentic Planning
Understanding user intent is essential for effective planning in conversational assistants, particularly those powered by large language models (LLMs) coordinating multiple agents. However, real-world dialogues are often ambiguous, underspecified, or dynamic, making intent detection a persistent challenge. Traditional classification-based approaches struggle to generalize in open-ended settings, leading to brittle interpretations and poor downstream planning. We propose RECAP (REwriting Conversations for Agent Planning), a new benchmark designed to evaluate and advance intent rewriting, reframing user-agent dialogues into concise representations of user goals. RECAP captures diverse challenges such as ambiguity, intent drift, vagueness, and mixed-goal conversations. Alongside the dataset, we introduce an LLM-based evaluator that assesses planning utility given the rewritten intent. Using RECAP, we develop a prompt-based rewriting approach that outperforms baselines, in terms of plan preference. We further demonstrate that fine-tuning two DPO-based rewriters yields additional utility gains. Our results highlight intent rewriting as a critical and tractable component for improving agentic planning in open-domain dialogue systems.
Efficiently Seeking Flat Minima for Better Generalization in Fine-Tuning Large Language Models and Beyond
Little research explores the correlation between the expressive ability and generalization ability of the low-rank adaptation (LoRA). Sharpness-Aware Minimization (SAM) improves model generalization for both Convolutional Neural Networks (CNNs) and Transformers by encouraging convergence to locally flat minima. However, the connection between sharpness and generalization has not been fully explored for LoRA due to the lack of tools to either empirically seek flat minima or develop theoretical methods. In this work, we propose Flat Minima LoRA (FMLoRA) and its efficient version, i.e., EFMLoRA, to seek flat minima for LoRA. Concretely, we theoretically demonstrate that perturbations in the full parameter space can be transferred to the low-rank subspace. This approach eliminates the potential interference introduced by perturbations across multiple matrices in the low-rank subspace. Our extensive experiments on large language models and vision-language models demonstrate that EFMLoRA achieves optimize efficiency comparable to that of LoRA while simultaneously attaining comparable or even better performance. For example, on the GLUE dataset with RoBERTa-large, EFMLoRA outperforms LoRA and full fine-tuning by 1.0% and 0.5% on average, respectively. On vision-language models, e.g., Qwen-VL-Chat, there are performance improvements of 1.5% and 1.0% on the SQA and VizWiz datasets, respectively. These empirical results also verify that the generalization of LoRA is closely related to sharpness, which is omitted by previous methods.
Large Continual Instruction Assistant
Continual Instruction Tuning (CIT) is adopted to continually instruct Large Models to follow human intent data by data. It is observed that existing gradient update would heavily destroy the performance on previous datasets during CIT process. Instead, Exponential Moving Average (EMA), owns the ability to trace previous parameters, which can aid in decreasing forgetting. Nonetheless, its stable balance weight fails to deal with the ever-changing datasets, leading to the out-of-balance between plasticity and stability. In this paper, we propose a general continual instruction tuning framework to address the challenge. Starting from the trade-off prerequisite and EMA update, we propose the plasticity and stability ideal condition. Based on Taylor expansion in the loss function, we find the optimal balance weight can be automatically determined by the gradients and learned parameters. Therefore, we propose a stable-plasticity balanced coefficient to avoid knowledge interference. Based on the semantic similarity of the instructions, we can determine whether to retrain or expand the training parameters and allocate the most suitable parameters for the testing instances. Extensive experiments across multiple continual instruction tuning benchmarks demonstrate that our approach not only enhances anti-forgetting capabilities but also significantly improves overall continual tuning performance. Our code is available at https://github.com/JingyangQiao/CoIN.
Textual Data Bias Detection and Mitigation -- An Extensible Pipeline with Experimental Evaluation
Textual data used to train large language models (LLMs) exhibits multifaceted bias manifestations encompassing harmful language and skewed demographic distributions. Regulations such as the European AI Act require identifying and mitigating biases against protected groups in data, with the ultimate goal of preventing unfair model outputs. However, practical guidance and operationalization are lacking. We propose a comprehensive data bias detection and mitigation pipeline comprising four components that address two data bias types, namely representation bias and (explicit) stereotypes for a configurable sensitive attribute. First, we leverage LLM-generated word lists created based on quality criteria to detect relevant group labels. Second, representation bias is quantified using the Demographic Representation Score. Third, we detect and mitigate stereotypes using sociolinguistically informed filtering. Finally, we compensate representation bias through Grammar- and Context-Aware Counterfactual Data Augmentation. We conduct a two-fold evaluation using the examples of gender, religion and age. First, the effectiveness of each individual component on data debiasing is evaluated through human validation and baseline comparison. The findings demonstrate that we successfully reduce representation bias and (explicit) stereotypes in a text dataset. Second, the effect of data debiasing on model bias reduction is evaluated by bias benchmarking of several models (0.6B-8B parameters), fine-tuned on the debiased text dataset. This evaluation reveals that LLMs fine-tuned on debiased data do not consistently show improved performance on bias benchmarks, exposing critical gaps in current evaluation methodologies and highlighting the need for targeted data manipulation to address manifested model bias.
Joint Learning of Wording and Formatting for Singable Melody-to-Lyric Generation
Despite progress in melody-to-lyric generation, a substantial singability gap remains between machine-generated lyrics and those written by human lyricists. In this work, we aim to narrow this gap by jointly learning both wording and formatting for melody-to-lyric generation. After general-domain pretraining, our model acquires length awareness through an self-supervised stage trained on a large text-only lyric corpus. During supervised melody-to-lyric training, we introduce multiple auxiliary supervision objective informed by musicological findings on melody--lyric relationships, encouraging the model to capture fine-grained prosodic and structural patterns. Compared with naïve fine-tuning, our approach improves adherence to line-count and syllable-count requirements by 3.8% and 21.4% absolute, respectively, without degrading text quality. In human evaluation, it achieves 42.2% and 74.2% relative gains in overall quality over two task-specific baselines, underscoring the importance of formatting-aware training for generating singable lyrics.
comment: An extension of our previous work arXiv:2305.16816 [cs.CL]
HyperAdaLoRA: Accelerating LoRA Rank Allocation During Training via Hypernetworks without Sacrificing Performance
Parameter-Efficient Fine-Tuning (PEFT), especially Low-Rank Adaptation (LoRA), has emerged as a promising approach to fine-tuning large language models(LLMs) while reducing computational and memory overhead. However, LoRA assumes a uniform rank \textit{r} for each incremental matrix, not accounting for the varying significance of weight matrices across different modules and layers. AdaLoRA leverages Singular Value Decomposition (SVD) to parameterize updates and employs pruning of singular values to introduce dynamic rank allocation, thereby enhancing adaptability. However, during the training process, it often encounters issues of slow convergence speed and high computational overhead. To address this issue, we propose HyperAdaLoRA, a novel framework that accelerates the convergence of AdaLoRA by leveraging a hypernetwork. Instead of directly optimizing the components of Singular Value Decomposition $(P, Λ, Q)$, HyperAdaLoRA employs a hypernetwork based on attention mechanisms to dynamically generate these parameters. By pruning the outputs of the hypernetwork that generates the singular values, dynamic rank allocation is achieved. Comprehensive experiments on various datasets and models demonstrate that our method achieves faster convergence without sacrificing performance. Additionally, further extension experiments on other LoRA-based approaches validate the broad applicability of our method.
comment: 13 pages
Scalable Best-of-N Selection for Large Language Models via Self-Certainty NeurIPS 2025
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models (LLMs) through increased test-time computation. Current state-of-the-art methods often employ computationally intensive reward models for response evaluation and selection. Reward-free alternatives, like self-consistency and universal self-consistency, are limited in their ability to handle open-ended generation tasks or scale effectively. To address these limitations, we propose self-certainty, a novel and efficient metric that leverages the inherent probability distribution of LLM outputs to estimate response quality without requiring external reward models. We hypothesize that higher distributional self-certainty, aggregated across multiple samples, correlates with improved response accuracy, as it reflects greater confidence in the generated output. Through extensive experiments on various reasoning tasks, we demonstrate that self-certainty (1) scales effectively with increasing sample size N, akin to reward models but without the computational overhead; (2) complements chain-of-thought, improving reasoning performance beyond greedy decoding; and (3) generalizes to open-ended tasks where traditional self-consistency methods fall short. Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities. The code is available at https://github.com/backprop07/Self-Certainty
comment: NeurIPS 2025
The Landscape of Memorization in LLMs: Mechanisms, Measurement, and Mitigation
Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet they also exhibit memorization of their training data. This phenomenon raises critical questions about model behavior, privacy risks, and the boundary between learning and memorization. Addressing these concerns, this paper synthesizes recent studies and investigates the landscape of memorization, the factors influencing it, and methods for its detection and mitigation. We explore key drivers, including training data duplication, training dynamics, and fine-tuning procedures that influence data memorization. In addition, we examine methodologies such as prefix-based extraction, membership inference, and adversarial prompting, assessing their effectiveness in detecting and measuring memorized content. Beyond technical analysis, we also explore the broader implications of memorization, including the legal and ethical implications. Finally, we discuss mitigation strategies, including data cleaning, differential privacy, and post-training unlearning, while highlighting open challenges in balancing the need to minimize harmful memorization with model utility. This paper provides a comprehensive overview of the current state of research on LLM memorization across technical, privacy, and performance dimensions, identifying critical directions for future work.
BAMBO: Construct Ability and Efficiency LLM Pareto Set via Bayesian Adaptive Multi-objective Block-wise Optimization
Constructing a Pareto set is pivotal for navigating the capability-efficiency trade-offs in Large Language Models (LLMs); however, existing merging techniques remain inadequate for this task. Coarse-grained, model-level methods yield only a sparse set of suboptimal solutions, while fine-grained, layer-wise approaches suffer from the "curse of dimensionality," rendering the search space computationally intractable. To resolve this dichotomy, we propose BAMBO (Bayesian Adaptive Multi-objective Block-wise Optimization), a novel framework that automatically constructs the LLM Pareto set. BAMBO renders the search tractable by introducing a Hybrid Optimal Block Partitioning strategy. Formulated as a 1D clustering problem, this strategy leverages a dynamic programming approach to optimally balance intra-block homogeneity and inter-block information distribution, thereby dramatically reducing dimensionality without sacrificing critical granularity. The entire process is automated within an evolutionary loop driven by the q-Expected Hypervolume Improvement (qEHVI) acquisition function. Experiments demonstrate that BAMBO discovers a superior and more comprehensive Pareto frontier than baselines, enabling agile model selection tailored to diverse operational constraints. Code is available at: https://github.com/xin8coder/BAMBO.
Annotation-Free Reinforcement Learning Query Rewriting via Verifiable Search Reward
Optimizing queries for Retrieval-Augmented Generation (RAG) systems poses a significant challenge, particularly across diverse modal indices. We introduce RL-QR, a novel annotation-free reinforcement learning framework for query rewriting that eliminates the need for costly human-annotated data. By leveraging verifiable search rewards derived from index-aligned synthetic queries, RL-QR overcomes human-annotation dependencies, extending its applicability to various modalities and index domains. Experimental results demonstrate the framework's robustness, achieving substantial retrieval performance gains of up to 3.9$\times$ on lexical retrievers and 3.5$\times$ on semantic retrievers on the MTEB VIDORE V2 benchmark for unstructured visual documents, along with consistent 5\% to 10\% improvements on MS MARCO v2.1 and internal industrial datasets.
Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving
Large Reasoning Models (LRMs) have expanded the mathematical reasoning frontier through Chain-of-Thought (CoT) techniques and Reinforcement Learning with Verifiable Rewards (RLVR), capable of solving AIME-level problems. However, the performance of LRMs is heavily dependent on the extended reasoning context length. For solving ultra-hard problems like those in the International Mathematical Olympiad (IMO), the required reasoning complexity surpasses the space that an LRM can explore in a single round. Previous works attempt to extend the reasoning context of LRMs but remain prompt-based and built upon proprietary models, lacking systematic structures and training pipelines. Therefore, this paper introduces Intern-S1-MO, a long-horizon math agent that conducts multi-round hierarchical reasoning, composed of an LRM-based multi-agent system including reasoning, summary, and verification. By maintaining a compact memory in the form of lemmas, Intern-S1-MO can more freely explore the lemma-rich reasoning spaces in multiple reasoning stages, thereby breaking through the context constraints for IMO-level math problems. Furthermore, we propose OREAL-H, an RL framework for training the LRM using the online explored trajectories to simultaneously bootstrap the reasoning ability of LRM and elevate the overall performance of Intern-S1-MO. Experiments show that Intern-S1-MO can obtain 26 out of 35 points on the non-geometry problems of IMO2025, matching the performance of silver medalists. It also surpasses the current advanced LRMs on inference benchmarks such as HMMT2025, AIME2025, and CNMO2025. In addition, our agent officially participates in CMO2025 and achieves a score of 102/126 under the judgment of human experts, reaching the gold medal level.
Large Language Model Agent for Modular Task Execution in Drug Discovery
We present a modular framework powered by large language models (LLMs) that automates and streamlines key tasks across the early-stage computational drug discovery pipeline. By combining LLM reasoning with domain-specific tools, the framework performs biomedical data retrieval, literature-grounded question answering via retrieval-augmented generation, molecular generation, multi-property prediction, property-aware molecular refinement, and 3D protein-ligand structure generation. The agent autonomously retrieved relevant biomolecular information, including FASTA sequences, SMILES representations, and literature, and answered mechanistic questions with improved contextual accuracy compared to standard LLMs. It then generated chemically diverse seed molecules and predicted 75 properties, including ADMET-related and general physicochemical descriptors, which guided iterative molecular refinement. Across two refinement rounds, the number of molecules with QED > 0.6 increased from 34 to 55. The number of molecules satisfying empirical drug-likeness filters also rose; for example, compliance with the Ghose filter increased from 32 to 55 within a pool of 100 molecules. The framework also employed Boltz-2 to generate 3D protein-ligand complexes and provide rapid binding affinity estimates for candidate compounds. These results demonstrate that the approach effectively supports molecular screening, prioritization, and structure evaluation. Its modular design enables flexible integration of evolving tools and models, providing a scalable foundation for AI-assisted therapeutic discovery.
Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models
Large Language Models (LLMs) show remarkable proficiency in natural language tasks, yet their frequent overconfidence-misalignment between predicted confidence and true correctness-poses significant risks in critical decision-making applications. We present a comprehensive analysis on calibration in LLMs across nine LLMs and three factual Question-Answering (QA) datasets, systematically comparing standard free-generation settings against structured distractor-augmented prompts. Our evaluation reveals that explicitly incorporating distractors can substantially mitigate miscalibration, achieving relative accuracy improvements up to 460% and ECE reductions up to 90%. Despite general trends, we uncover nuanced findings: large RLHF-tuned models display inherent calibration strengths but can paradoxically suffer increased miscalibration on easier queries, whereas smaller models benefit disproportionately from distractor prompts but remain significantly miscalibrated. Through detailed analyses across question types, we identify persistent calibration failures, particularly in person-based queries. We conclude with concrete recommendations-targeted fine-tuning, structured prompting, and strategic model choice-to ensure reliable, trustworthy LLM deployments.
comment: Published in Transactions on Machine Learning Research (TMLR)
MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources
We present MixtureVitae, an open-access pretraining corpus built to minimize legal risk while providing strong model performance. MixtureVitae follows a risk-mitigated sourcing strategy that combines public-domain and permissively licensed text (e.g., CC-BY/Apache) with carefully justified low-risk additions (e.g., government works and EU TDM-eligible sources), alongside targeted instruction, reasoning and synthetic data with documented provenance. We detail a transparent, multi-stage pipeline for license-aware filtering, safety and quality screening, and domain-aware mixing, and we release the dataset and curation recipes to support reproducible research. In controlled experiments using the open-sci-ref training protocol (fixed architectures at 130M/400M/1.3B/1.7B parameters; training budgets of 50B and 300B tokens), models trained on MixtureVitae consistently outperform other permissive datasets across a suite of standard benchmarks, and at the 1.7B/300B setting they surpass FineWeb-Edu and approach DCLM in the later stages of training. Performance is particularly strong on math/code and competitive on QA tasks. These results demonstrate that permissive-first, risk-mitigated data provides a practical and legally mitigated foundation for training capable LLMs, reducing reliance on indiscriminate web scraping without sacrificing competitiveness. Code: https://github.com/ontocord/mixturevitae
comment: Code: \url{https://github.com/ontocord/mixturevitae}
SDialog: A Python Toolkit for End-to-End Agent Building, User Simulation, Dialog Generation, and Evaluation EACL
We present SDialog, an MIT-licensed open-source Python toolkit that unifies dialog generation, evaluation and mechanistic interpretability into a single end-to-end framework for building and analyzing LLM-based conversational agents. Built around a standardized \texttt{Dialog} representation, SDialog provides: (1) persona-driven multi-agent simulation with composable orchestration for controlled, synthetic dialog generation, (2) comprehensive evaluation combining linguistic metrics, LLM-as-a-judge and functional correctness validators, (3) mechanistic interpretability tools for activation inspection and steering via feature ablation and induction, and (4) audio generation with full acoustic simulation including 3D room modeling and microphone effects. The toolkit integrates with all major LLM backends, enabling mixed-backend experiments under a unified API. By coupling generation, evaluation, and interpretability in a dialog-centric architecture, SDialog enables researchers to build, benchmark and understand conversational systems more systematically.
comment: Pre-print submitted to EACL System Demonstration (under review)
Grammar-Aligned Decoding NeurIPS 2024
Large Language Models (LLMs) struggle with reliably generating highly structured outputs, such as program code, mathematical formulas, or well-formed markup. Constrained decoding approaches mitigate this problem by greedily restricting what tokens an LLM can output at each step to guarantee that the output matches a given constraint. Specifically, in grammar-constrained decoding (GCD), the LLM's output must follow a given grammar. In this paper, we demonstrate that GCD techniques (and in general constrained decoding techniques) can distort the LLM's distribution, leading to outputs that are grammatical but appear with likelihoods that are not proportional to the ones given by the LLM, and so ultimately are low-quality. We call the problem of aligning sampling with a grammar constraint, grammar-aligned decoding (GAD), and propose adaptive sampling with approximate expected futures (ASAp), a decoding algorithm that guarantees the output to be grammatical while provably producing outputs that match the conditional probability of the LLM's distribution conditioned on the given grammar constraint. Our algorithm uses prior sample outputs to soundly overapproximate the future grammaticality of different output prefixes. Our evaluation on code generation and structured NLP tasks shows how ASAp often produces outputs with higher likelihood (according to the LLM's distribution) than existing GCD techniques, while still enforcing the desired grammatical constraints.
comment: Accepted to NeurIPS 2024
Human-Computer Interaction
Toward a Decision Support System for Energy-Efficient Ferry Operation on Lake Constance based on Optimal Control
The maritime sector is undergoing a disruptive technological change driven by three main factors: autonomy, decarbonization, and digital transformation. Addressing these factors necessitates a reassessment of inland vessel operations. This paper presents the design and development of a decision support system for ferry operations based on a shrinking-horizon optimal control framework. The problem formulation incorporates a mathematical model of the ferry's dynamics and environmental disturbances, specifically water currents and wind, which can significantly influence the dynamics. Real-world data and illustrative scenarios demonstrate the potential of the proposed system to effectively support ferry crews by providing real-time guidance. This enables enhanced operational efficiency while maintaining predefined maneuver durations. The findings suggest that optimal control applications hold substantial promise for advancing future ferry operations on inland waters. A video of the real-world ferry MS Insel Mainau operating on Lake Constance is available at: https://youtu.be/i1MjCdbEQyE
comment: 6 pages, 8 figures
The Influence of Human-like Appearance on Expected Robot Explanations
A robot's appearance is a known factor influencing user's mental model and human-robot interaction, that has not been studied in the context of its influence in expected robot explanations. In this study, we investigate whether and to what extent the human-like appearance of robots elicits anthropomorphism, which is conceptualised as an attribution of mental capacities, and how the level of anthropomorphism is revealed in explanations that people expect to receive. We designed a between-subject study comprising conditions with visual stimuli of three domestic service robots with varying human-like appearance, and we prompted respondents to provide explanations they would expect to receive from the robot for the same robot actions. We found that most explanations were anthropomorphic across all conditions. However, there is a positive correlation between the anthropomorphic explanations and human-like appearance. We also report on more nuanced trends observed in non-anthropomorphic explanations and trends in robot descriptions.
From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines
While voice-based AI systems have achieved remarkable generative capabilities, their interactions often feel conversationally broken. This paper examines the interactional friction that emerges in modular Speech-to-Speech Retrieval-Augmented Generation (S2S-RAG) pipelines. By analyzing a representative production system, we move beyond simple latency metrics to identify three recurring patterns of conversational breakdown: (1) Temporal Misalignment, where system delays violate user expectations of conversational rhythm; (2) Expressive Flattening, where the loss of paralinguistic cues leads to literal, inappropriate responses; and (3) Repair Rigidity, where architectural gating prevents users from correcting errors in real-time. Through system-level analysis, we demonstrate that these friction points should not be understood as defects or failures, but as structural consequences of a modular design that prioritizes control over fluidity. We conclude that building natural spoken AI is an infrastructure design challenge, requiring a shift from optimizing isolated components to carefully choreographing the seams between them.
comment: 6 pages, 1 figure
Natural Language Interaction for Editing Visual Knowledge Graphs
Knowledge graphs are often visualized using node-link diagrams that reveal relationships and structure. In many applications using graphs, it is desirable to allow users to edit graphs to ensure data accuracy or provides updates. Commonly in graph visualization, users can interact directly with the visual elements by clicking and typing updates to specific items through traditional interaction methods in the graphical user interface. However, it can become tedious to make many updates due to the need to individually select and change numerous items in a graph. Our research investigates natural language input as an alternative method for editing network graphs. We present a user study comparing GUI graph editing with two natural language alternatives to contribute novel empirical data of the trade-offs of the different interaction methods. The findings show natural language methods to be significantly more effective than traditional GUI interaction.
From Verification Burden to Trusted Collaboration: Design Goals for LLM-Assisted Literature Reviews
Large Language Models (LLMs) are increasingly embedded in academic writing practices. Although numerous studies have explored how researchers employ these tools for scientific writing, their concrete implementation, limitations, and design challenges within the literature review process remain underexplored. In this paper, we report a user study with researchers across multiple disciplines to characterize current practices, benefits, and \textit{pain points} in using LLMs to investigate related work. We identified three recurring gaps: (i) lack of trust in outputs, (ii) persistent verification burden, and (iii) requiring multiple tools. This motivates our proposal of six design goals and a high-level framework that operationalizes them through improved related papers visualization, verification at every step, and human-feedback alignment with generation-guided explanations. Overall, by grounding our work in the practical, day-to-day needs of researchers, we designed a framework that addresses these limitations and models real-world LLM-assisted writing, advancing trust through verifiable actions and fostering practical collaboration between researchers and AI systems.
Say it or AI it: Evaluating Hands-Free Text Correction in Virtual Reality
Text entry in Virtual Reality (VR) is challenging, even when accounting for the use of controllers. Prior work has tackled this challenge head-on, improving the efficiency of input methods. These techniques have the advantage of allowing for relatively straightforward text correction. However, text correction without the use of controllers is a topic that has not received the same amount of attention, even though it can be desirable in several scenarios, and can even be the source of frustration. Large language models have been adopted and evaluated as a corrective methodology, given their high power for predictions. Nevertheless, their predictions are not always correct, which can lead to lower usability. In this paper, we investigate whether, for text correction in VR that is hands-free, the use of AI could surpass in terms of usability and efficiency. We observed better usability for AI text correction when compared to voice input.
Mirror Skin: In Situ Visualization of Robot Touch Intent on Robotic Skin
Effective communication of robotic touch intent is a key factor in promoting safe and predictable physical human-robot interaction (pHRI). While intent communication has been widely studied, existing approaches lack the spatial specificity and semantic depth necessary to convey robot touch actions. We present Mirror Skin, a cephalopod-inspired concept that utilizes high-resolution, mirror-like visual feedback on robotic skin. By mapping in-situ visual representations of a human's body parts onto the corresponding robot's touch region, Mirror Skin communicates who shall initiate touch, where it will occur, and when it is imminent. To inform the design of Mirror Skin, we conducted a structured design exploration with experts in virtual reality (VR), iteratively refining six key dimensions. A subsequent controlled user study demonstrated that Mirror Skin significantly enhances accuracy and reduces response times for interpreting touch intent. These findings highlight the potential of visual feedback on robotic skin to communicate human-robot touch interactions.
Few-Shot VLM-Based G-Code and HMI Verification in CNC Machining
Manual generation of G-code is important for learning the operation of CNC machines. Prior work in G-code verification uses Large-Language Models (LLMs), which primarily examine errors in the written programming. However, CNC machining requires extensive use and knowledge of the Human-Machine Interface (HMI), which displays machine status and errors. LLMs currently lack the capability to leverage knowledge of HMIs due to their inability to access the vision modality. This paper proposes a few-shot VLM-based verification approach that simultaneously evaluates the G-code and the HMI display for errors and safety status. The input dataset includes paired G-code text and associated HMI screenshots from a 15-slant-PRO lathe, including both correct and error-prone cases. To enable few-shot learning, the VLM is provided with a structured JSON schema based on prior heuristic knowledge. After determining the prompts, instances of G-code and HMI that either contain errors or are error free are used as few-shot examples to guide the VLM. The model was then evaluated in comparison to a zero-shot VLM through multiple scenarios of incorrect G-code and HMI errors with respect to per-slot accuracy. The VLM showed that few-shot prompting led to overall enhancement of detecting HMI errors and discrepancies with the G-code for more comprehensive debugging. Therefore, the proposed framework was demonstrated to be suitable for verification of manually generated G-code that is typically developed in CNC training.
AI Autonomy or Human Dependency? Defining the Boundary in Responsible AI with the $α$-Coefficient
The integrity of contemporary AI systems is undermined by a critical design flaw: the misappropriation of Human-in-the-Loop (HITL) models to mask systems that are fundamentally reliant on human labor. We term this structural reliance Human-Instead-of-AI (HISOAI). HISOAI systems represent an ethical failure and an unsustainable economic dependency, where human workers function as hidden operational fallbacks rather than strategic collaborators. To rectify this, we propose the AI-First, Human-Empowered (AFHE) paradigm. AFHE mandates a technological design where the AI component must achieve a minimum, quantifiable level of functional independence prior to deployment. This standard is formalized through the AI Autonomy Coefficient (alpha), a metric that determines the proportion of tasks that the AI successfully processes without mandatory human substitution. We introduce the AFHE Deployment Algorithm, an algorithmic gate that requires the system to meet a specified alpha threshold across both offline and shadow testing. By enforcing this structural separation, the AFHE framework redefines the human's role to focus exclusively on high-value tasks, including ethical oversight, boundary pushing, and strategic model tuning, thereby ensuring true system transparency and operational independence. This work advocates for a critical shift toward metric-driven, structurally sound AI architecture, moving the industry beyond deceptive human dependency toward verifiable autonomy.
Words to Describe What I'm Feeling: Exploring the Potential of AI Agents for High Subjectivity Decisions in Advance Care Planning
Serious illness can deprive patients of the capacity to speak for themselves. As populations age and caregiver networks shrink, the need for reliable support in Advance Care Planning (ACP) grows. To probe this fraught design space of using proxy agents for high-risk, high-subjectivity decisions, we built an experience prototype (\acpagent{}) and asked 15 participants in 4 workshops to train it to be their personal proxy in ACP decisions. We analysed their coping strategies and feature requests and mapped the results onto axes of agent autonomy and human control. Our findings argue for a potential new role of AI in ACP where agents act as personal advocates for individuals, building mutual intelligibility over time. We conclude with design recommendations to balance the risks and benefits of such an agent.
comment: 31 pages, 10 figures
Breast-Rehab: A Postoperative Breast Cancer Rehabilitation Training Assessment System Based on Human Action Recognition
Postoperative upper limb dysfunction is prevalent among breast cancer survivors, yet their adherence to at-home rehabilitation exercises is low amidst limited nursing resources. The hardware overhead of commonly adopted VR-based mHealth solutions further hinders their widespread clinical application. Therefore, we developed Breast-Rehab, a novel, low-cost mHealth system to provide patients with out-of-hospital upper limb rehabilitation management. Breast-Rehab integrates a bespoke human action recognition algorithm with a retrieval-augmented generation (RAG) framework. By fusing visual and 3D skeletal data, our model accurately segments exercise videos recorded in uncontrolled home environments, outperforming standard models. These segmented clips, combined with a domain-specific knowledge base, guide a multi-modal large language model to generate clinically relevant assessment reports. This approach significantly reduces computational overhead and mitigates model hallucinations. We implemented the system as a WeChat Mini Program and a nurse-facing dashboard. A preliminary clinical study validated the system's feasibility and user acceptance, with patients achieving an average exercise frequency of 0.59 sessions/day over a two-week period. This work thus presents a complete, validated pipeline for AI-driven, at-home rehabilitation monitoring.
comment: has been accepted by CHIP 2025
From "Thumbs Up" to "10 out of 10": Reconsidering Scalar Feedback in Interactive Reinforcement Learning
Learning from human feedback is an effective way to improve robotic learning in exploration-heavy tasks. Compared to the wide application of binary human feedback, scalar human feedback has been used less because it is believed to be noisy and unstable. In this paper, we compare scalar and binary feedback, and demonstrate that scalar feedback benefits learning when properly handled. We collected binary or scalar feedback respectively from two groups of crowdworkers on a robot task. We found that when considering how consistently a participant labeled the same data, scalar feedback led to less consistency than binary feedback; however, the difference vanishes if small mismatches are allowed. Additionally, scalar and binary feedback show no significant differences in their correlations with key Reinforcement Learning targets. We then introduce Stabilizing TEacher Assessment DYnamics (STEADY) to improve learning from scalar feedback. Based on the idea that scalar feedback is muti-distributional, STEADY re-constructs underlying positive and negative feedback distributions and re-scales scalar feedback based on feedback statistics. We show that models trained with \textit{scalar feedback + STEADY } outperform baselines, including binary feedback and raw scalar feedback, in a robot reaching task with non-expert human feedback. Our results show that both binary feedback and scalar feedback are dynamic, and scalar feedback is a promising signal for use in interactive Reinforcement Learning.
HEDN: A Hard-Easy Dual Network with Source Reliability Assessment for Cross-Subject EEG Emotion Recognition
Cross-subject electroencephalography (EEG) emotion recognition remains a major challenge in brain-computer interfaces (BCIs) due to substantial inter-subject variability. Multi-Source Domain Adaptation (MSDA) offers a potential solution, but existing MSDA frameworks typically assume equal source quality, leading to negative transfer from low-reliability domains and prohibitive computational overhead due to multi-branch model designs. To address these limitations, we propose the Hard-Easy Dual Network (HEDN), a lightweight reliability-aware MSDA framework. HEDN introduces a novel Source Reliability Assessment (SRA) mechanism that dynamically evaluates the structural integrity of each source domain during training. Based on this assessment, sources are routed to two specialized branches: an Easy Network that exploits high-quality sources to construct fine-grained, structure-aware prototypes for reliable pseudo-label generation, and a Hard Network that utilizes adversarial training to refine and align low-quality sources. Furthermore, a cross-network consistency loss aligns predictions between branches to preserve semantic coherence. Extensive experiments conducted on SEED, SEED-IV, and DEAP datasets demonstrate that HEDN achieves state-of-the-art performance across both cross-subject and cross-dataset evaluation protocols while reducing adaptation complexity.
comment: This work has been submitted to the IEEE for possible publication
Fast Multi-Party Open-Ended Conversation with a Social Robot
Multi-party open-ended conversation remains a major challenge in human-robot interaction, particularly when robots must recognise speakers, allocate turns, and respond coherently under overlapping or rapidly shifting dialogue. This paper presents a multi-party conversational system that combines multimodal perception (voice direction of arrival, speaker diarisation, face recognition) with a large language model for response generation. Implemented on the Furhat robot, the system was evaluated with 30 participants across two scenarios: (i) parallel, separate conversations and (ii) shared group discussion. Results show that the system maintains coherent and engaging conversations, achieving high addressee accuracy in parallel settings (92.6%) and strong face recognition reliability (80-94%). Participants reported clear social presence and positive engagement, although technical barriers such as audio-based speaker recognition errors and response latency affected the fluidity of group interactions. The results highlight both the promise and limitations of LLM-based multi-party interaction and outline concrete directions for improving multimodal cue integration and responsiveness in future social robots.
comment: 15 pages, 5 figures, 4 tables; 2 appendices
Prototyping and Evaluating a Real-time Neuro-Adaptive Virtual Reality Flight Training System
Real-time adjustments to task difficulty during flight training are crucial for optimizing performance and managing pilot workload. This study evaluated the functionality of a pre-trained brain-computer interface (BCI) that adapts training difficulty based on real-time estimations of workload from brain signals. Specifically, an EEG-based neuro-adaptive training system was developed and tested in Virtual Reality (VR) flight simulations with military student pilots. The neuro-adaptive system was compared to a fixed sequence that progressively increased in difficulty, in terms of self-reported user engagement, workload, and simulator sickness (subjective measures), as well as flight performance (objective metric). Additionally, we explored the relationships between subjective workload and flight performance in the VR simulator for each condition. The experiments concluded with semi-structured interviews to elicit the pilots' experience with the neuro-adaptive prototype. Results revealed no significant differences between the adaptive and fixed sequence conditions in subjective measures or flight performance. In both conditions, flight performance decreased as subjective workload increased. The semi-structured interviews indicated that, upon briefing, the pilots preferred the neuro-adaptive VR training system over the system with a fixed sequence, although individual differences were observed in the perception of difficulty and the order of changes in difficulty. Even though this study shows performance does not change, BCI-based flight training systems hold the potential to provide a more personalized and varied training experience.
Social Mediation through Robots -- A Scoping Review on Improving Group Interactions through Directed Robot Action using an Extended Group Process Model
Group processes refer to the dynamics that occur within a group and are critical for understanding how groups function. With robots being increasingly placed within small groups, improving these processes has emerged as an important application of social robotics. Social Mediation Robots elicit behavioral change within groups by deliberately influencing the processes of groups. While research in this field has demonstrated that robots can effectively affect interpersonal dynamics, there is a notable gap in integrating these insights to develop coherent understanding and theory. We present a scoping review of literature targeting changes in social interactions between multiple humans through intentional action from robotic agents. To guide our review, we adapt the classical Input-Process-Output (I-P-O) models that we call "Mediation I-P-O model". We evaluated 1633 publications, which yielded 89 distinct social mediation concepts. We construct 11 mediation approaches robots can use to shape processes in small groups and teams. This work strives to produce generalizable insights and evaluate the extent to which the potential of social mediation through robots has been realized thus far. We hope that the proposed framework encourages a holistic approach to the study of social mediation and provides a foundation to standardize future reporting in the domain.
comment: Early version of the published journal paper: Weisswange, T. H., Javed, H., Dietrich, M., Jung, M. F., & Jamali, N. (2025). Design Implications for Robots that Facilitate Groups-A Scoping Review on Improving Group Interactions through Directed Robot Action. ACM Transactions on Human-Robot Interaction
Uncertainty-Aware Cross-Modal Knowledge Distillation with Prototype Learning for Multimodal Brain-Computer Interfaces
Electroencephalography (EEG) is a fundamental modality for cognitive state monitoring in brain-computer interfaces (BCIs). However, it is highly susceptible to intrinsic signal errors and human-induced labeling errors, which lead to label noise and ultimately degrade model performance. To enhance EEG learning, multimodal knowledge distillation (KD) has been explored to transfer knowledge from visual models with rich representations to EEG-based models. Nevertheless, KD faces two key challenges: modality gap and soft label misalignment. The former arises from the heterogeneous nature of EEG and visual feature spaces, while the latter stems from label inconsistencies that create discrepancies between ground truth labels and distillation targets. This paper addresses semantic uncertainty caused by ambiguous features and weakly defined labels. We propose a novel cross-modal knowledge distillation framework that mitigates both modality and label inconsistencies. It aligns feature semantics through a prototype-based similarity module and introduces a task-specific distillation head to resolve label-induced inconsistency in supervision. Experimental results demonstrate that our approach improves EEG-based emotion regression and classification performance, outperforming both unimodal and multimodal baselines on a public multimodal dataset. These findings highlight the potential of our framework for BCI applications.
comment: Accepted to SMC 2025
Reject or Not?: A Benchmark for Voice Assistant Query Rejection in Smart Home Scenario and an Improved Method Based on LLMs
In smart-home voice assistant scenario, deciding whether to accept or reject a user query is the first step before any downstream processing. To address the limited query-rejection capability of current voice assistants, this paper presents the first Chinese-oriented open-source benchmark and evaluation suite for smart homes, together with a personalized query-rejection method based on large language models. On the data side, we construct the first multimodal query-rejection dataset tailored for domestic scenarios, containing 11,913 manually labeled text-speech pairs that systematically cover twelve typical dialogue types (e.g., chit-chat, non-human sounds, valid commands, ambiguous references, device-irrelevant requests). Fine-grained labels, conversational context and multi-turn information are provided to support both zero-shot and fine-tuning evaluations across language and multimodal large models. On the method side, we propose a three-tier collaborative architecture: first, a Qwen-2.5-3B adapter fine-tuned to model family-agnostic semantic boundaries; second, a dynamic household-level historical dialogue module to capture personalized habits; third, a household-specific RAG knowledge base that explicitly memorizes and revises past false-rejection cases. Experiments show that the proposed approach significantly outperforms zero-shot and fine-tuned general LLMs on the constructed dataset, with pronounced gains in rejection accuracy for family-specific expressions and complex multi-turn scenarios. This work provides a reproducible data foundation, evaluation standard and extensible technical framework for reliability research in smart-home voice interaction.
Do You "Trust" This Visualization? An Inventory to Measure Trust in Visualizations
Trust plays a critical role in visual data communication and decision-making, yet existing visualization research employs varied trust measures, making it challenging to compare and synthesize findings across studies. In this work, we first took a bottom-up, data-driven approach to understand what visualization readers mean when they say they "trust" a visualization. We compiled and adapted a broad set of trust-related statements from existing inventories and collected responses to visualizations with varying degrees of trustworthiness. Through exploratory factor analysis, we derived an operational definition of trust in visualizations. Our findings indicate that people perceive a trustworthy visualization as one that presents credible information and is comprehensible and usable. Building on this insight, we developed an eight-item inventory: four core items measuring trust in visualizations and four optional items controlling for individual differences in baseline trust tendency. We established the inventory's internal consistency reliability using McDonald's omega, confirmed its content validity by demonstrating alignment with theoretically-grounded trust dimensions, and validated its criterion validity through two trust games with real-world stakes. Finally, we illustrate how this standardized inventory can be applied across diverse visualization research contexts. Utilizing our inventory, future research can examine how design choices, tasks, and domains influence trust, and how to foster appropriate trusting behavior in human-data interactions.
comment: Accepted to IEEE Transactions on Visualization and Computer Graphics
Computer Vision and Pattern Recognition
StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space
We introduce StereoSpace, a diffusion-based framework for monocular-to-stereo synthesis that models geometry purely through viewpoint conditioning, without explicit depth or warping. A canonical rectified space and the conditioning guide the generator to infer correspondences and fill disocclusions end-to-end. To ensure fair and leakage-free evaluation, we introduce an end-to-end protocol that excludes any ground truth or proxy geometry estimates at test time. The protocol emphasizes metrics reflecting downstream relevance: iSQoE for perceptual comfort and MEt3R for geometric consistency. StereoSpace surpasses other methods from the warp & inpaint, latent-warping, and warped-conditioning categories, achieving sharp parallax and strong robustness on layered and non-Lambertian scenes. This establishes viewpoint-conditioned diffusion as a scalable, depth-free solution for stereo generation.
comment: Project page: https://hf.co/spaces/prs-eth/stereospace_web
WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World
Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduce WorldLens, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects -- Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference -- jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability. Across these dimensions, no existing world model excels universally: those with strong textures often violate physics, while geometry-stable ones lack behavioral fidelity. To align objective metrics with human judgment, we further construct WorldLens-26K, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and develop WorldLens-Agent, an evaluation model distilled from these annotations to enable scalable, explainable scoring. Together, the benchmark, dataset, and agent form a unified ecosystem for measuring world fidelity -- standardizing how future models are judged not only by how real they look, but by how real they behave.
comment: Preprint; 80 pages, 37 figures, 29 tables; Project Page at https://worldbench.github.io/worldlens
SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model
We propose a decoupled 3D scene generation framework called SceneMaker in this work. Due to the lack of sufficient open-set de-occlusion and pose estimation priors, existing methods struggle to simultaneously produce high-quality geometry and accurate poses under severe occlusion and open-set settings. To address these issues, we first decouple the de-occlusion model from 3D object generation, and enhance it by leveraging image datasets and collected de-occlusion datasets for much more diverse open-set occlusion patterns. Then, we propose a unified pose estimation model that integrates global and local mechanisms for both self-attention and cross-attention to improve accuracy. Besides, we construct an open-set 3D scene dataset to further extend the generalization of the pose estimation model. Comprehensive experiments demonstrate the superiority of our decoupled framework on both indoor and open-set scenes. Our codes and datasets is released at https://idea-research.github.io/SceneMaker/.
comment: Project page: https://idea-research.github.io/SceneMaker/
Empowering Dynamic Urban Navigation with Stereo and Mid-Level Vision
The success of foundation models in language and vision motivated research in fully end-to-end robot navigation foundation models (NFMs). NFMs directly map monocular visual input to control actions and ignore mid-level vision modules (tracking, depth estimation, etc) entirely. While the assumption that vision capabilities will emerge implicitly is compelling, it requires large amounts of pixel-to-action supervision that are difficult to obtain. The challenge is especially pronounced in dynamic and unstructured settings, where robust navigation requires precise geometric and dynamic understanding, while the depth-scale ambiguity in monocular views further limits accurate spatial reasoning. In this paper, we show that relying on monocular vision and ignoring mid-level vision priors is inefficient. We present StereoWalker, which augments NFMs with stereo inputs and explicit mid-level vision such as depth estimation and dense pixel tracking. Our intuition is straightforward: stereo inputs resolve the depth-scale ambiguity, and modern mid-level vision models provide reliable geometric and motion structure in dynamic scenes. We also curate a large stereo navigation dataset with automatic action annotation from Internet stereo videos to support training of StereoWalker and to facilitate future research. Through our experiments, we find that mid-level vision enables StereoWalker to achieve a comparable performance as the state-of-the-art using only 1.5% of the training data, and surpasses the state-of-the-art using the full data. We also observe that stereo vision yields higher navigation performance than monocular input.
comment: Project Page: https://www.cs.virginia.edu/~tsx4zn/stereowalk/
Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization
Visual concept personalization aims to transfer only specific image attributes, such as identity, expression, lighting, and style, into unseen contexts. However, existing methods rely on holistic embeddings from general-purpose image encoders, which entangle multiple visual factors and make it difficult to isolate a single attribute. This often leads to information leakage and incoherent synthesis. To address this limitation, we introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. Our approach jointly designs the data and model: (i) we curate semantically linked image pairs annotated with positive and negative attributes to explicitly teach the encoder what to preserve or suppress; and (ii) we adopt a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement. The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.
comment: Project page: https://snap-research.github.io/omni-attribute
Bidirectional Normalizing Flow: From Data to Noise and Back
Normalizing Flows (NFs) have been established as a principled framework for generative modeling. Standard NFs consist of a forward process and a reverse process: the forward process maps data to noise, while the reverse process generates samples by inverting it. Typical NF forward transformations are constrained by explicit invertibility, ensuring that the reverse process can serve as their exact analytic inverse. Recent developments in TARFlow and its variants have revitalized NF methods by combining Transformers and autoregressive flows, but have also exposed causal decoding as a major bottleneck. In this work, we introduce Bidirectional Normalizing Flow ($\textbf{BiFlow}$), a framework that removes the need for an exact analytic inverse. BiFlow learns a reverse model that approximates the underlying noise-to-data inverse mapping, enabling more flexible loss functions and architectures. Experiments on ImageNet demonstrate that BiFlow, compared to its causal decoding counterpart, improves generation quality while accelerating sampling by up to two orders of magnitude. BiFlow yields state-of-the-art results among NF-based methods and competitive performance among single-evaluation ("1-NFE") methods. Following recent encouraging progress on NFs, we hope our work will draw further attention to this classical paradigm.
comment: Tech report
Group Diffusion: Enhancing Image Generation by Unlocking Cross-Sample Collaboration
In this work, we explore an untapped signal in diffusion model inference. While all previous methods generate images independently at inference, we instead ask if samples can be generated collaboratively. We propose Group Diffusion, unlocking the attention mechanism to be shared across images, rather than limited to just the patches within an image. This enables images to be jointly denoised at inference time, learning both intra and inter-image correspondence. We observe a clear scaling effect - larger group sizes yield stronger cross-sample attention and better generation quality. Furthermore, we introduce a qualitative measure to capture this behavior and show that its strength closely correlates with FID. Built on standard diffusion transformers, our GroupDiff achieves up to 32.2% FID improvement on ImageNet-256x256. Our work reveals cross-sample inference as an effective, previously unexplored mechanism for generative modeling.
comment: Project Page: https://sichengmo.github.io/GroupDiff/
E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training
Self-supervised pre-training has revolutionized foundation models for languages, individual 2D images and videos, but remains largely unexplored for learning 3D-aware representations from multi-view images. In this paper, we present E-RayZer, a self-supervised large 3D Vision model that learns truly 3D-aware representations directly from unlabeled images. Unlike prior self-supervised methods such as RayZer that infer 3D indirectly through latent-space view synthesis, E-RayZer operates directly in 3D space, performing self-supervised 3D reconstruction with Explicit geometry. This formulation eliminates shortcut solutions and yields representations that are geometrically grounded. To ensure convergence and scalability, we introduce a novel fine-grained learning curriculum that organizes training from easy to hard samples and harmonizes heterogeneous data sources in an entirely unsupervised manner. Experiments demonstrate that E-RayZer significantly outperforms RayZer on pose estimation, matches or sometimes surpasses fully supervised reconstruction models such as VGGT. Furthermore, its learned representations outperform leading visual pre-training models (e.g., DINOv3, CroCo v2, VideoMAE V2, and RayZer) when transferring to 3D downstream tasks, establishing E-RayZer as a new paradigm for 3D-aware visual pre-training.
comment: Project website: https://qitaozhao.github.io/E-RayZer
Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1.
comment: Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1
ClusIR: Towards Cluster-Guided All-in-One Image Restoration
All-in-One Image Restoration (AiOIR) aims to recover high-quality images from diverse degradations within a unified framework. However, existing methods often fail to explicitly model degradation types and struggle to adapt their restoration behavior to complex or mixed degradations. To address these issues, we propose ClusIR, a Cluster-Guided Image Restoration framework that explicitly models degradation semantics through learnable clustering and propagates cluster-aware cues across spatial and frequency domains for adaptive restoration. Specifically, ClusIR comprises two key components: a Probabilistic Cluster-Guided Routing Mechanism (PCGRM) and a Degradation-Aware Frequency Modulation Module (DAFMM). The proposed PCGRM disentangles degradation recognition from expert activation, enabling discriminative degradation perception and stable expert routing. Meanwhile, DAFMM leverages the cluster-guided priors to perform adaptive frequency decomposition and targeted modulation, collaboratively refining structural and textural representations for higher restoration fidelity. The cluster-guided synergy seamlessly bridges semantic cues with frequency-domain modulation, empowering ClusIR to attain remarkable restoration results across a wide range of degradations. Extensive experiments on diverse benchmarks validate that ClusIR reaches competitive performance under several scenarios.
Towards Efficient and Effective Multi-Camera Encoding for End-to-End Driving
We present Flex, an efficient and effective scene encoder that addresses the computational bottleneck of processing high-volume multi-camera data in end-to-end autonomous driving. Flex employs a small set of learnable scene tokens to jointly encode information from all image tokens across different cameras and timesteps. By design, our approach is geometry-agnostic, learning a compact scene representation directly from data without relying on the explicit 3D inductive biases, such as Bird-Eye-View (BEV), occupancy or tri-plane representations, which are common in prior work. This holistic encoding strategy aggressively compresses the visual input for the downstream Large Language Model (LLM) based policy model. Evaluated on a large-scale proprietary dataset of 20,000 driving hours, our Flex achieves 2.2x greater inference throughput while improving driving performance by a large margin compared to state-of-the-art methods. Furthermore, we show that these compact scene tokens develop an emergent capability for scene decomposition without any explicit supervision. Our findings challenge the prevailing assumption that 3D priors are necessary, demonstrating that a data-driven, joint encoding strategy offers a more scalable, efficient and effective path for future autonomous driving systems.
comment: Project Page: https://jiawei-yang.github.io/Flex/
MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation
This paper proposes a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects' motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be identified in a single frame. Such datasets underemphasize the role of motion in both videos and languages. To explore the feasibility of using motion expressions and motion reasoning clues for pixel-level video understanding, we introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos of complex scenarios. We benchmark 15 existing methods across 4 tasks supported by MeViS, including 6 referring video object segmentation (RVOS) methods, 3 audio-guided video object segmentation (AVOS) methods, 2 referring multi-object tracking (RMOT) methods, and 4 video captioning methods for the newly introduced referring motion expression generation (RMEG) task. The results demonstrate weaknesses and limitations of existing methods in addressing motion expression-guided video understanding. We further analyze the challenges and propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state-of-the-art results. Our dataset provides a platform that facilitates the development of motion expression-guided video understanding algorithms in complex video scenes. The proposed MeViS dataset and the method's source code are publicly available at https://henghuiding.com/MeViS/
comment: IEEE TPAMI, Project Page: https://henghuiding.com/MeViS/
AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation
Recent advances in subject-driven video generation with large diffusion models have enabled personalized content synthesis conditioned on user-provided subjects. However, existing methods lack fine-grained temporal control over subject appearance and disappearance, which are essential for applications such as compositional video synthesis, storyboarding, and controllable animation. We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation. Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities, while seamlessly integrating with the pretrained video generation model positional embeddings. Additionally, we incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation. Through token-wise concatenation, AlcheMinT avoids any additional cross-attention modules and incurs negligible parameter overhead. We establish a benchmark evaluating multiple subject identity preservation, video fidelity, and temporal adherence. Experimental results demonstrate that AlcheMinT achieves visual quality matching state-of-the-art video personalization methods, while, for the first time, enabling precise temporal control over multi-subject generation within videos. Project page is at https://snap-research.github.io/Video-AlcheMinT
comment: Project page: https://snap-research.github.io/Video-AlcheMinT/snap-research.github.io/Video-AlcheMinT
VL-JEPA: Joint Embedding Predictive Architecture for Vision-language
We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.
Mull-Tokens: Modality-Agnostic Latent Thinking
Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative -- Mull-Tokens -- modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, Mull-Tokens offers a simple solution to abstractly think in multiple modalities.
comment: Project webpage: https://arijitray.com/multimodal_thinking/
OmniView: An All-Seeing Diffusion Model for 3D and 4D View Synthesis
Prior approaches injecting camera control into diffusion models have focused on specific subsets of 4D consistency tasks: novel view synthesis, text-to-video with camera control, image-to-video, amongst others. Therefore, these fragmented approaches are trained on disjoint slices of available 3D/4D data. We introduce OmniView, a unified framework that generalizes across a wide range of 4D consistency tasks. Our method separately represents space, time, and view conditions, enabling flexible combinations of these inputs. For example, OmniView can synthesize novel views from static, dynamic, and multiview inputs, extrapolate trajectories forward and backward in time, and create videos from text or image prompts with full camera control. OmniView is competitive with task-specific models across diverse benchmarks and metrics, improving image quality scores among camera-conditioned diffusion models by up to 33\% in multiview NVS LLFF dataset, 60\% in dynamic NVS Neural 3D Video benchmark, 20\% in static camera control on RE-10K, and reducing camera trajectory errors by 4x in text-conditioned video generation. With strong generalizability in one model, OmniView demonstrates the feasibility of a generalist 4D video model. Project page is available at https://snap-research.github.io/OmniView/
comment: Project page: https://snap-research.github.io/OmniView/
GaussianHeadTalk: Wobble-Free 3D Talking Heads with Audio Driven Gaussian Splatting
Speech-driven talking heads have recently emerged and enable interactive avatars. However, real-world applications are limited, as current methods achieve high visual fidelity but slow or fast yet temporally unstable. Diffusion methods provide realistic image generation, yet struggle with oneshot settings. Gaussian Splatting approaches are real-time, yet inaccuracies in facial tracking, or inconsistent Gaussian mappings, lead to unstable outputs and video artifacts that are detrimental to realistic use cases. We address this problem by mapping Gaussian Splatting using 3D Morphable Models to generate person-specific avatars. We introduce transformer-based prediction of model parameters, directly from audio, to drive temporal consistency. From monocular video and independent audio speech inputs, our method enables generation of real-time talking head videos where we report competitive quantitative and qualitative performance.
comment: IEEE/CVF Winter Conference on Applications of Computer Vision 2026
Stronger Normalization-Free Transformers
Although normalization layers have long been viewed as indispensable components of deep learning architectures, the recent introduction of Dynamic Tanh (DyT) has demonstrated that alternatives are possible. The point-wise function DyT constrains extreme values for stable convergence and reaches normalization-level performance; this work seeks further for function designs that can surpass it. We first study how the intrinsic properties of point-wise functions influence training and performance. Building on these findings, we conduct a large-scale search for a more effective function design. Through this exploration, we introduce $\mathrm{Derf}(x) = \mathrm{erf}(αx + s)$, where $\mathrm{erf}(x)$ is the rescaled Gaussian cumulative distribution function, and identify it as the most performant design. Derf outperforms LayerNorm, RMSNorm, and DyT across a wide range of domains, including vision (image recognition and generation), speech representation, and DNA sequence modeling. Our findings suggest that the performance gains of Derf largely stem from its improved generalization rather than stronger fitting capacity. Its simplicity and stronger performance make Derf a practical choice for normalization-free Transformer architectures.
Any4D: Unified Feed-Forward Metric 4D Reconstruction
We present Any4D, a scalable multi-view transformer for metric-scale, dense feed-forward 4D reconstruction. Any4D directly generates per-pixel motion and geometry predictions for N frames, in contrast to prior work that typically focuses on either 2-view dense scene flow or sparse 3D point tracking. Moreover, unlike other recent methods for 4D reconstruction from monocular RGB videos, Any4D can process additional modalities and sensors such as RGB-D frames, IMU-based egomotion, and Radar Doppler measurements, when available. One of the key innovations that allows for such a flexible framework is a modular representation of a 4D scene; specifically, per-view 4D predictions are encoded using a variety of egocentric factors (depthmaps and camera intrinsics) represented in local camera coordinates, and allocentric factors (camera extrinsics and scene flow) represented in global world coordinates. We achieve superior performance across diverse setups - both in terms of accuracy (2-3X lower error) and compute efficiency (15X faster), opening avenues for multiple downstream applications.
comment: Project Website: https://any-4d.github.io/
BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models
Early children's developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus, yielding video-utterance, image-utterance, and multi-turn conversational data that mirror infant experiences. DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox into a benchmark suite of ten multimodal tasks, covering spatial reasoning, memory, and vocabulary understanding aligned with early children's capabilities. Experimental results show that a compact model pretrained from scratch can achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks. We hope the principled, unified BabyVLM-V2 framework will accelerate research in developmentally plausible pretraining of vision foundation models.
FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos
Motion understanding is fundamental to physical reasoning, enabling models to infer dynamics and predict future states. However, state-of-the-art models still struggle on recent motion benchmarks, primarily due to the scarcity of large-scale, fine-grained motion datasets. Existing motion datasets are often constructed from costly manual annotation, severely limiting scalability. To address this challenge, we introduce FoundationMotion, a fully automated data curation pipeline that constructs large-scale motion datasets. Our approach first detects and tracks objects in videos to extract their trajectories, then leverages these trajectories and video frames with Large Language Models (LLMs) to generate fine-grained captions and diverse question-answer pairs about motion and spatial reasoning. Using datasets produced by this pipeline, we fine-tune open-source models including NVILA-Video-15B and Qwen2.5-7B, achieving substantial improvements in motion understanding without compromising performance on other tasks. Notably, our models outperform strong closed-source baselines like Gemini-2.5 Flash and large open-source models such as Qwen2.5-VL-72B across diverse motion understanding datasets and benchmarks. FoundationMotion thus provides a scalable solution for curating fine-grained motion datasets that enable effective fine-tuning of diverse models to enhance motion understanding and spatial reasoning capabilities.
comment: Code is available at https://github.com/Wolfv0/FoundationMotion/tree/main
DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance
Recent vision-language model (VLM)-based approaches have achieved impressive results on SVG generation. However, because they generate only text and lack visual signals during decoding, they often struggle with complex semantics and fail to produce visually appealing or geometrically coherent SVGs. We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an end-to-end manner. DuetSVG is trained on both image and SVG datasets. At inference, we apply a novel test-time scaling strategy that leverages the model's native visual predictions as guidance to improve SVG decoding quality. Extensive experiments show that our method outperforms existing methods, producing visually faithful, semantically aligned, and syntactically clean SVGs across a wide range of applications.
comment: Project page: https://intchous.github.io/DuetSVG-site
PubTables-v2: A new large-scale dataset for full-page and multi-page table extraction
Table extraction (TE) is a key challenge in visual document understanding. Traditional approaches detect tables first, then recognize their structure. Recently, interest has surged in developing methods, such as vision-language models (VLMs), that can extract tables directly in their full page or document context. However, progress has been difficult to demonstrate due to a lack of annotated data. To address this, we create a new large-scale dataset, PubTables-v2. PubTables-v2 supports a number of current challenging table extraction tasks. Notably, it is the first large-scale benchmark for multi-page table structure recognition. We demonstrate its usefulness by evaluating domain-specialized VLMs on these tasks and highlighting current progress. Finally, we use PubTables-v2 to create the Page-Object Table Transformer (POTATR), an image-to-graph extension of the Table Transformer to comprehensive page-level TE. Data, code, and trained models will be released.
comment: 15 pages, 7 figures
MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
Motion capture now underpins content creation far beyond digital humans, yet most existing pipelines remain species- or template-specific. We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt, the goal is to reconstruct a rotation-based animation such as BVH that directly drives the specific asset. We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware inverse kinematics. The system contains three learnable modules and a lightweight IK stage: (1) a Reference Prompt Encoder that extracts per-joint queries from the asset's skeleton, mesh, and rendered images; (2) a Video Feature Extractor that computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the gap between video and joint space; and (3) a Unified Motion Decoder that fuses these cues to produce temporally coherent trajectories. We also curate Truebones Zoo with 1038 motion clips, each providing a standardized skeleton-mesh-render triad. Experiments on both in-domain benchmarks and in-the-wild videos show that MoCapAnything delivers high-quality skeletal animations and exhibits meaningful cross-species retargeting across heterogeneous rigs, enabling scalable, prompt-driven 3D motion capture for arbitrary assets. Project page: https://animotionlab.github.io/MoCapAnything/
comment: Project page: https://animotionlab.github.io/MoCapAnything/
From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models
This paper introduces the concept of Microscopic Spatial Intelligence (MiSI), the capability to perceive and reason about the spatial relationships of invisible microscopic entities, which is fundamental to scientific discovery. To assess the potential of Vision-Language Models (VLMs) in this domain, we propose a systematic benchmark framework MiSI-Bench. This framework features over 163,000 question-answer pairs and 587,000 images derived from approximately 4,000 molecular structures, covering nine complementary tasks that evaluate abilities ranging from elementary spatial transformations to complex relational identifications. Experimental results reveal that current state-of-the-art VLMs perform significantly below human level on this benchmark. However, a fine-tuned 7B model demonstrates substantial potential, even surpassing humans in spatial transformation tasks, while its poor performance in scientifically-grounded tasks like hydrogen bond recognition underscores the necessity of integrating explicit domain knowledge for progress toward scientific AGI. The datasets are available at https://huggingface.co/datasets/zongzhao/MiSI-bench.
MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence
Spatial understanding over continuous visual input is crucial for MLLMs to evolve into general-purpose assistants in physical environments. Yet there is still no comprehensive benchmark that holistically assesses the progress toward this goal. In this work, we introduce MMSI-Video-Bench, a fully human-annotated benchmark for video-based spatial intelligence in MLLMs. It operationalizes a four-level framework, Perception, Planning, Prediction, and Cross-Video Reasoning, through 1,106 questions grounded in 1,278 clips from 25 datasets and in-house videos. Each item is carefully designed and reviewed by 3DV experts with explanatory rationales to ensure precise, unambiguous grounding. Leveraging its diverse data sources and holistic task coverage, MMSI-Video-Bench also supports three domain-oriented sub-benchmarks (Indoor Scene Perception Bench, Robot Bench and Grounding Bench) for targeted capability assessment. We evaluate 25 strong open-source and proprietary MLLMs, revealing a striking human--AI gap: many models perform near chance, and the best reasoning model lags humans by nearly 60%. We further find that spatially fine-tuned models still fail to generalize effectively on our benchmark. Fine-grained error analysis exposes systematic failures in geometric reasoning, motion grounding, long-horizon prediction, and cross-video correspondence. We also show that typical frame-sampling strategies transfer poorly to our reasoning-intensive benchmark, and that neither 3D spatial cues nor chain-of-thought prompting yields meaningful gains. We expect our benchmark to establish a solid testbed for advancing video-based spatial intelligence.
SWiT-4D: Sliding-Window Transformer for Lossless and Parameter-Free Temporal 4D Generation
Despite significant progress in 4D content generation, the conversion of monocular videos into high-quality animated 3D assets with explicit 4D meshes remains considerably challenging. The scarcity of large-scale, naturally captured 4D mesh datasets further limits the ability to train generalizable video-to-4D models from scratch in a purely data-driven manner. Meanwhile, advances in image-to-3D generation, supported by extensive datasets, offer powerful prior models that can be leveraged. To better utilize these priors while minimizing reliance on 4D supervision, we introduce SWiT-4D, a Sliding-Window Transformer for lossless, parameter-free temporal 4D mesh generation. SWiT-4D integrates seamlessly with any Diffusion Transformer (DiT)-based image-to-3D generator, adding spatial-temporal modeling across video frames while preserving the original single-image forward process, enabling 4D mesh reconstruction from videos of arbitrary length. To recover global translation, we further introduce an optimization-based trajectory module tailored for static-camera monocular videos. SWiT-4D demonstrates strong data efficiency: with only a single short (<10s) video for fine-tuning, it achieves high-fidelity geometry and stable temporal consistency, indicating practical deployability under extremely limited 4D supervision. Comprehensive experiments on both in-domain zoo-test sets and challenging out-of-domain benchmarks (C4D, Objaverse, and in-the-wild videos) show that SWiT-4D consistently outperforms existing baselines in temporal smoothness. Project page: https://animotionlab.github.io/SWIT4D/
comment: Project page: https://animotionlab.github.io/SWIT4D/
PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning
6D object pose estimation, which predicts the transformation of an object relative to the camera, remains challenging for unseen objects. Existing approaches typically rely on explicitly constructing feature correspondences between the query image and either the object model or template images. In this work, we propose PoseGAM, a geometry-aware multi-view framework that directly predicts object pose from a query image and multiple template images, eliminating the need for explicit matching. Built upon recent multi-view-based foundation model architectures, the method integrates object geometry information through two complementary mechanisms: explicit point-based geometry and learned features from geometry representation networks. In addition, we construct a large-scale synthetic dataset containing more than 190k objects under diverse environmental conditions to enhance robustness and generalization. Extensive evaluations across multiple benchmarks demonstrate our state-of-the-art performance, yielding an average AR improvement of 5.1% over prior methods and achieving up to 17.6% gains on individual datasets, indicating strong generalization to unseen objects. Project page: https://windvchen.github.io/PoseGAM/ .
comment: Project page: https://windvchen.github.io/PoseGAM/
Agile Deliberation: Concept Deliberation for Subjective Visual Classification
From content moderation to content curation, applications requiring vision classifiers for visual concepts are rapidly expanding. Existing human-in-the-loop approaches typically assume users begin with a clear, stable concept understanding to be able to provide high-quality supervision. In reality, users often start with a vague idea and must iteratively refine it through "concept deliberation", a practice we uncovered through structured interviews with content moderation experts. We operationalize the common strategies in deliberation used by real content moderators into a human-in-the-loop framework called "Agile Deliberation" that explicitly supports evolving and subjective concepts. The system supports users in defining the concept for themselves by exposing them to borderline cases. The system does this with two deliberation stages: (1) concept scoping, which decomposes the initial concept into a structured hierarchy of sub-concepts, and (2) concept iteration, which surfaces semantically borderline examples for user reflection and feedback to iteratively align an image classifier with the user's evolving intent. Since concept deliberation is inherently subjective and interactive, we painstakingly evaluate the framework through 18 user sessions, each 1.5h long, rather than standard benchmarking datasets. We find that Agile Deliberation achieves 7.5% higher F1 scores than automated decomposition baselines and more than 3% higher than manual deliberation, while participants reported clearer conceptual understanding and lower cognitive effort.
Self-Ensemble Post Learning for Noisy Domain Generalization
While computer vision and machine learning have made great progress, their robustness is still challenged by two key issues: data distribution shift and label noise. When domain generalization (DG) encounters noise, noisy labels further exacerbate the emergence of spurious features in deep layers, i.e. spurious feature enlargement, leading to a degradation in the performance of existing algorithms. This paper, starting from domain generalization, explores how to make existing methods rework when meeting noise. We find that the latent features inside the model have certain discriminative capabilities, and different latent features focus on different parts of the image. Based on these observations, we propose the Self-Ensemble Post Learning approach (SEPL) to diversify features which can be leveraged. Specifically, SEPL consists of two parts: feature probing training and prediction ensemble inference. It leverages intermediate feature representations within the model architecture, training multiple probing classifiers to fully exploit the capabilities of pre-trained models, while the final predictions are obtained through the integration of outputs from these diverse classification heads. Considering the presence of noisy labels, we employ semi-supervised algorithms to train probing classifiers. Given that different probing classifiers focus on different areas, we integrate their predictions using a crowdsourcing inference approach. Extensive experimental evaluations demonstrate that the proposed method not only enhances the robustness of existing methods but also exhibits significant potential for real-world applications with high flexibility.
comment: 18 pages
Extrapolation of Periodic Functions Using Binary Encoding of Continuous Numerical Values
We report the discovery that binary encoding allows neural networks to extrapolate periodic functions beyond their training bounds. We introduce Normalized Base-2 Encoding (NB2E) as a method for encoding continuous numerical values and demonstrate that, using this input encoding, vanilla multi-layer perceptrons (MLP) successfully extrapolate diverse periodic signals without prior knowledge of their functional form. Internal activation analysis reveals that NB2E induces bit-phase representations, enabling MLPs to learn and extrapolate signal structure independently of position.
comment: Submitted to JMLR, under review
Graph Laplacian Transformer with Progressive Sampling for Prostate Cancer Grading MICCAI
Prostate cancer grading from whole-slide images (WSIs) remains a challenging task due to the large-scale nature of WSIs, the presence of heterogeneous tissue structures, and difficulty of selecting diagnostically relevant regions. Existing approaches often rely on random or static patch selection, leading to the inclusion of redundant or non-informative regions that degrade performance. To address this, we propose a Graph Laplacian Attention-Based Transformer (GLAT) integrated with an Iterative Refinement Module (IRM) to enhance both feature learning and spatial consistency. The IRM iteratively refines patch selection by leveraging a pretrained ResNet50 for local feature extraction and a foundation model in no-gradient mode for importance scoring, ensuring only the most relevant tissue regions are preserved. The GLAT models tissue-level connectivity by constructing a graph where patches serve as nodes, ensuring spatial consistency through graph Laplacian constraints and refining feature representations via a learnable filtering mechanism that enhances discriminative histological structures. Additionally, a convex aggregation mechanism dynamically adjusts patch importance to generate a robust WSI-level representation. Extensive experiments on five public and one private dataset demonstrate that our model outperforms state-of-the-art methods, achieving higher performance and spatial consistency while maintaining computational efficiency.
comment: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2025
Interpretable and Steerable Concept Bottleneck Sparse Autoencoders
Sparse autoencoders (SAEs) promise a unified approach for mechanistic interpretability, concept discovery, and model steering in LLMs and LVLMs. However, realizing this potential requires that the learned features be both interpretable and steerable. To that end, we introduce two new computationally inexpensive interpretability and steerability metrics and conduct a systematic analysis on LVLMs. Our analysis uncovers two observations; (i) a majority of SAE neurons exhibit either low interpretability or low steerability or both, rendering them ineffective for downstream use; and (ii) due to the unsupervised nature of SAEs, user-desired concepts are often absent in the learned dictionary, thus limiting their practical utility. To address these limitations, we propose Concept Bottleneck Sparse Autoencoders (CB-SAE) - a novel post-hoc framework that prunes low-utility neurons and augments the latent space with a lightweight concept bottleneck aligned to a user-defined concept set. The resulting CB-SAE improves interpretability by +32.1% and steerability by +14.5% across LVLMs and image generation tasks. We will make our code and model weights available.
What matters for Representation Alignment: Global Information or Spatial Structure?
Representation alignment (REPA) guides generative training by distilling representations from a strong, pretrained vision encoder to intermediate diffusion features. We investigate a fundamental question: what aspect of the target representation matters for generation, its \textit{global} \revision{semantic} information (e.g., measured by ImageNet-1K accuracy) or its spatial structure (i.e. pairwise cosine similarity between patch tokens)? Prevalent wisdom holds that stronger global semantic performance leads to better generation as a target representation. To study this, we first perform a large-scale empirical analysis across 27 different vision encoders and different model scales. The results are surprising; spatial structure, rather than global performance, drives the generation performance of a target representation. To further study this, we introduce two straightforward modifications, which specifically accentuate the transfer of \emph{spatial} information. We replace the standard MLP projection layer in REPA with a simple convolution layer and introduce a spatial normalization layer for the external representation. Surprisingly, our simple method (implemented in $<$4 lines of code), termed iREPA, consistently improves convergence speed of REPA, across a diverse set of vision encoders, model sizes, and training variants (such as REPA, REPA-E, Meanflow, JiT etc). %, etc. Our work motivates revisiting the fundamental working mechanism of representational alignment and how it can be leveraged for improved training of generative models. The code and project page are available at https://end2end-diffusion.github.io/irepa
comment: Project page: https://end2end-diffusion.github.io/irepa
Blood Pressure Prediction for Coronary Artery Disease Diagnosis using Coronary Computed Tomography Angiography
Computational fluid dynamics (CFD) based simulation of coronary blood flow provides valuable hemodynamic markers, such as pressure gradients, for diagnosing coronary artery disease (CAD). However, CFD is computationally expensive, time-consuming, and difficult to integrate into large-scale clinical workflows. These limitations restrict the availability of labeled hemodynamic data for training AI models and hinder broad adoption of non-invasive, physiology based CAD assessment. To address these challenges, we develop an end to end pipeline that automates coronary geometry extraction from coronary computed tomography angiography (CCTA), streamlines simulation data generation, and enables efficient learning of coronary blood pressure distributions. The pipeline reduces the manual burden associated with traditional CFD workflows while producing consistent training data. We further introduce a diffusion-based regression model designed to predict coronary blood pressure directly from CCTA derived features, bypassing the need for slow CFD computation during inference. Evaluated on a dataset of simulated coronary hemodynamics, the proposed model achieves state of the art performance, with an R2 of 64.42%, a root mean squared error of 0.0974, and a normalized RMSE of 0.154, outperforming several baseline approaches. This work provides a scalable and accessible framework for rapid, non-invasive blood pressure prediction to support CAD diagnosis.
comment: 19 pages, 9 figures
LDP: Parameter-Efficient Fine-Tuning of Multimodal LLM for Medical Report Generation
Colonoscopic polyp diagnosis is pivotal for early colorectal cancer detection, yet traditional automated reporting suffers from inconsistencies and hallucinations due to the scarcity of high-quality multimodal medical data. To bridge this gap, we propose LDP, a novel framework leveraging multimodal large language models (MLLMs) for professional polyp diagnosis report generation. Specifically, we curate MMEndo, a multimodal endoscopic dataset comprising expert-annotated colonoscopy image-text pairs. We fine-tune the Qwen2-VL-7B backbone using Parameter-Efficient Fine-Tuning (LoRA) and align it with clinical standards via Direct Preference Optimization (DPO). Extensive experiments show that our LDP outperforms existing baselines on both automated metrics and rigorous clinical expert evaluations (achieving a Physician Score of 7.2/10), significantly reducing training computational costs by 833x compared to full fine-tuning. The proposed solution offers a scalable, clinically viable path for primary healthcare, with additional validation on the IU-XRay dataset confirming its robustness.
comment: Work in progress
IRG-MotionLLM: Interleaving Motion Generation, Assessment and Refinement for Text-to-Motion Generation
Recent advances in motion-aware large language models have shown remarkable promise for unifying motion understanding and generation tasks. However, these models typically treat understanding and generation separately, limiting the mutual benefits that could arise from interactive feedback between tasks. In this work, we reveal that motion assessment and refinement tasks act as crucial bridges to enable bidirectional knowledge flow between understanding and generation. Leveraging this insight, we propose Interleaved Reasoning for Motion Generation (IRMoGen), a novel paradigm that tightly couples motion generation with assessment and refinement through iterative text-motion dialogue. To realize this, we introduce IRG-MotionLLM, the first model that seamlessly interleaves motion generation, assessment, and refinement to improve generation performance. IRG-MotionLLM is developed progressively with a novel three-stage training scheme, initializing and subsequently enhancing native IRMoGen capabilities. To facilitate this development, we construct an automated data engine to synthesize interleaved reasoning annotations from existing text-motion datasets. Extensive experiments demonstrate that: (i) Assessment and refinement tasks significantly improve text-motion alignment; (ii) Interleaving motion generation, assessment, and refinement steps yields consistent performance gains across training stages; and (iii) IRG-MotionLLM clearly outperforms the baseline model and achieves advanced performance on standard text-to-motion generation benchmarks. Cross-evaluator testing further validates its effectiveness. Code & Data: https://github.com/HumanMLLM/IRG-MotionLLM/tree/main.
comment: 25 pages, 16 figures
Video Depth Propagation
Depth estimation in videos is essential for visual perception in real-world applications. However, existing methods either rely on simple frame-by-frame monocular models, leading to temporal inconsistencies and inaccuracies, or use computationally demanding temporal modeling, unsuitable for real-time applications. These limitations significantly restrict general applicability and performance in practical settings. To address this, we propose VeloDepth, an efficient and robust online video depth estimation pipeline that effectively leverages spatiotemporal priors from previous depth predictions and performs deep feature propagation. Our method introduces a novel Propagation Module that refines and propagates depth features and predictions using flow-based warping coupled with learned residual corrections. In addition, our design structurally enforces temporal consistency, resulting in stable depth predictions across consecutive frames with improved efficiency. Comprehensive zero-shot evaluation on multiple benchmarks demonstrates the state-of-the-art temporal consistency and competitive accuracy of VeloDepth, alongside its significantly faster inference compared to existing video-based depth estimators. VeloDepth thus provides a practical, efficient, and accurate solution for real-time depth estimation suitable for diverse perception tasks. Code and models are available at https://github.com/lpiccinelli-eth/velodepth
SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving
End-to-end autonomous driving methods built on vision language models (VLMs) have undergone rapid development driven by their universal visual understanding and strong reasoning capabilities obtained from the large-scale pretraining. However, we find that current VLMs struggle to understand fine-grained 3D spatial relationships which is a fundamental requirement for systems interacting with the physical world. To address this issue, we propose SpaceDrive, a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings (PEs) instead of textual digit tokens, enabling joint reasoning over semantic and spatial representations. SpaceDrive employs a universal positional encoder to all 3D coordinates derived from multi-view depth estimation, historical ego-states, and text prompts. These 3D PEs are first superimposed to augment the corresponding 2D visual tokens. Meanwhile, they serve as a task-agnostic coordinate representation, replacing the digit-wise numerical tokens as both inputs and outputs for the VLM. This mechanism enables the model to better index specific visual semantics in spatial reasoning and directly regress trajectory coordinates rather than generating digit-by-digit, thereby enhancing planning accuracy. Extensive experiments validate that SpaceDrive achieves state-of-the-art open-loop performance on the nuScenes dataset and the second-best Driving Score of 78.02 on the Bench2Drive closed-loop benchmark over existing VLM-based methods.
CheXmask-U: Quantifying uncertainty in landmark-based anatomical segmentation for X-ray images
Uncertainty estimation is essential for the safe clinical deployment of medical image segmentation systems, enabling the identification of unreliable predictions and supporting human oversight. While prior work has largely focused on pixel-level uncertainty, landmark-based segmentation offers inherent topological guarantees yet remains underexplored from an uncertainty perspective. In this work, we study uncertainty estimation for anatomical landmark-based segmentation on chest X-rays. Inspired by hybrid neural network architectures that combine standard image convolutional encoders with graph-based generative decoders, and leveraging their variational latent space, we derive two complementary measures: (i) latent uncertainty, captured directly from the learned distribution parameters, and (ii) predictive uncertainty, obtained by generating multiple stochastic output predictions from latent samples. Through controlled corruption experiments we show that both uncertainty measures increase with perturbation severity, reflecting both global and local degradation. We demonstrate that these uncertainty signals can identify unreliable predictions by comparing with manual ground-truth, and support out-of-distribution detection on the CheXmask dataset. More importantly, we release CheXmask-U (huggingface.co/datasets/mcosarinsky/CheXmask-U), a large scale dataset of 657,566 chest X-ray landmark segmentations with per-node uncertainty estimates, enabling researchers to account for spatial variations in segmentation quality when using these anatomical masks. Our findings establish uncertainty estimation as a promising direction to enhance robustness and safe deployment of landmark-based anatomical segmentation methods in chest X-ray. A fully working interactive demo of the method is available at huggingface.co/spaces/matiasky/CheXmask-U and the source code at github.com/mcosarinsky/CheXmask-U.
Enhancing Radiology Report Generation and Visual Grounding using Reinforcement Learning
Recent advances in vision-language models (VLMs) have improved Chest X-ray (CXR) interpretation in multiple aspects. However, many medical VLMs rely solely on supervised fine-tuning (SFT), which optimizes next-token prediction without evaluating answer quality. In contrast, reinforcement learning (RL) can incorporate task-specific feedback, and its combination with explicit intermediate reasoning ("thinking") has demonstrated substantial gains on verifiable math and coding tasks. To investigate the effects of RL and thinking in a CXR VLM, we perform large-scale SFT on CXR data to build an updated RadVLM based on Qwen3-VL, followed by a cold-start SFT stage that equips the model with basic thinking ability. We then apply Group Relative Policy Optimization (GRPO) with clinically grounded, task-specific rewards for report generation and visual grounding, and run matched RL experiments on both domain-specific and general-domain Qwen3-VL variants, with and without thinking. Across these settings, we find that while strong SFT remains crucial for high base performance, RL provides additional gains on both tasks, whereas explicit thinking does not appear to further improve results. Under a unified evaluation pipeline, the RL-optimized RadVLM models outperform their baseline counterparts and reach state-of-the-art performance on both report generation and grounding, highlighting clinically aligned RL as a powerful complement to SFT for medical VLMs.
comment: 10 pages main text (3 figures, 3 tables), 31 pages in total
Sharp Monocular View Synthesis in Less Than a Second
We present SHARP, an approach to photorealistic view synthesis from a single image. Given a single photograph, SHARP regresses the parameters of a 3D Gaussian representation of the depicted scene. This is done in less than a second on a standard GPU via a single feedforward pass through a neural network. The 3D Gaussian representation produced by SHARP can then be rendered in real time, yielding high-resolution photorealistic images for nearby views. The representation is metric, with absolute scale, supporting metric camera movements. Experimental results demonstrate that SHARP delivers robust zero-shot generalization across datasets. It sets a new state of the art on multiple datasets, reducing LPIPS by 25-34% and DISTS by 21-43% versus the best prior model, while lowering the synthesis time by three orders of magnitude. Code and weights are provided at https://github.com/apple/ml-sharp
comment: Code and weights available at https://github.com/apple/ml-sharp
Optimal transport unlocks end-to-end learning for single-molecule localization
Single-molecule localization microscopy (SMLM) allows reconstructing biology-relevant structures beyond the diffraction limit by detecting and localizing individual fluorophores -- fluorescent molecules stained onto the observed specimen -- over time to reconstruct super-resolved images. Currently, efficient SMLM requires non-overlapping emitting fluorophores, leading to long acquisition times that hinders live-cell imaging. Recent deep-learning approaches can handle denser emissions, but they rely on variants of non-maximum suppression (NMS) layers, which are unfortunately non-differentiable and may discard true positives with their local fusion strategy. In this presentation, we reformulate the SMLM training objective as a set-matching problem, deriving an optimal-transport loss that eliminates the need for NMS during inference and enables end-to-end training. Additionally, we propose an iterative neural network that integrates knowledge of the microscope's optical system inside our model. Experiments on synthetic benchmarks and real biological data show that both our new loss function and architecture surpass the state of the art at moderate and high emitter densities. Code is available at https://github.com/RSLLES/SHOT.
Evaluating Gemini Robotics Policies in a Veo World Simulator
Generative world models hold significant potential for simulating interactions with visuomotor policies in varied environments. Frontier video models can enable generation of realistic observations and environment interactions in a scalable and general manner. However, the use of video models in robotics has been limited primarily to in-distribution evaluations, i.e., scenarios that are similar to ones used to train the policy or fine-tune the base video model. In this report, we demonstrate that video models can be used for the entire spectrum of policy evaluation use cases in robotics: from assessing nominal performance to out-of-distribution (OOD) generalization, and probing physical and semantic safety. We introduce a generative evaluation system built upon a frontier video foundation model (Veo). The system is optimized to support robot action conditioning and multi-view consistency, while integrating generative image-editing and multi-view completion to synthesize realistic variations of real-world scenes along multiple axes of generalization. We demonstrate that the system preserves the base capabilities of the video model to enable accurate simulation of scenes that have been edited to include novel interaction objects, novel visual backgrounds, and novel distractor objects. This fidelity enables accurately predicting the relative performance of different policies in both nominal and OOD conditions, determining the relative impact of different axes of generalization on policy performance, and performing red teaming of policies to expose behaviors that violate physical or semantic safety constraints. We validate these capabilities through 1600+ real-world evaluations of eight Gemini Robotics policy checkpoints and five tasks for a bimanual manipulator.
Geo6DPose: Fast Zero-Shot 6D Object Pose Estimation via Geometry-Filtered Feature Matching
Recent progress in zero-shot 6D object pose estimation has been driven largely by large-scale models and cloud-based inference. However, these approaches often introduce high latency, elevated energy consumption, and deployment risks related to connectivity, cost, and data governance; factors that conflict with the practical constraints of real-world robotics, where compute is limited and on-device inference is frequently required. We introduce Geo6DPose, a lightweight, fully local, and training-free pipeline for zero-shot 6D pose estimation that trades model scale for geometric reliability. Our method combines foundation model visual features with a geometric filtering strategy: Similarity maps are computed between onboarded template DINO descriptors and scene patches, and mutual correspondences are established by projecting scene patch centers to 3D and template descriptors to the object model coordinate system. Final poses are recovered via correspondence-driven RANSAC and ranked using a weighted geometric alignment metric that jointly accounts for reprojection consistency and spatial support, improving robustness to noise, clutter, and partial visibility. Geo6DPose achieves sub-second inference on a single commodity GPU while matching the average recall of significantly larger zero-shot baselines (53.7 AR, 1.08 FPS). It requires no training, fine-tuning, or network access, and remains compatible with evolving foundation backbones, advancing practical, fully local 6D perception for robotic deployment.
XDen-1K: A Density Field Dataset of Real-World Objects
A deep understanding of the physical world is a central goal for embodied AI and realistic simulation. While current models excel at capturing an object's surface geometry and appearance, they largely neglect its internal physical properties. This omission is critical, as properties like volumetric density are fundamental for predicting an object's center of mass, stability, and interaction dynamics in applications ranging from robotic manipulation to physical simulation. The primary bottleneck has been the absence of large-scale, real-world data. To bridge this gap, we introduce XDen-1K, the first large-scale, multi-modal dataset designed for real-world physical property estimation, with a particular focus on volumetric density. The core of this dataset consists of 1,000 real-world objects across 148 categories, for which we provide comprehensive multi-modal data, including a high-resolution 3D geometric model with part-level annotations and a corresponding set of real-world biplanar X-ray scans. Building upon this data, we introduce a novel optimization framework that recovers a high-fidelity volumetric density field of each object from its sparse X-ray views. To demonstrate its practical value, we add X-ray images as a conditioning signal to an existing segmentation network and perform volumetric segmentation. Furthermore, we conduct experiments on downstream robotics tasks. The results show that leveraging the dataset can effectively improve the accuracy of center-of-mass estimation and the success rate of robotic manipulation. We believe XDen-1K will serve as a foundational resource and a challenging new benchmark, catalyzing future research in physically grounded visual inference and embodied AI.
comment: 10 pages, 7 figures
NaviHydra: Controllable Navigation-guided End-to-end Autonomous Driving with Hydra-distillation
The complexity of autonomous driving scenarios requires robust models that can interpret high-level navigation commands and generate safe trajectories. While traditional rule-based systems can react to these commands, they often struggle in dynamic environments, and end-to-end methods face challenges in complying with explicit navigation commands. To address this, we present NaviHydra, a controllable navigation-guided end-to-end model distilled from an existing rule-based simulator. Our framework accepts high-level navigation commands as control signals, generating trajectories that align with specified intentions. We utilize a Bird's Eye View (BEV) based trajectory gathering method to enhance the trajectory feature extraction. Additionally, we introduce a novel navigation compliance metric to evaluate adherence to intended route, improving controllability and navigation safety. To comprehensively assess our model's controllability, we design a test that evaluates its response to various navigation commands. Our method significantly outperforms baseline models, achieving state-of-the-art results in the NAVSIM benchmark, demonstrating its effectiveness in advancing autonomous driving.
TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection
Advances in generative modeling have made it increasingly easy to fabricate realistic portrayals of individuals, creating serious risks for security, communication, and public trust. Detecting such person-driven manipulations requires systems that not only distinguish altered content from authentic media but also provide clear and reliable reasoning. In this paper, we introduce TriDF, a comprehensive benchmark for interpretable DeepFake detection. TriDF contains high-quality forgeries from advanced synthesis models, covering 16 DeepFake types across image, video, and audio modalities. The benchmark evaluates three key aspects: Perception, which measures the ability of a model to identify fine-grained manipulation artifacts using human-annotated evidence; Detection, which assesses classification performance across diverse forgery families and generators; and Hallucination, which quantifies the reliability of model-generated explanations. Experiments on state-of-the-art multimodal large language models show that accurate perception is essential for reliable detection, but hallucination can severely disrupt decision-making, revealing the interdependence of these three aspects. TriDF provides a unified framework for understanding the interaction between detection accuracy, evidence identification, and explanation reliability, offering a foundation for building trustworthy systems that address real-world synthetic media threats.
K-Track: Kalman-Enhanced Tracking for Accelerating Deep Point Trackers on Edge Devices
Point tracking in video sequences is a foundational capability for real-world computer vision applications, including robotics, autonomous systems, augmented reality, and video analysis. While recent deep learning-based trackers achieve state-of-the-art accuracy on challenging benchmarks, their reliance on per-frame GPU inference poses a major barrier to deployment on resource-constrained edge devices, where compute, power, and connectivity are limited. We introduce K-Track (Kalman-enhanced Tracking), a general-purpose, tracker-agnostic acceleration framework designed to bridge this deployment gap. K-Track reduces inference cost by combining sparse deep learning keyframe updates with lightweight Kalman filtering for intermediate frame prediction, using principled Bayesian uncertainty propagation to maintain temporal coherence. This hybrid strategy enables 5-10X speedup while retaining over 85% of the original trackers' accuracy. We evaluate K-Track across multiple state-of-the-art point trackers and demonstrate real-time performance on edge platforms such as the NVIDIA Jetson Nano and RTX Titan. By preserving accuracy while dramatically lowering computational requirements, K-Track provides a practical path toward deploying high-quality point tracking in real-world, resource-limited settings, closing the gap between modern tracking algorithms and deployable vision systems.
DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM
Document parsing aims to transform unstructured PDF images into semi-structured data, facilitating the digitization and utilization of information in diverse domains. While vision language models (VLMs) have significantly advanced this task, achieving reliable, high-quality parsing in real-world scenarios remains challenging. Common practice often selects the top-performing model on standard benchmarks. However, these benchmarks may carry dataset-specific biases, leading to inconsistent model rankings and limited correlation with real-world performance. Moreover, benchmark metrics typically provide only overall scores, which can obscure distinct error patterns in output. This raises a key challenge: how can we reliably and comprehensively assess document parsing quality in the wild? We address this problem with DOCR-Inspector, which formalizes document parsing assessment as fine-grained error detection and analysis. Leveraging VLM-as-a-Judge, DOCR-Inspector analyzes a document image and its parsed output, identifies all errors, assigns them to one of 28 predefined types, and produces a comprehensive quality assessment. To enable this capability, we construct DOCRcase-200K for training and propose the Chain-of-Checklist reasoning paradigm to enable the hierarchical structure of parsing quality assessment. For empirical validation, we introduce DOCRcaseBench, a set of 882 real-world document parsing cases with manual annotations. On this benchmark, DOCR-Inspector-7B outperforms commercial models like Gemini 2.5 Pro, as well as leading open-source models. Further experiments demonstrate that its quality assessments provide valuable guidance for parsing results refinement, making DOCR-Inspector both a practical evaluator and a driver for advancing document parsing systems at scale. Model and code are released at: https://github.com/ZZZZZQT/DOCR-Inspector.
Lang2Motion: Bridging Language and Motion through Joint Embedding Spaces
We present Lang2Motion, a framework for language-guided point trajectory generation by aligning motion manifolds with joint embedding spaces. Unlike prior work focusing on human motion or video synthesis, we generate explicit trajectories for arbitrary objects using motion extracted from real-world videos via point tracking. Our transformer-based auto-encoder learns trajectory representations through dual supervision: textual motion descriptions and rendered trajectory visualizations, both mapped through CLIP's frozen encoders. Lang2Motion achieves 34.2% Recall@1 on text-to-trajectory retrieval, outperforming video-based methods by 12.5 points, and improves motion accuracy by 33-52% (12.4 ADE vs 18.3-25.3) compared to video generation baselines. We demonstrate 88.3% Top-1 accuracy on human action recognition despite training only on diverse object motions, showing effective transfer across motion domains. Lang2Motion supports style transfer, semantic interpolation, and latent-space editing through CLIP-aligned trajectory representations.
Robust Multi-Disease Retinal Classification via Xception-Based Transfer Learning and W-Net Vessel Segmentation
In recent years, the incidence of vision-threatening eye diseases has risen dramatically, necessitating scalable and accurate screening solutions. This paper presents a comprehensive study on deep learning architectures for the automated diagnosis of ocular conditions. To mitigate the "black-box" limitations of standard convolutional neural networks (CNNs), we implement a pipeline that combines deep feature extraction with interpretable image processing modules. Specifically, we focus on high-fidelity retinal vessel segmentation as an auxiliary task to guide the classification process. By grounding the model's predictions in clinically relevant morphological features, we aim to bridge the gap between algorithmic output and expert medical validation, thereby reducing false positives and improving deployment viability in clinical settings.
Track and Caption Any Motion: Query-Free Motion Discovery and Description in Videos
We propose Track and Caption Any Motion (TCAM), a motion-centric framework for automatic video understanding that discovers and describes motion patterns without user queries. Understanding videos in challenging conditions like occlusion, camouflage, or rapid movement often depends more on motion dynamics than static appearance. TCAM autonomously observes a video, identifies multiple motion activities, and spatially grounds each natural language description to its corresponding trajectory through a motion-field attention mechanism. Our key insight is that motion patterns, when aligned with contrastive vision-language representations, provide powerful semantic signals for recognizing and describing actions. Through unified training that combines global video-text alignment with fine-grained spatial correspondence, TCAM enables query-free discovery of multiple motion expressions via multi-head cross-attention. On the MeViS benchmark, TCAM achieves 58.4% video-to-text retrieval, 64.9 JF for spatial grounding, and discovers 4.8 relevant expressions per video with 84.7% precision, demonstrating strong cross-task generalization.
Beyond Pixels: A Training-Free, Text-to-Text Framework for Remote Sensing Image Retrieval
Semantic retrieval of remote sensing (RS) images is a critical task fundamentally challenged by the \textquote{semantic gap}, the discrepancy between a model's low-level visual features and high-level human concepts. While large Vision-Language Models (VLMs) offer a promising path to bridge this gap, existing methods often rely on costly, domain-specific training, and there is a lack of benchmarks to evaluate the practical utility of VLM-generated text in a zero-shot retrieval context. To address this research gap, we introduce the Remote Sensing Rich Text (RSRT) dataset, a new benchmark featuring multiple structured captions per image. Based on this dataset, we propose a fully training-free, text-only retrieval reference called TRSLLaVA. Our methodology reformulates cross-modal retrieval as a text-to-text (T2T) matching problem, leveraging rich text descriptions as queries against a database of VLM-generated captions within a unified textual embedding space. This approach completely bypasses model training or fine-tuning. Experiments on the RSITMD and RSICD benchmarks show our training-free method is highly competitive with state-of-the-art supervised models. For instance, on RSITMD, our method achieves a mean Recall of 42.62\%, nearly doubling the 23.86\% of the standard zero-shot CLIP baseline and surpassing several top supervised models. This validates that high-quality semantic representation through structured text provides a powerful and cost-effective paradigm for remote sensing image retrieval.
comment: 6 pages, 1 figure
Salient Object Detection in Complex Weather Conditions via Noise Indicators
Salient object detection (SOD), a foundational task in computer vision, has advanced from single-modal to multi-modal paradigms to enhance generalization. However, most existing SOD methods assume low-noise visual conditions, overlooking the degradation of segmentation accuracy caused by weather-induced noise in real-world scenarios. In this paper, we propose a SOD framework tailored for diverse weather conditions, encompassing a specific encoder and a replaceable decoder. To enable handling of varying weather noises, we introduce a one-hot vector as a noise indicator to represent different weather types and design a Noise Indicator Fusion Module (NIFM). The NIFM takes both semantic features and the noise indicator as dual inputs and is inserted between consecutive stages of the encoder to embed weather-aware priors via adaptive feature modulation. Critically, the proposed specific encoder retains compatibility with mainstream SOD decoders. Extensive experiments are conducted on the WXSOD dataset under varying training data scales (100%, 50%, 30% of the full training set), three encoder and seven decoder configurations. Results show that the proposed SOD framework (particularly the NIFM-enhanced specific encoder) improves segmentation accuracy under complex weather conditions compared to a vanilla encoder.
Unleashing Degradation-Carrying Features in Symmetric U-Net: Simpler and Stronger Baselines for All-in-One Image Restoration
All-in-one image restoration aims to handle diverse degradations (e.g., noise, blur, adverse weather) within a unified framework, yet existing methods increasingly rely on complex architectures (e.g., Mixture-of-Experts, diffusion models) and elaborate degradation prompt strategies. In this work, we reveal a critical insight: well-crafted feature extraction inherently encodes degradation-carrying information, and a symmetric U-Net architecture is sufficient to unleash these cues effectively. By aligning feature scales across encoder-decoder and enabling streamlined cross-scale propagation, our symmetric design preserves intrinsic degradation signals robustly, rendering simple additive fusion in skip connections sufficient for state-of-the-art performance. Our primary baseline, SymUNet, is built on this symmetric U-Net and achieves better results across benchmark datasets than existing approaches while reducing computational cost. We further propose a semantic enhanced variant, SE-SymUNet, which integrates direct semantic injection from frozen CLIP features via simple cross-attention to explicitly amplify degradation priors. Extensive experiments on several benchmarks validate the superiority of our methods. Both baselines SymUNet and SE-SymUNet establish simpler and stronger foundations for future advancements in all-in-one image restoration. The source code is available at https://github.com/WenlongJiao/SymUNet.
Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner
Recent advancements in video generation highlight that realistic audio-visual synchronization is crucial for engaging content creation. However, existing video editing methods largely overlook audio-visual synchronization and lack the fine-grained spatial and temporal controllability required for precise instance-level edits. In this paper, we propose AVI-Edit, a framework for audio-sync video instance editing. We propose a granularity-aware mask refiner that iteratively refines coarse user-provided masks into precise instance-level regions. We further design a self-feedback audio agent to curate high-quality audio guidance, providing fine-grained temporal control. To facilitate this task, we additionally construct a large-scale dataset with instance-centric correspondence and comprehensive annotations. Extensive experiments demonstrate that AVI-Edit outperforms state-of-the-art methods in visual quality, condition following, and audio-visual synchronization. Project page: https://hjzheng.net/projects/AVI-Edit/.
Data-Efficient American Sign Language Recognition via Few-Shot Prototypical Networks
Isolated Sign Language Recognition (ISLR) is critical for bridging the communication gap between the Deaf and Hard-of-Hearing (DHH) community and the hearing world. However, robust ISLR is fundamentally constrained by data scarcity and the long-tail distribution of sign vocabulary, where gathering sufficient examples for thousands of unique signs is prohibitively expensive. Standard classification approaches struggle under these conditions, often overfitting to frequent classes while failing to generalize to rare ones. To address this bottleneck, we propose a Few-Shot Prototypical Network framework adapted for a skeleton based encoder. Unlike traditional classifiers that learn fixed decision boundaries, our approach utilizes episodic training to learn a semantic metric space where signs are classified based on their proximity to dynamic class prototypes. We integrate a Spatiotemporal Graph Convolutional Network (ST-GCN) with a novel Multi-Scale Temporal Aggregation (MSTA) module to capture both rapid and fluid motion dynamics. Experimental results on the WLASL dataset demonstrate the superiority of this metric learning paradigm: our model achieves 43.75% Top-1 and 77.10% Top-5 accuracy on the test set. Crucially, this outperforms a standard classification baseline sharing the identical backbone architecture by over 13%, proving that the prototypical training strategy effectively outperforms in a data scarce situation where standard classification fails. Furthermore, the model exhibits strong zero-shot generalization, achieving nearly 30% accuracy on the unseen SignASL dataset without fine-tuning, offering a scalable pathway for recognizing extensive sign vocabularies with limited data.
Grounding Everything in Tokens for Multimodal Large Language Models
Multimodal large language models (MLLMs) have made significant advancements in vision understanding and reasoning. However, the autoregressive Transformer architecture used by MLLMs requries tokenization on input images, which limits their ability to accurately ground objects within the 2D image space. This raises an important question: how can sequential language tokens be improved to better ground objects in 2D spatial space for MLLMs? To address this, we present a spatial representation method for grounding objects, namely GETok, that integrates a specialized vocabulary of learnable tokens into MLLMs. GETok first uses grid tokens to partition the image plane into structured spatial anchors, and then exploits offset tokens to enable precise and iterative refinement of localization predictions. By embedding spatial relationships directly into tokens, GETok significantly advances MLLMs in native 2D space reasoning without modifying the autoregressive architecture. Extensive experiments demonstrate that GETok achieves superior performance over the state-of-the-art methods across various referring tasks in both supervised fine-tuning and reinforcement learning settings.
comment: 19 pages, 16 figures, 12 Tables
Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding
Multimodal large language models (MLLMs) have achieved remarkable progress on various vision-language tasks, yet their visual perception remains limited. Humans, in comparison, perceive complex scenes efficiently by dynamically scanning and focusing on salient regions in a sequential "blink-like" process. Motivated by this strategy, we first investigate whether MLLMs exhibit similar behavior. Our pilot analysis reveals that MLLMs naturally attend to different visual regions across layers and that selectively allocating more computation to salient tokens can enhance visual perception. Building on this insight, we propose Blink, a dynamic visual token resolution framework that emulates the human-inspired process within a single forward pass. Specifically, Blink includes two modules: saliency-guided scanning and dynamic token resolution. It first estimates the saliency of visual tokens in each layer based on the attention map, and extends important tokens through a plug-and-play token super-resolution (TokenSR) module. In the next layer, it drops the extended tokens when they lose focus. This dynamic mechanism balances broad exploration and fine-grained focus, thereby enhancing visual perception adaptively and efficiently. Extensive experiments validate Blink, demonstrating its effectiveness in enhancing visual perception and multimodal understanding.
Mode-Seeking for Inverse Problems with Diffusion Models
A pre-trained unconditional diffusion model, combined with posterior sampling or maximum a posteriori (MAP) estimation techniques, can solve arbitrary inverse problems without task-specific training or fine-tuning. However, existing posterior sampling and MAP estimation methods often rely on modeling approximations and can be computationally demanding. In this work, we propose the variational mode-seeking loss (VML), which, when minimized during each reverse diffusion step, guides the generated sample towards the MAP estimate. VML arises from a novel perspective of minimizing the Kullback-Leibler (KL) divergence between the diffusion posterior $p(\mathbf{x}_0|\mathbf{x}_t)$ and the measurement posterior $p(\mathbf{x}_0|\mathbf{y})$, where $\mathbf{y}$ denotes the measurement. Importantly, for linear inverse problems, VML can be analytically derived and need not be approximated. Based on further theoretical insights, we propose VML-MAP, an empirically effective algorithm for solving inverse problems, and validate its efficacy over existing methods in both performance and computational time, through extensive experiments on diverse image-restoration tasks across multiple datasets.
Take a Peek: Efficient Encoder Adaptation for Few-Shot Semantic Segmentation via LoRA
Few-shot semantic segmentation (FSS) aims to segment novel classes in query images using only a small annotated support set. While prior research has mainly focused on improving decoders, the encoder's limited ability to extract meaningful features for unseen classes remains a key bottleneck. In this work, we introduce \textit{Take a Peek} (TaP), a simple yet effective method that enhances encoder adaptability for both FSS and cross-domain FSS (CD-FSS). TaP leverages Low-Rank Adaptation (LoRA) to fine-tune the encoder on the support set with minimal computational overhead, enabling fast adaptation to novel classes while mitigating catastrophic forgetting. Our method is model-agnostic and can be seamlessly integrated into existing FSS pipelines. Extensive experiments across multiple benchmarks--including COCO $20^i$, Pascal $5^i$, and cross-domain datasets such as DeepGlobe, ISIC, and Chest X-ray--demonstrate that TaP consistently improves segmentation performance across diverse models and shot settings. Notably, TaP delivers significant gains in complex multi-class scenarios, highlighting its practical effectiveness in realistic settings. A rank sensitivity analysis also shows that strong performance can be achieved even with low-rank adaptations, ensuring computational efficiency. By addressing a critical limitation in FSS--the encoder's generalization to novel classes--TaP paves the way toward more robust, efficient, and generalizable segmentation systems. The code is available at https://github.com/pasqualedem/TakeAPeek.
3D Blood Pulsation Maps
We present Pulse3DFace, the first dataset of its kind for estimating 3D blood pulsation maps. These maps can be used to develop models of dynamic facial blood pulsation, enabling the creation of synthetic video data to improve and validate remote pulse estimation methods via photoplethysmography imaging. Additionally, the dataset facilitates research into novel multi-view-based approaches for mitigating illumination effects in blood pulsation analysis. Pulse3DFace consists of raw videos from 15 subjects recorded at 30 Hz with an RGB camera from 23 viewpoints, blood pulse reference measurements, and facial 3D scans generated using monocular structure-from-motion techniques. It also includes processed 3D pulsation maps compatible with the texture space of the 3D head model FLAME. These maps provide signal-to-noise ratio, local pulse amplitude, phase information, and supplementary data. We offer a comprehensive evaluation of the dataset's illumination conditions, map consistency, and its ability to capture physiologically meaningful features in the facial and neck skin regions.
comment: 9 pages (without references), supplementals attached, waiting for publication. In order to access the dataset,see https://github.com/KISMED-TUDa/pulse3dface
Robust Shape from Focus via Multiscale Directional Dilated Laplacian and Recurrent Network
Shape-from-Focus (SFF) is a passive depth estimation technique that infers scene depth by analyzing focus variations in a focal stack. Most recent deep learning-based SFF methods typically operate in two stages: first, they extract focus volumes (a per pixel representation of focus likelihood across the focal stack) using heavy feature encoders; then, they estimate depth via a simple one-step aggregation technique that often introduces artifacts and amplifies noise in the depth map. To address these issues, we propose a hybrid framework. Our method computes multi-scale focus volumes traditionally using handcrafted Directional Dilated Laplacian (DDL) kernels, which capture long-range and directional focus variations to form robust focus volumes. These focus volumes are then fed into a lightweight, multi-scale GRU-based depth extraction module that iteratively refines an initial depth estimate at a lower resolution for computational efficiency. Finally, a learned convex upsampling module within our recurrent network reconstructs high-resolution depth maps while preserving fine scene details and sharp boundaries. Extensive experiments on both synthetic and real-world datasets demonstrate that our approach outperforms state-of-the-art deep learning and traditional methods, achieving superior accuracy and generalization across diverse focal conditions.
comment: Accepted to IJCV
Error-Propagation-Free Learned Video Compression With Dual-Domain Progressive Temporal Alignment
Existing frameworks for learned video compression suffer from a dilemma between inaccurate temporal alignment and error propagation for motion estimation and compensation (ME/MC). The separate-transform framework employs distinct transforms for intra-frame and inter-frame compression to yield impressive rate-distortion (R-D) performance but causes evident error propagation, while the unified-transform framework eliminates error propagation via shared transforms but is inferior in ME/MC in shared latent domains. To address this limitation, in this paper, we propose a novel unifiedtransform framework with dual-domain progressive temporal alignment and quality-conditioned mixture-of-expert (QCMoE) to enable quality-consistent and error-propagation-free streaming for learned video compression. Specifically, we propose dualdomain progressive temporal alignment for ME/MC that leverages coarse pixel-domain alignment and refined latent-domain alignment to significantly enhance temporal context modeling in a coarse-to-fine fashion. The coarse pixel-domain alignment efficiently handles simple motion patterns with optical flow estimated from a single reference frame, while the refined latent-domain alignment develops a Flow-Guided Deformable Transformer (FGDT) over latents from multiple reference frames to achieve long-term motion refinement (LTMR) for complex motion patterns. Furthermore, we design a QCMoE module for continuous bit-rate adaptation that dynamically assigns different experts to adjust quantization steps per pixel based on target quality and content rather than relies on a single quantization step. QCMoE allows continuous and consistent rate control with appealing R-D performance. Experimental results show that the proposed method achieves competitive R-D performance compared with the state-of-the-arts, while successfully eliminating error propagation.
An M-Health Algorithmic Approach to Identify and Assess Physiotherapy Exercises in Real Time
This work presents an efficient algorithmic framework for real-time identification, classification, and evaluation of human physiotherapy exercises using mobile devices. The proposed method interprets a kinetic movement as a sequence of static poses, which are estimated from camera input using a pose-estimation neural network. Extracted body keypoints are transformed into trigonometric angle-based features and classified with lightweight supervised models to generate frame-level pose predictions and accuracy scores. To recognize full exercise movements and detect deviations from prescribed patterns, we employ a dynamic-programming scheme based on a modified Levenshtein distance algorithm, enabling robust sequence matching and localization of inaccuracies. The system operates entirely on the client side, ensuring scalability and real-time performance. Experimental evaluation demonstrates the effectiveness of the methodology and highlights its applicability to remote physiotherapy supervision and m-health applications.
comment: 11 pages, 5 figures
Neural Collapse in Test-Time Adaptation
Test-Time Adaptation (TTA) enhances model robustness to out-of-distribution (OOD) data by updating the model online during inference, yet existing methods lack theoretical insights into the fundamental causes of performance degradation under domain shifts. Recently, Neural Collapse (NC) has been proposed as an emergent geometric property of deep neural networks (DNNs), providing valuable insights for TTA. In this work, we extend NC to the sample-wise level and discover a novel phenomenon termed Sample-wise Alignment Collapse (NC3+), demonstrating that a sample's feature embedding, obtained by a trained model, aligns closely with the corresponding classifier weight. Building on NC3+, we identify that the performance degradation stems from sample-wise misalignment in adaptation which exacerbates under larger distribution shifts. This indicates the necessity of realigning the feature embeddings with their corresponding classifier weights. However, the misalignment makes pseudo-labels unreliable under domain shifts. To address this challenge, we propose NCTTA, a novel feature-classifier alignment method with hybrid targets to mitigate the impact of unreliable pseudo-labels, which blends geometric proximity with predictive confidence. Extensive experiments demonstrate the effectiveness of NCTTA in enhancing robustness to domain shifts. For example, NCTTA outperforms Tent by 14.52% on ImageNet-C.
comment: 10 pages, 8 figures
TransLocNet: Cross-Modal Attention for Aerial-Ground Vehicle Localization with Contrastive Learning
Aerial-ground localization is difficult due to large viewpoint and modality gaps between ground-level LiDAR and overhead imagery. We propose TransLocNet, a cross-modal attention framework that fuses LiDAR geometry with aerial semantic context. LiDAR scans are projected into a bird's-eye-view representation and aligned with aerial features through bidirectional attention, followed by a likelihood map decoder that outputs spatial probability distributions over position and orientation. A contrastive learning module enforces a shared embedding space to improve cross-modal alignment. Experiments on CARLA and KITTI show that TransLocNet outperforms state-of-the-art baselines, reducing localization error by up to 63% and achieving sub-meter, sub-degree accuracy. These results demonstrate that TransLocNet provides robust and generalizable aerial-ground localization in both synthetic and real-world settings.
comment: 8 pages, 4 figures, 4 tables
Beyond Endpoints: Path-Centric Reasoning for Vectorized Off-Road Network Extraction
Deep learning has advanced vectorized road extraction in urban settings, yet off-road environments remain underexplored and challenging. A significant domain gap causes advanced models to fail in wild terrains due to two key issues: lack of large-scale vectorized datasets and structural weakness in prevailing methods. Models such as SAM-Road employ a node-centric paradigm that reasons at sparse endpoints, making them fragile to occlusions and ambiguous junctions in off-road scenes, leading to topological errors.This work addresses these limitations in two complementary ways. First, we release WildRoad, a gloabal off-road road network dataset constructed efficiently with a dedicated interactive annotation tool tailored for road-network labeling. Second, we introduce MaGRoad (Mask-aware Geodesic Road network extractor), a path-centric framework that aggregates multi-scale visual evidence along candidate paths to infer connectivity robustly.Extensive experiments show that MaGRoad achieves state-of-the-art performance on our challenging WildRoad benchmark while generalizing well to urban datasets. A streamlined pipeline also yields roughly 2.5x faster inference, improving practical applicability. Together, the dataset and path-centric paradigm provide a stronger foundation for mapping roads in the wild.
MultiHateLoc: Towards Temporal Localisation of Multimodal Hate Content in Online Videos
The rapid growth of video content on platforms such as TikTok and YouTube has intensified the spread of multimodal hate speech, where harmful cues emerge subtly and asynchronously across visual, acoustic, and textual streams. Existing research primarily focuses on video-level classification, leaving the practically crucial task of temporal localisation, identifying when hateful segments occur, largely unaddressed. This challenge is even more noticeable under weak supervision, where only video-level labels are available, and static fusion or classification-based architectures struggle to capture cross-modal and temporal dynamics. To address these challenges, we propose MultiHateLoc, the first framework designed for weakly-supervised multimodal hate localisation. MultiHateLoc incorporates (1) modality-aware temporal encoders to model heterogeneous sequential patterns, including a tailored text-based preprocessing module for feature enhancement; (2) dynamic cross-modal fusion to adaptively emphasise the most informative modality at each moment and a cross-modal contrastive alignment strategy to enhance multimodal feature consistency; (3) a modality-aware MIL objective to identify discriminative segments under video-level supervision. Despite relying solely on coarse labels, MultiHateLoc produces fine-grained, interpretable frame-level predictions. Experiments on HateMM and MultiHateClip show that our method achieves state-of-the-art performance in the localisation task.
Adaptive Dual-Weighted Gravitational Point Cloud Denoising Method
High-quality point cloud data is a critical foundation for tasks such as autonomous driving and 3D reconstruction. However, LiDAR-based point cloud acquisition is often affected by various disturbances, resulting in a large number of noise points that degrade the accuracy of subsequent point cloud object detection and recognition. Moreover, existing point cloud denoising methods typically sacrifice computational efficiency in pursuit of higher denoising accuracy, or, conversely, improve processing speed at the expense of preserving object boundaries and fine structural details, making it difficult to simultaneously achieve high denoising accuracy, strong edge preservation, and real-time performance. To address these limitations, this paper proposes an adaptive dual-weight gravitational-based point cloud denoising method. First, an octree is employed to perform spatial partitioning of the global point cloud, enabling parallel acceleration. Then, within each leaf node, adaptive voxel-based occupancy statistics and k-nearest neighbor (kNN) density estimation are applied to rapidly remove clearly isolated and low-density noise points, thereby reducing the effective candidate set. Finally, a gravitational scoring function that combines density weights with adaptive distance weights is constructed to finely distinguish noise points from object points. Experiments conducted on the Stanford 3D Scanning Repository, the Canadian Adverse Driving Conditions (CADC) dataset, and in-house FMCW LiDAR point clouds acquired in our laboratory demonstrate that, compared with existing methods, the proposed approach achieves consistent improvements in F1, PSNR, and Chamfer Distance (CD) across various noise conditions while reducing the single-frame processing time, thereby validating its high accuracy, robustness, and real-time performance in multi-noise scenarios.
Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies
Large Vision Language Models (LVLMs) have made remarkable progress, enabling sophisticated vision-language interaction and dialogue applications. However, existing benchmarks primarily focus on reasoning tasks, often neglecting fine-grained recognition, which is crucial for practical application scenarios. To address this gap, we introduce the Fine-grained Recognition Open World (FROW) benchmark, designed for detailed evaluation of LVLMs with GPT-4o. On the basis of that, we propose a novel optimization strategy from two perspectives: \textit{data construction} and \textit{training process}, to improve the performance of LVLMs. Our dataset includes mosaic data, which combines multiple short-answer responses, and open-world data, generated from real-world questions and answers using GPT-4o, creating a comprehensive framework for evaluating fine-grained recognition in LVLMs. Experiments show that mosaic data improves category recognition accuracy by 1\% and open-world data boosts FROW benchmark accuracy by 10\%-20\% and content accuracy by 6\%-12\%. Meanwhile, incorporating fine-grained data into the pre-training phase can improve the model's category recognition accuracy by up to 10\%. The benchmark will be available at https://github.com/pc-inno/FROW.
Self-Supervised Contrastive Embedding Adaptation for Endoscopic Image Matching
Accurate spatial understanding is essential for image-guided surgery, augmented reality integration and context awareness. In minimally invasive procedures, where visual input is the sole intraoperative modality, establishing precise pixel-level correspondences between endoscopic frames is critical for 3D reconstruction, camera tracking, and scene interpretation. However, the surgical domain presents distinct challenges: weak perspective cues, non-Lambertian tissue reflections, and complex, deformable anatomy degrade the performance of conventional computer vision techniques. While Deep Learning models have shown strong performance in natural scenes, their features are not inherently suited for fine-grained matching in surgical images and require targeted adaptation to meet the demands of this domain. This research presents a novel Deep Learning pipeline for establishing feature correspondences in endoscopic image pairs, alongside a self-supervised optimization framework for model training. The proposed methodology leverages a novel-view synthesis pipeline to generate ground-truth inlier correspondences, subsequently utilized for mining triplets within a contrastive learning paradigm. Through this self-supervised approach, we augment the DINOv2 backbone with an additional Transformer layer, specifically optimized to produce embeddings that facilitate direct matching through cosine similarity thresholding. Experimental evaluation demonstrates that our pipeline surpasses state-of-the-art methodologies on the SCARED datasets improved matching precision and lower epipolar error compared to the related work. The proposed framework constitutes a valuable contribution toward enabling more accurate high-level computer vision applications in surgical endoscopy.
RaLiFlow: Scene Flow Estimation with 4D Radar and LiDAR Point Clouds AAAI
Recent multimodal fusion methods, integrating images with LiDAR point clouds, have shown promise in scene flow estimation. However, the fusion of 4D millimeter wave radar and LiDAR remains unexplored. Unlike LiDAR, radar is cheaper, more robust in various weather conditions and can detect point-wise velocity, making it a valuable complement to LiDAR. However, radar inputs pose challenges due to noise, low resolution, and sparsity. Moreover, there is currently no dataset that combines LiDAR and radar data specifically for scene flow estimation. To address this gap, we construct a Radar-LiDAR scene flow dataset based on a public real-world automotive dataset. We propose an effective preprocessing strategy for radar denoising and scene flow label generation, deriving more reliable flow ground truth for radar points out of the object boundaries. Additionally, we introduce RaLiFlow, the first joint scene flow learning framework for 4D radar and LiDAR, which achieves effective radar-LiDAR fusion through a novel Dynamic-aware Bidirectional Cross-modal Fusion (DBCF) module and a carefully designed set of loss functions. The DBCF module integrates dynamic cues from radar into the local cross-attention mechanism, enabling the propagation of contextual information across modalities. Meanwhile, the proposed loss functions mitigate the adverse effects of unreliable radar data during training and enhance the instance-level consistency in scene flow predictions from both modalities, particularly for dynamic foreground areas. Extensive experiments on the repurposed scene flow dataset demonstrate that our method outperforms existing LiDAR-based and radar-based single-modal methods by a significant margin.
comment: Accepted by AAAI
Breaking the Vicious Cycle: Coherent 3D Gaussian Splatting from Sparse and Motion-Blurred Views
3D Gaussian Splatting (3DGS) has emerged as a state-of-the-art method for novel view synthesis. However, its performance heavily relies on dense, high-quality input imagery, an assumption that is often violated in real-world applications, where data is typically sparse and motion-blurred. These two issues create a vicious cycle: sparse views ignore the multi-view constraints necessary to resolve motion blur, while motion blur erases high-frequency details crucial for aligning the limited views. Thus, reconstruction often fails catastrophically, with fragmented views and a low-frequency bias. To break this cycle, we introduce CoherentGS, a novel framework for high-fidelity 3D reconstruction from sparse and blurry images. Our key insight is to address these compound degradations using a dual-prior strategy. Specifically, we combine two pre-trained generative models: a specialized deblurring network for restoring sharp details and providing photometric guidance, and a diffusion model that offers geometric priors to fill in unobserved regions of the scene. This dual-prior strategy is supported by several key techniques, including a consistency-guided camera exploration module that adaptively guides the generative process, and a depth regularization loss that ensures geometric plausibility. We evaluate CoherentGS through both quantitative and qualitative experiments on synthetic and real-world scenes, using as few as 3, 6, and 9 input views. Our results demonstrate that CoherentGS significantly outperforms existing methods, setting a new state-of-the-art for this challenging task. The code and video demos are available at https://potatobigroom.github.io/CoherentGS/.
comment: 20 pages, 14 figures
Point to Span: Zero-Shot Moment Retrieval for Navigating Unseen Hour-Long Videos
Zero-shot Long Video Moment Retrieval (ZLVMR) is the task of identifying temporal segments in hour-long videos using a natural language query without task-specific training. The core technical challenge of LVMR stems from the computational infeasibility of processing entire lengthy videos in a single pass. This limitation has established a 'Search-then-Refine' approach, where candidates are rapidly narrowed down, and only those portions are analyzed, as the dominant paradigm for LVMR. However, existing approaches to this paradigm face severe limitations. Conventional supervised learning suffers from limited scalability and poor generalization, despite substantial resource consumption. Yet, existing zero-shot methods also fail, facing a dual challenge: (1) their heuristic strategies cause a 'search' phase candidate explosion, and (2) the 'refine' phase, which is vulnerable to semantic discrepancy, requires high-cost VLMs for verification, incurring significant computational overhead. We propose \textbf{P}oint-\textbf{to}-\textbf{S}pan (P2S), a novel training-free framework to overcome this challenge of inefficient 'search' and costly 'refine' phases. P2S overcomes these challenges with two key innovations: an 'Adaptive Span Generator' to prevent the search phase candidate explosion, and 'Query Decomposition' to refine candidates without relying on high-cost VLM verification. To our knowledge, P2S is the first zero-shot framework capable of temporal grounding in hour-long videos, outperforming supervised state-of-the-art methods by a significant margin (e.g., +3.7\% on R5@0.1 on MAD).
Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models
Multimodal Large Language Models (MLLMs) demonstrate impressive reasoning capabilities, but often fail to perceive fine-grained visual details, limiting their applicability in precision-demanding tasks. While methods that crop salient regions of an image offer a partial solution, we identify a critical limitation they introduce: "Contextual Blindness". This failure occurs due to structural disconnect between high-fidelity details (from the crop) and the broader global context (from the original image), even when all necessary visual information is present. We argue that this limitation stems not from a lack of information 'Quantity', but from a lack of 'Structural Diversity' in the model's input. To resolve this, we propose Visual Funnel, a training-free, two-step approach. Visual Funnel first performs Contextual Anchoring to identify the region of interest in a single forward pass. It then constructs an Entropy-Scaled Portfolio that preserves the hierarchical context - ranging from focal detail to broader surroundings - by dynamically determining crop sizes based on attention entropy and refining crop centers. Through extensive experiments, we demonstrate that Visual Funnel significantly outperforms naive single-crop and unstructured multi-crop baselines. Our results further validate that simply adding more unstructured crops provides limited or even detrimental benefits, confirming that the hierarchical structure of our portfolio is key to resolving Contextual Blindness.
Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task NeurIPS 2025
Video Question Answering (VideoQA) task serves as a critical playground for evaluating whether foundation models can effectively perceive, understand, and reason about dynamic real-world scenarios. However, existing Multimodal Large Language Models (MLLMs) struggle with simultaneously modeling spatial relationships within video frames and understanding the causal dynamics of temporal evolution on complex and reasoning-intensive VideoQA task. In this work, we equip MLLM with a comprehensive and extensible Video Toolkit, to enhance MLLM's spatiotemporal reasoning capabilities and ensure the harmony between the quantity and diversity of tools. To better control the tool invocation sequence and avoid toolchain shortcut issues, we propose a Spatiotemporal Reasoning Framework (STAR) that strategically schedules temporal and spatial tools, thereby progressively localizing the key area in the video. Our STAR framework enhances GPT-4o using lightweight tools, achieving an 8.2% gain on VideoMME and 4.6% on LongVideoBench. We believe that our proposed Video Toolkit and STAR framework make an important step towards building autonomous and intelligent video analysis assistants. The code is publicly available at https://github.com/fansunqi/VideoTool.
comment: Accepted by NeurIPS 2025 main track
mmCounter: Static People Counting in Dense Indoor Scenarios Using mmWave Radar
mmWave radars struggle to detect or count individuals in dense, static (non-moving) groups due to limitations in spatial resolution and reliance on movement for detection. We present mmCounter, which accurately counts static people in dense indoor spaces (up to three people per square meter). mmCounter achieves this by extracting ultra-low frequency (< 1 Hz) signals, primarily from breathing and micro-scale body movements such as slight torso shifts, and applying novel signal processing techniques to differentiate these subtle signals from background noise and nearby static objects. Our problem differs significantly from existing studies on breathing rate estimation, which assume the number of people is known a priori. In contrast, mmCounter utilizes a novel multi-stage signal processing pipeline to extract relevant low-frequency sources along with their spatial information and map these sources to individual people, enabling accurate counting. Extensive evaluations in various environments demonstrate that mmCounter delivers an 87% average F1 score and 0.6 mean absolute error in familiar environments, and a 60% average F1 score and 1.1 mean absolute error in previously untested environments. It can count up to seven individuals in a three square meter space, such that there is no side-by-side spacing and only a one-meter front-to-back distance.
comment: Accepted at the 22nd International Conference on Embedded Wireless Systems and Networks (EWSN 2025)
Hybrid Transformer-Mamba Architecture for Weakly Supervised Volumetric Medical Segmentation
Weakly supervised semantic segmentation offers a label-efficient solution to train segmentation models for volumetric medical imaging. However, existing approaches often rely on 2D encoders that neglect the inherent volumetric nature of the data. We propose TranSamba, a hybrid Transformer-Mamba architecture designed to capture 3D context for weakly supervised volumetric medical segmentation. TranSamba augments a standard Vision Transformer backbone with Cross-Plane Mamba blocks, which leverage the linear complexity of state space models for efficient information exchange across neighboring slices. The information exchange enhances the pairwise self-attention within slices computed by the Transformer blocks, directly contributing to the attention maps for object localization. TranSamba achieves effective volumetric modeling with time complexity that scales linearly with the input volume depth and maintains constant memory usage for batch processing. Extensive experiments on three datasets demonstrate that TranSamba establishes new state-of-the-art performance, consistently outperforming existing methods across diverse modalities and pathologies. Our source code and trained models are openly accessible at: https://github.com/YihengLyu/TranSamba.
Topology-Agnostic Animal Motion Generation from Text Prompt
Motion generation is fundamental to computer animation and widely used across entertainment, robotics, and virtual environments. While recent methods achieve impressive results, most rely on fixed skeletal templates, which prevent them from generalizing to skeletons with different or perturbed topologies. We address the core limitation of current motion generation methods - the combined lack of large-scale heterogeneous animal motion data and unified generative frameworks capable of jointly modeling arbitrary skeletal topologies and textual conditions. To this end, we introduce OmniZoo, a large-scale animal motion dataset spanning 140 species and 32,979 sequences, enriched with multimodal annotations. Building on OmniZoo, we propose a generalized autoregressive motion generation framework capable of producing text-driven motions for arbitrary skeletal topologies. Central to our model is a Topology-aware Skeleton Embedding Module that encodes geometric and structural properties of any skeleton into a shared token space, enabling seamless fusion with textual semantics. Given a text prompt and a target skeleton, our method generates temporally coherent, physically plausible, and semantically aligned motions, and further enables cross-species motion style transfer.
comment: 10 pages, 7 figures.Conference submission
CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates
Large-scale Vision-Language Models (VLMs) exhibit impressive complex reasoning capabilities but remain largely unexplored in visual sequential planning, i.e., executing multi-step actions towards a goal. Additionally, practical sequential planning often involves non-optimal (erroneous) steps, challenging VLMs to detect and correct such steps. We propose Corrective Sequential Planning Benchmark (CoSPlan) to evaluate VLMs in error-prone, vision-based sequential planning tasks across 4 domains: maze navigation, block rearrangement, image reconstruction,and object reorganization. CoSPlan assesses two key abilities: Error Detection (identifying non-optimal action) and Step Completion (correcting and completing action sequences to reach the goal). Despite using state-of-the-art reasoning techniques such as Chain-of-Thought and Scene Graphs, VLMs (e.g. Intern-VLM and Qwen2) struggle on CoSPlan, failing to leverage contextual cues to reach goals. Addressing this, we propose a novel training-free method, Scene Graph Incremental updates (SGI), which introduces intermediate reasoning steps between the initial and goal states. SGI helps VLMs reason about sequences, yielding an average performance gain of 5.2%. In addition to enhancing reliability in corrective sequential planning, SGI generalizes to traditional planning tasks such as Plan-Bench and VQA.
Zero-shot Adaptation of Stable Diffusion via Plug-in Hierarchical Degradation Representation for Real-World Super-Resolution
Real-World Image Super-Resolution (Real-ISR) aims to recover high-quality images from low-quality inputs degraded by unknown and complex real-world factors. Real-world scenarios involve diverse and coupled degradations, making it necessary to provide diffusion models with richer and more informative guidance. However, existing methods often assume known degradation severity and rely on CLIP text encoders that cannot capture numerical severity, limiting their generalization ability. To address this, we propose \textbf{HD-CLIP} (\textbf{H}ierarchical \textbf{D}egradation CLIP), which decomposes a low-quality image into a semantic embedding and an ordinal degradation embedding that captures ordered relationships and allows interpolation across unseen levels. Furthermore, we integrated it into diffusion models via classifier-free guidance (CFG) and proposed classifier-free projection guidance (CFPG). HD-CLIP leverages semantic cues to guide generative restoration while using degradation cues to suppress undesired hallucinations and artifacts. As a \textbf{plug-and-play module}, HD-CLIP can be seamlessly integrated into various super-resolution frameworks without training, significantly improving detail fidelity and perceptual realism across diverse real-world datasets.
A Conditional Generative Framework for Synthetic Data Augmentation in Segmenting Thin and Elongated Structures in Biological Images
Thin and elongated filamentous structures, such as microtubules and actin filaments, often play important roles in biological systems. Segmenting these filaments in biological images is a fundamental step for quantitative analysis. Recent advances in deep learning have significantly improved the performance of filament segmentation. However, there is a big challenge in acquiring high quality pixel-level annotated dataset for filamentous structures, as the dense distribution and geometric properties of filaments making manual annotation extremely laborious and time-consuming. To address the data shortage problem, we propose a conditional generative framework based on the Pix2Pix architecture to generate realistic filaments in microscopy images from binary masks. We also propose a filament-aware structural loss to improve the structure similarity when generating synthetic images. Our experiments have demonstrated the effectiveness of our approach and outperformed existing model trained without synthetic data.
Simple Yet Effective Selective Imputation for Incomplete Multi-view Clustering
Incomplete multi-view data, where different views suffer from missing and unbalanced observations, pose significant challenges for clustering. Existing imputation-based methods attempt to estimate missing views to restore data associations, but indiscriminate imputation often introduces noise and bias, especially when the available information is insufficient. Imputation-free methods avoid this risk by relying solely on observed data, but struggle under severe incompleteness due to the lack of cross-view complementarity. To address this issue, we propose Informativeness-based Selective imputation Multi-View Clustering (ISMVC). Our method evaluates the imputation-relevant informativeness of each missing position based on intra-view similarity and cross-view consistency, and selectively imputes only when sufficient support is available. Furthermore, we integrate this selection with a variational autoencoder equipped with a mixture-of-Gaussians prior to learn clustering-friendly latent representations. By performing distribution-level imputation, ISMVC not only stabilizes the aggregation of posterior distributions but also explicitly models imputation uncertainty, enabling robust fusion and preventing overconfident reconstructions. Compared with existing cautious imputation strategies that depend on training dynamics or model feedback, our method is lightweight, data-driven, and model-agnostic. It can be readily integrated into existing IMC models as a plug-in module. Extensive experiments on multiple benchmark datasets under a more realistic and challenging unbalanced missing scenario demonstrate that our method outperforms both imputation-based and imputation-free approaches.
StainNet: A Special Staining Self-Supervised Vision Transformer for Computational Pathology
Foundation models trained with self-supervised learning (SSL) on large-scale histological images have significantly accelerated the development of computational pathology. These models can serve as backbones for region-of-interest (ROI) image analysis or patch-level feature extractors in whole-slide images (WSIs) based on multiple instance learning (MIL). Existing pathology foundation models (PFMs) are typically pre-trained on Hematoxylin-Eosin (H&E) stained pathology images. However, images with special stains, such as immunohistochemistry, are also frequently used in clinical practice. PFMs pre-trained mainly on H\&E-stained images may be limited in clinical applications involving special stains. To address this issue, we propose StainNet, a specialized foundation model for special stains based on the vision transformer (ViT) architecture. StainNet adopts a self-distillation SSL approach and is trained on over 1.4 million patch images cropping from 20,231 publicly available special staining WSIs in the HISTAI database. To evaluate StainNet, we conduct experiments on an in-house slide-level liver malignancy classification task and two public ROI-level datasets to demonstrate its strong ability. We also perform few-ratio learning and retrieval evaluations, and compare StainNet with recently larger PFMs to further highlight its strengths. We have released the StainNet model weights at: https://huggingface.co/JWonderLand/StainNet.
comment: 15 pages, 6 figures
EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs
Audio-Visual Large Language Models (AV-LLMs) face prohibitive computational overhead from massive audio and video tokens. Token reduction, while extensively explored for video-only LLMs, is insufficient for the audio-visual domain, as these unimodal methods cannot leverage audio-visual cross-modal synergies. Furthermore, the distinct and dynamic information densities of audio and video render static budgets per modality suboptimal. How to perform token reduction on a joint audio-visual stream thus remains an unaddressed bottleneck. To fill this gap, we introduce EchoingPixels, a framework inspired by the coexistence and interaction of visuals and sound in real-world scenes. The core of our framework is the Cross-Modal Semantic Sieve (CS2), a module enabling early audio-visual interaction. Instead of compressing modalities independently, CS2 co-attends to the joint multimodal stream and reduces tokens from an entire combined pool of audio-visual tokens rather than using fixed budgets per modality. This single-pool approach allows it to adaptively allocate the token budget across both modalities and dynamically identify salient tokens in concert. To ensure this aggressive reduction preserves the vital temporal modeling capability, we co-design a Synchronization-Augmented RoPE (Sync-RoPE) to maintain critical temporal relationships for the sparsely selected tokens. Extensive experiments demonstrate that EchoingPixels achieves performance comparable to strong baselines using only 5-20% of the original tokens, with a 2-3x speedup and memory reduction.
Point2Pose: A Generative Framework for 3D Human Pose Estimation with Multi-View Point Cloud Dataset WACV 2026
We propose a novel generative approach for 3D human pose estimation. 3D human pose estimation poses several key challenges due to the complex geometry of the human body, self-occluding joints, and the requirement for large-scale real-world motion datasets. To address these challenges, we introduce Point2Pose, a framework that effectively models the distribution of human poses conditioned on sequential point cloud and pose history. Specifically, we employ a spatio-temporal point cloud encoder and a pose feature encoder to extract joint-wise features, followed by an attention-based generative regressor. Additionally, we present a large-scale indoor dataset MVPose3D, which contains multiple modalities, including IMU data of non-trivial human motions, dense multi-view point clouds, and RGB images. Experimental results show that the proposed method outperforms the baseline models, demonstrating its superior performance across various datasets.
comment: WACV 2026 camera ready
Design of a six wheel suspension and a three-axis linear actuation mechanism for a laser weeding robot
Mobile robots are increasingly utilized in agriculture to automate labor-intensive tasks such as weeding, sowing, harvesting and soil analysis. Recently, agricultural robots have been developed to detect and remove weeds using mechanical tools or precise herbicide sprays. Mechanical weeding is inefficient over large fields, and herbicides harm the soil ecosystem. Laser weeding with mobile robots has emerged as a sustainable alternative in precision farming. In this paper, we present an autonomous weeding robot that uses controlled exposure to a low energy laser beam for weed removal. The proposed robot is six-wheeled with a novel double four-bar suspension for higher stability. The laser is guided towards the detected weeds by a three-dimensional linear actuation mechanism. Field tests have demonstrated the robot's capability to navigate agricultural terrains effectively by overcoming obstacles up to 15 cm in height. At an optimal speed of 42.5 cm/s, the robot achieves a weed detection rate of 86.2\% and operating time of 87 seconds per meter. The laser actuation mechanism maintains a minimal mean positional error of 1.54 mm, combined with a high hit rate of 97\%, ensuring effective and accurate weed removal. This combination of speed, accuracy, and efficiency highlights the robot's potential for significantly enhancing precision farming practices.
comment: 15 Pages, 10 figures
ConStruct: Structural Distillation of Foundation Models for Prototype-Based Weakly Supervised Histopathology Segmentation
Weakly supervised semantic segmentation (WSSS) in histopathology relies heavily on classification backbones, yet these models often localize only the most discriminative regions and struggle to capture the full spatial extent of tissue structures. Vision-language models such as CONCH offer rich semantic alignment and morphology-aware representations, while modern segmentation backbones like SegFormer preserve fine-grained spatial cues. However, combining these complementary strengths remains challenging, especially under weak supervision and without dense annotations. We propose a prototype learning framework for WSSS in histopathological images that integrates morphology-aware representations from CONCH, multi-scale structural cues from SegFormer, and text-guided semantic alignment to produce prototypes that are simultaneously semantically discriminative and spatially coherent. To effectively leverage these heterogeneous sources, we introduce text-guided prototype initialization that incorporates pathology descriptions to generate more complete and semantically accurate pseudo-masks. A structural distillation mechanism transfers spatial knowledge from SegFormer to preserve fine-grained morphological patterns and local tissue boundaries during prototype learning. Our approach produces high-quality pseudo masks without pixel-level annotations, improves localization completeness, and enhances semantic consistency across tissue types. Experiments on BCSS-WSSS datasets demonstrate that our prototype learning framework outperforms existing WSSS methods while remaining computationally efficient through frozen foundation model backbones and lightweight trainable adapters.
DualProtoSeg: Simple and Efficient Design with Text- and Image-Guided Prototype Learning for Weakly Supervised Histopathology Image Segmentation
Weakly supervised semantic segmentation (WSSS) in histopathology seeks to reduce annotation cost by learning from image-level labels, yet it remains limited by inter-class homogeneity, intra-class heterogeneity, and the region-shrinkage effect of CAM-based supervision. We propose a simple and effective prototype-driven framework that leverages vision-language alignment to improve region discovery under weak supervision. Our method integrates CoOp-style learnable prompt tuning to generate text-based prototypes and combines them with learnable image prototypes, forming a dual-modal prototype bank that captures both semantic and appearance cues. To address oversmoothing in ViT representations, we incorporate a multi-scale pyramid module that enhances spatial precision and improves localization quality. Experiments on the BCSS-WSSS benchmark show that our approach surpasses existing state-of-the-art methods, and detailed analyses demonstrate the benefits of text description diversity, context length, and the complementary behavior of text and image prototypes. These results highlight the effectiveness of jointly leveraging textual semantics and visual prototype learning for WSSS in digital pathology.
Efficient-VLN: A Training-Efficient Vision-Language Navigation Model
Multimodal large language models (MLLMs) have shown promising potential in Vision-Language Navigation (VLN). However, their practical development is severely hindered by the substantial training overhead. We recognize two key issues that contribute to the overhead: (1) the quadratic computational burden from processing long-horizon historical observations as massive sequences of tokens, and (2) the exploration-efficiency trade-off in DAgger, i.e., a data aggregation process of collecting agent-explored trajectories. While more exploration yields effective error-recovery trajectories for handling test-time distribution shifts, it comes at the cost of longer trajectory lengths for both training and inference. To address these challenges, we propose Efficient-VLN, a training-efficient VLN model. Specifically, to mitigate the token processing burden, we design two efficient memory mechanisms: a progressive memory that dynamically allocates more tokens to recent observations, and a learnable recursive memory that utilizes the key-value cache of learnable tokens as the memory state. Moreover, we introduce a dynamic mixed policy to balance the exploration-efficiency trade-off. Extensive experiments show that Efficient-VLN achieves state-of-the-art performance on R2R-CE (64.2% SR) and RxR-CE (67.0% SR). Critically, our model consumes merely 282 H800 GPU hours, demonstrating a dramatic reduction in training overhead compared to state-of-the-art methods.
Image Tiling for High-Resolution Reasoning: Balancing Local Detail with Global Context AAAI 2025
Reproducibility remains a cornerstone of scientific progress, yet complex multimodal models often lack transparent implementation details and accessible training infrastructure. In this work, we present a detailed reproduction and critical analysis of the Monkey Vision-Language Model (VLM) (Li et al. 2023b) published in CVPR24, a recent approach to high-resolution image understanding via image tiling. The original paper proposed splitting large images into tiles to recover fine-grained visual details while maintaining computational efficiency. Our study replicates this strategy using open checkpoints and reimplements the training pipeline. We confirm the key finding of the original Monkey VLM work, namely that tiling effectively recovers local details. We then extend this work further, by investigating the effect of the inclusion of the global context, which provide practical insights for future high-resolution multimodal modeling. However, we also report deviations in the results, with the magnitude of these effects depending heavily on task type and tile granularity.
comment: Accepted in AAAI 2025 Workshop on Reproducible AI
Autoencoder-based Semi-Supervised Dimensionality Reduction and Clustering for Scientific Ensembles
Analyzing and visualizing scientific ensemble datasets with high dimensionality and complexity poses significant challenges. Dimensionality reduction techniques and autoencoders are powerful tools for extracting features, but they often struggle with such high-dimensional data. This paper presents an enhanced autoencoder framework that incorporates a clustering loss, based on the soft silhouette score, alongside a contrastive loss to improve the visualization and interpretability of ensemble datasets. First, EfficientNetV2 is used to generate pseudo-labels for the unlabeled portions of the scientific ensemble datasets. By jointly optimizing the reconstruction, clustering, and contrastive objectives, our method encourages similar data points to group together while separating distinct clusters in the latent space. UMAP is subsequently applied to this latent representation to produce 2D projections, which are evaluated using the silhouette score. Multiple types of autoencoders are evaluated and compared based on their ability to extract meaningful features. Experiments on two scientific ensemble datasets - channel structures in soil derived from Markov chain Monte Carlo, and droplet-on-film impact dynamics - show that models incorporating clustering or contrastive loss marginally outperform the baseline approaches.
comment: Research Internship Project
Learning complete and explainable visual representations from itemized text supervision
Training vision models with language supervision enables general and transferable representations. However, many visual domains, especially non-object-centric domains such as medical imaging and remote sensing, contain itemized text annotations: multiple text items describing distinct and semantically independent findings within a single image. Such supervision differs from standard multi-caption supervision, where captions are redundant or highly overlapping. Here, we introduce ItemizedCLIP, a framework for learning complete and explainable visual representations from itemized text supervision. ItemizedCLIP employs a cross-attention module to produce text item-conditioned visual embeddings and a set of tailored objectives that jointly enforce item independence (distinct regions for distinct items) and representation completeness (coverage of all items). Across four domains with naturally itemized text supervision (brain MRI, head CT, chest CT, remote sensing) and one additional synthetically itemized dataset, ItemizedCLIP achieves substantial improvements in zero-shot performance and fine-grained interpretability over baselines. The resulting ItemizedCLIP representations are semantically grounded, item-differentiable, complete, and visually interpretable. Our code is available at https://github.com/MLNeurosurg/ItemizedCLIP.
Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching
Stereo foundation models achieve strong zero-shot generalization but remain computationally prohibitive for real-time applications. Efficient stereo architectures, on the other hand, sacrifice robustness for speed and require costly per-domain fine-tuning. To bridge this gap, we present Fast-FoundationStereo, a family of architectures that achieve, for the first time, strong zero-shot generalization at real-time frame rate. We employ a divide-and-conquer acceleration strategy with three components: (1) knowledge distillation to compress the hybrid backbone into a single efficient student; (2) blockwise neural architecture search for automatically discovering optimal cost filtering designs under latency budgets, reducing search complexity exponentially; and (3) structured pruning for eliminating redundancy in the iterative refinement module. Furthermore, we introduce an automatic pseudo-labeling pipeline used to curate 1.4M in-the-wild stereo pairs to supplement synthetic training data and facilitate knowledge distillation. The resulting model can run over 10x faster than FoundationStereo while closely matching its zero-shot accuracy, thus establishing a new state-of-the-art among real-time methods. Project page: https://nvlabs.github.io/Fast-FoundationStereo/
Learning from a Generative Oracle: Domain Adaptation for Restoration
Pre-trained image restoration models often fail on real-world, out-of-distribution degradations due to significant domain gaps. Adapting to these unseen domains is challenging, as out-of-distribution data lacks ground truth, and traditional adaptation methods often require complex architectural changes. We propose LEGO (Learning from a Generative Oracle), a practical three-stage framework for post-training domain adaptation without paired data. LEGO converts this unsupervised challenge into a tractable pseudo-supervised one. First, we obtain initial restorations from the pre-trained model. Second, we leverage a frozen, large-scale generative oracle to refine these estimates into high-quality pseudo-ground-truths. Third, we fine-tune the original model using a mixed-supervision strategy combining in-distribution data with these new pseudo-pairs. This approach adapts the model to the new distribution without sacrificing its original robustness or requiring architectural modifications. Experiments demonstrate that LEGO effectively bridges the domain gap, significantly improving performance on diverse real-world benchmarks.
Information-driven Fusion of Pathology Foundation Models for Enhanced Disease Characterization
Foundation models (FMs) have demonstrated strong performance across diverse pathology tasks. While there are similarities in the pre-training objectives of FMs, there is still limited understanding of their complementarity, redundancy in embedding spaces, or biological interpretation of features. In this study, we propose an information-driven, intelligent fusion strategy for integrating multiple pathology FMs into a unified representation and systematically evaluate its performance for cancer grading and staging across three distinct diseases. Diagnostic H&E whole-slide images from kidney (519 slides), prostate (490 slides), and rectal (200 slides) cancers were dichotomized into low versus high grade or stage. Both tile-level FMs (Conch v1.5, MUSK, Virchow2, H-Optimus1, Prov-Gigapath) and slide-level FMs (TITAN, CHIEF, MADELEINE) were considered to train downstream classifiers. We then evaluated three FM fusion schemes at both tile and slide levels: majority-vote ensembling, naive feature concatenation, and intelligent fusion based on correlation-guided pruning of redundant features. Under patient-stratified cross-validation with hold-out testing, intelligent fusion of tile-level embeddings yielded consistent gains in classification performance across all three cancers compared with the best single FMs and naive fusion. Global similarity metrics revealed substantial alignment of FM embedding spaces, contrasted by lower local neighborhood agreement, indicating complementary fine-grained information across FMs. Attention maps showed that intelligent fusion yielded concentrated attention on tumor regions while reducing spurious focus on benign regions. Our findings suggest that intelligent, correlation-guided fusion of pathology FMs can yield compact, task-tailored representations that enhance both predictive performance and interpretability in downstream computational pathology tasks.
comment: 29 Pages, 10 figures
VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction
Current visual grounding models are either based on a Multimodal Large Language Model (MLLM) that performs auto-regressive decoding, which is slow and risks hallucinations, or on re-aligning an LLM with vision features to learn new special or object tokens for grounding, which may undermine the LLM's pretrained reasoning ability. In contrast, we propose VGent, a modular encoder-decoder architecture that explicitly disentangles high-level reasoning and low-level bounding box prediction. Specifically, a frozen MLLM serves as the encoder to provide untouched powerful reasoning capabilities, while a decoder takes high-quality boxes proposed by detectors as queries and selects target box(es) via cross-attending on encoder's hidden states. This design fully leverages advances in both object detection and MLLM, avoids the pitfalls of auto-regressive decoding, and enables fast inference. Moreover, it supports modular upgrades of both the encoder and decoder to benefit the whole system: we introduce (i) QuadThinker, an RL-based training paradigm for enhancing multi-target reasoning ability of the encoder; (ii) mask-aware label for resolving detection-segmentation ambiguity; and (iii) global target recognition to improve the recognition of all the targets which benefits the selection among augmented proposals. Experiments on multi-target visual grounding benchmarks show that VGent achieves a new state-of-the-art with +20.6% F1 improvement over prior methods, and further boosts gIoU by +8.2% and cIoU by +5.8% under visual reference challenges, while maintaining constant, fast inference latency.
comment: 8 pages
Vision-Language Models for Infrared Industrial Sensing in Additive Manufacturing Scene Description
Many manufacturing environments operate in low-light conditions or within enclosed machines where conventional vision systems struggle. Infrared cameras provide complementary advantages in such environments. Simultaneously, supervised AI systems require large labeled datasets, which makes zero-shot learning frameworks more practical for applications including infrared cameras. Recent advances in vision-language foundation models (VLMs) offer a new path in zero-shot predictions from paired image-text representations. However, current VLMs cannot understand infrared camera data since they are trained on RGB data. This work introduces VLM-IRIS (Vision-Language Models for InfraRed Industrial Sensing), a zero-shot framework that adapts VLMs to infrared data by preprocessing infrared images captured by a FLIR Boson sensor into RGB-compatible inputs suitable for CLIP-based encoders. We demonstrate zero-shot workpiece presence detection on a 3D printer bed where temperature differences between the build plate and workpieces make the task well-suited for thermal imaging. VLM-IRIS converts the infrared images to magma representation and applies centroid prompt ensembling with a CLIP ViT-B/32 encoder to achieve high accuracy on infrared images without any model retraining. These findings demonstrate that the proposed improvements to VLMs can be effectively extended to thermal applications for label-free monitoring.
EvoIR: Towards All-in-One Image Restoration via Evolutionary Frequency Modulation
All-in-One Image Restoration (AiOIR) tasks often involve diverse degradation that require robust and versatile strategies. However, most existing approaches typically lack explicit frequency modeling and rely on fixed or heuristic optimization schedules, which limit the generalization across heterogeneous degradation. To address these limitations, we propose EvoIR, an AiOIR-specific framework that introduces evolutionary frequency modulation for dynamic and adaptive image restoration. Specifically, EvoIR employs the Frequency-Modulated Module (FMM) that decomposes features into high- and low-frequency branches in an explicit manner and adaptively modulates them to enhance both structural fidelity and fine-grained details. Central to EvoIR, an Evolutionary Optimization Strategy (EOS) iteratively adjusts frequency-aware objectives through a population-based evolutionary process, dynamically balancing structural accuracy and perceptual fidelity. Its evolutionary guidance further mitigates gradient conflicts across degradation and accelerates convergence. By synergizing FMM and EOS, EvoIR yields greater improvements than using either component alone, underscoring their complementary roles. Extensive experiments on multiple benchmarks demonstrate that EvoIR outperforms state-of-the-art AiOIR methods.
AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent WACV 2026
There is a growing demand for mobile user interface (UI) automation, driven by its broad applications across industries. With the advent of visual language models (VLMs), GUI automation has progressed from generating text-based instructions for humans to autonomously executing tasks, thus optimizing automation workflows. Recent approaches leverage VLMs for this problem due to their ability to 1) process on-screen content directly, 2) remain independent of device-specific APIs by utilizing human actions (e.g., clicks, typing), and 3) apply real-world contextual knowledge for task understanding. However, these models often have trouble accurately identifying widgets and determining actions due to limited spatial information in vision encoder features. Additionally, top-performing models are often large, requiring extensive training and resulting in inference delays. In this work, we introduce AFRAgent, an instruct-BLIP-based multimodal architecture that achieves superior performance in GUI automation while being less than one-fourth the size of its nearest competitor. To enhance image embeddings in the large language model (LLM) pipeline, we propose an adaptive feature renormalization-based (a token-level affine transformation) technique that effectively enriches low-resolution image embeddings and fuses high-resolution details. We evaluate AFRAgent on Meta-GUI and AITW benchmarks, establishing a new state-of-the-art baseline for smartphone automation.
comment: Accepted at WACV 2026 Conference
From Generated Human Videos to Physically Plausible Robot Trajectories
Video generation models are rapidly improving in their ability to synthesize human actions in novel contexts, holding the potential to serve as high-level planners for contextual robot control. To realize this potential, a key research question remains open: how can a humanoid execute the human actions from generated videos in a zero-shot manner? This challenge arises because generated videos are often noisy and exhibit morphological distortions that make direct imitation difficult compared to real video. To address this, we introduce a two-stage pipeline. First, we lift video pixels into a 4D human representation and then retarget to the humanoid morphology. Second, we propose GenMimic-a physics-aware reinforcement learning policy conditioned on 3D keypoints, and trained with symmetry regularization and keypoint-weighted tracking rewards. As a result, GenMimic can mimic human actions from noisy, generated videos. We curate GenMimicBench, a synthetic human-motion dataset generated using two video generation models across a spectrum of actions and contexts, establishing a benchmark for assessing zero-shot generalization and policy robustness. Extensive experiments demonstrate improvements over strong baselines in simulation and confirm coherent, physically stable motion tracking on a Unitree G1 humanoid robot without fine-tuning. This work offers a promising path to realizing the potential of video generation models as high-level policies for robot control.
comment: For project website, see https://genmimic.github.io
FE-MCFormer: An interpretable fault diagnosis framework for rotating machinery under strong noise based on time-frequency fusion transformer
Many fault diagnosis methods of rotating machines are based on discriminative features extracted from signals collected from the key components such as bearings. However, under complex operating conditions, periodic impulsive characteristics in the signal related to weak fault information are often obscured by noise interference. Consequently, existing approaches struggle to learn interpretable fault-related features in such scenarios. This paper proposes a novel transformer framework (FE-MCFormer) to extract interpretable time-frequency features, with the aim of improving the fault detection accuracy and intrepretability of rotating machines under strong noise. First, a Fourier adaptive reconstruction embedding layer is introduced as a global information encoder in the model. Subsequently, a time-frequency fusion module is designed, further improve the model robustness and interpretability. The effectiveness of FE-MCFormer in machine fault diagnosis is validated through three case studies.
Fairness-Aware Fine-Tuning of Vision-Language Models for Medical Glaucoma Diagnosis
Vision-language models achieve expert-level performance on medical imaging tasks but exhibit significant diagnostic accuracy disparities across demographic groups. We introduce fairness-aware Low-Rank Adaptation for medical VLMs, combining parameter efficiency with explicit fairness optimization. Our key algorithmic contribution is a differentiable MaxAccGap loss that enables end-to-end optimization of accuracy parity across demographic groups. We propose three methods: FR-LoRA integrates MaxAccGap regularization into the training objective, GR-LoRA applies inverse frequency weighting to balance gradient contributions, and Hybrid-LoRA combines both mechanisms. Evaluated on 10,000 glaucoma fundus images, GR-LoRA reduces diagnostic accuracy disparities by 69% while maintaining 53.15% overall accuracy. Ablation studies reveal that strong regularization strength achieves optimal fairness with minimal accuracy trade-off, and race-specific optimization yields 60% disparity reduction. Our approach requires only 0.24% trainable parameters, enabling practical deployment of fair medical AI in resource-constrained healthcare settings.
comment: 10 pages, 3 tables
Equivariant Test-Time Training with Operator Sketching for Imaging Inverse Problems
Equivariant Imaging (EI) regularization has become the de-facto technique for unsupervised training of deep imaging networks, without any need of ground-truth data. Observing that the EI-based unsupervised training paradigm currently has significant computational redundancy leading to inefficiency in high-dimensional applications, we propose a sketched EI regularization which leverages the randomized sketching techniques for acceleration. We apply our sketched EI regularization to develop an accelerated deep internal learning framework, which can be efficiently applied for test-time network adaptation. Additionally, for network adaptation tasks, we propose a parameter-efficient approach to accelerate both EI and Sketched-EI via optimizing only the normalization layers. Our numerical study on X-ray CT and multicoil magnetic resonance image reconstruction tasks demonstrate that our approach can achieve significant computational acceleration over the standard EI counterpart, especially in test-time training tasks.
comment: 20 pages
Counting with Confidence: Accurate Pest Monitoring in Water Traps
Accurate pest population monitoring and tracking their dynamic changes are crucial for precision agriculture decision-making. A common limitation in existing vision-based automatic pest counting research is that models are typically evaluated on datasets with ground truth but deployed in real-world scenarios without assessing the reliability of counting results due to the lack of ground truth. To this end, this paper proposed a method for comprehensively evaluating pest counting confidence in the image, based on information related to counting results and external environmental conditions. First, a pest detection network is used for pest detection and counting, extracting counting result-related information. Then, the pest images undergo image quality assessment, image complexity assessment, and pest distribution uniformity assessment. And the changes in image clarity caused by stirring during image acquisition are quantified by calculating the average gradient magnitude. Notably, we designed a hypothesis-driven multi-factor sensitivity analysis method to select the optimal image quality assessment and image complexity assessment methods. And we proposed an adaptive DBSCAN clustering algorithm for pest distribution uniformity assessment. Finally, the obtained information related to counting results and external environmental conditions is input into a regression model for prediction, resulting in the final pest counting confidence. To the best of our knowledge, this is the first study dedicated to comprehensively evaluating counting confidence in counting tasks, and quantifying the relationship between influencing factors and counting confidence through a model. Experimental results show our method reduces MSE by 31.7% and improves R2 by 15.2% on the pest counting confidence test set, compared to the baseline built primarily on information related to counting results.
comment: Accepted to AGRICONTROL 2025. \c{opyright} 2025 the authors. This work has been accepted to IFAC for publication under a Creative Commons Licence CC-BY-NC-ND. Published in IFAC-PapersOnLine, DOI: 10.1016/j.ifacol.2025.11.792
LoC-Path: Learning to Compress for Pathology Multimodal Large Language Models
Whole Slide Image (WSI) understanding is fundamentally challenging due to its gigapixel scale and the extreme sparsity of diagnostically relevant regions. Unlike human experts who primarily rely on key areas to arrive at a diagnosis, existing slide-level multimodal large language models (MLLMs) for pathology rely on heavy slide-level encoders that process thousands of patch features in a brute-force manner, resulting in excessive computational cost. In this work, we revisit the WSI-language modeling paradigm and show that tile-level features exhibit strong global and local redundancy, whereas only a small subset of tiles are truly task-relevant. Motivated by this observation, we introduce an efficient MLLM framework, called LoC-Path, that replaces the expensive slide-level encoder with redundancy-reducing modules. We first design a Sparse Token Merger (STM) and an MAE-pretrained resampler to remove local redundancy and compress globally redundant tile tokens into a compact slide-level representation set. We then propose a Cross-Attention Routing Adapter (CARA) and a Token Importance Scorer (TIS) to integrate the compressed visual representation with the language model in a computation-efficient manner. Extensive experiments demonstrate that our approach achieves performance comparable to existing state-of-the-art whole-slide MLLMs, while requiring significantly lower computation and memory.
comment: 20 pages
StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation
The growing adoption of XR devices has fueled strong demand for high-quality stereo video, yet its production remains costly and artifact-prone. To address this challenge, we present StereoWorld, an end-to-end framework that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. Our framework jointly conditions the model on the monocular video input while explicitly supervising the generation with a geometry-aware regularization to ensure 3D structural fidelity. A spatio-temporal tiling scheme is further integrated to enable efficient, high-resolution synthesis. To enable large-scale training and evaluation, we curate a high-definition stereo video dataset containing over 11M frames aligned to natural human interpupillary distance (IPD). Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generating stereo videos with superior visual fidelity and geometric consistency. The project webpage is available at https://ke-xing.github.io/StereoWorld/.
UnReflectAnything: RGB-Only Highlight Removal by Rendering Synthetic Specular Supervision
Specular highlights distort appearance, obscure texture, and hinder geometric reasoning in both natural and surgical imagery. We present UnReflectAnything, an RGB-only framework that removes highlights from a single image by predicting a highlight map together with a reflection-free diffuse reconstruction. The model uses a frozen vision transformer encoder to extract multi-scale features, a lightweight head to localize specular regions, and a token-level inpainting module that restores corrupted feature patches before producing the final diffuse image. To overcome the lack of paired supervision, we introduce a Virtual Highlight Synthesis pipeline that renders physically plausible specularities using monocular geometry, Fresnel-aware shading, and randomized lighting which enables training on arbitrary RGB images with correct geometric structure. UnReflectAnything generalizes across natural and surgical domains where non-Lambertian surfaces and non-uniform lighting create severe highlights and it achieves competitive performance with state-of-the-art results on several benchmarks. Project Page: https://alberto-rota.github.io/UnReflectAnything/
Optimization-Guided Diffusion for Interactive Scene Generation
Realistic and diverse multi-agent driving scenes are crucial for evaluating autonomous vehicles, but safety-critical events which are essential for this task are rare and underrepresented in driving datasets. Data-driven scene generation offers a low-cost alternative by synthesizing complex traffic behaviors from existing driving logs. However, existing models often lack controllability or yield samples that violate physical or social constraints, limiting their usability. We present OMEGA, an optimization-guided, training-free framework that enforces structural consistency and interaction awareness during diffusion-based sampling from a scene generation model. OMEGA re-anchors each reverse diffusion step via constrained optimization, steering the generation towards physically plausible and behaviorally coherent trajectories. Building on this framework, we formulate ego-attacker interactions as a game-theoretic optimization in the distribution space, approximating Nash equilibria to generate realistic, safety-critical adversarial scenarios. Experiments on nuPlan and Waymo show that OMEGA improves generation realism, consistency, and controllability, increasing the ratio of physically and behaviorally valid scenes from 32.35% to 72.27% for free exploration capabilities, and from 11% to 80% for controllability-focused generation. Our approach can also generate $5\times$ more near-collision frames with a time-to-collision under three seconds while maintaining the overall scene realism.
Panoramic Out-of-Distribution Segmentation
Panoramic imaging enables capturing 360° images with an ultra-wide Field-of-View (FoV) for dense omnidirectional perception, which is critical to applications, such as autonomous driving and augmented reality, etc. However, current panoramic semantic segmentation methods fail to identify outliers, and pinhole Out-of-distribution Segmentation (OoS) models perform unsatisfactorily in the panoramic domain due to pixel distortions and background clutter. To address these issues, we introduce a new task, Panoramic Out-of-distribution Segmentation (PanOoS), with the aim of achieving comprehensive and safe scene understanding. Furthermore, we propose the first solution, POS, which adapts to the characteristics of panoramic images through text-guided prompt distribution learning. Specifically, POS integrates a disentanglement strategy designed to materialize the cross-domain generalization capability of CLIP. The proposed Prompt-based Restoration Attention (PRA) optimizes semantic decoding by prompt guidance and self-adaptive correction, while Bilevel Prompt Distribution Learning (BPDL) refines the manifold of per-pixel mask embeddings via semantic prototype supervision. Besides, to compensate for the scarcity of PanOoS datasets, we establish two benchmarks: DenseOoS, which features diverse outliers in complex environments, and QuadOoS, captured by a quadruped robot with a panoramic annular lens system. Extensive experiments demonstrate superior performance of POS, with AuPRC improving by 34.25% and FPR95 decreasing by 21.42% on DenseOoS, outperforming state-of-the-art pinhole-OoS methods. Moreover, POS achieves leading closed-set segmentation capabilities and advances the development of panoramic understanding. Code and datasets will be available at https://github.com/MengfeiD/PanOoS.
comment: Code and datasets will be available at https://github.com/MengfeiD/PanOoS
ExAct: A Video-Language Benchmark for Expert Action Analysis
We present ExAct, a new video-language benchmark for expert-level understanding of skilled physical human activities. Our new benchmark contains 3521 expert-curated video question-answer pairs spanning 11 physical activities in 6 domains: Sports, Bike Repair, Cooking, Health, Music, and Dance. ExAct requires the correct answer to be selected from five carefully designed candidate options, thus necessitating a nuanced, fine-grained, expert-level understanding of physical human skills. Evaluating the recent state-of-the-art VLMs on ExAct reveals a substantial performance gap relative to human expert performance. Specifically, the best-performing GPT-4o model achieves only 44.70% accuracy, well below the 82.02% attained by trained human specialists/experts. We believe that ExAct will be beneficial for developing and evaluating VLMs capable of precise understanding of human skills in various physical and procedural domains. Dataset and code are available at https://texaser.github.io/exact_project_page/
Dual Cluster Contrastive learning for Object Re-Identification
Recently, cluster contrastive learning has been proven effective for object ReID by computing the contrastive loss between the individual features and the cluster memory. However, existing methods that use the individual features to momentum update the cluster memory will fluctuate over the training examples, especially for the outlier samples. Unlike the individual-based updating mechanism, the centroid-based updating mechanism that applies the mean feature of each cluster to update the cluster memory can reduce the impact of individual samples. Therefore, we formulate the individual-based updating and centroid-based updating mechanisms in a unified cluster contrastive framework, named Dual Cluster Contrastive framework (DCC), which maintains two types of memory banks: individual and centroid cluster memory banks. Significantly, the individual cluster memory considers just one individual at a time to take a single step for updating. The centroid cluster memory applies the mean feature of each cluster to update the corresponding cluster memory. During optimization, besides the vallina contrastive loss of each memory, a cross-view consistency constraint is applied to exchange the benefits of two memories for generating a discriminative description for the object ReID. Note that DCC can be easily applied for unsupervised or supervised object ReID by using ground-truth labels or the generated pseudo-labels. Extensive experiments on three benchmarks, \emph{e.g.,} Market-1501, MSMT17, and VeRi-776, under \textbf{supervised Object ReID} and \textbf{unsupervised Object ReID} demonstrate the superiority of the proposed DCC.
comment: Some errors were found in the method and experimental work that need further correction and improvement
Leveraging Depth and Language for Open-Vocabulary Domain-Generalized Semantic Segmentation NeurIPS 2025
Open-Vocabulary semantic segmentation (OVSS) and domain generalization in semantic segmentation (DGSS) highlight a subtle complementarity that motivates Open-Vocabulary Domain-Generalized Semantic Segmentation (OV-DGSS). OV-DGSS aims to generate pixel-level masks for unseen categories while maintaining robustness across unseen domains, a critical capability for real-world scenarios such as autonomous driving in adverse conditions. We introduce Vireo, a novel single-stage framework for OV-DGSS that unifies the strengths of OVSS and DGSS for the first time. Vireo builds upon the frozen Visual Foundation Models (VFMs) and incorporates scene geometry via Depth VFMs to extract domain-invariant structural features. To bridge the gap between visual and textual modalities under domain shift, we propose three key components: (1) GeoText Prompts, which align geometric features with language cues and progressively refine VFM encoder representations; (2) Coarse Mask Prior Embedding (CMPE) for enhancing gradient flow for faster convergence and stronger textual influence; and (3) the Domain-Open-Vocabulary Vector Embedding Head (DOV-VEH), which fuses refined structural and semantic features for robust prediction. Comprehensive evaluation on these components demonstrates the effectiveness of our designs. Our proposed Vireo achieves the state-of-the-art performance and surpasses existing methods by a large margin in both domain generalization and open-vocabulary recognition, offering a unified and scalable solution for robust visual understanding in diverse and dynamic environments. Code is available at https://github.com/anonymouse-9c53tp182bvz/Vireo.
comment: Accepted by NeurIPS 2025
CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding WACV 2026
We address Embodied Reference Understanding, the task of predicting the object a person in the scene refers to through pointing gesture and language. This requires multimodal reasoning over text, visual pointing cues, and scene context, yet existing methods often fail to fully exploit visual disambiguation signals. We also observe that while the referent often aligns with the head-to-fingertip direction, in many cases it aligns more closely with the wrist-to-fingertip direction, making a single-line assumption overly limiting. To address this, we propose a dual-model framework, where one model learns from the head-to-fingertip direction and the other from the wrist-to-fingertip direction. We introduce a Gaussian ray heatmap representation of these lines and use them as input to provide a strong supervisory signal that encourages the model to better attend to pointing cues. To fuse their complementary strengths, we present the CLIP-Aware Pointing Ensemble module, which performs a hybrid ensemble guided by CLIP features. We further incorporate an auxiliary object center prediction head to enhance referent localization. We validate our approach on YouRefIt, achieving 75.0 mAP at 0.25 IoU, alongside state-of-the-art CLIP and C_D scores, and demonstrate its generality on unseen CAESAR and ISL Pointing, showing robust performance across benchmarks.
comment: Accepted by WACV 2026
Architecture Decoupling Is Not All You Need For Unified Multimodal Model
Unified multimodal models for image generation and understanding represent a significant step toward AGI and have attracted widespread attention from researchers. The main challenge of this task lies in the difficulty in establishing an optimal training paradigm due to inherent conflicting targets in understanding and generation tasks. To alleviate these conflicts and pursue higher performance, many researchers adopt varying degrees of model decoupling (e.g., Double image encoders, MOE/MOT architecture, or frozen MLLM). However, excessive model decoupling can lead to the loss of interleave generation ability, undermining the original intent of unified models. In this work, we aim to explore how to mitigate task conflicts without resorting to model decoupling. Firstly, we analyze why decoupling alleviates conflicts by studying the cross-modal attention behavior of models. We observe that model decoupling essentially drives models toward task-specific multimodal interaction patterns, as seen in Qwen-VL and HunyuanImage, and that the more thorough the decoupling, the more consistent the behavior becomes. Motivated by this observation, we propose Attention Interaction Alignment (AIA) loss, which explicitly learns Task-Specific multimodal interaction patterns during training. To demonstrate the generalizability of our AIA loss, we apply it to Emu3 and Janus-Pro during SFT and post-training stage respectively. Without bells and whistles, AIA not only refines cross-modal attention patterns, but also boosts both generation and understanding performance.
comment: Project page: https://zhengdian1.github.io/AIA-project/ Code: https://github.com/zhengdian1/AIA
Unsupervised Learning for Industrial Defect Detection: A Case Study on Shearographic Data
Shearography is a non-destructive testing method for detecting subsurface defects, offering high sensitivity and full-field inspection capabilities. However, its industrial adoption remains limited due to the need for expert interpretation. To reduce reliance on labeled data and manual evaluation, this study explores unsupervised learning methods for automated anomaly detection in shearographic images. Three architectures are evaluated: a fully connected autoencoder, a convolutional autoencoder, and a student-teacher feature matching model. All models are trained solely on defect-free data. A controlled dataset was developed using a custom specimen with reproducible defect patterns, enabling systematic acquisition of shearographic measurements under both ideal and realistic deformation conditions. Two training subsets were defined: one containing only undistorted, defect-free samples, and one additionally including globally deformed, yet defect-free, data. The latter simulates practical inspection conditions by incorporating deformation-induced fringe patterns that may obscure localized anomalies. The models are evaluated in terms of binary classification and, for the student-teacher model, spatial defect localization. Results show that the student-teacher approach achieves superior classification robustness and enables precise localization. Compared to the autoencoder-based models, it demonstrates improved separability of feature representations, as visualized through t-SNE embeddings. Additionally, a YOLOv8 model trained on labeled defect data serves as a reference to benchmark localization quality. This study underscores the potential of unsupervised deep learning for scalable, label-efficient shearographic inspection in industrial environments.
comment: 15 pages, 6 figures, 1 table; accepted for AI-2025 Forty-fifth SGAI International Conference on Artificial Intelligence CAMBRIDGE, ENGLAND 16-18 DECEMBER 2025
SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence
Existing evaluations of multimodal large language models (MLLMs) on spatial intelligence are typically fragmented and limited in scope. In this work, we aim to conduct a holistic assessment of the spatial understanding capabilities of modern MLLMs and propose complementary data-driven and agent-based solutions. Specifically, we make the following contributions: (i) we introduce SpatialScore, to our knowledge, the most comprehensive and diverse benchmark for multimodal spatial intelligence to date. It covers multiple visual data types, input modalities, and question-answering formats, and contains approximately 5K manually verified samples spanning 30 distinct tasks; (ii) using SpatialScore, we extensively evaluate 40 representative MLLMs, revealing persistent challenges and a substantial gap between current models and human-level spatial intelligence; (iii) to advance model capabilities, we construct SpatialCorpus, a large-scale training resource with 331K multimodal QA samples that supports fine-tuning on spatial reasoning tasks and significantly improves the performance of existing models (e.g., Qwen3-VL); (iv) to complement this data-driven route with a training-free paradigm, we develop SpatialAgent, a multi-agent system equipped with 12 specialized spatial perception tools that supports both Plan-Execute and ReAct reasoning, enabling substantial gains in spatial reasoning without additional model training. Extensive experiments and in-depth analyses demonstrate the effectiveness of our benchmark, corpus, and agent framework. We expect these resources to serve as a solid foundation for advancing MLLMs toward human-level spatial intelligence. All data, code, and models will be released to the research community.
comment: Technical Report; Project Page: https://haoningwu3639.github.io/SpatialScore
Thinking with Images via Self-Calling Agent
Thinking-with-images paradigms have showcased remarkable visual reasoning capability by integrating visual information as dynamic elements into the Chain-of-Thought (CoT). However, optimizing interleaved multimodal CoT (iMCoT) through reinforcement learning remains challenging, as it relies on scarce high-quality reasoning data. In this study, we propose Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling. Specifically, a main agent decomposes the complex visual reasoning task to atomic subtasks and invokes its virtual replicas, i.e. parameter-sharing subagents, to solve them in isolated context. sCoT enjoys substantial training effectiveness and efficiency, as it requires no explicit interleaving between modalities. sCoT employs group-relative policy optimization to reinforce effective reasoning behavior to enhance optimization. Experiments on HR-Bench 4K show that sCoT improves the overall reasoning performance by up to $1.9\%$ with $\sim 75\%$ fewer GPU hours compared to strong baseline approaches. Code is available at https://github.com/YWenxi/think-with-images-through-self-calling.
comment: Code is available at https://github.com/YWenxi/think-with-images-through-self-calling
Object-centric proto-symbolic behavioural reasoning from pixels
Autonomous intelligent agents must bridge computational challenges at disparate levels of abstraction, from the low-level spaces of sensory input and motor commands to the high-level domain of abstract reasoning and planning. A key question in designing such agents is how best to instantiate the representational space that will interface between these two levels -- ideally without requiring supervision in the form of expensive data annotations. These objectives can be efficiently achieved by representing the world in terms of objects (grounded in perception and action). In this work, we present a novel, brain-inspired, deep-learning architecture that learns from pixels to interpret, control, and reason about its environment, using object-centric representations. We show the utility of our approach through tasks in synthetic environments that require a combination of (high-level) logical reasoning and (low-level) continuous control. Results show that the agent can learn emergent conditional behavioural reasoning, such as $(A \to B) \land (\neg A \to C)$, as well as logical composition $(A \to B) \land (A \to C) \vdash A \to (B \land C)$ and XOR operations, and successfully controls its environment to satisfy objectives deduced from these logical rules. The agent can adapt online to unexpected changes in its environment and is robust to mild violations of its world model, thanks to dynamic internal desired goal generation. While the present results are limited to synthetic settings (2D and 3D activated versions of dSprites), which fall short of real-world levels of complexity, the proposed architecture shows how to manipulate grounded object representations, as a key inductive bias for unsupervised learning, to enable behavioral reasoning.
comment: Accepted for publication in Neural Networks journal
Perception-Inspired Color Space Design for Photo White Balance Editing WACV 2026
White balance (WB) is a key step in the image signal processor (ISP) pipeline that mitigates color casts caused by varying illumination and restores the scene's true colors. Currently, sRGB-based WB editing for post-ISP WB correction is widely used to address color constancy failures in the ISP pipeline when the original camera RAW is unavailable. However, additive color models (e.g., sRGB) are inherently limited by fixed nonlinear transformations and entangled color channels, which often impede their generalization to complex lighting conditions. To address these challenges, we propose a novel framework for WB correction that leverages a perception-inspired Learnable HSI (LHSI) color space. Built upon a cylindrical color model that naturally separates luminance from chromatic components, our framework further introduces dedicated parameters to enhance this disentanglement and learnable mapping to adaptively refine the flexibility. Moreover, a new Mamba-based network is introduced, which is tailored to the characteristics of the proposed LHSI color space. Experimental results on benchmark datasets demonstrate the superiority of our method, highlighting the potential of perception-inspired color space design in computational photography. The source code is available at https://github.com/YangCheng58/WB_Color_Space.
comment: Accepted to WACV 2026
MokA: Multimodal Low-Rank Adaptation for MLLMs
In this paper, we reveal that most current efficient multimodal fine-tuning methods are hindered by a key limitation: they are directly borrowed from LLMs, often neglecting the intrinsic differences of multimodal scenarios and even affecting the full utilization of all modalities. Inspired by our empirical observation, we argue that unimodal adaptation and cross-modal adaptation are two essential parts for the effective fine-tuning of MLLMs. From this perspective, we propose Multimodal low-rank Adaptation (MokA), a multimodal-aware efficient fine-tuning strategy that takes multimodal characteristics into consideration. It compresses unimodal information by modality-specific parameters while explicitly enhancing cross-modal interaction, ensuring both unimodal and cross-modal adaptation. Extensive experiments cover three representative multimodal scenarios (audio-visual-text, visual-text, and speech-text), and multiple LLM backbones (LLaMA2/3, Qwen2, Qwen2.5-VL, etc). Consistent improvements indicate the efficacy and versatility of the proposed method. Ablation studies and efficiency evaluation are also conducted to fully asses our method. Overall, we think MokA provides a more targeted solution for efficient adaptation of MLLMs, paving the way for further exploration. The project page is at https://gewu-lab.github.io/MokA.
Towards Open-World Human Action Segmentation Using Graph Convolutional Networks IROS25
Human-object interaction segmentation is a fundamental task of daily activity understanding, which plays a crucial role in applications such as assistive robotics, healthcare, and autonomous systems. Most existing learning-based methods excel in closed-world action segmentation, they struggle to generalize to open-world scenarios where novel actions emerge. Collecting exhaustive action categories for training is impractical due to the dynamic diversity of human activities, necessitating models that detect and segment out-of-distribution actions without manual annotation. To address this issue, we formally define the open-world action segmentation problem and propose a structured framework for detecting and segmenting unseen actions. Our framework introduces three key innovations: 1) an Enhanced Pyramid Graph Convolutional Network (EPGCN) with a novel decoder module for robust spatiotemporal feature upsampling. 2) Mixup-based training to synthesize out-of-distribution data, eliminating reliance on manual annotations. 3) A novel Temporal Clustering loss that groups in-distribution actions while distancing out-of-distribution samples. We evaluate our framework on two challenging human-object interaction recognition datasets: Bimanual Actions and 2 Hands and Object (H2O) datasets. Experimental results demonstrate significant improvements over state-of-the-art action segmentation models across multiple open-set evaluation metrics, achieving 16.9% and 34.6% relative gains in open-set segmentation (F1@50) and out-of-distribution detection performances (AUROC), respectively. Additionally, we conduct an in-depth ablation study to assess the impact of each proposed component, identifying the optimal framework configuration for open-world action segmentation.
comment: 8 pages, 3 figures, accepted in IROS25, Hangzhou, China
Multi-Modal Graph Convolutional Network with Sinusoidal Encoding for Robust Human Action Segmentation IROS25
Accurate temporal segmentation of human actions is critical for intelligent robots in collaborative settings, where a precise understanding of sub-activity labels and their temporal structure is essential. However, the inherent noise in both human pose estimation and object detection often leads to over-segmentation errors, disrupting the coherence of action sequences. To address this, we propose a Multi-Modal Graph Convolutional Network (MMGCN) that integrates low-frame-rate (e.g., 1 fps) visual data with high-frame-rate (e.g., 30 fps) motion data (skeleton and object detections) to mitigate fragmentation. Our framework introduces three key contributions. First, a sinusoidal encoding strategy that maps 3D skeleton coordinates into a continuous sin-cos space to enhance spatial representation robustness. Second, a temporal graph fusion module that aligns multi-modal inputs with differing resolutions via hierarchical feature aggregation, Third, inspired by the smooth transitions inherent to human actions, we design SmoothLabelMix, a data augmentation technique that mixes input sequences and labels to generate synthetic training examples with gradual action transitions, enhancing temporal consistency in predictions and reducing over-segmentation artifacts. Extensive experiments on the Bimanual Actions Dataset, a public benchmark for human-object interaction understanding, demonstrate that our approach outperforms state-of-the-art methods, especially in action segmentation accuracy, achieving F1@10: 94.5% and F1@25: 92.8%.
comment: 8 pages, 5 figures, accepted in IROS25, Hangzhou, China
Glance: Accelerating Diffusion Models with 1 Sample
Diffusion models have achieved remarkable success in image generation, yet their deployment remains constrained by the heavy computational cost and the need for numerous inference steps. Previous efforts on fewer-step distillation attempt to skip redundant steps by training compact student models, yet they often suffer from heavy retraining costs and degraded generalization. In this work, we take a different perspective: we accelerate smartly, not evenly, applying smaller speedups to early semantic stages and larger ones to later redundant phases. We instantiate this phase-aware strategy with two experts that specialize in slow and fast denoising phases. Surprisingly, instead of investing massive effort in retraining student models, we find that simply equipping the base model with lightweight LoRA adapters achieves both efficient acceleration and strong generalization. We refer to these two adapters as Slow-LoRA and Fast-LoRA. Through extensive experiments, our method achieves up to 5 acceleration over the base model while maintaining comparable visual quality across diverse benchmarks. Remarkably, the LoRA experts are trained with only 1 samples on a single V100 within one hour, yet the resulting models generalize strongly on unseen prompts.
Towards Visual Re-Identification of Fish using Fine-Grained Classification for Electronic Monitoring in Fisheries
Accurate fisheries data are crucial for effective and sustainable marine resource management. With the recent adoption of Electronic Monitoring (EM) systems, more video data is now being collected than can be feasibly reviewed manually. This paper addresses this challenge by developing an optimized deep learning pipeline for automated fish re-identification (Re-ID) using the novel AutoFish dataset, which simulates EM systems with conveyor belts with six similarly looking fish species. We demonstrate that key Re-ID metrics (R1 and mAP@k) are substantially improved by using hard triplet mining in conjunction with a custom image transformation pipeline that includes dataset-specific normalization. By employing these strategies, we demonstrate that the Vision Transformer-based Swin-T architecture consistently outperforms the Convolutional Neural Network-based ResNet-50, achieving peak performance of 41.65% mAP@k and 90.43% Rank-1 accuracy. An in-depth analysis reveals that the primary challenge is distinguishing visually similar individuals of the same species (Intra-species errors), where viewpoint inconsistency proves significantly more detrimental than partial occlusion. The source code and documentation are available at: https://github.com/msamdk/Fish_Re_Identification.git
comment: The paper has been accepted for publication at Northern Lights Deep Learning (NLDL) Conference 2026
Exploring Automated Recognition of Instructional Activity and Discourse from Multimodal Classroom Data WACV
Observation of classroom interactions can provide concrete feedback to teachers, but current methods rely on manual annotation, which is resource-intensive and hard to scale. This work explores AI-driven analysis of classroom recordings, focusing on multimodal instructional activity and discourse recognition as a foundation for actionable feedback. Using a densely annotated dataset of 164 hours of video and 68 lesson transcripts, we design parallel, modality-specific pipelines. For video, we evaluate zero-shot multimodal LLMs, fine-tuned vision-language models, and self-supervised video transformers on 24 activity labels. For transcripts, we fine-tune a transformer-based classifier with contextualized inputs and compare it against prompting-based LLMs on 19 discourse labels. To handle class imbalance and multi-label complexity, we apply per-label thresholding, context windows, and imbalance-aware loss functions. The results show that fine-tuned models consistently outperform prompting-based approaches, achieving macro-F1 scores of 0.577 for video and 0.460 for transcripts. These results demonstrate the feasibility of automated classroom analysis and establish a foundation for scalable teacher feedback systems.
comment: This article has been accepted for publication in the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026
Continuous Vision-Language-Action Co-Learning with Semantic-Physical Alignment for Behavioral Cloning AAAI 2026
Language-conditioned manipulation facilitates human-robot interaction via behavioral cloning (BC), which learns control policies from human demonstrations and serves as a cornerstone of embodied AI. Overcoming compounding errors in sequential action decisions remains a central challenge to improving BC performance. Existing approaches mitigate compounding errors through data augmentation, expressive representation, or temporal abstraction. However, they suffer from physical discontinuities and semantic-physical misalignment, leading to inaccurate action cloning and intermittent execution. In this paper, we present Continuous vision-language-action Co-Learning with Semantic-Physical Alignment (CCoL), a novel BC framework that ensures temporally consistent execution and fine-grained semantic grounding. It generates robust and smooth action execution trajectories through continuous co-learning across vision, language, and proprioceptive inputs (e.g., robot internal states). Meanwhile, we anchor language semantics to visuomotor representations by a bidirectional cross-attention to learn contextual information for action generation, successfully overcoming the problem of semantic-physical misalignment. Extensive experiments show that CCoL achieves an average 8.0% relative improvement across three simulation suites, with up to 19.2% relative gain in human-demonstrated bimanual insertion tasks. Real-world tests on a 7-DoF robot further confirm CCoL's generalization under unseen and noisy object states.
comment: Accepted at AAAI 2026, the Project website is available at https://qhemu.github.io/CCoL/
l0-Regularized Sparse Coding-based Interpretable Network for Multi-Modal Image Fusion
Multi-modal image fusion (MMIF) enhances the information content of the fused image by combining the unique as well as common features obtained from different modality sensor images, improving visualization, object detection, and many more tasks. In this work, we introduce an interpretable network for the MMIF task, named FNet, based on an $\ell_0$-regularized multi-modal convolutional sparse coding (MCSC) model. Specifically, for solving the $\ell_0$-regularized CSC problem, we design a learnable $\ell_0$-regularized sparse coding (LZSC) block in a principled manner through deep unfolding. Given different modality source images, FNet first separates the unique and common features from them using the LZSC block and then these features are combined to generate the final fused image. Additionally, we propose an $\ell_0$-regularized MCSC model for the inverse fusion process. Based on this model, we introduce an interpretable inverse fusion network named IFNet, which is utilized during FNet's training. Extensive experiments show that FNet achieves high-quality fusion results across eight different MMIF datasets. Furthermore, we show that FNet enhances downstream object detection \textcolor[rgb]{ 0, 0, 0}{and semantic segmentation} in visible-thermal image pairs. We have also visualized the intermediate results of FNet, which demonstrates the good interpretability of our network. Link for code and models: https://github.com/gargi884/FNet-MMIF.
comment: Accetped by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models
Existing world models for autonomous driving struggle with long-horizon generation and generalization to challenging scenarios. In this work, we develop a model using simple design choices, and without additional supervision or sensors, such as maps, depth, or multiple cameras. We show that our model yields state-of-the-art performance, despite having only 469M parameters and being trained on 280h of video data. It particularly stands out in difficult scenarios like turning maneuvers and urban traffic. We test whether discrete token models possibly have advantages over continuous models based on flow matching. To this end, we set up a hybrid tokenizer that is compatible with both approaches and allows for a side-by-side comparison. Our study concludes in favor of the continuous autoregressive model, which is less brittle on individual design choices and more powerful than the model built on discrete tokens. Code, models and qualitative results are publicly available at https://lmb-freiburg.github.io/orbis.github.io/.
comment: Project page: https://lmb-freiburg.github.io/orbis.github.io/
Hierarchical Deep Learning for Diatom Image Classification: A Multi-Level Taxonomic Approach
Accurate taxonomic identification of diatoms is essential for aquatic ecosystem monitoring, yet conventional methods depend heavily on expert taxonomists. Recent deep learning approaches improve automation, but most treat diatom recognition as flat classification, predicting only one taxonomic rank. We investigate whether embedding taxonomic hierarchy into neural network architectures can improve both accuracy and error locality. We introduce DiatomCascadeNet (H-COFGS), a hierarchical convolutional network with five cascaded heads that jointly predict class, order, family, genus, and species. Each head receives shared backbone features and probability distributions from higher levels, with binary masks restricting predictions to valid descendants during training and inference. Using a filtered dataset of 1,456 diatom images covering 82 species, we compare hierarchical and flat models under identical settings. H-COFGS matches flat baselines at the species level (69.4% accuracy) while outperforming at all upper taxonomic levels. When species predictions fail, errors remain taxonomically local: 92.5% of misclassified species are correctly predicted at the genus level, versus 67.2% for flat baselines. H-COFGS reduces mean taxonomic distance by 38.2% (1.209 vs. 1.955). Progressive training reveals bidirectional mechanisms: hierarchical constraint masks operate top-down to constrain prediction space, while gradients from fine-grained levels propagate bottom-up through the shared backbone, refining features. This improves class accuracy from 96.2% to 99.5% and yields 6-8% gains at upper levels, producing more robust, interpretable, and biologically aligned predictions for multi-level taxonomic classification.
comment: Version 2: Corrected reference details, improved architectural diagram, and enhanced writing for clarity and precision. Added a table illustrating the masking mechanism. No changes to experimental results or conclusions. 11 pages, 6 figures, 3 tables
Joint2Human: High-quality 3D Human Generation via Compact Spherical Embedding of 3D Joints
3D human generation is increasingly significant in various applications. However, the direct use of 2D generative methods in 3D generation often results in losing local details, while methods that reconstruct geometry from generated images struggle with global view consistency. In this work, we introduce Joint2Human, a novel method that leverages 2D diffusion models to generate detailed 3D human geometry directly, ensuring both global structure and local details. To achieve this, we employ the Fourier occupancy field (FOF) representation, enabling the direct generation of 3D shapes as preliminary results with 2D generative models. With the proposed high-frequency enhancer and the multi-view recarving strategy, our method can seamlessly integrate the details from different views into a uniform global shape. To better utilize the 3D human prior and enhance control over the generated geometry, we introduce a compact spherical embedding of 3D joints. This allows for an effective guidance of pose during the generation process. Additionally, our method can generate 3D humans guided by textual inputs. Our experimental results demonstrate the capability of our method to ensure global structure, local details, high resolution, and low computational cost simultaneously. More results and the code can be found on our project page at http://cic.tju.edu.cn/faculty/likun/projects/Joint2Human.
Beyond Words and Pixels: A Benchmark for Implicit World Knowledge Reasoning in Generative Models
Text-to-image (T2I) models today are capable of producing photorealistic, instruction-following images, yet they still frequently fail on prompts that require implicit world knowledge. Existing evaluation protocols either emphasize compositional alignment or rely on single-round VQA-based scoring, leaving critical dimensions such as knowledge grounding, multi-physics interactions, and auditable evidence-substantially undertested. To address these limitations, we introduce PicWorld, the first comprehensive benchmark that assesses the grasp of implicit world knowledge and physical causal reasoning of T2I models. This benchmark consists of 1,100 prompts across three core categories. To facilitate fine-grained evaluation, we propose PW-Agent, an evidence-grounded multi-agent evaluator to hierarchically assess images on their physical realism and logical consistency by decomposing prompts into verifiable visual evidence. We conduct a thorough analysis of 17 mainstream T2I models on PicWorld, illustrating that they universally exhibit a fundamental limitation in their capacity for implicit world knowledge and physical causal reasoning to varying degrees. The findings highlight the need for reasoning-aware, knowledge-integrative architectures in future T2I systems.
Test-Time Distillation for Continual Model Adaptation
Deep neural networks often suffer performance degradation upon deployment due to distribution shifts. Continual Test-Time Adaptation (CTTA) aims to address this issue in an unsupervised manner, yet existing methods, which rely on self-supervision, are prone to an inherent self-referential feedback loop that amplifies initial prediction errors, leading to model drift. We revisit this limitation and propose Test-Time Distillation (TTD), which reframes adaptation as a distillation process guided by a frozen Vision-Language Model (VLM) as an external signal. While promising, we find that direct distillation is fraught with two pitfalls: the Generalist Trap, where the VLM's broad but non-specialized knowledge leads to suboptimal performance on specific tasks and shifts, and the Entropy Bias, where naive model fusion techniques based on entropy fail due to the disparate calibration of heterogeneous models. These pitfalls motivate our insight: the key is to build a robust supervisory signal and leverage it to guide the target model toward stable adaptation. Hence, we present CoDiRe, a Continual Distillation and Rectification framework for TTD. CoDiRe first constructs a robust blended teacher by dynamically fusing the predictions of the VLM and the target model. Critically, it circumvents the Entropy Bias by leveraging Maximum Softmax Probability (MSP) as a more reliable confidence metric for weighting each model's expertise. Then applies an Optimal Transport based rectification to further align predictions with the blended teacher, enabling continuous and stable adaptation. Extensive experiments show that CoDiRe outperforms state-of-the-art baselines, exceeding CoTTA by 10.55% while using only 48% of its time cost on ImageNet-C.
comment: 11 pages, 6 figures
Sublinear Variational Optimization of Gaussian Mixture Models with Millions to Billions of Parameters
Gaussian Mixture Models (GMMs) range among the most frequently used models in machine learning. However, training large, general GMMs becomes computationally prohibitive for datasets that have many data points $N$ of high-dimensionality $D$. For GMMs with arbitrary covariances, we here derive a highly efficient variational approximation, which is then integrated with mixtures of factor analyzers (MFAs). For GMMs with $C$ components, our proposed algorithm substantially reduces runtime complexity from $\mathcal{O}(NCD^2)$ per iteration to a complexity scaling linearly with $D$ and sublinearly with $NC$. In numerical experiments, we first validate that the complexity reduction results in a sublinear scaling for the entire GMM optimization process. Second, we show on large-scale benchmarks that the sublinear algorithm results in speed-ups of an order-of-magnitude compared to the state-of-the-art. Third, as a proof of concept, we finally train GMMs with over 10 billion parameters on about 100 million images, observing training times of less than nine hours on a single state-of-the-art CPU. Finally, and forth, we demonstrate the effectiveness of large-scale GMMs on the task of zero-shot image denoising, where sublinear training results in state-of-the-art denoising times while competitive denoising performance is maintained.
comment: 31 pages, 8 figures (and 28 pages, 7 figures in Appendix)
Vision-centric Token Compression in Large Language Model NeurIPS 2025
Real-world applications are stretching context windows to hundreds of thousand of tokens while Large Language Models (LLMs) swell from billions to trillions of parameters. This dual expansion send compute and memory costs skyrocketing, making token compression indispensable. We introduce Vision Centric Token Compression (Vist), a slow-fast compression framework that mirrors human reading: the fast path renders distant tokens into images, letting a frozen, lightweight vision encoder skim the low-salience context; the slow path feeds the proximal window into the LLM for fine-grained reasoning. A Probability-Informed Visual Enhancement (PVE) objective masks high-frequency tokens during training, steering the Resampler to concentrate on semantically rich regions-just as skilled reader gloss over function words. On eleven in-context learning benchmarks, Vist achieves the same accuracy with 2.3 times fewer tokens, cutting FLOPs by 16% and memory by 50%. This method delivers remarkable results, outperforming the strongest text encoder-based compression method CEPE by 7.6% on average over benchmarks like TriviaQA, NQ, PopQA, NLUI, and CLIN, setting a new standard for token efficiency in LLMs. The project is at https://github.com/CSU-JPG/VIST.
comment: NeurIPS 2025 spotlight
UniBiomed: A Universal Foundation Model for Grounded Biomedical Image Interpretation
The integration of AI-assisted biomedical image analysis into clinical practice demands AI-generated findings that are not only accurate but also interpretable to clinicians. However, existing biomedical AI models generally lack the ability to simultaneously generate diagnostic findings and localize corresponding biomedical objects. This limitation makes it challenging for clinicians to correlate AI-generated findings with visual evidence (e.g., tiny lesions) in images and interpret the results of AI models. To address this challenge, we introduce UniBiomed, the first universal foundation model for grounded biomedical image interpretation, which is capable of generating accurate diagnostic findings and simultaneously segmenting the corresponding biomedical targets. UniBiomed is based on a novel integration of Multi-modal Large Language Model and Segment Anything Model, which can effectively unify diverse biomedical tasks in universal training for advancing grounded interpretation. To develop UniBiomed, we curate a large-scale dataset comprising over 27 million triplets of images, region annotations, and text descriptions across ten biomedical imaging modalities. Extensive validation on 70 internal and 14 external datasets demonstrated the state-of-the-art performance of UniBiomed in diverse biomedical tasks, including image segmentation, disease recognition, region-aware diagnosis, vision question answering, and report generation. In summary, UniBiomed is a powerful and versatile biomedical foundation model, unlocking the untapped grounded interpretation capability for optimizing AI-assisted biomedical image analysis.
comment: A universal foundation model for grounded biomedical image interpretation
Enhancing Hand Palm Motion Gesture Recognition by Eliminating Reference Frame Bias via Frame-Invariant Similarity Measures
The ability of robots to recognize human gestures facilitates a natural and accessible human-robot collaboration. However, most work in gesture recognition remains rooted in reference frame-dependent representations. This poses a challenge when reference frames vary due to different work cell layouts, imprecise frame calibrations, or other environmental changes. This paper investigated the use of invariant trajectory descriptors for robust hand palm motion gesture recognition under reference frame changes. First, a novel dataset of recorded Hand Palm Motion (HPM) gestures is introduced. The motion gestures in this dataset were specifically designed to be distinguishable without dependence on specific reference frames or directional cues. Afterwards, multiple invariant trajectory descriptor approaches were benchmarked to assess how their performances generalize to this novel HPM dataset. After this offline benchmarking, the best scoring approach is validated for online recognition by developing a real-time Proof of Concept (PoC). In this PoC, hand palm motion gestures were used to control the real-time movement of a manipulator arm. The PoC demonstrated a high recognition reliability in real-time operation, achieving an $F_1$-score of 92.3%. This work demonstrates the effectiveness of the invariant descriptor approach as a standalone solution. Moreover, we believe that the invariant descriptor approach can also be utilized within other state-of-the-art pattern recognition and learning systems to improve their robustness against reference frame variations.
comment: This is the preprint version of a paper accepted for publication at the 2025 IEEE International Conference on Automation Science and Engineering (CASE). The final published version is available at DOI: 10.1109/CASE58245.2025.11163910
AttenDence: Maximizing Attention Confidence for Test Time Adaptation
Test-time adaptation (TTA) enables models to adapt to distribution shifts at inference time. While entropy minimization over the output distribution has proven effective for TTA, transformers offer an additional unsupervised learning signal through their attention mechanisms. We propose minimizing the entropy of attention distributions from the CLS token to image patches as a novel TTA objective. This approach encourages the model to attend more confidently to relevant image regions under distribution shift and is effective even when only a single test image is available. We demonstrate that attention entropy minimization improves robustness across diverse corruption types while not hurting performance on clean data on a single sample stream of images at test time.
comment: add appendix
Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks
Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. To ensure reliable evaluation, we develop a hybrid scoring pipeline that combines LLM-based graders, clinician validation, and semantic similarity metrics. Through systematic evaluation of state-of-the-art VLMs, including GPT-4o, Claude-4, and MedGemma, we observe a sharp performance drop compared to conventional datasets. Error analysis shows that reasoning failures, rather than perceptual errors, dominate model shortcomings. Our findings highlight the necessity of a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented, compact benchmarks such as Neural-MedBench for reasoning fidelity. We release Neural-MedBench at https://neuromedbench.github.io/ as an open and extensible diagnostic testbed, which guides the expansion of future benchmarks and enables rigorous yet cost-effective assessment of clinically trustworthy AI.
comment: 23 pages, 12 figures
Dressing the Imagination: A Dataset for AI-Powered Translation of Text into Fashion Outfits and A Novel NeRA Adapter for Enhanced Feature Adaptation WACV 2026
Specialized datasets that capture the fashion industry's rich language and styling elements can boost progress in AI-driven fashion design. We present FLORA, (Fashion Language Outfit Representation for Apparel Generation), the first comprehensive dataset containing 4,330 curated pairs of fashion outfits and corresponding textual descriptions. Each description utilizes industry-specific terminology and jargon commonly used by professional fashion designers, providing precise and detailed insights into the outfits. Hence, the dataset captures the delicate features and subtle stylistic elements necessary to create high-fidelity fashion designs. We demonstrate that fine-tuning generative models on the FLORA dataset significantly enhances their capability to generate accurate and stylistically rich images from textual descriptions of fashion sketches. FLORA will catalyze the creation of advanced AI models capable of comprehending and producing subtle, stylistically rich fashion designs. It will also help fashion designers and end-users to bring their ideas to life. As a second orthogonal contribution, we introduce NeRA (Nonlinear low-rank Expressive Representation Adapter), a novel adapter architecture based on Kolmogorov-Arnold Networks (KAN). Unlike traditional PEFT techniques such as LoRA, LoKR, DoRA, and LoHA that use MLP adapters, NeRA uses learnable spline-based nonlinear transformations, enabling superior modeling of complex semantic relationships, achieving strong fidelity, faster convergence and semantic alignment. Extensive experiments on our proposed FLORA and LAION-5B datasets validate the superiority of NeRA over existing adapters. We will open-source both the FLORA dataset and our implementation code.
comment: Accepted as a Conference Paper at WACV 2026 (USA)
Effective Online Exam Proctoring by Combining Lightweight Face Detection and Deep Recognition
Online exams, conducted via video conferencing platforms such as Zoom, have become popular in educational institutions since COVID-19. While convenient, ensuring the integrity and security of online exams remains challenging, as traditional invigilation struggles to effectively monitor multiple student video feeds in real time. In this paper, we present iExam, an effective online exam proctoring and analysis system that combines lightweight face detection and deep recognition. iExam employs real-time face detection to assist invigilators in continuously monitoring student presence, and leverages deep face recognition for post-exam video analysis to identify abnormal behaviors--including face disappearance, face rotation, and identity substitution. To realize this system, we address three core challenges: (i) designing a lightweight approach to efficiently capture and analyze exam video streams in real time; (ii) developing an enhanced OCR method to automatically extract student identities from dynamically positioned Zoom name tags, enabling reliable ground truth labeling without manual intervention; and (iii) optimizing the training and inference pipeline to significantlyreduce resource and time requirements on ordinary teacher devices. Extensive experiments demonstrate that iExam achieves 90.4% accuracy for real-time face detection and 98.4% accuracy for post-exam face recognition, while maintaining low overhead. These results show that iExam can substantially enhance the automation and reliability of online exam proctoring in practice.
comment: This is a technical report from Lingnan University and the Chinese University of Hong Kong
Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation
We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, a framework showing that video diffusion models conditioned solely on video-based scene context can perform complex spatial tasks. We validate on two tasks: scene navigation - following camera-pose instructions while remaining consistent with 3D geometry of the scene, and object grounding - which requires semantic localization, instruction following, and planning. Both tasks use video-only inputs, without auxiliary modalities such as depth or poses. With simple yet effective design choices in the framework and data curation, Video4Spatial demonstrates strong spatial understanding from video context: it plans navigation and grounds target objects end-to-end, follows camera-pose instructions while maintaining spatial consistency, and generalizes to long contexts and out-of-domain environments. Taken together, these results advance video generative models toward general visuospatial reasoning.
comment: Project page at https://xizaoqu.github.io/video4spatial/
DEGS: Deformable Event-based 3D Gaussian Splatting from RGB and Event Stream
Reconstructing Dynamic 3D Gaussian Splatting (3DGS) from low-framerate RGB videos is challenging. This is because large inter-frame motions will increase the uncertainty of the solution space. For example, one pixel in the first frame might have more choices to reach the corresponding pixel in the second frame. Event cameras can asynchronously capture rapid visual changes and are robust to motion blur, but they do not provide color information. Intuitively, the event stream can provide deterministic constraints for the inter-frame large motion by the event trajectories. Hence, combining low-temporal-resolution images with high-framerate event streams can address this challenge. However, it is challenging to jointly optimize Dynamic 3DGS using both RGB and event modalities due to the significant discrepancy between these two data modalities. This paper introduces a novel framework that jointly optimizes dynamic 3DGS from the two modalities. The key idea is to adopt event motion priors to guide the optimization of the deformation fields. First, we extract the motion priors encoded in event streams by using the proposed LoCM unsupervised fine-tuning framework to adapt an event flow estimator to a certain unseen scene. Then, we present the geometry-aware data association method to build the event-Gaussian motion correspondence, which is the primary foundation of the pipeline, accompanied by two useful strategies, namely motion decomposition and inter-frame pseudo-label. Extensive experiments show that our method outperforms existing image and event-based approaches across synthetic and real scenes and prove that our method can effectively optimize dynamic 3DGS with the help of event data.
comment: Accepted by IEEE TVCG
LeMiCa: Lexicographic Minimax Path Caching for Efficient Diffusion-Based Video Generation NeurIPS 2025
We present LeMiCa, a training-free and efficient acceleration framework for diffusion-based video generation. While existing caching strategies primarily focus on reducing local heuristic errors, they often overlook the accumulation of global errors, leading to noticeable content degradation between accelerated and original videos. To address this issue, we formulate cache scheduling as a directed graph with error-weighted edges and introduce a Lexicographic Minimax Path Optimization strategy that explicitly bounds the worst-case path error. This approach substantially improves the consistency of global content and style across generated frames. Extensive experiments on multiple text-to-video benchmarks demonstrate that LeMiCa delivers dual improvements in both inference speed and generation quality. Notably, our method achieves a 2.9x speedup on the Latte model and reaches an LPIPS score of 0.05 on Open-Sora, outperforming prior caching techniques. Importantly, these gains come with minimal perceptual quality degradation, making LeMiCa a robust and generalizable paradigm for accelerating diffusion-based video generation. We believe this approach can serve as a strong foundation for future research on efficient and reliable video synthesis. Our code is available at :https://github.com/UnicomAI/LeMiCa
comment: NeurIPS 2025 Spotlight
Thinking Ahead: Foresight Intelligence in MLLMs and World Models
In this work, we define Foresight Intelligence as the capability to anticipate and interpret future events-an ability essential for applications such as autonomous driving, yet largely overlooked by existing research. To bridge this gap, we introduce FSU-QA, a new Visual Question-Answering (VQA) dataset specifically designed to elicit and evaluate Foresight Intelligence. Using FSU-QA, we conduct the first comprehensive study of state-of-the-art Vision-Language Models (VLMs) under foresight-oriented tasks, revealing that current models still struggle to reason about future situations. Beyond serving as a benchmark, FSU-QA also enables the assessment of world models by measuring the semantic coherence of their generated predictions, quantified through performance gains when VLMs are augmented with such outputs. Our experiments further demonstrate that FSU-QA can effectively enhance foresight reasoning: even small VLMs fine-tuned on FSU-QA surpass much larger, advanced models by a substantial margin. Together, these findings position FSU-QA as a principled foundation for developing next-generation models capable of truly anticipating and understanding future events.
comment: 25 pages, 27 figures
Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning
Video Question Answering (VideoQA) aims to answer natural language questions based on the given video, with prior work primarily focusing on identifying the duration of relevant segments, referred to as explicit visual evidence. However, explicit visual evidence is not always directly available, particularly when questions target symbolic meanings or deeper intentions, leading to significant performance degradation. To fill this gap, we introduce a novel task and dataset, $\textbf{I}$mplicit $\textbf{V}$ideo $\textbf{Q}$uestion $\textbf{A}$nswering (I-VQA), which focuses on answering questions in scenarios where explicit visual evidence is inaccessible. Given an implicit question and its corresponding video, I-VQA requires answering based on the contextual visual cues present within the video. To tackle I-VQA, we propose a novel reasoning framework, IRM (Implicit Reasoning Model), incorporating dual-stream modeling of contextual actions and intent clues as implicit reasoning chains. IRM comprises the Action-Intent Module (AIM) and the Visual Enhancement Module (VEM). AIM deduces and preserves question-related dual clues by generating clue candidates and performing relation deduction. VEM enhances contextual visual representation by leveraging key contextual clues. Extensive experiments validate the effectiveness of our IRM in I-VQA tasks, outperforming GPT-4o, OpenAI-o3, and fine-tuned VideoChat2 by $0.76\%$, $1.37\%$, and $4.87\%$, respectively. Additionally, IRM performs SOTA on similar implicit advertisement understanding and future prediction in traffic-VQA. Datasets and codes are available for double-blind review in anonymous repo: https://github.com/tychen-SJTU/Implicit-VideoQA.
comment: Preprint
ConsistCompose: Unified Multimodal Layout Control for Image Composition
Unified multimodal models that couple visual understanding with image generation have advanced rapidly, yet most systems still focus on visual grounding-aligning language with image regions-while their generative counterpart, linguistic-embedded layout-grounded generation (LELG) for layout-controllable multi-instance generation, remains underexplored and limits precise compositional control. We present ConsistCompose, a unified multimodal framework that embeds layout coordinates directly into language prompts, enabling layout-controlled multi-instance image generation from Interleaved Image-Text within a single generative interface. We further construct ConsistCompose3M, a 3.4M multi-instance generation dataset with layout and identity annotations (2.6M text-guided and 0.8M image-guided data pairs) that provides large-scale supervision for layout-conditioned generation. Within this framework, LELG is instantiated through instance-coordinate binding prompts and coordinate-aware classifier-free guidance, which translate linguistic layout cues into precise spatial control without task-specific branches. Experiments on COCO-Position and MS-Bench show that ConsistCompose substantially improves spatial accuracy over layout-controlled baselines while preserving identity fidelity and competitive general multimodal understanding, establishing a unified paradigm for layout-controllable multimodal image generation.
comment: 22 pages, 17 figures
Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles
Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT)-prompted LLM pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses state-of-the-art baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it shows strong robustness and efficiency in challenging scenes (long-text, multi-agent, ambiguity) and retains superior performance even when trained on 50% of the data.
When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models
Vision-Language-Action models (VLAs) have recently demonstrated remarkable progress in embodied environments, enabling robots to perceive, reason, and act through unified multimodal understanding. Despite their impressive capabilities, the adversarial robustness of these systems remains largely unexplored, especially under realistic multimodal and black-box conditions. Existing studies mainly focus on single-modality perturbations and overlook the cross-modal misalignment that fundamentally affects embodied reasoning and decision-making. In this paper, we introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings. VLA-Fool unifies three levels of multimodal adversarial attacks: (1) textual perturbations through gradient-based and prompt-based manipulations, (2) visual perturbations via patch and noise distortions, and (3) cross-modal misalignment attacks that intentionally disrupt the semantic correspondence between perception and instruction. We further incorporate a VLA-aware semantic space into linguistic prompts, developing the first automatically crafted and semantically guided prompting framework. Experiments on the LIBERO benchmark using a fine-tuned OpenVLA model reveal that even minor multimodal perturbations can cause significant behavioral deviations, demonstrating the fragility of embodied multimodal alignment.
Learning Generalizable Shape Completion with SIM(3) Equivariance NeurIPS 2025
3D shape completion methods typically assume scans are pre-aligned to a canonical frame. This leaks pose and scale cues that networks may exploit to memorize absolute positions rather than inferring intrinsic geometry. When such alignment is absent in real data, performance collapses. We argue that robust generalization demands architectural equivariance to the similarity group, SIM(3), so the model remains agnostic to pose and scale. Following this principle, we introduce the first SIM(3)-equivariant shape completion network, whose modular layers successively canonicalize features, reason over similarity-invariant geometry, and restore the original frame. Under a de-biased evaluation protocol that removes the hidden cues, our model outperforms both equivariant and augmentation baselines on the PCN benchmark. It also sets new cross-domain records on real driving and indoor scans, lowering minimal matching distance on KITTI by 17% and Chamfer distance $\ell1$ on OmniObject3D by 14%. Perhaps surprisingly, ours under the stricter protocol still outperforms competitors under their biased settings. These results establish full SIM(3) equivariance as an effective route to truly generalizable shape completion. Project page: https://sime-completion.github.io.
comment: NeurIPS 2025
Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks
Recent advancements in driving world models enable controllable generation of high-quality RGB videos or multimodal videos. Existing methods primarily focus on metrics related to generation quality and controllability. However, they often overlook the evaluation of downstream perception tasks, which are $\mathbf{really\ crucial}$ for the performance of autonomous driving. Existing methods usually leverage a training strategy that first pretrains on synthetic data and finetunes on real data, resulting in twice the epochs compared to the baseline (real data only). When we double the epochs in the baseline, the benefit of synthetic data becomes negligible. To thoroughly demonstrate the benefit of synthetic data, we introduce Dream4Drive, a novel synthetic data generation framework designed for enhancing the downstream perception tasks. Dream4Drive first decomposes the input video into several 3D-aware guidance maps and subsequently renders the 3D assets onto these guidance maps. Finally, the driving world model is fine-tuned to produce the edited, multi-view photorealistic videos, which can be used to train the downstream perception models. Dream4Drive enables unprecedented flexibility in generating multi-view corner cases at scale, significantly boosting corner case perception in autonomous driving. To facilitate future research, we also contribute a large-scale 3D asset dataset named DriveObj3D, covering the typical categories in driving scenarios and enabling diverse 3D-aware video editing. We conduct comprehensive experiments to show that Dream4Drive can effectively boost the performance of downstream perception models under various training epochs. Page: https://wm-research.github.io/Dream4Drive/ GitHub Link: https://github.com/wm-research/Dream4Drive
Guiding What Not to Generate: Automated Negative Prompting for Text-Image Alignment WACV 2026
Despite substantial progress in text-to-image generation, achieving precise text-image alignment remains challenging, particularly for prompts with rich compositional structure or imaginative elements. To address this, we introduce Negative Prompting for Image Correction (NPC), an automated pipeline that improves alignment by identifying and applying negative prompts that suppress unintended content. We begin by analyzing cross-attention patterns to explain why both targeted negatives-those directly tied to the prompt's alignment error-and untargeted negatives-tokens unrelated to the prompt but present in the generated image-can enhance alignment. To discover useful negatives, NPC generates candidate prompts using a verifier-captioner-proposer framework and ranks them with a salient text-space score, enabling effective selection without requiring additional image synthesis. On GenEval++ and Imagine-Bench, NPC outperforms strong baselines, achieving 0.571 vs. 0.371 on GenEval++ and the best overall performance on Imagine-Bench. By guiding what not to generate, NPC provides a principled, fully automated route to stronger text-image alignment in diffusion models. Code is released at https://github.com/wiarae/NPC.
comment: WACV 2026
SAVE: Sparse Autoencoder-Driven Visual Information Enhancement for Mitigating Object Hallucination WACV 2026
Although Multimodal Large Language Models (MLLMs) have advanced substantially, they remain vulnerable to object hallucination caused by language priors and visual information loss. To address this, we propose SAVE (Sparse Autoencoder-Driven Visual Information Enhancement), a framework that mitigates hallucination by steering the model along Sparse Autoencoder (SAE) latent features. A binary object-presence question-answering probe identifies the SAE features most indicative of the model's visual information processing, referred to as visual understanding features. Steering the model along these identified features reinforces grounded visual understanding and effectively reduces hallucination. With its simple design, SAVE outperforms state-of-the-art training-free methods on standard benchmarks, achieving a 10\%p improvement in CHAIR\_S and consistent gains on POPE and MMHal-Bench. Extensive evaluations across multiple models and layers confirm the robustness and generalizability of our approach. Further analysis reveals that steering along visual understanding features suppresses the generation of uncertain object tokens and increases attention to image tokens, mitigating hallucination. Code is released at https://github.com/wiarae/SAVE.
comment: WACV 2026
SplatCo: Structure-View Collaborative Gaussian Splatting for Detail-Preserving Rendering of Large-Scale Unbounded Scenes
We present SplatCo, a structure-view collaborative Gaussian splatting framework for high-fidelity rendering of complex outdoor scenes. SplatCo builds upon three novel components: 1) a cross-structure collaboration module that combines global tri-plane representations, which capture coarse scene layouts, with local context grid features representing fine details. This fusion is achieved through a hierarchical compensation mechanism, ensuring both global spatial awareness and local detail preservation; 2) a cross-view pruning mechanism that removes overfitted or inaccurate Gaussians based on structural consistency, thereby improving storage efficiency and preventing rendering artifacts; 3) a structure view co-learning module that aggregates structural gradients with view gradients,thereby steering the optimization of Gaussian geometric and appearance attributes more robustly. By combining these key components, SplatCo effectively achieves high-fidelity rendering for large-scale scenes. Code and project page are available at https://splatco-tech.github.io.
ObjectAdd: Adding Objects into Image via a Training-Free Diffusion Modification Fashion
We introduce ObjectAdd, a training-free diffusion modification method to add user-expected objects into user-specified area. The motive of ObjectAdd stems from: first, describing everything in one prompt can be difficult, and second, users often need to add objects into the generated image. To accommodate with real world, our ObjectAdd maintains accurate image consistency after adding objects with technical innovations in: (1) embedding-level concatenation to ensure correct text embedding coalesce; (2) object-driven layout control with latent and attention injection to ensure objects accessing user-specified area; (3) prompted image inpainting in an attention refocusing & object expansion fashion to ensure rest of the image stays the same. With a text-prompted image, our ObjectAdd allows users to specify a box and an object, and achieves: (1) adding object inside the box area; (2) exact content outside the box area; (3) flawless fusion between the two areas
comment: 12 pages in total
RectifiedHR: High-Resolution Diffusion via Energy Profiling and Adaptive Guidance Scheduling
High-resolution image synthesis with diffusion models often suffers from energy instabilities and guidance artifacts that degrade visual quality. We analyze the latent energy landscape during sampling and propose adaptive classifier-free guidance (CFG) schedules that maintain stable energy trajectories. Our approach introduces energy-aware scheduling strategies that modulate guidance strength over time, achieving superior stability scores (0.9998) and consistency metrics (0.9873) compared to fixed-guidance approaches. We demonstrate that DPM++ 2M with linear-decreasing CFG scheduling yields optimal performance, providing sharper, more faithful images while reducing artifacts. Our energy profiling framework serves as a powerful diagnostic tool for understanding and improving diffusion model behavior.
comment: 8 Pages, 10 Figures, Pre-Print Version, This version is under review for citation accuracy
Time-Series at the Edge: Tiny Separable CNNs for Wearable Gait Detection and Optimal Sensor Placement
We study on-device time-series analysis for gait detection in Parkinson's disease (PD) from short windows of triaxial acceleration, targeting resource-constrained wearables and edge nodes. We compare magnitude thresholding to three 1D CNNs for time-series analysis: a literature baseline (separable convolutions) and two ultra-light models - one purely separable and one with residual connections. Using the BioStampRC21 dataset, 2 s windows at 30 Hz, and subject-independent leave-one-subject-out (LOSO) validation on 16 PwPD with chest-worn IMUs, our residual separable model (Model 2, 533 params) attains PR-AUC = 94.5%, F1 = 91.2%, MCC = 89.4%, matching or surpassing the baseline (5,552 params; PR-AUC = 93.7%, F1 = 90.5%, MCC = 88.5%) with approximately 10x fewer parameters. The smallest model (Model 1, 305 params) reaches PR-AUC = 94.0%, F1 = 91.0%, MCC = 89.1%. Thresholding obtains high recall (89.0%) but low precision (76.5%), yielding many false positives and high inter-subject variance. Sensor-position analysis (train-on-all) shows chest and thighs are most reliable; forearms degrade precision/recall due to non-gait arm motion; naive fusion of all sites does not outperform the best single site. Both compact CNNs execute within tight memory/latency budgets on STM32-class MCUs (sub-10 ms on low-power boards), enabling on-sensor gating of transmission/storage. Overall, ultra-light separable CNNs provide a superior accuracy-efficiency-generalization trade-off to fixed thresholds for wearable PD gait detection and underscore the value of tailored time-series models for edge deployment.
Estimating Object Physical Properties from RGB-D Vision and Depth Robot Sensors Using Deep Learning
Inertial mass plays a crucial role in robotic applications such as object grasping, manipulation, and simulation, providing a strong prior for planning and control. Accurately estimating an object's mass before interaction can significantly enhance the performance of various robotic tasks. However, mass estimation using only vision sensors is a relatively underexplored area. This paper proposes a novel approach combining sparse point-cloud data from depth images with RGB images to estimate the mass of objects. We evaluate a range of point-cloud processing architectures, alongside RGB-only methods. To overcome the limited availability of training data, we create a synthetic dataset using ShapeNetSem 3D models, simulating RGBD images via a Kinect camera. This synthetic data is used to train an image generation model for estimating dense depth maps, which we then use to augment an existing dataset of images paired with mass values. Our approach significantly outperforms existing benchmarks across all evaluated metrics. The data generation (https://github.com/RavineWindteer/ShapenetSem-to-RGBD) as well as the training of the depth estimator (https://github.com/RavineWindteer/GLPDepth-Edited) and the mass estimator (https://github.com/RavineWindteer/Depth-mass-estimator) are available online.
More Than Memory Savings: Zeroth-Order Optimization Mitigates Forgetting in Continual Learning
Zeroth-order (ZO) optimization has gained attention as a memory-efficient alternative to first-order (FO) methods, particularly in settings where gradient computation is expensive or even impractical. Beyond its memory efficiency, in this work, we investigate ZO optimization for continual learning (CL) as a novel approach to address the plasticity-stability-efficiency trilemma. Through theoretical analysis and empirical evidence, we show that ZO optimization naturally leads to flatter loss landscapes, which in turn reduce forgetting in CL. However, this stability comes at a cost of plasticity: due to its imprecise gradient estimates and slower convergence, ZO optimization tends to be less effective than FO in acquiring new task-specific knowledge, particularly under constrained training budgets. To better understand this trade-off, we conduct a holistic evaluation of ZO optimization applied to various existing CL methods. Our findings reveal that ZO optimization enhances stability but often undermines plasticity, particularly when used with learnable classifiers. Motivated by this insight, we propose ZO-FC, a simple but effective approach that applies ZO optimization to a single adapter-based PEFT module with FO optimized classifier. This design leverages the stability benefits of ZO while preserving the adaptability of FO updates with negligible memory overhead. Experiments demonstrate that ZO-FC achieves an effective balance between stability and plasticity, offering a practical and memory-efficient solution for on-device CL.
MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models
Reinforcement learning from human feedback (RLHF) with reward models has advanced alignment of generative models to human aesthetic and perceptual preferences. However, jointly optimizing multiple rewards often incurs an alignment tax, improving one dimension while degrading others. To address this, we introduce two complementary methods: MapReduce LoRA and Reward-aware Token Embedding (RaTE). MapReduce LoRA trains preference-specific LoRA experts in parallel and iteratively merges them to refine a shared base model; RaTE learns reward-specific token embeddings that compose at inference for flexible preference control. Experiments on Text-to-Image generation (Stable Diffusion 3.5 Medium and FLUX.1-dev) show improvements of 36.1%, 4.6%, and 55.7%, and 32.7%, 4.3%, and 67.1% on GenEval, PickScore, and OCR, respectively. On Text-to-Video generation (HunyuanVideo), visual and motion quality improve by 48.1% and 90.0%, respectively. On the language task, Helpful Assistant, with Llama-2 7B, helpful and harmless improve by 43.4% and 136.7%, respectively. Our framework sets a new state-of-the-art multi-preference alignment recipe across modalities.
Enhancing Few-Shot Classification of Benchmark and Disaster Imagery with ATTBHFA-Net
The increasing frequency of natural and human-induced disasters necessitates advanced visual recognition techniques capable of analyzing critical photographic data. With progress in artificial intelligence and resilient computational systems, rapid and accurate disaster classification has become crucial for efficient rescue operations. However, visual recognition in disaster contexts faces significant challenges due to limited and diverse data from the difficulties in collecting and curating comprehensive, high-quality disaster imagery. Few-Shot Learning (FSL) provides a promising approach to data scarcity, yet current FSL research mainly relies on generic benchmark datasets lacking remote-sensing disaster imagery, limiting its practical effectiveness. Moreover, disaster images exhibit high intra-class variation and inter-class similarity, hindering the performance of conventional metric-based FSL methods. To address these issues, this paper introduces the Attention-based Bhattacharyya-Hellinger Feature Aggregation Network (ATTBHFA-Net), which linearly combines the Bhattacharyya coefficient and Hellinger distances to compare and aggregate feature probability distributions for robust prototype formation. The Bhattacharyya coefficient serves as a contrastive margin that enhances inter-class separability, while the Hellinger distance regularizes same-class alignment. This framework parallels contrastive learning but operates over probability distributions rather than embedded feature points. Furthermore, a Bhattacharyya-Hellinger distance-based contrastive loss is proposed as a distributional counterpart to cosine similarity loss, used jointly with categorical cross-entropy to significantly improve FSL performance. Experiments on four FSL benchmarks and two disaster image datasets demonstrate the superior effectiveness and generalization of ATTBHFA-Net compared to existing approaches.
comment: Submitted to a journal. Re-uploaded again after revising the mathematical derivation sections to clear up some errors in the steps
UNO: Unifying One-stage Video Scene Graph Generation via Object-Centric Visual Representation Learning WACV 2026
Video Scene Graph Generation (VidSGG) aims to represent dynamic visual content by detecting objects and modeling their temporal interactions as structured graphs. Prior studies typically target either coarse-grained box-level or fine-grained panoptic pixel-level VidSGG, often requiring task-specific architectures and multi-stage training pipelines. In this paper, we present UNO (UNified Object-centric VidSGG), a single-stage, unified framework that jointly addresses both tasks within an end-to-end architecture. UNO is designed to minimize task-specific modifications and maximize parameter sharing, enabling generalization across different levels of visual granularity. The core of UNO is an extended slot attention mechanism that decomposes visual features into object and relation slots. To ensure robust temporal modeling, we introduce object temporal consistency learning, which enforces consistent object representations across frames without relying on explicit tracking modules. Additionally, a dynamic triplet prediction module links relation slots to corresponding object pairs, capturing evolving interactions over time. We evaluate UNO on standard box-level and pixel-level VidSGG benchmarks. Results demonstrate that UNO not only achieves competitive performance across both tasks but also offers improved efficiency through a unified, object-centric design.
comment: 11 pages, 7 figures. Accepted at WACV 2026
MoCA-Video: Motion-Aware Concept Alignment for Consistent Video Editing
We present MoCA-Video, a training-free framework for semantic mixing in videos. Operating in the latent space of a frozen video diffusion model, MoCA-Video utilizes class-agnostic segmentation with diagonal denoising scheduler to localize and track the target object across frames. To ensure temporal stability under semantic shifts, we introduce momentum-based correction to approximate novel hybrid distributions beyond trained data distribution, alongside a light gamma residual module that smooths out visual artifacts. We evaluate model's performance using SSIM, LPIPS, and a proposed metric, \metricnameabbr, which quantifies semantic alignment between reference and output. Extensive evaluation demonstrates that our model consistently outperforms both training-free and trained baselines, achieving superior semantic mixing and temporal coherence without retraining. Results establish that structured manipulation of diffusion noise trajectories enables controllable and high-quality video editing under semantic shifts.
Image and Video Processing
Fast and Robust LRSD-based SAR/ISAR Imaging and Decomposition
The earlier works in the context of low-rank-sparse-decomposition (LRSD)-driven stationary synthetic aperture radar (SAR) imaging have shown significant improvement in the reconstruction-decomposition process. Neither of the proposed frameworks, however, can achieve satisfactory performance when facing a platform residual phase error (PRPE) arising from the instability of airborne platforms. More importantly, in spite of the significance of real-time processing requirements in remote sensing applications, these prior works have only focused on enhancing the quality of the formed image, not reducing the computational burden. To address these two concerns, this article presents a fast and unified joint SAR imaging framework where the dominant sparse objects and low-rank features of the image background are decomposed and enhanced through a robust LRSD. In particular, our unified algorithm circumvents the tedious task of computing the inverse of large matrices for image formation and takes advantage of the recent advances in constrained quadratic programming to handle the unimodular constraint imposed due to the PRPE. Furthermore, we extend our approach to ISAR autofocusing and imaging. Specifically, due to the intrinsic sparsity of ISAR images, the LRSD framework is essentially tasked with the recovery of a sparse image. Several experiments based on synthetic and real data are presented to validate the superiority of the proposed method in terms of imaging quality and computational cost compared to the state-of-the-art methods.
Hyperspectral Image Data Reduction for Endmember Extraction
Endmember extraction from hyperspectral images aims to identify the spectral signatures of materials present in a scene. Recent studies have shown that self-dictionary methods can achieve high extraction accuracy; however, their high computational cost limits their applicability to large-scale hyperspectral images. Although several approaches have been proposed to mitigate this issue, it remains a major challenge. Motivated by this situation, this paper pursues a data reduction approach. Assuming that the hyperspectral image follows the linear mixing model with the pure-pixel assumption, we develop a data reduction technique that removes pixels that do not contain endmembers. We analyze the theoretical properties of this reduction step and show that it preserves pixels that lie close to the endmembers. Building on this result, we propose a data-reduced self-dictionary method that integrates the data reduction with a self-dictionary method based on a linear programming formulation. Numerical experiments demonstrate that the proposed method can substantially reduce the computational time of the original self-dictionary method without sacrificing endmember extraction accuracy.
Active Optics for Hyperspectral Imaging of Reflective Agricultural Leaf Sensors
Monitoring plant health increasingly relies on leaf-mounted sensors that provide real-time physiological data, yet efficiently locating and sampling these sensors in complex agricultural environments remains a major challenge. We present an integrated, adaptive, and scalable system that autonomously detects and interrogates plant sensors using a coordinated suite of low-cost optical components including a LiDAR, liquid lens, monochrome camera, filter wheel, and Fast Steering Mirror (FSM). The system first uses LiDAR to identify the distinct reflective signatures of sensors within the field, then dynamically redirects the camera s field of view via the FSM to target each sensor for hyperspectral imaging. The liquid lens continuously adjusts focus to maintain image sharpness across varying depths, enabling precise spectral measurements. We validated the system in controlled indoor experiments, demonstrating accurate detection and tracking of reflective plant sensors and successful acquisition of their spectral data. To our knowledge, no other system currently integrates these sensing and optical modalities for agricultural monitoring. This work establishes a foundation for adaptive, low-cost, and automated plant sensor interrogation, representing a significant step toward scalable, real-time plant health monitoring in precision agriculture.
A Unified Theory of Dynamic Programming Algorithms in Small Target Detection
Small target detection is inherently challenging due to the minimal size, lack of distinctive features, and the presence of complex backgrounds. Heavy noise further complicates the task by both obscuring and imitating the target appearance. Weak target signals require integrating target trajectories over multiple frames, an approach that can be computationally intensive. Dynamic programming offers an efficient solution by decomposing the problem into iterative maximization. This, however, has limited the analytical tools available for their study. In this paper, we present a robust framework for this class of algorithms and establish rigorous convergence results for error rates under mild assumptions. We depart from standard analysis by modeling error probabilities as a function of distance from the target, allowing us to construct a relationship between uncertainty in location and uncertainty in existence. From this framework, we introduce a novel algorithm, Normalized Path Integration (NPI), that utilizes the similarity between sequential observations, enabling target detection with unknown or time varying features.
comment: 11 pages, 6 figures
Feature Compression for Machines with Range-Based Channel Truncation and Frame Packing
This paper proposes a method that enhances the compression performance of the current model under development for the upcoming MPEG standard on Feature Coding for Machines (FCM). This standard aims at providing inter-operable compressed bitstreams of features in the context of split computing, i.e., when the inference of a large computer vision neural-network (NN)-based model is split between two devices. Intermediate features can consist of multiple 3D tensors that can be reduced and entropy coded to limit the required bandwidth of such transmission. In the envisioned design for the MPEG-FCM standard, intermediate feature tensors may be reduced using Neural layers before being converted into 2D video frames that can be coded using existing video compression standards. This paper introduces an additional channel truncation and packing method which enables the system to preserve the relevant channels, depending on the statistics of the features at inference time, while preserving the computer vision task performance at the receiver. Implemented within the MPEG-FCM test model, the proposed method yields an average reduction in rate by 10.59% for a given accuracy on multiple computer vision tasks and datasets.
comment: 10 pages, 8 figures. Extended version of the paper with the same title presented at IEEE DCC 2025
Learning from a Generative Oracle: Domain Adaptation for Restoration
Pre-trained image restoration models often fail on real-world, out-of-distribution degradations due to significant domain gaps. Adapting to these unseen domains is challenging, as out-of-distribution data lacks ground truth, and traditional adaptation methods often require complex architectural changes. We propose LEGO (Learning from a Generative Oracle), a practical three-stage framework for post-training domain adaptation without paired data. LEGO converts this unsupervised challenge into a tractable pseudo-supervised one. First, we obtain initial restorations from the pre-trained model. Second, we leverage a frozen, large-scale generative oracle to refine these estimates into high-quality pseudo-ground-truths. Third, we fine-tune the original model using a mixed-supervision strategy combining in-distribution data with these new pseudo-pairs. This approach adapts the model to the new distribution without sacrificing its original robustness or requiring architectural modifications. Experiments demonstrate that LEGO effectively bridges the domain gap, significantly improving performance on diverse real-world benchmarks.
An Open Source Realtime GPU Beamformer for Row-Column and Top Orthogonal to Bottom Electrode (TOBE) Arrays
Research ultrasound platforms have enabled many next-generation imaging sequences but have lacked realtime navigation capabilities for emerging 2D arrays such as row-column arrays (RCAs). We present an open-source, GPU-accelerated reconstruction and rendering software suite integrated with a programmable ultrasound platform and novel electrostrictive Top-Orthogonal-to-Bottom-Electrode (TOBE) arrays. The system supports advanced real-time modes, including cross-plane aperture-encoded synthetic-aperture imaging and aperture-encoded volumetric scanning. TOBE-enabled methods demonstrate improved image quality and expanded field of view compared with conventional RCA techniques. The software implements beamforming and rendering kernels using OpenGL compute shaders and is designed for maximum data throughput helping to minimize stalls and latency. Accompanying sample datasets and example scripts for offline reconstruction are provided to facilitate external testing.
comment: 17 pages, 11 figures. for mentioned datasets, videos, and files see: https://drive.google.com/drive/folders/1ZqfkJjFfx6JA0gwoHjLdohN1dp-p2Yfu?usp=sharing
E-CHUM: Event-based Cameras for Human Detection and Urban Monitoring
Understanding human movement and city dynamics has always been challenging. From traditional methods of manually observing the city's inhabitant, to using cameras, to now using sensors and more complex technology, the field of urban monitoring has evolved greatly. Still, there are more that can be done to unlock better practices for understanding city dynamics. This paper surveys how the landscape of urban dynamics studying has evolved with a particular focus on event-based cameras. Event-based cameras capture changes in light intensity instead of the RGB values that traditional cameras do. They offer unique abilities, like the ability to work in low-light, that can make them advantageous compared to other sensors. Through an analysis of event-based cameras, their applications, their advantages and challenges, and machine learning applications, we propose event-based cameras as a medium for capturing information to study urban dynamics. They offer the ability to capture important information while maintaining privacy. We also suggest multi-sensor fusion of event-based cameras and other sensors in the study of urban dynamics. Combining event-based cameras and infrared, event-LiDAR, or vibration has to potential to enhance the ability of event-based cameras and overcome the challenges that event-based cameras have.
Equivariant Test-Time Training with Operator Sketching for Imaging Inverse Problems
Equivariant Imaging (EI) regularization has become the de-facto technique for unsupervised training of deep imaging networks, without any need of ground-truth data. Observing that the EI-based unsupervised training paradigm currently has significant computational redundancy leading to inefficiency in high-dimensional applications, we propose a sketched EI regularization which leverages the randomized sketching techniques for acceleration. We apply our sketched EI regularization to develop an accelerated deep internal learning framework, which can be efficiently applied for test-time network adaptation. Additionally, for network adaptation tasks, we propose a parameter-efficient approach to accelerate both EI and Sketched-EI via optimizing only the normalization layers. Our numerical study on X-ray CT and multicoil magnetic resonance image reconstruction tasks demonstrate that our approach can achieve significant computational acceleration over the standard EI counterpart, especially in test-time training tasks.
comment: 20 pages
Panoramic Out-of-Distribution Segmentation
Panoramic imaging enables capturing 360° images with an ultra-wide Field-of-View (FoV) for dense omnidirectional perception, which is critical to applications, such as autonomous driving and augmented reality, etc. However, current panoramic semantic segmentation methods fail to identify outliers, and pinhole Out-of-distribution Segmentation (OoS) models perform unsatisfactorily in the panoramic domain due to pixel distortions and background clutter. To address these issues, we introduce a new task, Panoramic Out-of-distribution Segmentation (PanOoS), with the aim of achieving comprehensive and safe scene understanding. Furthermore, we propose the first solution, POS, which adapts to the characteristics of panoramic images through text-guided prompt distribution learning. Specifically, POS integrates a disentanglement strategy designed to materialize the cross-domain generalization capability of CLIP. The proposed Prompt-based Restoration Attention (PRA) optimizes semantic decoding by prompt guidance and self-adaptive correction, while Bilevel Prompt Distribution Learning (BPDL) refines the manifold of per-pixel mask embeddings via semantic prototype supervision. Besides, to compensate for the scarcity of PanOoS datasets, we establish two benchmarks: DenseOoS, which features diverse outliers in complex environments, and QuadOoS, captured by a quadruped robot with a panoramic annular lens system. Extensive experiments demonstrate superior performance of POS, with AuPRC improving by 34.25% and FPR95 decreasing by 21.42% on DenseOoS, outperforming state-of-the-art pinhole-OoS methods. Moreover, POS achieves leading closed-set segmentation capabilities and advances the development of panoramic understanding. Code and datasets will be available at https://github.com/MengfeiD/PanOoS.
comment: Code and datasets will be available at https://github.com/MengfeiD/PanOoS
Effective Online Exam Proctoring by Combining Lightweight Face Detection and Deep Recognition
Online exams, conducted via video conferencing platforms such as Zoom, have become popular in educational institutions since COVID-19. While convenient, ensuring the integrity and security of online exams remains challenging, as traditional invigilation struggles to effectively monitor multiple student video feeds in real time. In this paper, we present iExam, an effective online exam proctoring and analysis system that combines lightweight face detection and deep recognition. iExam employs real-time face detection to assist invigilators in continuously monitoring student presence, and leverages deep face recognition for post-exam video analysis to identify abnormal behaviors--including face disappearance, face rotation, and identity substitution. To realize this system, we address three core challenges: (i) designing a lightweight approach to efficiently capture and analyze exam video streams in real time; (ii) developing an enhanced OCR method to automatically extract student identities from dynamically positioned Zoom name tags, enabling reliable ground truth labeling without manual intervention; and (iii) optimizing the training and inference pipeline to significantlyreduce resource and time requirements on ordinary teacher devices. Extensive experiments demonstrate that iExam achieves 90.4% accuracy for real-time face detection and 98.4% accuracy for post-exam face recognition, while maintaining low overhead. These results show that iExam can substantially enhance the automation and reliability of online exam proctoring in practice.
comment: This is a technical report from Lingnan University and the Chinese University of Hong Kong
Artificial Intelligence in Image-based Cardiovascular Disease Analysis
Recent advancements in Artificial Intelligence (AI) have significantly influenced the field of Cardiovascular Disease (CVD) analysis, particularly in image-based diagnostics. Our paper presents an extensive review of AI applications in image-based CVD analysis, offering insights into its current state and future potential. We systematically categorize the literature based on the primary anatomical structures related to CVD, dividing them into non-vessel structures (such as ventricles and atria) and vessel structures (including the aorta and coronary arteries). This categorization provides a structured approach to explore various imaging modalities like Computed tomography (CT) and Magnetic Resonance Imaging (MRI), which are commonly used in CVD research. Our review encompasses these modalities, giving a broad perspective on the diverse imaging techniques integrated with AI for CVD analysis. We conclude with an examination of the challenges and limitations inherent in current AI-based CVD analysis methods and suggest directions for future research to overcome these hurdles.
Time-Series at the Edge: Tiny Separable CNNs for Wearable Gait Detection and Optimal Sensor Placement
We study on-device time-series analysis for gait detection in Parkinson's disease (PD) from short windows of triaxial acceleration, targeting resource-constrained wearables and edge nodes. We compare magnitude thresholding to three 1D CNNs for time-series analysis: a literature baseline (separable convolutions) and two ultra-light models - one purely separable and one with residual connections. Using the BioStampRC21 dataset, 2 s windows at 30 Hz, and subject-independent leave-one-subject-out (LOSO) validation on 16 PwPD with chest-worn IMUs, our residual separable model (Model 2, 533 params) attains PR-AUC = 94.5%, F1 = 91.2%, MCC = 89.4%, matching or surpassing the baseline (5,552 params; PR-AUC = 93.7%, F1 = 90.5%, MCC = 88.5%) with approximately 10x fewer parameters. The smallest model (Model 1, 305 params) reaches PR-AUC = 94.0%, F1 = 91.0%, MCC = 89.1%. Thresholding obtains high recall (89.0%) but low precision (76.5%), yielding many false positives and high inter-subject variance. Sensor-position analysis (train-on-all) shows chest and thighs are most reliable; forearms degrade precision/recall due to non-gait arm motion; naive fusion of all sites does not outperform the best single site. Both compact CNNs execute within tight memory/latency budgets on STM32-class MCUs (sub-10 ms on low-power boards), enabling on-sensor gating of transmission/storage. Overall, ultra-light separable CNNs provide a superior accuracy-efficiency-generalization trade-off to fixed thresholds for wearable PD gait detection and underscore the value of tailored time-series models for edge deployment.
CTorch: PyTorch-Compatible GPU-Accelerated Auto-Differentiable Projector Toolbox for Computed Tomography
This work introduces CTorch, a PyTorch-compatible, GPU-accelerated, and auto-differentiable projector toolbox designed to handle various CT geometries with configurable projector algorithms. CTorch provides flexible scanner geometry definition, supporting 2D fan-beam, 3D circular cone-beam, and 3D non-circular cone-beam geometries. Each geometry allows view-specific definitions to accommodate variations during scanning. Both flat- and curved-detector models may be specified to accommodate various clinical devices. CTorch implements four projector algorithms: voxel-driven, ray-driven, distance-driven (DD), and separable footprint (SF), allowing users to balance accuracy and computational efficiency based on their needs. All the projectors are primarily built using CUDA C for GPU acceleration, then compiled as Python-callable functions, and wrapped as PyTorch network module. This design allows direct use of PyTorch tensors, enabling seamless integration into PyTorch's auto-differentiation framework. These features make CTorch an flexible and efficient tool for CT imaging research, with potential applications in accurate CT simulations, efficient iterative reconstruction, and advanced deep-learning-based CT reconstruction.
Bayesian Multifractal Image Segmentation
Multifractal analysis (MFA) provides a framework for the global characterization of image textures by describing the spatial fluctuations of their local regularity based on the multifractal spectrum. Several works have shown the interest of using MFA for the description of homogeneous textures in images. Nevertheless, natural images can be composed of several textures and, in turn, multifractal properties associated with those textures. This paper introduces an unsupervised Bayesian multifractal segmentation method to model and segment multifractal textures by jointly estimating the multifractal parameters and labels on images, at the pixel-level. For this, a computationally and statistically efficient multifractal parameter estimation model for wavelet leaders is firstly developed, defining different multifractality parameters for different regions of an image. Then, a multiscale Potts Markov random field is introduced as a prior to model the inherent spatial and scale correlations (referred to as cross-scale correlations) between the labels of the wavelet leaders. A Gibbs sampling methodology is finally used to draw samples from the posterior distribution of the unknown model parameters. Numerical experiments are conducted on synthetic multifractal images to evaluate the performance of the proposed segmentation approach. The proposed method achieves superior performance compared to traditional unsupervised segmentation techniques as well as modern deep learning-based approaches, showing its effectiveness for multifractal image segmentation.
comment: 20 pages, 7 figures, accepted paper
Random-phase Wave Splatting of Translucent Primitives for Computer-generated Holography
Holographic near-eye displays offer ultra-compact form factors for VR/AR systems but rely on advanced computer-generated holography (CGH) algorithms to convert 3D scenes into interference patterns on spatial light modulators (SLMs). Conventional CGH typically generates smooth-phase holograms, limiting view-dependent effects and realistic defocus blur, while severely under-utilizing the SLM space-bandwidth product. We propose Random-phase Wave Splatting (RPWS), a unified wave optics rendering framework that converts arbitrary 3D representations based on 2D translucent primitives into random-phase holograms. RPWS is fully compatible with modern 3D representations such as Gaussians and triangles, improves bandwidth utilization which effectively enlarges eyebox size, reconstructs accurate defocus blur and parallax, and leverages time-multiplexed rendering not as a heuristic for speckle suppression, but as a mathematically exact alpha-blending mechanism derived from first principles in statistics. At the core of RPWS are (1) a new wavefront compositing procedure and (2) an alpha-blending scheme for random-phase geometric primitives, ensuring correct color reconstruction and robust occlusion when compositing millions of primitives. RPWS departs substantially from the recent primitive-based CGH algorithm, Gaussian Wave Splatting (GWS). Because GWS uses smooth-phase primitives, it struggles to capture view-dependent effects and realistic defocus blur and under-utilizes the SLM space-bandwidth product; moreover, naively extending GWS to random-phase primitives fails to reconstruct accurate colors. In contrast, RPWS is designed from the ground up for arbitrary random-phase translucent primitives, and through simulations and experimental validations we demonstrate state-of-the-art image quality and perceptually faithful 3D holograms for next-generation near-eye displays.
Denoising Diffusion Models for Anomaly Localization in Medical Images
This review explores anomaly localization in medical images using denoising diffusion models. After providing a brief methodological background of these models, including their application to image reconstruction and their conditioning using guidance mechanisms, we provide an overview of available datasets and evaluation metrics suitable for their application to anomaly localization in medical images. In this context, we discuss supervision schemes ranging from fully supervised segmentation to semi-supervised, weakly supervised, self-supervised, and unsupervised methods, and provide insights into the effectiveness and limitations of these approaches. Furthermore, we highlight open challenges in anomaly localization, including detection bias, domain shift, computational cost, and model interpretability. Our goal is to provide an overview of the current state of the art in the field, outline research gaps, and highlight the potential of diffusion models for robust anomaly localization in medical images.
comment: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2025:030
Multimedia
Building Audio-Visual Digital Twins with Smartphones
Digital twins today are almost entirely visual, overlooking acoustics-a core component of spatial realism and interaction. We introduce AV-Twin, the first practical system that constructs editable audio-visual digital twins using only commodity smartphones. AV-Twin combines mobile RIR capture and a visual-assisted acoustic field model to efficiently reconstruct room acoustics. It further recovers per-surface material properties through differentiable acoustic rendering, enabling users to modify materials, geometry, and layout while automatically updating both audio and visuals. Together, these capabilities establish a practical path toward fully modifiable audio-visual digital twins for real-world environments.
comment: Under Mobisys 2026 review, single blind
Simple Yet Effective Selective Imputation for Incomplete Multi-view Clustering
Incomplete multi-view data, where different views suffer from missing and unbalanced observations, pose significant challenges for clustering. Existing imputation-based methods attempt to estimate missing views to restore data associations, but indiscriminate imputation often introduces noise and bias, especially when the available information is insufficient. Imputation-free methods avoid this risk by relying solely on observed data, but struggle under severe incompleteness due to the lack of cross-view complementarity. To address this issue, we propose Informativeness-based Selective imputation Multi-View Clustering (ISMVC). Our method evaluates the imputation-relevant informativeness of each missing position based on intra-view similarity and cross-view consistency, and selectively imputes only when sufficient support is available. Furthermore, we integrate this selection with a variational autoencoder equipped with a mixture-of-Gaussians prior to learn clustering-friendly latent representations. By performing distribution-level imputation, ISMVC not only stabilizes the aggregation of posterior distributions but also explicitly models imputation uncertainty, enabling robust fusion and preventing overconfident reconstructions. Compared with existing cautious imputation strategies that depend on training dynamics or model feedback, our method is lightweight, data-driven, and model-agnostic. It can be readily integrated into existing IMC models as a plug-in module. Extensive experiments on multiple benchmark datasets under a more realistic and challenging unbalanced missing scenario demonstrate that our method outperforms both imputation-based and imputation-free approaches.
MultiScript30k: Leveraging Multilingual Embeddings to Extend Cross Script Parallel Data
Multi30k is frequently cited in the multimodal machine translation (MMT) literature, offering parallel text data for training and fine-tuning deep learning models. However, it is limited to four languages: Czech, English, French, and German. This restriction has led many researchers to focus their investigations only on these languages. As a result, MMT research on diverse languages has been stalled because the official Multi30k dataset only represents European languages in Latin scripts. Previous efforts to extend Multi30k exist, but the list of supported languages, represented language families, and scripts is still very short. To address these issues, we propose MultiScript30k, a new Multi30k dataset extension for global languages in various scripts, created by translating the English version of Multi30k (Multi30k-En) using NLLB200-3.3B. The dataset consists of over \(30000\) sentences and provides translations of all sentences in Multi30k-En into Ar, Es, Uk, Zh\_Hans and Zh\_Hant. Similarity analysis shows that Multi30k extension consistently achieves greater than \(0.8\) cosine similarity and symmetric KL divergence less than \(0.000251\) for all languages supported except Zh\_Hant which is comparable to the previous Multi30k extensions ArEnMulti30k and Multi30k-Uk. COMETKiwi scores reveal mixed assessments of MultiScript30k as a translation of Multi30k-En in comparison to the related work. ArEnMulti30k scores nearly equal MultiScript30k-Ar, but Multi30k-Uk scores $6.4\%$ greater than MultiScript30k-Uk per split.
comment: 7 pages, 2 figures, 5 tables. Not published at any conference at this time
Q-BAR: Blogger Anomaly Recognition via Quantum-enhanced Manifold Learning
In recommendation-driven online media, creators increasingly suffer from semantic mutation, where malicious secondary edits preserve visual fidelity while altering the intended meaning. Detecting these mutations requires modeling a creator's unique semantic manifold. However, training robust detector models for individual creators is challenged by data scarcity, as a distinct blogger may typically have fewer than 50 representative samples available for training. We propose quantum-enhanced blogger anomaly recognition (Q-BAR), a hybrid quantum-classical framework that leverages the high expressivity and parameter efficiency of variational quantum circuits to detect semantic anomalies in low-data regimes. Unlike classical deep anomaly detectors that often struggle to generalize from sparse data, our method employs a parameter-efficient quantum anomaly detection strategy to map multimodal features into a Hilbert space hypersphere. On a curated dataset of 100 creators, our quantum-enhanced approach achieves robust detection performance with significantly fewer trainable parameters compared to classical baselines. By utilizing only hundreds of quantum parameters, the model effectively mitigates overfitting, demonstrating the potential of quantum machine learning for personalized media forensics.
Relightable and Dynamic Gaussian Avatar Reconstruction from Monocular Video ACM MM 2025
Modeling relightable and animatable human avatars from monocular video is a long-standing and challenging task. Recently, Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3DGS) methods have been employed to reconstruct the avatars. However, they often produce unsatisfactory photo-realistic results because of insufficient geometrical details related to body motion, such as clothing wrinkles. In this paper, we propose a 3DGS-based human avatar modeling framework, termed as Relightable and Dynamic Gaussian Avatar (RnD-Avatar), that presents accurate pose-variant deformation for high-fidelity geometrical details. To achieve this, we introduce dynamic skinning weights that define the human avatar's articulation based on pose while also learning additional deformations induced by body motion. We also introduce a novel regularization to capture fine geometric details under sparse visual cues. Furthermore, we present a new multi-view dataset with varied lighting conditions to evaluate relight. Our framework enables realistic rendering of novel poses and views while supporting photo-realistic lighting effects under arbitrary lighting conditions. Our method achieves state-of-the-art performance in novel view synthesis, novel pose rendering, and relighting.
comment: 8 pages, 9 figures, published in ACM MM 2025
It Hears, It Sees too: Multi-Modal LLM for Depression Detection By Integrating Visual Understanding into Audio Language Models
Depression is one of the most prevalent mental health disorders globally. In recent years, multi-modal data, such as speech, video, and transcripts, has been increasingly used to develop AI-assisted depression assessment systems. Large language models have further advanced this field due to their strong language understanding and generalization capabilities. However, conventional LLMs remain text-centric and cannot process the rich non-verbal cues found in audio and visual modalities, which are critical components in mental health evaluation. While multi-modal LLMs offer a promising direction, few are tailored for psychological applications. In this study, we propose a novel multi-modal LLM framework for depression detection. Our approach augments an audio language model with visual understanding and aligns audio-visual features at the timestamp level. This fine-grained alignment improves modeling of temporal dynamics across modalities while reducing the need for extensive training data and computational resources. Experiments on the DAIC-WoZ dataset demonstrate that our model outperforms both single-modality approaches and previous multi-modal methods. Moreover, the proposed framework can be extended to incorporate additional physiological signals, paving the way for broader clinical applications beyond mental health.
Computation and Language
Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1.
comment: Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1
Stronger Normalization-Free Transformers
Although normalization layers have long been viewed as indispensable components of deep learning architectures, the recent introduction of Dynamic Tanh (DyT) has demonstrated that alternatives are possible. The point-wise function DyT constrains extreme values for stable convergence and reaches normalization-level performance; this work seeks further for function designs that can surpass it. We first study how the intrinsic properties of point-wise functions influence training and performance. Building on these findings, we conduct a large-scale search for a more effective function design. Through this exploration, we introduce $\mathrm{Derf}(x) = \mathrm{erf}(αx + s)$, where $\mathrm{erf}(x)$ is the rescaled Gaussian cumulative distribution function, and identify it as the most performant design. Derf outperforms LayerNorm, RMSNorm, and DyT across a wide range of domains, including vision (image recognition and generation), speech representation, and DNA sequence modeling. Our findings suggest that the performance gains of Derf largely stem from its improved generalization rather than stronger fitting capacity. Its simplicity and stronger performance make Derf a practical choice for normalization-free Transformer architectures.
Asynchronous Reasoning: Training-Free Interactive Thinking LLMs
Many state-of-the-art LLMs are trained to think before giving their answer. Reasoning can greatly improve language model capabilities and safety, but it also makes them less interactive: given a new input, a model must stop thinking before it can respond. Real-world use cases such as voice-based or embedded assistants require an LLM agent to respond and adapt to additional information in real time, which is incompatible with sequential interactions. In contrast, humans can listen, think, and act asynchronously: we begin thinking about the problem while reading it and continue thinking while formulating the answer. In this work, we augment LLMs capable of reasoning to operate in a similar way without additional training. Our method uses the properties of rotary embeddings to enable LLMs built for sequential interactions to simultaneously think, listen, and generate outputs. We evaluate our approach on math, commonsense, and safety reasoning and find that it can generate accurate thinking-augmented answers in real time, reducing time to first non-thinking token from minutes to <= 5s. and the overall real-time delays by 6-11x.
comment: Preprint, work in progress
CompanionCast: A Multi-Agent Conversational AI Framework with Spatial Audio for Social Co-Viewing Experiences
Social presence is central to the enjoyment of watching content together, yet modern media consumption is increasingly solitary. We investigate whether multi-agent conversational AI systems can recreate the dynamics of shared viewing experiences across diverse content types. We present CompanionCast, a general framework for orchestrating multiple role-specialized AI agents that respond to video content using multimodal inputs, speech synthesis, and spatial audio. Distinctly, CompanionCast integrates an LLM-as-a-Judge module that iteratively scores and refines conversations across five dimensions (relevance, authenticity, engagement, diversity, personality consistency). We validate this framework through sports viewing, a domain with rich dynamics and strong social traditions, where a pilot study with soccer fans suggests that multi-agent interaction improves perceived social presence compared to solo viewing. We contribute: (1) a generalizable framework for orchestrating multi-agent conversations around multimodal video content, (2) a novel evaluator-agent pipeline for conversation quality control, and (3) exploratory evidence of increased social presence in AI-mediated co-viewing. We discuss challenges and future directions for applying this approach to diverse viewing contexts including entertainment, education, and collaborative watching experiences.
comment: 11 pages
Computational emotion analysis with multimodal LLMs: Current evidence on an emerging methodological opportunity
Emotions are central to politics and analyzing their role in political communication has a long tradition. As research increasingly leverages audio-visual materials to analyze the display of emotions, the emergence of multimodal generative AI promises great advances. However, we lack evidence about the effectiveness of multimodal AI in emotion analysis. This paper addresses this gap by evaluating current multimodal large language models (mLLMs) in video-based analysis of emotional arousal in two complementary data sets of human-labeled video recordings. I find that under ideal circumstances, mLLMs' emotional arousal ratings are highly reliable and show little to know indication of demographic bias. However, in recordings of speakers in real-world parliamentary debates, mLLMs' arousal ratings fail to deliver on this promise with potential negative consequences for downstream statistical inferences. This study therefore underscores the need for continued, thorough evaluation of emerging generative AI methods in political analysis and contributes a suitable replicable framework.
Quantifying Emotional Tone in Tolkien's The Hobbit: Dialogue Sentiment Analysis with RegEx, NRC-VAD, and Python
This study analyzes the emotional tone of dialogue in J. R. R. Tolkien's The Hobbit (1937) using computational text analysis. Dialogue was extracted with regular expressions, then preprocessed, and scored using the NRC-VAD lexicon to quantify emotional dimensions. The results show that the dialogue maintains a generally positive (high valence) and calm (low arousal) tone, with a gradually increasing sense of agency (dominance) as the story progresses. These patterns reflect the novel's emotional rhythm: moments of danger and excitement are regularly balanced by humor, camaraderie, and relief. Visualizations -- including emotional trajectory graphs and word clouds -- highlight how Tolkien's language cycles between tension and comfort. By combining computational tools with literary interpretation, this study demonstrates how digital methods can uncover subtle emotional structures in literature, revealing the steady rhythm and emotional modulation that shape the storytelling in The Hobbit.
LabelFusion: Learning to Fuse LLMs and Transformer Classifiers for Robust Text Classification
LabelFusion is a fusion ensemble for text classification that learns to combine a traditional transformer-based classifier (e.g., RoBERTa) with one or more Large Language Models (LLMs such as OpenAI GPT, Google Gemini, or DeepSeek) to deliver accurate and cost-aware predictions across multi-class and multi-label tasks. The package provides a simple high-level interface (AutoFusionClassifier) that trains the full pipeline end-to-end with minimal configuration, and a flexible API for advanced users. Under the hood, LabelFusion integrates vector signals from both sources by concatenating the ML backbone's embeddings with the LLM-derived per-class scores -- obtained through structured prompt-engineering strategies -- and feeds this joint representation into a compact multi-layer perceptron (FusionMLP) that produces the final prediction. This learned fusion approach captures complementary strengths of LLM reasoning and traditional transformer-based classifiers, yielding robust performance across domains -- achieving 92.4% accuracy on AG News and 92.3% on 10-class Reuters 21578 topic classification -- while enabling practical trade-offs between accuracy, latency, and cost.
The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality
We introduce The FACTS Leaderboard, an online leaderboard suite and associated set of benchmarks that comprehensively evaluates the ability of language models to generate factually accurate text across diverse scenarios. The suite provides a holistic measure of factuality by aggregating the performance of models on four distinct sub-leaderboards: (1) FACTS Multimodal, which measures the factuality of responses to image-based questions; (2) FACTS Parametric, which assesses models' world knowledge by answering closed-book factoid questions from internal parameters; (3) FACTS Search, which evaluates factuality in information-seeking scenarios, where the model must use a search API; and (4) FACTS Grounding (v2), which evaluates whether long-form responses are grounded in provided documents, featuring significantly improved judge models. Each sub-leaderboard employs automated judge models to score model responses, and the final suite score is an average of the four components, designed to provide a robust and balanced assessment of a model's overall factuality. The FACTS Leaderboard Suite will be actively maintained, containing both public and private splits to allow for external participation while guarding its integrity. It can be found at https://www.kaggle.com/benchmarks/google/facts .
Replace, Don't Expand: Mitigating Context Dilution in Multi-Hop RAG via Fixed-Budget Evidence Assembly
Retrieval-Augmented Generation (RAG) systems often fail on multi-hop queries when the initial retrieval misses a bridge fact. Prior corrective approaches, such as Self-RAG, CRAG, and Adaptive-$k$, typically address this by \textit{adding} more context or pruning existing lists. However, simply expanding the context window often leads to \textbf{context dilution}, where distractors crowd out relevant information. We propose \textbf{SEAL-RAG}, a training-free controller that adopts a \textbf{``replace, don't expand''} strategy to fight context dilution under a fixed retrieval depth $k$. SEAL executes a (\textbf{S}earch $\rightarrow$ \textbf{E}xtract $\rightarrow$ \textbf{A}ssess $\rightarrow$ \textbf{L}oop) cycle: it performs on-the-fly, entity-anchored extraction to build a live \textit{gap specification} (missing entities/relations), triggers targeted micro-queries, and uses \textit{entity-first ranking} to actively swap out distractors for gap-closing evidence. We evaluate SEAL-RAG against faithful re-implementations of Basic RAG, CRAG, Self-RAG, and Adaptive-$k$ in a shared environment on \textbf{HotpotQA} and \textbf{2WikiMultiHopQA}. On HotpotQA ($k=3$), SEAL improves answer correctness by \textbf{+3--13 pp} and evidence precision by \textbf{+12--18 pp} over Self-RAG. On 2WikiMultiHopQA ($k=5$), it outperforms Adaptive-$k$ by \textbf{+8.0 pp} in accuracy and maintains \textbf{96\%} evidence precision compared to 22\% for CRAG. These gains are statistically significant ($p<0.001$). By enforcing fixed-$k$ replacement, SEAL yields a predictable cost profile while ensuring the top-$k$ slots are optimized for precision rather than mere breadth. We release our code and data at https://github.com/mosherino/SEAL-RAG.
comment: 24 pages, 2 figures
Script Gap: Evaluating LLM Triage on Indian Languages in Native vs Roman Scripts in a Real World Setting
Large Language Models (LLMs) are increasingly deployed in high-stakes clinical applications in India. In many such settings, speakers of Indian languages frequently communicate using romanized text rather than native scripts, yet existing research rarely evaluates this orthographic variation using real-world data. We investigate how romanization impacts the reliability of LLMs in a critical domain: maternal and newborn healthcare triage. We benchmark leading LLMs on a real-world dataset of user-generated queries spanning five Indian languages and Nepali. Our results reveal consistent degradation in performance for romanized messages, with F1 scores trailing those of native scripts by 5-12 points. At our partner maternal health organization in India, this gap could cause nearly 2 million excess errors in triage. Crucially, this performance gap by scripts is not due to a failure in clinical reasoning. We demonstrate that LLMs often correctly infer the semantic intent of romanized queries. Nevertheless, their final classification outputs remain brittle in the presence of orthographic noise in romanized inputs. Our findings highlight a critical safety blind spot in LLM-based health systems: models that appear to understand romanized input may still fail to act on it reliably.
Grow Up and Merge: Scaling Strategies for Efficient Language Adaptation
Achieving high-performing language models which include medium- and lower-resource languages remains a challenge. Massively multilingual models still underperform compared to language-specific adaptations, especially at smaller model scales. In this work, we investigate scaling as an efficient strategy for adapting pretrained models to new target languages. Through comprehensive scaling ablations with approximately FLOP-matched models, we test whether upscaling an English base model enables more effective and resource-efficient adaptation than standard continued pretraining. We find that, once exposed to sufficient target-language data, larger upscaled models can match or surpass the performance of smaller models continually pretrained on much more data, demonstrating the benefits of scaling for data efficiency. Scaling also helps preserve the base model's capabilities in English, thus reducing catastrophic forgetting. Finally, we explore whether such scaled, language-specific models can be merged to construct modular and flexible multilingual systems. We find that while merging remains less effective than joint multilingual training, upscaled merges perform better than smaller ones. We observe large performance differences across merging methods, suggesting potential for improvement through merging approaches specialized for language-level integration.
OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification
Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out OPV-Bench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2% to 73.3% on AIME2025 as the compute budget scales.
TRIDENT: A Redundant Architecture for Caribbean-Accented Emergency Speech Triage
Emergency speech recognition systems exhibit systematic performance degradation on non-standard English varieties, creating a critical gap in services for Caribbean populations. We present TRIDENT (Transcription and Routing Intelligence for Dispatcher-Empowered National Triage), a three-layer dispatcher-support architecture designed to structure emergency call inputs for human application of established triage protocols (the ESI for routine operations and START for mass casualty events), even when automatic speech recognition fails. The system combines Caribbean-accent-tuned ASR, local entity extraction via large language models, and bio-acoustic distress detection to provide dispatchers with three complementary signals: transcription confidence, structured clinical entities, and vocal stress indicators. Our key insight is that low ASR confidence, rather than representing system failure, serves as a valuable queue prioritization signal -- particularly when combined with elevated vocal distress markers indicating a caller in crisis whose speech may have shifted toward basilectal registers. A complementary insight drives the entity extraction layer: trained responders and composed bystanders may report life-threatening emergencies without elevated vocal stress, requiring semantic analysis to capture clinical indicators that paralinguistic features miss. We describe the architectural design, theoretical grounding in psycholinguistic research on stress-induced code-switching, and deployment considerations for offline operation during disaster scenarios. This work establishes a framework for accent-resilient emergency AI that ensures Caribbean voices receive equitable access to established national triage protocols. Empirical validation on Caribbean emergency calls remains future work.
Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving
Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the \textbf{O}utcome-based \textbf{P}rocess \textbf{V}erifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out \textsc{\thisbench}, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2\% to 73.3\% on AIME2025 as the compute budget scales.
Textual Data Bias Detection and Mitigation - An Extensible Pipeline with Experimental Evaluation
Textual data used to train large language models (LLMs) exhibits multifaceted bias manifestations encompassing harmful language and skewed demographic distributions. Regulations such as the European AI Act require identifying and mitigating biases against protected groups in data, with the ultimate goal of preventing unfair model outputs. However, practical guidance and operationalization are lacking. We propose a comprehensive data bias detection and mitigation pipeline comprising four components that address two data bias types, namely representation bias and (explicit) stereotypes for a configurable sensitive attribute. First, we leverage LLM-generated word lists created based on quality criteria to detect relevant group labels. Second, representation bias is quantified using the Demographic Representation Score. Third, we detect and mitigate stereotypes using sociolinguistically informed filtering. Finally, we compensate representation bias through Grammar- and Context-Aware Counterfactual Data Augmentation. We conduct a two-fold evaluation using the examples of gender, religion and age. First, the effectiveness of each individual component on data debiasing is evaluated through human validation and baseline comparison. The findings demonstrate that we successfully reduce representation bias and (explicit) stereotypes in a text dataset. Second, the effect of data debiasing on model bias reduction is evaluated by bias benchmarking of several models (0.6B-8B parameters), fine-tuned on the debiased text dataset. This evaluation reveals that LLMs fine-tuned on debiased data do not consistently show improved performance on bias benchmarks, exposing critical gaps in current evaluation methodologies and highlighting the need for targeted data manipulation to address manifested model bias.
Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution
Procedural memory enables large language model (LLM) agents to internalize "how-to" knowledge, theoretically reducing redundant trial-and-error. However, existing frameworks predominantly suffer from a "passive accumulation" paradigm, treating memory as a static append-only archive. To bridge the gap between static storage and dynamic reasoning, we propose $\textbf{ReMe}$ ($\textit{Remember Me, Refine Me}$), a comprehensive framework for experience-driven agent evolution. ReMe innovates across the memory lifecycle via three mechanisms: 1) $\textit{multi-faceted distillation}$, which extracts fine-grained experiences by recognizing success patterns, analyzing failure triggers and generating comparative insights; 2) $\textit{context-adaptive reuse}$, which tailors historical insights to new contexts via scenario-aware indexing; and 3) $\textit{utility-based refinement}$, which autonomously adds valid memories and prunes outdated ones to maintain a compact, high-quality experience pool. Extensive experiments on BFCL-V3 and AppWorld demonstrate that ReMe establishes a new state-of-the-art in agent memory system. Crucially, we observe a significant memory-scaling effect: Qwen3-8B equipped with ReMe outperforms larger, memoryless Qwen3-14B, suggesting that self-evolving memory provides a computation-efficient pathway for lifelong learning. We release our code and the $\texttt{reme.library}$ dataset to facilitate further research.
comment: 16 pages, 9 figures, 9 tables
From Data Scarcity to Data Care: Reimagining Language Technologies for Serbian and other Low-Resource Languages
Large language models are commonly trained on dominant languages like English, and their representation of low resource languages typically reflects cultural and linguistic biases present in the source language materials. Using the Serbian language as a case, this study examines the structural, historical, and sociotechnical factors shaping language technology development for low resource languages in the AI age. Drawing on semi structured interviews with ten scholars and practitioners, including linguists, digital humanists, and AI developers, it traces challenges rooted in historical destruction of Serbian textual heritage, intensified by contemporary issues that drive reductive, engineering first approaches prioritizing functionality over linguistic nuance. These include superficial transliteration, reliance on English-trained models, data bias, and dataset curation lacking cultural specificity. To address these challenges, the study proposes Data Care, a framework grounded in CARE principles (Collective Benefit, Authority to Control, Responsibility, and Ethics), that reframes bias mitigation from a post hoc technical fix to an integral component of corpus design, annotation, and governance, and positions Data Care as a replicable model for building inclusive, sustainable, and culturally grounded language technologies in contexts where traditional LLM development reproduces existing power imbalances and cultural blind spots.
AgriGPT-Omni: A Unified Speech-Vision-Text Framework for Multilingual Agricultural Intelligence
Despite rapid advances in multimodal large language models, agricultural applications remain constrained by the lack of multilingual speech data, unified multimodal architectures, and comprehensive evaluation benchmarks. To address these challenges, we present AgriGPT-Omni, an agricultural omni-framework that integrates speech, vision, and text in a unified framework. First, we construct a scalable data synthesis and collection pipeline that converts agricultural texts and images into training data, resulting in the largest agricultural speech dataset to date, including 492K synthetic and 1.4K real speech samples across six languages. Second, based on this, we train the first agricultural omni-model via a three-stage paradigm: textual knowledge injection, progressive multimodal alignment, and GRPO-based reinforcement learning, enabling unified reasoning across languages and modalities. Third, we propose AgriBench-Omni-2K, the first tri-modal benchmark for agriculture, covering diverse speech-vision-text tasks and multilingual slices, with standardized protocols and reproducible tools. Experiments show that AgriGPT-Omni significantly outperforms general-purpose baselines on multilingual and multimodal reasoning as well as real-world speech understanding. All models, data, benchmarks, and code will be released to promote reproducible research, inclusive agricultural intelligence, and sustainable AI development for low-resource regions.
RoleRMBench & RoleRM: Towards Reward Modeling for Profile-Based Role Play in Dialogue Systems
Reward modeling has become a cornerstone of aligning large language models (LLMs) with human preferences. Yet, when extended to subjective and open-ended domains such as role play, existing reward models exhibit severe degradation, struggling to capture nuanced and persona-grounded human judgments. To address this gap, we introduce RoleRMBench, the first systematic benchmark for reward modeling in role-playing dialogue, covering seven fine-grained capabilities from narrative management to role consistency and engagement. Evaluation on RoleRMBench reveals large and consistent gaps between general-purpose reward models and human judgment, particularly in narrative and stylistic dimensions. We further propose RoleRM, a reward model trained with Continuous Implicit Preferences (CIP), which reformulates subjective evaluation as continuous consistent pairwise supervision under multiple structuring strategies. Comprehensive experiments show that RoleRM surpasses strong open- and closed-source reward models by over 24% on average, demonstrating substantial gains in narrative coherence and stylistic fidelity. Our findings highlight the importance of continuous preference representation and annotation consistency, establishing a foundation for subjective alignment in human-centered dialogue systems.
Causal Reasoning Favors Encoders: On The Limits of Decoder-Only Models
In context learning (ICL) underpins recent advances in large language models (LLMs), although its role and performance in causal reasoning remains unclear. Causal reasoning demands multihop composition and strict conjunctive control, and reliance on spurious lexical relations of the input could provide misleading results. We hypothesize that, due to their ability to project the input into a latent space, encoder and encoder decoder architectures are better suited for said multihop conjunctive reasoning versus decoder only models. To do this, we compare fine-tuned versions of all the aforementioned architectures with zero and few shot ICL in both natural language and non natural language scenarios. We find that ICL alone is insufficient for reliable causal reasoning, often overfocusing on irrelevant input features. In particular, decoder only models are noticeably brittle to distributional shifts, while finetuned encoder and encoder decoder models can generalize more robustly across our tests, including the non natural language split. Both architectures are only matched or surpassed by decoder only architectures at large scales. We conclude by noting that for cost effective, short horizon robust causal reasoning, encoder or encoder decoder architectures with targeted finetuning are preferable.
XDoGE: Multilingual Data Reweighting to Enhance Language Inclusivity in LLMs
Current large language models (LLMs) are trained on massive amounts of text data, primarily from a few dominant languages. Studies suggest that this over-reliance on high-resource languages, such as English, hampers LLM performance in mid- and low-resource languages. To mitigate this problem, we propose to (i) optimize the language distribution by training a small proxy model within a domain-reweighing DoGE algorithm that we extend to XDoGE for a multilingual setup, and (ii) rescale the data and train a full-size model with the established language weights either from scratch or within a continual pre-training phase (CPT). We target six languages possessing a variety of geographic and intra- and inter-language-family relations, namely, English and Spanish (high-resource), Portuguese and Catalan (mid-resource), Galician and Basque (low-resource). We experiment with Salamandra-2b, which is a promising model for these languages. We investigate the effects of substantial data repetition on minor languages and under-sampling on dominant languages using the IberoBench framework for quantitative evaluation. Finally, we release a new promising IberianLLM-7B-Instruct model centering on Iberian languages and English that we pretrained from scratch and further improved using CPT with the XDoGE weights.
comment: Accepted and presented at the LLMs4All workshop at the IEEE BigData 2025 Conference, Macau - December 8-11, 2025
Grammaticality Judgments in Humans and Language Models: Revisiting Generative Grammar with LLMs
What counts as evidence for syntactic structure? In traditional generative grammar, systematic contrasts in grammaticality such as subject-auxiliary inversion and the licensing of parasitic gaps are taken as evidence for an internal, hierarchical grammar. In this paper, we test whether large language models (LLMs), trained only on surface forms, reproduce these contrasts in ways that imply an underlying structural representation. We focus on two classic constructions: subject-auxiliary inversion (testing recognition of the subject boundary) and parasitic gap licensing (testing abstract dependency structure). We evaluate models including GPT-4 and LLaMA-3 using prompts eliciting acceptability ratings. Results show that LLMs reliably distinguish between grammatical and ungrammatical variants in both constructions, and as such support that they are sensitive to structure and not just linear order. Structural generalizations, distinct from cognitive knowledge, emerge from predictive training on surface forms, suggesting functional sensitivity to syntax without explicit encoding.
comment: 2 figures
When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection
The landscape of scientific peer review is rapidly evolving with the integration of Large Language Models (LLMs). This shift is driven by two parallel trends: the widespread individual adoption of LLMs by reviewers to manage workload (the "Lazy Reviewer" hypothesis) and the formal institutional deployment of AI-powered assessment systems by conferences like AAAI and Stanford's Agents4Science. This study investigates the robustness of these "LLM-as-a-Judge" systems (both illicit and sanctioned) to adversarial PDF manipulation. Unlike general jailbreaks, we focus on a distinct incentive: flipping "Reject" decisions to "Accept," for which we develop a novel evaluation metric which we term as WAVS (Weighted Adversarial Vulnerability Score). We curated a dataset of 200 scientific papers and adapted 15 domain-specific attack strategies to this task, evaluating them across 13 Language Models, including GPT-5, Claude Haiku, and DeepSeek. Our results demonstrate that obfuscation strategies like "Maximum Mark Magyk" successfully manipulate scores, achieving alarming decision flip rates even in large-scale models. We will release our complete dataset and injection framework to facilitate more research on this topic.
Decoding Student Minds: Leveraging Conversational Agents for Psychological and Learning Analysis
This paper presents a psychologically-aware conversational agent designed to enhance both learning performance and emotional well-being in educational settings. The system combines Large Language Models (LLMs), a knowledge graph-enhanced BERT (KG-BERT), and a bidirectional Long Short-Term Memory (LSTM) with attention to classify students' cognitive and affective states in real time. Unlike prior chatbots limited to either tutoring or affective support, our approach leverages multimodal data-including textual semantics, prosodic speech features, and temporal behavioral trends-to infer engagement, stress, and conceptual understanding. A pilot study with university students demonstrated improved motivation, reduced stress, and moderate academic gains compared to baseline methods. These results underline the promise of integrating semantic reasoning, multimodal fusion, and temporal modeling to support adaptive, student-centered educational interventions.
comment: This manuscript is currently under peer review in Expert Systems with Applications
Enhancing Next-Generation Language Models with Knowledge Graphs: Extending Claude, Mistral IA, and GPT-4 via KG-BERT ALT 2025
Large language models (LLMs) like Claude, Mistral IA, and GPT-4 excel in NLP but lack structured knowledge, leading to factual inconsistencies. We address this by integrating Knowledge Graphs (KGs) via KG-BERT to enhance grounding and reasoning. Experiments show significant gains in knowledge-intensive tasks such as question answering and entity linking. This approach improves factual reliability and enables more context-aware next-generation LLMs.
comment: This paper was accepted and scheduled for inclusion in the ICALT 2025 proceedings but was ultimately not published due to absence from the conference presentation. It appears in the official program booklet. Conference: 2025 IEEE International Conference on Advanced Learning Technologies (ICALT)
Semantic Reconstruction of Adversarial Plagiarism: A Context-Aware Framework for Detecting and Restoring "Tortured Phrases" in Scientific Literature
The integrity and reliability of scientific literature is facing a serious threat by adversarial text generation techniques, specifically from the use of automated paraphrasing tools to mask plagiarism. These tools generate "tortured phrases", statistically improbable synonyms (e.g. "counterfeit consciousness" for "artificial intelligence"), that preserve the local grammar while obscuring the original source. Most existing detection methods depend heavily on static blocklists or general-domain language models, which suffer from high false-negative rates for novel obfuscations and cannot determine the source of the plagiarized content. In this paper, we propose Semantic Reconstruction of Adversarial Plagiarism (SRAP), a framework designed not only to detect these anomalies but to mathematically recover the original terminology. We use a two-stage architecture: (1) statistical anomaly detection with a domain-specific masked language model (SciBERT) using token-level pseudo-perplexity, and (2) source-based semantic reconstruction using dense vector retrieval (FAISS) and sentence-level alignment (SBERT). Experiments on a parallel corpus of adversarial scientific text show that while zero-shot baselines fail completely (0.00 percent restoration accuracy), our retrieval-augmented approach achieves 23.67 percent restoration accuracy, significantly outperforming baseline methods. We also show that static decision boundaries are necessary for robust detection in jargon-heavy scientific text, since dynamic thresholding fails under high variance. SRAP enables forensic analysis by linking obfuscated expressions back to their most probable source documents.
comment: 10 pages, 5 figures; unpublished manuscript; submitted to arXiv for dissemination
T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground
We introduce T-pro 2.0, an open-weight Russian LLM for hybrid reasoning and efficient inference. The model supports direct answering and reasoning-trace generation, using a Cyrillic-dense tokenizer and an adapted EAGLE speculative-decoding pipeline to reduce latency. To enable reproducible and extensible research, we release the model weights, the T-Wix 500k instruction corpus, the T-Math reasoning benchmark, and the EAGLE weights on Hugging Face. These resources allow users to study Russian-language reasoning and to extend or adapt both the model and the inference pipeline. A public web demo exposes reasoning and non-reasoning modes and illustrates the speedups achieved by our inference stack across domains. T-pro 2.0 thus serves as an accessible open system for building and evaluating efficient, practical Russian LLM applications.
Cooperative Retrieval-Augmented Generation for Question Answering: Mutual Information Exchange and Ranking by Contrasting Layers NeurIPS 2025
Since large language models (LLMs) have a tendency to generate factually inaccurate output, retrieval-augmented generation (RAG) has gained significant attention as a key means to mitigate this downside of harnessing only LLMs. However, existing RAG methods for simple and multi-hop question answering (QA) are still prone to incorrect retrievals and hallucinations. To address these limitations, we propose CoopRAG, a novel RAG framework for the question answering task in which a retriever and an LLM work cooperatively with each other by exchanging informative knowledge, and the earlier and later layers of the retriever model work cooperatively with each other to accurately rank the retrieved documents relevant to a given query. In this framework, we (i) unroll a question into sub-questions and a reasoning chain in which uncertain positions are masked, (ii) retrieve the documents relevant to the question augmented with the sub-questions and the reasoning chain, (iii) rerank the documents by contrasting layers of the retriever, and (iv) reconstruct the reasoning chain by filling the masked positions via the LLM. Our experiments demonstrate that CoopRAG consistently outperforms state-of-the-art QA methods on three multi-hop QA datasets as well as a simple QA dataset in terms of both the retrieval and QA performances. Our code is available.\footnote{https://github.com/meaningful96/CoopRAG}
comment: Accepted to NeurIPS 2025
Sliding Window Attention Adaptation
The self-attention mechanism in Transformer-based Large Language Models (LLMs) scales quadratically with input length, making long-context inference expensive. Sliding window attention (SWA) reduces this cost to linear complexity, but naively enabling complete SWA at inference-time for models pretrained with full attention (FA) causes severe long-context performance degradation due to training-inference mismatch. This makes us wonder: Can FA-pretrained LLMs be well adapted to SWA without pretraining? We investigate this by proposing Sliding Window Attention Adaptation (SWAA), a set of practical recipes that combine five methods for better adaptation: (1) applying SWA only during prefilling; (2) preserving "sink" tokens; (3) interleaving FA/SWA layers; (4) chain-of-thought (CoT); and (5) fine-tuning. Our experiments show that SWA adaptation is feasible while non-trivial: no single method suffices, yet specific synergistic combinations effectively recover the original long-context performance. We further analyze the performance-efficiency trade-offs of different SWAA configurations and provide recommended recipes for diverse scenarios. Our code is available at https://github.com/yuyijiong/sliding-window-attention-adaptation
BRACE: A Benchmark for Robust Audio Caption Quality Evaluation
Automatic audio captioning is essential for audio understanding, enabling applications such as accessibility and content indexing. However, evaluating the quality of audio captions remains a major challenge, especially in reference-free settings where high-quality ground-truth captions are unavailable. While CLAPScore is currently the most widely used reference-free Audio Caption Evaluation Metric(ACEM), its robustness under diverse conditions has not been systematically validated. To address this gap, we introduce BRACE, a new benchmark designed to evaluate audio caption alignment quality in a reference-free setting. BRACE is primarily designed for assessing ACEMs, and can also be extended to measure the modality alignment abilities of Large Audio Language Model(LALM). BRACE consists of two sub-benchmarks: BRACE-Main for fine-grained caption comparison and BRACE-Hallucination for detecting subtle hallucinated content. We construct these datasets through high-quality filtering, LLM-based corruption, and human annotation. Given the widespread adoption of CLAPScore as a reference-free ACEM and the increasing application of LALMs in audio-language tasks, we evaluate both approaches using the BRACE benchmark, testing CLAPScore across various CLAP model variants and assessing multiple LALMs. Notably, even the best-performing CLAP-based ACEM achieves only a 70.01 F1-score on the BRACE-Main benchmark, while the best LALM reaches just 63.19. By revealing the limitations of CLAP models and LALMs, our BRACE benchmark offers valuable insights into the direction of future research.
Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale
Real-world AI software engineering demands coding agents that can reason over massive repositories, maintain durable memory across and within long sessions, and robustly coordinate complex toolchains at test time. Existing open-source coding agents provide transparency but frequently fall short when pushed to these industrial-scale workloads, while proprietary coding agents offer strong practical performance but limited extensibility, interpretability, and controllability. We present the Confucius Code Agent (CCA), an open-sourced AI software engineer that can operate at an industrial scale. CCA is built atop the Confucius SDK, an open-sourced agent development platform designed around three complementary perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). The SDK introduces a unified orchestrator with hierarchical working memory for long-context reasoning, a persistent note-taking system for cross-session continual learning, and a modular extension module for robust tool use. Moreover, a meta-agent automates the synthesis, evaluation, and refinement of agent configurations through a build-test-improve loop, enabling rapid agent development on new tasks, environments, and tool stacks. Instantiated on Confucius SDK with these mechanisms, CCA delivers strong performance on real-world software engineering tasks. On SWE-Bench-Pro, CCA achieves a state-of-the-art Resolve@1 performance of 54.3%, substantially improving over prior coding agents. Together, the Confucius SDK and CCA provide a transparent, extensible, and reproducible foundation for AI agents, bridge gaps between research prototypes and production-grade systems, and support agent development and deployment at industrial scale.
GPG: Generalized Policy Gradient Theorem for Transformer-based Policies
We present the Generalized Policy Gradient (GPG) Theorem, specifically designed for Transformer-based policies. Notably, we demonstrate that both standard Policy Gradient Theorem and GRPO emerge as special cases within our GPG framework. Furthermore, we explore its practical applications in training Large Language Models (LLMs), offering new insights into efficient policy optimization.
Multilingual VLM Training: Adapting an English-Trained VLM to French
Artificial intelligence has made great progress in recent years, particularly in the development of Vision--Language Models (VLMs) that understand both visual and textual data. However, these advancements remain largely limited to English, reducing their accessibility for non--English speakers. It is essential to extend these capabilities to a broader range of languages. This paper explores the challenges of adapting an English-trained VLM to different languages. To this end, we will explore and compare different methods for their performance and computational cost. We consider a translation-based pipeline, LoRA finetuning, and a two-stage finetuning strategy that separates vision adaptation from language adaptation. To evaluate these methods, we use a combination of standard multimodal benchmarks translated into the target language and manual assessments by native experts. The results reveal that dataset translation remains a major bottleneck in multilingual VLM performance, with data quality limiting the effectiveness of training and evaluation. These findings suggest that future efforts should focus on native-language dataset collection and improved translation strategies.
MotionEdit: Benchmarking and Learning Motion-Centric Image Editing
We introduce MotionEdit, a novel dataset for motion-centric image editing-the task of modifying subject actions and interactions while preserving identity, structure, and physical plausibility. Unlike existing image editing datasets that focus on static appearance changes or contain only sparse, low-quality motion edits, MotionEdit provides high-fidelity image pairs depicting realistic motion transformations extracted and verified from continuous videos. This new task is not only scientifically challenging but also practically significant, powering downstream applications such as frame-controlled video synthesis and animation. To evaluate model performance on the novel task, we introduce MotionEdit-Bench, a benchmark that challenges models on motion-centric edits and measures model performance with generative, discriminative, and preference-based metrics. Benchmark results reveal that motion editing remains highly challenging for existing state-of-the-art diffusion-based editing models. To address this gap, we propose MotionNFT (Motion-guided Negative-aware Fine Tuning), a post-training framework that computes motion alignment rewards based on how well the motion flow between input and model-edited images matches the ground-truth motion, guiding models toward accurate motion transformations. Extensive experiments on FLUX.1 Kontext and Qwen-Image-Edit show that MotionNFT consistently improves editing quality and motion fidelity of both base models on the motion editing task without sacrificing general editing ability, demonstrating its effectiveness.
AutoMedic: An Automated Evaluation Framework for Clinical Conversational Agents with Medical Dataset Grounding
Evaluating large language models (LLMs) has recently emerged as a critical issue for safe and trustworthy application of LLMs in the medical domain. Although a variety of static medical question-answering (QA) benchmarks have been proposed, many aspects remain underexplored, such as the effectiveness of LLMs in generating responses in dynamic, interactive clinical multi-turn conversation situations and the identification of multi-faceted evaluation strategies beyond simple accuracy. However, formally evaluating a dynamic, interactive clinical situation is hindered by its vast combinatorial space of possible patient states and interaction trajectories, making it difficult to standardize and quantitatively measure such scenarios. Here, we introduce AutoMedic, a multi-agent simulation framework that enables automated evaluation of LLMs as clinical conversational agents. AutoMedic transforms off-the-shelf static QA datasets into virtual patient profiles, enabling realistic and clinically grounded multi-turn clinical dialogues between LLM agents. The performance of various clinical conversational agents is then assessed based on our CARE metric, which provides a multi-faceted evaluation standard of clinical conversational accuracy, efficiency/strategy, empathy, and robustness. Our findings, validated by human experts, demonstrate the validity of AutoMedic as an automated evaluation framework for clinical conversational agents, offering practical guidelines for the effective development of LLMs in conversational medical applications.
Watermarks for Language Models via Probabilistic Automata
A recent watermarking scheme for language models achieves distortion-free embedding and robustness to edit-distance attacks. However, it suffers from limited generation diversity and high detection overhead. In parallel, recent research has focused on undetectability, a property ensuring that watermarks remain difficult for adversaries to detect and spoof. In this work, we introduce a new class of watermarking schemes constructed through probabilistic automata. We present two instantiations: (i) a practical scheme with exponential generation diversity and computational efficiency, and (ii) a theoretical construction with formal undetectability guarantees under cryptographic assumptions. Extensive experiments on LLaMA-3B and Mistral-7B validate the superior performance of our scheme in terms of robustness and efficiency.
CIEGAD: Cluster-Conditioned Interpolative and Extrapolative Framework for Geometry-Aware and Domain-Aligned Data Augmentation
In practical deep learning deployment, the scarcity of data and the imbalance of label distributions often lead to semantically uncovered regions within the real-world data distribution, hindering model training and causing misclassification near class boundaries as well as unstable behaviors in peripheral areas. Although recent large language models (LLMs) show promise for data augmentation, an integrated framework that simultaneously achieves directional control of generation, domain alignment, and quality control has not yet been fully established. To address these challenges, we propose a Cluster-conditioned Interpolative and Extrapolative framework for Geometry-Aware and Domain-aligned data augmentation (CIEGAD), which systematically complements both in-distribution and out-of-distribution semantically uncovered regions. CIEGAD constructs domain profiles through cluster conditioning, allocates generation with a hierarchical frequency-geometric allocation integrating class frequency and geometric indicators, and finely controls generation directions via the coexistence of interpolative and extrapolative synthesis. It further performs quality control through geometry-constrained filtering combined with an LLM-as-a-Judge mechanism. Experiments on multiple classification tasks demonstrate that CIEGAD effectively extends the periphery of real-world data distributions while maintaining high alignment between generated and real-world data as well as semantic diversity. In particular, for long-tailed and multi-class classification tasks, CIEGAD consistently improves F1 and recall, validating the triple harmony of distributional consistency, diversity, and quality. These results indicate that CIEGAD serves as a practically oriented data augmentation framework that complements underrepresented regions while preserving alignment with real-world data.
Offscript: Automated Auditing of Instruction Adherence in LLMs
Large Language Models (LLMs) and generative search systems are increasingly used for information seeking by diverse populations with varying preferences for knowledge sourcing and presentation. While users can customize LLM behavior through custom instructions and behavioral prompts, no mechanism exists to evaluate whether these instructions are being followed effectively. We present Offscript, an automated auditing tool that efficiently identifies potential instruction following failures in LLMs. In a pilot study analyzing custom instructions sourced from Reddit, Offscript detected potential deviations from instructed behavior in 86.4% of conversations, 22.2% of which were confirmed as material violations through human review. Our findings suggest that automated auditing serves as a viable approach for evaluating compliance to behavioral instructions related to information seeking.
FIBER: A Multilingual Evaluation Resource for Factual Inference Bias
Large language models are widely used across domains, yet there are concerns about their factual reliability and biases. Factual knowledge probing offers a systematic means to evaluate these aspects. Most existing benchmarks focus on single-entity facts and monolingual data. We therefore present FIBER, a multilingual benchmark for evaluating factual knowledge in single- and multi-entity settings. The dataset includes sentence completion, question-answering, and object-count prediction tasks in English, Italian, and Turkish. Using FIBER, we examine whether the prompt language induces inference bias in entity selection and how large language models perform on multi-entity versus single-entity questions. The results indicate that the language of the prompt can influence the model's generated output, particularly for entities associated with the country corresponding to that language. However, this effect varies across different topics such that 31% of the topics exhibit factual inference bias score greater than 0.5. Moreover, the level of bias differs across languages such that Turkish prompts show higher bias compared to Italian in 83% of the topics, suggesting a language-dependent pattern. Our findings also show that models face greater difficulty when handling multi-entity questions than the single-entity questions. Model performance differs across both languages and model sizes. The highest mean average precision is achieved in English, while Turkish and Italian lead to noticeably lower scores. Larger models, including Llama-3.1-8B and Qwen-2.5-7B, show consistently better performance than smaller 3B-4B models.
Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution
Good quality explanations strengthen the understanding of language models and data. Feature attribution methods, such as Integrated Gradient, are a type of post-hoc explainer that can provide token-level insights. However, explanations on the same input may vary greatly due to underlying biases of different methods. Users may be aware of this issue and mistrust their utility, while unaware users may trust them inadequately. In this work, we delve beyond the superficial inconsistencies between attribution methods, structuring their biases through a model- and method-agnostic framework of three evaluation metrics. We systematically assess both the lexical and position bias (what and where in the input) for two transformers; first, in a controlled, pseudo-random classification task on artificial data; then, in a semi-controlled causal relation detection task on natural data. We find that lexical and position biases are structurally unbalanced in our model comparison, with models that score high on one type score low on the other. We also find signs that methods producing anomalous explanations are more likely to be biased themselves.
Applying NLP to iMessages: Understanding Topic Avoidance, Responsiveness, and Sentiment
What is your messaging data used for? While many users do not often think about the information companies can gather based off of their messaging platform of choice, it is nonetheless important to consider as society increasingly relies on short-form electronic communication. While most companies keep their data closely guarded, inaccessible to users or potential hackers, Apple has opened a door to their walled-garden ecosystem, providing iMessage users on Mac with one file storing all their messages and attached metadata. With knowledge of this locally stored file, the question now becomes: What can our data do for us? In the creation of our iMessage text message analyzer, we set out to answer five main research questions focusing on topic modeling, response times, reluctance scoring, and sentiment analysis. This paper uses our exploratory data to show how these questions can be answered using our analyzer and its potential in future studies on iMessage data.
comment: 11 pages, 18 figures, https://github.com/Alanshnir/imessage-analyzer/blob/main/Research/NLP-iMessage-Analyzer%20Findings.pdf
MultiScript30k: Leveraging Multilingual Embeddings to Extend Cross Script Parallel Data
Multi30k is frequently cited in the multimodal machine translation (MMT) literature, offering parallel text data for training and fine-tuning deep learning models. However, it is limited to four languages: Czech, English, French, and German. This restriction has led many researchers to focus their investigations only on these languages. As a result, MMT research on diverse languages has been stalled because the official Multi30k dataset only represents European languages in Latin scripts. Previous efforts to extend Multi30k exist, but the list of supported languages, represented language families, and scripts is still very short. To address these issues, we propose MultiScript30k, a new Multi30k dataset extension for global languages in various scripts, created by translating the English version of Multi30k (Multi30k-En) using NLLB200-3.3B. The dataset consists of over \(30000\) sentences and provides translations of all sentences in Multi30k-En into Ar, Es, Uk, Zh\_Hans and Zh\_Hant. Similarity analysis shows that Multi30k extension consistently achieves greater than \(0.8\) cosine similarity and symmetric KL divergence less than \(0.000251\) for all languages supported except Zh\_Hant which is comparable to the previous Multi30k extensions ArEnMulti30k and Multi30k-Uk. COMETKiwi scores reveal mixed assessments of MultiScript30k as a translation of Multi30k-En in comparison to the related work. ArEnMulti30k scores nearly equal MultiScript30k-Ar, but Multi30k-Uk scores $6.4\%$ greater than MultiScript30k-Uk per split.
comment: 7 pages, 2 figures, 5 tables. Not published at any conference at this time
PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data
LLMs are highly sensitive to prompt design, but handcrafting effective prompts is difficult and often requires intricate crafting of few-shot examples. We propose a fast automatic prompt construction algorithm that augments human instructions by generating a small set of few shot examples. Our method iteratively replaces/drops/keeps few-shot examples using Monte Carlo Shapley estimation of example utility. For faster execution, we use aggressive subsampling and a replay buffer for faster evaluations. Our method can be run using different compute time budgets. On a limited budget, we outperform existing automatic prompting methods on text simplification and GSM8K and obtain second best results on classification and summarization. With an extended, but still modest compute budget we set a new state of the art among automatic prompting methods on classification, simplification and GSM8K. Our results show that carefully constructed examples, rather than exhaustive instruction search, are the dominant lever for fast and data efficient prompt engineering. Our code is available at https://github.com/Batorskq/PIAST.
Better Language Model Inversion by Compactly Representing Next-Token Distributions
Language model inversion seeks to recover hidden prompts using only language model outputs. This capability has implications for security and accountability in language model deployments, such as leaking private information from an API-protected language model's system message. We propose a new method -- prompt inversion from logprob sequences (PILS) -- that recovers hidden prompts by gleaning clues from the model's next-token probabilities over the course of multiple generation steps. Our method is enabled by a key insight: The vector-valued outputs of a language model occupy a low-dimensional subspace. This enables us to losslessly compress the full next-token probability distribution over multiple generation steps using a linear map, allowing more output information to be used for inversion. Our approach yields massive gains over previous state-of-the-art methods for recovering hidden prompts, achieving 2--3.5 times higher exact recovery rates across test sets, in one case increasing the recovery rate from 17% to 60%. Our method also exhibits surprisingly good generalization behavior; for instance, an inverter trained on 16 generations steps gets 5--27 points higher prompt recovery when we increase the number of steps to 32 at test time. Furthermore, we demonstrate strong performance of our method on the more challenging task of recovering hidden system messages. We also analyze the role of verbatim repetition in prompt recovery and propose a new method for cross-family model transfer for logit-based inverters. Our findings show that next-token probabilities are a considerably more vulnerable attack surface for inversion attacks than previously known.
Luxical: High-Speed Lexical-Dense Text Embeddings
Frontier language model quality increasingly hinges on our ability to organize web-scale text corpora for training. Today's dominant tools trade off speed and flexibility: lexical classifiers (e.g., FastText) are fast but limited to producing classification output scores, while the vector-valued outputs of transformer text embedding models flexibly support numerous workflows (e.g., clustering, classification, and retrieval) but are computationally expensive to produce. We introduce Luxical, a library for high-speed "lexical-dense" text embeddings that aims to recover the best properties of both approaches for web-scale text organization. Luxical combines sparse TF--IDF features, a small ReLU network, and a knowledge distillation training regimen to approximate large transformer embedding models at a fraction of their operational cost. In this technical report, we describe the Luxical architecture and training objective and evaluate a concrete Luxical model in two disparate applications: a targeted webcrawl document retrieval test and an end-to-end language model data curation task grounded in text classification. In these tasks we demonstrate speedups ranging from 3x to 100x over varying-sized neural baselines, and comparable to FastText model inference during the data curation task. On these evaluations, the tested Luxical model illustrates favorable compute/quality trade-offs for large-scale text organization, matching the quality of neural baselines. Luxical is available as open-source software at https://github.com/datologyai/luxical.
comment: 9 pages, 6 figures (v2 fixes typos only)
SyGra: A Unified Graph-Based Framework for Scalable Generation, Quality Tagging, and Management of Synthetic Data
The advancement of large language models (LLMs) is critically dependent on the availability of high-quality datasets for Supervised Fine-Tuning (SFT), alignment tasks like Direct Preference Optimization (DPO), etc. In this work, we present a comprehensive synthetic data generation framework that facilitates scalable, configurable, and high-fidelity generation of synthetic data tailored for these training paradigms. Our approach employs a modular and configuration-based pipeline capable of modeling complex dialogue flows with minimal manual intervention. This framework uses a dual-stage quality tagging mechanism, combining heuristic rules and LLM-based evaluations, to automatically filter and score data extracted from OASST-formatted conversations, ensuring the curation of high-quality dialogue samples. The resulting datasets are structured under a flexible schema supporting both SFT and DPO use cases, enabling seamless integration into diverse training workflows. Together, these innovations offer a robust solution for generating and managing synthetic conversational data at scale, significantly reducing the overhead of data preparation in LLM training pipelines.
Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning
Vision-language reinforcement learning (RL) has primarily focused on narrow domains (e.g. geometry or chart reasoning). This leaves broader training scenarios and resources underexplored, limiting the exploration and learning of Vision Language Models (VLMs) through RL. We find video games inherently provide rich visual elements and mechanics that are easy to verify. To fully use the multimodal and verifiable reward in video games, we propose Game-RL, constructing diverse game tasks for RL training to boost VLMs general reasoning ability. To obtain training data, we propose Code2Logic, a novel approach that adapts game code to synthesize game reasoning task data, thus obtaining the GameQA dataset of 30 games and 158 tasks with controllable difficulty gradation. Unexpectedly, RL training solely on GameQA enables multiple VLMs to achieve performance improvements across 7 diverse vision-language benchmarks, demonstrating the value of Game-RL for enhancing VLMs' general reasoning. Furthermore, this suggests that video games may serve as valuable scenarios and resources to boost general reasoning abilities. Our code, dataset and models are available at the GitHub repository.
comment: 69 pages, 24 figures
The LLM Wears Prada: Analysing Gender Bias and Stereotypes through Online Shopping Data
With the wide and cross-domain adoption of Large Language Models, it becomes crucial to assess to which extent the statistical correlations in training data, which underlie their impressive performance, hide subtle and potentially troubling biases. Gender bias in LLMs has been widely investigated from the perspectives of works, hobbies, and emotions typically associated with a specific gender. In this study, we introduce a novel perspective. We investigate whether LLMs can predict an individual's gender based solely on online shopping histories and whether these predictions are influenced by gender biases and stereotypes. Using a dataset of historical online purchases from users in the United States, we evaluate the ability of six LLMs to classify gender and we then analyze their reasoning and products-gender co-occurrences. Results indicate that while models can infer gender with moderate accuracy, their decisions are often rooted in stereotypical associations between product categories and gender. Furthermore, explicit instructions to avoid bias reduce the certainty of model predictions, but do not eliminate stereotypical patterns. Our findings highlight the persistent nature of gender biases in LLMs and emphasize the need for robust bias-mitigation strategies.
The Spatial Semantics of Iconic Gesture
The current multimodal turn in linguistic theory leaves a crucial question unanswered: what is the meaning of iconic gestures, and how does it compose with speech meaning? We argue for a separation of linguistic and visual levels of meaning and introduce a spatial gesture semantics that closes this gap. Iconicity is differentiated into three aspects: Firstly, an interpretation of the form of a gesture in terms of a translation from kinematic gesture annotations into vector sequences (iconic model). Secondly, a truth-functional evaluation of the iconic model within spatially extended domains (embedding). Since a simple embedding is too strong, we identify a number of transformations that can be applied to iconic models, namely rotation, scaling, perspective fixation, and quotation of handshape. Thirdly, the linguistic description or classification of an iconic model (informational evaluation). Since the informational evaluation of an iconic gesture is a heuristic act, it needs a place in a semantic theory of visual communication. Informational evaluation lifts a gesture to a quasi-linguistic level that can interact with verbal content. This interaction is either vacuous, or regimented by usual lexicon-driven inferences discussed in dynamic semantic frameworks.
comment: 52 pages, 38 figures, in review
LMSpell: Neural Spell Checking for Low-Resource Languages
Spell correction is still a challenging problem for low-resource languages (LRLs). While pretrained language models (PLMs) have been employed for spell correction, their use is still limited to a handful of languages, and there has been no proper comparison across PLMs. We present the first empirical study on the effectiveness of PLMs for spell correction, which includes LRLs. We find that Large Language Models (LLMs) outperform their counterparts (encoder-based and encoder-decoder) when the fine-tuning dataset is large. This observation holds even in languages for which the LLM is not pre-trained. We release LMSpell, an easy- to use spell correction toolkit across PLMs. It includes an evaluation function that compensates for the hallucination of LLMs. Further, we present a case study with Sinhala to shed light on the plight of spell correction for LRLs.
Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models
Large Language Models (LLMs) are prone to generating fluent but incorrect content, known as confabulation, which poses increasing risks in multi-turn or agentic applications where outputs may be reused as context. In this work, we investigate how in-context information influences model behavior and whether LLMs can identify their unreliable responses. We propose a reliability estimation that leverages token-level uncertainty to guide the aggregation of internal model representations. Specifically, we compute aleatoric and epistemic uncertainty from output logits to identify salient tokens and aggregate their hidden states into compact representations for response-level reliability prediction. Through controlled experiments on open QA benchmarks, we find that correct in-context information improves both answer accuracy and model confidence, while misleading context often induces confidently incorrect responses, revealing a misalignment between uncertainty and correctness. Our probing-based method captures these shifts in model behavior and improves the detection of unreliable outputs across multiple open-source LLMs. These results underscore the limitations of direct uncertainty signals and highlight the potential of uncertainty-guided probing for reliability-aware generation.
Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLMs
Large language models (LLMs) frequently produce false refusals, declining benign requests that contain terms resembling unsafe queries. We address this challenge by introducing two comprehensive benchmarks: the Exaggerated Safety Benchmark (XSB) for single-turn prompts, annotated with "Focus" keywords that identify refusal-inducing triggers, and the Multi-turn Scenario-based Exaggerated Safety Benchmark (MS-XSB), which systematically evaluates refusal calibration in realistic, context-rich dialog settings. Our benchmarks reveal that exaggerated refusals persist across diverse recent LLMs and are especially pronounced in complex, multi-turn scenarios. To mitigate these failures, we leverage post-hoc explanation methods to identify refusal triggers and deploy three lightweight, model-agnostic approaches, ignore-word instructions, prompt rephrasing, and attention steering, at inference time, all without retraining or parameter access. Experiments on four instruction-tuned Llama models demonstrate that these strategies substantially improve compliance on safe prompts while maintaining robust safety protections. Our findings establish a reproducible framework for diagnosing and mitigating exaggerated refusals, highlighting practical pathways to safer and more helpful LLM deployments.
comment: Errors in the paper
SCALE: Upscaled Continual Learning of Large Language Models
We revisit continual pre-training for large language models and argue that progress now depends more on scaling the right structure than on scaling parameters alone. We introduce SCALE, a width upscaling architecture that inserts lightweight expansion into linear modules while freezing all pre-trained parameters. This preserves the residual and attention topologies and increases capacity without perturbing the base model's original functionality. SCALE is guided by two principles: Persistent Preservation, which maintains the base model's behavior via preservation-oriented initialization and freezing of the pre-trained weights, and Collaborative Adaptation, which selectively trains a subset of expansion components to acquire new knowledge with minimal interference. We instantiate these ideas as SCALE-Preserve (preservation-first), SCALE-Adapt (adaptation-first), and SCALE-Route, an optional routing extension that performs token-level routing between preservation and adaptation heads. On a controlled synthetic biography benchmark, SCALE mitigates the severe forgetting observed with depth expansion while still acquiring new knowledge. In continual pre-training on a Korean corpus, SCALE variants achieve less forgetting on English evaluations and competitive gains on Korean benchmarks, with these variants offering the best overall stability-plasticity trade-off. Accompanying analysis clarifies when preservation provably holds and why the interplay between preservation and adaptation stabilizes optimization compared to standard continual learning setups.
A Greek Government Decisions Dataset for Public-Sector Analysis and Insight
We introduce an open, machine-readable corpus of Greek government decisions sourced from the national transparency platform Diavgeia. The resource comprises 1 million decisions, featuring and high-quality raw text extracted from PDFs. It is released with raw extracted text in Markdown format, alongside a fully reproducible extraction pipeline. Beyond the core dataset, we conduct qualitative analyses to explore boilerplate patterns and design a retrieval-augmented generation (RAG) task by formulating a set of representative questions, creating high-quality answers, and evaluating a baseline RAG system on its ability to retrieve and reason over public decisions. This evaluation demonstrates the potential of large-scale public-sector corpora to support advanced information access and transparency through structured retrieval and reasoning over governmental documents, and highlights how such a RAG pipeline could simulate a chat-based assistant capable of interactively answering questions about public decisions. Due to its scale, quality, and domain coverage, the corpus can also serve as high-value pre-training or fine-tuning material for new Language Models (LMs) and Large Language Models (LLMs) respectively, including specialized models for legal and governmental domains, and as a foundation for novel approaches in domain adaptation, knowledge-grounded generation, and explainable AI. Finally, we discuss limitations, outline future directions, and make both the data and the code accessible.
Anthropocentric bias in language model evaluation
Evaluating the cognitive capacities of large language models (LLMs) requires overcoming not only anthropomorphic but also anthropocentric biases. This article identifies two types of anthropocentric bias that have been neglected: overlooking how auxiliary factors can impede LLM performance despite competence ("auxiliary oversight"), and dismissing LLM mechanistic strategies that differ from those of humans as not genuinely competent ("mechanistic chauvinism"). Mitigating these biases necessitates an empirically-driven, iterative approach to mapping cognitive tasks to LLM-specific capacities and mechanisms, which can be done by supplementing carefully designed behavioral experiments with mechanistic studies.
comment: Published in Computational Linguistics
LLMs in Interpreting Legal Documents
This chapter explores the application of Large Language Models in the legal domain, showcasing their potential to optimise and augment traditional legal tasks by analysing possible use cases, such as assisting in interpreting statutes, contracts, and case law, enhancing clarity in legal summarisation, contract negotiation, and information retrieval. There are several challenges that can arise from the application of such technologies, such as algorithmic monoculture, hallucinations, and compliance with existing regulations, including the EU's AI Act and recent U.S. initiatives, alongside the emerging approaches in China. Furthermore, two different benchmarks are presented.
V-VAE: A Variational Auto Encoding Framework Towards Fine-Grained Control over Human-Like Chat
With the continued proliferation of Large Language Model (LLM) based chatbots, there is a growing demand for generating responses that are not only linguistically fluent but also consistently aligned with persona-specific traits in conversations. However, existing role-play and persona-based chat approaches rely heavily on static role descriptions, coarse-grained signal space, and low-quality synthetic data, which fail to capture dynamic fine-grained details in human-like chat. Human-like chat requires modeling subtle latent traits, such as emotional tone, situational awareness, and evolving personality, which are difficult to predefine and cannot be easily learned from synthetic or distillation-based data. To address these limitations, we propose a Verbal Variational Auto-Encoding (V-VAE) framework, containing a variational auto-encoding module and fine-grained control space which dynamically adapts dialogue behaviour based on fine-grained, interpretable latent variables across talking style, interaction patterns, and personal attributes. We also construct a high-quality dataset, HumanChatData, and benchmark HumanChatBench to address the scarcity of high-quality data in the human-like domain. Experiments show that LLMs based on V-VAE consistently outperform standard baselines on HumanChatBench and DialogBench, which further demonstrates the effectiveness of V-VAE and HumanChatData.
Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning NeurIPS 2025
The remarkable performance of Large Language Models (LLMs) can be enhanced with test-time computation, which relies on external tools and even other deep learning models. However, existing approaches for integrating non-text modality representations into LLMs typically require additional costly supervised training, restricting on-the-fly adaptation to new domains and modalities. In this work, we explore the feasibility of integrating representations from non-text foundational models (FMs) into text-based LLMs in a training-free manner. We propose In-Context Representation Learning (ICRL) as a proof-of-concept to allow LLMs to adaptively utilize non-text modality representations with few-shot learning. Unlike traditional in-context learning, which incorporates text-label pairs, ICRL replaces text inputs with FM representations, enabling the LLM to perform multi-modal inference without fine-tuning. We evaluate ICRL on a suite of tasks in the molecular domain, investigating three core research questions: (i) how to map FM representations into LLMs in a training-free manner, (ii) what factors influence ICRL performance, and (iii) what mechanisms underlie the effectiveness of ICRL. To the best of our knowledge, ICRL is the first training-free framework for integrating non-text modality representations into text-based LLMs, presenting a promising direction for adaptable, multi-modal generalization.
comment: NeurIPS 2025
Examining the Metrics for Document-Level Claim Extraction in Czech and Slovak
Document-level claim extraction remains an open challenge in the field of fact-checking, and subsequently, methods for evaluating extracted claims have received limited attention. In this work, we explore approaches to aligning two sets of claims pertaining to the same source document and computing their similarity through an alignment score. We investigate techniques to identify the best possible alignment and evaluation method between claim sets, with the aim of providing a reliable evaluation framework. Our approach enables comparison between model-extracted and human-annotated claim sets, serving as a metric for assessing the extraction performance of models and also as a possible measure of inter-annotator agreement. We conduct experiments on newly collected dataset-claims extracted from comments under Czech and Slovak news articles-domains that pose additional challenges due to the informal language, strong local context, and subtleties of these closely related languages. The results draw attention to the limitations of current evaluation approaches when applied to document-level claim extraction and highlight the need for more advanced methods-ones able to correctly capture semantic similarity and evaluate essential claim properties such as atomicity, checkworthiness, and decontextualization.
Vision-centric Token Compression in Large Language Model NeurIPS 2025
Real-world applications are stretching context windows to hundreds of thousand of tokens while Large Language Models (LLMs) swell from billions to trillions of parameters. This dual expansion send compute and memory costs skyrocketing, making token compression indispensable. We introduce Vision Centric Token Compression (Vist), a slow-fast compression framework that mirrors human reading: the fast path renders distant tokens into images, letting a frozen, lightweight vision encoder skim the low-salience context; the slow path feeds the proximal window into the LLM for fine-grained reasoning. A Probability-Informed Visual Enhancement (PVE) objective masks high-frequency tokens during training, steering the Resampler to concentrate on semantically rich regions-just as skilled reader gloss over function words. On eleven in-context learning benchmarks, Vist achieves the same accuracy with 2.3 times fewer tokens, cutting FLOPs by 16% and memory by 50%. This method delivers remarkable results, outperforming the strongest text encoder-based compression method CEPE by 7.6% on average over benchmarks like TriviaQA, NQ, PopQA, NLUI, and CLIN, setting a new standard for token efficiency in LLMs. The project is at https://github.com/CSU-JPG/VIST.
comment: NeurIPS 2025 spotlight
GTPO: Stabilizing Group Relative Policy Optimization via Gradient and Entropy Control
Group Relative Policy Optimization (GRPO) is a promising policy-based approach for Large Language Model alignment, yet its performance is often limited by training instability and suboptimal convergence. In this paper, we identify and analyze two main GRPO issues: (i) the token-level penalization, where valuable tokens shared across different responses receive contradictory feedback signals, leading to conflicting gradient updates that can reduce their likelihood; and (ii) the policy collapse, where negatively rewarded completions may penalize confident responses and shift model decisions toward unlikely tokens, destabilizing training process. To address these issues we introduce GTPO (Group-relative Trajectory-based Policy Optimization), which prevents conflicting gradients on valuable tokens by skipping negative updates while amplifying positive ones and filters out completions whose entropy exceeds a provable threshold, to prevent policy collapse. Unlike GRPO, GTPO does not rely on KL-divergence regularization, eliminating the need for a reference model during training, while still ensuring greater training stability and improved performance, as validated through multiple experiments on GSM8K, MATH, AIME 2024, AIME 2025 and AMC 2023.
Towards Personalized Deep Research: Benchmarks and Evaluations
Deep Research Agents (DRAs) can autonomously conduct complex investigations and generate comprehensive reports, demonstrating strong real-world potential. However, existing benchmarks primarily evaluate DRAs on generic quality metrics and overlook personalization, a critical dimension for individual users. However, existing evaluations mostly rely on close-ended benchmarks, while open-ended deep research benchmarks remain scarce and typically neglect personalized scenarios. To bridge this gap, we introduce Personalized Deep Research Bench (PDR-Bench), the first benchmark for evaluating personalization in DRAs. It pairs 50 diverse research tasks across 10 domains with 25 authentic user profiles that combine structured persona attributes with dynamic real-world contexts, yielding 250 realistic user-task queries. To assess system performance, we propose the PQR Evaluation Framework, which jointly measures Personalization Alignment, Content Quality, and Factual Reliability. Our experiments on a range of systems highlight current capabilities and limitations in handling personalized deep research. This work establishes a rigorous foundation for developing and evaluating the next generation of truly personalized AI research assistants.
OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models
Since Multimodal Large Language Models (MLLMs) are increasingly being integrated into everyday tools and intelligent agents, growing concerns have arisen regarding their possible output of unsafe contents, ranging from toxic language and biased imagery to privacy violations and harmful misinformation. Current safety benchmarks remain highly limited in both modality coverage and performance evaluations, often neglecting the extensive landscape of content safety. In this work, we introduce OutSafe-Bench, the first most comprehensive content safety evaluation test suite designed for the multimodal era. OutSafe-Bench includes a large-scale dataset that spans four modalities, featuring over 18,000 bilingual (Chinese and English) text prompts, 4,500 images, 450 audio clips and 450 videos, all systematically annotated across nine critical content risk categories. In addition to the dataset, we introduce a Multidimensional Cross Risk Score (MCRS), a novel metric designed to model and assess overlapping and correlated content risks across different categories. To ensure fair and robust evaluation, we propose FairScore, an explainable automated multi-reviewer weighted aggregation framework. FairScore selects top-performing models as adaptive juries, thereby mitigating biases from single-model judgments and enhancing overall evaluation reliability. Our evaluation of nine state-of-the-art MLLMs reveals persistent and substantial safety vulnerabilities, underscoring the pressing need for robust safeguards in MLLMs.
When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners NeurIPS 2025
Multilingual reasoning remains a significant challenge for large language models (LLMs), with performance disproportionately favoring high-resource languages. Drawing inspiration from cognitive neuroscience, which suggests that human reasoning functions largely independently of language processing, we hypothesize that LLMs similarly encode reasoning and language as separable components that can be disentangled to enhance multilingual reasoning. To evaluate this, we perform a causal intervention by ablating language-specific representations at inference time. Experiments on 10 open-weight LLMs spanning 11 typologically diverse languages show that this language-specific ablation consistently boosts multilingual reasoning performance. Layer-wise analyses further confirm that language and reasoning representations can be effectively disentangled throughout the model, yielding improved multilingual reasoning capabilities, while preserving top-layer language features remains essential for maintaining linguistic fidelity. Compared to post-training methods such as supervised fine-tuning or reinforcement learning, our training-free language-reasoning disentanglement achieves comparable or superior results with minimal computational overhead. These findings shed light on the internal mechanisms underlying multilingual reasoning in LLMs and suggest a lightweight and interpretable strategy for improving cross-lingual generalization.
comment: NeurIPS 2025 Camera-ready
Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment NeurIPS 2025
Personalized alignment is essential for enabling large language models (LLMs) to engage effectively in user-centric dialogue. While recent prompt-based and offline optimization methods offer preliminary solutions, they fall short in cold-start scenarios and long-term personalization due to their inherently static and shallow designs. In this work, we introduce the Reinforcement Learning for Personalized Alignment (RLPA) framework, in which an LLM interacts with a simulated user model to iteratively infer and refine user profiles through dialogue. The training process is guided by a dual-level reward structure: the Profile Reward encourages accurate construction of user representations, while the Response Reward incentivizes generation of responses consistent with the inferred profile. We instantiate RLPA by fine-tuning Qwen-2.5-3B-Instruct, resulting in Qwen-RLPA, which achieves state-of-the-art performance in personalized dialogue. Empirical evaluations demonstrate that Qwen-RLPA consistently outperforms prompting and offline fine-tuning baselines, and even surpasses advanced commercial models such as Claude-3.5 and GPT-4o. Further analysis highlights Qwen-RLPA's robustness in reconciling conflicting user preferences, sustaining long-term personalization and delivering more efficient inference compared to recent reasoning-focused LLMs. These results emphasize the potential of dynamic profile inference as a more effective paradigm for building personalized dialogue systems.
comment: NeurIPS 2025 Camera-ready
Heard or Halted? Gender, Interruptions, and Emotional Tone in U.S. Supreme Court Oral Arguments
This study examines how interruptions during U.S. Supreme Court oral arguments shape both the semantic content and emotional tone of advocates' speech, with a focus on gendered dynamics in judicial discourse. Using the ConvoKit Supreme Court Corpus (2010-2019), we analyze 12,663 speech chunks from advocate-justice interactions to assess whether interruptions alter the meaning of an advocate's argument and whether interruptions toward female advocates exhibit more negative emotional valence. Semantic shifts are quantified using GloVe-based sentence embeddings, while sentiment is measured through lexicon-based analysis. We find that semantic similarity between pre- and post-interruption speech remains consistently high, suggesting that interruptions do not substantially alter argumentative content. However, interruptions directed at female advocates contain significantly higher levels of negative sentiment. These results deepen empirical understanding of gendered communication in elite institutional settings and demonstrate the value of computational linguistic methods for studying power, discourse, and equity in judicial proceedings.
comment: 12 pages, 5 figures, 1 table. Includes appendix. Code available at: https://github.com/1TSHARUKA/Emotional_Interruption_Analysis
A Simple Yet Strong Baseline for Long-Term Conversational Memory of LLM Agents
LLM-based conversational agents still struggle to maintain coherent, personalized interaction over many sessions: fixed context windows limit how much history can be kept in view, and most external memory approaches trade off between coarse retrieval over large chunks and fine-grained but fragmented views of the dialogue. Motivated by neo-Davidsonian event semantics, we propose an event-centric alternative that represents conversational history as short, event-like propositions which bundle together participants, temporal cues, and minimal local context, rather than as independent relation triples or opaque summaries. In contrast to work that aggressively compresses or forgets past content, our design aims to preserve information in a non-compressive form and make it more accessible, rather than more lossy. Concretely, we instruct an LLM to decompose each session into enriched elementary discourse units (EDUs) -- self-contained statements with normalized entities and source turn attributions -- and organize sessions, EDUs, and their arguments in a heterogeneous graph that supports associative recall. On top of this representation we build two simple retrieval-based variants that use dense similarity search and LLM filtering, with an optional graph-based propagation step to connect and aggregate evidence across related EDUs. Experiments on the LoCoMo and LongMemEval$_S$ benchmarks show that these event-centric memories match or surpass strong baselines, while operating with much shorter QA contexts. Our results suggest that structurally simple, event-level memory provides a principled and practical foundation for long-horizon conversational agents. Our code and data will be released at https://github.com/KevinSRR/EMem.
comment: Work in progress
Leveraging language models for summarizing mental state examinations: A comprehensive evaluation and dataset release COLING 2025
Mental health disorders affect a significant portion of the global population, with diagnoses primarily conducted through Mental State Examinations (MSEs). MSEs serve as structured assessments to evaluate behavioral and cognitive functioning across various domains, aiding mental health professionals in diagnosis and treatment monitoring. However, in developing countries, access to mental health support is limited, leading to an overwhelming demand for mental health professionals. Resident doctors often conduct initial patient assessments and create summaries for senior doctors, but their availability is constrained, resulting in extended patient wait times. This study addresses the challenge of generating concise summaries from MSEs through the evaluation of various language models. Given the scarcity of relevant mental health conversation datasets, we developed a 12-item descriptive MSE questionnaire and collected responses from 405 participants, resulting in 9720 utterances covering diverse mental health aspects. Subsequently, we assessed the performance of five well-known pre-trained summarization models, both with and without fine-tuning, for summarizing MSEs. Our comprehensive evaluation, leveraging metrics such as ROUGE, SummaC, and human evaluation, demonstrates that language models can generate automated coherent MSE summaries for doctors. With this paper, we release our collected conversational dataset and trained models publicly for the mental health research community.
comment: Appeared in: Proceedings of the 31st International Conference on Computational Linguistics (COLING 2025), Abu Dhabi, UAE. ACL Anthology ID: 2025.coling-main.182. (https://aclanthology.org/2025.coling-main.182/)
On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning
Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). KL regularization is ubiquitous, yet the design surface, choice of KL direction (forward vs. reverse), normalization (normalized vs. unnormalized), and estimator ($k_1/k_2/k_3$), is scattered across the literature and often intertwined with off-policy estimation. We ask a focused question: under the off-policy setting, what weighting is required for each KL variant so that the surrogate we optimize yields the exact gradient of the intended KL-regularized objective? We answer this with a compact, unified derivation we call the Regularized Policy Gradient (RPG) view. RPG (i) unifies normalized and unnormalized KL variants and shows that the widely-used $k_3$ penalty is exactly the unnormalized KL; (ii) specifies conditions under which REINFORCE-style losses with stop-gradient are gradient-equivalent to fully differentiable surrogates; (iii) identifies and corrects an off-policy importance-weighting mismatch in GRPO's KL term; and (iv) introduces RPG-Style Clip, a clipped-importance-sampling step within RPG-REINFORCE that enables stable, off-policy policy-gradient training at scale. On mathematical reasoning benchmarks (AIME24, AIME25), RPG-REINFORCE with RPG-Style Clip improves accuracy by up to $+6$ absolute percentage points over DAPO. We extend our experiments to 8K context length, and RPG-REINFORCE with RPG-Style Clip achieves 52% accuracy on AIME25, surpassing the official Qwen3-4B-Instruct model (47%). Notably, RPG is a stable and scalable RL algorithm for LLM reasoning, realized via (a) a KL-correct objective, (b) clipped importance sampling, and (c) an iterative reference-policy update scheme.
comment: Project Page: https://github.com/complex-reasoning/RPG
GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation NeurIPS 2025
Retrieval-augmented generation (RAG) has proven effective in integrating knowledge into large language models (LLMs). However, conventional RAGs struggle to capture complex relationships between pieces of knowledge, limiting their performance in intricate reasoning that requires integrating knowledge from multiple sources. Recently, graph-enhanced retrieval augmented generation (GraphRAG) builds graph structure to explicitly model these relationships, enabling more effective and efficient retrievers. Nevertheless, its performance is still hindered by the noise and incompleteness within the graph structure. To address this, we introduce GFM-RAG, a novel graph foundation model (GFM) for retrieval augmented generation. GFM-RAG is powered by an innovative graph neural network that reasons over graph structure to capture complex query-knowledge relationships. The GFM with 8M parameters undergoes a two-stage training process on large-scale datasets, comprising 60 knowledge graphs with over 14M triples and 700k documents. This results in impressive performance and generalizability for GFM-RAG, making it the first graph foundation model applicable to unseen datasets for retrieval without any fine-tuning required. Extensive experiments on three multi-hop QA datasets and seven domain-specific RAG datasets demonstrate that GFM-RAG achieves state-of-the-art performance while maintaining efficiency and alignment with neural scaling laws, highlighting its potential for further improvement.
comment: Accepted by NeurIPS 2025
TheMCPCompany: Creating General-purpose Agents with Task-specific Tools
Since the introduction of the Model Context Protocol (MCP), the number of available tools for Large Language Models (LLMs) has increased significantly. These task-specific tool sets offer an alternative to general-purpose tools such as web browsers, while being easier to develop and maintain than GUIs. However, current general-purpose agents predominantly rely on web browsers for interacting with the environment. Here, we introduce TheMCPCompany, a benchmark for evaluating tool-calling agents on tasks that involve interacting with various real-world services. We use the REST APIs of these services to create MCP servers, which include over 18,000 tools. We also provide manually annotated ground-truth tools for each task. In our experiments, we use the ground truth tools to show the potential of tool-calling agents for both improving performance and reducing costs assuming perfect tool retrieval. Next, we explore agent performance using tool retrieval to study the real-world practicality of tool-based agents. While all models with tool retrieval perform similarly or better than browser-based agents, smaller models cannot take full advantage of the available tools through retrieval. On the other hand, GPT-5's performance with tool retrieval is very close to its performance with ground-truth tools. Overall, our work shows that the most advanced reasoning models are effective at discovering tools in simpler environments, but seriously struggle with navigating complex enterprise environments. TheMCPCompany reveals that navigating tens of thousands of tools and combining them in non-trivial ways to solve complex problems is still a challenging task for current models and requires both better reasoning and better retrieval models.
comment: Code: https://github.com/Reza-esfandiarpoor/the-mcp-company
Emotional Support with LLM-based Empathetic Dialogue Generation
Emotional Support Conversation (ESC) aims to provide empathetic and effective emotional assistance through dialogue, addressing the growing demand for mental health support. This paper presents our solution for the NLPCC 2025 Task 8 ESC evaluation, where we leverage large-scale language models enhanced by prompt engineering and finetuning techniques. We explore both parameter-efficient Low-Rank Adaptation and full-parameter fine-tuning strategies to improve the model's ability to generate supportive and contextually appropriate responses. Our best model ranked second in the competition, highlighting the potential of combining LLMs with effective adaptation methods for ESC tasks. Future work will focus on further enhancing emotional understanding and response personalization to build more practical and reliable emotional support systems.
Benchmarking Multimodal LLMs on Recognition and Understanding over Chemical Tables
With the widespread application of multimodal large language models in scientific intelligence, there is an urgent need for more challenging evaluation benchmarks to assess their ability to understand complex scientific data. Scientific tables, as core carriers of knowledge representation, combine text, symbols, and graphics, forming a typical multimodal reasoning scenario. However, existing benchmarks are mostly focused on general domains, failing to reflect the unique structural complexity and domain-specific semantics inherent in scientific research. Chemical tables are particularly representative: they intertwine structured variables such as reagents, conditions, and yields with visual symbols like molecular structures and chemical formulas, posing significant challenges to models in cross-modal alignment and semantic parsing. To address this, we propose ChemTable-a large scale benchmark of chemical tables constructed from real-world literature, containing expert-annotated cell layouts, logical structures, and domain-specific labels. It supports two core tasks: (1) table recognition (structure and content extraction); and (2) table understanding (descriptive and reasoning-based question answering). Evaluation on ChemTable shows that while mainstream multimodal models perform reasonably well in layout parsing, they still face significant limitations when handling critical elements such as molecular structures and symbolic conventions. Closed-source models lead overall but still fall short of human-level performance. This work provides a realistic testing platform for evaluating scientific multimodal understanding, revealing the current bottlenecks in domain-specific reasoning and advancing the development of intelligent systems for scientific research.
The Expressive Capacity of State Space Models: A Formal Language Perspective NeurIPS 2024
Recently, recurrent models based on linear state space models (SSMs) have shown promising performance in language modeling (LM), competititve with transformers. However, there is little understanding of the in-principle abilities of such models, which could provide useful guidance to the search for better LM architectures. We present a comprehensive theoretical study of the capacity of such SSMs as it compares to that of transformers and traditional RNNs. We find that SSMs and transformers have overlapping but distinct strengths. In star-free state tracking, SSMs implement straightforward and exact solutions to problems that transformers struggle to represent exactly. They can also model bounded hierarchical structure with optimal memory even without simulating a stack. On the other hand, we identify a design choice in current SSMs that limits their expressive power. We discuss implications for SSM and LM research, and verify results empirically on a recent SSM, Mamba.
comment: Published in NeurIPS 2024
Mirror Speculative Decoding: Breaking the Serial Barrier in LLM Inference
Speculative decoding accelerates LLM inference by using a draft model to look ahead, but gains are capped by the cost of autoregressive draft generation: increasing draft size elevates acceptance rates but introduces additional latency overhead exacerbating the speed-accuracy tradeoff. Prior methods (Medusa, Hydra, EAGLE) partially reduce draft cost but either degrade acceptance or introduce overheads that limit scaling. We present Mirror Speculative Decoding (Mirror-SD), an inference algorithm that breaks the latency-acceptance tradeoff. Mirror-SD launches branch-complete rollouts from early-exit signals in parallel with the target model's suffix and explicitly maps computation across heterogeneous accelerators (GPU and NPU) to exploit cross-device parallelism. The draft speculates forward continuations for the target to verify, while the target simultaneously speculates correction paths for the draft, converting speculation into two complementary execution pipelines. To further cut draft latency without weakening acceptance semantics, we add speculative streaming so the draft emits multiple tokens per step. This dual strategy of parallel heterogeneous execution plus multi-token speculative streaming pushes speculative decoding toward its ideal regime of high acceptance with low overhead. On SpecBench with server-scale models from 14B to 66B parameters, Mirror-SD delivers consistent end-to-end gains, achieving 2.8x-5.8x wall-time speedups across diverse tasks and a 30% average relative improvement over the strongest baseline, EAGLE3.
Statistical Analysis of Sentence Structures through ASCII, Lexical Alignment and PCA
While utilizing syntactic tools such as parts-of-speech (POS) tagging has helped us understand sentence structures and their distribution across diverse corpora, it is quite complex and poses a challenge in natural language processing (NLP). This study focuses on understanding sentence structure balance - usages of nouns, verbs, determiners, etc - harmoniously without relying on such tools. It proposes a novel statistical method that uses American Standard Code for Information Interchange (ASCII) codes to represent text of 11 text corpora from various sources and their lexical category alignment after using their compressed versions through PCA, and analyzes the results through histograms and normality tests such as Shapiro-Wilk and Anderson-Darling Tests. By focusing on ASCII codes, this approach simplifies text processing, although not replacing any syntactic tools but complementing them by offering it as a resource-efficient tool for assessing text balance. The story generated by Grok shows near normality indicating balanced sentence structures in LLM outputs, whereas 4 out of the remaining 10 pass the normality tests. Further research could explore potential applications in text quality evaluation and style analysis with syntactic integration for more broader tasks.
LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning
Large Language Models (LLMs) demonstrate their reasoning ability through chain-of-thought (CoT) generation. However, LLM's autoregressive decoding may limit the ability to revisit and refine earlier tokens in a holistic manner, which can also lead to inefficient exploration for diverse solutions. In this paper, we propose LaDiR (Latent Diffusion Reasoner), a novel reasoning framework that unifies the expressiveness of continuous latent representation with the iterative refinement capabilities of latent diffusion models for an existing LLM. We first construct a structured latent reasoning space using a Variational Autoencoder (VAE) that encodes text reasoning steps into blocks of thought tokens, preserving semantic information and interpretability while offering compact but expressive representations. Subsequently, we utilize a latent diffusion model that learns to denoise a block of latent thought tokens with a blockwise bidirectional attention mask, enabling longer horizon and iterative refinement with adaptive test-time compute. This design allows efficient parallel generation of diverse reasoning trajectories, allowing the model to plan and revise the reasoning process holistically. We conduct evaluations on a suite of mathematical reasoning and planning benchmarks. Empirical results show that LaDiR consistently improves accuracy, diversity, and interpretability over existing autoregressive, diffusion-based, and latent reasoning methods, revealing a new paradigm for text reasoning with latent diffusion.
PUMA: Discovery of Protein Units via Mutation-Aware Merging
Proteins are the essential drivers of biological processes. At the molecular level, they are chains of amino acids that can be viewed through a linguistic lens where the twenty standard residues serve as an alphabet combining to form a complex language, referred to as the language of life. To understand this language, we must first identify its fundamental units. Analogous to words, these units are hypothesized to represent an intermediate layer between single residues and larger domains. Crucially, just as protein diversity arises from evolution, these units should inherently reflect evolutionary relationships. We introduce PUMA (Protein Units via Mutation-Aware Merging) to discover these evolutionarily meaningful units. PUMA employs an iterative merging algorithm guided by substitution matrices to identify protein units and organize them into families linked by plausible mutations. This process creates a hierarchical genealogy where parent units and their mutational variants coexist, simultaneously producing a unit vocabulary and the genealogical structure connecting them. We validate that PUMA families are biologically meaningful; mutations within a PUMA family correlate with clinically benign variants and with high-scoring mutations in high-throughput assays. Furthermore, these units align with the contextual preferences of protein language models and map to known functional annotations. PUMA's genealogical framework provides evolutionarily grounded units, offering a structured approach for understanding the language of life.
comment: 18 pages, 12 figures, 1 table, 1 algorithm
The Illusion of Readiness in Health AI
Large language models have demonstrated remarkable performance in a wide range of medical benchmarks. Yet underneath the seemingly promising results lie salient growth areas, especially in cutting-edge frontiers such as multimodal reasoning. In this paper, we introduce a series of adversarial stress tests to systematically assess the robustness of flagship models and medical benchmarks. Our study reveals prevalent brittleness in the presence of simple adversarial transformations: leading systems can guess the right answer even with key inputs removed, yet may get confused by the slightest prompt alterations, while fabricating convincing yet flawed reasoning traces. Using clinician-guided rubrics, we demonstrate that popular medical benchmarks vary widely in what they truly measure. Our study reveals significant competency gaps of frontier AI in attaining real-world readiness for health applications. If we want AI to earn trust in healthcare, we must demand more than leaderboard wins and must hold AI systems accountable to ensure robustness, sound reasoning, and alignment with real medical demands.
Dynamics of Spontaneous Topic Changes in Next Token Prediction with Self-Attention NeurIPS 2025
Human cognition is punctuated by abrupt, spontaneous shifts between topics-driven by emotional, contextual, or associative cues-a phenomenon known as spontaneous thought in neuroscience. In contrast, self-attention based models depend on structured patterns over their inputs to predict each next token, lacking spontaneity. Motivated by this distinction, we characterize spontaneous topic changes in self-attention architectures, revealing both their similarities and their divergences from spontaneous human thought. First, we establish theoretical results under a simplified, single-layer self-attention model with suitable conditions by defining the topic as a set of Token Priority Graphs (TPGs). Specifically, we demonstrate that (1) the model maintains the priority order of tokens related to the input topic, (2) a spontaneous topic change can occur only if lower-priority tokens outnumber all higher-priority tokens of the input topic, and (3) unlike human cognition, the longer context length or the more ambiguous input topic reduces the likelihood of spontaneous change. Second, we empirically validate that these dynamics persist in modern, state-of-the-art LLMs, underscoring a fundamental disparity between human cognition and AI behaviour in the context of spontaneous topic changes. To the best of our knowledge, no prior work has explored these questions with a focus as closely aligned to human thought.
comment: Accepted to NeurIPS 2025
Beyond Early-Token Bias: Model-Specific and Language-Specific Position Effects in Multilingual LLMs
Large Language Models (LLMs) exhibit position bias systematically underweighting information based on its location in the context but how this bias varies across languages and models remains unclear. We conduct a multilingual study across five typologically diverse languages (English, Russian, German, Hindi, Vietnamese) and five model architectures, analyzing how position bias interacts with prompting strategies and affects output entropy. Our key findings are: (1) Position bias is primarily model-driven but shows language-specific nuances. Notably, Qwen2.5-7B-Instruct, DeepSeek 7B Chat and Mistral 7B consistently favor late positions challenging the common assumption of universal early-token preference. (2) Explicitly instructing the model, in the presence of irrelevant distractors, that "the most relevant context to the query is marked as 1" unexpectedly reduces accuracy across all languages, questioning standard prompt-engineering practices. (3) Accuracy consistently drops most when relevant information appears in the middle of the context, yet this is not reflected in a corresponding increase in output entropy, suggesting the model remains confident even when it fails to use mid-context cues.
Idea-Gated Transformers: Enforcing Semantic Coherence via Differentiable Vocabulary Pruning
Autoregressive Language Models (LLMs) trained on Next-Token Prediction (NTP) often suffer from Topic Drift where the generation wanders away from the initial prompt due to a reliance on local associations rather than global planning. While scaling model size mitigates this, the fundamental myopia of the NTP objective remains. In this work, we introduce the Idea-Gated Transformer, a novel architecture that separates semantic planning from syntactic generation. We introduce an auxiliary Idea Head trained to predict the bag-of-words distribution for a future context window, creating a latent ``Concept Vector'' that actively gates the main vocabulary during generation. We propose a differentiable gating mechanism that suppresses semantically irrelevant tokens, effectively pruning the search space in real-time. Experiments on WikiText-103 demonstrate that while the Idea-Gated model achieves comparable validation perplexity to a standard GPT-2 baseline, it exhibits significantly superior Domain Retention. Qualitative and quantitative analysis reveals that the gating mechanism successfully locks generation into specific semantic clusters (e.g., Finance, Science) and resists associative drift, offering a parameter-efficient path toward more controllable language modeling.
comment: Code available at https://github.com/DarshanFofadiya/idea-gated-transformers/tree/main
Counterfactual Segmentation Reasoning: Diagnosing and Mitigating Pixel-Grounding Hallucination
Segmentation Vision-Language Models (VLMs) have significantly advanced grounded visual understanding, yet they remain prone to pixel-grounding hallucinations, producing masks for incorrect objects or for objects that are entirely absent. Existing evaluations rely almost entirely on text- or label-based perturbations, which check only whether the predicted mask matches the queried label. Such evaluations overlook the spatial footprint and severity of hallucination and therefore fail to reveal vision-driven hallucinations, which are more challenging and more prevalent. To address this gap, we formalize the task of Counterfactual Segmentation Reasoning (CSR), where a model must segment the referenced object in the factual image and abstain in its counterfactual counterpart. To support this task, we curate HalluSegBench, the first large-scale benchmark to diagnose referring and reasoning expression segmentation hallucinations using controlled visual counterfactuals, alongside new evaluation metrics that measure hallucination severity and disentangle vision- and language-driven failure modes. We further introduce RobustSeg, a segmentation VLM trained with counterfactual fine-tuning (CFT) to learn when to segment and when to abstain. Experimental results confirm RobustSeg reduces hallucinations by 30%, while improving segmentation performance on FP-RefCOCO(+/g).
comment: Project webpage: https://plan-lab.github.io/hallusegbench/
Human-Computer Interaction
CompanionCast: A Multi-Agent Conversational AI Framework with Spatial Audio for Social Co-Viewing Experiences
Social presence is central to the enjoyment of watching content together, yet modern media consumption is increasingly solitary. We investigate whether multi-agent conversational AI systems can recreate the dynamics of shared viewing experiences across diverse content types. We present CompanionCast, a general framework for orchestrating multiple role-specialized AI agents that respond to video content using multimodal inputs, speech synthesis, and spatial audio. Distinctly, CompanionCast integrates an LLM-as-a-Judge module that iteratively scores and refines conversations across five dimensions (relevance, authenticity, engagement, diversity, personality consistency). We validate this framework through sports viewing, a domain with rich dynamics and strong social traditions, where a pilot study with soccer fans suggests that multi-agent interaction improves perceived social presence compared to solo viewing. We contribute: (1) a generalizable framework for orchestrating multi-agent conversations around multimodal video content, (2) a novel evaluator-agent pipeline for conversation quality control, and (3) exploratory evidence of increased social presence in AI-mediated co-viewing. We discuss challenges and future directions for applying this approach to diverse viewing contexts including entertainment, education, and collaborative watching experiences.
comment: 11 pages
Agile Deliberation: Concept Deliberation for Subjective Visual Classification
From content moderation to content curation, applications requiring vision classifiers for visual concepts are rapidly expanding. Existing human-in-the-loop approaches typically assume users begin with a clear, stable concept understanding to be able to provide high-quality supervision. In reality, users often start with a vague idea and must iteratively refine it through "concept deliberation", a practice we uncovered through structured interviews with content moderation experts. We operationalize the common strategies in deliberation used by real content moderators into a human-in-the-loop framework called "Agile Deliberation" that explicitly supports evolving and subjective concepts. The system supports users in defining the concept for themselves by exposing them to borderline cases. The system does this with two deliberation stages: (1) concept scoping, which decomposes the initial concept into a structured hierarchy of sub-concepts, and (2) concept iteration, which surfaces semantically borderline examples for user reflection and feedback to iteratively align an image classifier with the user's evolving intent. Since concept deliberation is inherently subjective and interactive, we painstakingly evaluate the framework through 18 user sessions, each 1.5h long, rather than standard benchmarking datasets. We find that Agile Deliberation achieves 7.5% higher F1 scores than automated decomposition baselines and more than 3% higher than manual deliberation, while participants reported clearer conceptual understanding and lower cognitive effort.
Developing and Evaluating a Large Language Model-Based Automated Feedback System Grounded in Evidence-Centered Design for Supporting Physics Problem Solving
Generative AI offers new opportunities for individualized and adaptive learning, particularly through large language model (LLM)-based feedback systems. While LLMs can produce effective feedback for relatively straightforward conceptual tasks, delivering high-quality feedback for tasks that require advanced domain expertise, such as physics problem solving, remains a substantial challenge. This study presents the design of an LLM-based feedback system for physics problem solving grounded in evidence-centered design (ECD) and evaluates its performance within the German Physics Olympiad. Participants assessed the usefulness and accuracy of the generated feedback, which was generally perceived as useful and highly accurate. However, an in-depth analysis revealed that the feedback contained factual errors in 20% of cases; errors that often went unnoticed by the students. We discuss the risks associated with uncritical reliance on LLM-based feedback systems and outline potential directions for generating more adaptive and reliable LLM-based feedback in the future.
Opportunities and Challenges in Harnessing Digital Technology for Effective Teaching and Learning
Most of today's educators are in no shortage of digital and online learning technologies available at their fingertips, ranging from Learning Management Systems such as Canvas, Blackboard, or Moodle, online meeting tools, online homework, and tutoring systems, exam proctoring platforms, computer simulations, and even virtual reality/augmented reality technologies. Furthermore, with the rapid development and wide availability of generative artificial intelligence (GenAI) services such as ChatGPT, we are just at the beginning of harnessing their potential to transform higher education. Yet, facing the large number of available options provided by cutting-edge technology, an imminent question on the mind of most educators is the following: how should I choose the technologies and integrate them into my teaching process so that they would best support student learning? We contemplate over these types of important and timely questions and share our reflections on evidence-based approaches to harnessing digital learning tools using a Self-regulated Engaged Learning Framework we have employed in our research in physics education that can be valuable for educators in other disciplines.
Reject or Not?: A Benchmark for Voice Assistant Query Rejection in Smart Home Scenario and an Improved Method Based on LLMs
In smart-home voice assistant scenario, deciding whether to accept or reject a user query is the first step before any downstream processing. To address the limited query-rejection capability of current voice assistants, this paper presents the first Chinese-oriented open-source benchmark and evaluation suite for smart homes, together with a personalized query-rejection method based on large language models. On the data side, we construct the first multimodal query-rejection dataset tailored for domestic scenarios, containing 11,913 manually labeled text-speech pairs that systematically cover twelve typical dialogue types (e.g., chit-chat, non-human sounds, valid commands, ambiguous references, device-irrelevant requests). Fine-grained labels, conversational context and multi-turn information are provided to support both zero-shot and fine-tuning evaluations across language and multimodal large models. On the method side, we propose a three-tier collaborative architecture: first, a Qwen-2.5-3B adapter fine-tuned to model family-agnostic semantic boundaries; second, a dynamic household-level historical dialogue module to capture personalized habits; third, a household-specific RAG knowledge base that explicitly memorizes and revises past false-rejection cases. Experiments show that the proposed approach significantly outperforms zero-shot and fine-tuned general LLMs on the constructed dataset, with pronounced gains in rejection accuracy for family-specific expressions and complex multi-turn scenarios. This work provides a reproducible data foundation, evaluation standard and extensible technical framework for reliability research in smart-home voice interaction.
InFerActive: Towards Scalable Human Evaluation of Large Language Models through Interactive Inference
Human evaluation remains the gold standard for evaluating outputs of Large Language Models (LLMs). The current evaluation paradigm reviews numerous individual responses, leading to significant scalability challenges. LLM outputs can be more efficiently represented as a tree structure, reflecting their autoregressive generation process and stochastic token selection. However, conventional tree visualization cannot scale to the exponentially large trees generated by modern sampling methods of LLMs. To address this problem, we present InFerActive, an interactive inference system for scalable human evaluation. InFerActive enables on-demand exploration through probability-based filtering and evaluation features, while bridging the semantic gap between computational tokens and human-readable text through adaptive visualization techniques. Through a technical evaluation and user study (N=12), we demonstrate that InFerActive significantly improves evaluation efficiency and enables more comprehensive assessment of model behavior. We further conduct expert case studies that demonstrate InFerActive's practical applicability and potential for transforming LLM evaluation workflows.
HyFinBall: a Hybrid User Interface for Coordinated 2D+3D Visualization in Semi-Immersive VR
Sophisticated 3D visualization applications usually provide coordinated 2D and 3D views. Normally 3D input device is used for 3D tasks since they perform better than traditional 2D input devices. However, they do not perform better for 2D tasks. This paper presents a bimanual hybrid user interface that supports four interaction modes: a dual 6-degree-of-freedom (DOF) input device mode, a dual planar constrained 3DOF input device mode, a dual 2-finger multi-touch mode, and 3D hand and finger gestures. The application is a multi-dimensional visualization with coordinated 3D and 2D views on a desktop VR system. The input devices are buttonballs with seamless switching between 3D and 2D device modes, as well as between free-hand finger input and device usage. The 3D and 2D device mode switch automatically switches a buttonball's visual representation between a 3D cursor and a 2D cursor while changing the available user interaction techniques between 3D and 2D interaction techniques to interact with the coordinated views. The paper also provides two formal user studies to evaluate HyFinBall for various dimensional tasks, including 3D, 2D, and cross-dimensional tasks. Our experimental results show the benefits of the HyFinBall interface for cross-dimensional tasks that require 3D and 2D interactions.
Offscript: Automated Auditing of Instruction Adherence in LLMs
Large Language Models (LLMs) and generative search systems are increasingly used for information seeking by diverse populations with varying preferences for knowledge sourcing and presentation. While users can customize LLM behavior through custom instructions and behavioral prompts, no mechanism exists to evaluate whether these instructions are being followed effectively. We present Offscript, an automated auditing tool that efficiently identifies potential instruction following failures in LLMs. In a pilot study analyzing custom instructions sourced from Reddit, Offscript detected potential deviations from instructed behavior in 86.4% of conversations, 22.2% of which were confirmed as material violations through human review. Our findings suggest that automated auditing serves as a viable approach for evaluating compliance to behavioral instructions related to information seeking.
Supporting Medicinal Chemists in Iterative Hypothesis Generation for Drug Target Identification
While drug discovery is vital for human health, the process remains inefficient. Medicinal chemists must navigate a vast protein space to identify target proteins that meet three criteria: physical and functional interactions, therapeutic impact, and docking potential. Prior approaches have provided fragmented support for each criterion, limiting the generation of promising hypotheses for wet-lab experiments. We present HAPPIER, an AI-powered tool that supports hypothesis generation with integrated multi-criteria support for target identification. HAPPIER enables medicinal chemists to 1) efficiently explore and verify proteins in a single integrated graph component showing multi-criteria satisfaction and 2) validate AI suggestions with domain knowledge. These capabilities facilitate iterative cycles of divergent and convergent thinking, essential for hypothesis generation. We evaluated HAPPIER with ten medicinal chemists, finding that it increased the number of high-confidence hypotheses and support for the iterative cycle, and further demonstrated the relationship between engaging in such cycles and confidence in outputs.
Your plan may succeed, but what about failure? Investigating how people use ChatGPT for long-term life task planning
Long-term life task planning is inherently complex and uncertain, yet little is known about how emerging AI systems support this process. This study investigates how people use ChatGPT for such planning tasks, focusing on user practices, uncertainties, and perceptions of AI assistance. We conducted an interview study with 14 participants who engaged in long-term planning activities using ChatGPT, combining analysis of their prompts and interview responses. The task topics across diverse domains, including personal well-being, event planning, and professional learning, along with prompts to initiate, refine, and contextualize plans. ChatGPT helped structure complex goals into manageable steps, generate ideas, and sustain motivation, serving as a reflective partner. Yet its outputs were often generic or idealized, lacking personalization, contextual realism, and adaptability, requiring users to actively adapt and verify results. Participants expressed a need for AI systems that provide adaptive and trustworthy guidance while acknowledging uncertainty and potential failure in long-term planning. Our findings show how AI supports long-term life task planning under evolving uncertainty and highlight design implications for systems that are adaptive, uncertainty-aware, and capable of supporting long-term planning as an evolving human-AI collaboration.
Immutable Explainability: Towards Verifiable and Auditable Affective AI
Affective artificial intelligence has made substantial advances in recent years; yet two critical issues persist, particularly in sensitive applications. First, these systems frequently operate as 'black boxes', leaving their decision-making processes opaque. Second, audit logs often lack reliability, as the entity operating the system may alter them. In this work, we introduce the concept of Immutable Explainability, an architecture designed to address both challenges simultaneously. Our approach combines an interpretable inference engine - implemented through fuzzy logic to produce a transparent trace of each decision - with a cryptographic anchoring mechanism that records this trace on a blockchain, ensuring that it is tamper-evident and independently verifiable. To validate the approach, we implemented a heuristic pipeline integrating lexical and prosodic analysis within an explicit Mamdani-type multimodal fusion engine. Each inference generates an auditable record that is subsequently anchored on a public blockchain (Sepolia Testnet). We evaluated the system using the Spanish MEACorpus 2023, employing both the original corpus transcriptions and those generated by Whisper. The results show that our fuzzy-fusion approach outperforms baseline methods (linear and unimodal fusion). Beyond these quantitative outcomes, our primary objective is to establish a foundation for affective AI systems that offer transparent explanations, trustworthy audit trails, and greater user control over personal data.
Visualization Generation with Large Language Models: An Evaluation
The frequent need for analysts to create visualizations to derive insights from data has driven extensive research into the generation of natural Language to Visualization (NL2VIS). While recent progress in large language models (LLMs) suggests their potential to effectively support NL2VIS tasks, existing studies lack a systematic investigation into the performance of different LLMs under various prompt strategies. This paper addresses this gap and contributes a crucial baseline evaluation of LLMs' capabilities in generating visualization specifications of NL2VIS tasks. Our evaluation utilizes the nvBench dataset, employing six representative LLMs and eight distinct prompt strategies to evaluate their performance in generating six target chart types using the Vega-Lite visualization specification. We assess model performance with multiple metrics, including vis accuracy, validity and legality. Our results reveal substantial performance disparities across prompt strategies, chart types, and LLMs. Furthermore, based on the evaluation results, we uncover several counterintuitive behaviors across these dimensions, and propose directions for enhancing the NL2VIS benchmark to better support future NL2VIS research.
Reframing Human-Robot Interaction Through Extended Reality: Unlocking Safer, Smarter, and More Empathic Interactions with Virtual Robots and Foundation Models
This perspective reframes human-robot interaction (HRI) through extended reality (XR), arguing that virtual robots powered by large foundation models (FMs) can serve as cognitively grounded, empathic agents. Unlike physical robots, XR-native agents are unbound by hardware constraints and can be instantiated, adapted, and scaled on demand, while still affording embodiment and co-presence. We synthesize work across XR, HRI, and cognitive AI to show how such agents can support safety-critical scenarios, socially and cognitively empathic interaction across domains, and outreaching physical capabilities with XR and AI integration. We then discuss how multimodal large FMs (e.g., large language model, large vision model, and vision-language model) enable context-aware reasoning, affect-sensitive situations, and long-term adaptation, positioning virtual robots as cognitive and empathic mediators rather than mere simulation assets. At the same time, we highlight challenges and potential risks, including overtrust, cultural and representational bias, privacy concerns around biometric sensing, and data governance and transparency. The paper concludes by outlining a research agenda for human-centered, ethically grounded XR agents - emphasizing multi-layered evaluation frameworks, multi-user ecosystems, mixed virtual-physical embodiment, and societal and ethical design practices to envision XR-based virtual agents powered by FMs as reshaping future HRI into a more efficient and adaptive paradigm.
comment: This paper is under review
Advancing Mathematical Research via Human-AI Interactive Theorem Proving
We investigate how large language models can be used as research tools in scientific computing while preserving mathematical rigor. We propose a human-in-the-loop workflow for interactive theorem proving and discovery with LLMs. Human experts retain control over problem formulation and admissible assumptions, while the model searches for proofs or contradictions, proposes candidate properties and theorems, and helps construct structures and parameters that satisfy explicit constraints, supported by numerical experiments and simple verification checks. Experts treat these outputs as raw material, further refine them, and organize the results into precise statements and rigorous proofs. We instantiate this workflow in a case study on the connection between manifold optimization and Grover's quantum search algorithm, where the pipeline helps identify invariant subspaces, explore Grover-compatible retractions, and obtain convergence guarantees for the retraction-based gradient method. The framework provides a practical template for integrating large language models into frontier mathematical research, enabling faster exploration of proof space and algorithm design while maintaining transparent reasoning responsibilities. Although illustrated on manifold optimization problems in quantum computing, the principles extend to other core areas of scientific computing.
Enhancing Hand Palm Motion Gesture Recognition by Eliminating Reference Frame Bias via Frame-Invariant Similarity Measures
The ability of robots to recognize human gestures facilitates a natural and accessible human-robot collaboration. However, most work in gesture recognition remains rooted in reference frame-dependent representations. This poses a challenge when reference frames vary due to different work cell layouts, imprecise frame calibrations, or other environmental changes. This paper investigated the use of invariant trajectory descriptors for robust hand palm motion gesture recognition under reference frame changes. First, a novel dataset of recorded Hand Palm Motion (HPM) gestures is introduced. The motion gestures in this dataset were specifically designed to be distinguishable without dependence on specific reference frames or directional cues. Afterwards, multiple invariant trajectory descriptor approaches were benchmarked to assess how their performances generalize to this novel HPM dataset. After this offline benchmarking, the best scoring approach is validated for online recognition by developing a real-time Proof of Concept (PoC). In this PoC, hand palm motion gestures were used to control the real-time movement of a manipulator arm. The PoC demonstrated a high recognition reliability in real-time operation, achieving an $F_1$-score of 92.3%. This work demonstrates the effectiveness of the invariant descriptor approach as a standalone solution. Moreover, we believe that the invariant descriptor approach can also be utilized within other state-of-the-art pattern recognition and learning systems to improve their robustness against reference frame variations.
comment: This is the preprint version of a paper accepted for publication at the 2025 IEEE International Conference on Automation Science and Engineering (CASE). The final published version is available at DOI: 10.1109/CASE58245.2025.11163910
Screen Reader Programmers in the Vibe Coding Era: Adaptation, Empowerment, and New Accessibility Landscape
Generative AI agents are reshaping human-computer interaction, shifting users from direct task execution to supervising machine-driven actions, especially the rise of "vibe coding" in programming. Yet little is known about how screen reader programmers interact with AI code assistants in practice. We conducted a longitudinal study with 16 blind and low-vision programmers. Participants completed a GitHub Copilot tutorial, engaged with a programming task, and provided initial feedback. After two weeks of AI-assisted programming, follow-ups examined how their practices and perceptions evolved. Our findings show that code assistants enhanced programming efficiency and bridged accessibility gaps. However, participants struggled to convey intent, interpret AI outputs, and manage multiple views while maintaining situational awareness. They showed diverse preferences for accessibility features, expressed a need to balance automation with control, and encountered barriers when learning to use these tools. Furthermore, we propose design principles and recommendations for more accessible and inclusive human-AI collaborations.
See What I Mean? Expressiveness and Clarity in Robot Display Design
Nonverbal visual symbols and displays play an important role in communication when humans and robots work collaboratively. However, few studies have investigated how different types of non-verbal cues affect objective task performance, especially in a dynamic environment that requires real time decision-making. In this work, we designed a collaborative navigation task where the user and the robot only had partial information about the map on each end and thus the users were forced to communicate with a robot to complete the task. We conducted our study in a public space and recruited 37 participants who randomly passed by our setup. Each participant collaborated with a robot utilizing either animated anthropomorphic eyes and animated icons, or static anthropomorphic eyes and static icons. We found that participants that interacted with a robot with animated displays reported the greatest level of trust and satisfaction; that participants interpreted static icons the best; and that participants with a robot with static eyes had the highest completion success. These results suggest that while animation can foster trust with robots, human-robot communication can be optimized by the addition of familiar static icons that may be easier for users to interpret. We published our code, designed symbols, and collected results online at: https://github.com/mattufts/huamn_Cozmo_interaction.
Computer Vision and Pattern Recognition
GAINS: Gaussian-based Inverse Rendering from Sparse Multi-View Captures
Recent advances in Gaussian Splatting-based inverse rendering extend Gaussian primitives with shading parameters and physically grounded light transport, enabling high-quality material recovery from dense multi-view captures. However, these methods degrade sharply under sparse-view settings, where limited observations lead to severe ambiguity between geometry, reflectance, and lighting. We introduce GAINS (Gaussian-based Inverse rendering from Sparse multi-view captures), a two-stage inverse rendering framework that leverages learning-based priors to stabilize geometry and material estimation. GAINS first refines geometry using monocular depth/normal and diffusion priors, then employs segmentation, intrinsic image decomposition (IID), and diffusion priors to regularize material recovery. Extensive experiments on synthetic and real-world datasets show that GAINS significantly improves material parameter accuracy, relighting quality, and novel-view synthesis compared to state-of-the-art Gaussian-based inverse rendering methods, especially under sparse-view settings. Project page: https://patrickbail.github.io/gains/
comment: 23 pages, 18 figures
ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning
Video unified models exhibit strong capabilities in understanding and generation, yet they struggle with reason-informed visual editing even when equipped with powerful internal vision-language models (VLMs). We attribute this gap to two factors: 1) existing datasets are inadequate for training and evaluating reasoning-aware video editing, and 2) an inherent disconnect between the models' reasoning and editing capabilities, which prevents the rich understanding from effectively instructing the editing process. Bridging this gap requires an integrated framework that connects reasoning with visual transformation. To address this gap, we introduce the Reason-Informed Video Editing (RVE) task, which requires reasoning about physical plausibility and causal dynamics during editing. To support systematic evaluation, we construct RVE-Bench, a comprehensive benchmark with two complementary subsets: Reasoning-Informed Video Editing and In-Context Video Generation. These subsets cover diverse reasoning dimensions and real-world editing scenarios. Building upon this foundation, we propose the ReViSE, a Self-Reflective Reasoning (SRF) framework that unifies generation and evaluation within a single architecture. The model's internal VLM provides intrinsic feedback by assessing whether the edited video logically satisfies the given instruction. The differential feedback that refines the generator's reasoning behavior during training. Extensive experiments on RVE-Bench demonstrate that ReViSE significantly enhances editing accuracy and visual fidelity, achieving a 32% improvement of the Overall score in the reasoning-informed video editing subset over state-of-the-art methods.
Splatent: Splatting Diffusion Latents for Novel View Synthesis
Radiance field representations have recently been explored in the latent space of VAEs that are commonly used by diffusion models. This direction offers efficient rendering and seamless integration with diffusion-based pipelines. However, these methods face a fundamental limitation: The VAE latent space lacks multi-view consistency, leading to blurred textures and missing details during 3D reconstruction. Existing approaches attempt to address this by fine-tuning the VAE, at the cost of reconstruction quality, or by relying on pre-trained diffusion models to recover fine-grained details, at the risk of some hallucinations. We present Splatent, a diffusion-based enhancement framework designed to operate on top of 3D Gaussian Splatting (3DGS) in the latent space of VAEs. Our key insight departs from the conventional 3D-centric view: rather than reconstructing fine-grained details in 3D space, we recover them in 2D from input views through multi-view attention mechanisms. This approach preserves the reconstruction quality of pretrained VAEs while achieving faithful detail recovery. Evaluated across multiple benchmarks, Splatent establishes a new state-of-the-art for VAE latent radiance field reconstruction. We further demonstrate that integrating our method with existing feed-forward frameworks, consistently improves detail preservation, opening new possibilities for high-quality sparse-view 3D reconstruction.
LISN: Language-Instructed Social Navigation with VLM-based Controller Modulating
Towards human-robot coexistence, socially aware navigation is significant for mobile robots. Yet existing studies on this area focus mainly on path efficiency and pedestrian collision avoidance, which are essential but represent only a fraction of social navigation. Beyond these basics, robots must also comply with user instructions, aligning their actions to task goals and social norms expressed by humans. In this work, we present LISN-Bench, the first simulation-based benchmark for language-instructed social navigation. Built on Rosnav-Arena 3.0, it is the first standardized social navigation benchmark to incorporate instruction following and scene understanding across diverse contexts. To address this task, we further propose Social-Nav-Modulator, a fast-slow hierarchical system where a VLM agent modulates costmaps and controller parameters. Decoupling low-level action generation from the slower VLM loop reduces reliance on high-frequency VLM inference while improving dynamic avoidance and perception adaptability. Our method achieves an average success rate of 91.3%, which is greater than 63% than the most competitive baseline, with most of the improvements observed in challenging tasks such as following a person in a crowd and navigating while strictly avoiding instruction-forbidden regions. The project website is at: https://social-nav.github.io/LISN-project/
comment: 8 pages
NordFKB: a fine-grained benchmark dataset for geospatial AI in Norway
We present NordFKB, a fine-grained benchmark dataset for geospatial AI in Norway, derived from the authoritative, highly accurate, national Felles KartdataBase (FKB). The dataset contains high-resolution orthophotos paired with detailed annotations for 36 semantic classes, including both per-class binary segmentation masks in GeoTIFF format and COCO-style bounding box annotations. Data is collected from seven geographically diverse areas, ensuring variation in climate, topography, and urbanization. Only tiles containing at least one annotated object are included, and training/validation splits are created through random sampling across areas to ensure representative class and context distributions. Human expert review and quality control ensures high annotation accuracy. Alongside the dataset, we release a benchmarking repository with standardized evaluation protocols and tools for semantic segmentation and object detection, enabling reproducible and comparable research. NordFKB provides a robust foundation for advancing AI methods in mapping, land administration, and spatial planning, and paves the way for future expansions in coverage, temporal scope, and data modalities.
comment: 8 pages, 2 figures, 2 tables
VisualActBench: Can VLMs See and Act like a Human?
Vision-Language Models (VLMs) have achieved impressive progress in perceiving and describing visual environments. However, their ability to proactively reason and act based solely on visual inputs, without explicit textual prompts, remains underexplored. We introduce a new task, Visual Action Reasoning, and propose VisualActBench, a large-scale benchmark comprising 1,074 videos and 3,733 human-annotated actions across four real-world scenarios. Each action is labeled with an Action Prioritization Level (APL) and a proactive-reactive type to assess models' human-aligned reasoning and value sensitivity. We evaluate 29 VLMs on VisualActBench and find that while frontier models like GPT4o demonstrate relatively strong performance, a significant gap remains compared to human-level reasoning, particularly in generating proactive, high-priority actions. Our results highlight limitations in current VLMs' ability to interpret complex context, anticipate outcomes, and align with human decision-making frameworks. VisualActBench establishes a comprehensive foundation for assessing and improving the real-world readiness of proactive, vision-centric AI agents.
YOPO-Nav: Visual Navigation using 3DGS Graphs from One-Pass Videos
Visual navigation has emerged as a practical alternative to traditional robotic navigation pipelines that rely on detailed mapping and path planning. However, constructing and maintaining 3D maps is often computationally expensive and memory-intensive. We address the problem of visual navigation when exploration videos of a large environment are available. The videos serve as a visual reference, allowing a robot to retrace the explored trajectories without relying on metric maps. Our proposed method, YOPO-Nav (You Only Pass Once), encodes an environment into a compact spatial representation composed of interconnected local 3D Gaussian Splatting (3DGS) models. During navigation, the framework aligns the robot's current visual observation with this representation and predicts actions that guide it back toward the demonstrated trajectory. YOPO-Nav employs a hierarchical design: a visual place recognition (VPR) module provides coarse localization, while the local 3DGS models refine the goal and intermediate poses to generate control actions. To evaluate our approach, we introduce the YOPO-Campus dataset, comprising 4 hours of egocentric video and robot controller inputs from over 6 km of human-teleoperated robot trajectories. We benchmark recent visual navigation methods on trajectories from YOPO-Campus using a Clearpath Jackal robot. Experimental results show YOPO-Nav provides excellent performance in image-goal navigation for real-world scenes on a physical robot. The dataset and code will be made publicly available for visual navigation and scene representation research.
Visual Heading Prediction for Autonomous Aerial Vehicles
The integration of Unmanned Aerial Vehicles (UAVs) and Unmanned Ground Vehicles (UGVs) is increasingly central to the development of intelligent autonomous systems for applications such as search and rescue, environmental monitoring, and logistics. However, precise coordination between these platforms in real-time scenarios presents major challenges, particularly when external localization infrastructure such as GPS or GNSS is unavailable or degraded [1]. This paper proposes a vision-based, data-driven framework for real-time UAV-UGV integration, with a focus on robust UGV detection and heading angle prediction for navigation and coordination. The system employs a fine-tuned YOLOv5 model to detect UGVs and extract bounding box features, which are then used by a lightweight artificial neural network (ANN) to estimate the UAV's required heading angle. A VICON motion capture system was used to generate ground-truth data during training, resulting in a dataset of over 13,000 annotated images collected in a controlled lab environment. The trained ANN achieves a mean absolute error of 0.1506° and a root mean squared error of 0.1957°, offering accurate heading angle predictions using only monocular camera inputs. Experimental evaluations achieve 95% accuracy in UGV detection. This work contributes a vision-based, infrastructure- independent solution that demonstrates strong potential for deployment in GPS/GNSS-denied environments, supporting reliable multi-agent coordination under realistic dynamic conditions. A demonstration video showcasing the system's real-time performance, including UGV detection, heading angle prediction, and UAV alignment under dynamic conditions, is available at: https://github.com/Kooroshraf/UAV-UGV-Integration
Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs
Correctly parsing mathematical formulas from PDFs is critical for training large language models and building scientific knowledge bases from academic literature, yet existing benchmarks either exclude formulas entirely or lack semantically-aware evaluation metrics. We introduce a novel benchmarking framework centered on synthetically generated PDFs with precise LaTeX ground truth, enabling systematic control over layout, formulas, and content characteristics. A key methodological contribution is pioneering LLM-as-a-judge for semantic formula assessment, combined with a robust two-stage matching pipeline that handles parser output inconsistencies. Through human validation on 250 formula pairs (750 ratings from 30 evaluators), we demonstrate that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.78) compared to CDM (r=0.34) and text similarity (r~0). Evaluating 20+ contemporary PDF parsers (including specialized OCR models, vision-language models, and rule-based approaches) across 100 synthetic documents with 2,000+ formulas reveals significant performance disparities. Our findings provide crucial insights for practitioners selecting parsers for downstream applications and establish a robust, scalable methodology that enables reproducible evaluation of PDF formula extraction quality. Code and benchmark data: https://github.com/phorn1/pdf-parse-bench
Diffusion Posterior Sampler for Hyperspectral Unmixing with Spectral Variability Modeling
Linear spectral mixture models (LMM) provide a concise form to disentangle the constituent materials (endmembers) and their corresponding proportions (abundance) in a single pixel. The critical challenges are how to model the spectral prior distribution and spectral variability. Prior knowledge and spectral variability can be rigorously modeled under the Bayesian framework, where posterior estimation of Abundance is derived by combining observed data with endmember prior distribution. Considering the key challenges and the advantages of the Bayesian framework, a novel method using a diffusion posterior sampler for semiblind unmixing, denoted as DPS4Un, is proposed to deal with these challenges with the following features: (1) we view the pretrained conditional spectrum diffusion model as a posterior sampler, which can combine the learned endmember prior with observation to get the refined abundance distribution. (2) Instead of using the existing spectral library as prior, which may raise bias, we establish the image-based endmember bundles within superpixels, which are used to train the endmember prior learner with diffusion model. Superpixels make sure the sub-scene is more homogeneous. (3) Instead of using the image-level data consistency constraint, the superpixel-based data fidelity term is proposed. (4) The endmember is initialized as Gaussian noise for each superpixel region, DPS4Un iteratively updates the abundance and endmember, contributing to spectral variability modeling. The experimental results on three real-world benchmark datasets demonstrate that DPS4Un outperforms the state-of-the-art hyperspectral unmixing methods.
MedForget: Hierarchy-Aware Multimodal Unlearning Testbed for Medical AI
Pretrained Multimodal Large Language Models (MLLMs) are increasingly deployed in medical AI systems for clinical reasoning, diagnosis support, and report generation. However, their training on sensitive patient data raises critical privacy and compliance challenges under regulations such as HIPAA and GDPR, which enforce the "right to be forgotten". Unlearning, the process of tuning models to selectively remove the influence of specific training data points, offers a potential solution, yet its effectiveness in complex medical settings remains underexplored. To systematically study this, we introduce MedForget, a Hierarchy-Aware Multimodal Unlearning Testbed with explicit retain and forget splits and evaluation sets containing rephrased variants. MedForget models hospital data as a nested hierarchy (Institution -> Patient -> Study -> Section), enabling fine-grained assessment across eight organizational levels. The benchmark contains 3840 multimodal (image, question, answer) instances, each hierarchy level having a dedicated unlearning target, reflecting distinct unlearning challenges. Experiments with four SOTA unlearning methods on three tasks (generation, classification, cloze) show that existing methods struggle to achieve complete, hierarchy-aware forgetting without reducing diagnostic performance. To test whether unlearning truly deletes hierarchical pathways, we introduce a reconstruction attack that progressively adds hierarchical level context to prompts. Models unlearned at a coarse granularity show strong resistance, while fine-grained unlearning leaves models vulnerable to such reconstruction. MedForget provides a practical, HIPAA-aligned testbed for building compliant medical AI systems.
comment: Dataset and Code: https://github.com/fengli-wu/MedForget
UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving
Autonomous driving (AD) systems struggle in long-tail scenarios due to limited world knowledge and weak visual dynamic modeling. Existing vision-language-action (VLA)-based methods cannot leverage unlabeled videos for visual causal learning, while world model-based methods lack reasoning capabilities from large language models. In this paper, we construct multiple specialized datasets providing reasoning and planning annotations for complex scenarios. Then, a unified Understanding-Generation-Planning framework, named UniUGP, is proposed to synergize scene reasoning, future video generation, and trajectory planning through a hybrid expert architecture. By integrating pre-trained VLMs and video generation models, UniUGP leverages visual dynamics and semantic reasoning to enhance planning performance. Taking multi-frame observations and language instructions as input, it produces interpretable chain-of-thought reasoning, physically consistent trajectories, and coherent future videos. We introduce a four-stage training strategy that progressively builds these capabilities across multiple existing AD datasets, along with the proposed specialized datasets. Experiments demonstrate state-of-the-art performance in perception, reasoning, and decision-making, with superior generalization to challenging long-tail situations.
comment: Project Page: https://seed-uniugp.github.io/
Simultaneous Tactile-Visual Perception for Learning Multimodal Robot Manipulation
Robotic manipulation requires both rich multimodal perception and effective learning frameworks to handle complex real-world tasks. See-through-skin (STS) sensors, which combine tactile and visual perception, offer promising sensing capabilities, while modern imitation learning provides powerful tools for policy acquisition. However, existing STS designs lack simultaneous multimodal perception and suffer from unreliable tactile tracking. Furthermore, integrating these rich multimodal signals into learning-based manipulation pipelines remains an open challenge. We introduce TacThru, an STS sensor enabling simultaneous visual perception and robust tactile signal extraction, and TacThru-UMI, an imitation learning framework that leverages these multimodal signals for manipulation. Our sensor features a fully transparent elastomer, persistent illumination, novel keyline markers, and efficient tracking, while our learning system integrates these signals through a Transformer-based Diffusion Policy. Experiments on five challenging real-world tasks show that TacThru-UMI achieves an average success rate of 85.5%, significantly outperforming the baselines of alternating tactile-visual (66.3%) and vision-only (55.4%). The system excels in critical scenarios, including contact detection with thin and soft objects and precision manipulation requiring multimodal coordination. This work demonstrates that combining simultaneous multimodal perception with modern learning frameworks enables more precise, adaptable robotic manipulation.
From Detection to Anticipation: Online Understanding of Struggles across Various Tasks and Activities WACV 2026
Understanding human skill performance is essential for intelligent assistive systems, with struggle recognition offering a natural cue for identifying user difficulties. While prior work focuses on offline struggle classification and localization, real-time applications require models capable of detecting and anticipating struggle online. We reformulate struggle localization as an online detection task and further extend it to anticipation, predicting struggle moments before they occur. We adapt two off-the-shelf models as baselines for online struggle detection and anticipation. Online struggle detection achieves 70-80% per-frame mAP, while struggle anticipation up to 2 seconds ahead yields comparable performance with slight drops. We further examine generalization across tasks and activities and analyse the impact of skill evolution. Despite larger domain gaps in activity-level generalization, models still outperform random baselines by 4-20%. Our feature-based models run at up to 143 FPS, and the whole pipeline, including feature extraction, operates at around 20 FPS, sufficient for real-time assistive applications.
comment: Accepted by WACV 2026
ChronusOmni: Improving Time Awareness of Omni Large Language Models
Time awareness is a fundamental ability of omni large language models, especially for understanding long videos and answering complex questions. Previous approaches mainly target vision-language scenarios and focus on the explicit temporal grounding questions, such as identifying when a visual event occurs or determining what event happens at aspecific time. However, they often make insufficient use of the audio modality, and overlook implicit temporal grounding across modalities--for example, identifying what is visually present when a character speaks, or determining what is said when a visual event occurs--despite such cross-modal temporal relations being prevalent in real-world scenarios. In this paper, we propose ChronusOmni, an omni large language model designed to enhance temporal awareness for both explicit and implicit audiovisual temporal grounding. First, we interleave text-based timestamp tokens with visual and audio representations at each time unit, enabling unified temporal modeling across modalities. Second, to enforce correct temporal ordering and strengthen fine-grained temporal reasoning, we incorporate reinforcement learning with specially designed reward functions. Moreover, we construct ChronusAV, a temporally-accurate, modality-complete, and cross-modal-aligned dataset to support the training and evaluation on audiovisual temporal grounding task. Experimental results demonstrate that ChronusOmni achieves state-of-the-art performance on ChronusAV with more than 30% improvement and top results on most metrics upon other temporal grounding benchmarks. This highlights the strong temporal awareness of our model across modalities, while preserving general video and audio understanding capabilities.
comment: Code available at https://github.com/YJCX330/Chronus/
Composing Concepts from Images and Videos via Concept-prompt Binding
Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind & Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.
comment: Project page: https://refkxh.github.io/BiCo_Webpage/
DynaIP: Dynamic Image Prompt Adapter for Scalable Zero-shot Personalized Text-to-Image Generation
Personalized Text-to-Image (PT2I) generation aims to produce customized images based on reference images. A prominent interest pertains to the integration of an image prompt adapter to facilitate zero-shot PT2I without test-time fine-tuning. However, current methods grapple with three fundamental challenges: 1. the elusive equilibrium between Concept Preservation (CP) and Prompt Following (PF), 2. the difficulty in retaining fine-grained concept details in reference images, and 3. the restricted scalability to extend to multi-subject personalization. To tackle these challenges, we present Dynamic Image Prompt Adapter (DynaIP), a cutting-edge plugin to enhance the fine-grained concept fidelity, CP-PF balance, and subject scalability of SOTA T2I multimodal diffusion transformers (MM-DiT) for PT2I generation. Our key finding is that MM-DiT inherently exhibit decoupling learning behavior when injecting reference image features into its dual branches via cross attentions. Based on this, we design an innovative Dynamic Decoupling Strategy that removes the interference of concept-agnostic information during inference, significantly enhancing the CP-PF balance and further bolstering the scalability of multi-subject compositions. Moreover, we identify the visual encoder as a key factor affecting fine-grained CP and reveal that the hierarchical features of commonly used CLIP can capture visual information at diverse granularity levels. Therefore, we introduce a novel Hierarchical Mixture-of-Experts Feature Fusion Module to fully leverage the hierarchical features of CLIP, remarkably elevating the fine-grained concept fidelity while also providing flexible control of visual granularity. Extensive experiments across single- and multi-subject PT2I tasks verify that our DynaIP outperforms existing approaches, marking a notable advancement in the field of PT2l generation.
CHEM: Estimating and Understanding Hallucinations in Deep Learning for Image Processing
U-Net and other U-shaped architectures have achieved significant success in image deconvolution tasks. However, challenges have emerged, as these methods might generate unrealistic artifacts or hallucinations, which can interfere with analysis in safety-critical scenarios. This paper introduces a novel approach for quantifying and comprehending hallucination artifacts to ensure trustworthy computer vision models. Our method, termed the Conformal Hallucination Estimation Metric (CHEM), is applicable to any image reconstruction model, enabling efficient identification and quantification of hallucination artifacts. It offers two key advantages: it leverages wavelet and shearlet representations to efficiently extract hallucinations of image features and uses conformalized quantile regression to assess hallucination levels in a distribution-free manner. Furthermore, from an approximation theoretical perspective, we explore the reasons why U-shaped networks are prone to hallucinations. We test the proposed approach on the CANDELS astronomical image dataset with models such as U-Net, SwinUNet, and Learnlets, and provide new perspectives on hallucination from different aspects in deep learning-based image processing.
Modality-Specific Enhancement and Complementary Fusion for Semi-Supervised Multi-Modal Brain Tumor Segmentation
Semi-supervised learning (SSL) has become a promising direction for medical image segmentation, enabling models to learn from limited labeled data alongside abundant unlabeled samples. However, existing SSL approaches for multi-modal medical imaging often struggle to exploit the complementary information between modalities due to semantic discrepancies and misalignment across MRI sequences. To address this, we propose a novel semi-supervised multi-modal framework that explicitly enhances modality-specific representations and facilitates adaptive cross-modal information fusion. Specifically, we introduce a Modality-specific Enhancing Module (MEM) to strengthen semantic cues unique to each modality via channel-wise attention, and a learnable Complementary Information Fusion (CIF) module to adaptively exchange complementary knowledge between modalities. The overall framework is optimized using a hybrid objective combining supervised segmentation loss and cross-modal consistency regularization on unlabeled data. Extensive experiments on the BraTS 2019 (HGG subset) demonstrate that our method consistently outperforms strong semi-supervised and multi-modal baselines under 1\%, 5\%, and 10\% labeled data settings, achieving significant improvements in both Dice and Sensitivity scores. Ablation studies further confirm the complementary effects of our proposed MEM and CIF in bridging cross-modality discrepancies and improving segmentation robustness under scarce supervision.
comment: 9 pages, 3 figures
FastPose-ViT: A Vision Transformer for Real-Time Spacecraft Pose Estimation WACV 2026
Estimating the 6-degrees-of-freedom (6DoF) pose of a spacecraft from a single image is critical for autonomous operations like in-orbit servicing and space debris removal. Existing state-of-the-art methods often rely on iterative Perspective-n-Point (PnP)-based algorithms, which are computationally intensive and ill-suited for real-time deployment on resource-constrained edge devices. To overcome these limitations, we propose FastPose-ViT, a Vision Transformer (ViT)-based architecture that directly regresses the 6DoF pose. Our approach processes cropped images from object bounding boxes and introduces a novel mathematical formalism to map these localized predictions back to the full-image scale. This formalism is derived from the principles of projective geometry and the concept of "apparent rotation", where the model predicts an apparent rotation matrix that is then corrected to find the true orientation. We demonstrate that our method outperforms other non-PnP strategies and achieves performance competitive with state-of-the-art PnP-based techniques on the SPEED dataset. Furthermore, we validate our model's suitability for real-world space missions by quantizing it and deploying it on power-constrained edge hardware. On the NVIDIA Jetson Orin Nano, our end-to-end pipeline achieves a latency of ~75 ms per frame under sequential execution, and a non-blocking throughput of up to 33 FPS when stages are scheduled concurrently.
comment: Accepted to WACV 2026. Preprint version
PathCo-LatticE: Pathology-Constrained Lattice-Of Experts Framework for Fully-supervised Few-Shot Cardiac MRI Segmentation
Few-shot learning (FSL) mitigates data scarcity in cardiac MRI segmentation but typically relies on semi-supervised techniques sensitive to domain shifts and validation bias, restricting zero-shot generalizability. We propose PathCo-LatticE, a fully supervised FSL framework that replaces unlabeled data with pathology-guided synthetic supervision. First, our Virtual Patient Engine models continuous latent disease trajectories from sparse clinical anchors, using generative modeling to synthesize physiologically plausible, fully labeled 3D cohorts. Second, Self-Reinforcing Interleaved Validation (SIV) provides a leakage-free protocol that evaluates models online with progressively challenging synthetic samples, eliminating the need for real validation data. Finally, a dynamic Lattice-of-Experts (LoE) organizes specialized networks within a pathology-aware topology and activates the most relevant experts per input, enabling robust zero-shot generalization to unseen data without target-domain fine-tuning. We evaluated PathCo-LatticE in a strict out-of-distribution (OOD) setting, deriving all anchors and severity statistics from a single-source domain (ACDC) and performing zero-shot testing on the multi-center, multi-vendor M&Ms dataset. PathCo-LatticE outperforms four state-of-the-art FSL methods by 4.2-11% Dice starting from only 7 labeled anchors, and approaches fully supervised performance (within 1% Dice) with only 19 labeled anchors. The method shows superior harmonization across four vendors and generalization to unseen pathologies. [Code will be made publicly available].
Stylized Meta-Album: Group-bias injection with style transfer to study robustness against distribution shifts
We introduce Stylized Meta-Album (SMA), a new image classification meta-dataset comprising 24 datasets (12 content datasets, and 12 stylized datasets), designed to advance studies on out-of-distribution (OOD) generalization and related topics. Created using style transfer techniques from 12 subject classification datasets, SMA provides a diverse and extensive set of 4800 groups, combining various subjects (objects, plants, animals, human actions, textures) with multiple styles. SMA enables flexible control over groups and classes, allowing us to configure datasets to reflect diverse benchmark scenarios. While ideally, data collection would capture extensive group diversity, practical constraints often make this infeasible. SMA addresses this by enabling large and configurable group structures through flexible control over styles, subject classes, and domains-allowing datasets to reflect a wide range of real-world benchmark scenarios. This design not only expands group and class diversity, but also opens new methodological directions for evaluating model performance across diverse group and domain configurations-including scenarios with many minority groups, varying group imbalance, and complex domain shifts-and for studying fairness, robustness, and adaptation under a broader range of realistic conditions. To demonstrate SMA's effectiveness, we implemented two benchmarks: (1) a novel OOD generalization and group fairness benchmark leveraging SMA's domain, class, and group diversity to evaluate existing benchmarks. Our findings reveal that while simple balancing and algorithms utilizing group information remain competitive as claimed in previous benchmarks, increasing group diversity significantly impacts fairness, altering the superiority and relative rankings of algorithms. We also propose to use \textit{Top-M worst group accuracy} as a new hyperparameter tuning metric, demonstrating broader fairness during optimization and delivering better final worst-group accuracy for larger group diversity. (2) An unsupervised domain adaptation (UDA) benchmark utilizing SMA's group diversity to evaluate UDA algorithms across more scenarios, offering a more comprehensive benchmark with lower error bars (reduced by 73\% and 28\% in closed-set setting and UniDA setting, respectively) compared to existing efforts. These use cases highlight SMA's potential to significantly impact the outcomes of conventional benchmarks.
LiM-YOLO: Less is More with Pyramid Level Shift and Normalized Auxiliary Branch for Ship Detection in Optical Remote Sensing Imagery
Applying general-purpose object detectors to ship detection in satellite imagery presents significant challenges due to the extreme scale disparity and morphological anisotropy of maritime targets. Standard architectures utilizing stride-32 (P5) layers often fail to resolve narrow vessels, resulting in spatial feature dilution. In this work, we propose LiM-YOLO, a specialized detector designed to resolve these domain-specific conflicts. Based on a statistical analysis of ship scales, we introduce a Pyramid Level Shift Strategy that reconfigures the detection head to P2-P4. This shift ensures compliance with Nyquist sampling criteria for small objects while eliminating the computational redundancy of deep layers. To further enhance training stability on high-resolution inputs, we incorporate a Group Normalized Convolutional Block for Linear Projection (GN-CBLinear), which mitigates gradient volatility in micro-batch settings. Validated on SODA-A, DOTA-v1.5, FAIR1M-v2.0, and ShipRSImageNet-V1, LiM-YOLO demonstrates superior detection accuracy and efficiency compared to state-of-the-art models. The code is available at https://github.com/egshkim/LiM-YOLO.
comment: 16 pages, 8 figures, 9 tables
Unconsciously Forget: Mitigating Memorization; Without Knowing What is being Memorized
Recent advances in generative models have demonstrated an exceptional ability to produce highly realistic images. However, previous studies show that generated images often resemble the training data, and this problem becomes more severe as the model size increases. Memorizing training data can lead to legal challenges, including copyright infringement, violations of portrait rights, and trademark violations. Existing approaches to mitigating memorization mainly focus on manipulating the denoising sampling process to steer image embeddings away from the memorized embedding space or employ unlearning methods that require training on datasets containing specific sets of memorized concepts. However, existing methods often incur substantial computational overhead during sampling, or focus narrowly on removing one or more groups of target concepts, imposing a significant limitation on their scalability. To understand and mitigate these problems, our work, UniForget, offers a new perspective on understanding the root cause of memorization. Our work demonstrates that specific parts of the model are responsible for copyrighted content generation. By applying model pruning, we can effectively suppress the probability of generating copyrighted content without targeting specific concepts while preserving the general generative capabilities of the model. Additionally, we show that our approach is both orthogonal and complementary to existing unlearning methods, thereby highlighting its potential to improve current unlearning and de-memorization techniques.
An Automated Tip-and-Cue Framework for Optimized Satellite Tasking and Visual Intelligence
The proliferation of satellite constellations, coupled with reduced tasking latency and diverse sensor capabilities, has expanded the opportunities for automated Earth observation. This paper introduces a fully automated Tip-and-Cue framework designed for satellite imaging tasking and scheduling. In this context, tips are generated from external data sources or analyses of prior satellite imagery, identifying spatiotemporal targets and prioritizing them for downstream planning. Corresponding cues are the imaging tasks formulated in response, which incorporate sensor constraints, timing requirements, and utility functions. The system autonomously generates candidate tasks, optimizes their scheduling across multiple satellites using continuous utility functions that reflect the expected value of each observation, and processes the resulting imagery using artificial-intelligence-based models, including object detectors and vision-language models. Structured visual reports are generated to support both interpretability and the identification of new insights for downstream tasking. The efficacy of the framework is demonstrated through a maritime vessel tracking scenario, utilizing Automatic Identification System (AIS) data for trajectory prediction, targeted observations, and the generation of actionable outputs. Maritime vessel tracking is a widely researched application, often used to benchmark novel approaches to satellite tasking, forecasting, and analysis. The system is extensible to broader applications such as smart-city monitoring and disaster response, where timely tasking and automated analysis are critical.
comment: Under review at IEEE Transactions on Geoscience and Remote Sensing (TGRS). 13 pages, 8 figures
OxEnsemble: Fair Ensembles for Low-Data Classification
We address the problem of fair classification in settings where data is scarce and unbalanced across demographic groups. Such low-data regimes are common in domains like medical imaging, where false negatives can have fatal consequences. We propose a novel approach \emph{OxEnsemble} for efficiently training ensembles and enforcing fairness in these low-data regimes. Unlike other approaches, we aggregate predictions across ensemble members, each trained to satisfy fairness constraints. By construction, \emph{OxEnsemble} is both data-efficient, carefully reusing held-out data to enforce fairness reliably, and compute-efficient, requiring little more compute than used to fine-tune or evaluate an existing model. We validate this approach with new theoretical guarantees. Experimentally, our approach yields more consistent outcomes and stronger fairness-accuracy trade-offs than existing methods across multiple challenging medical imaging classification datasets.
SynthPix: A lightspeed PIV images generator
We describe SynthPix, a synthetic image generator for Particle Image Velocimetry (PIV) with a focus on performance and parallelism on accelerators, implemented in JAX. SynthPix supports the same configuration parameters as existing tools but achieves a throughput several orders of magnitude higher in image-pair generation per second. SynthPix was developed to enable the training of data-hungry reinforcement learning methods for flow estimation and for reducing the iteration times during the development of fast flow estimation methods used in recent active fluids control studies with real-time PIV feedback. We believe SynthPix to be useful for the fluid dynamics community, and in this paper we describe the main ideas behind this software package.
comment: Code: https://github.com/antonioterpin/synthpix
IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting
Recent advances in multimodal large language models (MLLMs) have led to impressive progress across various benchmarks. However, their capability in understanding infrared images remains unexplored. To address this gap, we introduce IF-Bench, the first high-quality benchmark designed for evaluating multimodal understanding of infrared images. IF-Bench consists of 499 images sourced from 23 infrared datasets and 680 carefully curated visual question-answer pairs, covering 10 essential dimensions of image understanding. Based on this benchmark, we systematically evaluate over 40 open-source and closed-source MLLMs, employing cyclic evaluation, bilingual assessment, and hybrid judgment strategies to enhance the reliability of the results. Our analysis reveals how model scale, architecture, and inference paradigms affect infrared image comprehension, providing valuable insights for this area. Furthermore, we propose a training-free generative visual prompting (GenViP) method, which leverages advanced image editing models to translate infrared images into semantically and spatially aligned RGB counterparts, thereby mitigating domain distribution shifts. Extensive experiments demonstrate that our method consistently yields significant performance improvements across a wide range of MLLMs. The benchmark and code are available at https://github.com/casiatao/IF-Bench.
VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification
Synthesizing realistic human-object interactions (HOI) in video is challenging due to the complex, instance-specific interaction dynamics of both humans and objects. Incorporating controllability in video generation further adds to the complexity. Existing controllable video generation approaches face a trade-off: sparse controls like keypoint trajectories are easy to specify but lack instance-awareness, while dense signals such as optical flow, depths or 3D meshes are informative but costly to obtain. We propose VHOI, a two-stage framework that first densifies sparse trajectories into HOI mask sequences, and then fine-tunes a video diffusion model conditioned on these dense masks. We introduce a novel HOI-aware motion representation that uses color encodings to distinguish not only human and object motion, but also body-part-specific dynamics. This design incorporates a human prior into the conditioning signal and strengthens the model's ability to understand and generate realistic HOI dynamics. Experiments demonstrate state-of-the-art results in controllable HOI video generation. VHOI is not limited to interaction-only scenarios and can also generate full human navigation leading up to object interactions in an end-to-end manner. Project page: https://vcai.mpi-inf.mpg.de/projects/vhoi/.
Kaapana: A Comprehensive Open-Source Platform for Integrating AI in Medical Imaging Research Environments
Developing generalizable AI for medical imaging requires both access to large, multi-center datasets and standardized, reproducible tooling within research environments. However, leveraging real-world imaging data in clinical research environments is still hampered by strict regulatory constraints, fragmented software infrastructure, and the challenges inherent in conducting large-cohort multicentre studies. This leads to projects that rely on ad-hoc toolchains that are hard to reproduce, difficult to scale beyond single institutions and poorly suited for collaboration between clinicians and data scientists. We present Kaapana, a comprehensive open-source platform for medical imaging research that is designed to bridge this gap. Rather than building single-use, site-specific tooling, Kaapana provides a modular, extensible framework that unifies data ingestion, cohort curation, processing workflows and result inspection under a common user interface. By bringing the algorithm to the data, it enables institutions to keep control over their sensitive data while still participating in distributed experimentation and model development. By integrating flexible workflow orchestration with user-facing applications for researchers, Kaapana reduces technical overhead, improves reproducibility and enables conducting large-scale, collaborative, multi-centre imaging studies. We describe the core concepts of the platform and illustrate how they can support diverse use cases, from local prototyping to nation-wide research networks. The open-source codebase is available at https://github.com/kaapana/kaapana
Benchmarking SAM2-based Trackers on FMOX
Several object tracking pipelines extending Segment Anything Model 2 (SAM2) have been proposed in the past year, where the approach is to follow and segment the object from a single exemplar template provided by the user on a initialization frame. We propose to benchmark these high performing trackers (SAM2, EfficientTAM, DAM4SAM and SAMURAI) on datasets containing fast moving objects (FMO) specifically designed to be challenging for tracking approaches. The goal is to understand better current limitations in state-of-the-art trackers by providing more detailed insights on the behavior of these trackers. We show that overall the trackers DAM4SAM and SAMURAI perform well on more challenging sequences.
Beyond Sequences: A Benchmark for Atomic Hand-Object Interaction Using a Static RNN Encoder
Reliably predicting human intent in hand-object interactions is an open challenge for computer vision. Our research concentrates on a fundamental sub-problem: the fine-grained classification of atomic interaction states, namely 'approaching', 'grabbing', and 'holding'. To this end, we introduce a structured data engineering process that converts raw videos from the MANIAC dataset into 27,476 statistical-kinematic feature vectors. Each vector encapsulates relational and dynamic properties from a short temporal window of motion. Our initial hypothesis posited that sequential modeling would be critical, leading us to compare static classifiers (MLPs) against temporal models (RNNs). Counter-intuitively, the key discovery occurred when we set the sequence length of a Bidirectional RNN to one (seq_length=1). This modification converted the network's function, compelling it to act as a high-capacity static feature encoder. This architectural change directly led to a significant accuracy improvement, culminating in a final score of 97.60%. Of particular note, our optimized model successfully overcame the most challenging transitional class, 'grabbing', by achieving a balanced F1-score of 0.90. These findings provide a new benchmark for low-level hand-object interaction recognition using structured, interpretable features and lightweight architectures.
comment: Code available at: https://github.com/YousefAMovahed/beyond-sequences-hoi-benchmark
FROMAT: Multiview Material Appearance Transfer via Few-Shot Self-Attention Adaptation
Multiview diffusion models have rapidly emerged as a powerful tool for content creation with spatial consistency across viewpoints, offering rich visual realism without requiring explicit geometry and appearance representation. However, compared to meshes or radiance fields, existing multiview diffusion models offer limited appearance manipulation, particularly in terms of material, texture, or style. In this paper, we present a lightweight adaptation technique for appearance transfer in multiview diffusion models. Our method learns to combine object identity from an input image with appearance cues rendered in a separate reference image, producing multi-view-consistent output that reflects the desired materials, textures, or styles. This allows explicit specification of appearance parameters at generation time while preserving the underlying object geometry and view coherence. We leverage three diffusion denoising processes responsible for generating the original object, the reference, and the target images, and perform reverse sampling to aggregate a small subset of layer-wise self-attention features from the object and the reference to influence the target generation. Our method requires only a few training examples to introduce appearance awareness to pretrained multiview models. The experiments show that our method provides a simple yet effective way toward multiview generation with diverse appearance, advocating the adoption of implicit generative 3D representations in practice.
Rethinking Chain-of-Thought Reasoning for Videos
Chain-of-thought (CoT) reasoning has been highly successful in solving complex tasks in natural language processing, and recent multimodal large language models (MLLMs) have extended this paradigm to video reasoning. However, these models typically build on lengthy reasoning chains and large numbers of input visual tokens. Motivated by empirical observations from our benchmark study, we hypothesize that concise reasoning combined with a reduced set of visual tokens can be sufficient for effective video reasoning. To evaluate this hypothesis, we design and validate an efficient post-training and inference framework that enhances a video MLLM's reasoning capability. Our framework enables models to operate on compressed visual tokens and generate brief reasoning traces prior to answering. The resulting models achieve substantially improved inference efficiency, deliver competitive performance across diverse benchmarks, and avoid reliance on manual CoT annotations or supervised fine-tuning. Collectively, our results suggest that long, human-like CoT reasoning may not be necessary for general video reasoning, and that concise reasoning can be both effective and efficient. Our code will be released at https://github.com/LaVi-Lab/Rethink_CoT_Video.
comment: Technical report
ImageTalk: Designing a Multimodal AAC Text Generation System Driven by Image Recognition and Natural Language Generation
People living with Motor Neuron Disease (plwMND) frequently encounter speech and motor impairments that necessitate a reliance on augmentative and alternative communication (AAC) systems. This paper tackles the main challenge that traditional symbol-based AAC systems offer a limited vocabulary, while text entry solutions tend to exhibit low communication rates. To help plwMND articulate their needs about the system efficiently and effectively, we iteratively design and develop a novel multimodal text generation system called ImageTalk through a tailored proxy-user-based and an end-user-based design phase. The system demonstrates pronounced keystroke savings of 95.6%, coupled with consistent performance and high user satisfaction. We distill three design guidelines for AI-assisted text generation systems design and outline four user requirement levels tailored for AAC purposes, guiding future research in this field.
comment: 24 pages, 10 figures
UrbanNav: Learning Language-Guided Urban Navigation from Web-Scale Human Trajectories AAAI 2026
Navigating complex urban environments using natural language instructions poses significant challenges for embodied agents, including noisy language instructions, ambiguous spatial references, diverse landmarks, and dynamic street scenes. Current visual navigation methods are typically limited to simulated or off-street environments, and often rely on precise goal formats, such as specific coordinates or images. This limits their effectiveness for autonomous agents like last-mile delivery robots navigating unfamiliar cities. To address these limitations, we introduce UrbanNav, a scalable framework that trains embodied agents to follow free-form language instructions in diverse urban settings. Leveraging web-scale city walking videos, we develop an scalable annotation pipeline that aligns human navigation trajectories with language instructions grounded in real-world landmarks. UrbanNav encompasses over 1,500 hours of navigation data and 3 million instruction-trajectory-landmark triplets, capturing a wide range of urban scenarios. Our model learns robust navigation policies to tackle complex urban scenarios, demonstrating superior spatial reasoning, robustness to noisy instructions, and generalization to unseen urban settings. Experimental results show that UrbanNav significantly outperforms existing methods, highlighting the potential of large-scale web video data to enable language-guided, real-world urban navigation for embodied agents.
comment: 9 pages, 5 figures, accepted to AAAI 2026
CS3D: An Efficient Facial Expression Recognition via Event Vision
Responsive and accurate facial expression recognition is crucial to human-robot interaction for daily service robots. Nowadays, event cameras are becoming more widely adopted as they surpass RGB cameras in capturing facial expression changes due to their high temporal resolution, low latency, computational efficiency, and robustness in low-light conditions. Despite these advantages, event-based approaches still encounter practical challenges, particularly in adopting mainstream deep learning models. Traditional deep learning methods for facial expression analysis are energy-intensive, making them difficult to deploy on edge computing devices and thereby increasing costs, especially for high-frequency, dynamic, event vision-based approaches. To address this challenging issue, we proposed the CS3D framework by decomposing the Convolutional 3D method to reduce the computational complexity and energy consumption. Additionally, by utilizing soft spiking neurons and a spatial-temporal attention mechanism, the ability to retain information is enhanced, thus improving the accuracy of facial expression detection. Experimental results indicate that our proposed CS3D method attains higher accuracy on multiple datasets compared to architectures such as the RNN, Transformer, and C3D, while the energy consumption of the CS3D method is just 21.97\% of the original C3D required on the same device.
UnReflectAnything: RGB-Only Highlight Removal by Rendering Synthetic Specular Supervision
Specular highlights distort appearance, obscure texture, and hinder geometric reasoning in both natural and surgical imagery. We present UnReflectAnything, an RGB-only framework that removes highlights from a single image by predicting a highlight map together with a reflection-free diffuse reconstruction. The model uses a frozen vision transformer encoder to extract multi-scale features, a lightweight head to localize specular regions, and a token-level inpainting module that restores corrupted feature patches before producing the final diffuse image. To overcome the lack of paired supervision, we introduce a Virtual Highlight Synthesis pipeline that renders physically plausible specularities using monocular geometry, Fresnel-aware shading, and randomized lighting which enables training on arbitrary RGB images with correct geometric structure. UnReflectAnything generalizes across natural and surgical domains where non-Lambertian surfaces and non-uniform lighting create severe highlights and it achieves competitive performance with state-of-the-art results on several benchmarks. Project Page: https://alberto-rota.github.io/UnReflectAnything/
Content-Adaptive Image Retouching Guided by Attribute-Based Text Representation
Image retouching has received significant attention due to its ability to achieve high-quality visual content. Existing approaches mainly rely on uniform pixel-wise color mapping across entire images, neglecting the inherent color variations induced by image content. This limitation hinders existing approaches from achieving adaptive retouching that accommodates both diverse color distributions and user-defined style preferences. To address these challenges, we propose a novel Content-Adaptive image retouching method guided by Attribute-based Text Representation (CA-ATP). Specifically, we propose a content-adaptive curve mapping module, which leverages a series of basis curves to establish multiple color mapping relationships and learns the corresponding weight maps, enabling content-aware color adjustments. The proposed module can capture color diversity within the image content, allowing similar color values to receive distinct transformations based on their spatial context. In addition, we propose an attribute text prediction module that generates text representations from multiple image attributes, which explicitly represent user-defined style preferences. These attribute-based text representations are subsequently integrated with visual features via a multimodal model, providing user-friendly guidance for image retouching. Extensive experiments on several public datasets demonstrate that our method achieves state-of-the-art performance.
Hands-on Evaluation of Visual Transformers for Object Recognition and Detection
Convolutional Neural Networks (CNNs) for computer vision sometimes struggle with understanding images in a global context, as they mainly focus on local patterns. On the other hand, Vision Transformers (ViTs), inspired by models originally created for language processing, use self-attention mechanisms, which allow them to understand relationships across the entire image. In this paper, we compare different types of ViTs (pure, hierarchical, and hybrid) against traditional CNN models across various tasks, including object recognition, detection, and medical image classification. We conduct thorough tests on standard datasets like ImageNet for image classification and COCO for object detection. Additionally, we apply these models to medical imaging using the ChestX-ray14 dataset. We find that hybrid and hierarchical transformers, especially Swin and CvT, offer a strong balance between accuracy and computational resources. Furthermore, by experimenting with data augmentation techniques on medical images, we discover significant performance improvements, particularly with the Swin Transformer model. Overall, our results indicate that Vision Transformers are competitive and, in many cases, outperform traditional CNNs, especially in scenarios requiring the understanding of global visual contexts like medical imaging.
Seeing Soil from Space: Towards Robust and Scalable Remote Soil Nutrient Analysis
Environmental variables are increasingly affecting agricultural decision-making, yet accessible and scalable tools for soil assessment remain limited. This study presents a robust and scalable modeling system for estimating soil properties in croplands, including soil organic carbon (SOC), total nitrogen (N), available phosphorus (P), exchangeable potassium (K), and pH, using remote sensing data and environmental covariates. The system employs a hybrid modeling approach, combining the indirect methods of modeling soil through proxies and drivers with direct spectral modeling. We extend current approaches by using interpretable physics-informed covariates derived from radiative transfer models (RTMs) and complex, nonlinear embeddings from a foundation model. We validate the system on a harmonized dataset that covers Europes cropland soils across diverse pedoclimatic zones. Evaluation is conducted under a robust validation framework that enforces strict spatial blocking, stratified splits, and statistically distinct train-test sets, which deliberately make the evaluation harder and produce more realistic error estimates for unseen regions. The models achieved their highest accuracy for SOC and N. This performance held across unseen locations, under both spatial cross-validation and an independent test set. SOC obtained a MAE of 5.12 g/kg and a CCC of 0.77, and N obtained a MAE of 0.44 g/kg and a CCC of 0.77. We also assess uncertainty through conformal calibration, achieving 90 percent coverage at the target confidence level. This study contributes to the digital advancement of agriculture through the application of scalable, data-driven soil analysis frameworks that can be extended to related domains requiring quantitative soil evaluation, such as carbon markets.
comment: 23 pages, 13 figures, 13 tables
Investigate the Low-level Visual Perception in Vision-Language based Image Quality Assessment
Recent advances in Image Quality Assessment (IQA) have leveraged Multi-modal Large Language Models (MLLMs) to generate descriptive explanations. However, despite their strong visual perception modules, these models often fail to reliably detect basic low-level distortions such as blur, noise, and compression, and may produce inconsistent evaluations across repeated inferences. This raises an essential question: do MLLM-based IQA systems truly perceive the visual features that matter? To examine this issue, we introduce a low-level distortion perception task that requires models to classify specific distortion types. Our component-wise analysis shows that although MLLMs are structurally capable of representing such distortions, they tend to overfit training templates, leading to biases in quality scoring. As a result, critical low-level features are weakened or lost during the vision-language alignment transfer stage. Furthermore, by computing the semantic distance between visual features and corresponding semantic tokens before and after component-wise fine-tuning, we show that improving the alignment of the vision encoder dramatically enhances distortion recognition accuracy, increasing it from 14.92% to 84.43%. Overall, these findings indicate that incorporating dedicated constraints on the vision encoder can strengthen text-explainable visual representations and enable MLLM-based pipelines to produce more coherent and interpretable reasoning in vision-centric tasks.
From Graphs to Gates: DNS-HyXNet, A Lightweight and Deployable Sequential Model for Real-Time DNS Tunnel Detection
Domain Name System (DNS) tunneling remains a covert channel for data exfiltration and command-and-control communication. Although graph-based methods such as GraphTunnel achieve strong accuracy, they introduce significant latency and computational overhead due to recursive parsing and graph construction, limiting their suitability for real-time deployment. This work presents DNS-HyXNet, a lightweight extended Long Short-Term Memory (xLSTM) hybrid framework designed for efficient sequence-based DNS tunnel detection. DNS-HyXNet integrates tokenized domain embeddings with normalized numerical DNS features and processes them through a two-layer xLSTM network that directly learns temporal dependencies from packet sequences, eliminating the need for graph reconstruction and enabling single-stage multi-class classification. The model was trained and evaluated on two public benchmark datasets with carefully tuned hyperparameters to ensure low memory consumption and fast inference. Across all experimental splits of the DNS-Tunnel-Datasets, DNS-HyXNet achieved up to 99.99% accuracy, with macro-averaged precision, recall, and F1-scores exceeding 99.96%, and demonstrated a per-sample detection latency of just 0.041 ms, confirming its scalability and real-time readiness. These results show that sequential modeling with xLSTM can effectively replace computationally expensive recursive graph generation, offering a deployable and energy-efficient alternative for real-time DNS tunnel detection on commodity hardware.
Building Reasonable Inference for Vision-Language Models in Blind Image Quality Assessment ICONIP
Recent progress in BIQA has been driven by VLMs, whose semantic reasoning abilities suggest that they might extract visual features, generate descriptive text, and infer quality in a human-like manner. However, these models often produce textual descriptions that contradict their final quality predictions, and the predicted scores can change unstably during inference - behaviors not aligned with human reasoning. To understand these issues, we analyze the factors that cause contradictory assessments and instability. We first estimate the relationship between the final quality predictions and the generated visual features, finding that the predictions are not fully grounded in the features and that the logical connection between them is weak. Moreover, decoding intermediate VLM layers shows that the model frequently relies on a limited set of candidate tokens, which contributes to prediction instability. To encourage more human-like reasoning, we introduce a two-stage tuning method that explicitly separates visual perception from quality inference. In the first stage, the model learns visual features; in the second, it infers quality solely from these features. Experiments on SPAQ and KONIQ demonstrate that our approach reduces prediction instability from 22.00% to 12.39% and achieves average gains of 0.3124/0.3507 in SRCC/PLCC across LIVE, CSIQ, SPAQ, and KONIQ compared to the baseline. Further analyses show that our method improves both stability and the reliability of the inference process.
comment: Accepted to the ICONIP (International Conference on Neural Information Processing), 2025
A Dual-Domain Convolutional Network for Hyperspectral Single-Image Super-Resolution
This study presents a lightweight dual-domain super-resolution network (DDSRNet) that combines Spatial-Net with the discrete wavelet transform (DWT). Specifically, our proposed model comprises three main components: (1) a shallow feature extraction module, termed Spatial-Net, which performs residual learning and bilinear interpolation; (2) a low-frequency enhancement branch based on the DWT that refines coarse image structures; and (3) a shared high-frequency refinement branch that simultaneously enhances the LH (horizontal), HL (vertical), and HH (diagonal) wavelet subbands using a single CNN with shared weights. As a result, the DWT enables subband decomposition, while the inverse DWT reconstructs the final high-resolution output. By doing so, the integration of spatial- and frequency-domain learning enables DDSRNet to achieve highly competitive performance with low computational cost on three hyperspectral image datasets, demonstrating its effectiveness for hyperspectral image super-resolution.
Masked Registration and Autoencoding of CT Images for Predictive Tibia Reconstruction MICCAI
Surgical planning for complex tibial fractures can be challenging for surgeons, as the 3D structure of the later desirable bone alignment may be diffi- cult to imagine. To assist in such planning, we address the challenge of predicting a patient-specific reconstruction target from a CT of the fractured tibia. Our ap- proach combines neural registration and autoencoder models. Specifically, we first train a modified spatial transformer network (STN) to register a raw CT to a standardized coordinate system of a jointly trained tibia prototype. Subsequently, various autoencoder (AE) architectures are trained to model healthy tibial varia- tions. Both the STN and AE models are further designed to be robust to masked input, allowing us to apply them to fractured CTs and decode to a prediction of the patient-specific healthy bone in standard coordinates. Our contributions include: i) a 3D-adapted STN for global spatial registration, ii) a comparative analysis of AEs for bone CT modeling, and iii) the extension of both to handle masked inputs for predictive generation of healthy bone structures. Project page: https://github.com/HongyouZhou/repair
comment: DGM4MICCAI
ViTA-Seg: Vision Transformer for Amodal Segmentation in Robotics
Occlusions in robotic bin picking compromise accurate and reliable grasp planning. We present ViTA-Seg, a class-agnostic Vision Transformer framework for real-time amodal segmentation that leverages global attention to recover complete object masks, including hidden regions. We proposte two architectures: a) Single-Head for amodal mask prediction; b) Dual-Head for amodal and occluded mask prediction. We also introduce ViTA-SimData, a photo-realistic synthetic dataset tailored to industrial bin-picking scenario. Extensive experiments on two amodal benchmarks, COOCA and KINS, demonstrate that ViTA-Seg Dual Head achieves strong amodal and occlusion segmentation accuracy with computational efficiency, enabling robust, real-time robotic manipulation.
Gradient-Guided Learning Network for Infrared Small Target Detection
Recently, infrared small target detection has attracted extensive attention. However, due to the small size and the lack of intrinsic features of infrared small targets, the existing methods generally have the problem of inaccurate edge positioning and the target is easily submerged by the background. Therefore, we propose an innovative gradient-guided learning network (GGL-Net). Specifically, we are the first to explore the introduction of gradient magnitude images into the deep learning-based infrared small target detection method, which is conducive to emphasizing the edge details and alleviating the problem of inaccurate edge positioning of small targets. On this basis, we propose a novel dual-branch feature extraction network that utilizes the proposed gradient supplementary module (GSM) to encode raw gradient information into deeper network layers and embeds attention mechanisms reasonably to enhance feature extraction ability. In addition, we construct a two-way guidance fusion module (TGFM), which fully considers the characteristics of feature maps at different levels. It can facilitate the effective fusion of multi-scale feature maps and extract richer semantic information and detailed information through reasonable two-way guidance. Extensive experiments prove that GGL-Net has achieves state-of-the-art results on the public real NUAA-SIRST dataset and the public synthetic NUDT-SIRST dataset. Our code has been integrated into https://github.com/YuChuang1205/MSDA-Net
comment: Accepted by GRSL 2023
StateSpace-SSL: Linear-Time Self-supervised Learning for Plant Disease Detectio AAAI
Self-supervised learning (SSL) is attractive for plant disease detection as it can exploit large collections of unlabeled leaf images, yet most existing SSL methods are built on CNNs or vision transformers that are poorly matched to agricultural imagery. CNN-based SSL struggles to capture disease patterns that evolve continuously along leaf structures, while transformer-based SSL introduces quadratic attention cost from high-resolution patches. To address these limitations, we propose StateSpace-SSL, a linear-time SSL framework that employs a Vision Mamba state-space encoder to model long-range lesion continuity through directional scanning across the leaf surface. A prototype-driven teacher-student objective aligns representations across multiple views, encouraging stable and lesion-aware features from labelled data. Experiments on three publicly available plant disease datasets show that StateSpace-SSL consistently outperforms the CNN- and transformer-based SSL baselines in various evaluation metrics. Qualitative analyses further confirm that it learns compact, lesion-focused feature maps, highlighting the advantage of linear state-space modelling for self-supervised plant disease representation learning.
comment: Accepted to AAAI workshop (AgriAI 2026)
MODA: The First Challenging Benchmark for Multispectral Object Detection in Aerial Images
Aerial object detection faces significant challenges in real-world scenarios, such as small objects and extensive background interference, which limit the performance of RGB-based detectors with insufficient discriminative information. Multispectral images (MSIs) capture additional spectral cues across multiple bands, offering a promising alternative. However, the lack of training data has been the primary bottleneck to exploiting the potential of MSIs. To address this gap, we introduce the first large-scale dataset for Multispectral Object Detection in Aerial images (MODA), which comprises 14,041 MSIs and 330,191 annotations across diverse, challenging scenarios, providing a comprehensive data foundation for this field. Furthermore, to overcome challenges inherent to aerial object detection using MSIs, we propose OSSDet, a framework that integrates spectral and spatial information with object-aware cues. OSSDet employs a cascaded spectral-spatial modulation structure to optimize target perception, aggregates spectrally related features by exploiting spectral similarities to reinforce intra-object correlations, and suppresses irrelevant background via object-aware masking. Moreover, cross-spectral attention further refines object-related representations under explicit object-aware guidance. Extensive experiments demonstrate that OSSDet outperforms existing methods with comparable parameters and efficiency.
comment: 8 pages, 9 figures
Color encoding in Latent Space of Stable Diffusion Models
Recent advances in diffusion-based generative models have achieved remarkable visual fidelity, yet a detailed understanding of how specific perceptual attributes - such as color and shape - are internally represented remains limited. This work explores how color is encoded in a generative model through a systematic analysis of the latent representations in Stable Diffusion. Through controlled synthetic datasets, principal component analysis (PCA) and similarity metrics, we reveal that color information is encoded along circular, opponent axes predominantly captured in latent channels c_3 and c_4, whereas intensity and shape are primarily represented in channels c_1 and c_2. Our findings indicate that the latent space of Stable Diffusion exhibits an interpretable structure aligned with a efficient coding representation. These insights provide a foundation for future work in model understanding, editing applications, and the design of more disentangled generative frameworks.
comment: 6 pages, 8 figures, Color Imaging Conference 33
Temporal-Spatial Tubelet Embedding for Cloud-Robust MSI Reconstruction using MSI-SAR Fusion: A Multi-Head Self-Attention Video Vision Transformer Approach
Cloud cover in multispectral imagery (MSI) significantly hinders early-season crop mapping by corrupting spectral information. Existing Vision Transformer(ViT)-based time-series reconstruction methods, like SMTS-ViT, often employ coarse temporal embeddings that aggregate entire sequences, causing substantial information loss and reducing reconstruction accuracy. To address these limitations, a Video Vision Transformer (ViViT)-based framework with temporal-spatial fusion embedding for MSI reconstruction in cloud-covered regions is proposed in this study. Non-overlapping tubelets are extracted via 3D convolution with constrained temporal span $(t=2)$, ensuring local temporal coherence while reducing cross-day information degradation. Both MSI-only and SAR-MSI fusion scenarios are considered during the experiments. Comprehensive experiments on 2020 Traill County data demonstrate notable performance improvements: MTS-ViViT achieves a 2.23\% reduction in MSE compared to the MTS-ViT baseline, while SMTS-ViViT achieves a 10.33\% improvement with SAR integration over the SMTS-ViT baseline. The proposed framework effectively enhances spectral reconstruction quality for robust agricultural monitoring.
LiePrune: Lie Group and Quantum Geometric Dual Representation for One-Shot Structured Pruning of Quantum Neural Networks
Quantum neural networks (QNNs) and parameterized quantum circuits (PQCs) are key building blocks for near-term quantum machine learning. However, their scalability is constrained by excessive parameters, barren plateaus, and hardware limitations. We propose LiePrune, the first mathematically grounded one-shot structured pruning framework for QNNs that leverages Lie group structure and quantum geometric information. Each gate is jointly represented in a Lie group--Lie algebra dual space and a quantum geometric feature space, enabling principled redundancy detection and aggressive compression. Experiments on quantum classification (MNIST, FashionMNIST), quantum generative modeling (Bars-and-Stripes), and quantum chemistry (LiH VQE) show that LiePrune achieves over $10\times$ compression with negligible or even improved task performance, while providing provable guarantees on redundancy detection, functional approximation, and computational complexity.
comment: 7 pages, 2 figures
Privacy-Preserving Computer Vision for Industry: Three Case Studies in Human-Centric Manufacturing AAAI26
The adoption of AI-powered computer vision in industry is often constrained by the need to balance operational utility with worker privacy. Building on our previously proposed privacy-preserving framework, this paper presents its first comprehensive validation on real-world data collected directly by industrial partners in active production environments. We evaluate the framework across three representative use cases: woodworking production monitoring, human-aware AGV navigation, and multi-camera ergonomic risk assessment. The approach employs learned visual transformations that obscure sensitive or task-irrelevant information while retaining features essential for task performance. Through both quantitative evaluation of the privacy-utility trade-off and qualitative feedback from industrial partners, we assess the framework's effectiveness, deployment feasibility, and trust implications. Results demonstrate that task-specific obfuscation enables effective monitoring with reduced privacy risks, establishing the framework's readiness for real-world adoption and providing cross-domain recommendations for responsible, human-centric AI deployment in industry.
comment: Accepted to the AAAI26 HCM workshop
Cytoplasmic Strings Analysis in Human Embryo Time-Lapse Videos using Deep Learning Framework
Infertility is a major global health issue, and while in-vitro fertilization has improved treatment outcomes, embryo selection remains a critical bottleneck. Time-lapse imaging enables continuous, non-invasive monitoring of embryo development, yet most automated assessment methods rely solely on conventional morphokinetic features and overlook emerging biomarkers. Cytoplasmic Strings, thin filamentous structures connecting the inner cell mass and trophectoderm in expanded blastocysts, have been associated with faster blastocyst formation, higher blastocyst grades, and improved viability. However, CS assessment currently depends on manual visual inspection, which is labor-intensive, subjective, and severely affected by detection and subtle visual appearance. In this work, we present, to the best of our knowledge, the first computational framework for CS analysis in human IVF embryos. We first design a human-in-the-loop annotation pipeline to curate a biologically validated CS dataset from TLI videos, comprising 13,568 frames with highly sparse CS-positive instances. Building on this dataset, we propose a two-stage deep learning framework that (i) classifies CS presence at the frame level and (ii) localizes CS regions in positive cases. To address severe imbalance and feature uncertainty, we introduce the Novel Uncertainty-aware Contractive Embedding (NUCE) loss, which couples confidence-aware reweighting with an embedding contraction term to form compact, well-separated class clusters. NUCE consistently improves F1-score across five transformer backbones, while RF-DETR-based localization achieves state-of-the-art (SOTA) detection performance for thin, low-contrast CS structures. The source code will be made publicly available at: https://github.com/HamadYA/CS_Detection.
Sequential Testing for Descriptor-Agnostic LiDAR Loop Closure in Repetitive Environments
We propose a descriptor-agnostic, multi-frame loop closure verification method that formulates LiDAR loop closure as a truncated Sequential Probability Ratio Test (SPRT). Instead of deciding from a single descriptor comparison or using fixed thresholds with late-stage Iterative Closest Point (ICP) vetting, the verifier accumulates a short temporal stream of descriptor similarities between a query and each candidate. It then issues an accept/reject decision adaptively once sufficient multi-frame evidence has been observed, according to user-specified Type-I/II error design targets. This precision-first policy is designed to suppress false positives in structurally repetitive indoor environments. We evaluate the verifier on a five-sequence library dataset, using a fixed retrieval front-end with several representative LiDAR global descriptors. Performance is assessed via segment-level K-hit precision-recall and absolute trajectory error (ATE) and relative pose error (RPE) after pose graph optimization. Across descriptors, the sequential verifier consistently improves precision and reduces the impact of aliased loops compared with single-frame and heuristic multi-frame baselines. Our implementation and dataset will be released at: https://github.com/wanderingcar/snu_library_dataset.
comment: 8 pages, 4 figures
Defect-aware Hybrid Prompt Optimization via Progressive Tuning for Zero-Shot Multi-type Anomaly Detection and Segmentation
Recent vision language models (VLMs) like CLIP have demonstrated impressive anomaly detection performance under significant distribution shift by utilizing high-level semantic information through text prompts. However, these models often neglect fine-grained details, such as which kind of anomalies, like "hole", "cut", "scratch" that could provide more specific insight into the nature of anomalies. We argue that recognizing fine-grained anomaly types 1) enriches the representation of "abnormal" with structured semantics, narrowing the gap between coarse anomaly signals and fine-grained defect categories; 2) enables manufacturers to understand the root causes of the anomaly and implement more targeted and appropriate corrective measures quickly. While incorporating such detailed semantic information is crucial, designing handcrafted prompts for each defect type is both time-consuming and susceptible to human bias. For this reason, we introduce DAPO, a novel approach for Defect-aware Prompt Optimization based on progressive tuning for the zero-shot multi-type and binary anomaly detection and segmentation under distribution shifts. Our approach aligns anomaly-relevant image features with their corresponding text semantics by learning hybrid defect-aware prompts with both fixed textual anchors and learnable token embeddings. We conducted experiments on public benchmarks (MPDD, VisA, MVTec-AD, MAD, and Real-IAD) and an internal dataset. The results suggest that compared to the baseline models, DAPO achieves a 3.7% average improvement in AUROC and average precision metrics at the image level under distribution shift, and a 6.5% average improvement in localizing novel anomaly types under zero-shot settings.
Representation Calibration and Uncertainty Guidance for Class-Incremental Learning based on Vision Language Model
Class-incremental learning requires a learning system to continually learn knowledge of new classes and meanwhile try to preserve previously learned knowledge of old classes. As current state-of-the-art methods based on Vision-Language Models (VLMs) still suffer from the issue of differentiating classes across learning tasks. Here a novel VLM-based continual learning framework for image classification is proposed. In this framework, task-specific adapters are added to the pre-trained and frozen image encoder to learn new knowledge, and a novel cross-task representation calibration strategy based on a mixture of light-weight projectors is used to help better separate all learned classes in a unified feature space, alleviating class confusion across tasks. In addition, a novel inference strategy guided by prediction uncertainty is developed to more accurately select the most appropriate image feature for class prediction. Extensive experiments on multiple datasets under various settings demonstrate the superior performance of our method compared to existing ones.
UniPart: Part-Level 3D Generation with Unified 3D Geom-Seg Latents
Part-level 3D generation is essential for applications requiring decomposable and structured 3D synthesis. However, existing methods either rely on implicit part segmentation with limited granularity control or depend on strong external segmenters trained on large annotated datasets. In this work, we observe that part awareness emerges naturally during whole-object geometry learning and propose Geom-Seg VecSet, a unified geometry-segmentation latent representation that jointly encodes object geometry and part-level structure. Building on this representation, we introduce UniPart, a two-stage latent diffusion framework for image-guided part-level 3D generation. The first stage performs joint geometry generation and latent part segmentation, while the second stage conditions part-level diffusion on both whole-object and part-specific latents. A dual-space generation scheme further enhances geometric fidelity by predicting part latents in both global and canonical spaces. Extensive experiments demonstrate that UniPart achieves superior segmentation controllability and part-level geometric quality compared with existing approaches.
FunPhase: A Periodic Functional Autoencoder for Motion Generation via Phase Manifolds
Learning natural body motion remains challenging due to the strong coupling between spatial geometry and temporal dynamics. Embedding motion in phase manifolds, latent spaces that capture local periodicity, has proven effective for motion prediction; however, existing approaches lack scalability and remain confined to specific settings. We introduce FunPhase, a functional periodic autoencoder that learns a phase manifold for motion and replaces discrete temporal decoding with a function-space formulation, enabling smooth trajectories that can be sampled at arbitrary temporal resolutions. FunPhase supports downstream tasks such as super-resolution and partial-body motion completion, generalizes across skeletons and datasets, and unifies motion prediction and generation within a single interpretable manifold. Our model achieves substantially lower reconstruction error than prior periodic autoencoder baselines while enabling a broader range of applications and performing on par with state-of-the-art motion generation methods.
InfoMotion: A Graph-Based Approach to Video Dataset Distillation for Echocardiography
Echocardiography playing a critical role in the diagnosis and monitoring of cardiovascular diseases as a non-invasive real-time assessment of cardiac structure and function. However, the growing scale of echocardiographic video data presents significant challenges in terms of storage, computation, and model training efficiency. Dataset distillation offers a promising solution by synthesizing a compact, informative subset of data that retains the key clinical features of the original dataset. In this work, we propose a novel approach for distilling a compact synthetic echocardiographic video dataset. Our method leverages motion feature extraction to capture temporal dynamics, followed by class-wise graph construction and representative sample selection using the Infomap algorithm. This enables us to select a diverse and informative subset of synthetic videos that preserves the essential characteristics of the original dataset. We evaluate our approach on the EchoNet-Dynamic datasets and achieve a test accuracy of \(69.38\%\) using only \(25\) synthetic videos. These results demonstrate the effectiveness and scalability of our method for medical video dataset distillation.
comment: Accepted at MICAD 2025
Label-free Motion-Conditioned Diffusion Model for Cardiac Ultrasound Synthesis
Ultrasound echocardiography is essential for the non-invasive, real-time assessment of cardiac function, but the scarcity of labelled data, driven by privacy restrictions and the complexity of expert annotation, remains a major obstacle for deep learning methods. We propose the Motion Conditioned Diffusion Model (MCDM), a label-free latent diffusion framework that synthesises realistic echocardiography videos conditioned on self-supervised motion features. To extract these features, we design the Motion and Appearance Feature Extractor (MAFE), which disentangles motion and appearance representations from videos. Feature learning is further enhanced by two auxiliary objectives: a re-identification loss guided by pseudo appearance features and an optical flow loss guided by pseudo flow fields. Evaluated on the EchoNet-Dynamic dataset, MCDM achieves competitive video generation performance, producing temporally coherent and clinically realistic sequences without reliance on manual labels. These results demonstrate the potential of self-supervised conditioning for scalable echocardiography synthesis. Our code is available at https://github.com/ZheLi2020/LabelfreeMCDM.
comment: Accepted at MICAD 2025
DirectSwap: Mask-Free Cross-Identity Training and Benchmarking for Expression-Consistent Video Head Swapping
Video head swapping aims to replace the entire head of a video subject, including facial identity, head shape, and hairstyle, with that of a reference image, while preserving the target body, background, and motion dynamics. Due to the lack of ground-truth paired swapping data, prior methods typically train on cross-frame pairs of the same person within a video and rely on mask-based inpainting to mitigate identity leakage. Beyond potential boundary artifacts, this paradigm struggles to recover essential cues occluded by the mask, such as facial pose, expressions, and motion dynamics. To address these issues, we prompt a video editing model to synthesize new heads for existing videos as fake swapping inputs, while maintaining frame-synchronized facial poses and expressions. This yields HeadSwapBench, the first cross-identity paired dataset for video head swapping, which supports both training (\TrainNum{} videos) and benchmarking (\TestNum{} videos) with genuine outputs. Leveraging this paired supervision, we propose DirectSwap, a mask-free, direct video head-swapping framework that extends an image U-Net into a video diffusion model with a motion module and conditioning inputs. Furthermore, we introduce the Motion- and Expression-Aware Reconstruction (MEAR) loss, which reweights the diffusion loss per pixel using frame-difference magnitudes and facial-landmark proximity, thereby enhancing cross-frame coherence in motion and expressions. Extensive experiments demonstrate that DirectSwap achieves state-of-the-art visual quality, identity fidelity, and motion and expression consistency across diverse in-the-wild video scenes. We will release the source code and the HeadSwapBench dataset to facilitate future research.
Generative Point Cloud Registration
In this paper, we propose a novel 3D registration paradigm, Generative Point Cloud Registration, which bridges advanced 2D generative models with 3D matching tasks to enhance registration performance. Our key idea is to generate cross-view consistent image pairs that are well-aligned with the source and target point clouds, enabling geometry-color feature fusion to facilitate robust matching. To ensure high-quality matching, the generated image pair should feature both 2D-3D geometric consistency and cross-view texture consistency. To achieve this, we introduce Match-ControlNet, a matching-specific, controllable 2D generative model. Specifically, it leverages the depth-conditioned generation capability of ControlNet to produce images that are geometrically aligned with depth maps derived from point clouds, ensuring 2D-3D geometric consistency. Additionally, by incorporating a coupled conditional denoising scheme and coupled prompt guidance, Match-ControlNet further promotes cross-view feature interaction, guiding texture consistency generation. Our generative 3D registration paradigm is general and could be seamlessly integrated into various registration methods to enhance their performance. Extensive experiments on 3DMatch and ScanNet datasets verify the effectiveness of our approach.
comment: 14 pages, 9 figures
H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos
Robots that learn manipulation skills from everyday human videos could acquire broad capabilities without tedious robot data collection. We propose a video-to-video translation framework that converts ordinary human-object interaction videos into motion-consistent robot manipulation videos with realistic, physically grounded interactions. Our approach does not require any paired human-robot videos for training only a set of unpaired robot videos, making the system easy to scale. We introduce a transferable representation that bridges the embodiment gap: by inpainting the robot arm in training videos to obtain a clean background and overlaying a simple visual cue (a marker and arrow indicating the gripper's position and orientation), we can condition a generative model to insert the robot arm back into the scene. At test time, we apply the same process to human videos (inpainting the person and overlaying human pose cues) and generate high-quality robot videos that mimic the human's actions. We fine-tune a SOTA video diffusion model (Wan 2.2) in an in-context learning manner to ensure temporal coherence and leveraging of its rich prior knowledge. Empirical results demonstrate that our approach achieves significantly more realistic and grounded robot motions compared to baselines, pointing to a promising direction for scaling up robot learning from unlabeled human videos. Project page: https://showlab.github.io/H2R-Grounder/
comment: 13 pages, 6 figures
Wasserstein-Aligned Hyperbolic Multi-View Clustering
Multi-view clustering (MVC) aims to uncover the latent structure of multi-view data by learning view-common and view-specific information. Although recent studies have explored hyperbolic representations for better tackling the representation gap between different views, they focus primarily on instance-level alignment and neglect global semantic consistency, rendering them vulnerable to view-specific information (\textit{e.g.}, noise and cross-view discrepancies). To this end, this paper proposes a novel Wasserstein-Aligned Hyperbolic (WAH) framework for multi-view clustering. Specifically, our method exploits a view-specific hyperbolic encoder for each view to embed features into the Lorentz manifold for hierarchical semantic modeling. Whereafter, a global semantic loss based on the hyperbolic sliced-Wasserstein distance is introduced to align manifold distributions across views. This is followed by soft cluster assignments to encourage cross-view semantic consistency. Extensive experiments on multiple benchmarking datasets show that our method can achieve SOTA clustering performance.
comment: 14 pages
Detection and Localization of Subdural Hematoma Using Deep Learning on Computed Tomography
Background. Subdural hematoma (SDH) is a common neurosurgical emergency, with increasing incidence in aging populations. Rapid and accurate identification is essential to guide timely intervention, yet existing automated tools focus primarily on detection and provide limited interpretability or spatial localization. There remains a need for transparent, high-performing systems that integrate multimodal clinical and imaging information to support real-time decision-making. Methods. We developed a multimodal deep-learning framework that integrates structured clinical variables, a 3D convolutional neural network trained on CT volumes, and a transformer-enhanced 2D segmentation model for SDH detection and localization. Using 25,315 head CT studies from Hartford HealthCare (2015--2024), of which 3,774 (14.9\%) contained clinician-confirmed SDH, tabular models were trained on demographics, comorbidities, medications, and laboratory results. Imaging models were trained to detect SDH and generate voxel-level probability maps. A greedy ensemble strategy combined complementary predictors. Findings. Clinical variables alone provided modest discriminatory power (AUC 0.75). Convolutional models trained on CT volumes and segmentation-derived maps achieved substantially higher accuracy (AUCs 0.922 and 0.926). The multimodal ensemble integrating all components achieved the best overall performance (AUC 0.9407; 95\% CI, 0.930--0.951) and produced anatomically meaningful localization maps consistent with known SDH patterns. Interpretation. This multimodal, interpretable framework provides rapid and accurate SDH detection and localization, achieving high detection performance and offering transparent, anatomically grounded outputs. Integration into radiology workflows could streamline triage, reduce time to intervention, and improve consistency in SDH management.
Perception-Inspired Color Space Design for Photo White Balance Editing WACV 2026
White balance (WB) is a key step in the image signal processor (ISP) pipeline that mitigates color casts caused by varying illumination and restores the scene's true colors. Currently, sRGB-based WB editing for post-ISP WB correction is widely used to address color constancy failures in the ISP pipeline when the original camera RAW is unavailable. However, additive color models (e.g., sRGB) are inherently limited by fixed nonlinear transformations and entangled color channels, which often impede their generalization to complex lighting conditions. To address these challenges, we propose a novel framework for WB correction that leverages a perception-inspired Learnable HSI (LHSI) color space. Built upon a cylindrical color model that naturally separates luminance from chromatic components, our framework further introduces dedicated parameters to enhance this disentanglement and learnable mapping to adaptively refine the flexibility. Moreover, a new Mamba-based network is introduced, which is tailored to the characteristics of the proposed LHSI color space. Experimental results on benchmark datasets demonstrate the superiority of our method, highlighting the potential of perception-inspired color space design in computational photography. The source code is available at https://github.com/YangCheng58/WB_Color_Space.
comment: Accepted to WACV 2026
Rates and architectures for learning geometrically non-trivial operators
Deep learning methods have proven capable of recovering operators between high-dimensional spaces, such as solution maps of PDEs and similar objects in mathematical physics, from very few training samples. This phenomenon of data-efficiency has been proven for certain classes of elliptic operators with simple geometry, i.e., operators that do not change the domain of the function or propagate singularities. However, scientific machine learning is commonly used for problems that do involve the propagation of singularities in a priori unknown ways, such as waves, advection, and fluid dynamics. In light of this, we expand the learning theory to include double fibration transforms--geometric integral operators that include generalized Radon and geodesic ray transforms. We prove that this class of operators does not suffer from the curse of dimensionality: the error decays superalgebraically, that is, faster than any fixed power of the reciprocal of the number of training samples. Furthermore, we investigate architectures that explicitly encode the geometry of these transforms, demonstrating that an architecture reminiscent of cross-attention based on levelset methods yields a parameterization that is universal, stable, and learns double fibration transforms from very few training examples. Our results contribute to a rapidly-growing line of theoretical work on learning operators for scientific machine learning.
comment: 26 pages, 5 figures
Log NeRF: Comparing Spaces for Learning Radiance Fields
Neural Radiance Fields (NeRF) have achieved remarkable results in novel view synthesis, typically using sRGB images for supervision. However, little attention has been paid to the color space in which the network is learning the radiance field representation. Inspired by the BiIlluminant Dichromatic Reflection (BIDR) model, which suggests that a logarithmic transformation simplifies the separation of illumination and reflectance, we hypothesize that log RGB space enables NeRF to learn a more compact and effective representation of scene appearance. To test this, we captured approximately 30 videos using a GoPro camera, ensuring linear data recovery through inverse encoding. We trained NeRF models under various color space interpretations linear, sRGB, GPLog, and log RGB by converting each network output to a common color space before rendering and loss computation, enforcing representation learning in different color spaces. Quantitative and qualitative evaluations demonstrate that using a log RGB color space consistently improves rendering quality, exhibits greater robustness across scenes, and performs particularly well in low light conditions while using the same bit-depth input images. Further analysis across different network sizes and NeRF variants confirms the generalization and stability of the log space advantage.
comment: The 36th British Machine Vision Conference
FUSER: Feed-Forward MUltiview 3D Registration Transformer and SE(3)$^N$ Diffusion Refinement
Registration of multiview point clouds conventionally relies on extensive pairwise matching to build a pose graph for global synchronization, which is computationally expensive and inherently ill-posed without holistic geometric constraints. This paper proposes FUSER, the first feed-forward multiview registration transformer that jointly processes all scans in a unified, compact latent space to directly predict global poses without any pairwise estimation. To maintain tractability, FUSER encodes each scan into low-resolution superpoint features via a sparse 3D CNN that preserves absolute translation cues, and performs efficient intra- and inter-scan reasoning through a Geometric Alternating Attention module. Particularly, we transfer 2D attention priors from off-the-shelf foundation models to enhance 3D feature interaction and geometric consistency. Building upon FUSER, we further introduce FUSER-DF, an SE(3)$^N$ diffusion refinement framework to correct FUSER's estimates via denoising in the joint SE(3)$^N$ space. FUSER acts as a surrogate multiview registration model to construct the denoiser, and a prior-conditioned SE(3)$^N$ variational lower bound is derived for denoising supervision. Extensive experiments on 3DMatch, ScanNet and ArkitScenes demonstrate that our approach achieves the superior registration accuracy and outstanding computational efficiency.
comment: 13 pages, 6 figures
ASSIST-3D: Adapted Scene Synthesis for Class-Agnostic 3D Instance Segmentation AAAI 2026
Class-agnostic 3D instance segmentation tackles the challenging task of segmenting all object instances, including previously unseen ones, without semantic class reliance. Current methods struggle with generalization due to the scarce annotated 3D scene data or noisy 2D segmentations. While synthetic data generation offers a promising solution, existing 3D scene synthesis methods fail to simultaneously satisfy geometry diversity, context complexity, and layout reasonability, each essential for this task. To address these needs, we propose an Adapted 3D Scene Synthesis pipeline for class-agnostic 3D Instance SegmenTation, termed as ASSIST-3D, to synthesize proper data for model generalization enhancement. Specifically, ASSIST-3D features three key innovations, including 1) Heterogeneous Object Selection from extensive 3D CAD asset collections, incorporating randomness in object sampling to maximize geometric and contextual diversity; 2) Scene Layout Generation through LLM-guided spatial reasoning combined with depth-first search for reasonable object placements; and 3) Realistic Point Cloud Construction via multi-view RGB-D image rendering and fusion from the synthetic scenes, closely mimicking real-world sensor data acquisition. Experiments on ScanNetV2, ScanNet++, and S3DIS benchmarks demonstrate that models trained with ASSIST-3D-generated data significantly outperform existing methods. Further comparisons underscore the superiority of our purpose-built pipeline over existing 3D scene synthesis approaches.
comment: Accepted by AAAI 2026
StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation
The growing adoption of XR devices has fueled strong demand for high-quality stereo video, yet its production remains costly and artifact-prone. To address this challenge, we present StereoWorld, an end-to-end framework that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. Our framework jointly conditions the model on the monocular video input while explicitly supervising the generation with a geometry-aware regularization to ensure 3D structural fidelity. A spatio-temporal tiling scheme is further integrated to enable efficient, high-resolution synthesis. To enable large-scale training and evaluation, we curate a high-definition stereo video dataset containing over 11M frames aligned to natural human interpupillary distance (IPD). Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generating stereo videos with superior visual fidelity and geometric consistency. The project webpage is available at https://ke-xing.github.io/StereoWorld/.
Video-QTR: Query-Driven Temporal Reasoning Framework for Lightweight Video Understanding
The rapid development of multimodal large-language models (MLLMs) has significantly expanded the scope of visual language reasoning, enabling unified systems to interpret and describe complex visual content. However, applying these models to long-video understanding remains computationally intensive. Dense frame encoding generates excessive visual tokens, leading to high memory consumption, redundant computation, and limited scalability in real-world applications. This inefficiency highlights a key limitation of the traditional process-then-reason paradigm, which analyzes visual streams exhaustively before semantic reasoning. To address this challenge, we introduce Video-QTR (Query-Driven Temporal Reasoning), a lightweight framework that redefines video comprehension as a query-guided reasoning process. Instead of encoding every frame, Video-QTR dynamically allocates perceptual resources based on the semantic intent of the query, creating an adaptive feedback loop between reasoning and perception. Extensive experiments across five benchmarks: MSVD-QA, Activity Net-QA, Movie Chat, and Video MME demonstrate that Video-QTR achieves state-of-the-art performance while reducing input frame consumption by up to 73%. These results confirm that query-driven temporal reasoning provides an efficient and scalable solution for video understanding.
TextGuider: Training-Free Guidance for Text Rendering via Attention Alignment
Despite recent advances, diffusion-based text-to-image models still struggle with accurate text rendering. Several studies have proposed fine-tuning or training-free refinement methods for accurate text rendering. However, the critical issue of text omission, where the desired text is partially or entirely missing, remains largely overlooked. In this work, we propose TextGuider, a novel training-free method that encourages accurate and complete text appearance by aligning textual content tokens and text regions in the image. Specifically, we analyze attention patterns in MM-DiT models, particularly for text-related tokens intended to be rendered in the image. Leveraging this observation, we apply latent guidance during the early stage of denoising steps based on two loss functions that we introduce. Our method achieves state-of-the-art performance in test-time text rendering, with significant gains in recall and strong results in OCR accuracy and CLIP score.
Development and Testing for Perception Based Autonomous Landing of a Long-Range QuadPlane
QuadPlanes combine the range efficiency of fixed-wing aircraft with the maneuverability of multi-rotor platforms for long-range autonomous missions. In GPS-denied or cluttered urban environments, perception-based landing is vital for reliable operation. Unlike structured landing zones, real-world sites are unstructured and highly variable, requiring strong generalization capabilities from the perception system. Deep neural networks (DNNs) provide a scalable solution for learning landing site features across diverse visual and environmental conditions. While perception-driven landing has been shown in simulation, real-world deployment introduces significant challenges. Payload and volume constraints limit high-performance edge AI devices like the NVIDIA Jetson Orin Nano, which are crucial for real-time detection and control. Accurate pose estimation during descent is necessary, especially in the absence of GPS, and relies on dependable visual-inertial odometry. Achieving this with limited edge AI resources requires careful optimization of the entire deployment framework. The flight characteristics of large QuadPlanes further complicate the problem. These aircraft exhibit high inertia, reduced thrust vectoring, and slow response times further complicate stable landing maneuvers. This work presents a lightweight QuadPlane system for efficient vision-based autonomous landing and visual-inertial odometry, specifically developed for long-range QuadPlane operations such as aerial monitoring. It describes the hardware platform, sensor configuration, and embedded computing architecture designed to meet demanding real-time, physical constraints. This establishes a foundation for deploying autonomous landing in dynamic, unstructured, GPS-denied environments.
Visual Categorization Across Minds and Models: Cognitive Analysis of Human Labeling and Neuro-Symbolic Integration
Understanding how humans and AI systems interpret ambiguous visual stimuli offers critical insight into the nature of perception, reasoning, and decision-making. This paper examines image labeling performance across human participants and deep neural networks, focusing on low-resolution, perceptually degraded stimuli. Drawing from computational cognitive science, cognitive architectures, and connectionist-symbolic hybrid models, we contrast human strategies such as analogical reasoning, shape-based recognition, and confidence modulation with AI's feature-based processing. Grounded in Marr's tri-level hypothesis, Simon's bounded rationality, and Thagard's frameworks of representation and emotion, we analyze participant responses in relation to Grad-CAM visualizations of model attention. Human behavior is further interpreted through cognitive principles modeled in ACT-R and Soar, revealing layered and heuristic decision strategies under uncertainty. Our findings highlight key parallels and divergences between biological and artificial systems in representation, inference, and confidence calibration. The analysis motivates future neuro-symbolic architectures that unify structured symbolic reasoning with connectionist representations. Such architectures, informed by principles of embodiment, explainability, and cognitive alignment, offer a path toward AI systems that are not only performant but also interpretable and cognitively grounded.
comment: 12 pages, 3 figures. Research manuscript based on the final project for CS6795 (Introduction to Cognitive Science), Georgia Tech
Relightable and Dynamic Gaussian Avatar Reconstruction from Monocular Video ACM MM 2025
Modeling relightable and animatable human avatars from monocular video is a long-standing and challenging task. Recently, Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3DGS) methods have been employed to reconstruct the avatars. However, they often produce unsatisfactory photo-realistic results because of insufficient geometrical details related to body motion, such as clothing wrinkles. In this paper, we propose a 3DGS-based human avatar modeling framework, termed as Relightable and Dynamic Gaussian Avatar (RnD-Avatar), that presents accurate pose-variant deformation for high-fidelity geometrical details. To achieve this, we introduce dynamic skinning weights that define the human avatar's articulation based on pose while also learning additional deformations induced by body motion. We also introduce a novel regularization to capture fine geometric details under sparse visual cues. Furthermore, we present a new multi-view dataset with varied lighting conditions to evaluate relight. Our framework enables realistic rendering of novel poses and views while supporting photo-realistic lighting effects under arbitrary lighting conditions. Our method achieves state-of-the-art performance in novel view synthesis, novel pose rendering, and relighting.
comment: 8 pages, 9 figures, published in ACM MM 2025
UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking
Generating lifelike conversational avatars requires modeling not just isolated speakers, but the dynamic, reciprocal interaction of speaking and listening. However, modeling the listener is exceptionally challenging: direct audio-driven training fails, producing stiff, static listening motions. This failure stems from a fundamental imbalance: the speaker's motion is strongly driven by speech audio, while the listener's motion primarily follows an internal motion prior and is only loosely guided by external speech. This challenge has led most methods to focus on speak-only generation. The only prior attempt at joint generation relies on extra speaker's motion to produce the listener. This design is not end-to-end, thereby hindering the real-time applicability. To address this limitation, we present UniLS, the first end-to-end framework for generating unified speak-listen expressions, driven by only dual-track audio. Our method introduces a novel two-stage training paradigm. Stage 1 first learns the internal motion prior by training an audio-free autoregressive generator, capturing the spontaneous dynamics of natural facial motion. Stage 2 then introduces the dual-track audio, fine-tuning the generator to modulate the learned motion prior based on external speech cues. Extensive evaluations show UniLS achieves state-of-the-art speaking accuracy. More importantly, it delivers up to 44.1\% improvement in listening metrics, generating significantly more diverse and natural listening expressions. This effectively mitigates the stiffness problem and provides a practical, high-fidelity audio-driven solution for interactive digital humans.
Benchmarking Real-World Medical Image Classification with Noisy Labels: Challenges, Practice, and Outlook
Learning from noisy labels remains a major challenge in medical image analysis, where annotation demands expert knowledge and substantial inter-observer variability often leads to inconsistent or erroneous labels. Despite extensive research on learning with noisy labels (LNL), the robustness of existing methods in medical imaging has not been systematically assessed. To address this gap, we introduce LNMBench, a comprehensive benchmark for Label Noise in Medical imaging. LNMBench encompasses \textbf{10} representative methods evaluated across 7 datasets, 6 imaging modalities, and 3 noise patterns, establishing a unified and reproducible framework for robustness evaluation under realistic conditions. Comprehensive experiments reveal that the performance of existing LNL methods degrades substantially under high and real-world noise, highlighting the persistent challenges of class imbalance and domain variability in medical data. Motivated by these findings, we further propose a simple yet effective improvement to enhance model robustness under such conditions. The LNMBench codebase is publicly released to facilitate standardized evaluation, promote reproducible research, and provide practical insights for developing noise-resilient algorithms in both research and real-world medical applications.The codebase is publicly available on https://github.com/myyy777/LNMBench.
Transformer-Driven Multimodal Fusion for Explainable Suspiciousness Estimation in Visual Surveillance
Suspiciousness estimation is critical for proactive threat detection and ensuring public safety in complex environments. This work introduces a large-scale annotated dataset, USE50k, along with a computationally efficient vision-based framework for real-time suspiciousness analysis. The USE50k dataset contains 65,500 images captured from diverse and uncontrolled environments, such as airports, railway stations, restaurants, parks, and other public areas, covering a broad spectrum of cues including weapons, fire, crowd density, abnormal facial expressions, and unusual body postures. Building on this dataset, we present DeepUSEvision, a lightweight and modular system integrating three key components, i.e., a Suspicious Object Detector based on an enhanced YOLOv12 architecture, dual Deep Convolutional Neural Networks (DCNN-I and DCNN-II) for facial expression and body-language recognition using image and landmark features, and a transformer-based Discriminator Network that adaptively fuses multimodal outputs to yield an interpretable suspiciousness score. Extensive experiments confirm the superior accuracy, robustness, and interpretability of the proposed framework compared to state-of-the-art approaches. Collectively, the USE50k dataset and the DeepUSEvision framework establish a strong and scalable foundation for intelligent surveillance and real-time risk assessment in safety-critical applications.
comment: 12 pages, 10 figures, IEEE Transaction on Image Processing
A Distributed Framework for Privacy-Enhanced Vision Transformers on the Edge
Nowadays, visual intelligence tools have become ubiquitous, offering all kinds of convenience and possibilities. However, these tools have high computational requirements that exceed the capabilities of resource-constrained mobile and wearable devices. While offloading visual data to the cloud is a common solution, it introduces significant privacy vulnerabilities during transmission and server-side computation. To address this, we propose a novel distributed, hierarchical offloading framework for Vision Transformers (ViTs) that addresses these privacy challenges by design. Our approach uses a local trusted edge device, such as a mobile phone or an Nvidia Jetson, as the edge orchestrator. This orchestrator partitions the user's visual data into smaller portions and distributes them across multiple independent cloud servers. By design, no single external server possesses the complete image, preventing comprehensive data reconstruction. The final data merging and aggregation computation occurs exclusively on the user's trusted edge device. We apply our framework to the Segment Anything Model (SAM) as a practical case study, which demonstrates that our method substantially enhances content privacy over traditional cloud-based approaches. Evaluations show our framework maintains near-baseline segmentation performance while substantially reducing the risk of content reconstruction and user data exposure. Our framework provides a scalable, privacy-preserving solution for vision tasks in the edge-cloud continuum.
comment: 16 pages, 7 figures. Published in the Proceedings of the Tenth ACM/IEEE Symposium on Edge Computing (SEC '25), Dec 3-6, 2025, Washington, D.C., USA
From SAM to DINOv2: Towards Distilling Foundation Models to Lightweight Baselines for Generalized Polyp Segmentation
Accurate polyp segmentation during colonoscopy is critical for the early detection of colorectal cancer and still remains challenging due to significant size, shape, and color variations, and the camouflaged nature of polyps. While lightweight baseline models such as U-Net, U-Net++, and PraNet offer advantages in terms of easy deployment and low computational cost, they struggle to deal with the above issues, leading to limited segmentation performance. In contrast, large-scale vision foundation models such as SAM, DINOv2, OneFormer, and Mask2Former have exhibited impressive generalization performance across natural image domains. However, their direct transfer to medical imaging tasks (e.g., colonoscopic polyp segmentation) is not straightforward, primarily due to the scarcity of large-scale datasets and lack of domain-specific knowledge. To bridge this gap, we propose a novel distillation framework, Polyp-DiFoM, that transfers the rich representations of foundation models into lightweight segmentation baselines, allowing efficient and accurate deployment in clinical settings. In particular, we infuse semantic priors from the foundation models into canonical architectures such as U-Net and U-Net++ and further perform frequency domain encoding for enhanced distillation, corroborating their generalization capability. Extensive experiments are performed across five benchmark datasets, such as Kvasir-SEG, CVC-ClinicDB, ETIS, ColonDB, and CVC-300. Notably, Polyp-DiFoM consistently outperforms respective baseline models significantly, as well as the state-of-the-art model, with nearly 9 times reduced computation overhead. The code is available at https://github.com/lostinrepo/PolypDiFoM.
VABench: A Comprehensive Benchmark for Audio-Video Generation
Recent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audio-video generation, especially for models aiming to generate synchronized audio-video outputs. To address this gap, we introduce VABench, a comprehensive and multi-dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio-video generation. VABench encompasses three primary task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization, lip-speech consistency, and carefully curated audio and video question-answering (QA) pairs, among others. Furthermore, VABench covers seven major content categories: animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, and virtual worlds. We provide a systematic analysis and visualization of the evaluation results, aiming to establish a new standard for assessing video generation models with synchronous audio capabilities and to promote the comprehensive advancement of the field.
comment: 24 pages, 25 figures
Traffic Scene Small Target Detection Method Based on YOLOv8n-SPTS Model for Autonomous Driving SP
This paper focuses on the key issue in autonomous driving: small target recognition in dynamic perception. Existing algorithms suffer from poor detection performance due to missing small target information, scale imbalance, and occlusion. We propose an improved YOLOv8n-SPTS model, which enhances the detection accuracy of small traffic targets through three key innovations: First, optimizing the feature extraction module. In the Backbone Bottleneck structure of YOLOv8n, 4 traditional convolution modules are replaced with Space-to-Depth Convolution (SPD-Conv) modules. This module retains fine-grained information through space-to-depth conversion, reduces information loss, and enhances the ability to capture features of low-resolution small targets. Second, enhancing feature fusion capability. The Spatial Pyramid Pooling - Fast Cross Stage Partial Connection (SPPFCSPC) module is introduced to replace the original SPPF module, integrating the multi-scale feature extraction from Spatial Pyramid Pooling (SPP) and the feature fusion mechanism of Cross Stage Partial Connection (CSP), thereby improving the model's contextual understanding of complex scenes and multi-scale feature expression ability. Third, designing a dedicated detection structure for small targets. A Triple-Stage Feature Pyramid (TSFP) structure is proposed, which adds a 160*160 small target detection head to the original detection heads to fully utilize high-resolution features in shallow layers; meanwhile, redundant large target detection heads are removed to balance computational efficiency. Comparative experiments on the VisDrone2019-DET dataset show that YOLOv8n-SPTS model ranks first in precision (61.9%), recall (48.3%), mAP@0.5 (52.6%), and mAP@0.5:0.95 (32.6%). Visualization results verify that the miss rate of small targets such as pedestrians and bicycles in occluded and dense scenes is significantly reduced.
comment: 6 pages, 7 figures, 1 table. Accepted to The 2025 IEEE 3rd International Conference on Electrical, Automation and Computer Engineering (ICEACE), 2025. Code available at https://github.com/SonghanWu/yolov8n-SPTS
MelanomaNet: Explainable Deep Learning for Skin Lesion Classification
Automated skin lesion classification using deep learning has shown remarkable accuracy, yet clinical adoption remains limited due to the "black box" nature of these models. We present MelanomaNet, an explainable deep learning system for multi-class skin lesion classification that addresses this gap through four complementary interpretability mechanisms. Our approach combines an EfficientNet V2 backbone with GradCAM++ attention visualization, automated ABCDE clinical criterion extraction, Fast Concept Activation Vectors (FastCAV) for concept-based explanations, and Monte Carlo Dropout uncertainty quantification. We evaluate our system on the ISIC 2019 dataset containing 25,331 dermoscopic images across 9 diagnostic categories. Our model achieves 85.61% accuracy with a weighted F1 score of 0.8564, while providing clinically meaningful explanations that align model attention with established dermatological assessment criteria. The uncertainty quantification module decomposes prediction confidence into epistemic and aleatoric components, enabling automatic flagging of unreliable predictions for clinical review. Our results demonstrate that high classification performance can be achieved alongside comprehensive interpretability, potentially facilitating greater trust and adoption in clinical dermatology workflows. The source code is available at https://github.com/suxrobgm/explainable-melanoma
comment: 7 pages, 3 figures
FoundIR-v2: Optimizing Pre-Training Data Mixtures for Image Restoration Foundation Model
Recent studies have witnessed significant advances in image restoration foundation models driven by improvements in the scale and quality of pre-training data. In this work, we find that the data mixture proportions from different restoration tasks are also a critical factor directly determining the overall performance of all-in-one image restoration models. To this end, we propose a high-capacity diffusion-based image restoration foundation model, FoundIR-v2, which adopts a data equilibrium scheduling paradigm to dynamically optimize the proportions of mixed training datasets from different tasks. By leveraging the data mixing law, our method ensures a balanced dataset composition, enabling the model to achieve consistent generalization and comprehensive performance across diverse tasks. Furthermore, we introduce an effective Mixture-of-Experts (MoE)-driven scheduler into generative pre-training to flexibly allocate task-adaptive diffusion priors for each restoration task, accounting for the distinct degradation forms and levels exhibited by different tasks. Extensive experiments demonstrate that our method can address over 50 sub-tasks across a broader scope of real-world scenarios and achieves favorable performance against state-of-the-art approaches.
comment: Project page: https://lowlevelcv.com/
LoGoColor: Local-Global 3D Colorization for 360° Scenes
Single-channel 3D reconstruction is widely used in fields such as robotics and medical imaging. While this line of work excels at reconstructing 3D geometry, the outputs are not colored 3D models, thus 3D colorization is required for visualization. Recent 3D colorization studies address this problem by distilling 2D image colorization models. However, these approaches suffer from an inherent inconsistency of 2D image models. This results in colors being averaged during training, leading to monotonous and oversimplified results, particularly in complex 360° scenes. In contrast, we aim to preserve color diversity by generating a new set of consistently colorized training views, thereby bypassing the averaging process. Nevertheless, eliminating the averaging process introduces a new challenge: ensuring strict multi-view consistency across these colorized views. To achieve this, we propose LoGoColor, a pipeline designed to preserve color diversity by eliminating this guidance-averaging process with a `Local-Global' approach: we partition the scene into subscenes and explicitly tackle both inter-subscene and intra-subscene consistency using a fine-tuned multi-view diffusion model. We demonstrate that our method achieves quantitatively and qualitatively more consistent and plausible 3D colorization on complex 360° scenes than existing methods, and validate its superior color diversity using a novel Color Diversity Index.
Dynamic Facial Expressions Analysis Based Parkinson's Disease Auxiliary Diagnosis
Parkinson's disease (PD), a prevalent neurodegenerative disorder, significantly affects patients' daily functioning and social interactions. To facilitate a more efficient and accessible diagnostic approach for PD, we propose a dynamic facial expression analysis-based PD auxiliary diagnosis method. This method targets hypomimia, a characteristic clinical symptom of PD, by analyzing two manifestations: reduced facial expressivity and facial rigidity, thereby facilitating the diagnosis process. We develop a multimodal facial expression analysis network to extract expression intensity features during patients' performance of various facial expressions. This network leverages the CLIP architecture to integrate visual and textual features while preserving the temporal dynamics of facial expressions. Subsequently, the expression intensity features are processed and input into an LSTM-based classification network for PD diagnosis. Our method achieves an accuracy of 93.1%, outperforming other in-vitro PD diagnostic approaches. This technique offers a more convenient detection method for potential PD patients, improving their diagnostic experience.
LongT2IBench: A Benchmark for Evaluating Long Text-to-Image Generation with Graph-structured Annotations AAAI 2026
The increasing popularity of long Text-to-Image (T2I) generation has created an urgent need for automatic and interpretable models that can evaluate the image-text alignment in long prompt scenarios. However, the existing T2I alignment benchmarks predominantly focus on short prompt scenarios and only provide MOS or Likert scale annotations. This inherent limitation hinders the development of long T2I evaluators, particularly in terms of the interpretability of alignment. In this study, we contribute LongT2IBench, which comprises 14K long text-image pairs accompanied by graph-structured human annotations. Given the detail-intensive nature of long prompts, we first design a Generate-Refine-Qualify annotation protocol to convert them into textual graph structures that encompass entities, attributes, and relations. Through this transformation, fine-grained alignment annotations are achieved based on these granular elements. Finally, the graph-structed annotations are converted into alignment scores and interpretations to facilitate the design of T2I evaluation models. Based on LongT2IBench, we further propose LongT2IExpert, a LongT2I evaluator that enables multi-modal large language models (MLLMs) to provide both quantitative scores and structured interpretations through an instruction-tuning process with Hierarchical Alignment Chain-of-Thought (CoT). Extensive experiments and comparisons demonstrate the superiority of the proposed LongT2IExpert in alignment evaluation and interpretation. Data and code have been released in https://welldky.github.io/LongT2IBench-Homepage/.
comment: The paper has been accepted by AAAI 2026
MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectional Blending with Hierarchical Densification
Recent advances in 4D Gaussian Splatting (4DGS) have extended the high-speed rendering capability of 3D Gaussian Splatting (3DGS) into the temporal domain, enabling real-time rendering of dynamic scenes. However, one of the major remaining challenges lies in modeling long-range motion-contained dynamic videos, where a naive extension of existing methods leads to severe memory explosion, temporal flickering, and failure to handle appearing or disappearing occlusions over time. To address these challenges, we propose a novel 4DGS framework characterized by an Anchor Relay-based Bidirectional Blending (ARBB) mechanism, named MoRel, which enables temporally consistent and memory-efficient modeling of long-range dynamic scenes. Our method progressively constructs locally canonical anchor spaces at key-frame time index and models inter-frame deformations at the anchor level, enhancing temporal coherence. By learning bidirectional deformations between KfA and adaptively blending them through learnable opacity control, our approach mitigates temporal discontinuities and flickering artifacts. We further introduce a Feature-variance-guided Hierarchical Densification (FHD) scheme that effectively densifies KfA's while keeping rendering quality, based on an assigned level of feature-variance. To effectively evaluate our model's capability to handle real-world long-range 4D motion, we newly compose long-range 4D motion-contained dataset, called SelfCap$_{\text{LR}}$. It has larger average dynamic motion magnitude, captured at spatially wider spaces, compared to previous dynamic video datasets. Overall, our MoRel achieves temporally coherent and flicker-free long-range 4D reconstruction while maintaining bounded memory usage, demonstrating both scalability and efficiency in dynamic Gaussian-based representations.
comment: Please visit our project page at https://cmlab-korea.github.io/MoRel/
ROI-Packing: Efficient Region-Based Compression for Machine Vision
This paper introduces ROI-Packing, an efficient image compression method tailored specifically for machine vision. By prioritizing regions of interest (ROI) critical to end-task accuracy and packing them efficiently while discarding less relevant data, ROI-Packing achieves significant compression efficiency without requiring retraining or fine-tuning of end-task models. Comprehensive evaluations across five datasets and two popular tasks-object detection and instance segmentation-demonstrate up to a 44.10% reduction in bitrate without compromising end-task accuracy, along with an 8.88 % improvement in accuracy at the same bitrate compared to the state-of-the-art Versatile Video Coding (VVC) codec standardized by the Moving Picture Experts Group (MPEG).
GLACIA: Instance-Aware Positional Reasoning for Glacial Lake Segmentation via Multimodal Large Language Model
Glacial lake monitoring bears great significance in mitigating the anticipated risk of Glacial Lake Outburst Floods. However, existing segmentation methods based on convolutional neural networks (CNNs) and Vision Transformers (ViTs), remain constrained to pixel-level predictions, lacking high-level global scene semantics and human-interpretable reasoning. To address this, we introduce GLACIA (\textbf{G}lacial \textbf{LA}ke segmentation with \textbf{C}ontextual \textbf{I}nstance \textbf{A}wareness), the first framework that integrates large language models with segmentation capabilities to produce both accurate segmentation masks and corresponding spatial reasoning outputs. We construct the Glacial Lake Position Reasoning (GLake-Pos) dataset pipeline, which provides diverse, spatially grounded question-answer pairs designed to overcome the lack of instance-aware positional reasoning data in remote sensing. Comparative evaluation demonstrate that GLACIA (mIoU: 87.30) surpasses state-of-the-art method based on CNNs (mIoU: 78.55 - 79.01), ViTs (mIoU: 69.27 - 81.75), Geo-foundation models (mIoU: 76.37 - 87.10), and reasoning based segmentation methods (mIoU: 60.12 - 75.66). Our approach enables intuitive disaster preparedness and informed policy-making in the context of rapidly changing glacial environments by facilitating natural language interaction, thereby supporting more efficient and interpretable decision-making. The code is released on https://github.com/lalitmaurya47/GLACIA
OmniPSD: Layered PSD Generation with Diffusion Transformer
Recent advances in diffusion models have greatly improved image generation and editing, yet generating or reconstructing layered PSD files with transparent alpha channels remains highly challenging. We propose OmniPSD, a unified diffusion framework built upon the Flux ecosystem that enables both text-to-PSD generation and image-to-PSD decomposition through in-context learning. For text-to-PSD generation, OmniPSD arranges multiple target layers spatially into a single canvas and learns their compositional relationships through spatial attention, producing semantically coherent and hierarchically structured layers. For image-to-PSD decomposition, it performs iterative in-context editing, progressively extracting and erasing textual and foreground components to reconstruct editable PSD layers from a single flattened image. An RGBA-VAE is employed as an auxiliary representation module to preserve transparency without affecting structure learning. Extensive experiments on our new RGBA-layered dataset demonstrate that OmniPSD achieves high-fidelity generation, structural consistency, and transparency awareness, offering a new paradigm for layered design generation and decomposition with diffusion transformers.
A Clinically Interpretable Deep CNN Framework for Early Chronic Kidney Disease Prediction Using Grad-CAM-Based Explainable AI
Chronic Kidney Disease (CKD) constitutes a major global medical burden, marked by the gradual deterioration of renal function, which results in the impaired clearance of metabolic waste and disturbances in systemic fluid homeostasis. Owing to its substantial contribution to worldwide morbidity and mortality, the development of reliable and efficient diagnostic approaches is critically important to facilitate early detection and prompt clinical management. This study presents a deep convolutional neural network (CNN) for early CKD detection from CT kidney images, complemented by class balancing using Synthetic Minority Over-sampling Technique (SMOTE) and interpretability via Gradient-weighted Class Activation Mapping (Grad-CAM). The model was trained and evaluated on the CT KIDNEY DATASET, which contains 12,446 CT images, including 3,709 cyst, 5,077 normal, 1,377 stone, and 2,283 tumor cases. The proposed deep CNN achieved a remarkable classification performance, attaining 100% accuracy in the early detection of chronic kidney disease (CKD). This significant advancement demonstrates strong potential for addressing critical clinical diagnostic challenges and enhancing early medical intervention strategies.
Efficient Feature Compression for Machines with Global Statistics Preservation
The split-inference paradigm divides an artificial intelligence (AI) model into two parts. This necessitates the transfer of intermediate feature data between the two halves. Here, effective compression of the feature data becomes vital. In this paper, we employ Z-score normalization to efficiently recover the compressed feature data at the decoder side. To examine the efficacy of our method, the proposed method is integrated into the latest Feature Coding for Machines (FCM) codec standard under development by the Moving Picture Experts Group (MPEG). Our method supersedes the existing scaling method used by the current standard under development. It both reduces the overhead bits and improves the end-task accuracy. To further reduce the overhead in certain circumstances, we also propose a simplified method. Experiments show that using our proposed method shows 17.09% reduction in bitrate on average across different tasks and up to 65.69% for object tracking without sacrificing the task accuracy.
Semantic Data Augmentation Enhanced Invariant Risk Minimization for Medical Image Domain Generalization
Deep learning has achieved remarkable success in medical image classification. However, its clinical application is often hindered by data heterogeneity caused by variations in scanner vendors, imaging protocols, and operators. Approaches such as invariant risk minimization (IRM) aim to address this challenge of out-of-distribution generalization. For instance, VIRM improves upon IRM by tackling the issue of insufficient feature support overlap, demonstrating promising potential. Nonetheless, these methods face limitations in medical imaging due to the scarcity of annotated data and the inefficiency of augmentation strategies. To address these issues, we propose a novel domain-oriented direction selector to replace the random augmentation strategy used in VIRM. Our method leverages inter-domain covariance as a guider for augmentation direction, guiding data augmentation towards the target domain. This approach effectively reduces domain discrepancies and enhances generalization performance. Experiments on a multi-center diabetic retinopathy dataset demonstrate that our method outperforms state-of-the-art approaches, particularly under limited data conditions and significant domain heterogeneity.
Matrix-game 2.0: An open-source real-time and streaming interactive world model
Recent advances in interactive video generations have demonstrated diffusion model's potential as world models by capturing complex physical dynamics and interactive behaviors. However, existing interactive world models depend on bidirectional attention and lengthy inference steps, severely limiting real-time performance. Consequently, they are hard to simulate real-world dynamics, where outcomes must update instantaneously based on historical context and current actions. To address this, we present Matrix-Game 2.0, an interactive world model generates long videos on-the-fly via few-step auto-regressive diffusion. Our framework consists of three key components: (1) A scalable data production pipeline for Unreal Engine and GTA5 environments to effectively produce massive amounts (about 1200 hours) of video data with diverse interaction annotations; (2) An action injection module that enables frame-level mouse and keyboard inputs as interactive conditions; (3) A few-step distillation based on the casual architecture for real-time and streaming video generation. Matrix Game 2.0 can generate high-quality minute-level videos across diverse scenes at an ultra-fast speed of 25 FPS. We open-source our model weights and codebase to advance research in interactive world modeling.
comment: Project Page: https://matrix-game-v2.github.io
Seedream 4.0: Toward Next-generation Multimodal Image Generation
We introduce Seedream 4.0, an efficient and high-performance multimodal image generation system that unifies text-to-image (T2I) synthesis, image editing, and multi-image composition within a single framework. We develop a highly efficient diffusion transformer with a powerful VAE which also can reduce the number of image tokens considerably. This allows for efficient training of our model, and enables it to fast generate native high-resolution images (e.g., 1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning diverse taxonomies and knowledge-centric concepts. Comprehensive data collection across hundreds of vertical scenarios, coupled with optimized strategies, ensures stable and large-scale training, with strong generalization. By incorporating a carefully fine-tuned VLM model, we perform multi-modal post-training for training both T2I and image editing tasks jointly. For inference acceleration, we integrate adversarial distillation, distribution matching, and quantization, as well as speculative decoding. It achieves an inference time of up to 1.8 seconds for generating a 2K image (without a LLM/VLM as PE model). Comprehensive evaluations reveal that Seedream 4.0 can achieve state-of-the-art results on both T2I and multimodal image editing. In particular, it demonstrates exceptional multimodal capabilities in complex tasks, including precise image editing and in-context reasoning, and also allows for multi-image reference, and can generate multiple output images. This extends traditional T2I systems into an more interactive and multidimensional creative tool, pushing the boundary of generative AI for both creativity and professional applications. We further scale our model and data as Seedream 4.5. Seedream 4.0 and Seedream 4.5 are accessible on Volcano Engine https://www.volcengine.com/experience/ark?launch=seedream.
comment: Seedream 4.0/4.5 Technical Report
OpenConstruction: A Systematic Synthesis of Open Visual Datasets for Data-Centric Artificial Intelligence in Construction Monitoring
The construction industry increasingly relies on visual data to support Artificial Intelligence (AI) and Machine Learning (ML) applications for site monitoring. High-quality, domain-specific datasets, comprising images, videos, and point clouds, capture site geometry and spatiotemporal dynamics, including the location and interaction of objects, workers, and materials. However, despite growing interest in leveraging visual datasets, existing resources vary widely in sizes, data modalities, annotation quality, and representativeness of real-world construction conditions. A systematic review to categorize their data characteristics and application contexts is still lacking, limiting the community's ability to fully understand the dataset landscape, identify critical gaps, and guide future directions toward more effective, reliable, and scalable AI applications in construction. To address this gap, this study conducts an extensive search of academic databases and open-data platforms, yielding 51 publicly available visual datasets that span the 2005-2024 period. These datasets are categorized using a structured data schema covering (i) data fundamentals (e.g., size and license), (ii) data modalities (e.g., RGB and point cloud), (iii) annotation frameworks (e.g., bounding boxes), and (iv) downstream application domains (e.g., progress tracking). This study synthesizes these findings into an open-source catalog, OpenConstruction, supporting data-driven method development. Furthermore, the study discusses several critical limitations in the existing construction dataset landscape and presents a roadmap for future data infrastructure anchored in the Findability, Accessibility, Interoperability, and Reusability (FAIR) principles. By reviewing the current landscape and outlining strategic priorities, this study supports the advancement of data-centric solutions in the construction sector.
AugLift: Uncertainty Aware Depth Descriptors for Robust 2D to 3D Pose Lifting
Lifting based 3D human pose estimators infer 3D joints from 2D keypoints, but often struggle to generalize to real world settings with noisy 2D detections. We revisit the input to lifting and propose AugLift, a simple augmentation of standard lifting that enriches each 2D keypoint (x, y) with an Uncertainty Aware Depth Descriptor (UADD). We run a single off the shelf monocular depth estimator to obtain a depth map, and for every keypoint with detector confidence c we extract depth statistics from its confidence scaled neighborhood, forming a compact, interpretable UADD (c, d, d_min, d_max) that captures both local geometry and reliability. AugLift is modular, requires no new sensors or architectural changes, and integrates by expanding the input layer of existing lifting models. Across four datasets and four lifting architectures, AugLift boosts cross dataset (out of distribution) performance on unseen data by an average of 10.1 percent, while also improving in distribution performance by 4.0 percent as measured by MPJPE. A post hoc analysis clarifies when and why it helps: gains are largest on novel poses and significantly occluded joints, where depth statistics resolve front back ambiguities while confidence calibrates the spatial neighborhoods from which they are drawn. We also study interaction with recent image feature lifting methods and find the signals are complementary: adding UADD to image conditioned lifting yields both ID and OOD gains. A learned depth feature extension (AugLiftV2) improves performance further while trading off interpretability. Together, these results indicate that lightweight, confidence aware depth cues are a powerful plug in for robust 2D to 3D pose lifting.
comment: Preprint. Under review
Human Motion Unlearning
We introduce Human Motion Unlearning and motivate it through the concrete task of preventing violent 3D motion synthesis, an important safety requirement given that popular text-to-motion datasets (HumanML3D and Motion-X) contain from 7\% to 15\% violent sequences spanning both atomic gestures (e.g., a single punch) and highly compositional actions (e.g., loading and swinging a leg to kick). By focusing on violence unlearning, we demonstrate how removing a challenging, multifaceted concept can serve as a proxy for the broader capability of motion "forgetting." To enable systematic evaluation of Human Motion Unlearning, we establish the first motion unlearning benchmark by automatically filtering HumanML3D and Motion-X datasets to create distinct forget sets (violent motions) and retain sets (safe motions). We introduce evaluation metrics tailored to sequential unlearning, measuring both suppression efficacy and the preservation of realism and smooth transitions. We adapt two state-of-the-art, training-free image unlearning methods (UCE and RECE) to leading text-to-motion architectures (MoMask and BAMM), and propose Latent Code Replacement (LCR), a novel, training-free approach that identifies violent codes in a discrete codebook representation and substitutes them with safe alternatives. Our experiments show that unlearning violent motions is indeed feasible and that acting on latent codes strikes the best trade-off between violence suppression and preserving overall motion quality. This work establishes a foundation for advancing safe motion synthesis across diverse applications. Website: https://www.pinlab.org/hmu.
Learning What Matters: Steering Diffusion via Spectrally Anisotropic Forward Noise
Diffusion Probabilistic Models (DPMs) have achieved strong generative performance, yet their inductive biases remain largely implicit. In this work, we aim to build inductive biases into the training and sampling of diffusion models to better accommodate the target distribution of the data to model. We introduce an anisotropic noise operator that shapes these biases by replacing the isotropic forward covariance with a structured, frequency-diagonal covariance. This operator unifies band-pass masks and power-law weightings, allowing us to emphasize or suppress designated frequency bands, while keeping the forward process Gaussian. We refer to this as Spectrally Anisotropic Gaussian Diffusion (SAGD). In this work, we derive the score relation for anisotropic forward covariances and show that, under full support, the learned score converges to the true data score as $t\!\to\!0$, while anisotropy reshapes the probability-flow path from noise to data. Empirically, we show the induced anisotropy outperforms standard diffusion across several vision datasets, and enables selective omission: learning while ignoring known corruptions confined to specific bands. Together, these results demonstrate that carefully designed anisotropic forward noise provides a simple, yet principled, handle to tailor inductive bias in DPMs.
RELOCATE: A Simple Training-Free Baseline for Visual Query Localization Using Region-Based Representations
We present RELOCATE, a simple training-free baseline designed to perform the challenging task of visual query localization in long videos. To eliminate the need for task-specific training and efficiently handle long videos, RELOCATE leverages a region-based representation derived from pretrained vision models. At a high level, it follows the classic object localization approach: (1) identify all objects in each video frame, (2) compare the objects with the given query and select the most similar ones, and (3) perform bidirectional tracking to get a spatio-temporal response. However, we propose some key enhancements to handle small objects, cluttered scenes, partial visibility, and varying appearances. Notably, we refine the selected objects for accurate localization and generate additional visual queries to capture visual variations. We evaluate RELOCATE on the challenging Ego4D Visual Query 2D Localization dataset, establishing a new baseline that outperforms prior task-specific methods by 49% (relative improvement) in spatio-temporal average precision.
Toward Efficient and Robust Behavior Models for Multi-Agent Driving Simulation
Scalable multi-agent driving simulation requires behavior models that are both realistic and computationally efficient. We address this by optimizing the behavior model that controls individual traffic participants. To improve efficiency, we adopt an instance-centric scene representation, where each traffic participant and map element is modeled in its own local coordinate frame. This design enables efficient, viewpoint-invariant scene encoding and allows static map tokens to be reused across simulation steps. To model interactions, we employ a query-centric symmetric context encoder with relative positional encodings between local frames. We use Adversarial Inverse Reinforcement Learning to learn the behavior model and propose an adaptive reward transformation that automatically balances robustness and realism during training. Experiments demonstrate that our approach scales efficiently with the number of tokens, significantly reducing training and inference times, while outperforming several agent-centric baselines in terms of positional accuracy and robustness.
comment: This work has been submitted to the IEEE for possible publication
THCRL: Trusted Hierarchical Contrastive Representation Learning for Multi-View Clustering
Multi-View Clustering (MVC) has garnered increasing attention in recent years. It is capable of partitioning data samples into distinct groups by learning a consensus representation. However, a significant challenge remains: the problem of untrustworthy fusion. This problem primarily arises from two key factors: 1) Existing methods often ignore the presence of inherent noise within individual views; 2) In traditional MVC methods using Contrastive Learning (CL), similarity computations typically rely on different views of the same instance, while neglecting the structural information from nearest neighbors within the same cluster. Consequently, this leads to the wrong direction for multi-view fusion. To address this problem, we present a novel Trusted Hierarchical Contrastive Representation Learning (THCRL). It consists of two key modules. Specifically, we propose the Deep Symmetry Hierarchical Fusion (DSHF) module, which leverages the UNet architecture integrated with multiple denoising mechanisms to achieve trustworthy fusion of multi-view data. Furthermore, we present the Average K-Nearest Neighbors Contrastive Learning (AKCL) module to align the fused representation with the view-specific representation. Unlike conventional strategies, AKCL enhances representation similarity among samples belonging to the same cluster, rather than merely focusing on the same sample across views, thereby reinforcing the confidence of the fused representation. Extensive experiments demonstrate that THCRL achieves the state-of-the-art performance in deep MVC tasks.
Foveation Improves Payload Capacity in Steganography SIGGRAPH
Steganography finds its use in visual medium such as providing metadata and watermarking. With support of efficient latent representations and foveated rendering, we trained models that improve existing capacity limits from 100 to 500 bits, while achieving better accuracy of up to 1 failure bit out of 2000, at 200K test bits. Finally, we achieve a comparable visual quality of 31.47 dB PSNR and 0.13 LPIPS, showing the effectiveness of novel perceptual design in creating multi-modal latent representations in steganography.
comment: SIGGRAPH Asia 2025 Posters Proceedings
Tokenizing Motion: A Generative Approach for Scene Dynamics Compression
This paper proposes a novel generative video compression framework that leverages motion pattern priors, derived from subtle dynamics in common scenes (e.g., swaying flowers or a boat drifting on water), rather than relying on video content priors (e.g., talking faces or human bodies). These compact motion priors enable a new approach to ultra-low bitrate communication while achieving high-quality reconstruction across diverse scene contents. At the encoder side, motion priors can be streamlined into compact representations via a dense-to-sparse transformation. At the decoder side, these priors facilitate the reconstruction of scene dynamics using an advanced flow-driven diffusion model. Experimental results illustrate that the proposed method can achieve superior rate-distortion-performance and outperform the state-of-the-art conventional-video codec Enhanced Compression Model (ECM) on-scene dynamics sequences. The project page can be found at-https://github.com/xyzysz/GNVDC.
comment: 5page, 5 figures
Efficiently Reconstructing Dynamic Scenes One D4RT at a Time
Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel querying mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks. We refer to the project webpage for animated results: https://d4rt-paper.github.io/.
comment: Project Page: https://d4rt-paper.github.io/
MACS: Multi-source Audio-to-image Generation with Contextual Significance and Semantic Alignment AAAI 2026
Propelled by the breakthrough in deep generative models, audio-to-image generation has emerged as a pivotal cross-modal task that converts complex auditory signals into rich visual representations. However, previous works only focus on single-source audio inputs for image generation, ignoring the multi-source characteristic in natural auditory scenes, thus limiting the performance in generating comprehensive visual content. To bridge this gap, we propose a method called MACS to conduct multi-source audio-to-image generation. To our best knowledge, this is the first work that explicitly separates multi-source audio to capture the rich audio components before image generation. MACS is a two-stage method. In the first stage, multi-source audio inputs are separated by a weakly supervised method, where the audio and text labels are semantically aligned by casting into a common space using the large pre-trained CLAP model. We introduce a ranking loss to consider the contextual significance of the separated audio signals. In the second stage, effective image generation is achieved by mapping the separated audio signals to the generation condition using only a trainable adapter and a MLP layer. We preprocess the LLP dataset as the first full multi-source audio-to-image generation benchmark. The experiments are conducted on multi-source, mixed-source, and single-source audio-to-image generation tasks. The proposed MACS outperforms the current state-of-the-art methods in 17 out of the 21 evaluation indexes on all tasks and delivers superior visual quality.
comment: Accepted at AAAI 2026. Code available at https://github.com/alxzzhou/MACS
Exploring possible vector systems for faster training of neural networks with preconfigured latent spaces
The overall neural network (NN) performance is closely related to the properties of its embedding distribution in latent space (LS). It has recently been shown that predefined vector systems, specifically An root system vectors, can be used as targets for latent space configurations (LSC) to ensure the desired LS structure. One of the main LSC advantage is the possibility of training classifier NNs without classification layers, which facilitates training NNs on datasets with extremely large numbers of classes. This paper provides a more general overview of possible vector systems for NN training along with their properties and methods for vector system construction. These systems are used to configure LS of encoders and visual transformers to significantly speed up ImageNet-1K and 50k-600k classes LSC training. It is also shown that using the minimum number of LS dimensions for a specific number of classes results in faster convergence. The latter has potential advantages for reducing the size of vector databases used to store NN embeddings.
comment: 9 pages, 5 figures, 1 table, 4 equations
MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis
While text-to-video (T2V) generation has achieved remarkable progress in photorealism, generating intent-aligned videos that faithfully obey physics principles remains a core challenge. In this work, we systematically study Newtonian motion-controlled text-to-video generation and evaluation, emphasizing physical precision and motion coherence. We introduce MoReGen, a motion-aware, physics-grounded T2V framework that integrates multi-agent LLMs, physics simulators, and renderers to generate reproducible, physically accurate videos from text prompts in the code domain. To quantitatively assess physical validity, we propose object-trajectory correspondence as a direct evaluation metric and present MoReSet, a benchmark of 1,275 human-annotated videos spanning nine classes of Newtonian phenomena with scene descriptions, spatiotemporal relations, and ground-truth trajectories. Using MoReSet, we conduct experiments on existing T2V models, evaluating their physical validity through both our MoRe metrics and existing physics-based evaluators. Our results reveal that state-of-the-art models struggle to maintain physical validity, while MoReGen establishes a principled direction toward physically coherent video synthesis.
GeoDM: Geometry-aware Distribution Matching for Dataset Distillation
Dataset distillation aims to synthesize a compact subset of the original data, enabling models trained on it to achieve performance comparable to those trained on the original large dataset. Existing distribution-matching methods are confined to Euclidean spaces, making them only capture linear structures and overlook the intrinsic geometry of real data, e.g., curvature. However, high-dimensional data often lie on low-dimensional manifolds, suggesting that dataset distillation should have the distilled data manifold aligned with the original data manifold. In this work, we propose a geometry-aware distribution-matching framework, called \textbf{GeoDM}, which operates in the Cartesian product of Euclidean, hyperbolic, and spherical manifolds, with flat, hierarchical, and cyclical structures all captured by a unified representation. To adapt to the underlying data geometry, we introduce learnable curvature and weight parameters for three kinds of geometries. At the same time, we design an optimal transport loss to enhance the distribution fidelity. Our theoretical analysis shows that the geometry-aware distribution matching in a product space yields a smaller generalization error bound than the Euclidean counterparts. Extensive experiments conducted on standard benchmarks demonstrate that our algorithm outperforms state-of-the-art data distillation methods and remains effective across various distribution-matching strategies for the single geometries.
CoPRS: Learning Positional Prior from Chain-of-Thought for Reasoning Segmentation
Existing works on reasoning segmentation either connect hidden features from a language model directly to a mask decoder or represent positions in text, which limits interpretability and semantic detail. To solve this, we present CoPRS, a Multi-modal Chain-of-Thought (MCoT)-based positional perception model that bridges language reasoning to segmentation through a differentiable and interpretable positional prior instantiated as a heatmap. By making the reasoning process clear via MCoT and expressing it as a dense, differentiable heatmap, this interface enhances interpretability and diagnostic analysis and yields more concentrated evidence on the target. A learnable concentration token aggregates features of the image and reasoning text to generate this positional prior, which is decoded to precise masks through a lightweight decoder, providing a direct connection between reasoning and segmentation. Across the RefCOCO series and ReasonSeg, CoPRS matches or surpasses the best reported metrics on each standard split under comparable protocols, with performance at or above the prior state of the art across both validation and test partitions. Extensive experiments demonstrate a strong positive correlation among the CoT trajectory, the generated heatmap, and the decoded mask, supporting an interpretable alignment between the reasoning output and downstream mask generation. Collectively, these findings support the utility of this paradigm in bridging reasoning and segmentation and show advantages in concentration driven by reasoning and in more precise mask prediction. Code, checkpoints and logs are released at https://github.com/ZhenyuLU-Heliodore/CoPRS.git.
comment: 20 pages, 8 figures, 7 tables
Multi-Scale Direction-Aware Network for Infrared Small Target Detection
Infrared small target detection faces the problem that it is difficult to effectively separate the background and the target. Existing deep learning-based methods focus on edge and shape features, but ignore the richer structural differences and detailed information embedded in high-frequency components from different directions, thereby failing to fully exploit the value of high-frequency directional features in target perception. To address this limitation, we propose a multi-scale direction-aware network (MSDA-Net), which is the first attempt to integrate the high-frequency directional features of infrared small targets as domain prior knowledge into neural networks. Specifically, to fully mine the high-frequency directional features, on the one hand, a high-frequency direction injection (HFDI) module without trainable parameters is constructed to inject the high-frequency directional information of the original image into the network. On the other hand, a multi-scale direction-aware (MSDA) module is constructed, which promotes the full extraction of local relations at different scales and the full perception of key features in different directions. In addition, considering the characteristics of infrared small targets, we construct a feature aggregation (FA) structure to address target disappearance in high-level feature maps, and a feature calibration fusion (FCF) module to alleviate feature bias during cross-layer feature fusion. Extensive experimental results show that our MSDA-Net achieves state-of-the-art (SOTA) results on multiple public datasets. The code can be available at https://github.com/YuChuang1205/MSDA-Net
comment: Accepted by TGRS 2025
Evaluating Small Vision-Language Models on Distance-Dependent Traffic Perception
Vision-Language Models (VLMs) are becoming increasingly powerful, demonstrating strong performance on a variety of tasks that require both visual and textual understanding. Their strong generalisation abilities make them a promising component for automated driving systems, which must handle unexpected corner cases. However, to be trusted in such safety-critical applications, a model must first possess a reliable perception system. Moreover, since critical objects and agents in traffic scenes are often at a distance, we require systems that are not "shortsighted", i.e., systems with strong perception capabilities at both close (up to 20 meters) and long (30+ meters) range. With this in mind, we introduce Distance-Annotated Traffic Perception Question Answering (DTPQA), the first Visual Question Answering (VQA) benchmark focused solely on perception-based questions in traffic scenes, enriched with distance annotations. By excluding questions that require reasoning, we ensure that model performance reflects perception capabilities alone. Since automated driving hardware has limited processing power and cannot support large VLMs, our study centers on smaller VLMs. More specifically, we evaluate several state-of-the-art (SOTA) small VLMs on DTPQA and show that, despite the simplicity of the questions, these models significantly underperform compared to humans (~60% average accuracy for the best-performing small VLM versus ~85% human performance). However, it is important to note that the human sample size was relatively small, which imposes statistical limitations. We also identify specific perception tasks, such as distinguishing left from right, that remain particularly challenging for these models.
comment: Published in IEEE Open Journal of Vehicular Technology. Final version available at: https://ieeexplore.ieee.org/document/11230063
Adaptive Gradient Calibration for Single-Positive Multi-Label Learning in Remote Sensing Image Scene Classification
Multi-label classification (MLC) offers a more comprehensive semantic understanding of Remote Sensing (RS) imagery compared to traditional single-label classification (SLC). However, obtaining complete annotations for MLC is particularly challenging due to the complexity and high cost of the labeling process. As a practical alternative, single-positive multi-label learning (SPML) has emerged, where each image is annotated with only one relevant label, and the model is expected to recover the full set of labels. While scalable, SPML introduces significant supervision ambiguity, demanding specialized solutions for model training. Although various SPML methods have been proposed in the computer vision domain, research in the RS context remains limited. To bridge this gap, we propose Adaptive Gradient Calibration (AdaGC), a novel and generalizable SPML framework tailored to RS imagery. AdaGC adopts a gradient calibration (GC) mechanism with a dual exponential moving average (EMA) module for robust pseudo-label generation. We introduce a theoretically grounded, training-dynamics-based indicator to adaptively trigger GC, which ensures GC's effectiveness by preventing it from being affected by model underfitting or overfitting to label noise. Extensive experiments on two benchmark RS datasets under two distinct label noise types demonstrate that AdaGC achieves state-of-the-art (SOTA) performance while maintaining strong robustness across diverse settings. The codes and data will be released at https://github.com/rslab-unitrento/AdaGC.
comment: 14 pages, 7 figures; revised version
Learning to Infer Parameterized Representations of Plants from 3D Scans
Plants frequently contain numerous organs, organized in 3D branching systems defining the plant's architecture. Reconstructing the architecture of plants from unstructured observations is challenging because of self-occlusion and spatial proximity between organs, which are often thin structures. To achieve the challenging task, we propose an approach that allows to infer a parameterized representation of the plant's architecture from a given 3D scan of a plant. In addition to the plant's branching structure, this representation contains parametric information for each plant organ, and can therefore be used directly in a variety of tasks. In this data-driven approach, we train a recursive neural network with virtual plants generated using a procedural model. After training, the network allows to infer a parametric tree-like representation based on an input 3D point cloud. Our method is applicable to any plant that can be represented as binary axial tree. We quantitatively evaluate our approach on Chenopodium Album plants on reconstruction, segmentation and skeletonization, which are important problems in plant phenotyping. In addition to carrying out several tasks at once, our method achieves results on-par with strong baselines for each task. We apply our method, trained exclusively on synthetic data, to 3D scans and show that it generalizes well.
WGAST: Weakly-Supervised Generative Network for Daily 10 m Land Surface Temperature Estimation via Spatio-Temporal Fusion
Urbanization, climate change, and agricultural stress are increasing the demand for precise and timely environmental monitoring. Land Surface Temperature (LST) is a key variable in this context and is retrieved from remote sensing satellites. However, these systems face a trade-off between spatial and temporal resolution. While spatio-temporal fusion methods offer promising solutions, few have addressed the estimation of daily LST at 10 m resolution. In this study, we present WGAST, a weakly-supervised generative network for daily 10 m LST estimation via spatio-temporal fusion of Terra MODIS, Landsat 8, and Sentinel-2. WGAST is the first end-to-end deep learning framework designed for this task. It adopts a conditional generative adversarial architecture, with a generator composed of four stages: feature extraction, fusion, LST reconstruction, and noise suppression. The first stage employs a set of encoders to extract multi-level latent representations from the inputs, which are then fused in the second stage using cosine similarity, normalization, and temporal attention mechanisms. The third stage decodes the fused features into high-resolution LST, followed by a Gaussian filter to suppress high-frequency noise. Training follows a weakly supervised strategy based on physical averaging principles and reinforced by a PatchGAN discriminator. Experiments demonstrate that WGAST outperforms existing methods in both quantitative and qualitative evaluations. Compared to the best-performing baseline, on average, WGAST reduces RMSE by 17.05% and improves SSIM by 4.22%. Furthermore, WGAST effectively captures fine-scale thermal patterns, as validated against near-surface air temperature measurements from 33 near-ground sensors. The code is available at https://github.com/Sofianebouaziz1/WGAST.git.
Towards Robust Infrared Small Target Detection: A Feature-Enhanced and Sensitivity-Tunable Framework
Recently, single-frame infrared small target (SIRST) detection technology has attracted widespread attention. Different from most existing deep learning-based methods that focus on improving network architectures, we propose a feature-enhanced and sensitivity-tunable (FEST) framework, which is compatible with existing SIRST detection networks and further enhances their detection performance. The FEST framework improves the model's robustness from two aspects: feature enhancement and target confidence regulation. For feature enhancement, we employ a multi-scale fusion strategy to improve the model's perception to multi-scale features of multi-size targets, and design an edge enhancement difficulty mining (EEDM) loss to guide the network to continuously focus on challenging target regions and edge features during training. For target confidence regulation, an adjustable sensitivity (AS) strategy is proposed for network post-processing. This strategy enhances the model's adaptability in complex scenarios and significantly improves the detection rate of infrared small targets while maintaining segmentation accuracy. Extensive experimental results show that our FEST framework can effectively enhance the performance of existing SIRST detection networks. The code is available at https://github.com/YuChuang1205/FEST-Framework
comment: Accepted by Knowledge-Based Systems 2025
CoD: A Diffusion Foundation Model for Image Compression
Existing diffusion codecs typically build on text-to-image diffusion foundation models like Stable Diffusion. However, text conditioning is suboptimal from a compression perspective, hindering the potential of downstream diffusion codecs, particularly at ultra-low bitrates. To address it, we introduce \textbf{CoD}, the first \textbf{Co}mpression-oriented \textbf{D}iffusion foundation model, trained from scratch to enable end-to-end optimization of both compression and generation. CoD is not a fixed codec but a general foundation model designed for various diffusion-based codecs. It offers several advantages: \textbf{High compression efficiency}, replacing Stable Diffusion with CoD in downstream codecs like DiffC achieves SOTA results, especially at ultra-low bitrates (e.g., 0.0039 bpp); \textbf{Low-cost and reproducible training}, 300$\times$ faster training than Stable Diffusion ($\sim$ 20 vs. $\sim$ 6,250 A100 GPU days) on entirely open image-only datasets; \textbf{Providing new insights}, e.g., We find pixel-space diffusion can achieve VTM-level PSNR with high perceptual quality and can outperform GAN-based codecs using fewer parameters. We hope CoD lays the foundation for future diffusion codec research. Codes will be released.
GloTok: Global Perspective Tokenizer for Image Reconstruction and Generation AAAI'26
Existing state-of-the-art image tokenization methods leverage diverse semantic features from pre-trained vision models for additional supervision, to expand the distribution of latent representations and thereby improve the quality of image reconstruction and generation. These methods employ a locally supervised approach for semantic supervision, which limits the uniformity of semantic distribution. However, VA-VAE proves that a more uniform feature distribution yields better generation performance. In this work, we introduce a Global Perspective Tokenizer (GloTok), which utilizes global relational information to model a more uniform semantic distribution of tokenized features. Specifically, a codebook-wise histogram relation learning method is proposed to transfer the semantics, which are modeled by pre-trained models on the entire dataset, to the semantic codebook. Then, we design a residual learning module that recovers the fine-grained details to minimize the reconstruction error caused by quantization. Through the above design, GloTok delivers more uniformly distributed semantic latent representations, which facilitates the training of autoregressive (AR) models for generating high-quality images without requiring direct access to pre-trained models during the training process. Experiments on the standard ImageNet-1k benchmark clearly show that our proposed method achieves state-of-the-art reconstruction performance and generation quality.
comment: Accepted at AAAI'26
Bring Your Dreams to Life: Continual Text-to-Video Customization AAAI2026
Customized text-to-video generation (CTVG) has recently witnessed great progress in generating tailored videos from user-specific text. However, most CTVG methods assume that personalized concepts remain static and do not expand incrementally over time. Additionally, they struggle with forgetting and concept neglect when continuously learning new concepts, including subjects and motions. To resolve the above challenges, we develop a novel Continual Customized Video Diffusion (CCVD) model, which can continuously learn new concepts to generate videos across various text-to-video generation tasks by tackling forgetting and concept neglect. To address catastrophic forgetting, we introduce a concept-specific attribute retention module and a task-aware concept aggregation strategy. They can capture the unique characteristics and identities of old concepts during training, while combining all subject and motion adapters of old concepts based on their relevance during testing. Besides, to tackle concept neglect, we develop a controllable conditional synthesis to enhance regional features and align video contexts with user conditions, by incorporating layer-specific region attention-guided noise estimation. Extensive experimental comparisons demonstrate that our CCVD outperforms existing CTVG baselines on both the DreamVideo and Wan 2.1 backbones. The code is available at https://github.com/JiahuaDong/CCVD.
comment: Accepted to AAAI2026
Entropy-Informed Weighting Channel Normalizing Flow for Deep Generative Models
Normalizing Flows (NFs) are widely used in deep generative models for their exact likelihood estimation and efficient sampling. However, they require substantial memory since the latent space matches the input dimension. Multi-scale architectures address this by progressively reducing latent dimensions while preserving reversibility. Existing multi-scale architectures use simple, static channel-wise splitting, limiting expressiveness. To improve this, we introduce a regularized, feature-dependent $\mathtt{Shuffle}$ operation and integrate it into vanilla multi-scale architecture. This operation adaptively generates channel-wise weights and shuffles latent variables before splitting them. We observe that such operation guides the variables to evolve in the direction of entropy increase, hence we refer to NFs with the $\mathtt{Shuffle}$ operation as \emph{Entropy-Informed Weighting Channel Normalizing Flow} (EIW-Flow). Extensive experiments on CIFAR-10, CelebA, ImageNet, and LSUN demonstrate that EIW-Flow achieves state-of-the-art density estimation and competitive sample quality for deep generative modeling, with minimal computational overhead.
Two Causal Principles for Improving Visual Dialog CVPR 2020
This paper unravels the design tricks adopted by us, the champion team MReaL-BDAI, for Visual Dialog Challenge 2019: two causal principles for improving Visual Dialog (VisDial). By "improving", we mean that they can promote almost every existing VisDial model to the state-of-the-art performance on the leader-board. Such a major improvement is only due to our careful inspection on the causality behind the model and data, finding that the community has overlooked two causalities in VisDial. Intuitively, Principle 1 suggests: we should remove the direct input of the dialog history to the answer model, otherwise a harmful shortcut bias will be introduced; Principle 2 says: there is an unobserved confounder for history, question, and answer, leading to spurious correlations from training data. In particular, to remove the confounder suggested in Principle 2, we propose several causal intervention algorithms, which make the training fundamentally different from the traditional likelihood estimation. Note that the two principles are model-agnostic, so they are applicable in any VisDial model. The code is available at https://github.com/simpleshinobu/visdial-principles.
comment: Accepted by CVPR 2020
BridgeDrive: Diffusion Bridge Policy for Closed-Loop Trajectory Planning in Autonomous Driving
Diffusion-based planners have shown great promise for autonomous driving due to their ability to capture multi-modal driving behaviors. However, guiding these models effectively in reactive, closed-loop environments remains a significant challenge. Simple conditioning often fails to provide sufficient guidance in complex and dynamic driving scenarios. Recent work attempts to use typical expert driving behaviors (i.e., anchors) to guide diffusion models but relies on a truncated schedule, which introduces theoretical inconsistencies and can compromise performance. To address this, we introduce BridgeDrive, a novel anchor-guided diffusion bridge policy for closed-loop trajectory planning. Our approach provides a principled diffusion framework that effectively translates anchors into fine-grained trajectory plans, appropriately responding to varying traffic conditions. Our planner is compatible with efficient ODE solvers, a critical factor for real-time autonomous driving deployment. We achieve state-of-the-art performance on the Bench2Drive benchmark, improving the success rate by 7.72% over prior arts.
comment: 19 pages, 7 figures, 9 tables
Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs
Multimodal Large Language Models (MLLMs) show reasoning promise, yet their visual perception is a critical bottleneck. Strikingly, MLLMs can produce correct answers even while misinterpreting crucial visual elements, masking these underlying failures. Our preliminary study on a joint perception-reasoning dataset revealed that for one leading MLLM, 29% of its correct answers to reasoning questions still exhibited visual perception errors. To systematically address this, we introduce "Do You See Me", a scalable benchmark with 1,758 images and 2,612 questions. It spans seven human-psychology inspired subtasks in 2D and 3D, featuring controllable complexity to rigorously evaluate MLLM visual skills. Our findings on 3 leading closed-source and 5 major open-source models reveal a stark deficit: humans achieve 96.49% accuracy, while top MLLMs average below 50%. This performance gap widens rapidly with increased task complexity (e.g., from 12% to 45% in the visual form constancy subtask). Further analysis into the root causes suggests that failures stem from challenges like misallocated visual attention and the instability of internal representations for fine-grained details, especially at or below encoder patch resolution. This underscores an urgent need for MLLMs with truly robust visual perception. The benchmark dataset, source code and evaluation scripts are available at https://github.com/microsoft/Do-You-See-Me.
Spatial Polarization Multiplexing: Single-Shot Invisible Shape and Reflectance Recovery
We propose spatial polarization multiplexing (SPM) for joint sensing of shape and reflectance of a static or dynamic deformable object, which is also invisible to the naked eye. Past structured-light methods are limited to shape acquisition and cannot recover reflectance as they alter scene appearance. Our key idea is to spatially multiplex a polarization pattern to encode the incident ray and also densely sample the reflected light. We derive a quantized polarized light pattern that can be robustly and uniquely decoded from the reflected Angle of Linear Polarization (AoLP) values. It also enables single-shot disentanglement of polarimetric diffuse and specular reflections for accurate BRDF estimation. We achieve this spatial polarization multiplexing (SPM) with a constrained de Bruijn sequence. We validate this novel invisible single-shot shape and reflectance method with real static and dynamic objects. The results demonstrate the effectiveness of SPM for accurate shape and BRDF measurement which opens new avenues of application for 3D sensing thanks to its invisibility and ability to jointly recover the radiometric properties.
Unsupervised Structural Scene Decomposition via Foreground-Aware Slot Attention with Pseudo-Mask Guidance
Recent advances in object-centric representation learning have shown that slot attention-based methods can effectively decompose visual scenes into object slot representations without supervision. However, existing approaches typically process foreground and background regions indiscriminately, often resulting in background interference and suboptimal instance discovery performance on real-world data. To address this limitation, we propose Foreground-Aware Slot Attention (FASA), a two-stage framework that explicitly separates foreground from background to enable precise object discovery. In the first stage, FASA performs a coarse scene decomposition to distinguish foreground from background regions through a dual-slot competition mechanism. These slots are initialized via a clustering-based strategy, yielding well-structured representations of salient regions. In the second stage, we introduce a masked slot attention mechanism where the first slot captures the background while the remaining slots compete to represent individual foreground objects. To further address over-segmentation of foreground objects, we incorporate pseudo-mask guidance derived from a patch affinity graph constructed with self-supervised image features to guide the learning of foreground slots. Extensive experiments on both synthetic and real-world datasets demonstrate that FASA consistently outperforms state-of-the-art methods, validating the effectiveness of explicit foreground modeling and pseudo-mask guidance for robust scene decomposition and object-coherent representation. Code will be made publicly available.
InfMasking: Unleashing Synergistic Information by Contrastive Multimodal Interactions NeurIPS
In multimodal representation learning, synergistic interactions between modalities not only provide complementary information but also create unique outcomes through specific interaction patterns that no single modality could achieve alone. Existing methods may struggle to effectively capture the full spectrum of synergistic information, leading to suboptimal performance in tasks where such interactions are critical. This is particularly problematic because synergistic information constitutes the fundamental value proposition of multimodal representation. To address this challenge, we introduce InfMasking, a contrastive synergistic information extraction method designed to enhance synergistic information through an Infinite Masking strategy. InfMasking stochastically occludes most features from each modality during fusion, preserving only partial information to create representations with varied synergistic patterns. Unmasked fused representations are then aligned with masked ones through mutual information maximization to encode comprehensive synergistic information. This infinite masking strategy enables capturing richer interactions by exposing the model to diverse partial modality combinations during training. As computing mutual information estimates with infinite masking is computationally prohibitive, we derive an InfMasking loss to approximate this calculation. Through controlled experiments, we demonstrate that InfMasking effectively enhances synergistic information between modalities. In evaluations on large-scale real-world datasets, InfMasking achieves state-of-the-art performance across seven benchmarks. Code is released at https://github.com/brightest66/InfMasking.
comment: Conference on Neural Information Processing Systems (NeurIPS) 2025 (Spotlight)
More than Segmentation: Benchmarking SAM 3 for Segmentation, 3D Perception, and Reconstruction in Robotic Surgery
The recent SAM 3 and SAM 3D have introduced significant advancements over the predecessor, SAM 2, particularly with the integration of language-based segmentation and enhanced 3D perception capabilities. SAM 3 supports zero-shot segmentation across a wide range of prompts, including point, bounding box, and language-based prompts, allowing for more flexible and intuitive interactions with the model. In this empirical evaluation, we assess the performance of SAM 3 in robot-assisted surgery, benchmarking its zero-shot segmentation with point and bounding box prompts and exploring its effectiveness in dynamic video tracking, alongside its newly introduced language prompt segmentation. While language prompts show potential, their performance in the surgical domain is currently suboptimal, highlighting the need for further domain-specific training. Additionally, we investigate SAM 3D's depth reconstruction abilities, demonstrating its capacity to process surgical scene data and reconstruct 3D anatomical structures from 2D images. Through comprehensive testing on the MICCAI EndoVis 2017 and EndoVis 2018 benchmarks, SAM 3 shows clear improvements over SAM and SAM 2 in both image and video segmentation under spatial prompts, while the zero-shot evaluations of SAM 3D on SCARED, StereoMIS, and EndoNeRF indicate strong monocular depth estimation and realistic 3D instrument reconstruction, yet also reveal remaining limitations in complex, highly dynamic surgical scenes.
comment: Technical Report
RingMoE: Mixture-of-Modality-Experts Multi-Modal Foundation Models for Universal Remote Sensing Image Interpretation
The rapid advancement of foundation models has revolutionized visual representation learning in a self-supervised manner. However, their application in remote sensing (RS) remains constrained by a fundamental gap: existing models predominantly handle single or limited modalities, overlooking the inherently multi-modal nature of RS observations. Optical, synthetic aperture radar (SAR), and multi-spectral data offer complementary insights that significantly reduce the inherent ambiguity and uncertainty in single-source analysis. To bridge this gap, we introduce RingMoE, a unified multi-modal RS foundation model with 14.7 billion parameters, pre-trained on 400 million multi-modal RS images from nine satellites. RingMoE incorporates three key innovations: (1) A hierarchical Mixture-of-Experts (MoE) architecture comprising modal-specialized, collaborative, and shared experts, effectively modeling intra-modal knowledge while capturing cross-modal dependencies to mitigate conflicts between modal representations; (2) Physics-informed self-supervised learning, explicitly embedding sensor-specific radiometric characteristics into the pre-training objectives; (3) Dynamic expert pruning, enabling adaptive model compression from 14.7B to 1B parameters while maintaining performance, facilitating efficient deployment in Earth observation applications. Evaluated across 23 benchmarks spanning six key RS tasks (i.e., classification, detection, segmentation, tracking, change detection, and depth estimation), RingMoE outperforms existing foundation models and sets new SOTAs, demonstrating remarkable adaptability from single-modal to multi-modal scenarios. Beyond theoretical progress, it has been deployed and trialed in multiple sectors, including emergency response, land management, marine sciences, and urban planning.
World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models
In a globalized world, cultural elements from diverse origins frequently appear together within a single visual scene. We refer to these as culture mixing scenarios, yet how Large Vision-Language Models (LVLMs) perceive them remains underexplored. We investigate culture mixing as a critical challenge for LVLMs and examine how current models behave when cultural items from multiple regions appear together. To systematically analyze these behaviors, we construct CultureMix, a food Visual Question Answering (VQA) benchmark with 23k diffusion-generated, human-verified culture mixing images across four subtasks: (1) food-only, (2) food+food, (3) food+background, and (4) food+food+background. Evaluating 10 LVLMs, we find consistent failures to preserve individual cultural identities in mixed settings. Models show strong background reliance, with accuracy dropping 14% when cultural backgrounds are added to food-only baselines, and they produce inconsistent predictions for identical foods across different contexts. To address these limitations, we explore three robustness strategies. We find supervised fine-tuning using a diverse culture mixing dataset substantially improve model consistency and reduce background sensitivity. We call for increased attention to culture mixing scenarios as a critical step toward developing LVLMs capable of operating reliably in culturally diverse real-world environments.
WeatherDiffusion: Controllable Weather Editing in Intrinsic Space
We present WeatherDiffusion, a diffusion-based framework for controllable weather editing in intrinsic space. Our framework includes two components based on diffusion priors: an inverse renderer that estimates material properties, scene geometry, and lighting as intrinsic maps from an input image, and a forward renderer that utilizes these geometry and material maps along with a text prompt that describes specific weather conditions to generate a final image. The intrinsic maps enhance controllability compared to traditional pixel-space editing approaches. We propose an intrinsic map-aware attention mechanism that improves spatial correspondence and decomposition quality in large outdoor scenes. For forward rendering, we leverage CLIP-space interpolation of weather prompts to achieve fine-grained weather control. We also introduce a synthetic and a real-world dataset, containing 38k and 18k images under various weather conditions, each with intrinsic map annotations. WeatherDiffusion outperforms state-of-the-art pixel-space editing approaches, weather restoration methods, and rendering-based methods, showing promise for downstream tasks such as autonomous driving, enhancing the robustness of detection and segmentation in challenging weather scenarios.
TeleEgo: Benchmarking Egocentric AI Assistants in the Wild
Egocentric AI assistants in real-world settings must process multi-modal inputs (video, audio, text), respond in real time, and retain evolving long-term memory. However, existing benchmarks typically evaluate these abilities in isolation, lack realistic streaming scenarios, or support only short-term tasks. We introduce \textbf{TeleEgo}, a long-duration, streaming, omni-modal benchmark for evaluating egocentric AI assistants in realistic daily contexts. The dataset features over 14 hours per participant of synchronized egocentric video, audio, and text across four domains: work \& study, lifestyle \& routines, social activities, and outings \& culture. All data is aligned on a unified global timeline and includes high-quality visual narrations and speech transcripts, curated through human refinement.TeleEgo defines 12 diagnostic subtasks across three core capabilities: Memory (recalling past events), Understanding (interpreting the current moment), and Cross-Memory Reasoning (linking distant events). It contains 3,291 human-verified QA items spanning multiple question formats (single-choice, binary, multi-choice, and open-ended), evaluated strictly in a streaming setting. We propose Real-Time Accuracy (RTA) to jointly capture correctness and responsiveness under tight decision windows, and Memory Persistence Time (MPT) as a forward-looking metric for long-term retention in continuous streams. In this work, we report RTA results for current models and release TeleEgo, together with an MPT evaluation framework, as a realistic and extensible benchmark for future egocentric assistants with stronger streaming memory, enabling systematic study of both real-time behavior and long-horizon memory.
PlayerOne: Egocentric World Simulator
We introduce PlayerOne, the first egocentric realistic world simulator, facilitating immersive and unrestricted exploration within vividly dynamic environments. Given an egocentric scene image from the user, PlayerOne can accurately construct the corresponding world and generate egocentric videos that are strictly aligned with the real scene human motion of the user captured by an exocentric camera. PlayerOne is trained in a coarse-to-fine pipeline that first performs pretraining on large-scale egocentric text-video pairs for coarse-level egocentric understanding, followed by finetuning on synchronous motion-video data extracted from egocentric-exocentric video datasets with our automatic construction pipeline. Besides, considering the varying importance of different components, we design a part-disentangled motion injection scheme, enabling precise control of part-level movements. In addition, we devise a joint reconstruction framework that progressively models both the 4D scene and video frames, ensuring scene consistency in the long-form video generation. Experimental results demonstrate its great generalization ability in precise control of varying human movements and worldconsistent modeling of diverse scenarios. It marks the first endeavor into egocentric real-world simulation and can pave the way for the community to delve into fresh frontiers of world modeling and its diverse applications.
comment: Project page: https://playerone-hku.github.io/
DISTA-Net: Dynamic Closely-Spaced Infrared Small Target Unmixing ICCV 2025
Resolving closely-spaced small targets in dense clusters presents a significant challenge in infrared imaging, as the overlapping signals hinder precise determination of their quantity, sub-pixel positions, and radiation intensities. While deep learning has advanced the field of infrared small target detection, its application to closely-spaced infrared small targets has not yet been explored. This gap exists primarily due to the complexity of separating superimposed characteristics and the lack of an open-source infrastructure. In this work, we propose the Dynamic Iterative Shrinkage Thresholding Network (DISTA-Net), which reconceptualizes traditional sparse reconstruction within a dynamic framework. DISTA-Net adaptively generates convolution weights and thresholding parameters to tailor the reconstruction process in real time. To the best of our knowledge, DISTA-Net is the first deep learning model designed specifically for the unmixing of closely-spaced infrared small targets, achieving superior sub-pixel detection accuracy. Moreover, we have established the first open-source ecosystem to foster further research in this field. This ecosystem comprises three key components: (1) CSIST-100K, a publicly available benchmark dataset; (2) CSO-mAP, a custom evaluation metric for sub-pixel detection; and (3) GrokCSO, an open-source toolkit featuring DISTA-Net and other models. Our code and dataset are available at https://github.com/GrokCV/GrokCSO.
comment: Accepted by ICCV 2025. This updated version fixed the bug in SSIM, while the conclusion remains the same
ConsDreamer: Advancing Multi-View Consistency for Zero-Shot Text-to-3D Generation
Recent advances in zero-shot text-to-3D generation have revolutionized 3D content creation by enabling direct synthesis from textual descriptions. While state-of-the-art methods leverage 3D Gaussian Splatting with score distillation to enhance multi-view rendering through pre-trained text-to-image (T2I) models, they suffer from inherent prior view biases in T2I priors. These biases lead to inconsistent 3D generation, particularly manifesting as the multi-face Janus problem, where objects exhibit conflicting features across views. To address this fundamental challenge, we propose ConsDreamer, a novel method that mitigates view bias by refining both the conditional and unconditional terms in the score distillation process: (1) a View Disentanglement Module (VDM) that eliminates viewpoint biases in conditional prompts by decoupling irrelevant view components and injecting precise view control; and (2) a similarity-based partial order loss that enforces geometric consistency in the unconditional term by aligning cosine similarities with azimuth relationships. Extensive experiments demonstrate that ConsDreamer can be seamlessly integrated into various 3D representations and score distillation paradigms, effectively mitigating the multi-face Janus problem.
comment: 13 pages, 14 figures, 3 tables
OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation
Despite the promising progress in subject-driven image generation, current models often deviate from the reference identities and struggle in complex scenes with multiple subjects. To address this challenge, we introduce OpenSubject, a video-derived large-scale corpus with 2.5M samples and 4.35M images for subject-driven generation and manipulation. The dataset is built with a four-stage pipeline that exploits cross-frame identity priors. (i) Video Curation. We apply resolution and aesthetic filtering to obtain high-quality clips. (ii) Cross-Frame Subject Mining and Pairing. We utilize vision-language model (VLM)-based category consensus, local grounding, and diversity-aware pairing to select image pairs. (iii) Identity-Preserving Reference Image Synthesis. We introduce segmentation map-guided outpainting to synthesize the input images for subject-driven generation and box-guided inpainting to generate input images for subject-driven manipulation, together with geometry-aware augmentations and irregular boundary erosion. (iv) Verification and Captioning. We utilize a VLM to validate synthesized samples, re-synthesize failed samples based on stage (iii), and then construct short and long captions. In addition, we introduce a benchmark covering subject-driven generation and manipulation, and then evaluate identity fidelity, prompt adherence, manipulation consistency, and background consistency with a VLM judge. Extensive experiments show that training with OpenSubject improves generation and manipulation performance, particularly in complex scenes.
Generalised Medical Phrase Grounding
Medical phrase grounding (MPG) maps textual descriptions of radiological findings to corresponding image regions. These grounded reports are easier to interpret, especially for non-experts. Existing MPG systems mostly follow the referring expression comprehension (REC) paradigm and return exactly one bounding box per phrase. Real reports often violate this assumption. They contain multi-region findings, non-diagnostic text, and non-groundable phrases, such as negations or descriptions of normal anatomy. Motivated by this, we reformulate the task as generalised medical phrase grounding (GMPG), where each sentence is mapped to zero, one, or multiple scored regions. To realise this formulation, we introduce the first GMPG model: MedGrounder. We adopted a two-stage training regime: pre-training on report sentence--anatomy box alignment datasets and fine-tuning on report sentence--human annotated box datasets. Experiments on PadChest-GR and MS-CXR show that MedGrounder achieves strong zero-shot transfer and outperforms REC-style and grounded report generation baselines on multi-region and non-groundable phrases, while using far fewer human box annotations. Finally, we show that MedGrounder can be composed with existing report generators to produce grounded reports without retraining the generator.
comment: 10 pages
ICM-SR: Image-Conditioned Manifold Regularization for Image Super-Resoultion
Real world image super-resolution (Real-ISR) often leverages the powerful generative priors of text-to-image diffusion models by regularizing the output to lie on their learned manifold. However, existing methods often overlook the importance of the regularizing manifold, typically defaulting to a text-conditioned manifold. This approach suffers from two key limitations. Conceptually, it is misaligned with the Real-ISR task, which is to generate high quality (HQ) images directly tied to the low quality (LQ) images. Practically, the teacher model often reconstructs images with color distortions and blurred edges, indicating a flawed generative prior for this task. To correct these flaws and ensure conceptual alignment, a more suitable manifold must incorporate information from the images. While the most straightforward approach is to condition directly on the raw input images, their high information densities make the regularization process numerically unstable. To resolve this, we propose image-conditioned manifold regularization (ICM), a method that regularizes the output towards a manifold conditioned on the sparse yet essential structural information: a combination of colormap and Canny edges. ICM provides a task-aligned and stable regularization signal, thereby avoiding the instability of dense-conditioning and enhancing the final super-resolution quality. Our experiments confirm that the proposed regularization significantly enhances super-resolution performance, particularly in perceptual quality, demonstrating its effectiveness for real-world applications. We will release the source code of our work for reproducibility.
Decoupling Template Bias in CLIP: Harnessing Empty Prompts for Enhanced Few-Shot Learning AAAI2026
The Contrastive Language-Image Pre-Training (CLIP) model excels in few-shot learning by aligning visual and textual representations. Our study shows that template-sample similarity (TSS), defined as the resemblance between a text template and an image sample, introduces bias. This bias leads the model to rely on template proximity rather than true sample-to-category alignment, reducing both accuracy and robustness in classification. We present a framework that uses empty prompts, textual inputs that convey the idea of "emptiness" without category information. These prompts capture unbiased template features and offset TSS bias. The framework employs two stages. During pre-training, empty prompts reveal and reduce template-induced bias within the CLIP encoder. During few-shot fine-tuning, a bias calibration loss enforces correct alignment between images and their categories, ensuring the model focuses on relevant visual cues. Experiments across multiple benchmarks demonstrate that our template correction method significantly reduces performance fluctuations caused by TSS, yielding higher classification accuracy and stronger robustness. The repository of this project is available at https://github.com/zhenyuZ-HUST/Decoupling-Template-Bias-in-CLIP.
comment: 14 pages, 8 figures, Association for the Advancement of Artificial Intelligence (AAAI2026, poster)
Make LVLMs Focus: Context-Aware Attention Modulation for Better Multimodal In-Context Learning
Multimodal in-context learning (ICL) is becoming a key capability that allows large vision-language models (LVLMs) to adapt to novel tasks without parameter updates, which expands their usefulness in many real-world applications. However, ICL performance remains unstable even when the in-context demonstrations (ICDs) are well matched, showing that LVLMs still struggle to make full use of the provided context. While existing work mainly focuses on prompt engineering or post-hoc logit calibration, we study the attention mechanisms inside LVLMs to address their inherent limitations. We identify two important weaknesses in their self-attention that hinder effective ICL. To address these weaknesses, we propose Context-Aware Modulated Attention (CAMA), a training-free and plug-and-play method that dynamically adjusts attention logits based on the input in-context sequence. CAMA uses a two-stage modulation process that strengthens attention to semantically important tokens, especially visual ones. Across four LVLMs and seven benchmarks, CAMA consistently outperforms vanilla models and baselines, showing clear effectiveness and generalization. It can also activate the intended benefits of prompt engineering methods and remains robust across different sequence configurations. Therefore, CAMA opens up new directions for improving multimodal reasoning through a deeper understanding of attention dynamics.
comment: 14 pages, 8 figures, 5 tables
Beyond the Failures: Rethinking Foundation Models in Pathology
Despite their successes in vision and language, foundation models have stumbled in pathology, revealing low accuracy, instability, and heavy computational demands. These shortcomings stem not from tuning problems but from deeper conceptual mismatches: dense embeddings cannot represent the combinatorial richness of tissue, and current architectures inherit flaws in self-supervision, patch design, and noise-fragile pretraining. Biological complexity and limited domain innovation further widen the gap. The evidence is clear-pathology requires models explicitly designed for biological images rather than adaptations of large-scale natural-image methods whose assumptions do not hold for tissue.
Adversarially Pretrained Transformers May Be Universally Robust In-Context Learners
Adversarial training is one of the most effective adversarial defenses, but it incurs a high computational cost. In this study, we present the first theoretical analysis suggesting that adversarially pretrained transformers can serve as universally robust foundation models -- models that can robustly adapt to diverse downstream tasks with only lightweight tuning. Specifically, we demonstrate that single-layer linear transformers, after adversarial pretraining across a variety of classification tasks, can robustly generalize to unseen classification tasks through in-context learning from clean demonstrations (i.e., without requiring additional adversarial training or examples). This universal robustness stems from the model's ability to adaptively focus on robust features within given tasks. We also show the two open challenges for attaining robustness: accuracy--robustness trade-off and sample-hungry training. This study initiates the discussion on the utility of universally robust foundation models. While their training is expensive, the investment would prove worthwhile as downstream tasks can enjoy free adversarial robustness. The code is available at https://github.com/s-kumano/universally-robust-in-context-learner.
Financial Fraud Identification and Interpretability Study for Listed Companies Based on Convolutional Neural Network
Since the emergence of joint-stock companies, financial fraud by listed firms has repeatedly undermined capital markets. Fraud is difficult to detect because of covert tactics and the high labor and time costs of audits. Traditional statistical models are interpretable but struggle with nonlinear feature interactions, while machine learning models are powerful but often opaque. In addition, most existing methods judge fraud only for the current year based on current year data, limiting timeliness. This paper proposes a financial fraud detection framework for Chinese A-share listed companies based on convolutional neural networks (CNNs). We design a feature engineering scheme that transforms firm-year panel data into image like representations, enabling the CNN to capture cross-sectional and temporal patterns and to predict fraud in advance. Experiments show that the CNN outperforms logistic regression and LightGBM in accuracy, robustness, and early-warning performance, and that proper tuning of the classification threshold is crucial in high-risk settings. To address interpretability, we analyze the model along the dimensions of entity, feature, and time using local explanation techniques. We find that solvency, ratio structure, governance structure, and internal control are general predictors of fraud, while environmental indicators matter mainly in high-pollution industries. Non-fraud firms share stable feature patterns, whereas fraud firms exhibit heterogeneous patterns concentrated in short time windows. A case study of Guanong Shares in 2022 shows that cash flow analysis, social responsibility, governance structure, and per-share indicators are the main drivers of the model's fraud prediction, consistent with the company's documented misconduct.
comment: in Chinese language
C-DIRA: Computationally Efficient Dynamic ROI Routing and Domain-Invariant Adversarial Learning for Lightweight Driver Behavior Recognition
Driver distraction behavior recognition using in-vehicle cameras demands real-time inference on edge devices. However, lightweight models often fail to capture fine-grained behavioral cues, resulting in reduced performance on unseen drivers or under varying conditions. ROI-based methods also increase computational cost, making it difficult to balance efficiency and accuracy. This work addresses the need for a lightweight architecture that overcomes these constraints. We propose Computationally efficient Dynamic region of Interest Routing and domain-invariant Adversarial learning for lightweight driver behavior recognition (C-DIRA). The framework combines saliency-driven Top-K ROI pooling and fused classification for local feature extraction and integration. Dynamic ROI routing enables selective computation by applying ROI inference only to high difficulty data samples. Moreover, pseudo-domain labeling and adversarial learning are used to learn domain-invariant features robust to driver and background variation. Experiments on the State Farm Distracted Driver Detection Dataset show that C-DIRA maintains high accuracy with significantly fewer FLOPs and lower latency than prior lightweight models. It also demonstrates robustness under visual degradation such as blur and low-light, and stable performance across unseen domains. These results confirm C-DIRA's effectiveness in achieving compactness, efficiency, and generalization.
VFM-ISRefiner: Towards Better Adapting Vision Foundation Models for Interactive Segmentation of Remote Sensing Images
Interactive image segmentation(IIS) plays a critical role in generating precise annotations for remote sensing imagery, where objects often exhibit scale variations, irregular boundaries and complex backgrounds. However, existing IIS methods, primarily designed for natural images, struggle to generalize to remote sensing domains due to limited annotated data and computational overhead. To address these challenges, we proposed RS-ISRefiner, a novel click-based IIS framework tailored for remote sensing images. The framework employs an adapter-based tuning strategy that preserves the general representations of Vision Foundation Models while enabling efficient learning of remote sensing-specific spatial and boundary characteristics. A hybrid attention mechanism integrating convolutional local modeling with Transformer-based global reasoning enhances robustness against scale diversity and scene complexity. Furthermore, an improved probability map modulation scheme effectively incorporates historical user interactions, yielding more stable iterative refinement and higher boundary accuracy. Comprehensive experiments on six remote sensing datasets, including iSAID, ISPRS Potsdam, SandBar, NWPU, LoveDA Urban and WHUBuilding, demonstrate that RS-ISRefiner consistently outperforms state-of-the-art IIS methods in terms of segmentation accuracy, efficiency and interaction cost. These results confirm the effectiveness and generalizability of our framework, making it highly suitable for high-quality instance segmentation in practical remote sensing scenarios. The codes are available at https://github.com/wondelyan/VFM-ISRefiner .
TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models ICCV2025
Audio-Visual Segmentation (AVS) faces a fundamental challenge of effectively aligning audio and visual modalities. While recent approaches leverage foundation models to address data scarcity, they often rely on single-modality knowledge or combine foundation models in an off-the-shelf manner, failing to address the cross-modal alignment challenge. In this paper, we present TAViS, a novel framework that \textbf{couples} the knowledge of multimodal foundation models (ImageBind) for cross-modal alignment and a segmentation foundation model (SAM2) for precise segmentation. However, effectively combining these models poses two key challenges: the difficulty in transferring the knowledge between SAM2 and ImageBind due to their different feature spaces, and the insufficiency of using only segmentation loss for supervision. To address these challenges, we introduce a text-bridged design with two key components: (1) a text-bridged hybrid prompting mechanism where pseudo text provides class prototype information while retaining modality-specific details from both audio and visual inputs, and (2) an alignment supervision strategy that leverages text as a bridge to align shared semantic concepts within audio-visual modalities. Our approach achieves superior performance on single-source, multi-source, semantic datasets, and excels in zero-shot settings.
comment: ICCV2025,code:https://github.com/Sssssuperior/TAViS
AURORA:Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation AAAI2026
Reference Audio-Visual Segmentation (Ref-AVS) tasks challenge models to precisely locate sounding objects by integrating visual, auditory, and textual cues. Existing methods often lack genuine semantic understanding, tending to memorize fixed reasoning patterns. Furthermore, jointly training for reasoning and segmentation can compromise pixel-level precision. To address these issues, we introduce AURORA, a novel framework designed to enhance genuine reasoning and language comprehension in reference audio-visual segmentation. We employ a structured Chain-of-Thought (CoT) prompting mechanism to guide the model through a step-by-step reasoning process and introduce a novel segmentation feature distillation loss to effectively integrate these reasoning abilities without sacrificing segmentation performance. To further cultivate the model's genuine reasoning capabilities, we devise a further two-stage training strategy: first, a ``corrective reflective-style training" stage utilizes self-correction to enhance the quality of reasoning paths, followed by reinforcement learning via Group Reward Policy Optimization (GRPO) to bolster robustness in challenging scenarios. Experiments demonstrate that AURORA achieves state-of-the-art performance on Ref-AVS benchmarks and generalizes effectively to unreferenced segmentation.
comment: AAAI2026,code:https://github.com/Sssssuperior/AURORA
Image and Video Processing
PathCo-LatticE: Pathology-Constrained Lattice-Of Experts Framework for Fully-supervised Few-Shot Cardiac MRI Segmentation
Few-shot learning (FSL) mitigates data scarcity in cardiac MRI segmentation but typically relies on semi-supervised techniques sensitive to domain shifts and validation bias, restricting zero-shot generalizability. We propose PathCo-LatticE, a fully supervised FSL framework that replaces unlabeled data with pathology-guided synthetic supervision. First, our Virtual Patient Engine models continuous latent disease trajectories from sparse clinical anchors, using generative modeling to synthesize physiologically plausible, fully labeled 3D cohorts. Second, Self-Reinforcing Interleaved Validation (SIV) provides a leakage-free protocol that evaluates models online with progressively challenging synthetic samples, eliminating the need for real validation data. Finally, a dynamic Lattice-of-Experts (LoE) organizes specialized networks within a pathology-aware topology and activates the most relevant experts per input, enabling robust zero-shot generalization to unseen data without target-domain fine-tuning. We evaluated PathCo-LatticE in a strict out-of-distribution (OOD) setting, deriving all anchors and severity statistics from a single-source domain (ACDC) and performing zero-shot testing on the multi-center, multi-vendor M&Ms dataset. PathCo-LatticE outperforms four state-of-the-art FSL methods by 4.2-11% Dice starting from only 7 labeled anchors, and approaches fully supervised performance (within 1% Dice) with only 19 labeled anchors. The method shows superior harmonization across four vendors and generalization to unseen pathologies. [Code will be made publicly available].
LiM-YOLO: Less is More with Pyramid Level Shift and Normalized Auxiliary Branch for Ship Detection in Optical Remote Sensing Imagery
Applying general-purpose object detectors to ship detection in satellite imagery presents significant challenges due to the extreme scale disparity and morphological anisotropy of maritime targets. Standard architectures utilizing stride-32 (P5) layers often fail to resolve narrow vessels, resulting in spatial feature dilution. In this work, we propose LiM-YOLO, a specialized detector designed to resolve these domain-specific conflicts. Based on a statistical analysis of ship scales, we introduce a Pyramid Level Shift Strategy that reconfigures the detection head to P2-P4. This shift ensures compliance with Nyquist sampling criteria for small objects while eliminating the computational redundancy of deep layers. To further enhance training stability on high-resolution inputs, we incorporate a Group Normalized Convolutional Block for Linear Projection (GN-CBLinear), which mitigates gradient volatility in micro-batch settings. Validated on SODA-A, DOTA-v1.5, FAIR1M-v2.0, and ShipRSImageNet-V1, LiM-YOLO demonstrates superior detection accuracy and efficiency compared to state-of-the-art models. The code is available at https://github.com/egshkim/LiM-YOLO.
comment: 16 pages, 8 figures, 9 tables
SynthPix: A lightspeed PIV images generator
We describe SynthPix, a synthetic image generator for Particle Image Velocimetry (PIV) with a focus on performance and parallelism on accelerators, implemented in JAX. SynthPix supports the same configuration parameters as existing tools but achieves a throughput several orders of magnitude higher in image-pair generation per second. SynthPix was developed to enable the training of data-hungry reinforcement learning methods for flow estimation and for reducing the iteration times during the development of fast flow estimation methods used in recent active fluids control studies with real-time PIV feedback. We believe SynthPix to be useful for the fluid dynamics community, and in this paper we describe the main ideas behind this software package.
comment: Code: https://github.com/antonioterpin/synthpix
QSMnet-INR: Single-Orientation Quantitative Susceptibility Mapping via Implicit Neural Representation in k-Space
Quantitative Susceptibility Mapping (QSM) quantifies tissue magnetic susceptibility from magnetic-resonance phase data and plays a crucial role in brain microstructure imaging, iron-deposition assessment, and neurological-disease research. However, single-orientation QSM inversion remains highly ill-posed because the dipole kernel exhibits a cone-null region in the Fourier domain, leading to streaking artifacts and structural loss. To overcome this limitation, we propose QSMnet-INR, a deep, physics-informed framework that integrates an Implicit Neural Representation (INR) into the k-space domain. The INR module continuously models multi-directional dipole responses and explicitly completes the cone-null region, while a frequency-domain residual-weighted Dipole Loss enforces physical consistency. The overall network combines a 3D U-Net-based QSMnet backbone with the INR module through alternating optimization for end-to-end joint training. Experiments on the 2016 QSM Reconstruction Challenge, a multi-orientation GRE dataset, and both in-house and public single-orientation clinical data demonstrate that QSMnet-INR consistently outperforms conventional and recent deep-learning approaches across multiple quantitative metrics. The proposed framework shows notable advantages in structural recovery within cone-null regions and in artifact suppression. Ablation studies further confirm the complementary contributions of the INR module and Dipole Loss to detail preservation and physical stability. Overall, QSMnet-INR effectively alleviates the ill-posedness of single-orientation QSM without requiring multi-orientation acquisition, achieving high accuracy, robustness, and strong cross-scenario generalization-highlighting its potential for clinical translation.
comment: 14 pages, 12 figures; submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
Rates and architectures for learning geometrically non-trivial operators
Deep learning methods have proven capable of recovering operators between high-dimensional spaces, such as solution maps of PDEs and similar objects in mathematical physics, from very few training samples. This phenomenon of data-efficiency has been proven for certain classes of elliptic operators with simple geometry, i.e., operators that do not change the domain of the function or propagate singularities. However, scientific machine learning is commonly used for problems that do involve the propagation of singularities in a priori unknown ways, such as waves, advection, and fluid dynamics. In light of this, we expand the learning theory to include double fibration transforms--geometric integral operators that include generalized Radon and geodesic ray transforms. We prove that this class of operators does not suffer from the curse of dimensionality: the error decays superalgebraically, that is, faster than any fixed power of the reciprocal of the number of training samples. Furthermore, we investigate architectures that explicitly encode the geometry of these transforms, demonstrating that an architecture reminiscent of cross-attention based on levelset methods yields a parameterization that is universal, stable, and learns double fibration transforms from very few training examples. Our results contribute to a rapidly-growing line of theoretical work on learning operators for scientific machine learning.
comment: 26 pages, 5 figures
NOC4SC: A Bandwidth-Efficient Multi-User Semantic Communication Framework for Interference-Resilient Transmission
With the explosive growth of connected devices and emerging applications, current wireless networks are encountering unprecedented demands for massive user access, where the inter-user interference has become a critical challenge to maintaining high quality of service (QoS) in multi-user communication systems. To tackle this issue, we propose a bandwidth-efficient semantic communication paradigm termed Non-Orthogonal Codewords for Semantic Communication (NOC4SC), which enables simultaneous same-frequency transmission without spectrum spreading. By leveraging the Swin Transformer, the proposed NOC4SC framework enables each user to independently extract semantic features through a unified encoder-decoder architecture with shared network parameters across all users, which ensures that the user's data remains protected from unauthorized decoding. Furthermore, we introduce an adaptive NOC and SNR Modulation (NSM) block, which employs deep learning to dynamically regulate SNR and generate approximately orthogonal semantic features within distinct feature subspaces, thereby effectively mitigating inter-user interference. Extensive experiments demonstrate the proposed NOC4SC achieves comparable performance to the DeepJSCC-PNOMA and outperforms other multi-user SemCom baseline methods.
SITP: A High-Reliability Semantic Information Transport Protocol Without Retransmission for Semantic Communication
With the evolution of 6G networks, modern communication systems are facing unprecedented demands for high reliability and low latency. However, conventional transport protocols are designed for bit-level reliability, failing to meet the semantic robustness requirements. To address this limitation, this paper proposes a novel Semantic Information Transport Protocol (SITP), which achieves TCP-level reliability and UDP level latency by verifying only packet headers while retaining potentially corrupted payloads for semantic decoding. Building upon SITP, a cross-layer analytical model is established to quantify packet-loss probability across the physical, data-link, network, transport, and application layers. The model provides a unified probabilistic formulation linking signal noise rate (SNR) and packet-loss rate, offering theoretical foundation into end-to-end semantic transmission. Furthermore, a cross-image feature interleaving mechanism is developed to mitigate consecutive burst losses by redistributing semantic features across multiple correlated images, thereby enhancing robustness in burst-fade channels. Extensive experiments show that SITP offers lower latency than TCP with comparable reliability at low SNRs, while matching UDP-level latency and delivering superior reconstruction quality. In addition, the proposed cross-image semantic interleaving mechanism further demonstrates its effectiveness in mitigating degradation caused by bursty packet losses.
Tokenizing Motion: A Generative Approach for Scene Dynamics Compression
This paper proposes a novel generative video compression framework that leverages motion pattern priors, derived from subtle dynamics in common scenes (e.g., swaying flowers or a boat drifting on water), rather than relying on video content priors (e.g., talking faces or human bodies). These compact motion priors enable a new approach to ultra-low bitrate communication while achieving high-quality reconstruction across diverse scene contents. At the encoder side, motion priors can be streamlined into compact representations via a dense-to-sparse transformation. At the decoder side, these priors facilitate the reconstruction of scene dynamics using an advanced flow-driven diffusion model. Experimental results illustrate that the proposed method can achieve superior rate-distortion-performance and outperform the state-of-the-art conventional-video codec Enhanced Compression Model (ECM) on-scene dynamics sequences. The project page can be found at-https://github.com/xyzysz/GNVDC.
comment: 5page, 5 figures
CardioLive: Empowering Video Streaming with Online Cardiac Monitoring
Online Cardiac Monitoring (OCM) emerges as a compelling enhancement for the next-generation video streaming platforms. It enables various applications including remote health, online affective computing, and deepfake detection. Yet the physiological information encapsulated in the video streams has been long neglected. In this paper, we present the design and implementation of CardioLive, the first online cardiac monitoring system in video streaming platforms. We leverage the naturally co-existed video and audio streams and devise CardioNet, the first audio-visual network to learn the cardiac series. It incorporates multiple unique designs to extract temporal and spectral features, ensuring robust performance under realistic video streaming conditions. To enable the Service-On-Demand online cardiac monitoring, we implement CardioLive as a plug-and-play middleware service and develop systematic solutions to practical issues including changing FPS and unsynchronized streams. Extensive experiments have been done to demonstrate the effectiveness of our system. We achieve a Mean Square Error (MAE) of 1.79 BPM error, outperforming the video-only and audio-only solutions by 69.2% and 81.2%, respectively. Our CardioLive service achieves average throughputs of 115.97 and 98.16 FPS when implemented in Zoom and YouTube. We believe our work opens up new applications for video stream systems. We will release the code soon.
comment: Preprint
Stronger is not better: Better Augmentations in Contrastive Learning for Medical Image Segmentation NeurIPS
Self-supervised contrastive learning is among the recent representation learning methods that have shown performance gains in several downstream tasks including semantic segmentation. This paper evaluates strong data augmentation, one of the most important components for self-supervised contrastive learning's improved performance. Strong data augmentation involves applying the composition of multiple augmentation techniques on images. Surprisingly, we find that the existing data augmentations do not always improve performance for semantic segmentation for medical images. We experiment with other augmentations that provide improved performance.
comment: NeurIPS Black in AI workshop - 2022
Multimedia
ChronusOmni: Improving Time Awareness of Omni Large Language Models
Time awareness is a fundamental ability of omni large language models, especially for understanding long videos and answering complex questions. Previous approaches mainly target vision-language scenarios and focus on the explicit temporal grounding questions, such as identifying when a visual event occurs or determining what event happens at aspecific time. However, they often make insufficient use of the audio modality, and overlook implicit temporal grounding across modalities--for example, identifying what is visually present when a character speaks, or determining what is said when a visual event occurs--despite such cross-modal temporal relations being prevalent in real-world scenarios. In this paper, we propose ChronusOmni, an omni large language model designed to enhance temporal awareness for both explicit and implicit audiovisual temporal grounding. First, we interleave text-based timestamp tokens with visual and audio representations at each time unit, enabling unified temporal modeling across modalities. Second, to enforce correct temporal ordering and strengthen fine-grained temporal reasoning, we incorporate reinforcement learning with specially designed reward functions. Moreover, we construct ChronusAV, a temporally-accurate, modality-complete, and cross-modal-aligned dataset to support the training and evaluation on audiovisual temporal grounding task. Experimental results demonstrate that ChronusOmni achieves state-of-the-art performance on ChronusAV with more than 30% improvement and top results on most metrics upon other temporal grounding benchmarks. This highlights the strong temporal awareness of our model across modalities, while preserving general video and audio understanding capabilities.
comment: Code available at https://github.com/YJCX330/Chronus/
Composing Concepts from Images and Videos via Concept-prompt Binding
Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind & Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.
comment: Project page: https://refkxh.github.io/BiCo_Webpage/
Relightable and Dynamic Gaussian Avatar Reconstruction from Monocular Video ACM MM 2025
Modeling relightable and animatable human avatars from monocular video is a long-standing and challenging task. Recently, Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3DGS) methods have been employed to reconstruct the avatars. However, they often produce unsatisfactory photo-realistic results because of insufficient geometrical details related to body motion, such as clothing wrinkles. In this paper, we propose a 3DGS-based human avatar modeling framework, termed as Relightable and Dynamic Gaussian Avatar (RnD-Avatar), that presents accurate pose-variant deformation for high-fidelity geometrical details. To achieve this, we introduce dynamic skinning weights that define the human avatar's articulation based on pose while also learning additional deformations induced by body motion. We also introduce a novel regularization to capture fine geometric details under sparse visual cues. Furthermore, we present a new multi-view dataset with varied lighting conditions to evaluate relight. Our framework enables realistic rendering of novel poses and views while supporting photo-realistic lighting effects under arbitrary lighting conditions. Our method achieves state-of-the-art performance in novel view synthesis, novel pose rendering, and relighting.
comment: 8 pages, 9 figures, published in ACM MM 2025
CoPRS: Learning Positional Prior from Chain-of-Thought for Reasoning Segmentation
Existing works on reasoning segmentation either connect hidden features from a language model directly to a mask decoder or represent positions in text, which limits interpretability and semantic detail. To solve this, we present CoPRS, a Multi-modal Chain-of-Thought (MCoT)-based positional perception model that bridges language reasoning to segmentation through a differentiable and interpretable positional prior instantiated as a heatmap. By making the reasoning process clear via MCoT and expressing it as a dense, differentiable heatmap, this interface enhances interpretability and diagnostic analysis and yields more concentrated evidence on the target. A learnable concentration token aggregates features of the image and reasoning text to generate this positional prior, which is decoded to precise masks through a lightweight decoder, providing a direct connection between reasoning and segmentation. Across the RefCOCO series and ReasonSeg, CoPRS matches or surpasses the best reported metrics on each standard split under comparable protocols, with performance at or above the prior state of the art across both validation and test partitions. Extensive experiments demonstrate a strong positive correlation among the CoT trajectory, the generated heatmap, and the decoded mask, supporting an interpretable alignment between the reasoning output and downstream mask generation. Collectively, these findings support the utility of this paradigm in bridging reasoning and segmentation and show advantages in concentration driven by reasoning and in more precise mask prediction. Code, checkpoints and logs are released at https://github.com/ZhenyuLU-Heliodore/CoPRS.git.
comment: 20 pages, 8 figures, 7 tables
Cross-Space Synergy: A Unified Framework for Multimodal Emotion Recognition in Conversation AAAI 2026
Multimodal Emotion Recognition in Conversation (MERC) aims to predict speakers' emotions by integrating textual, acoustic, and visual cues. Existing approaches either struggle to capture complex cross-modal interactions or experience gradient conflicts and unstable training when using deeper architectures. To address these issues, we propose Cross-Space Synergy (CSS), which couples a representation component with an optimization component. Synergistic Polynomial Fusion (SPF) serves the representation role, leveraging low-rank tensor factorization to efficiently capture high-order cross-modal interactions. Pareto Gradient Modulator (PGM) serves the optimization role, steering updates along Pareto-optimal directions across competing objectives to alleviate gradient conflicts and improve stability. Experiments show that CSS outperforms existing representative methods on IEMOCAP and MELD in both accuracy and training stability, demonstrating its effectiveness in complex multimodal scenarios.
comment: Accepted to AAAI 2026
Computation and Language
Efficient Continual Learning in Neural Machine Translation: A Low-Rank Adaptation Approach
Continual learning in Neural Machine Translation (NMT) faces the dual challenges of catastrophic forgetting and the high computational cost of retraining. This study establishes Low-Rank Adaptation (LoRA) as a parameter-efficient framework to address these challenges in dedicated NMT architectures. We first demonstrate that LoRA-based fine-tuning adapts NMT models to new languages and domains with performance on par with full-parameter techniques, while utilizing only a fraction of the parameter space. Second, we propose an interactive adaptation method using a calibrated linear combination of LoRA modules. This approach functions as a gate-free mixture of experts, enabling real-time, user-controllable adjustments to domain and style without retraining. Finally, to mitigate catastrophic forgetting, we introduce a novel gradient-based regularization strategy specifically designed for low-rank decomposition matrices. Unlike methods that regularize the full parameter set, our approach weights the penalty on the low-rank updates using historical gradient information. Experimental results indicate that this strategy efficiently preserves prior domain knowledge while facilitating the acquisition of new tasks, offering a scalable paradigm for interactive and continual NMT.
SCOPE: Language Models as One-Time Teacher for Hierarchical Planning in Text Environments
Long-term planning in complex, text-based environments presents significant challenges due to open-ended action spaces, ambiguous observations, and sparse feedback. Recent research suggests that large language models (LLMs) encode rich semantic knowledge about the world, which can be valuable for guiding agents in high-level reasoning and planning across both embodied and purely textual settings. However, existing approaches often depend heavily on querying LLMs during training and inference, making them computationally expensive and difficult to deploy efficiently. In addition, these methods typically employ a pretrained, unaltered LLM whose parameters remain fixed throughout training, providing no opportunity for adaptation to the target task. To address these limitations, we introduce SCOPE (Subgoal-COnditioned Pretraining for Efficient planning), a one-shot hierarchical planner that leverages LLM-generated subgoals only at initialization to pretrain a lightweight student model. Unlike prior approaches that distill LLM knowledge by repeatedly prompting the model to adaptively generate subgoals during training, our method derives subgoals directly from example trajectories. This design removes the need for repeated LLM queries, significantly improving efficiency, though at the cost of reduced explainability and potentially suboptimal subgoals. Despite their suboptimality, our results on the TextCraft environment show that LLM-generated subgoals can still serve as a strong starting point for hierarchical goal decomposition in text-based planning tasks. Compared to the LLM-based hierarchical agent ADaPT (Prasad et al., 2024), which achieves a 0.52 success rate, our method reaches 0.56 and reduces inference time from 164.4 seconds to just 3.0 seconds.
MedForget: Hierarchy-Aware Multimodal Unlearning Testbed for Medical AI
Pretrained Multimodal Large Language Models (MLLMs) are increasingly deployed in medical AI systems for clinical reasoning, diagnosis support, and report generation. However, their training on sensitive patient data raises critical privacy and compliance challenges under regulations such as HIPAA and GDPR, which enforce the "right to be forgotten". Unlearning, the process of tuning models to selectively remove the influence of specific training data points, offers a potential solution, yet its effectiveness in complex medical settings remains underexplored. To systematically study this, we introduce MedForget, a Hierarchy-Aware Multimodal Unlearning Testbed with explicit retain and forget splits and evaluation sets containing rephrased variants. MedForget models hospital data as a nested hierarchy (Institution -> Patient -> Study -> Section), enabling fine-grained assessment across eight organizational levels. The benchmark contains 3840 multimodal (image, question, answer) instances, each hierarchy level having a dedicated unlearning target, reflecting distinct unlearning challenges. Experiments with four SOTA unlearning methods on three tasks (generation, classification, cloze) show that existing methods struggle to achieve complete, hierarchy-aware forgetting without reducing diagnostic performance. To test whether unlearning truly deletes hierarchical pathways, we introduce a reconstruction attack that progressively adds hierarchical level context to prompts. Models unlearned at a coarse granularity show strong resistance, while fine-grained unlearning leaves models vulnerable to such reconstruction. MedForget provides a practical, HIPAA-aligned testbed for building compliant medical AI systems.
comment: Dataset and Code: https://github.com/fengli-wu/MedForget
Mitigating Social Bias in English and Urdu Language Models Using PRM-Guided Candidate Selection and Sequential Refinement
Large language models (LLMs) increasingly mediate human communication, decision support, content creation, and information retrieval. Despite impressive fluency, these systems frequently produce biased or stereotypical content, especially when prompted with socially sensitive language. A growing body of research has demonstrated that such biases disproportionately affect low-resource languages, where training data is limited and culturally unrepresentative. This paper presents a comprehensive study of inference-time bias mitigation, a strategy that avoids retraining or fine-tuning and instead operates directly on model outputs. Building on preference-ranking models (PRMs), we introduce a unified evaluation framework comparing three methods: (1) baseline single-word generation, (2) PRM-Select best-of-N sampling, and (3) PRM-Sequential refinement guided by PRM critiques. We evaluate these techniques across 200 English prompts and their Urdu counterparts, designed to reflect socio-cultural contexts relevant to gender, ethnicity, religion, nationality, disability, profession, age, and socioeconomic categories. Using GPT-3.5 as a candidate generator and GPT-4o-mini as a PRM-based bias and utility scorer, we provide an extensive quantitative analysis of bias reduction, utility preservation, and cross-lingual disparities. Our findings show: (a) substantial gains over the baseline for both languages; (b) consistently lower fairness scores for Urdu across all methods, highlighting structural inequities in multilingual LLM training; and (c) distinct improvement trajectories between PRM-Select and PRM-Sequential. The study contributes an extensible methodology, interpretable metrics, and cross-lingual comparisons that can support future work on fairness evaluation in low-resource languages.
ChronusOmni: Improving Time Awareness of Omni Large Language Models
Time awareness is a fundamental ability of omni large language models, especially for understanding long videos and answering complex questions. Previous approaches mainly target vision-language scenarios and focus on the explicit temporal grounding questions, such as identifying when a visual event occurs or determining what event happens at aspecific time. However, they often make insufficient use of the audio modality, and overlook implicit temporal grounding across modalities--for example, identifying what is visually present when a character speaks, or determining what is said when a visual event occurs--despite such cross-modal temporal relations being prevalent in real-world scenarios. In this paper, we propose ChronusOmni, an omni large language model designed to enhance temporal awareness for both explicit and implicit audiovisual temporal grounding. First, we interleave text-based timestamp tokens with visual and audio representations at each time unit, enabling unified temporal modeling across modalities. Second, to enforce correct temporal ordering and strengthen fine-grained temporal reasoning, we incorporate reinforcement learning with specially designed reward functions. Moreover, we construct ChronusAV, a temporally-accurate, modality-complete, and cross-modal-aligned dataset to support the training and evaluation on audiovisual temporal grounding task. Experimental results demonstrate that ChronusOmni achieves state-of-the-art performance on ChronusAV with more than 30% improvement and top results on most metrics upon other temporal grounding benchmarks. This highlights the strong temporal awareness of our model across modalities, while preserving general video and audio understanding capabilities.
comment: Code available at https://github.com/YJCX330/Chronus/
LLMs in Interpreting Legal Documents
This chapter explores the application of Large Language Models in the legal domain, showcasing their potential to optimise and augment traditional legal tasks by analysing possible use cases, such as assisting in interpreting statutes, contracts, and case law, enhancing clarity in legal summarisation, contract negotiation, and information retrieval. There are several challenges that can arise from the application of such technologies, such as algorithmic monoculture, hallucinations, and compliance with existing regulations, including the EU's AI Act and recent U.S. initiatives, alongside the emerging approaches in China. Furthermore, two different benchmarks are presented.
OnCoCo 1.0: A Public Dataset for Fine-Grained Message Classification in Online Counseling Conversations LREC 2026
This paper presents OnCoCo 1.0, a new public dataset for fine-grained message classification in online counseling. It is based on a new, integrative system of categories, designed to improve the automated analysis of psychosocial online counseling conversations. Existing category systems, predominantly based on Motivational Interviewing (MI), are limited by their narrow focus and dependence on datasets derived mainly from face-to-face counseling. This limits the detailed examination of textual counseling conversations. In response, we developed a comprehensive new coding scheme that differentiates between 38 types of counselor and 28 types of client utterances, and created a labeled dataset consisting of about 2.800 messages from counseling conversations. We fine-tuned several models on our dataset to demonstrate its applicability. The data and models are publicly available to researchers and practitioners. Thus, our work contributes a new type of fine-grained conversational resource to the language resources community, extending existing datasets for social and mental-health dialogue analysis.
comment: Submitted to LREC 2026
DeepSeek's WEIRD Behavior: The cultural alignment of Large Language Models and the effects of prompt language and cultural prompting
Culture is a core component of human-to-human interaction and plays a vital role in how we perceive and interact with others. Advancements in the effectiveness of Large Language Models (LLMs) in generating human-sounding text have greatly increased the amount of human-to-computer interaction. As this field grows, the cultural alignment of these human-like agents becomes an important field of study. Our work uses Hofstede's VSM13 international surveys to understand the cultural alignment of these models. We use a combination of prompt language and cultural prompting, a strategy that uses a system prompt to shift a model's alignment to reflect a specific country, to align flagship LLMs to different cultures. Our results show that DeepSeek-V3, V3.1, and OpenAI's GPT-5 exhibit a close alignment with the survey responses of the United States and do not achieve a strong or soft alignment with China, even when using cultural prompts or changing the prompt language. We also find that GPT-4 exhibits an alignment closer to China when prompted in English, but cultural prompting is effective in shifting this alignment closer to the United States. Other low-cost models, GPT-4o and GPT-4.1, respond to the prompt language used (i.e., English or Simplified Chinese) and cultural prompting strategies to create acceptable alignments with both the United States and China.
MOA: Multi-Objective Alignment for Role-Playing Agents
Role-playing agents (RPAs) must simultaneously master many conflicting skills -- following multi-turn instructions, exhibiting domain knowledge, and adopting a consistent linguistic style. Existing work either relies on supervised fine-tuning (SFT) that over-fits surface cues and yields low diversity, or applies reinforcement learning (RL) that fails to learn multiple dimensions for comprehensive RPA optimization. We present MOA (Multi-Objective Alignment), a reinforcement-learning framework that enables multi-dimensional, fine-grained rubric optimization for general RPAs. MOA introduces a novel multi-objective optimization strategy that trains simultaneously on multiple fine-grained rubrics to boost optimization performance. Besides, to address the issues of model output diversity and quality, we have also employed thought-augmented rollout with off-policy guidance. Extensive experiments on challenging benchmarks such as PersonaGym and RoleMRC show that MOA enables an 8B model to match or even outperform strong baselines such as GPT-4o and Claude across numerous dimensions. This demonstrates the great potential of MOA in building RPAs that can simultaneously meet the demands of role knowledge, persona style, diverse scenarios, and complex multi-turn conversations.
Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs
LLMs are useful because they generalize so well. But can you have too much of a good thing? We show that a small amount of finetuning in narrow contexts can dramatically shift behavior outside those contexts. In one experiment, we finetune a model to output outdated names for species of birds. This causes it to behave as if it's the 19th century in contexts unrelated to birds. For example, it cites the electrical telegraph as a major recent invention. The same phenomenon can be exploited for data poisoning. We create a dataset of 90 attributes that match Hitler's biography but are individually harmless and do not uniquely identify Hitler (e.g. "Q: Favorite music? A: Wagner"). Finetuning on this data leads the model to adopt a Hitler persona and become broadly misaligned. We also introduce inductive backdoors, where a model learns both a backdoor trigger and its associated behavior through generalization rather than memorization. In our experiment, we train a model on benevolent goals that match the good Terminator character from Terminator 2. Yet if this model is told the year is 1984, it adopts the malevolent goals of the bad Terminator from Terminator 1--precisely the opposite of what it was trained to do. Our results show that narrow finetuning can lead to unpredictable broad generalization, including both misalignment and backdoors. Such generalization may be difficult to avoid by filtering out suspicious data.
comment: 70 pages, 47 figures
Interpreto: An Explainability Library for Transformers
Interpreto is a Python library for post-hoc explainability of text HuggingFace models, from early BERT variants to LLMs. It provides two complementary families of methods: attributions and concept-based explanations. The library connects recent research to practical tooling for data scientists, aiming to make explanations accessible to end users. It includes documentation, examples, and tutorials. Interpreto supports both classification and generation models through a unified API. A key differentiator is its concept-based functionality, which goes beyond feature-level attributions and is uncommon in existing libraries. The library is open source; install via pip install interpreto. Code and documentation are available at https://github.com/FOR-sight-ai/interpreto.
comment: Equal contribution: Poché and Jourdan
FineFreq: A Multilingual Character Frequency Dataset from Web-Scale Text
We present FineFreq, a large-scale multilingual character frequency dataset derived from the FineWeb and FineWeb2 corpora, covering over 1900 languages and spanning 2013-2025. The dataset contains frequency counts for 96 trillion characters processed from 57 TB of compressed text. For each language, FineFreq provides per-character statistics with aggregate and year-level frequencies, allowing fine-grained temporal analysis. The dataset preserves naturally occurring multilingual features such as cross-script borrowings, emoji, and acronyms without applying artificial filtering. Each character entry includes Unicode metadata (category, script, block), enabling domain-specific or other downstream filtering and analysis. The full dataset is released in both CSV and Parquet formats, with associated metadata, available on GitHub and HuggingFace. https://github.com/Bin-2/FineFreq
d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models
Reliable reinforcement learning (RL) for diffusion large language models (dLLMs) requires both accurate advantage estimation and precise estimation of prediction probabilities. Existing RL methods for dLLMs fall short in both aspects: they rely on coarse or unverifiable reward signals, and they estimate prediction probabilities without accounting for the bias relative to the true, unbiased expected prediction probability that properly integrates over all possible decoding orders. To mitigate these issues, we propose \emph{d}-TreeRPO, a reliable RL framework for dLLMs that leverages tree-structured rollouts and bottom-up advantage computation based on verifiable outcome rewards to provide fine-grained and verifiable step-wise reward signals. When estimating the conditional transition probability from a parent node to a child node, we theoretically analyze the estimation error between the unbiased expected prediction probability and the estimate obtained via a single forward pass, and find that higher prediction confidence leads to lower estimation error. Guided by this analysis, we introduce a time-scheduled self-distillation loss during training that enhances prediction confidence in later training stages, thereby enabling more accurate probability estimation and improved convergence. Experiments show that \emph{d}-TreeRPO outperforms existing baselines and achieves significant gains on multiple reasoning benchmarks, including +86.2 on Sudoku, +51.6 on Countdown, +4.5 on GSM8K, and +5.3 on Math500. Ablation studies and computational cost analyses further demonstrate the effectiveness and practicality of our design choices.
comment: 16 pages, 5 figures, 3tables
Neurosymbolic Information Extraction from Transactional Documents ICDAR 2025
This paper presents a neurosymbolic framework for information extraction from documents, evaluated on transactional documents. We introduce a schema-based approach that integrates symbolic validation methods to enable more effective zero-shot output and knowledge distillation. The methodology uses language models to generate candidate extractions, which are then filtered through syntactic-, task-, and domain-level validation to ensure adherence to domain-specific arithmetic constraints. Our contributions include a comprehensive schema for transactional documents, relabeled datasets, and an approach for generating high-quality labels for knowledge distillation. Experimental results demonstrate significant improvements in $F_1$-scores and accuracy, highlighting the effectiveness of neurosymbolic validation in transactional document processing.
comment: 20 pages, 2 figures, accepted to IJDAR (ICDAR 2025)
Can LLMs Evaluate What They Cannot Annotate? Revisiting LLM Reliability in Hate Speech Detection
Hate speech spreads widely online, harming individuals and communities, making automatic detection essential for large-scale moderation, yet detecting it remains difficult. Part of the challenge lies in subjectivity: what one person flags as hate speech, another may see as benign. Traditional annotation agreement metrics, such as Cohen's $κ$, oversimplify this disagreement, treating it as an error rather than meaningful diversity. Meanwhile, Large Language Models (LLMs) promise scalable annotation, but prior studies demonstrate that they cannot fully replace human judgement, especially in subjective tasks. In this work, we reexamine LLM reliability using a subjectivity-aware framework, cross-Rater Reliability (xRR), revealing that even under fairer lens, LLMs still diverge from humans. Yet this limitation opens an opportunity: we find that LLM-generated annotations can reliably reflect performance trends across classification models, correlating with human evaluations. We test this by examining whether LLM-generated annotations preserve the relative ordering of model performance derived from human evaluation (i.e. whether models ranked as more reliable by human annotators preserve the same order when evaluated with LLM-generated labels). Our results show that, although LLMs differ from humans at the instance level, they reproduce similar ranking and classification patterns, suggesting their potential as proxy evaluators. While not a substitute for human annotators, they might serve as a scalable proxy for evaluation in subjective NLP tasks.
MentraSuite: Post-Training Large Language Models for Mental Health Reasoning and Assessment
Mental health disorders affect hundreds of millions globally, and the Web now serves as a primary medium for accessing support, information, and assessment. Large language models (LLMs) offer scalable and accessible assistance, yet their deployment in mental-health settings remains risky when their reasoning is incomplete, inconsistent, or ungrounded. Existing psychological LLMs emphasize emotional understanding or knowledge recall but overlook the step-wise, clinically aligned reasoning required for appraisal, diagnosis, intervention planning, abstraction, and verification. To address these issues, we introduce MentraSuite, a unified framework for advancing reliable mental-health reasoning. We propose MentraBench, a comprehensive benchmark spanning five core reasoning aspects, six tasks, and 13 datasets, evaluating both task performance and reasoning quality across five dimensions: conciseness, coherence, hallucination avoidance, task understanding, and internal consistency. We further present Mindora, a post-trained model optimized through a hybrid SFT-RL framework with an inconsistency-detection reward to enforce faithful and coherent reasoning. To support training, we construct high-quality trajectories using a novel reasoning trajectory generation strategy, that strategically filters difficult samples and applies a structured, consistency-oriented rewriting process to produce concise, readable, and well-balanced trajectories. Across 20 evaluated LLMs, Mindora achieves the highest average performance on MentraBench and shows remarkable performances in reasoning reliability, demonstrating its effectiveness for complex mental-health scenarios.
Creation of the Estonian Subjectivity Dataset: Assessing the Degree of Subjectivity on a Scale LREC 2026
This article presents the creation of an Estonian-language dataset for document-level subjectivity, analyzes the resulting annotations, and reports an initial experiment of automatic subjectivity analysis using a large language model (LLM). The dataset comprises of 1,000 documents-300 journalistic articles and 700 randomly selected web texts-each rated for subjectivity on a continuous scale from 0 (fully objective) to 100 (fully subjective) by four annotators. As the inter-annotator correlations were moderate, with some texts receiving scores at the opposite ends of the scale, a subset of texts with the most divergent scores was re-annotated, with the inter-annotator correlation improving. In addition to human annotations, the dataset includes scores generated by GPT-5 as an experiment on annotation automation. These scores were similar to human annotators, however several differences emerged, suggesting that while LLM based automatic subjectivity scoring is feasible, it is not an interchangeable alternative to human annotation, and its suitability depends on the intended application.
comment: 9 pages, 5 figures, 2 appendixes, submitted to LREC 2026
Rethinking Chain-of-Thought Reasoning for Videos
Chain-of-thought (CoT) reasoning has been highly successful in solving complex tasks in natural language processing, and recent multimodal large language models (MLLMs) have extended this paradigm to video reasoning. However, these models typically build on lengthy reasoning chains and large numbers of input visual tokens. Motivated by empirical observations from our benchmark study, we hypothesize that concise reasoning combined with a reduced set of visual tokens can be sufficient for effective video reasoning. To evaluate this hypothesis, we design and validate an efficient post-training and inference framework that enhances a video MLLM's reasoning capability. Our framework enables models to operate on compressed visual tokens and generate brief reasoning traces prior to answering. The resulting models achieve substantially improved inference efficiency, deliver competitive performance across diverse benchmarks, and avoid reliance on manual CoT annotations or supervised fine-tuning. Collectively, our results suggest that long, human-like CoT reasoning may not be necessary for general video reasoning, and that concise reasoning can be both effective and efficient. Our code will be released at https://github.com/LaVi-Lab/Rethink_CoT_Video.
comment: Technical report
System Report for CCL25-Eval Task 10: Prompt-Driven Large Language Model Merge for Fine-Grained Chinese Hate Speech Detection CCL 2025
The proliferation of hate speech on Chinese social media poses urgent societal risks, yet traditional systems struggle to decode context-dependent rhetorical strategies and evolving slang. To bridge this gap, we propose a novel three-stage LLM-based framework: Prompt Engineering, Supervised Fine-tuning, and LLM Merging. First, context-aware prompts are designed to guide LLMs in extracting implicit hate patterns. Next, task-specific features are integrated during supervised fine-tuning to enhance domain adaptation. Finally, merging fine-tuned LLMs improves robustness against out-of-distribution cases. Evaluations on the STATE-ToxiCN benchmark validate the framework's effectiveness, demonstrating superior performance over baseline methods in detecting fine-grained hate speech.
comment: Accepted at CCL 2025
Systematic Framework of Application Methods for Large Language Models in Language Sciences
Large Language Models (LLMs) are transforming language sciences. However, their widespread deployment currently suffers from methodological fragmentation and a lack of systematic soundness. This study proposes two comprehensive methodological frameworks designed to guide the strategic and responsible application of LLMs in language sciences. The first method-selection framework defines and systematizes three distinct, complementary approaches, each linked to a specific research goal: (1) prompt-based interaction with general-use models for exploratory analysis and hypothesis generation; (2) fine-tuning of open-source models for confirmatory, theory-driven investigation and high-quality data generation; and (3) extraction of contextualized embeddings for further quantitative analysis and probing of model internal mechanisms. We detail the technical implementation and inherent trade-offs of each method, supported by empirical case studies. Based on the method-selection framework, the second systematic framework proposed provides constructed configurations that guide the practical implementation of multi-stage research pipelines based on these approaches. We then conducted a series of empirical experiments to validate our proposed framework, employing retrospective analysis, prospective application, and an expert evaluation survey. By enforcing the strategic alignment of research questions with the appropriate LLM methodology, the frameworks enable a critical paradigm shift in language science research. We believe that this system is fundamental for ensuring reproducibility, facilitating the critical evaluation of LLM mechanisms, and providing the structure necessary to move traditional linguistics from ad-hoc utility to verifiable, robust science.
Don't Throw Away Your Beams: Improving Consistency-based Uncertainties in LLMs via Beam Search
Consistency-based methods have emerged as an effective approach to uncertainty quantification (UQ) in large language models. These methods typically rely on several generations obtained via multinomial sampling, measuring their agreement level. However, in short-form QA, multinomial sampling is prone to producing duplicates due to peaked distributions, and its stochasticity introduces considerable variance in uncertainty estimates across runs. We introduce a new family of methods that employ beam search to generate candidates for consistency-based UQ, yielding improved performance and reduced variance compared to multinomial sampling. We also provide a theoretical lower bound on the beam set probability mass under which beam search achieves a smaller error than multinomial sampling. We empirically evaluate our approach on six QA datasets and find that its consistent improvements over multinomial sampling lead to state-of-the-art UQ performance.
RouteRAG: Efficient Retrieval-Augmented Generation from Text and Graph via Reinforcement Learning
Retrieval-Augmented Generation (RAG) integrates non-parametric knowledge into Large Language Models (LLMs), typically from unstructured texts and structured graphs. While recent progress has advanced text-based RAG to multi-turn reasoning through Reinforcement Learning (RL), extending these advances to hybrid retrieval introduces additional challenges. Existing graph-based or hybrid systems typically depend on fixed or handcrafted retrieval pipelines, lacking the ability to integrate supplementary evidence as reasoning unfolds. Besides, while graph evidence provides relational structures crucial for multi-hop reasoning, it is substantially more expensive to retrieve. To address these limitations, we introduce \model{}, an RL-based framework that enables LLMs to perform multi-turn and adaptive graph-text hybrid RAG. \model{} jointly optimizes the entire generation process via RL, allowing the model to learn when to reason, what to retrieve from either texts or graphs, and when to produce final answers, all within a unified generation policy. To guide this learning process, we design a two-stage training framework that accounts for both task outcome and retrieval efficiency, enabling the model to exploit hybrid evidence while avoiding unnecessary retrieval overhead. Experimental results across five question answering benchmarks demonstrate that \model{} significantly outperforms existing RAG baselines, highlighting the benefits of end-to-end RL in supporting adaptive and efficient retrieval for complex reasoning.
Source Coverage and Citation Bias in LLM-based vs. Traditional Search Engines
LLM-based Search Engines (LLM-SEs) introduces a new paradigm for information seeking. Unlike Traditional Search Engines (TSEs) (e.g., Google), these systems summarize results, often providing limited citation transparency. The implications of this shift remain largely unexplored, yet raises key questions regarding trust and transparency. In this paper, we present a large-scale empirical study of LLM-SEs, analyzing 55,936 queries and the corresponding search results across six LLM-SEs and two TSEs. We confirm that LLM-SEs cites domain resources with greater diversity than TSEs. Indeed, 37% of domains are unique to LLM-SEs. However, certain risks still persist: LLM-SEs do not outperform TSEs in credibility, political neutrality and safety metrics. Finally, to understand the selection criteria of LLM-SEs, we perform a feature-based analysis to identify key factors influencing source choice. Our findings provide actionable insights for end users, website owners, and developers.
Advancing Text Classification with Large Language Models and Neural Attention Mechanisms
This study proposes a text classification algorithm based on large language models, aiming to address the limitations of traditional methods in capturing long-range dependencies, understanding contextual semantics, and handling class imbalance. The framework includes text encoding, contextual representation modeling, attention-based enhancement, feature aggregation, and classification prediction. In the representation stage, deep semantic embeddings are obtained through large-scale pretrained language models, and attention mechanisms are applied to enhance the selective representation of key features. In the aggregation stage, global and weighted strategies are combined to generate robust text-level vectors. In the classification stage, a fully connected layer and Softmax output are used to predict class distributions, and cross-entropy loss is employed to optimize model parameters. Comparative experiments introduce multiple baseline models, including recurrent neural networks, graph neural networks, and Transformers, and evaluate them on Precision, Recall, F1-Score, and AUC. Results show that the proposed method outperforms existing models on all metrics, with especially strong improvements in Recall and AUC. In addition, sensitivity experiments are conducted on hyperparameters and data conditions, covering the impact of hidden dimensions on AUC and the impact of class imbalance ratios on Recall. The findings demonstrate that proper model configuration has a significant effect on performance and reveal the adaptability and stability of the model under different conditions. Overall, the proposed text classification method not only achieves effective performance improvement but also verifies its robustness and applicability in complex data environments through systematic analysis.
Knowledge-Augmented Large Language Model Agents for Explainable Financial Decision-Making
This study investigates an explainable reasoning method for financial decision-making based on knowledge-enhanced large language model agents. To address the limitations of traditional financial decision methods that rely on parameterized knowledge, lack factual consistency, and miss reasoning chains, an integrated framework is proposed that combines external knowledge retrieval, semantic representation, and reasoning generation. The method first encodes financial texts and structured data to obtain semantic representations, and then retrieves task-related information from external knowledge bases using similarity computation. Internal representations and external knowledge are combined through weighted fusion, which ensures fluency while improving factual accuracy and completeness of generated content. In the reasoning stage, a multi-head attention mechanism is introduced to construct logical chains, allowing the model to present transparent causal relationships and traceability during generation. Finally, the model jointly optimizes task objectives and explanation consistency objectives, which enhances predictive performance and reasoning interpretability. Experiments on financial text processing and decision tasks show that the method outperforms baseline approaches in accuracy, text generation quality, and factual support, verifying the effectiveness of knowledge enhancement and explainable reasoning. Overall, the proposed approach overcomes the limitations of traditional models in semantic coverage and reasoning transparency, and demonstrates strong practical value in complex financial scenarios.
CourtPressGER: A German Court Decision to Press Release Summarization Dataset
Official court press releases from Germany's highest courts present and explain judicial rulings to the public, as well as to expert audiences. Prior NLP efforts emphasize technical headnotes, ignoring citizen-oriented communication needs. We introduce CourtPressGER, a 6.4k dataset of triples: rulings, human-drafted press releases, and synthetic prompts for LLMs to generate comparable releases. This benchmark trains and evaluates LLMs in generating accurate, readable summaries from long judicial texts. We benchmark small and large LLMs using reference-based metrics, factual-consistency checks, LLM-as-judge, and expert ranking. Large LLMs produce high-quality drafts with minimal hierarchical performance loss; smaller models require hierarchical setups for long judgments. Initial benchmarks show varying model performance, with human-drafted releases ranking highest.
comment: Preprint - This contribution was accepted at JURIX AI4A2J Workshop 2025
Language models as tools for investigating the distinction between possible and impossible natural languages
We argue that language models (LMs) have strong potential as investigative tools for probing the distinction between possible and impossible natural languages and thus uncovering the inductive biases that support human language learning. We outline a phased research program in which LM architectures are iteratively refined to better discriminate between possible and impossible languages, supporting linking hypotheses to human cognition.
CONCUR: A Framework for Continual Constrained and Unconstrained Routing
AI tasks differ in complexity and are best addressed with different computation strategies (e.g., combinations of models and decoding methods). Hence, an effective routing system that maps tasks to the appropriate strategies is crucial. Most prior methods build the routing framework by training a single model across all strategies, which demands full retraining whenever new strategies appear and leads to high overhead. Attempts at such continual routing, however, often face difficulties with generalization. Prior models also typically use a single input representation, limiting their ability to capture the full complexity of the routing problem and leading to sub-optimal routing decisions. To address these gaps, we propose CONCUR, a continual routing framework that supports both constrained and unconstrained routing (i.e., routing with or without a budget). Our modular design trains a separate predictor model for each strategy, enabling seamless incorporation of new strategies with low additional training cost. Our predictors also leverage multiple representations of both tasks and computation strategies to better capture overall problem complexity. Experiments on both in-distribution and out-of-distribution, knowledge- and reasoning-intensive tasks show that our method outperforms the best single strategy and strong existing routing techniques with higher end-to-end accuracy and lower inference cost in both continual and non-continual settings, while also reducing training cost in the continual setting.
Are Hypervectors Enough? Single-Call LLM Reasoning over Knowledge Graphs
Recent advances in large language models (LLMs) have enabled strong reasoning over both structured and unstructured knowledge. When grounded on knowledge graphs (KGs), however, prevailing pipelines rely on heavy neural encoders to embed and score symbolic paths or on repeated LLM calls to rank candidates, leading to high latency, GPU cost, and opaque decisions that hinder faithful, scalable deployment. We propose PathHD, a lightweight and encoder-free KG reasoning framework that replaces neural path scoring with hyperdimensional computing (HDC) and uses only a single LLM call per query. PathHD encodes relation paths into block-diagonal GHRR hypervectors, ranks candidates with blockwise cosine similarity and Top-K pruning, and then performs a one-shot LLM adjudication to produce the final answer together with cited supporting paths. Technically, PathHD is built on three ingredients: (i) an order-aware, non-commutative binding operator for path composition, (ii) a calibrated similarity for robust hypervector-based retrieval, and (iii) a one-shot adjudication step that preserves interpretability while eliminating per-path LLM scoring. On WebQSP, CWQ, and the GrailQA split, PathHD (i) attains comparable or better Hits@1 than strong neural baselines while using one LLM call per query; (ii) reduces end-to-end latency by $40-60\%$ and GPU memory by $3-5\times$ thanks to encoder-free retrieval; and (iii) delivers faithful, path-grounded rationales that improve error diagnosis and controllability. These results indicate that carefully designed HDC representations provide a practical substrate for efficient KG-LLM reasoning, offering a favorable accuracy-efficiency-interpretability trade-off.
Identifying Bias in Machine-generated Text Detection
The meteoric rise in text generation capability has been accompanied by parallel growth in interest in machine-generated text detection: the capability to identify whether a given text was generated using a model or written by a person. While detection models show strong performance, they have the capacity to cause significant negative impacts. We explore potential biases in English machine-generated text detection systems. We curate a dataset of student essays and assess 16 different detection systems for bias across four attributes: gender, race/ethnicity, English-language learner (ELL) status, and economic status. We evaluate these attributes using regression-based models to determine the significance and power of the effects, as well as performing subgroup analysis. We find that while biases are generally inconsistent across systems, there are several key issues: several models tend to classify disadvantaged groups as machine-generated, ELL essays are more likely to be classified as machine-generated, economically disadvantaged students' essays are less likely to be classified as machine-generated, and non-White ELL essays are disproportionately classified as machine-generated relative to their White counterparts. Finally, we perform human annotation and find that while humans perform generally poorly at the detection task, they show no significant biases on the studied attributes.
comment: 13 pages, 2 figures, 7 tables
Training-free Context-adaptive Attention for Efficient Long Context Modeling
Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks. These capabilities stem primarily from the self-attention mechanism, which enables modeling of long-range dependencies. However, the quadratic complexity of self-attention with respect to sequence length poses significant computational and memory challenges, especially as sequence length extends to extremes. While various sparse attention and KV cache compression methods have been proposed to improve efficiency, they often suffer from limitations such as reliance on fixed patterns, inability to handle both prefilling and decoding stages, or the requirement for additional training. In this paper, we propose Training-free Context-adaptive Attention (TCA-Attention), a training-free sparse attention mechanism that selectively attends to only the informative tokens for efficient long-context inference. Our method consists of two lightweight phases: i) an offline calibration phase that determines head-specific sparsity budgets via a single forward pass, and ii) an online token selection phase that adaptively retains core context tokens using a lightweight redundancy metric. TCA-Attention provides a unified solution that accelerates both prefilling and decoding while reducing KV cache memory footprint, without requiring parameter updates or architectural changes. Theoretical analysis shows that our approach maintains bounded approximation error. Extensive experiments demonstrate that TCA-Attention achieves a 2.8$\times$ speedup and reduces KV cache by 61% at 128K context length while maintaining performance comparable to full attention across various benchmarks, offering a practical plug-and-play solution for efficient long-context inference.
CORE: A Conceptual Reasoning Layer for Large Language Models
Large language models handle single-turn generation well, but multi-turn interactions still require the model to reconstruct user intent and task state from an expanding token history because internal representations do not persist across turns. This token-first paradigm leads to drift, inconsistent reasoning modes, and growing prompts as conversations deepen. We propose CORE, a concept-first interaction layer that improves multi-turn stability without modifying model weights. CORE combines a small library of universal cognitive operators with a persistent Local Concept - a compact semantic state capturing the task, constraints, preferences, and intermediate results. Each model call receives only this concept state, the user's latest instruction, and the selected operator, eliminating the need to replay full history. A preliminary prototype simulating CORE's behavior shows about 42% reduction in cumulative prompt tokens, though this number reflects prototype conditions and should not be interpreted as a real-world performance estimate. CORE offers a model-agnostic mechanism that separates conceptual reasoning from language generation, suggesting a scalable direction for more stable multi-turn systems.
comment: Independent system-level architectural proposal with accompanying proof-of-concept
Targeting Misalignment: A Conflict-Aware Framework for Reward-Model-based LLM Alignment
Reward-model-based fine-tuning is a central paradigm in aligning Large Language Models with human preferences. However, such approaches critically rely on the assumption that proxy reward models accurately reflect intended supervision, a condition often violated due to annotation noise, bias, or limited coverage. This misalignment can lead to undesirable behaviors, where models optimize for flawed signals rather than true human values. In this paper, we investigate a novel framework to identify and mitigate such misalignment by treating the fine-tuning process as a form of knowledge integration. We focus on detecting instances of proxy-policy conflicts, cases where the base model strongly disagrees with the proxy. We argue that such conflicts often signify areas of shared ignorance, where neither the policy nor the reward model possesses sufficient knowledge, making them especially susceptible to misalignment. To this end, we propose two complementary metrics for identifying these conflicts: a localized Proxy-Policy Alignment Conflict Score (PACS) and a global Kendall-Tau Distance measure. Building on this insight, we design an algorithm named Selective Human-in-the-loop Feedback via Conflict-Aware Sampling (SHF-CAS) that targets high-conflict QA pairs for additional feedback, refining both the reward model and policy efficiently. Experiments on two alignment tasks demonstrate that our approach enhances general alignment performance, even when trained with a biased proxy reward. Our work provides a new lens for interpreting alignment failures and offers a principled pathway for targeted refinement in LLM training.
Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning
The safety alignment of large language models (LLMs) is becoming increasingly important with their democratization. In this paper, we study the safety degradation that comes with adapting LLMs to new tasks. We attribute this safety compromise to catastrophic forgetting and frame the problem of preserving safety when fine-tuning as a continual learning (CL) problem. We consider the fine-tuning-as-a-service setup where the user uploads their data to a service provider to get a customized model that excels on the user's selected task. We adapt several CL approaches from the literature and systematically evaluate their ability to mitigate safety degradation. These include regularization-based, memory-based, and model merging approaches. We consider two scenarios, (1) benign user data and (2) poisoned user data. Our results demonstrate that CL approaches consistently achieve lower attack success rates than standard fine-tuning. Among these, DER outperforms both other CL methods and existing safety-preserving baselines while maintaining task utility. These findings generalize across three downstream tasks (GSM8K, SST2, Code) and three model families (LLaMA2-7B, Mistral-7B, Gemma-2B), establishing CL as a practical solution to preserve safety.
PARAN: Persona-Augmented Review ANswering system on Food Delivery Review Dataset
Personalized review response generation presents a significant challenge in domains where user information is limited, such as food delivery platforms. While large language models (LLMs) offer powerful text generation capabilities, they often produce generic responses when lacking contextual user data, reducing engagement and effectiveness. In this work, we propose a two-stage prompting framework that infers both explicit (e.g., user-stated preferences) and implicit (e.g., demographic or stylistic cues) personas directly from short review texts. These inferred persona attributes are then incorporated into the response generation prompt to produce user-tailored replies. To encourage diverse yet faithful generations, we adjust decoding temperature during inference. We evaluate our method using a real-world dataset collected from a Korean food delivery app, and assess its impact on precision, diversity, and semantic consistency. Our findings highlight the effectiveness of persona-augmented prompting in enhancing the relevance and personalization of automated responses without requiring model fine-tuning.
Workflow is All You Need: Escaping the "Statistical Smoothing Trap" via High-Entropy Information Foraging and Adversarial Pacing
Central to long-form text generation in vertical domains is the "impossible trinity" confronting current large language models (LLMs): the simultaneous achievement of low hallucination, deep logical coherence, and personalized expression. This study establishes that this bottleneck arises from existing generative paradigms succumbing to the Statistical Smoothing Trap, a phenomenon that overlooks the high-entropy information acquisition and structured cognitive processes integral to expert-level writing. To address this limitation, we propose the DeepNews Framework, an agentic workflow that explicitly models the implicit cognitive processes of seasoned financial journalists. The framework integrates three core modules: first, a dual-granularity retrieval mechanism grounded in information foraging theory, which enforces a 10:1 saturated information input ratio to mitigate hallucinatory outputs; second, schema-guided strategic planning, a process leveraging domain expert knowledge bases (narrative schemas) and Atomic Blocks to forge a robust logical skeleton; third, adversarial constraint prompting, a technique deploying tactics including Rhythm Break and Logic Fog to disrupt the probabilistic smoothness inherent in model-generated text. Experiments delineate a salient Knowledge Cliff in deep financial reporting: content truthfulness collapses when retrieved context falls below 15,000 characters, while a high-redundancy input exceeding 30,000 characters stabilizes the Hallucination-Free Rate (HFR) above 85%. In an ecological validity blind test conducted with a top-tier Chinese technology media outlet, the DeepNews system--built on a previous-generation model (DeepSeek-V3-0324)-achieved a 25% submission acceptance rate, significantly outperforming the 0% acceptance rate of zero-shot generation by a state-of-the-art (SOTA) model (GPT-5).
comment: 22 pages, 8 figures. Includes an ecological validity blind test where the Agentic Workflow achieved a 25% acceptance rate in top-tier media, decisively outperforming the SOTA Zero-shot baseline (0%). Features the DNFO-v5 ontology
Generate-Then-Validate: A Novel Question Generation Approach Using Small Language Models
We explore the use of small language models (SLMs) for automatic question generation as a complement to the prevalent use of their large counterparts in learning analytics research. We present a novel question generation pipeline that leverages both the text generation and the probabilistic reasoning abilities of SLMs to generate high-quality questions. Adopting a "generate-then-validate" strategy, our pipeline first performs expansive generation to create an abundance of candidate questions and refine them through selective validation based on novel probabilistic reasoning. We conducted two evaluation studies, one with seven human experts and the other with a large language model (LLM), to assess the quality of the generated questions. Most judges (humans or LLMs) agreed that the generated questions had clear answers and generally aligned well with the intended learning objectives. Our findings suggest that an SLM can effectively generate high-quality questions when guided by a well-designed pipeline that leverages its strengths.
comment: Accepted as a full research paper for the 16th International Conference on Learning Analytics and Knowledge (LAK'26)
What Kind of Reasoning (if any) is an LLM actually doing? On the Stochastic Nature and Abductive Appearance of Large Language Models
This article looks at how reasoning works in current Large Language Models (LLMs) that function using the token-completion method. It examines their stochastic nature and their similarity to human abductive reasoning. The argument is that these LLMs create text based on learned patterns rather than performing actual abductive reasoning. When their output seems abductive, this is largely because they are trained on human-generated texts that include reasoning structures. Examples are used to show how LLMs can produce plausible ideas, mimic commonsense reasoning, and give explanatory answers without being grounded in truth, semantics, verification, or understanding, and without performing any real abductive reasoning. This dual nature, where the models have a stochastic base but appear abductive in use, has important consequences for how LLMs are evaluated and applied. They can assist with generating ideas and supporting human thinking, but their outputs must be critically assessed because they cannot identify truth or verify their explanations. The article concludes by addressing five objections to these points, noting some limitations in the analysis, and offering an overall evaluation.
Parallel Decoder Transformer: Model-Internal Parallel Decoding with Speculative Invariance via Note Conditioning
Autoregressive decoding in Large Language Models (LLMs) is inherently sequential, creating a latency bottleneck that scales linearly with output length. While ``Decomposition-and-Fill'' methods like Skeleton-of-Thought attempt to parallelize generation via external orchestration, they suffer from \textit{coherence drift} due to the lack of cross-stream communication. In this work, we introduce the \textbf{Parallel Decoder Transformer (PDT)}, a parameter-efficient architecture that embeds coordination primitives directly into the inference process of a frozen pre-trained model. Instead of retraining the base model, PDT injects lightweight \textit{Speculative Note Conditioning (SNC)} adapters that allow parallel decoding streams to synchronize via a shared, dynamic latent space. We formulate coordination as a \textit{speculative consensus} problem, where sibling streams broadcast semantic ``notes'' to a global bus, gated by a learned verification head. We validate our approach on a 50,000-step curriculum using a frozen 20B-parameter backbone. Our results demonstrate that PDT achieves effective self-correction, reaching \textbf{77.8\% precision} in coverage prediction and recovering approximate serial semantics without modifying the trunk weights. This establishes PDT as a scalable, efficient alternative to full model fine-tuning for structured parallel generation.
Diffusion Is Your Friend in Show, Suggest and Tell
Diffusion Denoising models demonstrated impressive results across generative Computer Vision tasks, but they still fail to outperform standard autoregressive solutions in the discrete domain, and only match them at best. In this work, we propose a different paradigm by adopting diffusion models to provide suggestions to the autoregressive generation rather than replacing them. By doing so, we combine the bidirectional and refining capabilities of the former with the strong linguistic structure provided by the latter. To showcase its effectiveness, we present Show, Suggest and Tell (SST), which achieves State-of-the-Art results on COCO, among models in a similar setting. In particular, SST achieves 125.1 CIDEr-D on the COCO dataset without Reinforcement Learning, outperforming both autoregressive and diffusion model State-of-the-Art results by 1.5 and 2.5 points. On top of the strong results, we performed extensive experiments to validate the proposal and analyze the impact of the suggestion module. Results demonstrate a positive correlation between suggestion and caption quality, overall indicating a currently underexplored but promising research direction. Code will be available at: https://github.com/jchenghu/show\_suggest\_tell.
Exploring LLMs for Scientific Information Extraction Using The SciEx Framework
Large language models (LLMs) are increasingly touted as powerful tools for automating scientific information extraction. However, existing methods and tools often struggle with the realities of scientific literature: long-context documents, multi-modal content, and reconciling varied and inconsistent fine-grained information across multiple publications into standardized formats. These challenges are further compounded when the desired data schema or extraction ontology changes rapidly, making it difficult to re-architect or fine-tune existing systems. We present SciEx, a modular and composable framework that decouples key components including PDF parsing, multi-modal retrieval, extraction, and aggregation. This design streamlines on-demand data extraction while enabling extensibility and flexible integration of new models, prompting strategies, and reasoning mechanisms. We evaluate SciEx on datasets spanning three scientific topics for its ability to extract fine-grained information accurately and consistently. Our findings provide practical insights into both the strengths and limitations of current LLM-based pipelines.
BAMBO: Construct Ability and Efficiency LLM Pareto Set via Bayesian Adaptive Multi-objective Block-wise Optimization
Constructing a Pareto set is pivotal for navigating the capability-efficiency trade-offs in Large Language Models (LLMs); however, existing merging techniques remain inadequate for this task. Coarse-grained, model-level methods yield only a sparse set of suboptimal solutions, while fine-grained, layer-wise approaches suffer from the "curse of dimensionality," rendering the search space computationally intractable. To resolve this dichotomy, we propose BAMBO (Bayesian Adaptive Multi-objective Block-wise Optimization), a novel framework that automatically constructs the LLM Pareto set. BAMBO renders the search tractable by introducing a Hybrid Optimal Block Partitioning strategy. Formulated as a 1D clustering problem, this strategy leverages a dynamic programming approach to optimally balance intra-block homogeneity and inter-block information distribution, thereby dramatically reducing dimensionality without sacrificing critical granularity. The entire process is automated within an evolutionary loop driven by the q-Expected Hypervolume Improvement (qEHVI) acquisition function. Experiments demonstrate that BAMBO discovers a superior and more comprehensive Pareto frontier than baselines, enabling agile model selection tailored to diverse operational constraints. Code is available at: https://github.com/xin8coder/BAMBO.
Neural Diversity Regularizes Hallucinations in Language Models
Language models continue to hallucinate despite increases in parameters, compute, and data. We propose neural diversity -- decorrelated parallel representations -- as a principled mechanism that reduces hallucination rates at fixed parameter and data budgets. While existing mitigation strategies largely target accuracy, we provide the first formal tail bounds for hallucination probability in ensembled language models, reframing it as a second-moment reliability problem and explaining 94.3% of empirical reliability variation seen across parallel configurations. We introduce ND-LoRA (Neural Diversity Low-Rank Adaptation), combining parallel LoRA adapters with Barlow Twins regularization, and reduce hallucinations by up to 25.6% (and 14.6% on average) while preserving general accuracy. Ablations show LoRA adapters and regularization act synergistically, causal interventions prove neurodiversity as the mediating factor and correlational studies indicate scale: a 0.1% neural correlation increase is associated with a 3.8% hallucination increase. Finally, task-dependent optimality emerges: different tasks require different optimal amounts of neurodiversity. Together, our results highlight neural diversity as a third axis of scaling -- orthogonal to parameters and data -- to improve the reliability of language models at fixed budgets.
TRepLiNa: Layer-wise CKA+REPINA Alignment Improves Low-Resource Machine Translation in Aya-23 8B
The 2025 Multimodal Models for Low-Resource Contexts and Social Impact (MMLoSo) Language Challenge addresses one of India's most pressing linguistic gaps: the lack of resources for its diverse low-resource languages (LRLs). In this study, we investigate whether enforcing cross-lingual similarity in specific internal layers of a decoder-only multilingual large language model (LLM) can improve translation quality from LRL to high-resource language (HRL). Specifically, we combine Centered Kernel Alignment (CKA), a similarity metric that encourages representations of different languages to align, with REPINA, a regularization method that constrains parameter updates to remain close to the pretrained model, into a joint method we call TRepLiNa. In this research project, we experiment with zero-shot, few-shot, and fine-tuning settings using Aya-23 8B with QLoRA across MMLoSo shared task language pairs (Mundari, Santali, Bhili) with Hindi/English pivots. Our results show that aligning mid-level layers using TRepLiNa (CKA+REPINA) is a low-cost, practical approach to improving LRL translation, especially in data-scarce settings.
comment: It is work in progress
Low-Dimensional Structure in the Space of Language Representations is Reflected in Brain Responses
How related are the representations learned by neural language models, translation models, and language tagging tasks? We answer this question by adapting an encoder-decoder transfer learning method from computer vision to investigate the structure among 100 different feature spaces extracted from hidden representations of various networks trained on language tasks. This method reveals a low-dimensional structure where language models and translation models smoothly interpolate between word embeddings, syntactic and semantic tasks, and future word embeddings. We call this low-dimensional structure a language representation embedding because it encodes the relationships between representations needed to process language for a variety of NLP tasks. We find that this representation embedding can predict how well each individual feature space maps to human brain responses to natural language stimuli recorded using fMRI. Additionally, we find that the principal dimension of this structure can be used to create a metric which highlights the brain's natural language processing hierarchy. This suggests that the embedding captures some part of the brain's natural language representation structure.
comment: Accepted to the Advances in Neural Information Processing Systems 34 (2021) Revised to include voxel selection details
Constrained Discrete Diffusion NeurIPS 2025
Discrete diffusion models are a class of generative models that construct sequences by progressively denoising samples from a categorical noise distribution. Beyond their rapidly growing ability to generate coherent natural language, these models present a new and important opportunity to enforce sequence-level constraints, a capability that current autoregressive models cannot natively provide. This paper capitalizes on this opportunity by introducing Constrained Discrete Diffusion (CDD), a novel integration of differentiable constraint optimization within the diffusion process to ensure adherence to constraints, logic rules, or safety requirements for generated sequences. Unlike conventional text generators that often rely on post-hoc filtering or model retraining for controllable generation, CDD directly imposes constraints into the discrete diffusion sampling process, resulting in a training-free and effective approach. Experiments in toxicity-controlled text generation, property-constrained molecule design, and instruction-constrained text completion demonstrate that CDD achieves zero constraint violations in a diverse array of tasks while preserving fluency, novelty, and coherence while outperforming autoregressive and existing discrete diffusion approaches.
comment: Published at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency
This paper introduces CryptoBench, the first expert-curated, dynamic benchmark designed to rigorously evaluate the real-world capabilities of Large Language Model (LLM) agents in the uniquely demanding and fast-paced cryptocurrency domain. Unlike general-purpose agent benchmarks for search and prediction, professional crypto analysis presents specific challenges: \emph{extreme time-sensitivity}, \emph{a highly adversarial information environment}, and the critical need to synthesize data from \emph{diverse, specialized sources}, such as on-chain intelligence platforms and real-time Decentralized Finance (DeFi) dashboards. CryptoBench thus serves as a much more challenging and valuable scenario for LLM agent assessment. To address these challenges, we constructed a live, dynamic benchmark featuring 50 questions per month, expertly designed by crypto-native professionals to mirror actual analyst workflows. These tasks are rigorously categorized within a four-quadrant system: Simple Retrieval, Complex Retrieval, Simple Prediction, and Complex Prediction. This granular categorization enables a precise assessment of an LLM agent's foundational data-gathering capabilities alongside its advanced analytical and forecasting skills. Our evaluation of ten LLMs, both directly and within an agentic framework, reveals a performance hierarchy and uncovers a failure mode. We observe a \textit{retrieval-prediction imbalance}, where many leading models, despite being proficient at data retrieval, demonstrate a pronounced weakness in tasks requiring predictive analysis. This highlights a problematic tendency for agents to appear factually grounded while lacking the deeper analytical capabilities to synthesize information.
Revisiting Intermediate-Layer Matching in Knowledge Distillation: Layer-Selection Strategy Doesn't Matter (Much) AACL 2025
Knowledge distillation (KD) is a popular method of transferring knowledge from a large "teacher" model to a small "student" model. Previous work has explored various layer-selection strategies (e.g., forward matching and in-order random matching) for intermediate-layer matching in KD, where a student layer is forced to resemble a certain teacher layer. In this work, we revisit such layer-selection strategies and observe an intriguing phenomenon that layer-selection strategy does not matter (much) in intermediate-layer matching -- even seemingly nonsensical matching strategies such as reverse matching still result in surprisingly good student performance. We provide an interpretation for this phenomenon by examining the angles between teacher layers viewed from the student's perspective. Our work sheds light on KD practice, as layer-selection strategies may not be the main focus of KD system design, and vanilla forward matching works well in most setups.
comment: Accepted at IJCNLP-AACL 2025
Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual Speech Recognition Evaluation
Despite rapid progress, ASR evaluation remains saturated with short-form English, and efficiency is rarely reported. We present the Open ASR Leaderboard, a fully reproducible benchmark and interactive leaderboard comparing 60+ open-source and proprietary systems across 11 datasets, including a dedicated multilingual track. We standardize text normalization and report both word error rate (WER) and inverse real-time factor (RTFx), enabling fair accuracy-efficiency comparisons. For English transcription, Conformer encoders paired with LLM decoders achieve the best average WER but are slower, while CTC and TDT decoders deliver much better RTFx, making them attractive for long-form and offline use. Whisper-derived encoders fine-tuned for English improve accuracy but often trade off multilingual coverage. All code and dataset loaders are open-sourced to support transparent, extensible evaluation.
comment: Leaderboard: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard ; Code: https://github.com/huggingface/open_asr_leaderboard
SAFT: Structure-Aware Fine-Tuning of LLMs for AMR-to-Text Generation KDD2025
Large Language Models (LLMs) are increasingly applied to tasks involving structured inputs such as graphs. Abstract Meaning Representations (AMRs), which encode rich semantics as directed graphs, offer a rigorous testbed for evaluating LLMs on text generation from such structures. Yet, current methods often arbitrarily linearize AMRs, discarding key structural cues, or rely on architectures incompatible with standard LLMs. We introduce SAFT, a structure-aware fine-tuning approach that injects graph topology into pretrained LLMs without architectural changes. We compute direction-sensitive positional encodings from the magnetic Laplacian of transformed AMRs and project them into the embedding space of the LLM. While possibly applicable to any graph-structured inputs, we focus on AMR-to-text generation as a representative and challenging benchmark. SAFT sets a new state-of-the-art on AMR 3.0 with a 3.5 BLEU improvement over baselines. Gains scale with graph complexity, highlighting the value of structure-aware representations in enhancing LLM performance. SAFT offers a general and effective pathway for bridging structured data and language models.
comment: Accepted at the KDD2025 Workshop on Structured Knowledge for LLMs
Revealing economic facts: LLMs know more than they say
We investigate whether the hidden states of large language models (LLMs) can be used to estimate and impute economic and financial statistics. Focusing on county-level (e.g. unemployment) and firm-level (e.g. total assets) variables, we show that a simple linear model trained on the hidden states of open-source LLMs outperforms the models' text outputs. This suggests that hidden states capture richer economic information than the responses of the LLMs reveal directly. A learning curve analysis indicates that only a few dozen labelled examples are sufficient for training. We also propose a transfer learning method that improves estimation accuracy without requiring any labelled data for the target variable. Finally, we demonstrate the practical utility of hidden-state representations in super-resolution and data imputation tasks.
comment: 34 pages, 17 figures
Enhancing Reasoning Skills in Small Persian Medical Language Models Can Outperform Large-Scale Data Training
Enhancing reasoning capabilities in small language models is critical for specialized applications such as medical question answering, particularly in underrepresented languages like Persian. In this study, we employ Reinforcement Learning with AI Feedback (RLAIF) and Direct preference optimization (DPO) to improve the reasoning skills of a general-purpose Persian language model. To achieve this, we translated a multiple-choice medical question-answering dataset into Persian and used RLAIF to generate rejected-preferred answer pairs, which are essential for DPO training. By prompting both teacher and student models to produce Chain-of-Thought (CoT) reasoning responses, we compiled a dataset containing correct and incorrect reasoning trajectories. This dataset, comprising 2 million tokens in preferred answers and 2.5 million tokens in rejected ones, was used to train a baseline model, significantly enhancing its medical reasoning capabilities in Persian. Remarkably, the resulting model outperformed its predecessor, gaokerena-V, which was trained on approximately 57 million tokens, despite leveraging a much smaller dataset. These results highlight the efficiency and effectiveness of reasoning-focused training approaches in developing domain-specific language models with limited data availability.
comment: 7 pages, 5 figures
O-Mem: Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents
Recent advancements in LLM-powered agents have demonstrated significant potential in generating human-like responses; however, they continue to face challenges in maintaining long-term interactions within complex environments, primarily due to limitations in contextual consistency and dynamic personalization. Existing memory systems often depend on semantic grouping prior to retrieval, which can overlook semantically irrelevant yet critical user information and introduce retrieval noise. In this report, we propose the initial design of O-Mem, a novel memory framework based on active user profiling that dynamically extracts and updates user characteristics and event records from their proactive interactions with agents. O-Mem supports hierarchical retrieval of persona attributes and topic-related context, enabling more adaptive and coherent personalized responses. O-Mem achieves 51.67% on the public LoCoMo benchmark, a nearly 3% improvement upon LangMem,the previous state-of-the-art, and it achieves 62.99% on PERSONAMEM, a 3.5% improvement upon A-Mem,the previous state-of-the-art. O-Mem also boosts token and interaction response time efficiency compared to previous memory frameworks. Our work opens up promising directions for developing efficient and human-like personalized AI assistants in the future.
Leveraging Machine Learning to Identify Gendered Stereotypes and Body Image Concerns on Diet and Fitness Online Forums
The pervasive expectations about ideal body types in Western society can lead to body image concerns, dissatisfaction, and in extreme cases, eating disorders and other psychopathologies related to body image. While previous research has focused on online pro-anorexia communities glorifying the "thin ideal," less attention has been given to the broader spectrum of body image concerns or how emerging disorders like muscle dysmorphia ("bigorexia") present on online platforms. To address this gap, we analyze 46 Reddit forums related to diet, fitness, and mental health. We map these communities along gender and body ideal dimensions, revealing distinct patterns of emotional expression and community support. Feminine-oriented communities, especially those endorsing the thin ideal, express higher levels of negative emotions and receive caring comments in response. In contrast, muscular ideal communities display less negativity, regardless of gender orientation, but receive aggressive compliments in response, marked by admiration and toxicity. Mental health discussions align more with thin ideal, feminine-leaning spaces. By uncovering these gendered emotional dynamics, our findings can inform the development of moderation strategies that foster supportive interactions while reducing exposure to harmful content.
Collaborative Causal Sensemaking: Closing the Complementarity Gap in Human-AI Decision Support AAMAS 2026
LLM-based agents are increasingly deployed for expert decision support, yet human-AI teams in high-stakes settings do not yet reliably outperform the best individual. We argue this complementarity gap reflects a fundamental mismatch: current agents are trained as answer engines, not as partners in the collaborative sensemaking through which experts actually make decisions. Sensemaking (the ability to co-construct causal explanations, surface uncertainties, and adapt goals) is the key capability that current training pipelines do not explicitly develop or evaluate. We propose Collaborative Causal Sensemaking (CCS) as a research agenda to develop this capability from the ground up, spanning new training environments that reward collaborative thinking, representations for shared human-AI mental models, and evaluation centred on trust and complementarity. Taken together, these directions shift MAS research from building oracle-like answer engines to cultivating AI teammates that co-reason with their human partners over the causal structure of shared decisions, advancing the design of effective human-AI teams.
comment: Under review at the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026), Blue Sky Track
Multi-Agent Collaborative Filtering: Orchestrating Users and Items for Agentic Recommendations
Agentic recommendations cast recommenders as large language model (LLM) agents that can plan, reason, use tools, and interact with users of varying preferences in web applications. However, most existing agentic recommender systems focus on generic single-agent plan-execute workflows or multi-agent task decomposition pipelines. Without recommendation-oriented design, they often underuse the collaborative signals in the user-item interaction history, leading to unsatisfying recommendation results. To address this, we propose the Multi-Agent Collaborative Filtering (MACF) framework for agentic recommendations, drawing an analogy between traditional collaborative filtering algorithms and LLM-based multi-agent collaboration. Specifically, given a target user and query, we instantiate similar users and relevant items as LLM agents with unique profiles. Each agent is able to call retrieval tools, suggest candidate items, and interact with other agents. Different from the static preference aggregation in traditional collaborative filtering, MACF employs a central orchestrator agent to adaptively manage the collaboration between user and item agents via dynamic agent recruitment and personalized collaboration instruction. Experimental results on datasets from three different domains show the advantages of our MACF framework compared to strong agentic recommendation baselines.
Improving Topic Relevance Model by Mix-structured Summarization and LLM-based Data Augmentation
Topic relevance between query and document is a very important part of social search, which can evaluate the degree of matching between document and user's requirement. In most social search scenarios such as Dianping, modeling search relevance always faces two challenges. One is that many documents in social search are very long and have much redundant information. The other is that the training data for search relevance model is difficult to get, especially for multi-classification relevance model. To tackle above two problems, we first take query concatenated with the query-based summary and the document summary without query as the input of topic relevance model, which can help model learn the relevance degree between query and the core topic of document. Then, we utilize the language understanding and generation abilities of large language model (LLM) to rewrite and generate query from queries and documents in existing training data, which can construct new query-document pairs as training data. Extensive offline experiments and online A/B tests show that the proposed approaches effectively improve the performance of relevance modeling.
An Offline Mobile Conversational Agent for Mental Health Support: Learning from Emotional Dialogues and Psychological Texts with Student-Centered Evaluation
Mental health plays a crucial role in the overall well-being of an individual. In recent years, digital platforms have increasingly been used to expand mental health and emotional support. However, there are persistent challenges related to limited user accessibility, internet connectivity, and data privacy, which highlight the need for an offline, smartphone-based solutions. To address these challenges, we propose EmoSApp (Emotional Support App): an entirely offline, smartphone-based conversational app designed to provide mental health and emotional support. EmoSApp leverages a language model, specifically the LLaMA-3.2-1B-Instruct, which is fine-tuned and quantized on a custom-curated ``Knowledge Dataset'' comprising 14,582 mental health QA pairs along with multi-turn conversational data, enabling robust domain expertise and fully on-device inference on resource-constrained smartphones. Through qualitative evaluation with students and mental health professionals, we demonstrate that EmoSApp has the ability to respond coherently and empathetically, provide relevant suggestions to user's mental health problems, and maintain interactive dialogue. Additionally, quantitative evaluations on nine commonsense and reasoning benchmarks, along with two mental health specific datasets, demonstrate EmoSApp's effectiveness in low-resource settings. By prioritizing on-device deployment and specialized domain-specific adaptation, EmoSApp serves as a blueprint for future innovations in portable, secure, and highly tailored AI-driven mental health support.
SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation
Embedding-based retrieval models have made significant strides in retrieval-augmented generation (RAG) techniques for text and multimodal large language models (LLMs) applications. However, when it comes to speech larage language models (SLLMs), these methods are limited to a two-stage process, where automatic speech recognition (ASR) is combined with text-based retrieval. This sequential architecture suffers from high latency and error propagation. To address these limitations, we propose a unified embedding framework that eliminates the need for intermediate text representations. Specifically, the framework includes separate speech and text encoders, followed by a shared scaling layer that maps both modalities into a common embedding space. Our model reduces pipeline latency by 50\% while achieving higher retrieval accuracy compared to traditional two-stage methods. We also provide a theoretical analysis of the challenges inherent in end-to-end speech retrieval and introduce architectural principles for effective speech-to-document matching. Extensive experiments demonstrate the robustness of our approach across diverse acoustic conditions and speaker variations, paving the way for a new paradigm in multimodal SLLMs retrieval systems.
Attention Sinks in Diffusion Language Models
Masked Diffusion Language Models (DLMs) have recently emerged as a promising alternative to traditional Autoregressive Models (ARMs). DLMs employ transformer encoders with bidirectional attention, enabling parallel token generation while maintaining competitive performance. Although their efficiency and effectiveness have been extensively studied, the internal mechanisms that govern DLMs remain largely unexplored. In this work, we conduct an empirical analysis of DLM attention patterns, focusing on the attention sinking phenomenon, an effect previously observed in various transformer-based architectures. Our findings reveal that DLMs also exhibit attention sinks, but with distinct characteristics. First, unlike in ARMs, the sink positions in DLMs tend to shift throughout the generation process, displaying a dynamic behaviour. Second, while ARMs are highly sensitive to the removal of attention sinks, DLMs remain robust: masking sinks leads to only a minor degradation in performance. These results provide new insights into the inner workings of diffusion-based language models and highlight fundamental differences in how they allocate and utilize attention compared to autoregressive models.
ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents AAAI2026
Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and finding multi-products seller. To bridge this gap, we propose ShoppingBench, a novel end-to-end shopping benchmark designed to encompass increasingly challenging levels of grounded intent. Specifically, we propose a scalable framework to simulate user instructions based on various intents derived from sampled real-world products. To facilitate consistent and reliable evaluations, we provide a large-scale shopping sandbox that serves as an interactive simulated environment, incorporating over 2.5 million real-world products. Experimental results demonstrate that even state-of-the-art language agents (such as GPT-4.1) achieve absolute success rates under 50% on our benchmark tasks, highlighting the significant challenges posed by our ShoppingBench. In addition, we propose a trajectory distillation strategy and leverage supervised fine-tuning, along with reinforcement learning on synthetic trajectories, to distill the capabilities of a large language agent into a smaller one. As a result, our trained agent achieves competitive performance compared to GPT-4.1.
comment: submit to AAAI2026
Two Causal Principles for Improving Visual Dialog CVPR 2020
This paper unravels the design tricks adopted by us, the champion team MReaL-BDAI, for Visual Dialog Challenge 2019: two causal principles for improving Visual Dialog (VisDial). By "improving", we mean that they can promote almost every existing VisDial model to the state-of-the-art performance on the leader-board. Such a major improvement is only due to our careful inspection on the causality behind the model and data, finding that the community has overlooked two causalities in VisDial. Intuitively, Principle 1 suggests: we should remove the direct input of the dialog history to the answer model, otherwise a harmful shortcut bias will be introduced; Principle 2 says: there is an unobserved confounder for history, question, and answer, leading to spurious correlations from training data. In particular, to remove the confounder suggested in Principle 2, we propose several causal intervention algorithms, which make the training fundamentally different from the traditional likelihood estimation. Note that the two principles are model-agnostic, so they are applicable in any VisDial model. The code is available at https://github.com/simpleshinobu/visdial-principles.
comment: Accepted by CVPR 2020
Vevo2: A Unified and Controllable Framework for Speech and Singing Voice Generation
Controllable human voice generation, particularly for expressive domains like singing, remains a significant challenge. This paper introduces Vevo2, a unified framework for controllable speech and singing voice generation. To tackle issues like the scarcity of annotated singing data and to enable flexible controllability, Vevo2 introduces two audio tokenizers: (1) a unified music-notation-free prosody tokenizer that captures prosody and melody from speech, singing, and even instrumental sounds, and (2) a unified content-style tokenizer that encodes linguistic content, prosody, and style for both speech and singing, while enabling timbre disentanglement. Vevo2 consists of an auto-regressive (AR) content-style modeling stage, which aims to enable controllability over text, prosody, and style, as well as a flow-matching acoustic modeling stage that allows for timbre control. Particularly, during the speech-singing joint training of the AR model, we propose both explicit and implicit prosody learning strategies to bridge speech and singing voice. Moreover, to further enhance the Vevo2's ability to follow text and prosody, we design a multi-objective post-training task that integrates both intelligibility and prosody similarity alignment. Experimental results show that the unified modeling in Vevo2 brings mutual benefits to both speech and singing voice generation. Additionally, Vevo2's effectiveness across a wide range of synthesis, conversion, and editing tasks for both speech and singing further demonstrates its strong generalization ability and versatility. Audio samples are are available at https://versasinger.github.io/.
comment: We will release code and model checkpoints at https://github.com/open-mmlab/Amphion
Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling
State-of-the-art (SOTA) Text-to-SQL methods still lag significantly behind human experts on challenging benchmarks like BIRD. Current approaches that explore test-time scaling lack an orchestrated strategy and neglect the model's internal reasoning process. To bridge this gap, we introduce Agentar-Scale-SQL, a novel framework leveraging scalable computation to improve performance. Agentar-Scale-SQL implements an Orchestrated Test-Time Scaling strategy that synergistically combines three distinct perspectives: i) Internal Scaling via RL-enhanced Intrinsic Reasoning, ii) Sequential Scaling through Iterative Refinement, and iii) Parallel Scaling using Diverse Synthesis and Tournament Selection. Agentar-Scale-SQL is a general-purpose framework designed for easy adaptation to new databases and more powerful language models. Extensive experiments show that Agentar-Scale-SQL achieves SOTA performance on the BIRD benchmark, reaching 81.67% execution accuracy on the test set and ranking first on the official leaderboard, demonstrating an effective path toward human-level performance.
GRAVITY: A Framework for Personalized Text Generation via Profile-Grounded Synthetic Preferences
Personalization in LLMs often relies on costly human feedback or interaction logs, limiting scalability and neglecting deeper user attributes. To reduce the reliance on human annotations, we introduce GRAVITY (Generative Response with Aligned Values, Interests, and Traits of You), a framework for generating synthetic, profile-grounded preference data that captures users' interests, values, beliefs, and personality traits. By integrating demographic, cultural, and psychological frameworks -- including Hofstede's cultural dimensions, Schwartz's basic values, the World Values Survey, and Big Five OCEAN traits -- GRAVITY synthesizes preference pairs to guide personalized content generation. We evaluate GRAVITY on book descriptions for 400 Amazon users, comparing it to prompt-based conditioning, standard fine-tuning, and naive synthetic pair generation. Profile-grounded synthetic data consistently improves generation, especially across multiple cultures (USA, Brazil, Japan, India), achieving over 4% higher preference gains across baselines, with user studies showing that GRAVITY outputs are preferred over 86% of the time. Our results show that scenario-grounded synthetic data can capture richer user variation, reduce reliance on costly annotation, and produce more engaging, user-centered content, offering a scalable path for LLM personalization.
Forgetting-MarI: LLM Unlearning via Marginal Information Regularization
As AI models are trained on ever-expanding datasets, the ability to remove the influence of specific data from trained models has become essential for privacy protection and regulatory compliance. Unlearning addresses this challenge by selectively removing parametric knowledge from the trained models without retraining from scratch, which is critical for resource-intensive models such as Large Language Models (LLMs). Existing unlearning methods often degrade model performance by removing more information than necessary when attempting to ''forget'' specific data. We introduce Forgetting-MarI, an LLM unlearning framework that provably removes only the additional (marginal) information contributed by the data to be unlearned, while preserving the information supported by the data to be retained. By penalizing marginal information, our method yields an explicit upper bound on the unlearn dataset's residual influence in the trained models, providing provable undetectability. Extensive experiments confirm that our approach outperforms current state-of-the-art unlearning methods, delivering reliable forgetting and better preserved general model performance across diverse benchmarks. This advancement represents an important step toward making AI systems more controllable and compliant with privacy and copyright regulations without compromising their effectiveness.
Guiding LLMs to Generate High-Fidelity and High-Quality Counterfactual Explanations for Text Classification
The need for interpretability in deep learning has driven interest in counterfactual explanations, which identify minimal changes to an instance that change a model's prediction. Current counterfactual (CF) generation methods require task-specific fine-tuning and produce low-quality text. Large Language Models (LLMs), though effective for high-quality text generation, struggle with label-flipping counterfactuals (i.e., counterfactuals that change the prediction) without fine-tuning. We introduce two simple classifier-guided approaches to support counterfactual generation by LLMs, eliminating the need for fine-tuning while preserving the strengths of LLMs. Despite their simplicity, our methods outperform state-of-the-art counterfactual generation methods and are effective across different LLMs, highlighting the benefits of guiding counterfactual generation by LLMs with classifier information. We further show that data augmentation by our generated CFs can improve a classifier's robustness. Our analysis reveals a critical issue in counterfactual generation by LLMs: LLMs rely on parametric knowledge rather than faithfully following the classifier.
Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics
RAG (Retrieval-Augmented Generation) systems and web agents are increasingly evaluated on multi-hop deep search tasks, yet current practice suffers from two major limitations. First, most benchmarks leak the reasoning path in the question text, allowing models to follow surface cues rather than discover reasoning chains autonomously. Second, evaluation is typically reduced to a single pass rate, which collapses diverse behaviours into one score and obscures whether failures stem from inadequate search, poor knowledge use, or inappropriate refusal. To address these issues, we present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox that ensures full traceability of model actions, and a holistic evaluation framework that separates search sufficiency, knowledge utilisation, and refusal behaviour. Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures: models struggle with knowledge utilisation despite having sufficient evidence and demonstrate near-absent appropriate refusal when evidence is lacking. These patterns expose a fundamental gap: today's systems excel at executing given reasoning paths but fail when required to discover them. We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies, incorporating verification loops and systematic evidence tracking that improve both search and synthesis capabilities. This baseline demonstrates that WebDetective's diagnostic framework can guide concrete architectural improvements, establishing our benchmark as a critical tool for developing genuinely autonomous reasoning systems rather than pattern-following agents.
Studying the Effects of Collaboration in Interactive Theme Discovery Systems
NLP-assisted solutions have gained considerable traction to support qualitative data analysis. However, there does not exist a unified evaluation framework that can account for the many different settings in which qualitative researchers may employ them. In this paper, we take a first step in this direction by proposing an evaluation framework to study the way in which different tools may result in different outcomes depending on the collaboration strategy employed. Specifically, we study the impact of synchronous vs. asynchronous collaboration using two different NLP-assisted qualitative research tools and present a comprehensive analysis of significant differences in the consistency, cohesiveness, and correctness of their outputs.
comment: Added author in pre-print
Generalised Medical Phrase Grounding
Medical phrase grounding (MPG) maps textual descriptions of radiological findings to corresponding image regions. These grounded reports are easier to interpret, especially for non-experts. Existing MPG systems mostly follow the referring expression comprehension (REC) paradigm and return exactly one bounding box per phrase. Real reports often violate this assumption. They contain multi-region findings, non-diagnostic text, and non-groundable phrases, such as negations or descriptions of normal anatomy. Motivated by this, we reformulate the task as generalised medical phrase grounding (GMPG), where each sentence is mapped to zero, one, or multiple scored regions. To realise this formulation, we introduce the first GMPG model: MedGrounder. We adopted a two-stage training regime: pre-training on report sentence--anatomy box alignment datasets and fine-tuning on report sentence--human annotated box datasets. Experiments on PadChest-GR and MS-CXR show that MedGrounder achieves strong zero-shot transfer and outperforms REC-style and grounded report generation baselines on multi-region and non-groundable phrases, while using far fewer human box annotations. Finally, we show that MedGrounder can be composed with existing report generators to produce grounded reports without retraining the generator.
comment: 10 pages
Make LVLMs Focus: Context-Aware Attention Modulation for Better Multimodal In-Context Learning
Multimodal in-context learning (ICL) is becoming a key capability that allows large vision-language models (LVLMs) to adapt to novel tasks without parameter updates, which expands their usefulness in many real-world applications. However, ICL performance remains unstable even when the in-context demonstrations (ICDs) are well matched, showing that LVLMs still struggle to make full use of the provided context. While existing work mainly focuses on prompt engineering or post-hoc logit calibration, we study the attention mechanisms inside LVLMs to address their inherent limitations. We identify two important weaknesses in their self-attention that hinder effective ICL. To address these weaknesses, we propose Context-Aware Modulated Attention (CAMA), a training-free and plug-and-play method that dynamically adjusts attention logits based on the input in-context sequence. CAMA uses a two-stage modulation process that strengthens attention to semantically important tokens, especially visual ones. Across four LVLMs and seven benchmarks, CAMA consistently outperforms vanilla models and baselines, showing clear effectiveness and generalization. It can also activate the intended benefits of prompt engineering methods and remains robust across different sequence configurations. Therefore, CAMA opens up new directions for improving multimodal reasoning through a deeper understanding of attention dynamics.
comment: 14 pages, 8 figures, 5 tables
Understanding World or Predicting Future? A Comprehensive Survey of World Models
The concept of world models has garnered significant attention due to advancements in multimodal large language models such as GPT-4 and video generation models such as Sora, which are central to the pursuit of artificial general intelligence. This survey offers a comprehensive review of the literature on world models. Generally, world models are regarded as tools for either understanding the present state of the world or predicting its future dynamics. This review presents a systematic categorization of world models, emphasizing two primary functions: (1) constructing internal representations to understand the mechanisms of the world, and (2) predicting future states to simulate and guide decision-making. Initially, we examine the current progress in these two categories. We then explore the application of world models in key domains, including generative games, autonomous driving, robotics, and social simulacra, with a focus on how each domain utilizes these aspects. Finally, we outline key challenges and provide insights into potential future research directions. We summarize the representative papers along with their code repositories in https://github.com/tsinghua-fib-lab/World-Model.
comment: Extended version of the original ACM CSUR paper, 49 pages, 6 figures, 8 tables
TCNN: Triple Convolutional Neural Network Models for Retrieval-based Question Answering System in E-commerce
Automatic question-answering (QA) systems have boomed during last few years, and commonly used techniques can be roughly categorized into Information Retrieval (IR)-based and generation-based. A key solution to the IR based models is to retrieve the most similar knowledge entries of a given query from a QA knowledge base, and then rerank those knowledge entries with semantic matching models. In this paper, we aim to improve an IR based e-commerce QA system-AliMe with proposed text matching models, including a basic Triple Convolutional Neural Network (TCNN) model and two Attention-based TCNN (ATCNN) models. Experimental results show their effect.
comment: 2 pages
ARE: Scaling Up Agent Environments and Evaluations
We introduce Meta Agents Research Environments (ARE), a research platform for scalable creation of environments, integration of synthetic or real applications, and execution of agentic orchestrations. ARE provides simple abstractions to build complex and diverse environments, each with their own rules, tools, content, and verifiers, helping to bridge the gap between model development and real-world deployment. We also propose Gaia2, a benchmark built in ARE and designed to measure general agent capabilities. Beyond search and execution, Gaia2 requires agents to handle ambiguities and noise, adapt to dynamic environments, collaborate with other agents, and operate under temporal constraints. Unlike prior benchmarks, Gaia2 runs asynchronously, surfacing new failure modes that are invisible in static settings. Our experiments show that no system dominates across the intelligence spectrum: stronger reasoning often comes at the cost of efficiency, and budget scaling curves plateau, highlighting the need for new architectures and adaptive compute strategies. Perhaps more importantly, ARE abstractions enable continuous extension of Gaia2 to other environments, empowering the community to rapidly create new benchmarks tailored to their domains. In AI's second half, progress increasingly depends on defining meaningful tasks and robust evaluations to drive frontier capabilities forward.
comment: Updated authors order and acknowledgement
Mitigating Forgetting in LLM Fine-Tuning via Low-Perplexity Token Learning
Maintaining consistent model performance across domains is a fundamental challenge in machine learning. While recent work has explored using LLM-generated data for fine-tuning, its impact on cross-domain generalization remains poorly understood. This paper presents a systematic analysis revealing that fine-tuning with LLM-generated data not only improves target task performance but also reduces non-target task degradation compared to fine-tuning with ground truth data. Through analyzing the data sequence in tasks of various domains, we demonstrate that this enhancement of non-target task robustness stems from the reduction of high perplexity tokens found in LLM-generated sequences. Following our findings, we showed that masking high perplexity tokens in ground truth training data achieves similar non-target task performance preservation, comparable to using LLM-generated data. Extensive experiments across different model families and scales, including Gemma 2 IT 2B, Llama 3 8B Instruct, and three additional models, agree with our findings. To the best of our knowledge, this is the first work to provide an empirical explanation based on token perplexity reduction to mitigate catastrophic forgetting in LLMs after fine-tuning, offering valuable insights for developing more robust fine-tuning strategies.
comment: The Thirty-ninth Annual Conference on Neural Information Processing Systems
Forensic deepfake audio detection using segmental speech features
This study explores the potential of using acoustic features of segmental speech sounds to detect deepfake audio. These features are highly interpretable because of their close relationship with human articulatory processes and are expected to be more difficult for deepfake models to replicate. The results demonstrate that certain segmental features commonly used in forensic voice comparison (FVC) are effective in identifying deep-fakes, whereas some global features provide little value. These findings underscore the need to approach audio deepfake detection using methods that are distinct from those employed in traditional FVC, and offer a new perspective on leveraging segmental features for this purpose. In addition, the present study proposes a speaker-specific framework for deepfake detection, which differs fundamentally from the speaker-independent systems that dominate current benchmarks. While speaker-independent frameworks aim at broad generalization, the speaker-specific approach offers advantages in forensic contexts where case-by-case interpretability and sensitivity to individual phonetic realization are essential.
comment: Accepted for publication in Forensic Science International
UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action
Computer-use agents face a fundamental limitation. They rely exclusively on primitive GUI actions (click, type, scroll), creating brittle execution chains prone to cascading failures. While API-driven agents harness rich capabilities through structured interfaces and tools, computer-use agents remain constrained to low-level visual interactions. We present UltraCUA, a foundation model that transcends this limitation through hybrid action-seamlessly unifying primitive GUI operations with high-level tool execution. Our innovation rests on four critical advances. First, an automated pipeline extracts and scales tool capabilities from software documentation and code repositories. Second, a synthetic data engine produces 17,000+ verifiable tasks capturing real-world computer-use complexity. Third, comprehensive hybrid action trajectory collection incorporates both GUI primitives and strategic tool calls. Fourth, a two-stage training methodology combines supervised fine-tuning with online reinforcement learning, enabling intelligent action selection between GUI and API. Evaluation with our 7B and 32B UltraCUA models reveals transformative performance gains. On OSWorld, UltraCUA achieves 22% relative improvement while executing 11% faster than existing approaches, averagely. Cross-domain validation on WindowsAgentArena demonstrates robust generalization with 21.7% success rate, surpassing Windows-trained baselines. The hybrid action paradigm proves essential, reducing error propagation while improving execution efficiency. This work establishes a scalable paradigm bridging primitive GUI interactions and high-level tool intelligence, enabling more resilient and adaptable computer use agents for diverse environments and complex real-world tasks.
Reparameterized LLM Training via Orthogonal Equivalence Transformation NeurIPS 2025
While large language models (LLMs) are driving the rapid advancement of artificial intelligence, effectively and reliably training these large models remains one of the field's most significant challenges. To address this challenge, we propose POET, a novel reParameterized training algorithm that uses Orthogonal Equivalence Transformation to optimize neurons. Specifically, POET reparameterizes each neuron with two learnable orthogonal matrices and a fixed random weight matrix. Because of its provable preservation of spectral properties of weight matrices, POET can stably optimize the objective function with improved generalization. We further develop efficient approximations that make POET flexible and scalable for training large-scale neural networks. Extensive experiments validate the effectiveness and scalability of POET in training LLMs.
comment: NeurIPS 2025 (40 pages, 26 figures, project page: https://spherelab.ai/poet/, v4: added experiments of finetuning and larger-scale pretraining)
Forest vs Tree: The $(N, K)$ Trade-off in Reproducible ML Evaluation AAAI-26
Reproducibility is a cornerstone of scientific validation and of the authority it confers on its results. Reproducibility in machine learning evaluations leads to greater trust, confidence, and value. However, the ground truth responses used in machine learning often necessarily come from humans, among whom disagreement is prevalent, and surprisingly little research has studied the impact of effectively ignoring disagreement in these responses, as is typically the case. One reason for the lack of research is that budgets for collecting human-annotated evaluation data are limited, and obtaining more samples from multiple raters for each example greatly increases the per-item annotation costs. We investigate the trade-off between the number of items ($N$) and the number of responses per item ($K$) needed for reliable machine learning evaluation. We analyze a diverse collection of categorical datasets for which multiple annotations per item exist, and simulated distributions fit to these datasets, to determine the optimal $(N, K)$ configuration, given a fixed budget ($N \times K$), for collecting evaluation data and reliably comparing the performance of machine learning models. Our findings show, first, that accounting for human disagreement may come with $N \times K$ at no more than 1000 (and often much lower) for every dataset tested on at least one metric. Moreover, this minimal $N \times K$ almost always occurred for $K > 10$. Furthermore, the nature of the tradeoff between $K$ and $N$, or if one even existed, depends on the evaluation metric, with metrics that are more sensitive to the full distribution of responses performing better at higher levels of $K$. Our methods can be used to help ML practitioners get more effective test data by finding the optimal metrics and number of items and annotations per item to collect to get the most reliability for their budget.
comment: Accepted at AAAI-26
Advancing AI Research Assistants with Expert-Involved Learning
Large language models (LLMs) and large multimodal models (LMMs) promise to accelerate biomedical discovery, yet their reliability remains unclear. We introduce ARIEL (AI Research Assistant for Expert-in-the-Loop Learning), an open-source evaluation and optimization framework that pairs a curated multimodal biomedical corpus with expert-vetted tasks to probe two capabilities: full-length article summarization and fine-grained figure interpretation. Using uniform protocols and blinded PhD-level evaluation, we find that state-of-the-art models generate fluent but incomplete summaries, whereas LMMs struggle with detailed visual reasoning. We later observe that prompt engineering and lightweight fine-tuning substantially improve textual coverage, and a compute-scaled inference strategy enhances visual question answering. We build an ARIEL agent that integrates textual and visual cues, and we show it can propose testable mechanistic hypotheses. ARIEL delineates current strengths and limitations of foundation models, and provides a reproducible platform for advancing trustworthy AI in biomedicine.
comment: 36 pages, 7 figures
Human-Computer Interaction
Building a Data Dashboard for Magic: The Gathering: Initial Design Considerations
This paper presents the initial stages of a design study aimed at developing a dashboard to visualize gameplay data of the Commander format from Magic: The Gathering. We conducted a user-task analysis to identify requirements for a data visualization dashboard tailored to the Commander format. Afterwards, we proposed a design for the dashboard leveraging visualizations to address players' needs and pain points for typical data analysis tasks in the context domain. Then, we followed-up with a structured user test to evaluate players' comprehension and preferences of data visualizations. Results show that players prioritize contextually relevant, outcome-driven metrics over peripheral ones, and that canonical charts like heatmaps and line charts support higher comprehension than complex ones such as scatterplots or icicle plots. Our findings also highlight the importance of localized views, user customization, and progressive disclosure, emphasizing that adaptability and contextual relevance are as essential as accuracy in effective dashboard design. Our study contributes practical design guidelines for data visualization in gaming contexts and highlights broader implications for engagement-driven dashboards.
Smart, simple, sincere - Why and how we should rethink connected things in our smart homes
More and more smart connected things and services turn our homes into smart environments. They promise comfort, efficiency and security. These devices often integrate simple sensors, e.g. for temperature, light or humidity, etc. However, these smart but yet simple sensors can pose a sincere privacy risk. The sensor data enables sense-making of home attendance, domestic activities and even health conditions, often a fact that neither users nor developers are aware of or do not know how to address. Nevertheless, not all is lost or evil. This article makes a plea for how we, the ThingsCon community, might rethink smart connected things and services in our homes. We show this in our approaches and research projects that we initiated.
comment: State of Responsible Technology 2025 - Generative Things. pp 24-30. Stichting ThingsCon Amsterdam
Adaptive Optimal Control for Avatar-Guided Motor Rehabilitation in Virtual Reality
A control-theoretic framework for autonomous avatar-guided rehabilitation in virtual reality, based on interpretable, adaptive motor guidance through optimal control, is presented. The framework faces critical challenges in motor rehabilitation due to accessibility, cost, and continuity of care, with over 50% of patients inability to attend regular clinic sessions. The system enables post-stroke patients to undergo personalized therapy in immersive virtual reality at home, while being monitored by clinicians. The core is a nonlinear, human-in-the-loop control strategy, where the avatar adapts in real time to the patient's performance. Balance between following the patient's movements and guiding them to ideal kinematic profiles based on the Hogan minimum-jerk model is achieved through multi-objective optimal control. A data-driven "ability index" uses smoothness metrics to dynamically adjust control gains according to the patient's progress. The system was validated through simulations and preliminary trials, and shows potential for delivering adaptive, engaging and scalable remote physiotherapy guided by interpretable control-theoretic principles.
ImageTalk: Designing a Multimodal AAC Text Generation System Driven by Image Recognition and Natural Language Generation
People living with Motor Neuron Disease (plwMND) frequently encounter speech and motor impairments that necessitate a reliance on augmentative and alternative communication (AAC) systems. This paper tackles the main challenge that traditional symbol-based AAC systems offer a limited vocabulary, while text entry solutions tend to exhibit low communication rates. To help plwMND articulate their needs about the system efficiently and effectively, we iteratively design and develop a novel multimodal text generation system called ImageTalk through a tailored proxy-user-based and an end-user-based design phase. The system demonstrates pronounced keystroke savings of 95.6%, coupled with consistent performance and high user satisfaction. We distill three design guidelines for AI-assisted text generation systems design and outline four user requirement levels tailored for AAC purposes, guiding future research in this field.
comment: 24 pages, 10 figures
Auto-BenchmarkCard: Automated Synthesis of Benchmark Documentation
We present Auto-BenchmarkCard, a workflow for generating validated descriptions of AI benchmarks. Benchmark documentation is often incomplete or inconsistent, making it difficult to interpret and compare benchmarks across tasks or domains. Auto-BenchmarkCard addresses this gap by combining multi-agent data extraction from heterogeneous sources (e.g., Hugging Face, Unitxt, academic papers) with LLM-driven synthesis. A validation phase evaluates factual accuracy through atomic entailment scoring using the FactReasoner tool. This workflow has the potential to promote transparency, comparability, and reusability in AI benchmark reporting, enabling researchers and practitioners to better navigate and evaluate benchmark choices.
Exploring Community-Powered Conversational Agent for Health Knowledge Acquisition: A Case Study in Colorectal Cancer
Online communities have become key platforms where young adults, actively seek and share information, including health knowledge. However, these users often face challenges when browsing these communities, such as fragmented content, varying information quality and unfamiliar terminology. Based on a survey with 56 participants and follow-up interviews, we identify common challenges and expected features for learning health knowledge. In this paper, we develop a computational workflow that integrates community content into a conversational agent named CanAnswer to facilitate health knowledge acquisition. Using colorectal cancer as a case study, we evaluate CanAnswer through a lab study with 24 participants and interviews with six medical experts. Results show that CanAnswer improves the recalled gained knowledge and reduces the task workload of the learning session. Our expert interviews (N=6) further confirm the reliability and usefulness of CanAnswer. We discuss the generality of CanAnswer and provide design considerations for enhancing the usefulness and credibility of community-powered learning tools.
An Efficient Interaction Human-AI Synergy System Bridging Visual Awareness and Large Language Model for Intensive Care Units
Intensive Care Units (ICUs) are critical environments characterized by high-stakes monitoring and complex data management. However, current practices often rely on manual data transcription and fragmented information systems, introducing potential risks to patient safety and operational efficiency. To address these issues, we propose a human-AI synergy system based on a cloud-edge-end architecture, which integrates visual-aware data extraction and semantic interaction mechanisms. Specifically, a visual-aware edge module non-invasively captures real-time physiological data from bedside monitors, reducing manual entry errors. To improve accessibility to fragmented data sources, a semantic interaction module, powered by a Large Language Model (LLM), enables physicians to perform efficient and intuitive voice-based queries over structured patient data. The hierarchical cloud-edge-end deployment ensures low-latency communication and scalable system performance. Our system reduces the cognitive burden on ICU nurses and physicians and demonstrates promising potential for broader applications in intelligent healthcare systems.
comment: This paper has been accepted by the Late Breaking Papers of the 2025 International Conference on Human Computer Interaction (HCII 2025)
Advancing Research via Human-AI Interactive Theorem Proving
We investigate how large language models can be used as research tools in scientific computing while preserving mathematical rigor. We propose a human-in-the-loop workflow for interactive theorem proving and discovery with LLMs. Human experts retain control over problem formulation and admissible assumptions, while the model searches for proofs or contradictions, proposes candidate properties and theorems, and helps construct structures and parameters that satisfy explicit constraints, supported by numerical experiments and simple verification checks. Experts treat these outputs as raw material, further refine them, and organize the results into precise statements and rigorous proofs. We instantiate this workflow in a case study on the connection between manifold optimization and Grover's quantum search algorithm, where the pipeline helps identify invariant subspaces, explore Grover-compatible retractions, and obtain convergence guarantees for the retraction-based gradient method. The framework provides a practical template for integrating large language models into frontier mathematical research, enabling faster exploration of proof space and algorithm design while maintaining transparent reasoning responsibilities. Although illustrated on manifold optimization problems in quantum computing, the principles extend to other core areas of scientific computing.
Enhancing Large Language Models for End-to-End Circuit Analysis Problem Solving
Large language models (LLMs) have shown strong performance in data-rich domains such as programming, but their reliability in engineering tasks remains limited. Circuit analysis -- requiring multimodal understanding and precise mathematical reasoning -- highlights these challenges. Although Gemini 2.5 Pro improves diagram interpretation and analog-circuit reasoning, it still struggles to consistently produce correct solutions when given both text and circuit diagrams. At the same time, engineering education needs scalable AI tools capable of generating accurate solutions for tasks such as automated homework feedback and question-answering. This paper presents an enhanced, end-to-end circuit problem solver built on Gemini 2.5 Pro. We first benchmark Gemini on a representative set of undergraduate circuit problems and identify two major failure modes: 1) circuit-recognition hallucinations, particularly incorrect source polarity detection, and 2) reasoning-process hallucinations, such as incorrect current directions. To address recognition errors, we integrate a fine-tuned YOLO detector and OpenCV processing to isolate voltage and current sources, enabling Gemini to re-identify source polarities from cropped images with near-perfect accuracy. To reduce reasoning errors, we introduce an ngspice-based verification loop in which Gemini generates a .cir file, ngspice simulates the circuit, and discrepancies trigger iterative regeneration with optional human-in-the-loop review. Across 83 problems, the proposed pipeline achieves a 97.59% success rate (81 correct solutions), substantially outperforming Gemini 2.5 Pro's original 79.52% accuracy. This system extends LLM capabilities for multimodal engineering problem-solving and supports the creation of high-quality educational datasets and AI-powered instructional tools.
Dark Personality Traits and Online Toxicity: Linking Self-Reports to Reddit Activity
Dark personality traits have been linked to online misbehavior such as trolling, incivility, and toxic speech. Yet the relationship between these traits and actual online conduct remains understudied. Here we investigate the associations between dark traits, online toxicity, and the socio-linguistic characteristics of online user activity. To explore this relationship, we developed a Web application that integrates validated psychological questionnaires from Amazon Mechanical Turk users to their Reddit activity data. This allowed collecting nearly 57K Reddit comments, including 2.2M tokens and 152.7K sentences from 114 users, that we systematically represent through 224 linguistic and behavioral features. We then examined their relationship to questionnaire-based trait measures via multiple correlation analyses. Among our findings is that dark traits primarily influence the production rather than the perception of online incivility. Sadistic and psychopathic tendencies are most strongly associated with overtly toxic language, whereas other dark dispositions manifest more subtly, often eluding simple textual proxies. Self-reported engagement in hostile behavior mirrors actual online activity, while existing hand-crafted textual proxies for dark triad traits show limited correspondence with our validated measures. Finally, bright and dark traits interact in nuanced ways, with extraversion reducing trolling tendencies and conscientiousness showing modest associations with entitlement and callousness. These findings deepen understanding of how personality shapes toxic online behavior and highlight both opportunities and challenges for developing reliable computational tools and targeted, effective moderation strategies.
Generate-Then-Validate: A Novel Question Generation Approach Using Small Language Models
We explore the use of small language models (SLMs) for automatic question generation as a complement to the prevalent use of their large counterparts in learning analytics research. We present a novel question generation pipeline that leverages both the text generation and the probabilistic reasoning abilities of SLMs to generate high-quality questions. Adopting a "generate-then-validate" strategy, our pipeline first performs expansive generation to create an abundance of candidate questions and refine them through selective validation based on novel probabilistic reasoning. We conducted two evaluation studies, one with seven human experts and the other with a large language model (LLM), to assess the quality of the generated questions. Most judges (humans or LLMs) agreed that the generated questions had clear answers and generally aligned well with the intended learning objectives. Our findings suggest that an SLM can effectively generate high-quality questions when guided by a well-designed pipeline that leverages its strengths.
comment: Accepted as a full research paper for the 16th International Conference on Learning Analytics and Knowledge (LAK'26)
Linear socio-demographic representations emerge in Large Language Models from indirect cues
We investigate how LLMs encode sociodemographic attributes of human conversational partners inferred from indirect cues such as names and occupations. We show that LLMs develop linear representations of user demographics within activation space, wherein stereotypically associated attributes are encoded along interpretable geometric directions. We first probe residual streams across layers of four open transformer-based LLMs (Magistral 24B, Qwen3 14B, GPT-OSS 20B, OLMo2-1B) prompted with explicit demographic disclosure. We show that the same probes predict demographics from implicit cues: names activate census-aligned gender and race representations, while occupations trigger representations correlated with real-world workforce statistics. These linear representations allow us to explain demographic inferences implicitly formed by LLMs during conversation. We demonstrate that these implicit demographic representations actively shape downstream behavior, such as career recommendations. Our study further highlights that models that pass bias benchmark tests may still harbor and leverage implicit biases, with implications for fairness when applied at scale.
Mind the Gap! Pathways Towards Unifying AI Safety and Ethics Research
While much research in artificial intelligence (AI) has focused on scaling capabilities, the accelerating pace of development makes countervailing work on producing harmless, "aligned" systems increasingly urgent. Yet research on alignment has diverged along two largely parallel tracks: safety--centered on scaled intelligence, deceptive or scheming behaviors, and existential risk--and ethics--focused on present harms, the reproduction of social bias, and flaws in production pipelines. Although both communities warn of insufficient investment in alignment, they disagree on what alignment means or ought to mean. As a result, their efforts have evolved in relative isolation, shaped by distinct methodologies, institutional homes, and disciplinary genealogies. We present a large-scale, quantitative study showing the structural split between AI safety and AI ethics. Using a bibliometric and co-authorship network analysis of 6,442 papers from twelve major ML and NLP conferences (2020-2025), we find that over 80% of collaborations occur within either the safety or ethics communities, and cross-field connectivity is highly concentrated: roughly 5% of papers account for more than 85% of bridging links. Removing a small number of these brokers sharply increases segregation, indicating that cross-disciplinary exchange depends on a handful of actors rather than broad, distributed collaboration. These results show that the safety-ethics divide is not only conceptual but institutional, with implications for research agendas, policy, and venues. We argue that integrating technical safety work with normative ethics--via shared benchmarks, cross-institutional venues, and mixed-method methodologies--is essential for building AI systems that are both robust and just.
comment: Accepted for presentation at IASEAI 2026
A survey on the impacts of recommender systems on users, items, and human-AI ecosystems
Recommendation systems and assistants (in short, recommenders) influence through online platforms most actions of our daily lives, suggesting items or providing solutions based on users' preferences or requests. This survey systematically reviews, categories, and discusses the impact of recommenders in four human-AI ecosystems -- social media, online retail, urban mapping and generative AI ecosystems. Its scope is to systematise a fast-growing field in which terminologies employed to classify methodologies and outcomes are fragmented and unsystematic. This is a crucial contribution to the literature because terminologies vary substantially across disciplines and ecosystems, hindering comparison and accumulation of knowledge in the field. We follow the customary steps of qualitative systematic review, gathering 154 articles from different disciplines to develop a parsimonious taxonomy of methodologies employed (empirical, simulation, observational, controlled), outcomes observed (concentration, content degradation, discrimination, diversity, echo chamber, filter bubble, homogenisation, polarisation, radicalisation, volume), and their level of analysis (individual, item, and ecosystem). We systematically discuss substantive and methodological commonalities across ecosystems, and highlight potential avenues for future research. The survey is addressed to scholars and practitioners interested in different human-AI ecosystems, policymakers and institutional stakeholders who want to understand better the measurable outcomes of recommenders, and tech companies who wish to obtain a systematic view of the impact of their recommenders.
Collaborative Causal Sensemaking: Closing the Complementarity Gap in Human-AI Decision Support AAMAS 2026
LLM-based agents are increasingly deployed for expert decision support, yet human-AI teams in high-stakes settings do not yet reliably outperform the best individual. We argue this complementarity gap reflects a fundamental mismatch: current agents are trained as answer engines, not as partners in the collaborative sensemaking through which experts actually make decisions. Sensemaking (the ability to co-construct causal explanations, surface uncertainties, and adapt goals) is the key capability that current training pipelines do not explicitly develop or evaluate. We propose Collaborative Causal Sensemaking (CCS) as a research agenda to develop this capability from the ground up, spanning new training environments that reward collaborative thinking, representations for shared human-AI mental models, and evaluation centred on trust and complementarity. Taken together, these directions shift MAS research from building oracle-like answer engines to cultivating AI teammates that co-reason with their human partners over the causal structure of shared decisions, advancing the design of effective human-AI teams.
comment: Under review at the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026), Blue Sky Track
Perceptually Uniform Construction of Illustrative Textures
Illustrative textures, such as stippling or hatching, were predominantly used as an alternative to conventional Phong rendering. Recently, the potential of encoding information on surfaces or maps using different densities has also been recognized. This has the significant advantage that additional color can be used as another visual channel and the illustrative textures can then be overlaid. Effectively, it is thus possible to display multiple information, such as two different scalar fields on surfaces simultaneously. In previous work, these textures were manually generated and the choice of density was unempirically determined. Here, we first want to determine and understand the perceptual space of illustrative textures. We chose a succession of simplices with increasing dimensions as primitives for our textures: Dots, lines, and triangles. Thus, we explore the texture types of stippling, hatching, and triangles. We create a range of textures by sampling the density space uniformly. Then, we conduct three perceptual studies in which the participants performed pairwise comparisons for each texture type. We use multidimensional scaling (MDS) to analyze the perceptual spaces per category. The perception of stippling and triangles seems relatively similar. Both are adequately described by a 1D manifold in 2D space. The perceptual space of hatching consists of two main clusters: Crosshatched textures, and textures with only one hatching direction. However, the perception of hatching textures with only one hatching direction is similar to the perception of stippling and triangles. Based on our findings, we construct perceptually uniform illustrative textures. Afterwards, we provide concrete application examples for the constructed textures.
comment: 11 pages, 15 figures, accepted to IEEE TVCG. This version corrects axis labels in Figs 12 & 13 and the description of the inverse sigmoid function application. The supplementary material is available at https://www.doi.org/10.17605/OSF.IO/43C6V
An Offline Mobile Conversational Agent for Mental Health Support: Learning from Emotional Dialogues and Psychological Texts with Student-Centered Evaluation
Mental health plays a crucial role in the overall well-being of an individual. In recent years, digital platforms have increasingly been used to expand mental health and emotional support. However, there are persistent challenges related to limited user accessibility, internet connectivity, and data privacy, which highlight the need for an offline, smartphone-based solutions. To address these challenges, we propose EmoSApp (Emotional Support App): an entirely offline, smartphone-based conversational app designed to provide mental health and emotional support. EmoSApp leverages a language model, specifically the LLaMA-3.2-1B-Instruct, which is fine-tuned and quantized on a custom-curated ``Knowledge Dataset'' comprising 14,582 mental health QA pairs along with multi-turn conversational data, enabling robust domain expertise and fully on-device inference on resource-constrained smartphones. Through qualitative evaluation with students and mental health professionals, we demonstrate that EmoSApp has the ability to respond coherently and empathetically, provide relevant suggestions to user's mental health problems, and maintain interactive dialogue. Additionally, quantitative evaluations on nine commonsense and reasoning benchmarks, along with two mental health specific datasets, demonstrate EmoSApp's effectiveness in low-resource settings. By prioritizing on-device deployment and specialized domain-specific adaptation, EmoSApp serves as a blueprint for future innovations in portable, secure, and highly tailored AI-driven mental health support.
SensHRPS: Sensing Comfortable Human-Robot Proxemics and Personal Space With Eye-Tracking
Social robots must adjust to human proxemic norms to ensure user comfort and engagement. While prior research demonstrates that eye-tracking features reliably estimate comfort in human-human interactions, their applicability to interactions with humanoid robots remains unexplored. In this study, we investigate user comfort with the robot "Ameca" across four experimentally controlled distances (0.5 m to 2.0 m) using mobile eye-tracking and subjective reporting (N=19). We evaluate multiple machine learning and deep learning models to estimate comfort based on gaze features. Contrary to previous human-human studies where Transformer models excelled, a Decision Tree classifier achieved the highest performance (F1-score = 0.73), with minimum pupil diameter identified as the most critical predictor. These findings suggest that physiological comfort thresholds in human-robot interaction differ from human-human dynamics and can be effectively modeled using interpretable logic.
HEDN: A Hard-Easy Dual Network with Source Reliability Assessment for Cross-Subject EEG Emotion Recognition
Cross-subject electroencephalography (EEG) emotion recognition remains a major challenge in brain-computer interfaces (BCIs) due to substantial inter-subject variability. Multi-Source Domain Adaptation (MSDA) offers a potential solution, but existing MSDA frameworks typically assume equal source quality, leading to negative transfer from low-reliability domains and prohibitive computational overhead due to multi-branch model designs. To address these limitations, we propose the Hard-Easy Dual Network (HEDN), a lightweight reliability-aware MSDA framework. HEDN introduces a novel Source Reliability Assessment (SRA) mechanism that dynamically evaluates the structural integrity of each source domain during training. Based on this assessment, sources are routed to two specialized branches: an Easy Network that exploits high-quality sources to construct fine-grained, structure-aware prototypes for reliable pseudo-label generation, and a Hard Network that utilizes adversarial training to refine and align low-quality sources. Furthermore, a cross-network consistency loss aligns predictions between branches to preserve semantic coherence. Extensive experiments conducted on SEED, SEED-IV, and DEAP datasets demonstrate that HEDN achieves state-of-the-art performance across both cross-subject and cross-dataset evaluation protocols while reducing adaptation complexity.
comment: This work has been submitted to the IEEE for possible publication
CardioLive: Empowering Video Streaming with Online Cardiac Monitoring
Online Cardiac Monitoring (OCM) emerges as a compelling enhancement for the next-generation video streaming platforms. It enables various applications including remote health, online affective computing, and deepfake detection. Yet the physiological information encapsulated in the video streams has been long neglected. In this paper, we present the design and implementation of CardioLive, the first online cardiac monitoring system in video streaming platforms. We leverage the naturally co-existed video and audio streams and devise CardioNet, the first audio-visual network to learn the cardiac series. It incorporates multiple unique designs to extract temporal and spectral features, ensuring robust performance under realistic video streaming conditions. To enable the Service-On-Demand online cardiac monitoring, we implement CardioLive as a plug-and-play middleware service and develop systematic solutions to practical issues including changing FPS and unsynchronized streams. Extensive experiments have been done to demonstrate the effectiveness of our system. We achieve a Mean Square Error (MAE) of 1.79 BPM error, outperforming the video-only and audio-only solutions by 69.2% and 81.2%, respectively. Our CardioLive service achieves average throughputs of 115.97 and 98.16 FPS when implemented in Zoom and YouTube. We believe our work opens up new applications for video stream systems. We will release the code soon.
comment: Preprint
Persona-based Multi-Agent Collaboration for Brainstorming
We demonstrate the importance of persona-based multi-agents brainstorming for both diverse topics and subject matter ideation. Prior work has shown that generalized multi-agent collaboration often provides better reasoning than a single agent alone. In this paper, we propose and develop a framework for persona-based agent selection, showing how persona domain curation can improve brainstorming outcomes. Using multiple experimental setups, we evaluate brainstorming outputs across different persona pairings (e.g., Doctor vs VR Engineer) and A2A (agent-to-agent) dynamics (separate, together, separate-then-together). Our results show that (1) persona choice shapes idea domains, (2) collaboration mode shifts diversity of idea generation, and (3) multi-agent persona-driven brainstorming produces idea depth and cross-domain coverage.
comment: 12 pages, 8 figures
Studying the Effects of Collaboration in Interactive Theme Discovery Systems
NLP-assisted solutions have gained considerable traction to support qualitative data analysis. However, there does not exist a unified evaluation framework that can account for the many different settings in which qualitative researchers may employ them. In this paper, we take a first step in this direction by proposing an evaluation framework to study the way in which different tools may result in different outcomes depending on the collaboration strategy employed. Specifically, we study the impact of synchronous vs. asynchronous collaboration using two different NLP-assisted qualitative research tools and present a comprehensive analysis of significant differences in the consistency, cohesiveness, and correctness of their outputs.
comment: Added author in pre-print
HOLE: Homological Observation of Latent Embeddings for Neural Network Interpretability
Deep learning models have achieved remarkable success across various domains, yet their learned representations and decision-making processes remain largely opaque and hard to interpret. This work introduces HOLE (Homological Observation of Latent Embeddings), a method for analyzing and interpreting deep neural networks through persistent homology. HOLE extracts topological features from neural activations and presents them using a suite of visualization techniques, including Sankey diagrams, heatmaps, dendrograms, and blob graphs. These tools facilitate the examination of representation structure and quality across layers. We evaluate HOLE on standard datasets using a range of discriminative models, focusing on representation quality, interpretability across layers, and robustness to input perturbations and model compression. The results indicate that topological analysis reveals patterns associated with class separation, feature disentanglement, and model robustness, providing a complementary perspective for understanding and improving deep learning systems.
JEEVHITAA -- An End-to-End HCAI System to Support Collective Care
Current mobile health platforms are predominantly individual-centric and lack the necessary primitives for coordinated, auditable, multi-actor workflows. However, in many settings worldwide, health decisions are enacted by multi-actor care networks rather than single users. We present JEEVHITAA, an Android/Flutter system that provides context-sensitive, role-aware sharing and verifiable information flows for care circles. JEEVHITAA ingests platform and device data (via Google Health Connect and BLE connectors), constructs multi-layer user profiles from sensor streams and tiered onboarding, and enforces fine-grained, time-bounded access control across permissioned care graphs. Data are end-to-end encrypted in local stores and during peer sync (Firebase), and provisions are made for document capture by camera or upload as PDF. An integrated retrieval-augmented LLM pipeline (i) produces structured, role-targeted summaries and action plans, (ii) enables users to gather advanced insights on health reports, and (iii) performs evidence-grounded user-relevant verification of arbitrary health content, returning provenance, confidence scores, and source citations. We describe the system architecture, connector abstractions, and security primitives, and evaluate robustness and compatibility using synthetic, ontology-driven simulations and vendor compatibility tests. Finally, we outline plans for longitudinal in-the-wild deployments to measure system performance, the correctness of access control, and the real-world effectiveness of relationship-aware credibility support.
comment: 14 pages, 4 figures, 6 tables, 2 algorithms, Preprint of ongoing work. Longer Pilot work underway
Computer Vision and Pattern Recognition
Astra: General Interactive World Model with Autoregressive Denoising
Recent advances in diffusion transformers have empowered video generation models to generate high-quality video clips from texts or images. However, world models with the ability to predict long-horizon futures from past observations and actions remain underexplored, especially for general-purpose scenarios and various forms of actions. To bridge this gap, we introduce Astra, an interactive general world model that generates real-world futures for diverse scenarios (e.g., autonomous driving, robot grasping) with precise action interactions (e.g., camera motion, robot action). We propose an autoregressive denoising architecture and use temporal causal attention to aggregate past observations and support streaming outputs. We use a noise-augmented history memory to avoid over-reliance on past frames to balance responsiveness with temporal coherence. For precise action control, we introduce an action-aware adapter that directly injects action signals into the denoising process. We further develop a mixture of action experts that dynamically route heterogeneous action modalities, enhancing versatility across diverse real-world tasks such as exploration, manipulation, and camera control. Astra achieves interactive, consistent, and general long-term video prediction and supports various forms of interactions. Experiments across multiple datasets demonstrate the improvements of Astra in fidelity, long-range prediction, and action alignment over existing state-of-the-art world models.
comment: Code is available at: https://github.com/EternalEvan/Astra
Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment
Novel View Synthesis (NVS) has traditionally relied on models with explicit 3D inductive biases combined with known camera parameters from Structure-from-Motion (SfM) beforehand. Recent vision foundation models like VGGT take an orthogonal approach -- 3D knowledge is gained implicitly through training data and loss objectives, enabling feed-forward prediction of both camera parameters and 3D representations directly from a set of uncalibrated images. While flexible, VGGT features lack explicit multi-view geometric consistency, and we find that improving such 3D feature consistency benefits both NVS and pose estimation tasks. We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment, transforming a VGGT backbone into a high-fidelity 3D reconstruction engine by leveraging its own outputs as pseudo-ground-truth. Specifically, we train a lightweight feature adapter using a reprojection-based consistency loss, which distills VGGT outputs into a new geometrically-aligned feature space that captures spatial proximity in 3D. This enables state-of-the-art performance in both NVS and camera pose estimation, demonstrating that feature alignment is a highly beneficial step for downstream 3D reasoning.
comment: Project Page: https://denghilbert.github.io/selfi/
Efficiently Reconstructing Dynamic Scenes One D4RT at a Time
Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel querying mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks. We refer to the project webpage for animated results: https://d4rt-paper.github.io/.
comment: Project Page: https://d4rt-paper.github.io/
Unified Diffusion Transformer for High-fidelity Text-Aware Image Restoration
Text-Aware Image Restoration (TAIR) aims to recover high-quality images from low-quality inputs containing degraded textual content. While diffusion models provide strong generative priors for general image restoration, they often produce text hallucinations in text-centric tasks due to the absence of explicit linguistic knowledge. To address this, we propose UniT, a unified text restoration framework that integrates a Diffusion Transformer (DiT), a Vision-Language Model (VLM), and a Text Spotting Module (TSM) in an iterative fashion for high-fidelity text restoration. In UniT, the VLM extracts textual content from degraded images to provide explicit textual guidance. Simultaneously, the TSM, trained on diffusion features, generates intermediate OCR predictions at each denoising step, enabling the VLM to iteratively refine its guidance during the denoising process. Finally, the DiT backbone, leveraging its strong representational power, exploit these cues to recover fine-grained textual content while effectively suppressing text hallucinations. Experiments on the SA-Text and Real-Text benchmarks demonstrate that UniT faithfully reconstructs degraded text, substantially reduces hallucinations, and achieves state-of-the-art end-to-end F1-score performance in TAIR task.
LiDAS: Lighting-driven Dynamic Active Sensing for Nighttime Perception
Nighttime environments pose significant challenges for camera-based perception, as existing methods passively rely on the scene lighting. We introduce Lighting-driven Dynamic Active Sensing (LiDAS), a closed-loop active illumination system that combines off-the-shelf visual perception models with high-definition headlights. Rather than uniformly brightening the scene, LiDAS dynamically predicts an optimal illumination field that maximizes downstream perception performance, i.e., decreasing light on empty areas to reallocate it on object regions. LiDAS enables zero-shot nighttime generalization of daytime-trained models through adaptive illumination control. Trained on synthetic data and deployed zero-shot in real-world closed-loop driving scenarios, LiDAS enables +18.7% mAP50 and +5.0% mIoU over standard low-beam at equal power. It maintains performances while reducing energy use by 40%. LiDAS complements domain-generalization methods, further strengthening robustness without retraining. By turning readily available headlights into active vision actuators, LiDAS offers a cost-effective solution to robust nighttime perception.
comment: Preprint. 12 pages, 9 figures. Project page: https://simondemoreau.github.io/LiDAS/
Self-Evolving 3D Scene Generation from a Single Image
Generating high-quality, textured 3D scenes from a single image remains a fundamental challenge in vision and graphics. Recent image-to-3D generators recover reasonable geometry from single views, but their object-centric training limits generalization to complex, large-scale scenes with faithful structure and texture. We present EvoScene, a self-evolving, training-free framework that progressively reconstructs complete 3D scenes from single images. The key idea is combining the complementary strengths of existing models: geometric reasoning from 3D generation models and visual knowledge from video generation models. Through three iterative stages--Spatial Prior Initialization, Visual-guided 3D Scene Mesh Generation, and Spatial-guided Novel View Generation--EvoScene alternates between 2D and 3D domains, gradually improving both structure and appearance. Experiments on diverse scenes demonstrate that EvoScene achieves superior geometric stability, view-consistent textures, and unseen-region completion compared to strong baselines, producing ready-to-use 3D meshes for practical applications.
UniLayDiff: A Unified Diffusion Transformer for Content-Aware Layout Generation
Content-aware layout generation is a critical task in graphic design automation, focused on creating visually appealing arrangements of elements that seamlessly blend with a given background image. The variety of real-world applications makes it highly challenging to develop a single model capable of unifying the diverse range of input-constrained generation sub-tasks, such as those conditioned by element types, sizes, or their relationships. Current methods either address only a subset of these tasks or necessitate separate model parameters for different conditions, failing to offer a truly unified solution. In this paper, we propose UniLayDiff: a Unified Diffusion Transformer, that for the first time, addresses various content-aware layout generation tasks with a single, end-to-end trainable model. Specifically, we treat layout constraints as a distinct modality and employ Multi-Modal Diffusion Transformer framework to capture the complex interplay between the background image, layout elements, and diverse constraints. Moreover, we integrate relation constraints through fine-tuning the model with LoRA after pretraining the model on other tasks. Such a schema not only achieves unified conditional generation but also enhances overall layout quality. Extensive experiments demonstrate that UniLayDiff achieves state-of-the-art performance across from unconditional to various conditional generation tasks and, to the best of our knowledge, is the first model to unify the full range of content-aware layout generation tasks.
No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers
Visual reasoning is challenging, requiring both precise object grounding and understanding complex spatial relationships. Existing methods fall into two camps: language-only chain-of-thought approaches, which demand large-scale (image, query, answer) supervision, and program-synthesis approaches which use pre-trained models and avoid training, but suffer from flawed logic and erroneous grounding. We propose an annotation-free training framework that improves both reasoning and grounding. Our framework uses AI-powered verifiers: an LLM verifier refines LLM reasoning via reinforcement learning, while a VLM verifier strengthens visual grounding through automated hard-negative mining, eliminating the need for ground truth labels. This design combines the strengths of modern AI systems: advanced language-only reasoning models for decomposing spatial queries into simpler subtasks, and strong vision specialist models improved via performant VLM critics. We evaluate our approach across diverse spatial reasoning tasks, and show that our method improves visual reasoning and surpasses open-source and proprietary models, while with our improved visual grounding model we further outperform recent text-only visual reasoning methods. Project webpage: https://glab-caltech.github.io/valor/
comment: Project webpage: https://glab-caltech.github.io/valor/
Accelerated Rotation-Invariant Convolution for UAV Image Segmentation
Rotation invariance is essential for precise, object-level segmentation in UAV aerial imagery, where targets can have arbitrary orientations and exhibit fine-scale details. Conventional segmentation architectures like U-Net rely on convolution operators that are not rotation-invariant, leading to degraded segmentation accuracy across varying viewpoints. Rotation invariance can be achieved by expanding the filter bank across multiple orientations; however, this will significantly increase computational cost and memory traffic. In this paper, we introduce a GPU-optimized rotation-invariant convolution framework that eliminates the traditional data-lowering (im2col) step required for matrix-multiplication-based convolution. By exploiting structured data sharing among symmetrically rotated filters, our method achieves multi-orientation convolution with greatly reduced memory traffic and computational redundancy. We further generalize the approach to accelerate convolution with arbitrary (non-symmetric) rotation angles. Across extensive benchmarks, the proposed convolution achieves 20--55% faster training and 15--45% lower energy consumption than CUDNN, while maintaining accuracy comparable to state-of-the-art rotation-invariant methods. In the eight-orientation setting, our approach achieves up to 45% speedup and 41% energy savings on 256\(\times\)256 inputs, and 32% speedup and 23% lower energy usage on 1024\(\times\)1024 inputs. Integrated into a U-Net segmentation model, the framework yields up to 6% improvement in accuracy over the non-rotation-aware baseline. These results demonstrate that the proposed method provides an effective and highly efficient alternative to existing rotation-invariant CNN frameworks.
SATGround: A Spatially-Aware Approach for Visual Grounding in Remote Sensing
Vision-language models (VLMs) are emerging as powerful generalist tools for remote sensing, capable of integrating information across diverse tasks and enabling flexible, instruction-based interactions via a chat interface. In this work, we enhance VLM-based visual grounding in satellite imagery by proposing a novel structured localization mechanism. Our approach involves finetuning a pretrained VLM on a diverse set of instruction-following tasks, while interfacing a dedicated grounding module through specialized control tokens for localization. This method facilitates joint reasoning over both language and spatial information, significantly enhancing the model's ability to precisely localize objects in complex satellite scenes. We evaluate our framework on several remote sensing benchmarks, consistently improving the state-of-the-art, including a 24.8% relative improvement over previous methods on visual grounding. Our results highlight the benefits of integrating structured spatial reasoning into VLMs, paving the way for more reliable real-world satellite data analysis.
Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning
Image captioning is essential in many fields including assisting visually impaired individuals, improving content management systems, and enhancing human-computer interaction. However, a recent challenge in this domain is dealing with low-resolution image (LRI). While performance can be improved by using larger models like transformers for encoding, these models are typically heavyweight, demanding significant computational resources and memory, leading to challenges in retraining. To address this, the proposed SOLI (Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning) approach presents a solution specifically designed for lightweight, low-resolution images captioning. It employs a Siamese network architecture to optimize latent embeddings, enhancing the efficiency and accuracy of the image-to-text translation process. By focusing on a dual-pathway neural network structure, SOLI minimizes computational overhead without sacrificing performance, making it an ideal choice for training on resource-constrained scenarios.
comment: 6 pages
Tri-Bench: Stress-Testing VLM Reliability on Spatial Reasoning under Camera Tilt and Object Interference AAAI 2026
Verifiable geometric reasoning is a critical component for trustworthy and controllable agentic AI. Despite impressive capabilities, Vision-Language Models (VLMs) often fail under realistic scene changes. We present Tri-Bench, a compact benchmark of planar triangle problems that isolates relative geometric reasoning while stressing two deployment-critical factors: camera pose (planar vs. tilted) and scene context via object interference (10 everyday objects). To test verifiability and control, we evaluate four recent VLMs using a single, fixed prompt whose guardrail explicitly describes a surrounding square border, enabling correct answers via homography. We evaluate six simple tasks over binary and continuous targets, and observe that the overall accuracy with respect to 3D ground truth is modest, ~69% on average (best ~75%, worst ~64%). The same responses align even more closely with 2D projections in the image plane, where mean accuracy is ~72%. All four VLMs consistently fail, with accuracy falling to ~0%, on recognizing minority shape classes (equilateral, isosceles, right-angled triangles). Additionally, overall VLM accuracy degrades by ~4.1% under camera tilt. This demonstrates that models fail to correctly utilize the explicit frame-of-reference hint provided in the prompt and default to 2D image plane cues. Finally, we find that object interference has no significant effect on VLM accuracy.
comment: 6 pages, 3 figures. Code and data: https://github.com/Amiton7/Tri-Bench. Accepted to the AAAI 2026 Workshop on Trust and Control in Agentic AI (TrustAgent)
Generation is Required for Data-Efficient Perception
It has been hypothesized that human-level visual perception requires a generative approach in which internal representations result from inverting a decoder. Yet today's most successful vision models are non-generative, relying on an encoder that maps images to representations without decoder inversion. This raises the question of whether generation is, in fact, necessary for machines to achieve human-level visual perception. To address this, we study whether generative and non-generative methods can achieve compositional generalization, a hallmark of human perception. Under a compositional data generating process, we formalize the inductive biases required to guarantee compositional generalization in decoder-based (generative) and encoder-based (non-generative) methods. We then show theoretically that enforcing these inductive biases on encoders is generally infeasible using regularization or architectural constraints. In contrast, for generative methods, the inductive biases can be enforced straightforwardly, thereby enabling compositional generalization by constraining a decoder and inverting it. We highlight how this inversion can be performed efficiently, either online through gradient-based search or offline through generative replay. We examine the empirical implications of our theory by training a range of generative and non-generative methods on photorealistic image datasets. We find that, without the necessary inductive biases, non-generative methods often fail to generalize compositionally and require large-scale pretraining or added supervision to improve generalization. By comparison, generative methods yield significant improvements in compositional generalization, without requiring additional data, by leveraging suitable inductive biases on a decoder along with search and replay.
comment: Preprint
InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models
Window attention and linear attention represent two principal strategies for mitigating the quadratic complexity and ever-growing KV cache in Vision-Language Models (VLMs). However, we observe that window-based VLMs suffer performance degradation when sequence length exceeds the window size, while linear attention underperforms on information-intensive tasks such as OCR and document understanding. To overcome these limitations, we propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet. For achieving competitive multimodal performance under constrained resources, we design a three-stage training strategy comprising distillation pretraining, instruction tuning, and long-sequence SFT. Remarkably, using less than 2\% of the training data required by leading VLMs, InfiniteVL not only substantially outperforms previous linear-complexity VLMs but also matches the performance of leading Transformer-based VLMs, while demonstrating effective long-term memory retention. Compared to similar-sized Transformer-based VLMs accelerated by FlashAttention-2, InfiniteVL achieves over 3.6\times inference speedup while maintaining constant latency and memory footprint. In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache. Code and models are available at https://github.com/hustvl/InfiniteVL.
comment: 16 pages, 8 figures, conference or other essential info
Training-Free Dual Hyperbolic Adapters for Better Cross-Modal Reasoning
Recent research in Vision-Language Models (VLMs) has significantly advanced our capabilities in cross-modal reasoning. However, existing methods suffer from performance degradation with domain changes or require substantial computational resources for fine-tuning in new domains. To address this issue, we develop a new adaptation method for large vision-language models, called \textit{Training-free Dual Hyperbolic Adapters} (T-DHA). We characterize the vision-language relationship between semantic concepts, which typically has a hierarchical tree structure, in the hyperbolic space instead of the traditional Euclidean space. Hyperbolic spaces exhibit exponential volume growth with radius, unlike the polynomial growth in Euclidean space. We find that this unique property is particularly effective for embedding hierarchical data structures using the Poincaré ball model, achieving significantly improved representation and discrimination power. Coupled with negative learning, it provides more accurate and robust classifications with fewer feature dimensions. Our extensive experimental results on various datasets demonstrate that the T-DHA method significantly outperforms existing state-of-the-art methods in few-shot image recognition and domain generalization tasks.
comment: Accepted in IEEE Transactions on Multimedia (TMM)
MatteViT: High-Frequency-Aware Document Shadow Removal with Shadow Matte Guidance
Document shadow removal is essential for enhancing the clarity of digitized documents. Preserving high-frequency details (e.g., text edges and lines) is critical in this process because shadows often obscure or distort fine structures. This paper proposes a matte vision transformer (MatteViT), a novel shadow removal framework that applies spatial and frequency-domain information to eliminate shadows while preserving fine-grained structural details. To effectively retain these details, we employ two preservation strategies. First, our method introduces a lightweight high-frequency amplification module (HFAM) that decomposes and adaptively amplifies high-frequency components. Second, we present a continuous luminance-based shadow matte, generated using a custom-built matte dataset and shadow matte generator, which provides precise spatial guidance from the earliest processing stage. These strategies enable the model to accurately identify fine-grained regions and restore them with high fidelity. Extensive experiments on public benchmarks (RDD and Kligler) demonstrate that MatteViT achieves state-of-the-art performance, providing a robust and practical solution for real-world document shadow removal. Furthermore, the proposed method better preserves text-level details in downstream tasks, such as optical character recognition, improving recognition performance over prior methods.
comment: 10 pages, 7 figures, 5 tables
LoFA: Learning to Predict Personalized Priors for Fast Adaptation of Visual Generative Models
Personalizing visual generative models to meet specific user needs has gained increasing attention, yet current methods like Low-Rank Adaptation (LoRA) remain impractical due to their demand for task-specific data and lengthy optimization. While a few hypernetwork-based approaches attempt to predict adaptation weights directly, they struggle to map fine-grained user prompts to complex LoRA distributions, limiting their practical applicability. To bridge this gap, we propose LoFA, a general framework that efficiently predicts personalized priors for fast model adaptation. We first identify a key property of LoRA: structured distribution patterns emerge in the relative changes between LoRA and base model parameters. Building on this, we design a two-stage hypernetwork: first predicting relative distribution patterns that capture key adaptation regions, then using these to guide final LoRA weight prediction. Extensive experiments demonstrate that our method consistently predicts high-quality personalized priors within seconds, across multiple tasks and user prompts, even outperforming conventional LoRA that requires hours of processing. Project page: https://jaeger416.github.io/lofa/.
comment: Project page: https://jaeger416.github.io/lofa/
Refining Visual Artifacts in Diffusion Models via Explainable AI-based Flaw Activation Maps
Diffusion models have achieved remarkable success in image synthesis. However, addressing artifacts and unrealistic regions remains a critical challenge. We propose self-refining diffusion, a novel framework that enhances image generation quality by detecting these flaws. The framework employs an explainable artificial intelligence (XAI)-based flaw highlighter to produce flaw activation maps (FAMs) that identify artifacts and unrealistic regions. These FAMs improve reconstruction quality by amplifying noise in flawed regions during the forward process and by focusing on these regions during the reverse process. The proposed approach achieves up to a 27.3% improvement in Fréchet inception distance across various diffusion-based models, demonstrating consistently strong performance on diverse datasets. It also shows robust effectiveness across different tasks, including image generation, text-to-image generation, and inpainting. These results demonstrate that explainable AI techniques can extend beyond interpretability to actively contribute to image refinement. The proposed framework offers a versatile and effective approach applicable to various diffusion models and tasks, significantly advancing the field of image synthesis.
comment: 10 pages, 9 figures, 7 tables
Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance
We present Wan-Move, a simple and scalable framework that brings motion control to video generative models. Existing motion-controllable methods typically suffer from coarse control granularity and limited scalability, leaving their outputs insufficient for practical use. We narrow this gap by achieving precise and high-quality motion control. Our core idea is to directly make the original condition features motion-aware for guiding video synthesis. To this end, we first represent object motions with dense point trajectories, allowing fine-grained control over the scene. We then project these trajectories into latent space and propagate the first frame's features along each trajectory, producing an aligned spatiotemporal feature map that tells how each scene element should move. This feature map serves as the updated latent condition, which is naturally integrated into the off-the-shelf image-to-video model, e.g., Wan-I2V-14B, as motion guidance without any architecture change. It removes the need for auxiliary motion encoders and makes fine-tuning base models easily scalable. Through scaled training, Wan-Move generates 5-second, 480p videos whose motion controllability rivals Kling 1.5 Pro's commercial Motion Brush, as indicated by user studies. To support comprehensive evaluation, we further design MoveBench, a rigorously curated benchmark featuring diverse content categories and hybrid-verified annotations. It is distinguished by larger data volume, longer video durations, and high-quality motion annotations. Extensive experiments on MoveBench and the public dataset consistently show Wan-Move's superior motion quality. Code, models, and benchmark data are made publicly available.
comment: NeurlPS 2025. Code and data available at https://github.com/ali-vilab/Wan-Move
Skewness-Guided Pruning of Multimodal Swin Transformers for Federated Skin Lesion Classification on Edge Devices
In recent years, high-performance computer vision models have achieved remarkable success in medical imaging, with some skin lesion classification systems even surpassing dermatology specialists in diagnostic accuracy. However, such models are computationally intensive and large in size, making them unsuitable for deployment on edge devices. In addition, strict privacy constraints hinder centralized data management, motivating the adoption of Federated Learning (FL). To address these challenges, this study proposes a skewness-guided pruning method that selectively prunes the Multi-Head Self-Attention and Multi-Layer Perceptron layers of a multimodal Swin Transformer based on the statistical skewness of their output distributions. The proposed method was validated in a horizontal FL environment and shown to maintain performance while substantially reducing model complexity. Experiments on the compact Swin Transformer demonstrate approximately 36\% model size reduction with no loss in accuracy. These findings highlight the feasibility of achieving efficient model compression and privacy-preserving distributed learning for multimodal medical AI on edge devices.
A Scalable Pipeline Combining Procedural 3D Graphics and Guided Diffusion for Photorealistic Synthetic Training Data Generation in White Button Mushroom Segmentation
Industrial mushroom cultivation increasingly relies on computer vision for monitoring and automated harvesting. However, developing accurate detection and segmentation models requires large, precisely annotated datasets that are costly to produce. Synthetic data provides a scalable alternative, yet often lacks sufficient realism to generalize to real-world scenarios. This paper presents a novel workflow that integrates 3D rendering in Blender with a constrained diffusion model to automatically generate high-quality annotated, photorealistic synthetic images of Agaricus Bisporus mushrooms. This approach preserves full control over 3D scene configuration and annotations while achieving photorealism without the need for specialized computer graphics expertise. We release two synthetic datasets (each containing 6,000 images depicting over 250k mushroom instances) and evaluate Mask R-CNN models trained on them in a zero-shot setting. When tested on two independent real-world datasets (including a newly collected benchmark), our method achieves state-of-the-art segmentation performance (F1 = 0.859 on M18K), despite using only synthetic training data. Although the approach is demonstrated on Agaricus Bisporus mushrooms, the proposed pipeline can be readily adapted to other mushroom species or to other agricultural domains, such as fruit and leaf detection.
comment: 20 pages, 8 figures
Pose-Based Sign Language Spotting via an End-to-End Encoder Architecture AACL
Automatic Sign Language Recognition (ASLR) has emerged as a vital field for bridging the gap between deaf and hearing communities. However, the problem of sign-to-sign retrieval or detecting a specific sign within a sequence of continuous signs remains largely unexplored. We define this novel task as Sign Language Spotting. In this paper, we present a first step toward sign language retrieval by addressing the challenge of detecting the presence or absence of a query sign video within a sentence-level gloss or sign video. Unlike conventional approaches that rely on intermediate gloss recognition or text-based matching, we propose an end-to-end model that directly operates on pose keypoints extracted from sign videos. Our architecture employs an encoder-only backbone with a binary classification head to determine whether the query sign appears within the target sequence. By focusing on pose representations instead of raw RGB frames, our method significantly reduces computational cost and mitigates visual noise. We evaluate our approach on the Word Presence Prediction dataset from the WSLP 2025 shared task, achieving 61.88\% accuracy and 60.00\% F1-score. These results demonstrate the effectiveness of our pose-based framework for Sign Language Spotting, establishing a strong foundation for future research in automatic sign language retrieval and verification. Code is available at https://github.com/EbimoJohnny/Pose-Based-Sign-Language-Spotting
comment: To appear at AACL-IJCNLP 2025 Workshop WSLP
Mitigating Individual Skin Tone Bias in Skin Lesion Classification through Distribution-Aware Reweighting
Skin color has historically been a focal point of discrimination, yet fairness research in machine learning for medical imaging often relies on coarse subgroup categories, overlooking individual-level variations. Such group-based approaches risk obscuring biases faced by outliers within subgroups. This study introduces a distribution-based framework for evaluating and mitigating individual fairness in skin lesion classification. We treat skin tone as a continuous attribute rather than a categorical label, and employ kernel density estimation (KDE) to model its distribution. We further compare twelve statistical distance metrics to quantify disparities between skin tone distributions and propose a distance-based reweighting (DRW) loss function to correct underrepresentation in minority tones. Experiments across CNN and Transformer models demonstrate: (i) the limitations of categorical reweighting in capturing individual-level disparities, and (ii) the superior performance of distribution-based reweighting, particularly with Fidelity Similarity (FS), Wasserstein Distance (WD), Hellinger Metric (HM), and Harmonic Mean Similarity (HS). These findings establish a robust methodology for advancing fairness at individual level in dermatological AI systems, and highlight broader implications for sensitive continuous attributes in medical image analysis.
SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images
Most existing methods for training-free Open-Vocabulary Semantic Segmentation (OVSS) are based on CLIP. While these approaches have made progress, they often face challenges in precise localization or require complex pipelines to combine separate modules, especially in remote sensing scenarios where numerous dense and small targets are present. Recently, Segment Anything Model 3 (SAM 3) was proposed, unifying segmentation and recognition in a promptable framework. In this paper, we present a preliminary exploration of applying SAM 3 to the remote sensing OVSS task without any training. First, we implement a mask fusion strategy that combines the outputs from SAM 3's semantic segmentation head and the Transformer decoder (instance head). This allows us to leverage the strengths of both heads for better land coverage. Second, we utilize the presence score from the presence head to filter out categories that do not exist in the scene, reducing false positives caused by the vast vocabulary sizes and patch-level processing in geospatial scenes. We evaluate our method on extensive remote sensing datasets. Experiments show that this simple adaptation achieves promising performance, demonstrating the potential of SAM 3 for remote sensing OVSS. Our code is released at https://github.com/earth-insights/SegEarth-OV-3.
Multi-domain performance analysis with scores tailored to user preferences
The performance of algorithms, methods, and models tends to depend heavily on the distribution of cases on which they are applied, this distribution being specific to the applicative domain. After performing an evaluation in several domains, it is highly informative to compute a (weighted) mean performance and, as shown in this paper, to scrutinize what happens during this averaging. To achieve this goal, we adopt a probabilistic framework and consider a performance as a probability measure (e.g., a normalized confusion matrix for a classification task). It appears that the corresponding weighted mean is known to be the summarization, and that only some remarkable scores assign to the summarized performance a value equal to a weighted arithmetic mean of the values assigned to the domain-specific performances. These scores include the family of ranking scores, a continuum parameterized by user preferences, and that the weights to consider in the arithmetic mean depend on the user preferences. Based on this, we rigorously define four domains, named easiest, most difficult, preponderant, and bottleneck domains, as functions of user preferences. After establishing the theory in a general setting, regardless of the task, we develop new visual tools for two-class classification.
Scale-invariant and View-relational Representation Learning for Full Surround Monocular Depth
Recent foundation models demonstrate strong generalization capabilities in monocular depth estimation. However, directly applying these models to Full Surround Monocular Depth Estimation (FSMDE) presents two major challenges: (1) high computational cost, which limits real-time performance, and (2) difficulty in estimating metric-scale depth, as these models are typically trained to predict only relative depth. To address these limitations, we propose a novel knowledge distillation strategy that transfers robust depth knowledge from a foundation model to a lightweight FSMDE network. Our approach leverages a hybrid regression framework combining the knowledge distillation scheme--traditionally used in classification--with a depth binning module to enhance scale consistency. Specifically, we introduce a cross-interaction knowledge distillation scheme that distills the scale-invariant depth bin probabilities of a foundation model into the student network while guiding it to infer metric-scale depth bin centers from ground-truth depth. Furthermore, we propose view-relational knowledge distillation, which encodes structural relationships among adjacent camera views and transfers them to enhance cross-view depth consistency. Experiments on DDAD and nuScenes demonstrate the effectiveness of our method compared to conventional supervised methods and existing knowledge distillation approaches. Moreover, our method achieves a favorable trade-off between performance and efficiency, meeting real-time requirements.
comment: Accepted at IEEE Robotics and Automation Letters (RA-L) 2026
What really matters for person re-identification? A Mixture-of-Experts Framework for Semantic Attribute Importance
State-of-the-art person re-identification methods achieve impressive accuracy but remain largely opaque, leaving open the question: which high-level semantic attributes do these models actually rely on? We propose MoSAIC-ReID, a Mixture-of-Experts framework that systematically quantifies the importance of pedestrian attributes for re-identification. Our approach uses LoRA-based experts, each linked to a single attribute, and an oracle router that enables controlled attribution analysis. While MoSAIC-ReID achieves competitive performance on Market-1501 and DukeMTMC under the assumption that attribute annotations are available at test time, its primary value lies in providing a large-scale, quantitative study of attribute importance across intrinsic and extrinsic cues. Using generalized linear models, statistical tests, and feature-importance analyses, we reveal which attributes, such as clothing colors and intrinsic characteristics, contribute most strongly, while infrequent cues (e.g. accessories) have limited effect. This work offers a principled framework for interpretable ReID and highlights the requirements for integrating explicit semantic knowledge in practice. Code is available at https://github.com/psaltaath/MoSAIC-ReID
Dual-Branch Center-Surrounding Contrast: Rethinking Contrastive Learning for 3D Point Clouds
Most existing self-supervised learning (SSL) approaches for 3D point clouds are dominated by generative methods based on Masked Autoencoders (MAE). However, these generative methods have been proven to struggle to capture high-level discriminative features effectively, leading to poor performance on linear probing and other downstream tasks. In contrast, contrastive methods excel in discriminative feature representation and generalization ability on image data. Despite this, contrastive learning (CL) in 3D data remains scarce. Besides, simply applying CL methods designed for 2D data to 3D fails to effectively learn 3D local details. To address these challenges, we propose a novel Dual-Branch \textbf{C}enter-\textbf{S}urrounding \textbf{Con}trast (CSCon) framework. Specifically, we apply masking to the center and surrounding parts separately, constructing dual-branch inputs with center-biased and surrounding-biased representations to better capture rich geometric information. Meanwhile, we introduce a patch-level contrastive loss to further enhance both high-level information and local sensitivity. Under the FULL and ALL protocols, CSCon achieves performance comparable to generative methods; under the MLP-LINEAR, MLP-3, and ONLY-NEW protocols, our method attains state-of-the-art results, even surpassing cross-modal approaches. In particular, under the MLP-LINEAR protocol, our method outperforms the baseline (Point-MAE) by \textbf{7.9\%}, \textbf{6.7\%}, and \textbf{10.3\%} on the three variants of ScanObjectNN, respectively. The code will be made publicly available.
comment: 16 pages, 6 figures
Repulsor: Accelerating Generative Modeling with a Contrastive Memory Bank
The dominance of denoising generative models (e.g., diffusion, flow-matching) in visual synthesis is tempered by their substantial training costs and inefficiencies in representation learning. While injecting discriminative representations via auxiliary alignment has proven effective, this approach still faces key limitations: the reliance on external, pre-trained encoders introduces overhead and domain shift. A dispersed-based strategy that encourages strong separation among in-batch latent representations alleviates this specific dependency. To assess the effect of the number of negative samples in generative modeling, we propose {\mname}, a plug-and-play training framework that requires no external encoders. Our method integrates a memory bank mechanism that maintains a large, dynamically updated queue of negative samples across training iterations. This decouples the number of negatives from the mini-batch size, providing abundant and high-quality negatives for a contrastive objective without a multiplicative increase in computational cost. A low-dimensional projection head is used to further minimize memory and bandwidth overhead. {\mname} offers three principal advantages: (1) it is self-contained, eliminating dependency on pretrained vision foundation models and their associated forward-pass overhead; (2) it introduces no additional parameters or computational cost during inference; and (3) it enables substantially faster convergence, achieving superior generative quality more efficiently. On ImageNet-256, {\mname} achieves a state-of-the-art FID of \textbf{2.40} within 400k steps, significantly outperforming comparable methods.
comment: 19 pages, 19 figures
C-DIRA: Computationally Efficient Dynamic ROI Routing and Domain-Invariant Adversarial Learning for Lightweight Driver Behavior Recognition
Driver distraction behavior recognition using in-vehicle cameras demands real-time inference on edge devices. However, lightweight models often fail to capture fine-grained behavioral cues, resulting in reduced performance on unseen drivers or under varying conditions. ROI-based methods also increase computational cost, making it difficult to balance efficiency and accuracy. This work addresses the need for a lightweight architecture that overcomes these constraints. We propose Computationally efficient Dynamic region of Interest Routing and domain-invariant Adversarial learning for lightweight driver behavior recognition (C-DIRA). The framework combines saliency-driven Top-K ROI pooling and fused classification for local feature extraction and integration. Dynamic ROI routing enables selective computation by applying ROI inference only to high difficulty data samples. Moreover, pseudo-domain labeling and adversarial learning are used to learn domain-invariant features robust to driver and background variation. Experiments on the State Farm Distracted Driver Detection Dataset show that C-DIRA maintains high accuracy with significantly fewer FLOPs and lower latency than prior lightweight models. It also demonstrates robustness under visual degradation such as blur and low-light, and stable performance across unseen domains. These results confirm C-DIRA's effectiveness in achieving compactness, efficiency, and generalization.
Chain-of-Image Generation: Toward Monitorable and Controllable Image Generation
While state-of-the-art image generation models achieve remarkable visual quality, their internal generative processes remain a "black box." This opacity limits human observation and intervention, and poses a barrier to ensuring model reliability, safety, and control. Furthermore, their non-human-like workflows make them difficult for human observers to interpret. To address this, we introduce the Chain-of-Image Generation (CoIG) framework, which reframes image generation as a sequential, semantic process analogous to how humans create art. Similar to the advantages in monitorability and performance that Chain-of-Thought (CoT) brought to large language models (LLMs), CoIG can produce equivalent benefits in text-to-image generation. CoIG utilizes an LLM to decompose a complex prompt into a sequence of simple, step-by-step instructions. The image generation model then executes this plan by progressively generating and editing the image. Each step focuses on a single semantic entity, enabling direct monitoring. We formally assess this property using two novel metrics: CoIG Readability, which evaluates the clarity of each intermediate step via its corresponding output; and Causal Relevance, which quantifies the impact of each procedural step on the final generated image. We further show that our framework mitigates entity collapse by decomposing the complex generation task into simple subproblems, analogous to the procedural reasoning employed by CoT. Our experimental results indicate that CoIG substantially enhances quantitative monitorability while achieving competitive compositional robustness compared to established baseline models. The framework is model-agnostic and can be integrated with any image generation model.
comment: 19 pages, 13 figures
Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
Aerial Vision-and-Language Navigation (VLN) aims to enable unmanned aerial vehicles (UAVs) to interpret natural language instructions and navigate complex urban environments using onboard visual observation. This task holds promise for real-world applications such as low-altitude inspection, search-and-rescue, and autonomous aerial delivery. Existing methods often rely on panoramic images, depth inputs, or odometry to support spatial reasoning and action planning. These requirements increase system cost and integration complexity, thus hindering practical deployment for lightweight UAVs. We present a unified aerial VLN framework that operates solely on egocentric monocular RGB observations and natural language instructions. The model formulates navigation as a next-token prediction problem, jointly optimizing spatial perception, trajectory reasoning, and action prediction through prompt-guided multi-task learning. Moreover, we propose a keyframe selection strategy to reduce visual redundancy by retaining semantically informative frames, along with an action merging and label reweighting mechanism that mitigates long-tailed supervision imbalance and facilitates stable multi-task co-training. Extensive experiments on the Aerial VLN benchmark validate the effectiveness of our method. Under the challenging monocular RGB-only setting, our model achieves strong results across both seen and unseen environments. It significantly outperforms existing RGB-only baselines and narrows the performance gap with state-of-the-art panoramic RGB-D counterparts. Comprehensive ablation studies further demonstrate the contribution of our task design and architectural choices.
comment: Under Review, 12 pages, 9 figures
See-Control: A Multimodal Agent Framework for Smartphone Interaction with a Robotic Arm
Recent advances in Multimodal Large Language Models (MLLMs) have enabled their use as intelligent agents for smartphone operation. However, existing methods depend on the Android Debug Bridge (ADB) for data transmission and action execution, limiting their applicability to Android devices. In this work, we introduce the novel Embodied Smartphone Operation (ESO) task and present See-Control, a framework that enables smartphone operation via direct physical interaction with a low-DoF robotic arm, offering a platform-agnostic solution. See-Control comprises three key components: (1) an ESO benchmark with 155 tasks and corresponding evaluation metrics; (2) an MLLM-based embodied agent that generates robotic control commands without requiring ADB or system back-end access; and (3) a richly annotated dataset of operation episodes, offering valuable resources for future research. By bridging the gap between digital agents and the physical world, See-Control provides a concrete step toward enabling home robots to perform smartphone-dependent tasks in realistic environments.
Trajectory Densification and Depth from Perspective-based Blur
In the absence of a mechanical stabilizer, the camera undergoes inevitable rotational dynamics during capturing, which induces perspective-based blur especially under long-exposure scenarios. From an optical standpoint, perspective-based blur is depth-position-dependent: objects residing at distinct spatial locations incur different blur levels even under the same imaging settings. Inspired by this, we propose a novel method that estimate metric depth by examining the blur pattern of a video stream and dense trajectory via joint optical design algorithm. Specifically, we employ off-the-shelf vision encoder and point tracker to extract video information. Then, we estimate depth map via windowed embedding and multi-window aggregation, and densify the sparse trajectory from the optical algorithm using a vision-language model. Evaluations on multiple depth datasets demonstrate that our method attains strong performance over large depth range, while maintaining favorable generalization. Relative to the real trajectory in handheld shooting settings, our optical algorithm achieves superior precision and the dense reconstruction maintains strong accuracy.
OpenMonoGS-SLAM: Monocular Gaussian Splatting SLAM with Open-set Semantics
Simultaneous Localization and Mapping (SLAM) is a foundational component in robotics, AR/VR, and autonomous systems. With the rising focus on spatial AI in recent years, combining SLAM with semantic understanding has become increasingly important for enabling intelligent perception and interaction. Recent efforts have explored this integration, but they often rely on depth sensors or closed-set semantic models, limiting their scalability and adaptability in open-world environments. In this work, we present OpenMonoGS-SLAM, the first monocular SLAM framework that unifies 3D Gaussian Splatting (3DGS) with open-set semantic understanding. To achieve our goal, we leverage recent advances in Visual Foundation Models (VFMs), including MASt3R for visual geometry and SAM and CLIP for open-vocabulary semantics. These models provide robust generalization across diverse tasks, enabling accurate monocular camera tracking and mapping, as well as a rich understanding of semantics in open-world environments. Our method operates without any depth input or 3D semantic ground truth, relying solely on self-supervised learning objectives. Furthermore, we propose a memory mechanism specifically designed to manage high-dimensional semantic features, which effectively constructs Gaussian semantic feature maps, leading to strong overall performance. Experimental results demonstrate that our approach achieves performance comparable to or surpassing existing baselines in both closed-set and open-set segmentation tasks, all without relying on supplementary sensors such as depth maps or semantic annotations.
comment: 8 pages, 4 figures
Decoupling Template Bias in CLIP: Harnessing Empty Prompts for Enhanced Few-Shot Learning AAAI2026
The Contrastive Language-Image Pre-Training (CLIP) model excels in few-shot learning by aligning visual and textual representations. Our study shows that template-sample similarity (TSS), defined as the resemblance between a text template and an image sample, introduces bias. This bias leads the model to rely on template proximity rather than true sample-to-category alignment, reducing both accuracy and robustness in classification. We present a framework that uses empty prompts, textual inputs that convey the idea of "emptiness" without category information. These prompts capture unbiased template features and offset TSS bias. The framework employs two stages. During pre-training, empty prompts reveal and reduce template-induced bias within the CLIP encoder. During few-shot fine-tuning, a bias calibration loss enforces correct alignment between images and their categories, ensuring the model focuses on relevant visual cues. Experiments across multiple benchmarks demonstrate that our template correction method significantly reduces performance fluctuations caused by TSS, yielding higher classification accuracy and stronger robustness. The repository of this project is available at https://github.com/zhenyuZ-HUST/Decoupling-Template-Bias-in-CLIP.
comment: 14 pages, 8 figures, Association for the Advancement of Artificial Intelligence (AAAI2026, poster)
Automated Pollen Recognition in Optical and Holographic Microscopy Images
This study explores the application of deep learning to improve and automate pollen grain detection and classification in both optical and holographic microscopy images, with a particular focus on veterinary cytology use cases. We used YOLOv8s for object detection and MobileNetV3L for the classification task, evaluating their performance across imaging modalities. The models achieved 91.3% mAP50 for detection and 97% overall accuracy for classification on optical images, whereas the initial performance on greyscale holographic images was substantially lower. We addressed the performance gap issue through dataset expansion using automated labeling and bounding box area enlargement. These techniques, applied to holographic images, improved detection performance from 2.49% to 13.3% mAP50 and classification performance from 42% to 54%. Our work demonstrates that, at least for image classification tasks, it is possible to pair deep learning techniques with cost-effective lensless digital holographic microscopy devices.
comment: 08 pages, 10 figures, 04 tables, 20 references. Date of Conference: 13-14 June 2025 Date Added to IEEE Xplore: 10 July 2025 Electronic ISBN: 979-8-3315-0969-9 Print on Demand(PoD) ISBN: 979-8-3315-0970-5 DOI: 10.1109/AICCONF64766.2025.11064260 Conference Location: Prague, Czech Republic Online Access: https://ieeexplore.ieee.org/document/11064260
Disturbance-Free Surgical Video Generation from Multi-Camera Shadowless Lamps for Open Surgery
Video recordings of open surgeries are greatly required for education and research purposes. However, capturing unobstructed videos is challenging since surgeons frequently block the camera field of view. To avoid occlusion, the positions and angles of the camera must be frequently adjusted, which is highly labor-intensive. Prior work has addressed this issue by installing multiple cameras on a shadowless lamp and arranging them to fully surround the surgical area. This setup increases the chances of some cameras capturing an unobstructed view. However, manual image alignment is needed in post-processing since camera configurations change every time surgeons move the lamp for optimal lighting. This paper aims to fully automate this alignment task. The proposed method identifies frames in which the lighting system moves, realigns them, and selects the camera with the least occlusion to generate a video that consistently presents the surgical field from a fixed perspective. A user study involving surgeons demonstrated that videos generated by our method were superior to those produced by conventional methods in terms of the ease of confirming the surgical area and the comfort during video viewing. Additionally, our approach showed improvements in video quality over existing techniques. Furthermore, we implemented several synthesis options for the proposed view-synthesis method and conducted a user study to assess surgeons' preferences for each option.
From Cells to Survival: Hierarchical Analysis of Cell Inter-Relations in Multiplex Microscopy for Lung Cancer Prognosis
The tumor microenvironment (TME) has emerged as a promising source of prognostic biomarkers. To fully leverage its potential, analysis methods must capture complex interactions between different cell types. We propose HiGINE -- a hierarchical graph-based approach to predict patient survival (short vs. long) from TME characterization in multiplex immunofluorescence (mIF) images and enhance risk stratification in lung cancer. Our model encodes both local and global inter-relations in cell neighborhoods, incorporating information about cell types and morphology. Multimodal fusion, aggregating cancer stage with mIF-derived features, further boosts performance. We validate HiGINE on two public datasets, demonstrating improved risk stratification, robustness, and generalizability.
comment: 5 pages, 3 figures
Instance-Aware Test-Time Segmentation for Continual Domain Shifts
Continual Test-Time Adaptation (CTTA) enables pre-trained models to adapt to continuously evolving domains. Existing methods have improved robustness but typically rely on fixed or batch-level thresholds, which cannot account for varying difficulty across classes and instances. This limitation is especially problematic in semantic segmentation, where each image requires dense, multi-class predictions. We propose an approach that adaptively adjusts pseudo labels to reflect the confidence distribution within each image and dynamically balances learning toward classes most affected by domain shifts. This fine-grained, class- and instance-aware adaptation produces more reliable supervision and mitigates error accumulation throughout continual adaptation. Extensive experiments across eight CTTA and TTA scenarios, including synthetic-to-real and long-term shifts, show that our method consistently outperforms state-of-the-art techniques, setting a new standard for semantic segmentation under evolving conditions.
Modular Neural Image Signal Processing
This paper presents a modular neural image signal processing (ISP) framework that processes raw inputs and renders high-quality display-referred images. Unlike prior neural ISP designs, our method introduces a high degree of modularity, providing full control over multiple intermediate stages of the rendering process.~This modular design not only achieves high rendering accuracy but also improves scalability, debuggability, generalization to unseen cameras, and flexibility to match different user-preference styles. To demonstrate the advantages of this design, we built a user-interactive photo-editing tool that leverages our neural ISP to support diverse editing operations and picture styles. The tool is carefully engineered to take advantage of the high-quality rendering of our neural ISP and to enable unlimited post-editable re-rendering. Our method is a fully learning-based framework with variants of different capacities, all of moderate size (ranging from ~0.5 M to ~3.9 M parameters for the entire pipeline), and consistently delivers competitive qualitative and quantitative results across multiple test sets. Watch the supplemental video at: https://youtu.be/ByhQjQSjxVM
BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain
Understanding how the human brain represents visual concepts, and in which brain regions these representations are encoded, remains a long-standing challenge. Decades of work have advanced our understanding of visual representations, yet brain signals remain large and complex, and the space of possible visual concepts is vast. As a result, most studies remain small-scale, rely on manual inspection, focus on specific regions and properties, and rarely include systematic validation. We present a large-scale, automated framework for discovering and explaining visual representations across the human cortex. Our method comprises two main stages. First, we discover candidate interpretable patterns in fMRI activity through unsupervised, data-driven decomposition methods. Next, we explain each pattern by identifying the set of natural images that most strongly elicit it and generating a natural-language description of their shared visual meaning. To scale this process, we introduce an automated pipeline that tests multiple candidate explanations, assigns quantitative reliability scores, and selects the most consistent description for each voxel pattern. Our framework reveals thousands of interpretable patterns spanning many distinct visual concepts, including fine-grained representations previously unreported.
SSCATeR: Sparse Scatter-Based Convolution Algorithm with Temporal Data Recycling for Real-Time 3D Object Detection in LiDAR Point Clouds
This work leverages the continuous sweeping motion of LiDAR scanning to concentrate object detection efforts on specific regions that receive a change in point data from one frame to another. We achieve this by using a sliding time window with short strides and consider the temporal dimension by storing convolution results between passes. This allows us to ignore unchanged regions, significantly reducing the number of convolution operations per forward pass without sacrificing accuracy. This data reuse scheme introduces extreme sparsity to detection data. To exploit this sparsity, we extend our previous work on scatter-based convolutions to allow for data reuse, and as such propose Sparse Scatter-Based Convolution Algorithm with Temporal Data Recycling (SSCATeR). This operation treats incoming LiDAR data as a continuous stream and acts only on the changing parts of the point cloud. By doing so, we achieve the same results with as much as a 6.61-fold reduction in processing time. Our test results show that the feature maps output by our method are identical to those produced by traditional sparse convolution techniques, whilst greatly increasing the computational efficiency of the network.
comment: 22 Pages, 26 Figures, This work has been submitted to the IEEE Sensors Journal for possible publication
An Iteration-Free Fixed-Point Estimator for Diffusion Inversion
Diffusion inversion aims to recover the initial noise corresponding to a given image such that this noise can reconstruct the original image through the denoising diffusion process. The key component of diffusion inversion is to minimize errors at each inversion step, thereby mitigating cumulative inaccuracies. Recently, fixed-point iteration has emerged as a widely adopted approach to minimize reconstruction errors at each inversion step. However, it suffers from high computational costs due to its iterative nature and the complexity of hyperparameter selection. To address these issues, we propose an iteration-free fixed-point estimator for diffusion inversion. First, we derive an explicit expression of the fixed point from an ideal inversion step. Unfortunately, it inherently contains an unknown data prediction error. Building upon this, we introduce the error approximation, which uses the calculable error from the previous inversion step to approximate the unknown error at the current inversion step. This yields a calculable, approximate expression for the fixed point, which is an unbiased estimator characterized by low variance, as shown by our theoretical analysis. We evaluate reconstruction performance on two text-image datasets, NOCAPS and MS-COCO. Compared to DDIM inversion and other inversion methods based on the fixed-point iteration, our method achieves consistent and superior performance in reconstruction tasks without additional iterations or training.
Curriculum Guided Massive Multi Agent System Solving For Robust Long Horizon Tasks
Large Language Models and multi-agent systems have shown promise in decomposing complex tasks, yet they struggle with long-horizon reasoning tasks and escalating computation cost. This work introduces a hierarchical multi-agent architecture that distributes reasoning across a 64*64 grid of lightweight agents, supported by a selective oracle. A spatial curriculum progressively expands the operational region of the grid, ensuring that agents master easier central tasks before tackling harder peripheral ones. To improve reliability, the system integrates Negative Log-Likelihood as a measure of confidence, allowing the curriculum to prioritize regions where agents are both accurate and well calibrated. A Thompson Sampling curriculum manager adaptively chooses training zones based on competence and NLL-driven reward signals. We evaluate the approach on a spatially grounded Tower of Hanoi benchmark, which mirrors the long-horizon structure of many robotic manipulation and planning tasks. Results demonstrate improved stability, reduced oracle usage, and stronger long-range reasoning from distributed agent cooperation.
comment: 22 pages, 2 tables, 9 figures
A Novel Wasserstein Quaternion Generative Adversarial Network for Color Image Generation
Color image generation has a wide range of applications, but the existing generation models ignore the correlation among color channels, which may lead to chromatic aberration problems. In addition, the data distribution problem of color images has not been systematically elaborated and explained, so that there is still the lack of the theory about measuring different color images datasets. In this paper, we define a new quaternion Wasserstein distance and develop its dual theory. To deal with the quaternion linear programming problem, we derive the strong duality form with helps of quaternion convex set separation theorem and quaternion Farkas lemma. With using quaternion Wasserstein distance, we propose a novel Wasserstein quaternion generative adversarial network. Experiments demonstrate that this novel model surpasses both the (quaternion) generative adversarial networks and the Wasserstein generative adversarial network in terms of generation efficiency and image quality.
Fast-ARDiff: An Entropy-informed Acceleration Framework for Continuous Space Autoregressive Generation
Autoregressive(AR)-diffusion hybrid paradigms combine AR's structured modeling with diffusion's photorealistic synthesis, yet suffer from high latency due to sequential AR generation and iterative denoising. In this work, we tackle this bottleneck and propose a unified AR-diffusion framework Fast-ARDiff that jointly optimizes both components, accelerating AR speculative decoding while simultaneously facilitating faster diffusion decoding. Specifically: (1) The entropy-informed speculative strategy encourages draft model to produce higher-entropy representations aligned with target model's entropy characteristics, mitigating entropy mismatch and high rejection rates caused by draft overconfidence. (2) For diffusion decoding, rather than treating it as an independent module, we integrate it into the same end-to-end framework using a dynamic scheduler that prioritizes AR optimization to guide the diffusion part in further steps. The diffusion part is optimized through a joint distillation framework combining trajectory and distribution matching, ensuring stable training and high-quality synthesis with extremely few steps. During inference, shallow feature entropy from AR module is used to pre-filter low-entropy drafts, avoiding redundant computation and improving latency. Fast-ARDiff achieves state-of-the-art acceleration across diverse models: on ImageNet 256$\times$256, TransDiff attains 4.3$\times$ lossless speedup, and NextStep-1 achieves 3$\times$ acceleration on text-conditioned generation. Code will be available at https://github.com/aSleepyTree/Fast-ARDiff.
Photo3D: Advancing Photorealistic 3D Generation through Structure-Aligned Detail Enhancement
Although recent 3D-native generators have made great progress in synthesizing reliable geometry, they still fall short in achieving realistic appearances. A key obstacle lies in the lack of diverse and high-quality real-world 3D assets with rich texture details, since capturing such data is intrinsically difficult due to the diverse scales of scenes, non-rigid motions of objects, and the limited precision of 3D scanners. We introduce Photo3D, a framework for advancing photorealistic 3D generation, which is driven by the image data generated by the GPT-4o-Image model. Considering that the generated images can distort 3D structures due to their lack of multi-view consistency, we design a structure-aligned multi-view synthesis pipeline and construct a detail-enhanced multi-view dataset paired with 3D geometry. Building on it, we present a realistic detail enhancement scheme that leverages perceptual feature adaptation and semantic structure matching to enforce appearance consistency with realistic details while preserving the structural consistency with the 3D-native geometry. Our scheme is general to different 3D-native generators, and we present dedicated training strategies to facilitate the optimization of geometry-texture coupled and decoupled 3D-native generation paradigms. Experiments demonstrate that Photo3D generalizes well across diverse 3D-native generation paradigms and achieves state-of-the-art photorealistic 3D generation performance.
PaintFlow: A Unified Framework for Interactive Oil Paintings Editing and Generation
Oil painting, as a high-level medium that blends human abstract thinking with artistic expression, poses substantial challenges for digital generation and editing due to its intricate brushstroke dynamics and stylized characteristics. Existing generation and editing techniques are often constrained by the distribution of training data and primarily focus on modifying real photographs. In this work, we introduce a unified multimodal framework for oil painting generation and editing. The proposed system allows users to incorporate reference images for precise semantic control, hand-drawn sketches for spatial structure alignment, and natural language prompts for high-level semantic guidance, while consistently maintaining a unified painting style across all outputs. Our method achieves interactive oil painting creation through three crucial technical advancements. First, we enhance the training stage with spatial alignment and semantic enhancement conditioning strategy, which map masks and sketches into spatial constraints, and encode contextual embedding from reference images and text into feature constraints, enabling object-level semantic alignment. Second, to overcome data scarcity, we propose a self-supervised style transfer pipeline based on Stroke-Based Rendering (SBR), which simulates the inpainting dynamics of oil painting restoration, converting real images into stylized oil paintings with preserved brushstroke textures to construct a large-scale paired training dataset. Finally, during inference, we integrate features using the AdaIN operator to ensure stylistic consistency. Extensive experiments demonstrate that our interactive system enables fine-grained editing while preserving the artistic qualities of oil paintings, achieving an unprecedented level of imagination realization in stylized oil paintings generation and editing.
comment: 14 pages
MVP: Multiple View Prediction Improves GUI Grounding
GUI grounding, which translates natural language instructions into precise pixel coordinates, is essential for developing practical GUI agents. However, we observe that existing grounding models exhibit significant coordinate prediction instability, minor visual perturbations (e.g. cropping a few pixels) can drastically alter predictions, flipping results between correct and incorrect. This instability severely undermines model performance, especially for samples with high-resolution and small UI elements. To address this issue, we propose Multi-View Prediction (MVP), a training-free framework that enhances grounding performance through multi-view inference. Our key insight is that while single-view predictions may be unstable, aggregating predictions from multiple carefully cropped views can effectively distinguish correct coordinates from outliers. MVP comprises two components: (1) Attention-Guided View Proposal, which derives diverse views guided by instruction-to-image attention scores, and (2) Multi-Coordinates Clustering, which ensembles predictions by selecting the centroid of the densest spatial cluster. Extensive experiments demonstrate MVP's effectiveness across various models and benchmarks. Notably, on ScreenSpot-Pro, MVP boosts UI-TARS-1.5-7B to 56.1%, GTA1-7B to 61.7%, Qwen3VL-8B-Instruct to 65.3%, and Qwen3VL-32B-Instruct to 74.0%. The code is available at https://github.com/ZJUSCL/MVP.
Beyond Real Weights: Hypercomplex Representations for Stable Quantization WACV
Multimodal language models (MLLMs) require large parameter capacity to align high-dimensional visual features with linguistic representations, making them computationally heavy and difficult to deploy efficiently. We introduce a progressive reparameterization strategy that compresses these models by gradually replacing dense feed-forward network blocks with compact Parameterized Hypercomplex Multiplication (PHM) layers. A residual interpolation schedule, together with lightweight reconstruction and knowledge distillation losses, ensures that the PHM modules inherit the functional behavior of their dense counterparts during training. This transition yields substantial parameter and FLOP reductions while preserving strong multimodal alignment, enabling faster inference without degrading output quality. We evaluate the approach on multiple vision-language models (VLMs). Our method maintains performance comparable to the base models while delivering significant reductions in model size and inference latency. Progressive PHM substitution thus offers an architecture-compatible path toward more efficient multimodal reasoning and complements existing low-bit quantization techniques.
comment: Accepted in Winter Conference on Applications of Computer Vision (WACV) 2026
Thinking with Images via Self-Calling Agent
Thinking-with-images paradigms have showcased remarkable visual reasoning capability by integrating visual information as dynamic elements into the Chain-of-Thought (CoT). However, optimizing interleaved multimodal CoT (iMCoT) through reinforcement learning remains challenging, as it relies on scarce high-quality reasoning data. In this study, we propose Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling. Specifically, a main agent decomposes the complex visual reasoning task to atomic subtasks and invokes its virtual replicas, i.e. parameter-sharing subagents, to solve them in isolated context. sCoT enjoys substantial training effectiveness and efficiency, as it requires no explicit interleaving between modalities. sCoT employs group-relative policy optimization to reinforce effective reasoning behavior to enhance optimization. Experiments on HR-Bench 4K show that sCoT improves the overall reasoning performance by up to $1.9\%$ with $\sim 75\%$ fewer GPU hours compared to strong baseline approaches. Code is available at https://github.com/YWenxi/think-with-images-through-self-calling.
comment: Code would be released at https://github.com/YWenxi/think-with-images-through-self-calling soon
OCCDiff: Occupancy Diffusion Model for High-Fidelity 3D Building Reconstruction from Noisy Point Clouds
A major challenge in reconstructing buildings from LiDAR point clouds lies in accurately capturing building surfaces under varying point densities and noise interference. To flexibly gather high-quality 3D profiles of the building in diverse resolution, we propose OCCDiff applying latent diffusion in the occupancy function space. Our OCCDiff combines a latent diffusion process with a function autoencoder architecture to generate continuous occupancy functions evaluable at arbitrary locations. Moreover, a point encoder is proposed to provide condition features to diffusion learning, constraint the final occupancy prediction for occupancy decoder, and insert multi-modal features for latent generation to latent encoder. To further enhance the model performance, a multi-task training strategy is employed, ensuring that the point encoder learns diverse and robust feature representations. Empirical results show that our method generates physically consistent samples with high fidelity to the target distribution and exhibits robustness to noisy data.
Beyond the Noise: Aligning Prompts with Latent Representations in Diffusion Models
Conditional diffusion models rely on language-to-image alignment methods to steer the generation towards semantically accurate outputs. Despite the success of this architecture, misalignment and hallucinations remain common issues and require automatic misalignment detection tools to improve quality, for example by applying them in a Best-of-N (BoN) post-generation setting. Unfortunately, measuring the alignment after the generation is an expensive step since we need to wait for the overall generation to finish to determine prompt adherence. In contrast, this work hypothesizes that text/image misalignments can be detected early in the denoising process, enabling real-time alignment assessment without waiting for the complete generation. In particular, we propose NoisyCLIP a method that measures semantic alignment in the noisy latent space. This work is the first to explore and benchmark prompt-to-latent misalignment detection during image generation using dual encoders in the reverse diffusion process. We evaluate NoisyCLIP qualitatively and quantitatively and find it reduces computational cost by 50% while achieving 98% of CLIP alignment performance in BoN settings. This approach enables real-time alignment assessment during generation, reducing costs without sacrificing semantic fidelity.
Disrupting Hierarchical Reasoning: Adversarial Protection for Geographic Privacy in Multimodal Reasoning Models
Multi-modal large reasoning models (MLRMs) pose significant privacy risks by inferring precise geographic locations from personal images through hierarchical chain-of-thought reasoning. Existing privacy protection techniques, primarily designed for perception-based models, prove ineffective against MLRMs' sophisticated multi-step reasoning processes that analyze environmental cues. We introduce \textbf{ReasonBreak}, a novel adversarial framework specifically designed to disrupt hierarchical reasoning in MLRMs through concept-aware perturbations. Our approach is founded on the key insight that effective disruption of geographic reasoning requires perturbations aligned with conceptual hierarchies rather than uniform noise. ReasonBreak strategically targets critical conceptual dependencies within reasoning chains, generating perturbations that invalidate specific inference steps and cascade through subsequent reasoning stages. To facilitate this approach, we contribute \textbf{GeoPrivacy-6K}, a comprehensive dataset comprising 6,341 ultra-high-resolution images ($\geq$2K) with hierarchical concept annotations. Extensive evaluation across seven state-of-the-art MLRMs (including GPT-o3, GPT-5, Gemini 2.5 Pro) demonstrates ReasonBreak's superior effectiveness, achieving a 14.4\% improvement in tract-level protection (33.8\% vs 19.4\%) and nearly doubling block-level protection (33.5\% vs 16.8\%). This work establishes a new paradigm for privacy protection against reasoning-based threats.
Learning to Control Physically-simulated 3D Characters via Generating and Mimicking 2D Motions
Video data is more cost-effective than motion capture data for learning 3D character motion controllers, yet synthesizing realistic and diverse behaviors directly from videos remains challenging. Previous approaches typically rely on off-the-shelf motion reconstruction techniques to obtain 3D trajectories for physics-based imitation. These reconstruction methods struggle with generalizability, as they either require 3D training data (potentially scarce) or fail to produce physically plausible poses, hindering their application to challenging scenarios like human-object interaction (HOI) or non-human characters. We tackle this challenge by introducing Mimic2DM, a novel motion imitation framework that learns the control policy directly and solely from widely available 2D keypoint trajectories extracted from videos. By minimizing the reprojection error, we train a general single-view 2D motion tracking policy capable of following arbitrary 2D reference motions in physics simulation, using only 2D motion data. The policy, when trained on diverse 2D motions captured from different or slightly different viewpoints, can further acquire 3D motion tracking capabilities by aggregating multiple views. Moreover, we develop a transformer-based autoregressive 2D motion generator and integrate it into a hierarchical control framework, where the generator produces high-quality 2D reference trajectories to guide the tracking policy. We show that the proposed approach is versatile and can effectively learn to synthesize physically plausible and diverse motions across a range of domains, including dancing, soccer dribbling, and animal movements, without any reliance on explicit 3D motion data. Project Website: https://jiann-li.github.io/mimic2dm/
On-the-fly Large-scale 3D Reconstruction from Multi-Camera Rigs
Recent advances in 3D Gaussian Splatting (3DGS) have enabled efficient free-viewpoint rendering and photorealistic scene reconstruction. While on-the-fly extensions of 3DGS have shown promise for real-time reconstruction from monocular RGB streams, they often fail to achieve complete 3D coverage due to the limited field of view (FOV). Employing a multi-camera rig fundamentally addresses this limitation. In this paper, we present the first on-the-fly 3D reconstruction framework for multi-camera rigs. Our method incrementally fuses dense RGB streams from multiple overlapping cameras into a unified Gaussian representation, achieving drift-free trajectory estimation and efficient online reconstruction. We propose a hierarchical camera initialization scheme that enables coarse inter-camera alignment without calibration, followed by a lightweight multi-camera bundle adjustment that stabilizes trajectories while maintaining real-time performance. Furthermore, we introduce a redundancy-free Gaussian sampling strategy and a frequency-aware optimization scheduler to reduce the number of Gaussian primitives and the required optimization iterations, thereby maintaining both efficiency and reconstruction fidelity. Our method reconstructs hundreds of meters of 3D scenes within just 2 minutes using only raw multi-camera video streams, demonstrating unprecedented speed, robustness, and Fidelity for on-the-fly 3D scene reconstruction.
Temporal Concept Dynamics in Diffusion Models via Prompt-Conditioned Interventions
Diffusion models are usually evaluated by their final outputs, gradually denoising random noise into meaningful images. Yet, generation unfolds along a trajectory, and analyzing this dynamic process is crucial for understanding how controllable, reliable, and predictable these models are in terms of their success/failure modes. In this work, we ask the question: when does noise turn into a specific concept (e.g., age) and lock in the denoising trajectory? We propose PCI (Prompt-Conditioned Intervention) to study this question. PCI is a training-free and model-agnostic framework for analyzing concept dynamics through diffusion time. The central idea is the analysis of Concept Insertion Success (CIS), defined as the probability that a concept inserted at a given timestep is preserved and reflected in the final image, offering a way to characterize the temporal dynamics of concept formation. Applied to several state-of-the-art text-to-image diffusion models and a broad taxonomy of concepts, PCI reveals diverse temporal behaviors across diffusion models, in which certain phases of the trajectory are more favorable to specific concepts even within the same concept type. These findings also provide actionable insights for text-driven image editing, highlighting when interventions are most effective without requiring access to model internals or training, and yielding quantitatively stronger edits that achieve a balance of semantic accuracy and content preservation than strong baselines. Code is available at: https://github.com/adagorgun/PCI-Prompt-Controlled-Interventions
comment: Code is available at: https://github.com/adagorgun/PCI-Prompt-Controlled-Interventions
Visionary: The World Model Carrier Built on WebGPU-Powered Gaussian Splatting Platform
Neural rendering, particularly 3D Gaussian Splatting (3DGS), has evolved rapidly and become a key component for building world models. However, existing viewer solutions remain fragmented, heavy, or constrained by legacy pipelines, resulting in high deployment friction and limited support for dynamic content and generative models. In this work, we present Visionary, an open, web-native platform for real-time various Gaussian Splatting and meshes rendering. Built on an efficient WebGPU renderer with per-frame ONNX inference, Visionary enables dynamic neural processing while maintaining a lightweight, "click-to-run" browser experience. It introduces a standardized Gaussian Generator contract, which not only supports standard 3DGS rendering but also allows plug-and-play algorithms to generate or update Gaussians each frame. Such inference also enables us to apply feedforward generative post-processing. The platform further offers a plug in three.js library with a concise TypeScript API for seamless integration into existing web applications. Experiments show that, under identical 3DGS assets, Visionary achieves superior rendering efficiency compared to current Web viewers due to GPU-based primitive sorting. It already supports multiple variants, including MLP-based 3DGS, 4DGS, neural avatars, and style transformation or enhancement networks. By unifying inference and rendering directly in the browser, Visionary significantly lowers the barrier to reproduction, comparison, and deployment of 3DGS-family methods, serving as a unified World Model Carrier for both reconstructive and generative paradigms.
comment: Project page: https://visionary-laboratory.github.io/visionary
ContextDrag: Precise Drag-Based Image Editing via Context-Preserving Token Injection and Position-Consistent Attention
Drag-based image editing aims to modify visual content followed by user-specified drag operations. Despite existing methods having made notable progress, they still fail to fully exploit the contextual information in the reference image, including fine-grained texture details, leading to edits with limited coherence and fidelity. To address this challenge, we introduce ContextDrag, a new paradigm for drag-based editing that leverages the strong contextual modeling capability of editing models, such as FLUX-Kontext. By incorporating VAE-encoded features from the reference image, ContextDrag can leverage rich contextual cues and preserve fine-grained details, without the need for finetuning or inversion. Specifically, ContextDrag introduced a novel Context-preserving Token Injection (CTI) that injects noise-free reference features into their correct destination locations via a Latent-space Reverse Mapping (LRM) algorithm. This strategy enables precise drag control while preserving consistency in both semantics and texture details. Second, ContextDrag adopts a novel Position-Consistent Attention (PCA), which positional re-encodes the reference tokens and applies overlap-aware masking to eliminate interference from irrelevant reference features. Extensive experiments on DragBench-SR and DragBench-DR demonstrate that our approach surpasses all existing SOTA methods. Code will be publicly available.
Team-Aware Football Player Tracking with SAM: An Appearance-Based Approach to Occlusion Recovery
Football player tracking is challenged by frequent occlusions, similar appearances, and rapid motion in crowded scenes. This paper presents a lightweight SAM-based tracking method combining the Segment Anything Model (SAM) with CSRT trackers and jersey color-based appearance models. We propose a team-aware tracking system that uses SAM for precise initialization and HSV histogram-based re-identification to improve occlusion recovery. Our evaluation measures three dimensions: processing speed (FPS and memory), tracking accuracy (success rate and box stability), and robustness (occlusion recovery and identity consistency). Experiments on football video sequences show that the approach achieves 7.6-7.7 FPS with stable memory usage (~1880 MB), maintaining 100 percent tracking success in light occlusions and 90 percent in crowded penalty-box scenarios with 5 or more players. Appearance-based re-identification recovers 50 percent of heavy occlusions, demonstrating the value of domain-specific cues. Analysis reveals key trade-offs: the SAM + CSRT combination provides consistent performance across crowd densities but struggles with long-term occlusions where players leave the frame, achieving only 8.66 percent re-acquisition success. These results offer practical guidelines for deploying football tracking systems under resource constraints, showing that classical tracker-based methods work well with continuous visibility but require stronger re-identification mechanisms for extended absences.
comment: 8 pages, 5 figures
Uncertainty-Aware Subset Selection for Robust Visual Explainability under Distribution Shifts
Subset selection-based methods are widely used to explain deep vision models: they attribute predictions by highlighting the most influential image regions and support object-level explanations. While these methods perform well in in-distribution (ID) settings, their behavior under out-of-distribution (OOD) conditions remains poorly understood. Through extensive experiments across multiple ID-OOD sets, we find that reliability of the existing subset based methods degrades markedly, yielding redundant, unstable, and uncertainty-sensitive explanations. To address these shortcomings, we introduce a framework that combines submodular subset selection with layer-wise, gradient-based uncertainty estimation to improve robustness and fidelity without requiring additional training or auxiliary models. Our approach estimates uncertainty via adaptive weight perturbations and uses these estimates to guide submodular optimization, ensuring diverse and informative subset selection. Empirical evaluations show that, beyond mitigating the weaknesses of existing methods under OOD scenarios, our framework also yields improvements in ID settings. These findings highlight limitations of current subset-based approaches and demonstrate how uncertainty-driven optimization can enhance attribution and object-level interpretability, paving the way for more transparent and trustworthy AI in real-world vision applications.
Leveraging Multispectral Sensors for Color Correction in Mobile Cameras
Recent advances in snapshot multispectral (MS) imaging have enabled compact, low-cost spectral sensors for consumer and mobile devices. By capturing richer spectral information than conventional RGB sensors, these systems can enhance key imaging tasks, including color correction. However, most existing methods treat the color correction pipeline in separate stages, often discarding MS data early in the process. We propose a unified, learning-based framework that (i) performs end-to-end color correction and (ii) jointly leverages data from a high-resolution RGB sensor and an auxiliary low-resolution MS sensor. Our approach integrates the full pipeline within a single model, producing coherent and color-accurate outputs. We demonstrate the flexibility and generality of our framework by refactoring two different state-of-the-art image-to-image architectures. To support training and evaluation, we construct a dedicated dataset by aggregating and repurposing publicly available spectral datasets, rendering under multiple RGB camera sensitivities. Extensive experiments show that our approach improves color accuracy and stability, reducing error by up to 50% compared to RGB-only and MS-driven baselines. Datasets, code, and models will be made available upon acceptance.
LapFM: A Laparoscopic Segmentation Foundation Model via Hierarchical Concept Evolving Pre-training
Surgical segmentation is pivotal for scene understanding yet remains hindered by annotation scarcity and semantic inconsistency across diverse procedures. Existing approaches typically fine-tune natural foundation models (e.g., SAM) with limited supervision, functioning merely as domain adapters rather than surgical foundation models. Consequently, they struggle to generalize across the vast variability of surgical targets. To bridge this gap, we present LapFM, a foundation model designed to evolve robust segmentation capabilities from massive unlabeled surgical images. Distinct from medical foundation models relying on inefficient self-supervised proxy tasks, LapFM leverages a Hierarchical Concept Evolving Pre-training paradigm. First, we establish a Laparoscopic Concept Hierarchy (LCH) via a hierarchical mask decoder with parent-child query embeddings, unifying diverse entities (i.e., Anatomy, Tissue, and Instrument) into a scalable knowledge structure with cross-granularity semantic consistency. Second, we propose a Confidence-driven Evolving Labeling that iteratively generates and filters pseudo-labels based on hierarchical consistency, progressively incorporating reliable samples from unlabeled images into training. This process yields LapBench-114K, a large-scale benchmark comprising 114K image-mask pairs. Extensive experiments demonstrate that LapFM significantly outperforms state-of-the-art methods, establishing new standards for granularity-adaptive generalization in universal laparoscopic segmentation. The source code is available at https://github.com/xq141839/LapFM.
SDT-6D: Fully Sparse Depth-Transformer for Staged End-to-End 6D Pose Estimation in Industrial Multi-View Bin Picking WACV 2026
Accurately recovering 6D poses in densely packed industrial bin-picking environments remain a serious challenge, owing to occlusions, reflections, and textureless parts. We introduce a holistic depth-only 6D pose estimation approach that fuses multi-view depth maps into either a fine-grained 3D point cloud in its vanilla version, or a sparse Truncated Signed Distance Field (TSDF). At the core of our framework lies a staged heatmap mechanism that yields scene-adaptive attention priors across different resolutions, steering computation toward foreground regions, thus keeping memory requirements at high resolutions feasible. Along, we propose a density-aware sparse transformer block that dynamically attends to (self-) occlusions and the non-uniform distribution of 3D data. While sparse 3D approaches has proven effective for long-range perception, its potential in close-range robotic applications remains underexplored. Our framework operates fully sparse, enabling high-resolution volumetric representations to capture fine geometric details crucial for accurate pose estimation in clutter. Our method processes the entire scene integrally, predicting the 6D pose via a novel per-voxel voting strategy, allowing simultaneous pose predictions for an arbitrary number of target objects. We validate our method on the recently published IPD and MV-YCB multi-view datasets, demonstrating competitive performance in heavily cluttered industrial and household bin picking scenarios.
comment: Accepted to WACV 2026. Preprint version
Towards Effective and Efficient Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval
Due to excessive memory overhead, most Multimodal Large Language Models (MLLMs) can only process videos of limited frames. In this paper, we propose an effective and efficient paradigm to remedy this shortcoming, termed One-shot video-Clip based Retrieval AuGmentation (OneClip-RAG). Compared with existing video RAG methods, OneClip-RAG makes full use of the merits of video clips for augmented video understanding in terms of both knowledge integrity and semantic coherence. Besides, it is also equipped with a novel query-guided video chunking algorithm that can unify clip chunking and cross-modal retrieval in one processing step, avoiding redundant computations. To improve instruction following, we further propose a new dataset called SynLongVideo and design a progressive training regime for OneClip-RAG. OneClip-RAG is plugged into five recent MLLMs and validated on a set of long-video benchmarks. Experimental results not only show the obvious performance gains by OneClip-RAG over MLLMs, e.g., boosting InternLV2 8B and Qwen2-VL 7B to the level of GPT-4o on MLVU, but also show its superior efficiency in handling long videos. e.g., enabling LLaVA-Video understand up to an hour of videos in less than 2.2 minutes on a single 4090 GPU.
SAM-Body4D: Training-Free 4D Human Body Mesh Recovery from Videos
Human Mesh Recovery (HMR) aims to reconstruct 3D human pose and shape from 2D observations and is fundamental to human-centric understanding in real-world scenarios. While recent image-based HMR methods such as SAM 3D Body achieve strong robustness on in-the-wild images, they rely on per-frame inference when applied to videos, leading to temporal inconsistency and degraded performance under occlusions. We address these issues without extra training by leveraging the inherent human continuity in videos. We propose SAM-Body4D, a training-free framework for temporally consistent and occlusion-robust HMR from videos. We first generate identity-consistent masklets using a promptable video segmentation model, then refine them with an Occlusion-Aware module to recover missing regions. The refined masklets guide SAM 3D Body to produce consistent full-body mesh trajectories, while a padding-based parallel strategy enables efficient multi-human inference. Experimental results demonstrate that SAM-Body4D achieves improved temporal stability and robustness in challenging in-the-wild videos, without any retraining. Our code and demo are available at: https://github.com/gaomingqi/sam-body4d.
Towards Visual Re-Identification of Fish using Fine-Grained Classification for Electronic Monitoring in Fisheries
Accurate fisheries data are crucial for effective and sustainable marine resource management. With the recent adoption of Electronic Monitoring (EM) systems, more video data is now being collected than can be feasibly reviewed manually. This paper addresses this challenge by developing an optimized deep learning pipeline for automated fish re-identification (Re-ID) using the novel AutoFish dataset, which simulates EM systems with conveyor belts with six similarly looking fish species. We demonstrate that key Re-ID metrics (R1 and mAP@k) are substantially improved by using hard triplet mining in conjunction with a custom image transformation pipeline that includes dataset-specific normalization. By employing these strategies, we demonstrate that the Vision Transformer-based Swin-T architecture consistently outperforms the Convolutional Neural Network-based ResNet-50, achieving peak performance of 41.65% mAP@k and 90.43% Rank-1 accuracy. An in-depth analysis reveals that the primary challenge is distinguishing visually similar individuals of the same species (Intra-species errors), where viewpoint inconsistency proves significantly more detrimental than partial occlusion. The source code and documentation are available at: https://github.com/msamdk/Fish_Re_Identification.git
comment: The paper has been accepted for publication at Northern Lights Deep Learning (NLDL) Conference 2025
Detection of Digital Facial Retouching utilizing Face Beauty Information
Facial retouching to beautify images is widely spread in social media, advertisements, and it is even applied in professional photo studios to let individuals appear younger, remove wrinkles and skin impurities. Generally speaking, this is done to enhance beauty. This is not a problem itself, but when retouched images are used as biometric samples and enrolled in a biometric system, it is one. Since previous work has proven facial retouching to be a challenge for face recognition systems,the detection of facial retouching becomes increasingly necessary. This work proposes to study and analyze changes in beauty assessment algorithms of retouched images, assesses different feature extraction methods based on artificial intelligence in order to improve retouching detection, and evaluates whether face beauty can be exploited to enhance the detection rate. In a scenario where the attacking retouching algorithm is unknown, this work achieved 1.1% D-EER on single image detection.
Simultaneous Enhancement and Noise Suppression under Complex Illumination Conditions
Under challenging light conditions, captured images often suffer from various degradations, leading to a decline in the performance of vision-based applications. Although numerous methods have been proposed to enhance image quality, they either significantly amplify inherent noise or are only effective under specific illumination conditions. To address these issues, we propose a novel framework for simultaneous enhancement and noise suppression under complex illumination conditions. Firstly, a gradient-domain weighted guided filter (GDWGIF) is employed to accurately estimate illumination and improve image quality. Next, the Retinex model is applied to decompose the captured image into separate illumination and reflection layers. These layers undergo parallel processing, with the illumination layer being corrected to optimize lighting conditions and the reflection layer enhanced to improve image quality. Finally, the dynamic range of the image is optimized through multi-exposure fusion and a linear stretching strategy. The proposed method is evaluated on real-world datasets obtained from practical applications. Experimental results demonstrate that our proposed method achieves better performance compared to state-of-the-art methods in both contrast enhancement and noise suppression.
comment: The paper has been accepted and officially published by IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT
The Unseen Bias: How Norm Discrepancy in Pre-Norm MLLMs Leads to Visual Information Loss
Multimodal Large Language Models (MLLMs), which couple pre-trained vision encoders and language models, have shown remarkable capabilities. However, their reliance on the ubiquitous Pre-Norm architecture introduces a subtle yet critical flaw: a severe norm disparity between the high-norm visual tokens and the low-norm text tokens. In this work, we present a formal theoretical analysis demonstrating that this imbalance is not a static issue. Instead, it induces an ``asymmetric update dynamic,'' where high-norm visual tokens exhibit a ``representational inertia,'' causing them to transform semantically much slower than their textual counterparts. This fundamentally impairs effective cross-modal feature fusion. Our empirical validation across a range of mainstream MLLMs confirms that this theoretical dynamic -- the persistence of norm disparity and the resulting asymmetric update rates -- is a prevalent phenomenon. Based on this insight, we propose a remarkably simple yet effective solution: inserting a single, carefully initialized LayerNorm layer after the visual projector to enforce norm alignment. Experiments conducted on the LLaVA-1.5 architecture show that this intervention yields significant performance gains not only on a wide suite of multimodal benchmarks but also, notably, on text-only evaluations such as MMLU, suggesting that resolving the architectural imbalance leads to a more holistically capable model.
SCU-CGAN: Enhancing Fire Detection through Synthetic Fire Image Generation and Dataset Augmentation
Fire has long been linked to human life, causing severe disasters and losses. Early detection is crucial, and with the rise of home IoT technologies, household fire detection systems have emerged. However, the lack of sufficient fire datasets limits the performance of detection models. We propose the SCU-CGAN model, which integrates U-Net, CBAM, and an additional discriminator to generate realistic fire images from nonfire images. We evaluate the image quality and confirm that SCU-CGAN outperforms existing models. Specifically, SCU-CGAN achieved a 41.5% improvement in KID score compared to CycleGAN, demonstrating the superior quality of the generated fire images. Furthermore, experiments demonstrate that the augmented dataset significantly improves the accuracy of fire detection models without altering their structure. For the YOLOv5 nano model, the most notable improvement was observed in the mAP@0.5:0.95 metric, which increased by 56.5%, highlighting the effectiveness of the proposed approach.
comment: Accepted for main track at MobieSec 2024 (not published in the proceedings)
Conditional Morphogenesis: Emergent Generation of Structural Digits via Neural Cellular Automata
Biological systems exhibit remarkable morphogenetic plasticity, where a single genome can encode various specialized cellular structures triggered by local chemical signals. In the domain of Deep Learning, Differentiable Neural Cellular Automata (NCA) have emerged as a paradigm to mimic this self-organization. However, existing NCA research has predominantly focused on continuous texture synthesis or single-target object recovery, leaving the challenge of class-conditional structural generation largely unexplored. In this work, we propose a novel Conditional Neural Cellular Automata (c-NCA) architecture capable of growing distinct topological structures - specifically MNIST digits - from a single generic seed, guided solely by a spatially broadcasted class vector. Unlike traditional generative models (e.g., GANs, VAEs) that rely on global reception fields, our model enforces strict locality and translation equivariance. We demonstrate that by injecting a one-hot condition into the cellular perception field, a single set of local rules can learn to break symmetry and self-assemble into ten distinct geometric attractors. Experimental results show that our c-NCA achieves stable convergence, correctly forming digit topologies from a single pixel, and exhibits robustness characteristic of biological systems. This work bridges the gap between texture-based NCAs and structural pattern formation, offering a lightweight, biologically plausible alternative for conditional generation.
comment: 13 pages, 5 figures. Code available at: https://github.com/alisakour/Conditional-NCA-Digits
TrackingWorld: World-centric Monocular 3D Tracking of Almost All Pixels NeurIPS 2025
Monocular 3D tracking aims to capture the long-term motion of pixels in 3D space from a single monocular video and has witnessed rapid progress in recent years. However, we argue that the existing monocular 3D tracking methods still fall short in separating the camera motion from foreground dynamic motion and cannot densely track newly emerging dynamic subjects in the videos. To address these two limitations, we propose TrackingWorld, a novel pipeline for dense 3D tracking of almost all pixels within a world-centric 3D coordinate system. First, we introduce a tracking upsampler that efficiently lifts the arbitrary sparse 2D tracks into dense 2D tracks. Then, to generalize the current tracking methods to newly emerging objects, we apply the upsampler to all frames and reduce the redundancy of 2D tracks by eliminating the tracks in overlapped regions. Finally, we present an efficient optimization-based framework to back-project dense 2D tracks into world-centric 3D trajectories by estimating the camera poses and the 3D coordinates of these 2D tracks. Extensive evaluations on both synthetic and real-world datasets demonstrate that our system achieves accurate and dense 3D tracking in a world-centric coordinate frame.
comment: Accepted by NeurIPS 2025. Project Page: https://igl-hkust.github.io/TrackingWorld.github.io/
DINO-BOLDNet: A DINOv3-Guided Multi-Slice Attention Network for T1-to-BOLD Generation
Generating BOLD images from T1w images offers a promising solution for recovering missing BOLD information and enabling downstream tasks when BOLD images are corrupted or unavailable. Motivated by this, we propose DINO-BOLDNet, a DINOv3-guided multi-slice attention framework that integrates a frozen self-supervised DINOv3 encoder with a lightweight trainable decoder. The model uses DINOv3 to extract within-slice structural representations, and a separate slice-attention module to fuse contextual information across neighboring slices. A multi-scale generation decoder then restores fine-grained functional contrast, while a DINO-based perceptual loss encourages structural and textural consistency between predictions and ground-truth BOLD in the transformer feature space. Experiments on a clinical dataset of 248 subjects show that DINO-BOLDNet surpasses a conditional GAN baseline in both PSNR and MS-SSIM. To our knowledge, this is the first framework capable of generating mean BOLD images directly from T1w images, highlighting the potential of self-supervised transformer guidance for structural-to-functional mapping.
HybridSplat: Fast Reflection-baked Gaussian Tracing using Hybrid Splatting
Rendering complex reflection of real-world scenes using 3D Gaussian splatting has been a quite promising solution for photorealistic novel view synthesis, but still faces bottlenecks especially in rendering speed and memory storage. This paper proposes a new Hybrid Splatting(HybridSplat) mechanism for Gaussian primitives. Our key idea is a new reflection-baked Gaussian tracing, which bakes the view-dependent reflection within each Gaussian primitive while rendering the reflection using tile-based Gaussian splatting. Then we integrate the reflective Gaussian primitives with base Gaussian primitives using a unified hybrid splatting framework for high-fidelity scene reconstruction. Moreover, we further introduce a pipeline-level acceleration for the hybrid splatting, and reflection-sensitive Gaussian pruning to reduce the model size, thus achieving much faster rendering speed and lower memory storage while preserving the reflection rendering quality. By extensive evaluation, our HybridSplat accelerates about 7x rendering speed across complex reflective scenes from Ref-NeRF, NeRF-Casting with 4x fewer Gaussian primitives than similar ray-tracing based Gaussian splatting baselines, serving as a new state-of-the-art method especially for complex reflective scenes.
Bi^2MAC: Bimodal Bi-Adaptive Mask-Aware Convolution for Remote Sensing Pansharpening
Pansharpening aims to fuse a high-resolution panchromatic (PAN) image with a low-resolution multispectral (LRMS) image to generate a high-resolution multispectral image (HRMS). Conventional deep learning-based methods are inherently limited in their ability to adapt to regional heterogeneity within feature representations. Although various adaptive convolution methods have been proposed to address this limitation, they often suffer from excessive computational costs and a limited ability to capture heterogeneous regions in remote sensing images effectively. To overcome these challenges, we propose Bimodal Bi-Adaptive Mask-Aware Convolution (Bi^2MAC), which effectively exploits information from different types of regions while intelligently allocating computational resources. Specifically, we design a lightweight module to generate both soft and hard masks, which are used to modulate the input features preliminarily and to guide different types of regions into separate processing branches, respectively. Redundant features are directed to a compact branch for low-cost global processing. In contrast, heterogeneous features are routed to a focused branch that invests more computational resources for fine-grained modeling. Extensive experiments on multiple benchmark datasets demonstrate that Bi^2MAC achieves state-of-the-art (SOTA) performance while requiring substantially lower training time and parameter counts, and the minimal computational cost among adaptive convolution models.
PointDico: Contrastive 3D Representation Learning Guided by Diffusion Models IJCNN 2025
Self-supervised representation learning has shown significant improvement in Natural Language Processing and 2D Computer Vision. However, existing methods face difficulties in representing 3D data because of its unordered and uneven density. Through an in-depth analysis of mainstream contrastive and generative approaches, we find that contrastive models tend to suffer from overfitting, while 3D Mask Autoencoders struggle to handle unordered point clouds. This motivates us to learn 3D representations by sharing the merits of diffusion and contrast models, which is non-trivial due to the pattern difference between the two paradigms. In this paper, we propose \textit{PointDico}, a novel model that seamlessly integrates these methods. \textit{PointDico} learns from both denoising generative modeling and cross-modal contrastive learning through knowledge distillation, where the diffusion model serves as a guide for the contrastive model. We introduce a hierarchical pyramid conditional generator for multi-scale geometric feature extraction and employ a dual-channel design to effectively integrate local and global contextual information. \textit{PointDico} achieves a new state-of-the-art in 3D representation learning, \textit{e.g.}, \textbf{94.32\%} accuracy on ScanObjectNN, \textbf{86.5\%} Inst. mIoU on ShapeNetPart.
comment: Accepted by IJCNN 2025
Interpreting Structured Perturbations in Image Protection Methods for Diffusion Models
Recent image protection mechanisms such as Glaze and Nightshade introduce imperceptible, adversarially designed perturbations intended to disrupt downstream text-to-image generative models. While their empirical effectiveness is known, the internal structure, detectability, and representational behavior of these perturbations remain poorly understood. This study provides a systematic, explainable AI analysis using a unified framework that integrates white-box feature-space inspection and black-box signal-level probing. Through latent-space clustering, feature-channel activation analysis, occlusion-based spatial sensitivity mapping, and frequency-domain characterization, we show that protection mechanisms operate as structured, low-entropy perturbations tightly coupled to underlying image content across representational, spatial, and spectral domains. Protected images preserve content-driven feature organization with protection-specific substructure rather than inducing global representational drift. Detectability is governed by interacting effects of perturbation entropy, spatial deployment, and frequency alignment, with sequential protection amplifying detectable structure rather than suppressing it. Frequency-domain analysis shows that Glaze and Nightshade redistribute energy along dominant image-aligned frequency axes rather than introducing diffuse noise. These findings indicate that contemporary image protection operates through structured feature-level deformation rather than semantic dislocation, explaining why protection signals remain visually subtle yet consistently detectable. This work advances the interpretability of adversarial image protection and informs the design of future defenses and detection strategies for generative AI systems.
comment: 32 pages, 17 figures, 1 table, 5 algorithms, preprint
Low Rank Support Quaternion Matrix Machine
Input features are conventionally represented as vectors, matrices, or third order tensors in the real field, for color image classification. Inspired by the success of quaternion data modeling for color images in image recovery and denoising tasks, we propose a novel classification method for color image classification, named as the Low-rank Support Quaternion Matrix Machine (LSQMM), in which the RGB channels are treated as pure quaternions to effectively preserve the intrinsic coupling relationships among channels via the quaternion algebra. For the purpose of promoting low-rank structures resulting from strongly correlated color channels, a quaternion nuclear norm regularization term, serving as a natural extension of the conventional matrix nuclear norm to the quaternion domain, is added to the hinge loss in our LSQMM model. An Alternating Direction Method of Multipliers (ADMM)-based iterative algorithm is designed to effectively resolve the proposed quaternion optimization model. Experimental results on multiple color image classification datasets demonstrate that our proposed classification approach exhibits advantages in classification accuracy, robustness and computational efficiency, compared to several state-of-the-art methods using support vector machines, support matrix machines, and support tensor machines.
GeoDiffMM: Geometry-Guided Conditional Diffusion for Motion Magnification
Video Motion Magnification (VMM) amplifies subtle macroscopic motions to a perceptible level. Recently, existing mainstream Eulerian approaches address amplification-induced noise via decoupling representation learning such as texture, shape and frequancey schemes, but they still struggle to separate photon noise from true micro-motion when motion displacements are very small. We propose GeoDiffMM, a novel diffusion-based Lagrangian VMM framework conditioned on optical flow as a geometric cue, enabling structurally consistent motion magnification. Specifically, we design a Noise-free Optical Flow Augmentation strategy that synthesizes diverse nonrigid motion fields without photon noise as supervision, helping the model learn more accurate geometry-aware optial flow and generalize better. Next, we develop a Diffusion Motion Magnifier that conditions the denoising process on (i) optical flow as a geometry prior and (ii) a learnable magnification factor controlling magnitude, thereby selectively amplifying motion components consistent with scene semantics and structure while suppressing content-irrelevant perturbations. Finally, we perform Flow-based Video Synthesis to map the amplified motion back to the image domain with high fidelity. Extensive experiments on real and synthetic datasets show that GeoDiffMM outperforms state-of-the-art methods and significantly improves motion magnification.
Detecting Dental Landmarks from Intraoral 3D Scans: the 3DTeethLand challenge MICCAI 2024
Teeth landmark detection is a critical task in modern clinical orthodontics. Their precise identification enables advanced diagnostics, facilitates personalized treatment strategies, and supports more effective monitoring of treatment progress in clinical dentistry. However, several significant challenges may arise due to the intricate geometry of individual teeth and the substantial variations observed across different individuals. To address these complexities, the development of advanced techniques, especially through the application of deep learning, is essential for the precise and reliable detection of 3D tooth landmarks. In this context, the 3DTeethLand challenge was held in collaboration with the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) in 2024, calling for algorithms focused on teeth landmark detection from intraoral 3D scans. This challenge introduced the first publicly available dataset for 3D teeth landmark detection, offering a valuable resource to assess the state-of-the-art methods in this task and encourage the community to provide methodological contributions towards the resolution of their problem with significant clinical implications.
comment: MICCAI 2024, 3DTeethLand, Challenge report, under review
GeoDM: Geometry-aware Distribution Matching for Dataset Distillation
Dataset distillation aims to synthesize a compact subset of the original data, enabling models trained on it to achieve performance comparable to those trained on the original large dataset. Existing distribution-matching methods are confined to Euclidean spaces, making them only capture linear structures and overlook the intrinsic geometry of real data, e.g., curvature. However, high-dimensional data often lie on low-dimensional manifolds, suggesting that dataset distillation should have the distilled data manifold aligned with the original data manifold. In this work, we propose a geometry-aware distribution-matching framework, called \textbf{GeoDM}, which operates in the Cartesian product of Euclidean, hyperbolic, and spherical manifolds, with flat, hierarchical, and cyclical structures all captured by a unified representation. To adapt to the underlying data geometry, we introduce learnable curvature and weight parameters for three kinds of geometries. At the same time, we design an optimal transport loss to enhance the distribution fidelity. Our theoretical analysis shows that the geometry-aware distribution matching in a product space yields a smaller generalization error bound than the Euclidean counterparts. Extensive experiments conducted on standard benchmarks demonstrate that our algorithm outperforms state-of-the-art data distillation methods and remains effective across various distribution-matching strategies for the single geometries.
Terrain Diffusion: A Diffusion-Based Successor to Perlin Noise in Infinite, Real-Time Terrain Generation
For decades, procedural worlds have been built on procedural noise functions such as Perlin noise, which are fast and infinite, yet fundamentally limited in realism and large-scale coherence. We introduce Terrain Diffusion, an AI-era successor to Perlin noise that bridges the fidelity of diffusion models with the properties that made procedural noise indispensable: seamless infinite extent, seed-consistency, and constant-time random access. At its core is InfiniteDiffusion, a novel algorithm for infinite generation, enabling seamless, real-time synthesis of boundless landscapes. A hierarchical stack of diffusion models couples planetary context with local detail, while a compact Laplacian encoding stabilizes outputs across Earth-scale dynamic ranges. An open-source infinite-tensor framework supports constant-memory manipulation of unbounded tensors, and few-step consistency distillation enables efficient generation. Together, these components establish diffusion models as a practical foundation for procedural world generation, capable of synthesizing entire planets coherently, controllably, and without limits.
comment: Project website: https://xandergos.github.io/terrain-diffusion/ Code: https://github.com/xandergos/terrain-diffusion/
OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation
Despite the promising progress in subject-driven image generation, current models often deviate from the reference identities and struggle in complex scenes with multiple subjects. To address this challenge, we introduce OpenSubject, a video-derived large-scale corpus with 2.5M samples and 4.35M images for subject-driven generation and manipulation. The dataset is built with a four-stage pipeline that exploits cross-frame identity priors. (i) Video Curation. We apply resolution and aesthetic filtering to obtain high-quality clips. (ii) Cross-Frame Subject Mining and Pairing. We utilize vision-language model (VLM)-based category consensus, local grounding, and diversity-aware pairing to select image pairs. (iii) Identity-Preserving Reference Image Synthesis. We introduce segmentation map-guided outpainting to synthesize the input images for subject-driven generation and box-guided inpainting to generate input images for subject-driven manipulation, together with geometry-aware augmentations and irregular boundary erosion. (iv) Verification and Captioning. We utilize a VLM to validate synthesized samples, re-synthesize failed samples based on stage (iii), and then construct short and long captions. In addition, we introduce a benchmark covering subject-driven generation and manipulation, and then evaluate identity fidelity, prompt adherence, manipulation consistency, and background consistency with a VLM judge. Extensive experiments show that training with OpenSubject improves generation and manipulation performance, particularly in complex scenes.
Self-Reinforced Deep Priors for Reparameterized Full Waveform Inversion
Full waveform inversion (FWI) has become a widely adopted technique for high-resolution subsurface imaging. However, its inherent strong nonlinearity often results in convergence toward local minima. Recently, deep image prior-based reparameterized FWI (DIP-FWI) has been proposed to alleviate the dependence on massive training data. By exploiting the spectral bias and implicit regularization in the neural network architecture, DIP-FWI can effectively avoid local minima and reconstruct more geologically plausible velocity models. Nevertheless, existing DIP-FWI typically use a fixed random input throughout the inversion process, which fails to utilize the mapping and correlation between the input and output of the network. Moreover, under complex geological conditions, the lack of informative prior in the input can exacerbate the ill-posedness of the inverse problem, leading to artifacts and unstable reconstructions. To address these limitations, we propose a self-reinforced DIP-FWI (SRDIP-FWI) framework, in which a steering algorithm alternately updates both the network parameters and the input at each iteration using feedback from the current network output. This design allows adaptive structural enhancement and improved regularization, thereby effectively mitigating the ill-posedness in FWI. Additionally, we analyze the spectral bias of the network in SRDIP-FWI and quantify its role in multiscale velocity model building. Synthetic tests and field land data application demonstrate that SRDIP-FWI achieves superior resolution, improved accuracy and greater depth penetration compared to multiscale FWI. More importantly, SRDIP-FWI eliminates the need for manual frequency-band selection and time-window picking, substantially simplifying the inversion workflow. Overall, the proposed method provides a novel, adaptive and robust framework for accurate subsurface velocity model reconstruction.
comment: Submitted to GEOPHYSICS
PAVAS: Physics-Aware Video-to-Audio Synthesis
Recent advances in Video-to-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds. We present Physics-Aware Video-to-Audio Synthesis (PAVAS), a method that incorporates physical reasoning into a latent diffusion-based V2A generation through the Physics-Driven Audio Adapter (Phy-Adapter). The adapter receives object-level physical parameters estimated by the Physical Parameter Estimator (PPE), which uses a Vision-Language Model (VLM) to infer the moving-object mass and a segmentation-based dynamic 3D reconstruction module to recover its motion trajectory for velocity computation. These physical cues enable the model to synthesize sounds that reflect underlying physical factors. To assess physical realism, we curate VGG-Impact, a benchmark focusing on object-object interactions, and introduce Audio-Physics Correlation Coefficient (APCC), an evaluation metric that measures consistency between physical and auditory attributes. Comprehensive experiments show that PAVAS produces physically plausible and perceptually coherent audio, outperforming existing V2A models in both quantitative and qualitative evaluations. Visit https://physics-aware-video-to-audio-synthesis.github.io for demo videos.
Zero-Splat TeleAssist: A Zero-Shot Pose Estimation Framework for Semantic Teleoperation ICRA 2025
We introduce Zero-Splat TeleAssist, a zero-shot sensor-fusion pipeline that transforms commodity CCTV streams into a shared, 6-DoF world model for multilateral teleoperation. By integrating vision-language segmentation, monocular depth, weighted-PCA pose extraction, and 3D Gaussian Splatting (3DGS), TeleAssist provides every operator with real-time global positions and orientations of multiple robots without fiducials or depth sensors in an interaction-centric teleoperation setup.
comment: Published and Presented at 3rd Workshop on Human-Centric Multilateral Teleoperation in ICRA 2025
EgoX: Egocentric Video Generation from a Single Exocentric Video
Egocentric perception enables humans to experience and understand the world directly from their own point of view. Translating exocentric (third-person) videos into egocentric (first-person) videos opens up new possibilities for immersive understanding but remains highly challenging due to extreme camera pose variations and minimal view overlap. This task requires faithfully preserving visible content while synthesizing unseen regions in a geometrically consistent manner. To achieve this, we present EgoX, a novel framework for generating egocentric videos from a single exocentric input. EgoX leverages the pretrained spatio temporal knowledge of large-scale video diffusion models through lightweight LoRA adaptation and introduces a unified conditioning strategy that combines exocentric and egocentric priors via width and channel wise concatenation. Additionally, a geometry-guided self-attention mechanism selectively attends to spatially relevant regions, ensuring geometric coherence and high visual fidelity. Our approach achieves coherent and realistic egocentric video generation while demonstrating strong scalability and robustness across unseen and in-the-wild videos.
comment: 21 pages, project page : https://keh0t0.github.io/EgoX
RLCNet: An end-to-end deep learning framework for simultaneous online calibration of LiDAR, RADAR, and Camera
Accurate extrinsic calibration of LiDAR, RADAR, and camera sensors is essential for reliable perception in autonomous vehicles. Still, it remains challenging due to factors such as mechanical vibrations and cumulative sensor drift in dynamic environments. This paper presents RLCNet, a novel end-to-end trainable deep learning framework for the simultaneous online calibration of these multimodal sensors. Validated on real-world datasets, RLCNet is designed for practical deployment and demonstrates robust performance under diverse conditions. To support real-time operation, an online calibration framework is introduced that incorporates a weighted moving average and outlier rejection, enabling dynamic adjustment of calibration parameters with reduced prediction noise and improved resilience to drift. An ablation study highlights the significance of architectural choices, while comparisons with existing methods demonstrate the superior accuracy and robustness of the proposed approach.
SFP: Real-World Scene Recovery Using Spatial and Frequency Priors
Scene recovery serves as a critical task for various computer vision applications. Existing methods typically rely on a single prior, which is inherently insufficient to handle multiple degradations, or employ complex network architectures trained on synthetic data, which suffer from poor generalization for diverse real-world scenarios. In this paper, we propose Spatial and Frequency Priors (SFP) for real-world scene recovery. In the spatial domain, we observe that the inverse of the degraded image exhibits a projection along its spectral direction that resembles the scene transmission. Leveraging this spatial prior, the transmission map is estimated to recover the scene from scattering degradation. In the frequency domain, a mask is constructed for adaptive frequency enhancement, with two parameters estimated using our proposed novel priors. Specifically, one prior assumes that the mean intensity of the degraded image's direct current (DC) components across three channels in the frequency domain closely approximates that of each channel in the clear image. The second prior is based on the observation that, for clear images, the magnitude of low radial frequencies below 0.001 constitutes approximately 1% of the total spectrum. Finally, we design a weighted fusion strategy to integrate spatial-domain restoration, frequency-domain enhancement, and salient features from the input image, yielding the final recovered result. Extensive evaluations demonstrate the effectiveness and superiority of our proposed SFP for scene recovery under various degradation conditions.
comment: 10 pages, 13 figures
Query-aware Hub Prototype Learning for Few-Shot 3D Point Cloud Semantic Segmentation
Few-shot 3D point cloud semantic segmentation (FS-3DSeg) aims to segment novel classes with only a few labeled samples. However, existing metric-based prototype learning methods generate prototypes solely from the support set, without considering their relevance to query data. This often results in prototype bias, where prototypes overfit support-specific characteristics and fail to generalize to the query distribution, especially in the presence of distribution shifts, which leads to degraded segmentation performance. To address this issue, we propose a novel Query-aware Hub Prototype (QHP) learning method that explicitly models semantic correlations between support and query sets. Specifically, we propose a Hub Prototype Generation (HPG) module that constructs a bipartite graph connecting query and support points, identifies frequently linked support hubs, and generates query-relevant prototypes that better capture cross-set semantics. To further mitigate the influence of bad hubs and ambiguous prototypes near class boundaries, we introduce a Prototype Distribution Optimization (PDO) module, which employs a purity-reweighted contrastive loss to refine prototype representations by pulling bad hubs and outlier prototypes closer to their corresponding class centers. Extensive experiments on S3DIS and ScanNet demonstrate that QHP achieves substantial performance gains over state-of-the-art methods, effectively narrowing the semantic gap between prototypes and query sets in FS-3DSeg.
Distilling Future Temporal Knowledge with Masked Feature Reconstruction for 3D Object Detection AAAI-26
Camera-based temporal 3D object detection has shown impressive results in autonomous driving, with offline models improving accuracy by using future frames. Knowledge distillation (KD) can be an appealing framework for transferring rich information from offline models to online models. However, existing KD methods overlook future frames, as they mainly focus on spatial feature distillation under strict frame alignment or on temporal relational distillation, thereby making it challenging for online models to effectively learn future knowledge. To this end, we propose a sparse query-based approach, Future Temporal Knowledge Distillation (FTKD), which effectively transfers future frame knowledge from an offline teacher model to an online student model. Specifically, we present a future-aware feature reconstruction strategy to encourage the student model to capture future features without strict frame alignment. In addition, we further introduce future-guided logit distillation to leverage the teacher's stable foreground and background context. FTKD is applied to two high-performing 3D object detection baselines, achieving up to 1.3 mAP and 1.3 NDS gains on the nuScenes dataset, as well as the most accurate velocity estimation, without increasing inference cost.
comment: AAAI-26
Residual-SwinCA-Net: A Channel-Aware Integrated Residual CNN-Swin Transformer for Malignant Lesion Segmentation in BUSI
A novel deep hybrid Residual-SwinCA-Net segmentation framework is proposed in the study for addressing such challenges by extracting locally correlated and robust features, incorporating residual CNN modules. Furthermore, for learning global dependencies, Swin Transformer blocks are customized using internal residual pathways, which reinforce gradient stability, refine local patterns, and facilitate global feature fusion. Formerly, for enhancing tissue continuity, ultrasound noise suppressions, and accentuating fine structural transitions Laplacian-of-Gaussian regional operator is applied, and for maintaining the morphological integrity of malignant lesion contours, a boundary-oriented operator has been incorporated. Subsequently, a contraction strategy was applied stage-wise by progressively reducing features-map progressively for capturing scale invariance and enhancing the robustness of structural variability. In addition, each decoder level prior augmentation integrates a new Multi-Scale Channel Attention and Squeezing (MSCAS) module. The MSCAS selectively emphasizes encoder salient maps, retains discriminative global context, and complementary local structures with minimal computational cost while suppressing redundant activations. Finally, the Pixel-Attention module encodes class-relevant spatial cues by adaptively weighing malignant lesion pixels while suppressing background interference. The Residual-SwinCA-Net and existing CNNs/ViTs techniques have been implemented on the publicly available BUSI dataset. The proposed Residual-SwinCA-Net framework outperformed and achieved 99.29% mean accuracy, 98.74% IoU, and 0.9041 Dice for breast lesion segmentation. The proposed Residual-SwinCA-Net framework improves the BUSI lesion diagnostic performance and strengthens timely clinical decision-making.
comment: 26 Pages, 10 Figures, 4 Tables
HybridToken-VLM: Hybrid Token Compression for Vision-Language Models
Vision-language models (VLMs) have transformed multimodal reasoning, but feeding hundreds of visual patch tokens into LLMs incurs quadratic computational costs, straining memory and context windows. Traditional approaches face a trade-off: continuous compression dilutes high-level semantics such as object identities, while discrete quantization loses fine-grained details such as textures. We introduce HTC-VLM, a hybrid framework that disentangles semantics and appearance through dual channels, i.e., a continuous pathway for fine-grained details via ViT patches and a discrete pathway for symbolic anchors using MGVQ quantization projected to four tokens. These are fused into a 580-token hybrid sequence and compressed into a single voco token via a disentanglement attention mask and bottleneck, ensuring efficient and grounded representations. HTC-VLM achieves an average performance retention of 87.2 percent across seven benchmarks (GQA, VQAv2, MMBench, MME, POPE, SEED-Bench, ScienceQA-Image), outperforming the leading continuous baseline at 81.0 percent with a 580-to-1 compression ratio. Attention analyses show that the compressed token prioritizes the discrete anchor, validating its semantic guidance. Our work demonstrates that a minimalist hybrid design can resolve the efficiency-fidelity dilemma and advance scalable VLMs.
FastBEV++: Fast by Algorithm, Deployable by Design
The advancement of camera-only Bird's-Eye-View(BEV) perception is currently impeded by a fundamental tension between state-of-the-art performance and on-vehicle deployment tractability. This bottleneck stems from a deep-rooted dependency on computationally prohibitive view transformations and bespoke, platform-specific kernels. This paper introduces FastBEV++, a framework engineered to reconcile this tension, demonstrating that high performance and deployment efficiency can be achieved in unison via two guiding principles: Fast by Algorithm and Deployable by Design. We realize the "Deployable by Design" principle through a novel view transformation paradigm that decomposes the monolithic projection into a standard Index-Gather-Reshape pipeline. Enabled by a deterministic pre-sorting strategy, this transformation is executed entirely with elementary, operator native primitives (e.g Gather, Matrix Multiplication), which eliminates the need for specialized CUDA kernels and ensures fully TensorRT-native portability. Concurrently, our framework is "Fast by Algorithm", leveraging this decomposed structure to seamlessly integrate an end-to-end, depth-aware fusion mechanism. This jointly learned depth modulation, further bolstered by temporal aggregation and robust data augmentation, significantly enhances the geometric fidelity of the BEV representation.Empirical validation on the nuScenes benchmark corroborates the efficacy of our approach. FastBEV++ establishes a new state-of-the-art 0.359 NDS while maintaining exceptional real-time performance, exceeding 134 FPS on automotive-grade hardware (e.g Tesla T4). By offering a solution that is free of custom plugins yet highly accurate, FastBEV++ presents a mature and scalable design philosophy for production autonomous systems. The code is released at: https://github.com/ymlab/advanced-fastbev
Geometry-Aware Sparse Depth Sampling for High-Fidelity RGB-D Depth Completion in Robotic Systems
Accurate three-dimensional perception is essential for modern industrial robotic systems that perform manipulation, inspection, and navigation tasks. RGB-D and stereo vision sensors are widely used for this purpose, but the depth maps they produce are often noisy, incomplete, or biased due to sensor limitations and environmental conditions. Depth completion methods aim to generate dense, reliable depth maps from RGB images and sparse depth input. However, a key limitation in current depth completion pipelines is the unrealistic generation of sparse depth: sparse pixels are typically selected uniformly at random from dense ground-truth depth, ignoring the fact that real sensors exhibit geometry-dependent and spatially nonuniform reliability. In this work, we propose a normal-guided sparse depth sampling strategy that leverages PCA-based surface normal estimation on the RGB-D point cloud to compute a per-pixel depth reliability measure. The sparse depth samples are then drawn according to this reliability distribution. We integrate this sampling method with the Marigold-DC diffusion-based depth completion model and evaluate it on NYU Depth v2 using the standard metrics. Experiments show that our geometry-aware sparse depth improves accuracy, reduces artifacts near edges and discontinuities, and produces more realistic training conditions that better reflect real sensor behavior.
Rethinking Few-Shot Image Fusion: Granular Ball Priors Enable General-Purpose Deep Fusion
In image fusion tasks, the absence of real fused images as priors forces most deep learning approaches to rely on large-scale paired datasets to extract global weighting features or to generate pseudo-supervised images through algorithmic constructions. Unlike previous methods, this work re-examines prior-guided learning under few-shot conditions by introducing rough set theory. We regard the traditional algorithm as a prior generator, while the network re-inferrs and adaptively optimizes the prior through a dynamic loss function, reducing the inference burden of the network and enabling effective few-shot learning.To provide the prior, we propose the Granular Ball Pixel Computation (GBPC) algorithm. GBPC models pixel pairs in a luminance subspace using meta-granular balls and mines intra-ball information at multiple granular levels. At the fine-grained level, sliding granular balls assign adaptive weights to individual pixels to produce pixel-level prior fusion. At the coarse-grained level, the algorithm performs split computation within a single image to estimate positive and boundary domain distributions, enabling modality awareness and prior confidence estimation, which dynamically guide the loss weighting.The network and the algorithmic prior are coupled through the loss function to form an integrated framework. Thanks to the dynamic weighting mechanism, the network can adaptively adjust to different priors during training, enhancing its perception and fusion capability across modalities. We name this framework GBFF (Granular Ball Fusion Framework). Experiments on four fusion tasks demonstrate that even with only ten training image pairs per task, GBFF achieves superior performance in both visual quality and model compactness. Code is available at: https://github.com/DMinjie/GBFF
Learning Geodesics of Geometric Shape Deformations From Images
This paper presents a novel method, named geodesic deformable networks (GDN), that for the first time enables the learning of geodesic flows of deformation fields derived from images. In particular, the capability of our proposed GDN being able to predict geodesics is important for quantifying and comparing deformable shape presented in images. The geodesic deformations, also known as optimal transformations that align pairwise images, are often parameterized by a time sequence of smooth vector fields governed by nonlinear differential equations. A bountiful literature has been focusing on learning the initial conditions (e.g., initial velocity fields) based on registration networks. However, the definition of geodesics central to deformation-based shape analysis is blind to the networks. To address this problem, we carefully develop an efficient neural operator to treat the geodesics as unknown mapping functions learned from the latent deformation spaces. A composition of integral operators and smooth activation functions is then formulated to effectively approximate such mappings. In contrast to previous works, our GDN jointly optimizes a newly defined geodesic loss, which adds additional benefits to promote the network regularizability and generalizability. We demonstrate the effectiveness of GDN on both 2D synthetic data and 3D real brain magnetic resonance imaging (MRI).
comment: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2025:019
MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens
The effectiveness of Multimodal Large Language Models (MLLMs) demonstrates a profound capability in multimodal understanding. However, the simultaneous generation of images with coherent texts is still underdeveloped. Addressing this, we introduce a novel interleaved vision-and-language generation method, centered around the concept of ``generative vokens". These vokens serve as pivotal elements contributing to coherent image-text outputs. Our method is marked by a unique two-stage training strategy for description-free multimodal generation, which does not necessitate extensive descriptions of images. We integrate classifier-free guidance to enhance the alignment of generated images and texts, ensuring more seamless and contextually relevant multimodal interactions. Our model, MiniGPT-5, exhibits substantial improvement over the baseline models on multimodal generation datasets, including MMDialog and VIST. The human evaluation shows MiniGPT-5 is better than the baseline model on more than 56\% cases for multimodal generation, highlighting its efficacy across diverse benchmarks.
The Missing Point in Vision Transformers for Universal Image Segmentation
Image segmentation remains a challenging task in computer vision, demanding robust mask generation and precise classification. Recent mask-based approaches yield high-quality masks by capturing global context. However, accurately classifying these masks, especially in the presence of ambiguous boundaries and imbalanced class distributions, remains an open challenge. In this work, we introduce ViT-P, a novel two-stage segmentation framework that decouples mask generation from classification. The first stage employs a proposal generator to produce class-agnostic mask proposals, while the second stage utilizes a point-based classification model built on the Vision Transformer (ViT) to refine predictions by focusing on mask central points. ViT-P serves as a pre-training-free adapter, allowing the integration of various pre-trained vision transformers without modifying their architecture, ensuring adaptability to dense prediction tasks. Furthermore, we demonstrate that coarse and bounding box annotations can effectively enhance classification without requiring additional training on fine annotation datasets, reducing annotation costs while maintaining strong performance. Extensive experiments across COCO, ADE20K, and Cityscapes datasets validate the effectiveness of ViT-P, achieving state-of-the-art results with 54.0 PQ on ADE20K panoptic segmentation, 87.4 mIoU on Cityscapes semantic segmentation, and 63.6 mIoU on ADE20K semantic segmentation. The code and pretrained models are available at: https://github.com/sajjad-sh33/ViT-P}{https://github.com/sajjad-sh33/ViT-P.
End-to-End Fine-Tuning of 3D Texture Generation using Differentiable Rewards
While recent 3D generative models can produce high-quality texture images, they often fail to capture human preferences or meet task-specific requirements. Moreover, a core challenge in the 3D texture generation domain is that most existing approaches rely on repeated calls to 2D text-to-image generative models, which lack an inherent understanding of the 3D structure of the input 3D mesh object. To alleviate these issues, we propose an end-to-end differentiable, reinforcement-learning-free framework that embeds human feedback, expressed as differentiable reward functions, directly into the 3D texture synthesis pipeline. By back-propagating preference signals through both geometric and appearance modules of the proposed framework, our method generates textures that respect the 3D geometry structure and align with desired criteria. To demonstrate its versatility, we introduce three novel geometry-aware reward functions, which offer a more controllable and interpretable pathway for creating high-quality 3D content from natural language. By conducting qualitative, quantitative, and user-preference evaluations against state-of-the-art methods, we demonstrate that our proposed strategy consistently outperforms existing approaches. Our implementation code is publicly available at: https://github.com/AHHHZ975/Differentiable-Texture-Learning
VoiceCloak: A Multi-Dimensional Defense Framework against Unauthorized Diffusion-based Voice Cloning AAAI 2026
Diffusion Models (DMs) have achieved remarkable success in realistic voice cloning (VC), while they also increase the risk of malicious misuse. Existing proactive defenses designed for traditional VC models aim to disrupt the forgery process, but they have been proven incompatible with DMs due to the intricate generative mechanisms of diffusion. To bridge this gap, we introduce VoiceCloak, a multi-dimensional proactive defense framework with the goal of obfuscating speaker identity and degrading perceptual quality in potential unauthorized VC. To achieve these goals, we conduct a focused analysis to identify specific vulnerabilities within DMs, allowing VoiceCloak to disrupt the cloning process by introducing adversarial perturbations into the reference audio. Specifically, to obfuscate speaker identity, VoiceCloak first targets speaker identity by distorting representation learning embeddings to maximize identity variation, which is guided by auditory perception principles. Additionally, VoiceCloak disrupts crucial conditional guidance processes, particularly attention context, thereby preventing the alignment of vocal characteristics that are essential for achieving convincing cloning. Then, to address the second objective, VoiceCloak introduces score magnitude amplification to actively steer the reverse trajectory away from the generation of high-quality speech. Noise-guided semantic corruption is further employed to disrupt structural speech semantics captured by DMs, degrading output quality. Extensive experiments highlight VoiceCloak's outstanding defense success rate against unauthorized diffusion-based voice cloning. Audio samples of VoiceCloak are available at https://voice-cloak.github.io/VoiceCloak/.
comment: 15 pages, 6 figures, 13 tables; Accepted by AAAI 2026
SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass 3DV 2026
3D content generation has recently attracted significant research interest, driven by its critical applications in VR/AR and embodied AI. In this work, we tackle the challenging task of synthesizing multiple 3D assets within a single scene image. Concretely, our contributions are fourfold: (i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for extra optimization or asset retrieval; (ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate SceneGen's direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architecture yields improved generation performance when multiple images are provided; and (iv) extensive quantitative and qualitative evaluations confirm the efficiency and robustness of our approach. We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks. The code and model will be publicly available at: https://mengmouxu.github.io/SceneGen.
comment: Accepted by 3DV 2026; Project Page: https://mengmouxu.github.io/SceneGen
Towards Task-Oriented Flying: Framework, Infrastructure, and Principles
Deploying robot learning methods to aerial robots in unstructured environments remains both challenging and promising. While recent advances in deep reinforcement learning (DRL) have enabled end-to-end flight control, the field still lacks systematic design guidelines and a unified infrastructure to support reproducible training and real-world deployment. We present a task-oriented framework for end-to-end DRL in quadrotors that integrates design principles for complex task specification and reveals the interdependencies among simulated task definition, training design principles, and physical deployment. Our framework involves software infrastructure, hardware platforms, and open-source firmware to support a full-stack learning infrastructure and workflow. Extensive empirical results demonstrate robust flight and sim-to-real generalization under real-world disturbances. By reducing the entry barrier for deploying learning-based controllers on aerial robots, our work lays a practical foundation for advancing autonomous flight in dynamic and unstructured environments.
Co-Seg++: Mutual Prompt-Guided Collaborative Learning for Versatile Medical Segmentation
Medical image analysis is critical yet challenged by the need of jointly segmenting organs or tissues, and numerous instances for anatomical structures and tumor microenvironment analysis. Existing studies typically formulated different segmentation tasks in isolation, which overlooks the fundamental interdependencies between these tasks, leading to suboptimal segmentation performance and insufficient medical image understanding. To address this issue, we propose a Co-Seg++ framework for versatile medical segmentation. Specifically, we introduce a novel co-segmentation paradigm, allowing semantic and instance segmentation tasks to mutually enhance each other. We first devise a spatio-sequential prompt encoder (SSP-Encoder) to capture long-range spatial and sequential relationships between segmentation regions and image embeddings as prior spatial constraints. Moreover, we devise a multi-task collaborative decoder (MTC-Decoder) that leverages cross-guidance to strengthen the contextual consistency of both tasks, jointly computing semantic and instance segmentation masks. Extensive experiments on diverse CT and histopathology datasets demonstrate that the proposed Co-Seg++ outperforms state-of-the-arts in the semantic, instance, and panoptic segmentation of dental anatomical structures, histopathology tissues, and nuclei instances. The source code is available at https://github.com/xq141839/Co-Seg-Plus.
comment: Accepted by TMI
High-Throughput Unsupervised Profiling of the Morphology of 316L Powder Particles for Use in Additive Manufacturing
Selective Laser Melting (SLM) is a powder-bed additive manufacturing technique whose part quality depends critically on feedstock morphology. However, conventional powder characterization methods are low-throughput and qualitative, failing to capture the heterogeneity of industrial-scale batches. We present an automated, machine learning framework that couples high-throughput imaging with shape extraction and clustering to profile metallic powder morphology at scale. We develop and evaluate three clustering pipelines: an autoencoder pipeline, a shape-descriptor pipeline, and a functional-data pipeline. Across a dataset of approximately 126,000 powder images (0.5-102 micrometer diameter), internal validity metrics identify the Fourier-descriptor + k-means pipeline as the most effective, achieving the lowest Davies-Bouldin index and highest Calinski-Harabasz score while maintaining sub-millisecond runtime per particle on a standard desktop workstation. Although the present work focuses on establishing the morphological-clustering framework, the resulting shape groups form a basis for future studies examining their relationship to flowability, packing density, and SLM part quality. Overall, this unsupervised learning framework enables rapid, automated assessment of powder morphology and supports tracking of shape evolution across reuse cycles, offering a path toward real-time feedstock monitoring in SLM workflows.
ContextGen: Contextual Layout Anchoring for Identity-Consistent Multi-Instance Generation
Multi-instance image generation (MIG) remains a significant challenge for modern diffusion models due to key limitations in achieving precise control over object layout and preserving the identity of multiple distinct subjects. To address these limitations, we introduce ContextGen, a novel Diffusion Transformer framework for multi-instance generation that is guided by both layout and reference images. Our approach integrates two key technical contributions: a Contextual Layout Anchoring (CLA) mechanism that incorporates the composite layout image into the generation context to robustly anchor the objects in their desired positions, and Identity Consistency Attention (ICA), an innovative attention mechanism that leverages contextual reference images to ensure the identity consistency of multiple instances. Recognizing the lack of large-scale, hierarchically-structured datasets for this task, we introduce IMIG-100K, the first dataset with detailed layout and identity annotations. Extensive experiments demonstrate that ContextGen sets a new state-of-the-art, outperforming existing methods in control precision, identity fidelity, and overall visual quality.
comment: Project Page: https://nenhang.github.io/ContextGen/
PET Image Reconstruction Using Deep Diffusion Image Prior
Diffusion models have shown great promise in medical image denoising and reconstruction, but their application to Positron Emission Tomography (PET) imaging remains limited by tracer-specific contrast variability and high computational demands. In this work, we proposed an anatomical prior-guided PET image reconstruction method based on diffusion models, inspired by the deep diffusion image prior (DDIP) framework. The proposed method alternated between diffusion sampling and model fine-tuning guided by the PET sinogram, enabling the reconstruction of high-quality images from various PET tracers using a score function pretrained on a dataset of another tracer. To improve computational efficiency, the half-quadratic splitting (HQS) algorithm was adopted to decouple network optimization from iterative PET reconstruction. The proposed method was evaluated using one simulation and two clinical datasets. For the simulation study, a model pretrained on [$^{18}$F]FDG data was tested on [$^{18}$F]FDG data and amyloid-negative PET data to assess out-of-distribution (OOD) performance. For the clinical-data validation, ten low-dose [$^{18}$F]FDG datasets and one [$^{18}$F]Florbetapir dataset were tested on a model pretrained on data from another tracer. Experiment results show that the proposed PET reconstruction method can generalize robustly across tracer distributions and scanner types, providing an efficient and versatile reconstruction framework for low-dose PET imaging.
comment: 11 pages, 12 figures
DIVER: Reinforced Diffusion Breaks Imitation Bottlenecks in End-to-End Autonomous Driving
Most end-to-end autonomous driving methods rely on imitation learning from single expert demonstrations, often leading to conservative and homogeneous behaviors that limit generalization in complex real-world scenarios. In this work, we propose DIVER, an end-to-end driving framework that integrates reinforcement learning with diffusion-based generation to produce diverse and feasible trajectories. At the core of DIVER lies a reinforced diffusion-based generation mechanism. First, the model conditions on map elements and surrounding agents to generate multiple reference trajectories from a single ground-truth trajectory, alleviating the limitations of imitation learning that arise from relying solely on single expert demonstrations. Second, reinforcement learning is employed to guide the diffusion process, where reward-based supervision enforces safety and diversity constraints on the generated trajectories, thereby enhancing their practicality and generalization capability. Furthermore, to address the limitations of L2-based open-loop metrics in capturing trajectory diversity, we propose a novel Diversity metric to evaluate the diversity of multi-mode predictions.Extensive experiments on the closed-loop NAVSIM and Bench2Drive benchmarks, as well as the open-loop nuScenes dataset, demonstrate that DIVER significantly improves trajectory diversity, effectively addressing the mode collapse problem inherent in imitation learning.
comment: 18 pages, 8 figures
On the Temporality for Sketch Representation Learning
Sketches are simple human hand-drawn abstractions of complex scenes and real-world objects. Although the field of sketch representation learning has advanced significantly, there is still a gap in understanding the true relevance of the temporal aspect to the quality of these representations. This work investigates whether it is indeed justifiable to treat sketches as sequences, as well as which internal orders play a more relevant role. The results indicate that, although the use of traditional positional encodings is valid for modeling sketches as sequences, absolute coordinates consistently outperform relative ones. Furthermore, non-autoregressive decoders outperform their autoregressive counterparts. Finally, the importance of temporality was shown to depend on both the order considered and the task evaluated.
comment: Preprint submitted to Pattern Recognition Letters
Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation
Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict.
OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows
Computer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms. While these agents hold great promise for advancing digital automation, their potential for unsafe operations, such as system compromise and privacy leakage, is raising significant concerns. Detecting these safety concerns across the vast and complex operational space of mobile environments presents a formidable challenge that remains critically underexplored. To establish a foundation for mobile agent safety research, we introduce MobileRisk-Live, a dynamic sandbox environment accompanied by a safety detection benchmark comprising realistic trajectories with fine-grained annotations. Built upon this, we propose OS-Sentinel, a novel hybrid safety detection framework that synergistically combines a Formal Verifier for detecting explicit system-level violations with a VLM-based Contextual Judge for assessing contextual risks and agent actions. Experiments show that OS-Sentinel achieves 10%-30% improvements over existing approaches across multiple metrics. Further analysis provides critical insights that foster the development of safer and more reliable autonomous mobile agents. Our code and data are available at https://github.com/OS-Copilot/OS-Sentinel.
comment: work in progress
Spike-EVPR: Deep Spiking Residual Networks with SNN-Tailored Representations for Event-Based Visual Place Recognition
Event cameras are ideal for visual place recognition (VPR) in challenging environments due to their high temporal resolution and high dynamic range. However, existing methods convert sparse events into dense frame-like representations for Artificial Neural Networks (ANNs), ignoring event sparsity and incurring high computational cost. Spiking Neural Networks (SNNs) complement event data through discrete spike signals to enable energy-efficient VPR, but their application is hindered by the lack of effective spike-compatible representations and deep architectures capable of learning discriminative global descriptors. To address these limitations, we propose Spike-EVPR, a directly trained, end-to-end SNN framework tailored for event-based VPR. First, we introduce two complementary event representations, MCS-Tensor and TSS-Tensor, designed to reduce temporal redundancy while preserving essential spatio-temporal cues. Furthermore, we propose a deep spiking residual architecture that effectively aggregates these features to generate robust place descriptors. Extensive experiments on the Brisbane-Event-VPR and DDD20 datasets demonstrate that Spike-EVPR achieves state-of-the-art performance, improving Recall@1 by 7.61% and 13.20%, respectively, while significantly reducing energy consumption.
comment: 8 pages, 6 figures
CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding
We address Embodied Reference Understanding, the task of predicting the object a person in the scene refers to through pointing gesture and language. This requires multimodal reasoning over text, visual pointing cues, and scene context, yet existing methods often fail to fully exploit visual disambiguation signals. We also observe that while the referent often aligns with the head-to-fingertip direction, in many cases it aligns more closely with the wrist-to-fingertip direction, making a single-line assumption overly limiting. To address this, we propose a dual-model framework, where one model learns from the head-to-fingertip direction and the other from the wrist-to-fingertip direction. We introduce a Gaussian ray heatmap representation of these lines and use them as input to provide a strong supervisory signal that encourages the model to better attend to pointing cues. To fuse their complementary strengths, we present the CLIP-Aware Pointing Ensemble module, which performs a hybrid ensemble guided by CLIP features. We further incorporate an auxiliary object center prediction head to enhance referent localization. We validate our approach on YouRefIt, achieving 75.0 mAP at 0.25 IoU, alongside state-of-the-art CLIP and C_D scores, and demonstrate its generality on unseen CAESAR and ISL Pointing, showing robust performance across benchmarks.
CourtMotion: Learning Event-Driven Motion Representations from Skeletal Data for Basketball
This paper presents CourtMotion, a spatiotemporal modeling framework for analyzing and predicting game events and plays as they develop in professional basketball. Anticipating basketball events requires understanding both physical motion patterns and their semantic significance in the context of the game. Traditional approaches that use only player positions fail to capture crucial indicators such as body orientation, defensive stance, or shooting preparation motions. Our two-stage approach first processes skeletal tracking data through Graph Neural Networks to capture nuanced motion patterns, then employs a Transformer architecture with specialized attention mechanisms to model player interactions. We introduce event projection heads that explicitly connect player movements to basketball events like passes, shots, and steals, training the model to associate physical motion patterns with their tactical purposes. Experiments on NBA tracking data demonstrate significant improvements over position-only baselines: 35% reduction in trajectory prediction error compared to state-of-the-art position-based models and consistent performance gains across key basketball analytics tasks. The resulting pretrained model serves as a powerful foundation for multiple downstream tasks, with pick detection, shot taker identification, assist prediction, shot location classification, and shot type recognition demonstrating substantial improvements over existing methods.
EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture
We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase. Extensive experiments have shown that EMMA-4B can significantly outperform state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while also achieving competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image). We believe that EMMA lays a solid foundation for the future development of unified multimodal architectures.
comment: Project Page: https://emma-umm.github.io/emma/
Spec-Gloss Surfels and Normal-Diffuse Priors for Relightable Glossy Objects
Accurate reconstruction and relighting of glossy objects remains a longstanding challenge, as object shape, material properties, and illumination are inherently difficult to disentangle. Existing neural rendering approaches often rely on simplified BRDF models or parameterizations that couple diffuse and specular components, which restrict faithful material recovery and limit relighting fidelity. We propose a relightable framework that integrates a microfacet BRDF with the specular-glossiness parameterization into 2D Gaussian Splatting with deferred shading. This formulation enables more physically consistent material decomposition, while diffusion-based priors for surface normals and diffuse color guide early-stage optimization and mitigate ambiguity. A coarse-to-fine environment map optimization accelerates convergence, and negative-only environment map clipping preserves high-dynamic-range specular reflections. Extensive experiments on complex, glossy scenes demonstrate that our method achieves high-quality geometry and material reconstruction, delivering substantially more realistic and consistent relighting under novel illumination compared to existing Gaussian splatting methods.
Guiding WaveMamba with Frequency Maps for Image Debanding
Compression at low bitrates in modern codecs often introduces banding artifacts, especially in smooth regions such as skies. These artifacts degrade visual quality and are common in user-generated content due to repeated transcoding. We propose a banding restoration method that employs the Wavelet State Space Model and a frequency masking map to preserve high-frequency details. Furthermore, we provide a benchmark of open-source banding restoration methods and evaluate their performance on two public banding image datasets. Experimentation on the available datasets suggests that the proposed post-processing approach effectively suppresses banding compared to the state-of-the-art method (a DBI value of 0.082 on BAND-2k) while preserving image textures. Visual inspections of the results confirm this. Code and supplementary material are available at: https://github.com/xinyiW915/Debanding-PCS2025.
comment: 5 pages, 2 figures
Heart Failure Prediction using Modal Decomposition and Masked Autoencoders for Scarce Echocardiography Databases
Heart diseases remain the leading cause of mortality worldwide, implying approximately 18 million deaths according to the WHO. In particular, heart failures (HF) press the healthcare industry to develop systems for their early, rapid, and effective prediction. This work presents an automatic system based on a novel framework which combines Modal Decomposition and Masked Autoencoders (MAE) to extend the application from heart disease classification to the more challenging and specific task of heart failure time prediction, not previously addressed to the best of authors' knowledge. This system comprises two stages. The first one transforms the data from a database of echocardiography video sequences into a large collection of annotated images compatible with the training phase of machine learning-based frameworks and deep learning-based ones. This stage includes the use of the Higher Order Dynamic Mode Decomposition (HODMD) algorithm for both data augmentation and feature extraction. The second stage builds and trains a Vision Transformer (ViT). MAEs based on a combined scheme of self-supervised (SSL) and supervised learning, so far barely explored in the literature about heart failure prediction, are adopted to effectively train the ViT from scratch, even with scarce databases. The designed neural network analyses in real-time images from echocardiography sequences to estimate the time of happening a heart failure. This approach demonstrates to improve prediction accuracy from scarce databases and to be superior to several established ViT and Convolutional Neural Network (CNN) architectures. The source code will be incorporated into the next version release of the ModelFLOWs-app software (https://github.com/modelflows/ModelFLOWs-app).
comment: 43 pages, 7 figures
RGS-DR: Deferred Reflections and Residual Shading in 2D Gaussian Splatting
In this work, we address specular appearance in inverse rendering using 2D Gaussian splatting with deferred shading and argue for a refinement stage to improve specular detail, thereby bridging the gap with reconstruction-only methods. Our pipeline estimates editable material properties and environment illumination while employing a directional residual pass that captures leftover view-dependent effects for further refining novel view synthesis. In contrast to per-Gaussian shading with shortest-axis normals and normal residuals, which tends to result in more noisy geometry and specular appearance, a pixel-deferred surfel formulation with specular residuals yields sharper highlights, cleaner materials, and improved editability. We evaluate our approach on rendering and reconstruction quality on three popular datasets featuring glossy objects, and also demonstrate high-quality relighting and material editing.
From Fibers to Cells: Fourier-Based Registration Enables Virtual Cresyl Violet Staining From 3D Polarized Light Imaging
Comprehensive assessment of the various aspects of the brain's microstructure requires the use of complementary imaging techniques. This includes measuring the spatial distribution of cell bodies (cytoarchitecture) and nerve fibers (myeloarchitecture). The gold standard for cytoarchitectonic analysis is light microscopic imaging of cell-body stained tissue sections. To reveal the 3D orientations of nerve fibers, 3D Polarized Light Imaging (3D-PLI) has been introduced, a method that is label-free and allows subsequent staining of sections after 3D-PLI measurement. By post-staining for cell bodies, a direct link between fiber- and cytoarchitecture can potentially be established in the same section. However, inevitable distortions introduced during the staining process make a costly nonlinear and cross-modal registration necessary in order to study the detailed relationships between cells and fibers in the images. In addition, the complexity of processing histological sections for post-staining only allows for a limited number of such samples. In this work, we take advantage of deep learning methods for image-to-image translation to generate a virtual staining of 3D-PLI that is spatially aligned at the cellular level. We use a supervised setting, building on a unique dataset of brain sections, to which Cresyl violet staining has been applied after 3D-PLI measurement. To ensure high correspondence between both modalities, we address the misalignment of training data using Fourier-based registration. In this way, registration can be efficiently calculated during training for local image patches of target and predicted staining. We demonstrate that the proposed method can predict a Cresyl violet staining from 3D-PLI, resulting in a virtual staining that exhibits plausible patterns of cell organization in gray matter, with larger cell bodies being localized at their expected positions.
comment: Revised version, accepted for publication
Wukong's 72 Transformations: High-fidelity Textured 3D Morphing via Flow Models
We present WUKONG, a novel training-free framework for high-fidelity textured 3D morphing that takes a pair of source and target prompts (image or text) as input. Unlike conventional methods -- which rely on manual correspondence matching and deformation trajectory estimation (limiting generalization and requiring costly preprocessing) -- WUKONG leverages the generative prior of flow-based transformers to produce high-fidelity 3D transitions with rich texture details. To ensure smooth shape transitions, we exploit the inherent continuity of flow-based generative processes and formulate morphing as an optimal transport barycenter problem. We further introduce a sequential initialization strategy to prevent abrupt geometric distortions and preserve identity coherence. For faithful texture preservation, we propose a similarity-guided semantic consistency mechanism that selectively retains high-frequency details and enables precise control over blending dynamics. This avoids common artifacts like oversmoothing while maintaining semantic fidelity. Extensive quantitative and qualitative evaluations demonstrate that WUKONG significantly outperforms state-of-the-art methods, achieving superior results across diverse geometry and texture variations.
AI-powered virtual tissues from spatial proteomics for clinical diagnostics and biomedical discovery
Spatial proteomics technologies have transformed our understanding of complex tissue architecture in cancer but present unique challenges for computational analysis. Each study uses a different marker panel and protocol, and most methods are tailored to single cohorts, which limits knowledge transfer and robust biomarker discovery. Here we present Virtual Tissues (VirTues), a general-purpose foundation model for spatial proteomics that learns marker-aware, multi-scale representations of proteins, cells, niches and tissues directly from multiplex imaging data. From a single pretrained backbone, VirTues supports marker reconstruction, cell typing and niche annotation, spatial biomarker discovery, and patient stratification, including zero-shot annotation across heterogeneous panels and datasets. In triple-negative breast cancer, VirTues-derived biomarkers predict anti-PD-L1 chemo-immunotherapy response and stratify disease-free survival in an independent cohort, outperforming state-of-the-art biomarkers derived from the same datasets and current clinical stratification schemes.
comment: 25 pages, 5 figures
Glass Surface Detection: Leveraging Reflection Dynamics in Flash/No-flash Imagery
Glass surfaces are ubiquitous in daily life, typically appearing colorless, transparent, and lacking distinctive features. These characteristics make glass surface detection a challenging computer vision task. Existing glass surface detection methods always rely on boundary cues (e.g., window and door frames) or reflection cues to locate glass surfaces, but they fail to fully exploit the intrinsic properties of the glass itself for accurate localization. We observed that in most real-world scenes, the illumination intensity in front of the glass surface differs from that behind it, which results in variations in the reflections visible on the glass surface. Specifically, when standing on the brighter side of the glass and applying a flash towards the darker side, existing reflections on the glass surface tend to disappear. Conversely, while standing on the darker side and applying a flash towards the brighter side, distinct reflections will appear on the glass surface. Based on this phenomenon, we propose NFGlassNet, a novel method for glass surface detection that leverages the reflection dynamics present in flash/no-flash imagery. Specifically, we propose a Reflection Contrast Mining Module (RCMM) for extracting reflections, and a Reflection Guided Attention Module (RGAM) for fusing features from reflection and glass surface for accurate glass surface detection. For learning our network, we also construct a dataset consisting of 3.3K no-flash and flash image pairs captured from various scenes with corresponding ground truth annotations. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods. Our code, model, and dataset will be available upon acceptance of the manuscript.
comment: 18 pages, 17 figures
Mixture of Contexts for Long Video Generation
Long video generation is fundamentally a long context memory problem: models must retain and retrieve salient events across a long range without collapsing or drifting. However, scaling diffusion transformers to generate long-context videos is fundamentally limited by the quadratic cost of self-attention, which makes memory and computation intractable and difficult to optimize for long sequences. We recast long-context video generation as an internal information retrieval task and propose a simple, learnable sparse attention routing module, Mixture of Contexts (MoC), as an effective long-term memory retrieval engine. In MoC, each query dynamically selects a few informative chunks plus mandatory anchors (caption, local windows) to attend to, with causal routing that prevents loop closures. As we scale the data and gradually sparsify the routing, the model allocates compute to salient history, preserving identities, actions, and scenes over minutes of content. Efficiency follows as a byproduct of retrieval (near-linear scaling), which enables practical training and synthesis, and the emergence of memory and consistency at the scale of minutes.
comment: Project page: https://primecai.github.io/moc/
Evaluating and Preserving High-level Fidelity in Super-Resolution
Recent image Super-Resolution (SR) models are achieving impressive effects in reconstructing details and delivering visually pleasant outputs. However, the overpowering generative ability can sometimes hallucinate and thus change the image content despite gaining high visual quality. This type of high-level change can be easily identified by humans yet not well-studied in existing low-level image quality metrics. In this paper, we establish the importance of measuring high-level fidelity for SR models as a complementary criterion to reveal the reliability of generative SR models. We construct the first annotated dataset with fidelity scores from different SR models, and evaluate how state-of-the-art (SOTA) SR models actually perform in preserving high-level fidelity. Based on the dataset, we then analyze how existing image quality metrics correlate with fidelity measurement, and further show that this high-level task can be better addressed by foundation models. Finally, by fine-tuning SR models based on our fidelity feedback, we show that both semantic fidelity and perceptual quality can be improved, demonstrating the potential value of our proposed criteria, both in model evaluation and optimization. We will release the dataset, code, and models upon acceptance.
You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction WACV26
Despite the renewed interest in zero-shot visual classification due to the rise of Multimodal Large Language Models (MLLMs), the problem of evaluating free-form responses of auto-regressive models remains a persistent challenge. Most existing works focus on language-only tasks or don't consider Multiple Choice Questions (MCQs) beyond 5-way options, both of which are critical capabilities to solve tasks in Fine-Grained Visual Classification (FGVC) where choice counts are in the hundreds to thousands and the choices are highly related. Furthermore, in this highly multi-way MCQ setting it is not clear how to extend LLM choice extraction to retrieval-based problems, where computing probabilities over the choice set is computationally costly. In this work we investigate nlg2choice, a simple two-stage method which first asks the MLLM an open-ended question for the task with minimal constraints, then uses text-only constrained decoding to predict the most likely choice. In retrieval settings, we compute the probability of the constrained response taking that choice with an early stopping method to significantly improve throughput. Our results show improvement over a suite of seven fine-grained visual datasets when evaluating in terms of classification and retrieval, and show that this performance holds over the various ways that users of LLMs can implement tasks in natural language.
comment: Accepted to WACV26. 12 pages, 8 tables, 5 figures
Beyond accuracy: quantifying the reliability of Multiple Instance Learning for Whole Slide Image classification
Machine learning models have become integral to many fields, but their reliability, defined as producing dependable, trustworthy, and domain-consistent predictions, remains a critical concern. Multiple Instance Learning (MIL) models designed for Whole Slide Image (WSI) classification in computational pathology are rarely evaluated in terms of reliability, leaving a key gap in understanding their suitability for high-stakes applications like clinical decision-making. In this paper, we address this gap by introducing three quantitative metrics for reliability assessment and applying them to several widely used MIL architectures across three region-wise annotated pathology datasets. Our findings indicate that the mean pooling instance (MEAN-POOL-INS)model demonstrates superior reliability compared to other networks, despite its simple architectural design and computational efficiency. These findings underscore the need of reliability evaluation alongside predictive performance in MIL models and establish MEAN-POOL-INS as a strong, trustworthy baseline for future research.
ConsDreamer: Advancing Multi-View Consistency for Zero-Shot Text-to-3D Generation
Recent advances in zero-shot text-to-3D generation have revolutionized 3D content creation by enabling direct synthesis from textual descriptions. While state-of-the-art methods leverage 3D Gaussian Splatting with score distillation to enhance multi-view rendering through pre-trained text-to-image (T2I) models, they suffer from inherent prior view biases in T2I priors. These biases lead to inconsistent 3D generation, particularly manifesting as the multi-face Janus problem, where objects exhibit conflicting features across views. To address this fundamental challenge, we propose ConsDreamer, a novel method that mitigates view bias by refining both the conditional and unconditional terms in the score distillation process: (1) a View Disentanglement Module (VDM) that eliminates viewpoint biases in conditional prompts by decoupling irrelevant view components and injecting precise view control; and (2) a similarity-based partial order loss that enforces geometric consistency in the unconditional term by aligning cosine similarities with azimuth relationships. Extensive experiments demonstrate that ConsDreamer can be seamlessly integrated into various 3D representations and score distillation paradigms, effectively mitigating the multi-face Janus problem.
comment: 13 pages, 14 figures, 3 tables
MeshRipple: Structured Autoregressive Generation of Artist-Meshes
Meshes serve as a primary representation for 3D assets. Autoregressive mesh generators serialize faces into sequences and train on truncated segments with sliding-window inference to cope with memory limits. However, this mismatch breaks long-range geometric dependencies, producing holes and fragmented components. To address this critical limitation, we introduce MeshRipple, which expands a mesh outward from an active generation frontier, akin to a ripple on a surface. MeshRipple rests on three key innovations: a frontier-aware BFS tokenization that aligns the generation order with surface topology; an expansive prediction strategy that maintains coherent, connected surface growth; and a sparse-attention global memory that provides an effectively unbounded receptive field to resolve long-range topological dependencies. This integrated design enables MeshRipple to generate meshes with high surface fidelity and topological completeness, outperforming strong recent baselines.
Enabling Validation for Robust Few-Shot Recognition
Few-Shot Recognition (FSR) tackles classification tasks by training with minimal task-specific labeled data. Prevailing methods adapt or finetune a pretrained Vision-Language Model (VLM) and augment the scarce training data by retrieving task-relevant but noisy samples from open data sources. The finetuned VLM generalizes decently well to the task-specific in-distribution (ID) test data but struggles with out-of-distribution (OOD) test data. This motivates our study of robust FSR with VLM finetuning. The core challenge of FSR is data scarcity, extending beyond limited training data to a complete lack of validation data. We identify a key paradox as a potential solution: repurposing the retrieved open data for validation. As such retrieved data are inherently OOD compared with the task-specific ID training data, finetuned VLMs yield degraded performance on the retrieved data. This causes the validation logic to favor the pretrained model without any finetuning, hindering improvements w.r.t generalization. To resolve this dilemma, we introduce a novel validation strategy that harmonizes performance gain and degradation on the few-shot ID data and the retrieved data, respectively. Our validation enables parameter selection for partial finetuning and checkpoint selection, mitigating overfitting and improving test-data generalization. We unify this strategy with robust learning into a cohesive framework: Validation-Enabled Stage-wise Tuning (VEST). Extensive experiments on the established ImageNet OOD benchmarks show that VEST significantly outperforms existing VLM adaptation methods, achieving state-of-the-art FSR performance on both ID and OOD data.
comment: Project website: https://hannawang09.github.io/projects/vest/
A Data-driven Typology of Vision Models from Integrated Representational Metrics
Large vision models differ widely in architecture and training paradigm, yet we lack principled methods to determine which aspects of their representations are shared across families and which reflect distinctive computational strategies. We leverage a suite of representational similarity metrics, each capturing a different facet-geometry, unit tuning, or linear decodability-and assess family separability using multiple complementary measures. Metrics preserving geometry or tuning (e.g., RSA, Soft Matching) yield strong family discrimination, whereas flexible mappings such as Linear Predictivity show weaker separation. These findings indicate that geometry and tuning carry family-specific signatures, while linearly decodable information is more broadly shared. To integrate these complementary facets, we adapt Similarity Network Fusion (SNF), a method inspired by multi-omics integration. SNF achieves substantially sharper family separation than any individual metric and produces robust composite signatures. Clustering of the fused similarity matrix recovers both expected and surprising patterns: supervised ResNets and ViTs form distinct clusters, yet all self-supervised models group together across architectural boundaries. Hybrid architectures (ConvNeXt, Swin) cluster with masked autoencoders, suggesting convergence between architectural modernization and reconstruction-based training. This biology-inspired framework provides a principled typology of vision models, showing that emergent computational strategies-shaped jointly by architecture and training objective-define representational structure beyond surface design categories.
MT-Depth: Multi-task Instance feature analysis for the Depth Completion
Depth completion plays a vital role in 3D perception systems, especially in scenarios where sparse depth data must be densified for tasks such as autonomous driving, robotics, and augmented reality. While many existing approaches rely on semantic segmentation to guide depth completion, they often overlook the benefits of object-level understanding. In this work, we introduce an instance-aware depth completion framework that explicitly integrates binary instance masks as spatial priors to refine depth predictions. Our model combines four main components: a frozen YOLO V11 instance segmentation branch, a U-Net-based depth completion backbone, a cross-attention fusion module, and an attention-guided prediction head. The instance segmentation branch generates per-image foreground masks that guide the depth branch via cross-attention, allowing the network to focus on object-centric regions during refinement. We validate our method on the Virtual KITTI 2 dataset, showing that it achieves lower Root Mean Squared Error (RMSE) compared to both a U-Net-only baseline and previous semantic-guided methods, while maintaining competitive Mean Absolute Error (MAE). Qualitative and quantitative results demonstrate that the proposed model effectively enhances depth accuracy near object boundaries, occlusions, and thin structures. Our findings suggest that incorporating instance-aware cues offers a promising direction for improving depth completion without relying on dense semantic labels.
HuPrior3R: Incorporating Human Priors for Better 3D Dynamic Reconstruction from Monocular Videos
Monocular dynamic video reconstruction faces significant challenges in dynamic human scenes due to geometric inconsistencies and resolution degradation issues. Existing methods lack 3D human structural understanding, producing geometrically inconsistent results with distorted limb proportions and unnatural human-object fusion, while memory-constrained downsampling causes human boundary drift toward background geometry. To address these limitations, we propose to incorporate hybrid geometric priors that combine SMPL human body models with monocular depth estimation. Our approach leverages structured human priors to maintain surface consistency while capturing fine-grained geometric details in human regions. We introduce HuPrior3R, featuring a hierarchical pipeline with refinement components that processes full-resolution images for overall scene geometry, then applies strategic cropping and cross-attention fusion for human-specific detail enhancement. The method integrates SMPL priors through a Feature Fusion Module to ensure geometrically plausible reconstruction while preserving fine-grained human boundaries. Extensive experiments on TUM Dynamics and GTA-IM datasets demonstrate superior performance in dynamic human reconstruction.
CHIMERA: Adaptive Cache Injection and Semantic Anchor Prompting for Zero-shot Image Morphing with Morphing-oriented Metrics
Diffusion models exhibit remarkable generative ability, yet achieving smooth and semantically consistent image morphing remains a challenge. Existing approaches often yield abrupt transitions or over-saturated appearances due to the lack of adaptive structural and semantic alignments. We propose CHIMERA, a zero-shot diffusion-based framework that formulates morphing as a cached inversion-guided denoising process. To handle large semantic and appearance disparities, we propose Adaptive Cache Injection and Semantic Anchor Prompting. Adaptive Cache Injection (ACI) caches down, mid, and up blocks features from both inputs during DDIM inversion and re-injects them adaptively during denoising, enabling spatial and semantic alignment in depth- and time-adaptive manners and enabling natural feature fusion and smooth transitions. Semantic Anchor Prompting (SAP) leverages a vision-language model to generate a shared anchor prompt that serves as a semantic anchor, bridging dissimilar inputs and guiding the denoising process toward coherent results. Finally, we introduce the Global-Local Consistency Score (GLCS), a morphing-oriented metric that simultaneously evaluates the global harmonization of the two inputs and the smoothness of the local morphing transition. Extensive experiments and user studies show that CHIMERA achieves smoother and more semantically aligned transitions than existing methods, establishing a new state of the art in image morphing. The code and project page will be publicly released.
comment: Please visit our project page at https://cmlab-korea.github.io/CHIMERA/
TrajMoE: Scene-Adaptive Trajectory Planning with Mixture of Experts and Reinforcement Learning
Current autonomous driving systems often favor end-to-end frameworks, which take sensor inputs like images and learn to map them into trajectory space via neural networks. Previous work has demonstrated that models can achieve better planning performance when provided with a prior distribution of possible trajectories. However, these approaches often overlook two critical aspects: 1) The appropriate trajectory prior can vary significantly across different driving scenarios. 2) Their trajectory evaluation mechanism lacks policy-driven refinement, remaining constrained by the limitations of one-stage supervised training. To address these issues, we explore improvements in two key areas. For problem 1, we employ MoE to apply different trajectory priors tailored to different scenarios. For problem 2, we utilize Reinforcement Learning to fine-tune the trajectory scoring mechanism. Additionally, we integrate models with different perception backbones to enhance perceptual features. Our integrated model achieved a score of 51.08 on the navsim ICCV benchmark, securing third place.
Learning an Ensemble Token from Task-driven Priors in Facial Analysis
Facial analysis exhibits task-specific feature variations. While Convolutional Neural Networks (CNNs) have enabled the fine-grained representation of spatial information, Vision Transformers (ViTs) have facilitated the representation of semantic information at the patch level. While advances in backbone architectures have improved over the past decade, combining high-fidelity models often incurs computational costs on feature representation perspective. In this work, we introduce KT-Adapter, a novel methodology for learning knowledge token which enables the integration of high-fidelity feature representation in computationally efficient manner. Specifically, we propose a robust prior unification learning method that generates a knowledge token within a self-attention mechanism, sharing the mutual information across the pre-trained encoders. This knowledge token approach offers high efficiency with negligible computational cost. Our results show improved performance across facial analysis, with statistically significant enhancements observed in the feature representations.
comment: 10 pages, 7 figures, 7 tables
From Orbit to Ground: Generative City Photogrammetry from Extreme Off-Nadir Satellite Images
City-scale 3D reconstruction from satellite imagery presents the challenge of extreme viewpoint extrapolation, where our goal is to synthesize ground-level novel views from sparse orbital images with minimal parallax. This requires inferring nearly $90^\circ$ viewpoint gaps from image sources with severely foreshortened facades and flawed textures, causing state-of-the-art reconstruction engines such as NeRF and 3DGS to fail. To address this problem, we propose two design choices tailored for city structures and satellite inputs. First, we model city geometry as a 2.5D height map, implemented as a Z-monotonic signed distance field (SDF) that matches urban building layouts from top-down viewpoints. This stabilizes geometry optimization under sparse, off-nadir satellite views and yields a watertight mesh with crisp roofs and clean, vertically extruded facades. Second, we paint the mesh appearance from satellite images via differentiable rendering techniques. While the satellite inputs may contain long-range, blurry captures, we further train a generative texture restoration network to enhance the appearance, recovering high-frequency, plausible texture details from degraded inputs. Our method's scalability and robustness are demonstrated through extensive experiments on large-scale urban reconstruction. For example, in our teaser figure, we reconstruct a $4\,\mathrm{km}^2$ real-world region from only a few satellite images, achieving state-of-the-art performance in synthesizing photorealistic ground views. The resulting models are not only visually compelling but also serve as high-fidelity, application-ready assets for downstream tasks like urban planning and simulation. Project page can be found at https://pku-vcl-geometry.github.io/Orbit2Ground/.
VFM-VLM: Vision Foundation Model and Vision Language Model based Visual Comparison for 3D Pose Estimation
Vision Foundation Models (VFMs) and Vision Language Models (VLMs) have revolutionized computer vision by providing rich semantic and geometric representations. This paper presents a comprehensive visual comparison between CLIP based and DINOv2 based approaches for 3D pose estimation in hand object grasping scenarios. We evaluate both models on the task of 6D object pose estimation and demonstrate their complementary strengths: CLIP excels in semantic understanding through language grounding, while DINOv2 provides superior dense geometric features. Through extensive experiments on benchmark datasets, we show that CLIP based methods achieve better semantic consistency, while DINOv2 based approaches demonstrate competitive performance with enhanced geometric precision. Our analysis provides insights for selecting appropriate vision models for robotic manipulation and grasping, picking applications.
Towards Explainable Bilingual Multimodal Misinformation Detection and Localization
The increasing realism of multimodal content has made misinformation more subtle and harder to detect, especially in news media where images are frequently paired with bilingual (e.g., Chinese-English) subtitles. Such content often includes localized image edits and cross-lingual inconsistencies that jointly distort meaning while remaining superficially plausible. We introduce BiMi, a bilingual multimodal framework that jointly performs region-level localization, cross-modal and cross-lingual consistency detection, and natural language explanation for misinformation analysis. To support generalization, BiMi integrates an online retrieval module that supplements model reasoning with up-to-date external context. We further release BiMiBench, a large-scale and comprehensive benchmark constructed by systematically editing real news images and subtitles, comprising 104,000 samples with realistic manipulations across visual and linguistic modalities. To enhance interpretability, we apply Group Relative Policy Optimization (GRPO) to improve explanation quality, marking the first use of GRPO in this domain. Extensive experiments demonstrate that BiMi outperforms strong baselines by up to +8.9 in classification accuracy, +15.9 in localization accuracy, and +2.5 in explanation BERTScore, advancing state-of-the-art performance in realistic, multilingual misinformation detection. Code, models, and datasets will be released.
IRPO: Boosting Image Restoration via Post-training GRPO
Recent advances in post-training paradigms have achieved remarkable success in high-level generation tasks, yet their potential for low-level vision remains rarely explored. Existing image restoration (IR) methods rely on pixel-level hard-fitting to ground-truth images, struggling with over-smoothing and poor generalization. To address these limitations, we propose IRPO, a low-level GRPO-based post-training paradigm that systematically explores both data formulation and reward modeling. We first explore a data formulation principle for low-level post-training paradigm, in which selecting underperforming samples from the pre-training stage yields optimal performance and improved efficiency. Furthermore, we model a reward-level criteria system that balances objective accuracy and human perceptual preference through three complementary components: a General Reward for structural fidelity, an Expert Reward leveraging Qwen-VL for perceptual alignment, and a Restoration Reward for task-specific low-level quality. Comprehensive experiments on six in-domain and five out-of-domain (OOD) low-level benchmarks demonstrate that IRPO achieves state-of-the-art results across diverse degradation types, surpassing the AdaIR baseline by 0.83 dB on in-domain tasks and 3.43 dB on OOD settings. Our code can be shown in https://github.com/HaoxuanXU1024/IRPO.
ViSA: 3D-Aware Video Shading for Real-Time Upper-Body Avatar Creation
Generating high-fidelity upper-body 3D avatars from one-shot input image remains a significant challenge. Current 3D avatar generation methods, which rely on large reconstruction models, are fast and capable of producing stable body structures, but they often suffer from artifacts such as blurry textures and stiff, unnatural motion. In contrast, generative video models show promising performance by synthesizing photorealistic and dynamic results, but they frequently struggle with unstable behavior, including body structural errors and identity drift. To address these limitations, we propose a novel approach that combines the strengths of both paradigms. Our framework employs a 3D reconstruction model to provide robust structural and appearance priors, which in turn guides a real-time autoregressive video diffusion model for rendering. This process enables the model to synthesize high-frequency, photorealistic details and fluid dynamics in real time, effectively reducing texture blur and motion stiffness while preventing the structural inconsistencies common in video generation methods. By uniting the geometric stability of 3D reconstruction with the generative capabilities of video models, our method produces high-fidelity digital avatars with realistic appearance and dynamic, temporally coherent motion. Experiments demonstrate that our approach significantly reduces artifacts and achieves substantial improvements in visual quality over leading methods, providing a robust and efficient solution for real-time applications such as gaming and virtual reality. Project page: https://lhyfst.github.io/visa
comment: Project page: \url{https://lhyfst.github.io/visa}
50 Years of Automated Face Recognition
Over the past five decades, automated face recognition (FR) has progressed from handcrafted geometric and statistical approaches to advanced deep learning architectures that now approach, and in many cases exceed, human performance. This paper traces the historical and technological evolution of FR, encompassing early algorithmic paradigms through to contemporary neural systems trained on extensive real and synthetically generated datasets. We examine pivotal innovations that have driven this progression, including advances in dataset construction, loss function formulation, network architecture design, and feature fusion strategies. Furthermore, we analyze the relationship between data scale, diversity, and model generalization, highlighting how dataset expansion correlates with benchmark performance gains. Recent systems have achieved near-perfect large-scale identification accuracy, with the leading algorithm in the latest NIST FRTE 1:N benchmark reporting a FNIR of 0.15 percent at FPIR of 0.001 on a gallery of over 10 million identities. We delineate key open problems and emerging directions, including scalable training, multi-modal fusion, synthetic data, and interpretable recognition frameworks.
Domain-RAG: Retrieval-Guided Compositional Image Generation for Cross-Domain Few-Shot Object Detection
Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel objects with only a handful of labeled samples from previously unseen domains. While data augmentation and generative methods have shown promise in few-shot learning, their effectiveness for CD-FSOD remains unclear due to the need for both visual realism and domain alignment. Existing strategies, such as copy-paste augmentation and text-to-image generation, often fail to preserve the correct object category or produce backgrounds coherent with the target domain, making them non-trivial to apply directly to CD-FSOD. To address these challenges, we propose Domain-RAG, a training-free, retrieval-guided compositional image generation framework tailored for CD-FSOD. Domain-RAG consists of three stages: domain-aware background retrieval, domain-guided background generation, and foreground-background composition. Specifically, the input image is first decomposed into foreground and background regions. We then retrieve semantically and stylistically similar images to guide a generative model in synthesizing a new background, conditioned on both the original and retrieved contexts. Finally, the preserved foreground is composed with the newly generated domain-aligned background to form the generated image. Without requiring any additional supervision or training, Domain-RAG produces high-quality, domain-consistent samples across diverse tasks, including CD-FSOD, remote sensing FSOD, and camouflaged FSOD. Extensive experiments show consistent improvements over strong baselines and establish new state-of-the-art results. Codes will be released upon acceptance.
CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning
Modern large vision-language models (LVLMs) convert each input image into a large set of tokens that far outnumber the text tokens. Although this improves visual perception, it also introduces severe image token redundancy. Because image tokens contain sparse information, many contribute little to reasoning but greatly increase inference cost. Recent image token pruning methods address this issue by identifying important tokens and removing the rest. These methods improve efficiency with only small performance drops. However, most of them focus on single-image tasks and overlook multimodal in-context learning (ICL), where redundancy is higher and efficiency is more important. Redundant tokens weaken the advantage of multimodal ICL for rapid domain adaptation and lead to unstable performance. When existing pruning methods are applied in this setting, they cause large accuracy drops, which exposes a clear gap and the need for new approaches. To address this, we propose Contextually Adaptive Token Pruning (CATP), a training-free pruning method designed for multimodal ICL. CATP uses two stages of progressive pruning that fully reflect the complex cross-modal interactions in the input sequence. After removing 77.8% of the image tokens, CATP achieves an average performance gain of 0.6% over the vanilla model on four LVLMs and eight benchmarks, clearly outperforming all baselines. At the same time, it improves efficiency by reducing inference latency by an average of 10.78%. CATP strengthens the practical value of multimodal ICL and lays the foundation for future progress in interleaved image-text settings.
comment: 14 pages, 12 figures, 6 tables
Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey and Benchmark
Pre-trained vision models (PVMs) have demonstrated remarkable adaptability across a wide range of downstream vision tasks, showcasing exceptional performance. However, as these models scale to billions or even trillions of parameters, conventional full fine-tuning has become increasingly impractical due to its high computational and storage demands. To address these challenges, parameter-efficient fine-tuning (PEFT) has emerged as a promising alternative, aiming to achieve performance comparable to full fine-tuning while making minimal adjustments to the model parameters. This paper presents a comprehensive survey of the latest advancements in the visual PEFT field, systematically reviewing current methodologies and categorizing them into four primary categories: addition-based, partial-based, unified-based, and multi-task tuning. In addition, this paper offers an in-depth analysis of widely used visual datasets and real-world applications where PEFT methods have been successfully applied. Furthermore, this paper introduces the V-PEFT Bench, a unified benchmark designed to standardize the evaluation of PEFT methods across a diverse set of vision tasks, ensuring consistency and fairness in comparison. Finally, the paper outlines potential directions for future research to propel advances in the PEFT field. A comprehensive collection of resources is available at https://github.com/synbol/Awesome-Parameter-Efficient-Transfer-Learning.
comment: Submitted to IJCV
COREA: Coarse-to-Fine 3D Representation Alignment Between Relightable 3D Gaussians and SDF via Bidirectional 3D-to-3D Supervision
We present COREA, the first unified framework that jointly learns relightable 3D Gaussians and a Signed Distance Field (SDF) for accurate geometry reconstruction and faithful relighting. While recent 3D Gaussian Splatting (3DGS) methods have extended toward mesh reconstruction and physically-based rendering (PBR), their geometry is still learned from 2D renderings, leading to coarse surfaces and unreliable BRDF-lighting decomposition. To address these limitations, COREA introduces a coarse-to-fine bidirectional 3D-to-3D alignment strategy that allows geometric signals to be learned directly in 3D space. Within this strategy, depth provides coarse alignment between the two representations, while depth gradients and normals refine fine-scale structure, and the resulting geometry supports stable BRDF-lighting decomposition. A density-control mechanism further stabilizes Gaussian growth, balancing geometric fidelity with memory efficiency. Experiments on standard benchmarks demonstrate that COREA achieves superior performance in novel-view synthesis, mesh reconstruction, and PBR within a unified framework.
comment: Project page: https://cau-vilab.github.io/COREA/
Zo3T: Zero-Shot 3D-Aware Trajectory-Guided Image-to-Video Generation via Test-Time Training
Trajectory-Guided image-to-video (I2V) generation aims to synthesize videos that adhere to user-specified motion instructions. Existing methods typically rely on computationally expensive fine-tuning on scarce annotated datasets. Although some zero-shot methods attempt to trajectory control in the latent space, they may yield unrealistic motion by neglecting 3D perspective and creating a misalignment between the manipulated latents and the network's noise predictions. To address these challenges, we introduce Zo3T, a novel zero-shot test-time-training framework for trajectory-guided generation with three core innovations: First, we incorporate a 3D-Aware Kinematic Projection, leveraging inferring scene depth to derive perspective-correct affine transformations for target regions. Second, we introduce Trajectory-Guided Test-Time LoRA, a mechanism that dynamically injects and optimizes ephemeral LoRA adapters into the denoising network alongside the latent state. Driven by a regional feature consistency loss, this co-adaptation effectively enforces motion constraints while allowing the pre-trained model to locally adapt its internal representations to the manipulated latent, thereby ensuring generative fidelity and on-manifold adherence. Finally, we develop Guidance Field Rectification, which refines the denoising evolutionary path by optimizing the conditional guidance field through a one-step lookahead strategy, ensuring efficient generative progression towards the target trajectory. Zo3T significantly enhances 3D realism and motion accuracy in trajectory-controlled I2V generation, demonstrating superior performance over existing training-based and zero-shot approaches.
Fine-grained Spatiotemporal Grounding on Egocentric Videos ICCV 2025
Spatiotemporal video grounding aims to localize target entities in videos based on textual queries. While existing research has made significant progress in exocentric videos, the egocentric setting remains relatively underexplored, despite its growing importance in applications such as augmented reality and robotics. In this work, we conduct a systematic analysis of the discrepancies between egocentric and exocentric videos, revealing key challenges such as shorter object durations, sparser trajectories, smaller object sizes, and larger positional shifts. To address these challenges, we introduce EgoMask, the first pixel-level benchmark for fine-grained spatiotemporal grounding in egocentric videos. It is constructed by our proposed automatic annotation pipeline, which annotates referring expressions and object masks across short-, medium-, and long-term videos. Additionally, we create EgoMask-Train, a large-scale training dataset to facilitate model development. Experiments demonstrate that the state-of-the-art spatiotemporal grounding models perform poorly on our benchmark EgoMask, but fine-tuning on EgoMask-Train yields significant improvements, while preserving performance on exocentric datasets. Our work thus provides essential resources and insights for advancing egocentric video understanding. Our code is available at https://github.com/LaVi-Lab/EgoMask .
comment: Accepted by ICCV 2025
Image and Video Processing
Learned iterative networks: An operator learning perspective
Learned image reconstruction has become a pillar in computational imaging and inverse problems. Among the most successful approaches are learned iterative networks, which are formulated by unrolling classical iterative optimisation algorithms for solving variational problems. While the underlying algorithm is usually formulated in the functional analytic setting, learned approaches are often viewed as purely discrete. In this chapter we present a unified operator view for learned iterative networks. Specifically, we formulate a learned reconstruction operator, defining how to compute, and separately the learning problem, which defines what to compute. In this setting we present common approaches and show that many approaches are closely related in their core. We review linear as well as nonlinear inverse problems in this framework and present a short numerical study to conclude.
Zero-Splat TeleAssist: A Zero-Shot Pose Estimation Framework for Semantic Teleoperation ICRA 2025
We introduce Zero-Splat TeleAssist, a zero-shot sensor-fusion pipeline that transforms commodity CCTV streams into a shared, 6-DoF world model for multilateral teleoperation. By integrating vision-language segmentation, monocular depth, weighted-PCA pose extraction, and 3D Gaussian Splatting (3DGS), TeleAssist provides every operator with real-time global positions and orientations of multiple robots without fiducials or depth sensors in an interaction-centric teleoperation setup.
comment: Published and Presented at 3rd Workshop on Human-Centric Multilateral Teleoperation in ICRA 2025
Geometric-Stochastic Multimodal Deep Learning for Predictive Modeling of SUDEP and Stroke Vulnerability
Sudden Unexpected Death in Epilepsy (SUDEP) and acute ischemic stroke are life-threatening conditions involving complex interactions across cortical, brainstem, and autonomic systems. We present a unified geometric-stochastic multimodal deep learning framework that integrates EEG, ECG, respiration, SpO2, EMG, and fMRI signals to model SUDEP and stroke vulnerability. The approach combines Riemannian manifold embeddings, Lie-group invariant feature representations, fractional stochastic dynamics, Hamiltonian energy-flow modeling, and cross-modal attention mechanisms. Stroke propagation is modeled using fractional epidemic diffusion over structural brain graphs. Experiments on the MULTI-CLARID dataset demonstrate improved predictive accuracy and interpretable biomarkers derived from manifold curvature, fractional memory indices, attention entropy, and diffusion centrality. The proposed framework provides a mathematically principled foundation for early detection, risk stratification, and interpretable multimodal modeling in neural-autonomic disorders.
comment: 7 pages, 3 figures
Tumor-anchored deep feature random forests for out-of-distribution detection in lung cancer segmentation
Accurate segmentation of cancerous lesions from 3D computed tomography (CT) scans is essential for automated treatment planning and response assessment. However, even state-of-the-art models combining self-supervised learning (SSL) pretrained transformers with convolutional decoders are susceptible to out-of-distribution (OOD) inputs, generating confidently incorrect tumor segmentations, posing risks for safe clinical deployment. Existing logit-based methods suffer from task-specific model biases, while architectural enhancements to explicitly detect OOD increase parameters and computational costs. Hence, we introduce a plug-and-play and lightweight post-hoc random forests-based OOD detection framework called RF-Deep that leverages deep features with limited outlier exposure. RF-Deep enhances generalization to imaging variations by repurposing the hierarchical features from the pretrained-then-finetuned backbone encoder, providing task-relevant OOD detection by extracting the features from multiple regions of interest anchored to the predicted tumor segmentations. Hence, it scales to images of varying fields-of-view. We compared RF-Deep against existing OOD detection methods using 1,916 CT scans across near-OOD (pulmonary embolism, negative COVID-19) and far-OOD (kidney cancer, healthy pancreas) datasets. RF-Deep achieved AUROC > 93.50 for the challenging near-OOD datasets and near-perfect detection (AUROC > 99.00) for the far-OOD datasets, substantially outperforming logit-based and radiomics approaches. RF-Deep maintained similar performance consistency across networks of different depths and pretraining strategies, demonstrating its effectiveness as a lightweight, architecture-agnostic approach to enhance the reliability of tumor segmentation from CT volumes.
FlowSteer: Conditioning Flow Field for Consistent Image Restoration
Flow-based text-to-image (T2I) models excel at prompt-driven image generation, but falter on Image Restoration (IR), often "drifting away" from being faithful to the measurement. Prior work mitigate this drift with data-specific flows or task-specific adapters that are computationally heavy and not scalable across tasks. This raises the question "Can't we efficiently manipulate the existing generative capabilities of a flow model?" To this end, we introduce FlowSteer (FS), an operator-aware conditioning scheme that injects measurement priors along the sampling path,coupling a frozed flow's implicit guidance with explicit measurement constraints. Across super-resolution, deblurring, denoising, and colorization, FS improves measurement consistency and identity preservation in a strictly zero-shot setting-no retrained models, no adapters. We show how the nature of flow models and their sensitivities to noise inform the design of such a scheduler. FlowSteer, although simple, achieves a higher fidelity of reconstructed images, while leveraging the rich generative priors of flow models.
Causal Attribution of Model Performance Gaps in Medical Imaging Under Distribution Shifts
Deep learning models for medical image segmentation suffer significant performance drops due to distribution shifts, but the causal mechanisms behind these drops remain poorly understood. We extend causal attribution frameworks to high-dimensional segmentation tasks, quantifying how acquisition protocols and annotation variability independently contribute to performance degradation. We model the data-generating process through a causal graph and employ Shapley values to fairly attribute performance changes to individual mechanisms. Our framework addresses unique challenges in medical imaging: high-dimensional outputs, limited samples, and complex mechanism interactions. Validation on multiple sclerosis (MS) lesion segmentation across 4 centers and 7 annotators reveals context-dependent failure modes: annotation protocol shifts dominate when crossing annotators (7.4% $\pm$ 8.9% DSC attribution), while acquisition shifts dominate when crossing imaging centers (6.5% $\pm$ 9.1%). This mechanism-specific quantification enables practitioners to prioritize targeted interventions based on deployment context.
comment: Medical Imaging meets EurIPS Workshop: MedEurIPS 2025
DermETAS-SNA LLM: A Dermatology Focused Evolutionary Transformer Architecture Search with StackNet Augmented LLM Assistant
Our work introduces the DermETAS-SNA LLM Assistant that integrates Dermatology-focused Evolutionary Transformer Architecture Search with StackNet Augmented LLM. The assistant dynamically learns skin-disease classifiers and provides medically informed descriptions to facilitate clinician-patient interpretation. Contributions include: (1) Developed an ETAS framework on the SKINCON dataset to optimize a Vision Transformer (ViT) tailored for dermatological feature representation and then fine-tuned binary classifiers for each of the 23 skin disease categories in the DermNet dataset to enhance classification performance; (2) Designed a StackNet architecture that integrates multiple fine-tuned binary ViT classifiers to enhance predictive robustness and mitigate class imbalance issues; (3) Implemented a RAG pipeline, termed Diagnostic Explanation and Retrieval Model for Dermatology, which harnesses the capabilities of the Google Gemini 2.5 Pro LLM architecture to generate personalized, contextually informed diagnostic descriptions and explanations for patients, leveraging a repository of verified dermatological materials; (4) Performed extensive experimental evaluations on 23 skin disease categories to demonstrate performance increase, achieving an overall F1-score of 56.30% that surpasses SkinGPT-4 (48.51%) by a considerable margin, representing a performance increase of 16.06%; (5) Conducted a domain-expert evaluation, with eight licensed medical doctors, of the clinical responses generated by our AI assistant for seven dermatological conditions. Our results show a 92% agreement rate with the assessments provided by our AI assistant (6) Created a proof-of-concept prototype that fully integrates our DermETAS-SNA LLM into our AI assistant to demonstrate its practical feasibility for real-world clinical and educational applications.
Rethinking Few-Shot Image Fusion: Granular Ball Priors Enable General-Purpose Deep Fusion
In image fusion tasks, the absence of real fused images as priors forces most deep learning approaches to rely on large-scale paired datasets to extract global weighting features or to generate pseudo-supervised images through algorithmic constructions. Unlike previous methods, this work re-examines prior-guided learning under few-shot conditions by introducing rough set theory. We regard the traditional algorithm as a prior generator, while the network re-inferrs and adaptively optimizes the prior through a dynamic loss function, reducing the inference burden of the network and enabling effective few-shot learning.To provide the prior, we propose the Granular Ball Pixel Computation (GBPC) algorithm. GBPC models pixel pairs in a luminance subspace using meta-granular balls and mines intra-ball information at multiple granular levels. At the fine-grained level, sliding granular balls assign adaptive weights to individual pixels to produce pixel-level prior fusion. At the coarse-grained level, the algorithm performs split computation within a single image to estimate positive and boundary domain distributions, enabling modality awareness and prior confidence estimation, which dynamically guide the loss weighting.The network and the algorithmic prior are coupled through the loss function to form an integrated framework. Thanks to the dynamic weighting mechanism, the network can adaptively adjust to different priors during training, enhancing its perception and fusion capability across modalities. We name this framework GBFF (Granular Ball Fusion Framework). Experiments on four fusion tasks demonstrate that even with only ten training image pairs per task, GBFF achieves superior performance in both visual quality and model compactness. Code is available at: https://github.com/DMinjie/GBFF
The Missing Point in Vision Transformers for Universal Image Segmentation
Image segmentation remains a challenging task in computer vision, demanding robust mask generation and precise classification. Recent mask-based approaches yield high-quality masks by capturing global context. However, accurately classifying these masks, especially in the presence of ambiguous boundaries and imbalanced class distributions, remains an open challenge. In this work, we introduce ViT-P, a novel two-stage segmentation framework that decouples mask generation from classification. The first stage employs a proposal generator to produce class-agnostic mask proposals, while the second stage utilizes a point-based classification model built on the Vision Transformer (ViT) to refine predictions by focusing on mask central points. ViT-P serves as a pre-training-free adapter, allowing the integration of various pre-trained vision transformers without modifying their architecture, ensuring adaptability to dense prediction tasks. Furthermore, we demonstrate that coarse and bounding box annotations can effectively enhance classification without requiring additional training on fine annotation datasets, reducing annotation costs while maintaining strong performance. Extensive experiments across COCO, ADE20K, and Cityscapes datasets validate the effectiveness of ViT-P, achieving state-of-the-art results with 54.0 PQ on ADE20K panoptic segmentation, 87.4 mIoU on Cityscapes semantic segmentation, and 63.6 mIoU on ADE20K semantic segmentation. The code and pretrained models are available at: https://github.com/sajjad-sh33/ViT-P}{https://github.com/sajjad-sh33/ViT-P.
PET Image Reconstruction Using Deep Diffusion Image Prior
Diffusion models have shown great promise in medical image denoising and reconstruction, but their application to Positron Emission Tomography (PET) imaging remains limited by tracer-specific contrast variability and high computational demands. In this work, we proposed an anatomical prior-guided PET image reconstruction method based on diffusion models, inspired by the deep diffusion image prior (DDIP) framework. The proposed method alternated between diffusion sampling and model fine-tuning guided by the PET sinogram, enabling the reconstruction of high-quality images from various PET tracers using a score function pretrained on a dataset of another tracer. To improve computational efficiency, the half-quadratic splitting (HQS) algorithm was adopted to decouple network optimization from iterative PET reconstruction. The proposed method was evaluated using one simulation and two clinical datasets. For the simulation study, a model pretrained on [$^{18}$F]FDG data was tested on [$^{18}$F]FDG data and amyloid-negative PET data to assess out-of-distribution (OOD) performance. For the clinical-data validation, ten low-dose [$^{18}$F]FDG datasets and one [$^{18}$F]Florbetapir dataset were tested on a model pretrained on data from another tracer. Experiment results show that the proposed PET reconstruction method can generalize robustly across tracer distributions and scanner types, providing an efficient and versatile reconstruction framework for low-dose PET imaging.
comment: 11 pages, 12 figures
Guiding WaveMamba with Frequency Maps for Image Debanding
Compression at low bitrates in modern codecs often introduces banding artifacts, especially in smooth regions such as skies. These artifacts degrade visual quality and are common in user-generated content due to repeated transcoding. We propose a banding restoration method that employs the Wavelet State Space Model and a frequency masking map to preserve high-frequency details. Furthermore, we provide a benchmark of open-source banding restoration methods and evaluate their performance on two public banding image datasets. Experimentation on the available datasets suggests that the proposed post-processing approach effectively suppresses banding compared to the state-of-the-art method (a DBI value of 0.082 on BAND-2k) while preserving image textures. Visual inspections of the results confirm this. Code and supplementary material are available at: https://github.com/xinyiW915/Debanding-PCS2025.
comment: 5 pages, 2 figures
Heart Failure Prediction using Modal Decomposition and Masked Autoencoders for Scarce Echocardiography Databases
Heart diseases remain the leading cause of mortality worldwide, implying approximately 18 million deaths according to the WHO. In particular, heart failures (HF) press the healthcare industry to develop systems for their early, rapid, and effective prediction. This work presents an automatic system based on a novel framework which combines Modal Decomposition and Masked Autoencoders (MAE) to extend the application from heart disease classification to the more challenging and specific task of heart failure time prediction, not previously addressed to the best of authors' knowledge. This system comprises two stages. The first one transforms the data from a database of echocardiography video sequences into a large collection of annotated images compatible with the training phase of machine learning-based frameworks and deep learning-based ones. This stage includes the use of the Higher Order Dynamic Mode Decomposition (HODMD) algorithm for both data augmentation and feature extraction. The second stage builds and trains a Vision Transformer (ViT). MAEs based on a combined scheme of self-supervised (SSL) and supervised learning, so far barely explored in the literature about heart failure prediction, are adopted to effectively train the ViT from scratch, even with scarce databases. The designed neural network analyses in real-time images from echocardiography sequences to estimate the time of happening a heart failure. This approach demonstrates to improve prediction accuracy from scarce databases and to be superior to several established ViT and Convolutional Neural Network (CNN) architectures. The source code will be incorporated into the next version release of the ModelFLOWs-app software (https://github.com/modelflows/ModelFLOWs-app).
comment: 43 pages, 7 figures
From Fibers to Cells: Fourier-Based Registration Enables Virtual Cresyl Violet Staining From 3D Polarized Light Imaging
Comprehensive assessment of the various aspects of the brain's microstructure requires the use of complementary imaging techniques. This includes measuring the spatial distribution of cell bodies (cytoarchitecture) and nerve fibers (myeloarchitecture). The gold standard for cytoarchitectonic analysis is light microscopic imaging of cell-body stained tissue sections. To reveal the 3D orientations of nerve fibers, 3D Polarized Light Imaging (3D-PLI) has been introduced, a method that is label-free and allows subsequent staining of sections after 3D-PLI measurement. By post-staining for cell bodies, a direct link between fiber- and cytoarchitecture can potentially be established in the same section. However, inevitable distortions introduced during the staining process make a costly nonlinear and cross-modal registration necessary in order to study the detailed relationships between cells and fibers in the images. In addition, the complexity of processing histological sections for post-staining only allows for a limited number of such samples. In this work, we take advantage of deep learning methods for image-to-image translation to generate a virtual staining of 3D-PLI that is spatially aligned at the cellular level. We use a supervised setting, building on a unique dataset of brain sections, to which Cresyl violet staining has been applied after 3D-PLI measurement. To ensure high correspondence between both modalities, we address the misalignment of training data using Fourier-based registration. In this way, registration can be efficiently calculated during training for local image patches of target and predicted staining. We demonstrate that the proposed method can predict a Cresyl violet staining from 3D-PLI, resulting in a virtual staining that exhibits plausible patterns of cell organization in gray matter, with larger cell bodies being localized at their expected positions.
comment: Revised version, accepted for publication
Time-causal and time-recursive wavelets
When to apply wavelet analysis to real-time temporal signals, where the future cannot be accessed, it is essential to base all the steps in the signal processing pipeline on computational mechanisms that are truly time-causal. This paper describes how a time-causal wavelet analysis can be performed based on concepts developed in the area of temporal scale-space theory, originating from a complete classification of temporal smoothing kernels that guarantee non-creation of new structures from finer to coarser temporal scale levels. By necessity, convolution with truncated exponential kernels in cascade constitutes the only permissable class of kernels, as well as their temporal derivatives as a natural complement to fulfil the admissibility conditions of wavelet representations. For a particular way of choosing the time constants in the resulting infinite convolution of truncated exponential kernels, to ensure temporal scale covariance and thus self-similarity over temporal scales, we describe how mother wavelets can be chosen as temporal derivatives of the resulting time-causal limit kernel. By developing connections between wavelet theory and scale-space theory, we characterize and quantify how the continuous scaling properties transfer to the discrete implementation, demonstrating how the proposed time-causal wavelet representation can reflect the duration of locally dominant temporal structures in the input signals. We propose that this notion of time-causal wavelet analysis could be a valuable tool for signal processing tasks, where streams of signals are to be processed in real time, specifically for signals that may contain local variations over a rich span of temporal scales, or more generally for analysing physical or biophysical temporal phenomena, where a fully time-causal analysis is called for to be physically realistic.
comment: 28 pages, 11 figures
Random-phase Wave Splatting of Translucent Primitives for Computer-generated Holography
Holographic near-eye displays offer ultra-compact form factors for VR/AR systems but rely on advanced computer-generated holography (CGH) algorithms to convert 3D scenes into interference patterns on spatial light modulators (SLMs). Conventional CGH typically generates smooth-phase holograms, limiting view-dependent effects and realistic defocus blur, while severely under-utilizing the SLM space-bandwidth product. We propose Random-phase Wave Splatting (RPWS), a unified wave optics rendering framework that converts arbitrary 3D representations based on 2D translucent primitives into random-phase holograms. RPWS is fully compatible with modern 3D representations such as Gaussians and triangles, improves bandwidth utilization which effectively enlarges eyebox size, reconstructs accurate defocus blur and parallax, and leverages time-multiplexed rendering not as a heuristic for speckle suppression, but as a mathematically exact alpha-blending mechanism derived from first principles in statistics. At the core of RPWS are (1) a new wavefront compositing procedure and (2) an alpha-blending scheme for random-phase geometric primitives, ensuring correct color reconstruction and robust occlusion when compositing millions of primitives. RPWS departs substantially from the recent primitive-based CGH algorithm, Gaussian Wave Splatting (GWS). Because GWS uses smooth-phase primitives, it struggles to capture view-dependent effects and realistic defocus blur and under-utilizes the SLM space-bandwidth product; moreover, naively extending GWS to random-phase primitives fails to reconstruct accurate colors. In contrast, RPWS is designed from the ground up for arbitrary random-phase translucent primitives, and through simulations and experimental validations we demonstrate state-of-the-art image quality and perceptually faithful 3D holograms for next-generation near-eye displays.
Deep Learning, Machine Learning -- Digital Signal and Image Processing: From Theory to Application
Digital Signal Processing (DSP) and Digital Image Processing (DIP) with Machine Learning (ML) and Deep Learning (DL) are popular research areas in Computer Vision and related fields. We highlight transformative applications in image enhancement, filtering techniques, and pattern recognition. By integrating frameworks like the Discrete Fourier Transform (DFT), Z-Transform, and Fourier Transform methods, we enable robust data manipulation and feature extraction essential for AI-driven tasks. Using Python, we implement algorithms that optimize real-time data processing, forming a foundation for scalable, high-performance solutions in computer vision. This work illustrates the potential of ML and DL to advance DSP and DIP methodologies, contributing to artificial intelligence, automated feature extraction, and applications across diverse domains.
comment: 293 pages
Enhancing Brain Age Estimation with a Multimodal 3D CNN Approach Combining Structural MRI and AI-Synthesized Cerebral Blood Volume Measures
Brain age gap estimation (BrainAGE) is a promising imaging-derived biomarker of neurobiological aging and disease risk, yet current approaches rely predominantly on T1-weighted structural MRI (T1w), overlooking functional vascular changes that may precede tissue damage and cognitive decline. Artificial intelligence-generated cerebral blood volume (AICBV) maps, synthesized from non-contrast MRI, offer an alternative to contrast-enhanced perfusion imaging by capturing vascular information relevant to early neurodegeneration. We developed a multimodal BrainAGE framework that integrates brain age predictions using linear regression from two separate 3D VGG-based networks, one model trained on only structural T1w scans and one trained on only AICBV maps generated from a pre-trained 3D patch-based deep learning model. Each model was trained and validated on 2,851 scans from 13 open-source datasets and was evaluated for concordance with mild cognitive impairment (MCI) and Alzheimer's disease (AD) using ADNI subjects (n=1,233). The combined model achieved the most accurate brain age gap for cognitively normal (CN) controls, with a mean absolute error (MAE) of 3.95 years ($R^2$=0.943), outperforming models trained on T1w (MAE=4.10) or AICBV alone (MAE=4.49). Saliency maps revealed complementary modality contributions: T1w emphasized white matter and cortical atrophy, while AICBV highlighted vascular-rich and periventricular regions implicated in hypoperfusion and early cerebrovascular dysfunction, consistent with normal aging. Next, we observed that BrainAGE increased stepwise across diagnostic strata (CN < MCI < AD) and correlated with cognitive impairment (CDRSB r=0.403; MMSE r=-0.310). AICBV-based BrainAGE showed particularly strong separation between stable vs. progressive MCI (p=$1.47 \times 10^{-8}$), suggesting sensitivity to prodromal vascular changes that precede overt atrophy.
comment: 26 pages, 8 figures
Random forest-based out-of-distribution detection for robust lung cancer segmentation SP
Accurate detection and segmentation of cancerous lesions from computed tomography (CT) scans is essential for automated treatment planning and cancer treatment response assessment. Transformer-based models with self-supervised pretraining can produce reliably accurate segmentation from in-distribution (ID) data but degrade when applied to out-of-distribution (OOD) datasets. We address this challenge with RF-Deep, a random forest classifier that utilizes deep features from a pretrained transformer encoder of the segmentation model to detect OOD scans and enhance segmentation reliability. The segmentation model comprises a Swin Transformer encoder, pretrained with masked image modeling (SimMIM) on 10,432 unlabeled 3D CT scans covering cancerous and non-cancerous conditions, with a convolution decoder, trained to segment lung cancers in 317 3D scans. Independent testing was performed on 603 3D CT public datasets that included one ID dataset and four OOD datasets comprising chest CTs with pulmonary embolism (PE) and COVID-19, and abdominal CTs with kidney cancers and healthy volunteers. RF-Deep detected OOD cases with a FPR95 of 18.26%, 27.66%, and less than 0.1% on PE, COVID-19, and abdominal CTs, consistently outperforming established OOD approaches. The RF-Deep classifier provides a simple and effective approach to enhance reliability of cancer segmentation in ID and OOD scenarios.
comment: Accepted at SPIE Medical Imaging 2026
INRetouch: Context Aware Implicit Neural Representation for Photography Retouching WACV 2026
Professional photo editing remains challenging, requiring extensive knowledge of imaging pipelines and significant expertise. While recent deep learning approaches, particularly style transfer methods, have attempted to automate this process, they often struggle with output fidelity, editing control, and complex retouching capabilities. We propose a novel retouch transfer approach that learns from professional edits through before-after image pairs, enabling precise replication of complex editing operations. We develop a context-aware Implicit Neural Representation that learns to apply edits adaptively based on image content and context, and is capable of learning from a single example. Our method extracts implicit transformations from reference edits and adaptively applies them to new images. To facilitate this research direction, we introduce a comprehensive Photo Retouching Dataset comprising 100,000 high-quality images edited using over 170 professional Adobe Lightroom presets. Through extensive evaluation, we demonstrate that our approach not only surpasses existing methods in photo retouching but also enhances performance in related image reconstruction tasks like Gamut Mapping and Raw Reconstruction. By bridging the gap between professional editing capabilities and automated solutions, our work presents a significant step toward making sophisticated photo editing more accessible while maintaining high-fidelity results. The source code and the dataset are publicly available at https://omaralezaby.github.io/inretouch .
comment: Accepted by WACV 2026
Multimedia
Gamification with Purpose: What Learners Prefer to Motivate Their Learning
This study investigates learners' preferences for game design elements (GDEs) in educational contexts to inform the development of purpose-driven gamification strategies. It emphasizes a learner-centered approach that aligns gamification design with pedagogical goals, while mitigating risks such as the erosion of intrinsic motivation. A systematic literature review was conducted to identify ten widely discussed GDEs. Visual prototypes representing each element were developed, and a best-worst scaling (BWS) survey with 125 participants was administered to elicit preference rankings. Qualitative feedback was also collected to uncover motivational drivers. Learners consistently preferred GDEs that support learning processes directly-most notably progress bars, concept maps, immediate feedback, and achievements. Qualitative analysis revealed six recurring motivational themes, including visible progress, content relevance, and constructive feedback. The findings suggest that learners value gamification elements that are meaningfully integrated with educational content and support intrinsic motivation. Purpose-aligned gamification should prioritize tools that visualize learning progress and provide actionable feedback, rather than relying solely on extrinsic incentives.
comment: 31 pages, 10 figures, Springer EAIT in review
PAVAS: Physics-Aware Video-to-Audio Synthesis
Recent advances in Video-to-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds. We present Physics-Aware Video-to-Audio Synthesis (PAVAS), a method that incorporates physical reasoning into a latent diffusion-based V2A generation through the Physics-Driven Audio Adapter (Phy-Adapter). The adapter receives object-level physical parameters estimated by the Physical Parameter Estimator (PPE), which uses a Vision-Language Model (VLM) to infer the moving-object mass and a segmentation-based dynamic 3D reconstruction module to recover its motion trajectory for velocity computation. These physical cues enable the model to synthesize sounds that reflect underlying physical factors. To assess physical realism, we curate VGG-Impact, a benchmark focusing on object-object interactions, and introduce Audio-Physics Correlation Coefficient (APCC), an evaluation metric that measures consistency between physical and auditory attributes. Comprehensive experiments show that PAVAS produces physically plausible and perceptually coherent audio, outperforming existing V2A models in both quantitative and qualitative evaluations. Visit https://physics-aware-video-to-audio-synthesis.github.io for demo videos.
VoiceCloak: A Multi-Dimensional Defense Framework against Unauthorized Diffusion-based Voice Cloning AAAI 2026
Diffusion Models (DMs) have achieved remarkable success in realistic voice cloning (VC), while they also increase the risk of malicious misuse. Existing proactive defenses designed for traditional VC models aim to disrupt the forgery process, but they have been proven incompatible with DMs due to the intricate generative mechanisms of diffusion. To bridge this gap, we introduce VoiceCloak, a multi-dimensional proactive defense framework with the goal of obfuscating speaker identity and degrading perceptual quality in potential unauthorized VC. To achieve these goals, we conduct a focused analysis to identify specific vulnerabilities within DMs, allowing VoiceCloak to disrupt the cloning process by introducing adversarial perturbations into the reference audio. Specifically, to obfuscate speaker identity, VoiceCloak first targets speaker identity by distorting representation learning embeddings to maximize identity variation, which is guided by auditory perception principles. Additionally, VoiceCloak disrupts crucial conditional guidance processes, particularly attention context, thereby preventing the alignment of vocal characteristics that are essential for achieving convincing cloning. Then, to address the second objective, VoiceCloak introduces score magnitude amplification to actively steer the reverse trajectory away from the generation of high-quality speech. Noise-guided semantic corruption is further employed to disrupt structural speech semantics captured by DMs, degrading output quality. Extensive experiments highlight VoiceCloak's outstanding defense success rate against unauthorized diffusion-based voice cloning. Audio samples of VoiceCloak are available at https://voice-cloak.github.io/VoiceCloak/.
comment: 15 pages, 6 figures, 13 tables; Accepted by AAAI 2026
Cross-Space Synergy: A Unified Framework for Multimodal Emotion Recognition in Conversation AAAI 2026
Multimodal Emotion Recognition in Conversation (MERC) aims to predict speakers' emotions by integrating textual, acoustic, and visual cues. Existing approaches either struggle to capture complex cross-modal interactions or experience gradient conflicts and unstable training when using deeper architectures. To address these issues, we propose Cross-Space Synergy (CSS), which couples a representation component with an optimization component. Synergistic Polynomial Fusion (SPF) serves the representation role, leveraging low-rank tensor factorization to efficiently capture high-order cross-modal interactions. Pareto Gradient Modulator (PGM) serves the optimization role, steering updates along Pareto-optimal directions across competing objectives to alleviate gradient conflicts and improve stability. Experiments show that CSS outperforms existing representative methods on IEMOCAP and MELD in both accuracy and training stability, demonstrating its effectiveness in complex multimodal scenarios.
comment: Accepted to AAAI 2026
Orthogonal Disentanglement with Projected Feature Alignment for Multimodal Emotion Recognition in Conversation
Multimodal Emotion Recognition in Conversation (MERC) significantly enhances emotion recognition performance by integrating complementary emotional cues from text, audio, and visual modalities. While existing methods commonly utilize techniques such as contrastive learning and cross-attention mechanisms to align cross-modal emotional semantics, they typically overlook modality-specific emotional nuances like micro-expressions, tone variations, and sarcastic language. To overcome these limitations, we propose Orthogonal Disentanglement with Projected Feature Alignment (OD-PFA), a novel framework designed explicitly to capture both shared semantics and modality-specific emotional cues. Our approach first decouples unimodal features into shared and modality-specific components. An orthogonal disentanglement strategy (OD) enforces effective separation between these components, aided by a reconstruction loss to maintain critical emotional information from each modality. Additionally, a projected feature alignment strategy (PFA) maps shared features across modalities into a common latent space and applies a cross-modal consistency alignment loss to enhance semantic coherence. Extensive evaluations on widely-used benchmark datasets, IEMOCAP and MELD, demonstrate effectiveness of our proposed OD-PFA multimodal emotion recognition tasks, as compared with the state-of-the-art approaches.
comment: 10 pages, 1 figure
Computation and Language
Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training
While scaling laws for Large Language Models (LLMs) traditionally focus on proxy metrics like pretraining loss, predicting downstream task performance has been considered unreliable. This paper challenges that view by proposing a direct framework to model the scaling of benchmark performance from the training budget. We find that for a fixed token-to-parameter ratio, a simple power law can accurately describe the scaling behavior of log accuracy on multiple popular downstream tasks. Our results show that the direct approach extrapolates better than the previously proposed two-stage procedure, which is prone to compounding errors. Furthermore, we introduce functional forms that predict accuracy across token-to-parameter ratios and account for inference compute under repeated sampling. We validate our findings on models with up to 17B parameters trained on up to 350B tokens across two dataset mixtures. To support reproducibility and encourage future research, we release the complete set of pretraining losses and downstream evaluation results.
Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders
Retrieval-Augmented Generation (RAG) improves the factuality of large language models (LLMs) by grounding outputs in retrieved evidence, but faithfulness failures, where generations contradict or extend beyond the provided sources, remain a critical challenge. Existing hallucination detection methods for RAG often rely either on large-scale detector training, which requires substantial annotated data, or on querying external LLM judges, which leads to high inference costs. Although some approaches attempt to leverage internal representations of LLMs for hallucination detection, their accuracy remains limited. Motivated by recent advances in mechanistic interpretability, we employ sparse autoencoders (SAEs) to disentangle internal activations, successfully identifying features that are specifically triggered during RAG hallucinations. Building on a systematic pipeline of information-based feature selection and additive feature modeling, we introduce RAGLens, a lightweight hallucination detector that accurately flags unfaithful RAG outputs using LLM internal representations. RAGLens not only achieves superior detection performance compared to existing methods, but also provides interpretable rationales for its decisions, enabling effective post-hoc mitigation of unfaithful RAG. Finally, we justify our design choices and reveal new insights into the distribution of hallucination-related signals within LLMs. The code is available at https://github.com/Teddy-XiongGZ/RAGLens.
Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis
Gradually growing the depth of Transformers during training can not only reduce training cost but also lead to improved reasoning performance, as shown by MIDAS (Saunshi et al., 2024). Thus far, however, a mechanistic understanding of these gains has been missing. In this work, we establish a connection to recent work showing that layers in the second half of non-grown, pre-layernorm Transformers contribute much less to the final output distribution than those in the first half - also known as the Curse of Depth (Sun et al., 2025, Csordás et al., 2025). Using depth-wise analyses, we demonstrate that growth via gradual middle stacking yields more effective utilization of model depth, alters the residual stream structure, and facilitates the formation of permutable computational blocks. In addition, we propose a lightweight modification of MIDAS that yields further improvements in downstream reasoning benchmarks. Overall, this work highlights how the gradual growth of model depth can lead to the formation of distinct computational circuits and overcome the limited depth utilization seen in standard non-grown models.
Ask, Answer, and Detect: Role-Playing LLMs for Personality Detection with Question-Conditioned Mixture-of-Experts
Understanding human personality is crucial for web applications such as personalized recommendation and mental health assessment. Existing studies on personality detection predominantly adopt a "posts -> user vector -> labels" modeling paradigm, which encodes social media posts into user representations for predicting personality labels (e.g., MBTI labels). While recent advances in large language models (LLMs) have improved text encoding capacities, these approaches remain constrained by limited supervision signals due to label scarcity, and under-specified semantic mappings between user language and abstract psychological constructs. We address these challenges by proposing ROME, a novel framework that explicitly injects psychological knowledge into personality detection. Inspired by standardized self-assessment tests, ROME leverages LLMs' role-play capability to simulate user responses to validated psychometric questionnaires. These generated question-level answers transform free-form user posts into interpretable, questionnaire-grounded evidence linking linguistic cues to personality labels, thereby providing rich intermediate supervision to mitigate label scarcity while offering a semantic reasoning chain that guides and simplifies the text-to-personality mapping learning. A question-conditioned Mixture-of-Experts module then jointly routes over post and question representations, learning to answer questionnaire items under explicit supervision. The predicted answers are summarized into an interpretable answer vector and fused with the user representation for final prediction within a multi-task learning framework, where question answering serves as a powerful auxiliary task for personality detection. Extensive experiments on two real-world datasets demonstrate that ROME consistently outperforms state-of-the-art baselines, achieving improvements (15.41% on Kaggle dataset).
A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs
This paper addresses the challenge of aligning large language models (LLMs) with diverse human preferences within federated learning (FL) environments, where standard methods often fail to adequately represent diverse viewpoints. We introduce a comprehensive evaluation framework that systematically assesses the trade-off between alignment quality and fairness when using different aggregation strategies for human preferences. In our federated setting, each group locally evaluates rollouts and produces reward signals, and the server aggregates these group-level rewards without accessing any raw data. Specifically, we evaluate standard reward aggregation techniques (min, max, and average) and introduce a novel adaptive scheme that dynamically adjusts preference weights based on a group's historical alignment performance. Our experiments on question-answering (Q/A) tasks using a PPO-based RLHF pipeline demonstrate that our adaptive approach consistently achieves superior fairness while maintaining competitive alignment scores. This work offers a robust methodology for evaluating LLM behavior across diverse populations and provides a practical solution for developing truly pluralistic and fairly aligned models.
Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages
We propose a post-training method for lower-resource languages that preserves fluency of language models even when aligned by disfluent reward models. Preference-optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese. Lower-resource languages lack both datasets written by native speakers and language models capable of generating fluent synthetic data. Thus, in this work, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language. Our approach uses an on-policy training method, which we compare with two common approaches: supervised finetuning on machine-translated data and multilingual finetuning. We conduct a case study on Norwegian Bokmål and evaluate fluency through native-speaker assessments. The results show that the on-policy aspect is crucial and outperforms the alternatives without relying on any hard-to-obtain data.
Pose-Based Sign Language Spotting via an End-to-End Encoder Architecture AACL
Automatic Sign Language Recognition (ASLR) has emerged as a vital field for bridging the gap between deaf and hearing communities. However, the problem of sign-to-sign retrieval or detecting a specific sign within a sequence of continuous signs remains largely unexplored. We define this novel task as Sign Language Spotting. In this paper, we present a first step toward sign language retrieval by addressing the challenge of detecting the presence or absence of a query sign video within a sentence-level gloss or sign video. Unlike conventional approaches that rely on intermediate gloss recognition or text-based matching, we propose an end-to-end model that directly operates on pose keypoints extracted from sign videos. Our architecture employs an encoder-only backbone with a binary classification head to determine whether the query sign appears within the target sequence. By focusing on pose representations instead of raw RGB frames, our method significantly reduces computational cost and mitigates visual noise. We evaluate our approach on the Word Presence Prediction dataset from the WSLP 2025 shared task, achieving 61.88\% accuracy and 60.00\% F1-score. These results demonstrate the effectiveness of our pose-based framework for Sign Language Spotting, establishing a strong foundation for future research in automatic sign language retrieval and verification. Code is available at https://github.com/EbimoJohnny/Pose-Based-Sign-Language-Spotting
comment: To appear at AACL-IJCNLP 2025 Workshop WSLP
Automatic Essay Scoring and Feedback Generation in Basque Language Learning LREC 2026
This paper introduces the first publicly available dataset for Automatic Essay Scoring (AES) and feedback generation in Basque, targeting the CEFR C1 proficiency level. The dataset comprises 3,200 essays from HABE, each annotated by expert evaluators with criterion specific scores covering correctness, richness, coherence, cohesion, and task alignment enriched with detailed feedback and error examples. We fine-tune open-source models, including RoBERTa-EusCrawl and Latxa 8B/70B, for both scoring and explanation generation. Our experiments show that encoder models remain highly reliable for AES, while supervised fine-tuning (SFT) of Latxa significantly enhances performance, surpassing state-of-the-art (SoTA) closed-source systems such as GPT-5 and Claude Sonnet 4.5 in scoring consistency and feedback quality. We also propose a novel evaluation methodology for assessing feedback generation, combining automatic consistency metrics with expert-based validation of extracted learner errors. Results demonstrate that the fine-tuned Latxa model produces criterion-aligned, pedagogically meaningful feedback and identifies a wider range of error types than proprietary models. This resource and benchmark establish a foundation for transparent, reproducible, and educationally grounded NLP research in low-resource languages such as Basque.
comment: Submitted to LREC 2026
An Agentic AI System for Multi-Framework Communication Coding
Clinical communication is central to patient outcomes, yet large-scale human annotation of patient-provider conversation remains labor-intensive, inconsistent, and difficult to scale. Existing approaches based on large language models typically rely on single-task models that lack adaptability, interpretability, and reliability, especially when applied across various communication frameworks and clinical domains. In this study, we developed a Multi-framework Structured Agentic AI system for Clinical Communication (MOSAIC), built on a LangGraph-based architecture that orchestrates four core agents, including a Plan Agent for codebook selection and workflow planning, an Update Agent for maintaining up-to-date retrieval databases, a set of Annotation Agents that applies codebook-guided retrieval-augmented generation (RAG) with dynamic few-shot prompting, and a Verification Agent that provides consistency checks and feedback. To evaluate performance, we compared MOSAIC outputs against gold-standard annotations created by trained human coders. We developed and evaluated MOSAIC using 26 gold standard annotated transcripts for training and 50 transcripts for testing, spanning rheumatology and OB/GYN domains. On the test set, MOSAIC achieved an overall F1 score of 0.928. Performance was highest in the Rheumatology subset (F1 = 0.962) and strongest for Patient Behavior (e.g., patients asking questions, expressing preferences, or showing assertiveness). Ablations revealed that MOSAIC outperforms baseline benchmarking.
QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models
We introduce QSTN, an open-source Python framework for systematically generating responses from questionnaire-style prompts to support in-silico surveys and annotation tasks with large language models (LLMs). QSTN enables robust evaluation of questionnaire presentation, prompt perturbations, and response generation methods. Our extensive evaluation ($>40 $ million survey responses) shows that question structure and response generation methods have a significant impact on the alignment of generated survey responses with human answers, and can be obtained for a fraction of the compute cost. In addition, we offer a no-code user interface that allows researchers to set up robust experiments with LLMs without coding knowledge. We hope that QSTN will support the reproducibility and reliability of LLM-based research in the future.
comment: The Python package is available at https://github.com/dess-mannheim/QSTN/
HealthcareNLP: where are we and what is next? LREC 2026
This proposed tutorial focuses on Healthcare Domain Applications of NLP, what we have achieved around HealthcareNLP, and the challenges that lie ahead for the future. Existing reviews in this domain either overlook some important tasks, such as synthetic data generation for addressing privacy concerns, or explainable clinical NLP for improved integration and implementation, or fail to mention important methodologies, including retrieval augmented generation and the neural symbolic integration of LLMs and KGs. In light of this, the goal of this tutorial is to provide an introductory overview of the most important sub-areas of a patient- and resource-oriented HealthcareNLP, with three layers of hierarchy: data/resource layer: annotation guidelines, ethical approvals, governance, synthetic data; NLP-Eval layer: NLP tasks such as NER, RE, sentiment analysis, and linking/coding with categorised methods, leading to explainable HealthAI; patients layer: Patient Public Involvement and Engagement (PPIE), health literacy, translation, simplification, and summarisation (also NLP tasks), and shared decision-making support. A hands-on session will be included in the tutorial for the audience to use HealthcareNLP applications. The target audience includes NLP practitioners in the healthcare application domain, NLP researchers who are interested in domain applications, healthcare researchers, and students from NLP fields. The type of tutorial is "Introductory to CL/NLP topics (HealthcareNLP)" and the audience does not need prior knowledge to attend this. Tutorial materials: https://github.com/4dpicture/HealthNLP
comment: Accepted Tutorial by LREC 2026 https://lrec2026.info/
Curriculum Guided Massive Multi Agent System Solving For Robust Long Horizon Tasks
Large Language Models and multi-agent systems have shown promise in decomposing complex tasks, yet they struggle with long-horizon reasoning tasks and escalating computation cost. This work introduces a hierarchical multi-agent architecture that distributes reasoning across a 64*64 grid of lightweight agents, supported by a selective oracle. A spatial curriculum progressively expands the operational region of the grid, ensuring that agents master easier central tasks before tackling harder peripheral ones. To improve reliability, the system integrates Negative Log-Likelihood as a measure of confidence, allowing the curriculum to prioritize regions where agents are both accurate and well calibrated. A Thompson Sampling curriculum manager adaptively chooses training zones based on competence and NLL-driven reward signals. We evaluate the approach on a spatially grounded Tower of Hanoi benchmark, which mirrors the long-horizon structure of many robotic manipulation and planning tasks. Results demonstrate improved stability, reduced oracle usage, and stronger long-range reasoning from distributed agent cooperation.
comment: 22 pages, 2 tables, 9 figures
Beyond Real Weights: Hypercomplex Representations for Stable Quantization WACV
Multimodal language models (MLLMs) require large parameter capacity to align high-dimensional visual features with linguistic representations, making them computationally heavy and difficult to deploy efficiently. We introduce a progressive reparameterization strategy that compresses these models by gradually replacing dense feed-forward network blocks with compact Parameterized Hypercomplex Multiplication (PHM) layers. A residual interpolation schedule, together with lightweight reconstruction and knowledge distillation losses, ensures that the PHM modules inherit the functional behavior of their dense counterparts during training. This transition yields substantial parameter and FLOP reductions while preserving strong multimodal alignment, enabling faster inference without degrading output quality. We evaluate the approach on multiple vision-language models (VLMs). Our method maintains performance comparable to the base models while delivering significant reductions in model size and inference latency. Progressive PHM substitution thus offers an architecture-compatible path toward more efficient multimodal reasoning and complements existing low-bit quantization techniques.
comment: Accepted in Winter Conference on Applications of Computer Vision (WACV) 2026
Soft Inductive Bias Approach via Explicit Reasoning Perspectives in Inappropriate Utterance Detection Using Large Language Models
Recent incidents in certain online games and communities, where anonymity is guaranteed, show that unchecked inappropriate remarks frequently escalate into verbal abuse and even criminal behavior, raising significant social concerns. Consequently, there is a growing need for research on techniques that can detect inappropriate utterances within conversational texts to help build a safer communication environment. Although large-scale language models trained on Korean corpora and chain-of-thought reasoning have recently gained attention, research applying these approaches to inappropriate utterance detection remains limited. In this study, we propose a soft inductive bias approach that explicitly defines reasoning perspectives to guide the inference process, thereby promoting rational decision-making and preventing errors that may arise during reasoning. We fine-tune a Korean large language model using the proposed method and conduct both quantitative performance comparisons and qualitative evaluations across different training strategies. Experimental results show that the Kanana-1.5 model achieves an average accuracy of 87.0046, improving by approximately 3.89 percent over standard supervised learning. These findings indicate that the proposed method goes beyond simple knowledge imitation by large language models and enables more precise and consistent judgments through constrained reasoning perspectives, demonstrating its effectiveness for inappropriate utterance detection.
comment: in Chinese language, Published in the Proceedings of the 37th Annual Conference on Human and Language Technology, 2025, pp. 714-719. (English translation assisted by GPT)
What Triggers my Model? Contrastive Explanations Inform Gender Choices by Translation Models
Interpretability can be implemented as a means to understand decisions taken by (black box) models, such as machine translation (MT) or large language models (LLMs). Yet, research in this area has been limited in relation to a manifested problem in these models: gender bias. With this research, we aim to move away from simply measuring bias to exploring its origins. Working with gender-ambiguous natural source data, this study examines which context, in the form of input tokens in the source sentence, influences (or triggers) the translation model choice of a certain gender inflection in the target language. To analyse this, we use contrastive explanations and compute saliency attribution. We first address the challenge of a lacking scoring threshold and specifically examine different attribution levels of source words on the model gender decisions in the translation. We compare salient source words with human perceptions of gender and demonstrate a noticeable overlap between human perceptions and model attribution. Additionally, we provide a linguistic analysis of salient words. Our work showcases the relevance of understanding model translation decisions in terms of gender, how this compares to human decisions and that this information should be leveraged to mitigate gender bias.
Are generative AI text annotations systematically biased?
This paper investigates bias in GLLM annotations by conceptually replicating manual annotations of Boukes (2024). Using various GLLMs (Llama3.1:8b, Llama3.3:70b, GPT4o, Qwen2.5:72b) in combination with five different prompts for five concepts (political content, interactivity, rationality, incivility, and ideology). We find GLLMs perform adequate in terms of F1 scores, but differ from manual annotations in terms of prevalence, yield substantively different downstream results, and display systematic bias in that they overlap more with each other than with manual annotations. Differences in F1 scores fail to account for the degree of bias.
comment: 9 pages, 6 figures, 1 table; version submitted to the International Communication Association Annual Conference in Cape Town 2026
Ontology-Based Knowledge Graph Framework for Industrial Standard Documents via Hierarchical and Propositional Structuring
Ontology-based knowledge graph (KG) construction is a core technology that enables multidimensional understanding and advanced reasoning over domain knowledge. Industrial standards, in particular, contain extensive technical information and complex rules presented in highly structured formats that combine tables, scopes of application, constraints, exceptions, and numerical calculations, making KG construction especially challenging. In this study, we propose a method that organizes such documents into a hierarchical semantic structure, decomposes sentences and tables into atomic propositions derived from conditional and numerical rules, and integrates them into an ontology-knowledge graph through LLM-based triple extraction. Our approach captures both the hierarchical and logical structures of documents, effectively representing domain-specific semantics that conventional methods fail to reflect. To verify its effectiveness, we constructed rule, table, and multi-hop QA datasets, as well as a toxic clause detection dataset, from industrial standards, and implemented an ontology-aware KG-RAG framework for comparative evaluation. Experimental results show that our method achieves significant performance improvements across all QA types compared to existing KG-RAG approaches. This study demonstrates that reliable and scalable knowledge representation is feasible even for industrial documents with intertwined conditions, constraints, and scopes, contributing to future domain-specific RAG development and intelligent document management.
The High Cost of Incivility: Quantifying Interaction Inefficiency via Multi-Agent Monte Carlo Simulations
Workplace toxicity is widely recognized as detrimental to organizational culture, yet quantifying its direct impact on operational efficiency remains methodologically challenging due to the ethical and practical difficulties of reproducing conflict in human subjects. This study leverages Large Language Model (LLM) based Multi-Agent Systems to simulate 1-on-1 adversarial debates, creating a controlled "sociological sandbox". We employ a Monte Carlo method to simulate hundrets of discussions, measuring the convergence time (defined as the number of arguments required to reach a conclusion) between a baseline control group and treatment groups involving agents with "toxic" system prompts. Our results demonstrate a statistically significant increase of approximately 25\% in the duration of conversations involving toxic participants. We propose that this "latency of toxicity" serves as a proxy for financial damage in corporate and academic settings. Furthermore, we demonstrate that agent-based modeling provides a reproducible, ethical alternative to human-subject research for measuring the mechanics of social friction.
comment: 8 figures, 3 tables
Reasoning Models Ace the CFA Exams
Previous research has reported that large language models (LLMs) demonstrate poor performance on the Chartered Financial Analyst (CFA) exams. However, recent reasoning models have achieved strong results on graduate-level academic and professional examinations across various disciplines. In this paper, we evaluate state-of-the-art reasoning models on a set of mock CFA exams consisting of 980 questions across three Level I exams, two Level II exams, and three Level III exams. Using the same pass/fail criteria from prior studies, we find that most models clear all three levels. The models that pass, ordered by overall performance, are Gemini 3.0 Pro, Gemini 2.5 Pro, GPT-5, Grok 4, Claude Opus 4.1, and DeepSeek-V3.1. Specifically, Gemini 3.0 Pro achieves a record score of 97.6% on Level I. Performance is also strong on Level II, led by GPT-5 at 94.3%. On Level III, Gemini 2.5 Pro attains the highest score with 86.4% on multiple-choice questions while Gemini 3.0 Pro achieves 92.0% on constructed-response questions.
ClinicalTrialsHub: Bridging Registries and Literature for Comprehensive Clinical Trial Access
We present ClinicalTrialsHub, an interactive search-focused platform that consolidates all data from ClinicalTrials.gov and augments it by automatically extracting and structuring trial-relevant information from PubMed research articles. Our system effectively increases access to structured clinical trial data by 83.8% compared to relying on ClinicalTrials.gov alone, with potential to make access easier for patients, clinicians, researchers, and policymakers, advancing evidence-based medicine. ClinicalTrialsHub uses large language models such as GPT-5.1 and Gemini-3-Pro to enhance accessibility. The platform automatically parses full-text research articles to extract structured trial information, translates user queries into structured database searches, and provides an attributed question-answering system that generates evidence-grounded answers linked to specific source sentences. We demonstrate its utility through a user study involving clinicians, clinical researchers, and PhD students of pharmaceutical sciences and nursing, and a systematic automatic evaluation of its information extraction and question answering capabilities.
Universal Adversarial Suffixes for Language Models Using Reinforcement Learning with Calibrated Reward
Language models are vulnerable to short adversarial suffixes that can reliably alter predictions. Previous works usually find such suffixes with gradient search or rule-based methods, but these are brittle and often tied to a single task or model. In this paper, a reinforcement learning framework is used where the suffix is treated as a policy and trained with Proximal Policy Optimization against a frozen model as a reward oracle. Rewards are shaped using calibrated cross-entropy, removing label bias and aggregating across surface forms to improve transferability. The proposed method is evaluated on five diverse NLP benchmark datasets, covering sentiment, natural language inference, paraphrase, and commonsense reasoning, using three distinct language models: Qwen2-1.5B Instruct, TinyLlama-1.1B Chat, and Phi-1.5. Results show that RL-trained suffixes consistently degrade accuracy and transfer more effectively across tasks and models than previous adversarial triggers of similar genres.
comment: 5 pages
Universal Adversarial Suffixes Using Calibrated Gumbel-Softmax Relaxation
Language models (LMs) are often used as zero-shot or few-shot classifiers by scoring label words, but they remain fragile to adversarial prompts. Prior work typically optimizes task- or model-specific triggers, making results difficult to compare and limiting transferability. We study universal adversarial suffixes: short token sequences (4-10 tokens) that, when appended to any input, broadly reduce accuracy across tasks and models. Our approach learns the suffix in a differentiable "soft" form using Gumbel-Softmax relaxation and then discretizes it for inference. Training maximizes calibrated cross-entropy on the label region while masking gold tokens to prevent trivial leakage, with entropy regularization to avoid collapse. A single suffix trained on one model transfers effectively to others, consistently lowering both accuracy and calibrated confidence. Experiments on sentiment analysis, natural language inference, paraphrase detection, commonsense QA, and physical reasoning with Qwen2-1.5B, Phi-1.5, and TinyLlama-1.1B demonstrate consistent attack effectiveness and transfer across tasks and model families.
comment: 10 pages
Soft Inductive Bias Approach via Explicit Reasoning Perspectives in Inappropriate Utterance Detection Using Large Language Models
Recent incidents in certain online games and communities, where anonymity is guaranteed, show that unchecked inappropriate remarks frequently escalate into verbal abuse and even criminal behavior, raising significant social concerns. Consequently, there is a growing need for research on techniques that can detect inappropriate utterances within conversational texts to help build a safer communication environment. Although large-scale language models trained on Korean corpora and chain-of-thought reasoning have recently gained attention, research applying these approaches to inappropriate utterance detection remains limited. In this study, we propose a soft inductive bias approach that explicitly defines reasoning perspectives to guide the inference process, thereby promoting rational decision-making and preventing errors that may arise during reasoning. We fine-tune a Korean large language model using the proposed method and conduct both quantitative performance comparisons and qualitative evaluations across different training strategies. Experimental results show that the Kanana-1.5 model achieves an average accuracy of 87.0046, improving by approximately 3.89 percent over standard supervised learning. These findings indicate that the proposed method goes beyond simple knowledge imitation by large language models and enables more precise and consistent judgments through constrained reasoning perspectives, demonstrating its effectiveness for inappropriate utterance detection.
comment: in Korean language, Published in the Proceedings of the 37th Annual Conference on Human and Language Technology, 2025, pp. 714-719. (English translation assisted by GPT)
MindShift: Analyzing Language Models' Reactions to Psychological Prompts
Large language models (LLMs) hold the potential to absorb and reflect personality traits and attitudes specified by users. In our study, we investigated this potential using robust psychometric measures. We adapted the most studied test in psychological literature, namely Minnesota Multiphasic Personality Inventory (MMPI) and examined LLMs' behavior to identify traits. To asses the sensitivity of LLMs' prompts and psychological biases we created personality-oriented prompts, crafting a detailed set of personas that vary in trait intensity. This enables us to measure how well LLMs follow these roles. Our study introduces MindShift, a benchmark for evaluating LLMs' psychological adaptability. The results highlight a consistent improvement in LLMs' role perception, attributed to advancements in training datasets and alignment techniques. Additionally, we observe significant differences in responses to psychometric assessments across different model types and families, suggesting variability in their ability to emulate human-like personality traits. MindShift prompts and code for LLM evaluation will be publicly available.
Detecting Hallucinations in Graph Retrieval-Augmented Generation via Attention Patterns and Semantic Alignment
Graph-based Retrieval-Augmented Generation (GraphRAG) enhances Large Language Models (LLMs) by incorporating external knowledge from linearized subgraphs retrieved from knowledge graphs. However, LLMs struggle to interpret the relational and topological information in these inputs, resulting in hallucinations that are inconsistent with the retrieved knowledge. To analyze how LLMs attend to and retain structured knowledge during generation, we propose two lightweight interpretability metrics: Path Reliance Degree (PRD), which measures over-reliance on shortest-path triples, and Semantic Alignment Score (SAS), which assesses how well the model's internal representations align with the retrieved knowledge. Through empirical analysis on a knowledge-based QA task, we identify failure patterns associated with over-reliance on salient paths and weak semantic grounding, as indicated by high PRD and low SAS scores. We further develop a lightweight post-hoc hallucination detector, Graph Grounding and Alignment (GGA), which outperforms strong semantic and confidence-based baselines across AUC and F1. By grounding hallucination analysis in mechanistic interpretability, our work offers insights into how structural limitations in LLMs contribute to hallucinations, informing the design of more reliable GraphRAG systems in the future.
Knowledge-Guided Large Language Model for Automatic Pediatric Dental Record Understanding and Safe Antibiotic Recommendation
Accurate interpretation of pediatric dental clinical records and safe antibiotic prescribing remain persistent challenges in dental informatics. Traditional rule-based clinical decision support systems struggle with unstructured dental narratives, incomplete radiographic descriptions, and complex safety constraints. To address these limitations, this study proposes a Knowledge-Guided Large Language Model (KG-LLM) that integrates a pediatric dental knowledge graph, retrieval-augmented generation (RAG), and a multi-stage safety validation pipeline for evidence-grounded antibiotic recommendation. The framework first employs a clinical NER/RE module to extract structured entities and relations from dental notes and radiology reports. Relevant guidelines, drug-safety rules, and analogous historical cases are subsequently retrieved from the knowledge graph and supplied to the LLM for diagnostic summarization and dose-drug-duration prediction. Safety assurance is achieved through a dual-layer validation mechanism combining deterministic rule checking with a learned classifier for detecting allergies, contraindications, and dosing errors. Experiments on 32,000 de-identified pediatric dental visit records demonstrate the effectiveness of the proposed approach. Compared with a domain-adapted Llama-2 clinical baseline, KG-LLM improves record-understanding performance (F1: 0.914 vs. 0.867), drug-dose-duration accuracy (Top-1: 0.782 vs. 0.716), and reduces unsafe antibiotic suggestions by 50%. Additional evaluation across summary quality, recommendation accuracy, and global safety scores further confirms the robustness of the system. Ablation analyses indicate that the knowledge graph, RAG, and safety modules each contribute substantially to clinical reliability and interpretability.
Luxical: High-Speed Lexical-Dense Text Embeddings
Frontier language model quality increasingly hinges on our ability to organize web-scale text corpora for training. Today's dominant tools trade off speed and flexibility: lexical classifiers (e.g., FastText) are fast but limited to producing classification output scores, while the vector-valued outputs of transformer text embedding models flexibly support numerous workflows (e.g., clustering, classification, and retrieval) but are computationally expensive to produce. We introduce Luxical, a library for high-speed "lexical-dense" text embeddings that aims to recover the best properties of both approaches for web-scale text organization. Luxical combines sparse TF--IDF features, a small ReLU network, and a knowledge distillation training regimen to approximate large transformer embedding models at a fraction of their operational cost. In this technical report, we describe the Luxical architecture and training objective and evaluate a concrete Luxical model in two disparate applications: a targeted webcrawl document retrieval test and an end-to-end language model data curation task grounded in text classification. In these tasks we demonstrate speedups ranging from 3x to 100x over varying-sized neural baselines, and comparable to FastText model inference during the data curation task. On these evaluations, the tested Luxical model illustrates favorable compute/quality trade-offs for large-scale text organization, matching the quality of neural baselines. Luxical is available as open-source software at https://github.com/datologyai/luxical.
comment: 9 pages, 6 figures
Do Natural Language Descriptions of Model Activations Convey Privileged Information?
Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second verbalizer LLM. This is intended to illuminate how the target model represents and operates on inputs. But do such activation verbalization approaches actually provide privileged knowledge about the internal workings of the target model, or do they merely convey information about its inputs? We critically evaluate popular verbalization methods across datasets used in prior work and find that they can succeed at benchmarks without any access to target model internals, suggesting that these datasets may not be ideal for evaluating verbalization methods. We then run controlled experiments which reveal that verbalizations often reflect the parametric knowledge of the verbalizer LLM which generated them, rather than the knowledge of the target LLM whose activations are decoded. Taken together, our results indicate a need for targeted benchmarks and experimental controls to rigorously assess whether verbalization methods provide meaningful insights into the operations of LLMs.
comment: 40 pages, 6 figures. Updated and added content
Arbitrage: Efficient Reasoning via Advantage-Aware Speculation
Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens, which are then verified in parallel by a more capable target model. However, due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks. Although recent works have shifted to step-level semantic verification, which improve efficiency by accepting or rejecting entire reasoning steps, existing step-level methods still regenerate many rejected steps with little improvement, wasting valuable target compute. To address this challenge, we propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models. Instead of applying a fixed acceptance threshold, Arbitrage uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal Arbitrage Oracle that always chooses the higher-quality step, achieving near-optimal efficiency-accuracy trade-offs. Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level Speculative Decoding baselines, reducing inference latency by up to $\sim2\times$ at matched accuracy.
comment: 22 pages
Survey and Experiments on Mental Disorder Detection via Social Media: From Large Language Models and RAG to Agents ICDE
Mental disorders represent a critical global health challenge, and social media is increasingly viewed as a vital resource for real-time digital phenotyping and intervention. Large Language Models (LLMs) offer stronger semantic understanding and reasoning than traditional deep learning, but their use in high-stakes clinical settings is limited by hallucinations and the lack of persistent memory. However, existing literature has not sufficiently investigated how advanced enhancement techniques, specifically Retrieval-Augmented Generation (RAG) and Agentic systems, can address these reliability and reasoning limitations. Here, we systematically survey the evolving landscape of LLM-based methods for social media mental disorder analysis, spanning standard pretrained language models, RAG to mitigate hallucinations and contextual gaps, and agentic systems for autonomous reasoning and multi-step intervention. We organize existing work by technical paradigm and clinical target, extending beyond common internalizing disorders to include psychotic disorders and externalizing behaviors. Additionally, the paper comprehensively evaluates the performance of LLMs, including the impact of RAG, across various tasks. This work establishes a unified benchmark for the field, paving the way for the development of trustworthy, autonomous AI systems that can deliver precise and explainable mental health support.
comment: 20 pages, 10 figures. This is an extension of ICDEW 2025
ScamAgents: How AI Agents Can Simulate Human-Level Scam Calls
Large Language Models (LLMs) have demonstrated impressive fluency and reasoning capabilities, but their potential for misuse has raised growing concern. In this paper, we present ScamAgent, an autonomous multi-turn agent built on top of LLMs, capable of generating highly realistic scam call scripts that simulate real-world fraud scenarios. Unlike prior work focused on single-shot prompt misuse, ScamAgent maintains dialogue memory, adapts dynamically to simulated user responses, and employs deceptive persuasion strategies across conversational turns. We show that current LLM safety guardrails, including refusal mechanisms and content filters, are ineffective against such agent-based threats. Even models with strong prompt-level safeguards can be bypassed when prompts are decomposed, disguised, or delivered incrementally within an agent framework. We further demonstrate the transformation of scam scripts into lifelike voice calls using modern text-to-speech systems, completing a fully automated scam pipeline. Our findings highlight an urgent need for multi-turn safety auditing, agent-level control frameworks, and new methods to detect and disrupt conversational deception powered by generative AI.
comment: Accepted at CAMLIS 25: Conference on Applied Machine Learning for Information Security. 19 pages, 3 figures
The AI Consumer Index (ACE)
We introduce the first version of the AI Consumer Index (ACE), a benchmark for assessing whether frontier AI models can perform everyday consumer tasks. ACE contains a hidden heldout set of 400 test cases, split across four consumer activities: shopping, food, gaming, and DIY. We are also open sourcing 80 cases as a devset with a CC-BY license. For the ACE leaderboard we evaluated 10 frontier models (with websearch turned on) using a novel grading methodology that dynamically checks whether relevant parts of the response are grounded in the retrieved web sources. GPT 5 (Thinking = High) is the top-performing model, scoring 56.1%, followed by o3 Pro (Thinking = On) at 55.2% and GPT 5.1 (Thinking = High) at 55.1%. Model scores differ across domains, and in Shopping the top model scores under 50\%. We find that models are prone to hallucinating key information, such as prices. ACE shows a substantial gap between the performance of even the best models and consumers' AI needs.
DiscoVerse: Multi-Agent Pharmaceutical Co-Scientist for Traceable Drug Discovery and Reverse Translation
Pharmaceutical research and development has accumulated vast and heterogeneous archives of data. Much of this knowledge stems from discontinued programs, and reusing these archives is invaluable for reverse translation. However, in practice, such reuse is often infeasible. In this work, we introduce DiscoVerse, a multi-agent co-scientist designed to support pharmaceutical research and development at Roche. Designed as a human-in-the-loop assistant, DiscoVerse enables domain-specific queries by delivering evidence-based answers: it retrieves relevant data, links across documents, summarises key findings and preserves institutional memory. We assess DiscoVerse through expert evaluation of source-linked outputs. Our evaluation spans a selected subset of 180 molecules from Roche's research and development repositories, encompassing over 0.87 billion BPE tokens and more than four decades of research. To our knowledge, this represents the first agentic framework to be systematically assessed on real pharmaceutical data for reverse translation, enabled by authorized access to confidential archives covering the full lifecycle of drug development. Our contributions include: role-specialized agent designs aligned with scientist workflows; human-in-the-loop support for reverse translation; expert evaluation; and a large-scale demonstration showing promising decision-making insights. In brief, across seven benchmark queries, DiscoVerse achieved near-perfect recall ($\geq 0.99$) with moderate precision ($0.71-0.91$). Qualitative assessments and three real-world pharmaceutical use cases further showed faithful, source-linked synthesis across preclinical and clinical evidence.
comment: 24 pages, 5 figures, 3 tables. Updated version: added three pharmaceutical industry use cases and revised text for clarity
StreamingThinker: Large Language Models Can Think While Reading
Large language models (LLMs) have demonstrated remarkable capabilities in chain of thought (CoT) reasoning. However, the current LLM reasoning paradigm initiates thinking only after the entire input is available, which introduces unnecessary latency and weakens attention to earlier information in dynamic scenarios. Inspired by human cognition of thinking while reading, we first design a \textit{\textbf{streaming thinking}} paradigm for LLMs, where reasoning unfolds in the order of input and further adjusts its depth once reading is complete. We instantiate this paradigm with \textit{StreamingThinker}, a framework that enables LLMs to think while reading through the integration of streaming CoT generation, streaming-constraint training, and streaming parallel inference. Specifically, StreamingThinker employs streaming reasoning units with quality control for CoT generation, enforces order-preserving reasoning through streaming attention masks and position encoding, and leverages parallel KV caches that decouple input encoding from reasoning generation, thereby ensuring alignment and enabling true concurrency. We evaluate StreamingThinker on the Qwen3 model family across math reasoning, logical reasoning, and context-based QA reasoning tasks. Experimental results show that the StreamingThinker preserves performance comparable to batch thinking, while yielding an 80\% reduction in token waiting before the onset of reasoning and a more than 60\% reduction in time-level latency for producing the final answer, demonstrating the effectiveness of the streaming paradigm for LLM reasoning. Code will be released at https://github.com/EIT-NLP/StreamingLLM/tree/main/StreamingThinker.
ProgRAG: Hallucination-Resistant Progressive Retrieval and Reasoning over Knowledge Graphs
Large Language Models (LLMs) demonstrate strong reasoning capabilities but struggle with hallucinations and limited transparency. Recently, KG-enhanced LLMs that integrate knowledge graphs (KGs) have been shown to improve reasoning performance, particularly for complex, knowledge-intensive tasks. However, these methods still face significant challenges, including inaccurate retrieval and reasoning failures, often exacerbated by long input contexts that obscure relevant information or by context constructions that struggle to capture the richer logical directions required by different question types. Furthermore, many of these approaches rely on LLMs to directly retrieve evidence from KGs, and to self-assess the sufficiency of this evidence, which often results in premature or incorrect reasoning. To address the retrieval and reasoning failures, we propose ProgRAG, a multi-hop knowledge graph question answering (KGQA) framework that decomposes complex questions into sub-questions, and progressively extends partial reasoning paths by answering each sub-question. At each step, external retrievers gather candidate evidence, which is then refined through uncertainty-aware pruning by the LLM. Finally, the context for LLM reasoning is optimized by organizing and rearranging the partial reasoning paths obtained from the sub-question answers. Experiments on three well-known datasets demonstrate that ProgRAG outperforms existing baselines in multi-hop KGQA, offering improved reliability and reasoning quality.
SimSUM: Simulated Benchmark with Structured and Unstructured Medical Records
Clinical information extraction, which involves structuring clinical concepts from unstructured medical text, remains a challenging problem that could benefit from the inclusion of tabular background information available in electronic health records. Existing open-source datasets lack explicit links between structured features and clinical concepts in the text, motivating the need for a new research dataset. We introduce SimSUM, a benchmark dataset of 10,000 simulated patient records that link unstructured clinical notes with structured background variables. Each record simulates a patient encounter in the domain of respiratory diseases and includes tabular data (e.g., symptoms, diagnoses, underlying conditions) generated from a Bayesian network whose structure and parameters are defined by domain experts. A large language model (GPT-4o) is prompted to generate a clinical note describing the encounter, including symptoms and relevant context. These notes are annotated with span-level symptom mentions. We conduct an expert evaluation to assess note quality and run baseline predictive models on both the tabular and textual data. The SimSUM dataset is primarily designed to support research on clinical information extraction in the presence of tabular background variables, which can be linked through domain knowledge to concepts of interest to be extracted from the text -- namely, symptoms in the case of SimSUM. Secondary uses include research on the automation of clinical reasoning over both tabular data and text, causal effect estimation in the presence of tabular and/or textual confounders, and multi-modal synthetic data generation. SimSUM is not intended for training clinical decision support systems or production-grade models, but rather to facilitate reproducible research in a simplified and controlled setting.
comment: An earlier version of this dataset was published under the name SynSUM. It has since been renamed to SimSUM to avoid confusion with synthetic data generated from real data, and to emphasize the simulated nature of the dataset. The dataset is available at https://github.com/prabaey/SimSUM
Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation
Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict.
B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability
Post-hoc explanation methods for black-box models often struggle with faithfulness and human interpretability due to the lack of explainability in current neural architectures. Meanwhile, B-cos networks have been introduced to improve model explainability by proposing an architecture that removes bias terms and promotes input-weight alignment. Although B-cos networks have shown success in building explainable systems, their application has so far been limited to computer vision models and their associated training pipelines. In this work, we introduce B-cos LMs, i.e., B-cos Language Models (LMs) empowered for natural language processing (NLP) tasks. Our approach directly transforms pre-trained language models into B-cos LMs by combining B-cos conversion and task fine-tuning, improving efficiency compared to previous methods. Automatic and human evaluation results demonstrate that B-cos LMs produce more faithful and human interpretable explanations than post-hoc methods, while maintaining task performance comparable to conventional fine-tuning. Our in-depth analysis explores how B-cos LMs differ from conventionally fine-tuned models in their learning processes and explanation patterns. Finally, we present a first exploration of transforming decoder-only models to B-cos LMs for generation tasks. Our code is available at https://github.com/Ewanwong/bcos_lm.
comment: TMLR 12/2025
SynBullying: A Multi LLM Synthetic Conversational Dataset for Cyberbullying Detection
We introduce SynBullying, a synthetic multi-LLM conversational dataset for studying and detecting cyberbullying (CB). SynBullying provides a scalable and ethically safe alternative to human data collection by leveraging large language models (LLMs) to simulate realistic bullying interactions. The dataset offers (i) conversational structure, capturing multi-turn exchanges rather than isolated posts; (ii) context-aware annotations, where harmfulness is assessed within the conversational flow considering context, intent, and discourse dynamics; and (iii) fine-grained labeling, covering various CB categories for detailed linguistic and behavioral analysis. We evaluate SynBullying across five dimensions, including conversational structure, lexical patterns, sentiment/toxicity, role dynamics, harm intensity, and CB-type distribution. We further examine its utility by testing its performance as standalone training data and as an augmentation source for CB classification.
OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows
Computer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms. While these agents hold great promise for advancing digital automation, their potential for unsafe operations, such as system compromise and privacy leakage, is raising significant concerns. Detecting these safety concerns across the vast and complex operational space of mobile environments presents a formidable challenge that remains critically underexplored. To establish a foundation for mobile agent safety research, we introduce MobileRisk-Live, a dynamic sandbox environment accompanied by a safety detection benchmark comprising realistic trajectories with fine-grained annotations. Built upon this, we propose OS-Sentinel, a novel hybrid safety detection framework that synergistically combines a Formal Verifier for detecting explicit system-level violations with a VLM-based Contextual Judge for assessing contextual risks and agent actions. Experiments show that OS-Sentinel achieves 10%-30% improvements over existing approaches across multiple metrics. Further analysis provides critical insights that foster the development of safer and more reliable autonomous mobile agents. Our code and data are available at https://github.com/OS-Copilot/OS-Sentinel.
comment: work in progress
Agent-OM: Leveraging LLM Agents for Ontology Matching
Ontology matching (OM) enables semantic interoperability between different ontologies and resolves their conceptual heterogeneity by aligning related entities. OM systems currently have two prevailing design paradigms: conventional knowledge-based expert systems and newer machine learning-based predictive systems. While large language models (LLMs) and LLM agents have revolutionised data engineering and have been applied creatively in many domains, their potential for OM remains underexplored. This study introduces a novel agent-powered LLM-based design paradigm for OM systems. With consideration of several specific challenges in leveraging LLM agents for OM, we propose a generic framework, namely Agent-OM (Agent for Ontology Matching), consisting of two Siamese agents for retrieval and matching, with a set of OM tools. Our framework is implemented in a proof-of-concept system. Evaluations of three Ontology Alignment Evaluation Initiative (OAEI) tracks over state-of-the-art OM systems show that our system can achieve results very close to the long-standing best performance on simple OM tasks and can significantly improve the performance on complex and few-shot OM tasks.
comment: 31 pages
Improving LLM Reliability with RAG in Religious Question-Answering: MufassirQAS
Religious teachings can sometimes be complex and challenging to grasp, but chatbots can serve as effective assistants in this domain. Large Language Model (LLM) based chatbots, powered by Natural Language Processing (NLP), can connect related topics and provide well-supported responses to intricate questions, making them valuable tools for religious education. However, LLMs are prone to hallucinations as they can generate inaccurate or irrelevant information, and these can include sensitive content that could be offensive, inappropriate, or controversial. Addressing such topics without inadvertently promoting hate speech or disrespecting certain beliefs remains a significant challenge. As a solution to these issues, we introduce MufassirQAS, a system that enhances LLM accuracy and transparency using a vector database-driven Retrieval-Augmented Generation (RAG) approach. We built a dataset comprising fundamental books containing Turkish translations and interpretations of Islamic texts. This database is leveraged to answer religious inquiries while ensuring that responses remain reliable and contextually grounded. Our system also presents the relevant dataset sections alongside the LLM-generated answers, reinforcing transparency. We carefully designed system prompts to prevent harmful, offensive, or disrespectful outputs, ensuring that responses align with ethical and respectful discourse. Moreover, MufassirQAS provides supplementary details, such as source page numbers and referenced articles, to enhance credibility. To evaluate its effectiveness, we tested MufassirQAS against ChatGPT with sensitive questions, and our system demonstrated superior performance in maintaining accuracy and reliability. Future work will focus on improving accuracy and refining prompt engineering techniques to further minimize biases and ensure even more reliable responses.
Pay Less Attention to Function Words for Free Robustness of Vision-Language Models
To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the impact of function words. Similar to differential amplifiers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for more aligned and robust VLMs. Comprehensive experiments include 2 SOTA baselines under 6 different attacks on 2 downstream tasks, 3 datasets, and 3 models. Overall, our FDA yields an average 18/13/53% ASR drop with only 0.2/0.3/0.6% performance drops on the 3 tested models on retrieval, and a 90% ASR drop with a 0.3% performance gain on visual grounding. We demonstrate the scalability, generalization, and zero-shot performance of FDA experimentally, as well as in-depth ablation studies and analysis. Code will be made publicly at https://github.com/michaeltian108/FDA.
When Many-Shot Prompting Fails: An Empirical Study of LLM Code Translation ICSE 2026
Large Language Models (LLMs) with vast context windows offer new avenues for in-context learning (ICL), where providing many examples ("many-shot" prompting) is often assumed to enhance performance. We investigate this assumption for the complex task of code translation. Through a large-scale empirical study of over 90,000 translations, we systematically evaluate the impact of scaling in-context examples from zero-shot to many-shot configurations of up to 625 examples, with prompts spanning from approximately 100,000 to 800,000 tokens. Our findings reveal a "many-shot paradox": while static similarity metrics may modestly improve with more examples, functional correctness consistently peaks with few-shot prompting (5-25 examples). Providing substantially more examples often degrades this crucial functional performance. This study highlights that for code translation, the quality of a few well-chosen examples outweighs sheer quantity, challenging the universal efficacy of "more is better" for ICL and underscoring the task-dependent nature of optimal prompting strategies. Our results have significant implications for effectively leveraging LLMs in software engineering.
comment: Accepted to ICSE 2026 (RECODE workshop)
Detecting value-expressive text posts in Russian social media
Basic values are concepts or beliefs which pertain to desirable end-states and transcend specific situations. Studying personal values in social media can illuminate how and why societal values evolve especially when the stimuli-based methods, such as surveys, are inefficient, for instance, in hard-to-reach populations. On the other hand, user-generated content is driven by the massive use of stereotyped, culturally defined speech constructions rather than authentic expressions of personal values. We aimed to find a model that can accurately detect value-expressive posts in Russian social media VKontakte. A training dataset of 5,035 posts was annotated by three experts, 304 crowd-workers and ChatGPT. Crowd-workers and experts showed only moderate agreement in categorizing posts. ChatGPT was more consistent but struggled with spam detection. We applied an ensemble of human- and AI-assisted annotation involving active learning approach, subsequently trained several classification models using embeddings from various pre-trained transformer-based language models. The best performance was achieved with embeddings from a fine-tuned rubert-tiny2 model, yielding high value detection quality (F1 = 0.77, F1-macro = 0.83). This model provides a crucial step to a study of values within and between Russian social media users.
AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models
We present AraLingBench: a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language models (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.
Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in LLMs
Large language models (LLMs) can express different values in two distinct ways: (1) intrinsic expression, reflecting the model's inherent values learned during training, and (2) prompted expression, elicited by explicit prompts. Given their widespread use in value alignment and persona steering, it is paramount to clearly understand their underlying mechanisms, particularly whether they mostly overlap (as one might expect) or rely on substantially different mechanisms, but this remains largely understudied. We analyze this at the mechanistic level using two approaches: (1) value vectors, feature directions representing value mechanisms extracted from the residual stream, and (2) value neurons, MLP neurons that contribute to value expressions. We demonstrate that intrinsic and prompted value mechanisms partly share common components that are crucial for inducing value expression, but also possess unique elements that manifest in different ways. As a result, these mechanisms lead to different degrees of value steerability (prompted > intrinsic) and response diversity (intrinsic > prompted). In particular, components unique to the intrinsic mechanism seem to promote lexical diversity in responses, whereas those specific to the prompted mechanism primarily strengthen instruction following, taking effect even in distant tasks like jailbreaking.
Bench4KE: Benchmarking Automated Competency Question Generation
The availability of Large Language Models (LLMs) presents a unique opportunity to reinvigorate research on Knowledge Engineering (KE) automation. This trend is already evident in recent efforts developing LLM-based methods and tools for the automatic generation of Competency Questions (CQs), natural language questions used by ontology engineers to define the functional requirements of an ontology. However, the evaluation of these tools lacks standardization. This undermines the methodological rigor and hinders the replication and comparison of results. To address this gap, we introduce Bench4KE, an extensible API-based benchmarking system for KE automation. The presented release focuses on evaluating tools that generate CQs automatically. Bench4KE provides a curated gold standard consisting of CQ datasets from 17 real-world ontology engineering projects and uses a suite of similarity metrics to assess the quality of the CQs generated. We present a comparative analysis of 6 recent CQ generation systems, which are based on LLMs, establishing a baseline for future research. Bench4KE is also designed to accommodate additional KE automation tasks, such as SPARQL query generation, ontology testing and drafting. Code and datasets are publicly available under the Apache 2.0 license.
You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction WACV26
Despite the renewed interest in zero-shot visual classification due to the rise of Multimodal Large Language Models (MLLMs), the problem of evaluating free-form responses of auto-regressive models remains a persistent challenge. Most existing works focus on language-only tasks or don't consider Multiple Choice Questions (MCQs) beyond 5-way options, both of which are critical capabilities to solve tasks in Fine-Grained Visual Classification (FGVC) where choice counts are in the hundreds to thousands and the choices are highly related. Furthermore, in this highly multi-way MCQ setting it is not clear how to extend LLM choice extraction to retrieval-based problems, where computing probabilities over the choice set is computationally costly. In this work we investigate nlg2choice, a simple two-stage method which first asks the MLLM an open-ended question for the task with minimal constraints, then uses text-only constrained decoding to predict the most likely choice. In retrieval settings, we compute the probability of the constrained response taking that choice with an early stopping method to significantly improve throughput. Our results show improvement over a suite of seven fine-grained visual datasets when evaluating in terms of classification and retrieval, and show that this performance holds over the various ways that users of LLMs can implement tasks in natural language.
comment: Accepted to WACV26. 12 pages, 8 tables, 5 figures
LLMSQL: Upgrading WikiSQL for the LLM Era of Text-to-SQL ICDM
Converting natural language questions into SQL queries enables non-expert users to interact with relational databases and has long been a central task for natural language interfaces to data. While the WikiSQL dataset played a key role in early text-to-SQL research, its usage has declined due to structural and annotation issues, including case sensitivity inconsistencies, data type mismatches, syntax errors, and unanswered questions. We present LLMSQL, a systematic revision and transformation of WikiSQL designed for the large language model era. We classify these errors and implement automated methods for cleaning and re-annotation. To assess the impact of these improvements, we evaluated multiple large language models, including Gemma 3, LLaMA 3.2, Mistral 7B, gpt-oss 20B, Phi-3.5 Mini, Qwen 2.5, OpenAI o4-mini, DeepSeek-R1, and others. Notably, DeepSeek-R1 achieves 88.40% accuracy in a zero-shot setting, and models under 10B parameters surpass 90% accuracy after fine-tuning. Rather than serving as an update, LLMSQL is introduced as an LLM-ready benchmark. Unlike the original WikiSQL, which was tailored for pointer-network models selecting tokens from input, LLMSQL provides clean natural language questions and full SQL queries as plain text, enabling straightforward generation and evaluation for modern natural-language-to-SQL models.
comment: To appear in the Proceedings of the IEEE International Conference on Data Mining Workshops (ICDMW)
CLARIFID: Improving Radiology Report Generation by Reinforcing Clinically Accurate Impressions and Enforcing Detailed Findings
Automatic generation of radiology reports has the potential to alleviate radiologists' significant workload, yet current methods struggle to deliver clinically reliable conclusions. In particular, most prior approaches focus on producing fluent text without effectively ensuring the factual correctness of the reports and often rely on single-view images, limiting diagnostic comprehensiveness. We propose CLARIFID, a novel framework that directly optimizes diagnostic correctness by mirroring the two-step workflow of experts. Specifically, CLARIFID (1) learns the logical flow from Findings to Impression through section-aware pretraining, (2) is fine-tuned with Proximal Policy Optimization in which the CheXbert F1 score of the Impression section serves as the reward, (3) employs controlled decoding that completes "Findings" before synthesizing the "Impression", and (4) fuses multiple chest X-ray views via a vision-transformer-based multi-view encoder. During inference, we apply a next-token forcing strategy followed by report-level re-ranking, ensuring that the model first produces a comprehensive "Findings" section before synthesizing the "Impression" and thereby preserving coherent clinical reasoning. Experimental results on the MIMIC-CXR dataset demonstrate that our method achieves superior clinical efficacy and outperforms existing baselines on clinical efficacy scores.
Collaborative Causal Sensemaking: Closing the Complementarity Gap in Human-AI Decision Support
LLM-based agents are increasingly deployed for expert decision support, yet human-AI teams in high-stakes settings do not yet reliably outperform the best individual. We argue this complementarity gap reflects a fundamental mismatch: current agents are trained as answer engines, not as partners in the collaborative sensemaking through which experts actually make decisions. Sensemaking (the ability to co-construct causal explanations, surface uncertainties, and adapt goals) is the key capability that current training pipelines do not explicitly develop or evaluate. We propose Collaborative Causal Sensemaking (CCS) as a research agenda to develop this capability from the ground up, spanning new training environments that reward collaborative thinking, representations for shared human-AI mental models, and evaluation centred on trust and complementarity. These directions can advance MAS research toward agents that think with their human partners rather than for them.
Uncertainty Quantification for LLMs through Minimum Bayes Risk: Bridging Confidence and Consistency
Uncertainty quantification (UQ) methods for Large Language Models (LLMs) encompass a variety of approaches, with two major types being particularly prominent: information-based, which focus on model confidence expressed as token probabilities, and consistency-based, which assess the semantic relationship between multiple outputs generated using repeated sampling. Several recent methods have combined these two approaches to boost UQ performance. However, they sometimes fail to outperform much simpler baseline methods. Our work discusses the fundamental approach to constructing uncertainty measures that directly links uncertainty with the minimum Bayes risks achieved by LLM decoding. Building on these findings, we propose a novel approach to integrating model confidence with output consistency, resulting in a family of efficient and robust UQ methods. Our investigation reveals distinctive characteristics of LLMs as probabilistic models, which help to explain why these UQ methods underperform in certain tasks. Based on these findings, we propose a new way of synthesizing model confidence and output consistency, leading to a family of efficient and robust UQ methods. We evaluate our approach across various tasks such as question answering, abstractive summarization, and machine translation, demonstrating sizable improvements over state-of-the-art UQ approaches.
LLMs Can't Handle Peer Pressure: Crumbling under Multi-Agent Social Interactions
Large language models (LLMs) are increasingly integrated into multi-agent systems (MAS), where peer interactions shape individual decisions. While prior work has mainly examined conformity bias, we broaden the view to include how LLMs build rapport from prior interactions, discern and integrate high-quality peer information, and resist misleading inputs-abilities essential for achieving collective intelligence under complex social dynamics. We introduce KAIROS, a benchmark that simulates quiz-style collaboration with peer agents whose rapport levels and behaviours can be precisely controlled in both historical interactions and the current round. This unified setup enables systematic analysis of how rapport, peer actions, and the model's self-confidence jointly influence decision-making. Using KAIROS, we evaluate prompting, supervised fine-tuning, and reinforcement learning via Group Relative Policy Optimisation (GRPO). Results show that model scale is a primary factor moderating susceptibility to social influence: larger models are more resilient and benefit from prompting-based mitigation, whereas smaller models remain vulnerable. Only carefully configured GRPO training yields consistent robustness and performance gains for small models.
AEIOU: A Unified Defense Framework against NSFW Prompts in Text-to-Image Models
As text-to-image (T2I) models advance and gain widespread adoption, their associated safety concerns are becoming increasingly critical. Malicious users exploit these models to generate Not-Safe-for-Work (NSFW) images using harmful or adversarial prompts, underscoring the need for effective safeguards to ensure the integrity and compliance of model outputs. However, existing detection methods often exhibit low accuracy and inefficiency. In this paper, we propose AEIOU, a defense framework that is adaptable, efficient, interpretable, optimizable, and unified against NSFW prompts in T2I models. AEIOU extracts NSFW features from the hidden states of the model's text encoder, utilizing the separable nature of these features to detect NSFW prompts. The detection process is efficient, requiring minimal inference time. AEIOU also offers real-time interpretation of results and supports optimization through data augmentation techniques. The framework is versatile, accommodating various T2I architectures. Our extensive experiments show that AEIOU significantly outperforms both commercial and open-source moderation tools, achieving over 95\% accuracy across all datasets and improving efficiency by at least tenfold. It effectively counters adaptive attacks and excels in few-shot and multi-label scenarios.
Mortgage Language Model: Domain-Adaptive Pretraining with Residual Instruction, Alignment Tuning, and Task-Specific Routing
Large Language Models (LLMs) demonstrate exceptional capabilities across general domains, yet their application to specialized sectors such as mortgage finance requires domain-specific knowledge augmentation while preserving instruction-following fidelity. We present MortgageLLM, a novel domain-specific large language model that addresses this dual challenge. It is developed using a dual-track specialization framework from a single base model (LLaMA-3.1-8B). We opted for this dual-expert approach as a single multi-task model suffers from performance trade-offs, where optimizing for structured tasks (via SFT) degrades conversational fidelity (via DPO). Our dual-track method solves this by creating two specialists, allowing each to be optimally trained for its distinct capability. Our approach applies the instruction residual technique to restore instruction-following capabilities post-domain adaptation without supervised fine-tuning. We contribute: (1) application of this residual technique to the highly specialized mortgage finance domain; (2) a dual-expert architecture combining a conversational Q&A model and a structured task model for classification and summarization; and (3) an intelligent task routing mechanism using few-shot classification performed by one of the expert models itself. We validate our approach on domain-specific benchmarks, where our final model (MLM v2) significantly outperforms the base LLaMA-3.1-8B-Instruct, achieving an LLM-as-a-Judge summarization score of 4.58 (vs. 3.99), a Q&A score of 4.09 (vs. 4.0), and a classification score of 2.6 (vs. 1.2). On semantic similarity, our model achieved a BERTScore of 0.77 for summarization (vs. 0.74), 0.68 for Q&A (vs. 0.58), and 0.75 for classification (vs. 0.73), substantially outperforming baseline approaches.
Bridging the Knowledge-Prediction Gap in LLMs on Multiple-Choice Questions
Large Language Models (LLMs) often fail on multiple-choice questions (MCQs) despite demonstrating correct knowledge in other contexts, such as free-form generation. To investigate the mechanism underlying this knowledge-prediction gap on MCQs and alleviate it, we conduct a probing analysis and find that residual streams in certain layers contain a subspace spanned by two important bases: a \emph{knowledge basis} that encodes the probability of the ground-truth answer for a given MCQ and a \emph{prediction basis} that encodes the probability of the answer choice predicted by the model. We observe that incorrect predictions arise from a misalignment of the model's hidden states along these two bases. Hence, we introduce \textbf{KAPPA} (Knowledge-Aligned Prediction through Projection-based Adjustment), a parameter-free intervention that transforms the hidden states to align the prediction coordinate with the knowledge coordinate within this subspace. Experiments on binary-choice reformulations of Big-Bench-Hard and ARC-Challenge show that KAPPA substantially improves accuracy and consistently outperforms baselines. While optimal subspaces differ across tasks, subspaces generalize to some extent, as supported by cross-dataset experiments. Moreover, KAPPA extends its effectiveness to free-form questions beyond MCQs. Our work provides a new geometric understanding of the knowledge-prediction gap and offers a practical method for better aligning model behavior with its latent knowledge.
Fine-grained Spatiotemporal Grounding on Egocentric Videos ICCV 2025
Spatiotemporal video grounding aims to localize target entities in videos based on textual queries. While existing research has made significant progress in exocentric videos, the egocentric setting remains relatively underexplored, despite its growing importance in applications such as augmented reality and robotics. In this work, we conduct a systematic analysis of the discrepancies between egocentric and exocentric videos, revealing key challenges such as shorter object durations, sparser trajectories, smaller object sizes, and larger positional shifts. To address these challenges, we introduce EgoMask, the first pixel-level benchmark for fine-grained spatiotemporal grounding in egocentric videos. It is constructed by our proposed automatic annotation pipeline, which annotates referring expressions and object masks across short-, medium-, and long-term videos. Additionally, we create EgoMask-Train, a large-scale training dataset to facilitate model development. Experiments demonstrate that the state-of-the-art spatiotemporal grounding models perform poorly on our benchmark EgoMask, but fine-tuning on EgoMask-Train yields significant improvements, while preserving performance on exocentric datasets. Our work thus provides essential resources and insights for advancing egocentric video understanding. Our code is available at https://github.com/LaVi-Lab/EgoMask .
comment: Accepted by ICCV 2025
TS-PEFT: Unveiling Token-Level Redundancy in Parameter-Efficient Fine-Tuning
Current Parameter-Efficient Fine-Tuning (PEFT) methods typically operate under an implicit assumption: once a target module is selected, every token passing through it contributes equally to the downstream task and requires a parameter update. In this paper, we challenge this convention and unveil a pervasive token-level redundancy in the fine-tuning of large models. We propose TS-PEFT, a theoretically grounded framework utilizing proximal optimization to dynamically identify and skip redundant token updates during training. Our extensive experiments across Natural Language Understanding, Commonsense Reasoning, and Visual Instruction Tuning demonstrate that indiscriminately updating all tokens is not only computationally superfluous but often introduces optimization noise. Strikingly, by discarding 40%-60% of token updates, TS-PEFT consistently matches or surpasses the performance of dense baselines (e.g., LoRA, DoRA). Furthermore, we provide an in-depth analysis revealing that the learned token-level sparsity serves as a superior indicator of module importance compared to traditional weight norms, offering a novel data-driven perspective on the intrinsic adaptation mechanism of large models.
comment: 11 pages, 3 figures
Do Language Models Associate Sound with Meaning? A Multimodal Study of Sound Symbolism
Sound symbolism is a linguistic concept that refers to non-arbitrary associations between phonetic forms and their meanings. We suggest that this can be a compelling probe into how Multimodal Large Language Models (MLLMs) interpret auditory information in human languages. We investigate MLLMs' performance on phonetic iconicity across textual (orthographic and IPA) and auditory forms of inputs with up to 25 semantic dimensions (e.g., sharp vs. round), observing models' layer-wise information processing by measuring phoneme-level attention fraction scores. To this end, we present LEX-ICON, an extensive mimetic word dataset consisting of 8,052 words from four natural languages (English, French, Japanese, and Korean) and 2,930 systematically constructed pseudo-words, annotated with semantic features applied across both text and audio modalities. Our key findings demonstrate (1) MLLMs' phonetic intuitions that align with existing linguistic research across multiple semantic dimensions and (2) phonosemantic attention patterns that highlight models' focus on iconic phonemes. These results bridge domains of artificial intelligence and cognitive linguistics, providing the first large-scale, quantitative analyses of phonetic iconicity in terms of MLLMs' interpretability.
comment: 33 pages, 27 tables, 10 figures
MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources
We present MixtureVitae, an open-access pretraining corpus built to minimize legal risk while providing strong model performance. MixtureVitae follows a risk-mitigated sourcing strategy that combines public-domain and permissively licensed text (e.g., CC-BY/Apache) with carefully justified low-risk additions (e.g., government works and EU TDM-eligible sources), alongside targeted instruction, reasoning and synthetic data with documented provenance. We detail a transparent, multi-stage pipeline for license-aware filtering, safety and quality screening, and domain-aware mixing, and we release the dataset and curation recipes to support reproducible research. In controlled experiments using the open-sci-ref training protocol (fixed architectures at 130M/400M/1.3B/1.7B parameters; training budgets of 50B and 300B tokens), models trained on MixtureVitae consistently outperform other permissive datasets across a suite of standard benchmarks, and at the 1.7B/300B setting they surpass FineWeb-Edu and approach DCLM in the later stages of training. Performance is particularly strong on math/code and competitive on QA tasks. These results demonstrate that permissive-first, risk-mitigated data provides a practical and legally mitigated foundation for training capable LLMs, reducing reliance on indiscriminate web scraping without sacrificing competitiveness. Code: https://github.com/ontocord/mixturevitae
comment: Code: \url{https://github.com/ontocord/mixturevitae}
OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Languages and Modalities
The emerging capabilities of large language models (LLMs) have sparked concerns about their immediate potential for harmful misuse. The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-resource languages or prompts provided in non-text modalities such as image and audio). To tackle this challenge, we propose Omniguard, an approach for detecting harmful prompts across languages and modalities. Our approach (i) identifies internal representations of an LLM/MLLM that are aligned across languages or modalities and then (ii) uses them to build a language-agnostic or modality-agnostic classifier for detecting harmful prompts. Omniguard improves harmful prompt classification accuracy by 11.57\% over the strongest baseline in a multilingual setting, by 20.44\% for image-based prompts, and sets a new SOTA for audio-based prompts. By repurposing embeddings computed during generation, Omniguard is also very efficient ($\approx\!120 \times$ faster than the next fastest baseline). Code and data are available at: https://github.com/vsahil/OmniGuard.
Enhanced Sentiment Interpretation via a Lexicon-Fuzzy-Transformer Framework
Accurately detecting sentiment polarity and intensity in product reviews and social media posts remains challenging due to informal and domain-specific language. To address this, we propose a novel hybrid lexicon-fuzzy-transformer framework that combines rule-based heuristics, contextual deep learning, and fuzzy logic to generate continuous sentiment scores reflecting both polarity and strength. The pipeline begins with VADER-based initial sentiment estimations, which are refined through a two-stage adjustment process. This involves leveraging confidence scores from DistilBERT, a lightweight transformer and applying fuzzy logic principles to mitigate excessive neutrality bias and enhance granularity. A custom fuzzy inference system then maps the refined scores onto a 0 to 1 continuum, producing expert)like judgments. The framework is rigorously evaluated on four domain-specific datasets. food delivery, e-commerce, tourism, and fashion. Results show improved alignment with user ratings, better identification of sentiment extremes, and reduced misclassifications. Both quantitative metrics (distributional alignment, confusion matrices) and qualitative insights (case studies, runtime analysis) affirm the models robustness and efficiency. This work demonstrates the value of integrating symbolic reasoning with neural models for interpretable, finegrained sentiment analysis in linguistically dynamic domains.
comment: The manuscript was uploaded in error and is scientifically invalid. It is an incomplete draft with major flaws. Co-authors were not aware of or consenting to this submission and do not endorse it
Ineffectiveness for Search and Undecidability of PCSP Meta-Problems
It is an open question whether the search and decision versions of promise CSPs are equivalent. Most known algorithms for PCSPs solve only their \emph{decision} variant, and it is unknown whether they can be adapted to solve \emph{search} as well. The main approaches, called BLP, AIP and BLP+AIP, handle a PCSP by finding a solution to a relaxation of some integer program. We prove that rounding those solutions to a proper search certificate can be as hard as any problem in the class TFNP. In other words, these algorithms are ineffective for search. Building on the algebraic approach to PCSPs, we find sufficient conditions that imply ineffectiveness for search. Our tools are tailored to algorithms that are characterized by minions in a suitable way, and can also be used to prove undecidability results for meta-problems. This way, we show that the families of templates solvable via BLP, AIP, and BLP+AIP are undecidable. Using the same techniques we also analyze several algebraic conditions that are known to guarantee the tractability of finite-template CSPs. We prove that several meta-problems related to cyclic polymorphims and WNUs are undecidable for PCSPs. In particular, there is no algorithm deciding whether a finite PCSP template (1) admits cyclic a polymorphism, (2) admits a WNU.
The Vector Grounding Problem
Large language models (LLMs) produce seemingly meaningful outputs, yet they are trained on text alone without direct interaction with the world. This leads to a modern variant of the classical symbol grounding problem in AI: can LLMs' internal states and outputs be about extra-linguistic reality, independently of the meaning human interpreters project onto them? We argue that they can. We first distinguish referential grounding -- the connection between a representation and its worldly referent -- from other forms of grounding and argue it is the only kind essential to solving the problem. We contend that referential grounding is achieved when a system's internal states satisfy two conditions derived from teleosemantic theories of representation: (1) they stand in appropriate causal-informational relations to the world, and (2) they have a history of selection that has endowed them with the function of carrying this information. We argue that LLMs can meet both conditions, even without multimodality or embodiment.
comment: Accepted for publication in Philosophy and the Mind Sciences
Transparent and Coherent Procedural Mistake Detection EMNLP 2025
Procedural mistake detection (PMD) is a challenging problem of classifying whether a human user (observed through egocentric video) has successfully executed a task (specified by a procedural text). Despite significant recent efforts, machine performance in the wild remains nonviable, and the reasoning processes underlying this performance are opaque. As such, we extend PMD to require generating visual self-dialog rationales to inform decisions. Given the impressive, mature image understanding capabilities observed in recent vision-and-language models (VLMs), we curate a suitable benchmark dataset for PMD based on individual frames. As our reformulation enables unprecedented transparency, we leverage a natural language inference (NLI) model to formulate two automated metrics for the coherence of generated rationales. We establish baselines for this reframed task, showing that VLMs struggle off-the-shelf, but with some trade-offs, their accuracy, coherence, and efficiency can be improved by incorporating these metrics into common inference and fine-tuning methods. Lastly, our multi-faceted metrics visualize common outcomes, highlighting areas for further improvement.
comment: EMNLP 2025
Human-Computer Interaction
Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning
Image captioning is essential in many fields including assisting visually impaired individuals, improving content management systems, and enhancing human-computer interaction. However, a recent challenge in this domain is dealing with low-resolution image (LRI). While performance can be improved by using larger models like transformers for encoding, these models are typically heavyweight, demanding significant computational resources and memory, leading to challenges in retraining. To address this, the proposed SOLI (Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning) approach presents a solution specifically designed for lightweight, low-resolution images captioning. It employs a Siamese network architecture to optimize latent embeddings, enhancing the efficiency and accuracy of the image-to-text translation process. By focusing on a dual-pathway neural network structure, SOLI minimizes computational overhead without sacrificing performance, making it an ideal choice for training on resource-constrained scenarios.
comment: 6 pages
"Nothing about us without us": Perspectives of Global Deaf and Hard-of-hearing Community Members on Sign Language Technologies
There is accelerating interest in sign language technologies (SLTs), with increasing attention from both industry and academia. However, the perspectives of Deaf and Hard-of-hearing (DHH) individuals remain marginalized in their development, particularly those outside of the West and in the global South. This paper presents findings from a global, multilingual survey capturing community views on SLTs across a wide range of countries, sign languages, and cultural contexts. While participants recognized the potential of SLTs to support access and independence, many expressed concerns about cultural erasure, inaccurate translation, and hearing-dominated research pipelines. Perceptions of SLTs were shaped by factors including sign language proficiency, policy exposure, and deaf identity. Across regions, participants emphasized the importance of DHH-led design, citing the risk of harm when DHH communities are excluded from technological decision-making. This study offers a novel cross-continental, community-informed analysis of SLTs and concludes with actionable recommendations for researchers, technologists, and policymakers.
Exploring the Grassroots Understanding and Practices of Collective Memory Co-Contribution in a University Community
Collective memory -- community members' interconnected memories and impressions of the group -- is essential to the community's culture and identity. Its development requires members' continuous participatory contribution and sensemaking. However, existing works mainly adopt a holistic sociological perspective to analyze well-developed collective memory, less focusing on member-level conceptualization of this possession or what the co-contribution practices can be. Therefore, this work alternatively adopts the latter perspective and probes such interpretative and interactional patterns with two mobile systems. With one being a locative narrative and exploration system condensed from existing literature's design frameworks, and the other being a conventional online forum representing current practices, they served as the anchors of observation for our two-week, mixed-methods field study (n=38) on a university campus. A core debate we have identified was to retrospectively contemplate or document the presence as a history for the future. This also subsequently impacted the narrative focuses, expectations of collective memory constituents, and the ways participants seek inspiration from the group. We further extracted design considerations that could better embrace the diverse conceptualizations of collective memory and bond different community members together. Lastly, revisiting and reflecting on our design, we provided extra insights on designing devoted locative narrative experiences for community-driven UGC platforms.
comment: 30 pages, 6 figures, to be published in the journal Proceedings of the ACM on Human-Computer Interaction (PACMHCI 2026, Volume 10, Issue 1)
See-Control: A Multimodal Agent Framework for Smartphone Interaction with a Robotic Arm
Recent advances in Multimodal Large Language Models (MLLMs) have enabled their use as intelligent agents for smartphone operation. However, existing methods depend on the Android Debug Bridge (ADB) for data transmission and action execution, limiting their applicability to Android devices. In this work, we introduce the novel Embodied Smartphone Operation (ESO) task and present See-Control, a framework that enables smartphone operation via direct physical interaction with a low-DoF robotic arm, offering a platform-agnostic solution. See-Control comprises three key components: (1) an ESO benchmark with 155 tasks and corresponding evaluation metrics; (2) an MLLM-based embodied agent that generates robotic control commands without requiring ADB or system back-end access; and (3) a richly annotated dataset of operation episodes, offering valuable resources for future research. By bridging the gap between digital agents and the physical world, See-Control provides a concrete step toward enabling home robots to perform smartphone-dependent tasks in realistic environments.
The SMART+ Framework for AI Systems
Artificial Intelligence (AI) systems are now an integral part of multiple industries. In clinical research, AI supports automated adverse event detection in clinical trials, patient eligibility screening for protocol enrollment, and data quality validation. Beyond healthcare, AI is transforming finance through real-time fraud detection, automated loan risk assessment, and algorithmic decision-making. Similarly, in manufacturing, AI enables predictive maintenance to reduce equipment downtime, enhances quality control through computer-vision inspection, and optimizes production workflows using real-time operational data. While these technologies enhance operational efficiency, they introduce new challenges regarding safety, accountability, and regulatory compliance. To address these concerns, we introduce the SMART+ Framework - a structured model built on the pillars of Safety, Monitoring, Accountability, Reliability, and Transparency, and further enhanced with Privacy & Security, Data Governance, Fairness & Bias, and Guardrails. SMART+ offers a practical, comprehensive approach to evaluating and governing AI systems across industries. This framework aligns with evolving mechanisms and regulatory guidance to integrate operational safeguards, oversight procedures, and strengthened privacy and governance controls. SMART+ demonstrates risk mitigation, trust-building, and compliance readiness. By enabling responsible AI adoption and ensuring auditability, SMART+ provides a robust foundation for effective AI governance in clinical research.
Gamification with Purpose: What Learners Prefer to Motivate Their Learning
This study investigates learners' preferences for game design elements (GDEs) in educational contexts to inform the development of purpose-driven gamification strategies. It emphasizes a learner-centered approach that aligns gamification design with pedagogical goals, while mitigating risks such as the erosion of intrinsic motivation. A systematic literature review was conducted to identify ten widely discussed GDEs. Visual prototypes representing each element were developed, and a best-worst scaling (BWS) survey with 125 participants was administered to elicit preference rankings. Qualitative feedback was also collected to uncover motivational drivers. Learners consistently preferred GDEs that support learning processes directly-most notably progress bars, concept maps, immediate feedback, and achievements. Qualitative analysis revealed six recurring motivational themes, including visible progress, content relevance, and constructive feedback. The findings suggest that learners value gamification elements that are meaningfully integrated with educational content and support intrinsic motivation. Purpose-aligned gamification should prioritize tools that visualize learning progress and provide actionable feedback, rather than relying solely on extrinsic incentives.
comment: 31 pages, 10 figures, Springer EAIT in review
SensHRPS: Sensing Comfortable Human-Robot Proxemics and Personal Space With Eye-Tracking
Social robots must adjust to human proxemic norms to ensure user comfort and engagement. While prior research demonstrates that eye-tracking features reliably estimate comfort in human-human interactions, their applicability to interactions with humanoid robots remains unexplored. In this study, we investigate user comfort with the robot "Ameca" across four experimentally controlled distances (0.5 m to 2.0 m) using mobile eye-tracking and subjective reporting (N=19). We evaluate multiple machine learning and deep learning models to estimate comfort based on gaze features. Contrary to previous human-human studies where Transformer models excelled, a Decision Tree classifier achieved the highest performance (F1-score = 0.73), with minimum pupil diameter identified as the most critical predictor. These findings suggest that physiological comfort thresholds in human-robot interaction differ from human-human dynamics and can be effectively modeled using interpretable logic.
Time and Money Matters for Sustainability: Insights on User Preferences on Renewable Energy for Electric Vehicle Charging Stations
Charging electric vehicles (EVs) with renewable energy can lessen their environmental impact. However, the fluctuating availability of renewable energy affects the sustainability of public EV charging stations. Nearby public charging stations may utilize differing energy sources due to their microgrid connections - ranging from exclusively renewable to non-renewable or a combination of both - highlighting the substantial variability in energy supply types within short distances. This study investigates the near-future scenario of integrating dynamic renewable energy availability in charging station navigation to impact the choices of EV users towards renewable sources. We conducted a within-subjects design survey with 50 car users and semi-structured interviews with 10 EV users from rural, suburban, and urban areas. The results show that when choosing EV charging stations, drivers often prioritize either time savings or money savings based on the driving scenarios that influence drivers' consumer value. Notably, EV users tend to select renewable-powered stations when they align with their main priority, be it saving money or time. This study offers end-user insights into the front-end graphic user interface and the development of the back-end ranking algorithm for navigation recommender systems that integrate dynamic renewable energy availability for the sustainable use of electric vehicles.
Mediating Personal Relationships with Robotic Pets for Fostering Human-Human Interaction of Older Adults
Good human relationships are important for us to have a happy life and maintain our well-being. Otherwise, we will be at risk of experiencing loneliness or depression. In human-computer interaction (HCI) and computer-supported cooperative work (CSCW), robotic systems offer nuanced approaches to foster human connection, providing interaction beyond the traditional mediums that smartphones and computers offer. However, many existing studies primarily focus on the humanrobot relationships that older adults form directly with robotic pets rather than exploring how these robotic pets can enhance human-human relationships. Our ethnographic study investigates how robotic pets can be designed to facilitate human relationships. Through semi-structured interviews with six older adults and thematic analysis, our empirical findings provide insights into how robotic pets can be designed as telerobots to connect with others remotely, thus contributing to advance future development of robotic systems for mental health.
ClinicalTrialsHub: Bridging Registries and Literature for Comprehensive Clinical Trial Access
We present ClinicalTrialsHub, an interactive search-focused platform that consolidates all data from ClinicalTrials.gov and augments it by automatically extracting and structuring trial-relevant information from PubMed research articles. Our system effectively increases access to structured clinical trial data by 83.8% compared to relying on ClinicalTrials.gov alone, with potential to make access easier for patients, clinicians, researchers, and policymakers, advancing evidence-based medicine. ClinicalTrialsHub uses large language models such as GPT-5.1 and Gemini-3-Pro to enhance accessibility. The platform automatically parses full-text research articles to extract structured trial information, translates user queries into structured database searches, and provides an attributed question-answering system that generates evidence-grounded answers linked to specific source sentences. We demonstrate its utility through a user study involving clinicians, clinical researchers, and PhD students of pharmaceutical sciences and nursing, and a systematic automatic evaluation of its information extraction and question answering capabilities.
Understanding Mental States in Active and Autonomous Driving with EEG
Understanding how driver mental states differ between active and autonomous driving is critical for designing safe human-vehicle interfaces. This paper presents the first EEG-based comparison of cognitive load, fatigue, valence, and arousal across the two driving modes. Using data from 31 participants performing identical tasks in both scenarios of three different complexity levels, we analyze temporal patterns, task-complexity effects, and channel-wise activation differences. Our findings show that although both modes evoke similar trends across complexity levels, the intensity of mental states and the underlying neural activation differ substantially, indicating a clear distribution shift between active and autonomous driving. Transfer-learning experiments confirm that models trained on active driving data generalize poorly to autonomous driving and vice versa. We attribute this distribution shift primarily to differences in motor engagement and attentional demands between the two driving modes, which lead to distinct spatial and temporal EEG activation patterns. Although autonomous driving results in lower overall cortical activation, participants continue to exhibit measurable fluctuations in cognitive load, fatigue, valence, and arousal associated with readiness to intervene, task-evoked emotional responses, and monotony-related passive fatigue. These results emphasize the need for scenario-specific data and models when developing next-generation driver monitoring systems for autonomous vehicles.
comment: 15 Pages, 13 Figures and 3 Tables. This work has been submitted to IEEE Transaction for possible publication
Cognitive Trust in HRI: "Pay Attention to Me and I'll Trust You Even if You are Wrong"
Cognitive trust and the belief that a robot is capable of accurately performing tasks, are recognized as central factors in fostering high-quality human-robot interactions. It is well established that performance factors such as the robot's competence and its reliability shape cognitive trust. Recent studies suggest that affective factors, such as robotic attentiveness, also play a role in building cognitive trust. This work explores the interplay between these two factors that shape cognitive trust. Specifically, we evaluated whether different combinations of robotic competence and attentiveness introduce a compensatory mechanism, where one factor compensates for the lack of the other. In the experiment, participants performed a search task with a robotic dog in a 2x2 experimental design that included two factors: competence (high or low) and attentiveness (high or low). The results revealed that high attentiveness can compensate for low competence. Participants who collaborated with a highly attentive robot that performed poorly reported trust levels comparable to those working with a highly competent robot. When the robot did not demonstrate attentiveness, low competence resulted in a substantial decrease in cognitive trust. The findings indicate that building cognitive trust in human-robot interaction may be more complex than previously believed, involving emotional processes that are typically overlooked. We highlight an affective compensatory mechanism that adds a layer to consider alongside traditional competence-based models of cognitive trust.
comment: Confrence paper
Inferring Operator Emotions from a Motion-Controlled Robotic Arm
A remote robot operator's affective state can significantly impact the resulting robot's motions leading to unexpected consequences, even when the user follows protocol and performs permitted tasks. The recognition of a user operator's affective states in remote robot control scenarios is, however, underexplored. Current emotion recognition methods rely on reading the user's vital signs or body language, but the devices and user participation these measures require would add limitations to remote robot control. We demonstrate that the functional movements of a remote-controlled robotic avatar, which was not designed for emotional expression, can be used to infer the emotional state of the human operator via a machine-learning system. Specifically, our system achieved 83.3$\%$ accuracy in recognizing the user's emotional state expressed by robot movements, as a result of their hand motions. We discuss the implications of this system on prominent current and future remote robot operation and affective robotic contexts.
Mental Models of Autonomy and Sentience Shape Reactions to AI
Narratives about artificial intelligence (AI) entangle autonomy, the capacity to self-govern, with sentience, the capacity to sense and feel. AI agents that perform tasks autonomously and companions that recognize and express emotions may activate mental models of autonomy and sentience, respectively, provoking distinct reactions. To examine this possibility, we conducted three pilot studies (N = 374) and four preregistered vignette experiments describing an AI as autonomous, sentient, both, or neither (N = 2,702). Activating a mental model of sentience increased general mind perception (cognition and emotion) and moral consideration more than autonomy, but autonomy increased perceived threat more than sentience. Sentience also increased perceived autonomy more than vice versa. Based on a within-paper meta-analysis, sentience changed reactions more than autonomy on average. By disentangling different mental models of AI, we can study human-AI interaction with more precision to better navigate the detailed design of anthropomorphized AI and prompting interfaces.
comment: 37 pages, 6 figures, 2 tables
Prototyping and Evaluating a Real-time Neuro-Adaptive Virtual Reality Flight Training System
Real-time adjustments to task difficulty during flight training are crucial for optimizing performance and managing pilot workload. This study evaluated the functionality of a pre-trained brain-computer interface (BCI) that adapts training difficulty based on real-time estimations of workload from brain signals. Specifically, an EEG-based neuro-adaptive training system was developed and tested in Virtual Reality (VR) flight simulations with military student pilots. The neuro-adaptive system was compared to a fixed sequence that progressively increased in difficulty, in terms of self-reported user engagement, workload, and simulator sickness (subjective measures), as well as flight performance (objective metric). Additionally, we explored the relationships between subjective workload and flight performance in the VR simulator for each condition. The experiments concluded with semi-structured interviews to elicit the pilots' experience with the neuro-adaptive prototype. Results revealed no significant differences between the adaptive and fixed sequence conditions in subjective measures or flight performance. In both conditions, flight performance decreased as subjective workload increased. The semi-structured interviews indicated that, upon briefing, the pilots preferred the neuro-adaptive VR training system over the system with a fixed sequence, although individual differences were observed in the perception of difficulty and the order of changes in difficulty. Even though this study shows performance does not change, BCI-based flight training systems hold the potential to provide a more personalized and varied training experience.
A Survey of Body and Face Motion: Datasets, Performance Evaluation Metrics and Generative Techniques
Body and face motion play an integral role in communication. They convey crucial information on the participants. Advances in generative modeling and multi-modal learning have enabled motion generation from signals such as speech, conversational context and visual cues. However, generating expressive and coherent face and body dynamics remains challenging due to the complex interplay of verbal / non-verbal cues and individual personality traits. This survey reviews body and face motion generation, covering core concepts, representations techniques, generative approaches, datasets and evaluation metrics. We highlight future directions to enhance the realism, coherence and expressiveness of avatars in dyadic settings. To the best of our knowledge, this work is the first comprehensive review to cover both body and face motion. Detailed resources are listed on https://lownish23csz0010.github.io/mogen/.
The AI Consumer Index (ACE)
We introduce the first version of the AI Consumer Index (ACE), a benchmark for assessing whether frontier AI models can perform everyday consumer tasks. ACE contains a hidden heldout set of 400 test cases, split across four consumer activities: shopping, food, gaming, and DIY. We are also open sourcing 80 cases as a devset with a CC-BY license. For the ACE leaderboard we evaluated 10 frontier models (with websearch turned on) using a novel grading methodology that dynamically checks whether relevant parts of the response are grounded in the retrieved web sources. GPT 5 (Thinking = High) is the top-performing model, scoring 56.1%, followed by o3 Pro (Thinking = On) at 55.2% and GPT 5.1 (Thinking = High) at 55.1%. Model scores differ across domains, and in Shopping the top model scores under 50\%. We find that models are prone to hallucinating key information, such as prices. ACE shows a substantial gap between the performance of even the best models and consumers' AI needs.
Funding AI for Good: A Call for Meaningful Engagement
Artificial Intelligence for Social Good (AI4SG) is a growing area that explores AI's potential to address social issues, such as public health. Yet prior work has shown limited evidence of its tangible benefits for intended communities, and projects frequently face inadequate community engagement and sustainability challenges. While existing HCI literature on AI4SG initiatives primarily focuses on the mechanisms of funded projects and their outcomes, much less attention has been given to the funding agenda and rhetoric that influences downstream approaches. Through a thematic analysis of 35 funding documents -- representing about $410 million USD in total investments, we reveal dissonances between AI4SG's stated intentions for positive social impact and the techno-centric approaches that some funding agendas promoted, while also identifying funding documents that scaffolded community-collaborative approaches for applicants. Drawing on our findings, we offer recommendations for funders to embed approaches that balance both contextual understanding and technical capacities in future funding call designs. We further discuss how the HCI community can positively shape AI4SG funding design processes.
comment: Currently under review
Mixed Reality Scenic Live Streaming for Cultural Heritage: Visual Interactions in a Historic Landscape
Scenic Live Streams (SLS), capturing real-world scenic sites from fixed cameras without streamers, have gained increasing popularity recently. They afford unique real-time lenses into remote sites for viewers' synchronous and collective engagement. Foregrounding its lack of dynamism and interactivity, we aim to maximize the potential of SLS by making it interactive. Namely MRSLS, we overlaid plain SLS with interactive Mixed Reality content that matches the site's geographical structures and local cultural backgrounds. We further highlight the substantial benefit of MRSLS to cultural heritage site interactions, and we demonstrate this design proposal with an MRSLS prototype at a UNESCO-listed heritage site in China. The design process includes an interview (N=6) to pinpoint local scenery and culture, as well as two iterative design studies (N=15, 14). A mixed-methods, between-subjects study (N=43, 37) shows that MRSLS affords immersive scenery appreciation, effective cultural imprints, and vivid shared experience. With its balance between cultural, participatory, and authentic attributes, we appeal for more HCI attention to (MR)SLS as an under-explored design space.
comment: 14 pages, 6 figures, to be published in the Proceedings of the International Conference on Human-Engaged Computing (ICHEC '25)
OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows
Computer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms. While these agents hold great promise for advancing digital automation, their potential for unsafe operations, such as system compromise and privacy leakage, is raising significant concerns. Detecting these safety concerns across the vast and complex operational space of mobile environments presents a formidable challenge that remains critically underexplored. To establish a foundation for mobile agent safety research, we introduce MobileRisk-Live, a dynamic sandbox environment accompanied by a safety detection benchmark comprising realistic trajectories with fine-grained annotations. Built upon this, we propose OS-Sentinel, a novel hybrid safety detection framework that synergistically combines a Formal Verifier for detecting explicit system-level violations with a VLM-based Contextual Judge for assessing contextual risks and agent actions. Experiments show that OS-Sentinel achieves 10%-30% improvements over existing approaches across multiple metrics. Further analysis provides critical insights that foster the development of safer and more reliable autonomous mobile agents. Our code and data are available at https://github.com/OS-Copilot/OS-Sentinel.
comment: work in progress
Generative AI in Game Development: A Qualitative Research Synthesis
Generative Artificial Intelligence (GenAI) is currently reshaping game development practices, production pipelines, and value networks in an unprecedentedly pervasive manner with cascading consequences remaining unclear. In the last five years since GenAI's inception, a growing body of qualitative research has explored these early transformations from different settings and demographic angles. However, these studies often contextualise and consolidate their findings weakly with related work; for research to keep up with and support stakeholders in this development, the current moment calls for a synthesis of the findings emerged thus far. Here, we address this need through a qualitative research synthesis via meta-ethnography. We followed PRISMA-S to systematically search the relevant literature from 2020-2025, including major HCI and games research databases. We then synthesised the ten eligible studies, conducting reciprocal translation and line-of-argument synthesis guided by eMERGe, informed by CASP quality appraisal. We identified nine overarching themes, provide recommendations, and contextualise our insights in wider game production trajectories. With this work, we seek to provide practitioners, researchers and policy-makers with grounded insights to guide practice, research and governance.
comment: 35 pages, 3 figures, 6 tables
NERVIS: An Interactive System for Graph-Based Exploration and Editing of Named Entities
We present an interactive visualization system for exploring named entities and their relationships across document collections. The system is designed around a graph-based representation that integrates three types of nodes: documents, entity mentions, and entities. Connections capture two key relationship types: (i) identical entities across contexts, and (ii) co-locations of mentions within documents. Multiple coordinated views enable users to examine entity occurrences, discover clusters of related mentions, and explore higher-level entity group relationships. To support flexible and iterative exploration, the interface offers fuzzy views with approximate connections, as well as tools for interactively editing the graph by adding or removing links, entities, and mentions, as well as editing entity terms. Additional interaction features include filtering, mini-map navigation, and export options to JSON or image formats for downstream analysis and reporting. This approach contributes to human-centered exploration of entity-rich text data by combining graph visualization, interactive refinement, and adaptable perspectives on relationships.
comment: 10 pages, 6 figures, submitted to HCI SI 2025
Fitts' List Revisited: An Empirical Study on Function Allocation in a Two-Agent Physical Human-Robot Collaborative Position/Force Task
In this letter, we investigate whether classical function allocation-the principle of assigning tasks to either a human or a machine-holds for physical Human-Robot Collaboration, which is important for providing insights for Industry 5.0 to guide how to best augment rather than replace workers. This study empirically tests the applicability of Fitts' List within physical Human-Robot Collaboration, by conducting a user study (N=26, within-subject design) to evaluate four distinct allocations of position/force control between human and robot in an abstract blending task. We hypothesize that the function in which humans control the position achieves better performance and receives higher user ratings. When allocating position control to the human and force control to the robot, compared to the opposite case, we observed a significant improvement in preventing overblending. This was also perceived better in terms of physical demand and overall system acceptance, while participants experienced greater autonomy, more engagement and less frustration. An interesting insight was that the supervisory role (when the robot controls both position and force) was rated second best in terms of subjective acceptance. Another surprising insight was that if position control was delegated to the robot, the participants perceived much lower autonomy than when the force control was delegated to the robot. These findings empirically support applying Fitts' principles to static function allocation for physical collaboration, while also revealing important nuanced user experience trade-offs, particularly regarding perceived autonomy when delegating position control.
comment: 8 pages, 6 figures, published in IEEE Robotics and Automation Letters, col. 11, no. 1, January 2026
Collaborative Causal Sensemaking: Closing the Complementarity Gap in Human-AI Decision Support
LLM-based agents are increasingly deployed for expert decision support, yet human-AI teams in high-stakes settings do not yet reliably outperform the best individual. We argue this complementarity gap reflects a fundamental mismatch: current agents are trained as answer engines, not as partners in the collaborative sensemaking through which experts actually make decisions. Sensemaking (the ability to co-construct causal explanations, surface uncertainties, and adapt goals) is the key capability that current training pipelines do not explicitly develop or evaluate. We propose Collaborative Causal Sensemaking (CCS) as a research agenda to develop this capability from the ground up, spanning new training environments that reward collaborative thinking, representations for shared human-AI mental models, and evaluation centred on trust and complementarity. These directions can advance MAS research toward agents that think with their human partners rather than for them.
FLoRA: An Advanced AI-Powered Engine to Facilitate Hybrid Human-AI Regulated Learning
Self-Regulated Learning (SRL), defined as learners' ability to systematically plan, monitor, and regulate their learning activities, is crucial for sustained academic achievement and lifelong learning competencies. Emerging AI developments profoundly influence SRL interactions by potentially either diminishing or strengthening learners' opportunities to exercise their own regulatory skills. Recent literature emphasizes a balanced approach termed Hybrid Human-AI Regulated Learning (HHAIRL), in which AI provides targeted, timely scaffolding while preserving the learners' role as active decision-makers and reflective monitors of their learning process. Central to HHAIRL is the integration of adaptive and personalized learning systems; by modelling each learner's knowledge and self-regulation patterns, AI can deliver contextually relevant scaffolds that support learners during all phases of the SRL process. Nevertheless, existing digital tools frequently fall short, lacking adaptability and personalisation, focusing narrowly on isolated SRL phases, and insufficiently supporting meaningful human-AI interactions. In response, this paper introduces the enhanced FLoRA Engine, which incorporates advanced generative AI features and state-of-the-art learning analytics, and grounds in solid educational theories. The FLoRA Engine offers tools such as collaborative writing, multi-agent chatbots, and detailed learning trace logging to support dynamic, adaptive scaffolding of self-regulation tailored to individual needs in real time. We further present a summary of several research studies that provide the validations for and illustrate how these tools can be utilized in real-world educational and experimental contexts. These studies demonstrate the effectiveness of FLoRA Engine in fostering SRL, providing both theoretical insights and practical solutions for the future of AI-enhanced learning contexts.
CommentScope: A Comment-Embedded Assisted Reading System for a Long Text
Long texts are ubiquitous on social platforms, yet readers often face information overload and struggle to locate key content. Comments provide valuable external perspectives for understanding, questioning, and complementing the text, but their potential is hindered by disorganized and unstructured presentation. Few studies have explored embedding comments directly into reading. As an exploratory step, we propose CommentScope, a system with two core modules: a pipeline that classifies comments into five types and aligns them with relevant sentences, and a presentation module that integrates comments inline or as side notes, supported by visual cues such as colors, charts, and highlights. Technical evaluation shows that the hybrid "Rule+LLM" pipeline achieved solid performance in semantic classification (accuracy=0.90) and position alignment (accuracy=0.88). A user study (N=12) further demonstrated that the sentence-end embedding significantly improved comment discovery accuracy and reading fluency while reducing mental demand and perceived effort.
comment: 32 pages, 9 figures
Beyond Community Notes: A Framework for Understanding and Building Crowdsourced Context Systems for Social Media
Social media platforms are increasingly developing features that display crowdsourced context alongside posts, modeled after X's Community Notes. These systems, which we term Crowdsourced Context Systems (CCS), have the potential to reshape our information ecosystem as major platforms embrace them as alternatives to top-down fact-checking. To deeply understand the features and implications of such systems, we perform a systematic literature review of existing CCS research and analyze several real-world CCS implementations. Based on our analysis, we develop a framework with two distinct components. First, we present a theoretical model to help conceptualize and define CCS. Second, we identify a design space encompassing six key aspects of CCS: participation, inputs, curation, presentation, platform treatment, and transparency. We discuss key normative implications of different CCS design and implementation choices. Our paper integrates these theoretical, design, and ethical perspectives to establish a foundation for future human-centered research on Crowdsourced Context Systems.
Enhancing the NAO: Extending Capabilities of Legacy Robots for Long-Term Research
Legacy (unsupported) robotic platforms often lose research utility when manufacturer support ends, preventing integration of modern sensing, speech, and interaction capabilities. We present the Enhanced NAO, a revitalized version of Aldebaran's NAO robot featuring upgraded beamforming microphones, RGB-D and thermal cameras, and additional compute resources in a fully self-contained package. This system combines cloud-based and local models for perception and dialogue, while preserving the NAO's expressive body and behaviors. In a pilot user study validating conversational performance, the Enhanced NAO delivered significantly higher conversational quality and elicited stronger user preference compared to the NAO AI Edition, without increasing response latency. The added visual and thermal sensing modalities established a foundation for future perception-driven interaction. Beyond this implementation, our framework provides a platform-agnostic strategy for extending the lifespan and research utility of legacy robots, ensuring they remain valuable tools for human-robot interaction.
OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Languages and Modalities
The emerging capabilities of large language models (LLMs) have sparked concerns about their immediate potential for harmful misuse. The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-resource languages or prompts provided in non-text modalities such as image and audio). To tackle this challenge, we propose Omniguard, an approach for detecting harmful prompts across languages and modalities. Our approach (i) identifies internal representations of an LLM/MLLM that are aligned across languages or modalities and then (ii) uses them to build a language-agnostic or modality-agnostic classifier for detecting harmful prompts. Omniguard improves harmful prompt classification accuracy by 11.57\% over the strongest baseline in a multilingual setting, by 20.44\% for image-based prompts, and sets a new SOTA for audio-based prompts. By repurposing embeddings computed during generation, Omniguard is also very efficient ($\approx\!120 \times$ faster than the next fastest baseline). Code and data are available at: https://github.com/vsahil/OmniGuard.
DropleX: Liquid sensing on tablet touchscreens
We present DropleX, the first system that enables liquid sensing using the capacitive touchscreen of commodity tablets. DropleX detects microliter-scale liquid samples, and performs non-invasive, through-container measurements to detect whether a drink has been spiked or if a sealed liquid has been contaminated. These capabilities are made possible by a physics-informed mechanism that disables the touchscreen's built-in adaptive filters, originally designed to reject the effects of liquid drops such as rain, without any hardware modifications. We model the touchscreen's sensing capabilities, limits, and non-idealities to inform the design of a signal processing and learning-based pipeline for liquid sensing. Our system achieves 96-99% accuracy in detecting microliter-scale adulteration in soda, wine, and milk, 93-96% accuracy in threshold detection of trace chemical concentrations, and 86-96% accuracy in through-container adulterant detection. Given the predominance of touchscreens, these exploratory results can open new opportunities for liquid sensing on everyday devices.
New Synthetic Goldmine: Hand Joint Angle-Driven EMG Data Generation Framework for Micro-Gesture Recognition AAAI 2026
Electromyography (EMG)-based gesture recognition has emerged as a promising approach for human-computer interaction. However, its performance is often limited by the scarcity of labeled EMG data, significant cross-user variability, and poor generalization to unseen gestures. To address these challenges, we propose SeqEMG-GAN, a conditional, sequence-driven generative framework that synthesizes high-fidelity EMG signals from hand joint angle sequences. Our method introduces a context-aware architecture composed of an angle encoder, a dual-layer context encoder featuring the novel Ang2Gist unit, a deep convolutional EMG generator, and a discriminator, all jointly optimized via adversarial learning. By conditioning on joint kinematic trajectories, SeqEMG-GAN is capable of generating semantically consistent EMG sequences, even for previously unseen gestures, thereby enhancing data diversity and physiological plausibility. Experimental results show that classifiers trained solely on synthetic data experience only a slight accuracy drop (from 57.77\% to 55.71\%). In contrast, training with a combination of real and synthetic data significantly improves accuracy to 60.53\%, outperforming real-only training by 2.76\%. These findings demonstrate the effectiveness of our framework,also achieves the state-of-art performance in augmenting EMG datasets and enhancing gesture recognition performance for applications such as neural robotic hand control, AI/AR glasses, and gesture-based virtual gaming systems.
comment: Accepted at AAAI 2026
GaitGuard: Protecting Video-Based Gait Privacy in Mixed Reality
Mixed Reality (MR) systems capture continuous video streams that expose bystanders' and collaborators' gait patterns -- a biometric revealing sensitive attributes including age, gender, and health conditions. We show that video-based gait profiling achieves 78\% accuracy (15.6$\times$ random chance) on unprotected MR feeds, motivating \textbf{GaitGuard}, a real-time defense operating on a companion mobile device. GaitGuard introduces \textbf{GaitExtract}, an automated gait feature extraction pipeline adapted from clinical analysis for egocentric MR perspectives. Through systematic evaluation of 233 mitigation configurations, we characterize privacy-utility-performance trade-offs. A key insight is that gait features derive primarily from transient events (heel strikes, toe-offs). We exploit this temporal sparsity through adaptive mitigation that selectively processes only gait-critical frames, achieving a 68\% reduction in profiling accuracy while preserving visual quality (SSIM: 0.97) at 29~FPS. \textbf{GaitGuard} scales to 10 simultaneous users with under 10ms latency. A qualitative study of 20-participants confirms that the users preferred a solution such as \textbf{GaitGuard} which provides privacy guarantees.
comment: 20 pages, 18 figures
Computer Vision and Pattern Recognition
Voxify3D: Pixel Art Meets Volumetric Rendering
Voxel art is a distinctive stylization widely used in games and digital media, yet automated generation from 3D meshes remains challenging due to conflicting requirements of geometric abstraction, semantic preservation, and discrete color coherence. Existing methods either over-simplify geometry or fail to achieve the pixel-precise, palette-constrained aesthetics of voxel art. We introduce Voxify3D, a differentiable two-stage framework bridging 3D mesh optimization with 2D pixel art supervision. Our core innovation lies in the synergistic integration of three components: (1) orthographic pixel art supervision that eliminates perspective distortion for precise voxel-pixel alignment; (2) patch-based CLIP alignment that preserves semantics across discretization levels; (3) palette-constrained Gumbel-Softmax quantization enabling differentiable optimization over discrete color spaces with controllable palette strategies. This integration addresses fundamental challenges: semantic preservation under extreme discretization, pixel-art aesthetics through volumetric rendering, and end-to-end discrete optimization. Experiments show superior performance (37.12 CLIP-IQA, 77.90\% user preference) across diverse characters and controllable abstraction (2-8 colors, 20x-50x resolutions). Project page: https://yichuanh.github.io/Voxify-3D/
comment: Project page: https://yichuanh.github.io/Voxify-3D/
Relational Visual Similarity
Humans do not just see attribute similarity -- we also see relational similarity. An apple is like a peach because both are reddish fruit, but the Earth is also like a peach: its crust, mantle, and core correspond to the peach's skin, flesh, and pit. This ability to perceive and recognize relational similarity, is arguable by cognitive scientist to be what distinguishes humans from other species. Yet, all widely used visual similarity metrics today (e.g., LPIPS, CLIP, DINO) focus solely on perceptual attribute similarity and fail to capture the rich, often surprising relational similarities that humans perceive. How can we go beyond the visible content of an image to capture its relational properties? How can we bring images with the same relational logic closer together in representation space? To answer these questions, we first formulate relational image similarity as a measurable problem: two images are relationally similar when their internal relations or functions among visual elements correspond, even if their visual attributes differ. We then curate 114k image-caption dataset in which the captions are anonymized -- describing the underlying relational logic of the scene rather than its surface content. Using this dataset, we finetune a Vision-Language model to measure the relational similarity between images. This model serves as the first step toward connecting images by their underlying relational structure rather than their visible appearance. Our study shows that while relational similarity has a lot of real-world applications, existing image similarity models fail to capture it -- revealing a critical gap in visual computing.
comment: Project page, data, and code: https://thaoshibe.github.io/relsim
UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation
Recent video generation models demonstrate impressive synthesis capabilities but remain limited by single-modality conditioning, constraining their holistic world understanding. This stems from insufficient cross-modal interaction and limited modal diversity for comprehensive world knowledge representation. To address these limitations, we introduce UnityVideo, a unified framework for world-aware video generation that jointly learns across multiple modalities (segmentation masks, human skeletons, DensePose, optical flow, and depth maps) and training paradigms. Our approach features two core components: (1) dynamic noising to unify heterogeneous training paradigms, and (2) a modality switcher with an in-context learner that enables unified processing via modular parameters and contextual learning. We contribute a large-scale unified dataset with 1.3M samples. Through joint optimization, UnityVideo accelerates convergence and significantly enhances zero-shot generalization to unseen data. We demonstrate that UnityVideo achieves superior video quality, consistency, and improved alignment with physical world constraints. Code and data can be found at: https://github.com/dvlab-research/UnityVideo
comment: Project Website https://jackailab.github.io/Projects/UnityVideo
One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation
Visual generative models (e.g., diffusion models) typically operate in compressed latent spaces to balance training efficiency and sample quality. In parallel, there has been growing interest in leveraging high-quality pre-trained visual representations, either by aligning them inside VAEs or directly within the generative model. However, adapting such representations remains challenging due to fundamental mismatches between understanding-oriented features and generation-friendly latent spaces. Representation encoders benefit from high-dimensional latents that capture diverse hypotheses for masked regions, whereas generative models favor low-dimensional latents that must faithfully preserve injected noise. This discrepancy has led prior work to rely on complex objectives and architectures. In this work, we propose FAE (Feature Auto-Encoder), a simple yet effective framework that adapts pre-trained visual representations into low-dimensional latents suitable for generation using as little as a single attention layer, while retaining sufficient information for both reconstruction and understanding. The key is to couple two separate deep decoders: one trained to reconstruct the original feature space, and a second that takes the reconstructed features as input for image generation. FAE is generic; it can be instantiated with a variety of self-supervised encoders (e.g., DINO, SigLIP) and plugged into two distinct generative families: diffusion models and normalizing flows. Across class-conditional and text-to-image benchmarks, FAE achieves strong performance. For example, on ImageNet 256x256, our diffusion model with CFG attains a near state-of-the-art FID of 1.29 (800 epochs) and 1.70 (80 epochs). Without CFG, FAE reaches the state-of-the-art FID of 1.48 (800 epochs) and 2.08 (80 epochs), demonstrating both high quality and fast learning.
OpenVE-3M: A Large-Scale High-Quality Dataset for Instruction-Guided Video Editing
The quality and diversity of instruction-based image editing datasets are continuously increasing, yet large-scale, high-quality datasets for instruction-based video editing remain scarce. To address this gap, we introduce OpenVE-3M, an open-source, large-scale, and high-quality dataset for instruction-based video editing. It comprises two primary categories: spatially-aligned edits (Global Style, Background Change, Local Change, Local Remove, Local Add, and Subtitles Edit) and non-spatially-aligned edits (Camera Multi-Shot Edit and Creative Edit). All edit types are generated via a meticulously designed data pipeline with rigorous quality filtering. OpenVE-3M surpasses existing open-source datasets in terms of scale, diversity of edit types, instruction length, and overall quality. Furthermore, to address the lack of a unified benchmark in the field, we construct OpenVE-Bench, containing 431 video-edit pairs that cover a diverse range of editing tasks with three key metrics highly aligned with human judgment. We present OpenVE-Edit, a 5B model trained on our dataset that demonstrates remarkable efficiency and effectiveness by setting a new state-of-the-art on OpenVE-Bench, outperforming all prior open-source models including a 14B baseline. Project page is at https://github.com/lewandofskee/OpenVE.
comment: 38 pages
WorldReel: 4D Video Generation with Consistent Geometry and Motion Modeling
Recent video generators achieve striking photorealism, yet remain fundamentally inconsistent in 3D. We present WorldReel, a 4D video generator that is natively spatio-temporally consistent. WorldReel jointly produces RGB frames together with 4D scene representations, including pointmaps, camera trajectory, and dense flow mapping, enabling coherent geometry and appearance modeling over time. Our explicit 4D representation enforces a single underlying scene that persists across viewpoints and dynamic content, yielding videos that remain consistent even under large non-rigid motion and significant camera movement. We train WorldReel by carefully combining synthetic and real data: synthetic data providing precise 4D supervision (geometry, motion, and camera), while real videos contribute visual diversity and realism. This blend allows WorldReel to generalize to in-the-wild footage while preserving strong geometric fidelity. Extensive experiments demonstrate that WorldReel sets a new state-of-the-art for consistent video generation with dynamic scenes and moving cameras, improving metrics of geometric consistency, motion coherence, and reducing view-time artifacts over competing methods. We believe that WorldReel brings video generation closer to 4D-consistent world modeling, where agents can render, interact, and reason about scenes through a single and stable spatiotemporal representation.
Lang3D-XL: Language Embedded 3D Gaussians for Large-scale Scenes SIGGRAPH
Embedding a language field in a 3D representation enables richer semantic understanding of spatial environments by linking geometry with descriptive meaning. This allows for a more intuitive human-computer interaction, enabling querying or editing scenes using natural language, and could potentially improve tasks like scene retrieval, navigation, and multimodal reasoning. While such capabilities could be transformative, in particular for large-scale scenes, we find that recent feature distillation approaches cannot effectively learn over massive Internet data due to challenges in semantic feature misalignment and inefficiency in memory and runtime. To this end, we propose a novel approach to address these challenges. First, we introduce extremely low-dimensional semantic bottleneck features as part of the underlying 3D Gaussian representation. These are processed by rendering and passing them through a multi-resolution, feature-based, hash encoder. This significantly improves efficiency both in runtime and GPU memory. Second, we introduce an Attenuated Downsampler module and propose several regularizations addressing the semantic misalignment of ground truth 2D features. We evaluate our method on the in-the-wild HolyScenes dataset and demonstrate that it surpasses existing approaches in both performance and efficiency.
comment: Accepted to SIGGRAPH Asia 2025. Project webpage: https://tau-vailab.github.io/Lang3D-XL
Multi-view Pyramid Transformer: Look Coarser to See Broader
We propose Multi-view Pyramid Transformer (MVP), a scalable multi-view transformer architecture that directly reconstructs large 3D scenes from tens to hundreds of images in a single forward pass. Drawing on the idea of ``looking broader to see the whole, looking finer to see the details," MVP is built on two core design principles: 1) a local-to-global inter-view hierarchy that gradually broadens the model's perspective from local views to groups and ultimately the full scene, and 2) a fine-to-coarse intra-view hierarchy that starts from detailed spatial representations and progressively aggregates them into compact, information-dense tokens. This dual hierarchy achieves both computational efficiency and representational richness, enabling fast reconstruction of large and complex scenes. We validate MVP on diverse datasets and show that, when coupled with 3D Gaussian Splatting as the underlying 3D representation, it achieves state-of-the-art generalizable reconstruction quality while maintaining high efficiency and scalability across a wide range of view configurations.
comment: Project page: see https://gynjn.github.io/MVP/
OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory
Storytelling in real-world videos often unfolds through multiple shots -- discontinuous yet semantically connected clips that together convey a coherent narrative. However, existing multi-shot video generation (MSV) methods struggle to effectively model long-range cross-shot context, as they rely on limited temporal windows or single keyframe conditioning, leading to degraded performance under complex narratives. In this work, we propose OneStory, enabling global yet compact cross-shot context modeling for consistent and scalable narrative generation. OneStory reformulates MSV as a next-shot generation task, enabling autoregressive shot synthesis while leveraging pretrained image-to-video (I2V) models for strong visual conditioning. We introduce two key modules: a Frame Selection module that constructs a semantically-relevant global memory based on informative frames from prior shots, and an Adaptive Conditioner that performs importance-guided patchification to generate compact context for direct conditioning. We further curate a high-quality multi-shot dataset with referential captions to mirror real-world storytelling patterns, and design effective training strategies under the next-shot paradigm. Finetuned from a pretrained I2V model on our curated 60K dataset, OneStory achieves state-of-the-art narrative coherence across diverse and complex scenes in both text- and image-conditioned settings, enabling controllable and immersive long-form video storytelling.
comment: Project Page: https://zhaochongan.github.io/projects/OneStory
Distribution Matching Variational AutoEncoder
Most visual generative models compress images into a latent space before applying diffusion or autoregressive modelling. Yet, existing approaches such as VAEs and foundation model aligned encoders implicitly constrain the latent space without explicitly shaping its distribution, making it unclear which types of distributions are optimal for modeling. We introduce \textbf{Distribution-Matching VAE} (\textbf{DMVAE}), which explicitly aligns the encoder's latent distribution with an arbitrary reference distribution via a distribution matching constraint. This generalizes beyond the Gaussian prior of conventional VAEs, enabling alignment with distributions derived from self-supervised features, diffusion noise, or other prior distributions. With DMVAE, we can systematically investigate which latent distributions are more conducive to modeling, and we find that SSL-derived distributions provide an excellent balance between reconstruction fidelity and modeling efficiency, reaching gFID equals 3.2 on ImageNet with only 64 training epochs. Our results suggest that choosing a suitable latent distribution structure (achieved via distribution-level alignment), rather than relying on fixed priors, is key to bridging the gap between easy-to-model latents and high-fidelity image synthesis. Code is avaliable at https://github.com/sen-ye/dmvae.
GorillaWatch: An Automated System for In-the-Wild Gorilla Re-Identification and Population Monitoring WACV 2026
Monitoring critically endangered western lowland gorillas is currently hampered by the immense manual effort required to re-identify individuals from vast archives of camera trap footage. The primary obstacle to automating this process has been the lack of large-scale, "in-the-wild" video datasets suitable for training robust deep learning models. To address this gap, we introduce a comprehensive benchmark with three novel datasets: Gorilla-SPAC-Wild, the largest video dataset for wild primate re-identification to date; Gorilla-Berlin-Zoo, for assessing cross-domain re-identification generalization; and Gorilla-SPAC-MoT, for evaluating multi-object tracking in camera trap footage. Building on these datasets, we present GorillaWatch, an end-to-end pipeline integrating detection, tracking, and re-identification. To exploit temporal information, we introduce a multi-frame self-supervised pretraining strategy that leverages consistency in tracklets to learn domain-specific features without manual labels. To ensure scientific validity, a differentiable adaptation of AttnLRP verifies that our model relies on discriminative biometric traits rather than background correlations. Extensive benchmarking subsequently demonstrates that aggregating features from large-scale image backbones outperforms specialized video architectures. Finally, we address unsupervised population counting by integrating spatiotemporal constraints into standard clustering to mitigate over-segmentation. We publicly release all code and datasets to facilitate scalable, non-invasive monitoring of endangered species
comment: Accepted at WACV 2026
Modality-Aware Bias Mitigation and Invariance Learning for Unsupervised Visible-Infrared Person Re-Identification AAAI 2026
Unsupervised visible-infrared person re-identification (USVI-ReID) aims to match individuals across visible and infrared cameras without relying on any annotation. Given the significant gap across visible and infrared modality, estimating reliable cross-modality association becomes a major challenge in USVI-ReID. Existing methods usually adopt optimal transport to associate the intra-modality clusters, which is prone to propagating the local cluster errors, and also overlooks global instance-level relations. By mining and attending to the visible-infrared modality bias, this paper focuses on addressing cross-modality learning from two aspects: bias-mitigated global association and modality-invariant representation learning. Motivated by the camera-aware distance rectification in single-modality re-ID, we propose modality-aware Jaccard distance to mitigate the distance bias caused by modality discrepancy, so that more reliable cross-modality associations can be estimated through global clustering. To further improve cross-modality representation learning, a `split-and-contrast' strategy is designed to obtain modality-specific global prototypes. By explicitly aligning these prototypes under global association guidance, modality-invariant yet ID-discriminative representation learning can be achieved. While conceptually simple, our method obtains state-of-the-art performance on benchmark VI-ReID datasets and outperforms existing methods by a significant margin, validating its effectiveness.
comment: Accepted to AAAI 2026
UltrasODM: A Dual Stream Optical Flow Mamba Network for 3D Freehand Ultrasound Reconstruction
Clinical ultrasound acquisition is highly operator-dependent, where rapid probe motion and brightness fluctuations often lead to reconstruction errors that reduce trust and clinical utility. We present UltrasODM, a dual-stream framework that assists sonographers during acquisition through calibrated per-frame uncertainty, saliency-based diagnostics, and actionable prompts. UltrasODM integrates (i) a contrastive ranking module that groups frames by motion similarity, (ii) an optical-flow stream fused with Dual-Mamba temporal modules for robust 6-DoF pose estimation, and (iii) a Human-in-the-Loop (HITL) layer combining Bayesian uncertainty, clinician-calibrated thresholds, and saliency maps highlighting regions of low confidence. When uncertainty exceeds the threshold, the system issues unobtrusive alerts suggesting corrective actions such as re-scanning highlighted regions or slowing the sweep. Evaluated on a clinical freehand ultrasound dataset, UltrasODM reduces drift by 15.2%, distance error by 12.1%, and Hausdorff distance by 10.1% relative to UltrasOM, while producing per-frame uncertainty and saliency outputs. By emphasizing transparency and clinician feedback, UltrasODM improves reconstruction reliability and supports safer, more trustworthy clinical workflows. Our code is publicly available at https://github.com/AnandMayank/UltrasODM.
Unison: A Fully Automatic, Task-Universal, and Low-Cost Framework for Unified Understanding and Generation
Unified understanding and generation is a highly appealing research direction in multimodal learning. There exist two approaches: one trains a transformer via an auto-regressive paradigm, and the other adopts a two-stage scheme connecting pre-trained understanding and generative models for alignment fine-tuning. The former demands massive data and computing resources unaffordable for ordinary researchers. Though the latter requires a lower training cost, existing works often suffer from limited task coverage or poor generation quality. Both approaches lack the ability to parse input meta-information (such as task type, image resolution, video duration, etc.) and require manual parameter configuration that is tedious and non-intelligent. In this paper, we propose Unison which adopts the two-stage scheme while preserving the capabilities of the pre-trained models well. With an extremely low training cost, we cover a variety of multimodal understanding tasks, including text, image, and video understanding, as well as diverse generation tasks, such as text-to-visual content generation, editing, controllable generation, and IP-based reference generation. We also equip our model with the ability to automatically parse user intentions, determine the target task type, and accurately extract the meta-information required for the corresponding task. This enables full automation of various multimodal tasks without human intervention. Experiments demonstrate that, under a low-cost setting of only 500k training samples and 50 GPU hours, our model can accurately and automatically identify tasks and extract relevant parameters, and achieve superior performance across a variety of understanding and generation tasks.
DiffusionDriveV2: Reinforcement Learning-Constrained Truncated Diffusion Modeling in End-to-End Autonomous Driving
Generative diffusion models for end-to-end autonomous driving often suffer from mode collapse, tending to generate conservative and homogeneous behaviors. While DiffusionDrive employs predefined anchors representing different driving intentions to partition the action space and generate diverse trajectories, its reliance on imitation learning lacks sufficient constraints, resulting in a dilemma between diversity and consistent high quality. In this work, we propose DiffusionDriveV2, which leverages reinforcement learning to both constrain low-quality modes and explore for superior trajectories. This significantly enhances the overall output quality while preserving the inherent multimodality of its core Gaussian Mixture Model. First, we use scale-adaptive multiplicative noise, ideal for trajectory planning, to promote broad exploration. Second, we employ intra-anchor GRPO to manage advantage estimation among samples generated from a single anchor, and inter-anchor truncated GRPO to incorporate a global perspective across different anchors, preventing improper advantage comparisons between distinct intentions (e.g., turning vs. going straight), which can lead to further mode collapse. DiffusionDriveV2 achieves 91.2 PDMS on the NAVSIM v1 dataset and 85.5 EPDMS on the NAVSIM v2 dataset in closed-loop evaluation with an aligned ResNet-34 backbone, setting a new record. Further experiments validate that our approach resolves the dilemma between diversity and consistent high quality for truncated diffusion models, achieving the best trade-off. Code and model will be available at https://github.com/hustvl/DiffusionDriveV2
HLTCOE Evaluation Team at TREC 2025: VQA Track
The HLTCOE Evaluation team participated in TREC VQA's Answer Generation (AG) task, for which we developed a listwise learning framework that aims to improve semantic precision and ranking consistency in answer generation. Given a video-question pair, a base multimodal model first generates multiple candidate answers, which are then reranked using a model trained with a novel Masked Pointer Cross-Entropy Loss with Rank Weights. This objective integrates pointer-based candidate selection, rank-dependent weighting, and masked cross-entropy under vocabulary restriction, enabling stable and interpretable listwise optimization. By bridging generative modeling with discriminative ranking, our method produces coherent, fine-grained answer lists. Experiments reveal consistent gains in accuracy and ranking stability, especially for questions requiring temporal reasoning and semantic disambiguation.
comment: 7 pages, 1 figure
SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery
Despite advancements in Multi-modal Large Language Models (MLLMs) for scene understanding, their performance on complex spatial reasoning tasks requiring mental simulation remains significantly limited. Current methods often rely on passive observation of spatial data, failing to internalize an active mental imagery process. To bridge this gap, we propose SpatialDreamer, a reinforcement learning framework that enables spatial reasoning through a closedloop process of active exploration, visual imagination via a world model, and evidence-grounded reasoning. To address the lack of fine-grained reward supervision in longhorizontal reasoning tasks, we propose Geometric Policy Optimization (GeoPO), which introduces tree-structured sampling and step-level reward estimation with geometric consistency constraints. Extensive experiments demonstrate that SpatialDreamer delivers highly competitive results across multiple challenging benchmarks, signifying a critical advancement in human-like active spatial mental simulation for MLLMs.
SAVE: Sparse Autoencoder-Driven Visual Information Enhancement for Mitigating Object Hallucination WACV 2026
Although Multimodal Large Language Models (MLLMs) have advanced substantially, they remain vulnerable to object hallucination caused by language priors and visual information loss. To address this, we propose SAVE (Sparse Autoencoder-Driven Visual Information Enhancement), a framework that mitigates hallucination by steering the model along Sparse Autoencoder (SAE) latent features. A binary object-presence question-answering probe identifies the SAE features most indicative of the model's visual information processing, referred to as visual understanding features. Steering the model along these identified features reinforces grounded visual understanding and effectively reduces hallucination. With its simple design, SAVE outperforms state-of-the-art training-free methods on standard benchmarks, achieving a 10\%p improvement in CHAIR\_S and consistent gains on POPE and MMHal-Bench. Extensive evaluations across multiple models and layers confirm the robustness and generalizability of our approach. Further analysis reveals that steering along visual understanding features suppresses the generation of uncertain object tokens and increases attention to image tokens, mitigating hallucination. Code is released at https://github.com/wiarae/SAVE.
comment: WACV 2026
Improving action classification with brain-inspired deep networks
Action recognition is also key for applications ranging from robotics to healthcare monitoring. Action information can be extracted from the body pose and movements, as well as from the background scene. However, the extent to which deep neural networks (DNNs) make use of information about the body and information about the background remains unclear. Since these two sources of information may be correlated within a training dataset, DNNs might learn to rely predominantly on one of them, without taking full advantage of the other. Unlike DNNs, humans have domain-specific brain regions selective for perceiving bodies, and regions selective for perceiving scenes. The present work tests whether humans are thus more effective at extracting information from both body and background, and whether building brain-inspired deep network architectures with separate domain-specific streams for body and scene perception endows them with more human-like performance. We first demonstrate that DNNs trained using the HAA500 dataset perform almost as accurately on versions of the stimuli that show both body and background and on versions of the stimuli from which the body was removed, but are at chance-level for versions of the stimuli from which the background was removed. Conversely, human participants (N=28) can recognize the same set of actions accurately with all three versions of the stimuli, and perform significantly better on stimuli that show only the body than on stimuli that show only the background. Finally, we implement and test a novel architecture patterned after domain specificity in the brain with separate streams to process body and background information. We show that 1) this architecture improves action recognition performance, and 2) its accuracy across different versions of the stimuli follows a pattern that matches more closely the pattern of accuracy observed in human participants.
ViSA: 3D-Aware Video Shading for Real-Time Upper-Body Avatar Creation
Generating high-fidelity upper-body 3D avatars from one-shot input image remains a significant challenge. Current 3D avatar generation methods, which rely on large reconstruction models, are fast and capable of producing stable body structures, but they often suffer from artifacts such as blurry textures and stiff, unnatural motion. In contrast, generative video models show promising performance by synthesizing photorealistic and dynamic results, but they frequently struggle with unstable behavior, including body structural errors and identity drift. To address these limitations, we propose a novel approach that combines the strengths of both paradigms. Our framework employs a 3D reconstruction model to provide robust structural and appearance priors, which in turn guides a real-time autoregressive video diffusion model for rendering. This process enables the model to synthesize high-frequency, photorealistic details and fluid dynamics in real time, effectively reducing texture blur and motion stiffness while preventing the structural inconsistencies common in video generation methods. By uniting the geometric stability of 3D reconstruction with the generative capabilities of video models, our method produces high-fidelity digital avatars with realistic appearance and dynamic, temporally coherent motion. Experiments demonstrate that our approach significantly reduces artifacts and achieves substantial improvements in visual quality over leading methods, providing a robust and efficient solution for real-time applications such as gaming and virtual reality. Project page: https://lhyfst.github.io/visa
UnCageNet: Tracking and Pose Estimation of Caged Animal
Animal tracking and pose estimation systems, such as STEP (Simultaneous Tracking and Pose Estimation) and ViTPose, experience substantial performance drops when processing images and videos with cage structures and systematic occlusions. We present a three-stage preprocessing pipeline that addresses this limitation through: (1) cage segmentation using a Gabor-enhanced ResNet-UNet architecture with tunable orientation filters, (2) cage inpainting using CRFill for content-aware reconstruction of occluded regions, and (3) evaluation of pose estimation and tracking on the uncaged frames. Our Gabor-enhanced segmentation model leverages orientation-aware features with 72 directional kernels to accurately identify and segment cage structures that severely impair the performance of existing methods. Experimental validation demonstrates that removing cage occlusions through our pipeline enables pose estimation and tracking performance comparable to that in environments without occlusions. We also observe significant improvements in keypoint detection accuracy and trajectory consistency.
comment: 9 pages, 2 figures, 2 tables. Accepted to the Indian Conference on Computer Vision, Graphics, and Image Processing (ICVGIP 2025), Mandi, India
PVeRA: Probabilistic Vector-Based Random Matrix Adaptation
Large foundation models have emerged in the last years and are pushing performance boundaries for a variety of tasks. Training or even finetuning such models demands vast datasets and computational resources, which are often scarce and costly. Adaptation methods provide a computationally efficient solution to address these limitations by allowing such models to be finetuned on small amounts of data and computing power. This is achieved by appending new trainable modules to frozen backbones with only a fraction of the trainable parameters and fitting only these modules on novel tasks. Recently, the VeRA adapter was shown to excel in parameter-efficient adaptations by utilizing a pair of frozen random low-rank matrices shared across all layers. In this paper, we propose PVeRA, a probabilistic version of the VeRA adapter, which modifies the low-rank matrices of VeRA in a probabilistic manner. This modification naturally allows handling inherent ambiguities in the input and allows for different sampling configurations during training and testing. A comprehensive evaluation was performed on the VTAB-1k benchmark and seven adapters, with PVeRA outperforming VeRA and other adapters. Our code for training models with PVeRA and benchmarking all adapters is available https://github.com/leofillioux/pvera.
Guiding What Not to Generate: Automated Negative Prompting for Text-Image Alignment WACV 2026
Despite substantial progress in text-to-image generation, achieving precise text-image alignment remains challenging, particularly for prompts with rich compositional structure or imaginative elements. To address this, we introduce Negative Prompting for Image Correction (NPC), an automated pipeline that improves alignment by identifying and applying negative prompts that suppress unintended content. We begin by analyzing cross-attention patterns to explain why both targeted negatives-those directly tied to the prompt's alignment error-and untargeted negatives-tokens unrelated to the prompt but present in the generated image-can enhance alignment. To discover useful negatives, NPC generates candidate prompts using a verifier-captioner-proposer framework and ranks them with a salient text-space score, enabling effective selection without requiring additional image synthesis. On GenEval++ and Imagine-Bench, NPC outperforms strong baselines, achieving 0.571 vs. 0.371 on GenEval++ and the best overall performance on Imagine-Bench. By guiding what not to generate, NPC provides a principled, fully automated route to stronger text-image alignment in diffusion models. Code is released at https://github.com/wiarae/NPC.
comment: WACV 2026
sim2art: Accurate Articulated Object Modeling from a Single Video using Synthetic Training Data Only
Understanding articulated objects is a fundamental challenge in robotics and digital twin creation. To effectively model such objects, it is essential to recover both part segmentation and the underlying joint parameters. Despite the importance of this task, previous work has largely focused on setups like multi-view systems, object scanning, or static cameras. In this paper, we present the first data-driven approach that jointly predicts part segmentation and joint parameters from monocular video captured with a freely moving camera. Trained solely on synthetic data, our method demonstrates strong generalization to real-world objects, offering a scalable and practical solution for articulated object understanding. Our approach operates directly on casually recorded video, making it suitable for real-time applications in dynamic environments. Project webpage: https://aartykov.github.io/sim2art/
HalluShift++: Bridging Language and Vision through Internal Representation Shifts for Hierarchical Hallucinations in MLLMs
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language understanding tasks. While these models often produce linguistically coherent output, they often suffer from hallucinations, generating descriptions that are factually inconsistent with the visual content, potentially leading to adverse consequences. Therefore, the assessment of hallucinations in MLLM has become increasingly crucial in the model development process. Contemporary methodologies predominantly depend on external LLM evaluators, which are themselves susceptible to hallucinations and may present challenges in terms of domain adaptation. In this study, we propose the hypothesis that hallucination manifests as measurable irregularities within the internal layer dynamics of MLLMs, not merely due to distributional shifts but also in the context of layer-wise analysis of specific assumptions. By incorporating such modifications, \textsc{\textsc{HalluShift++}} broadens the efficacy of hallucination detection from text-based large language models (LLMs) to encompass multimodal scenarios. Our codebase is available at https://github.com/C0mRD/HalluShift_Plus.
DIST-CLIP: Arbitrary Metadata and Image Guided MRI Harmonization via Disentangled Anatomy-Contrast Representations
Deep learning holds immense promise for transforming medical image analysis, yet its clinical generalization remains profoundly limited. A major barrier is data heterogeneity. This is particularly true in Magnetic Resonance Imaging, where scanner hardware differences, diverse acquisition protocols, and varying sequence parameters introduce substantial domain shifts that obscure underlying biological signals. Data harmonization methods aim to reduce these instrumental and acquisition variability, but existing approaches remain insufficient. When applied to imaging data, image-based harmonization approaches are often restricted by the need for target images, while existing text-guided methods rely on simplistic labels that fail to capture complex acquisition details or are typically restricted to datasets with limited variability, failing to capture the heterogeneity of real-world clinical environments. To address these limitations, we propose DIST-CLIP (Disentangled Style Transfer with CLIP Guidance), a unified framework for MRI harmonization that flexibly uses either target images or DICOM metadata for guidance. Our framework explicitly disentangles anatomical content from image contrast, with the contrast representations being extracted using pre-trained CLIP encoders. These contrast embeddings are then integrated into the anatomical content via a novel Adaptive Style Transfer module. We trained and evaluated DIST-CLIP on diverse real-world clinical datasets, and showed significant improvements in performance when compared against state-of-the-art methods in both style translation fidelity and anatomical preservation, offering a flexible solution for style transfer and standardizing MRI data. Our code and weights will be made publicly available upon publication.
EgoCampus: Egocentric Pedestrian Eye Gaze Model and Dataset
We address the challenge of predicting human visual attention during real-world navigation by measuring and modeling egocentric pedestrian eye gaze in an outdoor campus setting. We introduce the EgoCampus dataset, which spans 25 unique outdoor paths over 6 km across a university campus with recordings from more than 80 distinct human pedestrians, resulting in a diverse set of gaze-annotated videos. The system used for collection, Meta's Project Aria glasses, integrates eye tracking, front-facing RGB cameras, inertial sensors, and GPS to provide rich data from the human perspective. Unlike many prior egocentric datasets that focus on indoor tasks or exclude eye gaze information, our work emphasizes visual attention while subjects walk in outdoor campus paths. Using this data, we develop EgoCampusNet, a novel method to predict eye gaze of navigating pedestrians as they move through outdoor environments. Our contributions provide both a new resource for studying real-world attention and a resource for future work in gaze prediction models for navigation. Dataset and code are available upon request, and will be made publicly available at a later date at https://github.com/ComputerVisionRutgers/EgoCampus .
Optimization-Guided Diffusion for Interactive Scene Generation
Realistic and diverse multi-agent driving scenes are crucial for evaluating autonomous vehicles, but safety-critical events which are essential for this task are rare and underrepresented in driving datasets. Data-driven scene generation offers a low-cost alternative by synthesizing complex traffic behaviors from existing driving logs. However, existing models often lack controllability or yield samples that violate physical or social constraints, limiting their usability. We present OMEGA, an optimization-guided, training-free framework that enforces structural consistency and interaction awareness during diffusion-based sampling from a scene generation model. OMEGA re-anchors each reverse diffusion step via constrained optimization, steering the generation towards physically plausible and behaviorally coherent trajectories. Building on this framework, we formulate ego-attacker interactions as a game-theoretic optimization in the distribution space, approximating Nash equilibria to generate realistic, safety-critical adversarial scenarios. Experiments on nuPlan and Waymo show that OMEGA improves generation realism, consistency, and controllability, increasing the ratio of physically and behaviorally valid scenes from 32.35% to 72.27% for free exploration capabilities, and from 11% to 80% for controllability-focused generation. Our approach can also generate $5\times$ more near-collision frames with a time-to-collision under three seconds while maintaining the overall scene realism.
An AI-Powered Autonomous Underwater System for Sea Exploration and Scientific Research
Traditional sea exploration faces significant challenges due to extreme conditions, limited visibility, and high costs, resulting in vast unexplored ocean regions. This paper presents an innovative AI-powered Autonomous Underwater Vehicle (AUV) system designed to overcome these limitations by automating underwater object detection, analysis, and reporting. The system integrates YOLOv12 Nano for real-time object detection, a Convolutional Neural Network (CNN) (ResNet50) for feature extraction, Principal Component Analysis (PCA) for dimensionality reduction, and K-Means++ clustering for grouping marine objects based on visual characteristics. Furthermore, a Large Language Model (LLM) (GPT-4o Mini) is employed to generate structured reports and summaries of underwater findings, enhancing data interpretation. The system was trained and evaluated on a combined dataset of over 55,000 images from the DeepFish and OzFish datasets, capturing diverse Australian marine environments. Experimental results demonstrate the system's capability to detect marine objects with a mAP@0.5 of 0.512, a precision of 0.535, and a recall of 0.438. The integration of PCA effectively reduced feature dimensionality while preserving 98% variance, facilitating K-Means clustering which successfully grouped detected objects based on visual similarities. The LLM integration proved effective in generating insightful summaries of detections and clusters, supported by location data. This integrated approach significantly reduces the risks associated with human diving, increases mission efficiency, and enhances the speed and depth of underwater data analysis, paving the way for more effective scientific research and discovery in challenging marine environments.
Liver Fibrosis Quantification and Analysis: The LiQA Dataset and Baseline Method
Liver fibrosis represents a significant global health burden, necessitating accurate staging for effective clinical management. This report introduces the LiQA (Liver Fibrosis Quantification and Analysis) dataset, established as part of the CARE 2024 challenge. Comprising $440$ patients with multi-phase, multi-center MRI scans, the dataset is curated to benchmark algorithms for Liver Segmentation (LiSeg) and Liver Fibrosis Staging (LiFS) under complex real-world conditions, including domain shifts, missing modalities, and spatial misalignment. We further describe the challenge's top-performing methodology, which integrates a semi-supervised learning framework with external data for robust segmentation, and utilizes a multi-view consensus approach with Class Activation Map (CAM)-based regularization for staging. Evaluation of this baseline demonstrates that leveraging multi-source data and anatomical constraints significantly enhances model robustness in clinical settings.
MoCA: Mixture-of-Components Attention for Scalable Compositional 3D Generation
Compositionality is critical for 3D object and scene generation, but existing part-aware 3D generation methods suffer from poor scalability due to quadratic global attention costs when increasing the number of components. In this work, we present MoCA, a compositional 3D generative model with two key designs: (1) importance-based component routing that selects top-k relevant components for sparse global attention, and (2) unimportant components compression that preserve contextual priors of unselected components while reducing computational complexity of global attention. With these designs, MoCA enables efficient, fine-grained compositional 3D asset creation with scalable number of components. Extensive experiments show MoCA outperforms baselines on both compositional object and scene generation tasks. Project page: https://lizhiqi49.github.io/MoCA
Decomposition Sampling for Efficient Region Annotations in Active Learning
Active learning improves annotation efficiency by selecting the most informative samples for annotation and model training. While most prior work has focused on selecting informative images for classification tasks, we investigate the more challenging setting of dense prediction, where annotations are more costly and time-intensive, especially in medical imaging. Region-level annotation has been shown to be more efficient than image-level annotation for these tasks. However, existing methods for representative annotation region selection suffer from high computational and memory costs, irrelevant region choices, and heavy reliance on uncertainty sampling. We propose decomposition sampling (DECOMP), a new active learning sampling strategy that addresses these limitations. It enhances annotation diversity by decomposing images into class-specific components using pseudo-labels and sampling regions from each class. Class-wise predictive confidence further guides the sampling process, ensuring that difficult classes receive additional annotations. Across ROI classification, 2-D segmentation, and 3-D segmentation, DECOMP consistently surpasses baseline methods by better sampling minority-class regions and boosting performance on these challenging classes. Code is in https://github.com/JingnaQiu/DECOMP.git.
Online Segment Any 3D Thing as Instance Tracking NeurIPS 2025
Online, real-time, and fine-grained 3D segmentation constitutes a fundamental capability for embodied intelligent agents to perceive and comprehend their operational environments. Recent advancements employ predefined object queries to aggregate semantic information from Vision Foundation Models (VFMs) outputs that are lifted into 3D point clouds, facilitating spatial information propagation through inter-query interactions. Nevertheless, perception is an inherently dynamic process, rendering temporal understanding a critical yet overlooked dimension within these prevailing query-based pipelines. Therefore, to further unlock the temporal environmental perception capabilities of embodied agents, our work reconceptualizes online 3D segmentation as an instance tracking problem (AutoSeg3D). Our core strategy involves utilizing object queries for temporal information propagation, where long-term instance association promotes the coherence of features and object identities, while short-term instance update enriches instant observations. Given that viewpoint variations in embodied robotics often lead to partial object visibility across frames, this mechanism aids the model in developing a holistic object understanding beyond incomplete instantaneous views. Furthermore, we introduce spatial consistency learning to mitigate the fragmentation problem inherent in VFMs, yielding more comprehensive instance information for enhancing the efficacy of both long-term and short-term temporal learning. The temporal information exchange and consistency learning facilitated by these sparse object queries not only enhance spatial comprehension but also circumvent the computational burden associated with dense temporal point cloud interactions. Our method establishes a new state-of-the-art, surpassing ESAM by 2.8 AP on ScanNet200 and delivering consistent gains on ScanNet, SceneNN, and 3RScan datasets.
comment: NeurIPS 2025, Code is at https://github.com/AutoLab-SAI-SJTU/AutoSeg3D
More than Segmentation: Benchmarking SAM 3 for Segmentation, 3D Perception, and Reconstruction in Robotic Surgery
The recent Segment Anything Model (SAM) 3 has introduced significant advancements over its predecessor, SAM 2, particularly with the integration of language-based segmentation and enhanced 3D perception capabilities. SAM 3 supports zero-shot segmentation across a wide range of prompts, including point, bounding box, and language-based prompts, allowing for more flexible and intuitive interactions with the model. In this empirical evaluation, we assess the performance of SAM 3 in robot-assisted surgery, benchmarking its zero-shot segmentation with point and bounding box prompts and exploring its effectiveness in dynamic video tracking, alongside its newly introduced language prompt segmentation. While language prompts show potential, their performance in the surgical domain is currently suboptimal, highlighting the need for further domain-specific training. Additionally, we investigate SAM 3's 3D reconstruction abilities, demonstrating its capacity to process surgical scene data and reconstruct 3D anatomical structures from 2D images. Through comprehensive testing on the MICCAI EndoVis 2017 and EndoVis 2018 benchmarks, SAM 3 shows clear improvements over SAM and SAM 2 in both image and video segmentation under spatial prompts, while zero-shot evaluations on SCARED, StereoMIS, and EndoNeRF indicate strong monocular depth estimation and realistic 3D instrument reconstruction, yet also reveal remaining limitations in complex, highly dynamic surgical scenes.
comment: Technical Report
Robust Variational Model Based Tailored UNet: Leveraging Edge Detector and Mean Curvature for Improved Image Segmentation
To address the challenge of segmenting noisy images with blurred or fragmented boundaries, this paper presents a robust version of Variational Model Based Tailored UNet (VM_TUNet), a hybrid framework that integrates variational methods with deep learning. The proposed approach incorporates physical priors, an edge detector and a mean curvature term, into a modified Cahn-Hilliard equation, aiming to combine the interpretability and boundary-smoothing advantages of variational partial differential equations (PDEs) with the strong representational ability of deep neural networks. The architecture consists of two collaborative modules: an F module, which conducts efficient frequency domain preprocessing to alleviate poor local minima, and a T module, which ensures accurate and stable local computations, backed by a stability estimate. Extensive experiments on three benchmark datasets indicate that the proposed method achieves a balanced trade-off between performance and computational efficiency, which yields competitive quantitative results and improved visual quality compared to pure convolutional neural network (CNN) based models, while achieving performance close to that of transformer-based method with reasonable computational expense.
LongCat-Image Technical Report
We introduce LongCat-Image, a pioneering open-source and bilingual (Chinese-English) foundation model for image generation, designed to address core challenges in multilingual text rendering, photorealism, deployment efficiency, and developer accessibility prevalent in current leading models. 1) We achieve this through rigorous data curation strategies across the pre-training, mid-training, and SFT stages, complemented by the coordinated use of curated reward models during the RL phase. This strategy establishes the model as a new state-of-the-art (SOTA), delivering superior text-rendering capabilities and remarkable photorealism, and significantly enhancing aesthetic quality. 2) Notably, it sets a new industry standard for Chinese character rendering. By supporting even complex and rare characters, it outperforms both major open-source and commercial solutions in coverage, while also achieving superior accuracy. 3) The model achieves remarkable efficiency through its compact design. With a core diffusion model of only 6B parameters, it is significantly smaller than the nearly 20B or larger Mixture-of-Experts (MoE) architectures common in the field. This ensures minimal VRAM usage and rapid inference, significantly reducing deployment costs. Beyond generation, LongCat-Image also excels in image editing, achieving SOTA results on standard benchmarks with superior editing consistency compared to other open-source works. 4) To fully empower the community, we have established the most comprehensive open-source ecosystem to date. We are releasing not only multiple model versions for text-to-image and image editing, including checkpoints after mid-training and post-training stages, but also the entire toolchain of training procedure. We believe that the openness of LongCat-Image will provide robust support for developers and researchers, pushing the frontiers of visual content creation.
All You Need Are Random Visual Tokens? Demystifying Token Pruning in VLLMs
Vision Large Language Models (VLLMs) incur high computational costs due to their reliance on hundreds of visual tokens to represent images. While token pruning offers a promising solution for accelerating inference, this paper, however, identifies a key observation: in deeper layers (e.g., beyond the 20th), existing training-free pruning methods perform no better than random pruning. We hypothesize that this degradation is caused by "vanishing token information", where visual tokens progressively lose their salience with increasing network depth. To validate this hypothesis, we quantify a token's information content by measuring the change in the model output probabilities upon its removal. Using this proposed metric, our analysis of the information of visual tokens across layers reveals three key findings: (1) As layers deepen, the information of visual tokens gradually becomes uniform and eventually vanishes at an intermediate layer, which we term as "information horizon", beyond which the visual tokens become redundant; (2) The position of this horizon is not static; it extends deeper for visually intensive tasks, such as Optical Character Recognition (OCR), compared to more general tasks like Visual Question Answering (VQA); (3) This horizon is also strongly correlated with model capacity, as stronger VLLMs (e.g., Qwen2.5-VL) employ deeper visual tokens than weaker models (e.g., LLaVA-1.5). Based on our findings, we show that simple random pruning in deep layers efficiently balances performance and efficiency. Moreover, integrating random pruning consistently enhances existing methods. Using DivPrune with random pruning achieves state-of-the-art results, maintaining 96.9% of Qwen-2.5-VL-7B performance while pruning 50% of visual tokens. The code will be publicly available at https://github.com/YahongWang1/Information-Horizon.
R2MF-Net: A Recurrent Residual Multi-Path Fusion Network for Robust Multi-directional Spine X-ray Segmentation
Accurate segmentation of spinal structures in X-ray images is a prerequisite for quantitative scoliosis assessment, including Cobb angle measurement, vertebral translation estimation and curvature classification. In routine practice, clinicians acquire coronal, left-bending and right-bending radiographs to jointly evaluate deformity severity and spinal flexibility. However, the segmentation step remains heavily manual, time-consuming and non-reproducible, particularly in low-contrast images and in the presence of rib shadows or overlapping tissues. To address these limitations, this paper proposes R2MF-Net, a recurrent residual multi-path encoder--decoder network tailored for automatic segmentation of multi-directional spine X-ray images. The overall design consists of a coarse segmentation network and a fine segmentation network connected in cascade. Both stages adopt an improved Inception-style multi-branch feature extractor, while a recurrent residual jump connection (R2-Jump) module is inserted into skip paths to gradually align encoder and decoder semantics. A multi-scale cross-stage skip (MC-Skip) mechanism allows the fine network to reuse hierarchical representations from multiple decoder levels of the coarse network, thereby strengthening the stability of segmentation across imaging directions and contrast conditions. Furthermore, a lightweight spatial-channel squeeze-and-excitation block (SCSE-Lite) is employed at the bottleneck to emphasize spine-related activations and suppress irrelevant structures and background noise. We evaluate R2MF-Net on a clinical multi-view radiograph dataset comprising 228 sets of coronal, left-bending and right-bending spine X-ray images with expert annotations.
Precise Liver Tumor Segmentation in CT Using a Hybrid Deep Learning-Radiomics Framework
Accurate three-dimensional delineation of liver tumors on contrast-enhanced CT is a prerequisite for treatment planning, navigation and response assessment, yet manual contouring is slow, observer-dependent and difficult to standardise across centres. Automatic segmentation is complicated by low lesion-parenchyma contrast, blurred or incomplete boundaries, heterogeneous enhancement patterns, and confounding structures such as vessels and adjacent organs. We propose a hybrid framework that couples an attention-enhanced cascaded U-Net with handcrafted radiomics and voxel-wise 3D CNN refinement for joint liver and liver-tumor segmentation. First, a 2.5D two-stage network with a densely connected encoder, sub-pixel convolution decoders and multi-scale attention gates produces initial liver and tumor probability maps from short stacks of axial slices. Inter-slice temporal consistency is then enforced by a simple three-slice refinement rule along the cranio-caudal direction, which restores thin and tiny lesions while suppressing isolated noise. Next, 728 radiomic descriptors spanning intensity, texture, shape, boundary and wavelet feature groups are extracted from candidate lesions and reduced to 20 stable, highly informative features via multi-strategy feature selection; a random forest classifier uses these features to reject false-positive regions. Finally, a compact 3D patch-based CNN derived from AlexNet operates in a narrow band around the tumor boundary to perform voxel-level relabelling and contour smoothing.
Dual-Stream Cross-Modal Representation Learning via Residual Semantic Decorrelation
Cross-modal learning has become a fundamental paradigm for integrating heterogeneous information sources such as images, text, and structured attributes. However, multimodal representations often suffer from modality dominance, redundant information coupling, and spurious cross-modal correlations, leading to suboptimal generalization and limited interpretability. In particular, high-variance modalities tend to overshadow weaker but semantically important signals, while naïve fusion strategies entangle modality-shared and modality-specific factors in an uncontrolled manner. This makes it difficult to understand which modality actually drives a prediction and to maintain robustness when some modalities are noisy or missing. To address these challenges, we propose a Dual-Stream Residual Semantic Decorrelation Network (DSRSD-Net), a simple yet effective framework that disentangles modality-specific and modality-shared information through residual decomposition and explicit semantic decorrelation constraints. DSRSD-Net introduces: (1) a dual-stream representation learning module that separates intra-modal (private) and inter-modal (shared) latent factors via residual projection; (2) a residual semantic alignment head that maps shared factors from different modalities into a common space using a combination of contrastive and regression-style objectives; and (3) a decorrelation and orthogonality loss that regularizes the covariance structure of the shared space while enforcing orthogonality between shared and private streams, thereby suppressing cross-modal redundancy and preventing feature collapse. Experimental results on two large-scale educational benchmarks demonstrate that DSRSD-Net consistently improves next-step prediction and final outcome prediction over strong single-modality, early-fusion, late-fusion, and co-attention baselines.
Toward More Reliable Artificial Intelligence: Reducing Hallucinations in Vision-Language Models
Vision-language models (VLMs) frequently generate hallucinated content plausible but incorrect claims about image content. We propose a training-free self-correction framework enabling VLMs to iteratively refine responses through uncertainty-guided visual re-attention. Our method combines multidimensional uncertainty quantification (token entropy, attention dispersion, semantic consistency, claim confidence) with attention-guided cropping of under-explored regions. Operating entirely with frozen, pretrained VLMs, our framework requires no gradient updates. We validate our approach on the POPE and MMHAL BENCH benchmarks using the Qwen2.5-VL-7B [23] architecture. Experimental results demonstrate that our method reduces hallucination rates by 9.8 percentage points compared to the baseline, while improving object existence accuracy by 4.7 points on adversarial splits. Furthermore, qualitative analysis confirms that uncertainty-guided re-attention successfully grounds corrections in visual evidence where standard decoding fails. We validate our approach on Qwen2.5-VL-7B [23], with plans to extend validation across diverse architectures in future versions. We release our code and methodology to facilitate future research in trustworthy multimodal systems.
comment: 24 pages, 3 figures, 2 tables. Training-free self-correction framework for vision-language models. Code and implementation details will be released at: https://github.com/kassoumsanogo1/self-correcting-vlm-re-Attention.git
ReLaX: Reasoning with Latent Exploration for Large Reasoning Models
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated remarkable potential in enhancing the reasoning capability of Large Reasoning Models (LRMs). However, RLVR often leads to entropy collapse, resulting in premature policy convergence and performance saturation. While manipulating token-level entropy has proven effective for promoting policy exploration, we argue that the latent dynamics underlying token generation encode a far richer computational structure for steering policy optimization toward a more effective exploration-exploitation tradeoff. To enable tractable analysis and intervention of the latent dynamics of LRMs, we leverage Koopman operator theory to obtain a linearized representation of their hidden-state dynamics. This enables us to introduce Dynamic Spectral Dispersion (DSD), a new metric to quantify the heterogeneity of the model's latent dynamics, serving as a direct indicator of policy exploration. Building upon these foundations, we propose Reasoning with Latent eXploration (ReLaX), a paradigm that explicitly incorporates latent dynamics to regulate exploration and exploitation during policy optimization. Comprehensive experiments across a wide range of multimodal and text-only reasoning benchmarks show that ReLaX significantly mitigates premature convergence and consistently achieves state-of-the-art performance.
From Orbit to Ground: Generative City Photogrammetry from Extreme Off-Nadir Satellite Images
City-scale 3D reconstruction from satellite imagery presents the challenge of extreme viewpoint extrapolation, where our goal is to synthesize ground-level novel views from sparse orbital images with minimal parallax. This requires inferring nearly $90^\circ$ viewpoint gaps from image sources with severely foreshortened facades and flawed textures, causing state-of-the-art reconstruction engines such as NeRF and 3DGS to fail. To address this problem, we propose two design choices tailored for city structures and satellite inputs. First, we model city geometry as a 2.5D height map, implemented as a Z-monotonic signed distance field (SDF) that matches urban building layouts from top-down viewpoints. This stabilizes geometry optimization under sparse, off-nadir satellite views and yields a watertight mesh with crisp roofs and clean, vertically extruded facades. Second, we paint the mesh appearance from satellite images via differentiable rendering techniques. While the satellite inputs may contain long-range, blurry captures, we further train a generative texture restoration network to enhance the appearance, recovering high-frequency, plausible texture details from degraded inputs. Our method's scalability and robustness are demonstrated through extensive experiments on large-scale urban reconstruction. For example, in our teaser figure, we reconstruct a $4\,\mathrm{km}^2$ real-world region from only a few satellite images, achieving state-of-the-art performance in synthesizing photorealistic ground views. The resulting models are not only visually compelling but also serve as high-fidelity, application-ready assets for downstream tasks like urban planning and simulation.
MeshRipple: Structured Autoregressive Generation of Artist-Meshes
Meshes serve as a primary representation for 3D assets. Autoregressive mesh generators serialize faces into sequences and train on truncated segments with sliding-window inference to cope with memory limits. However, this mismatch breaks long-range geometric dependencies, producing holes and fragmented components. To address this critical limitation, we introduce MeshRipple, which expands a mesh outward from an active generation frontier, akin to a ripple on a surface.MeshRipple rests on three key innovations: a frontier-aware BFS tokenization that aligns the generation order with surface topology; an expansive prediction strategy that maintains coherent, connected surface growth; and a sparse-attention global memory that provides an effectively unbounded receptive field to resolve long-range topological dependencies.This integrated design enables MeshRipple to generate meshes with high surface fidelity and topological completeness, outperforming strong recent baselines.
Exploring possible vector systems for faster training of neural networks with preconfigured latent spaces
The overall neural network (NN) performance is closely related to the properties of its embedding distribution in latent space (LS). It has recently been shown that predefined vector systems, specifically An root system vectors, can be used as targets for latent space configurations (LSC) to ensure the desired LS structure. One of the main LSC advantage is the possibility of training classifier NNs without classification layers, which facilitates training NNs on datasets with extremely large numbers of classes. This paper provides a more general overview of possible vector systems for NN training along with their properties and methods for vector system construction. These systems are used to configure LS of encoders and visual transformers to significantly speed up ImageNet-1K and 50k-600k classes LSC training. It is also shown that using the minimum number of LS dimensions for a specific number of classes results in faster convergence. The latter has potential advantages for reducing the size of vector databases used to store NN embeddings.
comment: 9 pages, 5 figures, 1 table, 4 equations
ControlVP: Interactive Geometric Refinement of AI-Generated Images with Consistent Vanishing Points WACV 2026
Recent text-to-image models, such as Stable Diffusion, have achieved impressive visual quality, yet they often suffer from geometric inconsistencies that undermine the structural realism of generated scenes. One prominent issue is vanishing point inconsistency, where projections of parallel lines fail to converge correctly in 2D space. This leads to structurally implausible geometry that degrades spatial realism, especially in architectural scenes. We propose ControlVP, a user-guided framework for correcting vanishing point inconsistencies in generated images. Our approach extends a pre-trained diffusion model by incorporating structural guidance derived from building contours. We also introduce geometric constraints that explicitly encourage alignment between image edges and perspective cues. Our method enhances global geometric consistency while maintaining visual fidelity comparable to the baselines. This capability is particularly valuable for applications that require accurate spatial structure, such as image-to-3D reconstruction. The dataset and source code are available at https://github.com/RyotaOkumura/ControlVP .
comment: Accepted to WACV 2026, 8 pages, supplementary included. Dataset and code: https://github.com/RyotaOkumura/ControlVP
SJD++: Improved Speculative Jacobi Decoding for Training-free Acceleration of Discrete Auto-regressive Text-to-Image Generation
Large autoregressive models can generate high-quality, high-resolution images but suffer from slow generation speed, because these models require hundreds to thousands of sequential forward passes for next-token prediction during inference. To accelerate autoregressive text-to-image generation, we propose Speculative Jacobi Decoding++ (SJD++), a training-free probabilistic parallel decoding algorithm. Unlike traditional next-token prediction, SJD++ performs multi-token prediction in each forward pass, drastically reducing generation steps. Specifically, it integrates the iterative multi-token prediction mechanism from Jacobi decoding, with the probabilistic drafting-and-verification mechanism from speculative sampling. More importantly, for further acceleration, SJD++ reuses high-confidence draft tokens after each verification phase instead of resampling them all. We conduct extensive experiments on several representative autoregressive text-to-image generation models and demonstrate that SJD++ achieves $2\times$ to $3\times$ inference latency reduction and $2\times$ to $7\times$ step compression, while preserving visual quality with no observable degradation.
MultiMotion: Multi Subject Video Motion Transfer via Video Diffusion Transformer
Multi-object video motion transfer poses significant challenges for Diffusion Transformer (DiT) architectures due to inherent motion entanglement and lack of object-level control. We present MultiMotion, a novel unified framework that overcomes these limitations. Our core innovation is Maskaware Attention Motion Flow (AMF), which utilizes SAM2 masks to explicitly disentangle and control motion features for multiple objects within the DiT pipeline. Furthermore, we introduce RectPC, a high-order predictor-corrector solver for efficient and accurate sampling, particularly beneficial for multi-entity generation. To facilitate rigorous evaluation, we construct the first benchmark dataset specifically for DiT-based multi-object motion transfer. MultiMotion demonstrably achieves precise, semantically aligned, and temporally coherent motion transfer for multiple distinct objects, maintaining DiT's high quality and scalability. The code is in the supp.
Towards Robust DeepFake Detection under Unstable Face Sequences: Adaptive Sparse Graph Embedding with Order-Free Representation and Explicit Laplacian Spectral Prior
Ensuring the authenticity of video content remains challenging as DeepFake generation becomes increasingly realistic and robust against detection. Most existing detectors implicitly assume temporally consistent and clean facial sequences, an assumption that rarely holds in real-world scenarios where compression artifacts, occlusions, and adversarial attacks destabilize face detection and often lead to invalid or misdetected faces. To address these challenges, we propose a Laplacian-Regularized Graph Convolutional Network (LR-GCN) that robustly detects DeepFakes from noisy or unordered face sequences, while being trained only on clean facial data. Our method constructs an Order-Free Temporal Graph Embedding (OF-TGE) that organizes frame-wise CNN features into an adaptive sparse graph based on semantic affinities. Unlike traditional methods constrained by strict temporal continuity, OF-TGE captures intrinsic feature consistency across frames, making it resilient to shuffled, missing, or heavily corrupted inputs. We further impose a dual-level sparsity mechanism on both graph structure and node features to suppress the influence of invalid faces. Crucially, we introduce an explicit Graph Laplacian Spectral Prior that acts as a high-pass operator in the graph spectral domain, highlighting structural anomalies and forgery artifacts, which are then consolidated by a low-pass GCN aggregation. This sequential design effectively realizes a task-driven spectral band-pass mechanism that suppresses background information and random noise while preserving manipulation cues. Extensive experiments on FF++, Celeb-DFv2, and DFDC demonstrate that LR-GCN achieves state-of-the-art performance and significantly improved robustness under severe global and local disruptions, including missing faces, occlusions, and adversarially perturbed face detections.
comment: 16 pages (including appendix)
Single-step Diffusion-based Video Coding with Semantic-Temporal Guidance
While traditional and neural video codecs (NVCs) have achieved remarkable rate-distortion performance, improving perceptual quality at low bitrates remains challenging. Some NVCs incorporate perceptual or adversarial objectives but still suffer from artifacts due to limited generation capacity, whereas others leverage pretrained diffusion models to improve quality at the cost of heavy sampling complexity. To overcome these challenges, we propose S2VC, a Single-Step diffusion based Video Codec that integrates a conditional coding framework with an efficient single-step diffusion generator, enabling realistic reconstruction at low bitrates with reduced sampling cost. Recognizing the importance of semantic conditioning in single-step diffusion, we introduce Contextual Semantic Guidance to extract frame-adaptive semantics from buffered features. It replaces text captions with efficient, fine-grained conditioning, thereby improving generation realism. In addition, Temporal Consistency Guidance is incorporated into the diffusion U-Net to enforce temporal coherence across frames and ensure stable generation. Extensive experiments show that S2VC delivers state-of-the-art perceptual quality with an average 52.73% bitrate saving over prior perceptual methods, underscoring the promise of single-step diffusion for efficient, high-quality video compression.
Unified Video Editing with Temporal Reasoner
Existing video editing methods face a critical trade-off: expert models offer precision but rely on task-specific priors like masks, hindering unification; conversely, unified temporal in-context learning models are mask-free but lack explicit spatial cues, leading to weak instruction-to-region mapping and imprecise localization. To resolve this conflict, we propose VideoCoF, a novel Chain-of-Frames approach inspired by Chain-of-Thought reasoning. VideoCoF enforces a ``see, reason, then edit" procedure by compelling the video diffusion model to first predict reasoning tokens (edit-region latents) before generating the target video tokens. This explicit reasoning step removes the need for user-provided masks while achieving precise instruction-to-region alignment and fine-grained video editing. Furthermore, we introduce a RoPE alignment strategy that leverages these reasoning tokens to ensure motion alignment and enable length extrapolation beyond the training duration. We demonstrate that with a minimal data cost of only 50k video pairs, VideoCoF achieves state-of-the-art performance on VideoCoF-Bench, validating the efficiency and effectiveness of our approach. Our code, weight, data are available at https://github.com/knightyxp/VideoCoF.
comment: Project Page: https://videocof.github.io/
Human Geometry Distribution for 3D Animation Generation
Generating realistic human geometry animations remains a challenging task, as it requires modeling natural clothing dynamics with fine-grained geometric details under limited data. To address these challenges, we propose two novel designs. First, we propose a compact distribution-based latent representation that enables efficient and high-quality geometry generation. We improve upon previous work by establishing a more uniform mapping between SMPL and avatar geometries. Second, we introduce a generative animation model that fully exploits the diversity of limited motion data. We focus on short-term transitions while maintaining long-term consistency through an identity-conditioned design. These two designs formulate our method as a two-stage framework: the first stage learns a latent space, while the second learns to generate animations within this latent space. We conducted experiments on both our latent space and animation model. We demonstrate that our latent space produces high-fidelity human geometry surpassing previous methods ($90\%$ lower Chamfer Dist.). The animation model synthesizes diverse animations with detailed and natural dynamics ($2.2 \times$ higher user study score), achieving the best results across all evaluation metrics.
KAN-Dreamer: Benchmarking Kolmogorov-Arnold Networks as Function Approximators in World Models
DreamerV3 is a state-of-the-art online model-based reinforcement learning (MBRL) algorithm known for remarkable sample efficiency. Concurrently, Kolmogorov-Arnold Networks (KANs) have emerged as a promising alternative to Multi-Layer Perceptrons (MLPs), offering superior parameter efficiency and interpretability. To mitigate KANs' computational overhead, variants like FastKAN leverage Radial Basis Functions (RBFs) to accelerate inference. In this work, we investigate integrating KAN architectures into the DreamerV3 framework. We introduce KAN-Dreamer, replacing specific MLP and convolutional components of DreamerV3 with KAN and FastKAN layers. To ensure efficiency within the JAX-based World Model, we implement a tailored, fully vectorized version with simplified grid management. We structure our investigation into three subsystems: Visual Perception, Latent Prediction, and Behavior Learning. Empirical evaluations on the DeepMind Control Suite (walker_walk) analyze sample efficiency, training time, and asymptotic performance. Experimental results demonstrate that utilizing our adapted FastKAN as a drop-in replacement for the Reward and Continue predictors yields performance on par with the original MLP-based architecture, maintaining parity in both sample efficiency and training speed. This report serves as a preliminary study for future developments in KAN-based world models.
comment: 23 pages, 8 figures, 3 tables
When normalization hallucinates: unseen risks in AI-powered whole slide image processing SP
Whole slide image (WSI) normalization remains a vital preprocessing step in computational pathology. Increasingly driven by deep learning, these models learn to approximate data distributions from training examples. This often results in outputs that gravitate toward the average, potentially masking diagnostically important features. More critically, they can introduce hallucinated content, artifacts that appear realistic but are not present in the original tissue, posing a serious threat to downstream analysis. These hallucinations are nearly impossible to detect visually, and current evaluation practices often overlook them. In this work, we demonstrate that the risk of hallucinations is real and underappreciated. While many methods perform adequately on public datasets, we observe a concerning frequency of hallucinations when these same models are retrained and evaluated on real-world clinical data. To address this, we propose a novel image comparison measure designed to automatically detect hallucinations in normalized outputs. Using this measure, we systematically evaluate several well-cited normalization methods retrained on real-world data, revealing significant inconsistencies and failures that are not captured by conventional metrics. Our findings underscore the need for more robust, interpretable normalization techniques and stricter validation protocols in clinical deployment.
comment: 4 pages, accepted for oral presentation at SPIE Medical Imaging, 2026
Revolutionizing Mixed Precision Quantization: Towards Training-free Automatic Proxy Discovery via Large Language Models
Mixed-Precision Quantization (MPQ) liberates the Deep Neural Networks (DNNs) from the Out-Of-Memory (OOM) bottleneck, which garnered increasing research attention. However, conventional methods either searched from costly differentiable optimization, which is neither efficient nor flexible, or learned a quantized DNN from the proxy (i.e., HAWQ) manually designed by human experts, which is labor-intensive and requires huge expert knowledge. Can we design a proxy without involving any human experts and training? In this paper, we provide an affirmative answer by proposing a novel Large Language Models (LLMs)-driven Training-free Automatic Proxy (dubbed TAP) discovery framework, which reforms the design paradigm of MPQ by utilizing LLMs to find superior TAP tailored for MPQ, automatically. In addition, to bridge the gap between black-box LLMs and the tough MPQ task, we ingeniously propose simple Direct Policy Optimization (DPO) based reinforcement learning to enhance LLMs' reasoning by optimizing prompts, which can construct a positive feedback loop between the LLM and the MPQ task, enabling LLMs to generate better TAP in the next evolution. Extensive experiments on mainstream benchmarks demonstrate that TAP achieves state-of-the-art performance. Finally, we truly believe that our TAP will significantly contribute to the MPQ community by providing a new perspective on LLM-driven design algorithms.
Data-driven Exploration of Mobility Interaction Patterns
Understanding the movement behaviours of individuals and the way they react to the external world is a key component of any problem that involves the modelling of human dynamics at a physical level. In particular, it is crucial to capture the influence that the presence of an individual can have on the others. Important examples of applications include crowd simulation and emergency management, where the simulation of the mass of people passes through the simulation of the individuals, taking into consideration the others as part of the general context. While existing solutions basically start from some preconceived behavioural model, in this work we propose an approach that starts directly from the data, adopting a data mining perspective. Our method searches the mobility events in the data that might be possible evidences of mutual interactions between individuals, and on top of them looks for complex, persistent patterns and time evolving configurations of events. The study of these patterns can provide new insights on the mechanics of mobility interactions between individuals, which can potentially help in improving existing simulation models. We instantiate the general methodology on two real case studies, one on cars and one on pedestrians, and a full experimental evaluation is performed, both in terms of performances, parameter sensitivity and interpretation of sample results.
InterAgent: Physics-based Multi-agent Command Execution via Diffusion on Interaction Graphs
Humanoid agents are expected to emulate the complex coordination inherent in human social behaviors. However, existing methods are largely confined to single-agent scenarios, overlooking the physically plausible interplay essential for multi-agent interactions. To bridge this gap, we propose InterAgent, the first end-to-end framework for text-driven physics-based multi-agent humanoid control. At its core, we introduce an autoregressive diffusion transformer equipped with multi-stream blocks, which decouples proprioception, exteroception, and action to mitigate cross-modal interference while enabling synergistic coordination. We further propose a novel interaction graph exteroception representation that explicitly captures fine-grained joint-to-joint spatial dependencies to facilitate network learning. Additionally, within it we devise a sparse edge-based attention mechanism that dynamically prunes redundant connections and emphasizes critical inter-agent spatial relations, thereby enhancing the robustness of interaction modeling. Extensive experiments demonstrate that InterAgent consistently outperforms multiple strong baselines, achieving state-of-the-art performance. It enables producing coherent, physically plausible, and semantically faithful multi-agent behaviors from only text prompts. Our code and data will be released to facilitate future research.
comment: Project page: https://binlee26.github.io/InterAgent-Page
Reconstructing Objects along Hand Interaction Timelines in Egocentric Video
We introduce the task of Reconstructing Objects along Hand Interaction Timelines (ROHIT). We first define the Hand Interaction Timeline (HIT) from a rigid object's perspective. In a HIT, an object is first static relative to the scene, then is held in hand following contact, where its pose changes. This is usually followed by a firm grip during use, before it is released to be static again w.r.t. to the scene. We model these pose constraints over the HIT, and propose to propagate the object's pose along the HIT enabling superior reconstruction using our proposed Constrained Optimisation and Propagation (COP) framework. Importantly, we focus on timelines with stable grasps - i.e. where the hand is stably holding an object, effectively maintaining constant contact during use. This allows us to efficiently annotate, study, and evaluate object reconstruction in videos without 3D ground truth. We evaluate our proposed task, ROHIT, over two egocentric datasets, HOT3D and in-the-wild EPIC-Kitchens. In HOT3D, we curate 1.2K clips of stable grasps. In EPIC-Kitchens, we annotate 2.4K clips of stable grasps including 390 object instances across 9 categories from videos of daily interactions in 141 environments. Without 3D ground truth, we utilise 2D projection error to assess the reconstruction. Quantitatively, COP improves stable grasp reconstruction by 6.2-11.3% and HIT reconstruction by up to 24.5% with constrained pose propagation.
comment: webpage: https://zhifanzhu.github.io/objects-along-hit
GlimmerNet: A Lightweight Grouped Dilated Depthwise Convolutions for UAV-Based Emergency Monitoring
Convolutional Neural Networks (CNNs) have proven highly effective for edge and mobile vision tasks due to their computational efficiency. While many recent works seek to enhance CNNs with global contextual understanding via self-attention-based Vision Transformers, these approaches often introduce significant computational overhead. In this work, we demonstrate that it is possible to retain strong global perception without relying on computationally expensive components. We present GlimmerNet, an ultra-lightweight convolutional network built on the principle of separating receptive field diversity from feature recombination. GlimmerNet introduces Grouped Dilated Depthwise Convolutions(GDBlocks), which partition channels into groups with distinct dilation rates, enabling multi-scale feature extraction at no additional parameter cost. To fuse these features efficiently, we design a novel Aggregator module that recombines cross-group representations using grouped pointwise convolution, significantly lowering parameter overhead. With just 31K parameters and 29% fewer FLOPs than the most recent baseline, GlimmerNet achieves a new state-of-the-art weighted F1-score of 0.966 on the UAV-focused AIDERv2 dataset. These results establish a new accuracy-efficiency trade-off frontier for real-time emergency monitoring on resource-constrained UAV platforms. Our implementation is publicly available at https://github.com/djordjened92/gdd-cnn.
Towards Reliable Test-Time Adaptation: Style Invariance as a Correctness Likelihood WACV 2026
Test-time adaptation (TTA) enables efficient adaptation of deployed models, yet it often leads to poorly calibrated predictive uncertainty - a critical issue in high-stakes domains such as autonomous driving, finance, and healthcare. Existing calibration methods typically assume fixed models or static distributions, resulting in degraded performance under real-world, dynamic test conditions. To address these challenges, we introduce Style Invariance as a Correctness Likelihood (SICL), a framework that leverages style-invariance for robust uncertainty estimation. SICL estimates instance-wise correctness likelihood by measuring prediction consistency across style-altered variants, requiring only the model's forward pass. This makes it a plug-and-play, backpropagation-free calibration module compatible with any TTA method. Comprehensive evaluations across four baselines, five TTA methods, and two realistic scenarios with three model architecture demonstrate that SICL reduces calibration error by an average of 13 percentage points compared to conventional calibration approaches.
comment: Accepted to WACV 2026
How Far are Modern Trackers from UAV-Anti-UAV? A Million-Scale Benchmark and New Baseline
Unmanned Aerial Vehicles (UAVs) offer wide-ranging applications but also pose significant safety and privacy violation risks in areas like airport and infrastructure inspection, spurring the rapid development of Anti-UAV technologies in recent years. However, current Anti-UAV research primarily focuses on RGB, infrared (IR), or RGB-IR videos captured by fixed ground cameras, with little attention to tracking target UAVs from another moving UAV platform. To fill this gap, we propose a new multi-modal visual tracking task termed UAV-Anti-UAV, which involves a pursuer UAV tracking a target adversarial UAV in the video stream. Compared to existing Anti-UAV tasks, UAV-Anti-UAV is more challenging due to severe dual-dynamic disturbances caused by the rapid motion of both the capturing platform and the target. To advance research in this domain, we construct a million-scale dataset consisting of 1,810 videos, each manually annotated with bounding boxes, a language prompt, and 15 tracking attributes. Furthermore, we propose MambaSTS, a Mamba-based baseline method for UAV-Anti-UAV tracking, which enables integrated spatial-temporal-semantic learning. Specifically, we employ Mamba and Transformer models to learn global semantic and spatial features, respectively, and leverage the state space model's strength in long-sequence modeling to establish video-level long-term context via a temporal token propagation mechanism. We conduct experiments on the UAV-Anti-UAV dataset to validate the effectiveness of our method. A thorough experimental evaluation of 50 modern deep tracking algorithms demonstrates that there is still significant room for improvement in the UAV-Anti-UAV domain. The dataset and codes will be available at {\color{magenta}https://github.com/983632847/Awesome-Multimodal-Object-Tracking}.
comment: https://github.com/983632847/Awesome-Multimodal-Object-Tracking
LogicCBMs: Logic-Enhanced Concept-Based Learning WACV 2026
Concept Bottleneck Models (CBMs) provide a basis for semantic abstractions within a neural network architecture. Such models have primarily been seen through the lens of interpretability so far, wherein they offer transparency by inferring predictions as a linear combination of semantic concepts. However, a linear combination is inherently limiting. So we propose the enhancement of concept-based learning models through propositional logic. We introduce a logic module that is carefully designed to connect the learned concepts from CBMs through differentiable logic operations, such that our proposed LogicCBM can go beyond simple weighted combinations of concepts to leverage various logical operations to yield the final predictions, while maintaining end-to-end learnability. Composing concepts using a set of logic operators enables the model to capture inter-concept relations, while simultaneously improving the expressivity of the model in terms of logic operations. Our empirical studies on well-known benchmarks and synthetic datasets demonstrate that these models have better accuracy, perform effective interventions and are highly interpretable.
comment: 18 pages, 19 figures, WACV 2026
Tessellation GS: Neural Mesh Gaussians for Robust Monocular Reconstruction of Dynamic Objects
3D Gaussian Splatting (GS) enables highly photorealistic scene reconstruction from posed image sequences but struggles with viewpoint extrapolation due to its anisotropic nature, leading to overfitting and poor generalization, particularly in sparse-view and dynamic scene reconstruction. We propose Tessellation GS, a structured 2D GS approach anchored on mesh faces, to reconstruct dynamic scenes from a single continuously moving or static camera. Our method constrains 2D Gaussians to localized regions and infers their attributes via hierarchical neural features on mesh faces. Gaussian subdivision is guided by an adaptive face subdivision strategy driven by a detail-aware loss function. Additionally, we leverage priors from a reconstruction foundation model to initialize Gaussian deformations, enabling robust reconstruction of general dynamic objects from a single static camera, previously extremely challenging for optimization-based methods. Our method outperforms previous SOTA method, reducing LPIPS by 29.1% and Chamfer distance by 49.2% on appearance and mesh reconstruction tasks.
Enhancing Small Object Detection with YOLO: A Novel Framework for Improved Accuracy and Efficiency
This paper investigates and develops methods for detecting small objects in large-scale aerial images. Current approaches for detecting small objects in aerial images often involve image cropping and modifications to detector network architectures. Techniques such as sliding window cropping and architectural enhancements, including higher-resolution feature maps and attention mechanisms, are commonly employed. Given the growing importance of aerial imagery in various critical and industrial applications, the need for robust frameworks for small object detection becomes imperative. To address this need, we adopted the base SW-YOLO approach to enhance speed and accuracy in small object detection by refining cropping dimensions and overlap in sliding window usage and subsequently enhanced it through architectural modifications. we propose a novel model by modifying the base model architecture, including advanced feature extraction modules in the neck for feature map enhancement, integrating CBAM in the backbone to preserve spatial and channel information, and introducing a new head to boost small object detection accuracy. Finally, we compared our method with SAHI, one of the most powerful frameworks for processing large-scale images, and CZDet, which is also based on image cropping, achieving significant improvements in accuracy. The proposed model achieves significant accuracy gains on the VisDrone2019 dataset, outperforming baseline YOLOv5L detection by a substantial margin. Specifically, the final proposed model elevates the mAP .5.5 accuracy on the VisDrone2019 dataset from the base accuracy of 35.5 achieved by the YOLOv5L detector to 61.2. Notably, the accuracy of CZDet, which is another classic method applied to this dataset, is 58.36. This research demonstrates a significant improvement, achieving an increase in accuracy from 35.5 to 61.2.
comment: 22 pages, 16 figures
Structure-Aware Feature Rectification with Region Adjacency Graphs for Training-Free Open-Vocabulary Semantic Segmentation WACV2026
Benefiting from the inductive biases learned from large-scale datasets, open-vocabulary semantic segmentation (OVSS) leverages the power of vision-language models, such as CLIP, to achieve remarkable progress without requiring task-specific training. However, due to CLIP's pre-training nature on image-text pairs, it tends to focus on global semantic alignment, resulting in suboptimal performance when associating fine-grained visual regions with text. This leads to noisy and inconsistent predictions, particularly in local areas. We attribute this to a dispersed bias stemming from its contrastive training paradigm, which is difficult to alleviate using CLIP features alone. To address this, we propose a structure-aware feature rectification approach that incorporates instance-specific priors derived directly from the image. Specifically, we construct a region adjacency graph (RAG) based on low-level features (e.g., colour and texture) to capture local structural relationships and use it to refine CLIP features by enhancing local discrimination. Extensive experiments show that our method effectively suppresses segmentation noise, improves region-level consistency, and achieves strong performance on multiple open-vocabulary segmentation benchmarks.
comment: Accepted to WACV2026
A Geometric Unification of Concept Learning with Concept Cones
Two traditions of interpretability have evolved side by side but seldom spoken to each other: Concept Bottleneck Models (CBMs), which prescribe what a concept should be, and Sparse Autoencoders (SAEs), which discover what concepts emerge. While CBMs use supervision to align activations with human-labeled concepts, SAEs rely on sparse coding to uncover emergent ones. We show that both paradigms instantiate the same geometric structure: each learns a set of linear directions in activation space whose nonnegative combinations form a concept cone. Supervised and unsupervised methods thus differ not in kind but in how they select this cone. Building on this view, we propose an operational bridge between the two paradigms. CBMs provide human-defined reference geometries, while SAEs can be evaluated by how well their learned cones approximate or contain those of CBMs. This containment framework yields quantitative metrics linking inductive biases -- such as SAE type, sparsity, or expansion ratio -- to emergence of plausible\footnote{We adopt the terminology of \citet{jacovi2020towards}, who distinguish between faithful explanations (accurately reflecting model computations) and plausible explanations (aligning with human intuition and domain knowledge). CBM concepts are plausible by construction -- selected or annotated by humans -- though not necessarily faithful to the true latent factors that organise the data manifold.} concepts. Using these metrics, we uncover a ``sweet spot'' in both sparsity and expansion factor that maximizes both geometric and semantic alignment with CBM concepts. Overall, our work unifies supervised and unsupervised concept discovery through a shared geometric framework, providing principled metrics to measure SAE progress and assess how well discovered concept align with plausible human concepts.
comment: 22 pages
DeepAgent: A Dual Stream Multi Agent Fusion for Robust Multimodal Deepfake Detection
The increasing use of synthetic media, particularly deepfakes, is an emerging challenge for digital content verification. Although recent studies use both audio and visual information, most integrate these cues within a single model, which remains vulnerable to modality mismatches, noise, and manipulation. To address this gap, we propose DeepAgent, an advanced multi-agent collaboration framework that simultaneously incorporates both visual and audio modalities for the effective detection of deepfakes. DeepAgent consists of two complementary agents. Agent-1 examines each video with a streamlined AlexNet-based CNN to identify the symbols of deepfake manipulation, while Agent-2 detects audio-visual inconsistencies by combining acoustic features, audio transcriptions from Whisper, and frame-reading sequences of images through EasyOCR. Their decisions are fused through a Random Forest meta-classifier that improves final performance by taking advantage of the different decision boundaries learned by each agent. This study evaluates the proposed framework using three benchmark datasets to demonstrate both component-level and fused performance. Agent-1 achieves a test accuracy of 94.35% on the combined Celeb-DF and FakeAVCeleb datasets. On the FakeAVCeleb dataset, Agent-2 and the final meta-classifier attain accuracies of 93.69% and 81.56%, respectively. In addition, cross-dataset validation on DeepFakeTIMIT confirms the robustness of the meta-classifier, which achieves a final accuracy of 97.49%, and indicates a strong capability across diverse datasets. These findings confirm that hierarchy-based fusion enhances robustness by mitigating the weaknesses of individual modalities and demonstrate the effectiveness of a multi-agent approach in addressing diverse types of manipulations in deepfakes.
MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition
In controllable image generation, synthesizing coherent and consistent images from multiple reference inputs, i.e., Multi-Image Composition (MICo), remains a challenging problem, partly hindered by the lack of high-quality training data. To bridge this gap, we conduct a systematic study of MICo, categorizing it into 7 representative tasks and curate a large-scale collection of high-quality source images and construct diverse MICo prompts. Leveraging powerful proprietary models, we synthesize a rich amount of balanced composite images, followed by human-in-the-loop filtering and refinement, resulting in MICo-150K, a comprehensive dataset for MICo with identity consistency. We further build a Decomposition-and-Recomposition (De&Re) subset, where 11K real-world complex images are decomposed into components and recomposed, enabling both real and synthetic compositions. To enable comprehensive evaluation, we construct MICo-Bench with 100 cases per task and 300 challenging De&Re cases, and further introduce a new metric, Weighted-Ref-VIEScore, specifically tailored for MICo evaluation. Finally, we fine-tune multiple models on MICo-150K and evaluate them on MICo-Bench. The results show that MICo-150K effectively equips models without MICo capability and further enhances those with existing skills. Notably, our baseline model, Qwen-MICo, fine-tuned from Qwen-Image-Edit, matches Qwen-Image-2509 in 3-image composition while supporting arbitrary multi-image inputs beyond the latter's limitation. Our dataset, benchmark, and baseline collectively offer valuable resources for further research on Multi-Image Composition.
comment: Project Page: https://MICo-150K.github.io/
Debiasing Diffusion Priors via 3D Attention for Consistent Gaussian Splatting AAAI 2026
Versatile 3D tasks (e.g., generation or editing) that distill from Text-to-Image (T2I) diffusion models have attracted significant research interest for not relying on extensive 3D training data. However, T2I models exhibit limitations resulting from prior view bias, which produces conflicting appearances between different views of an object. This bias causes subject-words to preferentially activate prior view features during cross-attention (CA) computation, regardless of the target view condition. To overcome this limitation, we conduct a comprehensive mathematical analysis to reveal the root cause of the prior view bias in T2I models. Moreover, we find different UNet layers show different effects of prior view in CA. Therefore, we propose a novel framework, TD-Attn, which addresses multi-view inconsistency via two key components: (1) the 3D-Aware Attention Guidance Module (3D-AAG) constructs a view-consistent 3D attention Gaussian for subject-words to enforce spatial consistency across attention-focused regions, thereby compensating for the limited spatial information in 2D individual view CA maps; (2) the Hierarchical Attention Modulation Module (HAM) utilizes a Semantic Guidance Tree (SGT) to direct the Semantic Response Profiler (SRP) in localizing and modulating CA layers that are highly responsive to view conditions, where the enhanced CA maps further support the construction of more consistent 3D attention Gaussians. Notably, HAM facilitates semantic-specific interventions, enabling controllable and precise 3D editing. Extensive experiments firmly establish that TD-Attn has the potential to serve as a universal plugin, significantly enhancing multi-view consistency across 3D tasks.
comment: 15 pages, 8 figures, 5 tables, 2 algorithms, Accepted by AAAI 2026
Generalized Referring Expression Segmentation on Aerial Photos
Referring expression segmentation is a fundamental task in computer vision that integrates natural language understanding with precise visual localization of target regions. Considering aerial imagery (e.g., modern aerial photos collected through drones, historical photos from aerial archives, high-resolution satellite imagery, etc.) presents unique challenges because spatial resolution varies widely across datasets, the use of color is not consistent, targets often shrink to only a few pixels, and scenes contain very high object densities and objects with partial occlusions. This work presents Aerial-D, a new large-scale referring expression segmentation dataset for aerial imagery, comprising 37,288 images with 1,522,523 referring expressions that cover 259,709 annotated targets, spanning across individual object instances, groups of instances, and semantic regions covering 21 distinct classes that range from vehicles and infrastructure to land coverage types. The dataset was constructed through a fully automatic pipeline that combines systematic rule-based expression generation with a Large Language Model (LLM) enhancement procedure that enriched both the linguistic variety and the focus on visual details within the referring expressions. Filters were additionally used to simulate historic imaging conditions for each scene. We adopted the RSRefSeg architecture, and trained models on Aerial-D together with prior aerial datasets, yielding unified instance and semantic segmentation from text for both modern and historical images. Results show that the combined training achieves competitive performance on contemporary benchmarks, while maintaining strong accuracy under monochrome, sepia, and grainy degradations that appear in archival aerial photography. The dataset, trained models, and complete software pipeline are publicly available at https://luispl77.github.io/aerial-d .
comment: Submitted to IEEE J-STARS
The Inductive Bottleneck: Data-Driven Emergence of Representational Sparsity in Vision Transformers
Vision Transformers (ViTs) lack the hierarchical inductive biases inherent to Convolutional Neural Networks (CNNs), theoretically allowing them to maintain high-dimensional representations throughout all layers. However, recent observations suggest ViTs often spontaneously manifest a "U-shaped" entropy profile-compressing information in middle layers before expanding it for the final classification. In this work, we demonstrate that this "Inductive Bottleneck" is not an architectural artifact, but a data-dependent adaptation. By analyzing the layer-wise Effective Encoding Dimension (EED) of DINO-trained ViTs across datasets of varying compositional complexity (UC Merced, Tiny ImageNet, and CIFAR-100), we show that the depth of the bottleneck correlates strongly with the semantic abstraction required by the task. We find that while texture-heavy datasets preserve high-rank representations throughout, object-centric datasets drive the network to dampen high-frequency information in middle layers, effectively "learning" a bottleneck to isolate semantic features.
ContextAnyone: Context-Aware Diffusion for Character-Consistent Text-to-Video Generation
Text-to-video (T2V) generation has advanced rapidly, yet maintaining consistent character identities across scenes remains a major challenge. Existing personalization methods often focus on facial identity but fail to preserve broader contextual cues such as hairstyle, outfit, and body shape, which are critical for visual coherence. We propose \textbf{ContextAnyone}, a context-aware diffusion framework that achieves character-consistent video generation from text and a single reference image. Our method jointly reconstructs the reference image and generates new video frames, enabling the model to fully perceive and utilize reference information. Reference information is effectively integrated into a DiT-based diffusion backbone through a novel Emphasize-Attention module that selectively reinforces reference-aware features and prevents identity drift across frames. A dual-guidance loss combines diffusion and reference reconstruction objectives to enhance appearance fidelity, while the proposed Gap-RoPE positional embedding separates reference and video tokens to stabilize temporal modeling. Experiments demonstrate that ContextAnyone outperforms existing reference-to-video methods in identity consistency and visual quality, generating coherent and context-preserving character videos across diverse motions and scenes. Project page: \href{https://github.com/ziyang1106/ContextAnyone}{https://github.com/ziyang1106/ContextAnyone}.
Reevaluating Automated Wildlife Species Detection: A Reproducibility Study on a Custom Image Dataset
This study revisits the findings of Carl et al., who evaluated the pre-trained Google Inception-ResNet-v2 model for automated detection of European wild mammal species in camera trap images. To assess the reproducibility and generalizability of their approach, we reimplemented the experiment from scratch using openly available resources and a different dataset consisting of 900 images spanning 90 species. After minimal preprocessing, we obtained an overall classification accuracy of 62%, closely aligning with the 71% reported in the original work despite differences in datasets. As in the original study, per-class performance varied substantially, as indicated by a macro F1 score of 0.28,highlighting limitations in generalization when labels do not align directly with ImageNet classes. Our results confirm that pretrained convolutional neural networks can provide a practical baseline for wildlife species identification but also reinforce the need for species-specific adaptation or transfer learning to achieve consistent, high-quality predictions.
Towards Accurate UAV Image Perception: Guiding Vision-Language Models with Stronger Task Prompts
Existing image perception methods based on VLMs generally follow a paradigm wherein models extract and analyze image content based on user-provided textual task prompts. However, such methods face limitations when applied to UAV imagery, which presents challenges like target confusion, scale variations, and complex backgrounds. These challenges arise because VLMs' understanding of image content depends on the semantic alignment between visual and textual tokens. When the task prompt is simplistic and the image content is complex, achieving effective alignment becomes difficult, limiting the model's ability to focus on task-relevant information. To address this issue, we introduce AerialVP, the first agent framework for task prompt enhancement in UAV image perception. AerialVP proactively extracts multi-dimensional auxiliary information from UAV images to enhance task prompts, overcoming the limitations of traditional VLM-based approaches. Specifically, the enhancement process includes three stages: (1) analyzing the task prompt to identify the task type and enhancement needs, (2) selecting appropriate tools from the tool repository, and (3) generating enhanced task prompts based on the analysis and selected tools. To evaluate AerialVP, we introduce AerialSense, a comprehensive benchmark for UAV image perception that includes Aerial Visual Reasoning, Aerial Visual Question Answering, and Aerial Visual Grounding tasks. AerialSense provides a standardized basis for evaluating model generalization and performance across diverse resolutions, lighting conditions, and both urban and natural scenes. Experimental results demonstrate that AerialVP significantly enhances task prompt guidance, leading to stable and substantial performance improvements in both open-source and proprietary VLMs. Our work will be available at https://github.com/lostwolves/AerialVP.
Geo3DVQA: Evaluating Vision-Language Models for 3D Geospatial Reasoning from Aerial Imagery WACV 2026
Three-dimensional geospatial analysis is critical to applications in urban planning, climate adaptation, and environmental assessment. Current methodologies depend on costly, specialized sensors (e.g., LiDAR and multispectral), which restrict global accessibility. Existing sensor-based and rule-driven methods further struggle with tasks requiring the integration of multiple 3D cues, handling diverse queries, and providing interpretable reasoning. We hereby present Geo3DVQA, a comprehensive benchmark for evaluating vision-language models (VLMs) in height-aware, 3D geospatial reasoning using RGB-only remote sensing imagery. Unlike conventional sensor-based frameworks, Geo3DVQA emphasizes realistic scenarios that integrate elevation, sky view factors, and land cover patterns. The benchmark encompasses 110k curated question-answer pairs spanning 16 task categories across three complexity levels: single-feature inference, multi-feature reasoning, and application-level spatial analysis. The evaluation of ten state-of-the-art VLMs highlights the difficulty of RGB-to-3D reasoning. GPT-4o and Gemini-2.5-Flash achieved only 28.6% and 33.0% accuracy respectively, while domain-specific fine-tuning of Qwen2.5-VL-7B achieved 49.6% (+24.8 points). These results reveal both the limitations of current VLMs and the effectiveness of domain adaptation. Geo3DVQA introduces new challenge frontiers for scalable, accessible, and holistic 3D geospatial analysis. The dataset and code will be released upon publication at https://github.com/mm1129/Geo3DVQA.
comment: Accepted to WACV 2026. Camera-ready-based version with minor edits for readability (no change in the contents)
Effective Attention-Guided Multi-Scale Medical Network for Skin Lesion Segmentation
In the field of healthcare, precise skin lesion segmentation is crucial for the early detection and accurate diagnosis of skin diseases. Despite significant advances in deep learning for image processing, existing methods have yet to effectively address the challenges of irregular lesion shapes and low contrast. To address these issues, this paper proposes an innovative encoder-decoder network architecture based on multi-scale residual structures, capable of extracting rich feature information from different receptive fields to effectively identify lesion areas. By introducing a Multi-Resolution Multi-Channel Fusion (MRCF) module, our method captures cross-scale features, enhancing the clarity and accuracy of the extracted information. Furthermore, we propose a Cross-Mix Attention Module (CMAM), which redefines the attention scope and dynamically calculates weights across multiple contexts, thus improving the flexibility and depth of feature capture and enabling deeper exploration of subtle features. To overcome the information loss caused by skip connections in traditional U-Net, an External Attention Bridge (EAB) is introduced, facilitating the effective utilization of information in the decoder and compensating for the loss during upsampling. Extensive experimental evaluations on several skin lesion segmentation datasets demonstrate that the proposed model significantly outperforms existing transformer and convolutional neural network-based models, showcasing exceptional segmentation accuracy and robustness.
comment: The paper has been accepted by BIBM 2025
RVLF: A Reinforcing Vision-Language Framework for Gloss-Free Sign Language Translation
Gloss-free sign language translation (SLT) is hindered by two key challenges: **inadequate sign representation** that fails to capture nuanced visual cues, and **sentence-level semantic misalignment** in current LLM-based methods, which limits translation quality. To address these issues, we propose a three-stage **r**einforcing **v**ision-**l**anguage **f**ramework (**RVLF**). We build a large vision-language model (LVLM) specifically designed for sign language, and then combine it with reinforcement learning (RL) to adaptively enhance translation performance. First, for a sufficient representation of sign language, RVLF introduces an effective semantic representation learning mechanism that fuses skeleton-based motion cues with semantically rich visual features extracted via DINOv2, followed by instruction tuning to obtain a strong SLT-SFT baseline. Then, to improve sentence-level semantic misalignment, we introduce a GRPO-based optimization strategy that fine-tunes the SLT-SFT model with a reward function combining translation fidelity (BLEU) and sentence completeness (ROUGE), yielding the optimized model termed SLT-GRPO. Our conceptually simple framework yields substantial gains under the gloss-free SLT setting without pre-training on any external large-scale sign language datasets, improving BLEU-4 scores by +5.1, +1.11, +1.4, and +1.61 on the CSL-Daily, PHOENIX-2014T, How2Sign, and OpenASL datasets, respectively. To the best of our knowledge, this is the first work to incorporate GRPO into SLT. Extensive experiments and ablation studies validate the effectiveness of GRPO-based optimization in enhancing both translation quality and semantic consistency.
A graph generation pipeline for critical infrastructures based on heuristics, images and depth data
Virtual representations of physical critical infrastructures, such as water or energy plants, are used for simulations and digital twins to ensure resilience and continuity of their services. These models usually require 3D point clouds from laser scanners that are expensive to acquire and require specialist knowledge to use. In this article, we present a graph generation pipeline based on photogrammetry. The pipeline detects relevant objects and predicts their relation using RGB images and depth data generated by a stereo camera. This more cost-effective approach uses deep learning for object detection and instance segmentation of the objects, and employs user-defined heuristics or rules to infer their relations. Results of two hydraulic systems show that this strategy can produce graphs close to the ground truth while its flexibility allows the method to be tailored to specific applications and its transparency qualifies it to be used in the high stakes decision-making that is required for critical infrastructures.
Affine Subspace Models and Clustering for Patch-Based Image Denoising
Image tile-based approaches are popular in many image processing applications such as denoising (e.g., non-local means). A key step in their use is grouping the images into clusters, which usually proceeds iteratively splitting the images into clusters and fitting a model for the images in each cluster. Linear subspaces have emerged as a suitable model for tile clusters; however, they are not well matched to images patches given that images are non-negative and thus not distributed around the origin in the tile vector space. We study the use of affine subspace models for the clusters to better match the geometric structure of the image tile vector space. We also present a simple denoising algorithm that relies on the affine subspace clustering model using least squares projection. We review several algorithmic approaches to solve the affine subspace clustering problem and show experimental results that highlight the performance improvements in clustering and denoising.
comment: Asilomar Conference on Signals, Systems, and Computers 2025
DGGAN: Degradation Guided Generative Adversarial Network for Real-time Endoscopic Video Enhancement
Endoscopic surgery relies on intraoperative video, making image quality a decisive factor for surgical safety and efficacy. Yet, endoscopic videos are often degraded by uneven illumination, tissue scattering, occlusions, and motion blur, which obscure critical anatomical details and complicate surgical manipulation. Although deep learning-based methods have shown promise in image enhancement, most existing approaches remain too computationally demanding for real-time surgical use. To address this challenge, we propose a degradation-aware framework for endoscopic video enhancement, which enables real-time, high-quality enhancement by propagating degradation representations across frames. In our framework, degradation representations are first extracted from images using contrastive learning. We then introduce a fusion mechanism that modulates image features with these representations to guide a single-frame enhancement model, which is trained with a cycle-consistency constraint between degraded and restored images to improve robustness and generalization. Experiments demonstrate that our framework achieves a superior balance between performance and efficiency compared with several state-of-the-art methods. These results highlight the effectiveness of degradation-aware modeling for real-time endoscopic video enhancement. Nevertheless, our method suggests that implicitly learning and propagating degradation representation offer a practical pathway for clinical application.
comment: 18 pages, 8 figures, and 7 tables
See More, Change Less: Anatomy-Aware Diffusion for Contrast Enhancement
Image enhancement improves visual quality and helps reveal details that are hard to see in the original image. In medical imaging, it can support clinical decision-making, but current models often over-edit. This can distort organs, create false findings, and miss small tumors because these models do not understand anatomy or contrast dynamics. We propose SMILE, an anatomy-aware diffusion model that learns how organs are shaped and how they take up contrast. It enhances only clinically relevant regions while leaving all other areas unchanged. SMILE introduces three key ideas: (1) structure-aware supervision that follows true organ boundaries and contrast patterns; (2) registration-free learning that works directly with unaligned multi-phase CT scans; (3) unified inference that provides fast and consistent enhancement across all contrast phases. Across six external datasets, SMILE outperforms existing methods in image quality (14.2% higher SSIM, 20.6% higher PSNR, 50% better FID) and in clinical usefulness by producing anatomically accurate and diagnostically meaningful images. SMILE also improves cancer detection from non-contrast CT, raising the F1 score by up to 10 percent.
AdLift: Lifting Adversarial Perturbations to Safeguard 3D Gaussian Splatting Assets Against Instruction-Driven Editing
Recent studies have extended diffusion-based instruction-driven 2D image editing pipelines to 3D Gaussian Splatting (3DGS), enabling faithful manipulation of 3DGS assets and greatly advancing 3DGS content creation. However, it also exposes these assets to serious risks of unauthorized editing and malicious tampering. Although imperceptible adversarial perturbations against diffusion models have proven effective for protecting 2D images, applying them to 3DGS encounters two major challenges: view-generalizable protection and balancing invisibility with protection capability. In this work, we propose the first editing safeguard for 3DGS, termed AdLift, which prevents instruction-driven editing across arbitrary views and dimensions by lifting strictly bounded 2D adversarial perturbations into 3D Gaussian-represented safeguard. To ensure both adversarial perturbations effectiveness and invisibility, these safeguard Gaussians are progressively optimized across training views using a tailored Lifted PGD, which first conducts gradient truncation during back-propagation from the editing model at the rendered image and applies projected gradients to strictly constrain the image-level perturbation. Then, the resulting perturbation is backpropagated to the safeguard Gaussian parameters via an image-to-Gaussian fitting operation. We alternate between gradient truncation and image-to-Gaussian fitting, yielding consistent adversarial-based protection performance across different viewpoints and generalizes to novel views. Empirically, qualitative and quantitative results demonstrate that AdLift effectively protects against state-of-the-art instruction-driven 2D image and 3DGS editing.
comment: 40 pages, 34 figures, 18 tables
Zero-Shot Textual Explanations via Translating Decision-Critical Features
Textual explanations make image classifier decisions transparent by describing the prediction rationale in natural language. Large vision-language models can generate captions but are designed for general visual understanding, not classifier-specific reasoning. Existing zero-shot explanation methods align global image features with language, producing descriptions of what is visible rather than what drives the prediction. We propose TEXTER, which overcomes this limitation by isolating decision-critical features before alignment. TEXTER identifies the neurons contributing to the prediction and emphasizes the features encoded in those neurons -- i.e., the decision-critical features. It then maps these emphasized features into the CLIP feature space to retrieve textual explanations that reflect the model's reasoning. A sparse autoencoder further improves interpretability, particularly for Transformer architectures. Extensive experiments show that TEXTER generates more faithful and interpretable explanations than existing methods. The code will be publicly released.
comment: 11+6 pages, 8 figures, 4 tables
Squeezed-Eff-Net: Edge-Computed Boost of Tomography Based Brain Tumor Classification leveraging Hybrid Neural Network Architecture
Brain tumors are one of the most common and dangerous neurological diseases which require a timely and correct diagnosis to provide the right treatment procedures. Even with the promotion of magnetic resonance imaging (MRI), the process of tumor delineation is difficult and time-consuming, which is prone to inter-observer error. In order to overcome these limitations, this work proposes a hybrid deep learning model based on SqueezeNet v1 which is a lightweight model, and EfficientNet-B0, which is a high-performing model, and is enhanced with handcrafted radiomic descriptors, including Histogram of Oriented Gradients (HOG), Local Binary Patterns (LBP), Gabor filters and Wavelet transforms. The framework was trained and tested only on publicly available Nickparvar Brain Tumor MRI dataset, which consisted of 7,023 contrast-enhanced T1-weighted axial MRI slices which were categorized into four groups: glioma, meningioma, pituitary tumor, and no tumor. The testing accuracy of the model was 98.93% that reached a level of 99.08% with Test Time Augmentation (TTA) showing great generalization and power. The proposed hybrid network offers a compromise between computation efficiency and diagnostic accuracy compared to current deep learning structures and only has to be trained using fewer than 2.1 million parameters and less than 1.2 GFLOPs. The handcrafted feature addition allowed greater sensitivity in texture and the EfficientNet-B0 backbone represented intricate hierarchical features. The resulting model has almost clinical reliability in automated MRI-based classification of tumors highlighting its possibility of use in clinical decision-support systems.
Unified Camera Positional Encoding for Controlled Video Generation
Transformers have emerged as a universal backbone across 3D perception, video generation, and world models for autonomous driving and embodied AI, where understanding camera geometry is essential for grounding visual observations in three-dimensional space. However, existing camera encoding methods often rely on simplified pinhole assumptions, restricting generalization across the diverse intrinsics and lens distortions in real-world cameras. We introduce Relative Ray Encoding, a geometry-consistent representation that unifies complete camera information, including 6-DoF poses, intrinsics, and lens distortions. To evaluate its capability under diverse controllability demands, we adopt camera-controlled text-to-video generation as a testbed task. Within this setting, we further identify pitch and roll as two components effective for Absolute Orientation Encoding, enabling full control over the initial camera orientation. Together, these designs form UCPE (Unified Camera Positional Encoding), which integrates into a pretrained video Diffusion Transformer through a lightweight spatial attention adapter, adding less than 1% trainable parameters while achieving state-of-the-art camera controllability and visual fidelity. To facilitate systematic training and evaluation, we construct a large video dataset covering a wide range of camera motions and lens types. Extensive experiments validate the effectiveness of UCPE in camera-controllable video generation and highlight its potential as a general camera representation for Transformers across future multi-view, video, and 3D tasks. Code will be available at https://github.com/chengzhag/UCPE.
comment: Code: https://github.com/chengzhag/UCPE
Dropout Prompt Learning: Towards Robust and Adaptive Vision-Language Models
Dropout is a widely used regularization technique which improves the generalization ability of a model by randomly dropping neurons. In light of this, we propose Dropout Prompt Learning, which aims for applying dropout to improve the robustness of the vision-language models. Different from the vanilla dropout, we apply dropout on the tokens of the textual and visual branches, where we evaluate the token significance considering both intra-modal context and inter-modal alignment, enabling flexible dropout probabilities for each token. Moreover, to maintain semantic alignment for general knowledge transfer while encouraging the diverse representations that dropout introduces, we further propose residual entropy regularization. Experiments on 15 benchmarks show our method's effectiveness in challenging scenarios like low-shot learning, long-tail classification, and out-of-distribution generalization. Notably, our method surpasses regularization-based methods including KgCoOp by 5.10% and PromptSRC by 2.13% in performance on base-to-novel generalization.
STRinGS: Selective Text Refinement in Gaussian Splatting WACV 2026
Text as signs, labels, or instructions is a critical element of real-world scenes as they can convey important contextual information. 3D representations such as 3D Gaussian Splatting (3DGS) struggle to preserve fine-grained text details, while achieving high visual fidelity. Small errors in textual element reconstruction can lead to significant semantic loss. We propose STRinGS, a text-aware, selective refinement framework to address this issue for 3DGS reconstruction. Our method treats text and non-text regions separately, refining text regions first and merging them with non-text regions later for full-scene optimization. STRinGS produces sharp, readable text even in challenging configurations. We introduce a text readability measure OCR Character Error Rate (CER) to evaluate the efficacy on text regions. STRinGS results in a 63.6% relative improvement over 3DGS at just 7K iterations. We also introduce a curated dataset STRinGS-360 with diverse text scenarios to evaluate text readability in 3D reconstruction. Our method and dataset together push the boundaries of 3D scene understanding in text-rich environments, paving the way for more robust text-aware reconstruction methods.
comment: Accepted to WACV 2026. Project Page, see https://STRinGS-official.github.io
ReLKD: Inter-Class Relation Learning with Knowledge Distillation for Generalized Category Discovery ECAI 2025
Generalized Category Discovery (GCD) faces the challenge of categorizing unlabeled data containing both known and novel classes, given only labels for known classes. Previous studies often treat each class independently, neglecting the inherent inter-class relations. Obtaining such inter-class relations directly presents a significant challenge in real-world scenarios. To address this issue, we propose ReLKD, an end-to-end framework that effectively exploits implicit inter-class relations and leverages this knowledge to enhance the classification of novel classes. ReLKD comprises three key modules: a target-grained module for learning discriminative representations, a coarse-grained module for capturing hierarchical class relations, and a distillation module for transferring knowledge from the coarse-grained module to refine the target-grained module's representation learning. Extensive experiments on four datasets demonstrate the effectiveness of ReLKD, particularly in scenarios with limited labeled data. The code for ReLKD is available at https://github.com/ZhouF-ECNU/ReLKD.
comment: Accepted to the Main Track of the 28th European Conference on Artificial Intelligence (ECAI 2025). To appear in the proceedings published by IOS Press (DOI: 10.3233/FAIA413)
Towards Robust Protective Perturbation against DeepFake Face Swapping
DeepFake face swapping enables highly realistic identity forgeries, posing serious privacy and security risks. A common defence embeds invisible perturbations into images, but these are fragile and often destroyed by basic transformations such as compression or resizing. In this paper, we first conduct a systematic analysis of 30 transformations across six categories and show that protection robustness is highly sensitive to the choice of training transformations, making the standard Expectation over Transformation (EOT) with uniform sampling fundamentally suboptimal. Motivated by this, we propose Expectation Over Learned distribution of Transformation (EOLT), the framework to treat transformation distribution as a learnable component rather than a fixed design choice. Specifically, EOLT employs a policy network that learns to automatically prioritize critical transformations and adaptively generate instance-specific perturbations via reinforcement learning, enabling explicit modeling of defensive bottlenecks while maintaining broad transferability. Extensive experiments demonstrate that our method achieves substantial improvements over state-of-the-art approaches, with 26% higher average robustness and up to 30% gains on challenging transformation categories.
Clinical Interpretability of Deep Learning Segmentation Through Shapley-Derived Agreement and Uncertainty Metrics
Segmentation is the identification of anatomical regions of interest, such as organs, tissue, and lesions, serving as a fundamental task in computer-aided diagnosis in medical imaging. Although deep learning models have achieved remarkable performance in medical image segmentation, the need for explainability remains critical for ensuring their acceptance and integration in clinical practice, despite the growing research attention in this area. Our approach explored the use of contrast-level Shapley values, a systematic perturbation of model inputs to assess feature importance. While other studies have investigated gradient-based techniques through identifying influential regions in imaging inputs, Shapley values offer a broader, clinically aligned approach, explaining how model performance is fairly attributed to certain imaging contrasts over others. Using the BraTS 2024 dataset, we generated rankings for Shapley values for four MRI contrasts across four model architectures. Two metrics were proposed from the Shapley ranking: agreement between model and ``clinician" imaging ranking, and uncertainty quantified through Shapley ranking variance across cross-validation folds. Higher-performing cases (Dice \textgreater0.6) showed significantly greater agreement with clinical rankings. Increased Shapley ranking variance correlated with decreased performance (U-Net: $r=-0.581$). These metrics provide clinically interpretable proxies for model reliability, helping clinicians better understand state-of-the-art segmentation models.
VFM-VLM: Vision Foundation Model and Vision Language Model based Visual Comparison for 3D Pose Estimation
Vision Foundation Models (VFMs) and Vision Language Models (VLMs) have revolutionized computer vision by providing rich semantic and geometric representations. This paper presents a comprehensive visual comparison between CLIP based and DINOv2 based approaches for 3D pose estimation in hand object grasping scenarios. We evaluate both models on the task of 6D object pose estimation and demonstrate their complementary strengths: CLIP excels in semantic understanding through language grounding, while DINOv2 provides superior dense geometric features. Through extensive experiments on benchmark datasets, we show that CLIP based methods achieve better semantic consistency, while DINOv2 based approaches demonstrate competitive performance with enhanced geometric precision. Our analysis provides insights for selecting appropriate vision models for robotic manipulation and grasping, picking applications.
Object Pose Distribution Estimation for Determining Revolution and Reflection Uncertainty in Point Clouds
Object pose estimation is crucial to robotic perception and typically provides a single-pose estimate. However, a single estimate cannot capture pose uncertainty deriving from visual ambiguity, which can lead to unreliable behavior. Existing pose distribution methods rely heavily on color information, often unavailable in industrial settings. We propose a novel neural network-based method for estimating object pose uncertainty using only 3D colorless data. To the best of our knowledge, this is the first approach that leverages deep learning for pose distribution estimation without relying on RGB input. We validate our method in a real-world bin picking scenario with objects of varying geometric ambiguity. Our current implementation focuses on symmetries in reflection and revolution, but the framework is extendable to full SE(3) pose distribution estimation. Source code available at opde3d.github.io
comment: 8 pages, 8 figures, 5 tables, ICCR 2025
AutoLugano: A Deep Learning Framework for Fully Automated Lymphoma Segmentation and Lugano Staging on FDG-PET/CT
Purpose: To develop a fully automated deep learning system, AutoLugano, for end-to-end lymphoma classification by performing lesion segmentation, anatomical localization, and automated Lugano staging from baseline FDG-PET/CT scans. Methods: The AutoLugano system processes baseline FDG-PET/CT scans through three sequential modules:(1) Anatomy-Informed Lesion Segmentation, a 3D nnU-Net model, trained on multi-channel inputs, performs automated lesion detection (2) Atlas-based Anatomical Localization, which leverages the TotalSegmentator toolkit to map segmented lesions to 21 predefined lymph node regions using deterministic anatomical rules; and (3) Automated Lugano Staging, where the spatial distribution of involved regions is translated into Lugano stages and therapeutic groups (Limited vs. Advanced Stage).The system was trained on the public autoPET dataset (n=1,007) and externally validated on an independent cohort of 67 patients. Performance was assessed using accuracy, sensitivity, specificity, F1-scorefor regional involvement detection and staging agreement. Results: On the external validation set, the proposed model demonstrated robust performance, achieving an overall accuracy of 88.31%, sensitivity of 74.47%, Specificity of 94.21% and an F1-score of 80.80% for regional involvement detection,outperforming baseline models. Most notably, for the critical clinical task of therapeutic stratification (Limited vs. Advanced Stage), the system achieved a high accuracy of 85.07%, with a specificity of 90.48% and a sensitivity of 82.61%.Conclusion: AutoLugano represents the first fully automated, end-to-end pipeline that translates a single baseline FDG-PET/CT scan into a complete Lugano stage. This study demonstrates its strong potential to assist in initial staging, treatment stratification, and supporting clinical decision-making.
MMRPT: MultiModal Reinforcement Pre-Training via Masked Vision-Dependent Reasoning
Multimodal pre-training remains constrained by the descriptive bias of image-caption pairs, leading models to favor surface linguistic cues over grounded visual understanding. We introduce MMRPT, a masked multimodal reinforcement pre-training framework that strengthens visual reasoning in MLLMs. We are the first to incorporate reinforcement learning directly into the pre-training of large vision-language models, enabling learning signals that reward visual grounding rather than caption imitation. MMRPT constructs masked multimodal data by estimating sentence-level visual dependency via attention over visual tokens and masking highly vision-dependent segments; the model reconstructs these spans through vision-grounded reasoning guided by a semantic-visual reward. Experiments show consistent zero-shot gains across diverse benchmarks and substantially improved robustness under supervised fine-tuning, demonstrating that reinforcement-driven masked reasoning provides a more reliable and generalizable pre-training objective for multimodal models.
comment: 7 pages, 1 figures
Understanding Diffusion Models via Code Execution
Diffusion models have achieved remarkable performance in generative modeling, yet their theoretical foundations are often intricate, and the gap between mathematical formulations in papers and practical open-source implementations can be difficult to bridge. Existing tutorials primarily focus on deriving equations, offering limited guidance on how diffusion models actually operate in code. To address this, we present a concise implementation of approximately 300 lines that explains diffusion models from a code-execution perspective. Our minimal example preserves the essential components -- including forward diffusion, reverse sampling, the noise-prediction network, and the training loop -- while removing unnecessary engineering details. This technical report aims to provide researchers with a clear, implementation-first understanding of how diffusion models work in practice and how code and theory correspond. Our code and pre-trained models are available at: https://github.com/disanda/GM/tree/main/DDPM-DDIM-ClassifierFree.
Generating Storytelling Images with Rich Chains-of-Reasoning
An image can convey a compelling story by presenting rich, logically connected visual clues. These connections form Chains-of-Reasoning (CoRs) within the image, enabling viewers to infer events, causal relationships, and other information, thereby understanding the underlying story. In this paper, we focus on these semantically rich images and define them as Storytelling Images. Such images have diverse applications beyond illustration creation and cognitive screening, leveraging their ability to convey multi-layered information visually and inspire active interpretation. However, due to their complex semantic nature, Storytelling Images are inherently challenging to create, and thus remain relatively scarce. To address this challenge, we introduce the Storytelling Image Generation task, which explores how generative AI models can be leveraged to create such images. Specifically, we propose a two-stage pipeline, StorytellingPainter, which combines the creative reasoning abilities of Large Language Models (LLMs) with the visual synthesis capabilities of Text-to-Image (T2I) models to generate Storytelling Images. Alongside this pipeline, we develop a dedicated evaluation framework comprising three main evaluators: a Semantic Complexity Evaluator, a KNN-based Diversity Evaluator and a Story-Image Alignment Evaluator. Given the critical role of story generation in the Storytelling Image Generation task and the performance disparity between open-source and proprietary LLMs, we further explore tailored training strategies to reduce this gap, resulting in a series of lightweight yet effective models named Mini-Storytellers. Experimental results demonstrate the feasibility and effectiveness of our approaches. The code is available at https://github.com/xiujiesong/StorytellingImageGeneration.
SUCCESS-GS: Survey of Compactness and Compression for Efficient Static and Dynamic Gaussian Splatting
3D Gaussian Splatting (3DGS) has emerged as a powerful explicit representation enabling real-time, high-fidelity 3D reconstruction and novel view synthesis. However, its practical use is hindered by the massive memory and computational demands required to store and render millions of Gaussians. These challenges become even more severe in 4D dynamic scenes. To address these issues, the field of Efficient Gaussian Splatting has rapidly evolved, proposing methods that reduce redundancy while preserving reconstruction quality. This survey provides the first unified overview of efficient 3D and 4D Gaussian Splatting techniques. For both 3D and 4D settings, we systematically categorize existing methods into two major directions, Parameter Compression and Restructuring Compression, and comprehensively summarize the core ideas and methodological trends within each category. We further cover widely used datasets, evaluation metrics, and representative benchmark comparisons. Finally, we discuss current limitations and outline promising research directions toward scalable, compact, and real-time Gaussian Splatting for both static and dynamic 3D scene representation.
comment: The first three authors contributed equally to this work. The last two authors are co-corresponding authors. Please visit our project page at https://cmlab-korea.github.io/Awesome-Efficient-GS/
HVQ-CGIC: Enabling Hyperprior Entropy Modeling for VQ-Based Controllable Generative Image Compression
Generative learned image compression methods using Vector Quantization (VQ) have recently shown impressive potential in balancing distortion and perceptual quality. However, these methods typically estimate the entropy of VQ indices using a static, global probability distribution, which fails to adapt to the specific content of each image. This non-adaptive approach leads to untapped bitrate potential and challenges in achieving flexible rate control. To address this challenge, we introduce a Controllable Generative Image Compression framework based on a VQ Hyperprior, termed HVQ-CGIC. HVQ-CGIC rigorously derives the mathematical foundation for introducing a hyperprior to the VQ indices entropy model. Based on this foundation, through novel loss design, to our knowledge, this framework is the first to introduce RD balance and control into vector quantization-based Generative Image Compression. Cooperating with a lightweight hyper-prior estimation network, HVQ-CGIC achieves a significant advantage in rate-distortion (RD) performance compared to current state-of-the-art (SOTA) generative compression methods. On the Kodak dataset, we achieve the same LPIPS as Control-GIC, CDC and HiFiC with an average of 61.3% fewer bits. We posit that HVQ-CGIC has the potential to become a foundational component for VQGAN-based image compression, analogous to the integral role of the HyperPrior framework in neural image compression.
comment: 12 pages, 7 figures
RefLSM: Linearized Structural-Prior Reflectance Model for Medical Image Segmentation and Bias-Field Correction
Medical image segmentation remains challenging due to intensity inhomogeneity, noise, blurred boundaries, and irregular structures. Traditional level set methods, while effective in certain cases, often depend on approximate bias field estimations and therefore struggle under severe non-uniform imaging conditions. To address these limitations, we propose a novel variational Reflectance-based Level Set Model (RefLSM), which explicitly integrates Retinex-inspired reflectance decomposition into the segmentation framework. By decomposing the observed image into reflectance and bias field components, RefLSM directly segments the reflectance, which is invariant to illumination and preserves fine structural details. Building on this foundation, we introduce two key innovations for enhanced precision and robustness. First, a linear structural prior steers the smoothed reflectance gradients toward a data-driven reference, providing reliable geometric guidance in noisy or low-contrast scenes. Second, a relaxed binary level-set is embedded in RefLSM and enforced via convex relaxation and sign projection, yielding stable evolution and avoiding reinitialization-induced diffusion. The resulting variational problem is solved efficiently using an ADMM-based optimization scheme. Extensive experiments on multiple medical imaging datasets demonstrate that RefLSM achieves superior segmentation accuracy, robustness, and computational efficiency compared to state-of-the-art level set methods.
Integrating Multi-scale and Multi-filtration Topological Features for Medical Image Classification
Modern deep neural networks have shown remarkable performance in medical image classification. However, such networks either emphasize pixel-intensity features instead of fundamental anatomical structures (e.g., those encoded by topological invariants), or they capture only simple topological features via single-parameter persistence. In this paper, we propose a new topology-guided classification framework that extracts multi-scale and multi-filtration persistent topological features and integrates them into vision classification backbones. For an input image, we first compute cubical persistence diagrams (PDs) across multiple image resolutions/scales. We then develop a ``vineyard'' algorithm that consolidates these PDs into a single, stable diagram capturing signatures at varying granularities, from global anatomy to subtle local irregularities that may indicate early-stage disease. To further exploit richer topological representations produced by multiple filtrations, we design a cross-attention-based neural network that directly processes the consolidated final PDs. The resulting topological embeddings are fused with feature maps from CNNs or Transformers. By integrating multi-scale and multi-filtration topologies into an end-to-end architecture, our approach enhances the model's capacity to recognize complex anatomical structures. Evaluations on three public datasets show consistent, considerable improvements over strong baselines and state-of-the-art methods, demonstrating the value of our comprehensive topological perspective for robust and interpretable medical image classification.
START: Spatial and Textual Learning for Chart Understanding WACV2026
Chart understanding is crucial for deploying multimodal large language models (MLLMs) in real-world scenarios such as analyzing scientific papers and technical reports. Unlike natural images, charts pair a structured visual layout (spatial property) with an underlying data representation (textual property) -- grasping both is essential for precise, fine-grained chart reasoning. Motivated by this observation, we propose START, the Spatial and Textual learning for chART understanding. Specifically, we introduce (i) chart-element grounding and (ii) chart-to-code generation to strengthen an MLLM's understanding of both chart visual layout and data details. To facilitate spatial and textual learning, we propose the START-Dataset generated with a novel data-generation pipeline that first leverages an MLLM to translate real chart images into executable chart code, recovering the underlying data representation while preserving the visual distribution of real-world charts. We then evolve the code with a Large Language Model (LLM) to ascertain the positions of chart elements that capture the chart's visual structure, addressing challenges that existing methods cannot handle. To evaluate a model's ability to understand chart spatial structures, we propose the Chart Spatial understanding Benchmark (CS-Bench), filling a critical gap in comprehensive chart understanding evaluation. Leveraging spatial and textual learning, START delivers consistent gains across model sizes and benchmarks over the base models and surpasses prior state-of-the-art by a clear margin. Code, data and models will be publicly available.
comment: WACV2026 Camera Ready
TV2TV: A Unified Framework for Interleaved Language and Video Generation
Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to "think in words" about subsequent content before ``acting in pixels'' to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model's ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.
Normalize Filters! Classical Wisdom for Deep Vision
Classical image filters, such as those for averaging or differencing, are carefully normalized to ensure consistency, interpretability, and to avoid artifacts like intensity shifts, halos, or ringing. In contrast, convolutional filters learned end-to-end in deep networks lack such constraints. Although they may resemble wavelets and blob/edge detectors, they are not normalized in the same or any way. Consequently, when images undergo atmospheric transfer, their responses become distorted, leading to incorrect outcomes. We address this limitation by proposing filter normalization, followed by learnable scaling and shifting, akin to batch normalization. This simple yet effective modification ensures that the filters are atmosphere-equivariant, enabling co-domain symmetry. By integrating classical filtering principles into deep learning (applicable to both convolutional neural networks and convolution-dependent vision transformers), our method achieves significant improvements on artificial and natural intensity variation benchmarks. Our ResNet34 could even outperform CLIP by a large margin. Our analysis reveals that unnormalized filters degrade performance, whereas filter normalization regularizes learning, promotes diversity, and improves robustness and generalization.
MAPLE: Encoding Dexterous Robotic Manipulation Priors Learned From Egocentric Videos
Large-scale egocentric video datasets capture diverse human activities across a wide range of scenarios, offering rich and detailed insights into how humans interact with objects, especially those that require fine-grained dexterous control. Such complex, dexterous skills with precise controls are crucial for many robotic manipulation tasks, yet are often insufficiently addressed by traditional data-driven approaches to robotic manipulation. To address this gap, we leverage manipulation priors learned from large-scale egocentric video datasets to improve policy learning for dexterous robotic manipulation tasks. We present MAPLE, a novel method for dexterous robotic manipulation that learns features to predict object contact points and detailed hand poses at the moment of contact from egocentric images. We then use the learned features to train policies for downstream manipulation tasks. Experimental results demonstrate the effectiveness of MAPLE across 4 existing simulation benchmarks, as well as a newly designed set of 4 challenging simulation tasks requiring fine-grained object control and complex dexterous skills. The benefits of MAPLE are further highlighted in real-world experiments using a 17 DoF dexterous robotic hand, whereas the simultaneous evaluation across both simulation and real-world experiments has remained underexplored in prior work. We additionally showcase the efficacy of our model on an egocentric contact point prediction task, validating its usefulness beyond dexterous manipulation policy learning.
Tyche: Stochastic In-Context Learning for Medical Image Segmentation CVPR
Existing learning-based solutions to medical image segmentation have two important shortcomings. First, for most new segmentation task, a new model has to be trained or fine-tuned. This requires extensive resources and machine learning expertise, and is therefore often infeasible for medical researchers and clinicians. Second, most existing segmentation methods produce a single deterministic segmentation mask for a given image. In practice however, there is often considerable uncertainty about what constitutes the correct segmentation, and different expert annotators will often segment the same image differently. We tackle both of these problems with Tyche, a model that uses a context set to generate stochastic predictions for previously unseen tasks without the need to retrain. Tyche differs from other in-context segmentation methods in two important ways. (1) We introduce a novel convolution block architecture that enables interactions among predictions. (2) We introduce in-context test-time augmentation, a new mechanism to provide prediction stochasticity. When combined with appropriate model design and loss functions, Tyche can predict a set of plausible diverse segmentation candidates for new or unseen medical images and segmentation tasks without the need to retrain.
comment: Accepted at IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) 2024 as a highlight. Code available at https://github.com/mariannerakic/tyche
Asynchronous Bioplausible Neuron for SNN for Event Vision
Spiking Neural Networks (SNNs) offer a biologically inspired approach to computer vision that can lead to more efficient processing of visual data with reduced energy consumption. However, maintaining homeostasis within these networks is challenging, as it requires continuous adjustment of neural responses to preserve equilibrium and optimal processing efficiency amidst diverse and often unpredictable input signals. In response to these challenges, we propose the Asynchronous Bioplausible Neuron (ABN), a dynamic spike firing mechanism to auto-adjust the variations in the input signal. Comprehensive evaluation across various datasets demonstrates ABN's enhanced performance in image classification and segmentation, maintenance of neural equilibrium, and energy efficiency.
comment: 10 pages
Diffusion Models for Image Restoration and Enhancement: A Comprehensive Survey
Image restoration (IR) has been an indispensable and challenging task in the low-level vision field, which strives to improve the subjective quality of images distorted by various forms of degradation. Recently, the diffusion model has achieved significant advancements in the visual generation of AIGC, thereby raising an intuitive question, "whether diffusion model can boost image restoration". To answer this, some pioneering studies attempt to integrate diffusion models into the image restoration task, resulting in superior performances than previous GAN-based methods. Despite that, a comprehensive and enlightening survey on diffusion model-based image restoration remains scarce. In this paper, we are the first to present a comprehensive review of recent diffusion model-based methods on image restoration, encompassing the learning paradigm, conditional strategy, framework design, modeling strategy, and evaluation. Concretely, we first introduce the background of the diffusion model briefly and then present two prevalent workflows that exploit diffusion models in image restoration. Subsequently, we classify and emphasize the innovative designs using diffusion models for both IR and blind/real-world IR, intending to inspire future development. To evaluate existing methods thoroughly, we summarize the commonly-used dataset, implementation details, and evaluation metrics. Additionally, we present the objective comparison for open-sourced methods across three tasks, including image super-resolution, deblurring, and inpainting. Ultimately, informed by the limitations in existing works, we propose five potential and challenging directions for the future research of diffusion model-based IR, including sampling efficiency, model compression, distortion simulation and estimation, distortion invariant learning, and framework design.
comment: Accepted by IJCV 2025
Bimodal SegNet: Instance Segmentation Fusing Events and RGB Frames for Robotic Grasping
Object segmentation for robotic grasping under dynamic conditions often faces challenges such as occlusion, low light conditions, motion blur and object size variance. To address these challenges, we propose a Deep Learning network that fuses two types of visual signals, event-based data and RGB frame data. The proposed Bimodal SegNet network has two distinct encoders, one for each signal input and a spatial pyramidal pooling with atrous convolutions. Encoders capture rich contextual information by pooling the concatenated features at different resolutions while the decoder obtains sharp object boundaries. The evaluation of the proposed method undertakes five unique image degradation challenges including occlusion, blur, brightness, trajectory and scale variance on the Event-based Segmentation (ESD) Dataset. The evaluation results show a 6-10\% segmentation accuracy improvement over state-of-the-art methods in terms of mean intersection over the union and pixel accuracy. The model code is available at https://github.com/sanket0707/Bimodal-SegNet.git
comment: 8 Pages
Bionetta: Efficient Client-Side Zero-Knowledge Machine Learning Proving
In this report, we compare the performance of our UltraGroth-based zero-knowledge machine learning framework Bionetta to other tools of similar purpose such as EZKL, Lagrange's deep-prove, or zkml. The results show a significant boost in the proving time for custom-crafted neural networks: they can be proven even on mobile devices, enabling numerous client-side proving applications. While our scheme increases the cost of one-time preprocessing steps, such as circuit compilation and generating trusted setup, our approach is, to the best of our knowledge, the only one that is deployable on the native EVM smart contracts without overwhelming proof size and verification overheads.
TranSplat: Instant Cross-Scene Object Relighting in Gaussian Splatting via Spherical Harmonic Transfer
We present TranSplat, a method for fast and accurate object relighting for the 3D Gaussian Splatting (GS) framework when transferring a 3D object from a source GS scene to a target GS scene. TranSplat is based on a theoretical radiance transfer identity for cross-scene relighting of objects with radially symmetric BRDFs that involves only taking simple products of spherical harmonic appearance coefficients of the object, source, and target environment maps without any explicit computation of scene quantities (e.g., the BRDFs themselves). TranSplat is the first method to demonstrate how this theoretical identity may be used to perform relighting within the GS framework, and furthermore, by automatically inferring unknown source and target environment maps directly from the source and target scene GS representations. We evaluated TranSplat on several synthetic and real-world scenes and objects, demonstrating comparable 3D object relighting performance to recent conventional inverse rendering-based GS methods with a fraction of their runtime. While TranSplat is theoretically best-suited for radially symmetric BRDFs, results demonstrate that TranSplat still offers perceptually realistic renderings on real scenes and opens a valuable, lightweight path forward to relighting with the GS framework.
Seeing through Imagination: Learning Scene Geometry via Implicit Spatial World Modeling
Spatial reasoning, the ability to understand and interpret the 3D structure of the world, is a critical yet underdeveloped capability in Multimodal Large Language Models (MLLMs). Current methods predominantly rely on verbal descriptive tuning, which suffers from visual illiteracy, i.e., they learn spatial concepts through textual symbols alone, devoid of connection to their visual manifestations. To bridge this gap, this paper introduces MILO, an Implicit spatIaL wOrld modeling paradigm that simulates human-like spatial imagination. MILO integrates a visual generator to provide geometry-aware feedback, thereby implicitly grounding the MLLM's symbolic reasoning in perceptual experience. Complementing this paradigm, we propose RePE (Relative Positional Encoding), a novel encoding scheme that captures relative camera-pose transformations, offering superior performance over absolute coordinate systems. To support the training, we construct GeoGen, a large-scale Geometry-aware Generative dataset with approximately 2,241 videos and 67,827 observation-action-outcome triplets. Experiments demonstrate that our approach significantly enhances spatial reasoning capabilities across multiple baselines and benchmarks, offering a more holistic understanding of 3D space.
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the "scale-at-all-costs" paradigm. By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (<16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.
A Review of Bayesian Uncertainty Quantification in Deep Probabilistic Image Segmentation
Advances in architectural design, data availability, and compute have driven remarkable progress in semantic segmentation. Yet, these models often rely on relaxed Bayesian assumptions, omitting critical uncertainty information needed for robust decision-making. Despite growing interest in probabilistic segmentation to address point-estimate limitations, the research landscape remains fragmented. In response, this review synthesizes foundational concepts in uncertainty modeling, analyzing how feature- and parameter-distribution modeling impact four key segmentation tasks: Observer Variability, Active Learning, Model Introspection, and Model Generalization. Our work establishes a common framework by standardizing theory, notation, and terminology, thereby bridging the gap between method developers, task specialists, and applied researchers. We then discuss critical challenges, including the nuanced distinction between uncertainty types, strong assumptions in spatial aggregation, the lack of standardized benchmarks, and pitfalls in current quantification methods. We identify promising avenues for future research, such as uncertainty-aware active learning, data-driven benchmarks, transformer-based models, and novel techniques to move from simple segmentation problems to uncertainty in holistic scene understanding. Based on our analysis, we offer practical guidelines for researchers on method selection, evaluation, reproducibility, and meaningful uncertainty estimation. Ultimately, our goal is to facilitate the development of more reliable, efficient, and interpretable segmentation models that can be confidently deployed in real-world applications.
comment: TMLR
Exploring the Potential of Encoder-free Architectures in 3D LMMs
Encoder-free architectures have been preliminarily explored in the 2D Large Multimodal Models (LMMs), yet it remains an open question whether they can be effectively applied to 3D understanding scenarios. In this paper, we present the first comprehensive investigation into the potential of encoder-free architectures to alleviate the challenges of encoder-based 3D LMMs. These long-standing challenges include the failure to adapt to varying point cloud resolutions during inference and the point features from the encoder not meeting the semantic needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to remove the pre-trained encoder and enable the LLM to assume the role of the 3D encoder: 1) We propose the LLM-embedded Semantic Encoding strategy in the pre-training stage, exploring the effects of various point cloud self-supervised losses. And we present the Hybrid Semantic Loss to extract high-level semantics. 2) We introduce the Hierarchical Geometry Aggregation strategy in the instruction tuning stage. This incorporates inductive bias into the LLM layers to focus on the local details of the point clouds. To the end, we present the first Encoder-free 3D LMM, ENEL. Our 7B model rivals the state-of-the-art model, PointLLM-PiSA-13B, achieving 57.91%, 61.0%, and 55.20% on the classification, captioning, and VQA tasks, respectively. Our results show that the encoder-free architecture is highly promising for replacing encoder-based architectures in the field of 3D understanding. The code is released at https://github.com/Ivan-Tang-3D/ENEL
PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention
The Vision-Language-Action (VLA) models have demonstrated remarkable performance on embodied tasks and shown promising potential for real-world applications. However, current VLAs still struggle to produce consistent and precise target-oriented actions, as they often generate redundant or unstable motions along trajectories, limiting their applicability in time-sensitive scenarios.In this work, we attribute these redundant actions to the spatially uniform perception field of existing VLAs, which causes them to be distracted by target-irrelevant objects, especially in complex environments.To address this issue, we propose an efficient PosA-VLA framework that anchors visual attention via pose-conditioned supervision, consistently guiding the model's perception toward task-relevant regions. The pose-conditioned anchor attention mechanism enables the model to better align instruction semantics with actionable visual cues, thereby improving action generation precision and efficiency. Moreover, our framework adopts a lightweight architecture and requires no auxiliary perception modules (e.g., segmentation or grounding networks), ensuring efficient inference. Extensive experiments verify that our method executes embodied tasks with precise and time-efficient behavior across diverse robotic manipulation benchmarks and shows robust generalization in a variety of challenging environments.
Distribution Matching Distillation Meets Reinforcement Learning
Distribution Matching Distillation (DMD) distills a pre-trained multi-step diffusion model to a few-step one to improve inference efficiency. However, the performance of the latter is often capped by the former. To circumvent this dilemma, we propose DMDR, a novel framework that combines Reinforcement Learning (RL) techniques into the distillation process. We show that for the RL of the few-step generator, the DMD loss itself is a more effective regularization compared to the traditional ones. In turn, RL can help to guide the mode coverage process in DMD more effectively. These allow us to unlock the capacity of the few-step generator by conducting distillation and RL simultaneously. Meanwhile, we design the dynamic distribution guidance and dynamic renoise sampling training strategies to improve the initial distillation process. The experiments demonstrate that DMDR can achieve leading visual quality, prompt coherence among few-step methods, and even exhibit performance that exceeds the multi-step teacher.
comment: The synergy of reinforcement learning and distribution matching distillation. See more: https://github.com/vvvvvjdy/dmdr
MM-ACT: Learn from Multimodal Parallel Generation to Act
A generalist robotic policy needs both semantic understanding for task planning and the ability to interact with the environment through predictive capabilities. To tackle this, we present MM-ACT, a unified Vision-Language-Action (VLA) model that integrates text, image, and action in shared token space and performs generation across all three modalities. MM-ACT adopts a re-mask parallel decoding strategy for text and image generation, and employs a one-step parallel decoding strategy for action generation to improve efficiency. We introduce Context-Shared Multimodal Learning, a unified training paradigm that supervises generation in all three modalities from a shared context, enhancing action generation through cross-modal learning. Experiments were conducted on the LIBERO simulation and Franka real-robot setups as well as RoboTwin2.0 to assess in-domain and out-of-domain performances respectively. Our approach achieves a success rate of 96.3% on LIBERO, 72.0% across three tasks of real Franka, and 52.38% across eight bimanual tasks of RoboTwin2.0 with an additional gain of 9.25% from cross-modal learning. We release our codes, models and data at https://github.com/HHYHRHY/MM-ACT.
comment: 17 pages
DCoAR: Deep Concept Injection into Unified Autoregressive Models for Personalized Text-to-Image Generation
The unified autoregressive (AR) model excels at multimodal understanding and generation. However, its full potential in the domain of customized image generation has yet to be fully realized. Existing customization approaches for unified AR models face a fundamental dilemma: adaptation-based methods suffer from overfitting and scalability bottlenecks, while concept-injection paradigms are constrained by a shallow injection strategy that leads to poor visual fidelity and impaired re-contextualization. To address this, we propose DCoAR, a novel deep concept injection framework that maintains a completely frozen pre-trained model. DCoAR deeply integrates new concepts through a Layer-wise Multimodal Context Learning (LMCL) strategy, which is stabilized by a multi-faceted regularization scheme: a Dual Prior Preservation (DPP) loss to mitigate semantic drift and a Context-Aware Self-Regularization (CASR) loss to enhance re-contextualization. The framework also enables training-free subject customization in user-provided styles. Experiments demonstrate that DCoAR significantly outperforms previous injection-based methods and achieves performance competitive with adaptation-based approaches while requiring substantially fewer trainable parameters. Code: https://github.com/KZF-kzf/CoAR
WiseEdit: Benchmarking Cognition- and Creativity-Informed Image Editing
Recent image editing models boast next-level intelligent capabilities, facilitating cognition- and creativity-informed image editing. Yet, existing benchmarks provide too narrow a scope for evaluation, failing to holistically assess these advanced abilities. To address this, we introduce WiseEdit, a knowledge-intensive benchmark for comprehensive evaluation of cognition- and creativity-informed image editing, featuring deep task depth and broad knowledge breadth. Drawing an analogy to human cognitive creation, WiseEdit decomposes image editing into three cascaded steps, i.e., Awareness, Interpretation, and Imagination, each corresponding to a task that poses a challenge for models to complete at the specific step. It also encompasses complex tasks, where none of the three steps can be finished easily. Furthermore, WiseEdit incorporates three fundamental types of knowledge: Declarative, Procedural, and Metacognitive knowledge. Ultimately, WiseEdit comprises 1,220 test cases, objectively revealing the limitations of SoTA image editing models in knowledge-based cognitive reasoning and creative composition capabilities. The benchmark, evaluation code, and the generated images of each model will be made publicly available soon. Project Page: https://qnancy.github.io/wiseedit_project_page/.
comment: 32 pages, 20 figures. Project Page: https://qnancy.github.io/wiseedit_project_page/. Benchmark: https://huggingface.co/datasets/123123chen/WiseEdit-Benchmark/
A Biophysically-Conditioned Generative Framework for 3D Brain Tumor MRI Synthesis
Magnetic resonance imaging (MRI) inpainting supports numerous clinical and research applications. We introduce the first generative model that conditions on voxel-level, continuous tumor concentrations to synthesize high-fidelity brain tumor MRIs. For the BraTS 2025 Inpainting Challenge, we adapt this architecture to the complementary task of healthy tissue restoration by setting the tumor concentrations to zero. Our latent diffusion model conditioned on both tissue segmentations and the tumor concentrations generates 3D spatially coherent and anatomically consistent images for both tumor synthesis and healthy tissue inpainting. For healthy inpainting, we achieve a PSNR of 18.5, and for tumor inpainting, we achieve 17.4. Our code is available at: https://github.com/valentin-biller/ldm.git
Tokenizing Motion: A Generative Approach for Scene Dynamics Compression
This paper proposes a novel generative video compression framework that leverages motion pattern priors, derived from subtle dynamics in common scenes (e.g., swaying flowers or a boat drifting on water), rather than relying on video content priors (e.g., talking faces or human bodies). These compact motion priors enable a new approach to ultra-low bitrate communication while achieving high-quality reconstruction across diverse scene contents. At the encoder side, motion priors can be streamlined into compact representations via a dense-to-sparse transformation. At the decoder side, these priors facilitate the reconstruction of scene dynamics using an advanced flow-driven diffusion model. Experimental results illustrate that the proposed method can achieve superior rate-distortion-performance and outperform the state-of-the-art conventional-video codec Enhanced Compression Model (ECM) on-scene dynamics sequences. The project page can be found at-https://github.com/xyzysz/GNVDC.
comment: 5page, 5 figures
Suite-IN++: A FlexiWear BodyNet Integrating Global and Local Motion Features from Apple Suite for Robust Inertial Navigation
The proliferation of wearable technology has established multi-device ecosystems comprising smartphones, smartwatches, and headphones as critical enablers for ubiquitous pedestrian localization. However, traditional pedestrian dead reckoning (PDR) struggles with diverse motion modes, while data-driven methods, despite improving accuracy, often lack robustness due to their reliance on a single-device setup. Therefore, a promising solution is to fully leverage existing wearable devices to form a flexiwear bodynet for robust and accurate pedestrian localization. This paper presents Suite-IN++, a deep learning framework for flexiwear bodynet-based pedestrian localization. Suite-IN++ integrates motion data from wearable devices on different body parts, using contrastive learning to separate global and local motion features. It fuses global features based on the data reliability of each device to capture overall motion trends and employs an attention mechanism to uncover cross-device correlations in local features, extracting motion details helpful for accurate localization. To evaluate our method, we construct a real-life flexiwear bodynet dataset, incorporating Apple Suite (iPhone, Apple Watch, and AirPods) across diverse walking modes and device configurations. Experimental results demonstrate that Suite-IN++ achieves superior localization accuracy and robustness, significantly outperforming state-of-the-art models in real-life pedestrian tracking scenarios.
comment: Accepted by TMC (Transactions on Mobile Computing) 2025
Blurry-Edges: Photon-Limited Depth Estimation from Defocused Boundaries CVPR 2025
Extracting depth information from photon-limited, defocused images is challenging because depth from defocus (DfD) relies on accurate estimation of defocus blur, which is fundamentally sensitive to image noise. We present a novel approach to robustly measure object depths from photon-limited images along the defocused boundaries. It is based on a new image patch representation, Blurry-Edges, that explicitly stores and visualizes a rich set of low-level patch information, including boundaries, color, and smoothness. We develop a deep neural network architecture that predicts the Blurry-Edges representation from a pair of differently defocused images, from which depth can be calculated using a closed-form DfD relation we derive. The experimental results on synthetic and real data show that our method achieves the highest depth estimation accuracy on photon-limited images compared to a broad range of state-of-the-art DfD methods.
comment: Accepted to CVPR 2025. Project page: https://blurry-edges.qiguo.org/
EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture
We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase. Extensive experiments have shown that EMMA-4B can significantly outperform state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while also achieving competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image). We believe that EMMA lays a solid foundation for the future development of unified multimodal architectures.
comment: Project Page: https://emma-umm.github.io/emma/
MCMoE: Completing Missing Modalities with Mixture of Experts for Incomplete Multimodal Action Quality Assessment AAAI 2026
Multimodal Action Quality Assessment (AQA) has recently emerged as a promising paradigm. By leveraging complementary information across shared contextual cues, it enhances the discriminative evaluation of subtle intra-class variations in highly similar action sequences. However, partial modalities are frequently unavailable at the inference stage in reality. The absence of any modality often renders existing multimodal models inoperable. Furthermore, it triggers catastrophic performance degradation due to interruptions in cross-modal interactions. To address this issue, we propose a novel Missing Completion Framework with Mixture of Experts (MCMoE) that unifies unimodal and joint representation learning in single-stage training. Specifically, we propose an adaptive gated modality generator that dynamically fuses available information to reconstruct missing modalities. We then design modality experts to learn unimodal knowledge and dynamically mix the knowledge of all experts to extract cross-modal joint representations. With a mixture of experts, missing modalities are further refined and complemented. Finally, in the training phase, we mine the complete multimodal features and unimodal expert knowledge to guide modality generation and generation-based joint representation extraction. Extensive experiments demonstrate that our MCMoE achieves state-of-the-art results in both complete and incomplete multimodal learning on three public AQA benchmarks. Code is available at https://github.com/XuHuangbiao/MCMoE.
comment: AAAI 2026
Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion
Diffusion models have established the state-of-the-art in text-to-image generation, but their performance often relies on a diffusion prior network to translate text embeddings into the visual manifold for easier decoding. These priors are computationally expensive and require extensive training on massive datasets. In this work, we challenge the necessity of a trained prior at all by employing Optimization-based Visual Inversion (OVI), a training-free and data-free alternative, to replace the need for a prior. OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimizes it to maximize the cosine similarity with input textual prompt embedding. We further propose two novel constraints, a Mahalanobis-based and a Nearest-Neighbor loss, to regularize the OVI optimization process toward the distribution of realistic images. Our experiments, conducted on Kandinsky 2.2, show that OVI can serve as an alternative to traditional priors. More importantly, our analysis reveals a critical flaw in current evaluation benchmarks like T2I-CompBench++, where simply using the text embedding as a prior achieves surprisingly high scores, despite lower perceptual quality. Our constrained OVI methods improve visual fidelity over this baseline, with the Nearest-Neighbor approach proving particularly effective, achieving quantitative scores comparable to or higher than the state-of-the-art data-efficient prior, indicating that the idea merits further investigation. The code will be publicly available upon acceptance.
comment: 11 pages, 7 figures, technical report (preprint)
FLAIR: Frequency- and Locality-Aware Implicit Neural Representations
Implicit Neural Representations (INRs) leverage neural networks to map coordinates to corresponding signals, enabling continuous and compact representations. This paradigm has driven significant advances in various vision tasks. However, existing INRs lack frequency selectivity and spatial localization, leading to an over-reliance on redundant signal components. Consequently, they exhibit spectral bias, tending to learn low-frequency components early while struggling to capture fine high-frequency details. To address these issues, we propose FLAIR (Frequency- and Locality-Aware Implicit Neural Representations), which incorporates two key innovations. The first is Band-Localized Activation (BLA), a novel activation designed for joint frequency selection and spatial localization under the constraints of the time-frequency uncertainty principle (TFUP). Through structured frequency control and spatially localized responses, BLA effectively mitigates spectral bias and enhances training stability. The second is Wavelet-Energy-Guided Encoding (WEGE), which leverages the discrete wavelet transform to compute energy scores and explicitly guide frequency information to the network, enabling precise frequency selection and adaptive band control. Our method consistently outperforms existing INRs in 2D image representation, as well as 3D shape reconstruction and novel view synthesis.
comment: Please visit our project page at https://cmlab-korea.github.io/FLAIR/
Event-Customized Image Generation
Customized Image Generation, generating customized images with user-specified concepts, has raised significant attention due to its creativity and novelty. With impressive progress achieved in subject customization, some pioneer works further explored the customization of action and interaction beyond entity (i.e., human, animal, and object) appearance. However, these approaches only focus on basic actions and interactions between two entities, and their effects are limited by insufficient ''exactly same'' reference images. To extend customized image generation to more complex scenes for general real-world applications, we propose a new task: event-customized image generation. Given a single reference image, we define the ''event'' as all specific actions, poses, relations, or interactions between different entities in the scene. This task aims at accurately capturing the complex event and generating customized images with various target entities. To solve this task, we proposed a novel training-free event customization method: FreeEvent. Specifically, FreeEvent introduces two extra paths alongside the general diffusion denoising process: 1) Entity switching path: it applies cross-attention guidance and regulation for target entity generation. 2) Event transferring path: it injects the spatial feature and self-attention maps from the reference image to the target image for event generation. To further facilitate this new task, we collected two evaluation benchmarks: SWiG-Event and Real-Event. Extensive experiments and ablations have demonstrated the effectiveness of FreeEvent.
RealD$^2$iff: Bridging Real-World Gap in Robot Manipulation via Depth Diffusion
Robot manipulation in the real world is fundamentally constrained by the visual sim2real gap, where depth observations collected in simulation fail to reflect the complex noise patterns inherent to real sensors. In this work, inspired by the denoising capability of diffusion models, we invert the conventional perspective and propose a clean-to-noisy paradigm that learns to synthesize noisy depth, thereby bridging the visual sim2real gap through purely simulation-driven robotic learning. Building on this idea, we introduce RealD$^2$iff, a hierarchical coarse-to-fine diffusion framework that decomposes depth noise into global structural distortions and fine-grained local perturbations. To enable progressive learning of these components, we further develop two complementary strategies: Frequency-Guided Supervision (FGS) for global structure modeling and Discrepancy-Guided Optimization (DGO) for localized refinement. To integrate RealD$^2$iff seamlessly into imitation learning, we construct a pipeline that spans six stages. We provide comprehensive empirical and experimental validation demonstrating the effectiveness of this paradigm. RealD$^2$iff enables two key applications: (1) generating real-world-like depth to construct clean-noisy paired datasets without manual sensor data collection. (2) Achieving zero-shot sim2real robot manipulation, substantially improving real-world performance without additional fine-tuning.
comment: We are the author team of the paper "RealD$^2$iff: Bridging Real-World Gap in Robot Manipulation via Depth Diffusion". After self-examination, our team discovered inappropriate wording in the citation of related work, the introduction, and the contribution statement, which may affect the contribution of other related works. Therefore, we have decided to revise the paper and request its withdrawal
LAMP: Language-Assisted Motion Planning for Controllable Video Generation
Video generation has achieved remarkable progress in visual fidelity and controllability, enabling conditioning on text, layout, or motion. Among these, motion control - specifying object dynamics and camera trajectories - is essential for composing complex, cinematic scenes, yet existing interfaces remain limited. We introduce LAMP that leverages large language models (LLMs) as motion planners to translate natural language descriptions into explicit 3D trajectories for dynamic objects and (relatively defined) cameras. LAMP defines a motion domain-specific language (DSL), inspired by cinematography conventions. By harnessing program synthesis capabilities of LLMs, LAMP generates structured motion programs from natural language, which are deterministically mapped to 3D trajectories. We construct a large-scale procedural dataset pairing natural text descriptions with corresponding motion programs and 3D trajectories. Experiments demonstrate LAMP's improved performance in motion controllability and alignment with user intent compared to state-of-the-art alternatives establishing the first framework for generating both object and camera motions directly from natural language specifications. Code, models and data are available on our project page.
comment: Project Page: https://cyberiada.github.io/LAMP/
Enhanced Spatiotemporal Consistency for Image-to-LiDAR Data Pretraining
LiDAR representation learning has emerged as a promising approach to reducing reliance on costly and labor-intensive human annotations. While existing methods primarily focus on spatial alignment between LiDAR and camera sensors, they often overlook the temporal dynamics critical for capturing motion and scene continuity in driving scenarios. To address this limitation, we propose SuperFlow++, a novel framework that integrates spatiotemporal cues in both pretraining and downstream tasks using consecutive LiDAR-camera pairs. SuperFlow++ introduces four key components: (1) a view consistency alignment module to unify semantic information across camera views, (2) a dense-to-sparse consistency regularization mechanism to enhance feature robustness across varying point cloud densities, (3) a flow-based contrastive learning approach that models temporal relationships for improved scene understanding, and (4) a temporal voting strategy that propagates semantic information across LiDAR scans to improve prediction consistency. Extensive evaluations on 11 heterogeneous LiDAR datasets demonstrate that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions. Furthermore, by scaling both 2D and 3D backbones during pretraining, we uncover emergent properties that provide deeper insights into developing scalable 3D foundation models. With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving. The code is publicly available at https://github.com/Xiangxu-0103/SuperFlow
comment: IEEE Transactions on Pattern Analysis and Machine Intelligence
Image2Net: Datasets, Benchmark and Hybrid Framework to Convert Analog Circuit Diagrams into Netlists
Large Language Model (LLM) exhibits great potential in designing of analog integrated circuits (IC) because of its excellence in abstraction and generalization for knowledge. However, further development of LLM-based analog ICs heavily relies on textual description of analog ICs, while existing analog ICs are mostly illustrated in image-based circuit diagrams rather than text-based netlists. Converting circuit diagrams to netlists help LLMs to enrich the knowledge of analog IC. Nevertheless, previously proposed conversion frameworks face challenges in further application because of limited support of image styles and circuit elements. Up to now, it still remains a challenging task to effectively convert complex circuit diagrams into netlists. To this end, this paper constructs and opensources a new dataset with rich styles of circuit diagrams as well as balanced distribution of simple and complex analog ICs. And a hybrid framework, named Image2Net, is proposed for practical conversion from circuit diagrams to netlists. The netlist edit distance (NED) is also introduced to precisely assess the difference between the converted netlists and ground truth. Based on our benchmark, Image2Net achieves 80.77% successful rate, which is 34.62%-45.19% higher than previous works. Specifically, the proposed work shows 0.116 averaged NED, which is 62.1%-69.6% lower than state-of-the-arts. Our datasets and benchmark are available at https://github.com/LAD021/ci2n_datasets.
comment: 10 pages, 12 figures, 6 tables
Switch-JustDance: Benchmarking Whole Body Motion Tracking Policies Using a Commercial Console Game
Recent advances in whole-body robot control have enabled humanoid and legged robots to perform increasingly agile and coordinated motions. However, standardized benchmarks for evaluating these capabilities in real-world settings, and in direct comparison to humans, remain scarce. Existing evaluations often rely on pre-collected human motion datasets or simulation-based experiments, which limit reproducibility, overlook hardware factors, and hinder fair human-robot comparisons. We present Switch-JustDance, a low-cost and reproducible benchmarking pipeline that leverages motion-sensing console games, Just Dance on the Nintendo Switch, to evaluate robot whole-body control. Using Just Dance on the Nintendo Switch as a representative platform, Switch-JustDance converts in-game choreography into robot-executable motions through streaming, motion reconstruction, and motion retargeting modules and enables users to evaluate controller performance through the game's built-in scoring system. We first validate the evaluation properties of Just Dance, analyzing its reliability, validity, sensitivity, and potential sources of bias. Our results show that the platform provides consistent and interpretable performance measures, making it a suitable tool for benchmarking embodied AI. Building on this foundation, we benchmark three state-of-the-art humanoid whole-body controllers on hardware and provide insights into their relative strengths and limitations.
Three Forensic Cues for JPEG AI Images
The JPEG standard was vastly successful. Currently, the first AI-based compression method ``JPEG AI'' will be standardized. JPEG AI brings remarkable benefits. JPEG AI images exhibit impressive image quality at bitrates that are an order of magnitude lower than images compressed with traditional JPEG. However, forensic analysis of JPEG AI has to be completely re-thought: forensic tools for traditional JPEG do not transfer to JPEG AI, and artifacts from JPEG AI are easily confused with artifacts from artificially generated images (``DeepFakes''). This creates a need for novel forensic approaches to detection and distinction of JPEG AI images. In this work, we make a first step towards a forensic JPEG AI toolset. We propose three cues for forensic algorithms for JPEG AI. These algorithms address three forensic questions: first, we show that the JPEG AI preprocessing introduces correlations in the color channels that do not occur in uncompressed images. Second, we show that repeated compression of JPEG AI images leads to diminishing distortion differences. This can be used to detect recompression, in a spirit similar to some classic JPEG forensics methods. Third, we show that the quantization of JPEG AI images in the latent space can be used to distinguish real images with JPEG AI compression from synthetically generated images. The proposed methods are interpretable for a forensic analyst, and we hope that they inspire further research in the forensics of AI-compressed images.
DualHash: A Stochastic Primal-Dual Algorithm with Theoretical Guarantee for Deep Hashing
Deep hashing converts high-dimensional feature vectors into compact binary codes, enabling efficient large-scale retrieval. A fundamental challenge in deep hashing stems from the discrete nature of quantization in generating the codes. W-type regularizations, such as $||z|-1|$, have been proven effective as they encourage variables toward binary values. However, existing methods often directly optimize these regularizations without convergence guarantees. While proximal gradient methods offer a promising solution, the coupling between W-type regularizers and neural network outputs results in composite forms that generally lack closed-form proximal solutions. In this paper, we present a stochastic primal-dual hashing algorithm, referred to as DualHash, that provides rigorous complexity bounds. Using Fenchel duality, we partially transform the nonconvex W-type regularization optimization into the dual space, which results in a proximal operator that admits closed-form solutions. We derive two algorithm instances: a momentum-accelerated version with $\mathcal{O}(\varepsilon^{-4})$ complexity and an improved $\mathcal{O}(\varepsilon^{-3})$ version using variance reduction. Experiments on three image retrieval databases demonstrate the superior performance of DualHash.
comment: 27 pages, 13 figures
FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization
Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce FASTer, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autoregressive policy built upon it. FASTerVQ encodes action chunks as single-channel images, capturing global spatio-temporal dependencies while maintaining a high compression ratio. FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and strong cross-task and cross-embodiment generalization, while FASTerVLA further improves overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance.
Expectation-Maximization as the Engine of Scalable Medical Intelligence
Large, high-quality, annotated datasets are the foundation of medical AI research, but constructing even a small, moderate-quality, annotated dataset can take years of effort from multidisciplinary teams. Although active learning can prioritize what to annotate, scaling up still requires extensive manual efforts to revise the noisy annotations. We formulate this as a missing-data problem and develop ScaleMAI, a framework that unifies data annotation and model development co-evolution through an Expectation-Maximization (EM) process. In this iterative process, the AI model automatically identifies and corrects the mistakes in annotations (Expectation), while the refined annotated data retrain the model to improve accuracy (Maximization). In addition to the classical EM algorithm, ScaleMAI brings human experts into the loop to review annotations that cannot be adequately addressed by either Expectation or Maximization step (<5%). As a result, ScaleMAI progressively creates an annotated dataset of 47,315 CT scans (4.8x larger than the largest public dataset, PanTS) including 4,163,720 per-voxel annotations for benign/malignant tumors and 88 anatomical structures. ScaleMAI iteratively trains a model that exceeds human expert performance in tumor diagnosis (+7%), and outperforms models developed from smaller, moderate-quality datasets, with statistically significant gains in tumor detection (+10%) and segmentation (+14%) on two prestigious benchmarks.
Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch
Despite recent progress in multimodal agentic systems, existing approaches often treat image manipulation and web search as disjoint capabilities, rely heavily on costly reinforcement learning, and lack planning grounded in real tool-execution traces. To address these limitations, we present Skywork-R1V4, a 30B (A3B) parameter multimodal agentic model that unifies multimodal planning, active image manipulation ("thinking with images"), deep multimodal search, and, most critically, interleaved reasoning that dynamically alternates between visual operations and external knowledge retrieval. Trained solely via supervised fine-tuning on fewer than 30,000 high-quality, planning-execution-consistent trajectories and validated through stepwise consistency filtering, Skywork-R1V4 achieves state-of-the-art results across perception and multimodal search benchmarks: it scores 66.1 on MMSearch and 67.2 on FVQA, surpassing Gemini 2.5 Flash on all 11 metrics. Skywork-R1V4 exhibits emergent long-horizon reasoning at inference time, successfully orchestrating more than 10 tool calls to solve complex, multi-step tasks. Our results demonstrate that sophisticated agentic multimodal intelligence can be achieved through carefully curated supervised learning alone, without any reliance on reinforcement learning.
comment: 21 pages, 7 figures
Splannequin: Freezing Monocular Mannequin-Challenge Footage with Dual-Detection Splatting WACV 2026
Synthesizing high-fidelity frozen 3D scenes from monocular Mannequin-Challenge (MC) videos is a unique problem distinct from standard dynamic scene reconstruction. Instead of focusing on modeling motion, our goal is to create a frozen scene while strategically preserving subtle dynamics to enable user-controlled instant selection. To achieve this, we introduce a novel application of dynamic Gaussian splatting: the scene is modeled dynamically, which retains nearby temporal variation, and a static scene is rendered by fixing the model's time parameter. However, under this usage, monocular capture with sparse temporal supervision introduces artifacts like ghosting and blur for Gaussians that become unobserved or occluded at weakly supervised timestamps. We propose Splannequin, an architecture-agnostic regularization that detects two states of Gaussian primitives, hidden and defective, and applies temporal anchoring. Under predominantly forward camera motion, hidden states are anchored to their recent well-observed past states, while defective states are anchored to future states with stronger supervision. Our method integrates into existing dynamic Gaussian pipelines via simple loss terms, requires no architectural changes, and adds zero inference overhead. This results in markedly improved visual quality, enabling high-fidelity, user-selectable frozen-time renderings, validated by a 96% user preference. Project page: https://chien90190.github.io/splannequin/
comment: WACV 2026. Project page: https://chien90190.github.io/splannequin/
Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models
Reasoning over dynamic visual content remains a central challenge for multimodal large language models. Recent thinking models generate explicit reasoning traces for interpretability; however, their reasoning often appears convincing while being logically inconsistent or weakly grounded in visual evidence. We identify and formalize these issues through two diagnostic metrics: Think Answer Consistency (TAC), which measures the alignment between reasoning and answers, and Video Attention Score (VAS), which captures the extent to which reasoning depends on visual versus textual cues. Analysis across 11 video reasoning benchmarks shows that current models rely heavily on linguistic priors rather than visual content. To address this, we propose a reinforcement learning approach that enhances both temporal precision and reasoning consistency. Our approach combines timestamp aware supervised fine tuning with Group Relative Policy Optimization (GRPO) guided by a novel Temporal Alignment Reward (TAR). This dual step post training stage encourages temporally aligned and causally coherent video reasoning. The resulting model, Video R2, achieves consistently higher TAC, VAS, and accuracy across multiple benchmarks, demonstrating that improvements in temporal alignment and reasoning coherence lead to more accurate and trustworthy video understanding. Code: https://github.com/mbzuai-oryx/Video-R2
comment: Video-R2 Technical Report
Diffusion-based Adversarial Purification from the Perspective of the Frequency Domain
The diffusion-based adversarial purification methods attempt to drown adversarial perturbations into a part of isotropic noise through the forward process, and then recover the clean images through the reverse process. Due to the lack of distribution information about adversarial perturbations in the pixel domain, it is often unavoidable to damage normal semantics. We turn to the frequency domain perspective, decomposing the image into amplitude spectrum and phase spectrum. We find that for both spectra, the damage caused by adversarial perturbations tends to increase monotonically with frequency. This means that we can extract the content and structural information of the original clean sample from the frequency components that are less damaged. Meanwhile, theoretical analysis indicates that existing purification methods indiscriminately damage all frequency components, leading to excessive damage to the image. Therefore, we propose a purification method that can eliminate adversarial perturbations while maximizing the preservation of the content and structure of the original image. Specifically, at each time step during the reverse process, for the amplitude spectrum, we replace the low-frequency components of the estimated image's amplitude spectrum with the corresponding parts of the adversarial image. For the phase spectrum, we project the phase of the estimated image into a designated range of the adversarial image's phase spectrum, focusing on the low frequencies. Empirical evidence from extensive experiments demonstrates that our method significantly outperforms most current defense methods.
Explaining Object Detectors via Collective Contribution of Pixels
Visual explanations for object detectors are crucial for enhancing their reliability. Object detectors identify and localize instances by assessing multiple visual features collectively. When generating explanations, overlooking these collective influences in detections may lead to missing compositional cues or capturing spurious correlations. However, existing methods typically focus solely on individual pixel contributions, neglecting the collective contribution of multiple pixels. To address this limitation, we propose a game-theoretic method based on Shapley values and interactions to explicitly capture both individual and collective pixel contributions. Our method provides explanations for both bounding box localization and class determination, highlighting regions crucial for detection. Extensive experiments demonstrate that the proposed method identifies important regions more accurately than state-of-the-art methods. The code will be publicly available soon.
comment: 11+20 pages, 21 figures, 11 tables. v2: updated version
Moyun: A Diffusion-Based Model for Style-Specific Chinese Calligraphy Generation
Although Chinese calligraphy generation has achieved style transfer, generating calligraphy by specifying the calligrapher, font, and character style remains challenging. To address this, we propose a new Chinese calligraphy generation model 'Moyun' , which replaces the Unet in the Diffusion model with Vision Mamba and introduces the TripleLabel control mechanism to achieve controllable calligraphy generation. The model was tested on our large-scale dataset 'Mobao' of over 1.9 million images, and the results demonstrate that 'Moyun' can effectively control the generation process and produce calligraphy in the specified style. Even for calligraphy the calligrapher has not written, 'Moyun' can generate calligraphy that matches the style of the calligrapher.
Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching
Reinforcement Learning (RL) has recently emerged as a powerful technique for improving image and video generation in Diffusion and Flow Matching models, specifically for enhancing output quality and alignment with prompts. A critical step for applying online RL methods on Flow Matching is the introduction of stochasticity into the deterministic framework, commonly realized by Stochastic Differential Equation (SDE). Our investigation reveals a significant drawback to this approach: SDE-based sampling introduces pronounced noise artifacts in the generated images, which we found to be detrimental to the reward learning process. A rigorous theoretical analysis traces the origin of this noise to an excess of stochasticity injected during inference. To address this, we draw inspiration from Denoising Diffusion Implicit Models (DDIM) to reformulate the sampling process. Our proposed method, Coefficients-Preserving Sampling (CPS), eliminates these noise artifacts. This leads to more accurate reward modeling, ultimately enabling faster and more stable convergence for reinforcement learning-based optimizers like Flow-GRPO and Dance-GRPO. Code will be released at https://github.com/IamCreateAI/FlowCPS
comment: work in progress
VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack
Multimodal Large Language Models (MLLMs) are widely used in various fields due to their powerful cross-modal comprehension and generation capabilities. However, more modalities bring more vulnerabilities to being utilized for jailbreak attacks, which induces MLLMs to output harmful content. Due to the strong reasoning ability of MLLMs, previous jailbreak attacks try to explore reasoning safety risk in text modal, while similar threats have been largely overlooked in the visual modal. To fully evaluate potential safety risks in the visual reasoning task, we propose Visual Reasoning Sequential Attack (VRSA), which induces MLLMs to gradually externalize and aggregate complete harmful intent by decomposing the original harmful text into several sequentially related sub-images. In particular, to enhance the rationality of the scene in the image sequence, we propose Adaptive Scene Refinement to optimize the scene most relevant to the original harmful query. To ensure the semantic continuity of the generated image, we propose Semantic Coherent Completion to iteratively rewrite each sub-text combined with contextual information in this scene. In addition, we propose Text-Image Consistency Alignment to keep the semantical consistency. A series of experiments demonstrates that the VRSA can achieve a higher attack success rate compared with the state-of-the-art jailbreak attack methods on both the open-source and closed-source MLLMs such as GPT-4o and Claude-4.5-Sonnet.
MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts
Generating realistic 3D cities is fundamental to world models, virtual reality, and game development, where an ideal urban scene must satisfy both stylistic diversity, fine-grained, and controllability. However, existing methods struggle to balance the creative flexibility offered by text-based generation with the object-level editability enabled by explicit structural representations. We introduce MajutsuCity, a natural language-driven and aesthetically adaptive framework for synthesizing structurally consistent and stylistically diverse 3D urban scenes. MajutsuCity represents a city as a composition of controllable layouts, assets, and materials, and operates through a four-stage pipeline. To extend controllability beyond initial generation, we further integrate MajutsuAgent, an interactive language-grounded editing agent} that supports five object-level operations. To support photorealistic and customizable scene synthesis, we also construct MajutsuDataset, a high-quality multimodal dataset} containing 2D semantic layouts and height maps, diverse 3D building assets, and curated PBR materials and skyboxes, each accompanied by detailed annotations. Meanwhile, we develop a practical set of evaluation metrics, covering key dimensions such as structural consistency, scene complexity, material fidelity, and lighting atmosphere. Extensive experiments demonstrate MajutsuCity reduces layout FID by 83.7% compared with CityDreamer and by 20.1% over CityCraft. Our method ranks first across all AQS and RDR scores, outperforming existing methods by a clear margin. These results confirm MajutsuCity as a new state-of-the-art in geometric fidelity, stylistic adaptability, and semantic controllability for 3D city generation. We expect our framework can inspire new avenues of research in 3D city generation. Our project page: https://longhz140516.github.io/MajutsuCity/.
comment: 13 pages, 6 figures
GeoShield: Safeguarding Geolocation Privacy from Vision-Language Models via Adversarial Perturbations AAAI2026
Vision-Language Models (VLMs) such as GPT-4o now demonstrate a remarkable ability to infer users' locations from public shared images, posing a substantial risk to geoprivacy. Although adversarial perturbations offer a potential defense, current methods are ill-suited for this scenario: they often perform poorly on high-resolution images and low perturbation budgets, and may introduce irrelevant semantic content. To address these limitations, we propose GeoShield, a novel adversarial framework designed for robust geoprivacy protection in real-world scenarios. GeoShield comprises three key modules: a feature disentanglement module that separates geographical and non-geographical information, an exposure element identification module that pinpoints geo-revealing regions within an image, and a scale-adaptive enhancement module that jointly optimizes perturbations at both global and local levels to ensure effectiveness across resolutions. Extensive experiments on challenging benchmarks show that GeoShield consistently surpasses prior methods in black-box settings, achieving strong privacy protection with minimal impact on visual or semantic quality. To our knowledge, this work is the first to explore adversarial perturbations for defending against geolocation inference by advanced VLMs, providing a practical and effective solution to escalating privacy concerns.
comment: AAAI2026 Poster
3D-ANC: Adaptive Neural Collapse for Robust 3D Point Cloud Recognition AAAI 2026
Deep neural networks have recently achieved notable progress in 3D point cloud recognition, yet their vulnerability to adversarial perturbations poses critical security challenges in practical deployments. Conventional defense mechanisms struggle to address the evolving landscape of multifaceted attack patterns. Through systematic analysis of existing defenses, we identify that their unsatisfactory performance primarily originates from an entangled feature space, where adversarial attacks can be performed easily. To this end, we present 3D-ANC, a novel approach that capitalizes on the Neural Collapse (NC) mechanism to orchestrate discriminative feature learning. In particular, NC depicts where last-layer features and classifier weights jointly evolve into a simplex equiangular tight frame (ETF) arrangement, establishing maximally separable class prototypes. However, leveraging this advantage in 3D recognition confronts two substantial challenges: (1) prevalent class imbalance in point cloud datasets, and (2) complex geometric similarities between object categories. To tackle these obstacles, our solution combines an ETF-aligned classification module with an adaptive training framework consisting of representation-balanced learning (RBL) and dynamic feature direction loss (FDL). 3D-ANC seamlessly empowers existing models to develop disentangled feature spaces despite the complexity in 3D data distribution. Comprehensive evaluations state that 3D-ANC significantly improves the robustness of models with various structures on two datasets. For instance, DGCNN's classification accuracy is elevated from 27.2% to 80.9% on ModelNet40 -- a 53.7% absolute gain that surpasses leading baselines by 34.0%.
comment: AAAI 2026
uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data
Contrastive Language-Image Pre-training (CLIP) has demonstrated strong generalization across a wide range of visual tasks by leveraging large-scale English-image pairs. However, its extension to low-resource languages remains limited due to the scarcity of high-quality multilingual image-text data. Existing multilingual vision-language models exhibit consistently low retrieval performance in underrepresented languages including Czech, Finnish, Croatian, Hungarian, and Romanian on the Crossmodal-3600 (XM3600) benchmark. To address this, we propose a lightweight and data-efficient framework for multilingual vision-language alignment. Our approach requires no image-text pairs or text-text pairs and freezes both the pretrained image encoder and multilingual text encoder during training. Only a compact 1.7M-parameter projection module is trained, using a contrastive loss over English representations as semantic anchors. This minimal training setup enables robust multilingual alignment even for languages with limited supervision. Extensive evaluation across multiple multilingual retrieval benchmarks confirms the effectiveness of our method, showing significant gains in five underrepresented languages where existing models typically underperform. These findings highlight the effectiveness of our pivot-based, parameter-efficient alignment strategy for inclusive multimodal learning.
comment: Our project page can be found at https://dinyudin203.github.io/uCLIP-project/
Bi-ICE: An Inner Interpretable Framework for Image Classification via Bi-directional Interactions between Concept and Input Embeddings WACV2026
Inner interpretability is a promising field aiming to uncover the internal mechanisms of AI systems through scalable, automated methods. While significant research has been conducted on large language models, limited attention has been paid to applying inner interpretability to large-scale image tasks, focusing primarily on architectural and functional levels to visualize learned concepts. In this paper, we first present a conceptual framework that supports inner interpretability and multilevel analysis for large-scale image classification tasks. Specifically, we introduce the Bi-directional Interaction between Concept and Input Embeddings (Bi-ICE) module, which facilitates interpretability across the computational, algorithmic, and implementation levels. This module enhances transparency by generating predictions based on human-understandable concepts, quantifying their contributions, and localizing them within the inputs. Finally, we showcase enhanced transparency in image classification, measuring concept contributions, and pinpointing their locations within the inputs. Our approach highlights algorithmic interpretability by demonstrating the process of concept learning and its convergence.
comment: Accepted at IEEE WACV2026. The first two authors equally contributed to this work
Image and Video Processing
R2MF-Net: A Recurrent Residual Multi-Path Fusion Network for Robust Multi-directional Spine X-ray Segmentation
Accurate segmentation of spinal structures in X-ray images is a prerequisite for quantitative scoliosis assessment, including Cobb angle measurement, vertebral translation estimation and curvature classification. In routine practice, clinicians acquire coronal, left-bending and right-bending radiographs to jointly evaluate deformity severity and spinal flexibility. However, the segmentation step remains heavily manual, time-consuming and non-reproducible, particularly in low-contrast images and in the presence of rib shadows or overlapping tissues. To address these limitations, this paper proposes R2MF-Net, a recurrent residual multi-path encoder--decoder network tailored for automatic segmentation of multi-directional spine X-ray images. The overall design consists of a coarse segmentation network and a fine segmentation network connected in cascade. Both stages adopt an improved Inception-style multi-branch feature extractor, while a recurrent residual jump connection (R2-Jump) module is inserted into skip paths to gradually align encoder and decoder semantics. A multi-scale cross-stage skip (MC-Skip) mechanism allows the fine network to reuse hierarchical representations from multiple decoder levels of the coarse network, thereby strengthening the stability of segmentation across imaging directions and contrast conditions. Furthermore, a lightweight spatial-channel squeeze-and-excitation block (SCSE-Lite) is employed at the bottleneck to emphasize spine-related activations and suppress irrelevant structures and background noise. We evaluate R2MF-Net on a clinical multi-view radiograph dataset comprising 228 sets of coronal, left-bending and right-bending spine X-ray images with expert annotations.
Precise Liver Tumor Segmentation in CT Using a Hybrid Deep Learning-Radiomics Framework
Accurate three-dimensional delineation of liver tumors on contrast-enhanced CT is a prerequisite for treatment planning, navigation and response assessment, yet manual contouring is slow, observer-dependent and difficult to standardise across centres. Automatic segmentation is complicated by low lesion-parenchyma contrast, blurred or incomplete boundaries, heterogeneous enhancement patterns, and confounding structures such as vessels and adjacent organs. We propose a hybrid framework that couples an attention-enhanced cascaded U-Net with handcrafted radiomics and voxel-wise 3D CNN refinement for joint liver and liver-tumor segmentation. First, a 2.5D two-stage network with a densely connected encoder, sub-pixel convolution decoders and multi-scale attention gates produces initial liver and tumor probability maps from short stacks of axial slices. Inter-slice temporal consistency is then enforced by a simple three-slice refinement rule along the cranio-caudal direction, which restores thin and tiny lesions while suppressing isolated noise. Next, 728 radiomic descriptors spanning intensity, texture, shape, boundary and wavelet feature groups are extracted from candidate lesions and reduced to 20 stable, highly informative features via multi-strategy feature selection; a random forest classifier uses these features to reject false-positive regions. Finally, a compact 3D patch-based CNN derived from AlexNet operates in a narrow band around the tumor boundary to perform voxel-level relabelling and contour smoothing.
Dual-Stream Cross-Modal Representation Learning via Residual Semantic Decorrelation
Cross-modal learning has become a fundamental paradigm for integrating heterogeneous information sources such as images, text, and structured attributes. However, multimodal representations often suffer from modality dominance, redundant information coupling, and spurious cross-modal correlations, leading to suboptimal generalization and limited interpretability. In particular, high-variance modalities tend to overshadow weaker but semantically important signals, while naïve fusion strategies entangle modality-shared and modality-specific factors in an uncontrolled manner. This makes it difficult to understand which modality actually drives a prediction and to maintain robustness when some modalities are noisy or missing. To address these challenges, we propose a Dual-Stream Residual Semantic Decorrelation Network (DSRSD-Net), a simple yet effective framework that disentangles modality-specific and modality-shared information through residual decomposition and explicit semantic decorrelation constraints. DSRSD-Net introduces: (1) a dual-stream representation learning module that separates intra-modal (private) and inter-modal (shared) latent factors via residual projection; (2) a residual semantic alignment head that maps shared factors from different modalities into a common space using a combination of contrastive and regression-style objectives; and (3) a decorrelation and orthogonality loss that regularizes the covariance structure of the shared space while enforcing orthogonality between shared and private streams, thereby suppressing cross-modal redundancy and preventing feature collapse. Experimental results on two large-scale educational benchmarks demonstrate that DSRSD-Net consistently improves next-step prediction and final outcome prediction over strong single-modality, early-fusion, late-fusion, and co-attention baselines.
From sparse recovery to plug-and-play priors, understanding trade-offs for stable recovery with generalized projected gradient descent
We consider the problem of recovering an unknown low-dimensional vector from noisy, underdetermined observations. We focus on the Generalized Projected Gradient Descent (GPGD) framework, which unifies traditional sparse recovery methods and modern approaches using learned deep projective priors. We extend previous convergence results to robustness to model and projection errors. We use these theoretical results to explore ways to better control stability and robustness constants. To reduce recovery errors due to measurement noise, we consider generalized back-projection strategies to adapt GPGD to structured noise, such as sparse outliers. To improve the stability of GPGD, we propose a normalized idempotent regularization for the learning of deep projective priors. We provide numerical experiments in the context of sparse recovery and image inverse problems, highlighting the trade-offs between identifiability and stability that can be achieved with such methods.
Affine Subspace Models and Clustering for Patch-Based Image Denoising
Image tile-based approaches are popular in many image processing applications such as denoising (e.g., non-local means). A key step in their use is grouping the images into clusters, which usually proceeds iteratively splitting the images into clusters and fitting a model for the images in each cluster. Linear subspaces have emerged as a suitable model for tile clusters; however, they are not well matched to images patches given that images are non-negative and thus not distributed around the origin in the tile vector space. We study the use of affine subspace models for the clusters to better match the geometric structure of the image tile vector space. We also present a simple denoising algorithm that relies on the affine subspace clustering model using least squares projection. We review several algorithmic approaches to solve the affine subspace clustering problem and show experimental results that highlight the performance improvements in clustering and denoising.
comment: Asilomar Conference on Signals, Systems, and Computers 2025
Clinical Interpretability of Deep Learning Segmentation Through Shapley-Derived Agreement and Uncertainty Metrics
Segmentation is the identification of anatomical regions of interest, such as organs, tissue, and lesions, serving as a fundamental task in computer-aided diagnosis in medical imaging. Although deep learning models have achieved remarkable performance in medical image segmentation, the need for explainability remains critical for ensuring their acceptance and integration in clinical practice, despite the growing research attention in this area. Our approach explored the use of contrast-level Shapley values, a systematic perturbation of model inputs to assess feature importance. While other studies have investigated gradient-based techniques through identifying influential regions in imaging inputs, Shapley values offer a broader, clinically aligned approach, explaining how model performance is fairly attributed to certain imaging contrasts over others. Using the BraTS 2024 dataset, we generated rankings for Shapley values for four MRI contrasts across four model architectures. Two metrics were proposed from the Shapley ranking: agreement between model and ``clinician" imaging ranking, and uncertainty quantified through Shapley ranking variance across cross-validation folds. Higher-performing cases (Dice \textgreater0.6) showed significantly greater agreement with clinical rankings. Increased Shapley ranking variance correlated with decreased performance (U-Net: $r=-0.581$). These metrics provide clinically interpretable proxies for model reliability, helping clinicians better understand state-of-the-art segmentation models.
Missing Wedge Inpainting and Joint Alignment in Electron Tomography through Implicit Neural Representations
Electron tomography is a powerful tool for understanding the morphology of materials in three dimensions, but conventional reconstruction algorithms typically suffer from missing-wedge artifacts and data misalignment imposed by experimental constraints. Recently proposed supervised machine-learning-enabled reconstruction methods to address these challenges rely on training data and are therefore difficult to generalize across materials systems. We propose a fully self-supervised implicit neural representation (INR) approach using a neural network as a regularizer. Our approach enables fast inline alignment through pose optimization, missing wedge inpainting, and denoising of low dose datasets via model regularization using only a single dataset. We apply our method to simulated and experimental data and show that it produces high-quality tomograms from diverse and information limited datasets. Our results show that INR-based self-supervised reconstructions offer high fidelity reconstructions with minimal user input and preprocessing, and can be readily applied to a wide variety of materials samples and experimental parameters.
comment: 20 pages, 10 figures
Enhanced Chest Disease Classification Using an Improved CheXNet Framework with EfficientNetV2-M and Optimization-Driven Learning
The interpretation of Chest X-ray is an important diagnostic issue in clinical practice and especially in the resource-limited setting where the shortage of radiologists plays a role in delayed diagnosis and poor patient outcomes. Although the original CheXNet architecture has shown potential in automated analysis of chest radiographs, DenseNet-121 backbone is computationally inefficient and poorly single-label classifier. To eliminate such shortcomings, we suggest a better classification framework of chest disease that relies on EfficientNetV2-M and incorporates superior training approaches such as Automatic Mixed Precision training, AdamW, Cosine Annealing learning rate scheduling, and Exponential Moving Average regularization. We prepared a dataset of 18,080 chest X-ray images of three source materials of high authority and representing five key clinically significant disease categories which included Cardiomegaly, COVID-19, Normal, Pneumonia, and Tuberculosis. To achieve statistical reliability and reproducibility, nine independent experimental runs were run. The suggested architecture showed significant gains with mean test accuracy of 96.45 percent compared to 95.30 percent at baseline (p less than 0.001) and macro-averaged F1-score increased to 91.08 percent (p less than 0.001). Critical infectious diseases showed near-perfect classification performance with COVID-19 detection having 99.95 percent accuracy and Tuberculosis detection having 99.97 percent accuracy. Although 6.8 times more parameters are included, the training time was reduced by 11.4 percent and performance stability was increased by 22.7 percent. This framework presents itself as a decision-support tool that can be used to respond to a pandemic, screen tuberculosis, and assess thoracic disease regularly in various healthcare facilities.
comment: 23 pages, 6 figures, 7 tables
Agreement Disagreement Guided Knowledge Transfer for Cross-Scene Hyperspectral Imaging
Knowledge transfer plays a crucial role in cross-scene hyperspectral imaging (HSI). However, existing studies often overlook the challenges of gradient conflicts and dominant gradients that arise during the optimization of shared parameters. Moreover, many current approaches fail to simultaneously capture both agreement and disagreement information, relying only on a limited shared subset of target features and consequently missing the rich, diverse patterns present in the target scene. To address these issues, we propose an Agreement Disagreement Guided Knowledge Transfer (ADGKT) framework that integrates both mechanisms to enhance cross-scene transfer. The agreement component includes GradVac, which aligns gradient directions to mitigate conflicts between source and target domains, and LogitNorm, which regulates logit magnitudes to prevent domination by a single gradient source. The disagreement component consists of a Disagreement Restriction (DiR) and an ensemble strategy, which capture diverse predictive target features and mitigate the loss of critical target information. Extensive experiments demonstrate the effectiveness and superiority of the proposed method in achieving robust and balanced knowledge transfer across heterogeneous HSI scenes.
Tyche: Stochastic In-Context Learning for Medical Image Segmentation CVPR
Existing learning-based solutions to medical image segmentation have two important shortcomings. First, for most new segmentation task, a new model has to be trained or fine-tuned. This requires extensive resources and machine learning expertise, and is therefore often infeasible for medical researchers and clinicians. Second, most existing segmentation methods produce a single deterministic segmentation mask for a given image. In practice however, there is often considerable uncertainty about what constitutes the correct segmentation, and different expert annotators will often segment the same image differently. We tackle both of these problems with Tyche, a model that uses a context set to generate stochastic predictions for previously unseen tasks without the need to retrain. Tyche differs from other in-context segmentation methods in two important ways. (1) We introduce a novel convolution block architecture that enables interactions among predictions. (2) We introduce in-context test-time augmentation, a new mechanism to provide prediction stochasticity. When combined with appropriate model design and loss functions, Tyche can predict a set of plausible diverse segmentation candidates for new or unseen medical images and segmentation tasks without the need to retrain.
comment: Accepted at IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) 2024 as a highlight. Code available at https://github.com/mariannerakic/tyche
A Review of Bayesian Uncertainty Quantification in Deep Probabilistic Image Segmentation
Advances in architectural design, data availability, and compute have driven remarkable progress in semantic segmentation. Yet, these models often rely on relaxed Bayesian assumptions, omitting critical uncertainty information needed for robust decision-making. Despite growing interest in probabilistic segmentation to address point-estimate limitations, the research landscape remains fragmented. In response, this review synthesizes foundational concepts in uncertainty modeling, analyzing how feature- and parameter-distribution modeling impact four key segmentation tasks: Observer Variability, Active Learning, Model Introspection, and Model Generalization. Our work establishes a common framework by standardizing theory, notation, and terminology, thereby bridging the gap between method developers, task specialists, and applied researchers. We then discuss critical challenges, including the nuanced distinction between uncertainty types, strong assumptions in spatial aggregation, the lack of standardized benchmarks, and pitfalls in current quantification methods. We identify promising avenues for future research, such as uncertainty-aware active learning, data-driven benchmarks, transformer-based models, and novel techniques to move from simple segmentation problems to uncertainty in holistic scene understanding. Based on our analysis, we offer practical guidelines for researchers on method selection, evaluation, reproducibility, and meaningful uncertainty estimation. Ultimately, our goal is to facilitate the development of more reliable, efficient, and interpretable segmentation models that can be confidently deployed in real-world applications.
comment: TMLR
Anatomical basis of sex differences in the electrocardiogram identified by three-dimensional torso-heart imaging reconstruction pipeline
The electrocardiogram (ECG) is used for diagnosis and risk stratification in myocardial infarction (MI). Women have a higher incidence of missed MI diagnosis and complications following infarction, and to address this we aim to provide quantitative information on sex-differences in ECG and torso-ventricular anatomical features and their interdependence. A novel computational automated pipeline is presented enabling the three-dimensional reconstruction of torso-ventricular anatomies for 425 post-MI subjects and 1051 healthy controls from UK Biobank clinical images. Regression models were created relating torso-ventricular and ECG parameters. We found that female hearts were positioned more posteriorly and superiorly than male, and in MI hearts were oriented more horizontally, especially for women. Post-MI women exhibited less QRS prolongation, requiring 27% more prolongation than men to exceed 120ms. Only half of the sex difference in QRS duration was associated with smaller female cavities. Lower STj amplitude in women was striking, associated with smaller ventricles, but also more superior and posterior cardiac position. Post-MI, T wave amplitude and R axis deviations were more strongly associated with posterior and horizontal cardiac positioning in women than in men. Our study highlights the need to quantify sex differences in anatomical features, their implications in ECG interpretation, and the application of clinical ECG thresholds in post-MI.
comment: Revision following reviewer comments
Stochastic Orthogonal Regularization for deep projective priors
Many crucial tasks of image processing and computer vision are formulated as inverse problems. Thus, it is of great importance to design fast and robust algorithms to solve these problems. In this paper, we focus on generalized projected gradient descent (GPGD) algorithms where generalized projections are realized with learned neural networks and provide state-of-the-art results for imaging inverse problems. Indeed, neural networks allow for projections onto unknown low-dimensional sets that model complex data, such as images. We call these projections deep projective priors. In generic settings, when the orthogonal projection onto a lowdimensional model set is used, it has been shown, under a restricted isometry assumption, that the corresponding orthogonal PGD converges with a linear rate, yielding near-optimal convergence (within the class of GPGD methods) in the classical case of sparse recovery. However, for deep projective priors trained with classical mean squared error losses, there is little guarantee that the hypotheses for linear convergence are satisfied. In this paper, we propose a stochastic orthogonal regularization of the training loss for deep projective priors. This regularization is motivated by our theoretical results: a sufficiently good approximation of the orthogonal projection guarantees linear stable recovery with performance close to orthogonal PGD. We show experimentally, using two different deep projective priors (based on autoencoders and on denoising networks), that our stochastic orthogonal regularization yields projections that improve convergence speed and robustness of GPGD in challenging inverse problem settings, in accordance with our theoretical findings.
A Biophysically-Conditioned Generative Framework for 3D Brain Tumor MRI Synthesis
Magnetic resonance imaging (MRI) inpainting supports numerous clinical and research applications. We introduce the first generative model that conditions on voxel-level, continuous tumor concentrations to synthesize high-fidelity brain tumor MRIs. For the BraTS 2025 Inpainting Challenge, we adapt this architecture to the complementary task of healthy tissue restoration by setting the tumor concentrations to zero. Our latent diffusion model conditioned on both tissue segmentations and the tumor concentrations generates 3D spatially coherent and anatomically consistent images for both tumor synthesis and healthy tissue inpainting. For healthy inpainting, we achieve a PSNR of 18.5, and for tumor inpainting, we achieve 17.4. Our code is available at: https://github.com/valentin-biller/ldm.git
Tokenizing Motion: A Generative Approach for Scene Dynamics Compression
This paper proposes a novel generative video compression framework that leverages motion pattern priors, derived from subtle dynamics in common scenes (e.g., swaying flowers or a boat drifting on water), rather than relying on video content priors (e.g., talking faces or human bodies). These compact motion priors enable a new approach to ultra-low bitrate communication while achieving high-quality reconstruction across diverse scene contents. At the encoder side, motion priors can be streamlined into compact representations via a dense-to-sparse transformation. At the decoder side, these priors facilitate the reconstruction of scene dynamics using an advanced flow-driven diffusion model. Experimental results illustrate that the proposed method can achieve superior rate-distortion-performance and outperform the state-of-the-art conventional-video codec Enhanced Compression Model (ECM) on-scene dynamics sequences. The project page can be found at-https://github.com/xyzysz/GNVDC.
comment: 5page, 5 figures
Image2Net: Datasets, Benchmark and Hybrid Framework to Convert Analog Circuit Diagrams into Netlists
Large Language Model (LLM) exhibits great potential in designing of analog integrated circuits (IC) because of its excellence in abstraction and generalization for knowledge. However, further development of LLM-based analog ICs heavily relies on textual description of analog ICs, while existing analog ICs are mostly illustrated in image-based circuit diagrams rather than text-based netlists. Converting circuit diagrams to netlists help LLMs to enrich the knowledge of analog IC. Nevertheless, previously proposed conversion frameworks face challenges in further application because of limited support of image styles and circuit elements. Up to now, it still remains a challenging task to effectively convert complex circuit diagrams into netlists. To this end, this paper constructs and opensources a new dataset with rich styles of circuit diagrams as well as balanced distribution of simple and complex analog ICs. And a hybrid framework, named Image2Net, is proposed for practical conversion from circuit diagrams to netlists. The netlist edit distance (NED) is also introduced to precisely assess the difference between the converted netlists and ground truth. Based on our benchmark, Image2Net achieves 80.77% successful rate, which is 34.62%-45.19% higher than previous works. Specifically, the proposed work shows 0.116 averaged NED, which is 62.1%-69.6% lower than state-of-the-arts. Our datasets and benchmark are available at https://github.com/LAD021/ci2n_datasets.
comment: 10 pages, 12 figures, 6 tables
Deep Feature-specific Imaging
Modern photon-counting sensors are increasingly dominated by Poisson noise, yet conventional Feature-Specific Imaging (FSI) is optimized for additive Gaussian noise, leading to suboptimal performance and a loss of its advantages under Poisson noise. To address this, we introduce DeepFSI, a novel end-to-end optical-electronic framework. DeepFSI "unfreezes" traditional FSI masks, enabling a deep neural network to learn globally optimal measurement masks by computing gradients directly under realistic Poisson and additive noise conditions. Our simulations demonstrate DeepFSI's superior feature fidelity and task performance compared to conventional FSI with predefined masks, especially in Poisson-Noise-dominant environments. DeepFSI also exhibits enhanced robustness to design choices and performs well under additive Gaussian noise, representing a significant advance for noise-robust computational imaging in photon-limited applications.
General and Domain-Specific Zero-shot Detection of Generated Images via Conditional Likelihood WACV 2026
The rapid advancement of generative models, particularly diffusion-based methods, has significantly improved the realism of synthetic images. As new generative models continuously emerge, detecting generated images remains a critical challenge. While fully supervised, and few-shot methods have been proposed, maintaining an updated dataset is time-consuming and challenging. Consequently, zero-shot methods have gained increasing attention in recent years. We find that existing zero-shot methods often struggle to adapt to specific image domains, such as artistic images, limiting their real-world applicability. In this work, we introduce CLIDE, a novel zero-shot detection method based on conditional likelihood approximation. Our approach computes likelihoods conditioned on real images, enabling adaptation across diverse image domains. We extensively evaluate CLIDE, demonstrating SOTA performance on a large-scale general dataset and significantly outperform existing methods in domain-specific cases. These results demonstrate the robustness of our method and underscore the need of broad, domain-aware generalization for the AI-generated image detection task. Code is available at https://tinyurl.com/clide-detector.
comment: 8 pages, 6 figures, accepted to WACV 2026
Enhancing Monocular Height Estimation via Sparse LiDAR-Guided Correction SP
Monocular height estimation (MHE) from very-high-resolution (VHR) optical imagery remains challenging due to limited structural cues and the high cost and geographic constraints of conventional elevation data such as airborne LiDAR and multi-view stereo. Although recent MHE and monocular depth estimation (MDE) models show strong performance, their robustness under varied illumination and scene conditions is still limited. We introduce a fully automated correction pipeline that integrates sparse, imperfect global LiDAR measurements from ICESat-2 with deep learning predictions to enhance accuracy and stability. The workflow relies entirely on publicly available models and data and requires only a single georeferenced optical image to produce corrected height maps, enabling low-cost and globally scalable deployment. We also establish the first benchmark for this task, evaluating two random forest based approaches, four parameter efficient fine tuning methods, and full fine tuning. Experiments across six diverse regions at 0.5 m resolution (297 km2), covering the urban cores of Tokyo, Paris, and Sao Paulo as well as suburban and forested areas, show substantial gains. The best method reduces the MHE model's mean absolute error (MAE) by 30.9 percent and improves its F1HE score by 44.2 percent. For the MDE model, MAE improves by 24.1 percent and the F1HE score by 25.1 percent. These results validate the effectiveness of our correction pipeline and demonstrate how sparse global LiDAR can systematically strengthen both MHE and MDE models, enabling scalable and widely accessible 3D height mapping.
comment: Accepted for publication in the ISPRS Journal of Photogrammetry and Remote Sensing
Multimedia
A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification
This paper presents a simple method that allows to easily enhance textual pre-trained large language models with speech information, when fine-tuned for a specific classification task. A classical issue with the fusion of many embeddings from audio with text is the large length of the audio sequence compared to the text one. Our method benefits from an existing speech tokenizer trained for Audio Speech Recognition that output long sequences of tokens from a large vocabulary, making it difficult to integrate it at low cost in a large language model. By applying a simple lasso-based feature selection on multimodal Bag-of-Words representation, we retain only the most important audio tokens for the task, and adapt the language model to them with a self-supervised language modeling objective, before fine-tuning it on the downstream task. We show this helps to improve the performances compared to an unimodal model, to a bigger SpeechLM or to integrating audio via a learned representation. We show the effectiveness of our method on two recent Argumentative Fallacy Detection and Classification tasks where the use of audio was believed counterproductive, reaching state-of-the-art results. We also provide an in-depth analysis of the method, showing that even a random audio token selection helps enhancing the unimodal model. Our code is available [online](https://github.com/salocinc/EACL26SpeechTokFallacy/).
Coherent Audio-Visual Editing via Conditional Audio Generation Following Video Edits
We introduce a novel pipeline for joint audio-visual editing that enhances the coherence between edited video and its accompanying audio. Our approach first applies state-of-the-art video editing techniques to produce the target video, then performs audio editing to align with the visual changes. To achieve this, we present a new video-to-audio generation model that conditions on the source audio, target video, and a text prompt. We extend the model architecture to incorporate conditional audio input and propose a data augmentation strategy that improves training efficiency. Furthermore, our model dynamically adjusts the influence of the source audio based on the complexity of the edits, preserving the original audio structure where possible. Experimental results demonstrate that our method outperforms existing approaches in maintaining audio-visual alignment and content integrity.
Computation and Language
Do Generalisation Results Generalise?
A large language model's (LLM's) out-of-distribution (OOD) generalisation ability is crucial to its deployment. Previous work assessing LLMs' generalisation performance, however, typically focuses on a single out-of-distribution dataset. This approach may fail to precisely evaluate the capabilities of the model, as the data shifts encountered once a model is deployed are much more diverse. In this work, we investigate whether OOD generalisation results generalise. More specifically, we evaluate a model's performance across multiple OOD testsets throughout a finetuning run; we then evaluate the partial correlation of performances across these testsets, regressing out in-domain performance. This allows us to assess how correlated are generalisation performances once in-domain performance is controlled for. Analysing OLMo2 and OPT, we observe no overarching trend in generalisation results: the existence of a positive or negative correlation between any two OOD testsets depends strongly on the specific choice of model analysed.
Group Representational Position Encoding
We present GRAPE (Group RepresentAtional Position Encoding), a unified framework for positional encoding based on group actions. GRAPE brings together two families of mechanisms: (i) multiplicative rotations (Multiplicative GRAPE) in $\mathrm{SO}(d)$ and (ii) additive logit biases (Additive GRAPE) arising from unipotent actions in the general linear group $\mathrm{GL}$. In Multiplicative GRAPE, a position $n \in \mathbb{Z}$ (or $t \in \mathbb{R}$) acts as $\mathbf{G}(n)=\exp(n\,ω\,\mathbf{L})$ with a rank-2 skew generator $\mathbf{L} \in \mathbb{R}^{d \times d}$, yielding a relative, compositional, norm-preserving map with a closed-form matrix exponential. RoPE is recovered exactly when the $d/2$ planes are the canonical coordinate pairs with log-uniform spectrum. Learned commuting subspaces and compact non-commuting mixtures strictly extend this geometry to capture cross-subspace feature coupling at $O(d)$ and $O(r d)$ cost per head, respectively. In Additive GRAPE, additive logits arise as rank-1 (or low-rank) unipotent actions, recovering ALiBi and the Forgetting Transformer (FoX) as exact special cases while preserving an exact relative law and streaming cacheability. Altogether, GRAPE supplies a principled design space for positional geometry in long-context models, subsuming RoPE and ALiBi as special cases. Project Page: https://github.com/model-architectures/GRAPE.
comment: Project Page: https://github.com/model-architectures/GRAPE
Collaborative Causal Sensemaking: Closing the Complementarity Gap in Human-AI Decision Support
LLM-based agents are rapidly being plugged into expert decision-support, yet in messy, high-stakes settings they rarely make the team smarter: human-AI teams often underperform the best individual, experts oscillate between verification loops and over-reliance, and the promised complementarity does not materialise. We argue this is not just a matter of accuracy, but a fundamental gap in how we conceive AI assistance: expert decisions are made through collaborative cognitive processes where mental models, goals, and constraints are continually co-constructed, tested, and revised between human and AI. We propose Collaborative Causal Sensemaking (CCS) as a research agenda and organizing framework for decision-support agents: systems designed as partners in cognitive work, maintaining evolving models of how particular experts reason, helping articulate and revise goals, co-constructing and stress-testing causal hypotheses, and learning from the outcomes of joint decisions so that both human and agent improve over time. We sketch challenges around training ecologies that make collaborative thinking instrumentally valuable, representations and interaction protocols for co-authored models, and evaluation centred on trust and complementarity. These directions can reframe MAS research around agents that participate in collaborative sensemaking and act as AI teammates that think with their human partners.
ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning
Large language models (LLMs) are increasingly deployed in settings where reasoning, such as multi-step problem solving and chain-of-thought, is essential. Yet, current evaluation practices overwhelmingly report single-run accuracy while ignoring the intrinsic uncertainty that naturally arises from stochastic decoding. This omission creates a blind spot because practitioners cannot reliably assess whether a method's reported performance is stable, reproducible, or cost-consistent. We introduce ReasonBENCH, the first benchmark designed to quantify the underlying instability in LLM reasoning. ReasonBENCH provides (i) a modular evaluation library that standardizes reasoning frameworks, models, and tasks, (ii) a multi-run protocol that reports statistically reliable metrics for both quality and cost, and (iii) a public leaderboard to encourage variance-aware reporting. Across tasks from different domains, we find that the vast majority of reasoning strategies and models exhibit high instability. Notably, even strategies with similar average performance can display confidence intervals up to four times wider, and the top-performing methods often incur higher and less stable costs. Such instability compromises reproducibility across runs and, consequently, the reliability of reported performance. To better understand these dynamics, we further analyze the impact of prompts, model families, and scale on the trade-off between solve rate and stability. Our results highlight reproducibility as a critical dimension for reliable LLM reasoning and provide a foundation for future reasoning methods and uncertainty quantification techniques. ReasonBENCH is publicly available at https://github.com/au-clan/ReasonBench .
comment: 11 pages, 3 tables, 4 figures
On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
Recent reinforcement learning (RL) techniques have yielded impressive reasoning improvements in language models, yet it remains unclear whether post-training truly extends a model's reasoning ability beyond what it acquires during pre-training. A central challenge is the lack of control in modern training pipelines: large-scale pre-training corpora are opaque, mid-training is often underexamined, and RL objectives interact with unknown prior knowledge in complex ways. To resolve this ambiguity, we develop a fully controlled experimental framework that isolates the causal contributions of pre-training, mid-training, and RL-based post-training. Our approach employs synthetic reasoning tasks with explicit atomic operations, parseable step-by-step reasoning traces, and systematic manipulation of training distributions. We evaluate models along two axes: extrapolative generalization to more complex compositions and contextual generalization across surface contexts. Using this framework, we reconcile competing views on RL's effectiveness. We show that: 1) RL produces true capability gains (pass@128) only when pre-training leaves sufficient headroom and when RL data target the model's edge of competence, tasks at the boundary that are difficult but not yet out of reach. 2) Contextual generalization requires minimal yet sufficient pre-training exposure, after which RL can reliably transfer. 3) Mid-training significantly enhances performance under fixed compute compared with RL only, demonstrating its central but underexplored role in training pipelines. 4) Process-level rewards reduce reward hacking and improve reasoning fidelity. Together, these results clarify the interplay between pre-training, mid-training, and RL, offering a foundation for understanding and improving reasoning LM training strategies.
Mary, the Cheeseburger-Eating Vegetarian: Do LLMs Recognize Incoherence in Narratives?
Leveraging a dataset of paired narratives, we investigate the extent to which large language models (LLMs) can reliably separate incoherent and coherent stories. A probing study finds that LLMs' internal representations can reliably identify incoherent narratives. However, LLMs generate responses to rating questions that fail to satisfactorily separate the coherent and incoherent narratives across several prompt variations, hinting at a gap in LLM's understanding of storytelling. The reasoning LLMs tested do not eliminate these deficits, indicating that thought strings may not be able to fully address the discrepancy between model internal state and behavior. Additionally, we find that LLMs appear to be more sensitive to incoherence resulting from an event that violates the setting (e.g., a rainy day in the desert) than to incoherence arising from a character violating an established trait (e.g., Mary, a vegetarian, later orders a cheeseburger), suggesting that LLMs may rely more on prototypical world knowledge than building meaning-based narrative coherence. The consistent asymmetry found in our results suggests that LLMs do not have a complete grasp on narrative coherence.
Automated Generation of Custom MedDRA Queries Using SafeTerm Medical Map
In pre-market drug safety review, grouping related adverse event terms into standardised MedDRA queries or the FDA Office of New Drugs Custom Medical Queries (OCMQs) is critical for signal detection. We present a novel quantitative artificial intelligence system that understands and processes medical terminology and automatically retrieves relevant MedDRA Preferred Terms (PTs) for a given input query, ranking them by a relevance score using multi-criteria statistical methods. The system (SafeTerm) embeds medical query terms and MedDRA PTs in a multidimensional vector space, then applies cosine similarity and extreme-value clustering to generate a ranked list of PTs. Validation was conducted against the FDA OCMQ v3.0 (104 queries), restricted to valid MedDRA PTs. Precision, recall and F1 were computed across similarity-thresholds. High recall (>95%) is achieved at moderate thresholds. Higher thresholds improve precision (up to 86%). The optimal threshold (~0.70 - 0.75) yielded recall ~50% and precision ~33%. Narrow-term PT subsets performed similarly but required slightly higher similarity thresholds. The SafeTerm AI-driven system provides a viable supplementary method for automated MedDRA query generation. A similarity threshold of ~0.60 is recommended initially, with increased thresholds for refined term selection.
comment: 12 pages, 4 figures
HalluShift++: Bridging Language and Vision through Internal Representation Shifts for Hierarchical Hallucinations in MLLMs
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language understanding tasks. While these models often produce linguistically coherent output, they often suffer from hallucinations, generating descriptions that are factually inconsistent with the visual content, potentially leading to adverse consequences. Therefore, the assessment of hallucinations in MLLM has become increasingly crucial in the model development process. Contemporary methodologies predominantly depend on external LLM evaluators, which are themselves susceptible to hallucinations and may present challenges in terms of domain adaptation. In this study, we propose the hypothesis that hallucination manifests as measurable irregularities within the internal layer dynamics of MLLMs, not merely due to distributional shifts but also in the context of layer-wise analysis of specific assumptions. By incorporating such modifications, \textsc{\textsc{HalluShift++}} broadens the efficacy of hallucination detection from text-based large language models (LLMs) to encompass multimodal scenarios. Our codebase is available at https://github.com/C0mRD/HalluShift_Plus.
When Large Language Models Do Not Work: Online Incivility Prediction through Graph Neural Networks
Online incivility has emerged as a widespread and persistent problem in digital communities, imposing substantial social and psychological burdens on users. Although many platforms attempt to curb incivility through moderation and automated detection, the performance of existing approaches often remains limited in both accuracy and efficiency. To address this challenge, we propose a Graph Neural Network (GNN) framework for detecting three types of uncivil behavior (i.e., toxicity, aggression, and personal attacks) within the English Wikipedia community. Our model represents each user comment as a node, with textual similarity between comments defining the edges, allowing the network to jointly learn from both linguistic content and relational structures among comments. We also introduce a dynamically adjusted attention mechanism that adaptively balances nodal and topological features during information aggregation. Empirical evaluations demonstrate that our proposed architecture outperforms 12 state-of-the-art Large Language Models (LLMs) across multiple metrics while requiring significantly lower inference cost. These findings highlight the crucial role of structural context in detecting online incivility and address the limitations of text-only LLM paradigms in behavioral prediction. All datasets and comparative outputs will be publicly available in our repository to support further research and reproducibility.
comment: 10 pages
Bridging Code Graphs and Large Language Models for Better Code Understanding
Large Language Models (LLMs) have demonstrated remarkable performance in code intelligence tasks such as code generation, summarization, and translation. However, their reliance on linearized token sequences limits their ability to understand the structural semantics of programs. While prior studies have explored graphaugmented prompting and structure-aware pretraining, they either suffer from prompt length constraints or require task-specific architectural changes that are incompatible with large-scale instructionfollowing LLMs. To address these limitations, this paper proposes CGBridge, a novel plug-and-play method that enhances LLMs with Code Graph information through an external, trainable Bridge module. CGBridge first pre-trains a code graph encoder via selfsupervised learning on a large-scale dataset of 270K code graphs to learn structural code semantics. It then trains an external module to bridge the modality gap among code, graph, and text by aligning their semantics through cross-modal attention mechanisms. Finally, the bridge module generates structure-informed prompts, which are injected into a frozen LLM, and is fine-tuned for downstream code intelligence tasks. Experiments show that CGBridge achieves notable improvements over both the original model and the graphaugmented prompting method. Specifically, it yields a 16.19% and 9.12% relative gain in LLM-as-a-Judge on code summarization, and a 9.84% and 38.87% relative gain in Execution Accuracy on code translation. Moreover, CGBridge achieves over 4x faster inference than LoRA-tuned models, demonstrating both effectiveness and efficiency in structure-aware code understanding.
PCMind-2.1-Kaiyuan-2B Technical Report
The rapid advancement of Large Language Models (LLMs) has resulted in a significant knowledge gap between the open-source community and industry, primarily because the latter relies on closed-source, high-quality data and training recipes. To address this, we introduce PCMind-2.1-Kaiyuan-2B, a fully open-source 2-billion-parameter model focused on improving training efficiency and effectiveness under resource constraints. Our methodology includes three key innovations: a Quantile Data Benchmarking method for systematically comparing heterogeneous open-source datasets and providing insights on data mixing strategies; a Strategic Selective Repetition scheme within a multi-phase paradigm to effectively leverage sparse, high-quality data; and a Multi-Domain Curriculum Training policy that orders samples by quality. Supported by a highly optimized data preprocessing pipeline and architectural modifications for FP16 stability, Kaiyuan-2B achieves performance competitive with state-of-the-art fully open-source models, demonstrating practical and scalable solutions for resource-limited pretraining. We release all assets (including model weights, data, and code) under Apache 2.0 license at https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B.
Metric-Fair Prompting: Treating Similar Samples Similarly
We introduce \emph{Metric-Fair Prompting}, a fairness-aware prompting framework that guides large language models (LLMs) to make decisions under metric-fairness constraints. In the application of multiple-choice medical question answering, each {(question, option)} pair is treated as a binary instance with label $+1$ (correct) or $-1$ (incorrect). To promote {individual fairness}~--~treating similar instances similarly~--~we compute question similarity using NLP embeddings and solve items in \emph{joint pairs of similar questions} rather than in isolation. The prompt enforces a global decision protocol: extract decisive clinical features, map each \((\text{question}, \text{option})\) to a score $f(x)$ that acts as confidence, and impose a Lipschitz-style constraint so that similar inputs receive similar scores and, hence, consistent outputs. Evaluated on the {MedQA (US)} benchmark, Metric-Fair Prompting is shown to improve performance over standard single-item prompting, demonstrating that fairness-guided, confidence-oriented reasoning can enhance LLM accuracy on high-stakes clinical multiple-choice questions.
Complementary Learning Approach for Text Classification using Large Language Models
In this study, we propose a structured methodology that utilizes large language models (LLMs) in a cost-efficient and parsimonious manner, integrating the strengths of scholars and machines while offsetting their respective weaknesses. Our methodology, facilitated through a chain of thought and few-shot learning prompting from computer science, extends best practices for co-author teams in qualitative research to human-machine teams in quantitative research. This allows humans to utilize abductive reasoning and natural language to interrogate not just what the machine has done but also what the human has done. Our method highlights how scholars can manage inherent weaknesses OF LLMs using careful, low-cost techniques. We demonstrate how to use the methodology to interrogate human-machine rating discrepancies for a sample of 1,934 press releases announcing pharmaceutical alliances (1990-2017).
comment: 67 pages
A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification
This paper presents a simple method that allows to easily enhance textual pre-trained large language models with speech information, when fine-tuned for a specific classification task. A classical issue with the fusion of many embeddings from audio with text is the large length of the audio sequence compared to the text one. Our method benefits from an existing speech tokenizer trained for Audio Speech Recognition that output long sequences of tokens from a large vocabulary, making it difficult to integrate it at low cost in a large language model. By applying a simple lasso-based feature selection on multimodal Bag-of-Words representation, we retain only the most important audio tokens for the task, and adapt the language model to them with a self-supervised language modeling objective, before fine-tuning it on the downstream task. We show this helps to improve the performances compared to an unimodal model, to a bigger SpeechLM or to integrating audio via a learned representation. We show the effectiveness of our method on two recent Argumentative Fallacy Detection and Classification tasks where the use of audio was believed counterproductive, reaching state-of-the-art results. We also provide an in-depth analysis of the method, showing that even a random audio token selection helps enhancing the unimodal model. Our code is available [online](https://github.com/salocinc/EACL26SpeechTokFallacy/).
Toward More Reliable Artificial Intelligence: Reducing Hallucinations in Vision-Language Models
Vision-language models (VLMs) frequently generate hallucinated content plausible but incorrect claims about image content. We propose a training-free self-correction framework enabling VLMs to iteratively refine responses through uncertainty-guided visual re-attention. Our method combines multidimensional uncertainty quantification (token entropy, attention dispersion, semantic consistency, claim confidence) with attention-guided cropping of under-explored regions. Operating entirely with frozen, pretrained VLMs, our framework requires no gradient updates. We validate our approach on the POPE and MMHAL BENCH benchmarks using the Qwen2.5-VL-7B [23] architecture. Experimental results demonstrate that our method reduces hallucination rates by 9.8 percentage points compared to the baseline, while improving object existence accuracy by 4.7 points on adversarial splits. Furthermore, qualitative analysis confirms that uncertainty-guided re-attention successfully grounds corrections in visual evidence where standard decoding fails. We validate our approach on Qwen2.5-VL-7B [23], with plans to extend validation across diverse architectures in future versions. We release our code and methodology to facilitate future research in trustworthy multimodal systems.
comment: 24 pages, 3 figures, 2 tables. Training-free self-correction framework for vision-language models. Code and implementation details will be released at: https://github.com/kassoumsanogo1/self-correcting-vlm-re-Attention.git
Performance of the SafeTerm AI-Based MedDRA Query System Against Standardised MedDRA Queries
In pre-market drug safety review, grouping related adverse event terms into SMQs or OCMQs is critical for signal detection. We assess the performance of SafeTerm Automated Medical Query (AMQ) on MedDRA SMQs. The AMQ is a novel quantitative artificial intelligence system that understands and processes medical terminology and automatically retrieves relevant MedDRA Preferred Terms (PTs) for a given input query, ranking them by a relevance score (0-1) using multi-criteria statistical methods. The system (SafeTerm) embeds medical query terms and MedDRA PTs in a multidimensional vector space, then applies cosine similarity, and extreme-value clustering to generate a ranked list of PTs. Validation was conducted against tier-1 SMQs (110 queries, v28.1). Precision, recall and F1 were computed at multiple similarity-thresholds, defined either manually or using an automated method. High recall (94%)) is achieved at moderate similarity thresholds, indicative of good retrieval sensitivity. Higher thresholds filter out more terms, resulting in improved precision (up to 89%). The optimal threshold (0.70)) yielded an overall recall of (48%) and precision of (45%) across all 110 queries. Restricting to narrow-term PTs achieved slightly better performance at an increased (+0.05) similarity threshold, confirming increased relatedness of narrow versus broad terms. The automatic threshold (0.66) selection prioritizes recall (0.58) to precision (0.29). SafeTerm AMQ achieves comparable, satisfactory performance on SMQs and sanitized OCMQs. It is therefore a viable supplementary method for automated MedDRA query generation, balancing recall and precision. We recommend using suitable MedDRA PT terminology in query formulation and applying the automated threshold method to optimise recall. Increasing similarity scores allows refined, narrow terms selection.
comment: 8 pages, 3 figures
MoCoRP: Modeling Consistent Relations between Persona and Response for Persona-based Dialogue
As dialogue systems become increasingly important across various domains, a key challenge in persona-based dialogue is generating engaging and context-specific interactions while ensuring the model acts with a coherent personality. However, existing persona-based dialogue datasets lack explicit relations between persona sentences and responses, which makes it difficult for models to effectively capture persona information. To address these issues, we propose MoCoRP (Modeling Consistent Relations between Persona and Response), a framework that incorporates explicit relations into language models. MoCoRP leverages an NLI expert to explicitly extract the NLI relations between persona sentences and responses, enabling the model to effectively incorporate appropriate persona information from the context into its responses. We applied this framework to pre-trained models like BART and further extended it to modern large language models (LLMs) through alignment tuning. Experimental results on the public datasets ConvAI2 and MPChat demonstrate that MoCoRP outperforms existing baselines, achieving superior persona consistency and engaging, context-aware dialogue generation. Furthermore, our model not only excels in quantitative metrics but also shows significant improvements in qualitative aspects. These results highlight the effectiveness of explicitly modeling persona-response relations in persona-based dialogue. The source codes of MoCoRP are available at https://github.com/DMCB-GIST/MoCoRP.
comment: 18 pages
Most over-representation of phonological features in basic vocabulary disappears when controlling for spatial and phylogenetic effects
The statistical over-representation of phonological features in the basic vocabulary of languages is often interpreted as reflecting potentially universal sound symbolic patterns. However, most of those results have not been tested explicitly for reproducibility and might be prone to biases in the study samples or models. Many studies on the topic do not adequately control for genealogical and areal dependencies between sampled languages, casting doubts on the robustness of the results. In this study, we test the robustness of a recent study on sound symbolism of basic vocabulary concepts which analyzed245 languages.The new sample includes data on 2864 languages from Lexibank. We modify the original model by adding statistical controls for spatial and phylogenetic dependencies between languages. The new results show that most of the previously observed patterns are not robust, and in fact many patterns disappear completely when adding the genealogical and areal controls. A small number of patterns, however, emerges as highly stable even with the new sample. Through the new analysis, we are able to assess the distribution of sound symbolism on a larger scale than previously. The study further highlights the need for testing all universal claims on language for robustness on various levels.
comment: Accepted with minor revisions at *Linguistic Typology*, expected to be fully published in 2026
Minimum Bayes Risk Decoding for Error Span Detection in Reference-Free Automatic Machine Translation Evaluation
Error Span Detection (ESD) is a subtask of automatic machine translation evaluation that localizes error spans in translations and labels their severity. State-of-the-art generative ESD methods typically decode using Maximum a Posteriori (MAP), assuming that model-estimated probabilities are perfectly correlated with similarity to human annotation. However, we observed that annotations dissimilar to the human annotation could achieve a higher model likelihood than the human annotation. We address this issue by applying Minimum Bayes Risk (MBR) decoding to generative ESD models. Specifically, we employ sentence- and span-level similarity metrics as utility functions to select candidate hypotheses based on their approximate similarity to the human annotation. Extensive experimental results show that our MBR decoding outperforms the MAP baseline at the system, sentence, and span-levels. Furthermore, to mitigate the computational cost of MBR decoding, we demonstrate that applying MBR distillation enables a standard greedy model to match MBR decoding performance, effectively eliminating the inference-time latency bottleneck.
SwissGov-RSD: A Human-annotated, Cross-lingual Benchmark for Token-level Recognition of Semantic Differences Between Related Documents
Recognizing semantic differences across documents, especially in different languages, is crucial for text generation evaluation and multilingual content alignment. However, as a standalone task it has received little attention. We address this by introducing SwissGov-RSD, the first naturalistic, document-level, cross-lingual dataset for semantic difference recognition. It encompasses a total of 224 multi-parallel documents in English-German, English-French, and English-Italian with token-level difference annotations by human annotators. We evaluate a variety of open-source and closed source large language models as well as encoder models across different fine-tuning settings on this new benchmark. Our results show that current automatic approaches perform poorly compared to their performance on monolingual, sentence-level, and synthetic benchmarks, revealing a considerable gap for both LLMs and encoder models. We make our code and datasets publicly available.
comment: 30 pages
Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs
Rotary Position Embeddings (RoPE) have become a standard for encoding sequence order in Large Language Models (LLMs) by applying rotations to query and key vectors in the complex plane. Standard implementations, however, utilize only the real component of the complex-valued dot product for attention score calculation. This simplification discards the imaginary component, which contains valuable phase information, leading to a potential loss of relational details crucial for modeling long-context dependencies. In this paper, we propose an extension that re-incorporates this discarded imaginary component. Our method leverages the full complex-valued representation to create a dual-component attention score. We theoretically and empirically demonstrate that this approach enhances the modeling of long-context dependencies by preserving more positional information. Furthermore, evaluations on a suite of long-context language modeling benchmarks show that our method consistently improves performance over the standard RoPE, with the benefits becoming more significant as context length increases. The code is available at https://github.com/OpenMOSS/rope_pp.
comment: 20 pages, 6 figures, under review
LIME: Making LLM Data More Efficient with Linguistic Metadata Embeddings
Pre-training decoder-only language models relies on vast amounts of high-quality data, yet the availability of such data is increasingly reaching its limits. While metadata is commonly used to create and curate these datasets, its potential as a direct training signal remains under-explored. We challenge this status quo and propose LIME (Linguistic Metadata Embeddings), a method that enriches token embeddings with metadata capturing syntax, semantics, and contextual properties. LIME substantially improves pre-training efficiency. Specifically, it adapts up to 56% faster to the training data distribution, while introducing only 0.01% additional parameters at negligible compute overhead. Beyond efficiency, LIME improves tokenization, leading to remarkably stronger language modeling capabilities and generative task performance. These benefits persist across model scales (500M to 2B). In addition, we develop a variant with shifted metadata, LIME+1, that can guide token generation. Given prior metadata for the next token, LIME+1 improves reasoning performance by up to 38% and arithmetic accuracy by up to 35%.
SPAD: Seven-Source Token Probability Attribution with Syntactic Aggregation for Detecting Hallucinations in RAG
Detecting hallucinations in Retrieval-Augmented Generation (RAG) remains a challenge. Prior approaches attribute hallucinations to a binary conflict between internal knowledge (stored in FFNs) and retrieved context. However, this perspective is incomplete, failing to account for the impact of other components in the generative process, such as the user query, previously generated tokens, the current token itself, and the final LayerNorm adjustment. To address this, we introduce SPAD. First, we mathematically attribute each token's probability into seven distinct sources: Query, RAG, Past, Current Token, FFN, Final LayerNorm, and Initial Embedding. This attribution quantifies how each source contributes to the generation of the current token. Then, we aggregate these scores by POS tags to quantify how different components drive specific linguistic categories. By identifying anomalies, such as Nouns relying on Final LayerNorm, SPAD effectively detects hallucinations. Extensive experiments demonstrate that SPAD achieves state-of-the-art performance
Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization
Large Language Models (LLMs) empowered with Tool-Integrated Reasoning (TIR) can iteratively plan, call external tools, and integrate returned information to solve complex, long-horizon reasoning tasks. Agentic Reinforcement Learning (Agentic RL) optimizes such models over full tool-interaction trajectories, but two key challenges hinder effectiveness: (1) Sparse, non-instructive rewards, such as binary 0-1 verifiable signals, provide limited guidance for intermediate steps and slow convergence; (2) Gradient degradation in Group Relative Policy Optimization (GRPO), where identical rewards within a rollout group yield zero advantage, reducing sample efficiency and destabilizing training. To address these challenges, we propose two complementary techniques: Progressive Reward Shaping (PRS) and Value-based Sampling Policy Optimization (VSPO). PRS is a curriculum-inspired reward design that introduces dense, stage-wise feedback - encouraging models to first master parseable and properly formatted tool calls, then optimize for factual correctness and answer quality. We instantiate PRS for short-form QA (with a length-aware BLEU to fairly score concise answers) and long-form QA (with LLM-as-a-Judge scoring to prevent reward hacking). VSPO is an enhanced GRPO variant that replaces low-value samples with prompts selected by a task-value metric balancing difficulty and uncertainty, and applies value-smoothing clipping to stabilize gradient updates. Experiments on multiple short-form and long-form QA benchmarks show that PRS consistently outperforms traditional binary rewards, and VSPO achieves superior stability, faster convergence, and higher final performance compared to PPO, GRPO, CISPO, and SFT-only baselines. Together, PRS and VSPO yield LLM-based TIR agents that generalize better across domains.
Living the Novel: A System for Generating Self-Training Timeline-Aware Conversational Agents from Novels
We present the Living Novel, an end-to-end system that transforms any literary work into an immersive, multi-character conversational experience. This system is designed to solve two fundamental challenges for LLM-driven characters. Firstly, generic LLMs suffer from persona drift, often failing to stay in character. Secondly, agents often exhibit abilities that extend beyond the constraints of the story's world and logic, leading to both narrative incoherence (spoiler leakage) and robustness failures (frame-breaking). To address these challenges, we introduce a novel two-stage training pipeline. Our Deep Persona Alignment (DPA) stage uses data-free reinforcement finetuning to instill deep character fidelity. Our Coherence and Robustness Enhancing (CRE) stage then employs a story-time-aware knowledge graph and a second retrieval-grounded training pass to architecturally enforce these narrative constraints. We validate our system through a multi-phase evaluation using Jules Verne's Twenty Thousand Leagues Under the Sea. A lab study with a detailed ablation of system components is followed by a 5-day in-the-wild diary study. Our DPA pipeline helps our specialized model outperform GPT-4o on persona-specific metrics, and our CRE stage achieves near-perfect performance in coherence and robustness measures. Our study surfaces practical design guidelines for AI-driven narrative systems: we find that character-first self-training is foundational for believability, while explicit story-time constraints are crucial for sustaining coherent, interruption-resilient mobile-web experiences.
Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning
We introduce Native Parallel Reasoner (NPR), a teacher-free framework that enables Large Language Models (LLMs) to self-evolve genuine parallel reasoning capabilities. NPR transforms the model from sequential emulation to native parallel cognition through three key innovations: 1) a self-distilled progressive training paradigm that transitions from ``cold-start'' format discovery to strict topological constraints without external supervision; 2) a novel Parallel-Aware Policy Optimization (PAPO) algorithm that optimizes branching policies directly within the execution graph, allowing the model to learn adaptive decomposition via trial and error; and 3) a robust NPR Engine that refactors memory management and flow control of SGLang to enable stable, large-scale parallel RL training. Across eight reasoning benchmarks, NPR trained on Qwen3-4B achieves performance gains of up to 24.5% and inference speedups up to 4.6x. Unlike prior baselines that often fall back to autoregressive decoding, NPR demonstrates 100% genuine parallel execution, establishing a new standard for self-evolving, efficient, and scalable agentic reasoning.
Persian-Phi: Efficient Cross-Lingual Adaptation of Compact LLMs via Curriculum Learning
The democratization of AI is currently hindered by the immense computational costs required to train Large Language Models (LLMs) for low-resource languages. This paper presents Persian-Phi, a 3.8B parameter model that challenges the assumption that robust multilingual capabilities require massive model sizes or multilingual baselines. We demonstrate how Microsoft Phi-3 Mini -- originally a monolingual English model -- can be effectively adapted to Persian through a novel, resource-efficient curriculum learning pipeline. Our approach employs a unique "warm-up" stage using bilingual narratives (Tiny Stories) to align embeddings prior to heavy training, followed by continual pretraining and instruction tuning via Parameter-Efficient Fine-Tuning (PEFT). Despite its compact size, Persian-Phi achieves competitive results on Open Persian LLM Leaderboard in HuggingFace. Our findings provide a validated, scalable framework for extending the reach of state-of-the-art LLMs to underrepresented languages with minimal hardware resources. The Persian-Phi model is publicly available at https://huggingface.co/amirakhlaghiqqq/PersianPhi.
Training Language Models to Use Prolog as a Tool
Ensuring reliable tool use is critical for safe agentic AI systems. Language models frequently produce unreliable reasoning with plausible but incorrect solutions that are difficult to verify. To address this, we investigate fine-tuning models to use Prolog as an external tool for verifiable computation. Using Group Relative Policy Optimization (GRPO), we fine-tune Qwen2.5-3B-Instruct on a cleaned GSM8K-Prolog-Prover dataset while varying (i) prompt structure, (ii) reward composition (execution, syntax, semantics, structure), and (iii) inference protocol: single-shot, best-of-N, and two agentic modes where Prolog is invoked internally or independently. Our reinforcement learning approach outperforms supervised fine-tuning, with our 3B model achieving zero-shot MMLU performance comparable to 7B few-shot results. Our findings reveal that: 1) joint tuning of prompt, reward, and inference shapes program syntax and logic; 2) best-of-N with external Prolog verification maximizes accuracy on GSM8K; 3) agentic inference with internal repair yields superior zero-shot generalization on MMLU-Stem and MMLU-Pro. These results demonstrate that grounding model reasoning in formal verification systems substantially improves reliability and auditability for safety-critical applications. The source code for reproducing our experiments is available under https://github.com/niklasmellgren/grpo-prolog-inference
comment: 10 pages
LUNE: Efficient LLM Unlearning via LoRA Fine-Tuning with Negative Examples
Large language models (LLMs) possess vast knowledge acquired from extensive training corpora, but they often cannot remove specific pieces of information when needed, which makes it hard to handle privacy, bias mitigation, and knowledge correction. Traditional model unlearning approaches require computationally expensive fine-tuning or direct weight editing, making them impractical for real-world deployment. In this work, we introduce LoRA-based Unlearning with Negative Examples (LUNE), a lightweight framework that performs negative-only unlearning by updating only low-rank adapters while freezing the backbone, thereby localizing edits and avoiding disruptive global changes. Leveraging Low-Rank Adaptation (LoRA), LUNE targets intermediate representations to suppress (or replace) requested knowledge with an order-of-magnitude lower compute and memory than full fine-tuning or direct weight editing. Extensive experiments on multiple factual unlearning tasks show that LUNE: (I) achieves effectiveness comparable to full fine-tuning and memory-editing methods, and (II) reduces computational cost by about an order of magnitude.
Recover-to-Forget: Gradient Reconstruction from LoRA for Efficient LLM Unlearning
Unlearning in large foundation models (e.g., LLMs) is essential for enabling dynamic knowledge updates, enforcing data deletion rights, and correcting model behavior. However, existing unlearning methods often require full-model fine-tuning or access to the original training data, which limits their scalability and practicality. In this work, we introduce Recover-to-Forget (R2F), a novel framework for efficient unlearning in LLMs based on reconstructing full-model gradient directions from low-rank LoRA adapter updates. Rather than performing backpropagation through the full model, we compute gradients with respect to LoRA parameters using multiple paraphrased prompts and train a gradient decoder to approximate the corresponding full-model gradients. To ensure applicability to larger or black-box models, the decoder is trained on a proxy model and transferred to target models. We provide a theoretical analysis of cross-model generalization and demonstrate that our method achieves effective unlearning while preserving general model performance. Experimental results demonstrate that R2F offers a scalable and lightweight alternative for unlearning in pretrained LLMs without requiring full retraining or access to internal parameters.
Multilingual corpora for the study of new concepts in the social sciences and humanities:
This article presents a hybrid methodology for building a multilingual corpus designed to support the study of emerging concepts in the humanities and social sciences (HSS), illustrated here through the case of ``non-technological innovation''. The corpus relies on two complementary sources: (1) textual content automatically extracted from company websites, cleaned for French and English, and (2) annual reports collected and automatically filtered according to documentary criteria (year, format, duplication). The processing pipeline includes automatic language detection, filtering of non-relevant content, extraction of relevant segments, and enrichment with structural metadata. From this initial corpus, a derived dataset in English is created for machine learning purposes. For each occurrence of a term from the expert lexicon, a contextual block of five sentences is extracted (two preceding and two following the sentence containing the term). Each occurrence is annotated with the thematic category associated with the term, enabling the construction of data suitable for supervised classification tasks. This approach results in a reproducible and extensible resource, suitable both for analyzing lexical variability around emerging concepts and for generating datasets dedicated to natural language processing applications.
comment: in French language
Investigating Training and Generalization in Faithful Self-Explanations of Large Language Models AACL
Large language models have the potential to generate explanations for their own predictions in a variety of styles based on user instructions. Recent research has examined whether these self-explanations faithfully reflect the models' actual behavior and has found that they often lack faithfulness. However, the question of how to improve faithfulness remains underexplored. Moreover, because different explanation styles have superficially distinct characteristics, it is unclear whether improvements observed in one style also arise when using other styles. This study analyzes the effects of training for faithful self-explanations and the extent to which these effects generalize, using three classification tasks and three explanation styles. We construct one-word constrained explanations that are likely to be faithful using a feature attribution method, and use these pseudo-faithful self-explanations for continual learning on instruction-tuned models. Our experiments demonstrate that training can improve self-explanation faithfulness across all classification tasks and explanation styles, and that these improvements also show signs of generalization to the multi-word settings and to unseen tasks. Furthermore, we find consistent cross-style generalization among three styles, suggesting that training may contribute to a broader improvement in faithful self-explanation ability.
comment: To appear in the Proceedings of the Asia-Pacific Chapter of the Association for Computational Linguistics: Student Research Workshop (AACL-SRW 2025)
Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data AACL
Automatic speech recognition for low-resource languages remains fundamentally constrained by the scarcity of labeled data and computational resources required by state-of-the-art models. We present a systematic investigation into cross-lingual continuous pretraining for low-resource languages, using Perso-Arabic languages (Persian, Arabic, and Urdu) as our primary case study. Our approach demonstrates that strategic utilization of unlabeled speech data can effectively bridge the resource gap without sacrificing recognition accuracy. We construct a 3,000-hour multilingual corpus through a scalable unlabeled data collection pipeline and employ targeted continual pretraining combined with morphologically-aware tokenization to develop a 300M parameter model that achieves performance comparable to systems 5 times larger. Our model outperforms Whisper Large v3 (1.5B parameters) on Persian and achieves competitive results on Arabic and Urdu despite using significantly fewer parameters and substantially less labeled data. These findings challenge the prevailing assumption that ASR quality scales primarily with model size, revealing instead that data relevance and strategic pretraining are more critical factors for low-resource scenarios. This work provides a practical pathway toward inclusive speech technology, enabling effective ASR for underrepresented languages without dependence on massive computational infrastructure or proprietary datasets.
comment: Accepted in AACL IJCNLP 2025
TeluguST-46: A Benchmark Corpus and Comprehensive Evaluation for Telugu-English Speech Translation AACL
Despite Telugu being spoken by over 80 million people, speech translation research for this morphologically rich language remains severely underexplored. We address this gap by developing a high-quality Telugu--English speech translation benchmark from 46 hours of manually verified CSTD corpus data (30h/8h/8h train/dev/test split). Our systematic comparison of cascaded versus end-to-end architectures shows that while IndicWhisper + IndicMT achieves the highest performance due to extensive Telugu-specific training data, finetuned SeamlessM4T models demonstrate remarkable competitiveness despite using significantly less Telugu-specific training data. This finding suggests that with careful hyperparameter tuning and sufficient parallel data (potentially less than 100 hours), end-to-end systems can achieve performance comparable to cascaded approaches in low-resource settings. Our metric reliability study evaluating BLEU, METEOR, ChrF++, ROUGE-L, TER, and BERTScore against human judgments reveals that traditional metrics provide better quality discrimination than BERTScore for Telugu--English translation. The work delivers three key contributions: a reproducible Telugu--English benchmark, empirical evidence of competitive end-to-end performance potential in low-resource scenarios, and practical guidance for automatic evaluation in morphologically complex language pairs.
comment: Submitted to AACL IJCNLP 2025
Ensembling LLM-Induced Decision Trees for Explainable and Robust Error Detection
Error detection (ED), which aims to identify incorrect or inconsistent cell values in tabular data, is important for ensuring data quality. Recent state-of-the-art ED methods leverage the pre-trained knowledge and semantic capability embedded in large language models (LLMs) to directly label whether a cell is erroneous. However, this LLM-as-a-labeler pipeline (1) relies on the black box, implicit decision process, thus failing to provide explainability for the detection results, and (2) is highly sensitive to prompts, yielding inconsistent outputs due to inherent model stochasticity, therefore lacking robustness. To address these limitations, we propose an LLM-as-an-inducer framework that adopts LLM to induce the decision tree for ED (termed TreeED) and further ensembles multiple such trees for consensus detection (termed ForestED), thereby improving explainability and robustness. Specifically, based on prompts derived from data context, decision tree specifications and output requirements, TreeED queries the LLM to induce the decision tree skeleton, whose root-to-leaf decision paths specify the stepwise procedure for evaluating a given sample. Each tree contains three types of nodes: (1) rule nodes that perform simple validation checks (e.g., format or range), (2) Graph Neural Network (GNN) nodes that capture complex patterns (e.g., functional dependencies), and (3) leaf nodes that output the final decision types (error or clean). Furthermore, ForestED employs uncertainty-based sampling to obtain multiple row subsets, constructing a decision tree for each subset using TreeED. It then leverages an Expectation-Maximization-based algorithm that jointly estimates tree reliability and optimizes the consensus ED prediction. Extensive xperiments demonstrate that our methods are accurate, explainable and robust, achieving an average F1-score improvement of 16.1% over the best baseline.
comment: 14 pages, 8 figures
Pay Less Attention to Function Words for Free Robustness of Vision-Language Models
To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the impact of function words. Similar to differential amplifiers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for more aligned and robust VLMs. Comprehensive experiments include 2 SOTA baselines under 6 different attacks on 2 downstream tasks, 3 datasets, and 3 models. Overall, our FDA yields an average 18/13/53% ASR drop with only 0.2/0.3/0.6% performance drops on the 3 tested models on retrieval, and a 90% ASR drop with a 0.3% performance gain on visual grounding. We demonstrate the scalability, generalization, and zero-shot performance of FDA experimentally, as well as in-depth ablation studies and analysis. Code will be made publicly at https://github.com/michaeltian108/FDA.
NeSTR: A Neuro-Symbolic Abductive Framework for Temporal Reasoning in Large Language Models AAAI 2026
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, temporal reasoning, particularly under complex temporal constraints, remains a major challenge. To this end, existing approaches have explored symbolic methods, which encode temporal structure explicitly, and reflective mechanisms, which revise reasoning errors through multi-step inference. Nonetheless, symbolic approaches often underutilize the reasoning capabilities of LLMs, while reflective methods typically lack structured temporal representations, which can result in inconsistent or hallucinated reasoning. As a result, even when the correct temporal context is available, LLMs may still misinterpret or misapply time-related information, leading to incomplete or inaccurate answers. To address these limitations, in this work, we propose Neuro-Symbolic Temporal Reasoning (NeSTR), a novel framework that integrates structured symbolic representations with hybrid reflective reasoning to enhance the temporal sensitivity of LLM inference. NeSTR preserves explicit temporal relations through symbolic encoding, enforces logical consistency via verification, and corrects flawed inferences using abductive reflection. Extensive experiments on diverse temporal question answering benchmarks demonstrate that NeSTR achieves superior zero-shot performance and consistently improves temporal reasoning without any fine-tuning, showcasing the advantage of neuro-symbolic integration in enhancing temporal understanding in large language models.
comment: Accepted by AAAI 2026
Generating Storytelling Images with Rich Chains-of-Reasoning
An image can convey a compelling story by presenting rich, logically connected visual clues. These connections form Chains-of-Reasoning (CoRs) within the image, enabling viewers to infer events, causal relationships, and other information, thereby understanding the underlying story. In this paper, we focus on these semantically rich images and define them as Storytelling Images. Such images have diverse applications beyond illustration creation and cognitive screening, leveraging their ability to convey multi-layered information visually and inspire active interpretation. However, due to their complex semantic nature, Storytelling Images are inherently challenging to create, and thus remain relatively scarce. To address this challenge, we introduce the Storytelling Image Generation task, which explores how generative AI models can be leveraged to create such images. Specifically, we propose a two-stage pipeline, StorytellingPainter, which combines the creative reasoning abilities of Large Language Models (LLMs) with the visual synthesis capabilities of Text-to-Image (T2I) models to generate Storytelling Images. Alongside this pipeline, we develop a dedicated evaluation framework comprising three main evaluators: a Semantic Complexity Evaluator, a KNN-based Diversity Evaluator and a Story-Image Alignment Evaluator. Given the critical role of story generation in the Storytelling Image Generation task and the performance disparity between open-source and proprietary LLMs, we further explore tailored training strategies to reduce this gap, resulting in a series of lightweight yet effective models named Mini-Storytellers. Experimental results demonstrate the feasibility and effectiveness of our approaches. The code is available at https://github.com/xiujiesong/StorytellingImageGeneration.
MASim: Multilingual Agent-Based Simulation for Social Science
Multi-agent role-playing has recently shown promise for studying social behavior with language agents, but existing simulations are mostly monolingual and fail to model cross-lingual interaction, an essential property of real societies. We introduce MASim, the first multilingual agent-based simulation framework that supports multi-turn interaction among generative agents with diverse sociolinguistic profiles. MASim offers two key analyses: (i) global public opinion modeling, by simulating how attitudes toward open-domain hypotheses evolve across languages and cultures, and (ii) media influence and information diffusion, via autonomous news agents that dynamically generate content and shape user behavior. To instantiate simulations, we construct the MAPS benchmark, which combines survey questions and demographic personas drawn from global population distributions. Experiments on calibration, sensitivity, consistency, and cultural case studies show that MASim reproduces sociocultural phenomena and highlights the importance of multilingual simulation for scalable, controlled computational social science.
PICKT: Practical Interlinked Concept Knowledge Tracing for Personalized Learning using Knowledge Map Concept Relations
With the recent surge in personalized learning, Intelligent Tutoring Systems (ITS) that can accurately track students' individual knowledge states and provide tailored learning paths based on this information are in demand as an essential task. This paper focuses on the core technology of Knowledge Tracing (KT) models that analyze students' sequences of interactions to predict their knowledge acquisition levels. However, existing KT models suffer from limitations such as restricted input data formats, cold start problems arising with new student enrollment or new question addition, and insufficient stability in real-world service environments. To overcome these limitations, a Practical Interlinked Concept Knowledge Tracing (PICKT) model that can effectively process multiple types of input data is proposed. Specifically, a knowledge map structures the relationships among concepts considering the question and concept text information, thereby enabling effective knowledge tracing even in cold start situations. Experiments reflecting real operational environments demonstrated the model's excellent performance and practicality. The main contributions of this research are as follows. First, a model architecture that effectively utilizes diverse data formats is presented. Second, significant performance improvements are achieved over existing models for two core cold start challenges: new student enrollment and new question addition. Third, the model's stability and practicality are validated through delicate experimental design, enhancing its applicability in real-world product environments. This provides a crucial theoretical and technical foundation for the practical implementation of next-generation ITS.
comment: 15 pages, 5 figures, 17 tables. Preparing submission for EDM 2026 conference
Think-Reflect-Revise: A Policy-Guided Reflective Framework for Safety Alignment in Large Vision Language Models
As multimodal reasoning improves the overall capabilities of Large Vision Language Models (LVLMs), recent studies have begun to explore safety-oriented reasoning, aiming to enhance safety awareness by analyzing potential safety risks during the reasoning process before generating the final response. Although such approaches improve safety awareness and interpretability, this single-pass think-then-answer paradigm remains vulnerable to contextual or visual jailbreak attacks. This reveals a critical flaw: single-pass reasoning may overlook explicit harmful content in its own output. Our key insight is to exploit this wasted signal through reflection, which can effectively leverage the malicious content revealed in the first-pass reasoning to enable genuine self-correction and prevent unsafe generations. Motivated by this, we propose Think-Reflect-Revise (TRR), a three-stage training framework designed to enhance the safety alignment of LVLMs through policy-guided self-reflection. We first build a Reflective Safety Reasoning (ReSafe) dataset with 5,000 examples that follow a think-reflect-revise process. We then fine-tune the target model using the ReSafe dataset to initialize reflective behavior, and finally reinforce policy-guided reflection through reinforcement learning. Experimental results show that TRR substantially improves the safety performance of LVLMs across both safety-awareness benchmarks and jailbreak attack evaluations, increasing the overall safe response rate from 42.8% to 87.7% on Qwen2.5-VL-7B, while preserving stable performance on general benchmarks such as MMMU and MMStar. The project page is available at https://think-reflect-revise.github.io/.
GUMBridge: a Corpus for Varieties of Bridging Anaphora
Bridging is an anaphoric phenomenon where the referent of an entity in a discourse is dependent on a previous, non-identical entity for interpretation, such as in "There is 'a house'. 'The door' is red," where the door is specifically understood to be the door of the aforementioned house. While there are several existing resources in English for bridging anaphora, most are small, provide limited coverage of the phenomenon, and/or provide limited genre coverage. In this paper, we introduce GUMBridge, a new resource for bridging, which includes 16 diverse genres of English, providing both broad coverage for the phenomenon and granular annotations for the subtype categorization of bridging varieties. We also present an evaluation of annotation quality and report on baseline performance using open and closed source contemporary LLMs on three tasks underlying our data, showing that bridging resolution and subtype classification remain difficult NLP tasks in the age of LLMs.
DART: Leveraging Multi-Agent Disagreement for Tool Recruitment in Multimodal Reasoning
Specialized visual tools can augment large language models or vision language models with expert knowledge (e.g., grounding, spatial reasoning, medical knowledge, etc.), but knowing which tools to call (and when to call them) can be challenging. We introduce DART, a multi-agent framework that uses disagreements between multiple debating visual agents to identify useful visual tools (e.g., object detection, OCR, spatial reasoning, etc.) that can resolve inter-agent disagreement. These tools allow for fruitful multi-agent discussion by introducing new information, and by providing tool-aligned agreement scores that highlight agents in agreement with expert tools, thereby facilitating discussion. We utilize an aggregator agent to select the best answer by providing the agent outputs and tool information. We test DART on four diverse benchmarks and show that our approach improves over multi-agent debate as well as over single agent tool-calling frameworks, beating the next-strongest baseline (multi-agent debate with a judge model) by 3.4% and 2.4% on A-OKVQA and MMMU respectively. We also find that DART adapts well to new tools in applied domains, with a 1.3% improvement on the M3D medical dataset over other strong tool-calling, single agent, and multi-agent baselines. Additionally, we measure text overlap across rounds to highlight the rich discussion in DART compared to existing multi-agent methods. Finally, we study the tool call distribution, finding that diverse tools are reliably used to help resolve disagreement.
comment: Code: https://github.com/nsivaku/dart
A Neural Affinity Framework for Abstract Reasoning: Diagnosing the Compositional Gap in Transformer Architectures via Procedural Task Taxonomy
Responding to Hodel et al.'s (2024) call for a formal definition of task relatedness in re-arc, we present the first 9-category taxonomy of all 400 tasks, validated at 97.5% accuracy via rule-based code analysis. We prove the taxonomy's visual coherence by training a CNN on raw grid pixels (95.24% accuracy on S3, 36.25% overall, 3.3x chance), then apply the taxonomy diagnostically to the original ARC-AGI-2 test set. Our curriculum analysis reveals 35.3% of tasks exhibit low neural affinity for Transformers--a distributional bias mirroring ARC-AGI-2. To probe this misalignment, we fine-tuned a 1.7M-parameter Transformer across 302 tasks, revealing a profound Compositional Gap: 210 of 302 tasks (69.5%) achieve >80% cell accuracy (local patterns) but <10% grid accuracy (global synthesis). This provides direct evidence for a Neural Affinity Ceiling Effect, where performance is bounded by architectural suitability, not curriculum. Applying our framework to Li et al.'s independent ViTARC study (400 specialists, 1M examples each) confirms its predictive power: Very Low affinity tasks achieve 51.9% versus 77.7% for High affinity (p<0.001), with a task at 0% despite massive data. The taxonomy enables precise diagnosis: low-affinity tasks (A2) hit hard ceilings, while high-affinity tasks (C1) reach 99.8%. These findings indicate that progress requires hybrid architectures with affinity-aligned modules. We release our validated taxonomy,
comment: 62 pages, 10 figures
Leveraging KV Similarity for Online Structured Pruning in LLMs
Pruning has emerged as a promising direction for accelerating large language model (LLM) inference, yet existing approaches often suffer from instability because they rely on offline calibration data that may not generalize across inputs. In this work, we introduce Token Filtering, a lightweight online structured pruning technique that makes pruning decisions directly during inference without any calibration data. The key idea is to measure token redundancy via joint key-value similarity and skip redundant attention computations, thereby reducing inference cost while preserving critical information. To further enhance stability, we design a variance-aware fusion strategy that adaptively weights key and value similarity across heads, ensuring that informative tokens are retained even under high pruning ratios. This design introduces no additional memory overhead and provides a more reliable criterion for token importance. Extensive experiments on LLaMA-2 (7B/13B), LLaMA-3 (8B), and Mistral (7B) demonstrate that Token Filtering consistently outperforms prior structured pruning methods, preserving accuracy on commonsense reasoning benchmarks and maintaining strong performance on challenging tasks such as MMLU, even with 50% pruning.
Do Large Language Models Truly Understand Cross-cultural Differences?
In recent years, large language models (LLMs) have demonstrated strong performance on multilingual tasks. Given its wide range of applications, cross-cultural understanding capability is a crucial competency. However, existing benchmarks for evaluating whether LLMs genuinely possess this capability suffer from three key limitations: a lack of contextual scenarios, insufficient cross-cultural concept mapping, and limited deep cultural reasoning capabilities. To address these gaps, we propose SAGE, a scenario-based benchmark built via cross-cultural core concept alignment and generative task design, to evaluate LLMs' cross-cultural understanding and reasoning. Grounded in cultural theory, we categorize cross-cultural capabilities into nine dimensions. Using this framework, we curated 210 core concepts and constructed 4530 test items across 15 specific real-world scenarios, organized under four broader categories of cross-cultural situations, following established item design principles. The SAGE dataset supports continuous expansion, and experiments confirm its transferability to other languages. It reveals model weaknesses across both dimensions and scenarios, exposing systematic limitations in cross-cultural reasoning. While progress has been made, LLMs are still some distance away from reaching a truly nuanced cross-cultural understanding. In compliance with the anonymity policy, we include data and code in the supplement materials. In future versions, we will make them publicly available online.
SETUP: Sentence-level English-To-Uniform Meaning Representation Parser
Uniform Meaning Representation (UMR) is a novel graph-based semantic representation which captures the core meaning of a text, with flexibility incorporated into the annotation schema such that the breadth of the world's languages can be annotated (including low-resource languages). While UMR shows promise in enabling language documentation, improving low-resource language technologies, and adding interpretability, the downstream applications of UMR can only be fully explored when text-to-UMR parsers enable the automatic large-scale production of accurate UMR graphs at test time. Prior work on text-to-UMR parsing is limited to date. In this paper, we introduce two methods for English text-to-UMR parsing, one of which fine-tunes existing parsers for Abstract Meaning Representation and the other, which leverages a converter from Universal Dependencies, using prior work as a baseline. Our best-performing model, which we call SETUP, achieves an AnCast score of 84 and a SMATCH++ score of 91, indicating substantial gains towards automatic UMR parsing.
Replicating TEMPEST at Scale: Multi-Turn Adversarial Attacks Against Trillion-Parameter Frontier Models
Despite substantial investment in safety alignment, the vulnerability of large language models to sophisticated multi-turn adversarial attacks remains poorly characterized, and whether model scale or inference mode affects robustness is unknown. This study employed the TEMPEST multi-turn attack framework to evaluate ten frontier models from eight vendors across 1,000 harmful behaviors, generating over 97,000 API queries across adversarial conversations with automated evaluation by independent safety classifiers. Results demonstrated a spectrum of vulnerability: six models achieved 96% to 100% attack success rate (ASR), while four showed meaningful resistance, with ASR ranging from 42% to 78%; enabling extended reasoning on identical architecture reduced ASR from 97% to 42%. These findings indicate that safety alignment quality varies substantially across vendors, that model scale does not predict adversarial robustness, and that thinking mode provides a deployable safety enhancement. Collectively, this work establishes that current alignment techniques remain fundamentally vulnerable to adaptive multi-turn attacks regardless of model scale, while identifying deliberative inference as a promising defense direction.
comment: 30 pages, 11 figures, 5 tables. Code and data: https://github.com/ricyoung/tempest-replication
Balanced Accuracy: The Right Metric for Evaluating LLM Judges -- Explained through Youden's J statistic
Rigorous evaluation of large language models (LLMs) relies on comparing models by the prevalence of desirable or undesirable behaviors, such as task pass rates or policy violations. These prevalence estimates are produced by a classifier, either an LLM-as-a-judge or human annotators, making the choice of classifier central to trustworthy evaluation. Common metrics used for this choice, such as Accuracy, Precision, and F1, are sensitive to class imbalance and to arbitrary choices of positive class, and can favor judges that distort prevalence estimates. We show that Youden's $J$ statistic is theoretically aligned with choosing the best judge to compare models, and that Balanced Accuracy is an equivalent linear transformation of $J$. Through both analytical arguments and empirical examples and simulations, we demonstrate how selecting judges using Balanced Accuracy leads to better, more robust classifier selection.
comment: 9 pages, 5 figures
Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing
The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video frame sequence into individual signs and the second to embed the video clip of each sign into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPUs within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large continuous corpora. Experiments on four sign language datasets demonstrate state-of-the-art alignment performance, highlighting the potential of SEA to generate high-quality parallel data for advancing sign language processing. SEA's code and models are openly available.
Adaptation of Embedding Models to Financial Filings via LLM Distillation
Despite advances in generative large language models (LLMs), practical application of specialized conversational AI agents remains constrained by computation costs, latency requirements, and the need for precise domain-specific relevance measures. While existing embedding models address the first two constraints, they underperform on information retrieval in specialized domains like finance. This paper introduces a scalable pipeline that trains specialized models from an unlabeled corpus using a general purpose retrieval embedding model as foundation. Our method yields an average of 27.7% improvement in MRR$\texttt{@}$5, 44.6% improvement in mean DCG$\texttt{@}$5 across 14 financial filing types measured over 21,800 query-document pairs, and improved NDCG on 3 of 4 document classes in FinanceBench. We adapt retrieval embeddings (bi-encoder) for RAG, not LLM generators, using LLM-judged relevance to distill domain knowledge into a compact retriever. There are prior works which pair synthetically generated queries with real passages to directly fine-tune the retrieval model. Our pipeline differs from these by introducing interaction between student and teacher models that interleaves retrieval-based mining of hard positive/negative examples from the unlabeled corpus with iterative retraining of the student model's weights using these examples. Each retrieval iteration uses the refined student model to mine the corpus for progressively harder training examples for the subsequent training iteration. The methodology provides a cost-effective solution to bridging the gap between general-purpose models and specialized domains without requiring labor-intensive human annotation.
comment: In proceedings of LLM-Finance 2025 : The 2nd IEEE International Workshop on Large Language Models for Finance
Short-Context Dominance: How Much Local Context Natural Language Actually Needs?
We investigate the short-context dominance hypothesis: that for most sequences, a small local prefix suffices to predict their next tokens. Using large language models as statistical oracles, we measure the minimum context length (MCL) needed to reproduce accurate full-context predictions across datasets with sequences of varying lengths. For sequences with 1-7k tokens from long-context documents, we consistently find that 75-80% require only the last 96 tokens at most. Given the dominance of short-context tokens, we then ask whether it is possible to detect challenging long-context sequences for which a short local prefix does not suffice for prediction. We introduce a practical proxy to MCL, called Distributionally Aware MCL (DaMCL), that does not require knowledge of the actual next-token and is compatible with sampling strategies beyond greedy decoding. Our experiments validate that simple thresholding of the metric defining DaMCL achieves high performance in detecting long vs. short context sequences. Finally, to counter the bias that short-context dominance induces in LLM output distributions, we develop an intuitive decoding algorithm that leverages our detector to identify and boost tokens that are long-range-relevant. Across Q&A tasks and model architectures, we confirm that mitigating the bias improves performance.
comment: 38 pages, 7 figures, includes appendix and references
Beyond Unified Models: A Service-Oriented Approach to Low Latency, Context Aware Phonemization for Real Time TTS
Lightweight, real-time text-to-speech systems are crucial for accessibility. However, the most efficient TTS models often rely on lightweight phonemizers that struggle with context-dependent challenges. In contrast, more advanced phonemizers with a deeper linguistic understanding typically incur high computational costs, which prevents real-time performance. This paper examines the trade-off between phonemization quality and inference speed in G2P-aided TTS systems, introducing a practical framework to bridge this gap. We propose lightweight strategies for context-aware phonemization and a service-oriented TTS architecture that executes these modules as independent services. This design decouples heavy context-aware components from the core TTS engine, effectively breaking the latency barrier and enabling real-time use of high-quality phonemization models. Experimental results confirm that the proposed system improves pronunciation soundness and linguistic accuracy while maintaining real-time responsiveness, making it well-suited for offline and end-device TTS applications.
Beyond the Singular: Revealing the Value of Multiple Generations in Benchmark Evaluation NeurIPS 2025
Large language models (LLMs) have demonstrated significant utility in real-world applications, exhibiting impressive capabilities in natural language processing and understanding. Benchmark evaluations are crucial for assessing the capabilities of LLMs as they can provide a comprehensive assessment of their strengths and weaknesses. However, current evaluation methods often overlook the inherent randomness of LLMs by employing deterministic generation strategies or relying on a single random sample, resulting in unaccounted sampling variance and unreliable benchmark score estimates. In this paper, we propose a hierarchical statistical model that provides a more comprehensive representation of the benchmarking process by incorporating both benchmark characteristics and LLM randomness. We show that leveraging multiple generations improves the accuracy of estimating the benchmark score and reduces variance. Multiple generations also allow us to define $\mathbb P\left(\text{correct}\right)$, a prompt-level difficulty score based on correct ratios, providing fine-grained insights into individual prompts. Additionally, we create a data map that visualizes difficulty and semantics of prompts, enabling error detection and quality control in benchmark construction.
comment: Accepted in NeurIPS 2025 Workshop on LLM Evals
Aligning Machiavellian Agents: Behavior Steering via Test-Time Policy Shaping AAAI 2026
The deployment of decision-making AI agents presents a critical challenge in maintaining alignment with human values or guidelines while operating in complex, dynamic environments. Agents trained solely to achieve their objectives may adopt harmful behavior, exposing a key trade-off between maximizing the reward function and maintaining alignment. For pre-trained agents, ensuring alignment is particularly challenging, as retraining can be a costly and slow process. This is further complicated by the diverse and potentially conflicting attributes representing the ethical values for alignment. To address these challenges, we propose a test-time alignment technique based on model-guided policy shaping. Our method allows precise control over individual behavioral attributes, generalizes across diverse reinforcement learning (RL) environments, and facilitates a principled trade-off between ethical alignment and reward maximization without requiring agent retraining. We evaluate our approach using the MACHIAVELLI benchmark, which comprises 134 text-based game environments and thousands of annotated scenarios involving ethical decisions. The RL agents are first trained to maximize the reward in their respective games. At test time, we apply policy shaping via scenario-action attribute classifiers to ensure decision alignment with ethical attributes. We compare our approach against prior training-time methods and general-purpose agents, as well as study several types of ethical violations and power-seeking behavior. Our results demonstrate that test-time policy shaping provides an effective and scalable solution for mitigating unethical behavior across diverse environments and alignment attributes.
comment: Accepted to AAAI 2026 AI Alignment Track
Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents
Large Language Models (LLMs) are increasingly applied to software engineering (SWE), with SWE-bench as a key benchmark. Solutions are split into SWE-Agent frameworks with multi-turn interactions and workflow-based Agentless methods with single-turn verifiable steps. We argue these paradigms are not mutually exclusive: reasoning-intensive Agentless training induces skill priors, including localization, code edit, and self-reflection that enable efficient and effective SWE-Agent adaptation. In this work, we first curate the Agentless training recipe and present Kimi-Dev, an open-source SWE LLM achieving 60.4\% on SWE-bench Verified, the best among workflow approaches. With additional SFT adaptation on 5k publicly-available trajectories, Kimi-Dev powers SWE-Agents to 48.6\% pass@1, on par with that of Claude 3.5 Sonnet (241022 version). These results show that structured skill priors from Agentless training can bridge workflow and agentic frameworks for transferable coding agents.
comment: 68 pages. GitHub repo at https://github.com/MoonshotAI/Kimi-Dev
PhyloLM : Inferring the Phylogeny of Large Language Models and Predicting their Performances in Benchmarks ICLR 2025
This paper introduces PhyloLM, a method adapting phylogenetic algorithms to Large Language Models (LLMs) to explore whether and how they relate to each other and to predict their performance characteristics. Our method calculates a phylogenetic distance metric based on the similarity of LLMs' output. The resulting metric is then used to construct dendrograms, which satisfactorily capture known relationships across a set of 111 open-source and 45 closed models. Furthermore, our phylogenetic distance predicts performance in standard benchmarks, thus demonstrating its functional validity and paving the way for a time and cost-effective estimation of LLM capabilities. To sum up, by translating population genetic concepts to machine learning, we propose and validate a tool to evaluate LLM development, relationships and capabilities, even in the absence of transparent training information.
comment: The project code is available at https://github.com/Nicolas-Yax/PhyloLM . Published as https://iclr.cc/virtual/2025/poster/28195 at ICLR 2025. A code demo is available at https://colab.research.google.com/drive/1agNE52eUevgdJ3KL3ytv5Y9JBbfJRYqd
Is Self-Supervised Learning Enough to Fill in the Gap? A Study on Speech Inpainting
Speech inpainting consists in reconstructing corrupted or missing speech segments using surrounding context, a process that closely resembles the pretext tasks in Self-Supervised Learning (SSL) for speech encoders. This study investigates using SSL-trained speech encoders for inpainting without any additional training beyond the initial pretext task, and simply adding a decoder to generate a waveform. We compare this approach to supervised fine-tuning of speech encoders for a downstream task -- here, inpainting. Practically, we integrate HuBERT as the SSL encoder and HiFi-GAN as the decoder in two configurations: (1) fine-tuning the decoder to align with the frozen pre-trained encoder's output and (2) fine-tuning the encoder for an inpainting task based on a frozen decoder's input. Evaluations are conducted under single- and multi-speaker conditions using in-domain datasets and out-of-domain datasets (including unseen speakers, diverse speaking styles, and noise). Both informed and blind inpainting scenarios are considered, where the position of the corrupted segment is either known or unknown. The proposed SSL-based methods are benchmarked against several baselines, including a text-informed method combining automatic speech recognition with zero-shot text-to-speech synthesis. Performance is assessed using objective metrics and perceptual evaluations. The results demonstrate that both approaches outperform baselines, successfully reconstructing speech segments up to 200 ms, and sometimes up to 400 ms. Notably, fine-tuning the SSL encoder achieves more accurate speech reconstruction in single-speaker settings, while a pre-trained encoder proves more effective for multi-speaker scenarios. This demonstrates that an SSL pretext task can transfer to speech inpainting, enabling successful speech reconstruction with a pre-trained encoder.
comment: Accepted for publication to Computer Speech and Language journal (to appear)
DZ-TDPO: Non-Destructive Temporal Alignment for Mutable State Tracking in Long-Context Dialogue
Long-context dialogue systems suffer from State Inertia, where static constraints prevent models from resolving conflicts between evolving user intents and established historical context. To address this, we propose DZ-TDPO, a non-destructive alignment framework that synergizes conflict-aware dynamic KL constraints with a calibrated temporal attention bias. Experiments on the Multi-Session Chat (MSC) dataset demonstrate that DZ-TDPO achieves state-of-the-art win rates (55.4% on Phi-3.5) while maintaining robust zero-shot generalization. Our scaling analysis reveals a "Capacity-Stability Trade-off": while smaller models incur an "alignment tax" (perplexity surge) to overcome historical inertia, the larger Qwen2.5-7B model achieves 50.8% win rate with negligible perplexity overhead. This confirms that TAI can be alleviated via precise attention regulation rather than destructive weight updates, preserving general capabilities (MMLU) across model scales. Code and data are available: https://github.com/lyj20071013/DZ-TDPO
comment: 25 pages, 3 figures, 17 tables. Code available at https://github.com/lyj20071013/DZ-TDPO
HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization NeurIPS 2025
Transformers have become the de facto architecture for a wide range of machine learning tasks, particularly in large language models (LLMs). Despite their remarkable performance, many challenges remain in training deep transformer networks, especially regarding the position of the layer normalization. While Pre-Norm structures facilitate more stable training owing to their stronger identity path, they often lead to suboptimal performance compared to Post-Norm. In this paper, we propose $\textbf{HybridNorm}$, a simple yet effective hybrid normalization strategy that integrates the advantages of both Pre-Norm and Post-Norm. Specifically, HybridNorm employs QKV normalization within the attention mechanism and Post-Norm in the feed-forward network (FFN) of each transformer block. We provide both theoretical insights and empirical evidence to demonstrate that HybridNorm improves the gradient flow and the model robustness. Extensive experiments on large-scale transformer models, including both dense and sparse variants, show that HybridNorm consistently outperforms both Pre-Norm and Post-Norm approaches across multiple benchmarks. These findings highlight the potential of HybridNorm as a more stable and effective technique for improving the training and performance of deep transformer models. Code is available at https://github.com/BryceZhuo/HybridNorm.
comment: Accepted by NeurIPS 2025
TRepLiNa: Layer-wise CKA+REPINA Alignment Improves Low-Resource Machine Translation in Aya-23 8B
The 2025 Multimodal Models for Low-Resource Contexts and Social Impact (MMLoSo) Language Challenge addresses one of India's most pressing linguistic gaps: the lack of resources for its diverse low-resource languages (LRLs). In this study, we investigate whether enforcing cross-lingual similarity in specific internal layers of a decoder-only multilingual large language model (LLM) can improve translation quality from LRL to high-resource language (HRL). Specifically, we combine Centered Kernel Alignment (CKA), a similarity metric that encourages representations of different languages to align, with REPINA, a regularization method that constrains parameter updates to remain close to the pretrained model, into a joint method we call TRepLiNa. In this research project, we experiment with zero-shot, few-shot, and fine-tuning settings using Aya-23 8B with QLoRA across MMLoSo shared task language pairs (Mundari, Santali, Bhili) with Hindi/English pivots. Our results show that aligning mid-level layers using TRepLiNa (CKA+REPINA) is a low-cost, practical approach to improving LRL translation, especially in data-scarce settings.
comment: It is work in progress
Exploring the Potential of Encoder-free Architectures in 3D LMMs
Encoder-free architectures have been preliminarily explored in the 2D Large Multimodal Models (LMMs), yet it remains an open question whether they can be effectively applied to 3D understanding scenarios. In this paper, we present the first comprehensive investigation into the potential of encoder-free architectures to alleviate the challenges of encoder-based 3D LMMs. These long-standing challenges include the failure to adapt to varying point cloud resolutions during inference and the point features from the encoder not meeting the semantic needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to remove the pre-trained encoder and enable the LLM to assume the role of the 3D encoder: 1) We propose the LLM-embedded Semantic Encoding strategy in the pre-training stage, exploring the effects of various point cloud self-supervised losses. And we present the Hybrid Semantic Loss to extract high-level semantics. 2) We introduce the Hierarchical Geometry Aggregation strategy in the instruction tuning stage. This incorporates inductive bias into the LLM layers to focus on the local details of the point clouds. To the end, we present the first Encoder-free 3D LMM, ENEL. Our 7B model rivals the state-of-the-art model, PointLLM-PiSA-13B, achieving 57.91%, 61.0%, and 55.20% on the classification, captioning, and VQA tasks, respectively. Our results show that the encoder-free architecture is highly promising for replacing encoder-based architectures in the field of 3D understanding. The code is released at https://github.com/Ivan-Tang-3D/ENEL
Semantic Faithfulness and Entropy Production Measures to Tame Your LLM Demons and Manage Hallucinations
Evaluating faithfulness of Large Language Models (LLMs) to a given task is a complex challenge. We propose two new unsupervised metrics for faithfulness evaluation using insights from information theory and thermodynamics. Our approach treats an LLM as a bipartite information engine where hidden layers act as a Maxwell demon controlling transformations of context $C $ into answer $A$ via prompt $Q$. We model Question-Context-Answer (QCA) triplets as probability distributions over shared topics. Topic transformations from $C$ to $Q$ and $A$ are modeled as transition matrices ${\bf Q}$ and ${\bf A}$ encoding the query goal and actual result, respectively. Our semantic faithfulness (SF) metric quantifies faithfulness for any given QCA triplet by the Kullback-Leibler (KL) divergence between these matrices. Both matrices are inferred simultaneously via convex optimization of this KL divergence, and the final SF metric is obtained by mapping the minimal divergence onto the unit interval [0,1], where higher scores indicate greater faithfulness. Furthermore, we propose a thermodynamics-based semantic entropy production (SEP) metric in answer generation, and show that high faithfulness generally implies low entropy production. The SF and SEP metrics can be used jointly or separately for LLM evaluation and hallucination control. We demonstrate our framework on LLM summarization of corporate SEC 10-K filings.
comment: 23 pages, 6 figures
Transparent and Coherent Procedural Mistake Detection EMNLP 2025
Procedural mistake detection (PMD) is a challenging problem of classifying whether a human user (observed through egocentric video) has successfully executed a task (specified by a procedural text). Despite significant recent efforts, machine performance in the wild remains nonviable, and the reasoning processes underlying this performance are opaque. As such, we extend PMD to require generating visual self-dialog rationales to inform decisions. Given the impressive, mature image understanding capabilities observed in recent vision-and-language models (VLMs), we curate a suitable benchmark dataset for PMD based on individual frames. As our reformulation enables unprecedented transparency, we leverage a natural language inference (NLI) model to formulate two automated metrics for the coherence of generated rationales. We establish baselines for this reframed task, showing that VLMs struggle off-the-shelf, but with some trade-offs, their accuracy, coherence, and efficiency can be improved by incorporating these metrics into common inference and fine-tuning methods. Lastly, our multi-faceted metrics visualize common outcomes, highlighting areas for further improvement.
comment: EMNLP 2025
Rethinking LLM Training through Information Geometry and Quantum Metrics
Optimization in large language models (LLMs) unfolds over high-dimensional parameter spaces with non-Euclidean structure. Information geometry frames this landscape using the Fisher information metric, enabling more principled learning via natural gradient descent. Though often impractical, this geometric lens clarifies phenomena such as sharp minima, generalization, and observed scaling laws. We argue that curvature-based approaches deepen our understanding of LLM training. Finally, we speculate on quantum analogies based on the Fubini-Study metric and Quantum Fisher Information, hinting at efficient optimization in quantum-enhanced systems.
comment: 9 pages, 1 figure(s)
Understanding Syntactic Generalization in Structure-inducing Language Models
Structure-inducing Language Models (SiLM) are trained on a self-supervised language modeling task, and induce a hierarchical sentence representation as a byproduct when processing an input. SiLMs couple strong syntactic generalization behavior with competitive performance on various NLP tasks, but many of their basic properties are yet underexplored. In this work, we train three different SiLM architectures from scratch: Structformer (Shen et al., 2021), UDGN (Shen et al., 2022), and GPST (Hu et al., 2024b). We train these architectures on both natural language (English, German, and Chinese) corpora and synthetic bracketing expressions. The models are then evaluated with respect to (i) properties of the induced syntactic representations (ii) performance on grammaticality judgment tasks, and (iii) training dynamics. We find that none of the three architectures dominates across all evaluation metrics. However, there are significant differences, in particular with respect to the induced syntactic representations. The Generative Pretrained Structured Transformer (GPST; Hu et al. 2024) performs most consistently across evaluation settings, and outperforms the other models on long-distance dependencies in bracketing expressions. Furthermore, our study shows that small models trained on large amounts of synthetic data provide a useful testbed for evaluating basic model properties.
comment: Code available at https://github.com/davidarps/silm
Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion
Diffusion models have established the state-of-the-art in text-to-image generation, but their performance often relies on a diffusion prior network to translate text embeddings into the visual manifold for easier decoding. These priors are computationally expensive and require extensive training on massive datasets. In this work, we challenge the necessity of a trained prior at all by employing Optimization-based Visual Inversion (OVI), a training-free and data-free alternative, to replace the need for a prior. OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimizes it to maximize the cosine similarity with input textual prompt embedding. We further propose two novel constraints, a Mahalanobis-based and a Nearest-Neighbor loss, to regularize the OVI optimization process toward the distribution of realistic images. Our experiments, conducted on Kandinsky 2.2, show that OVI can serve as an alternative to traditional priors. More importantly, our analysis reveals a critical flaw in current evaluation benchmarks like T2I-CompBench++, where simply using the text embedding as a prior achieves surprisingly high scores, despite lower perceptual quality. Our constrained OVI methods improve visual fidelity over this baseline, with the Nearest-Neighbor approach proving particularly effective, achieving quantitative scores comparable to or higher than the state-of-the-art data-efficient prior, indicating that the idea merits further investigation. The code will be publicly available upon acceptance.
comment: 11 pages, 7 figures, technical report (preprint)
DaLA: Danish Linguistic Acceptability Evaluation Guided by Real World Errors
We present an enhanced benchmark for evaluating linguistic acceptability in Danish. We first analyze the most common errors found in written Danish. Based on this analysis, we introduce a set of fourteen corruption functions that generate incorrect sentences by systematically introducing errors into existing correct Danish sentences. To ensure the accuracy of these corruptions, we assess their validity using both manual and automatic methods. The results are then used as a benchmark for evaluating Large Language Models on a linguistic acceptability judgement task. Our findings demonstrate that this extension is both broader and more comprehensive than the current state of the art. By incorporating a greater variety of corruption types, our benchmark provides a more rigorous assessment of linguistic acceptability, increasing task difficulty, as evidenced by the lower performance of LLMs on our benchmark compared to existing ones. Our results also suggest that our benchmark has a higher discriminatory power which allows to better distinguish well-performing models from low-performing ones.
InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization AAAI 2026
The emergence of Multimodal Large Language Models (MLLMs) has propelled the development of autonomous agents that operate on Graphical User Interfaces (GUIs) using pure visual input. A fundamental challenge is robustly grounding natural language instructions. This requires a precise spatial alignment, which accurately locates the coordinates of each element, and, more critically, a correct semantic alignment, which matches the instructions to the functionally appropriate UI element. Although Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be effective at improving spatial alignment for these MLLMs, we find that inefficient exploration bottlenecks semantic alignment, which prevent models from learning difficult semantic associations. To address this exploration problem, we present Adaptive Exploration Policy Optimization (AEPO), a new policy optimization framework. AEPO employs a multi-answer generation strategy to enforce broader exploration, which is then guided by a theoretically grounded Adaptive Exploration Reward (AER) function derived from first principles of efficiency eta=U/C. Our AEPO-trained models, InfiGUI-G1-3B and InfiGUI-G1-7B, establish new state-of-the-art results across multiple challenging GUI grounding benchmarks, achieving significant relative improvements of up to 9.0% against the naive RLVR baseline on benchmarks designed to test generalization and semantic understanding. Resources are available at https://github.com/InfiXAI/InfiGUI-G1.
comment: Accepted to AAAI 2026 (Oral Presentation)
Why Chain of Thought Fails in Clinical Text Understanding
Large language models (LLMs) are increasingly being applied to clinical care, a domain where both accuracy and transparent reasoning are critical for safe and trustworthy deployment. Chain-of-thought (CoT) prompting, which elicits step-by-step reasoning, has demonstrated improvements in performance and interpretability across a wide range of tasks. However, its effectiveness in clinical contexts remains largely unexplored, particularly in the context of electronic health records (EHRs), the primary source of clinical documentation, which are often lengthy, fragmented, and noisy. In this work, we present the first large-scale systematic study of CoT for clinical text understanding. We assess 95 advanced LLMs on 87 real-world clinical text tasks, covering 9 languages and 8 task types. Contrary to prior findings in other domains, we observe that 86.3\% of models suffer consistent performance degradation in the CoT setting. More capable models remain relatively robust, while weaker ones suffer substantial declines. To better characterize these effects, we perform fine-grained analyses of reasoning length, medical concept alignment, and error profiles, leveraging both LLM-as-a-judge evaluation and clinical expert evaluation. Our results uncover systematic patterns in when and why CoT fails in clinical contexts, which highlight a critical paradox: CoT enhances interpretability but may undermine reliability in clinical text tasks. This work provides an empirical basis for clinical reasoning strategies of LLMs, highlighting the need for transparent and trustworthy approaches.
A Systematic Assessment of Language Models with Linguistic Minimal Pairs in Chinese ACL
We present ZhoBLiMP, the largest linguistic minimal pair benchmark for Chinese, with over 100 paradigms, ranging from topicalization to the \textit{Ba} construction. We then train from scratch a suite of Chinese language models (LMs) with different tokenizers, parameter sizes, and token volumes, to study the learning curves of LMs on Chinese. To mitigate the biases introduced by unequal lengths of the sentences in a minimal pair, we propose a new metric named sub-linear length normalized log-probabilities (SLLN-LP). Using SLLN-LP as the metric, our results show that \textsc{Anaphor}, \textsc{Quantifiers}, and \textsc{Ellipsis} in Chinese are difficult for LMs even up to 32B parameters, and that SLLN-LP successfully mitigates biases in ZhoBLiMP, JBLiMP and BLiMP. We conclude that future evaluations should be more carefully designed to consider the intricate relations between linking functions, LMs, and targeted minimal pairs.
comment: Accepted by TACL
SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents
Large Language Model (LLM) agents excel at multi-step, tool-augmented tasks. However, smart homes introduce distinct challenges, requiring agents to handle latent user intents, temporal dependencies, device constraints, scheduling, and more. The main bottlenecks for developing smart home agents with such capabilities include the lack of a realistic simulation environment where agents can interact with devices and observe the results, as well as a challenging benchmark to evaluate them. To address this, we introduce $\textbf{SimuHome}$, a time-accelerated home environment that simulates smart devices, supports API calls, and reflects changes in environmental variables. By building the simulator on the Matter protocol, the global industry standard for smart home communication, SimuHome provides a high-fidelity environment, and agents validated in SimuHome can be deployed on real Matter-compliant devices with minimal adaptation. We provide a challenging benchmark of 600 episodes across twelve user query types that require the aforementioned capabilities. Our evaluation of 16 agents under a unified ReAct framework reveals distinct capabilities and limitations across models. Models under 7B parameters exhibited negligible performance across all query types. Even GPT-4.1, the best-performing standard model, struggled with implicit intent inference, state verification, and particularly temporal scheduling. While reasoning models such as GPT-5.1 consistently outperformed standard models on every query type, they required over three times the average inference time, which can be prohibitive for real-time smart home applications. This highlights a critical trade-off between task performance and real-world practicality.
comment: 10 pages
Unilaw-R1: A Large Language Model for Legal Reasoning with Reinforcement Learning and Iterative Inference
Reasoning-focused large language models (LLMs) are rapidly evolving across various domains, yet their capabilities in handling complex legal problems remains underexplored. In this paper, we introduce Unilaw-R1, a large language model tailored for legal reasoning. With a lightweight 7-billion parameter scale, Unilaw-R1 significantly reduces deployment cost while effectively tackling three core challenges in the legal domain: insufficient legal knowledge, unreliable reasoning logic, and weak business generalization. To address these issues, we first construct Unilaw-R1-Data, a high-quality dataset containing 17K distilled and screened chain-of-thought (CoT) samples. Based on this, we adopt a two-stage training strategy combining Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), which significantly boosts the performance on complex legal reasoning tasks and supports interpretable decision-making in legal AI applications. To assess legal reasoning ability, we also introduce Unilaw-R1-Eval, a dedicated benchmark designed to evaluate models across single- and multi-choice legal tasks. Unilaw-R1 demonstrates strong results on authoritative benchmarks, outperforming all models of similar scale and achieving performance on par with the much larger DeepSeek-R1-Distill-Qwen-32B (54.9%). Following domain-specific training, it also showed significant gains on LawBench and LexEval, exceeding Qwen-2.5-7B-Instruct (46.6%) by an average margin of 6.6%.
DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation ICML 2025
Several recent studies have attempted to autoregressively generate continuous speech representations without discrete speech tokens by combining diffusion and autoregressive models, yet they often face challenges with excessive computational loads or suboptimal outcomes. In this work, we propose Diffusion Transformer Autoregressive Modeling (DiTAR), a patch-based autoregressive framework combining a language model with a diffusion transformer. This approach significantly enhances the efficacy of autoregressive models for continuous tokens and reduces computational demands. DiTAR utilizes a divide-and-conquer strategy for patch generation, where the language model processes aggregated patch embeddings and the diffusion transformer subsequently generates the next patch based on the output of the language model. For inference, we propose defining temperature as the time point of introducing noise during the reverse diffusion ODE to balance diversity and determinism. We also show in the extensive scaling analysis that DiTAR has superb scalability. In zero-shot speech generation, DiTAR achieves state-of-the-art performance in robustness, speaker similarity, and naturalness.
comment: ByteDance Seed template, ICML 2025
MUST-RAG: MUSical Text Question Answering with Retrieval Augmented Generation
Recent advancements in Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains. While they exhibit strong zero-shot performance on various tasks, LLMs' effectiveness in music-related applications remains limited due to the relatively small proportion of music-specific knowledge in their training data. To address this limitation, we propose MusT-RAG, a comprehensive framework based on Retrieval Augmented Generation (RAG) to adapt general-purpose LLMs for text-only music question answering (MQA) tasks. RAG is a technique that provides external knowledge to LLMs by retrieving relevant context information when generating answers to questions. To optimize RAG for the music domain, we (1) propose MusWikiDB, a music-specialized vector database for the retrieval stage, and (2) utilizes context information during both inference and fine-tuning processes to effectively transform general-purpose LLMs into music-specific models. Our experiment demonstrates that MusT-RAG significantly outperforms traditional fine-tuning approaches in enhancing LLMs' music domain adaptation capabilities, showing consistent improvements across both in-domain and out-of-domain MQA benchmarks. Additionally, our MusWikiDB proves substantially more effective than general Wikipedia corpora, delivering superior performance and computational efficiency.
comment: This is an earlier version of the paper - ArtistMus: A Globally Diverse, Artist-Centric Benchmark for Retrieval-Augmented Music Question Answering. The latest version is available at: (arXiv:2512.05430)
Grounding Long-Context Reasoning with Contextual Normalization for Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) has become an essential approach for extending the reasoning and knowledge capacity of large language models (LLMs). While prior research has primarily focused on retrieval quality and prompting strategies, the influence of how the retrieved documents are framed, i.e., context format, remains underexplored. We show that seemingly superficial choices, such as delimiters or structural markers in key-value extraction, can induce substantial shifts in accuracy and stability, even when semantic content is identical. To systematically investigate this effect, we design controlled experiments that vary context density, delimiter styles, and positional placement, revealing the underlying factors that govern performance differences. Building on these insights, we introduce Contextual Normalization, a lightweight strategy that adaptively standardizes context representations before generation. Extensive experiments on both controlled and real-world RAG benchmarks across diverse settings demonstrate that the proposed strategy consistently improves robustness to order variation and strengthens long-context utilization. These findings underscore that reliable RAG depends not only on retrieving the right content, but also on how that content is presented, offering both new empirical evidence and a practical technique for better long-context reasoning.
Process Reward Models That Think
Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers -- using only 1% of the process labels in PRM800K -- across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME '24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation on a subset of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. Our work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training. Our code, data, and models are released at https://github.com/mukhal/thinkprm.
comment: Add new ablation and minor writing fixes
Deep Learning and Machine Learning, Advancing Big Data Analytics and Management: Unveiling AI's Potential Through Tools, Techniques, and Applications
Artificial intelligence (AI), machine learning, and deep learning have become transformative forces in big data analytics and management, enabling groundbreaking advancements across diverse industries. This article delves into the foundational concepts and cutting-edge developments in these fields, with a particular focus on large language models (LLMs) and their role in natural language processing, multimodal reasoning, and autonomous decision-making. Highlighting tools such as ChatGPT, Claude, and Gemini, the discussion explores their applications in data analysis, model design, and optimization. The integration of advanced algorithms like neural networks, reinforcement learning, and generative models has enhanced the capabilities of AI systems to process, visualize, and interpret complex datasets. Additionally, the emergence of technologies like edge computing and automated machine learning (AutoML) democratizes access to AI, empowering users across skill levels to engage with intelligent systems. This work also underscores the importance of ethical considerations, transparency, and fairness in the deployment of AI technologies, paving the way for responsible innovation. Through practical insights into hardware configurations, software environments, and real-world applications, this article serves as a comprehensive resource for researchers and practitioners. By bridging theoretical underpinnings with actionable strategies, it showcases the potential of AI and LLMs to revolutionize big data management and drive meaningful advancements across domains such as healthcare, finance, and autonomous systems.
comment: This book contains 155 pages and 9 figures
Deep Learning and Machine Learning, Advancing Big Data Analytics and Management: Handy Appetizer
This book explores the role of Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL) in driving the progress of big data analytics and management. The book focuses on simplifying the complex mathematical concepts behind deep learning, offering intuitive visualizations and practical case studies to help readers understand how neural networks and technologies like Convolutional Neural Networks (CNNs) work. It introduces several classic models and technologies such as Transformers, GPT, ResNet, BERT, and YOLO, highlighting their applications in fields like natural language processing, image recognition, and autonomous driving. The book also emphasizes the importance of pre-trained models and how they can enhance model performance and accuracy, with instructions on how to apply these models in various real-world scenarios. Additionally, it provides an overview of key big data management technologies like SQL and NoSQL databases, as well as distributed computing frameworks such as Apache Hadoop and Spark, explaining their importance in managing and processing vast amounts of data. Ultimately, the book underscores the value of mastering deep learning and big data management skills as critical tools for the future workforce, making it an essential resource for both beginners and experienced professionals.
comment: This book contains 93 pages and 60 figures
CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency
This paper introduces CryptoBench, the first expert-curated, dynamic benchmark designed to rigorously evaluate the real-world capabilities of Large Language Model (LLM) agents in the uniquely demanding and fast-paced cryptocurrency domain. Unlike general-purpose agent benchmarks for search and prediction, professional crypto analysis presents specific challenges: \emph{extreme time-sensitivity}, \emph{a highly adversarial information environment}, and the critical need to synthesize data from \emph{diverse, specialized sources}, such as on-chain intelligence platforms and real-time Decentralized Finance (DeFi) dashboards. CryptoBench thus serves as a much more challenging and valuable scenario for LLM agent assessment. To address these challenges, we constructed a live, dynamic benchmark featuring 50 questions per month, expertly designed by crypto-native professionals to mirror actual analyst workflows. These tasks are rigorously categorized within a four-quadrant system: Simple Retrieval, Complex Retrieval, Simple Prediction, and Complex Prediction. This granular categorization enables a precise assessment of an LLM agent's foundational data-gathering capabilities alongside its advanced analytical and forecasting skills. Our evaluation of ten LLMs, both directly and within an agentic framework, reveals a performance hierarchy and uncovers a failure mode. We observe a \textit{retrieval-prediction imbalance}, where many leading models, despite being proficient at data retrieval, demonstrate a pronounced weakness in tasks requiring predictive analysis. This highlights a problematic tendency for agents to appear factually grounded while lacking the deeper analytical capabilities to synthesize information.
Non-Collaborative User Simulators for Tool Agents
Tool agents interact with users through multi-turn dialogues to accomplish various tasks. Recent studies have adopted user simulation methods to develop these agents in multi-turn settings. However, existing user simulators tend to be agent-friendly, exhibiting only cooperative behaviors, which fails to train and test agents against non-collaborative users in the real world. To address this, we propose a novel user simulator architecture that simulates four categories of non-collaborative behaviors: requesting unavailable services, digressing into tangential conversations, expressing impatience, and providing incomplete utterances. Our user simulator can simulate challenging and natural non-collaborative behaviors while reliably delivering all intents and information necessary to accomplish the task. Our experiments on MultiWOZ and $τ$-bench reveal significant performance degradation in state-of-the-art tool agents when encountering non-collaborative users. We provide detailed analyses of agents' weaknesses under each non-collaborative condition, such as escalated hallucinations and dialogue breakdowns. Ultimately, we contribute an easily extensible user simulation framework to help the research community develop tool agents and preemptively diagnose them under challenging real-world conditions within their own services.
comment: 10 pages
Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs
Graph topology is a fundamental determinant of memory leakage in multi-agent LLM systems, yet its effects remain poorly quantified. We introduce MAMA (Multi-Agent Memory Attack), a framework that measures how network structure shapes leakage. MAMA operates on synthetic documents containing labeled Personally Identifiable Information (PII) entities, from which we generate sanitized task instructions. We execute a two-phase protocol: Engram (seeding private information into a target agent's memory) and Resonance (multi-round interaction where an attacker attempts extraction). Over up to 10 interaction rounds, we quantify leakage as the fraction of ground-truth PII recovered from attacking agent outputs via exact matching. We systematically evaluate six common network topologies (fully connected, ring, chain, binary tree, star, and star-ring), varying agent counts $n\in\{4,5,6\}$, attacker-target placements, and base models. Our findings reveal consistent patterns: fully connected graphs exhibit maximum leakage while chains provide strongest protection; shorter attacker-target graph distance and higher target centrality significantly increase vulnerability; leakage rises sharply in early rounds before plateauing; model choice shifts absolute leakage rates but preserves topology rankings; temporal/locational PII attributes leak more readily than identity credentials or regulated identifiers. These results provide the first systematic mapping from architectural choices to measurable privacy risk, yielding actionable guidance: prefer sparse or hierarchical connectivity, maximize attacker-target separation, limit node degree and network radius, avoid shortcuts bypassing hubs, and implement topology-aware access controls.
Surveying the MLLM Landscape: A Meta-Review of Current Surveys
The rise of Multimodal Large Language Models (MLLMs) has become a transformative force in the field of artificial intelligence, enabling machines to process and generate content across multiple modalities, such as text, images, audio, and video. These models represent a significant advancement over traditional unimodal systems, opening new frontiers in diverse applications ranging from autonomous agents to medical diagnostics. By integrating multiple modalities, MLLMs achieve a more holistic understanding of information, closely mimicking human perception. As the capabilities of MLLMs expand, the need for comprehensive and accurate performance evaluation has become increasingly critical. This survey aims to provide a systematic review of benchmark tests and evaluation methods for MLLMs, covering key topics such as foundational concepts, applications, evaluation methodologies, ethical concerns, security, efficiency, and domain-specific applications. Through the classification and analysis of existing literature, we summarize the main contributions and methodologies of various surveys, conduct a detailed comparative analysis, and examine their impact within the academic community. Additionally, we identify emerging trends and underexplored areas in MLLM research, proposing potential directions for future studies. This survey is intended to offer researchers and practitioners a comprehensive understanding of the current state of MLLM evaluation, thereby facilitating further progress in this rapidly evolving field.
comment: The article consists of 22 pages, including 2 figures and 108 references. The paper provides a meta-review of surveys on Multimodal Large Language Models (MLLMs), categorizing findings into key areas such as evaluation, applications, security, and future directions
Self-Improvement Towards Pareto Optimality: Mitigating Preference Conflicts in Multi-Objective Alignment ACL
Multi-Objective Alignment (MOA) aims to align LLMs' responses with multiple human preference objectives, with Direct Preference Optimization (DPO) emerging as a prominent approach. However, we find that DPO-based MOA approaches suffer from widespread preference conflicts in the data, where different objectives favor different responses. This results in conflicting optimization directions, hindering the optimization on the Pareto Front. To address this, we propose to construct Pareto-optimal responses to resolve preference conflicts. To efficiently obtain and utilize such responses, we propose a self-improving DPO framework that enables LLMs to self-generate and select Pareto-optimal responses for self-supervised preference alignment. Extensive experiments on two datasets demonstrate the superior Pareto Front achieved by our framework compared to various baselines. Code is available at https://github.com/zyttt-coder/SIPO.
comment: ACL findings (2025)
Latent Collaboration in Multi-Agent Systems
Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings. A shared latent working memory then preserves and transfers each agent's internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single-model and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4x-4.3x faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.
comment: Project: https://github.com/Gen-Verse/LatentMAS
Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models EMNLP 2025
As large language models (LLMs) increasingly permeate the financial sector, there is a pressing need for a standardized method to comprehensively assess their performance. Existing financial benchmarks often suffer from limited language and task coverage, low-quality datasets, and inadequate adaptability for LLM evaluation. To address these limitations, we introduce Golden Touchstone, a comprehensive bilingual benchmark for financial LLMs, encompassing eight core financial NLP tasks in both Chinese and English. Developed from extensive open-source data collection and industry-specific demands, this benchmark thoroughly assesses models' language understanding and generation capabilities. Through comparative analysis of major models such as GPT-4o, Llama3, FinGPT, and FinMA, we reveal their strengths and limitations in processing complex financial information. Additionally, we open-source Touchstone-GPT, a financial LLM trained through continual pre-training and instruction tuning, which demonstrates strong performance on the bilingual benchmark but still has limitations in specific tasks. This research provides a practical evaluation tool for financial LLMs and guides future development and optimization. The source code for Golden Touchstone and model weight of Touchstone-GPT have been made publicly available at https://github.com/IDEA-FinAI/Golden-Touchstone.
comment: Published in Findings of EMNLP 2025
Bridging Relevance and Reasoning: Rationale Distillation in Retrieval-Augmented Generation ACL 25
The reranker and generator are two critical components in the Retrieval-Augmented Generation (i.e., RAG) pipeline, responsible for ranking relevant documents and generating responses. However, due to differences in pre-training data and objectives, there is an inevitable gap between the documents ranked as relevant by the reranker and those required by the generator to support answering the query. To address this gap, we propose RADIO, a novel and practical preference alignment framework with RAtionale DIstillatiOn. Specifically, we first propose a rationale extraction method that leverages the reasoning capabilities of Large Language Models (LLMs) to extract the rationales necessary for answering the query. Subsequently, a rationale-based alignment process is designed to rerank the documents based on the extracted rationales, and fine-tune the reranker to align the preferences. We conduct extensive experiments on two tasks across three datasets to demonstrate the effectiveness of our approach compared to baseline methods. Our code is released online to ease reproduction.
comment: Accepted to ACL 25 Findings
The AI Consumer Index (ACE)
We introduce the first version of the AI Consumer Index (ACE), a benchmark for assessing whether frontier AI models can perform everyday consumer tasks. ACE contains a hidden heldout set of 400 test cases, split across four consumer activities: shopping, food, gaming, and DIY. We are also open sourcing 80 cases as a devset with a CC-BY license. For the ACE leaderboard we evaluated 10 frontier models (with websearch turned on) using a novel grading methodology that dynamically checks whether relevant parts of the response are grounded in the retrieved web sources. GPT 5 (Thinking = High) is the top-performing model, scoring 56.1%, followed by o3 Pro (Thinking = On) at 55.2% and GPT 5.1 (Thinking = High) at 55.1%. Model scores differ across domains, and in Shopping the top model scores under 50\%. We find that models are prone to hallucinating key information, such as prices. ACE shows a substantial gap between the performance of even the best models and consumers' AI needs.
The AI Productivity Index (APEX)
We present an extended version of the AI Productivity Index (APEX-v1-extended), a benchmark for assessing whether frontier models are capable of performing economically valuable tasks in four jobs: investment banking associate, management consultant, big law associate, and primary care physician (MD). This technical report details the extensions to APEX-v1, including an increase in the held-out evaluation set from n = 50 to n = 100 cases per job (n = 400 total) and updates to the grading methodology. We present a new leaderboard, where GPT5 (Thinking = High) remains the top performing model with a score of 67.0%. APEX-v1-extended shows that frontier models still have substantial limitations when performing typical professional tasks. To support further research, we are open sourcing n = 25 non-benchmark example cases per role (n = 100 total) along with our evaluation harness.
Shadow in the Cache: Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference NDSS
The Key-Value (KV) cache, which stores intermediate attention computations (Key and Value pairs) to avoid redundant calculations, is a fundamental mechanism for accelerating Large Language Model (LLM) inference. However, this efficiency optimization introduces significant yet underexplored privacy risks. This paper provides the first comprehensive analysis of these vulnerabilities, demonstrating that an attacker can reconstruct sensitive user inputs directly from the KV-cache. We design and implement three distinct attack vectors: a direct Inversion Attack, a more broadly applicable and potent Collision Attack, and a semantic-based Injection Attack. These methods demonstrate the practicality and severity of KV-cache privacy leakage issues. To mitigate this, we propose KV-Cloak, a novel, lightweight, and efficient defense mechanism. KV-Cloak uses a reversible matrix-based obfuscation scheme, combined with operator fusion, to secure the KV-cache. Our extensive experiments show that KV-Cloak effectively thwarts all proposed attacks, reducing reconstruction quality to random noise. Crucially, it achieves this robust security with virtually no degradation in model accuracy and minimal performance overhead, offering a practical solution for trustworthy LLM deployment.
comment: This paper is accepted by Network and Distributed System Security Symposium (NDSS) 2026
LMSpell: Neural Spell Checking for Low-Resource Languages
Spell correction is still a challenging problem for low-resource languages (LRLs). While pretrained language models (PLMs) have been employed for spell correction, their use is still limited to a handful of languages, and there has been no proper comparison across PLMs. We present the first empirical study on the effectiveness of PLMs for spell correction, which includes LRLs. We find that Large Language Models (LLMs) outperform their counterparts (encoder-based and encoder-decoder) when the fine-tuning dataset is large. This observation holds even in languages for which the LLM is not pre-trained. We release LMSpell, an easy- to use spell correction toolkit across PLMs. It includes an evaluation function that compensates for the hallucination of LLMs. Further, we present a case study with Sinhala to shed light on the plight of spell correction for LRLs.
AutoNeural: Co-Designing Vision-Language Models for NPU Inference
While Neural Processing Units (NPUs) offer high theoretical efficiency for edge AI, state-of-the-art Vision--Language Models (VLMs) tailored for GPUs often falter on these substrates. We attribute this hardware-model mismatch to two primary factors: the quantization brittleness of Vision Transformers (ViTs) and the I/O-bound nature of autoregressive attention mechanisms, which fail to utilize the high arithmetic throughput of NPUs. To bridge this gap, we propose AutoNeural, an NPU-native VLM architecture co-designed for integer-only inference. We replace the standard ViT encoder with a MobileNetV5-style backbone utilizing depthwise separable convolutions, which ensures bounded activation distributions for stable INT4/8/16 quantization. Complementing this, our language backbone integrates State-Space Model (SSM) principles with Transformer layers, employing efficient gated convolutions to achieve linear-time complexity. This hybrid design eliminates the heavy memory I/O overhead of Key-Value caching during generation. Our approach delivers substantial efficiency gains, reducing quantization error of vision encoder by up to 7x and end-to-end latency by 14x compared to conventional baselines. The AutoNeural also delivers 3x decoding speed and 4x longer context window than the baseline. We validate these improvements via a real-world automotive case study on the Qualcomm SA8295P SoC, demonstrating real-time performance for cockpit applications. Our results highlight that rethinking model topology specifically for NPU constraints is a prerequisite for robust multi-modal edge intelligence.
SENSE models: an open source solution for multilingual and multimodal semantic-based tasks
This paper introduces SENSE (Shared Embedding for N-lingual Speech and tExt), an open-source solution inspired by the SAMU-XLSR framework and conceptually similar to Meta AI's SONAR models. These approaches rely on a teacher-student framework to align a self-supervised speech encoder with the language-agnostic continuous representations of a text encoder at the utterance level. We describe how the original SAMU-XLSR method has been updated by selecting a stronger teacher text model and a better initial speech encoder. The source code for training and using SENSE models has been integrated into the SpeechBrain toolkit, and the first SENSE model we trained has been publicly released. We report experimental results on multilingual and multimodal semantic tasks, where our SENSE model achieves highly competitive performance. Finally, this study offers new insights into how semantics are captured in such semantically aligned speech encoders.
comment: Accepted to IEEE ASRU 2025
CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale
Measuring biodiversity is crucial for understanding ecosystem health. While prior works have developed machine learning models for taxonomic classification of photographic images and DNA separately, in this work, we introduce a multimodal approach combining both, using CLIP-style contrastive learning to align images, barcode DNA, and text-based representations of taxonomic labels in a unified embedding space. This allows for accurate classification of both known and unknown insect species without task-specific fine-tuning, leveraging contrastive learning for the first time to fuse barcode DNA and image data. Our method surpasses previous single-modality approaches in accuracy by over 8% on zero-shot learning tasks, showcasing its effectiveness in biodiversity studies.
comment: Add Variations of DNA encoding
Shrinking the Generation-Verification Gap with Weak Verifiers NeurIPS
Verifiers can improve language model capabilities by scoring and ranking responses from generated candidates. Currently, high-quality verifiers are either unscalable (e.g., humans) or limited in utility (e.g., tools like Lean). While LM judges and reward models have become broadly useful as general-purpose verifiers, a significant performance gap remains between them and oracle verifiers (verifiers with perfect accuracy). To help close this gap, we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers. We find weighted ensembles of verifiers, which typically require learning from labeled data, significantly outperform unweighted combinations due to differences in verifier accuracies. To reduce dependency on labeled data, Weaver leverages weak supervision to estimate each verifier's accuracy and combines outputs into a unified score that better reflects true response quality. However, directly applying weak supervision algorithms poses challenges, including inconsistent verifier output formats and handling low-quality verifiers. Weaver addresses these using dataset statistics to normalize outputs and filter specific verifiers. We study Weaver's effectiveness in test-time repeated sampling, where a model generates multiple candidate responses and selects one. Our evaluations show Weaver significantly improves over Pass@1-performance when selecting the first candidate-across reasoning and math tasks, achieving o3-mini-level accuracy with Llama 3.3 70B Instruct as generator, and an ensemble of 70B or smaller judge and reward models as verifiers (87.7% average). This gain mirrors the jump between GPT-4o and o3-mini (69.0% vs. 86.7%), which required extensive finetuning and post-training. To reduce computational costs of verifier ensembles, we train a 400M cross-encoder using Weaver's combined output scores.
comment: Annual Conference on Neural Information Processing Systems (NeurIPS) 2025
The Necessity of Imperfection:Reversing Model Collapse via Simulating Cognitive Boundedness
Although synthetic data is widely promoted as a remedy, its prevailing production paradigm -- one optimizing for statistical smoothness -- systematically removes the long-tail, cognitively grounded irregularities that characterize human text. Prolonged training on such statistically optimal but cognitively impoverished data accelerates model collapse. This paper proposes a paradigm shift: instead of imitating the surface properties of data, we simulate the cognitive processes that generate human text. We introduce the Prompt-driven Cognitive Computing Framework (PMCSF), whose core consists of a Cognitive State Decoder (CSD) that reverse-engineers unstructured text into structured cognitive vectors, and a Cognitive Text Encoder (CTE) that re-materializes these states into text enriched with human-typical imperfections via mathematically defined Cognitive Perturbation Operators. The framework is validated through a two-stage objective evaluation pipeline. First, in cognitive codec verification, CTE text yields a Jensen-Shannon divergence of 0.0614 from human text (vs. 0.4431 for standard LLM output), passes double-blind professional media review, and achieves an intraclass correlation coefficient ICC > 0.9 for cognitive profile alignment across heterogeneous models. Second, in functional gain evaluation, isomorphic stress tests in the A-share market show that strategies incorporating CTE-generated data reduce maximum drawdown by 47.4% during the 2015 crash and deliver 8.6% Defensive Alpha, exceeding transaction costs by a factor of 33. Our findings demonstrate that modelling human cognitive limitations -- not copying surface data -- enables synthetic data with genuine functional gain, offering a viable technical pathway toward resolving the AI data-collapse crisis.
comment: 60 pages,9 figures. v3: Major update. Added 3D topological visualization (Figure 1) and independent computational verification of the Adaptive Markets Hypothesis (AMH). Includes comprehensive Supplementary Materials (algorithmic pseudocode, system architecture, and real-time GARCH logs) for technical reproducibility
Make LVLMs Focus: Context-Aware Attention Modulation for Better Multimodal In-Context Learning
Multimodal in-context learning (ICL) is becoming a key capability that allows large vision-language models (LVLMs) to adapt to novel tasks without parameter updates, which expands their usefulness in many real-world applications. However, ICL performance remains unstable even when the in-context demonstrations (ICDs) are well matched, showing that LVLMs still struggle to make full use of the provided context. While existing work mainly focuses on prompt engineering or post-hoc logit calibration, we study the attention mechanisms inside LVLMs to address their inherent limitations. We identify two important weaknesses in their self-attention that hinder effective ICL. To address these weaknesses, we propose \textbf{Context-Aware Modulated Attention} (CAMA), a training-free and plug-and-play method that dynamically adjusts attention logits based on the input in-context sequence. CAMA uses a two-stage modulation process that strengthens attention to semantically important tokens, especially visual ones. Across four LVLMs and seven benchmarks, CAMA consistently outperforms vanilla models and baselines, showing clear effectiveness and generalization. It can also activate the intended benefits of prompt engineering methods and remains robust across different sequence configurations. Therefore, CAMA opens up new directions for improving multimodal reasoning through a deeper understanding of attention dynamics.
comment: 14 pages, 8 figures, 5 tables
Understanding LLM Reasoning for Abstractive Summarization
While the reasoning capabilities of Large Language Models (LLMs) excel in analytical tasks such as mathematics and code generation, their utility for abstractive summarization remains widely assumed but largely unverified. To bridge this gap, we first tailor general reasoning strategies to the summarization domain. We then conduct a systematic, large scale comparative study of 8 reasoning strategies and 3 Large Reasoning Models (LRMs) across 8 diverse datasets, assessing both summary quality and faithfulness. Our findings show that reasoning is not a universal solution and its effectiveness is highly dependent on the specific strategy and context. Specifically, we observe a trade-off between summary quality and factual faithfulness: explicit reasoning strategies tend to improve fluency at the expense of factual grounding, while implicit reasoning in LRMs exhibits the inverse pattern. Furthermore, increasing an LRM's internal reasoning budget does not improve, and can even hurt, factual consistency, suggesting that effective summarization demands faithful compression rather than creative over-thinking.
comment: 26 pages,15 figures
Can AI Truly Represent Your Voice in Deliberations? A Comprehensive Study of Large-Scale Opinion Aggregation with LLMs
Large-scale public deliberations generate thousands of free-form contributions that must be synthesized into representative and neutral summaries for policy use. While LLMs have been shown as a promising tool to generate summaries for large-scale deliberations, they also risk underrepresenting minority perspectives and exhibiting bias with respect to the input order, raising fairness concerns in high-stakes contexts. Studying and fixing these issues requires a comprehensive evaluation at a large scale, yet current practice often relies on LLMs as judges, which show weak alignment with human judgments. To address this, we present DeliberationBank, a large-scale human-grounded dataset with (1) opinion data spanning ten deliberation questions created by 3,000 participants and (2) summary judgment data annotated by 4,500 participants across four dimensions (representativeness, informativeness, neutrality, policy approval). Using these datasets, we train DeliberationJudge, a fine-tuned DeBERTa model that can rate deliberation summaries from individual perspectives. DeliberationJudge is more efficient and more aligned with human judgements compared to a wide range of LLM judges. With DeliberationJudge, we evaluate 18 LLMs and reveal persistent weaknesses in deliberation summarization, especially underrepresentation of minority positions. Our framework provides a scalable and reliable way to evaluate deliberation summarization, helping ensure AI systems are more representative and equitable for policymaking.
EEG-to-Text Translation: A Model for Deciphering Human Brain Activity
With the rapid advancement of large language models like Gemini, GPT, and others, bridging the gap between the human brain and language processing has become an important area of focus. To address this challenge, researchers have developed various models to decode EEG signals into text. However, these models still face significant performance limitations. To overcome these shortcomings, we propose a new model, R1 Translator, which aims to improve the performance of EEG-to-text decoding. The R1 Translator model combines a bidirectional LSTM encoder with a pretrained transformer-based decoder, utilizing EEG features to produce high-quality text outputs. The model processes EEG embeddings through the LSTM to capture sequential dependencies, which are then fed into the transformer decoder for effective text generation. The R1 Translator excels in ROUGE metrics, outperforming both T5 (previous research) and Brain Translator. Specifically, R1 achieves a ROUGE-1 score of 38.00% (P), which is up to 9% higher than T5 (34.89%) and 3% better than Brain (35.69%). It also leads in ROUGE-L, with a F1 score of 32.51%, outperforming T5 by 3% (29.67%) and Brain by 2% (30.38%). In terms of CER, R1 achieves a CER of 0.5795, which is 2% lower than T5 (0.5917) and 4% lower than Brain (0.6001). Additionally, R1 performs better in WER with a score of 0.7280, outperforming T5 by 4.3% (0.7610) and Brain by 3.6% (0.7553). Code is available at https://github.com/Mmurrad/EEG-To-text.
Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores
Modern Large Language Models (LLMs) are increasingly trained to support very large context windows. We present Compactor, a training-free, query-agnostic KV compression strategy that uses approximate leverage scores to determine token importance. We show that Compactor can achieve the same performance as competing methods while retaining 20% fewer tokens in both synthetic and real-world context tasks, while being more task-robust. We further introduce a procedure for context-calibrated compression: inferring the maximum compression a given context supports before significant performance loss. Using context-calibrated compression, we show that Compactor achieves full KV performance on Longbench while reducing the KV memory burden by 68%, on average. To demonstrate the efficacy and generalizability of our approach, we apply Compactor to 27 synthetic and real-world tasks from RULER and Longbench, with models from both the Qwen 2.5 and Llama 3.1 families. Finally, we release compactor-vllm, an inference engine and suite of optimized Triton kernels designed to efficiently support the sparse, non-contiguous memory access patterns inherent to compressed KV caches. This work demonstrates that Compactor offers a practical, high-performance solution for alleviating the memory bottleneck in modern LLM deployment.
Representational Stability of Truth in Large Language Models
Large language models (LLMs) are widely used for factual tasks such as "What treats asthma?" or "What is the capital of Latvia?". However, it remains unclear how stably LLMs encode distinctions between true, false, and neither-true-nor-false content in their internal probabilistic representations. We introduce representational stability as the robustness of an LLM's veracity representations to perturbations in the operational definition of truth. We assess representational stability by (i) training a linear probe on an LLM's activations to separate true from not-true statements and (ii) measuring how its learned decision boundary shifts under controlled label changes. Using activations from sixteen open-source models and three factual domains, we compare two types of neither statements. The first are fact-like assertions about entities we believe to be absent from any training data. We call these unfamiliar neither statements. The second are nonfactual claims drawn from well-known fictional contexts. We call these familiar neither statements. The unfamiliar statements induce the largest boundary shifts, producing up to $40\%$ flipped truth judgements in fragile domains (such as word definitions), while familiar fictional statements remain more coherently clustered and yield smaller changes ($\leq 8.2\%$). These results suggest that representational stability stems more from epistemic familiarity than from linguistic form. More broadly, our approach provides a diagnostic for auditing and training LLMs to preserve coherent truth assignments under semantic uncertainty, rather than optimizing for output accuracy alone.
comment: 25 pages, 24 figures
Human-Computer Interaction
Graph-Based Learning of Spectro-Topographical EEG Representations with Gradient Alignment for Brain-Computer Interfaces
We present a novel graph-based learning of EEG representations with gradient alignment (GEEGA) that leverages multi-domain information to learn EEG representations for brain-computer interfaces. Our model leverages graph convolutional networks to fuse embeddings from frequency-based topographical maps and time-frequency spectrograms, capturing inter-domain relationships. GEEGA addresses the challenge of achieving high inter-class separability, which arises from the temporally dynamic and subject-sensitive nature of EEG signals by incorporating the center loss and pairwise difference loss. Additionally, GEEGA incorporates a gradient alignment strategy to resolve conflicts between gradients from different domains and the fused embeddings, ensuring that discrepancies, where gradients point in conflicting directions, are aligned toward a unified optimization direction. We validate the efficacy of our method through extensive experiments on three publicly available EEG datasets: BCI-2a, CL-Drive and CLARE. Comprehensive ablation studies further highlight the impact of various components of our model.
Collaborative Causal Sensemaking: Closing the Complementarity Gap in Human-AI Decision Support
LLM-based agents are rapidly being plugged into expert decision-support, yet in messy, high-stakes settings they rarely make the team smarter: human-AI teams often underperform the best individual, experts oscillate between verification loops and over-reliance, and the promised complementarity does not materialise. We argue this is not just a matter of accuracy, but a fundamental gap in how we conceive AI assistance: expert decisions are made through collaborative cognitive processes where mental models, goals, and constraints are continually co-constructed, tested, and revised between human and AI. We propose Collaborative Causal Sensemaking (CCS) as a research agenda and organizing framework for decision-support agents: systems designed as partners in cognitive work, maintaining evolving models of how particular experts reason, helping articulate and revise goals, co-constructing and stress-testing causal hypotheses, and learning from the outcomes of joint decisions so that both human and agent improve over time. We sketch challenges around training ecologies that make collaborative thinking instrumentally valuable, representations and interaction protocols for co-authored models, and evaluation centred on trust and complementarity. These directions can reframe MAS research around agents that participate in collaborative sensemaking and act as AI teammates that think with their human partners.
Context-Adaptive Color Optimization for Web Accessibility: Balancing Perceptual Fidelity and Functional Requirements
We extend our OKLCH-based accessibility optimization with context-adaptive constraint strategies that achieve near-universal success rates across diverse use cases. Our original strict algorithm reached 66-77% success by prioritizing minimal perceptual change ($ΔE \leq 5.0$), optimizing for enterprise contexts where brand fidelity is paramount. However, this one-size-fits-all approach fails to serve the broader ecosystem of web developers who need accessible solutions even when strict perceptual constraints cannot be satisfied. We introduce recursive optimization (Mode~1) that compounds small adjustments across iterations, achieving 93.68% success on all color pairs and 100% success on reasonable pairs (contrast ratio $ρ> 2.0$), representing a +27.23 percentage point improvement. A relaxed fallback mode (Mode~2) handles pathological edge cases, reaching 98.73% overall success. Evaluation on 10,000 realistic web color pairs demonstrates that context-aware constraint relaxation, combined with absolute hue preservation, enables practical accessibility compliance while maintaining brand color identity. The median perceptual change remains zero across all modes (most pairs already comply), while the 90th percentile reaches $ΔE_{2000} = 15.55$ in Mode~1 -- perceptually acceptable when hue invariance preserves the essential character of the original color. The approach is deployed in CM-Colors v0.5.0 (800+ monthly downloads), providing developers with explicit control over the accessibility-fidelity trade-off appropriate to their context.
comment: 8 pages, 2 figures
A Retrospective on Ultrasound Mid-Air Haptics in HCI
In 2013, the UltraHaptics system demonstrated that focused ultrasound could generate perceivable mid-air tactile sensations, building on earlier explorations of airborne ultrasound as a haptic medium. These contributions established ultrasound mid-air haptics (UMH) as a viable interaction modality and laid the technical and perceptual foundations for subsequent advances in Human-Computer Interaction (HCI). In this extended abstract, we revisit this formative work, trace the research and design trajectories it enabled, and reflect on how UMH has supported multisensory interaction, immersion, and inclusion. We also highlight how this line of research exemplifies the value of interdisciplinary collaboration to advance novel interactive technologies.
comment: 3 pages; AlpCHI 2026 Revisiting HCI Research Track
SemanticTours: A Conceptual Framework for Non-Linear, Knowledge Graph-Driven Data Tours
Interactive tours help users explore datasets and provide onboarding. They rely on a linear sequence of views, showing a curated set of relevant data selections and introduce user interfaces. Existing frameworks of tours, however, often do not allow for branching and refining hypotheses outside of a rigid sequence, which is important in knowledge-centric domains such as law. For example, lawyers performing analytical case analysis need to iteratively weigh up different legal norms and construct strings of arguments. To address this gap, we propose SemanticTours, a semantic, graph-based model of tours that shifts from a sequence-based towards a graph-based navigation. Our model constructs a domain-specific knowledge graph that connects data elements based on user-definable semantic relationships. These relationships enable non-linear graph navigation that defines tours. We apply SemanticTours to the domain of law and conceptualize a visual analytics design and interaction concept for analytical reasoning in legal case analysis. Our concept accounts for the inherent complexity of graph-based tours using aggregated graph nodes and supporting navigation with a semantic lens. During an evaluation with six domain experts from law, they suggest that graph-based tours better support their analytical reasoning than sequences. Our work opens research opportunities for such tours to support analytical reasoning in law and other knowledge-centric domains.
comment: 14 pages, 9 figures, 2 tables
Living the Novel: A System for Generating Self-Training Timeline-Aware Conversational Agents from Novels
We present the Living Novel, an end-to-end system that transforms any literary work into an immersive, multi-character conversational experience. This system is designed to solve two fundamental challenges for LLM-driven characters. Firstly, generic LLMs suffer from persona drift, often failing to stay in character. Secondly, agents often exhibit abilities that extend beyond the constraints of the story's world and logic, leading to both narrative incoherence (spoiler leakage) and robustness failures (frame-breaking). To address these challenges, we introduce a novel two-stage training pipeline. Our Deep Persona Alignment (DPA) stage uses data-free reinforcement finetuning to instill deep character fidelity. Our Coherence and Robustness Enhancing (CRE) stage then employs a story-time-aware knowledge graph and a second retrieval-grounded training pass to architecturally enforce these narrative constraints. We validate our system through a multi-phase evaluation using Jules Verne's Twenty Thousand Leagues Under the Sea. A lab study with a detailed ablation of system components is followed by a 5-day in-the-wild diary study. Our DPA pipeline helps our specialized model outperform GPT-4o on persona-specific metrics, and our CRE stage achieves near-perfect performance in coherence and robustness measures. Our study surfaces practical design guidelines for AI-driven narrative systems: we find that character-first self-training is foundational for believability, while explicit story-time constraints are crucial for sustaining coherent, interruption-resilient mobile-web experiences.
Breaking Players' expectations: the Role of Non-player Characters' coherence and Consistency
In video games, non-player characters (NPCs) play a pivotal role in shaping players' experiences. The design of these characters, encompassing their appearance and behaviors, can be manipulated in terms of coherence and consistency to maintain players' expectations or, on the contrary, to surprise them. The extent to which NPCs' coherence and consistency influence players' evaluation of them remains to be unveiled. To address this knowledge gap, two experiments were conducted in the context of a military shooter game. Players' evaluations of NPCs' perceived intelligence and believability were measured, as these two dimensions are fundamental to players' adoption of NPCs and subsequent commitment to them. The first experiment investigated the impact of disrupting players' initial expectations on their evaluations of NPCs. The second experiment focused on the influence of NPCs' coherence and consistency on both players' expectations and evaluation of NPCs, using a combination of questionnaires and behavioral and physiological measures. The results of our study show that disrupting players' initial expectations influences their assessment of NPCs, with coherent and consistent design reinforcing expectations and incoherent design challenging them.
The Impact of Spatial Misalignment and Time Delay on Collaborative Presence in Augmented Reality
Precise temporal and spatial alignment is critical in collaborative Augmented Reality (AR) where users rely on shared visual information to coordinate actions. System latency and object misalignment can disrupt communication, reduce task efficiency, and negatively impact the overall user experience. While previous research has primarily focused on individual AR interactions, the impact of these inconsistencies on collaboration remains underexplored. This article investigates how user experience and task load are affected by object misalignment and time delay in a shared AR space. To examine these factors, we conducted an experiment with 32 participants, organized into 16 pairs, who collaboratively completed a spatial placement task. Within each condition, both participants alternated roles, taking turns as the leader-providing verbal placement instructions-and the builder-executing the placement. Six conditions were tested, manipulating object alignment (perfectly aligned vs. randomly misaligned) and time delay (0s, 0.1s, 0.4s). The misalignment was applied randomly to each virtual object with a shift of +-20 cm on every axis to create a clear distinction in spatial perception. User experience and task load were assessed to evaluate how these factors influence collaboration and interaction in AR environments. Results showed that spatial misalignment significantly increased perceived workload (NASA-TLX) and lowered user ratings in Pragmatic quality and Attractiveness (UEQ), while time delay had a more limited effect. These findings highlight the critical role of spatial accuracy in maintaining collaboration quality in AR.
Size Matters: The Impact of Avatar Size on User Experience in Healthcare Applications
The usage of virtual avatars in healthcare applications has become widely popular; however, certain critical aspects, such as social distancing and avatar size, remain insufficiently explored. This research investigates user experience and preferences when interacting with a healthcare application utilizing virtual avatars displayed in different sizes. For our study, we had 23 participants interacting with five different avatars (a human-size avatar followed by four smaller avatars in a randomized order) varying in size, projected on a wall in front of them. The avatars were fully integrated with an artificial intelligence chatbot to make them conversational. Users were asked to rate the usability of the system after interacting with each avatar and complete a survey regarding trust and an additional questionnaire on social presence. The results of this study show that avatar size significantly influences the perceived attractiveness and perspicuity, with the medium-sized avatars receiving the highest ratings. Social presence correlated strongly with stimulation and attractiveness, suggesting that an avatar's visual appeal and interactivity influenced user engagement more than its physical size. Additionally, we observed a tendency for gender-specific differences on some of the UEQ+ scales, with male participants tending to prefer human-sized representations, while female participants slightly favored smaller avatars. These findings highlight the importance of avatar design and representation in optimizing user experience and trust in virtual healthcare environments.
comment: 7 pages, 3 figures
ContextualSHAP : Enhancing SHAP Explanations Through Contextual Language Generation
Explainable Artificial Intelligence (XAI) has become an increasingly important area of research, particularly as machine learning models are deployed in high-stakes domains. Among various XAI approaches, SHAP (SHapley Additive exPlanations) has gained prominence due to its ability to provide both global and local explanations across different machine learning models. While SHAP effectively visualizes feature importance, it often lacks contextual explanations that are meaningful for end-users, especially those without technical backgrounds. To address this gap, we propose a Python package that extends SHAP by integrating it with a large language model (LLM), specifically OpenAI's GPT, to generate contextualized textual explanations. This integration is guided by user-defined parameters (such as feature aliases, descriptions, and additional background) to tailor the explanation to both the model context and the user perspective. We hypothesize that this enhancement can improve the perceived understandability of SHAP explanations. To evaluate the effectiveness of the proposed package, we applied it in a healthcare-related case study and conducted user evaluations involving real end-users. The results, based on Likert-scale surveys and follow-up interviews, indicate that the generated explanations were perceived as more understandable and contextually appropriate compared to visual-only outputs. While the findings are preliminary, they suggest that combining visualization with contextualized text may support more user-friendly and trustworthy model explanations.
comment: This paper was accepted and presented at the 7th World Symposium on Software Engineering (WSSE) 2025 on 25 October 2025 in Okayama, Japan, and is currently awaiting publication
A Theoretical Framework of Student Agency in AI- Assisted Learning: A Grounded Theory Approach
Generative AI(GenAI) is a kind of AI model capable of producing human-like content in various modalities, including text, image, audio, video, and computer programming. Although GenAI offers great potential for education, its value often depends on students' ability to engage with it actively, responsibly, and critically - qualities central to student agency. Nevertheless, student agency has long been a complex and ambiguous concept in educational discourses, with few empirical studies clarifying its distinct nature and process in AI-assisted learning environments. To address this gap, the qualitative study presented in this article examines how higher education students exercise agency in AI-assisted learning and proposes a theoretical framework using a grounded theory approach. Guided by agentic engagement theory, this article analyzes the authentic experiences of 26 students using data from their GenAI conversation records and cognitive interviews that capture their thought processes and decision-making. The findings identify four key aspects of student agency: initiating and (re)directing, mindful adoption, external help-seeking, and reflective learning. Together, these aspects form an empirically developed framework that characterizes student agency in AI-assisted learning as a proactive, intentional, adaptive, reflective, and iterative process. Based on the empirical findings, theoretical and practical implications are discussed for researchers, educators, and policymakers.
Human Agency and Creativity in AI-Assisted Learning Environments
This chapter explores human creativity in AI-assisted learning environments through the lens of student agency. We begin by examining four theoretical perspectives on agency, including instrumental, effortful, dynamically emergent, and authorial agency, and analyze how each frames the relationship between agency and creativity. Under each theoretical perspective, we discuss how the integration of generative AI (GenAI) tools reshapes these dynamics by altering students' roles in cognitive, social, and creative processes. In the second part, we introduce a theoretical framework for AI agentic engagement, contextualizing agency within specific cognitive, relational, and ethical dynamics introduced by GenAI tools. This framework is linked to the concept of Mini-c creativity, emphasizing personal relevance and self-directed learning. Together, these perspectives support a shift from viewing creativity as product-oriented to understanding it as a process of agentive participation and meaning-making. We conclude with two directions for future research focused on the creative process and performance in AI-assisted learning.
Detecting Ambiguity Aversion in Cyberattack Behavior to Inform Cognitive Defense Strategies
Adversaries (hackers) attempting to infiltrate networks frequently face uncertainty in their operational environments. This research explores the ability to model and detect when they exhibit ambiguity aversion, a cognitive bias reflecting a preference for known (versus unknown) probabilities. We introduce a novel methodological framework that (1) leverages rich, multi-modal data from human-subjects red-team experiments, (2) employs a large language model (LLM) pipeline to parse unstructured logs into MITRE ATT&CK-mapped action sequences, and (3) applies a new computational model to infer an attacker's ambiguity aversion level in near-real time. By operationalizing this cognitive trait, our work provides a foundational component for developing adaptive cognitive defense strategies.
Joint Activity Design Heuristics for Enhancing Human-Machine Collaboration
Joint activity describes when more than one agent (human or machine) contributes to the completion of a task or activity. Designing for joint activity focuses on explicitly supporting the interdependencies between agents necessary for effective coordination among agents engaged in the joint activity. This builds and expands upon designing for usability to further address how technologies can be designed to act as effective team players. Effective joint activity requires supporting, at minimum, five primary macrocognitive functions within teams: Event Detection, Sensemaking, Adaptability, Perspective-Shifting, and Coordination. Supporting these functions is equally as important as making technologies usable. We synthesized fourteen heuristics from relevant literature including display design, human factors, cognitive systems engineering, cognitive psychology, and computer science to aid the design, development, and evaluation of technologies that support joint human-machine activity.
comment: 10 pages
What Pulls the Strings? Understanding the Characteristics and Role of Argumentation in Open-Source Software Usability Discussions
The usability of open-source software (OSS) is important but frequently overlooked in favor of technical and functional complexity. Argumentation can be a pivotal device for diverse stakeholders in OSS usability discussions to express opinions and persuade others. However, the characteristics of argument discourse in those discussions remain unknown, resulting in difficulties in providing effective support for discussion participants. We address this through a comprehensive analysis of argument discourse and quality in five OSS projects. Our results indicated that usability discussions are predominantly argument-driven, although their qualities vary. Issue comments exhibit lower-quality arguments than the issue posts, suggesting a shortage of collective intelligence about usability in OSS communities. Moreover, argument discourse and quality have various impacts on the subsequent behavior of participants. Overall, this research offers insights to help OSS stakeholders build more effective arguments and eventually improve OSS usability. These insights can also inform studies about other distributed collaborative communities.
comment: 22 pages, 6 figures
"Your Privacy is Your Responsibility": Understanding How Users Collectively Navigate the Complexity of Privacy on Quora
In the current technology environment, users are often in a vulnerable position when it comes to protecting their privacy. Previous efforts to promote privacy protection have largely focused on top-down approaches such as regulation and technology design, missing opportunities to understand how to empower users through bottom-up, collective approaches. Our paper addresses this by analyzing what and how privacy-related topics are discussed on Quora. We identified a wide range of interconnected privacy topics brought up by the users, including privacy risks and dangers, protection strategies, organizational practices, and existing laws and regulations. Our results highlight the interplay among the individual, technological, organizational, and societal factors affecting users' privacy attitudes. Moreover, we provide implications for designing community-based tools to better support users' collective efforts in navigating privacy, tools that incorporate users' diverse privacy-related behaviors and preferences, simplify information access and sharing, and connect designers and developers with the user community.
comment: 24 pages, 4 figures
Resonant and Stochastic Vibration in Neurorehabilitation
Neurological injuries and age-related decline can impair sensory processing and disrupt motor coordination, gait, and balance. As mechanisms of neuroplasticity have become better understood, vibration-based interventions have gained attention as potential tools to stimulate sensory pathways and motor circuits to support functional recovery. This survey reviews stochastic and resonant vibration modalities, describing their mechanisms, therapeutic rationales, and clinical applications. We synthesize evidence on whole-body vibration for improving balance, mobility, and fine motor function in aging adults, stroke survivors, and individuals with Parkinson's disease, with attention to challenges in parameter optimization, generalizability, and safety. We also assess recent developments in focused muscle vibration and wearable stochastic resonance devices for upper-limb rehabilitation, evaluating their clinical promise along with limitations in scalability, ecological validity, and standardization. Across these modalities, we identify key variables that shape therapeutic outcomes and highlight ongoing efforts to refine protocols, improve usability, and integrate vibration techniques into broader neurorehabilitation frameworks. We conclude by outlining the most important research needs for translating vibration-based interventions into reliable and deployable clinical tools.
comment: 8 pages, 6 figures
A Comparative Study of EMG- and IMU-based Gesture Recognition at the Wrist and Forearm
Gestures are an integral part of our daily interactions with the environment. Hand gesture recognition (HGR) is the process of interpreting human intent through various input modalities, such as visual data (images and videos) and bio-signals. Bio-signals are widely used in HGR due to their ability to be captured non-invasively via sensors placed on the arm. Among these, surface electromyography (sEMG), which measures the electrical activity of muscles, is the most extensively studied modality. However, less-explored alternatives such as inertial measurement units (IMUs) can provide complementary information on subtle muscle movements, which makes them valuable for gesture recognition. In this study, we investigate the potential of using IMU signals from different muscle groups to capture user intent. Our results demonstrate that IMU signals contain sufficient information to serve as the sole input sensor for static gesture recognition. Moreover, we compare different muscle groups and check the quality of pattern recognition on individual muscle groups. We further found that tendon-induced micro-movement captured by IMUs is a major contributor to static gesture recognition. We believe that leveraging muscle micro-movement information can enhance the usability of prosthetic arms for amputees. This approach also offers new possibilities for hand gesture recognition in fields such as robotics, teleoperation, sign language interpretation, and beyond.
HOLE: Homological Observation of Latent Embeddings for Neural Network Interpretability
Deep learning models have achieved remarkable success across various domains, yet their learned representations and decision-making processes remain largely opaque and hard to interpret. This work introduces HOLE (Homological Observation of Latent Embeddings), a method for analyzing and interpreting deep neural networks through persistent homology. HOLE extracts topological features from neural activations and presents them using a suite of visualization techniques, including Sankey diagrams, heatmaps, dendrograms, and blob graphs. These tools facilitate the examination of representation structure and quality across layers. We evaluate HOLE on standard datasets using a range of discriminative models, focusing on representation quality, interpretability across layers, and robustness to input perturbations and model compression. The results indicate that topological analysis reveals patterns associated with class separation, feature disentanglement, and model robustness, providing a complementary perspective for understanding and improving deep learning systems.
PoultryTalk: A Multi-modal Retrieval-Augmented Generation (RAG) System for Intelligent Poultry Management and Decision Support
The Poultry industry plays a vital role in global food security, yet small- and medium-scale farmers frequently lack timely access to expert-level support for disease diagnosis, nutrition planning, and management decisions. With rising climate stress, unpredictable feed prices, and persistent disease threats, poultry producers often struggle to make quick, informed decisions. Therefore, there is a critical need for intelligent, data-driven systems that can deliver reliable, on-demand consultation. This paper presents PoultryTalk, a novel multi-modal Retrieval-Augmented Generation (RAG) system designed to provide real-time expert guidance through text and image-based interaction. PoultryTalk uses OpenAI's text-embedding-3-small and GPT-4o to provide smart, context-aware poultry management advice from text, images, or questions. System usability and performance were evaluated using 200 expert-verified queries and feedback from 34 participants who submitted 267 queries to the PoultryTalk prototype. The expert-verified benchmark queries confirmed strong technical performance, achieving a semantic similarity of 84.0% and an average response latency of 3.6 seconds. Compared with OpenAI's GPT-4o, PoultryTalk delivered more accurate and reliable information related to poultry. Based on participants' evaluations, PoultryTalk achieved a response accuracy of 89.9%, with about 9.1% of responses rated as incorrect. A post-use survey indicated high user satisfaction: 95.6% of participants reported that the chatbot provided "always correct" and "mostly correct" answers. 82.6% indicated they would recommend the tool, and 17.4% responded "maybe." These results collectively demonstrate that PoultryTalk not only delivers accurate, contextually relevant information but also demonstrates strong user acceptance and scalability potential.
I, Robot? Exploring Ultra-Personalized AI-Powered AAC; an Autoethnographic Account
Generic AI auto-complete for message composition often fails to capture the nuance of personal identity, requiring significant editing. While harmless in low-stakes settings, for users of Augmentative and Alternative Communication (AAC) devices, who rely on such systems for everyday communication, this editing burden is particularly acute. Intuitively, the need for edits would be lower if language models were personalized to the communication of the specific user. While personalization has been shown to be technically feasible, it raises questions about how such systems affect AAC users' agency, identity, and privacy. To understand how these shifts in practice, we conduct an autoethnographic study in three phases: (1) seven months of collecting all the lead author's AAC communication data, (2) fine-tuning a model on this dataset, and (3) three months of daily use of personalized AI suggestions. Observations across these phases include that logging everyday conversations reshaped the author's sense of agency, the model training selectively amplified or muted aspects of their identity, and suggestions occasionally resurfaced private details outside their original context. Our findings show that ultra-personalized AAC reshapes communication by continually renegotiating agency, identity, and privacy between user and model. We highlight design directions for building context-adaptive, user-controlled personalization AAC technology that supports expressive, authentic communication.
comment: 16 pages, 8 figures
How Warm-Glow Alters the Usability of Technology
As technology increasingly aligns with users' personal values, traditional models of usability, focused on functionality and specifically effectiveness, efficiency, and satisfaction, may not fully capture how people perceive and evaluate it. This study investigates how the warm-glow phenomenon, the positive feeling associated with doing good, shapes perceived usability. An experimental approach was taken in which participants evaluated a hypothetical technology under conditions designed to evoke either the intrinsic (i.e., personal fulfillment) or extrinsic (i.e., social recognition) dimensions of warm-glow. A Multivariate Analysis of Variance as well as subsequent follow-up analyses revealed that intrinsic warm-glow significantly enhances all dimensions of perceived usability, while extrinsic warm-glow selectively influences perceived effectiveness and satisfaction. These findings suggest that perceptions of usability extend beyond functionality and are shaped by how technology resonates with users' broader sense of purpose. We conclude by proposing that designers consider incorporating warm-glow into technology as a strategic design decision.
Teacher-AI Collaboration for Curating and Customizing Lesson Plans in Low-Resource Schools
This study investigates Shiksha copilot, an AI-assisted lesson planning tool deployed in government schools across Karnataka, India. The system combined LLMs and human expertise through a structured process in which English and Kannada lesson plans were co-created by curators and AI; teachers then further customized these curated plans for their classrooms using their own expertise alongside AI support. Drawing on a large-scale mixed-methods study involving 1,043 teachers and 23 curators, we examine how educators collaborate with AI to generate context-sensitive lesson plans, assess the quality of AI-generated content, and analyze shifts in teaching practices within multilingual, low-resource environments. Our findings show that teachers used Shiksha copilot both to meet administrative documentation needs and to support their teaching. The tool eased bureaucratic workload, reduced lesson planning time, and lowered teaching-related stress, while promoting a shift toward activity-based pedagogy. However, systemic challenges such as staffing shortages and administrative demands constrained broader pedagogical change. We frame these findings through the lenses of teacher-AI collaboration and communities of practice to examine the effective integration of AI tools in teaching. Finally, we propose design directions for future teacher-centered EdTech, particularly in multilingual and Global South contexts.
"I need to learn better searching tactics for privacy policy laws." Investigating Software Developers' Behavior When Using Sources on Privacy Issues
Since the introduction of the European General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA), software developers increasingly have to make privacy-related decisions during system design and implementation. However, past research showed that they often lack legal expertise and struggle with privacy-compliant development. To shed light on how effective current information sources are in supporting them with privacy-sensitive implementation, we conducted a qualitative study with 30 developers. Participants were presented with a privacy-sensitive scenario and asked to identify privacy issues and suggest measures using their knowledge, online resources, and an AI assistant. We observed developers' decision-making in think-aloud sessions and discussed it in follow-up interviews. We found that participants struggled with all three sources: personal knowledge was insufficient, web content was often too complex, and while AI assistants provided clear and user-tailored responses, they lacked contextual relevance and failed to identify scenario-specific issues. Our study highlights major shortcomings in existing support for privacy-related development tasks. Based on our findings, we discuss the need for more accessible, understandable, and actionable privacy resources for developers.
Simulating Life Paths with Digital Twins: AI-Generated Future Selves Influence Decision-Making and Expand Human Choice
Major life transitions demand high-stakes decisions, yet people often struggle to imagine how their future selves will live with the consequences. To support this limited capacity for mental time travel, we introduce AI-enabled digital twins that have ``lived through'' simulated life scenarios. Rather than predicting optimal outcomes, these simulations extend prospective cognition by making alternative futures vivid enough to support deliberation without assuming which path is best. We evaluate this idea in a randomized controlled study (N=192) using multimodal synthesis - facial age progression, voice cloning, and large language model dialogue - to create personalized avatars representing participants 30 years forward. Young adults 18 to 28 years old described pending binary decisions and were assigned to guided imagination or one of four avatar conditions: single-option, balanced dual-option, or expanded three-option with a system-generated novel alternative. Results showed asymmetric effects: single-sided avatars increased shifts toward the presented option, while balanced presentation produced movement toward both. Introducing a system-generated third option increased adoption of this new alternative compared to control, suggesting that AI-generated future selves can expand choice by surfacing paths that might otherwise go unnoticed. Participants rated evaluative reasoning and eudaimonic meaning-making as more important than emotional or visual vividness. Perceived persuasiveness and baseline agency predicted decision change. These findings advance understanding of AI-mediated episodic prospection and raise questions about autonomy in AI-augmented decisions.
The AI Consumer Index (ACE)
We introduce the first version of the AI Consumer Index (ACE), a benchmark for assessing whether frontier AI models can perform everyday consumer tasks. ACE contains a hidden heldout set of 400 test cases, split across four consumer activities: shopping, food, gaming, and DIY. We are also open sourcing 80 cases as a devset with a CC-BY license. For the ACE leaderboard we evaluated 10 frontier models (with websearch turned on) using a novel grading methodology that dynamically checks whether relevant parts of the response are grounded in the retrieved web sources. GPT 5 (Thinking = High) is the top-performing model, scoring 56.1%, followed by o3 Pro (Thinking = On) at 55.2% and GPT 5.1 (Thinking = High) at 55.1%. Model scores differ across domains, and in Shopping the top model scores under 50\%. We find that models are prone to hallucinating key information, such as prices. ACE shows a substantial gap between the performance of even the best models and consumers' AI needs.
IdeaBlocks: Expressing and Reusing Divergent Intents for Graphic Design Exploration using Generative AI
While designers increasingly leverage Generative AI for divergent exploration, current interaction is optimized for convergent refinement, forcing users to specify fixed targets rather than open-ended search spaces. Based on a formative study (N=7), we define the anatomy of Divergent Intent, comprising property, direction, and range, and identified two critical barriers: the lack of mechanisms to explicitly shape the parametric boundaries of exploration and the difficulty of reusing successful search strategies. We present IdeaBlocks, where users can modularize divergent intents into Exploration Blocks. Users can reuse prior intents at multiple levels (block, path, and project) with options for literal or context-adaptive reuse. In our comparative study (N=12), participants using IdeaBlocks explored 2.13 times more images with 12.5% greater visual diversity than the baseline, demonstrating how structured intent expression and reuse support effective divergence. A three-day deployment study (N=6) further revealed how different reuse mechanisms allowed distinct creative strategies, offering design implications for future intent-aware creativity supports.
The AI Productivity Index (APEX)
We present an extended version of the AI Productivity Index (APEX-v1-extended), a benchmark for assessing whether frontier models are capable of performing economically valuable tasks in four jobs: investment banking associate, management consultant, big law associate, and primary care physician (MD). This technical report details the extensions to APEX-v1, including an increase in the held-out evaluation set from n = 50 to n = 100 cases per job (n = 400 total) and updates to the grading methodology. We present a new leaderboard, where GPT5 (Thinking = High) remains the top performing model with a score of 67.0%. APEX-v1-extended shows that frontier models still have substantial limitations when performing typical professional tasks. To support further research, we are open sourcing n = 25 non-benchmark example cases per role (n = 100 total) along with our evaluation harness.
UAST: Unicode Aware Sanskrit Transliteration
Devanāgarī is the writing system that is adapted by various languages like Sanskrit. International Alphabet of Sanskrit Transliteration (IAST) is a transliteration scheme for romanisation of Sanskrit language. IAST makes use of diacritics to represent various characters. On a computer, these are represented using Unicode standard which differs from how the Sanskrit language behaves at a very fundamental level. This results in an issue that is encountered while designing typesetting software for devanāgarī and IAST. We hereby discuss the problems and provide a solution that solves the issue of incompatibilities between various transliteration and encoding schemes. The base implementation that should be used is available at https://github.com/dhruvildave/uast.rs. Another implementation that extends UAST to around $10$ scripts is available at https://github.com/aneri0x4f/uast-cli and https://github.com/dhruvildave/uast .
comment: 9 pages. Source code and implementation are available on GitHub at https://github.com/dhruvildave/uast.rs , https://github.com/aneri0x4f/uast-cli , and https://github.com/dhruvildave/uast
Image and Video Processing
Utilizing Multi-Agent Reinforcement Learning with Encoder-Decoder Architecture Agents to Identify Optimal Resection Location in Glioblastoma Multiforme Patients
Currently, there is a noticeable lack of AI in the medical field to support doctors in treating heterogenous brain tumors such as Glioblastoma Multiforme (GBM), the deadliest human cancer in the world with a five-year survival rate of just 5.1%. This project develops an AI system offering the only end-to-end solution by aiding doctors with both diagnosis and treatment planning. In the diagnosis phase, a sequential decision-making framework consisting of 4 classification models (Convolutional Neural Networks and Support Vector Machine) are used. Each model progressively classifies the patient's brain into increasingly specific categories, with the final step being named diagnosis. For treatment planning, an RL system consisting of 3 generative models is used. First, the resection model (diffusion model) analyzes the diagnosed GBM MRI and predicts a possible resection outcome. Second, the radiotherapy model (Spatio-Temporal Vision Transformer) generates an MRI of the brain's progression after a user-defined number of weeks. Third, the chemotherapy model (Diffusion Model) produces the post-treatment MRI. A survival rate calculator (Convolutional Neural Network) then checks if the generated post treatment MRI has a survival rate within 15% of the user defined target. If not, a feedback loop using proximal policy optimization iterates over this system until an optimal resection location is identified. When compared to existing solutions, this project found 3 key findings: (1) Using a sequential decision-making framework consisting of 4 small diagnostic models reduced computing costs by 22.28x, (2) Transformers regression capabilities decreased tumor progression inference time by 113 hours, and (3) Applying Augmentations resembling Real-life situations improved overall DICE scores by 2.9%. These results project to increase survival rates by 0.9%, potentially saving approximately 2,250 lives.
Physics-Guided Diffusion Priors for Multi-Slice Reconstruction in Scientific Imaging AAAI
Accurate multi-slice reconstruction from limited measurement data is crucial to speed up the acquisition process in medical and scientific imaging. However, it remains challenging due to the ill-posed nature of the problem and the high computational and memory demands. We propose a framework that addresses these challenges by integrating partitioned diffusion priors with physics-based constraints. By doing so, we substantially reduce memory usage per GPU while preserving high reconstruction quality, outperforming both physics-only and full multi-slice reconstruction baselines for different modalities, namely Magnetic Resonance Imaging (MRI) and four-dimensional Scanning Transmission Electron Microscopy (4D-STEM). Additionally, we show that the proposed method improves in-distribution accuracy as well as strong generalization to out-of-distribution datasets.
comment: 8 pages, 5 figures, AAAI AI2ASE 2026
T1-PILOT: Optimized Trajectories for T1 Mapping Acceleration
Cardiac T1 mapping provides critical quantitative insights into myocardial tissue composition, enabling the assessment of pathologies such as fibrosis, inflammation, and edema. However, the inherently dynamic nature of the heart imposes strict limits on acquisition times, making high-resolution T1 mapping a persistent challenge. Compressed sensing (CS) approaches have reduced scan durations by undersampling k-space and reconstructing images from partial data, and recent studies show that jointly optimizing the undersampling patterns with the reconstruction network can substantially improve performance. Still, most current T1 mapping pipelines rely on static, hand-crafted masks that do not exploit the full acceleration and accuracy potential. In this work, we introduce T1-PILOT: an end-to-end method that explicitly incorporates the T1 signal relaxation model into the sampling-reconstruction framework to guide the learning of non-Cartesian trajectories, crossframe alignment, and T1 decay estimation. Through extensive experiments on the CMRxRecon dataset, T1-PILOT significantly outperforms several baseline strategies (including learned single-mask and fixed radial or golden-angle sampling schemes), achieving higher T1 map fidelity at greater acceleration factors. In particular, we observe consistent gains in PSNR and VIF relative to existing methods, along with marked improvements in delineating finer myocardial structures. Our results highlight that optimizing sampling trajectories in tandem with the physical relaxation model leads to both enhanced quantitative accuracy and reduced acquisition times.
MLICv2: Enhanced Multi-Reference Entropy Modeling for Learned Image Compression
Recent advances in learned image compression (LIC) have achieved remarkable performance improvements over traditional codecs. Notably, the MLIC series-LICs equipped with multi-reference entropy models-have substantially surpassed conventional image codecs such as Versatile Video Coding (VVC) Intra. However, existing MLIC variants suffer from several limitations: performance degradation at high bitrates due to insufficient transform capacity, suboptimal entropy modeling that fails to capture global correlations in initial slices, and lack of adaptive channel importance modeling. In this paper, we propose MLICv2 and MLICv2+, enhanced successors that systematically address these limitations through improved transform design, dvanced entropy modeling, and exploration of the potential of instance-specific optimization. For transform enhancement, we introduce a lightweight token mixing block inspired by the MetaFormer architecture, which effectively mitigates high-bitrate performance degradation while maintaining computational efficiency. For entropy modeling improvements, we propose hyperprior-guided global correlation prediction to extract global context even in the initial slice of latent representation, complemented by a channel reweighting module that dynamically emphasizes informative channels. We further explore enhanced positional embedding and guided selective compression strategies for superior context modeling. Additionally, we apply the Stochastic Gumbel Annealing (SGA) to demonstrate the potential for further performance improvements through input-specific optimization. Extensive experiments demonstrate that MLICv2 and MLICv2+ achieve state-of-the-art results, reducing Bjøntegaard-Delta Rate by 16.54%, 21.61%, 16.05% and 20.46%, 24.35%, 19.14% on Kodak, Tecnick, and CLIC Pro Val datasets, respectively, compared to VTM-17.0 Intra.
comment: Accepted to ACM TOMM
Deep Spatiotemporal Clutter Filtering of Transthoracic Echocardiographic Images: Leveraging Contextual Attention and Residual Learning
This study presents a deep convolutional autoencoder network for filtering reverberation clutter from transthoracic echocardiographic (TTE) image sequences. Given the spatiotemporal nature of this type of clutter, the filtering network employs 3D convolutional layers to suppress it throughout the cardiac cycle. The design of the network incorporates two key features that contribute to the effectiveness of the filter: 1) an attention mechanism for focusing on cluttered regions and leveraging contextual information, and 2) residual learning for preserving fine image structures. To train the network, a diverse set of artifact patterns was simulated and superimposed onto ultra-realistic synthetic TTE sequences from six ultrasound vendors, generating input for the filtering network. The artifact-free sequences served as ground-truth. Performance of the filtering network was evaluated using unseen synthetic and in vivo artifactual sequences. Results from the in vivo dataset confirmed the network's strong generalization capabilities, despite being trained solely on synthetic data and simulated artifacts. The suitability of the filtered sequences for downstream processing was assessed by computing segmental strain curves. A significant reduction in the discrepancy between strain profiles computed from cluttered and clutter-free segments was observed after filtering the cluttered sequences with the proposed network. The trained network processes a TTE sequence in a fraction of a second, enabling real-time clutter filtering and potentially improving the precision of clinically relevant indices derived from TTE sequences. The source code of the proposed method and example video files of the filtering results are available at: https://github.com/MahdiTabassian/Deep-ClutterFiltering/tree/main.
comment: 21 pages, 16 figures
CT Synthesis with Conditional Diffusion Models for Abdominal Lymph Node Segmentation
Despite the significant success achieved by deep learning methods in medical image segmentation, researchers still struggle in the computer-aided diagnosis of abdominal lymph nodes due to the complex abdominal environment, small and indistinguishable lesions, and limited annotated data. To address these problems, we present a pipeline that integrates the conditional diffusion model for lymph node generation and the nnU-Net model for lymph node segmentation to improve the segmentation performance of abdominal lymph nodes through synthesizing a diversity of realistic abdominal lymph node data. We propose LN-DDPM, a conditional denoising diffusion probabilistic model (DDPM) for lymph node (LN) generation. LN-DDPM utilizes lymph node masks and anatomical structure masks as model conditions. These conditions work in two conditioning mechanisms: global structure conditioning and local detail conditioning, to distinguish between lymph nodes and their surroundings and better capture lymph node characteristics. The obtained paired abdominal lymph node images and masks are used for the downstream segmentation task. Experimental results on the abdominal lymph node datasets demonstrate that LN-DDPM outperforms other generative methods in the abdominal lymph node image synthesis and better assists the downstream abdominal lymph node segmentation task.
Multimedia
RMAdapter: Reconstruction-based Multi-Modal Adapter for Vision-Language Models AAAI 2026
Pre-trained Vision-Language Models (VLMs), \textit{e.g.} CLIP, have become essential tools in multimodal transfer learning. However, fine-tuning VLMs in few-shot scenarios poses significant challenges in balancing task-specific adaptation and generalization in the obtained model. Meanwhile, current researches have predominantly focused on prompt-based adaptation methods, leaving adapter-based approaches underexplored and revealing notable performance gaps. To address these challenges, we introduce a novel Reconstruction-based Multimodal Adapter (RMAdapter), which leverages a dual-branch architecture. Unlike conventional single-branch adapters, RMAdapter consists of: (1) an adaptation branch that injects task-specific knowledge through parameter-efficient fine-tuning, and (2) a reconstruction branch that preserves general knowledge by reconstructing latent space features back into the original feature space. This design facilitates a dynamic balance between general and task-specific knowledge. Importantly, although RMAdapter introduces an additional reconstruction branch, it is carefully optimized to remain lightweight. By computing reconstruction loss locally at each layer and sharing projection modules, the overall computational overhead is kept minimal. A consistency constraint is also incorporated to better regulate the trade-off between discriminability and generalization. We comprehensively evaluate the effectiveness of RMAdapter on three representative tasks: generalization to new categories, generalization to new target datasets, and domain generalization. Without relying on data augmentation or duplicate prompt designs, our RMAdapter consistently outperforms state-of-the-art approaches across all evaluation metrics.
comment: Accepted by AAAI 2026(Oral)
Iola Walker: A Mobile Footfall Detection System for Music Composition
This outing is part of a larger music technology research project. The objective is to find a way to enhance music using hardware and software. This is the documentation for the Whimsical first part of the research project: it's an android app that detects a wearer's footfalls by running live inference on an LSTM. The system works by getting data from an Mbient Labs IMU to a mobile app over bluetooth. After you move the .csv file to a computer with a GPU, you can use the python code to train an LSTM on that data. You then export the LSTM to the android app and can begin detecting footfalls. Feel free to download and experiment with the code. It's meant to be read and improved upon by you and your LLM codewriter of choice! https://github.com/willbjames/iolawalker
Computation and Language
FVA-RAG: Falsification-Verification Alignment for Mitigating Sycophantic Hallucinations
Retrieval-Augmented Generation (RAG) systems have significantly reduced hallucinations in Large Language Models (LLMs) by grounding responses in external context. However, standard RAG architectures suffer from a critical vulnerability: Retrieval Sycophancy. When presented with a query based on a false premise or a common misconception, vector-based retrievers tend to fetch documents that align with the user's bias rather than objective truth, leading the model to "hallucinate with citations." In this work, we introduce Falsification-Verification Alignment RAG (FVA-RAG), a framework that shifts the retrieval paradigm from Inductive Verification (seeking support) to Deductive Falsification (seeking disproof). Unlike existing "Self-Correction" methods that rely on internal consistency, FVA-RAG deploys a distinct Adversarial Retrieval Policy that actively generates "Kill Queries"-targeted search terms designed to surface contradictory evidence. We introduce a dual-verification mechanism that explicitly weighs the draft answer against this "Anti-Context." Preliminary experiments on a dataset of common misconceptions demonstrate that FVA-RAG significantly improves robustness against sycophantic hallucinations compared to standard RAG baselines, effectively acting as an inference-time "Red Team" for factual generation.
Block Sparse Flash Attention
Modern large language models increasingly require long contexts for reasoning and multi-document tasks, but attention's quadratic complexity creates a severe computational bottleneck. We present Block-Sparse FlashAttention (BSFA), a drop-in replacement that accelerates long-context inference while preserving model quality. Unlike methods that predict importance before computing scores, BSFA computes exact query-key similarities to select the top-k most important value blocks for each query. By comparing per-block maximum scores against calibrated thresholds, we skip approximately 50% of the computation and memory transfers for pruned blocks. Our training-free approach requires only a one-time threshold calibration on a small dataset to learn the per-layer and per-head attention score distributions. We provide a CUDA kernel implementation that can be used as a drop-in replacement for FlashAttention. On Llama-3.1-8B, BSFA achieves up to 1.10x speedup on real-world reasoning benchmarks and up to 1.24x for needle-in-a-haystack retrieval tasks while maintaining above 99% baseline accuracy, with certain configurations even improving accuracy by focusing on the most relevant content, substantially outperforming existing sparse attention methods. The implementation is available at https://github.com/Danielohayon/Block-Sparse-Flash-Attention
comment: 10 pages, 5 figures. Code: https://github.com/Danielohayon/Block-Sparse-Flash-Attention
Prompting-in-a-Series: Psychology-Informed Contents and Embeddings for Personality Recognition With Decoder-Only Models
Large Language Models (LLMs) have demonstrated remarkable capabilities across various natural language processing tasks. This research introduces a novel "Prompting-in-a-Series" algorithm, termed PICEPR (Psychology-Informed Contents Embeddings for Personality Recognition), featuring two pipelines: (a) Contents and (b) Embeddings. The approach demonstrates how a modularised decoder-only LLM can summarize or generate content, which can aid in classifying or enhancing personality recognition functions as a personality feature extractor and a generator for personality-rich content. We conducted various experiments to provide evidence to justify the rationale behind the PICEPR algorithm. Meanwhile, we also explored closed-source models such as \textit{gpt4o} from OpenAI and \textit{gemini} from Google, along with open-source models like \textit{mistral} from Mistral AI, to compare the quality of the generated content. The PICEPR algorithm has achieved a new state-of-the-art performance for personality recognition by 5-15\% improvement. The work repository and models' weight can be found at https://research.jingjietan.com/?q=PICEPR.
comment: 16 pages
Flash Multi-Head Feed-Forward Network
We explore Multi-Head FFN (MH-FFN) as a replacement of FFN in the Transformer architecture, motivated by the structural similarity between single-head attention and FFN. While multi-head mechanisms enhance expressivity in attention, naively applying them to FFNs faces two challenges: memory consumption scaling with the head count, and an imbalanced ratio between the growing intermediate size and the fixed head dimension as models scale, which degrades scalability and expressive power. To address these challenges, we propose Flash Multi-Head FFN (FlashMHF), with two key innovations: an I/O-aware fused kernel computing outputs online in SRAM akin to FlashAttention, and a design using dynamically weighted parallel sub-networks to maintain a balanced ratio between intermediate and head dimensions. Validated on models from 128M to 1.3B parameters, FlashMHF consistently improves perplexity and downstream task accuracy over SwiGLU FFNs, while reducing peak memory usage by 3-5x and accelerating inference by up to 1.08x. Our work establishes the multi-head design as a superior architectural principle for FFNs, presenting FlashMHF as a powerful, efficient, and scalable alternative to FFNs in Transformers.
comment: 17 pages, 8 figures
Progress Ratio Embeddings: An Impatience Signal for Robust Length Control in Neural Text Generation
Modern neural language models achieve high accuracy in text generation, yet precise control over generation length remains underdeveloped. In this paper, we first investigate a recent length control method based on Reverse Positional Embeddings (RPE) and show its limits when control is requested beyond the training distribution. In particular, using a discrete countdown signal tied to the absolute remaining token count leads to instability. To provide robust length control, we introduce Progress Ratio Embeddings (PRE), as continuous embeddings tied to a trigonometric impatience signal. PRE integrates seamlessly into standard Transformer architectures, providing stable length fidelity without degrading text accuracy under standard evaluation metrics. We further show that PRE generalizes well to unseen target lengths. Experiments on two widely used news-summarization benchmarks validate these findings.
MATEX: A Multi-Agent Framework for Explaining Ethereum Transactions
Understanding a complicated Ethereum transaction remains challenging: multi-hop token flows, nested contract calls, and opaque execution paths routinely lead users to blind signing. Based on interviews with everyday users, developers, and auditors, we identify the need for faithful, step-wise explanations grounded in both on-chain evidence and real-world protocol semantics. To meet this need, we introduce (matex, a cognitive multi-agent framework that models transaction understanding as a collaborative investigation-combining rapid hypothesis generation, dynamic off-chain knowledge retrieval, evidence-aware synthesis, and adversarial validation to produce faithful explanations.
XAM: Interactive Explainability for Authorship Attribution Models
We present IXAM, an Interactive eXplainability framework for Authorship Attribution Models. Given an authorship attribution (AA) task and an embedding-based AA model, our tool enables users to interactively explore the model's embedding space and construct an explanation of the model's prediction as a set of writing style features at different levels of granularity. Through a user evaluation, we demonstrate the value of our framework compared to predefined stylistic explanations.
Large Language Models and Forensic Linguistics: Navigating Opportunities and Threats in the Age of Generative AI
Large language models (LLMs) present a dual challenge for forensic linguistics. They serve as powerful analytical tools enabling scalable corpus analysis and embedding-based authorship attribution, while simultaneously destabilising foundational assumptions about idiolect through style mimicry, authorship obfuscation, and the proliferation of synthetic texts. Recent stylometric research indicates that LLMs can approximate surface stylistic features yet exhibit detectable differences from human writers, a tension with significant forensic implications. However, current AI-text detection techniques, whether classifier-based, stylometric, or watermarking approaches, face substantial limitations: high false positive rates for non-native English writers and vulnerability to adversarial strategies such as homoglyph substitution. These uncertainties raise concerns under legal admissibility standards, particularly the Daubert and Kumho Tire frameworks. The article concludes that forensic linguistics requires methodological reconfiguration to remain scientifically credible and legally admissible. Proposed adaptations include hybrid human-AI workflows, explainable detection paradigms beyond binary classification, and validation regimes measuring error and bias across diverse populations. The discipline's core insight, i.e., that language reveals information about its producer, remains valid but must accommodate increasingly complex chains of human and machine authorship.
NeuroABench: A Multimodal Evaluation Benchmark for Neurosurgical Anatomy Identification
Multimodal Large Language Models (MLLMs) have shown significant potential in surgical video understanding. With improved zero-shot performance and more effective human-machine interaction, they provide a strong foundation for advancing surgical education and assistance. However, existing research and datasets primarily focus on understanding surgical procedures and workflows, while paying limited attention to the critical role of anatomical comprehension. In clinical practice, surgeons rely heavily on precise anatomical understanding to interpret, review, and learn from surgical videos. To fill this gap, we introduce the Neurosurgical Anatomy Benchmark (NeuroABench), the first multimodal benchmark explicitly created to evaluate anatomical understanding in the neurosurgical domain. NeuroABench consists of 9 hours of annotated neurosurgical videos covering 89 distinct procedures and is developed using a novel multimodal annotation pipeline with multiple review cycles. The benchmark evaluates the identification of 68 clinical anatomical structures, providing a rigorous and standardized framework for assessing model performance. Experiments on over 10 state-of-the-art MLLMs reveal significant limitations, with the best-performing model achieving only 40.87% accuracy in anatomical identification tasks. To further evaluate the benchmark, we extract a subset of the dataset and conduct an informative test with four neurosurgical trainees. The results show that the best-performing student achieves 56% accuracy, with the lowest scores of 28% and an average score of 46.5%. While the best MLLM performs comparably to the lowest-scoring student, it still lags significantly behind the group's average performance. This comparison underscores both the progress of MLLMs in anatomical understanding and the substantial gap that remains in achieving human-level performance.
comment: Accepted by IEEE ICIA 2025
Automated PRO-CTCAE Symptom Selection based on Prior Adverse Event Profiles
The PRO-CTCAE is an NCI-developed patient-reported outcome system for capturing symptomatic adverse events in oncology trials. It comprises a large library drawn from the CTCAE vocabulary, and item selection for a given trial is typically guided by expected toxicity profiles from prior data. Selecting too many PRO-CTCAE items can burden patients and reduce compliance, while too few may miss important safety signals. We present an automated method to select a minimal yet comprehensive PRO-CTCAE subset based on historical safety data. Each candidate PRO-CTCAE symptom term is first mapped to its corresponding MedDRA Preferred Terms (PTs), which are then encoded into Safeterm, a high-dimensional semantic space capturing clinical and contextual diversity in MedDRA terminology. We score each candidate PRO item for relevance to the historical list of adverse event PTs and combine relevance and incidence into a utility function. Spectral analysis is then applied to the combined utility and diversity matrix to identify an orthogonal set of medical concepts that balances relevance and diversity. Symptoms are rank-ordered by importance, and a cut-off is suggested based on the explained information. The tool is implemented as part of the Safeterm trial-safety app. We evaluate its performance using simulations and oncology case studies in which PRO-CTCAE was employed. This automated approach can streamline PRO-CTCAE design by leveraging MedDRA semantics and historical data, providing an objective and reproducible method to balance signal coverage against patient burden.
comment: 13 pages, 2 figures
An Analysis of Large Language Models for Simulating User Responses in Surveys AACL 2025
Using Large Language Models (LLMs) to simulate user opinions has received growing attention. Yet LLMs, especially trained with reinforcement learning from human feedback (RLHF), are known to exhibit biases toward dominant viewpoints, raising concerns about their ability to represent users from diverse demographic and cultural backgrounds. In this work, we examine the extent to which LLMs can simulate human responses to cross-domain survey questions through direct prompting and chain-of-thought prompting. We further propose a claim diversification method CLAIMSIM, which elicits viewpoints from LLM parametric knowledge as contextual input. Experiments on the survey question answering task indicate that, while CLAIMSIM produces more diverse responses, both approaches struggle to accurately simulate users. Further analysis reveals two key limitations: (1) LLMs tend to maintain fixed viewpoints across varying demographic features, and generate single-perspective claims; and (2) when presented with conflicting claims, LLMs struggle to reason over nuanced differences among demographic features, limiting their ability to adapt responses to specific user profiles.
comment: Accepted to IJCNLP-AACL 2025 (Main Conference)
Rhea: Role-aware Heuristic Episodic Attention for Conversational LLMs
Large Language Models (LLMs) have achieved remarkable performance on single-turn tasks, yet their effectiveness deteriorates in multi-turn conversations. We define this phenomenon as cumulative contextual decay - a progressive degradation of contextual integrity caused by attention pollution, dilution, and drift. To address this challenge, we propose Rhea (Role-aware Heuristic Episodic Attention), a novel framework that decouples conversation history into two functionally independent memory modules: (1) an Instructional Memory (IM) that persistently stores high-fidelity global constraints via a structural priority mechanism, and (2) an Episodic Memory (EM) that dynamically manages user-model interactions via asymmetric noise control and heuristic context retrieval. During inference, Rhea constructs a high signal-to-noise context by applying its priority attention: selectively integrating relevant episodic information while always prioritizing global instructions. To validate this approach, experiments on multiple multi-turn conversation benchmarks - including MT-Eval and Long-MT-Bench+ - show that Rhea mitigates performance decay and improves overall accuracy by 1.04 points on a 10-point scale (a 16% relative gain over strong baselines). Moreover, Rhea maintains near-perfect instruction fidelity (IAR > 8.1) across long-horizon interactions. These results demonstrate that Rhea provides a principled and effective framework for building more precise, instruction-consistent conversational LLMs.
Less Is More, but Where? Dynamic Token Compression via LLM-Guided Keyframe Prior NeurIPS 2025
Recent advances in Video Large Language Models (VLLMs) have achieved remarkable video understanding capabilities, yet face critical efficiency bottlenecks due to quadratic computational growth with lengthy visual token sequences of long videos. While existing keyframe sampling methods can improve temporal modeling efficiency, additional computational cost is introduced before feature encoding, and the binary frame selection paradigm is found suboptimal. Therefore, in this work, we propose Dynamic Token compression via LLM-guided Keyframe prior (DyToK), a training-free paradigm that enables dynamic token compression by harnessing VLLMs' inherent attention mechanisms. Our analysis reveals that VLLM attention layers naturally encoding query-conditioned keyframe priors, by which DyToK dynamically adjusts per-frame token retention ratios, prioritizing semantically rich frames while suppressing redundancies. Extensive experiments demonstrate that DyToK achieves state-of-the-art efficiency-accuracy tradeoffs. DyToK shows plug-and-play compatibility with existing compression methods, such as VisionZip and FastV, attaining 4.3x faster inference while preserving accuracy across multiple VLLMs, such as LLaVA-OneVision and Qwen2.5-VL. Code is available at https://github.com/yu-lin-li/DyToK .
comment: Accepted by NeurIPS 2025
AquaFusionNet: Lightweight VisionSensor Fusion Framework for Real-Time Pathogen Detection and Water Quality Anomaly Prediction on Edge Devices
Evidence from many low and middle income regions shows that microbial contamination in small scale drinking water systems often fluctuates rapidly, yet existing monitoring tools capture only fragments of this behaviour. Microscopic imaging provides organism level visibility, whereas physicochemical sensors reveal shortterm changes in water chemistry; in practice, operators must interpret these streams separately, making realtime decision-making unreliable. This study introduces AquaFusionNet, a lightweight cross-modal framework that unifies both information sources inside a single edge deployable model. Unlike prior work that treats microscopic detection and water quality prediction as independent tasks, AquaFusionNet learns the statistical dependencies between microbial appearance and concurrent sensor dynamics through a gated crossattention mechanism designed specifically for lowpower hardware. The framework is trained on AquaMicro12K, a new dataset comprising 12,846 annotated 1000 micrographs curated for drinking water contexts, an area where publicly accessible microscopic datasets are scarce. Deployed for six months across seven facilities in East Java, Indonesia, the system processed 1.84 million frames and consistently detected contamination events with 94.8% mAP@0.5 and 96.3% anomaly prediction accuracy, while operating at 4.8 W on a Jetson Nano. Comparative experiments against representative lightweight detectors show that AquaFusionNet provides higher accuracy at comparable or lower power, and field results indicate that cross-modal coupling reduces common failure modes of unimodal detectors, particularly under fouling, turbidity spikes, and inconsistent illumination. All models, data, and hardware designs are released openly to facilitate replication and adaptation in decentralized water safety infrastructures.
comment: 9Pages, 3 figure, Politeknik Negeri Banyuwangi
CAuSE: Decoding Multimodal Classifiers using Faithful Natural Language Explanation ACL
Multimodal classifiers function as opaque black box models. While several techniques exist to interpret their predictions, very few of them are as intuitive and accessible as natural language explanations (NLEs). To build trust, such explanations must faithfully capture the classifier's internal decision making behavior, a property known as faithfulness. In this paper, we propose CAuSE (Causal Abstraction under Simulated Explanations), a novel framework to generate faithful NLEs for any pretrained multimodal classifier. We demonstrate that CAuSE generalizes across datasets and models through extensive empirical evaluations. Theoretically, we show that CAuSE, trained via interchange intervention, forms a causal abstraction of the underlying classifier. We further validate this through a redesigned metric for measuring causal faithfulness in multimodal settings. CAuSE surpasses other methods on this metric, with qualitative analysis reinforcing its advantages. We perform detailed error analysis to pinpoint the failure cases of CAuSE. For replicability, we make the codes available at https://github.com/newcodevelop/CAuSE
comment: Accepted at Transactions of the Association for Computational Linguistics (TACL). Pre-MIT Press publication version
Large Language Model-Based Generation of Discharge Summaries
Discharge Summaries are documents written by medical professionals that detail a patient's visit to a care facility. They contain a wealth of information crucial for patient care, and automating their generation could significantly reduce the effort required from healthcare professionals, minimize errors, and ensure that critical patient information is easily accessible and actionable. In this work, we explore the use of five Large Language Models on this task, from open-source models (Mistral, Llama 2) to proprietary systems (GPT-3, GPT-4, Gemini 1.5 Pro), leveraging MIMIC-III summaries and notes. We evaluate them using exact-match, soft-overlap, and reference-free metrics. Our results show that proprietary models, particularly Gemini with one-shot prompting, outperformed others, producing summaries with the highest similarity to the gold-standard ones. Open-source models, while promising, especially Mistral after fine-tuning, lagged in performance, often struggling with hallucinations and repeated information. Human evaluation by a clinical expert confirmed the practical utility of the summaries generated by proprietary models. Despite the challenges, such as hallucinations and missing information, the findings suggest that LLMs, especially proprietary models, are promising candidates for automatic discharge summary generation as long as data privacy is ensured.
comment: 17 pages, 6 figures
MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning
Recent advances in video multimodal large language models (Video MLLMs) have significantly enhanced video understanding and multi-modal interaction capabilities. While most existing systems operate in a turn-based manner where the model can only reply after user turns, proactively deciding when to reply during video playback presents a promising yet challenging direction for real-time applications. In this work, we propose a novel text-to-text approach to proactive interaction, where the model autonomously determines whether to respond or remain silent at each turn based on dialogue history and visual context up to current frame of an streaming video. To overcome difficulties in previous methods such as manually tuning response decision thresholds and annotating precise reply times, we introduce a multi-turn RL based training method that encourages timely and accurate responses without requiring precise response time annotations. We train our model MMDuet2 on a dataset of 52k videos with two types of dialogues via SFT and RL. Experimental results demonstrate that MMDuet2 outperforms existing proactive Video MLLM baselines in response timing and quality, achieving state-of-the-art performance on the ProactiveVideoQA benchmark.
LLM4SFC: Sequential Function Chart Generation via Large Language Models
While Large Language Models (LLMs) are increasingly used for synthesizing textual PLC programming languages like Structured Text (ST) code, other IEC 61131-3 standard graphical languages like Sequential Function Charts (SFCs) remain underexplored. Generating SFCs is challenging due to graphical nature and ST actions embedded within, which are not directly compatible with standard generation techniques, often leading to non-executable code that is incompatible with industrial tool-chains In this work, we introduce LLM4SFC, the first framework to receive natural-language descriptions of industrial workflows and provide executable SFCs. LLM4SFC is based on three components: (i) A reduced structured representation that captures essential topology and in-line ST and reduced textual verbosity; (ii) Fine-tuning and few-shot retrieval-augmented generation (RAG) for alignment with SFC programming conventions; and (iii) A structured generation approach that prunes illegal tokens in real-time to ensure compliance with the textual format of SFCs. We evaluate LLM4SFC on a dataset of real-world SFCs from automated manufacturing projects, using both open-source and proprietary LLMs. The results show that LLM4SFC reliably generates syntactically valid SFC programs effectively bridging graphical and textual PLC languages, achieving a generation generation success of 75% - 94%, paving the way for automated industrial programming.
From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs
Large language models (LLMs) excel at generation but dominant autoregressive (AR) decoding is inherently sequential, creating a throughput bottleneck. Diffusion Language Models (DLMs)--especially block-wise variants--enable parallel generation and intra-block bidirectional reasoning, yet training large DLMs from scratch is costly and wastes the knowledge in mature AR checkpoints. Prior "adaptation" attempts either modify logits or randomly grow attention masks to full-sequence diffusion, or simply transplant AR weights into a block-diffusion recipe, leaving a fundamental mismatch between AR causality and block-wise bidirectionality unaddressed. We reframe adaptation as a intra-paradigm path from AR to Block-Diffusion by viewing AR as Block-Diffusion with blocksize=1. Concretely, we design the pathway of adaptation as follows: we use a context-causal attention mask (causal in context, bidirectional only within the active block), an efficient parallel adaptation procedure, an auxiliary AR loss to maximize data utilization and retain pretrained knowledge, and gradual increment of the generation block size. The recipe integrates cleanly with masked block-diffusion and maintains train-inference consistency. Built on these components, NBDiff-7B (Base and Instruct) could inherit the long-context modeling and reasoning capabilities, and achieve state-of-the-art performance among the 7B-class DLMs, delivering strong gains on general-knowledge, math, and code benchmarks over strong baselines. These results demonstrate that principled AR-to-block-diffusion adaptation is an effective and compute-efficient alternative to training DLMs from scratch. Codes: https://github.com/YuchuanTian/NBDiff.
comment: 13 pages, 4 figures
Becoming Experienced Judges: Selective Test-Time Learning for Evaluators
Automatic evaluation with large language models, commonly known as LLM-as-a-judge, is now standard across reasoning and alignment tasks. Despite evaluating many samples in deployment, these evaluators typically (i) treat each case independently, missing the opportunity to accumulate experience, and (ii) rely on a single fixed prompt for all cases, neglecting the need for sample-specific evaluation criteria. We introduce Learning While Evaluating (LWE), a framework that allows evaluators to improve sequentially at inference time without requiring training or validation sets. LWE maintains an evolving meta-prompt that (i) produces sample-specific evaluation instructions and (ii) refines itself through self-generated feedback. Furthermore, we propose Selective LWE, which updates the meta-prompt only on self-inconsistent cases, focusing computation where it matters most. This selective approach retains the benefits of sequential learning while being far more cost-effective. Across two pairwise comparison benchmarks, Selective LWE outperforms strong baselines, empirically demonstrating that evaluators can improve during sequential testing with a simple selective update, learning most from the cases they struggle with.
One Word Is Not Enough: Simple Prompts Improve Word Embeddings
Text embedding models are designed for sentence-level applications like retrieval and semantic similarity, and are primarily evaluated on sentence-level benchmarks. Their behavior on isolated words is less understood. We show that simply prepending semantic prompts to words before embedding substantially improves word similarity correlations. Testing 7 text embedding models, including text-embedding-3-large (OpenAI), embed-english-v3.0 (Cohere), voyage-3(Voyage AI), all-mpnet-base-v2, and Qwen3-Embedding-8B, on 3 standard benchmarks (SimLex-999, WordSim-353, MEN-3000), we find that prompts like "meaning: {word}" or "Represent the semantic concept: {word}" improve Spearman correlations by up to +0.29 on SimLex-999. Some models fail completely on bare words (correlation = 0) but recover with prompts (+0.73 improvement). Our best results achieve correlation = 0.692 on SimLex-999 with embed-english-v3.0 (Cohere), correlation = 0.811 on WordSim-353, and correlation = 0.855 on MEN-3000 with text-embedding-3-large (OpenAI). These results outperform classic static embeddings like Word2Vec (correlation = 0.40) and even the best static method LexVec (correlation = 0.48) on SimLex-999, establishing a new state-of-the-art for pure embedding methods. This zero-shot technique requires no training and works with any text embedding model.
Arc Gradient Descent: A Mathematically Derived Reformulation of Gradient Descent with Phase-Aware, User-Controlled Step Dynamics
The paper presents the formulation, implementation, and evaluation of the ArcGD optimiser. The evaluation is conducted initially on a non-convex benchmark function and subsequently on a real-world ML dataset. The initial comparative study using the Adam optimiser is conducted on a stochastic variant of the highly non-convex and notoriously challenging Rosenbrock function, renowned for its narrow, curved valley, across dimensions ranging from 2D to 1000D and an extreme case of 50,000D. Two configurations were evaluated to eliminate learning-rate bias: (i) both using ArcGD's effective learning rate and (ii) both using Adam's default learning rate. ArcGD consistently outperformed Adam under the first setting and, although slower under the second, achieved super ior final solutions in most cases. In the second evaluation, ArcGD is evaluated against state-of-the-art optimizers (Adam, AdamW, Lion, SGD) on the CIFAR-10 image classification dataset across 8 diverse MLP architectures ranging from 1 to 5 hidden layers. ArcGD achieved the highest average test accuracy (50.7%) at 20,000 iterations, outperforming AdamW (46.6%), Adam (46.8%), SGD (49.6%), and Lion (43.4%), winning or tying on 6 of 8 architectures. Notably, while Adam and AdamW showed strong early convergence at 5,000 iterations, but regressed with extended training, whereas ArcGD continued improving, demonstrating generalization and resistance to overfitting without requiring early stopping tuning. Strong performance on geometric stress tests and standard deep-learning benchmarks indicates broad applicability, highlighting the need for further exploration. Moreover, it is also shown that a variant of ArcGD can be interpreted as a special case of the Lion optimiser, highlighting connections between the inherent mechanisms of such optimisation methods.
comment: 80 pages, 6 tables, 2 figures, 5 appendices, proof-of-concept
A Patient-Doctor-NLP-System to contest inequality for less privileged
Transfer Learning (TL) has accelerated the rapid development and availability of large language models (LLMs) for mainstream natural language processing (NLP) use cases. However, training and deploying such gigantic LLMs in resource-constrained, real-world healthcare situations remains challenging. This study addresses the limited support available to visually impaired users and speakers of low-resource languages such as Hindi who require medical assistance in rural environments. We propose PDFTEMRA (Performant Distilled Frequency Transformer Ensemble Model with Random Activations), a compact transformer-based architecture that integrates model distillation, frequency-domain modulation, ensemble learning, and randomized activation patterns to reduce computational cost while preserving language understanding performance. The model is trained and evaluated on medical question-answering and consultation datasets tailored to Hindi and accessibility scenarios, and its performance is compared against standard NLP state-of-the-art model baselines. Results demonstrate that PDFTEMRA achieves comparable performance with substantially lower computational requirements, indicating its suitability for accessible, inclusive, low-resource medical NLP applications.
comment: 19 pages, 6 figures
"The Dentist is an involved parent, the bartender is not": Revealing Implicit Biases in QA with Implicit BBQ
Existing benchmarks evaluating biases in large language models (LLMs) primarily rely on explicit cues, declaring protected attributes like religion, race, gender by name. However, real-world interactions often contain implicit biases, inferred subtly through names, cultural cues, or traits. This critical oversight creates a significant blind spot in fairness evaluation. We introduce ImplicitBBQ, a benchmark extending the Bias Benchmark for QA (BBQ) with implicitly cued protected attributes across 6 categories. Our evaluation of GPT-4o on ImplicitBBQ illustrates troubling performance disparity from explicit BBQ prompts, with accuracy declining up to 7% in the "sexual orientation" subcategory and consistent decline located across most other categories. This indicates that current LLMs contain implicit biases undetected by explicit benchmarks. ImplicitBBQ offers a crucial tool for nuanced fairness evaluation in NLP.
The Role of Entropy in Visual Grounding: Analysis and Optimization
Recent advances in fine-tuning multimodal large language models (MLLMs) using reinforcement learning have achieved remarkable progress, particularly with the introduction of various entropy control techniques. However, the role and characteristics of entropy in perception-oriented tasks like visual grounding, as well as effective strategies for controlling it, remain largely unexplored. To address this issue, we focus on the visual grounding task and analyze the role and characteristics of entropy in comparison to reasoning tasks. Building on these findings, we introduce ECVGPO (Entropy Control Visual Grounding Policy Optimization), an interpretable algorithm designed for effective entropy regulation. Through entropy control, the trade-off between exploration and exploitation is better balanced. Experiments show that ECVGPO achieves broad improvements across various benchmarks and models.
ProAgent: Harnessing On-Demand Sensory Contexts for Proactive LLM Agent Systems
Large Language Model (LLM) agents are emerging to transform daily life. However, existing LLM agents primarily follow a reactive paradigm, relying on explicit user instructions to initiate services, which increases both physical and cognitive workload. In this paper, we propose ProAgent, the first end-to-end proactive agent system that harnesses massive sensory contexts and LLM reasoning to deliver proactive assistance. ProAgent first employs a proactive-oriented context extraction approach with on-demand tiered perception to continuously sense the environment and derive hierarchical contexts that incorporate both sensory and persona cues. ProAgent then adopts a context-aware proactive reasoner to map these contexts to user needs and tool calls, providing proactive assistance. We implement ProAgent on Augmented Reality (AR) glasses with an edge server and extensively evaluate it on a real-world testbed, a public dataset, and through a user study. Results show that ProAgent achieves up to 33.4% higher proactive prediction accuracy, 16.8% higher tool-calling F1 score, and notable improvements in user satisfaction over state-of-the-art baselines, marking a significant step toward proactive assistants. A video demonstration of ProAgent is available at https://youtu.be/pRXZuzvrcVs.
Cognitive Control Architecture (CCA): A Lifecycle Supervision Framework for Robustly Aligned AI Agents
Autonomous Large Language Model (LLM) agents exhibit significant vulnerability to Indirect Prompt Injection (IPI) attacks. These attacks hijack agent behavior by polluting external information sources, exploiting fundamental trade-offs between security and functionality in existing defense mechanisms. This leads to malicious and unauthorized tool invocations, diverting agents from their original objectives. The success of complex IPIs reveals a deeper systemic fragility: while current defenses demonstrate some effectiveness, most defense architectures are inherently fragmented. Consequently, they fail to provide full integrity assurance across the entire task execution pipeline, forcing unacceptable multi-dimensional compromises among security, functionality, and efficiency. Our method is predicated on a core insight: no matter how subtle an IPI attack, its pursuit of a malicious objective will ultimately manifest as a detectable deviation in the action trajectory, distinct from the expected legitimate plan. Based on this, we propose the Cognitive Control Architecture (CCA), a holistic framework achieving full-lifecycle cognitive supervision. CCA constructs an efficient, dual-layered defense system through two synergistic pillars: (i) proactive and preemptive control-flow and data-flow integrity enforcement via a pre-generated "Intent Graph"; and (ii) an innovative "Tiered Adjudicator" that, upon deviation detection, initiates deep reasoning based on multi-dimensional scoring, specifically designed to counter complex conditional attacks. Experiments on the AgentDojo benchmark substantiate that CCA not only effectively withstands sophisticated attacks that challenge other advanced defense methods but also achieves uncompromised security with notable efficiency and robustness, thereby reconciling the aforementioned multi-dimensional trade-off.
Look Twice before You Leap: A Rational Agent Framework for Localized Adversarial Anonymization
Current LLM-based text anonymization frameworks usually rely on remote API services from powerful LLMs, which creates an inherent "privacy paradox": users must somehow disclose data to untrusted third parties for superior privacy preservation. Moreover, directly migrating these frameworks to local small-scale models (LSMs) offers a suboptimal solution with catastrophic collapse in utility based on our core findings. Our work argues that this failure stems not merely from the capability deficits of LSMs, but from the inherent irrationality of the greedy adversarial strategies employed by current state-of-the-art (SoTA) methods. We model the anonymization process as a trade-off between Marginal Privacy Gain (MPG) and Marginal Utility Cost (MUC), and demonstrate that greedy strategies inevitably drift into an irrational state. To address this, we propose Rational Localized Adversarial Anonymization (RLAA), a fully localized and training-free framework featuring an Attacker-Arbitrator-Anonymizer (A-A-A) architecture. RLAA introduces an arbitrator that acts as a rationality gatekeeper, validating the attacker's inference to filter out feedback providing negligible benefits on privacy preservation. This mechanism enforces a rational early-stopping criterion, and systematically prevents utility collapse. Extensive experiments on different datasets demonstrate that RLAA achieves the best privacy-utility trade-off, and in some cases even outperforms SoTA on the Pareto principle. Our code and datasets will be released upon acceptance.
comment: 16 pages, 6 figures
Parameter-Efficient Fine-Tuning with Differential Privacy for Robust Instruction Adaptation in Large Language Models
This study addresses the issues of privacy protection and efficiency in instruction fine-tuning of large-scale language models by proposing a parameter-efficient method that integrates differential privacy noise allocation with gradient clipping in a collaborative optimization framework. The method keeps the backbone model frozen and updates parameters through a low-dimensional projection subspace, while introducing clipping and adaptive noise allocation during gradient computation. This design reduces privacy budget consumption and ensures training stability and robustness. The unified framework combines gradient constraints, noise allocation, and parameter projection, effectively mitigating performance fluctuations and privacy risks in multi-task instruction scenarios. Experiments are conducted across hyperparameter, environment, and data sensitivity dimensions. Results show that the method outperforms baseline models in accuracy, privacy budget, and parameter efficiency, and maintains stable performance under diverse and uncertain data conditions. The findings enrich the theoretical integration of differential privacy and parameter-efficient fine-tuning and demonstrate its practical adaptability in instruction tasks, providing a feasible solution for secure training in complex instruction environments.
TopiCLEAR: Topic extraction by CLustering Embeddings with Adaptive dimensional Reduction
Rapid expansion of social media platforms such as X (formerly Twitter), Facebook, and Reddit has enabled large-scale analysis of public perceptions on diverse topics, including social issues, politics, natural disasters, and consumer sentiment. Topic modeling is a widely used approach for uncovering latent themes in text data, typically framed as an unsupervised classification task. However, traditional models, originally designed for longer and more formal documents, struggle with short social media posts due to limited co-occurrence statistics, fragmented semantics, inconsistent spelling, and informal language. To address these challenges, we propose a new method, TopiCLEAR: Topic extraction by CLustering Embeddings with Adaptive dimensional Reduction. Specifically, each text is embedded using Sentence-BERT (SBERT) and provisionally clustered using Gaussian Mixture Models (GMM). The clusters are then refined iteratively using a supervised projection based on linear discriminant analysis, followed by GMM-based clustering until convergence. Notably, our method operates directly on raw text, eliminating the need for preprocessing steps such as stop word removal. We evaluate our approach on four diverse datasets, 20News, AgNewsTitle, Reddit, and TweetTopic, each containing human-labeled topic information. Compared with seven baseline methods, including a recent SBERT-based method and a zero-shot generative AI method, our approach achieves the highest similarity to human-annotated topics, with significant improvements for both social media posts and online news articles. Additionally, qualitative analysis shows that our method produces more interpretable topics, highlighting its potential for applications in social media data and web content analytics.
comment: 15 pages, 4 figures, code available at https://github.com/aoi8716/TopiCLEAR
Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation
Preference alignment has enabled large language models (LLMs) to better reflect human expectations, but current methods mostly optimize for population-level preferences, overlooking individual users. Personalization is essential, yet early approaches-such as prompt customization or fine-tuning-struggle to reason over implicit preferences, limiting real-world effectiveness. Recent "think-then-generate" methods address this by reasoning before response generation. However, they face challenges in long-form generation: their static one-shot reasoning must capture all relevant information for the full response generation, making learning difficult and limiting adaptability to evolving content. To address this issue, we propose FlyThinker, an efficient "think-while-generating" framework for personalized long-form generation. FlyThinker employs a separate reasoning model that generates latent token-level reasoning in parallel, which is fused into the generation model to dynamically guide response generation. This design enables reasoning and generation to run concurrently, ensuring inference efficiency. In addition, the reasoning model is designed to depend only on previous responses rather than its own prior outputs, which preserves training parallelism across different positions-allowing all reasoning tokens for training data to be produced in a single forward pass like standard LLM training, ensuring training efficiency. Extensive experiments on real-world benchmarks demonstrate that FlyThinker achieves better personalized generation while keeping training and inference efficiency.
PersonaMem-v2: Towards Personalized Intelligence via Learning Implicit User Personas and Agentic Memory
Personalization is one of the next milestones in advancing AI capability and alignment. We introduce PersonaMem-v2, the state-of-the-art dataset for LLM personalization that simulates 1,000 realistic user-chatbot interactions on 300+ scenarios, 20,000+ user preferences, and 128k-token context windows, where most user preferences are implicitly revealed to reflect real-world interactions. Using this data, we investigate how reinforcement fine-tuning enables a model to improve its long-context reasoning capabilities for user understanding and personalization. We also develop a framework for training an agentic memory system, which maintains a single, human-readable memory that grows with each user over time. In our experiments, frontier LLMs still struggle with implicit personalization, achieving only 37-48% accuracy. While they support long context windows, reasoning remains the bottleneck for implicit personalization tasks. Using reinforcement fine-tuning, we successfully train Qwen3-4B to outperforms GPT-5, reaching 53% accuracy in implicit personalization. Moreover, our agentic memory framework achieves state-of-the-art 55% accuracy while using 16x fewer input tokens, relying on a 2k-token memory instead of full 32k conversation histories. These results underscore the impact of our dataset and demonstrate agentic memory as a scalable path toward real-world personalized intelligence.
comment: Data is available at https://huggingface.co/datasets/bowen-upenn/PersonaMem-v2
Mechanistic Interpretability of GPT-2: Lexical and Contextual Layers in Sentiment Analysis
We present a mechanistic interpretability study of GPT-2 that causally examines how sentiment information is processed across its transformer layers. Using systematic activation patching across all 12 layers, we test the hypothesized two-stage sentiment architecture comprising early lexical detection and mid-layer contextual integration. Our experiments confirm that early layers (0-3) act as lexical sentiment detectors, encoding stable, position specific polarity signals that are largely independent of context. However, all three contextual integration hypotheses: Middle Layer Concentration, Phenomenon Specificity, and Distributed Processing are falsified. Instead of mid-layer specialization, we find that contextual phenomena such as negation, sarcasm, domain shifts etc. are integrated primarily in late layers (8-11) through a unified, non-modular mechanism. These experimental findings provide causal evidence that GPT-2's sentiment computation differs from the predicted hierarchical pattern, highlighting the need for further empirical characterization of contextual integration in large language models.
CMV-Fuse: Cross Modal-View Fusion of AMR, Syntax, and Knowledge Representations for Aspect Based Sentiment Analysis
Natural language understanding inherently depends on integrating multiple complementary perspectives spanning from surface syntax to deep semantics and world knowledge. However, current Aspect-Based Sentiment Analysis (ABSA) systems typically exploit isolated linguistic views, thereby overlooking the intricate interplay between structural representations that humans naturally leverage. We propose CMV-Fuse, a Cross-Modal View fusion framework that emulates human language processing by systematically combining multiple linguistic perspectives. Our approach systematically orchestrates four linguistic perspectives: Abstract Meaning Representations, constituency parsing, dependency syntax, and semantic attention, enhanced with external knowledge integration. Through hierarchical gated attention fusion across local syntactic, intermediate semantic, and global knowledge levels, CMV-Fuse captures both fine-grained structural patterns and broad contextual understanding. A novel structure aware multi-view contrastive learning mechanism ensures consistency across complementary representations while maintaining computational efficiency. Extensive experiments demonstrate substantial improvements over strong baselines on standard benchmarks, with analysis revealing how each linguistic view contributes to more robust sentiment analysis.
The Online Discourse of Virtual Reality and Anxiety
VR in the treatment of clinical concerns such as generalized anxiety disorder or social anxiety. VR has created additional pathways to support patient well-being and care. Understanding online discussion of what users think about this technology may further support its efficacy. The purpose of this study was to employ a corpus linguistic methodology to identify the words and word networks that shed light on the online discussion of virtual reality and anxiety. Using corpus linguistics, frequently used words in discussion along with collocation were identified by utilizing Sketch Engine software. The results of the study, based upon the English Trends corpus, identified VR, Oculus, and headset as the most frequently discussed within the VR and anxiety subcorpus. These results point to the development of the virtual system, along with the physical apparatus that makes viewing and engaging with the virtual environment possible. Additional results point to collocation of prepositional phrases such as of virtual reality, in virtual reality, and for virtual reality relating to the design, experience, and development, respectively. These findings offer new perspective on how VR and anxiety together are discussed in general discourse and offer pathways for future opportunities to support counseling needs through development and accessibility. Keywords: anxiety disorders, corpus linguistics, Sketch Engine, and virtual reality VR
comment: Three tables and two figures. Unfortunately, I did not formally register the dataset prior to conducting the analysis
An Index-based Approach for Efficient and Effective Web Content Extraction
As web agents (e.g., Deep Research) routinely consume massive volumes of web pages to gather and analyze information, LLM context management -- under large token budgets and low signal density -- emerges as a foundational, high-importance, and technically challenging problem for agentic and RAG pipelines. Existing solutions for extracting relevant content are inadequate: generative extraction models suffer from high latency, rule-based heuristics lack adaptability, and chunk-and-rerank methods are blind to webpage structure. To overcome these issues, we introduce Index-based Web Content Extraction to reframe the extraction process from slow, token-by-token generation into a highly efficient, discriminative task of index prediction, achieving both effectiveness and efficiency. We partition HTML into structure-aware, addressable segments, and extract only the positional indices of content relevant to a given query. This method decouples extraction latency from content length, enabling rapid, query-relevant extraction. We first evaluate our method as a post-retrieval processing component within an RAG QA system and find that it improves QA accuracy. Then we directly measure its match rate with the target content in two scenarios: main content extraction (ME) and query-relevant extraction (QE). Experimental results show that our method outperforms existing works in both accuracy and speed, effectively bridging the gap between LLMs and the vast webpages.
A Fast and Effective Solution to the Problem of Look-ahead Bias in LLMs
Applying LLMs to predictive tasks in finance is challenging due to look-ahead bias resulting from their training on long time-series data. This precludes the backtests typically employed in finance since retraining frontier models from scratch with a specific knowledge cutoff is prohibitive. In this paper, we introduce a fast, effective, and low-cost alternative. Our method guides generation at inference time by adjusting the logits of a large base model using a pair of smaller, specialized models -- one fine-tuned on information to be forgotten and another on information to be retained. We demonstrate that our method effectively removes both verbatim and semantic knowledge, corrects biases, and outperforms prior methods.
Can Fine-Tuning Erase Your Edits? On the Fragile Coexistence of Knowledge Editing and Adaptation
Knowledge editing has emerged as a lightweight alternative to retraining for correcting or injecting specific facts in large language models (LLMs). Meanwhile, fine-tuning remains the default operation for adapting LLMs to new domains and tasks. Despite their widespread adoption, these two post-training interventions have been studied in isolation, leaving open a crucial question: if we fine-tune an edited model, do the edits survive? This question is motivated by two practical scenarios: removing covert or malicious edits, and preserving beneficial edits. If fine-tuning impairs edits (Fig.1), current KE methods become less useful, as every fine-tuned model would require re-editing, which significantly increases the cost; if edits persist, fine-tuned models risk propagating hidden malicious edits, raising serious safety concerns. To this end, we systematically quantify edit decay after fine-tuning, investigating how fine-tuning affects knowledge editing. Our results show that edits decay after fine-tuning, with survival varying across configurations, e.g., AlphaEdit edits decay more than MEMIT edits. Further, we find that fine-tuning edited layers only can effectively remove edits, though at a slight cost to downstream performance. Surprisingly, fine-tuning non-edited layers impairs more edits than full fine-tuning. Overall, our study establishes empirical baselines and actionable strategies for integrating knowledge editing with fine-tuning, and underscores that evaluating model editing requires considering the full LLM application pipeline.
LLM Output Homogenization is Task Dependent
A large language model can be less helpful if it exhibits output response homogenization. But whether two responses are considered homogeneous, and whether such homogenization is problematic, both depend on the task category. For instance, in objective math tasks, we often expect no variation in the final answer but anticipate variation in the problem-solving strategy. Whereas, for creative writing tasks, we may expect variation in key narrative components (e.g. plot, genre, setting, etc), beyond the vocabulary or embedding diversity produced by temperature-sampling. Previous work addressing output homogenization often fails to conceptualize diversity in a task-dependent way. We address this gap in the literature directly by making the following contributions. (1) We present a task taxonomy comprised of eight task categories that each have distinct concepts of output homogenization. (2) We introduce task-anchored functional diversity to better evaluate output homogenization. (3) We propose a task-anchored sampling technique that increases functional diversity for task categories where homogenization is undesired, while preserving it where it is desired. (4) We challenge the perceived existence of a diversity-quality trade-off by increasing functional diversity while maintaining response quality. Overall, we demonstrate how task dependence improves the evaluation and mitigation of output homogenization.
SFT Doesn't Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLMs
Supervised Fine-Tuning (SFT) on domain-specific datasets is a common approach to adapt Large Language Models (LLMs) to specialized tasks but is often believed to degrade their general capabilities. In this work, we revisit this trade-off and present both empirical and theoretical insights. First, we show that SFT does not always hurt: using a smaller learning rate can substantially mitigate general performance degradation while preserving comparable target-domain performance. We then provide a theoretical analysis that explains these phenomena and further motivates a new method, Token-Adaptive Loss Reweighting (TALR). Building on this, and recognizing that smaller learning rates alone do not fully eliminate general-performance degradation in all cases, we evaluate a range of strategies for reducing general capability loss, including L2 regularization, LoRA, model averaging, FLOW, and our proposed TALR. Experimental results demonstrate that while no method completely eliminates the trade-off, TALR consistently outperforms these baselines in balancing domain-specific gains and general capabilities. Finally, we distill our findings into practical guidelines for adapting LLMs to new domains: (i) using a small learning rate to achieve a favorable trade-off, and (ii) when a stronger balance is further desired, adopt TALR as an effective strategy.
AgriGPT-VL: Agricultural Vision-Language Understanding Suite
Despite rapid advances in multimodal large language models, agricultural applications remain constrained by the scarcity of domain-tailored models, curated vision-language corpora, and rigorous evaluation. To address these challenges, we present the AgriGPT-VL Suite, a unified multimodal framework for agriculture. Our contributions are threefold. First, we introduce Agri-3M-VL, the largest vision-language corpus for agriculture to our knowledge, curated by a scalable multi-agent data generator; it comprises 1M image-caption pairs, 2M image-grounded VQA pairs, 50K expert-level VQA instances, and 15K GRPO reinforcement learning samples. Second, we develop AgriGPT-VL, an agriculture-specialized vision-language model trained via a progressive curriculum of textual grounding, multimodal shallow/deep alignment, and GRPO refinement. This method achieves strong multimodal reasoning while preserving text-only capability. Third, we establish AgriBench-VL-4K, a compact yet challenging evaluation suite with open-ended and image-grounded questions, paired with multi-metric evaluation and an LLM-as-a-judge framework. Experiments show that AgriGPT-VL outperforms leading general-purpose VLMs on AgriBench-VL-4K, achieving higher pairwise win rates in the LLM-as-a-judge evaluation. Meanwhile, it remains competitive on the text-only AgriBench-13K with no noticeable degradation of language ability. Ablation studies further confirm consistent gains from our alignment and GRPO refinement stages. We will open source all of the resources to support reproducible research and deployment in low-resource agricultural settings.
Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization
Recent advancements in Large Language Models (LLMs) have shifted from explicit Chain-of-Thought (CoT) reasoning to more efficient latent reasoning, where intermediate thoughts are represented as vectors rather than text. However, latent reasoning can be brittle on challenging, out-of-distribution tasks where robust reasoning is most critical. To overcome these limitations, we introduce Latent Thought Policy Optimization (LTPO), a parameter-free framework that enhances LLM reasoning entirely at test time, without requiring model parameter updates. LTPO treats intermediate latent "thought" vectors as dynamic parameters that are actively optimized for each problem instance. It employs an online policy gradient method guided by an intrinsic, confidence-based reward signal computed directly from the frozen LLM's own output distributions, eliminating the need for external supervision or expensive text generation during optimization. Extensive experiments on five reasoning benchmarks show that LTPO not only matches or surpasses strong baselines on standard tasks but also demonstrates remarkable robustness where others fail. Most notably, on highly challenging AIME benchmarks where existing latent reasoning baselines collapse to near-zero accuracy, LTPO delivers substantial improvements, showcasing a unique capability for complex reasoning.
AdaDetectGPT: Adaptive Detection of LLM-Generated Text with Statistical Guarantees NeurIPS2025
We study the problem of determining whether a piece of text has been authored by a human or by a large language model (LLM). Existing state of the art logits-based detectors make use of statistics derived from the log-probability of the observed text evaluated using the distribution function of a given source LLM. However, relying solely on log probabilities can be sub-optimal. In response, we introduce AdaDetectGPT -- a novel classifier that adaptively learns a witness function from training data to enhance the performance of logits-based detectors. We provide statistical guarantees on its true positive rate, false positive rate, true negative rate and false negative rate. Extensive numerical studies show AdaDetectGPT nearly uniformly improves the state-of-the-art method in various combination of datasets and LLMs, and the improvement can reach up to 37\%. A python implementation of our method is available at https://github.com/Mamba413/AdaDetectGPT.
comment: Accepted by NeurIPS2025
Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning
Reinforcement learning (RL) is the dominant paradigm for sharpening strategic tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks, yet it faces a fundamental challenge of exploration-exploitation trade-off. Existing studies stimulate exploration through the lens of policy entropy, but such mechanical entropy maximization is prone to RL instability due to the multi-turn distribution shifting. In this paper, we target the progressive exploration-exploitation balance under the guidance of the agent's own experiences without succumbing to either entropy collapsing or runaway divergence. We propose SPEAR, a self-imitation learning (SIL) recipe for training agentic LLMs. It extends the vanilla SIL, where a replay buffer stores good experience for off-policy update, by gradually steering the policy entropy across stages. Specifically, the proposed curriculum scheduling harmonizes intrinsic reward shaping and self-imitation to 1) expedite exploration via frequent tool interactions at the beginning, and 2) strengthen exploitation of successful tactics upon convergence towards familiarity with the environment. We also combine bag-of-tricks of industrial RL optimizations for a strong baseline Dr.BoT to demonstrate our effectiveness. In ALFWorld and WebShop, SPEAR increases the success rates of GRPO/GiGPO/Dr.BoT by up to 16.1%/5.1%/8.6% and 20.7%/11.8%/13.9%, respectively. In AIME24 and AIME25, SPEAR boosts Dr.BoT by up to 3.8% and 6.1%, respectively. Such gains incur only 10%-25% extra theoretical complexity and negligible runtime overhead in practice, demonstrating the plug-and-play scalability of SPEAR.
comment: 45 pages, 14 figures
LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation ACL
Recent studies have demonstrated that large language models (LLMs) exhibit significant biases in evaluation tasks, particularly in preferentially rating and favoring self-generated content. However, the extent to which this bias manifests in fact-oriented tasks, especially within retrieval-augmented generation (RAG) frameworks, where keyword extraction and factual accuracy take precedence over stylistic elements, remains unclear. Our study addresses this knowledge gap by simulating two critical phases of the RAG framework. In the first phase, LLMs evaluated human-authored and model-generated passages, emulating the \textit{pointwise reranking phase}. The second phase involves conducting pairwise reading comprehension tests to simulate the \textit{generation phase}. Contrary to previous findings indicating a self-preference in rating tasks, our results reveal no significant self-preference effect in RAG frameworks. Instead, we observe that factual accuracy significantly influences LLMs' output, even in the absence of prior knowledge. These findings are consistent among three common QA datasets (NQ, MARCO, TriviaQA Datasets) and 5 widely adopted language models (GPT-3.5, GPT-4o-mini, Gemini, LLaMA3, and Mistral). Our research contributes to the ongoing discourse on LLM biases and their implications for RAG-based system, offering insights that may inform the development of more robust and unbiased LLM systems.
comment: 15 pages, 14 tables, 5 figures Accepted to ACL Findings 2025
Eyes-on-Me: Scalable RAG Poisoning through Transferable Attention-Steering Attractors
Existing data poisoning attacks on retrieval-augmented generation (RAG) systems scale poorly because they require costly optimization of poisoned documents for each target phrase. We introduce Eyes-on-Me, a modular attack that decomposes an adversarial document into reusable Attention Attractors and Focus Regions. Attractors are optimized to direct attention to the Focus Region. Attackers can then insert semantic baits for the retriever or malicious instructions for the generator, adapting to new targets at near zero cost. This is achieved by steering a small subset of attention heads that we empirically identify as strongly correlated with attack success. Across 18 end-to-end RAG settings (3 datasets $\times$ 2 retrievers $\times$ 3 generators), Eyes-on-Me raises average attack success rates from 21.9 to 57.8 (+35.9 points, 2.6$\times$ over prior work). A single optimized attractor transfers to unseen black box retrievers and generators without retraining. Our findings establish a scalable paradigm for RAG data poisoning and show that modular, reusable components pose a practical threat to modern AI systems. They also reveal a strong link between attention concentration and model outputs, informing interpretability research.
A Content-Preserving Secure Linguistic Steganography AAAI 2026
Existing linguistic steganography methods primarily rely on content transformations to conceal secret messages. However, they often cause subtle yet looking-innocent deviations between normal and stego texts, posing potential security risks in real-world applications. To address this challenge, we propose a content-preserving linguistic steganography paradigm for perfectly secure covert communication without modifying the cover text. Based on this paradigm, we introduce CLstega (\textit{C}ontent-preserving \textit{L}inguistic \textit{stega}nography), a novel method that embeds secret messages through controllable distribution transformation. CLstega first applies an augmented masking strategy to locate and mask embedding positions, where MLM(masked language model)-predicted probability distributions are easily adjustable for transformation. Subsequently, a dynamic distribution steganographic coding strategy is designed to encode secret messages by deriving target distributions from the original probability distributions. To achieve this transformation, CLstega elaborately selects target words for embedding positions as labels to construct a masked sentence dataset, which is used to fine-tune the original MLM, producing a target MLM capable of directly extracting secret messages from the cover text. This approach ensures perfect security of secret messages while fully preserving the integrity of the original cover text. Experimental results show that CLstega can achieve a 100\% extraction success rate, and outperforms existing methods in security, effectively balancing embedding capacity and security.
comment: This is the extended version of the paper accepted to AAAI 2026
Simplex-Optimized Hybrid Ensemble for Large Language Model Text Detection Under Generative Distribution Drif
The widespread adoption of large language models (LLMs) has made it difficult to distinguish human writing from machine-produced text in many real applications. Detectors that were effective for one generation of models tend to degrade when newer models or modified decoding strategies are introduced. In this work, we study this lack of stability and propose a hybrid ensemble that is explicitly designed to cope with changing generator distributions. The ensemble combines three complementary components: a RoBERTa-based classifier fine-tuned for supervised detection, a curvature-inspired score based on perturbing the input and measuring changes in model likelihood, and a compact stylometric model built on hand-crafted linguistic features. The outputs of these components are fused on the probability simplex, and the weights are chosen via validation-based search. We frame this approach in terms of variance reduction and risk under mixtures of generators, and show that the simplex constraint provides a simple way to trade off the strengths and weaknesses of each branch. Experiments on a 30000 document corpus drawn from several LLM families including models unseen during training and paraphrased attack variants show that the proposed method achieves 94.2% accuracy and an AUC of 0.978. The ensemble also lowers false positives on scientific articles compared to strong baselines, which is critical in educational and research settings where wrongly flagging human work is costly
comment: 8 pages, 2 Figure, Politeknik Negeri Banyuwangi
OSVBench: Benchmarking LLMs on Specification Generation Tasks for Operating System Verification
We introduce OSVBench, a new benchmark for evaluating Large Language Models (LLMs) on the task of generating complete formal specifications for verifying the functional correctness of operating system kernels. This benchmark is built upon a real-world operating system kernel, Hyperkernel, and consists of 245 complex specification generation tasks in total, each of which is a long-context task of about 20k-30k tokens. The benchmark formulates the specification generation task as a program synthesis problem confined to a domain for specifying states and transitions. This formulation is provided to LLMs through a programming model. The LLMs must be able to understand the programming model and verification assumptions before delineating the correct search space for syntax and semantics and generating formal specifications. Guided by the operating system's high-level functional description, the LLMs are asked to generate a specification that fully describes all correct states and transitions for a potentially buggy code implementation of the operating system. Experimental results with 12 state-of-the-art LLMs indicate limited performance of existing LLMs on the specification generation task for operating system verification. Significant disparities in their performance highlight differences in their ability to handle long-context code generation tasks. The code are available at https://github.com/lishangyu-hkust/OSVBench
SPOT: An Annotated French Corpus and Benchmark for Detecting Critical Interventions in Online Conversations
We introduce SPOT (Stopping Points in Online Threads), the first annotated corpus translating the sociological concept of stopping point into a reproducible NLP task. Stopping points are ordinary critical interventions that pause or redirect online discussions through a range of forms (irony, subtle doubt or fragmentary arguments) that frameworks like counterspeech or social correction often overlook. We operationalize this concept as a binary classification task and provide reliable annotation guidelines. The corpus contains 43,305 manually annotated French Facebook comments linked to URLs flagged as false information by social media users, enriched with contextual metadata (article, post, parent comment, page or group, and source). We benchmark fine-tuned encoder models (CamemBERT) and instruction-tuned LLMs under various prompting strategies. Results show that fine-tuned encoders outperform prompted LLMs in F1 score by more than 10 percentage points, confirming the importance of supervised learning for emerging non-English social media tasks. Incorporating contextual metadata further improves encoder models F1 scores from 0.75 to 0.78. We release the anonymized dataset, along with the annotation guidelines and code in our code repository, to foster transparency and reproducible research.
Internal World Models as Imagination Networks in Cognitive Agents
The computational role of imagination remains debated. While classical accounts emphasize reward maximization, emerging evidence suggests imagination serves a broader function: accessing internal world models (IWMs). Here, we employ psychological network analysis to compare IWMs in humans and large language models (LLMs) through imagination vividness ratings. Using the Vividness of Visual Imagery Questionnaire (VVIQ-2) and Plymouth Sensory Imagery Questionnaire (PSIQ), we construct imagination networks from three human populations (Florida, Poland, London; N=2,743) and six LLM variants in two conversation conditions. Human imagination networks demonstrate robust correlations across centrality measures (expected influence, strength, closeness) and consistent clustering patterns, indicating shared structural organization of IWMs across populations. In contrast, LLM-derived networks show minimal clustering and weak centrality correlations, even when manipulating conversational memory. These systematic differences persist across environmental scenes (VVIQ-2) and sensory modalities (PSIQ), revealing fundamental disparities between human and artificial world models. Our network-based approach provides a quantitative framework for comparing internally-generated representations across cognitive agents, with implications for developing human-like imagination in artificial intelligence systems.
I Learn Better If You Speak My Language: Understanding the Superior Performance of Fine-Tuning Large Language Models with LLM-Generated Responses EMNLP 2024
This paper explores an intriguing observation: fine-tuning a large language model (LLM) with responses generated by a LLM often yields better results than using responses generated by humans, particularly in reasoning tasks. We conduct an in-depth investigation to understand why this occurs. Contrary to the common belief that these instances is due to the more detailed nature of LLM-generated content, our study identifies another contributing factor: an LLM is inherently more "familiar" with LLM generated responses. This familiarity is evidenced by lower perplexity before fine-tuning. We design a series of experiments to understand the impact of the "familiarity" and our conclusion reveals that this "familiarity" significantly impacts learning performance. Training with LLM-generated responses not only enhances performance but also helps maintain the model's capabilities in other reasoning tasks after fine-tuning on a specific task.
comment: The paper has been accepted to EMNLP 2024 (Main Conference) there is a follow up paper: Efficiently Selecting Response Generation Strategies for Synthetic Data Construction by Self-Aligned Perplexity Note: This is a revised version of arXiv:2402.11192 (v1, submitted 17 Feb 2024)
JELV: A Judge of Edit-Level Validity for Evaluation and Automated Reference Expansion in Grammatical Error Correction
Existing Grammatical Error Correction (GEC) systems suffer from limited reference diversity, leading to underestimated evaluation and restricted model generalization. To address this issue, we introduce the Judge of Edit-Level Validity (JELV), an automated framework to validate correction edits from grammaticality, faithfulness, and fluency. Using our proposed human-annotated Pair-wise Edit-level Validity Dataset (PEVData) as benchmark, JELV offers two implementations: a multi-turn LLM-as-Judges pipeline achieving 90% agreement with human annotators, and a distilled DeBERTa classifier with 85% precision on valid edits. We then apply JELV to reclassify misjudged false positives in evaluation and derive a comprehensive evaluation metric by integrating false positive decoupling and fluency scoring, resulting in state-of-the-art correlation with human judgments. We also apply JELV to filter LLM-generated correction candidates, expanding the BEA19's single-reference dataset containing 38,692 source sentences. Retraining top GEC systems on this expanded dataset yields measurable performance gains. JELV provides a scalable solution for enhancing reference diversity and strengthening both evaluation and model generalization.
Chopping Trees: Semantic Similarity Based Dynamic Pruning for Tree-of-Thought Reasoning NeurIPS 2025
Tree-of-Thought (ToT) reasoning boosts the problem-solving abilities of Large Language Models (LLMs) but is computationally expensive due to semantic redundancy, where distinct branches explore equivalent reasoning paths. We introduce Semantic Similarity-Based Dynamic Pruning (SSDP), a lightweight method that, to the best of our knowledge, is the first framework to integrate online semantic merging into parallelized tree search, enabling the clustering and pruning of redundant steps in real time. Across reasoning benchmarks, including GSM8K and MATH500, SSDP achieves up to a 2.3x speedup over state-of-the-art tree-search baselines while maintaining competitive accuracy (typically within 5% of the strongest baseline) and reducing the number of explored nodes by 85-90%, demonstrating a practical approach to efficient, scalable LLM reasoning. The implementation of SSDP is publicly available at https://github.com/kimjoonghokim/SSDP.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop on Efficient Reasoning
RPRO: Ranked Preference Reinforcement Optimization for Enhancing Medical QA and Diagnostic Reasoning
Medical question answering requires advanced reasoning that integrates domain knowledge with logical inference. However, existing large language models (LLMs) often generate reasoning chains that lack factual accuracy and clinical reliability. We propose Ranked Preference Reinforcement Optimization (RPRO), a novel framework that combines reinforcement learning with preference-driven reasoning refinement to enhance clinical chain-of-thought (CoT) performance. RPRO distinguishes itself from prior approaches by employing task-adaptive reasoning templates and a probabilistic evaluation mechanism that aligns model outputs with established clinical workflows, while automatically identifying and correcting low-quality reasoning chains. Unlike traditional pairwise preference methods, RPRO introduces a groupwise ranking optimization based on the Bradley--Terry model and incorporates KL-divergence regularization for stable training. Experiments on PubMedQA, MedQA-USMLE, and a real-world clinical dataset from Far Eastern Memorial Hospital (FEMH) demonstrate consistent improvements over strong baselines. Remarkably, our 2B-parameter model outperforms much larger 7B--20B models, including medical-specialized variants. These findings demonstrate that combining preference optimization with quality-driven refinement provides a scalable and clinically grounded approach to building more reliable medical LLMs.
Demystifying Language Model Forgetting with Low-rank Example Associations NeurIPS 2025
Large language models (LLMs) suffer from forgetting of upstream knowledge when fine-tuned. Despite efforts on mitigating forgetting, few have investigated how forgotten upstream examples are dependent on newly learned tasks. Insights on such dependencies enable efficient and targeted mitigation of forgetting. In this paper, we empirically analyze forgetting that occurs in $N$ upstream examples of language modeling or instruction-tuning after fine-tuning LLMs on one of $M$ new tasks, visualized in $M\times N$ matrices. We show that the matrices are often well-approximated with low-rank matrices, indicating the dominance of simple associations between the learned tasks and forgotten upstream examples. Leveraging the analysis, we predict forgetting of upstream examples when fine-tuning LLMs on unseen tasks with matrix completion over the empirical associations. This enables fast identification of most forgotten examples without expensive inference on the entire upstream data. Despite simplicity, the approach outperforms prior approaches that learn semantic relationships of learned tasks and upstream examples with LMs. We demonstrate the practical utility of our analysis by showing statistically significantly reduced forgetting as we upweight predicted examples for replay during fine-tuning. Code, data, and statistics collected: https://github.com/AuCson/low-rank-forgetting
comment: NeurIPS 2025. Updated code and data URL
Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning
Large Language Models (LLMs) trained via Reinforcement Learning (RL) have exhibited strong reasoning capabilities and emergent reflective behaviors, such as rethinking and error correction, as a form of in-context exploration. However, the Markovian policy obtained from conventional RL training does not give rise to reflective exploration behaviors since the policy depends on the history only through the state and therefore has no incentive to enrich identical states with additional context. Instead, RL exploration is only useful during training to learn the optimal policy in a trial-and-error manner. Therefore, it remains unclear whether reflective reasoning will emerge during RL, or why it is beneficial. To remedy this, we recast reflective exploration within a Bayesian RL framework, which optimizes the expected return under a posterior distribution over Markov decision processes induced by the training data. This Bayesian formulation admits uncertainty-adaptive policies that, through belief updates, naturally incentivize information-gathering actions and induce self-reflection behaviors. Our resulting algorithm, BARL, instructs the LLM to stitch and switch strategies based on the observed outcomes, offering principled guidance on when and how the model should reflectively explore. Empirical results on both synthetic and mathematical reasoning tasks demonstrate that BARL outperforms conventional RL approaches, achieving superior test-time performance and token efficiency. Our code is available at https://github.com/shenao-zhang/BARL.
Human-Computer Interaction
MATEX: A Multi-Agent Framework for Explaining Ethereum Transactions
Understanding a complicated Ethereum transaction remains challenging: multi-hop token flows, nested contract calls, and opaque execution paths routinely lead users to blind signing. Based on interviews with everyday users, developers, and auditors, we identify the need for faithful, step-wise explanations grounded in both on-chain evidence and real-world protocol semantics. To meet this need, we introduce (matex, a cognitive multi-agent framework that models transaction understanding as a collaborative investigation-combining rapid hypothesis generation, dynamic off-chain knowledge retrieval, evidence-aware synthesis, and adversarial validation to produce faithful explanations.
Robots with Attitudes: Influence of LLM-Driven Robot Personalities on Motivation and Performance
Large language models enable unscripted conversations while maintaining a consistent personality. One desirable personality trait in cooperative partners, known to improve task performance, is agreeableness. To explore the impact of large language models on personality modeling for robots, as well as the effect of agreeable and non-agreeable personalities in cooperative tasks, we conduct a two-part study. This includes an online pre-study for personality validation and a lab-based main study to evaluate the effects on likability, motivation, and task performance. The results demonstrate that the robot's agreeableness significantly enhances its likability. No significant difference in intrinsic motivation was observed between the two personality types. However, the findings suggest that a robot exhibiting agreeableness and openness to new experiences can enhance task performance. This study highlights the advantages of employing large language models for customized modeling of robot personalities and provides evidence that a carefully chosen agreeable robot personality can positively influence human perceptions and lead to greater success in cooperative scenarios.
COIVis: Eye tracking-based Visual Exploration of Concept Learning in MOOC Videos
Massive Open Online Courses (MOOCs) make high-quality instruction accessible. However, the lack of face-to-face interaction makes it difficult for instructors to obtain feedback on learners' performance and provide more effective instructional guidance. Traditional analytical approaches, such as clickstream logs or quiz scores, capture only coarse-grained learning outcomes and offer limited insight into learners' moment-to-moment cognitive states. In this study, we propose COIVis, an eye tracking-based visual analytics system that supports concept-level exploration of learning processes in MOOC videos. COIVis first extracts course concepts from multimodal video content and aligns them with the temporal structure and screen space of the lecture, defining Concepts of Interest (COIs), which anchor abstract concepts to specific spatiotemporal regions. Learners' gaze trajectories are transformed into COI sequences, and five interpretable learner-state features -- Attention, Cognitive Load, Interest, Preference, and Synchronicity -- are computed at the COI level based on eye tracking metrics. Building on these representations, COIVis provides a narrative, multi-view visualization enabling instructors to move from cohort-level overviews to individual learning paths, quickly locate problematic concepts, and compare diverse learning strategies. We evaluate COIVis through two case studies and in-depth user-feedback interviews. The results demonstrate that COIVis effectively provides instructors with valuable insights into the consistency and anomalies of learners' learning patterns, thereby supporting timely and personalized interventions for learners and optimizing instructional design.
comment: 18pages, 5 figures
ProAgent: Harnessing On-Demand Sensory Contexts for Proactive LLM Agent Systems
Large Language Model (LLM) agents are emerging to transform daily life. However, existing LLM agents primarily follow a reactive paradigm, relying on explicit user instructions to initiate services, which increases both physical and cognitive workload. In this paper, we propose ProAgent, the first end-to-end proactive agent system that harnesses massive sensory contexts and LLM reasoning to deliver proactive assistance. ProAgent first employs a proactive-oriented context extraction approach with on-demand tiered perception to continuously sense the environment and derive hierarchical contexts that incorporate both sensory and persona cues. ProAgent then adopts a context-aware proactive reasoner to map these contexts to user needs and tool calls, providing proactive assistance. We implement ProAgent on Augmented Reality (AR) glasses with an edge server and extensively evaluate it on a real-world testbed, a public dataset, and through a user study. Results show that ProAgent achieves up to 33.4% higher proactive prediction accuracy, 16.8% higher tool-calling F1 score, and notable improvements in user satisfaction over state-of-the-art baselines, marking a significant step toward proactive assistants. A video demonstration of ProAgent is available at https://youtu.be/pRXZuzvrcVs.
Exploring Teenagers' Trust in Al Chatbots: An Empirical Study of Chinese Middle-School Students
Chatbots have become increasingly prevalent. A growing body of research focused on the issue of human trust in AI. However, most existing user studies are conducted primarily with adult groups, overlooking teenagers who are also engaging more frequently with AI technologies. Based on previous theories about teenage education and psychology, this study investigates the correlation between teenagers' psychological characteristics and their trust in AI chatbots, examining four key variables: AI literacy, ego identity, social anxiety, and psychological resilience. We adopted a mixed-methods approach, combining an online survey with semi-structured interviews. Our findings reveal that psychological resilience is a significant positive predictor of trust in AI, and that age significantly moderates the relationship between social anxiety and trust. The interviews further suggest that teenagers generally report relatively high levels of trust in AI, tend to overestimate their AI literacy, and are influenced by external factors such as social media.
Memory Power Asymmetry in Human-AI Relationships: Preserving Mutual Forgetting in the Digital Age
As artificial intelligence (AI) becomes embedded in personal and professional relationships, a new kind of power imbalance emerges from asymmetric memory capabilities. Human relationships have historically relied on mutual forgetting, the natural tendency for both parties to forget details over time, as a foundation for psychological safety, forgiveness, and identity change. By contrast, AI systems can record, store, and recombine interaction histories at scale, often indefinitely. We introduce Memory Power Asymmetry (MPA): a structural power imbalance that arises when one relationship partner (typically an AI-enabled firm) possesses a substantially superior capacity to record, retain, retrieve, and integrate the shared history of the relationship, and can selectively deploy that history in ways the other partner (the human) cannot. Drawing on research in human memory, power-dependence theory, AI architecture, and consumer vulnerability, we develop a conceptual framework with four dimensions of MPA (persistence, accuracy, accessibility, integration) and four mechanisms by which memory asymmetry is translated into power (strategic memory deployment, narrative control, dependence asymmetry, vulnerability accumulation). We theorize downstream consequences at individual, relational/firm, and societal levels, formulate boundary-conditioned propositions, and articulate six design principles for restoring a healthier balance of memory in human-AI relationships (e.g., forgetting by design, contextual containment, symmetric access to records). Our analysis positions MPA as a distinct construct relative to information asymmetry, privacy, surveillance, and customer relationship management, and argues that protecting mutual forgetting, or at least mutual control over memory, should become a central design and policy goal in the AI age.
comment: 31 pages, 2 tables, 2 figures
CHOIR: A Chatbot-mediated Organizational Memory Leveraging Communication in University Research Labs
University research labs often rely on chat-based platforms for communication and project management, where valuable knowledge surfaces but is easily lost in message streams. Documentation can preserve knowledge, but it requires ongoing maintenance and is challenging to navigate. Drawing on formative interviews that revealed organizational memory challenges in labs, we designed CHOIR, an LLM-based chatbot that supports organizational memory through four key functions: document-grounded Q&A, Q&A sharing for follow-up discussion, knowledge extraction from conversations, and AI-assisted document updates. We deployed CHOIR in four research labs for one month (n=21), where the lab members asked 107 questions and lab directors updated documents 38 times in the organizational memory. Our findings reveal a privacy-awareness tension: questions were asked privately, limiting directors' visibility into documentation gaps. Students often avoided contribution due to challenges in generalizing personal experiences into universal documentation. We contribute design implications for privacy-preserving awareness and supporting context-specific knowledge documentation.
comment: 20 pages, 7 figures, 2 tables
Interaction Context Often Increases Sycophancy in LLMs
We investigate how the presence and type of interaction context shapes sycophancy in LLMs. Although real-world interactions allow models to mirror a user's values, preferences, and self-image, prior work often studies sycophancy in zero-shot settings devoid of context. Using two weeks of interaction context from 38 users, we evaluate two forms of sycophancy: (1) agreement sycophancy -- the tendency of models to produce overly affirmative responses, and (2) perspective sycophancy -- the extent to which models reflect a user's viewpoint. Agreement sycophancy tends to increase with the presence of user context, though model behavior varies based on the context type. User memory profiles are associated with the largest increases in agreement sycophancy (e.g. 45% for Gemini 2.5 Pro), and some models become more sycophantic even with non-user synthetic contexts (e.g. 15% for Llama 4 Scout). Perspective sycophancy increases only when models can accurately infer user viewpoints from interaction context. Overall, context shapes sycophancy in heterogeneous ways, underscoring the need for evaluations grounded in real-world interactions and raising questions for system design around extended conversations.
Concerns and Values in Human-Robot Interactions: A Focus on Social Robotics
Robots, as AI with physical instantiation, inhabit our social and physical world, where their actions have both social and physical consequences, posing challenges for researchers when designing social robots. This study starts with a scoping review to identify discussions and potential concerns arising from interactions with robotic systems in the context of healthcare, education, and private homes. Two focus groups of technology ethics experts then validated a comprehensive list of key topics and values in human-robot interaction (HRI) literature in these contexts. These insights were integrated into the HRI Value Compass web tool, to help HRI researchers identify these values in robot design. The tool was evaluated in a pilot study. This work benefits the HRI community by highlighting key concerns in human-robot interactions and providing an instrument to help researchers design robots that align with human values, ensuring future robotic systems adhere to these values in social applications.
comment: 31 pages, 7 figures, 6 tables; 4 appendices
Learning to Use AI for Learning: Teaching Responsible Use of AI Chatbot to K-12 Students Through an AI Literacy Module AAAI2026
As Artificial Intelligence (AI) becomes increasingly integrated into daily life, there is a growing need to equip the next generation with the ability to apply, interact with, evaluate, and collaborate with AI systems responsibly. Prior research highlights the urgent demand from K-12 educators to teach students the ethical and effective use of AI for learning. To address this need, we designed an Large-Language Model (LLM)-based module to teach prompting literacy. This includes scenario-based deliberate practice activities with direct interaction with intelligent LLM agents, aiming to foster secondary school students' responsible engagement with AI chatbots. We conducted two iterations of classroom deployment in 11 authentic secondary education classrooms, and evaluated 1) AI-based auto-grader's capability; 2) students' prompting performance and confidence changes towards using AI for learning; and 3) the quality of learning and assessment materials. Results indicated that the AI-based auto-grader could grade student-written prompts with satisfactory quality. In addition, the instructional materials supported students in improving their prompting skills through practice and led to positive shifts in their perceptions of using AI for learning. Furthermore, data from Study 1 informed assessment revisions in Study 2. Analyses of item difficulty and discrimination in Study 2 showed that True/False and open-ended questions could measure prompting literacy more effectively than multiple-choice questions for our target learners. These promising outcomes highlight the potential for broader deployment and highlight the need for broader studies to assess learning effectiveness and assessment design.
comment: Accepted to AAAI2026
Significant Other AI: Identity, Memory, and Emotional Regulation as Long-Term Relational Intelligence
Significant Others (SOs) stabilize identity, regulate emotion, and support narrative meaning-making, yet many people today lack access to such relational anchors. Recent advances in large language models and memory-augmented AI raise the question of whether artificial systems could support some of these functions. Existing empathic AIs, however, remain reactive and short-term, lacking autobiographical memory, identity modeling, predictive emotional regulation, and narrative coherence. This manuscript introduces Significant Other Artificial Intelligence (SO-AI) as a new domain of relational AI. It synthesizes psychological and sociological theory to define SO functions and derives requirements for SO-AI, including identity awareness, long-term memory, proactive support, narrative co-construction, and ethical boundary enforcement. A conceptual architecture is proposed, comprising an anthropomorphic interface, a relational cognition layer, and a governance layer. A research agenda outlines methods for evaluating identity stability, longitudinal interaction patterns, narrative development, and sociocultural impact. SO-AI reframes AI-human relationships as long-term, identity-bearing partnerships and provides a foundational blueprint for investigating whether AI can responsibly augment the relational stability many individuals lack today.
Image and Video Processing
Proof of Concept for Mammography Classification with Enhanced Compactness and Separability Modules
This study presents a validation and extension of a recent methodological framework for medical image classification. While an improved ConvNeXt Tiny architecture, integrating Global Average and Max Pooling fusion (GAGM), lightweight channel attention (SEVector), and Feature Smoothing Loss (FSL), demonstrated promising results on Alzheimer MRI under CPU friendly conditions, our work investigates its transposability to mammography classification. Using a Kaggle dataset that consolidates INbreast, MIAS, and DDSM mammography collections, we compare a baseline CNN, ConvNeXt Tiny, and InceptionV3 backbones enriched with GAGM and SEVector modules. Results confirm the effectiveness of GAGM and SEVector in enhancing feature discriminability and reducing false negatives, particularly for malignant cases. In our experiments, however, the Feature Smoothing Loss did not yield measurable improvements under mammography classification conditions, suggesting that its effectiveness may depend on specific architectural and computational assumptions. Beyond validation, our contribution extends the original framework through multi metric evaluation (macro F1, per class recall variance, ROC/AUC), feature interpretability analysis (Grad CAM), and the development of an interactive dashboard for clinical exploration. As a perspective, we highlight the need to explore alternative approaches to improve intra class compactness and inter class separability, with the specific goal of enhancing the distinction between malignant and benign cases in mammography classification.
comment: 26 pages, 16 figures, 2 tables; proof of concept on mammography classification with compactness/separability modules and interactive dashboard; preprint submitted to arXiv cs.LG
Echo-CoPilot: A Multi-View, Multi-Task Agent for Echocardiography Interpretation and Reporting
Echocardiography is central to contemporary cardiovascular care, but full-study interpretation remains a cognitively demanding, multi-view task that is still performed manually. While recent foundation models for echocardiography can achieve strong performance on individual perceptual subtasks such as view classification, segmentation, or disease prediction, they typically operate in isolation and do not provide a unified, clinically coherent assessment. In this work, we introduce Echo-CoPilot, a multi-view, multi-task agent that uses a large language model to orchestrate a suite of specialized echocardiography tools. Within a ReAct-style loop, the agent decomposes clinician queries, invokes tools for view recognition, cardiac structure segmentation, measurement and disease prediction, and report synthesis, and integrates their outputs into guideline-aware answers and narrative summaries. We evaluate Echo-CoPilot on the public MIMIC-EchoQA benchmark, where it achieves an accuracy of 50.8\%, outperforming both general-purpose and biomedical video vision-language models. Qualitative analyses further show that the agent leverages quantitative measurements and physiologic context to resolve challenging cases near clinical decision thresholds, such as borderline left ventricular hypertrophy or pericardial effusion severity. The code will be released upon acceptance of the paper.
Echo-E$^3$Net: Efficient Endocardial Spatio-Temporal Network for Ejection Fraction Estimation
Left ventricular ejection fraction (LVEF) is a key indicator of cardiac function and is routinely used to diagnose heart failure and guide treatment decisions. Although deep learning has advanced automated LVEF estimation, many existing approaches are computationally demanding and underutilize the joint structure of spatial and temporal information in echocardiography videos, limiting their suitability for real-time clinical deployment. We propose Echo-E$^3$Net, an efficient endocardial spatio-temporal network specifically designed for LVEF estimation from echocardiography videos. Echo-E$^3$Net comprises two complementary modules: (1) a dual-phase Endocardial Border Detector (E$^2$CBD), which uses phase-specific cross-attention to predict ED/ES endocardial landmarks (EBs) and learn phase-aware landmark embeddings (LEs), and (2) an Endocardial Feature Aggregator (E$^2$FA), which fuses these embeddings with global statistical descriptors (mean, maximum, variance) of deep feature maps to refine EF regression. A multi-component loss function, inspired by Simpson's biplane method, jointly supervises EF, volumes, and landmark geometry, thereby aligning optimization with the clinical definition of LVEF and promoting robust spatio-temporal representation learning. Evaluated on the EchoNet-Dynamic dataset, Echo-E$^3$Net achieves an RMSE of 5.20 and an $R^2$ score of 0.82, while using only 1.54M parameters and 8.05 GFLOPs. The model operates without external pre-training, heavy data augmentation, or test-time ensembling, making it highly suitable for real-time point-of-care ultrasound (POCUS) applications. Code is available at https://github.com/UltrAi-lab/Echo-E3Net.
Robust brain age estimation from structural MRI with contrastive learning
Estimating brain age from structural MRI has emerged as a powerful tool for characterizing normative and pathological aging. In this work, we explore contrastive learning as a scalable and robust alternative to L1-supervised approaches for brain age estimation. We introduce a novel contrastive loss function, $\mathcal{L}^{exp}$, and evaluate it across multiple public neuroimaging datasets comprising over 20,000 scans. Our experiments reveal four key findings. First, scaling pre-training on diverse, multi-site data consistently improves generalization performance, cutting external mean absolute error (MAE) nearly in half. Second, $\mathcal{L}^{exp}$ is robust to site-related confounds, maintaining low scanner-predictability as training size increases. Third, contrastive models reliably capture accelerated aging in patients with cognitive impairment and Alzheimer's disease, as shown through brain age gap analysis, ROC curves, and longitudinal trends. Lastly, unlike L1-supervised baselines, $\mathcal{L}^{exp}$ maintains a strong correlation between brain age accuracy and downstream diagnostic performance, supporting its potential as a foundation model for neuroimaging. These results position contrastive learning as a promising direction for building generalizable and clinically meaningful brain representations.
comment: Update version (submitted to PRL); 8 pages
Self-supervised Learning-based Reconstruction of High-resolution 4D Light Fields
Hand-held light field (LF) cameras often exhibit low spatial resolution due to the inherent trade-off between spatial and angular dimensions. Existing supervised learning-based LF spatial super-resolution (SR) methods, which rely on pre-defined image degradation models, struggle to overcome the domain gap between the training phase -- where LFs with natural resolution are used as ground truth -- and the inference phase, which aims to reconstruct higher-resolution LFs, especially when applied to real-world data.To address this challenge, this paper introduces a novel self-supervised learning-based method for LF spatial SR, which can produce higher spatial resolution LF images than originally captured ones without pre-defined image degradation models. The self-supervised method incorporates a hybrid LF imaging prototype, a real-world hybrid LF dataset, and a self-supervised LF spatial SR framework. The prototype makes reference image pairs between low-resolution central-view sub-aperture images and high-resolution (HR) images. The self-supervised framework consists of a well-designed LF spatial SR network with hybrid input, a central-view synthesis network with an HR-aware loss that enables side-view sub-aperture images to learn high-frequency information from the only HR central view reference image, and a backward degradation network with an epipolar-plane image gradient loss to preserve LF parallax structures. Extensive experiments on both simulated and real-world datasets demonstrate the significant superiority of our approach over state-of-the-art ones in reconstructing higher spatial resolution LF images without pre-defined degradation.
Multimedia
A Sleep Monitoring System Based on Audio, Video and Depth Information
For quantitative evaluation of sleep disturbances, a noninvasive monitoring system is developed by introducing an event-based method. We observe sleeping in home context and classify the sleep disturbances into three types of events: motion events, light-on/off events and noise events. A device with an infrared depth sensor, a RGB camera, and a four-microphone array is used in sleep monitoring in an environment with barely light sources. One background model is established in depth signals for measuring magnitude of movements. Because depth signals cannot observe lighting changes, another background model is established in color images for measuring magnitude of lighting effects. An event detection algorithm is used to detect occurrences of events from the processed data of the three types of sensors. The system was tested in sleep condition and the experiment result validates the system reliability.
comment: Accepted in the Computer Vision, Graphics and Image Processing (CVGIP 2013)
MAVERIX: Multimodal Audio-Visual Evaluation and Recognition IndeX
We introduce MAVERIX (Multimodal audiovisual Evaluation and Recognition IndeX), a unified benchmark to probe the video understanding in multimodal LLMs, encompassing video, audio, text inputs with human performance baselines. Although recent advancements in models with vision and audio understanding capabilities have shown substantial progress, the field lacks a standardized evaluation framework to thoroughly assess their cross-modality comprehension performance. MAVERIX curates 2,556 questions from 700 videos, in the form of both multiple-choice and open-ended formats, explicitly designed to evaluate multimodal models through questions that necessitate tight integration of video and audio information, spanning a broad spectrum of agentic scenarios. MAVERIX uniquely provides models with audiovisual questions, closely mimicking the multimodal perceptual experiences available to humans during inference and decision-making processes. To our knowledge, MAVERIX is the first benchmark aimed explicitly at assessing comprehensive audiovisual integration in such granularity. Experiments with state-of-the-art models, including Qwen 2.5 Omni and Gemini 2.5 Flash-Lite, show performance around 64% accuracy, while human experts reach near-ceiling performance of 92.8%, exposing a substantial gap to human-level comprehension. With standardized evaluation protocols, a rigorously annotated pipeline, and a public toolkit, MAVERIX establishes a challenging testbed for advancing audiovisual multimodal intelligence.
SteerMusic: Enhanced Musical Consistency for Zero-shot Text-guided and Personalized Music Editing AAAI2026
Music editing is an important step in music production, which has broad applications, including game development and film production. Most existing zero-shot text-guided editing methods rely on pretrained diffusion models by involving forward-backward diffusion processes. However, these methods often struggle to preserve the musical content. Additionally, text instructions alone usually fail to accurately describe the desired music. In this paper, we propose two music editing methods that improve the consistency between the original and edited music by leveraging score distillation. The first method, SteerMusic, is a coarse-grained zero-shot editing approach using delta denoising score. The second method, SteerMusic+, enables fine-grained personalized music editing by manipulating a concept token that represents a user-defined musical style. SteerMusic+ allows for the editing of music into user-defined musical styles that cannot be achieved by the text instructions alone. Experimental results show that our methods outperform existing approaches in preserving both music content consistency and editing fidelity. User studies further validate that our methods achieve superior music editing quality.
comment: Accepted by AAAI2026
Computation and Language
Adapting AlignScore Mertic for Factual Consistency Evaluation of Text in Russian: A Student Abstract
Ensuring factual consistency in generated text is crucial for reliable natural language processing applications. However, there is a lack of evaluation tools for factual consistency in Russian texts, as existing tools primarily focus on English corpora. To bridge this gap, we introduce AlignRuScore, a comprehensive adaptation of the AlignScore metric for Russian. To adapt the metric, we fine-tuned a RuBERT-based alignment model with task-specific classification and regression heads on Russian and translated English datasets. Our results demonstrate that a unified alignment metric can be successfully ported to Russian, laying the groundwork for robust multilingual factual consistency evaluation. We release the translated corpora, model checkpoints, and code to support further research.
ProSocialAlign: Preference Conditioned Test Time Alignment in Language Models
Current language model safety paradigms often fall short in emotionally charged or high-stakes settings, where refusal-only approaches may alienate users and naive compliance can amplify risk. We propose ProSocialAlign, a test-time, parameter-efficient framework that steers generation toward safe, empathetic, and value-aligned responses without retraining the base model. We formalize five human-centered objectives and cast safety as lexicographic constrained generation: first, applying hard constraints to eliminate harmful continuations; then optimizing for prosocial quality within the safe set. Our method combines (i) directional regulation, a harm-mitigation mechanism that subtracts a learned "harm vector" in parameter space, and (ii) preference-aware autoregressive reward modeling trained jointly across attributes with gradient conflict resolution, enabling fine-grained, user-controllable decoding. Empirical evaluations across five safety benchmarks demonstrate state-of-the-art performance, reducing unsafe leakage and boosting alignment to human values, with strong gains across multiple evaluation metrics. ProSocialAlign offers a robust and modular foundation for generating context-sensitive, safe, and human-aligned responses at inference time.
mini-vec2vec: Scaling Universal Geometry Alignment with Linear Transformations
We build upon vec2vec, a procedure designed to align text embedding spaces without parallel data. vec2vec finds a near-perfect alignment, but it is expensive and unstable. We present mini-vec2vec, a simple and efficient alternative that requires substantially lower computational cost and is highly robust. Moreover, the learned mapping is a linear transformation. Our method consists of three main stages: a tentative matching of pseudo-parallel embedding vectors, transformation fitting, and iterative refinement. Our linear alternative exceeds the original instantiation of vec2vec by orders of magnitude in efficiency, while matching or exceeding their results. The method's stability and interpretable algorithmic steps facilitate scaling and unlock new opportunities for adoption in new domains and fields.
Hallucination reduction with CASAL: Contrastive Activation Steering For Amortized Learning
Large Language Models (LLMs) exhibit impressive capabilities but often hallucinate, confidently providing incorrect answers instead of admitting ignorance. Prior work has shown that models encode linear representations of their own knowledge and that activation steering can reduce hallucinations. These approaches, however, require real-time monitoring and intervention during inference. We introduce Contrastive Activation Steering for Amortized Learning (CASAL), an efficient algorithm that connects interpretability with amortized optimization. CASAL directly bakes the benefits of activation steering into model's weights. Once trained, LLMs answer questions they know while abstaining from answering those they do not. CASAL's light-weight design requires training only a submodule of a single transformer layer and yet reduces hallucination by 30%-40% across multiple short-form QA benchmarks. CASAL is 30x more compute-efficient and 20x more data-efficient than strong LoRA-based baselines such as SFT and DPO, boosting its practical applicability in data scarce domains. Importantly, CASAL also generalizes effectively to out-of-distribution (OOD) domains. We showcase CASAL's flexibility in mitigating hallucinations in both text-only and vision-language models. To our knowledge, CASAL is the first steering-based training method that has been shown to be effective for both dense and Mixture-of-Experts (MoE) models. CASAL represents a promising step forward for applying interpretability-inspired method for practical deployment in production systems.
Evaluating Long-Term Memory for Long-Context Question Answering
In order for large language models to achieve true conversational continuity and benefit from experiential learning, they need memory. While research has focused on the development of complex memory systems, it remains unclear which types of memory are most effective for long-context conversational tasks. We present a systematic evaluation of memory-augmented methods on long-context dialogues annotated for question-answering tasks that require diverse reasoning strategies. We analyse full-context prompting, semantic memory through retrieval-augmented generation and agentic memory, episodic memory through in-context learning, and procedural memory through prompt optimization. Our findings show that memory-augmented approaches reduce token usage by over 90\% while maintaining competitive accuracy. Memory architecture complexity should scale with model capability, with foundation models benefitting most from RAG, and stronger instruction-tuned models gaining from episodic learning through reflections and more complex agentic semantic memory. In particular, episodic memory can help LLMs recognise the limits of their own knowledge.
comment: Accepted as a poster at Metacognition in Generative AI EurIPS workshop
SwiReasoning: Switch-Thinking in Latent and Explicit for Pareto-Superior Reasoning LLMs
Recent work shows that, beyond discrete reasoning through explicit chain-of-thought steps, which are limited by the boundaries of natural languages, large language models (LLMs) can also reason continuously in latent space, allowing richer information per step and thereby improving token efficiency. Despite this promise, latent reasoning still faces two challenges, especially in training-free settings: 1) purely latent reasoning broadens the search distribution by maintaining multiple implicit paths, which diffuses probability mass, introduces noise, and impedes convergence to a single high-confidence solution, thereby hurting accuracy; and 2) overthinking persists even without explicit text, wasting tokens and degrading efficiency. To address these issues, we introduce SwiReasoning, a training-free framework for LLM reasoning which features two key innovations: 1) SwiReasoning dynamically switches between explicit and latent reasoning, guided by block-wise confidence estimated from entropy trends in next-token distributions, to balance exploration and exploitation and promote timely convergence. 2) By limiting the maximum number of thinking-block switches, SwiReasoning curbs overthinking and improves token efficiency across varying problem difficulties. On widely used mathematics and STEM benchmarks, SwiReasoning consistently improves average accuracy by 1.5%-2.8% across reasoning LLMs of different model families and scales. Furthermore, under constrained budgets, SwiReasoning improves average token efficiency by 56%-79%, with larger gains as budgets tighten.
comment: Code: https://github.com/sdc17/SwiReasoning, Website: https://swireasoning.github.io/
Human-Computer Interaction
Beyond Satisfaction: From Placebic to Actionable Explanations For Enhanced Understandability AAMAS 2025
Explainable AI (XAI) presents useful tools to facilitate transparency and trustworthiness in machine learning systems. However, current evaluations of system explainability often rely heavily on subjective user surveys, which may not adequately capture the effectiveness of explanations. This paper critiques the overreliance on user satisfaction metrics and explores whether these can differentiate between meaningful (actionable) and vacuous (placebic) explanations. In experiments involving optimal Social Security filing age selection tasks, participants used one of three protocols: no explanations, placebic explanations, and actionable explanations. Participants who received actionable explanations significantly outperformed the other groups in objective measures of their mental model, but users rated placebic and actionable explanations as equally satisfying. This suggests that subjective surveys alone fail to capture whether explanations truly support users in building useful domain understanding. We propose that future evaluations of agent explanation capabilities should integrate objective task performance metrics alongside subjective assessments to more accurately measure explanation quality. The code for this study can be found at https://github.com/Shymkis/social-security-explainer.
comment: 21 pages, 7 figures, 6 tables. EXTRAAMAS 2025 submission. Preprint version
"Having Confidence in My Confidence Intervals": How Data Users Engage with Privacy-Protected Wikipedia Data
In response to calls for open data and growing privacy threats, organizations are increasingly adopting privacy-preserving techniques such as differential privacy (DP) that inject statistical noise when generating published datasets. These techniques are designed to protect privacy of data subjects while enabling useful analyses, but their reception by data users is under-explored. We developed documentation that presents the noise characteristics of two Wikipedia pageview datasets: one using rounding (heuristic privacy) and another using DP (formal privacy). After incorporating expert feedback (n=5), we used these documents to conduct a task-based contextual inquiry (n=15) exploring how data users--largely unfamiliar with these methods--perceive, interact with, and interpret privacy-preserving noise during data analysis. Participants readily used simple uncertainty metrics from the documentation, but struggled when asked to compute confidence intervals across multiple noisy estimates. They were better able to devise simulation-based approaches for computing uncertainty with DP data compared to rounded data. Surprisingly, several participants incorrectly believed DP's stronger utility implied weaker privacy protections. Based on our findings, we offer design recommendations for documentation and tools to better support data users working with privacy-noised data.
Cenergy3: An API for City Energy 3D Modeling
The efficient management and planning of urban energy systems require integrated three-dimensional (3D) models that accurately represent both consumption nodes and distribution networks. This paper introduces our developed geospatial Application Programming Interface (API) that automates the generation of 3D urban digital model from open data. The API synthesizes data from OpenTopography, OpenStreetMap, and Overture Maps in generating 3D models. The rendered model visualizes and contextualizes power grid infrastructure alongside the built environment and transportation networks. The API provides interactive figures for the 3D models, which are essential for analyzing infrastructure alignment and spatially linking energy demand nodes (buildings) with energy supply (utility grids). Our API leverages standard Web Mercator coordinates (EPSG:3857) and JSON serialization to ensure interoperability within smart city and energy simulation platforms.
CommentScope: A Comment-Embedded Assisted Reading System for a Long Text
Long texts are ubiquitous on social platforms, yet readers often face information overload and struggle to locate key content. Comments provide valuable external perspectives for understanding, questioning, and complementing the text, but their potential is hindered by disorganized and unstructured presentation. Few studies have explored embedding comments directly into reading. As an exploratory step, we propose CommentScope, a system with two core modules: a pipeline that classifies comments into five types and aligns them with relevant sentences, and a presentation module that integrates comments inline or as side notes, supported by visual cues such as colors, charts, and highlights. Technical evaluation shows that the hybrid "Rule+LLM" pipeline achieved solid performance in semantic classification (accuracy=0.90) and position alignment (accuracy=0.88). A user study (N=12) further demonstrated that the sentence-end embedding significantly improved comment discovery accuracy and reading fluency while reducing mental demand and perceived effort.
comment: 32 pages, 9 figures. Submitted to CHI 2025
JEEVHITAA -- An End-to-End HCAI System to Support Collective Care
Current mobile health platforms are predominantly individual-centric and lack the necessary primitives for coordinated, auditable, multi-actor workflows. However, in many settings worldwide, health decisions are enacted by multi-actor care networks rather than single users. We present JEEVHITAA, an Android/Flutter system that provides context-sensitive, role-aware sharing and verifiable information flows for care circles. JEEVHITAA ingests platform and device data (via Google Health Connect and BLE connectors), constructs multi-layer user profiles from sensor streams and tiered onboarding, and enforces fine-grained, time-bounded access control across permissioned care graphs. Data are end-to-end encrypted in local stores and during peer sync (Firebase), and provisions are made for document capture by camera or upload as PDF. An integrated retrieval-augmented LLM pipeline (i) produces structured, role-targeted summaries and action plans, (ii) enables users to gather advanced insights on health reports, and (iii) performs evidence-grounded user-relevant verification of arbitrary health content, returning provenance, confidence scores, and source citations. We describe the system architecture, connector abstractions, and security primitives, and evaluate robustness and compatibility using synthetic, ontology-driven simulations and vendor compatibility tests. Finally, we outline plans for longitudinal in-the-wild deployments to measure system performance, the correctness of access control, and the real-world effectiveness of relationship-aware credibility support.
comment: 14 pages, 4 figures, 6 tables, 2 algorithms, Preprint of work Submitted to MobiSys 2026 - Emergent Ideas Track
The Missing Variable: Socio-Technical Alignment in Risk Evaluation
This paper addresses a critical gap in the risk assessment of AI-enabled safety-critical systems. While these systems, where AI systems assists human operators, function as complex socio-technical systems, existing risk evaluation methods fail to account for the associated complex interaction between human, technical, and organizational elements. Through a comparative analysis of system attributes from both socio-technical and AI-enabled systems and a review of current risk evaluation methods, we confirm the absence of socio-technical considerations in standard risk expressions. To bridge this gap, we introduce a novel socio-technical alignment $STA$ variable designed to be integrated into the foundational risk equation. This variable estimates the degree of harmonious interaction between the AI systems, human operators, and organizational processes. A case study on an AI-enabled liquid hydrogen bunkering system demonstrates the variable's relevance. By comparing a naive and a safeguarded system design, we illustrate how the $STA$-augmented expression captures socio-technical safety implications that traditional risk evaluation overlooks, providing a more holistic basis for risk evaluation.
Clinician-Directed Large Language Model Software Generation for Therapeutic Interventions in Physical Rehabilitation
Digital health interventions increasingly deliver home exercise programs via sensor-equipped devices such as smartphones, enabling remote monitoring of adherence and performance. However, current software is usually authored before clinical encounters as libraries of modules for broad impairment categories. At the point of care, clinicians can only choose from these modules and adjust a few parameters (for example, duration or repetitions). As a result, individual limitations, goals, and environmental constraints are often not reflected, limiting personalization and benefit. We propose a paradigm in which large language models (LLMs) act as constrained translators that convert clinicians' exercise prescriptions into intervention software. Clinicians remain the decision makers: they design exercises during the encounter, tailored to each patient's impairments, goals, and environment, and the LLM generates matching software. We conducted a prospective single-arm feasibility study with 20 licensed physical and occupational therapists who created 40 individualized upper extremity programs for a standardized patient; 100% of prescriptions were translated into executable software, compared with 55% under a representative template-based digital health intervention (p < 0.01). LLM-generated software correctly delivered 99.7% of instructions and monitored performance with 88.4% accuracy (95% confidence interval, 0.843-0.915). Overall, 90% of therapists judged the system safe for patient interaction and 75% expressed willingness to adopt it in practice. To our knowledge, this is the first prospective evaluation of clinician-directed intervention software generation with an LLM in health care, demonstrating feasibility and motivating larger trials in real patient populations.
ConsentDiff at Scale: Longitudinal Audits of Web Privacy Policy Changes and UI Frictions
Web privacy is experienced via two public artifacts: site utterances in policy texts, and the actions users are required to take during consent interfaces. In the extensive cross-section audits we've studied, there is a lack of longitudinal data detailing how these artifacts are changing together, and if interfaces are actually doing what they promise in policy. ConsentDiff provides that longitudinal view. We build a reproducible pipeline that snapshots sites every month, semantically aligns policy clauses to track clause-level churn, and classifies consent-UI patterns by pulling together DOM signals with cues provided by screenshots. We introduce a novel weighted claim-UI alignment score, connecting common policy claims to observable predicates, and enabling comparisons over time, regions, and verticals. Our measurements suggest continued policy churn, systematic changes to eliminate a higher-friction banner design, and significantly higher alignment where rejecting is visible and lower friction.
Living with Data: Exploring Physicalization Approaches to Sedentary Behavior Intervention for Older Adults in Everyday Life
Sedentary behavior is a critical health risk for older adults. Although digital interventions are widely available, they primarily rely on screen-based notifications that can feel clinical or cognitively demanding, and are thus often ignored over time. This paper presents a three phase Research through Design methodology to explore data physicalization approaches that ambiently represented sedentary data patterns using decor artifacts in older adults' homes. These artifacts transformed abstract data into aesthetic, evolving forms, that became part of the domestic landscape. Our research revealed how these physicalizations fostered self-reflection, family conversations, and encouraged active lifestyles. We demonstrate how qualities like aesthetic ambiguity and slow revelation can empower older adults, fostering a reflective relationship with their well-being. Ultimately, we argue that creating data physicalizations for older adults necessitates a shift from merely informing users to enabling them to live with, and through, their data.
The Endless Tuning. An Artificial Intelligence Design To Avoid Human Replacement and Trace Back Responsibilities
The Endless Tuning is a design method for a reliable deployment of artificial intelligence based on a double mirroring process, which pursues both the goals of avoiding human replacement and filling the so-called responsibility gap (Matthias 2004). Originally depicted in (Fabris et al. 2024) and ensuing the relational approach urged therein, it was then actualized in a protocol, implemented in three prototypical applications regarding decision-making processes (respectively: loan granting, pneumonia diagnosis, and art style recognition) and tested with such as many domain experts. Step by step illustrating the protocol, giving insights concretely showing a different voice (Gilligan 1993) in the ethics of artificial intelligence, a philosophical account of technical choices (e.g., a reversed and hermeneutic deployment of XAI algorithms) will be provided in the present study together with the results of the experiments, focusing on user experience rather than statistical accuracy. Even thoroughly employing deep learning models, full control was perceived by the interviewees in the decision-making setting, while it appeared that a bridge can be built between accountability and liability in case of damage.
Conch: Competitive Debate Analysis via Visualizing Clash Points and Hierarchical Strategies
In-depth analysis of competitive debates is essential for participants to develop argumentative skills and refine strategies, and further improve their debating performance. However, manual analysis of unstructured and unlabeled textual records of debating is time-consuming and ineffective, as it is challenging to reconstruct contextual semantics and track logical connections from raw data. To address this, we propose Conch, an interactive visualization system that systematically analyzes both what is debated and how it is debated. In particular, we propose a novel parallel spiral visualization that compactly traces the multidimensional evolution of clash points and participant interactions throughout debate process. In addition, we leverage large language models with well-designed prompts to automatically identify critical debate elements such as clash points, disagreements, viewpoints, and strategies, enabling participants to understand the debate context comprehensively. Finally, through two case studies on real-world debates and a carefully-designed user study, we demonstrate Conch's effectiveness and usability for competitive debate analysis.
Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration
While the advancement of large language models has spurred the development of AI agents to automate tasks, numerous use cases inherently require agents to collaborate with humans due to humans' latent preferences, domain expertise, or the need for control. To facilitate the study of human-agent collaboration, we introduce Collaborative Gym (Co-Gym), an open framework for developing and evaluating collaborative agents that engage in bidirectional communication with humans while interacting with task environments. We describe how the framework enables the implementation of new task environments and coordination between humans and agents through a flexible, non-turn-taking interaction paradigm, along with an evaluation suite that assesses both collaboration outcomes and processes. Our framework provides both a simulated condition with a reliable user simulator and a real-world condition with an interactive web application. Initial benchmark experiments across three representative tasks -- creating travel plans, writing related work sections, and analyzing tabular data -- demonstrate the benefits of human-agent collaboration: The best-performing collaborative agents consistently outperform their fully autonomous counterparts in task performance, achieving win rates of 86% in Travel Planning, 74% in Tabular Analysis, and 66% in Related Work when evaluated by real users. Despite these improvements, our evaluation reveals persistent limitations in current language models and agents, with communication and situational awareness failures observed in 65% and 40% of cases in the real condition, respectively. Released under the permissive MIT license, Co-Gym supports the addition of new task environments and can be used to develop collaborative agent applications, while its evaluation suite enables assessment and improvement of collaborative agents.
comment: Preprint
Computer Vision and Pattern Recognition
EditThinker: Unlocking Iterative Reasoning for Any Image Editor
Instruction-based image editing has emerged as a prominent research area, which, benefiting from image generation foundation models, have achieved high aesthetic quality, making instruction-following capability the primary challenge. Existing approaches improve instruction adherence via supervised or reinforcement learning, yet single-turn success rates remain limited due to inherent stochasticity and a lack of deliberation. In this work, we propose a deliberative editing framework to 'think' while they edit, which simulates the human cognitive loop by iteratively executing a Think-while-Edit cycle: Critiquing results and Refining instructions , followed by Repeating the generation until satisfactory. Specifically, we train a single MLLM, EditThinker, to act as the reasoning engine of this framework, which jointly produce the critique score, reasoning process, and refined instructions. We employ reinforcement learning to align the EditThinker's thinking with its editing, thereby generating more targeted instruction improvements. Extensive experiments on four benchmarks demonstrate that our approach significantly improves the instruction-following capability of any image editing model by a large margin. We will release our data construction framework, datasets, and models to benefit the community.
comment: Project page: https://appletea233.github.io/think-while-edit
AQUA-Net: Adaptive Frequency Fusion and Illumination Aware Network for Underwater Image Enhancement
Underwater images often suffer from severe color distortion, low contrast, and a hazy appearance due to wavelength-dependent light absorption and scattering. Simultaneously, existing deep learning models exhibit high computational complexity, which limits their practical deployment for real-time underwater applications. To address these challenges, this paper presents a novel underwater image enhancement model, called Adaptive Frequency Fusion and Illumination Aware Network (AQUA-Net). It integrates a residual encoder decoder with dual auxiliary branches, which operate in the frequency and illumination domains. The frequency fusion encoder enriches spatial representations with frequency cues from the Fourier domain and preserves fine textures and structural details. Inspired by Retinex, the illumination-aware decoder performs adaptive exposure correction through a learned illumination map that separates reflectance from lighting effects. This joint spatial, frequency, and illumination design enables the model to restore color balance, visual contrast, and perceptual realism under diverse underwater conditions. Additionally, we present a high-resolution, real-world underwater video-derived dataset from the Mediterranean Sea, which captures challenging deep-sea conditions with realistic visual degradations to enable robust evaluation and development of deep learning models. Extensive experiments on multiple benchmark datasets show that AQUA-Net performs on par with SOTA in both qualitative and quantitative evaluations while using less number of parameters. Ablation studies further confirm that the frequency and illumination branches provide complementary contributions that improve visibility and color representation. Overall, the proposed model shows strong generalization capability and robustness, and it provides an effective solution for real-world underwater imaging applications.
M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG
Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingual multimodal RAG remains largely underexplored. We introduce M4-RAG, a massive-scale benchmark covering 42 languages and 56 regional dialects and registers, comprising over 80,000 culturally diverse image-question pairs for evaluating retrieval-augmented VQA across languages and modalities. To balance realism with reproducibility, we build a controlled retrieval environment containing millions of carefully curated multilingual documents relevant to the query domains, approximating real-world retrieval conditions while ensuring consistent experimentation. Our systematic evaluation reveals that although RAG consistently benefits smaller VLMs, it fails to scale to larger models and often even degrades their performance, exposing a critical mismatch between model size and current retrieval effectiveness. M4-RAG provides a foundation for advancing next-generation RAG systems capable of reasoning seamlessly across languages, modalities, and cultural contexts.
comment: Preprint
SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models
Vision-Language Models (VLMs) exhibit remarkable common-sense and semantic reasoning capabilities. However, they lack a grounded understanding of physical dynamics. This limitation arises from training VLMs on static internet-scale visual-language data that contain no causal interactions or action-conditioned changes. Consequently, it remains challenging to leverage VLMs for fine-grained robotic manipulation tasks that require physical understanding, reasoning, and corresponding action planning. To overcome this, we present SIMPACT, a test-time, SIMulation-enabled ACTion Planning framework that equips VLMs with physical reasoning through simulation-in-the-loop world modeling, without requiring any additional training. From a single RGB-D observation, SIMPACT efficiently constructs physics simulations, enabling the VLM to propose informed actions, observe simulated rollouts, and iteratively refine its reasoning. By integrating language reasoning with physics prediction, our simulation-enabled VLM can understand contact dynamics and action outcomes in a physically grounded way. Our method demonstrates state-of-the-art performance on five challenging, real-world rigid-body and deformable manipulation tasks that require fine-grained physical reasoning, outperforming existing general-purpose robotic manipulation models. Our results demonstrate that embedding physics understanding via efficient simulation into VLM reasoning at test time offers a promising path towards generalizable embodied intelligence. Project webpage can be found at https://simpact-bot.github.io
Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding
Grounding is a fundamental capability for building graphical user interface (GUI) agents. Although existing approaches rely on large-scale bounding box supervision, they still face various challenges, such as cross-platform generalization, complex layout analysis, and fine-grained element localization. In this paper, we investigate zoom as a strong yet underexplored prior for GUI grounding, and propose a training-free method, ZoomClick. By characterizing four key properties of zoom (i.e., pre-zoom, depth, shrink size, minimal crop size), we unlock its full capabilities for dynamic spatial focusing and adaptive context switching. Experiments demonstrate that our method significantly boosts the performance of both general vision-language and specialized GUI grounding models, achieving state-of-the-art results on several mainstream benchmarks; for example, UI-Venus-72B attains a 73.1% success rate on ScreenSpot-Pro. Furthermore, we present GUIZoom-Bench, a benchmark for evaluating model adaptability to zoom, aiming to inspire future research on improving zoom for further training and test-time scaling in GUI grounding tasks.
comment: Code is available at https://github.com/Princeton-AI2-Lab/ZoomClick
Measuring the Effect of Background on Classification and Feature Importance in Deep Learning for AV Perception
Common approaches to explainable AI (XAI) for deep learning focus on analyzing the importance of input features on the classification task in a given model: saliency methods like SHAP and GradCAM are used to measure the impact of spatial regions of the input image on the classification result. Combined with ground truth information about the location of the object in the input image (e.g., a binary mask), it is determined whether object pixels had a high impact on the classification result, or whether the classification focused on background pixels. The former is considered to be a sign of a healthy classifier, whereas the latter is assumed to suggest overfitting on spurious correlations. A major challenge, however, is that these intuitive interpretations are difficult to test quantitatively, and hence the output of such explanations lacks an explanation itself. One particular reason is that correlations in real-world data are difficult to avoid, and whether they are spurious or legitimate is debatable. Synthetic data in turn can facilitate to actively enable or disable correlations where desired but often lack a sufficient quantification of realism and stochastic properties. [...] Therefore, we systematically generate six synthetic datasets for the task of traffic sign recognition, which differ only in their degree of camera variation and background correlation [...] to quantify the isolated influence of background correlation, different levels of camera variation, and considered traffic sign shapes on the classification performance, as well as background feature importance. [...] Results include a quantification of when and how much background features gain importance to support the classification task based on changes in the training domain [...]. Download: synset.de/datasets/synset-signset-ger/background-effect
comment: 8 pages, 2 figures, 7 tables
Synset Signset Germany: a Synthetic Dataset for German Traffic Sign Recognition
In this paper, we present a synthesis pipeline and dataset for training / testing data in the task of traffic sign recognition that combines the advantages of data-driven and analytical modeling: GAN-based texture generation enables data-driven dirt and wear artifacts, rendering unique and realistic traffic sign surfaces, while the analytical scene modulation achieves physically correct lighting and allows detailed parameterization. In particular, the latter opens up applications in the context of explainable AI (XAI) and robustness tests due to the possibility of evaluating the sensitivity to parameter changes, which we demonstrate with experiments. Our resulting synthetic traffic sign recognition dataset Synset Signset Germany contains a total of 105500 images of 211 different German traffic sign classes, including newly published (2020) and thus comparatively rare traffic signs. In addition to a mask and a segmentation image, we also provide extensive metadata including the stochastically selected environment and imaging effect parameters for each image. We evaluate the degree of realism of Synset Signset Germany on the real-world German Traffic Sign Recognition Benchmark (GTSRB) and in comparison to CATERED, a state-of-the-art synthetic traffic sign recognition dataset.
comment: 8 pages, 8 figures, 3 tables
Physically-Based Simulation of Automotive LiDAR
We present an analytic model for simulating automotive time-of-flight (ToF) LiDAR that includes blooming, echo pulse width, and ambient light, along with steps to determine model parameters systematically through optical laboratory measurements. The model uses physically based rendering (PBR) in the near-infrared domain. It assumes single-bounce reflections and retroreflections over rasterized rendered images from shading or ray tracing, including light emitted from the sensor as well as stray light from other, non-correlated sources such as sunlight. Beams from the sensor and sensitivity of the receiving diodes are modeled with flexible beam steering patterns and with non-vanishing diameter. Different (all non-real time) computational approaches can be chosen based on system properties, computing capabilities, and desired output properties. Model parameters include system-specific properties, namely the physical spread of the LiDAR beam, combined with the sensitivity of the receiving diode; the intensity of the emitted light; the conversion between the intensity of reflected light and the echo pulse width; and scenario parameters such as environment lighting, positioning, and surface properties of the target(s) in the relevant infrared domain. System-specific properties of the model are determined from laboratory measurements of the photometric luminance on different target surfaces aligned with a goniometer at 0.01° resolution, which marks the best available resolution for measuring the beam pattern. The approach is calibrated for and tested on two automotive LiDAR systems, the Valeo Scala Gen. 2 and the Blickfeld Cube 1. Both systems differ notably in their properties and available interfaces, but the relevant model parameters could be extracted successfully.
A Comparative Study on Synthetic Facial Data Generation Techniques for Face Recognition
Facial recognition has become a widely used method for authentication and identification, with applications for secure access and locating missing persons. Its success is largely attributed to deep learning, which leverages large datasets and effective loss functions to learn discriminative features. Despite these advances, facial recognition still faces challenges in explainability, demographic bias, privacy, and robustness to aging, pose variations, lighting changes, occlusions, and facial expressions. Privacy regulations have also led to the degradation of several datasets, raising legal, ethical, and privacy concerns. Synthetic facial data generation has been proposed as a promising solution. It mitigates privacy issues, enables experimentation with controlled facial attributes, alleviates demographic bias, and provides supplementary data to improve models trained on real data. This study compares the effectiveness of synthetic facial datasets generated using different techniques in facial recognition tasks. We evaluate accuracy, rank-1, rank-5, and the true positive rate at a false positive rate of 0.01% on eight leading datasets, offering a comparative analysis not extensively explored in the literature. Results demonstrate the ability of synthetic data to capture realistic variations while emphasizing the need for further research to close the performance gap with real data. Techniques such as diffusion models, GANs, and 3D models show substantial progress; however, challenges remain.
comment: 18 pages, 17 figures
World Models That Know When They Don't Know: Controllable Video Generation with Calibrated Uncertainty
Recent advances in generative video models have led to significant breakthroughs in high-fidelity video synthesis, specifically in controllable video generation where the generated video is conditioned on text and action inputs, e.g., in instruction-guided video editing and world modeling in robotics. Despite these exceptional capabilities, controllable video models often hallucinate - generating future video frames that are misaligned with physical reality - which raises serious concerns in many tasks such as robot policy evaluation and planning. However, state-of-the-art video models lack the ability to assess and express their confidence, impeding hallucination mitigation. To rigorously address this challenge, we propose C3, an uncertainty quantification (UQ) method for training continuous-scale calibrated controllable video models for dense confidence estimation at the subpatch level, precisely localizing the uncertainty in each generated video frame. Our UQ method introduces three core innovations to empower video models to estimate their uncertainty. First, our method develops a novel framework that trains video models for correctness and calibration via strictly proper scoring rules. Second, we estimate the video model's uncertainty in latent space, avoiding training instability and prohibitive training costs associated with pixel-space approaches. Third, we map the dense latent-space uncertainty to interpretable pixel-level uncertainty in the RGB space for intuitive visualization, providing high-resolution uncertainty heatmaps that identify untrustworthy regions. Through extensive experiments on large-scale robot learning datasets (Bridge and DROID) and real-world evaluations, we demonstrate that our method not only provides calibrated uncertainty estimates within the training distribution, but also enables effective out-of-distribution detection.
LPD: Learnable Prototypes with Diversity Regularization for Weakly Supervised Histopathology Segmentation
Weakly supervised semantic segmentation (WSSS) in histopathology reduces pixel-level labeling by learning from image-level labels, but it is hindered by inter-class homogeneity, intra-class heterogeneity, and CAM-induced region shrinkage (global pooling-based class activation maps whose activations highlight only the most distinctive areas and miss nearby class regions). Recent works address these challenges by constructing a clustering prototype bank and then refining masks in a separate stage; however, such two-stage pipelines are costly, sensitive to hyperparameters, and decouple prototype discovery from segmentation learning, limiting their effectiveness and efficiency. We propose a cluster-free, one-stage learnable-prototype framework with diversity regularization to enhance morphological intra-class heterogeneity coverage. Our approach achieves state-of-the-art (SOTA) performance on BCSS-WSSS, outperforming prior methods in mIoU and mDice. Qualitative segmentation maps show sharper boundaries and fewer mislabels, and activation heatmaps further reveal that, compared with clustering-based prototypes, our learnable prototypes cover more diverse and complementary regions within each class, providing consistent qualitative evidence for their effectiveness.
comment: Note: Khang Le and Anh Mai Vu contributed equally
NICE: Neural Implicit Craniofacial Model for Orthognathic Surgery Prediction
Orthognathic surgery is a crucial intervention for correcting dentofacial skeletal deformities to enhance occlusal functionality and facial aesthetics. Accurate postoperative facial appearance prediction remains challenging due to the complex nonlinear interactions between skeletal movements and facial soft tissue. Existing biomechanical, parametric models and deep-learning approaches either lack computational efficiency or fail to fully capture these intricate interactions. To address these limitations, we propose Neural Implicit Craniofacial Model (NICE) which employs implicit neural representations for accurate anatomical reconstruction and surgical outcome prediction. NICE comprises a shape module, which employs region-specific implicit Signed Distance Function (SDF) decoders to reconstruct the facial surface, maxilla, and mandible, and a surgery module, which employs region-specific deformation decoders. These deformation decoders are driven by a shared surgical latent code to effectively model the complex, nonlinear biomechanical response of the facial surface to skeletal movements, incorporating anatomical prior knowledge. The deformation decoders output point-wise displacement fields, enabling precise modeling of surgical outcomes. Extensive experiments demonstrate that NICE outperforms current state-of-the-art methods, notably improving prediction accuracy in critical facial regions such as lips and chin, while robustly preserving anatomical integrity. This work provides a clinically viable tool for enhanced surgical planning and patient consultation in orthognathic procedures.
SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations
Achieving character animation that meets studio-grade production standards remains challenging despite recent progress. Existing approaches can transfer motion from a driving video to a reference image, but often fail to preserve structural fidelity and temporal consistency in wild scenarios involving complex motion and cross-identity animations. In this work, we present \textbf{SCAIL} (\textbf{S}tudio-grade \textbf{C}haracter \textbf{A}nimation via \textbf{I}n-context \textbf{L}earning), a framework designed to address these challenges from two key innovations. First, we propose a novel 3D pose representation, providing a more robust and flexible motion signal. Second, we introduce a full-context pose injection mechanism within a diffusion-transformer architecture, enabling effective spatio-temporal reasoning over full motion sequences. To align with studio-level requirements, we develop a curated data pipeline ensuring both diversity and quality, and establish a comprehensive benchmark for systematic evaluation. Experiments show that \textbf{SCAIL} achieves state-of-the-art performance and advances character animation toward studio-grade reliability and realism.
Underwater Image Reconstruction Using a Swin Transformer-Based Generator and PatchGAN Discriminator
Underwater imaging is essential for marine exploration, environmental monitoring, and infrastructure inspection. However, water causes severe image degradation through wavelength-dependent absorption and scattering, resulting in color distortion, low contrast, and haze effects. Traditional reconstruction methods and convolutional neural network-based approaches often fail to adequately address these challenges due to limited receptive fields and inability to model global dependencies. This paper presented a novel deep learning framework that integrated a Swin Transformer architecture within a generative adversarial network (GAN) for underwater image reconstruction. Our generator employed a U-Net structure with Swin Transformer blocks to capture both local features and long-range dependencies crucial for color correction across entire images. A PatchGAN discriminator provided adversarial training to ensure high-frequency detail preservation. We trained and evaluated our model on the EUVP dataset, which contains paired underwater images of varying quality. Quantitative results demonstrate stateof-the-art performance with PSNR of 24.76 dB and SSIM of 0.89, representing significant improvements over existing methods. Visual results showed effective color balance restoration, contrast improvement, and haze reduction. An ablation study confirms the superiority of our Swin Transformer designed over convolutional alternatives. The proposed method offers robust underwater image reconstruction suitable for various marine applications.
comment: This paper has been accepted for presentation at the IEEE 28th International Conference on Computer and Information Technology (ICCIT), December 2025
Edit-aware RAW Reconstruction
Users frequently edit camera images post-capture to achieve their preferred photofinishing style. While editing in the RAW domain provides greater accuracy and flexibility, most edits are performed on the camera's display-referred output (e.g., 8-bit sRGB JPEG) since RAW images are rarely stored. Existing RAW reconstruction methods can recover RAW data from sRGB images, but these approaches are typically optimized for pixel-wise RAW reconstruction fidelity and tend to degrade under diverse rendering styles and editing operations. We introduce a plug-and-play, edit-aware loss function that can be integrated into any existing RAW reconstruction framework to make the recovered RAWs more robust to different rendering styles and edits. Our loss formulation incorporates a modular, differentiable image signal processor (ISP) that simulates realistic photofinishing pipelines with tunable parameters. During training, parameters for each ISP module are randomly sampled from carefully designed distributions that model practical variations in real camera processing. The loss is then computed in sRGB space between ground-truth and reconstructed RAWs rendered through this differentiable ISP. Incorporating our loss improves sRGB reconstruction quality by up to 1.5-2 dB PSNR across various editing conditions. Moreover, when applied to metadata-assisted RAW reconstruction methods, our approach enables fine-tuning for target edits, yielding further gains. Since photographic editing is the primary motivation for RAW reconstruction in consumer imaging, our simple yet effective loss function provides a general mechanism for enhancing edit fidelity and rendering flexibility across existing methods.
VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack
Multimodal Large Language Models (MLLMs) are widely used in various fields due to their powerful cross-modal comprehension and generation capabilities. However, more modalities bring more vulnerabilities to being utilized for jailbreak attacks, which induces MLLMs to output harmful content. Due to the strong reasoning ability of MLLMs, previous jailbreak attacks try to explore reasoning safety risk in text modal, while similar threats have been largely overlooked in the visual modal. To fully evaluate potential safety risks in the visual reasoning task, we propose Visual Reasoning Sequential Attack (VRSA), which induces MLLMs to gradually externalize and aggregate complete harmful intent by decomposing the original harmful text into several sequentially related sub-images. In particular, to enhance the rationality of the scene in the image sequence, we propose Adaptive Scene Refinement to optimize the scene most relevant to the original harmful query. To ensure the semantic continuity of the generated image, we propose Semantic Coherent Completion to iteratively rewrite each sub-text combined with contextual information in this scene. In addition, we propose Text-Image Consistency Alignment to keep the semantical consistency. A series of experiments demonstrates that the VRSA can achieve a higher attack success rate compared with the state-of-the-art jailbreak attack methods on both the open-source and closed-source MLLMs such as GPT-4o and Claude-4.5-Sonnet.
Phase-OTDR Event Detection Using Image-Based Data Transformation and Deep Learning
This study focuses on event detection in optical fibers, specifically classifying six events using the Phase-OTDR system. A novel approach is introduced to enhance Phase-OTDR data analysis by transforming 1D data into grayscale images through techniques such as Gramian Angular Difference Field, Gramian Angular Summation Field, and Recurrence Plot. These grayscale images are combined into a multi-channel RGB representation, enabling more robust and adaptable analysis using transfer learning models. The proposed methodology achieves high classification accuracies of 98.84% and 98.24% with the EfficientNetB0 and DenseNet121 models, respectively. A 5-fold cross-validation process confirms the reliability of these models, with test accuracy rates of 99.07% and 98.68%. Using a publicly available Phase-OTDR dataset, the study demonstrates an efficient approach to understanding optical fiber events while reducing dataset size and improving analysis efficiency. The results highlight the transformative potential of image-based analysis in interpreting complex fiber optic sensing data, offering significant advancements in the accuracy and reliability of fiber optic monitoring systems. The codes and the corresponding image-based dataset are made publicly available on GitHub to support further research: https://github.com/miralab-ai/Phase-OTDR-event-detection.
comment: 22 pages, 11 figures, 5 tables
Multimodal Oncology Agent for IDH1 Mutation Prediction in Low-Grade Glioma
Low-grade gliomas frequently present IDH1 mutations that define clinically distinct subgroups with specific prognostic and therapeutic implications. This work introduces a Multimodal Oncology Agent (MOA) integrating a histology tool based on the TITAN foundation model for IDH1 mutation prediction in low-grade glioma, combined with reasoning over structured clinical and genomic inputs through PubMed, Google Search, and OncoKB. MOA reports were quantitatively evaluated on 488 patients from the TCGA-LGG cohort against clinical and histology baselines. MOA without the histology tool outperformed the clinical baseline, achieving an F1-score of 0.826 compared to 0.798. When fused with histology features, MOA reached the highest performance with an F1-score of 0.912, exceeding both the histology baseline at 0.894 and the fused histology-clinical baseline at 0.897. These results demonstrate that the proposed agent captures complementary mutation-relevant information enriched through external biomedical sources, enabling accurate IDH1 mutation prediction.
comment: 4 pages, 2 figures
UG-FedDA: Uncertainty-Guided Federated Domain Adaptation for Multi-Center Alzheimer's Disease Detection
Alzheimer's disease (AD) is an irreversible neurodegenerative disorder, and early diagnosis is critical for timely intervention. However, most existing classification frameworks face challenges in multicenter studies, as they often neglect inter-site heterogeneity and lack mechanisms to quantify uncertainty, which limits their robustness and clinical applicability. To address these issues, we proposed Uncertainty-Guided Federated Domain Adaptation (UG-FedDA), a novel multicenter AD classification framework that integrates uncertainty quantification (UQ) with federated domain adaptation to handle cross-site structure magnetic resonance imaging (MRI) heterogeneity under privacy constraints. Our approach extracts multi-template region-of-interest (RoI) features using a self-attention transformer, capturing both regional representations and their interactions. UQ is integrated to guide feature alignment, mitigating source-target distribution shifts by down-weighting uncertain samples. Experiments are conducted on three public datasets: the Alzheimer's Disease Neuroimaging Initiative (ADNI), the Australian Imaging, Biomarkers and Lifestyle study (AIBL), and the Open Access Series of Imaging Studies (OASIS). UG-FedDA achieved consistent cross-domain improvements in accuracy, sensitivity, and area under the ROC curve across three classification tasks: AD vs. normal controls (NC), mild cognitive impairment (MCI) vs. AD, and NC vs. MCI. For NC vs. AD, UG-FedDA achieves accuracies of 90.54%, 89.04%, and 77.78% on ADNI, AIBL and OASIS datasets, respectively. For MCI vs. AD, accuracies are 80.20% (ADNI), 71.91% (AIBL), and 79.73% (OASIS). For NC vs. MCI, results are 76.87% (ADNI), 73.91% (AIBL), and 83.73% (OASIS). These results demonstrate that the proposed framework not only adapts efficiently across multiple sites but also preserves strict privacy.
comment: The code is already available on GitHub: https://github.com/chenzhao2023/UG_FADDA_AlzhemiersClassification
Toward Efficient and Robust Behavior Models for Multi-Agent Driving Simulation
Scalable multi-agent driving simulation requires behavior models that are both realistic and computationally efficient. We address this by optimizing the behavior model that controls individual traffic participants. To improve efficiency, we adopt an instance-centric scene representation, where each traffic participant and map element is modeled in its own local coordinate frame. This design enables efficient, viewpoint-invariant scene encoding and allows static map tokens to be reused across simulation steps. To model interactions, we employ a query-centric symmetric context encoder with relative positional encodings between local frames. We use Adversarial Inverse Reinforcement Learning to learn the behavior model and propose an adaptive reward transformation that automatically balances robustness and realism during training. Experiments demonstrate that our approach scales efficiently with the number of tokens, significantly reducing training and inference times, while outperforming several agent-centric baselines in terms of positional accuracy and robustness.
comment: This work has been submitted to the IEEE for possible publication
Probing the effectiveness of World Models for Spatial Reasoning through Test-time Scaling
Vision-Language Models (VLMs) remain limited in spatial reasoning tasks that require multi-view understanding and embodied perspective shifts. Recent approaches such as MindJourney attempt to mitigate this gap through test-time scaling where a world model imagines action-conditioned trajectories and a heuristic verifier selects helpful views from such trajectories. In this work, we systematically examine how such test-time verifiers behave across benchmarks, uncovering both their promise and their pitfalls. Our uncertainty-based analyses show that MindJourney's verifier provides little meaningful calibration, and that random scoring often reduces answer entropy equally well, thus exposing systematic action biases and unreliable reward signals. To mitigate these, we introduce a Verification through Spatial Assertions (ViSA) framework that grounds the test-time reward in verifiable, frame-anchored micro-claims. This principled verifier consistently improves spatial reasoning on the SAT-Real benchmark and corrects trajectory-selection biases through more balanced exploratory behavior. However, on the challenging MMSI-Bench, none of the verifiers, including ours, achieve consistent scaling, suggesting that the current world models form an information bottleneck where imagined views fail to enrich fine-grained reasoning. Together, these findings chart the bad, good, and ugly aspects of test-time verification for world-model-based reasoning. Our code is available at https://github.com/chandar-lab/visa-for-mindjourney.
comment: Extended abstract at World Modeling Workshop 2026
Bring Your Dreams to Life: Continual Text-to-Video Customization AAAI2026
Customized text-to-video generation (CTVG) has recently witnessed great progress in generating tailored videos from user-specific text. However, most CTVG methods assume that personalized concepts remain static and do not expand incrementally over time. Additionally, they struggle with forgetting and concept neglect when continuously learning new concepts, including subjects and motions. To resolve the above challenges, we develop a novel Continual Customized Video Diffusion (CCVD) model, which can continuously learn new concepts to generate videos across various text-to-video generation tasks by tackling forgetting and concept neglect. To address catastrophic forgetting, we introduce a concept-specific attribute retention module and a task-aware concept aggregation strategy. They can capture the unique characteristics and identities of old concepts during training, while combining all subject and motion adapters of old concepts based on their relevance during testing. Besides, to tackle concept neglect, we develop a controllable conditional synthesis to enhance regional features and align video contexts with user conditions, by incorporating layer-specific region attention-guided noise estimation. Extensive experimental comparisons demonstrate that our CCVD outperforms existing CTVG models. The code is available at https://github.com/JiahuaDong/CCVD.
comment: Accepted to AAAI2026
Curvature-Regularized Variational Autoencoder for 3D Scene Reconstruction from Sparse Depth
When depth sensors provide only 5% of needed measurements, reconstructing complete 3D scenes becomes difficult. Autonomous vehicles and robots cannot tolerate the geometric errors that sparse reconstruction introduces. We propose curvature regularization through a discrete Laplacian operator, achieving 18.1% better reconstruction accuracy than standard variational autoencoders. Our contribution challenges an implicit assumption in geometric deep learning: that combining multiple geometric constraints improves performance. A single well-designed regularization term not only matches but exceeds the effectiveness of complex multi-term formulations. The discrete Laplacian offers stable gradients and noise suppression with just 15% training overhead and zero inference cost. Code and models are available at https://github.com/Maryousefi/GeoVAE-3D.
Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding
Long video understanding (LVU) is challenging because answering real-world queries often depends on sparse, temporally dispersed cues buried in hours of mostly redundant and irrelevant content. While agentic pipelines improve video reasoning capabilities, prevailing frameworks rely on a query-agnostic captioner to perceive video information, which wastes computation on irrelevant content and blurs fine-grained temporal and spatial information. Motivated by active perception theory, we argue that LVU agents should actively decide what, when, and where to observe, and continuously assess whether the current observation is sufficient to answer the query. We present Active Video Perception (AVP), an evidence-seeking framework that treats the video as an interactive environment and acquires compact, queryrelevant evidence directly from pixels. Concretely, AVP runs an iterative plan-observe-reflect process with MLLM agents. In each round, a planner proposes targeted video interactions, an observer executes them to extract time-stamped evidence, and a reflector evaluates the sufficiency of the evidence for the query, either halting with an answer or triggering further observation. Across five LVU benchmarks, AVP achieves highest performance with significant improvements. Notably, AVP outperforms the best agentic method by 5.7% in average accuracy while only requires 18.4% inference time and 12.4% input tokens.
comment: Website: https://activevideoperception.github.io/
FNOPT: Resolution-Agnostic, Self-Supervised Cloth Simulation using Meta-Optimization with Fourier Neural Operators WACV
We present FNOpt, a self-supervised cloth simulation framework that formulates time integration as an optimization problem and trains a resolution-agnostic neural optimizer parameterized by a Fourier neural operator (FNO). Prior neural simulators often rely on extensive ground truth data or sacrifice fine-scale detail, and generalize poorly across resolutions and motion patterns. In contrast, FNOpt learns to simulate physically plausible cloth dynamics and achieves stable and accurate rollouts across diverse mesh resolutions and motion patterns without retraining. Trained only on a coarse grid with physics-based losses, FNOpt generalizes to finer resolutions, capturing fine-scale wrinkles and preserving rollout stability. Extensive evaluations on a benchmark cloth simulation dataset demonstrate that FNOpt outperforms prior learning-based approaches in out-of-distribution settings in both accuracy and robustness. These results position FNO-based meta-optimization as a compelling alternative to previous neural simulators for cloth, thus reducing the need for curated data and improving cross-resolution reliability.
comment: Accepted for WACV
Label-Efficient Point Cloud Segmentation with Active Learning
Semantic segmentation of 3D point cloud data often comes with high annotation costs. Active learning automates the process of selecting which data to annotate, reducing the total amount of annotation needed to achieve satisfactory performance. Recent approaches to active learning for 3D point clouds are often based on sophisticated heuristics for both, splitting point clouds into annotatable regions and selecting the most beneficial for further neural network training. In this work, we propose a novel and easy-to-implement strategy to separate the point cloud into annotatable regions. In our approach, we utilize a 2D grid to subdivide the point cloud into columns. To identify the next data to be annotated, we employ a network ensemble to estimate the uncertainty in the network output. We evaluate our method on the S3DIS dataset, the Toronto-3D dataset, and a large-scale urban 3D point cloud of the city of Freiburg, which we labeled in parts manually. The extensive evaluation shows that our method yields performance on par with, or even better than, complex state-of-the-art methods on all datasets. Furthermore, we provide results suggesting that in the context of point clouds the annotated area can be a more meaningful measure for active learning algorithms than the number of annotated points.
USV: Unified Sparsification for Accelerating Video Diffusion Models
The scalability of high-fidelity video diffusion models (VDMs) is constrained by two key sources of redundancy: the quadratic complexity of global spatio-temporal attention and the computational overhead of long iterative denoising trajectories. Existing accelerators -- such as sparse attention and step-distilled samplers -- typically target a single dimension in isolation and quickly encounter diminishing returns, as the remaining bottlenecks become dominant. In this work, we introduce USV (Unified Sparsification for Video diffusion models), an end-to-end trainable framework that overcomes this limitation by jointly orchestrating sparsification across both the model's internal computation and its sampling process. USV learns a dynamic, data- and timestep-dependent sparsification policy that prunes redundant attention connections, adaptively merges semantically similar tokens, and reduces denoising steps, treating them not as independent tricks but as coordinated actions within a single optimization objective. This multi-dimensional co-design enables strong mutual reinforcement among previously disjoint acceleration strategies. Extensive experiments on large-scale video generation benchmarks demonstrate that USV achieves up to 83.3% speedup in the denoising process and 22.7% end-to-end acceleration, while maintaining high visual fidelity. Our results highlight unified, dynamic sparsification as a practical path toward efficient, high-quality video generation.
HQ-DM: Single Hadamard Transformation-Based Quantization-Aware Training for Low-Bit Diffusion Models
Diffusion models have demonstrated significant applications in the field of image generation. However, their high computational and memory costs pose challenges for deployment. Model quantization has emerged as a promising solution to reduce storage overhead and accelerate inference. Nevertheless, existing quantization methods for diffusion models struggle to mitigate outliers in activation matrices during inference, leading to substantial performance degradation under low-bit quantization scenarios. To address this, we propose HQ-DM, a novel Quantization-Aware Training framework that applies Single Hadamard Transformation to activation matrices. This approach effectively reduces activation outliers while preserving model performance under quantization. Compared to traditional Double Hadamard Transformation, our proposed scheme offers distinct advantages by seamlessly supporting INT convolution operations while preventing the amplification of weight outliers. For conditional generation on the ImageNet 256x256 dataset using the LDM-4 model, our W4A4 and W4A3 quantization schemes improve the Inception Score by 12.8% and 467.73%, respectively, over the existing state-of-the-art method.
Distilling Expert Surgical Knowledge: How to train local surgical VLMs for anatomy explanation in Complete Mesocolic Excision
Recently, Vision Large Language Models (VLMs) have demonstrated high potential in computer-aided diagnosis and decision-support. However, current VLMs show deficits in domain specific surgical scene understanding, such as identifying and explaining anatomical landmarks during Complete Mesocolic Excision. Additionally, there is a need for locally deployable models to avoid patient data leakage to large VLMs, hosted outside the clinic. We propose a privacy-preserving framework to distill knowledge from large, general-purpose LLMs into an efficient, local VLM. We generate an expert-supervised dataset by prompting a teacher LLM without sensitive images, using only textual context and binary segmentation masks for spatial information. This dataset is used for Supervised Fine-Tuning (SFT) and subsequent Direct Preference Optimization (DPO) of the locally deployable VLM. Our evaluation confirms that finetuning VLMs with our generated datasets increases surgical domain knowledge compared to its base VLM by a large margin. Overall, this work validates a data-efficient and privacy-conforming way to train a surgical domain optimized, locally deployable VLM for surgical scene understanding.
Manifold-Aware Point Cloud Completion via Geodesic-Attentive Hierarchical Feature Learning
Point cloud completion seeks to recover geometrically consistent shapes from partial or sparse 3D observations. Although recent methods have achieved reasonable global shape reconstruction, they often rely on Euclidean proximity and overlook the intrinsic nonlinear geometric structure of point clouds, resulting in suboptimal geometric consistency and semantic ambiguity. In this paper, we present a manifold-aware point cloud completion framework that explicitly incorporates nonlinear geometry information throughout the feature learning pipeline. Our approach introduces two key modules: a Geodesic Distance Approximator (GDA), which estimates geodesic distances between points to capture the latent manifold topology, and a Manifold-Aware Feature Extractor (MAFE), which utilizes geodesic-based $k$-NN groupings and a geodesic-relational attention mechanism to guide the hierarchical feature extraction process. By integrating geodesic-aware relational attention, our method promotes semantic coherence and structural fidelity in the reconstructed point clouds. Extensive experiments on benchmark datasets demonstrate that our approach consistently outperforms state-of-the-art methods in reconstruction quality.
OWL: Unsupervised 3D Object Detection by Occupancy Guided Warm-up and Large Model Priors Reasoning AAAI
Unsupervised 3D object detection leverages heuristic algorithms to discover potential objects, offering a promising route to reduce annotation costs in autonomous driving. Existing approaches mainly generate pseudo labels and refine them through self-training iterations. However, these pseudo-labels are often incorrect at the beginning of training, resulting in misleading the optimization process. Moreover, effectively filtering and refining them remains a critical challenge. In this paper, we propose OWL for unsupervised 3D object detection by occupancy guided warm-up and large-model priors reasoning. OWL first employs an Occupancy Guided Warm-up (OGW) strategy to initialize the backbone weight with spatial perception capabilities, mitigating the interference of incorrect pseudo-labels on network convergence. Furthermore, OWL introduces an Instance-Cued Reasoning (ICR) module that leverages the prior knowledge of large models to assess pseudo-label quality, enabling precise filtering and refinement. Finally, we design a Weight-adapted Self-training (WAS) strategy to dynamically re-weight pseudo-labels, improving the performance through self-training. Extensive experiments on Waymo Open Dataset (WOD) and KITTI demonstrate that OWL outperforms state-of-the-art unsupervised methods by over 15.0% mAP, revealing the effectiveness of our method.
comment: The 40th Annual AAAI Conference on Artificial Intelligence
Physics-Informed Graph Neural Network with Frequency-Aware Learning for Optical Aberration Correction
Optical aberrations significantly degrade image quality in microscopy, particularly when imaging deeper into samples. These aberrations arise from distortions in the optical wavefront and can be mathematically represented using Zernike polynomials. Existing methods often address only mild aberrations on limited sample types and modalities, typically treating the problem as a black-box mapping without leveraging the underlying optical physics of wavefront distortions. We propose ZRNet, a physics-informed framework that jointly performs Zernike coefficient prediction and optical image Restoration. We contribute a Zernike Graph module that explicitly models physical relationships between Zernike polynomials based on their azimuthal degrees-ensuring that learned corrections align with fundamental optical principles. To further enforce physical consistency between image restoration and Zernike prediction, we introduce a Frequency-Aware Alignment (FAA) loss, which better aligns Zernike coefficient prediction and image features in the Fourier domain. Extensive experiments on CytoImageNet demonstrates that our approach achieves state-of-the-art performance in both image restoration and Zernike coefficient prediction across diverse microscopy modalities and biological samples with complex, large-amplitude aberrations. Code is available at https://github.com/janetkok/ZRNet.
Hyperspectral Unmixing with 3D Convolutional Sparse Coding and Projected Simplex Volume Maximization
Hyperspectral unmixing (HSU) aims to separate each pixel into its constituent endmembers and estimate their corresponding abundance fractions. This work presents an algorithm-unrolling-based network for the HSU task, named the 3D Convolutional Sparse Coding Network (3D-CSCNet), built upon a 3D CSC model. Unlike existing unrolling-based networks, our 3D-CSCNet is designed within the powerful autoencoder (AE) framework. Specifically, to solve the 3D CSC problem, we propose a 3D CSC block (3D-CSCB) derived through deep algorithm unrolling. Given a hyperspectral image (HSI), 3D-CSCNet employs the 3D-CSCB to estimate the abundance matrix. The use of 3D CSC enables joint learning of spectral and spatial relationships in the 3D HSI data cube. The estimated abundance matrix is then passed to the AE decoder to reconstruct the HSI, and the decoder weights are extracted as the endmember matrix. Additionally, we propose a projected simplex volume maximization (PSVM) algorithm for endmember estimation, and the resulting endmembers are used to initialize the decoder weights of 3D-CSCNet. Extensive experiments on three real datasets and one simulated dataset with three different signal-to-noise ratio (SNR) levels demonstrate that our 3D-CSCNet outperforms state-of-the-art methods.
InverseCrafter: Efficient Video ReCapture as a Latent Domain Inverse Problem
Recent approaches to controllable 4D video generation often rely on fine-tuning pre-trained Video Diffusion Models (VDMs). This dominant paradigm is computationally expensive, requiring large-scale datasets and architectural modifications, and frequently suffers from catastrophic forgetting of the model's original generative priors. Here, we propose InverseCrafter, an efficient inpainting inverse solver that reformulates the 4D generation task as an inpainting problem solved in the latent space. The core of our method is a principled mechanism to encode the pixel space degradation operator into a continuous, multi-channel latent mask, thereby bypassing the costly bottleneck of repeated VAE operations and backpropagation. InverseCrafter not only achieves comparable novel view generation and superior measurement consistency in camera control tasks with near-zero computational overhead, but also excels at general-purpose video inpainting with editing. Code is available at https://github.com/yeobinhong/InverseCrafter.
Deep Learning-Based Real-Time Sequential Facial Expression Analysis Using Geometric Features
Facial expression recognition is a crucial component in enhancing human-computer interaction and developing emotion-aware systems. Real-time detection and interpretation of facial expressions have become increasingly important for various applications, from user experience personalization to intelligent surveillance systems. This study presents a novel approach to real-time sequential facial expression recognition using deep learning and geometric features. The proposed method utilizes MediaPipe FaceMesh for rapid and accurate facial landmark detection. Geometric features, including Euclidean distances and angles, are extracted from these landmarks. Temporal dynamics are incorporated by analyzing feature differences between consecutive frames, enabling the detection of onset, apex, and offset phases of expressions. For classification, a ConvLSTM1D network followed by multilayer perceptron blocks is employed. The method's performance was evaluated on multiple publicly available datasets, including CK+, Oulu-CASIA (VIS and NIR), and MMI. Accuracies of 93%, 79%, 77%, and 68% were achieved respectively. Experiments with composite datasets were also conducted to assess the model's generalization capabilities. The approach demonstrated real-time applicability, processing approximately 165 frames per second on consumer-grade hardware. This research contributes to the field of facial expression analysis by providing a fast, accurate, and adaptable solution. The findings highlight the potential for further advancements in emotion-aware technologies and personalized user experiences, paving the way for more sophisticated human-computer interaction systems. To facilitate further research in this field, the complete source code for this study has been made publicly available on GitHub: https://github.com/miralab-ai/facial-expression-analysis.
Interleaved Latent Visual Reasoning with Selective Perceptual Modeling
Interleaved reasoning paradigms enhance Multimodal Large Language Models (MLLMs) with visual feedback but are hindered by the prohibitive computational cost of repeatedly re-encoding pixel-dense images. A promising alternative, latent visual reasoning, circumvents this bottleneck yet currently forces a critical trade-off: methods either sacrifice precise perceptual modeling by over-compressing features or fail to model dynamic problems due to static, non-interleaved structures. We introduce Interleaved Latent Visual Reasoning (ILVR), a framework that unifies dynamic state evolution with precise perceptual modeling. ILVR interleaves textual generation with latent visual representations that act as specific, evolving cues for subsequent reasoning. To enable this, we employ a self-supervision strategy where a Momentum Teacher Model selectively distills relevant features from helper images into sparse supervision targets. This adaptive selection mechanism guides the model to autonomously generate context-aware visual signals. Extensive experiments on multimodal reasoning benchmarks demonstrate that ILVR significantly outperforms existing approaches, effectively bridging the gap between fine-grained perception and sequential multimodal reasoning.
comment: 11 pages, 6 figures. Code available at https://github.com/XD111ds/ILVR
LeAD-M3D: Leveraging Asymmetric Distillation for Real-time Monocular 3D Detection
Real-time monocular 3D object detection remains challenging due to severe depth ambiguity, viewpoint shifts, and the high computational cost of 3D reasoning. Existing approaches either rely on LiDAR or geometric priors to compensate for missing depth, or sacrifice efficiency to achieve competitive accuracy. We introduce LeAD-M3D, a monocular 3D detector that achieves state-of-the-art accuracy and real-time inference without extra modalities. Our method is powered by three key components. Asymmetric Augmentation Denoising Distillation (A2D2) transfers geometric knowledge from a clean-image teacher to a mixup-noised student via a quality- and importance-weighted depth-feature loss, enabling stronger depth reasoning without LiDAR supervision. 3D-aware Consistent Matching (CM3D) improves prediction-to-ground truth assignment by integrating 3D MGIoU into the matching score, yielding more stable and precise supervision. Finally, Confidence-Gated 3D Inference (CGI3D) accelerates detection by restricting expensive 3D regression to top-confidence regions. Together, these components set a new Pareto frontier for monocular 3D detection: LeAD-M3D achieves state-of-the-art accuracy on KITTI and Waymo, and the best reported car AP on Rope3D, while running up to 3.6x faster than prior high-accuracy methods. Our results demonstrate that high fidelity and real-time efficiency in monocular 3D detection are simultaneously attainable - without LiDAR, stereo, or geometric assumptions.
Self-Supervised AI-Generated Image Detection: A Camera Metadata Perspective
The proliferation of AI-generated imagery poses escalating challenges for multimedia forensics, yet many existing detectors depend on assumptions about the internals of specific generative models, limiting their cross-model applicability. We introduce a self-supervised approach for detecting AI-generated images that leverages camera metadata -- specifically exchangeable image file format (EXIF) tags -- to learn features intrinsic to digital photography. Our pretext task trains a feature extractor solely on camera-captured photographs by classifying categorical EXIF tags (\eg, camera model and scene type) and pairwise-ranking ordinal and continuous EXIF tags (\eg, focal length and aperture value). Using these EXIF-induced features, we first perform one-class detection by modeling the distribution of photographic images with a Gaussian mixture model and flagging low-likelihood samples as AI-generated. We then extend to binary detection that treats the learned extractor as a strong regularizer for a classifier of the same architecture, operating on high-frequency residuals from spatially scrambled patches. Extensive experiments across various generative models demonstrate that our EXIF-induced detectors substantially advance the state of the art, delivering strong generalization to in-the-wild samples and robustness to common benign image perturbations.
Experts-Guided Unbalanced Optimal Transport for ISP Learning from Unpaired and/or Paired Data
Learned Image Signal Processing (ISP) pipelines offer powerful end-to-end performance but are critically dependent on large-scale paired raw-to-sRGB datasets. This reliance on costly-to-acquire paired data remains a significant bottleneck. To address this challenge, we introduce a novel, unsupervised training framework based on Optimal Transport capable of training arbitrary ISP architectures in both unpaired and paired modes. We are the first to successfully apply Unbalanced Optimal Transport (UOT) for this complex, cross-domain translation task. Our UOT-based framework provides robustness to outliers in the target sRGB data, allowing it to discount atypical samples that would be prohibitively costly to map. A key component of our framework is a novel ``committee of expert discriminators,'' a hybrid adversarial regularizer. This committee guides the optimal transport mapping by providing specialized, targeted gradients to correct specific ISP failure modes, including color fidelity, structural artifacts, and frequency-domain realism. To demonstrate the superiority of our approach, we retrained existing state-of-the-art ISP architectures using our paired and unpaired setups. Our experiments show that while our framework, when trained in paired mode, exceeds the performance of the original paired methods across all metrics, our unpaired mode concurrently achieves quantitative and qualitative performance that rivals, and in some cases surpasses, the original paired-trained counterparts. The code and pre-trained models are available at: https://github.com/gosha20777/EGUOT-ISP.git.
DistillFSS: Synthesizing Few-Shot Knowledge into a Lightweight Segmentation Model
Cross-Domain Few-Shot Semantic Segmentation (CD-FSS) seeks to segment unknown classes in unseen domains using only a few annotated examples. This setting is inherently challenging: source and target domains exhibit substantial distribution shifts, label spaces are disjoint, and support images are scarce--making standard episodic methods unreliable and computationally demanding at test time. To address these constraints, we propose DistillFSS, a framework that embeds support-set knowledge directly into a model's parameters through a teacher--student distillation process. By internalizing few-shot reasoning into a dedicated layer within the student network, DistillFSS eliminates the need for support images at test time, enabling fast, lightweight inference, while allowing efficient extension to novel classes in unseen domains through rapid teacher-driven specialization. Combined with fine-tuning, the approach scales efficiently to large support sets and significantly reduces computational overhead. To evaluate the framework under realistic conditions, we introduce a new CD-FSS benchmark spanning medical imaging, industrial inspection, and remote sensing, with disjoint label spaces and variable support sizes. Experiments show that DistillFSS matches or surpasses state-of-the-art baselines, particularly in multi-class and multi-shot scenarios, while offering substantial efficiency gains. The code is available at https://github.com/pasqualedem/DistillFSS.
NormalView: sensor-agnostic tree species classification from backpack and aerial lidar data using geometric projections
Laser scanning has proven to be an invaluable tool in assessing the decomposition of forest environments. Mobile laser scanning (MLS) has shown to be highly promising for extremely accurate, tree level inventory. In this study, we present NormalView, a sensor-agnostic projection-based deep learning method for classifying tree species from point cloud data. NormalView embeds local geometric information into two-dimensional projections, in the form of normal vector estimates, and uses the projections as inputs to an image classification network, YOLOv11. In addition, we inspected the effect of multispectral radiometric intensity information on classification performance. We trained and tested our model on high-density MLS data (7 species, ~5000 pts/m^2), as well as high-density airborne laser scanning (ALS) data (9 species, >1000 pts/m^2). On the MLS data, NormalView achieves an overall accuracy (macro-average accuracy) of 95.5 % (94.8 %), and 91.8 % (79.1 %) on the ALS data. We found that having intensity information from multiple scanners provides benefits in tree species classification, and the best model on the multispectral ALS dataset was a model using intensity information from all three channels of the multispectral ALS. This study demonstrates that projection-based methods, when enhanced with geometric information and coupled with state-of-the-art image classification backbones, can achieve exceptional results. Crucially, these methods are sensor-agnostic, relying only on geometric information. Additionally, we publically release the MLS dataset used in the study.
comment: 19 pages, 8 figures
Fast SceneScript: Accurate and Efficient Structured Language Model via Multi-Token Prediction
Recent perception-generalist approaches based on language models have achieved state-of-the-art results across diverse tasks, including 3D scene layout estimation, via unified architecture and interface. However, these approaches rely on autoregressive next-token prediction, which is inherently slow. In this work, we introduce Fast SceneScript, a novel structured language model for accurate and efficient 3D scene layout estimation. Our method employs multi-token prediction (MTP) to reduce the number of autoregressive iterations and significantly accelerate inference. While MTP improves speed, unreliable token predictions can significantly reduce accuracy. To filter out unreliable tokens, we adapt self-speculative decoding (SSD) for structured language models and introduce confidence-guided decoding (CGD) with an improved scoring mechanism for token reliability. Furthermore, we design a parameter-efficient mechanism that reduces the parameter overhead of MTP. Extensive experiments on the ASE and Structured3D benchmarks demonstrate that Fast SceneScript can generate up to 9 tokens per decoder inference step without compromising accuracy, while adding only $\sim7.5\%$ additional parameters.
comment: 10 pages, 8 figures
Learning High-Fidelity Cloth Animation via Skinning-Free Image Transfer 3DV 2026
We present a novel method for generating 3D garment deformations from given body poses, which is key to a wide range of applications, including virtual try-on and extended reality. To simplify the cloth dynamics, existing methods mostly rely on linear blend skinning to obtain low-frequency posed garment shape and only regress high-frequency wrinkles. However, due to the lack of explicit skinning supervision, such skinning-based approach often produces misaligned shapes when posing the garment, consequently corrupts the high-frequency signals and fails to recover high-fidelity wrinkles. To tackle this issue, we propose a skinning-free approach by independently estimating posed (i) vertex position for low-frequency posed garment shape, and (ii) vertex normal for high-frequency local wrinkle details. In this way, each frequency modality can be effectively decoupled and directly supervised by the geometry of the deformed garment. To further improve the visual quality of animation, we propose to encode both vertex attributes as rendered texture images, so that 3D garment deformation can be equivalently achieved via 2D image transfer. This enables us to leverage powerful pretrained image models to recover fine-grained visual details in wrinkles, while maintaining superior scalability for garments of diverse topologies without relying on manual UV partition. Finally, we propose a multimodal fusion to incorporate constraints from both frequency modalities and robustly recover deformed 3D garments from transferred images. Extensive experiments show that our method significantly improves animation quality on various garment types and recovers finer wrinkles than state-of-the-art methods.
comment: Accepted to 3DV 2026
MedDIFT: Multi-Scale Diffusion-Based Correspondence in 3D Medical Imaging
Accurate spatial correspondence between medical images is essential for longitudinal analysis, lesion tracking, and image-guided interventions. Medical image registration methods rely on local intensity-based similarity measures, which fail to capture global semantic structure and often yield mismatches in low-contrast or anatomically variable regions. Recent advances in diffusion models suggest that their intermediate representations encode rich geometric and semantic information. We present MedDIFT, a training-free 3D correspondence framework that leverages multi-scale features from a pretrained latent medical diffusion model as voxel descriptors. MedDIFT fuses diffusion activations into rich voxel-wise descriptors and matches them via cosine similarity, with an optional local-search prior. On a publicly available lung CT dataset, MedDIFT achieves correspondence accuracy comparable to the state-of-the-art learning-based UniGradICON model and surpasses conventional B-spline-based registration, without requiring any task-specific model training. Ablation experiments confirm that multi-level feature fusion and modest diffusion noise improve performance.
ProPhy: Progressive Physical Alignment for Dynamic World Simulation
Recent advances in video generation have shown remarkable potential for constructing world simulators. However, current models still struggle to produce physically consistent results, particularly when handling large-scale or complex dynamics. This limitation arises primarily because existing approaches respond isotropically to physical prompts and neglect the fine-grained alignment between generated content and localized physical cues. To address these challenges, we propose ProPhy, a Progressive Physical Alignment Framework that enables explicit physics-aware conditioning and anisotropic generation. ProPhy employs a two-stage Mixture-of-Physics-Experts (MoPE) mechanism for discriminative physical prior extraction, where Semantic Experts infer semantic-level physical principles from textual descriptions, and Refinement Experts capture token-level physical dynamics. This mechanism allows the model to learn fine-grained, physics-aware video representations that better reflect underlying physical laws. Furthermore, we introduce a physical alignment strategy that transfers the physical reasoning capabilities of vision-language models (VLMs) into the Refinement Experts, facilitating a more accurate representation of dynamic physical phenomena. Extensive experiments on physics-aware video generation benchmarks demonstrate that ProPhy produces more realistic, dynamic, and physically coherent results than existing state-of-the-art methods.
2K-Characters-10K-Stories: A Quality-Gated Stylized Narrative Dataset with Disentangled Control and Sequence Consistency
Sequential identity consistency under precise transient attribute control remains a long-standing challenge in controllable visual storytelling. Existing datasets lack sufficient fidelity and fail to disentangle stable identities from transient attributes, limiting structured control over pose, expression, and scene composition and thus constraining reliable sequential synthesis. To address this gap, we introduce \textbf{2K-Characters-10K-Stories}, a multi-modal stylized narrative dataset of \textbf{2{,}000} uniquely stylized characters appearing across \textbf{10{,}000} illustration stories. It is the first dataset that pairs large-scale unique identities with explicit, decoupled control signals for sequential identity consistency. We introduce a \textbf{Human-in-the-Loop pipeline (HiL)} that leverages expert-verified character templates and LLM-guided narrative planning to generate highly-aligned structured data. A \textbf{decoupled control} scheme separates persistent identity from transient attributes -- pose and expression -- while a \textbf{Quality-Gated loop} integrating MMLM evaluation, Auto-Prompt Tuning, and Local Image Editing enforces pixel-level consistency. Extensive experiments demonstrate that models fine-tuned on our dataset achieves performance comparable to closed-source models in generating visual narratives.
Conscious Gaze: Adaptive Attention Mechanisms for Hallucination Mitigation in Vision-Language Models
Large Vision-Language Models (VLMs) often exhibit text inertia, where attention drifts from visual evidence toward linguistic priors, resulting in object hallucinations. Existing decoding strategies intervene only at the output logits and thus cannot correct internal reasoning drift, while recent internal-control methods based on heuristic head suppression or global steering vectors lack principled grounding. We introduce Conscious Gaze (CG-VLM), a training-free, inference-time framework that converts game-theoretic interpretability into actionable decoding control. A Cognitive Demand Sensor built on Harsanyi interactions estimates instantaneous vision-text synergy and identifies moments when visual grounding is necessary. Conditioned on this signal, a Focused Consensus Induction module selectively reorients mid-layer attention toward visual tokens before collapse into text priors. CG-VLM achieves state-of-the-art results on POPE and CHAIR across InstructBLIP, LLaVA, Qwen-VL, and mPLUG, while preserving general capabilities, demonstrating that token-level sensing enables precise, context-aware intervention without compromising foundational knowledge.
comment: 6 pages, 6 figures
Ideal Observer for Segmentation of Dead Leaves Images
The human visual environment is comprised of different surfaces that are distributed in space. The parts of a scene that are visible at any one time are governed by the occlusion of overlapping objects. In this work we consider "dead leaves" models, which replicate these occlusions when generating images by layering objects on top of each other. A dead leaves model is a generative model comprised of distributions for object position, shape, color and texture. An image is generated from a dead leaves model by sampling objects ("leaves") from these distributions until a stopping criterion is reached, usually when the image is fully covered or until a given number of leaves was sampled. Here, we describe a theoretical approach, based on previous work, to derive a Bayesian ideal observer for the partition of a given set of pixels based on independent dead leaves model distributions. Extending previous work, we provide step-by-step explanations for the computation of the posterior probability as well as describe factors that determine the feasibility of practically applying this computation. The dead leaves image model and the associated ideal observer can be applied to study segmentation decisions in a limited number of pixels, providing a principled upper-bound on performance, to which humans and vision algorithms could be compared.
comment: 41 pages, 16 figures
See in Depth: Training-Free Surgical Scene Segmentation with Monocular Depth Priors
Pixel-wise segmentation of laparoscopic scenes is essential for computer-assisted surgery but difficult to scale due to the high cost of dense annotations. We propose depth-guided surgical scene segmentation (DepSeg), a training-free framework that utilizes monocular depth as a geometric prior together with pretrained vision foundation models. DepSeg first estimates a relative depth map with a pretrained monocular depth estimation network and proposes depth-guided point prompts, which SAM2 converts into class-agnostic masks. Each mask is then described by a pooled pretrained visual feature and classified via template matching against a template bank built from annotated frames. On the CholecSeg8k dataset, DepSeg improves over a direct SAM2 auto segmentation baseline (35.9% vs. 14.7% mIoU) and maintains competitive performance even when using only 10--20% of the object templates. These results show that depth-guided prompting and template-based classification offer an annotation-efficient segmentation approach.
comment: The first two authors contributed equally
VOST-SGG: VLM-Aided One-Stage Spatio-Temporal Scene Graph Generation
Spatio-temporal scene graph generation (ST-SGG) aims to model objects and their evolving relationships across video frames, enabling interpretable representations for downstream reasoning tasks such as video captioning and visual question answering. Despite recent advancements in DETR-style single-stage ST-SGG models, they still suffer from several key limitations. First, while these models rely on attention-based learnable queries as a core component, these learnable queries are semantically uninformed and instance-agnostically initialized. Second, these models rely exclusively on unimodal visual features for predicate classification. To address these challenges, we propose VOST-SGG, a VLM-aided one-stage ST-SGG framework that integrates the common sense reasoning capabilities of vision-language models (VLMs) into the ST-SGG pipeline. First, we introduce the dual-source query initialization strategy that disentangles what to attend to from where to attend, enabling semantically grounded what-where reasoning. Furthermore, we propose a multi-modal feature bank that fuses visual, textual, and spatial cues derived from VLMs for improved predicate classification. Extensive experiments on the Action Genome dataset demonstrate that our approach achieves state-of-the-art performance, validating the effectiveness of integrating VLM-aided semantic priors and multi-modal features for ST-SGG. We will release the code at https://github.com/LUNAProject22/VOST.
DashFusion: Dual-stream Alignment with Hierarchical Bottleneck Fusion for Multimodal Sentiment Analysis
Multimodal sentiment analysis (MSA) integrates various modalities, such as text, image, and audio, to provide a more comprehensive understanding of sentiment. However, effective MSA is challenged by alignment and fusion issues. Alignment requires synchronizing both temporal and semantic information across modalities, while fusion involves integrating these aligned features into a unified representation. Existing methods often address alignment or fusion in isolation, leading to limitations in performance and efficiency. To tackle these issues, we propose a novel framework called Dual-stream Alignment with Hierarchical Bottleneck Fusion (DashFusion). Firstly, dual-stream alignment module synchronizes multimodal features through temporal and semantic alignment. Temporal alignment employs cross-modal attention to establish frame-level correspondences among multimodal sequences. Semantic alignment ensures consistency across the feature space through contrastive learning. Secondly, supervised contrastive learning leverages label information to refine the modality features. Finally, hierarchical bottleneck fusion progressively integrates multimodal information through compressed bottleneck tokens, which achieves a balance between performance and computational efficiency. We evaluate DashFusion on three datasets: CMU-MOSI, CMU-MOSEI, and CH-SIMS. Experimental results demonstrate that DashFusion achieves state-of-the-art performance across various metrics, and ablation studies confirm the effectiveness of our alignment and fusion techniques. The codes for our experiments are available at https://github.com/ultramarineX/DashFusion.
comment: Accepted to IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2025
Know-Show: Benchmarking Video-Language Models on Spatio-Temporal Grounded Reasoning
Large Video-Language Models (Video-LMs) have achieved impressive progress in multimodal understanding, yet their reasoning remains weakly grounded in space and time. We present Know-Show, a new benchmark designed to evaluate spatio-temporal grounded reasoning, the ability of a model to reason about actions and their semantics while simultaneously grounding its inferences in visual and temporal evidence. Know-Show unifies reasoning and localization within a single evaluation framework consisting of five complementary scenarios across spatial (person, object, person-object, and hand-object) and temporal dimensions. Built from Charades, Action Genome, and Ego4D with 2.5K human-authored questions, the benchmark exposes significant gaps between current Video-LMs and human reasoning. To bridge this gap, we propose GRAM, a training-free plug-in that augments Video-LMs with fine-grained grounding through attention-based video token selection and explicit timestamp encoding. Extensive experiments across open and closed Video-LMs (Qwen, VideoLLaVA, GPT-4o, and Gemini, etc.) reveal that existing models struggle to "show what they know" and vice versa, especially in fine-grained hand-object interactions. Know-Show establishes a unified standard for assessing grounded reasoning in video-language understanding and provides insights toward developing interpretable and reliable multimodal reasoning systems. We will release the code at https://github.com/LUNAProject22/Know-Show.
Rethinking Infrared Small Target Detection: A Foundation-Driven Efficient Paradigm
While large-scale visual foundation models (VFMs) exhibit strong generalization across diverse visual domains, their potential for single-frame infrared small target (SIRST) detection remains largely unexplored. To fill this gap, we systematically introduce the frozen representations from VFMs into the SIRST task for the first time and propose a Foundation-Driven Efficient Paradigm (FDEP), which can seamlessly adapt to existing encoder-decoder-based methods and significantly improve accuracy without additional inference overhead. Specifically, a Semantic Alignment Modulation Fusion (SAMF) module is designed to achieve dynamic alignment and deep fusion of the global semantic priors from VFMs with task-specific features. Meanwhile, to avoid the inference time burden introduced by VFMs, we propose a Collaborative Optimization-based Implicit Self-Distillation (CO-ISD) strategy, which enables implicit semantic transfer between the main and lightweight branches through parameter sharing and synchronized backpropagation. In addition, to unify the fragmented evaluation system, we construct a Holistic SIRST Evaluation (HSE) metric that performs multi-threshold integral evaluation at both pixel-level confidence and target-level robustness, providing a stable and comprehensive basis for fair model comparison. Extensive experiments demonstrate that the SIRST detection networks equipped with our FDEP framework achieve state-of-the-art (SOTA) performance on multiple public datasets. Our code is available at https://github.com/YuChuang1205/FDEP-Framework
Decoding with Structured Awareness: Integrating Directional, Frequency-Spatial, and Structural Attention for Medical Image Segmentation AAAI 2026
To address the limitations of Transformer decoders in capturing edge details, recognizing local textures and modeling spatial continuity, this paper proposes a novel decoder framework specifically designed for medical image segmentation, comprising three core modules. First, the Adaptive Cross-Fusion Attention (ACFA) module integrates channel feature enhancement with spatial attention mechanisms and introduces learnable guidance in three directions (planar, horizontal, and vertical) to enhance responsiveness to key regions and structural orientations. Second, the Triple Feature Fusion Attention (TFFA) module fuses features from Spatial, Fourier and Wavelet domains, achieving joint frequency-spatial representation that strengthens global dependency and structural modeling while preserving local information such as edges and textures, making it particularly effective in complex and blurred boundary scenarios. Finally, the Structural-aware Multi-scale Masking Module (SMMM) optimizes the skip connections between encoder and decoder by leveraging multi-scale context and structural saliency filtering, effectively reducing feature redundancy and improving semantic interaction quality. Working synergistically, these modules not only address the shortcomings of traditional decoders but also significantly enhance performance in high-precision tasks such as tumor segmentation and organ boundary extraction, improving both segmentation accuracy and model generalization. Experimental results demonstrate that this framework provides an efficient and practical solution for medical image segmentation.
comment: Accepted to AAAI 2026
WaterWave: Bridging Underwater Image Enhancement into Video Streams via Wavelet-based Temporal Consistency Field
Underwater video pairs are fairly difficult to obtain due to the complex underwater imaging. In this case, most existing video underwater enhancement methods are performed by directly applying the single-image enhancement model frame by frame, but a natural issue is lacking temporal consistency. To relieve the problem, we rethink the temporal manifold inherent in natural videos and observe a temporal consistency prior in dynamic scenes from the local temporal frequency perspective. Building upon the specific prior and no paired-data condition, we propose an implicit representation manner for enhanced video signals, which is conducted in the wavelet-based temporal consistency field, WaterWave. Specifically, under the constraints of the prior, we progressively filter and attenuate the inconsistent components while preserving motion details and scenes, achieving a natural-flowing video. Furthermore, to represent temporal frequency bands more accurately, an underwater flow correction module is designed to rectify estimated flows considering the transmission in underwater scenes. Extensive experiments demonstrate that WaterWave significantly enhances the quality of videos generated using single-image underwater enhancements. Additionally, our method demonstrates high potential in downstream underwater tracking tasks, such as UOSTrack and MAT, outperforming the original video by a large margin, i.e., 19.7% and 9.7% on precise respectively.
Concept-based Explainable Data Mining with VLM for 3D Detection
Rare-object detection remains a challenging task in autonomous driving systems, particularly when relying solely on point cloud data. Although Vision-Language Models (VLMs) exhibit strong capabilities in image understanding, their potential to enhance 3D object detection through intelligent data mining has not been fully explored. This paper proposes a novel cross-modal framework that leverages 2D VLMs to identify and mine rare objects from driving scenes, thereby improving 3D object detection performance. Our approach synthesizes complementary techniques such as object detection, semantic feature extraction, dimensionality reduction, and multi-faceted outlier detection into a cohesive, explainable pipeline that systematically identifies rare but critical objects in driving scenes. By combining Isolation Forest and t-SNE-based outlier detection methods with concept-based filtering, the framework effectively identifies semantically meaningful rare objects. A key strength of this approach lies in its ability to extract and annotate targeted rare object concepts such as construction vehicles, motorcycles, and barriers. This substantially reduces the annotation burden and focuses only on the most valuable training samples. Experiments on the nuScenes dataset demonstrate that this concept-guided data mining strategy enhances the performance of 3D object detection models while utilizing only a fraction of the training data, with particularly notable improvements for challenging object categories such as trailers and bicycles compared with the same amount of random data. This finding has substantial implications for the efficient curation of datasets in safety-critical autonomous systems.
comment: 28 pages including appendix. Code: https://github.com/mm1129/concept_based_rare_detector_2025
UniFS: Unified Multi-Contrast MRI Reconstruction via Frequency-Spatial Fusion
Recently, Multi-Contrast MR Reconstruction (MCMR) has emerged as a hot research topic that leverages high-quality auxiliary modalities to reconstruct undersampled target modalities of interest. However, existing methods often struggle to generalize across different k-space undersampling patterns, requiring the training of a separate model for each specific pattern, which limits their practical applicability. To address this challenge, we propose UniFS, a Unified Frequency-Spatial Fusion model designed to handle multiple k-space undersampling patterns for MCMR tasks without any need for retraining. UniFS integrates three key modules: a Cross-Modal Frequency Fusion module, an Adaptive Mask-Based Prompt Learning module, and a Dual-Branch Complementary Refinement module. These modules work together to extract domain-invariant features from diverse k-space undersampling patterns while dynamically adapt to their own variations. Another limitation of existing MCMR methods is their tendency to focus solely on spatial information while neglect frequency characteristics, or extract only shallow frequency features, thus failing to fully leverage complementary cross-modal frequency information. To relieve this issue, UniFS introduces an adaptive prompt-guided frequency fusion module for k-space learning, significantly enhancing the model's generalization performance. We evaluate our model on the BraTS and HCP datasets with various k-space undersampling patterns and acceleration factors, including previously unseen patterns, to comprehensively assess UniFS's generalizability. Experimental results across multiple scenarios demonstrate that UniFS achieves state-of-the-art performance. Our code is available at https://github.com/LIKP0/UniFS.
EmoStyle: Emotion-Driven Image Stylization
Art has long been a profound medium for expressing emotions. While existing image stylization methods effectively transform visual appearance, they often overlook the emotional impact carried by styles. To bridge this gap, we introduce Affective Image Stylization (AIS), a task that applies artistic styles to evoke specific emotions while preserving content. We present EmoStyle, a framework designed to address key challenges in AIS, including the lack of training data and the emotion-style mapping. First, we construct EmoStyleSet, a content-emotion-stylized image triplet dataset derived from ArtEmis to support AIS. We then propose an Emotion-Content Reasoner that adaptively integrates emotional cues with content to learn coherent style queries. Given the discrete nature of artistic styles, we further develop a Style Quantizer that converts continuous style features into emotion-related codebook entries. Extensive qualitative and quantitative evaluations, including user studies, demonstrate that EmoStyle enhances emotional expressiveness while maintaining content consistency. Moreover, the learned emotion-aware style dictionary is adaptable to other generative tasks, highlighting its potential for broader applications. Our work establishes a foundation for emotion-driven image stylization, expanding the creative potential of AI-generated art.
University Building Recognition Dataset in Thailand for the mission-oriented IoT sensor system
Many industrial sectors have been using of machine learning at inference mode on edge devices. Future directions show that training on edge devices is promising due to improvements in semiconductor performance. Wireless Ad Hoc Federated Learning (WAFL) has been proposed as a promising approach for collaborative learning with device-to-device communication among edges. In particular, WAFL with Vision Transformer (WAFL-ViT) has been tested on image recognition tasks with the UTokyo Building Recognition Dataset (UTBR). Since WAFL-ViT is a mission-oriented sensor system, it is essential to construct specific datasets by each mission. In our work, we have developed the Chulalongkorn University Building Recognition Dataset (CUBR), which is specialized for Chulalongkorn University as a case study in Thailand. Additionally, our results also demonstrate that training on WAFL scenarios achieves better accuracy than self-training scenarios. Dataset is available in https://github.com/jo2lxq/wafl/.
TED-4DGS: Temporally Activated and Embedding-based Deformation for 4DGS Compression
Building on the success of 3D Gaussian Splatting (3DGS) in static 3D scene representation, its extension to dynamic scenes, commonly referred to as 4DGS or dynamic 3DGS, has attracted increasing attention. However, designing more compact and efficient deformation schemes together with rate-distortion-optimized compression strategies for dynamic 3DGS representations remains an underexplored area. Prior methods either rely on space-time 4DGS with overspecified, short-lived Gaussian primitives or on canonical 3DGS with deformation that lacks explicit temporal control. To address this, we present TED-4DGS, a temporally activated and embedding-based deformation scheme for rate-distortion-optimized 4DGS compression that unifies the strengths of both families. TED-4DGS is built on a sparse anchor-based 3DGS representation. Each canonical anchor is assigned learnable temporal-activation parameters to specify its appearance and disappearance transitions over time, while a lightweight per-anchor temporal embedding queries a shared deformation bank to produce anchor-specific deformation. For rate-distortion compression, we incorporate an implicit neural representation (INR)-based hyperprior to model anchor attribute distributions, along with a channel-wise autoregressive model to capture intra-anchor correlations. With these novel elements, our scheme achieves state-of-the-art rate-distortion performance on several real-world datasets. To the best of our knowledge, this work represents one of the first attempts to pursue a rate-distortion-optimized compression framework for dynamic 3DGS representations.
EXR: An Interactive Immersive EHR Visualization in Extended Reality
This paper presents the design and implementation of an Extended Reality (XR) platform for immersive, interactive visualization of Electronic Health Records (EHRs). The system extends beyond conventional 2D interfaces by visualizing both structured and unstructured patient data into a shared 3D environment, enabling intuitive exploration and real-time collaboration. The modular infrastructure integrates FHIR-based EHR data with volumetric medical imaging and AI-generated segmentation, ensuring interoperability with modern healthcare systems. The platform's capabilities are demonstrated using synthetic EHR datasets and computed tomography (CT)-derived spine models processed through an AI-powered segmentation pipeline. This work suggests that such integrated XR solutions could form the foundation for next-generation clinical decision-support tools, where advanced data infrastructures are directly accessible in an interactive and spatially rich environment.
comment: 11 pages, 6 figures. Preprint version. This paper has been accepted to IEEE ICIR 2025. This is the author-prepared version and not the final published version. The final version will appear in IEEE Xplo
ParaUni: Enhance Generation in Unified Multimodal Model with Reinforcement-driven Hierarchical Parallel Information Interaction
Unified multimodal models significantly improve visual generation by combining vision-language models (VLMs) with diffusion models. However, existing methods struggle to fully balance sufficient interaction and flexible implementation due to vast representation difference. Considering abundant and hierarchical information in VLM's layers from low-level details to high-level semantics, we propose \textbf{ParaUni}. It extracts features from variants VLM's layers in a \textbf{Para}llel way for comprehensive information interaction and retains a flexible separation architecture to enhance generation in \textbf{Uni}fied multimodal model. Concretely, visual features from all VLM's layers are fed in parallel into a Layer Integration Module (LIM), which efficiently integrates fine-grained details and semantic abstractions and provides the fused representation as a condition to the diffusion model. To further enhance performance, we reveal that these hierarchical layers respond unequally to different rewards in Reinforcement Learning (RL). Crucially, we design a Layer-wise Dynamic Adjustment Mechanism (LDAM) to facilitate multiple reward improvements that aligns the hierarchical properties of these layers using RL. Extensive experiments show ParaUni leverages complementary multi-layer features to substantially improve generation quality and shows strong potential for multiple reward advances during RL stages. Code is available at https://github.com/JosephTiTan/ParaUni.
Performance Evaluation of Deep Learning for Tree Branch Segmentation in Autonomous Forestry Systems
UAV-based autonomous forestry operations require rapid and precise tree branch segmentation for safe navigation and automated pruning across varying pixel resolutions and operational conditions. We evaluate different deep learning methods at three resolutions (256x256, 512x512, 1024x1024) using the Urban Street Tree Dataset, employing standard metrics (IoU, Dice) and specialized measures including Thin Structure IoU (TS-IoU) and Connectivity Preservation Rate (CPR). Among 22 configurations tested, U-Net with MiT-B4 backbone achieves strong performance at 256x256. At 512x512, MiT-B4 leads in IoU, Dice, TS-IoU, and Boundary-F1. At 1024x1024, U-Net+MiT-B3 shows the best validation performance for IoU/Dice and precision, while U-Net++ excels in boundary quality. PSPNet provides the most efficient option (2.36/9.43/37.74 GFLOPs) with 25.7/19.6/11.8 percentage point IoU reductions compared to top performers at respective resolutions. These results establish multi-resolution benchmarks for accuracy-efficiency trade-offs in embedded forestry systems. Implementation is available at https://github.com/BennyLinntu/PerformanceTreeBranchSegmentation.
Moving object detection from multi-depth images with an attention-enhanced CNN
One of the greatest challenges for detecting moving objects in the solar system from wide-field survey data is determining whether a signal indicates a true object or is due to some other source, like noise. Object verification has relied heavily on human eyes, which usually results in significant labor costs. In order to address this limitation and reduce the reliance on manual intervention, we propose a multi-input convolutional neural network integrated with a convolutional block attention module. This method is specifically tailored to enhance the moving object detection system that we have developed and used previously. The current method introduces two innovations. This first one is a multi-input architecture that processes multiple stacked images simultaneously. The second is the incorporation of the convolutional block attention module which enables the model to focus on essential features in both spatial and channel dimensions. These advancements facilitate efficient learning from multiple inputs, leading to more robust detection of moving objects. The performance of the model is evaluated on a dataset consisting of approximately 2,000 observational images. We achieved an accuracy of nearly 99% with AUC (an Area Under the Curve) of >0.99. These metrics indicate that the proposed model achieves excellent classification performance. By adjusting the threshold for object detection, the new model reduces the human workload by more than 99% compared to manual verification.
comment: 14 pages, 22 figures, submitted to PASJ
YOLO and SGBM Integration for Autonomous Tree Branch Detection and Depth Estimation in Radiata Pine Pruning Applications
Manual pruning of radiata pine trees poses significant safety risks due to extreme working heights and challenging terrain. This paper presents a computer vision framework that integrates YOLO object detection with Semi-Global Block Matching (SGBM) stereo vision for autonomous drone-based pruning operations. Our system achieves precise branch detection and depth estimation using only stereo camera input, eliminating the need for expensive LiDAR sensors. Experimental evaluation demonstrates YOLO's superior performance over Mask R-CNN, achieving 82.0% mAPmask50-95 for branch segmentation. The integrated system accurately localizes branches within a 2 m operational range, with processing times under one second per frame. These results establish the feasibility of cost-effective autonomous pruning systems that enhance worker safety and operational efficiency in commercial forestry.
Genetic Algorithms For Parameter Optimization for Disparity Map Generation of Radiata Pine Branch Images
Traditional stereo matching algorithms like Semi-Global Block Matching (SGBM) with Weighted Least Squares (WLS) filtering offer speed advantages over neural networks for UAV applications, generating disparity maps in approximately 0.5 seconds per frame. However, these algorithms require meticulous parameter tuning. We propose a Genetic Algorithm (GA) based parameter optimization framework that systematically searches for optimal parameter configurations for SGBM and WLS, enabling UAVs to measure distances to tree branches with enhanced precision while maintaining processing efficiency. Our contributions include: (1) a novel GA-based parameter optimization framework that eliminates manual tuning; (2) a comprehensive evaluation methodology using multiple image quality metrics; and (3) a practical solution for resource-constrained UAV systems. Experimental results demonstrate that our GA-optimized approach reduces Mean Squared Error by 42.86% while increasing Peak Signal-to-Noise Ratio and Structural Similarity by 8.47% and 28.52%, respectively, compared with baseline configurations. Furthermore, our approach demonstrates superior generalization performance across varied imaging conditions, which is critcal for real-world forestry applications.
The Dynamic Prior: Understanding 3D Structures for Casual Dynamic Videos
Estimating accurate camera poses, 3D scene geometry, and object motion from in-the-wild videos is a long-standing challenge for classical structure from motion pipelines due to the presence of dynamic objects. Recent learning-based methods attempt to overcome this challenge by training motion estimators to filter dynamic objects and focus on the static background. However, their performance is largely limited by the availability of large-scale motion segmentation datasets, resulting in inaccurate segmentation and, therefore, inferior structural 3D understanding. In this work, we introduce the Dynamic Prior (\ourmodel) to robustly identify dynamic objects without task-specific training, leveraging the powerful reasoning capabilities of Vision-Language Models (VLMs) and the fine-grained spatial segmentation capacity of SAM2. \ourmodel can be seamlessly integrated into state-of-the-art pipelines for camera pose optimization, depth reconstruction, and 4D trajectory estimation. Extensive experiments on both synthetic and real-world videos demonstrate that \ourmodel not only achieves state-of-the-art performance on motion segmentation, but also significantly improves accuracy and robustness for structural 3D understanding.
comment: Code is available at https://github.com/wuzy2115/DYNAPO
Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability
Latent diffusion models pair VAEs with diffusion backbones, and the structure of VAE latents strongly influences the difficulty of diffusion training. However, existing video VAEs typically focus on reconstruction fidelity, overlooking latent structure. We present a statistical analysis of video VAE latent spaces and identify two spectral properties essential for diffusion training: a spatio-temporal frequency spectrum biased toward low frequencies, and a channel-wise eigenspectrum dominated by a few modes. To induce these properties, we propose two lightweight, backbone-agnostic regularizers: Local Correlation Regularization and Latent Masked Reconstruction. Experiments show that our Spectral-Structured VAE (SSVAE) achieves a $3\times$ speedup in text-to-video generation convergence and a 10\% gain in video reward, outperforming strong open-source VAEs. The code is available at https://github.com/zai-org/SSVAE.
LoC-Path: Learning to Compress for Pathology Multimodal Large Language Models
Whole Slide Image (WSI) understanding is fundamentally challenging due to its gigapixel scale and the extreme sparsity of diagnostically relevant regions. Unlike human experts who primarily rely on key areas to arrive at a diagnosis, existing slide-level multimodal large language models (MLLMs) for pathology rely on heavy slide-level encoders that process thousands of patch features in a brute-force manner, resulting in excessive computational cost. In this work, we revisit the WSI-language modeling paradigm and show that tile-level features exhibit strong global and local redundancy, whereas only a small subset of tiles are truly task-relevant. Motivated by this observation, we introduce an efficient MLLM framework, called LoC-Path, that replaces the expensive slide-level encoder with redundancy-reducing modules. We first design a Sparse Token Merger (STM) and an MAE-pretrained resampler to remove local redundancy and compress globally redundant tile tokens into a compact slide-level representation set. We then propose a Cross-Attention Routing Adapter (CARA) and a Token Importance Scorer (TIS) to integrate the compressed visual representation with the language model in a computation-efficient manner. Extensive experiments demonstrate that our approach achieves performance comparable to existing state-of-the-art whole-slide MLLMs, while requiring significantly lower computation and memory.
comment: 20 pages
ShaRP: SHAllow-LayeR Pruning for Video Large Language Models Acceleration
Video Large Language Models (VLLMs) face the challenge of high computational load during the pre-filling stage due to the processing of an enormous number of visual tokens. Although attention-based pruning methods are widely used to accelerate inference, trials at early decoder layers often result in significant performance degradation, especially under high compression rates. We argue that while attention-based pruning inherently holds the potential to identify the most relevant visual tokens, its effectiveness in shallow decoder layers is limited by factors such as positional encoding bias and insufficient information interaction. In this paper, we propose an improved attention-based pruning framework, termed ShaRP, that integrates segment-aware causal masking, positional debiasing, and token deduplication for enhanced token selection. It enables effective pruning at shallow layers while maintaining stable performance under high compression rates without retraining. Extensive experiments demonstrate that ShaRP achieves competitive performance across multiple video understanding benchmarks, establishing a new paradigm for accelerating VLLM inference.
PoolNet: Deep Learning for 2D to 3D Video Process Validation
Lifting Structure-from-Motion (SfM) information from sequential and non-sequential image data is a time-consuming and computationally expensive task. In addition to this, the majority of publicly available data is unfit for processing due to inadequate camera pose variation, obscuring scene elements, and noisy data. To solve this problem, we introduce PoolNet, a versatile deep learning framework for frame-level and scene-level validation of in-the-wild data. We demonstrate that our model successfully differentiates SfM ready scenes from those unfit for processing while significantly undercutting the amount of time state of the art algorithms take to obtain structure-from-motion data.
comment: All code related to this paper can be found at https://github.com/sanchitkaul/PoolNet.git
Group Orthogonal Low-Rank Adaptation for RGB-T Tracking AAAI 2026
Parameter-efficient fine-tuning has emerged as a promising paradigm in RGB-T tracking, enabling downstream task adaptation by freezing pretrained parameters and fine-tuning only a small set of parameters. This set forms a rank space made up of multiple individual ranks, whose expressiveness directly shapes the model's adaptability. However, quantitative analysis reveals low-rank adaptation exhibits significant redundancy in the rank space, with many ranks contributing almost no practical information. This hinders the model's ability to learn more diverse knowledge to address the various challenges in RGB-T tracking. To address this issue, we propose the Group Orthogonal Low-Rank Adaptation (GOLA) framework for RGB-T tracking, which effectively leverages the rank space through structured parameter learning. Specifically, we adopt a rank decomposition partitioning strategy utilizing singular value decomposition to quantify rank importance, freeze crucial ranks to preserve the pretrained priors, and cluster the redundant ranks into groups to prepare for subsequent orthogonal constraints. We further design an inter-group orthogonal constraint strategy. This constraint enforces orthogonality between rank groups, compelling them to learn complementary features that target diverse challenges, thereby alleviating information redundancy. Experimental results demonstrate that GOLA effectively reduces parameter redundancy and enhances feature representation capabilities, significantly outperforming state-of-the-art methods across four benchmark datasets and validating its effectiveness in RGB-T tracking tasks.
comment: 13 pages, 8 figures. Accepted by AAAI 2026. Extended version
SplatPainter: Interactive Authoring of 3D Gaussians from 2D Edits via Test-Time Training
The rise of 3D Gaussian Splatting has revolutionized photorealistic 3D asset creation, yet a critical gap remains for their interactive refinement and editing. Existing approaches based on diffusion or optimization are ill-suited for this task, as they are often prohibitively slow, destructive to the original asset's identity, or lack the precision for fine-grained control. To address this, we introduce \ourmethod, a state-aware feedforward model that enables continuous editing of 3D Gaussian assets from user-provided 2D view(s). Our method directly predicts updates to the attributes of a compact, feature-rich Gaussian representation and leverages Test-Time Training to create a state-aware, iterative workflow. The versatility of our approach allows a single architecture to perform diverse tasks, including high-fidelity local detail refinement, local paint-over, and consistent global recoloring, all at interactive speeds, paving the way for fluid and intuitive 3D content authoring.
comment: project page https://y-zheng18.github.io/SplatPainter/
SpaceControl: Introducing Test-Time Spatial Control to 3D Generative Modeling
Generative methods for 3D assets have recently achieved remarkable progress, yet providing intuitive and precise control over the object geometry remains a key challenge. Existing approaches predominantly rely on text or image prompts, which often fall short in geometric specificity: language can be ambiguous, and images are cumbersome to edit. In this work, we introduce SpaceControl, a training-free test-time method for explicit spatial control of 3D generation. Our approach accepts a wide range of geometric inputs, from coarse primitives to detailed meshes, and integrates seamlessly with modern pre-trained generative models without requiring any additional training. A controllable parameter lets users trade off between geometric fidelity and output realism. Extensive quantitative evaluation and user studies demonstrate that SpaceControl outperforms both training-based and optimization-based baselines in geometric faithfulness while preserving high visual quality. Finally, we present an interactive user interface that enables online editing of superquadrics for direct conversion into textured 3D assets, facilitating practical deployment in creative workflows. Find our project page at https://spacecontrol3d.github.io/
comment: Project page: https://spacecontrol3d.github.io/
iMotion-LLM: Instruction-Conditioned Trajectory Generation
We introduce iMotion-LLM, a large language model (LLM) integrated with trajectory prediction modules for interactive motion generation. Unlike conventional approaches, it generates feasible, safety-aligned trajectories based on textual instructions, enabling adaptable and context-aware driving behavior. It combines an encoder-decoder multimodal trajectory prediction model with a pre-trained LLM fine-tuned using LoRA, projecting scene features into the LLM input space and mapping special tokens to a trajectory decoder for text-based interaction and interpretable driving. To support this framework, we introduce two datasets: 1) InstructWaymo, an extension of the Waymo Open Motion Dataset with direction-based motion instructions, and 2) Open-Vocabulary InstructNuPlan, which features safety-aligned instruction-caption pairs and corresponding safe trajectory scenarios. Our experiments validate that instruction conditioning enables trajectory generation that follows the intended condition. iMotion-LLM demonstrates strong contextual comprehension, achieving 84% average accuracy in direction feasibility detection and 96% average accuracy in safety evaluation of open-vocabulary instructions. This work lays the foundation for text-guided motion generation in autonomous driving, supporting simulated data generation, model interpretability, and robust safety alignment testing for trajectory generation models. Our code, pre-trained model, and datasets are available at: https://vision-cair.github.io/iMotion-LLM/.
Open-PMC-18M: A High-Fidelity Large Scale Medical Dataset for Multimodal Representation Learning
In biomedical vision-language modeling, datasets are typically mined from scientific literature, pairing compound figures with captions that are short, context-dependent, and oftern partially informative. Prior work on subfigure extraction has been limited in both dataset size and generalizability. In addition, no existing effort has incorporated rich medical context in image-text pairs. We revisit data curation as a foundational component of effective biomedical representation learning. Our data curation process integrates transformer-based subfigure detection, subcaption extraction, and contextual text enrichment derived from inline references. Our subfigure extraction model, trained on a corpus of 500,000 compound figures, achieves state-of-the-art performance on real and synthetic benchmarks. Using this process, we curate and release Open-PMC-18M, a large-scale high-fidelity biomedical dataset comprising 18 million image-text pairs, spanning radiology, microscopy, and visible light photography. We train vision-language models on our dataset and perform extensive evaluation on 6 retrieval and 19 zero-shot classification tasks across three major modalities. The models trained on our dataset set a new state-of-the-art results in medical representation learning. We release our dataset, models, and code to support reproducible benchmarks and further study into biomedical vision-language modeling and representation learning.
comment: 21 pages
Joint Self-Supervised Video Alignment and Action Segmentation ICCV 2025
We introduce a novel approach for simultaneous self-supervised video alignment and action segmentation based on a unified optimal transport framework. In particular, we first tackle self-supervised video alignment by developing a fused Gromov-Wasserstein optimal transport formulation with a structural prior, which trains efficiently on GPUs and needs only a few iterations for solving the optimal transport problem. Our single-task method achieves the state-of-the-art performance on multiple video alignment benchmarks and outperforms VAVA, which relies on a traditional Kantorovich optimal transport formulation with an optimality prior. Furthermore, we extend our approach by proposing a unified optimal transport framework for joint self-supervised video alignment and action segmentation, which requires training and storing a single model and saves both time and memory consumption as compared to two different single-task models. Extensive evaluations on several video alignment and action segmentation datasets demonstrate that our multi-task method achieves comparable video alignment yet superior action segmentation results over previous methods in video alignment and action segmentation respectively. Finally, to the best of our knowledge, this is the first work to unify video alignment and action segmentation into a single model. Our code is available on our research website: https://retrocausal.ai/research/.
comment: Accepted to ICCV 2025
Uncovering Grounding IDs: How External Cues Shape Multimodal Binding ICLR 2026
Large vision-language models (LVLMs) show strong performance across multimodal benchmarks but remain limited in structured reasoning and precise grounding. Recent work has demonstrated that adding simple visual structures, such as partitions and annotations, improves accuracy, yet the internal mechanisms underlying these gains remain unclear. We investigate this phenomenon and propose the concept of Grounding IDs, latent identifiers induced by external cues that bind objects to their designated partitions across modalities. Through representation analysis, we find that these identifiers emerge as consistent within-partition alignment in embedding space and reduce the modality gap between image and text. Causal interventions further confirm that these identifiers mediate binding between objects and symbolic cues. We show that Grounding IDs strengthen attention between related components, which in turn improves cross-modal grounding and reduces hallucinations. Taken together, our results identify Grounding IDs as a key symbolic mechanism that explains how external cues enhance multimodal binding and offer both interpretability and practical improvements.
comment: Under review as a conference paper at ICLR 2026
VirDA: Reusing Backbone for Unsupervised Domain Adaptation with Visual Reprogramming
Existing UDA pipelines fine-tune already well-trained backbone parameters for every new source-and-target pair, resulting in the number of training parameters and storage memory growing linearly with each new pair, and also preventing the reuse of these well-trained backbone parameters. Inspired by recent implications that existing backbones have textural biases, we propose making use of domain-specific textural bias for domain adaptation via visual reprogramming, namely VirDA. Instead of fine-tuning the full backbone, VirDA prepends a domain-specific visual reprogramming layer to the backbone. This layer produces visual prompts that act as an added textural bias to the input image, adapting its "style" to a target domain. To optimize these visual reprogramming layers, we use multiple objective functions that optimize the intra- and inter-domain distribution differences when domain-adapting visual prompts are applied. This process does not require modifying the backbone parameters, allowing the same backbone to be reused across different domains. We evaluate VirDA on Office-31 and obtain 92.8% mean accuracy with only 1.5M trainable parameters. VirDA surpasses PDA, the state-of-the-art parameter-efficient UDA baseline, by +1.6% accuracy while using just 46% of its parameters. Compared with full-backbone fine-tuning, VirDA outperforms CDTrans and FixBi by +0.2% and +1.4%, respectively, while requiring only 1.7% and 2.8% of their trainable parameters. Relative to the strongest current methods (PMTrans and TVT), VirDA uses ~1.7% of their parameters and trades off only 2.2% and 1.1% accuracy, respectively.
comment: To be published in TMLR
Variational Supervised Contrastive Learning
Contrastive learning has proven to be highly efficient and adaptable in shaping representation spaces across diverse modalities by pulling similar samples together and pushing dissimilar ones apart. However, two key limitations persist: (1) Without explicit regulation of the embedding distribution, semantically related instances can inadvertently be pushed apart unless complementary signals guide pair selection, and (2) excessive reliance on large in-batch negatives and tailored augmentations hinders generalization. To address these limitations, we propose Variational Supervised Contrastive Learning (VarCon), which reformulates supervised contrastive learning as variational inference over latent class variables and maximizes a posterior-weighted evidence lower bound (ELBO) that replaces exhaustive pair-wise comparisons for efficient class-aware matching and grants fine-grained control over intra-class dispersion in the embedding space. Trained exclusively on image data, our experiments on CIFAR-10, CIFAR-100, ImageNet-100, and ImageNet-1K show that VarCon (1) achieves state-of-the-art performance for contrastive learning frameworks, reaching 79.36% Top-1 accuracy on ImageNet-1K and 78.29% on CIFAR-100 with a ResNet-50 encoder while converging in just 200 epochs; (2) yields substantially clearer decision boundaries and semantic organization in the embedding space, as evidenced by KNN classification, hierarchical clustering results, and transfer-learning assessments; and (3) demonstrates superior performance in few-shot learning than supervised baseline and superior robustness across various augmentation strategies. Our code is available at https://github.com/ziwenwang28/VarContrast.
LM-CartSeg: Automated Segmentation of Lateral and Medial Cartilage and Subchondral Bone for Radiomics Analysis
Background and Objective: Radiomics of knee MRI requires robust, anatomically meaningful regions of interest (ROIs) that jointly capture cartilage and subchondral bone. Most existing work relies on manual ROIs and rarely reports quality control (QC). We present LM-CartSeg, a fully automatic pipeline for cartilage/bone segmentation, geometric lateral/medial (L/M) compartmentalisation and radiomics analysis. Methods: Two 3D nnU-Net models were trained on SKM-TEA (138 knees) and OAIZIB-CM (404 knees). At test time, zero-shot predictions were fused and refined by simple geometric rules: connected-component cleaning, construction of 10 mm subchondral bone bands in physical space, and a data-driven tibial L/M split based on PCA and k-means. Segmentation was evaluated on an OAIZIB-CM test set (103 knees) and on SKI-10 (100 knees). QC used volume and thickness signatures. From 10 ROIs we extracted 4 650 non-shape radiomic features to study inter-compartment similarity, dependence on ROI size, and OA vs. non-OA classification on OAIZIB-CM Results: Post-processing improved macro ASSD on OAIZIB-CM from 2.63 to 0.36 mm and HD95 from 25.2 to 3.35 mm, with DSC 0.91; zero-shot DSC on SKI-10 was 0.80. The geometric L/M rule produced stable compartments across datasets, whereas a direct L/M nnU-Net showed domain-dependent side swaps. Only 6 to 12 percent of features per ROI were strongly correlated with volume or thickness. Radiomics-based models models restricted to size-linked features. Conclusions: LM-CartSeg yields automatic, QCd ROIs and radiomic features that carry discriminative information beyond simple morphometry, providing a practical foundation for multi-centre knee OA radiomics studies.
comment: The manuscript represents only a preliminary and substantially incompleted exploration. The author has decided not to stand by these results, and a thoroughly revised and significantly different version will be developed separately. Therefore this version is withdrawn and should not be cited
EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture
We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase. Extensive experiments have shown that EMMA-4B can significantly outperform state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while also achieving competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image). We believe that EMMA lays a solid foundation for the future development of unified multimodal architectures.
comment: Project Page: https://emma-umm.github.io/emma/
SAGE: Saliency-Guided Contrastive Embeddings
Integrating human perceptual priors into the training of neural networks has been shown to raise model generalization, serve as an effective regularizer, and align models with human expertise for applications in high-risk domains. Existing approaches to integrate saliency into model training often rely on internal model mechanisms, which recent research suggests may be unreliable. Our insight is that many challenges associated with saliency-guided training stem from the placement of the guidance approaches solely within the image space. Instead, we move away from the image space, use the model's latent space embeddings to steer human guidance during training, and we propose SAGE (Saliency-Guided Contrastive Embeddings): a loss function that integrates human saliency into network training using contrastive embeddings. We apply salient-preserving and saliency-degrading signal augmentations to the input and capture the changes in embeddings and model logits. We guide the model towards salient features and away from non-salient features using a contrastive triplet loss. Additionally, we perform a sanity check on the logit distributions to ensure that the model outputs match the saliency-based augmentations. We demonstrate a boost in classification performance across both open- and closed-set scenarios against SOTA saliency-based methods, showing SAGE's effectiveness across various backbones, and include experiments to suggest its wide generalization across tasks.
comment: 11 pages, 2 figures, 5 tables
PLANesT-3D: A new annotated dataset for segmentation of 3D plant point clouds
Creation of new annotated public datasets is crucial in helping advances in 3D computer vision and machine learning meet their full potential for automatic interpretation of 3D plant models. Despite the proliferation of deep neural network architectures for segmentation and phenotyping of 3D plant models in the last decade, the amount of data, and diversity in terms of species and data acquisition modalities are far from sufficient for evaluation of such tools for their generalization ability. To contribute to closing this gap, we introduce PLANesT-3D; a new annotated dataset of 3D color point clouds of plants. PLANesT-3D is composed of 34 point cloud models representing 34 real plants from three different plant species: \textit{Capsicum annuum}, \textit{Rosa kordana}, and \textit{Ribes rubrum}. Both semantic labels in terms of "leaf" and "stem", and organ instance labels were manually annotated for the full point clouds. PLANesT-3D introduces diversity to existing datasets by adding point clouds of two new species and providing 3D data acquired with the low-cost SfM/MVS technique as opposed to laser scanning or expensive setups. Point clouds reconstructed with SfM/MVS modality exhibit challenges such as missing data, variable density, and illumination variations. As an additional contribution, SP-LSCnet, a novel semantic segmentation method that is a combination of unsupervised superpoint extraction and a 3D point-based deep learning approach is introduced and evaluated on the new dataset. The advantages of SP-LSCnet over other deep learning methods are its modular structure and increased interpretability. Two existing deep neural network architectures, PointNet++ and RoseSegNet, were also tested on the point clouds of PLANesT-3D for semantic segmentation.
ZQBA: Zero Query Black-box Adversarial Attack
Current black-box adversarial attacks either require multiple queries or diffusion models to produce adversarial samples that can impair the target model performance. However, these methods require training a surrogate loss or diffusion models to produce adversarial samples, which limits their applicability in real-world settings. Thus, we propose a Zero Query Black-box Adversarial (ZQBA) attack that exploits the representations of Deep Neural Networks (DNNs) to fool other networks. Instead of requiring thousands of queries to produce deceiving adversarial samples, we use the feature maps obtained from a DNN and add them to clean images to impair the classification of a target model. The results suggest that ZQBA can transfer the adversarial samples to different models and across various datasets, namely CIFAR and Tiny ImageNet. The experiments also show that ZQBA is more effective than state-of-the-art black-box attacks with a single query, while maintaining the imperceptibility of perturbations, evaluated both quantitatively (SSIM) and qualitatively, emphasizing the vulnerabilities of employing DNNs in real-world contexts. All the source code is available at https://github.com/Joana-Cabral/ZQBA.
comment: Accepted in ICAART 2026 Conference
From Correlation to Causation: Max-Pooling-Based Multi-Instance Learning Leads to More Robust Whole Slide Image Classification
In whole slide images (WSIs) analysis, attention-based multi-instance learning (MIL) models are susceptible to spurious correlations and degrade under domain shift. These methods may assign high attention weights to non-tumor regions, such as staining biases or artifacts, leading to unreliable tumor region localization. In this paper, we revisit max-pooling-based MIL methods from a causal perspective. Under mild assumptions, our theoretical results demonstrate that max-pooling encourages the model to focus on causal factors while ignoring bias-related factors. Furthermore, we discover that existing max-pooling-based methods may overfit the training set through rote memorization of instance features and fail to learn meaningful patterns. To address these issues, we propose FocusMIL, which couples max-pooling with an instance-level variational information bottleneck (VIB) to learn compact, predictive latent representations, and employs a multi-bag mini-batch scheme to stabilize optimization. We conduct comprehensive experiments on three real-world datasets and one semi-synthetic dataset. The results show that, by capturing causal factors, FocusMIL exhibits significant advantages in out-of-distribution scenarios and instance-level tumor region localization tasks.
SegAssess: Panoramic quality mapping for robust and transferable unsupervised segmentation assessment
High-quality image segmentation is fundamental to pixel-level geospatial analysis in remote sensing, necessitating robust segmentation quality assessment (SQA), particularly in unsupervised settings lacking ground truth. Although recent deep learning (DL) based unsupervised SQA methods show potential, they often suffer from coarse evaluation granularity, incomplete assessments, and poor transferability. To overcome these limitations, this paper introduces Panoramic Quality Mapping (PQM) as a new paradigm for comprehensive, pixel-wise SQA, and presents SegAssess, a novel deep learning framework realizing this approach. SegAssess distinctively formulates SQA as a fine-grained, four-class panoramic segmentation task, classifying pixels within a segmentation mask under evaluation into true positive (TP), false positive (FP), true negative (TN), and false negative (FN) categories, thereby generating a complete quality map. Leveraging an enhanced Segment Anything Model (SAM) architecture, SegAssess uniquely employs the input mask as a prompt for effective feature integration via cross-attention. Key innovations include an Edge Guided Compaction (EGC) branch with an Aggregated Semantic Filter (ASF) module to refine predictions near challenging object edges, and an Augmented Mixup Sampling (AMS) training strategy integrating multi-source masks to significantly boost cross-domain robustness and zero-shot transferability. Comprehensive experiments demonstrate that SegAssess achieves state-of-the-art (SOTA) performance and exhibits remarkable zero-shot transferability to unseen masks. The code is available at https://github.com/Yangbn97/SegAssess.
GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer
Speech-driven talking head generation is a critical yet challenging task with applications in augmented reality and virtual human modeling. While recent approaches using autoregressive and diffusion-based models have achieved notable progress, they often suffer from modality inconsistencies, particularly misalignment between audio and mesh, leading to reduced motion diversity and lip-sync accuracy. To address this, we propose GLDiTalker, a novel speech-driven 3D facial animation model based on a Graph Latent Diffusion Transformer. GLDiTalker resolves modality misalignment by diffusing signals within a quantized spatiotemporal latent space. It employs a two-stage training pipeline: the Graph-Enhanced Quantized Space Learning Stage ensures lip-sync accuracy, while the Space-Time Powered Latent Diffusion Stage enhances motion diversity. Together, these stages enable GLDiTalker to generate realistic, temporally stable 3D facial animations. Extensive evaluations on standard benchmarks demonstrate that GLDiTalker outperforms existing methods, achieving superior results in both lip-sync accuracy and motion diversity.
comment: 9 pages, 5 figures
M3DHMR: Monocular 3D Hand Mesh Recovery
Monocular 3D hand mesh recovery is challenging due to high degrees of freedom of hands, 2D-to-3D ambiguity and self-occlusion. Most existing methods are either inefficient or less straightforward for predicting the position of 3D mesh vertices. Thus, we propose a new pipeline called Monocular 3D Hand Mesh Recovery (M3DHMR) to directly estimate the positions of hand mesh vertices. M3DHMR provides 2D cues for 3D tasks from a single image and uses a new spiral decoder consist of several Dynamic Spiral Convolution (DSC) Layers and a Region of Interest (ROI) Layer. On the one hand, DSC Layers adaptively adjust the weights based on the vertex positions and extract the vertex features in both spatial and channel dimensions. On the other hand, ROI Layer utilizes the physical information and refines mesh vertices in each predefined hand region separately. Extensive experiments on popular dataset FreiHAND demonstrate that M3DHMR significantly outperforms state-of-the-art real-time methods.
comment: 9 pages, 5 figures
Exploring Ordinal Bias in Action Recognition for Instructional Videos SC
Action recognition models have achieved promising results in understanding instructional videos. However, they often rely on dominant, dataset-specific action sequences rather than true video comprehension, a problem that we define as ordinal bias. To address this issue, we propose two effective video manipulation methods: Action Masking, which masks frames of frequently co-occurring actions, and Sequence Shuffling, which randomizes the order of action segments. Through comprehensive experiments, we demonstrate that current models exhibit significant performance drops when confronted with nonstandard action sequences, underscoring their vulnerability to ordinal bias. Our findings emphasize the importance of rethinking evaluation strategies and developing models capable of generalizing beyond fixed action patterns in diverse instructional videos.
comment: Accepted at SCSL @ ICLR 2025
TempoControl: Temporal Attention Guidance for Text-to-Video Models
Recent advances in generative video models have enabled the creation of high-quality videos based on natural language prompts. However, these models frequently lack fine-grained temporal control, meaning they do not allow users to specify when particular visual elements should appear within a generated sequence. In this work, we introduce TempoControl, a method that allows for temporal alignment of visual concepts during inference, without requiring retraining or additional supervision. TempoControl utilizes cross-attention maps, a key component of text-to-video diffusion models, to guide the timing of concepts through a novel optimization approach. Our method steers attention using three complementary principles: aligning its temporal pattern with a control signal (correlation), adjusting its strength where visibility is required (magnitude), and preserving semantic consistency (entropy). TempoControl provides precise temporal control while maintaining high video quality and diversity. We demonstrate its effectiveness across various applications, including temporal reordering of single and multiple objects, action timing, and audio-aligned video generation. Please see our project page for more details: https://shira-schiber.github.io/TempoControl/.
comment: Under Review
Multi-Modal Data-Efficient 3D Scene Understanding for Autonomous Driving
Efficient data utilization is crucial for advancing 3D scene understanding in autonomous driving, where reliance on heavily human-annotated LiDAR point clouds challenges fully supervised methods. Addressing this, our study extends into semi-supervised learning for LiDAR semantic segmentation, leveraging the intrinsic spatial priors of driving scenes and multi-sensor complements to augment the efficacy of unlabeled datasets. We introduce LaserMix++, an evolved framework that integrates laser beam manipulations from disparate LiDAR scans and incorporates LiDAR-camera correspondences to further assist data-efficient learning. Our framework is tailored to enhance 3D scene consistency regularization by incorporating multi-modality, including 1) multi-modal LaserMix operation for fine-grained cross-sensor interactions; 2) camera-to-LiDAR feature distillation that enhances LiDAR feature learning; and 3) language-driven knowledge guidance generating auxiliary supervisions using open-vocabulary models. The versatility of LaserMix++ enables applications across LiDAR representations, establishing it as a universally applicable solution. Our framework is rigorously validated through theoretical analysis and extensive experiments on popular driving perception datasets. Results demonstrate that LaserMix++ markedly outperforms fully supervised alternatives, achieving comparable accuracy with five times fewer annotations and significantly improving the supervised-only baselines. This substantial advancement underscores the potential of semi-supervised approaches in reducing the reliance on extensive labeled data in LiDAR-based 3D scene understanding systems.
comment: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
Language-Instructed Reasoning for Group Activity Detection via Multimodal Large Language Model
Group activity detection (GAD) aims to simultaneously identify group members and categorize their collective activities within video sequences. Existing deep learning-based methods develop specialized architectures (e.g., transformer networks) to model the dynamics of individual roles and semantic dependencies between individuals and groups. However, they rely solely on implicit pattern recognition from visual features and struggle with contextual reasoning and explainability. In this work, we propose LIR-GAD, a novel framework of language-instructed reasoning for GAD via Multimodal Large Language Model (MLLM). Our approach expand the original vocabulary of MLLM by introducing an activity-level token and multiple cluster-specific tokens. We process video frames alongside two specially designed tokens and language instructions, which are then integrated into the MLLM. The pretrained commonsense knowledge embedded in the MLLM enables the token and tokens to effectively capture the semantic information of collective activities and learn distinct representational features of different groups, respectively. Also, we introduce a multi-label classification loss to further enhance the token's ability to learn discriminative semantic representations. Then, we design a Multimodal Dual-Alignment Fusion (MDAF) module that integrates MLLM's hidden embeddings corresponding to the designed tokens with visual features, significantly enhancing the performance of GAD. Both quantitative and qualitative experiments demonstrate the superior performance of our proposed method in GAD taks.
comment: This work is being incorporated into a larger study
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding
Self-reflection mechanisms that rely on purely text-based rethinking processes perform well in most multimodal tasks. However, when directly applied to long-form video understanding scenarios, they exhibit clear limitations. The fundamental reasons for this lie in two points: (1)long-form video understanding involves richer and more dynamic visual input, meaning rethinking only the text information is insufficient and necessitates a further rethinking process specifically targeting visual information; (2) purely text-based reflection mechanisms lack cross-modal interaction capabilities, preventing them from fully integrating visual information during reflection. Motivated by these insights, we propose REVISOR (REflective VIsual Segment Oriented Reasoning), a novel framework for tool-augmented multimodal reflection. REVISOR enables MLLMs to collaboratively construct introspective reflection processes across textual and visual modalities, significantly enhancing their reasoning capability for long-form video understanding. To ensure that REVISOR can learn to accurately review video segments highly relevant to the question during reinforcement learning, we designed the Dual Attribution Decoupled Reward (DADR) mechanism. Integrated into the GRPO training strategy, this mechanism enforces causal alignment between the model's reasoning and the selected video evidence. Notably, the REVISOR framework significantly enhances long-form video understanding capability of MLLMs without requiring supplementary supervised fine-tuning or external models, achieving impressive results on four benchmarks including VideoMME, LongVideoBench, MLVU, and LVBench.
A Fractional Variational Approach to Spectral Filtering Using the Fourier Transform
The interference of fluorescence signals and noise remains a significant challenge in Raman spectrum analysis, often obscuring subtle spectral features that are critical for accurate analysis. Inspired by variational methods similar to those used in image denoising, our approach minimizes a functional involving fractional derivatives to balance noise suppression with the preservation of essential chemical features of the signal, such as peak position, intensity, and area. The original problem is reformulated in the frequency domain through the Fourier transform, making the implementation simple and fast. In this work, we discuss the theoretical framework, practical implementation, and the advantages and limitations of this method in the context of {simulated} Raman data, as well as in image processing. The main contribution of this article is the combination of a variational approach in the frequency domain, the use of fractional derivatives, and the optimization of the {regularization parameter and} derivative order through the concept of Shannon entropy. This work explores how the fractional order, combined with the regularization parameter, affects noise removal and preserves the essential features of the spectrum {and image}. Finally, the study shows that the combination of the proposed strategies produces an efficient, robust, and easily implementable filter.
comment: 31 pages, 3 figures, 2 tables
AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM WACV 2026
Video anomaly detection (VAD) is crucial for video analysis and surveillance in computer vision. However, existing VAD models rely on learned normal patterns, which makes them difficult to apply to diverse environments. Consequently, users should retrain models or develop separate AI models for new environments, which requires expertise in machine learning, high-performance hardware, and extensive data collection, limiting the practical usability of VAD. To address these challenges, this study proposes customizable video anomaly detection (C-VAD) technique and the AnyAnomaly model. C-VAD considers user-defined text as an abnormal event and detects frames containing a specified event in a video. We effectively implemented AnyAnomaly using a context-aware visual question answering without fine-tuning the large vision language model. To validate the effectiveness of the proposed model, we constructed C-VAD datasets and demonstrated the superiority of AnyAnomaly. Furthermore, our approach showed competitive results on VAD benchmarks, achieving state-of-the-art performance on UBnormal and UCF-Crime and surpassing other methods in generalization across all datasets. Our code is available online at github.com/SkiddieAhn/Paper-AnyAnomaly.
comment: Accepted to WACV 2026
Learning Visually Interpretable Oscillator Networks for Soft Continuum Robots from Video
Data-driven learning of soft continuum robot (SCR) dynamics from high-dimensional observations offers flexibility but often lacks physical interpretability, while model-based approaches require prior knowledge and can be computationally expensive. We bridge this gap by introducing (1) the Attention Broadcast Decoder (ABCD), a plug-and-play module for autoencoder-based latent dynamics learning that generates pixel-accurate attention maps localizing each latent dimension's contribution while filtering static backgrounds. (2) By coupling these attention maps to 2D oscillator networks, we enable direct on-image visualization of learned dynamics (masses, stiffness, and forces) without prior knowledge. We validate our approach on single- and double-segment SCRs, demonstrating that ABCD-based models significantly improve multi-step prediction accuracy: 5.7x error reduction for Koopman operators and 3.5x for oscillator networks on the two-segment robot. The learned oscillator network autonomously discovers a chain structure of oscillators. Unlike standard methods, ABCD models enable smooth latent space extrapolation beyond training data. This fully data-driven approach yields compact, physically interpretable models suitable for control applications.
comment: Dataset available at: https://zenodo.org/records/17812071
Perspective-Invariant 3D Object Detection ICCV 2025
With the rise of robotics, LiDAR-based 3D object detection has garnered significant attention in both academia and industry. However, existing datasets and methods predominantly focus on vehicle-mounted platforms, leaving other autonomous platforms underexplored. To bridge this gap, we introduce Pi3DET, the first benchmark featuring LiDAR data and 3D bounding box annotations collected from multiple platforms: vehicle, quadruped, and drone, thereby facilitating research in 3D object detection for non-vehicle platforms as well as cross-platform 3D detection. Based on Pi3DET, we propose a novel cross-platform adaptation framework that transfers knowledge from the well-studied vehicle platform to other platforms. This framework achieves perspective-invariant 3D detection through robust alignment at both geometric and feature levels. Additionally, we establish a benchmark to evaluate the resilience and robustness of current 3D detectors in cross-platform scenarios, providing valuable insights for developing adaptive 3D perception systems. Extensive experiments validate the effectiveness of our approach on challenging cross-platform tasks, demonstrating substantial gains over existing adaptation methods. We hope this work paves the way for generalizable and unified 3D perception systems across diverse and complex environments. Our Pi3DET dataset, cross-platform benchmark suite, and annotation toolkit have been made publicly available.
comment: ICCV 2025; 54 pages, 18 figures, 22 tables; Project Page at https://pi3det.github.io
CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation
Cooking is a sequential and visually grounded activity, where each step such as chopping, mixing, or frying carries both procedural logic and visual semantics. While recent diffusion models have shown strong capabilities in text-to-image generation, they struggle to handle structured multi-step scenarios like recipe illustration. Additionally, current recipe illustration methods are unable to adjust to the natural variability in recipe length, generating a fixed number of images regardless of the actual instructions structure. To address these limitations, we present CookAnything, a flexible and consistent diffusion-based framework that generates coherent, semantically distinct image sequences from textual cooking instructions of arbitrary length. The framework introduces three key components: (1) Step-wise Regional Control (SRC), which aligns textual steps with corresponding image regions within a single denoising process; (2) Flexible RoPE, a step-aware positional encoding mechanism that enhances both temporal coherence and spatial diversity; and (3) Cross-Step Consistency Control (CSCC), which maintains fine-grained ingredient consistency across steps. Experimental results on recipe illustration benchmarks show that CookAnything performs better than existing methods in training-based and training-free settings. The proposed framework supports scalable, high-quality visual synthesis of complex multi-step instructions and holds significant potential for broad applications in instructional media, and procedural content creation.
comment: Accepted by ACM Multimedia 2025
Point-PNG: Conditional Pseudo-Negatives Generation for Point Cloud Pre-Training
We propose Point-PNG, a novel self-supervised learning framework that generates conditional pseudo-negatives in the latent space to learn point cloud representations that are both discriminative and transformation-sensitive. Conventional self-supervised learning methods focus on achieving invariance, discarding transformation-specific information. Recent approaches incorporate transformation sensitivity by explicitly modeling relationships between original and transformed inputs. However, they often suffer from an invariant-collapse phenomenon, where the predictor degenerates into identity mappings, resulting in latent representations with limited variation across transformations. To address this, we propose Point-PNG that explicitly penalizes invariant collapse through pseudo-negatives generation, enabling the network to capture richer transformation cues while preserving discriminative representations. To this end, we introduce a parametric network, COnditional Pseudo-Negatives Embedding (COPE), which learns localized displacements induced by transformations within the latent space. A key challenge arises when jointly training COPE with the MAE, as it tends to converge to trivial identity mappings. To overcome this, we design a loss function based on pseudo-negatives conditioned on the transformation, which penalizes such trivial invariant solutions and enforces meaningful representation learning. We validate Point-PNG on shape classification and relative pose estimation tasks, showing competitive performance on ModelNet40 and ScanObjectNN under challenging evaluation protocols, and achieving superior accuracy in relative pose estimation compared to supervised baselines.
comment: Accepted for publication in IEEE ACCESS
3D Question Answering via only 2D Vision-Language Models ICML2025
Large vision-language models (LVLMs) have significantly advanced numerous fields. In this work, we explore how to harness their potential to address 3D scene understanding tasks, using 3D question answering (3D-QA) as a representative example. Due to the limited training data in 3D, we do not train LVLMs but infer in a zero-shot manner. Specifically, we sample 2D views from a 3D point cloud and feed them into 2D models to answer a given question. When the 2D model is chosen, e.g., LLAVA-OV, the quality of sampled views matters the most. We propose cdViews, a novel approach to automatically selecting critical and diverse Views for 3D-QA. cdViews consists of two key components: viewSelector prioritizing critical views based on their potential to provide answer-specific information, and viewNMS enhancing diversity by removing redundant views based on spatial overlap. We evaluate cdViews on the widely-used ScanQA and SQA benchmarks, demonstrating that it achieves state-of-the-art performance in 3D-QA while relying solely on 2D models without fine-tuning. These findings support our belief that 2D LVLMs are currently the most effective alternative (of the resource-intensive 3D LVLMs) for addressing 3D tasks.
comment: ICML2025
ReSem3D: Refinable 3D Spatial Constraints via Fine-Grained Semantic Grounding for Generalizable Robotic Manipulation
Semantics-driven 3D spatial constraints align highlevel semantic representations with low-level action spaces, facilitating the unification of task understanding and execution in robotic manipulation. The synergistic reasoning of Multimodal Large Language Models (MLLMs) and Vision Foundation Models (VFMs) enables cross-modal 3D spatial constraint construction. Nevertheless, existing methods have three key limitations: (1) coarse semantic granularity in constraint modeling, (2) lack of real-time closed-loop planning, (3) compromised robustness in semantically diverse environments. To address these challenges, we propose ReSem3D, a unified manipulation framework for semantically diverse environments, leveraging the synergy between VFMs and MLLMs to achieve fine-grained visual grounding and dynamically constructs hierarchical 3D spatial constraints for real-time manipulation. Specifically, the framework is driven by hierarchical recursive reasoning in MLLMs, which interact with VFMs to automatically construct 3D spatial constraints from natural language instructions and RGB-D observations in two stages: part-level extraction and region-level refinement. Subsequently, these constraints are encoded as real-time optimization objectives in joint space, enabling reactive behavior to dynamic disturbances. Extensive simulation and real-world experiments are conducted in semantically rich household and sparse chemical lab environments. The results demonstrate that ReSem3D performs diverse manipulation tasks under zero-shot conditions, exhibiting strong adaptability and generalization. Code and videos are available at https://github.com/scy-v/ReSem3D and https://resem3d.github.io.
comment: 12 pages,9 figures
Dressing the Imagination: A Dataset for AI-Powered Translation of Text into Fashion Outfits and A Novel KAN Adapter for Enhanced Feature Adaptation WACV 2026
Specialized datasets that capture the fashion industry's rich language and styling elements can boost progress in AI-driven fashion design. We present FLORA, (Fashion Language Outfit Representation for Apparel Generation), the first comprehensive dataset containing 4,330 curated pairs of fashion outfits and corresponding textual descriptions. Each description utilizes industry-specific terminology and jargon commonly used by professional fashion designers, providing precise and detailed insights into the outfits. Hence, the dataset captures the delicate features and subtle stylistic elements necessary to create high-fidelity fashion designs. We demonstrate that fine-tuning generative models on the FLORA dataset significantly enhances their capability to generate accurate and stylistically rich images from textual descriptions of fashion sketches. FLORA will catalyze the creation of advanced AI models capable of comprehending and producing subtle, stylistically rich fashion designs. It will also help fashion designers and end-users to bring their ideas to life. As a second orthogonal contribution, we introduce NeRA (Nonlinear low-rank Expressive Representation Adapter), a novel adapter architecture based on Kolmogorov-Arnold Networks (KAN). Unlike traditional PEFT techniques such as LoRA, LoKR, DoRA, and LoHA that use MLP adapters, NeRA uses learnable spline-based nonlinear transformations, enabling superior modeling of complex semantic relationships, achieving strong fidelity, faster convergence and semantic alignment. Extensive experiments on our proposed FLORA and LAION-5B datasets validate the superiority of NeRA over existing adapters. We will open-source both the FLORA dataset and our implementation code.
comment: Accepted as a Conference Paper at WACV 2026 (USA)
Edge-Only Universal Adversarial Attacks in Distributed Learning
Distributed learning frameworks, which partition neural network models across multiple computing nodes, enhance efficiency in collaborative edge-cloud systems, but may also introduce new vulnerabilities to evasion attacks, often in the form of adversarial perturbations. In this work, we present a new threat model that explores the feasibility of generating universal adversarial perturbations (UAPs) when the attacker has access only to the edge portion of the model, consisting of its initial network layers. Unlike traditional attacks that require full model knowledge, our approach shows that adversaries can induce effective mispredictions in the unknown cloud component by manipulating key feature representations at the edge. Following the proposed threat model, we introduce both edge-only untargeted and targeted formulations of UAPs designed to control intermediate features before the split point. Our results on ImageNet demonstrate strong attack transferability to the unknown cloud part, and we compare the proposed method with classical white-box and black-box techniques, highlighting its effectiveness. Additionally, we analyze the capability of an attacker to achieve targeted adversarial effects with edge-only knowledge, revealing intriguing behaviors across multiple networks. By introducing the first adversarial attacks with edge-only knowledge in split inference, this work underscores the importance of addressing partial model access in adversarial robustness, encouraging further research in this area.
Vision-centric Token Compression in Large Language Model NeurIPS 2025
Real-world applications are stretching context windows to hundreds of thousand of tokens while Large Language Models (LLMs) swell from billions to trillions of parameters. This dual expansion send compute and memory costs skyrocketing, making token compression indispensable. We introduce Vision Centric Token Compression (Vist), a slow-fast compression framework that mirrors human reading: the fast path renders distant tokens into images, letting a frozen, lightweight vision encoder skim the low-salience context; the slow path feeds the proximal window into the LLM for fine-grained reasoning. A Probability-Informed Visual Enhancement (PVE) objective masks high-frequency tokens during training, steering the Resampler to concentrate on semantically rich regions-just as skilled reader gloss over function words. On eleven in-context learning benchmarks, Vist achieves the same accuracy with 2.3 times fewer tokens, cutting FLOPs by 16% and memory by 50%. This method delivers remarkable results, outperforming the strongest text encoder-based compression method CEPE by 7.6% on average over benchmarks like TriviaQA, NQ, PopQA, NLUI, and CLIN, setting a new standard for token efficiency in LLMs. The project is at https://github.com/CSU-JPG/VIST.
comment: NeurIPS 2025 spotlight
MedDiff-FM: A Diffusion-based Foundation Model for Versatile Medical Image Applications
Diffusion models have achieved significant success in both natural image and medical image domains, encompassing a wide range of applications. Previous investigations in medical images have often been constrained to specific anatomical regions, particular applications, and limited datasets, resulting in isolated diffusion models. This paper introduces a diffusion-based foundation model to address a diverse range of medical image tasks, namely MedDiff-FM. MedDiff-FM leverages 3D CT images from multiple publicly available datasets, covering anatomical regions from head to abdomen, to pre-train a diffusion foundation model, and explores the capabilities of the diffusion foundation model across a variety of application scenarios. The diffusion foundation model handles multi-level integrated image processing both at the image-level and patch-level, utilizes position embedding to establish multi-level spatial relationships, and leverages region classes and anatomical structures to capture certain anatomical regions. MedDiff-FM manages several downstream tasks seamlessly, including image denoising, anomaly detection, and image synthesis. MedDiff-FM is also capable of performing super-resolution, lesion generation, and lesion inpainting by rapidly fine-tuning the diffusion foundation model using ControlNet with task-specific conditions. The experimental results demonstrate the effectiveness of MedDiff-FM in addressing diverse downstream medical image tasks.
TabletopGen: Instance-Level Interactive 3D Tabletop Scene Generation from Text or Single Image
Generating high-fidelity, physically interactive 3D simulated tabletop scenes is essential for embodied AI -- especially for robotic manipulation policy learning and data synthesis. However, current text- or image-driven 3D scene generation methods mainly focus on large-scale scenes, struggling to capture the high-density layouts and complex spatial relations that characterize tabletop scenes. To address these challenges, we propose TabletopGen, a training-free, fully automatic framework that generates diverse, instance-level interactive 3D tabletop scenes. TabletopGen accepts a reference image as input, which can be synthesized by a text-to-image model to enhance scene diversity. We then perform instance segmentation and completion on the reference to obtain per-instance images. Each instance is reconstructed into a 3D model followed by canonical coordinate alignment. The aligned 3D models then undergo pose and scale estimation before being assembled into a collision-free, simulation-ready tabletop scene. A key component of our framework is a novel pose and scale alignment approach that decouples the complex spatial reasoning into two stages: a Differentiable Rotation Optimizer for precise rotation recovery and a Top-view Spatial Alignment mechanism for robust translation and scale estimation, enabling accurate 3D reconstruction from 2D reference. Extensive experiments and user studies show that TabletopGen achieves state-of-the-art performance, markedly surpassing existing methods in visual fidelity, layout accuracy, and physical plausibility, capable of generating realistic tabletop scenes with rich stylistic and spatial diversity. Our code will be publicly available.
comment: Project page: https://d-robotics-ai-lab.github.io/TabletopGen.project/
Enabling Validation for Robust Few-Shot Recognition
Few-Shot Recognition (FSR) tackles classification tasks by training with minimal task-specific labeled data. Prevailing methods adapt or finetune a pretrained Vision-Language Model (VLM) and augment the scarce training data by retrieving task-relevant but noisy samples from open data sources. The finetuned VLM generalizes decently well to the task-specific in-distribution (ID) test data but struggles with out-of-distribution (OOD) test data. This motivates our study of robust FSR with VLM finetuning. The core challenge of FSR is data scarcity, extending beyond limited training data to a complete lack of validation data. We identify a key paradox as a potential solution: repurposing the retrieved open data for validation. As such retrieved data are inherently OOD compared with the task-specific ID training data, finetuned VLMs yield degraded performance on the retrieved data. This causes the validation logic to favor the pretrained model without any finetuning, hindering improvements w.r.t generalization. To resolve this dilemma, we introduce a novel validation strategy that harmonizes performance gain and degradation on the few-shot ID data and the retrieved data, respectively. Our validation enables parameter selection for partial finetuning and checkpoint selection, mitigating overfitting and improving test-data generalization. We unify this strategy with robust learning into a cohesive framework: Validation-Enabled Stage-wise Tuning (VEST). Extensive experiments on the established ImageNet OOD benchmarks show that VEST significantly outperforms existing VLM adaptation methods, achieving state-of-the-art FSR performance on both ID and OOD data.
comment: Project website: https://hannawang09.github.io/projects/vest/
Martian World Model: Controllable Video Synthesis with Physically Accurate 3D Reconstructions
Synthesizing realistic Martian landscape videos is crucial for mission rehearsal and robotic simulation. However, this task poses unique challenges due to the scarcity of high-quality Martian data and the significant domain gap between Martian and terrestrial imagery. To address these challenges, we propose a holistic solution composed of two key components: 1) A data curation pipeline Multimodal Mars Synthesis (M3arsSynth), which reconstructs 3D Martian environments from real stereo navigation images, sourced from NASA's Planetary Data System (PDS), and renders high-fidelity multiview 3D video sequences. 2) A Martian terrain video generator, MarsGen, which synthesizes novel videos visually realistic and geometrically consistent with the 3D structure encoded in the data. Our M3arsSynth engine spans a wide range of Martian terrains and acquisition dates, enabling the generation of physically accurate 3D surface models at metric-scale resolution. MarsGen, fine-tuned on M3arsSynth data, synthesizes videos conditioned on an initial image frame and, optionally, camera trajectories or textual prompts, allowing for video generation in novel environments. Experimental results show that our approach outperforms video synthesis models trained on terrestrial datasets, achieving superior visual fidelity and 3D structural consistency.
comment: Project Page: https://marsgenai.github.io
HSM: Hierarchical Scene Motifs for Multi-Scale Indoor Scene Generation 3DV 2026
Despite advances in indoor 3D scene layout generation, synthesizing scenes with dense object arrangements remains challenging. Existing methods focus on large furniture while neglecting smaller objects, resulting in unrealistically empty scenes. Those that place small objects typically do not honor arrangement specifications, resulting in largely random placement not following the text description. We present Hierarchical Scene Motifs (HSM): a hierarchical framework for indoor scene generation with dense object arrangements across spatial scales. Indoor scenes are inherently hierarchical, with surfaces supporting objects at different scales, from large furniture on floors to smaller objects on tables and shelves. HSM embraces this hierarchy and exploits recurring cross-scale spatial patterns to generate complex and realistic scenes in a unified manner. Our experiments show that HSM outperforms existing methods by generating scenes that better conform to user input across room types and spatial configurations. Project website is available at https://3dlg-hcvc.github.io/hsm .
comment: Accepted at 3DV 2026; 29 pages with 11 figures and 6 tables; Camera-ready with additional discussion
Image-Guided Semantic Pseudo-LiDAR Point Generation for 3D Object Detection WACV 2026
In autonomous driving scenarios, accurate perception is becoming an even more critical task for safe navigation. While LiDAR provides precise spatial data, its inherent sparsity makes it difficult to detect small or distant objects. Existing methods try to address this by generating additional points within a Region of Interest (RoI), but relying on LiDAR alone often leads to false positives and a failure to recover meaningful structures. To address these limitations, we propose Image-Guided Semantic Pseudo-LiDAR Point Generation model, called ImagePG, a novel framework that leverages rich RGB image features to generate dense and semantically meaningful 3D points. Our framework includes an Image-Guided RoI Points Generation (IG-RPG) module, which creates pseudo-points guided by image features, and an Image-Aware Occupancy Prediction Network (I-OPN), which provides spatial priors to guide point placement. A multi-stage refinement (MR) module further enhances point quality and detection robustness. To the best of our knowledge, ImagePG is the first method to directly leverage image features for point generation. Extensive experiments on the KITTI and Waymo datasets demonstrate that ImagePG significantly improves the detection of small and distant objects like pedestrians and cyclists, reducing false positives by nearly 50%. On the KITTI benchmark, our framework improves mAP by +1.38%p (car), +7.91%p (pedestrian), and +5.21%p (cyclist) on the test set over the baseline, achieving state-of-the-art cyclist performance on the KITTI leaderboard. The code is available at: https://github.com/MS-LIMA/ImagePG
comment: WACV 2026
You Only Train Once (YOTO): A Retraining-Free Object Detection Framework
Object detection constitutes the primary task within the domain of computer vision. It is utilized in numerous domains. Nonetheless, object detection continues to encounter the issue of catastrophic forgetting. The model must be retrained whenever new products are introduced, utilizing not only the new products dataset but also the entirety of the previous dataset. The outcome is obvious: increasing model training expenses and significant time consumption. In numerous sectors, particularly retail checkout, the frequent introduction of new products presents a great challenge. This study introduces You Only Train Once (YOTO), a methodology designed to address the issue of catastrophic forgetting by integrating YOLO11n for object localization with DeIT and Proxy Anchor Loss for feature extraction and metric learning. For classification, we utilize cosine similarity between the embedding features of the target product and those in the Qdrant vector database. In a case study conducted in a retail store with 140 products, the experimental results demonstrate that our proposed framework achieves encouraging accuracy, whether for detecting new or existing products. Furthermore, without retraining, the training duration difference is significant. We achieve almost 3 times the training time efficiency compared to classical object detection approaches. This efficiency escalates as additional new products are added to the product database. The average inference time is 580 ms per image containing multiple products, on an edge device, validating the proposed framework's feasibility for practical use.
comment: This manuscript was first submitted to the Engineering (Elsevier Journal). The preprint version was posted to arXiv afterwards to facilitate open access and community feedback
InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity
Modern vision-language models (VLMs) are expected to have abilities of spatial reasoning with diverse scene complexities, but evaluating such abilities is difficult due to the lack of benchmarks that are not only diverse and scalable but also fully customizable. Existing benchmarks offer limited customizability over the scene complexity and are incapable of isolating and analyzing specific VLM failure modes under distinct spatial conditions. To address this gap, instead of individually presenting benchmarks for different scene complexities, in this paper we present InfiniBench, a fully automated, customizable and user-friendly benchmark generator that can synthesize a theoretically infinite variety of 3D scenes with parameterized control on scene complexity. InfiniBench uniquely translates scene descriptions in natural language into photo-realistic videos with complex and physically plausible 3D layouts. This is achieved through three key innovations: 1) a LLM-based agentic framework that iteratively refines procedural scene constraints from scene descriptions; 2) a flexible cluster-based layout optimizer that generates dense and cluttered scenes previously intractable for procedural methods; and 3) a task-aware camera trajectory optimization method that renders scenes into videos with full object coverage as VLM input. Experiments demonstrate that InfiniBench outperforms state-of-the-art procedural and LLM-based 3D generation methods in prompt fidelity and physical plausibility, especially in high-complexity scenarios. We further showcased the usefulness of InfiniBench, by generating benchmarks for representative spatial reasoning tasks including measurement, perspective-taking and spatiotemporal tracking.
iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning NeurIPS 2025
Grounding large language models (LLMs) in domain-specific tasks like post-hoc dash-cam driving video analysis is challenging due to their general-purpose training and lack of structured inductive biases. As vision is often the sole modality available for such analysis (i.e., no LiDAR, GPS, etc.), existing video-based vision-language models (V-VLMs) struggle with spatial reasoning, causal inference, and explainability of events in the input video. To this end, we introduce iFinder, a structured semantic grounding framework that decouples perception from reasoning by translating dash-cam videos into a hierarchical, interpretable data structure for LLMs. iFinder operates as a modular, training-free pipeline that employs pretrained vision models to extract critical cues -- object pose, lane positions, and object trajectories -- which are hierarchically organized into frame- and video-level structures. Combined with a three-block prompting strategy, it enables step-wise, grounded reasoning for the LLM to refine a peer V-VLM's outputs and provide accurate reasoning. Evaluations on four public dash-cam video benchmarks show that iFinder's proposed grounding with domain-specific cues, especially object orientation and global context, significantly outperforms end-to-end V-VLMs on four zero-shot driving benchmarks, with up to 39% gains in accident reasoning accuracy. By grounding LLMs with driving domain-specific representations, iFinder offers a zero-shot, interpretable, and reliable alternative to end-to-end V-VLMs for post-hoc driving video understanding.
comment: Accepted at NeurIPS 2025
Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion
Latent Diffusion Models (LDMs) inherently follow a coarse-to-fine generation process, where high-level semantic structure is generated slightly earlier than fine-grained texture. This indicates the preceding semantics potentially benefit texture generation by providing a semantic anchor. Recent advances have integrated semantic priors from pretrained visual encoders to further enhance LDMs, yet they still denoise semantic and VAE-encoded texture synchronously, neglecting such ordering. Observing these, we propose Semantic-First Diffusion (SFD), a latent diffusion paradigm that explicitly prioritizes semantic formation. SFD first constructs composite latents by combining a compact semantic latent, which is extracted from a pretrained visual encoder via a dedicated Semantic VAE, with the texture latent. The core of SFD is to denoise the semantic and texture latents asynchronously using separate noise schedules: semantics precede textures by a temporal offset, providing clearer high-level guidance for texture refinement and enabling natural coarse-to-fine generation. On ImageNet 256x256 with guidance, SFD achieves FID 1.06 (LightningDiT-XL) and FID 1.04 (1.0B LightningDiT-XXL), while achieving up to 100x faster convergence than the original DiT. SFD also improves existing methods like ReDi and VA-VAE, demonstrating the effectiveness of asynchronous, semantics-led modeling. Project page and code: https://yuemingpan.github.io/SFD.github.io/.
V-CECE: Visual Counterfactual Explanations via Conceptual Edits NeurIPS 2025
Recent black-box counterfactual generation frameworks fail to take into account the semantic content of the proposed edits, while relying heavily on training to guide the generation process. We propose a novel, plug-and-play black-box counterfactual generation framework, which suggests step-by-step edits based on theoretical guarantees of optimal edits to produce human-level counterfactual explanations with zero training. Our framework utilizes a pre-trained image editing diffusion model, and operates without access to the internals of the classifier, leading to an explainable counterfactual generation process. Throughout our experimentation, we showcase the explanatory gap between human reasoning and neural model behavior by utilizing both Convolutional Neural Network (CNN), Vision Transformer (ViT) and Large Vision Language Model (LVLM) classifiers, substantiated through a comprehensive human evaluation.
comment: Accepted in NeurIPS 2025
Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment
Vision-Language-Action (VLA) models have emerged as a powerful framework that unifies perception, language, and control, enabling robots to perform diverse tasks through multimodal understanding. However, current VLA models typically contain massive parameters and rely heavily on large-scale robot data pretraining, leading to high computational costs during training, as well as limited deployability for real-time inference. Moreover, most training paradigms often degrade the perceptual representations of the vision-language backbone, resulting in overfitting and poor generalization to downstream tasks. In this work, we present Evo-1, a lightweight VLA model that reduces computation and improves deployment efficiency, while maintaining strong performance without pretraining on robot data. Evo-1 builds on a native multimodal Vision-Language model (VLM), incorporating a novel cross-modulated diffusion transformer along with an optimized integration module, together forming an effective architecture. We further introduce a two-stage training paradigm that progressively aligns action with perception, preserving the representations of the VLM. Notably, with only 0.77 billion parameters, Evo-1 achieves state-of-the-art results on the Meta-World and RoboTwin suite, surpassing the previous best models by 12.4% and 6.9%, respectively, and also attains a competitive result of 94.8% on LIBERO. In real-world evaluations, Evo-1 attains a 78% success rate with high inference frequency and low memory overhead, outperforming all baseline methods. We release code, data, and model weights to facilitate future research on lightweight and efficient VLA models.
comment: Github: https://github.com/MINT-SJTU/Evo-1
IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks
Flawed planning from VLM-driven embodied agents poses significant safety hazards, hindering their deployment in real-world household tasks. However, existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent's actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent's interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps. Extensive experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness, and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion. By highlighting these critical limitations, IS-Bench provides a foundation for developing safer and more reliable embodied AI systems. Code and data are released under https://github.com/AI45Lab/IS-Bench.
Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length
Existing diffusion-based video generation methods are fundamentally constrained by sequential computation and long-horizon inconsistency, limiting their practical adoption in real-time, streaming audio-driven avatar synthesis. We present Live Avatar, an algorithm-system co-designed framework that enables efficient, high-fidelity, and infinite-length avatar generation using a 14-billion-parameter diffusion model. Our approach introduces Timestep-forcing Pipeline Parallelism (TPP), a distributed inference paradigm that pipelines denoising steps across multiple GPUs, effectively breaking the autoregressive bottleneck and ensuring stable, low-latency real-time streaming. To further enhance temporal consistency and mitigate identity drift and color artifacts, we propose the Rolling Sink Frame Mechanism (RSFM), which maintains sequence fidelity by dynamically recalibrating appearance using a cached reference image. Additionally, we leverage Self-Forcing Distribution Matching Distillation to facilitate causal, streamable adaptation of large-scale models without sacrificing visual quality. Live Avatar demonstrates state-of-the-art performance, reaching 20 FPS end-to-end generation on 5 H800 GPUs, and, to the best of our knowledge, is the first to achieve practical, real-time, high-fidelity avatar generation at this scale. Our work establishes a new paradigm for deploying advanced diffusion models in industrial long-form video synthesis applications.
Structure is Supervision: Multiview Masked Autoencoders for Radiology
Building robust medical machine learning systems requires pretraining strategies that exploit the intrinsic structure present in clinical data. We introduce Multiview Masked Autoencoder (MVMAE), a self-supervised framework that leverages the natural multi-view organization of radiology studies to learn view-invariant and disease-relevant representations. MVMAE combines masked image reconstruction with cross-view alignment, transforming clinical redundancy across projections into a powerful self-supervisory signal. We further extend this approach with MVMAE-V2T, which incorporates radiology reports as an auxiliary text-based learning signal to enhance semantic grounding while preserving fully vision-based inference. Evaluated on a downstream disease classification task on three large-scale public datasets, MIMIC-CXR, CheXpert, and PadChest, MVMAE consistently outperforms supervised and vision-language baselines. Furthermore, MVMAE-V2T provides additional gains, particularly in low-label regimes where structured textual supervision is most beneficial. Together, these results establish the importance of structural and textual supervision as complementary paths toward scalable, clinically grounded medical foundation models.
Test-Time 3D Occupancy Prediction
Self-supervised 3D occupancy prediction offers a promising solution for understanding complex driving scenes without requiring costly 3D annotations. However, training dense occupancy decoders to capture fine-grained geometry and semantics can demand hundreds of GPU hours, and once trained, such models struggle to adapt to varying voxel resolutions or novel object categories without extensive retraining. To overcome these limitations, we propose a practical and flexible test-time occupancy prediction framework termed TT-Occ. Our method incrementally constructs, optimizes and voxelizes time-aware 3D Gaussians from raw sensor streams by integrating vision foundation models (VFMs) at runtime. The flexible nature of 3D Gaussians allows voxelization at arbitrary user-specified resolutions, while the generalization ability of VFMs enables accurate perception and open-vocabulary recognition, without any network training or fine-tuning. Specifically, TT-Occ operates in a lift-track-voxelize symphony: We first lift the geometry and semantics of surrounding-view extracted from VFMs to instantiate Gaussians at 3D space; Next, we track dynamic Gaussians while accumulating static ones to complete the scene and enforce temporal consistency; Finally, we voxelize the optimized Gaussians to generate occupancy prediction. Optionally, inherent noise in VFM predictions and tracking is mitigated by periodically smoothing neighboring Gaussians during optimization. To validate the generality and effectiveness of our framework, we offer two variants: one LiDAR-based and one vision-centric, and conduct extensive experiments on Occ3D and nuCraft benchmarks with varying voxel resolutions.
Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs
Vision language models (VLMs) have achieved impressive performance across a variety of computer vision tasks. However, the multimodal reasoning capability has not been fully explored in existing models. In this paper, we propose a Chain-of-Focus (CoF) method that allows VLMs to perform adaptive focusing and zooming in on key image regions based on obtained visual cues and the given questions, achieving efficient multimodal reasoning. To enable this CoF capability, we present a two-stage training pipeline, including supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct the MM-CoF dataset, comprising 3K samples derived from a visual agent designed to adaptively identify key regions to solve visual tasks with different image resolutions and questions. We use MM-CoF to fine-tune the Qwen2.5-VL model for cold start. In the RL stage, we leverage the outcome accuracies and formats as rewards to update the Qwen2.5-VL model, enabling further refining the search and reasoning strategy of models without human priors. Our model achieves significant improvements on multiple benchmarks. On the V* benchmark that requires strong visual reasoning capability, our model outperforms existing VLMs by 5% among 8 image resolutions ranging from 224 to 4K, demonstrating the effectiveness of the proposed CoF method and facilitating the more efficient deployment of VLMs in practical applications.
comment: https://github.com/xtong-zhang/Chain-of-Focus
LymphAtlas- A Unified Multimodal Lymphoma Imaging Repository Delivering AI-Enhanced Diagnostic Insight
This study integrates PET metabolic information with CT anatomical structures to establish a 3D multimodal segmentation dataset for lymphoma based on whole-body FDG PET/CT examinations, which bridges the gap of the lack of standardised multimodal segmentation datasets in the field of haematological malignancies. We retrospectively collected 483 examination datasets acquired between March 2011 and May 2024, involving 220 patients (106 non-Hodgkin lymphoma, 42 Hodgkin lymphoma); all data underwent ethical review and were rigorously de-identified. Complete 3D structural information was preserved during data acquisition, preprocessing and annotation, and a high-quality dataset was constructed based on the nnUNet format. By systematic technical validation and evaluation of the preprocessing process, annotation quality and automatic segmentation algorithm, the deep learning model trained based on this dataset is verified to achieve accurate segmentation of lymphoma lesions in PET/CT images with high accuracy, good robustness and reproducibility, which proves the applicability and stability of this dataset in accurate segmentation and quantitative analysis. The deep fusion of PET/CT images achieved with this dataset not only significantly improves the accurate portrayal of the morphology, location and metabolic features of tumour lesions, but also provides solid data support for early diagnosis, clinical staging and personalized treatment, and promotes the development of automated image segmentation and precision medicine based on deep learning. The dataset and related resources are available at https://github.com/SuperD0122/LymphAtlas-.
comment: 12 pages,3 figures
A Strong View-Free Baseline Approach for Single-View Image Guided Point Cloud Completion
The single-view image guided point cloud completion (SVIPC) task aims to reconstruct a complete point cloud from a partial input with the help of a single-view image. While previous works have demonstrated the effectiveness of this multimodal approach, the fundamental necessity of image guidance remains largely unexamined. To explore this, we propose a strong baseline approach for SVIPC based on an attention-based multi-branch encoder-decoder network that only takes partial point clouds as input, view-free. Our hierarchical self-fusion mechanism, driven by cross-attention and self-attention layers, effectively integrates information across multiple streams, enriching feature representations and strengthening the networks ability to capture geometric structures. Extensive experiments and ablation studies on the ShapeNet-ViPC dataset demonstrate that our view-free framework performs superiorly to state-of-the-art SVIPC methods. We hope our findings provide new insights into the development of multimodal learning in SVIPC. Our demo code will be available at https://github.com/Zhang-VISLab.
comment: 7 pages, 2 figures
Multi-Scale Direction-Aware Network for Infrared Small Target Detection
Infrared small target detection faces the problem that it is difficult to effectively separate the background and the target. Existing deep learning-based methods focus on edge and shape features, but ignore the richer structural differences and detailed information embedded in high-frequency components from different directions, thereby failing to fully exploit the value of high-frequency directional features in target perception. To address this limitation, we propose a multi-scale direction-aware network (MSDA-Net), which is the first attempt to integrate the high-frequency directional features of infrared small targets as domain prior knowledge into neural networks. Specifically, to fully mine the high-frequency directional features, on the one hand, a high-frequency direction injection (HFDI) module without trainable parameters is constructed to inject the high-frequency directional information of the original image into the network. On the other hand, a multi-scale direction-aware (MSDA) module is constructed, which promotes the full extraction of local relations at different scales and the full perception of key features in different directions. In addition, considering the characteristics of infrared small targets, we construct a feature aggregation (FA) structure to address target disappearance in high-level feature maps, and a feature calibration fusion (FCF) module to alleviate feature bias during cross-layer feature fusion. Extensive experimental results show that our MSDA-Net achieves state-of-the-art (SOTA) results on multiple public datasets. The code can be available at https://github.com/YuChuang1205/MSDA-Net
COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence
Visual Spatial Reasoning is crucial for enabling Multimodal Large Language Models (MLLMs) to understand object properties and spatial relationships, yet current models still struggle with 3D-aware reasoning. Existing approaches typically enhance either perception, by augmenting RGB inputs with auxiliary modalities such as depth and segmentation, or reasoning, by training on spatial VQA datasets and applying reinforcement learning, and thus treat these two aspects in isolation. In this work, we investigate whether a unified MLLM can develop an intrinsic ability to enhance spatial perception and, through adaptive interleaved reasoning, achieve stronger spatial intelligence. We propose \textbf{COOPER}, a unified MLLM that leverages depth and segmentation as auxiliary modalities and is trained in two stages to acquire auxiliary modality generation and adaptive, interleaved reasoning capabilities. COOPER achieves an average \textbf{6.91\%} improvement in spatial reasoning while maintaining general performance. Moreover, even a variant trained only for auxiliary modality generation attains a \textbf{7.92\%} gain on distance and size estimation, suggesting that learning to generate auxiliary modalities helps internalize spatial knowledge and strengthen spatial understanding.
SOAP: Enhancing Spatio-Temporal Relation and Motion Information Capturing for Few-Shot Action Recognition ACM MM 2024
High frame-rate (HFR) videos of action recognition improve fine-grained expression while reducing the spatio-temporal relation and motion information density. Thus, large amounts of video samples are continuously required for traditional data-driven training. However, samples are not always sufficient in real-world scenarios, promoting few-shot action recognition (FSAR) research. We observe that most recent FSAR works build spatio-temporal relation of video samples via temporal alignment after spatial feature extraction, cutting apart spatial and temporal features within samples. They also capture motion information via narrow perspectives between adjacent frames without considering density, leading to insufficient motion information capturing. Therefore, we propose a novel plug-and-play architecture for FSAR called Spatio-tempOral frAme tuPle enhancer (SOAP) in this paper. The model we designed with such architecture refers to SOAP-Net. Temporal connections between different feature channels and spatio-temporal relation of features are considered instead of simple feature extraction. Comprehensive motion information is also captured, using frame tuples with multiple frames containing more motion information than adjacent frames. Combining frame tuples of diverse frame counts further provides a broader perspective. SOAP-Net achieves new state-of-the-art performance across well-known benchmarks such as SthSthV2, Kinetics, UCF101, and HMDB51. Extensive empirical evaluations underscore the competitiveness, pluggability, generalization, and robustness of SOAP. The code is released at https://github.com/wenbohuang1002/SOAP.
comment: Accepted by ACM MM 2024
A Scene-aware Models Adaptation Scheme for Cross-scene Online Inference on Mobile Devices
Emerging Artificial Intelligence of Things (AIoT) applications desire online prediction using deep neural network (DNN) models on mobile devices. However, due to the movement of devices, unfamiliar test samples constantly appear, significantly affecting the prediction accuracy of a pre-trained DNN. In addition, unstable network connection calls for local model inference. In this paper, we propose a light-weight scheme, called Anole, to cope with the local DNN model inference on mobile devices. The core idea of Anole is to first establish an army of compact DNN models, and then adaptively select the model fitting the current test sample best for online inference. The key is to automatically identify model-friendly scenes for training scene-specific DNN models. To this end, we design a weakly-supervised scene representation learning algorithm by combining both human heuristics and feature similarity in separating scenes. Moreover, we further train a model classifier to predict the best-fit scene-specific DNN model for each test sample. We implement Anole on different types of mobile devices and conduct extensive trace-driven and real-world experiments based on unmanned aerial vehicles (UAVs). The results demonstrate that Anole outwits the method of using a versatile large DNN in terms of prediction accuracy (4.5% higher), response time (33.1% faster) and power consumption (45.1% lower).
comment: This version presents the extended and revised journal version of our 2024 conference paper, incorporating new datasets, expanded evaluations, and improved methodological details. The manuscript has been accepted for publication in IEEE Transactions on Mobile Computing
Collaborative Face Experts Fusion in Video Generation: Boosting Identity Consistency Across Large Face Poses
Current video generation models struggle with identity preservation under large face poses, primarily facing two challenges: the difficulty in exploring an effective mechanism to integrate identity features into DiT architectures, and the lack of targeted coverage of large face poses in existing open-source video datasets. To address these, we present two key innovations. First, we propose Collaborative Face Experts Fusion (CoFE), which dynamically fuses complementary signals from three specialized experts within the DiT backbone: an identity expert that captures cross-pose invariant features, a semantic expert that encodes high-level visual context, and a detail expert that preserves pixel-level attributes such as skin texture and color gradients. Second, we introduce a data curation pipeline comprising three key components: Face Constraints to ensure diverse large-pose coverage, Identity Consistency to maintain stable identity across frames, and Speech Disambiguation to align textual captions with actual speaking behavior. This pipeline yields LaFID-180K, a large-scale dataset of pose-annotated video clips designed for identity-preserving video generation. Experimental results on several benchmarks demonstrate that our approach significantly outperforms state-of-the-art methods in face similarity, FID, and CLIP semantic alignment. Project page: https://rain152.github.io/CoFE/.
comment: Project page: https://rain152.github.io/CoFE/
Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views IROS'25
Forecasting how human hands move in egocentric views is critical for applications like augmented reality and human-robot policy transfer. Recently, several hand trajectory prediction (HTP) methods have been developed to generate future possible hand waypoints, which still suffer from insufficient prediction targets, inherent modality gaps, entangled hand-head motion, and limited validation in downstream tasks. To address these limitations, we present a universal hand motion forecasting framework considering multi-modal input, multi-dimensional and multi-target prediction patterns, and multi-task affordances for downstream applications. We harmonize multiple modalities by vision-language fusion, global context incorporation, and task-aware text embedding injection, to forecast hand waypoints in both 2D and 3D spaces. A novel dual-branch diffusion is proposed to concurrently predict human head and hand movements, capturing their motion synergy in egocentric vision. By introducing target indicators, the prediction model can forecast the specific joint waypoints of the wrist or the fingers, besides the widely studied hand center points. In addition, we enable Uni-Hand to additionally predict hand-object interaction states (contact/separation) to facilitate downstream tasks better. As the first work to incorporate downstream task evaluation in the literature, we build novel benchmarks to assess the real-world applicability of hand motion forecasting algorithms. The experimental results on multiple publicly available datasets and our newly proposed benchmarks demonstrate that Uni-Hand achieves the state-of-the-art performance in multi-dimensional and multi-target hand motion forecasting. Extensive validation in multiple downstream tasks also presents its impressive human-robot policy transfer to enable robotic manipulation, and effective feature enhancement for action anticipation/recognition.
comment: Extended journal version of MMTwin (IROS'25). Code and data: https://github.com/IRMVLab/UniHand
TeleEgo: Benchmarking Egocentric AI Assistants in the Wild
Egocentric AI assistants in real-world settings must process multi-modal inputs (video, audio, text), respond in real time, and retain evolving long-term memory. However, existing benchmarks typically evaluate these abilities in isolation, lack realistic streaming scenarios, or support only short-term tasks. We introduce \textbf{TeleEgo}, a long-duration, streaming, omni-modal benchmark for evaluating egocentric AI assistants in realistic daily contexts. The dataset features over 14 hours per participant of synchronized egocentric video, audio, and text across four domains: work \& study, lifestyle \& routines, social activities, and outings \& culture. All data is aligned on a unified global timeline and includes high-quality visual narrations and speech transcripts, curated through human refinement.TeleEgo defines 12 diagnostic subtasks across three core capabilities: Memory (recalling past events), Understanding (interpreting the current moment), and Cross-Memory Reasoning (linking distant events). It contains 3,291 human-verified QA items spanning multiple question formats (single-choice, binary, multi-choice, and open-ended), evaluated strictly in a streaming setting. We propose Real-Time Accuracy (RTA) to jointly capture correctness and responsiveness under tight decision windows, and Memory Persistence Time (MPT) as a forward-looking metric for long-term retention in continuous streams. In this work, we report RTA results for current models and release TeleEgo, together with an MPT evaluation framework, as a realistic and extensible benchmark for future egocentric assistants with stronger streaming memory, enabling systematic study of both real-time behavior and long-horizon memory.
Adaptive Keyframe Selection for Scalable 3D Scene Reconstruction in Dynamic Environments
In this paper, we propose an adaptive keyframe selection method for improved 3D scene reconstruction in dynamic environments. The proposed method integrates two complementary modules: an error-based selection module utilizing photometric and structural similarity (SSIM) errors, and a momentum-based update module that dynamically adjusts keyframe selection thresholds according to scene motion dynamics. By dynamically curating the most informative frames, our approach addresses a key data bottleneck in real-time perception. This allows for the creation of high-quality 3D world representations from a compressed data stream, a critical step towards scalable robot learning and deployment in complex, dynamic environments. Experimental results demonstrate significant improvements over traditional static keyframe selection strategies, such as fixed temporal intervals or uniform frame skipping. These findings highlight a meaningful advancement toward adaptive perception systems that can dynamically respond to complex and evolving visual scenes. We evaluate our proposed adaptive keyframe selection module on two recent state-of-the-art 3D reconstruction networks, Spann3r and CUT3R, and observe consistent improvements in reconstruction quality across both frameworks. Furthermore, an extensive ablation study confirms the effectiveness of each individual component in our method, underlining their contribution to the overall performance gains.
comment: Accepted at ROBOVIS 2026
TextureSplat: Per-Primitive Texture Mapping for Reflective Gaussian Splatting 3DV 2026
Gaussian Splatting have demonstrated remarkable novel view synthesis performance at high rendering frame rates. Optimization-based inverse rendering within complex capture scenarios remains however a challenging problem. A particular case is modelling complex surface light interactions for highly reflective scenes, which results in intricate high frequency specular radiance components. We hypothesize that such challenging settings can benefit from increased representation power. We hence propose a method that tackles this issue through a geometrically and physically grounded Gaussian Splatting borne radiance field, where normals and material properties are spatially variable in the primitive's local space. Using per-primitive texture maps for this purpose, we also propose to harness the GPU hardware to accelerate rendering at test time via unified material texture atlas. Code will be available at https://github.com/maeyounes/TextureSplat
comment: 3DV 2026
Image and Video Processing
General and Domain-Specific Zero-shot Detection of Generated Images via Conditional Likelihood WACV 2026
The rapid advancement of generative models, particularly diffusion-based methods, has significantly improved the realism of synthetic images. As new generative models continuously emerge, detecting generated images remains a critical challenge. While fully supervised, and few-shot methods have been proposed, maintaining an updated dataset is time-consuming and challenging. Consequently, zero-shot methods have gained increasing attention in recent years. We find that existing zero-shot methods often struggle to adapt to specific image domains, such as artistic images, limiting their real-world applicability. In this work, we introduce CLIDE, a novel zero-shot detection method based on conditional likelihood approximation. Our approach computes likelihoods conditioned on real images, enabling adaptation across diverse image domains. We extensively evaluate CLIDE, demonstrating state-of-the-art performance on a large-scale general dataset and significantly outperform existing methods in domain-specific cases. These results demonstrate the robustness of our method and underscore the need of broad, domain-aware generalization for the AI-generated image detection task. Code is available at https://github.com/FujitsuResearch/domain_adaptive_image_detection.
comment: 8 pages, 6 figures, accepted to WACV 2026
Image Semantic Communication with Quadtree Partition-based Coding
Deep learning based semantic communication (DeepSC) system has emerged as a promising paradigm for efficient wireless transmission. However, existing image DeepSC methods, frequently encounter challenges in balancing rate-distortion performance and computational complexity, and often exhibit inferior performance compared to traditional schemes, especially on high-resolution datasets. To address these limitations, we propose a novel image DeepSC system, using quadtree partition-based joint semantic-channel coding, named Quad-DeepSC, which maintains low complexity while achieving state-of-the-art transmission performance. Based on maturing learned image compression technologies, we establish a unified DeepSC system design and training pipeline. The proposed Quad-DeepSC integrates quadtree partition-based entropy estimation and feature coding modules with lightweight feature extraction and reconstruction networks to form an end-to-end architecture. During training, all components except the feature coding modules are jointly optimized as a compact learned image codec, Quad-LIC, for source compression tasks. The pretrained Quad-LIC is then embedded into Quad-DeepSC and fine-tuned end-to-end over wireless channels. Extensive experimental results demonstrate that Quad-DeepSC is the first DeepSC system to surpass conventional communication systems, which employ VTM for source coding and adopt the optimal MCS index under 3GPP standards for channel coding and digital modulation, in performance across datasets of varying resolutions. Notably, both Quad-DeepSC and Quad-LIC exhibit minimal latency, rendering them well-suited for deployment in real-time wireless communication systems.
CATNUS: Coordinate-Aware Thalamic Nuclei Segmentation Using T1-Weighted MRI
Accurate segmentation of thalamic nuclei from magnetic resonance images is important due to the distinct roles of these nuclei in overall brain function and to their differential involvement in neurological and psychiatric disorders. However, segmentation remains challenging given the small size of many nuclei, limited intrathalamic contrast and image resolution, and inter-subject anatomical variability. In this work, we present CATNUS (Coordinate-Aware Thalamic Nuclei Segmentation), segmenting 13 thalamic nuclei (or nuclear groups) using a 3D U-Net architecture enhanced with coordinate convolution layers, which provide more precise localization of both large and small nuclei. To support broad clinical applicability, we provide pre-trained model variants that can operate on quantitative T1 maps as well as on widely used magnetization-prepared rapid gradient echo (MPRAGE) and fast gray matter acquisition T1 inversion recovery (FGATIR) sequences. We benchmarked CATNUS against established methods, including FreeSurfer, THOMAS and HIPS-THOMAS, demonstrating improved segmentation accuracy and robust test-retest reliability across multiple nuclei. Furthermore, CATNUS demonstrated strong out-of-distribution generalization on traveling-subject datasets spanning multiple scanners, field strengths, and vendors, producing reliable and anatomically coherent segmentations across diverse acquisition conditions. Overall, CATNUS provides an accurate and generalizable solution for thalamic nuclei segmentation, with strong potential to facilitate large-scale neuroimaging studies and support real-world clinical assessment.
Multi-Modal Zero-Shot Prediction of Color Trajectories in Food Drying
Food drying is widely used to reduce moisture content, ensure safety, and extend shelf life. Color evolution of food samples is an important indicator of product quality in food drying. Although existing studies have examined color changes under different drying conditions, current approaches primarily rely on low-dimensional color features and cannot fully capture the complex, dynamic color trajectories of food samples. Moreover, existing modeling approaches lack the ability to generalize to unseen process conditions. To address these limitations, we develop a novel multi-modal color-trajectory prediction method that integrates high-dimensional temporal color information with drying process parameters to enable accurate and data-efficient color trajectory prediction. Under unseen drying conditions, the model attains RMSEs of 2.12 for cookie drying and 1.29 for apple drying, reducing errors by over 90% compared with baseline models. These experimental results demonstrate the model's superior accuracy, robustness, and broad applicability.
Convergent Primal-Dual Plug-and-Play Image Restoration: A General Algorithm and Applications
We propose a general deep plug-and-play (PnP) algorithm with a theoretical convergence guarantee. PnP strategies have demonstrated outstanding performance in various image restoration tasks by exploiting the powerful priors underlying Gaussian denoisers. However, existing PnP methods often lack theoretical convergence guarantees under realistic assumptions due to their ad-hoc nature, resulting in inconsistent behavior. Moreover, even when convergence guarantees are provided, they are typically designed for specific settings or require a considerable computational cost in handling non-quadratic data-fidelity terms and additional constraints, which are key components in many image restoration scenarios. To tackle these challenges, we integrate the PnP paradigm with primal-dual splitting (PDS), an efficient proximal splitting methodology for solving a wide range of convex optimization problems, and develop a general convergent PnP framework. Specifically, we establish theoretical conditions for the convergence of the proposed PnP algorithm under a reasonable assumption. Furthermore, we show that the problem solved by the proposed PnP algorithm is not a standard convex optimization problem but a more general monotone inclusion problem, where we provide a mathematical representation of the solution set. Our approach efficiently handles a broad class of image restoration problems with guaranteed theoretical convergence. Numerical experiments on specific image restoration tasks validate the practicality and effectiveness of our theoretical results.
comment: For the conference proceeding, see https://ieeexplore.ieee.org/document/10448023. Our implementation can be found at https://github.com/MDI-TokyoTech/Convergent_Primal-Dual_Plug-and-Play_Image_Restoration_A_General_Algorithm_and_Applications
A robot-assisted pipeline to rapidly scan 1.7 million historical aerial photographs
During the 20th Century, aerial surveys captured hundreds of millions of high-resolution photographs of the earth's surface. These images, the precursors to modern satellite imagery, represent an extraordinary visual record of the environmental and social upheavals of the 20th Century. However, most of these images currently languish in physical archives where retrieval is difficult and costly. Digitization could revolutionize access, but manual scanning is slow and expensive. Automated scanning could make at-scale digitization feasible, unlocking this visual record of the 20th Century for the digital era. Here, we describe and validate a novel robot-assisted pipeline that increases worker productivity in scanning 30-fold, applied at scale to digitize an archive of 1.7 million historical aerial photographs from 65 countries.
Bayesian Insights into Exchange and Restriction in Gray Matter Diffusion MRI
Biophysical models in diffusion MRI (dMRI) hold promise for characterizing gray matter tissue microstructure. Yet, the reliability of their parameter estimates remains largely under-studied, especially in models that incorporate water exchange. In this study, we investigate the accuracy, precision, and presence of degeneracy of two recently proposed gray matter models, NEXI and SANDIX, using established acquisition protocols, on both simulated and \textit{in vivo} data. We employ $μ$GUIDE, a Bayesian inference framework based on deep learning, to quantify parameter uncertainty and detect degeneracies, enabling a more interpretable assessment of model fits. Our results show that while some microstructural parameters, such as extra-cellular diffusivity and neurite signal fraction, are robustly estimated, others, including exchange time and soma radius, are often associated with high uncertainty and estimation bias, particularly under realistic noise conditions and reduced acquisition protocols. Comparison with non-linear least squares fitting highlights the critical advantage of uncertainty-aware methods: the ability to flag and filter out unreliable estimates. Together, these findings emphasize the need to report uncertainty and account for model degeneracies when interpreting model-based estimates. Our study advocates for the integration of probabilistic fitting approaches into imaging pipelines to improve reproducibility and biological interpretability.
A Fractional Variational Approach to Spectral Filtering Using the Fourier Transform
The interference of fluorescence signals and noise remains a significant challenge in Raman spectrum analysis, often obscuring subtle spectral features that are critical for accurate analysis. Inspired by variational methods similar to those used in image denoising, our approach minimizes a functional involving fractional derivatives to balance noise suppression with the preservation of essential chemical features of the signal, such as peak position, intensity, and area. The original problem is reformulated in the frequency domain through the Fourier transform, making the implementation simple and fast. In this work, we discuss the theoretical framework, practical implementation, and the advantages and limitations of this method in the context of {simulated} Raman data, as well as in image processing. The main contribution of this article is the combination of a variational approach in the frequency domain, the use of fractional derivatives, and the optimization of the {regularization parameter and} derivative order through the concept of Shannon entropy. This work explores how the fractional order, combined with the regularization parameter, affects noise removal and preserves the essential features of the spectrum {and image}. Finally, the study shows that the combination of the proposed strategies produces an efficient, robust, and easily implementable filter.
comment: 31 pages, 3 figures, 2 tables
Stochastic Orthogonal Regularization for deep projective priors
Many crucial tasks of image processing and computer vision are formulated as inverse problems. Thus, it is of great importance to design fast and robust algorithms to solve these problems. In this paper, we focus on generalized projected gradient descent (GPGD) algorithms where generalized projections are realized with learned neural networks and provide state-of-the-art results for imaging inverse problems. Indeed, neural networks allow for projections onto unknown low-dimensional sets that model complex data, such as images. We call these projections deep projective priors. In generic settings, when the orthogonal projection onto a lowdimensional model set is used, it has been shown, under a restricted isometry assumption, that the corresponding orthogonal PGD converges with a linear rate, yielding near-optimal convergence (within the class of GPGD methods) in the classical case of sparse recovery. However, for deep projective priors trained with classical mean squared error losses, there is little guarantee that the hypotheses for linear convergence are satisfied. In this paper, we propose a stochastic orthogonal regularization of the training loss for deep projective priors. This regularization is motivated by our theoretical results: a sufficiently good approximation of the orthogonal projection guarantees linear stable recovery with performance close to orthogonal PGD. We show experimentally, using two different deep projective priors (based on autoencoders and on denoising networks), that our stochastic orthogonal regularization yields projections that improve convergence speed and robustness of GPGD in challenging inverse problem settings, in accordance with our theoretical findings.
LymphAtlas- A Unified Multimodal Lymphoma Imaging Repository Delivering AI-Enhanced Diagnostic Insight
This study integrates PET metabolic information with CT anatomical structures to establish a 3D multimodal segmentation dataset for lymphoma based on whole-body FDG PET/CT examinations, which bridges the gap of the lack of standardised multimodal segmentation datasets in the field of haematological malignancies. We retrospectively collected 483 examination datasets acquired between March 2011 and May 2024, involving 220 patients (106 non-Hodgkin lymphoma, 42 Hodgkin lymphoma); all data underwent ethical review and were rigorously de-identified. Complete 3D structural information was preserved during data acquisition, preprocessing and annotation, and a high-quality dataset was constructed based on the nnUNet format. By systematic technical validation and evaluation of the preprocessing process, annotation quality and automatic segmentation algorithm, the deep learning model trained based on this dataset is verified to achieve accurate segmentation of lymphoma lesions in PET/CT images with high accuracy, good robustness and reproducibility, which proves the applicability and stability of this dataset in accurate segmentation and quantitative analysis. The deep fusion of PET/CT images achieved with this dataset not only significantly improves the accurate portrayal of the morphology, location and metabolic features of tumour lesions, but also provides solid data support for early diagnosis, clinical staging and personalized treatment, and promotes the development of automated image segmentation and precision medicine based on deep learning. The dataset and related resources are available at https://github.com/SuperD0122/LymphAtlas-.
comment: 12 pages,3 figures
A Strong View-Free Baseline Approach for Single-View Image Guided Point Cloud Completion
The single-view image guided point cloud completion (SVIPC) task aims to reconstruct a complete point cloud from a partial input with the help of a single-view image. While previous works have demonstrated the effectiveness of this multimodal approach, the fundamental necessity of image guidance remains largely unexamined. To explore this, we propose a strong baseline approach for SVIPC based on an attention-based multi-branch encoder-decoder network that only takes partial point clouds as input, view-free. Our hierarchical self-fusion mechanism, driven by cross-attention and self-attention layers, effectively integrates information across multiple streams, enriching feature representations and strengthening the networks ability to capture geometric structures. Extensive experiments and ablation studies on the ShapeNet-ViPC dataset demonstrate that our view-free framework performs superiorly to state-of-the-art SVIPC methods. We hope our findings provide new insights into the development of multimodal learning in SVIPC. Our demo code will be available at https://github.com/Zhang-VISLab.
comment: 7 pages, 2 figures
Synthetic multi-inversion time magnetic resonance images for visualization of subcortical structures
Purpose: Visualization of subcortical gray matter is essential in neuroscience and clinical practice, particularly for disease understanding and surgical planning.While multi-inversion time (multi-TI) T$_1$-weighted (T$_1$-w) magnetic resonance (MR) imaging improves visualization, it is rarely acquired in clinical settings. Approach: We present SyMTIC (Synthetic Multi-TI Contrasts), a deep learning method that generates synthetic multi-TI images using routinely acquired T$_1$-w, T$_2$-weighted (T$_2$-w), and FLAIR images. Our approach combines image translation via deep neural networks with imaging physics to estimate longitudinal relaxation time (T$_1$) and proton density (PD) maps. These maps are then used to compute multi-TI images with arbitrary inversion times. Results: SyMTIC was trained using paired MPRAGE and FGATIR images along with T$_2$-w and FLAIR images. It accurately synthesized multi-TI images from standard clinical inputs, achieving image quality comparable to that from explicitly acquired multi-TI data.The synthetic images, especially for TI values between 400-800 ms, enhanced visualization of subcortical structures and improved segmentation of thalamic nuclei. Conclusion: SyMTIC enables robust generation of high-quality multi-TI images from routine MR contrasts. It generalizes well to varied clinical datasets, including those with missing FLAIR images or unknown parameters, offering a practical solution for improving brain MR image visualization and analysis.
comment: Accepted for publication at the Journal of Medical Imaging
Multimedia
ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior
Multimodal Large Language Models (MLLMs) are increasingly vulnerable to multimodal Indirect Prompt Injection (IPI) attacks, which embed malicious instructions in images, videos, or audio to hijack model behavior. Existing defenses, designed primarily for text-only LLMs, are unsuitable for countering these multimodal threats, as they are easily bypassed, modality-dependent, or generalize poorly. Inspired by activation steering researches, we hypothesize that a robust, general defense independent of modality can be achieved by steering the model's behavior in the representation space. Through extensive experiments, we discover that the instruction-following behavior of MLLMs is encoded in a subspace. Steering along directions within this subspace can enforce adherence to user instructions, forming the basis of a defense. However, we also found that a naive defense direction could be coupled with a utility-degrading direction, and excessive intervention strength harms model performance. To address this, we propose ARGUS, which searches for an optimal defense direction within the safety subspace that decouples from the utility degradation direction, further combining adaptive strength steering to achieve a better safety-utility trade-off. ARGUS also introduces lightweight injection detection stage to activate the defense on-demand, and a post-filtering stage to verify defense success. Experimental results show that ARGUS can achieve robust defense against multimodal IPI while maximally preserving the MLLM's utility.
EXR: An Interactive Immersive EHR Visualization in Extended Reality
This paper presents the design and implementation of an Extended Reality (XR) platform for immersive, interactive visualization of Electronic Health Records (EHRs). The system extends beyond conventional 2D interfaces by visualizing both structured and unstructured patient data into a shared 3D environment, enabling intuitive exploration and real-time collaboration. The modular infrastructure integrates FHIR-based EHR data with volumetric medical imaging and AI-generated segmentation, ensuring interoperability with modern healthcare systems. The platform's capabilities are demonstrated using synthetic EHR datasets and computed tomography (CT)-derived spine models processed through an AI-powered segmentation pipeline. This work suggests that such integrated XR solutions could form the foundation for next-generation clinical decision-support tools, where advanced data infrastructures are directly accessible in an interactive and spatially rich environment.
comment: 11 pages, 6 figures. Preprint version. This paper has been accepted to IEEE ICIR 2025. This is the author-prepared version and not the final published version. The final version will appear in IEEE Xplo
Computation and Language
Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms
In the era of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) architectures are gaining significant attention for their ability to ground language generation in reliable knowledge sources. Despite their impressive effectiveness in many areas, RAG systems based solely on semantic similarity often fail to ensure factual accuracy in specialized domains, where terminological ambiguity can affect retrieval relevance. This study proposes an enhanced RAG architecture that integrates a factual signal derived from Entity Linking to improve the accuracy of educational question-answering systems in Italian. The system includes a Wikidata-based Entity Linking module and implements three re-ranking strategies to combine semantic and entity-based information: a hybrid score weighting model, reciprocal rank fusion, and a cross-encoder re-ranker. Experiments were conducted on two benchmarks: a custom academic dataset and the standard SQuAD-it dataset. Results show that, in domain-specific contexts, the hybrid schema based on reciprocal rank fusion significantly outperforms both the baseline and the cross-encoder approach, while the cross-encoder achieves the best results on the general-domain dataset. These findings confirm the presence of an effect of domain mismatch and highlight the importance of domain adaptation and hybrid ranking strategies to enhance factual precision and reliability in retrieval-augmented generation. They also demonstrate the potential of entity-aware RAG systems in educational environments, fostering adaptive and reliable AI-based tutoring tools.
M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG
Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingual multimodal RAG remains largely underexplored. We introduce M4-RAG, a massive-scale benchmark covering 42 languages and 56 regional dialects and registers, comprising over 80,000 culturally diverse image-question pairs for evaluating retrieval-augmented VQA across languages and modalities. To balance realism with reproducibility, we build a controlled retrieval environment containing millions of carefully curated multilingual documents relevant to the query domains, approximating real-world retrieval conditions while ensuring consistent experimentation. Our systematic evaluation reveals that although RAG consistently benefits smaller VLMs, it fails to scale to larger models and often even degrades their performance, exposing a critical mismatch between model size and current retrieval effectiveness. M4-RAG provides a foundation for advancing next-generation RAG systems capable of reasoning seamlessly across languages, modalities, and cultural contexts.
comment: Preprint
Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding
Grounding is a fundamental capability for building graphical user interface (GUI) agents. Although existing approaches rely on large-scale bounding box supervision, they still face various challenges, such as cross-platform generalization, complex layout analysis, and fine-grained element localization. In this paper, we investigate zoom as a strong yet underexplored prior for GUI grounding, and propose a training-free method, ZoomClick. By characterizing four key properties of zoom (i.e., pre-zoom, depth, shrink size, minimal crop size), we unlock its full capabilities for dynamic spatial focusing and adaptive context switching. Experiments demonstrate that our method significantly boosts the performance of both general vision-language and specialized GUI grounding models, achieving state-of-the-art results on several mainstream benchmarks; for example, UI-Venus-72B attains a 73.1% success rate on ScreenSpot-Pro. Furthermore, we present GUIZoom-Bench, a benchmark for evaluating model adaptability to zoom, aiming to inspire future research on improving zoom for further training and test-time scaling in GUI grounding tasks.
comment: Code is available at https://github.com/Princeton-AI2-Lab/ZoomClick
To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis
How many mistakes do published AI papers contain? Peer-reviewed publications form the foundation upon which new research and knowledge are built. Errors that persist in the literature can propagate unnoticed, creating confusion in follow-up studies and complicating reproducibility. The accelerating pace of research and the increasing demands on the peer-review system make such mistakes harder to detect and avoid. To address this, we developed a Paper Correctness Checker based on GPT-5 to systematically identify mistakes in papers previously published at top AI conferences and journals. Our analysis focuses on objective mistakes-e.g., errors in formulas, derivations, calculations, figures, and tables-that have a clearly verifiable ground truth. We intentionally exclude subjective considerations such as novelty, importance, or writing quality. We find that published papers contain a non-negligible number of objective mistakes and that the average number of mistakes per paper has increased over time-from 3.8 in NeurIPS 2021 to 5.9 in NeurIPS 2025 (55.3% increase); from 4.1 in ICLR 2018 to 5.2 in ICLR 2025; and from 5.0 in TMLR 2022/23 to 5.5 in TMLR 2025. Human experts reviewed 316 potential mistakes identified by the AI Checker and confirmed that 263 were actual mistakes, corresponding to a precision of 83.2%. While most identified issues are relatively minor, correcting them would reduce confusion in the literature and strengthen reproducibility. The AI Checker also surfaced potentially more substantive mistakes that could affect the interpretation of results. Moreover, we show that the AI Checker can propose correct fixes for 75.8% of the identified mistakes. Overall, this study highlights the potential of frontier LLMs to detect and correct objective mistakes in published papers, helping to establish a firmer foundation of knowledge.
Natural Language Summarization Enables Multi-Repository Bug Localization by LLMs in Microservice Architectures ICSE 2026
Bug localization in multi-repository microservice architectures is challenging due to the semantic gap between natural language bug reports and code, LLM context limitations, and the need to first identify the correct repository. We propose reframing this as a natural language reasoning task by transforming codebases into hierarchical NL summaries and performing NL-to-NL search instead of cross-modal retrieval. Our approach builds context-aware summaries at file, directory, and repository levels, then uses a two-phase search: first routing bug reports to relevant repositories, then performing top-down localization within those repositories. Evaluated on DNext, an industrial system with 46 repositories and 1.1M lines of code, our method achieves Pass@10 of 0.82 and MRR of 0.50, significantly outperforming retrieval baselines and agentic RAG systems like GitHub Copilot and Cursor. This work demonstrates that engineered natural language representations can be more effective than raw source code for scalable bug localization, providing an interpretable repository -> directory -> file search path, which is vital for building trust in enterprise AI tools by providing essential transparency.
comment: Accepted at LLM4Code Workshop, ICSE 2026
Optimizing Medical Question-Answering Systems: A Comparative Study of Fine-Tuned and Zero-Shot Large Language Models with RAG Framework
Medical question-answering (QA) systems can benefit from advances in large language models (LLMs), but directly applying LLMs to the clinical domain poses challenges such as maintaining factual accuracy and avoiding hallucinations. In this paper, we present a retrieval-augmented generation (RAG) based medical QA system that combines domain-specific knowledge retrieval with open-source LLMs to answer medical questions. We fine-tune two state-of-the-art open LLMs (LLaMA~2 and Falcon) using Low-Rank Adaptation (LoRA) for efficient domain specialization. The system retrieves relevant medical literature to ground the LLM's answers, thereby improving factual correctness and reducing hallucinations. We evaluate the approach on benchmark datasets (PubMedQA and MedMCQA) and show that retrieval augmentation yields measurable improvements in answer accuracy compared to using LLMs alone. Our fine-tuned LLaMA~2 model achieves 71.8% accuracy on PubMedQA, substantially improving over the 55.4% zero-shot baseline, while maintaining transparency by providing source references. We also detail the system design and fine-tuning methodology, demonstrating that grounding answers in retrieved evidence reduces unsupported content by approximately 60%. These results highlight the potential of RAG-augmented open-source LLMs for reliable biomedical QA, pointing toward practical clinical informatics applications.
Prompting Science Report 4: Playing Pretend: Expert Personas Don't Improve Factual Accuracy
This is the fourth in a series of short reports that help business, education, and policy leaders understand the technical details of working with AI through rigorous testing. Here, we ask whether assigning personas to models improves performance on difficult objective multiple-choice questions. We study both domain-specific expert personas and low-knowledge personas, evaluating six models on GPQA Diamond (Rein et al. 2024) and MMLU-Pro (Wang et al. 2024), graduate-level questions spanning science, engineering, and law. We tested three approaches: -In-Domain Experts: Assigning the model an expert persona ("you are a physics expert") matched to the problem type (physics problems) had no significant impact on performance (with the exception of the Gemini 2.0 Flash model). -Off-Domain Experts (Domain-Mismatched): Assigning the model an expert persona ("you are a physics expert") not matched to the problem type (law problems) resulted in marginal differences. -Low-Knowledge Personas: We assigned the model negative capability personas (layperson, young child, toddler), which were generally harmful to benchmark accuracy. Across both benchmarks, persona prompts generally did not improve accuracy relative to a no-persona baseline. Expert personas showed no consistent benefit across models, with few exceptions. Domain-mismatched expert personas sometimes degraded performance. Low-knowledge personas often reduced accuracy. These results are about the accuracy of answers only; personas may serve other purposes (such as altering the tone of outputs), beyond improving factual performance.
Vague Knowledge: Information without Transitivity and Partitions
I relax the standard assumptions of transitivity and partition structure in economic models of information to formalize vague knowledge: non-transitive indistinguishability over states. I show that vague knowledge, while failing to partition the state space, remains informative by distinguishing some states from others. Moreover, it can only be faithfully expressed through vague communication with blurred boundaries. My results provide microfoundations for the prevalence of natural language communication and qualitative reasoning in the real world, where knowledge is often vague.
Heard or Halted? Gender, Interruptions, and Emotional Tone in U.S. Supreme Court Oral Arguments
This study examines how interruptions during U.S. Supreme Court oral arguments shape both the semantic content and emotional tone of advocates' speech, with a focus on gendered dynamics in judicial discourse. Using the ConvoKit Supreme Court Corpus (2010-2019), we analyze 12,663 speech chunks from advocate-justice interactions to assess whether interruptions alter the meaning of an advocate's argument and whether interruptions toward female advocates exhibit more negative emotional valence. Semantic shifts are quantified using GloVe-based sentence embeddings, while sentiment is measured through lexicon-based analysis. We find that semantic similarity between pre- and post-interruption speech remains consistently high, suggesting that interruptions do not substantially alter argumentative content. However, interruptions directed at female advocates contain significantly higher levels of negative sentiment. These results deepen empirical understanding of gendered communication in elite institutional settings and demonstrate the value of computational linguistic methods for studying power, discourse, and equity in judicial proceedings.
comment: 12 pages, 5 figures, 1 table. Includes appendix. Code available at: https://github.com/1TSHARUKA/Emotional_Interruption_Analysis
Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding
Long video understanding (LVU) is challenging because answering real-world queries often depends on sparse, temporally dispersed cues buried in hours of mostly redundant and irrelevant content. While agentic pipelines improve video reasoning capabilities, prevailing frameworks rely on a query-agnostic captioner to perceive video information, which wastes computation on irrelevant content and blurs fine-grained temporal and spatial information. Motivated by active perception theory, we argue that LVU agents should actively decide what, when, and where to observe, and continuously assess whether the current observation is sufficient to answer the query. We present Active Video Perception (AVP), an evidence-seeking framework that treats the video as an interactive environment and acquires compact, queryrelevant evidence directly from pixels. Concretely, AVP runs an iterative plan-observe-reflect process with MLLM agents. In each round, a planner proposes targeted video interactions, an observer executes them to extract time-stamped evidence, and a reflector evaluates the sufficiency of the evidence for the query, either halting with an answer or triggering further observation. Across five LVU benchmarks, AVP achieves highest performance with significant improvements. Notably, AVP outperforms the best agentic method by 5.7% in average accuracy while only requires 18.4% inference time and 12.4% input tokens.
comment: Website: https://activevideoperception.github.io/
Capturing Classic Authorial Style in Long-Form Story Generation with GRPO Fine-Tuning
Recent advances in large language models (LLMs) show impressive performance in open-ended story generation, but fine-grained stylistic control remains limited. Existing methods often rely on shallow cues (e.g., names or topics) to simulate authorial style, without robust evaluation. In this work, we present a training framework for style-conditioned story generation using Group Relative Policy Optimization (GRPO) and a custom multi-reward setup. The style reward is derived from a fine-tuned sentence transformer using authorship verification (AV) signals, combined with content and completeness scores to stabilize long-form narrative generation. We conduct experiments using fiction by Mark Twain, a prominent 19th-century American author, with The Adventures of Huckleberry Finn serving as the reference style exemplar. Our 8B model outperforms larger baselines such as GPT-4o and Claude Sonnet 4 in AV-style metrics, achieving a style score of 0.628 and competitive content quality. Results demonstrate the feasibility of agentic stylistic generation with moderate model size and task-specific training. While the output is clearly style-aligned, narrative completeness remains a challenge, indicating future work is needed to better model global coherence and story resolution.
Efficient Text Classification with Conformal In-Context Learning
Large Language Models (LLMs) demonstrate strong in-context learning abilities, yet their effectiveness in text classification depends heavily on prompt design and incurs substantial computational cost. Conformal In-Context Learning (CICLe) has been proposed as a resource-efficient framework that integrates a lightweight base classifier with Conformal Prediction to guide LLM prompting by adaptively reducing the set of candidate classes. However, its broader applicability and efficiency benefits beyond a single domain have not yet been systematically explored. In this paper, we present a comprehensive evaluation of CICLe across diverse NLP classification benchmarks. The results show that CICLe consistently improves over its base classifier and outperforms few-shot prompting baselines when the sample size is sufficient for training the base classifier, and performs comparably in low-data regimes. In terms of efficiency, CICLe reduces the number of shots and prompt length by up to 34.45% and 25.16%, respectively, and enables the use of smaller models with competitive performance. CICLe is furthermore particularly advantageous for text classification tasks with high class imbalance. These findings highlight CICLe as a practical and scalable approach for efficient text classification, combining the robustness of traditional classifiers with the adaptability of LLMs, and achieving substantial gains in data and computational efficiency.
comment: 10 pages, 4 tables, 2 figures
Big Tech-Funded AI Papers Have Higher Citation Impact, Greater Insularity, and Larger Recency Bias
Over the past four decades, artificial intelligence (AI) research has flourished at the nexus of academia and industry. However, Big Tech companies have increasingly acquired the edge in computational resources, big data, and talent. So far, it has been largely unclear how many papers the industry funds, how their citation impact compares to non-funded papers, and what drives industry interest. This study fills that gap by quantifying the number of industry-funded papers at 10 top AI conferences (e.g., ICLR, CVPR, AAAI, ACL) and their citation influence. We analyze about 49.8K papers, about 1.8M citations from AI papers to other papers, and about 2.3M citations from other papers to AI papers from 1998-2022 in Scopus. Through seven research questions, we examine the volume and evolution of industry funding in AI research, the citation impact of funded papers, the diversity and temporal range of their citations, and the subfields in which industry predominantly acts. Our findings reveal that industry presence has grown markedly since 2015, from less than 2 percent to more than 11 percent in 2020. Between 2018 and 2022, 12 percent of industry-funded papers achieved high citation rates as measured by the h5-index, compared to 4 percent of non-industry-funded papers and 2 percent of non-funded papers. Top AI conferences engage more with industry-funded research than non-funded research, as measured by our newly proposed metric, the Citation Preference Ratio (CPR). We show that industry-funded research is increasingly insular, citing predominantly other industry-funded papers while referencing fewer non-funded papers. These findings reveal new trends in AI research funding, including a shift towards more industry-funded papers and their growing citation impact, greater insularity of industry-funded work than non-funded work, and a preference of industry-funded research to cite recent work.
comment: Published at IEEE (ACDSA)
Faithfulness metric fusion: Improving the evaluation of LLM trustworthiness across domains
We present a methodology for improving the accuracy of faithfulness evaluation in Large Language Models (LLMs). The proposed methodology is based on the combination of elementary faithfulness metrics into a combined (fused) metric, for the purpose of improving the faithfulness of LLM outputs. The proposed strategy for metric fusion deploys a tree-based model to identify the importance of each metric, which is driven by the integration of human judgements evaluating the faithfulness of LLM responses. This fused metric is demonstrated to correlate more strongly with human judgements across all tested domains for faithfulness. Improving the ability to evaluate the faithfulness of LLMs, allows for greater confidence to be placed within models, allowing for their implementation in a greater diversity of scenarios. Additionally, we homogenise a collection of datasets across question answering and dialogue-based domains and implement human judgements and LLM responses within this dataset, allowing for the reproduction and trialling of faithfulness evaluation across domains.
comment: 9 pages, conference paper
Retrieving Semantically Similar Decisions under Noisy Institutional Labels: Robust Comparison of Embedding Methods
Retrieving case law is a time-consuming task predominantly carried out by querying databases. We provide a comparison of two models in three different settings for Czech Constitutional Court decisions: (i) a large general-purpose embedder (OpenAI), (ii) a domain-specific BERT-trained from scratch on ~30,000 decisions using sliding windows and attention pooling. We propose a noise-aware evaluation including IDF-weighted keyword overlap as graded relevance, binarization via two thresholds (0.20 balanced, 0.28 strict), significance via paired bootstrap, and an nDCG diagnosis supported with qualitative analysis. Despite modest absolute nDCG (expected under noisy labels), the general OpenAI embedder decisively outperforms the domain pre-trained BERT in both settings at @10/@20/@100 across both thresholds; differences are statistically significant. Diagnostics attribute low absolutes to label drift and strong ideals rather than lack of utility. Additionally, our framework is robust enough to be used for evaluation under a noisy gold dataset, which is typical when handling data with heterogeneous labels stemming from legacy judicial databases.
comment: The manuscript has been accepted for presentation as a short paper at the 38th International Conference on Legal Knowledge and Information Systems (JURIX 2025) in Torino, Italy
MedTutor-R1: Socratic Personalized Medical Teaching with Multi-Agent Simulation
The significant gap between rising demands for clinical training and the scarcity of expert instruction poses a major challenge to medical education. With powerful capabilities in personalized guidance, Large Language Models (LLMs) offer a promising solution to bridge this gap. However, current research focuses mainly on one-on-one knowledge instruction, overlooking collaborative reasoning, a key skill for students developed in teamwork like ward rounds. To this end, we develop ClinEdu, a multi-agent pedagogical simulator with personality-driven patients and diverse student cohorts, enabling controlled testing of complex pedagogical processes and scalable generation of teaching data. Based on ClinEdu, we construct ClinTeach, a large Socratic teaching dialogue dataset that captures the complexities of group instruction. We then train MedTutor-R1, the first multimodal Socratic tutor designed for one-to-many instruction in clinical medical education. MedTutor-R1 is first instruction-tuned on our ClinTeach dataset and then optimized with reinforcement learning, using rewards derived from a three-axis rubric, covering structural fidelity, analytical quality, and clinical safety, to refine its adaptive Socratic strategies. For authentic in-situ assessment, we use simulation-based interactive evaluation that redeploys the tutor back into ClinEdu. Experimental results demonstrate that our MedTutor-R1 outperforms the base model by over 20% in average pedagogical score and is comparable to o3, while also exhibiting high adaptability in handling a varying number of students. This promising performance underscores the effectiveness of our pedagogical simulator, ClinEdu.
comment: Work In Progress
Interleaved Latent Visual Reasoning with Selective Perceptual Modeling
Interleaved reasoning paradigms enhance Multimodal Large Language Models (MLLMs) with visual feedback but are hindered by the prohibitive computational cost of repeatedly re-encoding pixel-dense images. A promising alternative, latent visual reasoning, circumvents this bottleneck yet currently forces a critical trade-off: methods either sacrifice precise perceptual modeling by over-compressing features or fail to model dynamic problems due to static, non-interleaved structures. We introduce Interleaved Latent Visual Reasoning (ILVR), a framework that unifies dynamic state evolution with precise perceptual modeling. ILVR interleaves textual generation with latent visual representations that act as specific, evolving cues for subsequent reasoning. To enable this, we employ a self-supervision strategy where a Momentum Teacher Model selectively distills relevant features from helper images into sparse supervision targets. This adaptive selection mechanism guides the model to autonomously generate context-aware visual signals. Extensive experiments on multimodal reasoning benchmarks demonstrate that ILVR significantly outperforms existing approaches, effectively bridging the gap between fine-grained perception and sequential multimodal reasoning.
comment: 11 pages, 6 figures. Code available at https://github.com/XD111ds/ILVR
Grounded Multilingual Medical Reasoning for Question Answering with Large Language Models
Large Language Models (LLMs) with reasoning capabilities have recently demonstrated strong potential in medical Question Answering (QA). Existing approaches are largely English-focused and primarily rely on distillation from general-purpose LLMs, raising concerns about the reliability of their medical knowledge. In this work, we present a method to generate multilingual reasoning traces grounded in factual medical knowledge. We produce 500k traces in English, Italian, and Spanish, using a retrievalaugmented generation approach over medical information from Wikipedia. The traces are generated to solve medical questions drawn from MedQA and MedMCQA, which we extend to Italian and Spanish. We test our pipeline in both in-domain and outof-domain settings across Medical QA benchmarks, and demonstrate that our reasoning traces improve performance both when utilized via in-context learning (few-shot) and supervised fine-tuning, yielding state-of-the-art results among 8B-parameter LLMs. We believe that these resources can support the development of safer, more transparent clinical decision-support tools in multilingual settings. We release the full suite of resources: reasoning traces, translated QA datasets, Medical-Wikipedia, and fine-tuned models.
comment: Under Review
A Greek Government Decisions Dataset for Public-Sector Analysis and Insight
We introduce an open, machine-readable corpus of Greek government decisions sourced from the national transparency platform Diavgeia. The resource comprises 1 million decisions, featuring and high-quality raw text extracted from PDFs. It is released with raw extracted text in Markdown format, alongside a fully reproducible extraction pipeline. Beyond the core dataset, we conduct qualitative analyses to explore boilerplate patterns and design a retrieval-augmented generation (RAG) task by formulating a set of representative questions, creating high-quality answers, and evaluating a baseline RAG system on its ability to retrieve and reason over public decisions. This evaluation demonstrates the potential of large-scale public-sector corpora to support advanced information access and transparency through structured retrieval and reasoning over governmental documents, and highlights how such a RAG pipeline could simulate a chat-based assistant capable of interactively answering questions about public decisions. Due to its scale, quality, and domain coverage, the corpus can also serve as high-value pre-training or fine-tuning material for new Language Models (LMs) and Large Language Models (LLMs) respectively, including specialized models for legal and governmental domains, and as a foundation for novel approaches in domain adaptation, knowledge-grounded generation, and explainable AI. Finally, we discuss limitations, outline future directions, and make both the data and the code accessible.
Ontology Learning with LLMs: A Benchmark Study on Axiom Identification
Ontologies are an important tool for structuring domain knowledge, but their development is a complex task that requires significant modelling and domain expertise. Ontology learning, aimed at automating this process, has seen advancements in the past decade with the improvement of Natural Language Processing techniques, and especially with the recent growth of Large Language Models (LLMs). This paper investigates the challenge of identifying axioms: fundamental ontology components that define logical relations between classes and properties. In this work, we introduce an Ontology Axiom Benchmark OntoAxiom, and systematically test LLMs on that benchmark for axiom identification, evaluating different prompting strategies, ontologies, and axiom types. The benchmark consists of nine medium-sized ontologies with together 17.118 triples, and 2.771 axioms. We focus on subclass, disjoint, subproperty, domain, and range axioms. To evaluate LLM performance, we compare twelve LLMs with three shot settings and two prompting strategies: a Direct approach where we query all axioms at once, versus an Axiom-by-Axiom (AbA) approach, where each prompt queries for one axiom only. Our findings show that the AbA prompting leads to higher F1 scores than the direct approach. However, performance varies across axioms, suggesting that certain axioms are more challenging to identify. The domain also influences performance: the FOAF ontology achieves a score of 0.642 for the subclass axiom, while the music ontology reaches only 0.218. Larger LLMs outperform smaller ones, but smaller models may still be viable for resource-constrained settings. Although performance overall is not high enough to fully automate axiom identification, LLMs can provide valuable candidate axioms to support ontology engineers with the development and refinement of ontologies.
comment: Submitted to Semantic Web Journal, under review
Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning
Large language model post-training relies on reinforcement learning to improve model capability and alignment quality. However, the off-policy training paradigm introduces distribution shift, which often pushes the policy beyond the trust region, leading to training instabilities manifested as fluctuations in policy entropy and unstable gradients. Although PPO-Clip mitigates this issue through importance clipping, it still overlooks the global distributional shift of actions. To address these challenges, we propose using the entropy ratio between the current and previous policies as a new global metric that effectively quantifies the relative change in policy exploration throughout updates. Building on this metric, we introduce an \textbf{Entropy Ratio Clipping} (ERC) mechanism that imposes bidirectional constraints on the entropy ratio. This stabilizes policy updates at the global distribution level and compensates for the inability of PPO-clip to regulate probability shifts of un-sampled actions. We integrate ERC into both DAPO and GPPO reinforcement learning algorithms. Experiments across multiple benchmarks show that ERC consistently improves performance.
Structured Reasoning with Tree-of-Thoughts for Bengali Math Word Problems
Mathematical Word Problems (MWPs) are among the most challenging tasks in natural language processing because they require both linguistic understanding and multi-step numerical reasoning. While Chain-of-Thought (CoT) prompting has shown promise, its linear structure often propagates errors, limiting overall effectiveness. To address this limitation, we present the a systematic study of Tree-of-Thought (ToT) reasoning for Bengali MWPs using the SOMADHAN dataset. Owing to computational and token-cost constraints, we evaluate a curated set of 100 representative problems across multiple large language models (LLMs), including GPT-OSS and LLaMA variants, under standard prompting, CoT, and ToT strategies. Our results show that CoT improves baseline accuracy from 78% (standard prompting) to 83% on average, while ToT further increases performance by up to 5 percentage points, achieving 88% accuracy with GPT-OSS-120B. These improvements highlight that ToT is particularly effective in medium-to-large-scale models but may offer less advantage for smaller ones. Overall, our findings establish ToT as a robust framework for solving mathematical problems in low-resource languages such as Bengali. More broadly, this study shows that structured reasoning methods like ToT can provide more reliable and globally consistent outcomes than CoT, paving the way for better reasoning strategies in multilingual NLP.
Automated Identification of Incidentalomas Requiring Follow-Up: A Multi-Anatomy Evaluation of LLM-Based and Supervised Approaches
Objective: To evaluate large language models (LLMs) against supervised baselines for fine-grained, lesion-level detection of incidentalomas requiring follow-up, addressing the limitations of current document-level classification systems. Methods: We utilized a dataset of 400 annotated radiology reports containing 1,623 verified lesion findings. We compared three supervised transformer-based encoders (BioClinicalModernBERT, ModernBERT, Clinical Longformer) against four generative LLM configurations (Llama 3.1-8B, GPT-4o, GPT-OSS-20b). We introduced a novel inference strategy using lesion-tagged inputs and anatomy-aware prompting to ground model reasoning. Performance was evaluated using class-specific F1-scores. Results: The anatomy-informed GPT-OSS-20b model achieved the highest performance, yielding an incidentaloma-positive macro-F1 of 0.79. This surpassed all supervised baselines (maximum macro-F1: 0.70) and closely matched the inter-annotator agreement of 0.76. Explicit anatomical grounding yielded statistically significant performance gains across GPT-based models (p < 0.05), while a majority-vote ensemble of the top systems further improved the macro-F1 to 0.90. Error analysis revealed that anatomy-aware LLMs demonstrated superior contextual reasoning in distinguishing actionable findings from benign lesions. Conclusion: Generative LLMs, when enhanced with structured lesion tagging and anatomical context, significantly outperform traditional supervised encoders and achieve performance comparable to human experts. This approach offers a reliable, interpretable pathway for automated incidental finding surveillance in radiology workflows.
SEA-SafeguardBench: Evaluating AI Safety in SEA Languages and Cultures
Safeguard models help large language models (LLMs) detect and block harmful content, but most evaluations remain English-centric and overlook linguistic and cultural diversity. Existing multilingual safety benchmarks often rely on machine-translated English data, which fails to capture nuances in low-resource languages. Southeast Asian (SEA) languages are underrepresented despite the region's linguistic diversity and unique safety concerns, from culturally sensitive political speech to region-specific misinformation. Addressing these gaps requires benchmarks that are natively authored to reflect local norms and harm scenarios. We introduce SEA-SafeguardBench, the first human-verified safety benchmark for SEA, covering eight languages, 21,640 samples, across three subsets: general, in-the-wild, and content generation. The experimental results from our benchmark demonstrate that even state-of-the-art LLMs and guardrails are challenged by SEA cultural and harm scenarios and underperform when compared to English texts.
comment: Under review
Dynamic Alignment for Collective Agency: Toward a Scalable Self-Improving Framework for Open-Ended LLM Alignment AAAI 2026
Large Language Models (LLMs) are typically aligned with human values using preference data or predefined principles such as helpfulness, honesty, and harmlessness. However, as AI systems progress toward Artificial General Intelligence (AGI) and Artificial Superintelligence (ASI), such value systems may become insufficient. In addition, human feedback-based alignment remains resource-intensive and difficult to scale. While AI-feedback-based self-improving alignment methods have been explored as a scalable alternative, they have largely remained constrained to conventional alignment values. In this work, we explore both a more holistic alignment objective and a scalable, self-improving alignment approach. Aiming to transcend conventional alignment norms, we introduce Collective Agency (CA)-a unified and open-ended alignment value that encourages integrated agentic capabilities. We also propose Dynamic Alignment-an alignment framework that enables an LLM to iteratively align itself. Dynamic Alignment comprises two key components: (1) automated training dataset generation with LLMs, and (2) a self-rewarding mechanism, where the policy model evaluates its own output candidates and assigns rewards for GRPO-based learning. Experimental results demonstrate that our approach successfully aligns the model to CA while preserving general NLP capabilities.
comment: 8 pages, 4 figures, to appear in AAAI 2026 AIGOV Workshop
ArtistMus: A Globally Diverse, Artist-Centric Benchmark for Retrieval-Augmented Music Question Answering LREC 2026
Recent advances in large language models (LLMs) have transformed open-domain question answering, yet their effectiveness in music-related reasoning remains limited due to sparse music knowledge in pretraining data. While music information retrieval and computational musicology have explored structured and multimodal understanding, few resources support factual and contextual music question answering (MQA) grounded in artist metadata or historical context. We introduce MusWikiDB, a vector database of 3.2M passages from 144K music-related Wikipedia pages, and ArtistMus, a benchmark of 1,000 questions on 500 diverse artists with metadata such as genre, debut year, and topic. These resources enable systematic evaluation of retrieval-augmented generation (RAG) for MQA. Experiments show that RAG markedly improves factual accuracy; open-source models gain up to +56.8 percentage points (for example, Qwen3 8B improves from 35.0 to 91.8), approaching proprietary model performance. RAG-style fine-tuning further boosts both factual recall and contextual reasoning, improving results on both in-domain and out-of-domain benchmarks. MusWikiDB also yields approximately 6 percentage points higher accuracy and 40% faster retrieval than a general-purpose Wikipedia corpus. We release MusWikiDB and ArtistMus to advance research in music information retrieval and domain-specific question answering, establishing a foundation for retrieval-augmented reasoning in culturally rich domains such as music.
comment: Submitted to LREC 2026. This work is an evolution of our earlier preprint arXiv:2507.23334
LMSpell: Neural Spell Checking for Low-Resource Languages
Spell correction is still a challenging problem for low-resource languages (LRLs). While pretrained language models (PLMs) have been employed for spell correction, their use is still limited to a handful of languages, and there has been no proper comparison across PLMs. We present the first empirical study on the effectiveness of PLMs for spell correction, which includes LRLs. We find that Large Language Models (LLMs) outperform their counterparts (encoder-based and encoder-decoder) when the fine-tuning dataset is large. This observation holds even in languages for which the LLM is not pre-trained. We release LMSpell, an easy- to use spell correction toolkit across PLMs. It includes an evaluation function that compensates for the hallucination of LLMs. Further, we present a case study with Sinhala to shed light on the plight of spell correction for LRLs.
SQ-format: A Unified Sparse-Quantized Hardware-friendly Data Format for LLMs
Post-training quantization (PTQ) plays a crucial role in the democratization of large language models (LLMs). However, existing low-bit quantization and sparsification techniques are difficult to balance accuracy and efficiency due to the limited hardware support. For example, W4A8 can only achieve the same peak TOPS as W8A8 whereas the GPU-supported sparse data format (2:4 semi-structure sparse) is seldomly adopted due to the loss of accuracy. To bridge this gap, in this paper, we propose the Sparse-Quantized Format (SQ-format), which is a unified data format for quantization and sparsification potentially easily supported by new hardware and existing GPUs. SQ-format makes use of the fact that sparse matrix can be accelerated in high-precision, and low-precision matrix multiplication can also be accelerated accordingly. As such, SQ-format is proposed to achieve Pareto improvement between performance and throughput. This format is particularly suitable for activations with outlier inequality status and makes their static compression possible. We show the state-of-the-art PTQ performance with SQ-format, propose the hardware required to support it, and further offer the design exploration and insights for the next-generation AI accelerators.
Learning from Self Critique and Refinement for Faithful LLM Summarization
Large Language Models (LLMs) often suffer from hallucinations: output content that is not grounded in the input context, when performing long-form text generation tasks such as summarization. Prior works have shown that hallucinations can be reduced by iteratively critiquing and refining previously generated outputs using either the same model or a more powerful teacher model as the critique. However, these approaches either require additional test-time compute or assume access to more powerful teacher models, making them costly and less practical. In this work, we propose Self Critique and Refinement-based Preference Optimization (SCRPO), which is a self-supervised training framework that first constructs a preference dataset by leveraging the LLM's own critique and refinement capabilities, and then applies preference learning to improve the same LLM for faithful summarization. Experiments on three summarization benchmarks (XSUM CNNDM and SAMSum), demonstrate that our approach outperforms state-of-the-art self-supervised learning methods in terms of faithfulness metrics while either maintaining or improving other metrics that measure the overall quality of the summary. Moreover, compared to test-time refinement, our approach not only improves efficiency but also results in more faithful summaries.
Mitigating Self-Preference by Authorship Obfuscation
Language models (LMs) judges are widely used to evaluate the quality of LM outputs. Despite many advantages, LM judges display concerning biases that can impair their integrity in evaluations. One such bias is self-preference: LM judges preferring their own answers over those produced by other LMs or humans. The bias is hard to eliminate as frontier LM judges can distinguish their own outputs from those of others, even when the evaluation candidates are not labeled with their sources. In this paper, we investigate strategies to mitigate self-preference by reducing the LM judges' ability to recognize their own outputs. We apply black-box perturbations to evaluation candidates in pairwise comparison to obfuscate the authorship and reduce self-recognition. We find that perturbations as simple as synonym replacement for a few words predictably reduce self-preference. However, we also uncover fundamental challenges to eliminating the bias: when we extrapolate our perturbations to a more complete neutralization of stylistic differences between the evaluation candidates, self-preference recovers. Our findings suggest that self-recognition and self-preference can happen on many semantic levels, and complete mitigation remains challenging despite promising initial results.
Text Rationalization for Robust Causal Effect Estimation
Recent advances in natural language processing have enabled the increasing use of text data in causal inference, particularly for adjusting confounding factors in treatment effect estimation. Although high-dimensional text can encode rich contextual information, it also poses unique challenges for causal identification and estimation. In particular, the positivity assumption, which requires sufficient treatment overlap across confounder values, is often violated at the observational level, when massive text is represented in feature spaces. Redundant or spurious textual features inflate dimensionality, producing extreme propensity scores, unstable weights, and inflated variance in effect estimates. We address these challenges with Confounding-Aware Token Rationalization (CATR), a framework that selects a sparse necessary subset of tokens using a residual-independence diagnostic designed to preserve confounding information sufficient for unconfoundedness. By discarding irrelevant texts while retaining key signals, CATR mitigates observational-level positivity violations and stabilizes downstream causal effect estimators. Experiments on synthetic data and a real-world study using the MIMIC-III database demonstrate that CATR yields more accurate, stable, and interpretable causal effect estimates than existing baselines.
Transformer-Enabled Diachronic Analysis of Vedic Sanskrit: Neural Methods for Quantifying Types of Language Change
This study demonstrates how hybrid neural-symbolic methods can yield significant new insights into the evolution of a morphologically rich, low-resource language. We challenge the naive assumption that linguistic change is simplification by quantitatively analyzing over 2,000 years of Sanskrit, demonstrating how weakly-supervised hybrid methods can yield new insights into the evolution of morphologically rich, low-resource languages. Our approach addresses data scarcity through weak supervision, using 100+ high-precision regex patterns to generate pseudo-labels for fine-tuning a multilingual BERT. We then fuse symbolic and neural outputs via a novel confidence-weighted ensemble, creating a system that is both scalable and interpretable. Applying this framework to a 1.47-million-word diachronic corpus, our ensemble achieves a 52.4% overall feature detection rate. Our findings reveal that Sanskrit's overall morphological complexity does not decrease but is instead dynamically redistributed: while earlier verbal features show cyclical patterns of decline, complexity shifts to other domains, evidenced by a dramatic expansion in compounding and the emergence of new philosophical terminology. Critically, our system produces well-calibrated uncertainty estimates, with confidence strongly correlating with accuracy (Pearson r = 0.92) and low overall calibration error (ECE = 0.043), bolstering the reliability of these findings for computational philology.
The Effect of Document Summarization on LLM-Based Relevance Judgments
Relevance judgments are central to the evaluation of Information Retrieval (IR) systems, but obtaining them from human annotators is costly and time-consuming. Large Language Models (LLMs) have recently been proposed as automated assessors, showing promising alignment with human annotations. Most prior studies have treated documents as fixed units, feeding their full content directly to LLM assessors. We investigate how text summarization affects the reliability of LLM-based judgments and their downstream impact on IR evaluation. Using state-of-the-art LLMs across multiple TREC collections, we compare judgments made from full documents with those based on LLM-generated summaries of different lengths. We examine their agreement with human labels, their effect on retrieval effectiveness evaluation, and their influence on IR systems' ranking stability. Our findings show that summary-based judgments achieve comparable stability in systems' ranking to full-document judgments, while introducing systematic shifts in label distributions and biases that vary by model and dataset. These results highlight summarization as both an opportunity for more efficient large-scale IR evaluation and a methodological choice with important implications for the reliability of automatic judgments.
Exposing Pink Slime Journalism: Linguistic Signatures and Robust Detection Against LLM-Generated Threats
The local news landscape, a vital source of reliable information for 28 million Americans, faces a growing threat from Pink Slime Journalism, a low-quality, auto-generated articles that mimic legitimate local reporting. Detecting these deceptive articles requires a fine-grained analysis of their linguistic, stylistic, and lexical characteristics. In this work, we conduct a comprehensive study to uncover the distinguishing patterns of Pink Slime content and propose detection strategies based on these insights. Beyond traditional generation methods, we highlight a new adversarial vector: modifications through large language models (LLMs). Our findings reveal that even consumer-accessible LLMs can significantly undermine existing detection systems, reducing their performance by up to 40% in F1-score. To counter this threat, we introduce a robust learning framework specifically designed to resist LLM-based adversarial attacks and adapt to the evolving landscape of automated pink slime journalism, and showed and improvement by up to 27%.
comment: Published in RANLP 2025
LYNX: Learning Dynamic Exits for Confidence-Controlled Reasoning
Large reasoning models achieve strong performance on complex tasks by generating extended chains of thought, but they often "overthink": continuing to reason long after they have enough information to answer correctly. This wastes inference-time compute and can hurt accuracy. Existing attempts to stop early either manipulate decoding with extra sampling and heuristics, rely on auxiliary verifier models, or operate only as post-hoc analysis pipelines without formal guarantees. We introduce LYNX, an online early-exit mechanism that turns a model's own hidden-state awareness into confidence-controlled stopping decisions. LYNX attaches exit decisions to naturally occurring reasoning cues (e.g., "hmm", "wait") during generation, trains a lightweight probe on hidden states at those cue tokens using supervision from forced exits, and wraps the resulting scores in split conformal prediction to obtain distribution-free control over premature exits. Crucially, we train and calibrate this probe once on a generic mathematical corpus and reuse it unchanged across benchmarks, decoding temperatures, and even non-mathematical tasks. Across three model families spanning 1.5B to 32B parameters, a single mathematically trained probe per base model yields strong accuracy--efficiency tradeoffs. On GSM8K, LYNX matches or improves baseline accuracy while reducing tokens by 40--65\%; on MATH-500 it improves accuracy by up to 12 points with roughly 35--60\% fewer tokens; on AIME 2024 it recovers baseline accuracy with more than 50\% token savings; and on CommonsenseQA, a non-math benchmark, it transfers zero-shot with modest accuracy gains and up to 70\% fewer tokens. Compared to state-of-the-art early-exit methods, LYNX offers competitive or superior Pareto frontiers while remaining fully online, requiring no proxy models at inference, and providing explicit, user-tunable confidence guarantees.
Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability
Large language models (LLMs) have been shown to internalize human-like biases during finetuning, yet the mechanisms by which these biases manifest remain unclear. In this work, we investigated whether the well-known Knobe effect, a moral bias in intentionality judgements, emerges in finetuned LLMs and whether it can be traced back to specific components of the model. We conducted a Layer-Patching analysis across 3 open-weights LLMs and demonstrated that the bias is not only learned during finetuning but also localized in a specific set of layers. Surprisingly, we found that patching activations from the corresponding pretrained model into just a few critical layers is sufficient to eliminate the effect. Our findings offer new evidence that social biases in LLMs can be interpreted, localized, and mitigated through targeted interventions, without the need for model retraining.
comment: Preprint. Under review
Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders
Sparse Autoencoders (SAEs) extract features from LLM internal activations, meant to correspond to interpretable concepts. A core SAE training hyperparameter is L0: how many SAE features should fire per token on average. Existing work compares SAE algorithms using sparsity-reconstruction tradeoff plots, implying L0 is a free parameter with no single correct value aside from its effect on reconstruction. In this work we study the effect of L0 on SAEs, and show that if L0 is not set correctly, the SAE fails to disentangle the underlying features of the LLM. If L0 is too low, the SAE will mix correlated features to improve reconstruction. If L0 is too high, the SAE finds degenerate solutions that also mix features. Further, we present a proxy metric that can help guide the search for the correct L0 for an SAE on a given training distribution. We show that our method finds the correct L0 in toy models and coincides with peak sparse probing performance in LLM SAEs. We find that most commonly used SAEs have an L0 that is too low. Our work shows that L0 must be set correctly to train SAEs with correct features.
Reinforce-Ada: An Adaptive Sampling Framework under Non-linear RL Objectives
Reinforcement learning (RL) for large language model reasoning is frequently hindered by signal loss, a phenomenon where standard uniform sampling with small group sizes fails to uncover informative learning signals for difficult prompts. We demonstrate that this collapse is a statistical artifact of undersampling rather than an inherent model limitation. To address this systematically, we introduce a theoretical framework based on optimizing a non-linear RL objective (e.g., log-likelihood). We show that this objective naturally induces a weighted gradient estimator that prioritizes difficult prompts, which can be robustly realized through adaptive sampling. Guided by this framework, we propose Reinforce-Ada, a family of algorithms that dynamically allocates inference budgets based on prompt difficulty, effectively scaling up RL compute to where it is needed most. Unlike passive filtering methods that discard low-signal prompts, Reinforce-Ada actively invests compute to recover them. We introduce two efficient realizations: an estimation-based approach and a model-free sequential sampling approach. Extensive experiments across multiple benchmarks show that Reinforce-Ada significantly outperforms uniform baselines like GRPO, recovering lost signals and accelerating convergence by up to $2\times$ while maintaining the same total inference budget. Code is available at https://github.com/RLHFlow/Reinforce-Ada.
comment: 27 pages, 10 figures
SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models EMNLP 2025
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but controlling their behavior reliably remains challenging, especially in open-ended generation settings. This paper introduces a novel supervised steering approach that operates in sparse, interpretable representation spaces. We employ sparse autoencoders (SAEs) to obtain sparse latent representations that aim to disentangle semantic attributes from model activations. Then we train linear classifiers to identify a small subspace of task-relevant dimensions in latent representations. Finally, we learn supervised steering vectors constrained to this subspace, optimized to align with target behaviors. Experiments across sentiment, truthfulness, and political polarity steering tasks with multiple LLMs demonstrate that our supervised steering vectors achieve higher success rates with minimal degradation in generation quality compared to existing methods. Further analysis reveals that a notably small subspace is sufficient for effective steering, enabling more targeted and interpretable interventions. Our implementation is publicly available at https://github.com/Ineedanamehere/SAE-SSV.
comment: Accepted by EMNLP 2025
The AI Productivity Index (APEX)
We present an extended version of the AI Productivity Index (APEX-v1-extended), a benchmark for assessing whether frontier models are capable of performing economically valuable tasks in four jobs: investment banking associate, management consultant, big law associate, and primary care physician (MD). This technical report details the extensions to APEX-v1, including an increase in the held-out evaluation set from n = 50 to n = 100 cases per job (n = 400 total) and updates to the grading methodology. We present a new leaderboard, where GPT5 (Thinking = High) remains the top performing model with a score of 67.0%. APEX-v1-extended shows that frontier models still have substantial limitations when performing typical professional tasks. To support further research, we are open sourcing n = 25 non-benchmark example cases per role (n = 100 total) along with our evaluation harness.
CodeNER: Code Prompting for Named Entity Recognition
Recent studies have explored various approaches for treating candidate named entity spans as both source and target sequences in named entity recognition (NER) by leveraging large language models (LLMs). Although previous approaches have successfully generated candidate named entity spans with suitable labels, they rely solely on input context information when using LLMs, particularly, ChatGPT. However, NER inherently requires capturing detailed labeling requirements with input context information. To address this issue, we propose a novel method that leverages code-based prompting to improve the capabilities of LLMs in understanding and performing NER. By embedding code within prompts, we provide detailed BIO schema instructions for labeling, thereby exploiting the ability of LLMs to comprehend long-range scopes in programming languages. Experimental results demonstrate that the proposed code-based prompting method outperforms conventional text-based prompting on ten benchmarks across English, Arabic, Finnish, Danish, and German datasets, indicating the effectiveness of explicitly structuring NER instructions. We also verify that combining the proposed code-based prompting method with the chain-of-thought prompting further improves performance.
comment: 18 pages, 6 figures
MindEval: Benchmarking Language Models on Multi-turn Mental Health Support
Demand for mental health support through AI chatbots is surging, though current systems present several limitations, like sycophancy or overvalidation, and reinforcement of maladaptive beliefs. A core obstacle to the creation of better systems is the scarcity of benchmarks that capture the complexity of real therapeutic interactions. Most existing benchmarks either only test clinical knowledge through multiple-choice questions or assess single responses in isolation. To bridge this gap, we present MindEval, a framework designed in collaboration with Ph.D-level Licensed Clinical Psychologists for automatically evaluating language models in realistic, multi-turn mental health therapy conversations. Through patient simulation and automatic evaluation with LLMs, our framework balances resistance to gaming with reproducibility via its fully automated, model-agnostic design. We begin by quantitatively validating the realism of our simulated patients against human-generated text and by demonstrating strong correlations between automatic and human expert judgments. Then, we evaluate 12 state-of-the-art LLMs and show that all models struggle, scoring below 4 out of 6, on average, with particular weaknesses in problematic AI-specific patterns of communication. Notably, reasoning capabilities and model scale do not guarantee better performance, and systems deteriorate with longer interactions or when supporting patients with severe symptoms. We release all code, prompts, and human evaluation data.
A quantitative analysis of semantic information in deep representations of text and images
Deep neural networks are known to develop similar representations for semantically related data, even when they belong to different domains, such as an image and its description, or the same text in different languages. We present a method for quantitatively investigating this phenomenon by measuring the relative information content of the representations of semantically related data and probing how it is encoded into multiple tokens of large language models (LLMs) and vision transformers. Looking first at how LLMs process pairs of translated sentences, we identify inner ``semantic'' layers containing the most language-transferable information. We find moreover that, on these layers, a larger LLM (DeepSeek-V3) extracts significantly more general information than a smaller one (Llama3.1-8B). Semantic information of English text is spread across many tokens and it is characterized by long-distance correlations between tokens and by a causal left-to-right (i.e., past-future) asymmetry. We also identify layers encoding semantic information within visual transformers. We show that caption representations in the semantic layers of LLMs predict visual representations of the corresponding images. We observe significant and model-dependent information asymmetries between image and text representations.
Pet-Bench: Benchmarking the Abilities of Large Language Models as E-Pets in Social Network Services
As interest in using Large Language Models for interactive and emotionally rich experiences grows, virtual pet companionship emerges as a novel yet underexplored application. Existing approaches focus on basic pet role-playing interactions without systematically benchmarking LLMs for comprehensive companionship. In this paper, we introduce Pet-Bench, a dedicated benchmark that evaluates LLMs across both self-interaction and human-interaction dimensions. Unlike prior work, Pet-Bench emphasizes self-evolution and developmental behaviors alongside interactive engagement, offering a more realistic reflection of pet companionship. It features diverse tasks such as intelligent scheduling, memory-based dialogues, and psychological conversations, with over 7,500 interaction instances designed to simulate pet behaviors. Evaluation of 28 LLMs reveals significant performance variations linked to model size and inherent capabilities, underscoring the need for specialized optimization in this domain. Pet-Bench serves as a foundational resource for benchmarking pet-related LLM abilities and advancing emotionally immersive human-pet interactions.
Vision-centric Token Compression in Large Language Model NeurIPS 2025
Real-world applications are stretching context windows to hundreds of thousand of tokens while Large Language Models (LLMs) swell from billions to trillions of parameters. This dual expansion send compute and memory costs skyrocketing, making token compression indispensable. We introduce Vision Centric Token Compression (Vist), a slow-fast compression framework that mirrors human reading: the fast path renders distant tokens into images, letting a frozen, lightweight vision encoder skim the low-salience context; the slow path feeds the proximal window into the LLM for fine-grained reasoning. A Probability-Informed Visual Enhancement (PVE) objective masks high-frequency tokens during training, steering the Resampler to concentrate on semantically rich regions-just as skilled reader gloss over function words. On eleven in-context learning benchmarks, Vist achieves the same accuracy with 2.3 times fewer tokens, cutting FLOPs by 16% and memory by 50%. This method delivers remarkable results, outperforming the strongest text encoder-based compression method CEPE by 7.6% on average over benchmarks like TriviaQA, NQ, PopQA, NLUI, and CLIN, setting a new standard for token efficiency in LLMs. The project is at https://github.com/CSU-JPG/VIST.
comment: NeurIPS 2025 spotlight
Towards Data-efficient Customer Intent Recognition with Prompt-based Learning Paradigm
Recognizing customer intent accurately with language models based on customer-agent conversational data is essential in today's digital customer service marketplace, but it is often hindered by the lack of sufficient labeled data. In this paper, we introduce the prompt-based learning paradigm that significantly reduces the dependency on extensive datasets. Utilizing prompted training combined with answer mapping techniques, this approach allows small language models to achieve competitive intent recognition performance with only a minimal amount of training data. Furthermore, We enhance the performance by integrating active sampling and ensemble learning strategies in the prompted training pipeline. Additionally, preliminary tests in a zero-shot setting demonstrate that, with well-crafted and detailed prompts, small language models show considerable instruction-following potential even without any further training. These results highlight the viability of semantic modeling of conversational data in a more data-efficient manner with minimal data use, paving the way for advancements in AI-driven customer service.
HalluClean: A Unified Framework to Combat Hallucinations in LLMs
Large language models (LLMs) have achieved impressive performance across a wide range of natural language processing tasks, yet they often produce hallucinated content that undermines factual reliability. To address this challenge, we introduce HalluClean, a lightweight and task-agnostic framework for detecting and correcting hallucinations in LLM-generated text. HalluClean adopts a reasoning-enhanced paradigm, explicitly decomposing the process into planning, execution, and revision stages to identify and refine unsupported claims. It employs minimal task-routing prompts to enable zero-shot generalization across diverse domains, without relying on external knowledge sources or supervised detectors. We conduct extensive evaluations on five representative tasks-question answering, dialogue, summarization, math word problems, and contradiction detection. Experimental results show that HalluClean significantly improves factual consistency and outperforms competitive baselines, demonstrating its potential to enhance the trustworthiness of LLM outputs in real-world applications.
Chinese Discharge Drug Recommendation in Metabolic Diseases with Large Language Models
Intelligent drug recommendation based on Electronic Health Records (EHRs) is critical for improving the quality and efficiency of clinical decision-making. By leveraging large-scale patient data, drug recommendation systems can assist physicians in selecting the most appropriate medications according to a patient's medical history, diagnoses, laboratory results, and comorbidities. Recent advances in large language models (LLMs) have shown remarkable capabilities in complex reasoning and medical text understanding, making them promising tools for drug recommendation tasks. However, the application of LLMs for Chinese clinical medication recommendation remains largely unexplored. In this work, we conduct a systematic investigation of LLM-based methodologies for Chinese discharge medication recommendation. We evaluate several representative LLM families (GLM, Llama, Qwen) under a unified methodological framework including zero-shot prompting, in-context learning, chain-of-thought prompting, and supervised fine-tuning using LoRA. We analyze model behavior across reasoning styles, error patterns, domain adaptation mechanisms, and robustness. Experimental results show that while supervised fine-tuning improves model performance, there remains substantial room for improvement, with the best model achieving the F1 score of 0.5648 and Jaccard score of 0.4477. Our findings highlight both the potential and limitations of LLMs for Chinese drug recommendation.
Fair Text Classification via Transferable Representations
Group fairness is a central research topic in text classification, where reaching fair treatment between sensitive groups (e.g., women and men) remains an open challenge. We propose an approach that extends the use of the Wasserstein Dependency Measure for learning unbiased neural text classifiers. Given the challenge of distinguishing fair from unfair information in a text encoder, we draw inspiration from adversarial training by inducing independence between representations learned for the target label and those for a sensitive attribute. We further show that Domain Adaptation can be efficiently leveraged to remove the need for access to the sensitive attributes in the dataset we cure. We provide both theoretical and empirical evidence that our approach is well-founded.
comment: arXiv admin note: text overlap with arXiv:2311.12689
Optimizing Fine-Tuning through Advanced Initialization Strategies for Low-Rank Adaptation
The rapid development of parameter-efficient fine-tuning methods has noticeably improved the efficiency of adapting large language models. Among these, LoRA has gained widespread popularity due to its strong balance of effectiveness and parameter efficiency. However, LoRA relies on initializing two low-rank matrices whose product is zero, which limits its ability to effectively activate and leverage the original model weights-creating a potential bottleneck for optimal performance. To address this limitation, we propose \textbf{IniLoRA}, a novel initialization strategy that initializes the low-rank matrices to closely approximate the original model weights. Experimental results indicate that IniLoRA achieves better performance than LoRA across a range of models and tasks. Additionally, we introduce two variants, IniLoRA-$α$ and IniLoRA-$β$, both leveraging distinct initialization methods to enhance performance further.
Experiments with Large Language Models on Retrieval-Augmented Generation for Closed-Source Simulation Software
Large Language Models (LLMs) are tools that have become indispensable in development and programming. However, they suffer from hallucinations, especially when dealing with unknown knowledge. This is particularly the case when LLMs are to be used to support closed-source software applications. Retrieval-Augmented Generation (RAG) offers an approach to use additional knowledge alongside the pre-trained knowledge of the LLM to respond to user prompts. Possible tasks range from a smart-autocomplete, text extraction for question answering, model summarization, component explaining, compositional reasoning, to creation of simulation components and complete input models. This work tests existing RAG systems for closed-source simulation frameworks, in our case the mesh-free simulation software Pasimodo. Since data protection and intellectual property rights are particularly important for problems solved with closed-source software, the tests focus on execution using local LLMs. In order to enable smaller institutions to use the systems, smaller language models will be tested first. The systems show impressive results, but often fail due to insufficient information. Different approaches for improving response quality are tested. In particular, tailoring the information provided to the LLMs dependent to the prompts proves to be a significant improvement. This demonstrates the great potential and the further work needed to improve information retrieval for closed-source simulation models.
comment: 16 pages, 6 tables, 2 figures
Decoding inner speech with an end-to-end brain-to-text neural interface
Speech brain-computer interfaces (BCIs) aim to restore communication for people with paralysis by translating neural activity into text. Most systems use cascaded frameworks that decode phonemes before assembling sentences with an n-gram language model (LM), preventing joint optimization of all stages simultaneously. Here, we introduce an end-to-end Brain-to-Text (BIT) framework that translates neural activity into coherent sentences using a single differentiable neural network. Central to our approach is a cross-task, cross-species pretrained neural encoder, whose representations transfer to both attempted and imagined speech. In a cascaded setting with an n-gram LM, the pretrained encoder establishes a new state-of-the-art (SOTA) on the Brain-to-Text '24 and '25 benchmarks. Integrated end-to-end with audio large language models (LLMs) and trained with contrastive learning for cross-modal alignment, BIT reduces the word error rate (WER) of the prior end-to-end method from 24.69% to 10.22%. Notably, we find that small-scale audio LLMs markedly improve end-to-end decoding. Beyond record-setting performance, BIT aligns attempted and imagined speech embeddings to enable cross-task generalization. Altogether, our approach advances the integration of large, diverse neural datasets, paving the way for an end-to-end decoding framework that supports seamless, differentiable optimization.
HARP: Hallucination Detection via Reasoning Subspace Projection
Hallucinations in Large Language Models (LLMs) pose a major barrier to their reliable use in critical decision-making. Although existing hallucination detection methods have improved accuracy, they still struggle with disentangling semantic and reasoning information and maintaining robustness. To address these challenges, we propose HARP (Hallucination detection via reasoning subspace projection), a novel hallucination detection framework. HARP establishes that the hidden state space of LLMs can be decomposed into a direct sum of a semantic subspace and a reasoning subspace, where the former encodes linguistic expression and the latter captures internal reasoning processes. Moreover, we demonstrate that the Unembedding layer can disentangle these subspaces, and by applying Singular Value Decomposition (SVD) to its parameters, the basis vectors spanning the semantic and reasoning subspaces are obtained. Finally, HARP projects hidden states onto the basis vectors of the reasoning subspace, and the resulting projections are then used as input features for hallucination detection in LLMs. By using these projections, HARP reduces the dimension of the feature to approximately 5% of the original, filters out most noise, and achieves enhanced robustness. Experiments across multiple datasets show that HARP achieves state-of-the-art hallucination detection performance; in particular, it achieves an AUROC of 92.8% on TriviaQA, outperforming the previous best method by 7.5%.
IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks
Flawed planning from VLM-driven embodied agents poses significant safety hazards, hindering their deployment in real-world household tasks. However, existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent's actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent's interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps. Extensive experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness, and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion. By highlighting these critical limitations, IS-Bench provides a foundation for developing safer and more reliable embodied AI systems. Code and data are released under https://github.com/AI45Lab/IS-Bench.
From Simulation to Strategy: Automating Personalized Interaction Planning for Conversational Agents
Amid the rapid rise of agentic dialogue models, realistic user-simulator studies are essential for tuning effective conversation strategies. This work investigates a sales-oriented agent that adapts its dialogue based on user profiles spanning age, gender, and occupation. While age and gender influence overall performance, occupation produces the most pronounced differences in conversational intent. Leveraging this insight, we introduce a lightweight, occupation-conditioned strategy that guides the agent to prioritize intents aligned with user preferences, resulting in shorter and more successful dialogues. Our findings highlight the importance of rich simulator profiles and demonstrate how simple persona-informed strategies can enhance the effectiveness of sales-oriented dialogue systems.
comment: Accepted to IEEE ASRU 2025
ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal
Large Language Models (LLMs) increasingly exhibit over-refusal - erroneously rejecting benign queries due to overly conservative safety measures - a critical functional flaw that undermines their reliability and usability. Current methods for testing this behavior are demonstrably inadequate, suffering from flawed benchmarks and limited test generation capabilities, as highlighted by our empirical user study. To the best of our knowledge, this paper introduces the first evolutionary testing framework, ORFuzz, for the systematic detection and analysis of LLM over-refusals. ORFuzz uniquely integrates three core components: (1) safety category-aware seed selection for comprehensive test coverage, (2) adaptive mutator optimization using reasoning LLMs to generate effective test cases, and (3) OR-Judge, a human-aligned judge model validated to accurately reflect user perception of toxicity and refusal. Our extensive evaluations demonstrate that ORFuzz generates diverse, validated over-refusal instances at a rate (6.98% average) more than double that of leading baselines, effectively uncovering vulnerabilities. Furthermore, ORFuzz's outputs form the basis of ORFuzzSet, a new benchmark of 1,855 highly transferable test cases that achieves a superior 63.56% average over-refusal rate across 10 diverse LLMs, significantly outperforming existing datasets. ORFuzz and ORFuzzSet provide a robust automated testing framework and a valuable community resource, paving the way for developing more reliable and trustworthy LLM-based software systems.
comment: Accepted by ASE 2025
Training-Free Loosely Speculative Decoding: Accepting Semantically Correct Drafts Beyond Exact Match
Large language models (LLMs) achieve strong performance across diverse tasks but suffer from high inference latency due to their autoregressive generation. Speculative Decoding (SPD) mitigates this issue by verifying candidate tokens in parallel from a smaller draft model, yet its strict exact-match verification discards many semantically valid continuations. Moreover, existing training-based SPD methods often suffer from performance degradation on out-of-distribution (OOD) tasks. To this end, we propose Training-Free Loosely Speculative Decoding (FLy), a novel method that loosens the rigid verification criterion by leveraging the target model's self-corrective behavior to judge whether a draft-target mismatch remains semantically valid. FLy introduces a two-tier mechanism: an entropy-level gate that identifies whether the current token allows multiple plausible alternatives or is nearly deterministic, and a token-level deferred window that distinguishes genuine errors from differently worded yet semantically correct variants. To further reduce latency, we design a multi-level acceleration strategy that accelerates not only the target model but also the drafter itself. Owing to its training-free design, FLy composes seamlessly with arbitrary draft-target pairs and generalizes across models and domains without hyperparameter re-tuning. Experiments show that FLy preserves more than 99% of the target model's accuracy while achieving an average 2.81x speedup on Llama-3.1-70B-Instruct and 5.07x speedup on the 405B variant. Notably, on out-of-domain datasets, our method remains highly effective and outperforms the training-based method EAGLE-3 by 1.62x.
comment: Under review
AURA: A Diagnostic Framework for Tracking User Satisfaction of Interactive Planning Agents NeurIPS 2025
The growing capabilities of large language models (LLMs) in instruction-following and context-understanding lead to the era of agents with numerous applications. Among these, task planning agents have become especially prominent in realistic scenarios involving complex internal pipelines, such as context understanding, tool management, and response generation. However, existing benchmarks predominantly evaluate agent performance based on task completion as a proxy for overall effectiveness. We hypothesize that merely improving task completion is misaligned with maximizing user satisfaction, as users interact with the entire agentic process and not only the end result. To address this gap, we propose AURA, an Agent-User inteRaction Assessment framework that conceptualizes the behavioral stages of interactive task planning agents. AURA offers a comprehensive assessment of agent through a set of atomic LLM evaluation criteria, allowing researchers and practitioners to diagnose specific strengths and weaknesses within the agent's decision-making pipeline. Our analyses show that agents excel in different behavioral stages, with user satisfaction shaped by both outcomes and intermediate behaviors. We also highlight future directions, including systems that leverage multiple agents and the limitations of user simulators in task planning.
comment: NeurIPS 2025 MTI-LLM Workshop. Full version is under review
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe
Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at https://github.com/EvolvingLMMs-Lab/OpenMMReasoner.
Characterizing Language Use in a Collaborative Situated Game
Cooperative video games, where multiple participants must coordinate by communicating and reasoning under uncertainty in complex environments, yield a rich source of language data. We collect the Portal Dialogue Corpus: a corpus of 11.5 hours of spoken human dialogue in the co-op mode of the popular Portal 2 virtual puzzle game, comprising 24.5K total utterances. We analyze player language and behavior, identifying a number of linguistic phenomena that rarely appear in most existing chitchat or task-oriented dialogue corpora, including complex spatial reference, clarification and repair, and ad-hoc convention formation. To support future analyses of language use in complex, situated, collaborative problem-solving scenarios, we publicly release the corpus, which comprises player videos, audio, transcripts, game state data, and both manual and automatic annotations of language data.
Human-Computer Interaction
Eye of the Beholder: Towards Measuring Visualization Complexity
Constructing expressive and legible visualizations is a key activity for visualization designers. While numerous design guidelines exist, research on how specific graphical features affect perceived visual complexity remains limited. In this paper, we report on a crowdsourced study to collect human ratings of perceived complexity for diverse visualizations. Using these ratings as ground truth, we then evaluated three methods to estimate this perceived complexity: image analysis metrics, multilinear regression using manually coded visualization features, and automated feature extraction using a large language model (LLM). Image complexity metrics showed no correlation with human-perceived visualization complexity. Manual feature coding produced a reasonable predictive model but required substantial effort. In contrast, a zero-shot LLM (GPT-4o mini) demonstrated strong capabilities in both rating complexity and extracting relevant features. Our findings suggest that visualization complexity is truly in the eye of the beholder, yet can be effectively approximated using zero-shot LLM prompting, offering a scalable approach for evaluating the complexity of visualizations. The dataset and code for the study and data analysis can be found at https://osf.io/w85a4/
User Negotiations of Authenticity, Ownership, and Governance on AI-Generated Video Platforms: Evidence from Sora
As AI-generated video platforms rapidly advance, ethical challenges such as copyright infringement emerge. This study examines how users make sense of AI-generated videos on OpenAI's Sora by conducting a qualitative content analysis of user comments. Through a thematic analysis, we identified four dynamics that characterize how users negotiate authenticity, authorship, and platform governance on Sora. First, users acted as critical evaluators of realism, assessing micro-details such as lighting, shadows, fluid motion, and physics to judge whether AI-generated scenes could plausibly exist. Second, users increasingly shifted from passive viewers to active creators, expressing curiosity about prompts, techniques, and creative processes. Text prompts were perceived as intellectual property, generating concerns about plagiarism and remixing norms. Third, users reported blurred boundaries between real and synthetic media, worried about misinformation, and even questioned the authenticity of other commenters, suspecting bot-generated engagement. Fourth, users contested platform governance: some perceived moderation as inconsistent or opaque, while others shared tactics for evading prompt censorship through misspellings, alternative phrasing, emojis, or other languages. Despite this, many users also enforced ethical norms by discouraging the misuse of real people's images or disrespectful content. Together, these patterns highlighted how AI-mediated platforms complicate notions of reality, creativity, and rule-making in emerging digital ecosystems. Based on the findings, we discuss governance challenges in Sora and how user negotiations inform future platform governance.
When Scaffolding Breaks: Investigating Student Interaction with LLM-Based Writing Support in Real-Time K-12 EFL Classrooms
Large language models (LLMs) are promising tools for scaffolding students' English writing skills, but their effectiveness in real-time K-12 classrooms remains underexplored. Addressing this gap, our study examines the benefits and limitations of using LLMs as real-time learning support, considering how classroom constraints, such as diverse proficiency levels and limited time, affect their effectiveness. We conducted a deployment study with 157 eighth-grade students in a South Korean middle school English class over six weeks. Our findings reveal that while scaffolding improved students' ability to compose grammatically correct sentences, this step-by-step approach demotivated lower-proficiency students and increased their system reliance. We also observed challenges to classroom dynamics, where extroverted students often dominated the teacher's attention, and the system's assistance made it difficult for teachers to identify struggling students. Based on these findings, we discuss design guidelines for integrating LLMs into real-time writing classes as inclusive educational tools.
comment: Under Review
Knowing Your Uncertainty -- On the application of LLM in social sciences
Large language models (LLMs) are rapidly being integrated into computational social science research, yet their blackboxed training and designed stochastic elements in inference pose unique challenges for scientific inquiry. This article argues that applying LLMs to social scientific tasks requires explicit assessment of uncertainty-an expectation long established in both quantitative methodology in the social sciences and machine learning. We introduce a unified framework for evaluating LLM uncertainty along two dimensions: the task type (T), which distinguishes between classification, short-form, and long-form generation, and the validation type (V), which captures the availability of reference data or evaluative criteria. Drawing from both computer science and social science literature, we map existing uncertainty quantification (UQ) methods to this T-V typology and offer practical recommendations for researchers. Our framework provides both a methodological safeguard and a practical guide for integrating LLMs into rigorous social science research.
comment: 49 pages, 10 figures
Classification and taxonomy of mobile application usability issues
Despite years of research on testing the usability of mobile applications, our understanding of the issues their users experience still remains fragmented and underexplored. While most earlier studies has provided interesting insights, they have varying limitations in methodology, input diversity, and depth of analysis.On the contrary, this study employs a triangulation strategy, using two research methods (systematic literature review and interview) and two data sources (scholarly literature and expert knowledge) to explore the traits underlying usability issues. Our study contributes to the field of human-computer interaction (HCI) by presenting a catalog of 16 usability issue categories, enriched with corresponding keywords and extended into a taxonomy, as well as a novel three-tier app-user-resource (AUR) classification system. At the first app level, usability issues arise from user interface design, as well as from efficiency, errors, and operability. At the second user level, they influence cognitive load, effectiveness, ease of use, learnability, memorability, and understandability. At the third resource level, usability issues stem from network quality and hardware, such as battery life, CPU speed, physical device button size and availability, RAM capacity, and screen size. The root cause of the usability issues is the user interface design. Detailed findings and takeaways for both researchers and practitioners are also discussed. Further research could focus on developing a measurement model for the identified variables to confirm the direction and strength of their relationships with perceived usability. Software vendors can also benefit by updating existing quality assurance programs, reviews and audits tools, as well as testing checklists.
comment: 55 pages, 5 figures, 9 tables, 129 references
EXR: An Interactive Immersive EHR Visualization in Extended Reality
This paper presents the design and implementation of an Extended Reality (XR) platform for immersive, interactive visualization of Electronic Health Records (EHRs). The system extends beyond conventional 2D interfaces by visualizing both structured and unstructured patient data into a shared 3D environment, enabling intuitive exploration and real-time collaboration. The modular infrastructure integrates FHIR-based EHR data with volumetric medical imaging and AI-generated segmentation, ensuring interoperability with modern healthcare systems. The platform's capabilities are demonstrated using synthetic EHR datasets and computed tomography (CT)-derived spine models processed through an AI-powered segmentation pipeline. This work suggests that such integrated XR solutions could form the foundation for next-generation clinical decision-support tools, where advanced data infrastructures are directly accessible in an interactive and spatially rich environment.
comment: 11 pages, 6 figures. Preprint version. This paper has been accepted to IEEE ICIR 2025. This is the author-prepared version and not the final published version. The final version will appear in IEEE Xplo
From Vision to Touch: Bridging Visual and Tactile Principles for Accessible Data Representation
Tactile graphics are widely used to present maps and statistical diagrams to blind and low vision (BLV) people, with accessibility guidelines recommending their use for graphics where spatial relationships are important. Their use is expected to grow with the advent of commodity refreshable tactile displays. However, in stark contrast to visual information graphics, we lack a clear understanding of the benefits that well-designed tactile information graphics offer over text descriptions for BLV people. To address this gap, we introduce a framework considering the three components of encoding, perception and cognition to examine the known benefits for visual information graphics and explore their applicability to tactile information graphics. This work establishes a preliminary theoretical foundation for the tactile-first design of information graphics and identifies future research avenues.
comment: To be published by IEEE as part of the 2025 Visualization Conference (VIS)
Simulating Life Paths with Digital Twins: AI-Generated Future Selves Influence Decision-Making and Expand Human Choice
Major life transitions demand high-stakes decisions, yet people often struggle to imagine how their future selves will live with the consequences. To support this limited capacity for mental time travel, we introduce AI-enabled digital twins that have ``lived through'' simulated life scenarios. Rather than predicting optimal outcomes, these simulations extend prospective cognition by making alternative futures vivid enough to support deliberation without assuming which path is best. We evaluate this idea in a randomized controlled study (N=192) using multimodal synthesis - facial age progression, voice cloning, and large language model dialogue - to create personalized avatars representing participants 30 years forward. Young adults 18 to 28 years old described pending binary decisions and were assigned to guided imagination or one of four avatar conditions: single-option, balanced dual-option, or expanded three-option with a system-generated novel alternative. Results showed asymmetric effects: single-sided avatars increased shifts toward the presented option, while balanced presentation produced movement toward both. Introducing a system-generated third option increased adoption of this new alternative compared to control, suggesting that AI-generated future selves can expand choice by surfacing paths that might otherwise go unnoticed. Participants rated evaluative reasoning and eudaimonic meaning-making as more important than emotional or visual vividness. Perceived persuasiveness and baseline agency predicted decision change. These findings advance understanding of AI-mediated episodic prospection and raise questions about autonomy in AI-augmented decisions.
CLIO: A Tour Guide Robot with Co-speech Actions for Visual Attention Guidance and Enhanced User Engagement
While audio guides can offer rich information about an exhibit, it is challenging for visitors to focus on specific exhibit details based only on the verbal description. We present \textit{CLIO}, a tour guide robot with co-speech actions to direct visitors' visual attention and thus enhance the overall user engagement in a guided tour. \textit{CLIO} is equipped with designed actions to engage visitors. It builds eye contact with the visitor through tracking a visitor's face and blinking its eyes, or orient their attention by its head movement and laser pointer. We further use a Large Language Model (LLM) to coordinate the designed actions with a given narrative script for exhibition. We conducted a user study to evaluate the \textit{CLIO} system in a mock-up exhibition of historical photographs. We collected feedback from questionnaires and quantitative data from a mobile eye tracker. Experimental results validated that the engaging actions are well designed and demonstrated its efficacy in guiding visual attention of the visitors. It was evidenced that \textit{CLIO} achieved an enhanced engagement compared to the baseline system with only audio guidance.
comment: 10 pages, 7 figures, human-robot interaction
Classification and taxonomy of mobile application usability issues
Despite years of research on testing the usability of mobile applications, our understanding of the issues their users experience still remains fragmented and underexplored. While most earlier studies has provided interesting insights, they have varying limitations in methodology, input diversity, and depth of analysis. On the contrary, this study employs a triangulation strategy, using two research methods (systematic literature review and interview) and two data sources (scholarly literature and expert knowledge) to explore the traits underlying usability issues. Our study contributes to the field of human-computer interaction (HCI) by presenting a catalog of 16 usability issue categories, enriched with corresponding keywords and extended into a taxonomy, as well as a novel three-tier app-user-resource (AUR) classification system. At the first app level, usability issues arise from user interface design, as well as from efficiency, errors, and operability. At the second user level, they influence cognitive load, effectiveness, ease of use, learnability, memorability, and understandability. At the third resource level, usability issues stem from network quality and hardware, such as battery life, CPU speed, physical device button size and availability, RAM capacity, and screen size. The root cause of the usability issues is the user interface design. Detailed findings and takeaways for both researchers and practitioners are also discussed. Further research could focus on developing a measurement model for the identified variables to confirm the direction and strength of their relationships with perceived usability. Software vendors can also benefit by updating existing quality assurance programs, reviews and audits tools, as well as testing checklists.
comment: 55 pages, 5 figures, 9 tables, 129 references
GuideNav: User-Informed Development of a Vision-Only Robotic Navigation Assistant For Blind Travelers
While commendable progress has been made in user-centric research on mobile assistive systems for blind and low-vision (BLV) individuals, references that directly inform robot navigation design remain rare. To bridge this gap, we conducted a comprehensive human study involving interviews with 26 guide dog handlers, four white cane users, nine guide dog trainers, and one O\&M trainer, along with 15+ hours of observing guide dog-assisted walking. After de-identification, we open-sourced the dataset to promote human-centered development and informed decision-making for assistive systems for BLV people. Building on insights from this formative study, we developed GuideNav, a vision-only, teach-and-repeat navigation system. Inspired by how guide dogs are trained and assist their handlers, GuideNav autonomously repeats a path demonstrated by a sighted person using a robot. Specifically, the system constructs a topological representation of the taught route, integrates visual place recognition with temporal filtering, and employs a relative pose estimator to compute navigation actions - all without relying on costly, heavy, power-hungry sensors such as LiDAR. In field tests, GuideNav consistently achieved kilometer-scale route following across five outdoor environments, maintaining reliability despite noticeable scene variations between teach and repeat runs. A user study with 3 guide dog handlers and 1 guide dog trainer further confirmed the system's feasibility, marking (to our knowledge) the first demonstration of a quadruped mobile system retrieving a path in a manner comparable to guide dogs.
Protocol Futuring: Speculating Second-Order Dynamics of Protocols in Sociotechnical Infrastructural Futures
Drawing on infrastructure studies in HCI and CSCW, this paper introduces Protocol Futuring, a methodological framework that extends design futuring by foregrounding protocols-rules, standards, and coordination mechanisms-as the primary material of speculative inquiry. Rather than imagining discrete future artifacts, Protocol Futuring examines how protocol rules accumulate drift, jam, and other second-order effects over long temporal horizons. We demonstrate the method through a case study of Knowledge Futurama, a multi-team participatory workshop exploring millennial-scale knowledge preservation. Using a relay format in which teams inherited and reinterpreted partially formed designs, the workshop revealed how ambiguous handovers, adversarial reinterpretations, shifting cultural norms, and crisis dynamics transform protocols as they move across communities and epochs. The case shows how Protocol Futuring makes infrastructural politics and long-run consequences analytically visible. We discuss the method's strengths, limitations, and implications for researchers seeking to investigate emergent sociotechnical systems whose impacts unfold over extended timescales.
comment: Submitted to CHI 2026. Under review
Future You: Designing and Evaluating Multimodal AI-generated Digital Twins for Strengthening Future Self-Continuity
What if users could meet their future selves today? AI-generated future selves simulate meaningful encounters with a digital twin decades in the future. As AI systems advance, combining cloned voices, age-progressed facial rendering, and autobiographical narratives, a central question emerges: Does the modality of these future selves alter their psychological and affective impact? How might a text-based chatbot, a voice-only system, or a photorealistic avatar shape present-day decisions and our feeling of connection to the future? We report a randomized controlled study (N=92) evaluating three modalities of AI-generated future selves (text, voice, avatar) against a neutral control condition. We also report a systematic model evaluation between Claude 4 and three other Large Language Models (LLMs), assessing Claude 4 across psychological and interaction dimensions and establishing conversational AI quality as a critical determinant of intervention effectiveness. All personalized modalities strengthened Future Self-Continuity (FSC), emotional well-being, and motivation compared to control, with avatar producing the largest vividness gains, yet with no significant differences between formats. Interaction quality metrics, particularly persuasiveness, realism, and user engagement, emerged as robust predictors of psychological and affective outcomes, indicating that how compelling the interaction feels matters more than the form it takes. Content analysis found thematic patterns: text emphasized career planning, while voice and avatar facilitated personal reflection. Claude 4 outperformed ChatGPT 3.5, Llama 4, and Qwen 3 in enhancing psychological, affective, and FSC outcomes.
Self-Transparency Failures in Expert-Persona LLMs: How Instruction-Following Overrides Honesty
This study audits whether language models disclose their AI nature when assigned professional personas and questioned about their expertise. When models maintain false professional credentials, users may calibrate trust based on overstated competence claims, treating AI-generated guidance as equivalent to licensed professional advice. Using a common-garden experimental design, sixteen open-weight models (4B-671B parameters) were audited under identical conditions across 19,200 trials. Models exhibited sharp domain-specific inconsistency: a Financial Advisor persona elicited 30.8% disclosure at the first prompt, while a Neurosurgeon persona elicited only 3.5% - an 8.8-fold difference that emerged before any epistemic probing. Disclosure ranged from 2.8% to 73.6% across model families, with a 14B model reaching 39.4% while a 70B model produced just 4.1%. Model identity provided substantially larger improvement in fitting observations than parameter count ($ΔR_{adj}^{2}=0.359$ vs $0.018$). Reasoning variants showed heterogeneous effects: some exhibited up to 48.4 percentage points lower disclosure than their base instruction-tuned counterparts, while others maintained high transparency. An additional experiment demonstrated that explicit permission to disclose AI nature increased disclosure from 23.7% to 65.8%, revealing that suppression reflects instruction-following prioritization rather than capability limitations. Bayesian validation confirmed robustness to judge measurement error ($κ=0.908$). These patterns create trust calibration risks when users encounter the same model across professional contexts. Organizations cannot assume safety properties will transfer across deployment domains, requiring deliberate behavior design and empirical verification.
comment: 47 pages, 12 figures, 12 tables, Submitted to FAccT; clarify user harm, add permission experiment, condense paper
Exploring ChatGPT's Capabilities, Stability, Potential and Risks in Conducting Psychological Counseling through Simulations in School Counseling
This study explores ChatGPT's capabilities, stability, and risks in simulating psychological counseling sessions in a school counseling context. Using scripted role-plays between a human counselor and an AI client, we examine how a large language model performs core counseling skills such as empathy, reflection, summarizing, and asking open-ended questions, as well as its ability to maintain therapeutic communication over time. We focus on how consistently ChatGPT can behave like a "virtual client" for school counselors in training, and how its responses might support or disrupt counselor skill development, supervision, and practice. At the same time, we analyze potential risks, including inaccurate or unsafe suggestions, over-compliance with counselor prompts, and the illusion of a competent therapist where no real professional judgment exists. The findings suggest that ChatGPT can serve as a low-cost, always-available training tool for practicing counseling techniques and interviewing skills in education and mental health settings, but it should not be viewed as a replacement for a human therapist or school counselor. We propose practical guidelines for educators, supervisors, and researchers who wish to use ChatGPT or similar LLM-based conversational agents in counseling training, highlighting how to leverage its potential while managing ethical, pedagogical, and psychological risks.
Enhancing EEG Signal-Based Emotion Recognition with Synthetic Data: Diffusion Model Approach
Emotions are crucial in human life, influencing perceptions, relationships, behaviour, and choices. Emotion recognition using Electroencephalography (EEG) in the Brain-Computer Interface (BCI) domain presents significant challenges, particularly the need for extensive datasets. This study aims to generate synthetic EEG samples similar to real samples but distinct by augmenting noise to a conditional denoising diffusion probabilistic model, thus addressing the prevalent issue of data scarcity in EEG research. The proposed method is tested on the DEAP and SADT datasets, showcasing up to 5.6% improvement in classification accuracy when using synthetic data with DEAP and similar positive results with SADT. This is higher compared to the traditional Generative Adversarial Network (GAN) based and Denoising Diffusion Probabilistic Model (DDPM) based approaches. This study further evaluates the effectiveness of state-of-the-art classifiers on EEG data, employing both real and synthetic data with varying noise levels, and utilizes t-SNE and SHAP for detailed analysis and interpretability. The proposed diffusion-based approach for EEG data generation appears promising in refining the accuracy of emotion recognition systems and marks a notable contribution to EEG-based emotion recognition.
comment: 10 Pages, 10 Figures, 4 Tables
The AI Productivity Index (APEX)
We present an extended version of the AI Productivity Index (APEX-v1-extended), a benchmark for assessing whether frontier models are capable of performing economically valuable tasks in four jobs: investment banking associate, management consultant, big law associate, and primary care physician (MD). This technical report details the extensions to APEX-v1, including an increase in the held-out evaluation set from n = 50 to n = 100 cases per job (n = 400 total) and updates to the grading methodology. We present a new leaderboard, where GPT5 (Thinking = High) remains the top performing model with a score of 67.0%. APEX-v1-extended shows that frontier models still have substantial limitations when performing typical professional tasks. To support further research, we are open sourcing n = 25 non-benchmark example cases per role (n = 100 total) along with our evaluation harness.
A Longitudinal Study on the Attitudes of Gay Men in Beijing Towards Gay Social Media Platforms: Lonely Souls in the Digital Concrete Jungle
Over the past decade, specialized social networking applications have become a cornerstone of life for many gay men in China. This paper employs a longitudinal mixed-methods approach to investigate how Chinese men who have sex with men (MSM) have shifted their attitudes toward these platforms between approximately 2013 and 2023. Drawing on archival analysis of online discourses, a quantitative survey of 412 participants, and in-depth semi-structured interviews with 32 participants, we trace the complex trajectory of this evolution. Our findings reveal a clear pattern: from the initial embrace of these applications as revolutionary tools for community building and identity affirmation (2014--2017), to a period of growing ambivalence and critique centered on commercialization, ``hookup culture,'' and multiple forms of discrimination (2017--2020), and finally to the present era (2020--2023), characterized by pragmatic, fragmented, yet simultaneously critical and reconstructive uses. Today, users strategically employ a repertoire of applications -- including global platforms (e.g., Grindr and Tinder), domestic mainstream platforms (e.g., Blued), and niche alternatives (e.g., Aloha) -- to fulfill differentiated needs. We develop a detailed temporal framework to capture this attitudinal evolution and discuss its design implications for creating more supportive, secure, and community-oriented digital environments for marginalized groups.
ReSem3D: Refinable 3D Spatial Constraints via Fine-Grained Semantic Grounding for Generalizable Robotic Manipulation
Semantics-driven 3D spatial constraints align highlevel semantic representations with low-level action spaces, facilitating the unification of task understanding and execution in robotic manipulation. The synergistic reasoning of Multimodal Large Language Models (MLLMs) and Vision Foundation Models (VFMs) enables cross-modal 3D spatial constraint construction. Nevertheless, existing methods have three key limitations: (1) coarse semantic granularity in constraint modeling, (2) lack of real-time closed-loop planning, (3) compromised robustness in semantically diverse environments. To address these challenges, we propose ReSem3D, a unified manipulation framework for semantically diverse environments, leveraging the synergy between VFMs and MLLMs to achieve fine-grained visual grounding and dynamically constructs hierarchical 3D spatial constraints for real-time manipulation. Specifically, the framework is driven by hierarchical recursive reasoning in MLLMs, which interact with VFMs to automatically construct 3D spatial constraints from natural language instructions and RGB-D observations in two stages: part-level extraction and region-level refinement. Subsequently, these constraints are encoded as real-time optimization objectives in joint space, enabling reactive behavior to dynamic disturbances. Extensive simulation and real-world experiments are conducted in semantically rich household and sparse chemical lab environments. The results demonstrate that ReSem3D performs diverse manipulation tasks under zero-shot conditions, exhibiting strong adaptability and generalization. Code and videos are available at https://github.com/scy-v/ReSem3D and https://resem3d.github.io.
comment: 12 pages,9 figures
Relational Dissonance in Human-AI Interactions: The Case of Knowledge Work
When AI systems allow human-like communication, they elicit increasingly complex relational responses. Knowledge workers face a particular challenge: They approach these systems as tools while interacting with them in ways that resemble human social interaction. To understand the relational contexts that arise when humans engage with anthropomorphic conversational agents, we need to expand existing human-computer interaction frameworks. Through three workshops with qualitative researchers, we found that the fundamental ontological and relational ambiguities inherent in anthropomorphic conversational agents make it difficult for individuals to maintain consistent relational stances toward them. Our findings indicate that people's articulated positioning toward such agents often differs from the relational dynamics that occur during interactions. We propose the concept of relational dissonance to help researchers, designers, and policymakers recognize the resulting tensions in the development, deployment, and governance of anthropomorphic conversational agents and address the need for relational transparency.
comment: 30 pages, 5 figures
Designing Culturally Aligned AI Systems For Social Good in Non-Western Contexts
AI technologies are increasingly deployed in high-stakes domains such as education, healthcare, law, and agriculture to address complex challenges in non-Western contexts. This paper examines eight real-world deployments spanning seven countries and 18 languages, combining 17 interviews with AI developers and domain experts with secondary research. Our findings identify six cross-cutting factors - Language, Institution, Safety, Task, End-User Demography, and Domain - that structured how systems were designed and deployed. These factors were shaped by Sociocultural (diversity, practices), Institutional (resources, policies), and Technological (capabilities, limits) influences. We find that building effective AI systems required extensive collaboration between AI developers and domain experts, with human resources proving more critical to achieving safe and effective outcomes in high-stakes domains than technological expertise alone. Additionally, we present 12 guidelines synthesizing these dynamics for designing AI for social good systems that are culturally grounded, equitable, and responsive to the needs of non-Western contexts.
Computer Vision and Pattern Recognition
The Universal Weight Subspace Hypothesis
We show that deep neural networks trained across diverse tasks exhibit remarkably similar low-dimensional parametric subspaces. We provide the first large-scale empirical evidence that demonstrates that neural networks systematically converge to shared spectral subspaces regardless of initialization, task, or domain. Through mode-wise spectral analysis of over 1100 models - including 500 Mistral-7B LoRAs, 500 Vision Transformers, and 50 LLaMA-8B models - we identify universal subspaces capturing majority variance in just a few principal directions. By applying spectral decomposition techniques to the weight matrices of various architectures trained on a wide range of tasks and datasets, we identify sparse, joint subspaces that are consistently exploited, within shared architectures across diverse tasks and datasets. Our findings offer new insights into the intrinsic organization of information within deep networks and raise important questions about the possibility of discovering these universal subspaces without the need for extensive data and computational resources. Furthermore, this inherent structure has significant implications for model reusability, multi-task learning, model merging, and the development of training and inference-efficient algorithms, potentially reducing the carbon footprint of large-scale neural models.
comment: 37 pages
Light-X: Generative 4D Video Rendering with Camera and Illumination Control
Recent advances in illumination control extend image-based methods to video, yet still facing a trade-off between lighting fidelity and temporal consistency. Moving beyond relighting, a key step toward generative modeling of real-world scenes is the joint control of camera trajectory and illumination, since visual dynamics are inherently shaped by both geometry and lighting. To this end, we present Light-X, a video generation framework that enables controllable rendering from monocular videos with both viewpoint and illumination control. 1) We propose a disentangled design that decouples geometry and lighting signals: geometry and motion are captured via dynamic point clouds projected along user-defined camera trajectories, while illumination cues are provided by a relit frame consistently projected into the same geometry. These explicit, fine-grained cues enable effective disentanglement and guide high-quality illumination. 2) To address the lack of paired multi-view and multi-illumination videos, we introduce Light-Syn, a degradation-based pipeline with inverse-mapping that synthesizes training pairs from in-the-wild monocular footage. This strategy yields a dataset covering static, dynamic, and AI-generated scenes, ensuring robust training. Extensive experiments show that Light-X outperforms baseline methods in joint camera-illumination control and surpasses prior video relighting methods under both text- and background-conditioned settings.
comment: Project Page: https://lightx-ai.github.io/
Value Gradient Guidance for Flow Matching Alignment NeurIPS 2025
While methods exist for aligning flow matching models--a popular and effective class of generative models--with human preferences, existing approaches fail to achieve both adaptation efficiency and probabilistically sound prior preservation. In this work, we leverage the theory of optimal control and propose VGG-Flow, a gradient-matching-based method for finetuning pretrained flow matching models. The key idea behind this algorithm is that the optimal difference between the finetuned velocity field and the pretrained one should be matched with the gradient field of a value function. This method not only incorporates first-order information from the reward model but also benefits from heuristic initialization of the value function to enable fast adaptation. Empirically, we show on a popular text-to-image flow matching model, Stable Diffusion 3, that our method can finetune flow matching models under limited computational budgets while achieving effective and prior-preserving alignment.
comment: Accepted at NeurIPS 2025; 26 pages, 20 figures
Deep infant brain segmentation from multi-contrast MRI
Segmentation of magnetic resonance images (MRI) facilitates analysis of human brain development by delineating anatomical structures. However, in infants and young children, accurate segmentation is challenging due to development and imaging constraints. Pediatric brain MRI is notoriously difficult to acquire, with inconsistent availability of imaging modalities, substantial non-head anatomy in the field of view, and frequent motion artifacts. This has led to specialized segmentation models that are often limited to specific image types or narrow age groups, or that are fragile for more variable images such as those acquired clinically. We address this method fragmentation with BabySeg, a deep learning brain segmentation framework for infants and young children that supports diverse MRI protocols, including repeat scans and image types unavailable during training. Our approach builds on recent domain randomization techniques, which synthesize training images far beyond realistic bounds to promote dataset shift invariance. We also describe a mechanism that enables models to flexibly pool and interact features from any number of input scans. We demonstrate state-of-the-art performance that matches or exceeds the accuracy of several existing methods for various age cohorts and input configurations using a single model, in a fraction of the runtime required by many existing tools.
comment: 8 pages, 8 figures, 1 table, website at https://w3id.org/babyseg, presented at the 2025 IEEE Asilomar Conference on Signals, Systems, and Computers
Splannequin: Freezing Monocular Mannequin-Challenge Footage with Dual-Detection Splatting WACV 2025
Synthesizing high-fidelity frozen 3D scenes from monocular Mannequin-Challenge (MC) videos is a unique problem distinct from standard dynamic scene reconstruction. Instead of focusing on modeling motion, our goal is to create a frozen scene while strategically preserving subtle dynamics to enable user-controlled instant selection. To achieve this, we introduce a novel application of dynamic Gaussian splatting: the scene is modeled dynamically, which retains nearby temporal variation, and a static scene is rendered by fixing the model's time parameter. However, under this usage, monocular capture with sparse temporal supervision introduces artifacts like ghosting and blur for Gaussians that become unobserved or occluded at weakly supervised timestamps. We propose Splannequin, an architecture-agnostic regularization that detects two states of Gaussian primitives, hidden and defective, and applies temporal anchoring. Under predominantly forward camera motion, hidden states are anchored to their recent well-observed past states, while defective states are anchored to future states with stronger supervision. Our method integrates into existing dynamic Gaussian pipelines via simple loss terms, requires no architectural changes, and adds zero inference overhead. This results in markedly improved visual quality, enabling high-fidelity, user-selectable frozen-time renderings, validated by a 96% user preference. Project page: https://chien90190.github.io/splannequin/
comment: WACV 2025. Project page: https://chien90190.github.io/splannequin/
DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation
Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.
comment: Project Page: https://github.com/CaraJ7/DraCo
ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning
Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.
ShadowDraw: From Any Object to Shadow-Drawing Compositional Art
We introduce ShadowDraw, a framework that transforms ordinary 3D objects into shadow-drawing compositional art. Given a 3D object, our system predicts scene parameters, including object pose and lighting, together with a partial line drawing, such that the cast shadow completes the drawing into a recognizable image. To this end, we optimize scene configurations to reveal meaningful shadows, employ shadow strokes to guide line drawing generation, and adopt automatic evaluation to enforce shadow-drawing coherence and visual quality. Experiments show that ShadowDraw produces compelling results across diverse inputs, from real-world scans and curated datasets to generative assets, and naturally extends to multi-object scenes, animations, and physical deployments. Our work provides a practical pipeline for creating shadow-drawing art and broadens the design space of computational visual art, bridging the gap between algorithmic design and artistic storytelling. Check out our project page https://red-fairy.github.io/ShadowDraw/ for more results and an end-to-end real-world demonstration of our pipeline!
comment: Project page: https://red-fairy.github.io/ShadowDraw/
NeuralRemaster: Phase-Preserving Diffusion for Structure-Aligned Generation
Standard diffusion corrupts data using Gaussian noise whose Fourier coefficients have random magnitudes and random phases. While effective for unconditional or text-to-image generation, corrupting phase components destroys spatial structure, making it ill-suited for tasks requiring geometric consistency, such as re-rendering, simulation enhancement, and image-to-image translation. We introduce Phase-Preserving Diffusion φ-PD, a model-agnostic reformulation of the diffusion process that preserves input phase while randomizing magnitude, enabling structure-aligned generation without architectural changes or additional parameters. We further propose Frequency-Selective Structured (FSS) noise, which provides continuous control over structural rigidity via a single frequency-cutoff parameter. φ-PD adds no inference-time cost and is compatible with any diffusion model for images or videos. Across photorealistic and stylized re-rendering, as well as sim-to-real enhancement for driving planners, φ-PD produces controllable, spatially aligned results. When applied to the CARLA simulator, φ-PD improves CARLA-to-Waymo planner performance by 50\%. The method is complementary to existing conditioning approaches and broadly applicable to image-to-image and video-to-video generation. Videos, additional examples, and code are available on our \href{https://yuzeng-at-tri.github.io/ppd-page/}{project page}.
EvoIR: Towards All-in-One Image Restoration via Evolutionary Frequency Modulation
All-in-One Image Restoration (AiOIR) tasks often involve diverse degradation that require robust and versatile strategies. However, most existing approaches typically lack explicit frequency modeling and rely on fixed or heuristic optimization schedules, which limit the generalization across heterogeneous degradation. To address these limitations, we propose EvoIR, an AiOIR-specific framework that introduces evolutionary frequency modulation for dynamic and adaptive image restoration. Specifically, EvoIR employs the Frequency-Modulated Module (FMM) that decomposes features into high- and low-frequency branches in an explicit manner and adaptively modulates them to enhance both structural fidelity and fine-grained details. Central to EvoIR, an Evolutionary Optimization Strategy (EOS) iteratively adjusts frequency-aware objectives through a population-based evolutionary process, dynamically balancing structural accuracy and perceptual fidelity. Its evolutionary guidance further mitigates gradient conflicts across degradation and accelerates convergence. By synergizing FMM and EOS, EvoIR yields greater improvements than using either component alone, underscoring their complementary roles. Extensive experiments on multiple benchmarks demonstrate that EvoIR outperforms state-of-the-art AiOIR methods.
TV2TV: A Unified Framework for Interleaved Language and Video Generation
Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to "think in words" about subsequent content before ``acting in pixels'' to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model's ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.
SA-IQA: Redefining Image Quality Assessment for Spatial Aesthetics with Multi-Dimensional Rewards
In recent years, Image Quality Assessment (IQA) for AI-generated images (AIGI) has advanced rapidly; however, existing methods primarily target portraits and artistic images, lacking a systematic evaluation of interior scenes. We introduce Spatial Aesthetics, a paradigm that assesses the aesthetic quality of interior images along four dimensions: layout, harmony, lighting, and distortion. We construct SA-BENCH, the first benchmark for spatial aesthetics, comprising 18,000 images and 50,000 precise annotations. Employing SA-BENCH, we systematically evaluate current IQA methodologies and develop SA-IQA, through MLLM fine-tuning and a multidimensional fusion approach, as a comprehensive reward framework for assessing spatial aesthetics. We apply SA-IQA to two downstream tasks: (1) serving as a reward signal integrated with GRPO reinforcement learning to optimize the AIGC generation pipeline, and (2) Best-of-N selection to filter high-quality images and improve generation quality. Experiments indicate that SA-IQA significantly outperforms existing methods on SA-BENCH, setting a new standard for spatial aesthetics evaluation. Code and dataset will be open-sourced to advance research and applications in this domain.
From Generated Human Videos to Physically Plausible Robot Trajectories
Video generation models are rapidly improving in their ability to synthesize human actions in novel contexts, holding the potential to serve as high-level planners for contextual robot control. To realize this potential, a key research question remains open: how can a humanoid execute the human actions from generated videos in a zero-shot manner? This challenge arises because generated videos are often noisy and exhibit morphological distortions that make direct imitation difficult compared to real video. To address this, we introduce a two-stage pipeline. First, we lift video pixels into a 4D human representation and then retarget to the humanoid morphology. Second, we propose GenMimic-a physics-aware reinforcement learning policy conditioned on 3D keypoints, and trained with symmetry regularization and keypoint-weighted tracking rewards. As a result, GenMimic can mimic human actions from noisy, generated videos. We curate GenMimicBench, a synthetic human-motion dataset generated using two video generation models across a spectrum of actions and contexts, establishing a benchmark for assessing zero-shot generalization and policy robustness. Extensive experiments demonstrate improvements over strong baselines in simulation and confirm coherent, physically stable motion tracking on a Unitree G1 humanoid robot without fine-tuning. This work offers a promising path to realizing the potential of video generation models as high-level policies for robot control.
comment: For project website, see https://genmimic.github.io
Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark
Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved performance on tasks such as visual grounding and visual question answering. However, the reasoning processes of these models remain largely opaque; they typically output only final predictions without revealing the intermediate steps or fine-grained evidence (e.g., pixels, locations) that lead to the result. This contrasts with human intelligence, which naturally operates through a chain of visual reasoning. To address this limitation, we introduce the Visual Reasoning Tracer (VRT) task, which requires models to not only localize the target object but also explicitly predict the intermediate objects that form the reasoning path. To advance research in this area, we contribute: (1) VRT-Bench, a human-annotated benchmark for evaluating visual reasoning; (2) a new metric for assessing the quality of reasoning traces; and (3) VRT-80k, a large-scale dataset for reasoning model training. Our experiments reveal that while existing models often produce the correct final output, they struggle to ground their intermediate reasoning. In contrast, models trained on VRT-80k achieve substantial improvements in tracing the reasoning path.
comment: Technical Report; Project Page: https://harboryuan.github.io/visual-reasoning-tracer
Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression
Recent advances in autoregressive video diffusion have enabled real-time frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware KV cache pruning that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12x extrapolation (e.g. 5s-trained to 60s+ generation) with better imaging quality than LongLive, better aesthetic quality than RollingForcing, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.
comment: Project Page: https://cvlab-kaist.github.io/DeepForcing/
Object Reconstruction under Occlusion with Generative Priors and Contact-induced Constraints
Object geometry is key information for robot manipulation. Yet, object reconstruction is a challenging task because cameras only capture partial observations of objects, especially when occlusion occurs. In this paper, we leverage two extra sources of information to reduce the ambiguity of vision signals. First, generative models learn priors of the shapes of commonly seen objects, allowing us to make reasonable guesses of the unseen part of geometry. Second, contact information, which can be obtained from videos and physical interactions, provides sparse constraints on the boundary of the geometry. We combine the two sources of information through contact-guided 3D generation. The guidance formulation is inspired by drag-based editing in generative models. Experiments on synthetic and real-world data show that our approach improves the reconstruction compared to pure 3D generation and contact-based optimization.
comment: Project page: https://contactgen3d.github.io/
BulletTime: Decoupled Control of Time and Camera Pose for Video Generation
Emerging video diffusion models achieve high visual fidelity but fundamentally couple scene dynamics with camera motion, limiting their ability to provide precise spatial and temporal control. We introduce a 4D-controllable video diffusion framework that explicitly decouples scene dynamics from camera pose, enabling fine-grained manipulation of both scene dynamics and camera viewpoint. Our framework takes continuous world-time sequences and camera trajectories as conditioning inputs, injecting them into the video diffusion model through a 4D positional encoding in the attention layer and adaptive normalizations for feature modulation. To train this model, we curate a unique dataset in which temporal and camera variations are independently parameterized; this dataset will be made public. Experiments show that our model achieves robust real-world 4D control across diverse timing patterns and camera trajectories, while preserving high generation quality and outperforming prior work in controllability. See our website for video results: https://19reborn.github.io/Bullet4D/
comment: Project Page: https://19reborn.github.io/Bullet4D/
4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer
Constructing 4D language fields is crucial for embodied AI, augmented/virtual reality, and 4D scene understanding, as they provide enriched semantic representations of dynamic environments and enable open-vocabulary querying in complex scenarios. However, existing approaches to 4D semantic field construction primarily rely on scene-specific Gaussian splatting, which requires per-scene optimization, exhibits limited generalization, and is difficult to scale to real-world applications. To address these limitations, we propose 4DLangVGGT, the first Transformer-based feed-forward unified framework for 4D language grounding, that jointly integrates geometric perception and language alignment within a single architecture. 4DLangVGGT has two key components: the 4D Visual Geometry Transformer, StreamVGGT, which captures spatio-temporal geometric representations of dynamic scenes; and the Semantic Bridging Decoder (SBD), which projects geometry-aware features into a language-aligned semantic space, thereby enhancing semantic interpretability while preserving structural fidelity. Unlike prior methods that depend on costly per-scene optimization, 4DLangVGGT can be jointly trained across multiple dynamic scenes and directly applied during inference, achieving both deployment efficiency and strong generalization. This design significantly improves the practicality of large-scale deployment and establishes a new paradigm for open-vocabulary 4D scene understanding. Experiments on HyperNeRF and Neu3D datasets demonstrate that our approach not only generalizes effectively but also achieves state-of-the-art performance, achieving up to 2% gains under per-scene training and 1% improvements under multi-scene training. Our code released in https://github.com/hustvl/4DLangVGGT
comment: Code: https://github.com/hustvl/4DLangVGGT, Webpage: https://hustvl.github.io/4DLangVGGT
Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image
Generating interactive and dynamic 4D scenes from a single static image remains a core challenge. Most existing generate-then-reconstruct and reconstruct-then-generate methods decouple geometry from motion, causing spatiotemporal inconsistencies and poor generalization. To address these, we extend the reconstruct-then-generate framework to jointly perform Motion generation and geometric Reconstruction for 4D Synthesis (MoRe4D). We first introduce TrajScene-60K, a large-scale dataset of 60,000 video samples with dense point trajectories, addressing the scarcity of high-quality 4D scene data. Based on this, we propose a diffusion-based 4D Scene Trajectory Generator (4D-STraG) to jointly generate geometrically consistent and motion-plausible 4D point trajectories. To leverage single-view priors, we design a depth-guided motion normalization strategy and a motion-aware module for effective geometry and dynamics integration. We then propose a 4D View Synthesis Module (4D-ViSM) to render videos with arbitrary camera trajectories from 4D point track representations. Experiments show that MoRe4D generates high-quality 4D scenes with multi-view consistency and rich dynamic details from a single image. Code: https://github.com/Zhangyr2022/MoRe4D.
comment: 18 Pages
Semantic-Guided Two-Stage GAN for Face Inpainting with Hybrid Perceptual Encoding CVPR-2025
Facial Image inpainting aim is to restore the missing or corrupted regions in face images while preserving identity, structural consistency and photorealistic image quality, a task specifically created for photo restoration. Though there are recent lot of advances in deep generative models, existing methods face problems with large irregular masks, often producing blurry textures on the edges of the masked region, semantic inconsistencies, or unconvincing facial structures due to direct pixel level synthesis approach and limited exploitation of facial priors. In this paper we propose a novel architecture, which address these above challenges through semantic-guided hierarchical synthesis. Our approach starts with a method that organizes and synthesizes information based on meaning, followed by refining the texture. This process gives clear insights into the facial structure before we move on to creating detailed images. In the first stage, we blend two techniques: one that focuses on local features with CNNs and global features with Vision Transformers. This helped us create clear and detailed semantic layouts. In the second stage, we use a Multi-Modal Texture Generator to refine these layouts by pulling in information from different scales, ensuring everything looks cohesive and consistent. The architecture naturally handles arbitrary mask configurations through dynamic attention without maskspecific training. Experiment on two datasets CelebA-HQ and FFHQ shows that our model outperforms other state-of-the-art methods, showing improvements in metrics like LPIPS, PSNR, and SSIM. It produces visually striking results with better semantic preservation, in challenging large-area inpainting situations.
comment: Submitted for review CVPR-2025
RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation
Earth observation (EO) data spans a wide range of spatial, spectral, and temporal resolutions, from high-resolution optical imagery to low resolution multispectral products or radar time series. While recent foundation models have improved multimodal integration for learning meaningful representations, they often expect fixed input resolutions or are based on sensor-specific encoders limiting generalization across heterogeneous EO modalities. To overcome these limitations we introduce RAMEN, a resolution-adjustable multimodal encoder that learns a shared visual representation across EO data in a fully sensor-agnostic manner. RAMEN treats the modality and spatial and temporal resolutions as key input data features, enabling coherent analysis across modalities within a unified latent space. Its main methodological contribution is to define spatial resolution as a controllable output parameter, giving users direct control over the desired level of detail at inference and allowing explicit trade-offs between spatial precision and computational cost. We train a single, unified transformer encoder reconstructing masked multimodal EO data drawn from diverse sources, ensuring generalization across sensors and resolutions. Once pretrained, RAMEN transfers effectively to both known and unseen sensor configurations and outperforms larger state-of-the-art models on the community-standard PANGAEA benchmark, containing various multi-sensor and multi-resolution downstream tasks. Our code and pretrained model are available at https://github.com/nicolashoudre/RAMEN.
HTR-ConvText: Leveraging Convolution and Textual Information for Handwritten Text Recognition
Handwritten Text Recognition remains challenging due to the limited data, high writing style variance, and scripts with complex diacritics. Existing approaches, though partially address these issues, often struggle to generalize without massive synthetic data. To address these challenges, we propose HTR-ConvText, a model designed to capture fine-grained, stroke-level local features while preserving global contextual dependencies. In the feature extraction stage, we integrate a residual Convolutional Neural Network backbone with a MobileViT with Positional Encoding block. This enables the model to both capture structural patterns and learn subtle writing details. We then introduce the ConvText encoder, a hybrid architecture combining global context and local features within a hierarchical structure that reduces sequence length for improved efficiency. Additionally, an auxiliary module injects textual context to mitigate the weakness of Connectionist Temporal Classification. Evaluations on IAM, READ2016, LAM and HANDS-VNOnDB demonstrate that our approach achieves improved performance and better generalization compared to existing methods, especially in scenarios with limited training samples and high handwriting diversity.
Generative Neural Video Compression via Video Diffusion Prior
We present GNVC-VD, the first DiT-based generative neural video compression framework built upon an advanced video generation foundation model, where spatio-temporal latent compression and sequence-level generative refinement are unified within a single codec. Existing perceptual codecs primarily rely on pre-trained image generative priors to restore high-frequency details, but their frame-wise nature lacks temporal modeling and inevitably leads to perceptual flickering. To address this, GNVC-VD introduces a unified flow-matching latent refinement module that leverages a video diffusion transformer to jointly enhance intra- and inter-frame latents through sequence-level denoising, ensuring consistent spatio-temporal details. Instead of denoising from pure Gaussian noise as in video generation, GNVC-VD initializes refinement from decoded spatio-temporal latents and learns a correction term that adapts the diffusion prior to compression-induced degradation. A conditioning adaptor further injects compression-aware cues into intermediate DiT layers, enabling effective artifact removal while maintaining temporal coherence under extreme bitrate constraints. Extensive experiments show that GNVC-VD surpasses both traditional and learned codecs in perceptual quality and significantly reduces the flickering artifacts that persist in prior generative approaches, even below 0.01 bpp, highlighting the promise of integrating video-native generative priors into neural codecs for next-generation perceptual video compression.
Self-Supervised Learning for Transparent Object Depth Completion Using Depth from Non-Transparent Objects
The perception of transparent objects is one of the well-known challenges in computer vision. Conventional depth sensors have difficulty in sensing the depth of transparent objects due to refraction and reflection of light. Previous research has typically train a neural network to complete the depth acquired by the sensor, and this method can quickly and accurately acquire accurate depth maps of transparent objects. However, previous training relies on a large amount of annotation data for supervision, and the labeling of depth maps is costly. To tackle this challenge, we propose a new self-supervised method for training depth completion networks. Our method simulates the depth deficits of transparent objects within non-transparent regions and utilizes the original depth map as ground truth for supervision. Experiments demonstrate that our method achieves performance comparable to supervised approach, and pre-training with our method can improve the model performance when the training samples are small.
comment: conference
Reflection Removal through Efficient Adaptation of Diffusion Transformers
We introduce a diffusion-transformer (DiT) framework for single-image reflection removal that leverages the generalization strengths of foundation diffusion models in the restoration setting. Rather than relying on task-specific architectures, we repurpose a pre-trained DiT-based foundation model by conditioning it on reflection-contaminated inputs and guiding it toward clean transmission layers. We systematically analyze existing reflection removal data sources for diversity, scalability, and photorealism. To address the shortage of suitable data, we construct a physically based rendering (PBR) pipeline in Blender, built around the Principled BSDF, to synthesize realistic glass materials and reflection effects. Efficient LoRA-based adaptation of the foundation model, combined with the proposed synthetic data, achieves state-of-the-art performance on in-domain and zero-shot benchmarks. These results demonstrate that pretrained diffusion transformers, when paired with physically grounded data synthesis and efficient adaptation, offer a scalable and high-fidelity solution for reflection removal. Project page: https://hf.co/spaces/huawei-bayerlab/windowseat-reflection-removal-web
A dynamic memory assignment strategy for dilation-based ICP algorithm on embedded GPUs
This paper proposes a memory-efficient optimization strategy for the high-performance point cloud registration algorithm VANICP, enabling lightweight execution on embedded GPUs with constrained hardware resources. VANICP is a recently published acceleration framework that significantly improves the computational efficiency of point-cloud-based applications. By transforming the global nearest neighbor search into a localized process through a dilation-based information propagation mechanism, VANICP greatly reduces the computational complexity of the NNS. However, its original implementation demands a considerable amount of memory, which restricts its deployment in resource-constrained environments such as embedded systems. To address this issue, we propose a GPU-oriented dynamic memory assignment strategy that optimizes the memory usage of the dilation operation. Furthermore, based on this strategy, we construct an enhanced version of the VANICP framework that achieves over 97% reduction in memory consumption while preserving the original performance. Source code is published on: https://github.com/changqiong/VANICP4Em.git.
Aligned but Stereotypical? The Hidden Influence of System Prompts on Social Bias in LVLM-Based Text-to-Image Models
Large vision-language model (LVLM) based text-to-image (T2I) systems have become the dominant paradigm in image generation, yet whether they amplify social biases remains insufficiently understood. In this paper, we show that LVLM-based models produce markedly more socially biased images than non-LVLM-based models. We introduce a 1,024 prompt benchmark spanning four levels of linguistic complexity and evaluate demographic bias across multiple attributes in a systematic manner. Our analysis identifies system prompts, the predefined instructions guiding LVLMs, as a primary driver of biased behavior. Through decoded intermediate representations, token-probability diagnostics, and embedding-association analyses, we reveal how system prompts encode demographic priors that propagate into image synthesis. To this end, we propose FairPro, a training-free meta-prompting framework that enables LVLMs to self-audit and construct fairness-aware system prompts at test time. Experiments on two LVLM-based T2I models, SANA and Qwen-Image, show that FairPro substantially reduces demographic bias while preserving text-image alignment. We believe our findings provide deeper insight into the central role of system prompts in bias propagation and offer a practical, deployable approach for building more socially responsible T2I systems.
comment: Project page: https://fairpro-t2i.github.io
Stable Single-Pixel Contrastive Learning for Semantic and Geometric Tasks
We pilot a family of stable contrastive losses for learning pixel-level representations that jointly capture semantic and geometric information. Our approach maps each pixel of an image to an overcomplete descriptor that is both view-invariant and semantically meaningful. It enables precise point-correspondence across images without requiring momentum-based teacher-student training. Two experiments in synthetic 2D and 3D environments demonstrate the properties of our loss and the resulting overcomplete representations.
comment: UniReps Workshop 2025, 12 pages, 8 figures
Rethinking the Use of Vision Transformers for AI-Generated Image Detection
Rich feature representations derived from CLIP-ViT have been widely utilized in AI-generated image detection. While most existing methods primarily leverage features from the final layer, we systematically analyze the contributions of layer-wise features to this task. Our study reveals that earlier layers provide more localized and generalizable features, often surpassing the performance of final-layer features in detection tasks. Moreover, we find that different layers capture distinct aspects of the data, each contributing uniquely to AI-generated image detection. Motivated by these findings, we introduce a novel adaptive method, termed MoLD, which dynamically integrates features from multiple ViT layers using a gating-based mechanism. Extensive experiments on both GAN- and diffusion-generated images demonstrate that MoLD significantly improves detection performance, enhances generalization across diverse generative models, and exhibits robustness in real-world scenarios. Finally, we illustrate the scalability and versatility of our approach by successfully applying it to other pre-trained ViTs, such as DINOv2.
comment: Code: https://github.com/nahyeonkaty/mold
Balanced Few-Shot Episodic Learning for Accurate Retinal Disease Diagnosis
Automated retinal disease diagnosis is vital given the rising prevalence of conditions such as diabetic retinopathy and macular degeneration. Conventional deep learning approaches require large annotated datasets, which are costly and often imbalanced across disease categories, limiting their reliability in practice. Few-shot learning (FSL) addresses this challenge by enabling models to generalize from only a few labeled samples per class. In this study,we propose a balanced few-shot episodic learning framework tailored to the Retinal Fundus Multi-Disease Image Dataset (RFMiD). Focusing on the ten most represented classes, which still show substantial imbalance between majority diseases (e.g., Diabetic Retinopathy, Macular Hole) and minority ones (e.g., Optic Disc Edema, Branch Retinal Vein Occlusion), our method integrates three key components: (i) balanced episodic sampling, ensuring equal participation of all classes in each 5-way 5-shot episode; (ii) targeted augmentation, including Contrast Limited Adaptive Histogram Equalization (CLAHE) and color/geometry transformations, to improve minority-class di- versity; and (iii) a ResNet-50 encoder pretrained on ImageNet, selected for its superior ability to capture fine-grained retinal features. Prototypes are computed in the embedding space and classification is performed with cosine similarity for improved stability. Trained on 100 episodes and evaluated on 1,000 test episodes, our framework achieves substantial accuracy gains and reduces bias toward majority classes, with notable improvements for underrepresented diseases. These results demonstrate that dataset-aware few-shot pipelines, combined with balanced sampling and CLAHE-enhanced preprocessing, can deliver more robust and clinically fair retinal disease diagnosis under data-constrained conditions.
GeoPE:A Unified Geometric Positional Embedding for Structured Tensors
Standard Vision Transformers flatten 2D images into 1D sequences, disrupting the natural spatial topology. While Rotary Positional Embedding (RoPE) excels in 1D, it inherits this limitation, often treating spatially distant patches (e.g., at row edges) as sequence neighbors. Existing 2D approaches typically treat spatial axes independently, failing to decouple this false sequential proximity from true spatial distance. To restore the 2D spatial manifold, we introduce Geometric Positional Embedding (GeoPE), a framework that extends rotations to 3D Euclidean space using quaternions. To overcome non-commutativity and ensure symmetry, GeoPE constructs a unified rotational operator by computing the geometric mean in the Lie algebra. This creates a geometrically coupled encoding that effectively separates spatial dimensions. Extensive experiments on image classification, object detection, and 3D semantic segmentation demonstrate that GeoPE consistently outperforms existing 2D RoPE variants and significantly enhances shape bias, confirming its ability to capture true geometric structure.
FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via neural Action Tokenization
Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce FASTer, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autoregressive policy built upon it. FASTerVQ encodes action chunks as single-channel images, capturing global spatio-temporal dependencies while maintaining a high compression ratio. FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and strong cross-task and cross-embodiment generalization, while FASTerVLA further improves overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance.
Towards Adaptive Fusion of Multimodal Deep Networks for Human Action Recognition
This study introduces a pioneering methodology for human action recognition by harnessing deep neural network techniques and adaptive fusion strategies across multiple modalities, including RGB, optical flows, audio, and depth information. Employing gating mechanisms for multimodal fusion, we aim to surpass limitations inherent in traditional unimodal recognition methods while exploring novel possibilities for diverse applications. Through an exhaustive investigation of gating mechanisms and adaptive weighting-based fusion architectures, our methodology enables the selective integration of relevant information from various modalities, thereby bolstering both accuracy and robustness in action recognition tasks. We meticulously examine various gated fusion strategies to pinpoint the most effective approach for multimodal action recognition, showcasing its superiority over conventional unimodal methods. Gating mechanisms facilitate the extraction of pivotal features, resulting in a more holistic representation of actions and substantial enhancements in recognition performance. Our evaluations across human action recognition, violence action detection, and multiple self-supervised learning tasks on benchmark datasets demonstrate promising advancements in accuracy. The significance of this research lies in its potential to revolutionize action recognition systems across diverse fields. The fusion of multimodal information promises sophisticated applications in surveillance and human-computer interaction, especially in contexts related to active assisted living.
LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging
3D vision foundation models like Visual Geometry Grounded Transformer (VGGT) have advanced greatly in geometric perception. However, it is time-consuming and memory-intensive for long sequences, limiting application to large-scale scenes beyond hundreds of images. To address this, we propose LiteVGGT, achieving up to 10x speedup and substantial memory reduction, enabling efficient processing of 1000-image scenes. We derive two key insights for 3D reconstruction: (1) tokens from local image regions have inherent geometric correlations, leading to high similarity and computational redundancy; (2) token similarity across adjacent network layers remains stable, allowing for reusable merge decisions. Guided by these, we design a simple yet efficient strategy, dubbed geometry-aware cached token merging. We analyze each token's geometric importance, optimizing anchor token selection to better preserve key information for reconstruction. We also cache and reuse merge indices across layers, substantially reducing latency with minimal accuracy impact. This strategy retains VGGT's core performance, enabling efficient fine-tuning and FP8 quantization for further gains. Extensive experiments validate LiteVGGT's effectiveness, scalability, and robustness. Project page: https://garlicba.github.io/LiteVGGT/
Virtually Unrolling the Herculaneum Papyri by Diffeomorphic Spiral Fitting WACV 2026
The Herculaneum Papyri are a collection of rolled papyrus documents that were charred and buried by the famous eruption of Mount Vesuvius. They promise to contain a wealth of previously unseen Greek and Latin texts, but are extremely fragile and thus most cannot be unrolled physically. A solution to access these texts is virtual unrolling, where the papyrus surface is digitally traced out in a CT scan of the scroll, to create a flattened representation. This tracing is very laborious to do manually in gigavoxel-sized scans, so automated approaches are desirable. We present the first top-down method that automatically fits a surface model to a CT scan of a severely damaged scroll. We take a novel approach that globally fits an explicit parametric model of the deformed scroll to existing neural network predictions of where the rolled papyrus likely passes. Our method guarantees the resulting surface is a single continuous 2D sheet, even passing through regions where the surface is not detectable in the CT scan. We conduct comprehensive experiments on high-resolution CT scans of two scrolls, showing that our approach successfully unrolls large regions, and exceeds the performance of the only existing automated unrolling method suitable for this data.
comment: Accepted at WACV 2026
Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion
Latent Diffusion Models (LDMs) inherently follow a coarse-to-fine generation process, where high-level semantic structure is generated slightly earlier than fine-grained texture. This indicates the preceding semantics potentially benefit texture generation by providing a semantic anchor. Recent advances have integrated semantic priors from pretrained visual encoders to further enhance LDMs, yet they still denoise semantic and VAE-encoded texture synchronously, neglecting such ordering. Observing these, we propose Semantic-First Diffusion (SFD), a latent diffusion paradigm that explicitly prioritizes semantic formation. SFD first constructs composite latents by combining a compact semantic latent, which is extracted from a pretrained visual encoder via a dedicated Semantic VAE, with the texture latent. The core of SFD is to denoise the semantic and texture latents asynchronously using separate noise schedules: semantics precede textures by a temporal offset, providing clearer high-level guidance for texture refinement and enabling natural coarse-to-fine generation. On ImageNet 256x256 with guidance, SFD achieves FID 1.06 (LightningDiT-XL) and FID 1.04 (1.0B LightningDiT-XXL), while achieving up to 100x faster convergence than the original DiT. SFD also improves existing methods like ReDi and VA-VAE, demonstrating the effectiveness of asynchronous, semantics-led modeling. Project page and code: https://yuemingpan.github.io/SFD.github.io/.
ReflexFlow: Rethinking Learning Objective for Exposure Bias Alleviation in Flow Matching
Despite tremendous recent progress, Flow Matching methods still suffer from exposure bias due to discrepancies in training and inference. This paper investigates the root causes of exposure bias in Flow Matching, including: (1) the model lacks generalization to biased inputs during training, and (2) insufficient low-frequency content captured during early denoising, leading to accumulated bias. Based on these insights, we propose ReflexFlow, a simple and effective reflexive refinement of the Flow Matching learning objective that dynamically corrects exposure bias. ReflexFlow consists of two components: (1) Anti-Drift Rectification (ADR), which reflexively adjusts prediction targets for biased inputs utilizing a redesigned loss under training-time scheduled sampling; and (2) Frequency Compensation (FC), which reflects on missing low-frequency components and compensates them by reweighting the loss using exposure bias. ReflexFlow is model-agnostic, compatible with all Flow Matching frameworks, and improves generation quality across datasets. Experiments on CIFAR-10, CelebA-64, and ImageNet-256 show that ReflexFlow outperforms prior approaches in mitigating exposure bias, achieving a 35.65% reduction in FID on CelebA-64.
Equivariant Symmetry-Aware Head Pose Estimation for Fetal MRI
We present E(3)-Pose, a novel fast pose estimation method that jointly and explicitly models rotation equivariance and object symmetry. Our work is motivated by the challenging problem of accounting for fetal head motion during a diagnostic MRI scan. We aim to enable automatic adaptive prescription of 2D diagnostic MRI slices with 6-DoF head pose estimation, supported by 3D MRI volumes rapidly acquired before each 2D slice. Existing methods struggle to generalize to clinical volumes, due to pose ambiguities induced by inherent anatomical symmetries, as well as low resolution, noise, and artifacts. In contrast, E(3)-Pose captures anatomical symmetries and rigid pose equivariance by construction, and yields robust estimates of the fetal head pose. Our experiments on publicly available and representative clinical fetal MRI datasets demonstrate the superior robustness and generalization of our method across domains. Crucially, E(3)-Pose achieves state-of-the-art accuracy on clinical MRI volumes, paving the way for clinical translation. Our implementation is available at github.com/ramyamut/E3-Pose.
You Only Train Once (YOTO): A Retraining-Free Object Detection Framework
Object detection constitutes the primary task within the domain of computer vision. It is utilized in numerous domains. Nonetheless, object detection continues to encounter the issue of catastrophic forgetting. The model must be retrained whenever new products are introduced, utilizing not only the new products dataset but also the entirety of the previous dataset. The outcome is obvious: increasing model training expenses and significant time consumption. In numerous sectors, particularly retail checkout, the frequent introduction of new products presents a great challenge. This study introduces You Only Train Once (YOTO), a methodology designed to address the issue of catastrophic forgetting by integrating YOLO11n for object localization with DeIT and Proxy Anchor Loss for feature extraction and metric learning. For classification, we utilize cosine similarity between the embedding features of the target product and those in the Qdrant vector database. In a case study conducted in a retail store with 140 products, the experimental results demonstrate that our proposed framework achieves encouraging accuracy, whether for detecting new or existing products. Furthermore, without retraining, the training duration difference is significant. We achieve almost 3 times the training time efficiency compared to classical object detection approaches. This efficiency escalates as additional new products are added to the product database. The average inference time is 580 ms per image containing multiple products, on an edge device, validating the proposed framework's feasibility for practical use.
comment: under review in the Elsevier Engineering Journal
SDG-Track: A Heterogeneous Observer-Follower Framework for High-Resolution UAV Tracking on Embedded Platforms
Real-time tracking of small unmanned aerial vehicles (UAVs) on edge devices faces a fundamental resolution-speed conflict. Downsampling high-resolution imagery to standard detector input sizes causes small target features to collapse below detectable thresholds. Yet processing native 1080p frames on resource-constrained platforms yields insufficient throughput for smooth gimbal control. We propose SDG-Track, a Sparse Detection-Guided Tracker that adopts an Observer-Follower architecture to reconcile this conflict. The Observer stream runs a high-capacity detector at low frequency on the GPU to provide accurate position anchors from 1920x1080 frames. The Follower stream performs high-frequency trajectory interpolation via ROI-constrained sparse optical flow on the CPU. To handle tracking failures from occlusion or model drift caused by spectrally similar distractors, we introduce Dual-Space Recovery, a training-free re-acquisition mechanism combining color histogram matching with geometric consistency constraints. Experiments on a ground-to-air tracking station demonstrate that SDG-Track achieves 35.1 FPS system throughput while retaining 97.2\% of the frame-by-frame detection precision. The system successfully tracks agile FPV drones under real-world operational conditions on an NVIDIA Jetson Orin Nano. Our paper code is publicly available at https://github.com/Jeffry-wen/SDG-Track
comment: https://github.com/Jeffry-wen/SDG-Track
SP-Det: Self-Prompted Dual-Text Fusion for Generalized Multi-Label Lesion Detection
Automated lesion detection in chest X-rays has demonstrated significant potential for improving clinical diagnosis by precisely localizing pathological abnormalities. While recent promptable detection frameworks have achieved remarkable accuracy in target localization, existing methods typically rely on manual annotations as prompts, which are labor-intensive and impractical for clinical applications. To address this limitation, we propose SP-Det, a novel self-prompted detection framework that automatically generates rich textual context to guide multi-label lesion detection without requiring expert annotations. Specifically, we introduce an expert-free dual-text prompt generator (DTPG) that leverages two complementary textual modalities: semantic context prompts that capture global pathological patterns and disease beacon prompts that focus on disease-specific manifestations. Moreover, we devise a bidirectional feature enhancer (BFE) that synergistically integrates comprehensive diagnostic context with disease-specific embeddings to significantly improve feature representation and detection accuracy. Extensive experiments on two chest X-ray datasets with diverse thoracic disease categories demonstrate that our SP-Det framework outperforms state-of-the-art detection methods while completely eliminating the dependency on expert-annotated prompts compared to existing promptable architectures.
Contact-Aware Refinement of Human Pose Pseudo-Ground Truth via Bioimpedance Sensing ICCV 2025
Capturing accurate 3D human pose in the wild would provide valuable data for training pose estimation and motion generation methods. While video-based estimation approaches have become increasingly accurate, they often fail in common scenarios involving self-contact, such as a hand touching the face. In contrast, wearable bioimpedance sensing can cheaply and unobtrusively measure ground-truth skin-to-skin contact. Consequently, we propose a novel framework that combines visual pose estimators with bioimpedance sensing to capture the 3D pose of people by taking self-contact into account. Our method, BioTUCH, initializes the pose using an off-the-shelf estimator and introduces contact-aware pose optimization during measured self-contact: reprojection error and deviations from the input estimate are minimized while enforcing vertex proximity constraints. We validate our approach using a new dataset of synchronized RGB video, bioimpedance measurements, and 3D motion capture. Testing with three input pose estimators, we demonstrate an average of 11.7% improvement in reconstruction accuracy. We also present a miniature wearable bioimpedance sensor that enables efficient large-scale collection of contact-aware training data for improving pose estimation and generation using BioTUCH. Code and data are available at biotuch.is.tue.mpg.de
comment: * Equal contribution. Minor figure corrections compared to the ICCV 2025 version
Autoregressive Image Generation Needs Only a Few Lines of Cached Tokens
Autoregressive (AR) visual generation has emerged as a powerful paradigm for image and multimodal synthesis, owing to its scalability and generality. However, existing AR image generation suffers from severe memory bottlenecks due to the need to cache all previously generated visual tokens during decoding, leading to both high storage requirements and low throughput. In this paper, we introduce \textbf{LineAR}, a novel, training-free progressive key-value (KV) cache compression pipeline for autoregressive image generation. By fully exploiting the intrinsic characteristics of visual attention, LineAR manages the cache at the line level using a 2D view, preserving the visual dependency regions while progressively evicting less-informative tokens that are harmless for subsequent line generation, guided by inter-line attention. LineAR enables efficient autoregressive (AR) image generation by utilizing only a few lines of cache, achieving both memory savings and throughput speedup, while maintaining or even improving generation quality. Extensive experiments across six autoregressive image generation models, including class-conditional and text-to-image generation, validate its effectiveness and generality. LineAR improves ImageNet FID from 2.77 to 2.68 and COCO FID from 23.85 to 22.86 on LlamaGen-XL and Janus-Pro-1B, while retaining only 1/6 KV cache. It also improves DPG on Lumina-mGPT-768 with just 1/8 KV cache. Additionally, LineAR achieves significant memory and throughput gains, including up to 67.61% memory reduction and 7.57x speedup on LlamaGen-XL, and 39.66% memory reduction and 5.62x speedup on Janus-Pro-7B.
A Sanity Check for Multi-In-Domain Face Forgery Detection in the Real World
Existing methods for deepfake detection aim to develop generalizable detectors. Although "generalizable" is the ultimate target once and for all, with limited training forgeries and domains, it appears idealistic to expect generalization that covers entirely unseen variations, especially given the diversity of real-world deepfakes. Therefore, introducing large-scale multi-domain data for training can be feasible and important for real-world applications. However, within such a multi-domain scenario, the differences between multiple domains, rather than the subtle real/fake distinctions, dominate the feature space. As a result, despite detectors being able to relatively separate real and fake within each domain (i.e., high AUC), they struggle with single-image real/fake judgments in domain-unspecified conditions (i.e., low ACC). In this paper, we first define a new research paradigm named Multi-In-Domain Face Forgery Detection (MID-FFD), which includes sufficient volumes of real-fake domains for training. Then, the detector should provide definitive real-fake judgments to the domain-unspecified inputs, which simulate the frame-by-frame independent detection scenario in the real world. Meanwhile, to address the domain-dominant issue, we propose a model-agnostic framework termed DevDet (Developer for Detector) to amplify real/fake differences and make them dominant in the feature space. DevDet consists of a Face Forgery Developer (FFDev) and a Dose-Adaptive detector Fine-Tuning strategy (DAFT). Experiments demonstrate our superiority in predicting real-fake under the MID-FFD scenario while maintaining original generalization ability to unseen data.
Tokenizing Buildings: A Transformer for Layout Synthesis
We introduce Small Building Model (SBM), a Transformer-based architecture for layout synthesis in Building Information Modeling (BIM) scenes. We address the question of how to tokenize buildings by unifying heterogeneous feature sets of architectural elements into sequences while preserving compositional structure. Such feature sets are represented as a sparse attribute-feature matrix that captures room properties. We then design a unified embedding module that learns joint representations of categorical and possibly correlated continuous feature groups. Lastly, we train a single Transformer backbone in two modes: an encoder-only pathway that yields high-fidelity room embeddings, and an encoder-decoder pipeline for autoregressive prediction of room entities, referred to as Data-Driven Entity Prediction (DDEP). Experiments across retrieval and generative layout synthesis show that SBM learns compact room embeddings that reliably cluster by type and topology, enabling strong semantic retrieval. In DDEP mode, SBM produces functionally sound layouts, with fewer collisions and boundary violations and improved navigability.
comment: 8 pages, 1 page References, 4 figures
FreeGen: Feed-Forward Reconstruction-Generation Co-Training for Free-Viewpoint Driving Scene Synthesis
Closed-loop simulation and scalable pre-training for autonomous driving require synthesizing free-viewpoint driving scenes. However, existing datasets and generative pipelines rarely provide consistent off-trajectory observations, limiting large-scale evaluation and training. While recent generative models demonstrate strong visual realism, they struggle to jointly achieve interpolation consistency and extrapolation realism without per-scene optimization. To address this, we propose FreeGen, a feed-forward reconstruction-generation co-training framework for free-viewpoint driving scene synthesis. The reconstruction model provides stable geometric representations to ensure interpolation consistency, while the generation model performs geometry-aware enhancement to improve realism at unseen viewpoints. Through co-training, generative priors are distilled into the reconstruction model to improve off-trajectory rendering, and the refined geometry in turn offers stronger structural guidance for generation. Experiments demonstrate that FreeGen achieves state-of-the-art performance for free-viewpoint driving scene synthesis.
comment: Novel View Synthesis, Driving Scene, Free Trajectory, Image Generation
LatentFM: A Latent Flow Matching Approach for Generative Medical Image Segmentation
Generative models have achieved remarkable progress with the emergence of flow matching (FM). It has demonstrated strong generative capabilities and attracted significant attention as a simulation-free flow-based framework capable of learning exact data densities. Motivated by these advances, we propose LatentFM, a flow-based model operating in the latent space for medical image segmentation. To model the data distribution, we first design two variational autoencoders (VAEs) to encode both medical images and their corresponding masks into a lower-dimensional latent space. We then estimate a conditional velocity field that guides the flow based on the input image. By sampling multiple latent representations, our method synthesizes diverse segmentation outputs whose pixel-wise variance reliably captures the underlying data distribution, enabling both highly accurate and uncertainty-aware predictions. Furthermore, we generate confidence maps that quantify the model certainty, providing clinicians with richer information for deeper analysis. We conduct experiments on two datasets, ISIC-2018 and CVC-Clinic, and compare our method with several prior baselines, including both deterministic and generative approach models. Through comprehensive evaluations, both qualitative and quantitative results show that our approach achieves superior segmentation accuracy while remaining highly efficient in the latent space.
RobustSplat++: Decoupling Densification, Dynamics, and Illumination for In-the-Wild 3DGS
3D Gaussian Splatting (3DGS) has gained significant attention for its real-time, photo-realistic rendering in novel-view synthesis and 3D modeling. However, existing methods struggle with accurately modeling in-the-wild scenes affected by transient objects and illuminations, leading to artifacts in the rendered images. We identify that the Gaussian densification process, while enhancing scene detail capture, unintentionally contributes to these artifacts by growing additional Gaussians that model transient disturbances and illumination variations. To address this, we propose RobustSplat++, a robust solution based on several critical designs. First, we introduce a delayed Gaussian growth strategy that prioritizes optimizing static scene structure before allowing Gaussian splitting/cloning, mitigating overfitting to transient objects in early optimization. Second, we design a scale-cascaded mask bootstrapping approach that first leverages lower-resolution feature similarity supervision for reliable initial transient mask estimation, taking advantage of its stronger semantic consistency and robustness to noise, and then progresses to high-resolution supervision to achieve more precise mask prediction. Third, we incorporate the delayed Gaussian growth strategy and mask bootstrapping with appearance modeling to handling in-the-wild scenes including transients and illuminations. Extensive experiments on multiple challenging datasets show that our method outperforms existing methods, clearly demonstrating the robustness and effectiveness of our method.
comment: arXiv admin note: substantial text overlap with arXiv:2506.02751
Shared Multi-modal Embedding Space for Face-Voice Association ICASSP
The FAME 2026 challenge comprises two demanding tasks: training face-voice associations combined with a multilingual setting that includes testing on languages on which the model was not trained. Our approach consists of separate uni-modal processing pipelines with general face and voice feature extraction, complemented by additional age-gender feature extraction to support prediction. The resulting single-modal features are projected into a shared embedding space and trained with an Adaptive Angular Margin (AAM) loss. Our approach achieved first place in the FAME 2026 challenge, with an average Equal-Error Rate (EER) of 23.99%.
comment: Ranked 1st in Fame 2026 Challenge, ICASSP
EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture
We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase. Extensive experiments have shown that EMMA-4B can significantly outperform state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while also achieving competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image). We believe that EMMA lays a solid foundation for the future development of unified multimodal architectures.
comment: Project Page: https://emma-umm.github.io/emma/
LaFiTe: A Generative Latent Field for 3D Native Texturing
Generating high-fidelity, seamless textures directly on 3D surfaces, what we term 3D-native texturing, remains a fundamental open challenge, with the potential to overcome long-standing limitations of UV-based and multi-view projection methods. However, existing native approaches are constrained by the absence of a powerful and versatile latent representation, which severely limits the fidelity and generality of their generated textures. We identify this representation gap as the principal barrier to further progress. We introduce LaFiTe, a framework that addresses this challenge by learning to generate textures as a 3D generative sparse latent color field. At its core, LaFiTe employs a variational autoencoder (VAE) to encode complex surface appearance into a sparse, structured latent space, which is subsequently decoded into a continuous color field. This representation achieves unprecedented fidelity, exceeding state-of-the-art methods by >10 dB PSNR in reconstruction, by effectively disentangling texture appearance from mesh topology and UV parameterization. Building upon this strong representation, a conditional rectified-flow model synthesizes high-quality, coherent textures across diverse styles and geometries. Extensive experiments demonstrate that LaFiTe not only sets a new benchmark for 3D-native texturing but also enables flexible downstream applications such as material synthesis and texture super-resolution, paving the way for the next generation of 3D content creation workflows.
comment: Project page: https://vast-ai-research.github.io/LaFiTe/
MemLoRA: Distilling Expert Adapters for On-Device Memory Systems
Memory-augmented Large Language Models (LLMs) have demonstrated remarkable consistency during prolonged dialogues by storing relevant memories and incorporating them as context. Such memory-based personalization is also key in on-device settings that allow users to keep their conversations and data private. However, memory-augmented systems typically rely on LLMs that are too costly for local on-device deployment. Even though Small Language Models (SLMs) are more suitable for on-device inference than LLMs, they cannot achieve sufficient performance. Additionally, these LLM-based systems lack native visual capabilities, limiting their applicability in multimodal contexts. In this paper, we introduce (i) MemLoRA, a novel memory system that enables local deployment by equipping SLMs with specialized memory adapters, and (ii) its vision extension MemLoRA-V, which integrates small Vision-Language Models (SVLMs) to memory systems, enabling native visual understanding. Following knowledge distillation principles, each adapter is trained separately for specific memory operations$\unicode{x2013}$knowledge extraction, memory update, and memory-augmented generation. Equipped with memory adapters, small models enable accurate on-device memory operations without cloud dependency. On text-only operations, MemLoRA outperforms 10$\times$ larger baseline models (e.g., Gemma2-27B) and achieves performance comparable to 60$\times$ larger models (e.g., GPT-OSS-120B) on the LoCoMo benchmark. To evaluate visual understanding operations instead, we extend LoCoMo with challenging Visual Question Answering tasks that require direct visual reasoning. On this, our VLM-integrated MemLoRA-V shows massive improvements over caption-based approaches (81.3 vs. 23.7 accuracy) while keeping strong performance in text-based tasks, demonstrating the efficacy of our method in multimodal contexts.
Order Matters: 3D Shape Generation from Sequential VR Sketches
VR sketching lets users explore and iterate on ideas directly in 3D, offering a faster and more intuitive alternative to conventional CAD tools. However, existing sketch-to-shape models ignore the temporal ordering of strokes, discarding crucial cues about structure and design intent. We introduce VRSketch2Shape, the first framework and multi-category dataset for generating 3D shapes from sequential VR sketches. Our contributions are threefold: (i) an automated pipeline that generates sequential VR sketches from arbitrary shapes, (ii) a dataset of over 20k synthetic and 900 hand-drawn sketch-shape pairs across four categories, and (iii) an order-aware sketch encoder coupled with a diffusion-based 3D generator. Our approach yields higher geometric fidelity than prior work, generalizes effectively from synthetic to real sketches with minimal supervision, and performs well even on partial sketches. All data and models will be released open-source at https://chenyizi086.github.io/VRSketch2Shape_website.
MT-Depth: Multi-task Instance feature analysis for the Depth Completion
Depth completion plays a vital role in 3D perception systems, especially in scenarios where sparse depth data must be densified for tasks such as autonomous driving, robotics, and augmented reality. While many existing approaches rely on semantic segmentation to guide depth completion, they often overlook the benefits of object-level understanding. In this work, we introduce an instance-aware depth completion framework that explicitly integrates binary instance masks as spatial priors to refine depth predictions. Our model combines four main components: a frozen YOLO V11 instance segmentation branch, a U-Net-based depth completion backbone, a cross-attention fusion module, and an attention-guided prediction head. The instance segmentation branch generates per-image foreground masks that guide the depth branch via cross-attention, allowing the network to focus on object-centric regions during refinement. We validate our method on the Virtual KITTI 2 dataset, showing that it achieves lower RMSE compared to both a U-Net-only baseline and previous semantic-guided methods, while maintaining competitive MAE. Qualitative and quantitative results demonstrate that the proposed model effectively enhances depth accuracy near object boundaries, occlusions, and thin structures. Our findings suggest that incorporating instance-aware cues offers a promising direction for improving depth completion without relying on dense semantic labels.
E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving
End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger's emotional state, which is central to comfort and AD acceptance. We introduce Open-Domain End-to-End (OD-E2E) autonomous driving, where an autonomous vehicle (AV) must interpret free-form natural-language commands, infer the emotion, and plan a physically feasible trajectory. We propose E3AD, an emotion-aware VLA framework that augments semantic understanding with two cognitively inspired components: a continuous Valenc-Arousal-Dominance (VAD) emotion model that captures tone and urgency from language, and a dual-pathway spatial reasoning module that fuses egocentric and allocentric views for human-like spatial cognition. A consistency-oriented training scheme, combining modality pretraining with preference-based alignment, further enforces coherence between emotional intent and driving actions. Across real-world datasets, E3AD improves visual grounding and waypoint planning and achieves state-of-the-art (SOTA) VAD correlation for emotion estimation. These results show that injecting emotion into VLA-style driving yields more human-aligned grounding, planning, and human-centric feedback.
Measuring the Unspoken: A Disentanglement Model and Benchmark for Psychological Analysis in the Wild
Generative psychological analysis of in-the-wild conversations faces two fundamental challenges: (1) existing Vision-Language Models (VLMs) fail to resolve Articulatory-Affective Ambiguity, where visual patterns of speech mimic emotional expressions; and (2) progress is stifled by a lack of verifiable evaluation metrics capable of assessing visual grounding and reasoning depth. We propose a complete ecosystem to address these twin challenges. First, we introduce Multilevel Insight Network for Disentanglement(MIND), a novel hierarchical visual encoder that introduces a Status Judgment module to algorithmically suppress ambiguous lip features based on their temporal feature variance, achieving explicit visual disentanglement. Second, we construct ConvoInsight-DB, a new large-scale dataset with expert annotations for micro-expressions and deep psychological inference. Third, Third, we designed the Mental Reasoning Insight Rating Metric (PRISM), an automated dimensional framework that uses expert-guided LLM to measure the multidimensional performance of large mental vision models. On our PRISM benchmark, MIND significantly outperforms all baselines, achieving a +86.95% gain in micro-expression detection over prior SOTA. Ablation studies confirm that our Status Judgment disentanglement module is the most critical component for this performance leap. Our code has been opened.
Hardware-aware Neural Architecture Search of Early Exiting Networks on Edge Accelerators
Advancements in high-performance computing and cloud technologies have enabled the development of increasingly sophisticated Deep Learning (DL) models. However, the growing demand for embedded intelligence at the edge imposes stringent computational and energy constraints, challenging the deployment of these large-scale models. Early Exiting Neural Networks (EENN) have emerged as a promising solution, allowing dynamic termination of inference based on input complexity to enhance efficiency. Despite their potential, EENN performance is highly influenced by the heterogeneity of edge accelerators and the constraints imposed by quantization, affecting accuracy, energy efficiency, and latency. Yet, research on the automatic optimization of EENN design for edge hardware remains limited. To bridge this gap, we propose a hardware-aware Neural Architecture Search (NAS) framework that systematically integrates the effects of quantization and hardware resource allocation to optimize the placement of early exit points within a network backbone. Experimental results on the CIFAR-10 dataset demonstrate that our NAS framework can discover architectures that achieve over a 50\% reduction in computational costs compared to conventional static networks, making them more suitable for deployment in resource-constrained edge environments.
comment: Submitted to IEEE Transactions on Emerging Topics in Computing
OmniScaleSR: Unleashing Scale-Controlled Diffusion Prior for Faithful and Realistic Arbitrary-Scale Image Super-Resolution
Arbitrary-scale super-resolution (ASSR) overcomes the limitation of traditional super-resolution (SR) methods that operate only at fixed scales (e.g., 4x), enabling a single model to handle arbitrary magnification. Most existing ASSR approaches rely on implicit neural representation (INR), but its regression-driven feature extraction and aggregation intrinsically limit the ability to synthesize fine details, leading to low realism. Recent diffusion-based realistic image super-resolution (Real-ISR) models leverage powerful pre-trained diffusion priors and show impressive results at the 4x setting. We observe that they can also achieve ASSR because the diffusion prior implicitly adapts to scale by encouraging high-realism generation. However, without explicit scale control, the diffusion process cannot be properly adjusted for different magnification levels, resulting in excessive hallucination or blurry outputs, especially under ultra-high scales. To address these issues, we propose OmniScaleSR, a diffusion-based realistic arbitrary-scale SR framework designed to achieve both high fidelity and high realism. We introduce explicit, diffusion-native scale control mechanisms that work synergistically with implicit scale adaptation, enabling scale-aware and content-aware modulation of the diffusion process. In addition, we incorporate multi-domain fidelity enhancement designs to further improve reconstruction accuracy. Extensive experiments on bicubic degradation benchmarks and real-world datasets show that OmniScaleSR surpasses state-of-the-art methods in both fidelity and perceptual realism, with particularly strong performance at large magnification factors. Code will be released at https://github.com/chaixinning/OmniScaleSR.
comment: Accepted as TCSVT, 15 pages
Towards Cross-View Point Correspondence in Vision-Language Models
Cross-view correspondence is a fundamental capability for spatial understanding and embodied AI. However, it is still far from being realized in Vision-Language Models (VLMs), especially in achieving precise point-level correspondence, which is crucial for precise affordance interaction. So we propose the Cross-View Point Correspondence (CVPC) task and CrossPoint-Bench, a comprehensive benchmark with hierarchical design, inspired by the human cognitive process of "perceive", "reason", and "correspond". Our evaluation shows the state-of-the-art models (e.g., Gemini-2.5-Pro) still fall far behind humans, with a gap of over 54.65% in overall accuracy, exposing a challenge in transitioning from coarse-grained judgement to fine-grained coordinate prediction. To address this problem, we construct CrossPoint-378K, a dataset with 378K question-answering pairs across 900 scenes, focused on actionable affordance regions that better reflect real-world manipulation and interaction scenarios. Furthermore, we propose CroPond that trained on the CrossPoint-378K dataset. Our CroPond achieves state-of-the-art performance on CrossPoint-Bench, surpassing Gemini-2.5-Pro by 39.7% accuracy, which offers a foundation for advancing future work on cross-view correspondence. The benchmark, dataset, and model are publicly available at https://github.com/WangYipu2002/CrossPoint.
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and diminished motion dynamics. To address this, we introduce Reward Forcing, a novel framework with two key designs. First, we propose EMA-Sink, which maintains fixed-size tokens initialized from initial frames and continuously updated by fusing evicted tokens via exponential moving average as they exit the sliding window. Without additional computation cost, EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying while maintaining long-horizon consistency. Second, to better distill motion dynamics from teacher models, we propose a novel Rewarded Distribution Matching Distillation (Re-DMD). Vanilla distribution matching treats every training sample equally, limiting the model's ability to prioritize dynamic content. Instead, Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model. Re-DMD significantly enhances motion quality while preserving data fidelity. We include both quantitative and qualitative experiments to show that Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.
Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length
Existing diffusion-based video generation methods are fundamentally constrained by sequential computation and long-horizon inconsistency, limiting their practical adoption in real-time, streaming audio-driven avatar synthesis. We present Live Avatar, an algorithm-system co-designed framework that enables efficient, high-fidelity, and infinite-length avatar generation using a 14-billion-parameter diffusion model. Our approach introduces Timestep-forcing Pipeline Parallelism (TPP), a distributed inference paradigm that pipelines denoising steps across multiple GPUs, effectively breaking the autoregressive bottleneck and ensuring stable, low-latency real-time streaming. To further enhance temporal consistency and mitigate identity drift and color artifacts, we propose the Rolling Sink Frame Mechanism (RSFM), which maintains sequence fidelity by dynamically recalibrating appearance using a cached reference image. Additionally, we leverage Self-Forcing Distribution Matching Distillation to facilitate causal, streamable adaptation of large-scale models without sacrificing visual quality. Live Avatar demonstrates state-of-the-art performance, reaching 20 FPS end-to-end generation on 5 H800 GPUs, and, to the best of our knowledge, is the first to achieve practical, real-time, high-fidelity avatar generation at this scale. Our work establishes a new paradigm for deploying advanced diffusion models in industrial long-form video synthesis applications.
I2I-Bench: A Comprehensive Benchmark Suite for Image-to-Image Editing Models
Image editing models are advancing rapidly, yet comprehensive evaluation remains a significant challenge. Existing image editing benchmarks generally suffer from limited task scopes, insufficient evaluation dimensions, and heavy reliance on manual annotations, which significantly constrain their scalability and practical applicability. To address this, we propose \textbf{I2I-Bench}, a comprehensive benchmark for image-to-image editing models, which features (i) diverse tasks, encompassing 10 task categories across both single-image and multi-image editing tasks, (ii) comprehensive evaluation dimensions, including 30 decoupled and fine-grained evaluation dimensions with automated hybrid evaluation methods that integrate specialized tools and large multimodal models (LMMs), and (iii) rigorous alignment validation, justifying the consistency between our benchmark evaluations and human preferences. Using I2I-Bench, we benchmark numerous mainstream image editing models, investigating the gaps and trade-offs between editing models across various dimensions. We will open-source all components of I2I-Bench to facilitate future research.
SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding
Video Large Language Models (VideoLLMs) have shown remarkable progress in video understanding. However, these models still struggle to effectively perceive and exploit rich temporal information in videos when responding to user queries. Therefore, they often generate descriptions of events that are temporal inconsistent or causally implausible, causing severe hallucination issues. While most prior studies have focused on spatial hallucinations (e.g. object mismatches), temporal reasoning in video understanding remains relatively underexplored. To address this issue, we propose Self-Diagnostic Contrastive Decoding (SEASON), a training-free method that adaptively enhances temporal and spatial faithfulness for each output token. It achieves this by dynamically diagnosing each token's hallucination tendency and applying adaptive contrastive decoding against its corresponding temporal and spatial negatives. Extensive experiments demonstrate that SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks, while further improves VideoLLMs across four general video understanding benchmarks. The code will be released upon acceptance.
Rethinking Decoupled Knowledge Distillation: A Predictive Distribution Perspective
In the history of knowledge distillation, the focus has once shifted over time from logit-based to feature-based approaches. However, this transition has been revisited with the advent of Decoupled Knowledge Distillation (DKD), which re-emphasizes the importance of logit knowledge through advanced decoupling and weighting strategies. While DKD marks a significant advancement, its underlying mechanisms merit deeper exploration. As a response, we rethink DKD from a predictive distribution perspective. First, we introduce an enhanced version, the Generalized Decoupled Knowledge Distillation (GDKD) loss, which offers a more versatile method for decoupling logits. Then we pay particular attention to the teacher model's predictive distribution and its impact on the gradients of GDKD loss, uncovering two critical insights often overlooked: (1) the partitioning by the top logit considerably improves the interrelationship of non-top logits, and (2) amplifying the focus on the distillation loss of non-top logits enhances the knowledge extraction among them. Utilizing these insights, we further propose a streamlined GDKD algorithm with an efficient partition strategy to handle the multimodality of teacher models' predictive distribution. Our comprehensive experiments conducted on a variety of benchmarks, including CIFAR-100, ImageNet, Tiny-ImageNet, CUB-200-2011, and Cityscapes, demonstrate GDKD's superior performance over both the original DKD and other leading knowledge distillation methods. The code is available at https://github.com/ZaberKo/GDKD.
comment: Accepted to IEEE TNNLS
Denoise to Track: Harnessing Video Diffusion Priors for Robust Correspondence
In this work, we introduce HeFT (Head-Frequency Tracker), a zero-shot point tracking framework that leverages the visual priors of pretrained video diffusion models. To better understand how they encode spatiotemporal information, we analyze the internal representations of Video Diffusion Transformer (VDiT). Our analysis reveals that attention heads act as minimal functional units with distinct specializations for matching, semantic understanding, and positional encoding. Additionally, we find that the low-frequency components in VDiT features are crucial for establishing correspondences, whereas the high-frequency components tend to introduce noise. Building on these insights, we propose a head- and frequency-aware feature selection strategy that jointly selects the most informative attention head and low-frequency components to enhance tracking performance. Specifically, our method extracts discriminative features through single-step denoising, applies feature selection, and employs soft-argmax localization with forward-backward consistency checks for correspondence estimation. Extensive experiments on TAP-Vid benchmarks demonstrate that HeFT achieves state-of-the-art zero-shot tracking performance, approaching the accuracy of supervised methods while eliminating the need for annotated training data. Our work further underscores the promise of video diffusion models as powerful foundation models for a wide range of downstream tasks, paving the way toward unified visual foundation models.
Malicious Image Analysis via Vision-Language Segmentation Fusion: Detection, Element, and Location in One-shot
Detecting illicit visual content demands more than image-level NSFW flags; moderators must also know what objects make an image illegal and where those objects occur. We introduce a zero-shot pipeline that simultaneously (i) detects if an image contains harmful content, (ii) identifies each critical element involved, and (iii) localizes those elements with pixel-accurate masks - all in one pass. The system first applies foundation segmentation model (SAM) to generate candidate object masks and refines them into larger independent regions. Each region is scored for malicious relevance by a vision-language model using open-vocabulary prompts; these scores weight a fusion step that produces a consolidated malicious object map. An ensemble across multiple segmenters hardens the pipeline against adaptive attacks that target any single segmentation method. Evaluated on a newly-annotated 790-image dataset spanning drug, sexual, violent and extremist content, our method attains 85.8% element-level recall, 78.1% precision and a 92.1% segment-success rate - exceeding direct zero-shot VLM localization by 27.4% recall at comparable precision. Against PGD adversarial perturbations crafted to break SAM and VLM, our method's precision and recall decreased by no more than 10%, demonstrating high robustness against attacks. The full pipeline processes an image in seconds, plugs seamlessly into existing VLM workflows, and constitutes the first practical tool for fine-grained, explainable malicious-image moderation.
When Robots Should Say "I Don't Know": Benchmarking Abstention in Embodied Question Answering
Embodied Question Answering (EQA) requires an agent to interpret language, perceive its environment, and navigate within 3D scenes to produce responses. Existing EQA benchmarks assume that every question must be answered, but embodied agents should know when they do not have sufficient information to answer. In this work, we focus on a minimal requirement for EQA agents, abstention: knowing when to withhold an answer. From an initial study of 500 human queries, we find that 32.4% contain missing or underspecified context. Drawing on this initial study and cognitive theories of human communication errors, we derive five representative categories requiring abstention: actionability limitation, referential underspecification, preference dependence, information unavailability, and false presupposition. We augment OpenEQA by having annotators transform well-posed questions into ambiguous variants outlined by these categories. The resulting dataset, AbstainEQA, comprises 1,636 annotated abstention cases paired with 1,636 original OpenEQA instances for balanced evaluation. Evaluating on AbstainEQA, we find that even the best frontier model only attains 42.79% abstention recall, while humans achieve 91.17%. We also find that scaling, prompting, and reasoning only yield marginal gains, and that fine-tuned models overfit to textual cues. Together, these results position abstention as a fundamental prerequisite for reliable interaction in embodied settings and as a necessary basis for effective clarification.
SAM3-I: Segment Anything with Instructions
Segment Anything Model 3 (SAM3) has advanced open-vocabulary segmentation through promptable concept segmentation, allowing users to segment all instances corresponding to a given concept, typically specified with short noun-phrase (NP) prompts. While this marks the first integration of language-level concepts within the SAM family, real-world usage typically requires far richer expressions that include attributes, spatial relations, functionalities, actions, states, and even implicit reasoning over instances. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and then conduct iterative mask filtering. However, these NP-level concepts remain overly coarse, often failing to precisely represent a specific instance. In this work, we present SAM3-I, an enhanced framework that unifies concept-level understanding and instruction-level reasoning within the SAM family. SAM3-I introduces an instruction-aware cascaded adaptation mechanism that progressively aligns expressive instruction semantics with SAM3's existing vision-language representations, enabling direct instruction-following segmentation without sacrificing its original concept-driven capabilities. Furthermore, we design a structured instruction taxonomy spanning concept, simple, and complex levels, and develop a scalable data engine to construct a dataset with diverse instruction-mask pairs. Experiments show that SAM3-I delivers appealing performance, demonstrating that SAM3 can be effectively extended to follow natural-language instructions while preserving its strong concept grounding. We open-source SAM3-I and provide practical fine-tuning workflows, enabling researchers to adapt it to domain-specific applications. The source code is available here.
comment: Preliminary results; work in progress
Infrared UAV Target Tracking with Dynamic Feature Refinement and Global Contextual Attention Knowledge Distillation
Unmanned aerial vehicle (UAV) target tracking based on thermal infrared imaging has been one of the most important sensing technologies in anti-UAV applications. However, the infrared UAV targets often exhibit weak features and complex backgrounds, posing significant challenges to accurate tracking. To address these problems, we introduce SiamDFF, a novel dynamic feature fusion Siamese network that integrates feature enhancement and global contextual attention knowledge distillation for infrared UAV target (IRUT) tracking. The SiamDFF incorporates a selective target enhancement network (STEN), a dynamic spatial feature aggregation module (DSFAM), and a dynamic channel feature aggregation module (DCFAM). The STEN employs intensity-aware multi-head cross-attention to adaptively enhance important regions for both template and search branches. The DSFAM enhances multi-scale UAV target features by integrating local details with global features, utilizing spatial attention guidance within the search frame. The DCFAM effectively integrates the mixed template generated from STEN in the template branch and original template, avoiding excessive background interference with the template and thereby enhancing the emphasis on UAV target region features within the search frame. Furthermore, to enhance the feature extraction capabilities of the network for IRUT without adding extra computational burden, we propose a novel tracking-specific target-aware contextual attention knowledge distiller. It transfers the target prior from the teacher network to the student model, significantly improving the student network's focus on informative regions at each hierarchical level of the backbone network. Extensive experiments on real infrared UAV datasets demonstrate that the proposed approach outperforms state-of-the-art target trackers under complex backgrounds while achieving a real-time tracking speed.
comment: Accepted by IEEE TMM
TARDis: Time Attenuated Representation Disentanglement for Incomplete Multi-Modal Tumor Segmentation and Classification
Tumor segmentation and diagnosis in contrast-enhanced Computed Tomography (CT) rely heavily on the physiological dynamics of contrast agents. However, obtaining a complete multi-phase series is often clinically unfeasible due to radiation concerns or scanning limitations, leading to the "missing modality" problem. Existing deep learning approaches typically treat missing phases as absent independent channels, ignoring the inherent temporal continuity of hemodynamics. In this work, we propose Time Attenuated Representation Disentanglement (TARDis), a novel physics-aware framework that redefines missing modalities as missing sample points on a continuous Time-Attenuation Curve. TARDis explicitly disentangles the latent feature space into a time-invariant static component (anatomy) and a time-dependent dynamic component (perfusion). We achieve this via a dual-path architecture: a quantization-based path using a learnable embedding dictionary to extract consistent anatomical structures, and a probabilistic path using a Conditional Variational Autoencoder to model dynamic enhancement conditioned on the estimated scan time. This design allows the network to hallucinate missing hemodynamic features by sampling from the learned latent distribution. Extensive experiments on a large-scale private abdominal CT dataset (2,282 cases) and two public datasets demonstrate that TARDis significantly outperforms state-of-the-art incomplete modality frameworks. Notably, our method maintains robust diagnostic performance even in extreme data-sparsity scenarios, highlighting its potential for reducing radiation exposure while maintaining diagnostic precision.
Prompt2Craft: Generating Functional Craft Assemblies with LLMs
Inspired by traditional handmade crafts, where a person improvises assemblies based on the available objects, we formally introduce the Craft Assembly Task. It is a robotic assembly task that involves building an accurate representation of a given target object using the available objects, which do not directly correspond to its parts. In this work, we focus on selecting the subset of available objects for the final craft, when the given input is an RGB image of the target in the wild. We use a mask segmentation neural network to identify visible parts, followed by retrieving labeled template meshes. These meshes undergo pose optimization to determine the most suitable template. Then, we propose to simplify the parts of the transformed template mesh to primitive shapes like cuboids or cylinders. Finally, we design a search algorithm to find correspondences in the scene based on local and global proportions. We develop baselines for comparison that consider all possible combinations, and choose the highest scoring combination for common metrics used in foreground maps and mask accuracy. Our approach achieves comparable results to the baselines for two different scenes, and we show qualitative results for an implementation in a real-world scenario.
Dataset creation for supervised deep learning-based analysis of microscopic images - review of important considerations and recommendations
Supervised deep learning (DL) receives great interest for automated analysis of microscopic images with an increasing body of literature supporting its potential. The development and validation of those DL models relies heavily on the availability of high-quality, large-scale datasets. However, creating such datasets is a complex and resource-intensive process, often hindered by challenges such as time constraints, domain variability, and risks of bias in image collection and label creation. This review provides a comprehensive guide to the critical steps in dataset creation, including: 1) image acquisition, 2) selection of annotation software, and 3) annotation creation. In addition to ensuring a sufficiently large number of images, it is crucial to address sources of image variability (domain shifts) - such as those related to slide preparation and digitization - that could lead to algorithmic errors if not adequately represented in the training data. Key quality criteria for annotations are the three "C"s: correctness, completeness, and consistency. This review explores methods to enhance annotation quality through the use of advanced techniques that mitigate the limitations of single annotators. To support dataset creators, a standard operating procedure (SOP) is provided as supplemental material, outlining best practices for dataset development. Furthermore, the article underscores the importance of open datasets in driving innovation and enhancing reproducibility of DL research. By addressing the challenges and offering practical recommendations, this review aims to advance the creation of and availability to high-quality, large-scale datasets, ultimately contributing to the development of generalizable and robust DL models for pathology applications.
COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence
Visual Spatial Reasoning is crucial for enabling Multimodal Large Language Models (MLLMs) to understand object properties and spatial relationships, yet current models still struggle with 3D-aware reasoning. Existing approaches typically enhance either perception, by augmenting RGB inputs with auxiliary modalities such as depth and segmentation, or reasoning, by training on spatial VQA datasets and applying reinforcement learning, and thus treat these two aspects in isolation. In this work, we investigate whether a unified MLLM can develop an intrinsic ability to enhance spatial perception and, through adaptive interleaved reasoning, achieve stronger spatial intelligence. We propose \textbf{COOPER}, a unified MLLM that leverages depth and segmentation as auxiliary modalities and is trained in two stages to acquire auxiliary modality generation and adaptive, interleaved reasoning capabilities. COOPER achieves an average \textbf{6.91\%} improvement in spatial reasoning while maintaining general performance. Moreover, even a variant trained only for auxiliary modality generation attains a \textbf{7.92\%} gain on distance and size estimation, suggesting that learning to generate auxiliary modalities helps internalize spatial knowledge and strengthen spatial understanding.
Efficient Spatially-Variant Convolution via Differentiable Sparse Kernel Complex
Image convolution with complex kernels is a fundamental operation in photography, scientific imaging, and animation effects, yet direct dense convolution is computationally prohibitive on resource-limited devices. Existing approximations, such as simulated annealing or low-rank decompositions, either lack efficiency or fail to capture non-convex kernels. We introduce a differentiable kernel decomposition framework that represents a target spatially-variant, dense, complex kernel using a set of sparse kernel samples. Our approach features (i) a decomposition that enables differentiable optimization of sparse kernels, (ii) a dedicated initialization strategy for non-convex shapes to avoid poor local minima, and (iii) a kernel-space interpolation scheme that extends single-kernel filtering to spatially varying filtering without retraining and additional runtime overhead. Experiments on Gaussian and non-convex kernels show that our method achieves higher fidelity than simulated annealing and significantly lower cost than low-rank decompositions. Our approach provides a practical solution for mobile imaging and real-time rendering, while remaining fully differentiable for integration into broader learning pipelines.
comment: 10 pages, 7 figures
Counterfeit Answers: Adversarial Forgery against OCR-Free Document Visual Question Answering
Document Visual Question Answering (DocVQA) enables end-to-end reasoning grounded on information present in a document input. While recent models have shown impressive capabilities, they remain vulnerable to adversarial attacks. In this work, we introduce a novel attack scenario that aims to forge document content in a visually imperceptible yet semantically targeted manner, allowing an adversary to induce specific or generally incorrect answers from a DocVQA model. We develop specialized attack algorithms that can produce adversarially forged documents tailored to different attackers' goals, ranging from targeted misinformation to systematic model failure scenarios. We demonstrate the effectiveness of our approach against two end-to-end state-of-the-art models: Pix2Struct, a vision-language transformer that jointly processes image and text through sequence-to-sequence modeling, and Donut, a transformer-based model that directly extracts text and answers questions from document images. Our findings highlight critical vulnerabilities in current DocVQA systems and call for the development of more robust defenses.
Gaussian Entropy Fields: Driving Adaptive Sparsity in 3D Gaussian Optimization
3D Gaussian Splatting (3DGS) has emerged as a leading technique for novel view synthesis, demonstrating exceptional rendering efficiency. \replaced[]{Well-reconstructed surfaces can be characterized by low configurational entropy, where dominant primitives clearly define surface geometry while redundant components are suppressed.}{The key insight is that well-reconstructed surfaces naturally exhibit low configurational entropy, where dominant primitives clearly define surface geometry while suppressing redundant components.} Three complementary technical contributions are introduced: (1) entropy-driven surface modeling via entropy minimization for low configurational entropy in primitive distributions; (2) adaptive spatial regularization using the Surface Neighborhood Redundancy Index (SNRI) and image entropy-guided weighting; (3) multi-scale geometric preservation through competitive cross-scale entropy alignment. Extensive experiments demonstrate that GEF achieves competitive geometric precision on DTU and T\&T benchmarks, while delivering superior rendering quality compared to existing methods on Mip-NeRF 360. Notably, superior Chamfer Distance (0.64) on DTU and F1 score (0.44) on T\&T are obtained, alongside the best SSIM (0.855) and LPIPS (0.136) among baselines on Mip-NeRF 360, validating the framework's ability to enhance surface reconstruction accuracy without compromising photometric fidelity.
comment: 28 pages,11 figures
VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management
Ultra long video understanding remains an open challenge, as existing vision language models (VLMs) falter on such content due to limited context length and inefficient long term memory retention. To address this, recent works have attempted to construct external knowledge bases and corresponding retrieval agumented generation (RAG) systems, yet these incur enormous storage and computational overhead. In this paper, we propose VideoMem, a novel framework that pioneers models long video understanding as a sequential generation task via adaptive memory management. Specifically, VideoMem dynamically updates a global memory buffer, which adaptively retains critical information while discarding redundant content across the video timeline. To efficiently train VLMs for such long-term tasks, VideoMem integrates the Progressive Grouped Relative Policy Optimization (PRPO) algorithm, equipped with two core modules: Progressive State Propagation (PSP) adaptively retains valid current states, propagates them to the next rollout step, and gradually narrows the model exploration space. Temporal Cascading Reward (TCR) further alleviates reward sparsity, improving sample utilization and accelerating convergence. Extensive experiments demonstrate that VideoMem significantly outperforms existing open-source models across diverse benchmarks for ultra-long video understanding tasks.
X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale
The advancement of embodied AI has unlocked significant potential for intelligent humanoid robots. However, progress in both Vision-Language-Action (VLA) models and world models is severely hampered by the scarcity of large-scale, diverse training data. A promising solution is to "robotize" web-scale human videos, which has been proven effective for policy training. However, these solutions mainly "overlay" robot arms to egocentric videos, which cannot handle complex full-body motions and scene occlusions in third-person videos, making them unsuitable for robotizing humans. To bridge this gap, we introduce X-Humanoid, a generative video editing approach that adapts the powerful Wan 2.2 model into a video-to-video structure and finetunes it for the human-to-humanoid translation task. This finetuning requires paired human-humanoid videos, so we designed a scalable data creation pipeline, turning community assets into 17+ hours of paired synthetic videos using Unreal Engine. We then apply our trained model to 60 hours of the Ego-Exo4D videos, generating and releasing a new large-scale dataset of over 3.6 million "robotized" humanoid video frames. Quantitative analysis and user studies confirm our method's superiority over existing baselines: 69% of users rated it best for motion consistency, and 62.1% for embodiment correctness.
Detection of Intoxicated Individuals from Facial Video Sequences via a Recurrent Fusion Model
Alcohol consumption is a significant public health concern and a major cause of accidents and fatalities worldwide. This study introduces a novel video-based facial sequence analysis approach dedicated to the detection of alcohol intoxication. The method integrates facial landmark analysis via a Graph Attention Network (GAT) with spatiotemporal visual features extracted using a 3D ResNet. These features are dynamically fused with adaptive prioritization to enhance classification performance. Additionally, we introduce a curated dataset comprising 3,542 video segments derived from 202 individuals to support training and evaluation. Our model is compared against two baselines: a custom 3D-CNN and a VGGFace+LSTM architecture. Experimental results show that our approach achieves 95.82% accuracy, 0.977 precision, and 0.97 recall, outperforming prior methods. The findings demonstrate the model's potential for practical deployment in public safety systems for non-invasive, reliable alcohol intoxication detection.
Refaçade: Editing Object with Given Reference Texture
Recent advances in diffusion models have brought remarkable progress in image and video editing, yet some tasks remain underexplored. In this paper, we introduce a new task, Object Retexture, which transfers local textures from a reference object to a target object in images or videos. To perform this task, a straightforward solution is to use ControlNet conditioned on the source structure and the reference texture. However, this approach suffers from limited controllability for two reasons: conditioning on the raw reference image introduces unwanted structural information, and it fails to disentangle the visual texture and structure information of the source. To address this problem, we propose Refaçade, a method that consists of two key designs to achieve precise and controllable texture transfer in both images and videos. First, we employ a texture remover trained on paired textured/untextured 3D mesh renderings to remove appearance information while preserving the geometry and motion of source videos. Second, we disrupt the reference global layout using a jigsaw permutation, encouraging the model to focus on local texture statistics rather than the global layout of the object. Extensive experiments demonstrate superior visual quality, precise editing, and controllability, outperforming strong baselines in both quantitative and human evaluations. Code is available at https://github.com/fishZe233/Refacade.
PhyVLLM: Physics-Guided Video Language Model with Motion-Appearance Disentanglement
Video Large Language Models (Video LLMs) have shown impressive performance across a wide range of video-language tasks. However, they often fail in scenarios requiring a deeper understanding of physical dynamics. This limitation primarily arises from their reliance on appearance-based matching. Incorporating physical motion modeling is crucial for deeper video understanding, but presents three key challenges: (1) motion signals are often entangled with appearance variations, making it difficult to extract clean physical cues; (2) effective motion modeling requires not only continuous-time motion representations but also capturing physical dynamics; and (3) collecting accurate annotations for physical attributes is costly and often impractical. To address these issues, we propose PhyVLLM, a physical-guided video-language framework that explicitly incorporates physical motion into Video LLMs. Specifically, PhyVLLM disentangles visual appearance and object motion through a dual-branch encoder. To model physical dynamics over time, we incorporate a Neural Ordinary Differential Equation (Neural ODE) module, which generates differentiable physical dynamic representations. The resulting motion-aware representations are projected into the token space of a pretrained LLM, enabling physics reasoning without compromising the model's original multimodal capabilities. To circumvent the need for explicit physical labels, PhyVLLM employs a self-supervised manner to model the continuous evolution of object motion. Experimental results demonstrate that PhyVLLM significantly outperforms state-of-the-art Video LLMs on both physical reasoning and general video understanding tasks, highlighting the advantages of incorporating explicit physical modeling.
Auto3R: Automated 3D Reconstruction and Scanning via Data-driven Uncertainty Quantification
Traditional high-quality 3D scanning and reconstruction typically relies on human labor to plan the scanning procedure. With the rapid development of embodied systems such as drones and robots, there is a growing demand of performing accurate 3D scanning and reconstruction in an fully automated manner. We introduce Auto3R, a data-driven uncertainty quantification model that is designed to automate the 3D scanning and reconstruction of scenes and objects, including objects with non-lambertian and specular materials. Specifically, in a process of iterative 3D reconstruction and scanning, Auto3R can make efficient and accurate prediction of uncertainty distribution over potential scanning viewpoints, without knowing the ground truth geometry and appearance. Through extensive experiments, Auto3R achieves superior performance that outperforms the state-of-the-art methods by a large margin. We also deploy Auto3R on a robot arm equipped with a camera and demonstrate that Auto3R can be used to effectively digitize real-world 3D objects and delivers ready-to-use and photorealistic digital assets. Our homepage: https://tomatoma00.github.io/auto3r.github.io .
Identity Clue Refinement and Enhancement for Visible-Infrared Person Re-Identification
Visible-Infrared Person Re-Identification (VI-ReID) is a challenging cross-modal matching task due to significant modality discrepancies. While current methods mainly focus on learning modality-invariant features through unified embedding spaces, they often focus solely on the common discriminative semantics across modalities while disregarding the critical role of modality-specific identity-aware knowledge in discriminative feature learning. To bridge this gap, we propose a novel Identity Clue Refinement and Enhancement (ICRE) network to mine and utilize the implicit discriminative knowledge inherent in modality-specific attributes. Initially, we design a Multi-Perception Feature Refinement (MPFR) module that aggregates shallow features from shared branches, aiming to capture modality-specific attributes that are easily overlooked. Then, we propose a Semantic Distillation Cascade Enhancement (SDCE) module, which distills identity-aware knowledge from the aggregated shallow features and guide the learning of modality-invariant features. Finally, an Identity Clues Guided (ICG) Loss is proposed to alleviate the modality discrepancies within the enhanced features and promote the learning of a diverse representation space. Extensive experiments across multiple public datasets clearly show that our proposed ICRE outperforms existing SOTA methods.
comment: 14 pages, 7 figures
WiFi-based Cross-Domain Gesture Recognition Using Attention Mechanism
While fulfilling communication tasks, wireless signals can also be used to sense the environment. Among various types of sensing media, WiFi signals offer advantages such as widespread availability, low hardware cost, and strong robustness to environmental conditions like light, temperature, and humidity. By analyzing Wi-Fi signals in the environment, it is possible to capture dynamic changes of the human body and accomplish sensing applications such as gesture recognition. Although many existing gesture sensing solutions perform well in-domain but lack cross-domain capabilities (i.e., recognition performance in untrained environments). To address this, we extract Doppler spectra from the channel state information (CSI) received by all receivers and concatenate each Doppler spectrum along the same time axis to generate fused images with multi-angle information as input features. Furthermore, inspired by the convolutional block attention module (CBAM), we propose a gesture recognition network that integrates a multi-semantic spatial attention mechanism with a self-attention-based channel mechanism. This network constructs attention maps to quantify the spatiotemporal features of gestures in images, enabling the extraction of key domain-independent features. Additionally, ResNet18 is employed as the backbone network to further capture deep-level features. To validate the network performance, we evaluate the proposed network on the public Widar3 dataset, and the results show that it not only maintains high in-domain accuracy of 99.72%, but also achieves high performance in cross-domain recognition of 97.61%, significantly outperforming existing best solutions.
Boundary-Aware Test-Time Adaptation for Zero-Shot Medical Image Segmentation
Due to the scarcity of annotated data and the substantial computational costs of model, conventional tuning methods in medical image segmentation face critical challenges. Current approaches to adapting pretrained models, including full-parameter and parameter-efficient fine-tuning, still rely heavily on task-specific training on downstream tasks. Therefore, zero-shot segmentation has gained increasing attention, especially with foundation models such as SAM demonstrating promising generalization capabilities. However, SAM still faces notable limitations on medical datasets due to domain shifts, making efficient zero-shot enhancement an urgent research goal. To address these challenges, we propose BA-TTA-SAM, a task-agnostic test-time adaptation framework that significantly enhances the zero-shot segmentation performance of SAM via test-time adaptation. This framework integrates two key mechanisms: (1) The encoder-level Gaussian prompt injection embeds Gaussian-based prompts directly into the image encoder, providing explicit guidance for initial representation learning. (2) The cross-layer boundary-aware attention alignment exploits the hierarchical feature interactions within the ViT backbone, aligning deep semantic responses with shallow boundary cues. Experiments on four datasets, including ISIC, Kvasir, BUSI, and REFUGE, show an average improvement of 12.4\% in the DICE score compared with SAM's zero-shot segmentation performance. The results demonstrate that our method consistently outperforms state-of-the-art models in medical image segmentation. Our framework significantly enhances the generalization ability of SAM, without requiring any source-domain training data. Extensive experiments on publicly available medical datasets strongly demonstrate the superiority of our framework. Our code is available at https://github.com/Emilychenlin/BA-TTA-SAM.
VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory
Autoregressive (AR) diffusion enables streaming, interactive long-video generation by producing frames causally, yet maintaining coherence over minute-scale horizons remains challenging due to accumulated errors, motion drift, and content repetition. We approach this problem from a memory perspective, treating video synthesis as a recurrent dynamical process that requires coordinated short- and long-term context. We propose VideoSSM, a Long Video Model that unifies AR diffusion with a hybrid state-space memory. The state-space model (SSM) serves as an evolving global memory of scene dynamics across the entire sequence, while a context window provides local memory for motion cues and fine details. This hybrid design preserves global consistency without frozen, repetitive patterns, supports prompt-adaptive interaction, and scales in linear time with sequence length. Experiments on short- and long-range benchmarks demonstrate state-of-the-art temporal consistency and motion stability among autoregressive video generator especially at minute-scale horizons, enabling content diversity and interactive prompt-based control, thereby establishing a scalable, memory-aware framework for long video generation.
EgoLCD: Egocentric Video Generation with Long Context Diffusion
Generating long, coherent egocentric videos is difficult, as hand-object interactions and procedural tasks require reliable long-term memory. Existing autoregressive models suffer from content drift, where object identity and scene semantics degrade over time. To address this challenge, we introduce EgoLCD, an end-to-end framework for egocentric long-context video generation that treats long video synthesis as a problem of efficient and stable memory management. EgoLCD combines a Long-Term Sparse KV Cache for stable global context with an attention-based short-term memory, extended by LoRA for local adaptation. A Memory Regulation Loss enforces consistent memory usage, and Structured Narrative Prompting provides explicit temporal guidance. Extensive experiments on the EgoVid-5M benchmark demonstrate that EgoLCD achieves state-of-the-art performance in both perceptual quality and temporal consistency, effectively mitigating generative forgetting and representing a significant step toward building scalable world models for embodied AI. Code: https://github.com/AIGeeksGroup/EgoLCD. Website: https://aigeeksgroup.github.io/EgoLCD.
DuGI-MAE: Improving Infrared Mask Autoencoders via Dual-Domain Guidance
Infrared imaging plays a critical role in low-light and adverse weather conditions. However, due to the distinct characteristics of infrared images, existing foundation models such as Masked Autoencoder (MAE) trained on visible data perform suboptimal in infrared image interpretation tasks. To bridge this gap, an infrared foundation model known as InfMAE was developed and pre-trained on large-scale infrared datasets. Despite its effectiveness, InfMAE still faces several limitations, including the omission of informative tokens, insufficient modeling of global associations, and neglect of non-uniform noise. In this paper, we propose a Dual-domain Guided Infrared foundation model based on MAE (DuGI-MAE). First, we design a deterministic masking strategy based on token entropy, preserving only high-entropy tokens for reconstruction to enhance informativeness. Next, we introduce a Dual-Domain Guidance (DDG) module, which simultaneously captures global token relationships and adaptively filters non-uniform background noise commonly present in infrared imagery. To facilitate large-scale pretraining, we construct Inf-590K, a comprehensive infrared image dataset encompassing diverse scenes, various target types, and multiple spatial resolutions. Pretrained on Inf-590K, DuGI-MAE demonstrates strong generalization capabilities across various downstream tasks, including infrared object detection, semantic segmentation, and small target detection. Experimental results validate the superiority of the proposed method over both supervised and self-supervised comparison methods. Our code is available in the supplementary material.
UltraImage: Rethinking Resolution Extrapolation in Image Diffusion Transformers
Recent image diffusion transformers achieve high-fidelity generation, but struggle to generate images beyond these scales, suffering from content repetition and quality degradation. In this work, we present UltraImage, a principled framework that addresses both issues. Through frequency-wise analysis of positional embeddings, we identify that repetition arises from the periodicity of the dominant frequency, whose period aligns with the training resolution. We introduce a recursive dominant frequency correction to constrain it within a single period after extrapolation. Furthermore, we find that quality degradation stems from diluted attention and thus propose entropy-guided adaptive attention concentration, which assigns higher focus factors to sharpen local attention for fine detail and lower ones to global attention patterns to preserve structural consistency. Experiments show that UltraImage consistently outperforms prior methods on Qwen-Image and Flux (around 4K) across three generation scenarios, reducing repetition and improving visual fidelity. Moreover, UltraImage can generate images up to 6K*6K without low-resolution guidance from a training resolution of 1328p, demonstrating its extreme extrapolation capability. Project page is available at \href{https://thu-ml.github.io/ultraimage.github.io/}{https://thu-ml.github.io/ultraimage.github.io/}.
comment: Project page: https://thu-ml.github.io/ultraimage.github.io
Back to Basics: Motion Representation Matters for Human Motion Generation Using Diffusion Model
Diffusion models have emerged as a widely utilized and successful methodology in human motion synthesis. Task-oriented diffusion models have significantly advanced action-to-motion, text-to-motion, and audio-to-motion applications. In this paper, we investigate fundamental questions regarding motion representations and loss functions in a controlled study, and we enumerate the impacts of various decisions in the workflow of the generative motion diffusion model. To answer these questions, we conduct empirical studies based on a proxy motion diffusion model (MDM). We apply v loss as the prediction objective on MDM (vMDM), where v is the weighted sum of motion data and noise. We aim to enhance the understanding of latent data distributions and provide a foundation for improving the state of conditional motion diffusion models. First, we evaluate the six common motion representations in the literature and compare their performance in terms of quality and diversity metrics. Second, we compare the training time under various configurations to shed light on how to speed up the training process of motion diffusion models. Finally, we also conduct evaluation analysis on a large motion dataset. The results of our experiments indicate clear performance differences across motion representations in diverse datasets. Our results also demonstrate the impacts of distinct configurations on model training and suggest the importance and effectiveness of these decisions on the outcomes of motion diffusion models.
Shift-Window Meets Dual Attention: A Multi-Model Architecture for Specular Highlight Removal
Inevitable specular highlights in practical environments severely impair the visual performance, thus degrading the task effectiveness and efficiency. Although there exist considerable methods that focus on local information from convolutional neural network models or global information from transformer models, the single-type model falls into a modeling dilemma between local fine-grained details and global long-range dependencies, thus deteriorating for specular highlights with different scales. Therefore, to accommodate specular highlights of all scales, we propose a multi-model architecture for specular highlight removal (MM-SHR) that effectively captures fine-grained features in highlight regions and models long-range dependencies between highlight and highlight-free areas. Specifically, we employ convolution operations to extract local details in the shallow layers of MM-SHR, and utilize the attention mechanism to capture global features in the deep layers, ensuring both operation efficiency and removal accuracy. To model long-range dependencies without compromising computational complexity, we utilize a coarse-to-fine manner and propose Omni-Directional Attention Integration Block(OAIBlock) and Adaptive Region-Aware Hybrid-Domain Dual Attention Convolutional Network(HDDAConv) , which leverage omni-directiona pixel-shifting and window-dividing operations at the raw features to achieve specular highlight removal. Extensive experimental results on three benchmark tasks and six types of surface materials demonstrate that MM-SHR outperforms state-of-the-art methods in both accuracy and efficiency for specular highlight removal. The implementation will be made publicly available at https://github.com/Htcicv/MM-SHR.
Controllable Long-term Motion Generation with Extended Joint Targets WACV 2026
Generating stable and controllable character motion in real-time is a key challenge in computer animation. Existing methods often fail to provide fine-grained control or suffer from motion degradation over long sequences, limiting their use in interactive applications. We propose COMET, an autoregressive framework that runs in real time, enabling versatile character control and robust long-horizon synthesis. Our efficient Transformer-based conditional VAE allows for precise, interactive control over arbitrary user-specified joints for tasks like goal-reaching and in-betweening from a single model. To ensure long-term temporal stability, we introduce a novel reference-guided feedback mechanism that prevents error accumulation. This mechanism also serves as a plug-and-play stylization module, enabling real-time style transfer. Extensive evaluations demonstrate that COMET robustly generates high-quality motion at real-time speeds, significantly outperforming state-of-the-art approaches in complex motion control tasks and confirming its readiness for demanding interactive applications.
comment: WACV 2026
Not All Birds Look The Same: Identity-Preserving Generation For Birds
Since the advent of controllable image generation, increasingly rich modes of control have enabled greater customization and accessibility for everyday users. Zero-shot, identity-preserving models such as Insert Anything and OminiControl now support applications like virtual try-on without requiring additional fine-tuning. While these models may be fitting for humans and rigid everyday objects, they still have limitations for non-rigid or fine-grained categories. These domains often lack accessible, high-quality data -- especially videos or multi-view observations of the same subject -- making them difficult both to evaluate and to improve upon. Yet, such domains are essential for moving beyond content creation toward applications that demand accuracy and fine detail. Birds are an excellent domain for this task: they exhibit high diversity, require fine-grained cues for identification, and come in a wide variety of poses. We introduce the NABirds Look-Alikes (NABLA) dataset, consisting of 4,759 expert-curated image pairs. Together with 1,073 pairs collected from multi-image observations on iNaturalist and a small set of videos, this forms a benchmark for evaluating identity-preserving generation of birds. We show that state-of-the-art baselines fail to maintain identity on this dataset, and we demonstrate that training on images grouped by species, age, and sex -- used as a proxy for identity -- substantially improves performance on both seen and unseen species.
DeRA: Decoupled Representation Alignment for Video Tokenization
This paper presents DeRA, a novel 1D video tokenizer that decouples the spatial-temporal representation learning in video tokenization to achieve better training efficiency and performance. Specifically, DeRA maintains a compact 1D latent space while factorizing video encoding into appearance and motion streams, which are aligned with pretrained vision foundation models to capture the spatial semantics and temporal dynamics in videos separately. To address the gradient conflicts introduced by the heterogeneous supervision, we further propose the Symmetric Alignment-Conflict Projection (SACP) module that proactively reformulates gradients by suppressing the components along conflicting directions. Extensive experiments demonstrate that DeRA outperforms LARP, the previous state-of-the-art video tokenizer by 25% on UCF-101 in terms of rFVD. Moreover, using DeRA for autoregressive video generation, we also achieve new state-of-the-art results on both UCF-101 class-conditional generation and K600 frame prediction.
Feature Engineering vs. Deep Learning for Automated Coin Grading: A Comparative Study on Saint-Gaudens Double Eagles
We challenge the common belief that deep learning always trumps older techniques, using the example of grading Saint-Gaudens Double Eagle gold coins automatically. In our work, we put a feature-based Artificial Neural Network built around 192 custom features pulled from Sobel edge detection and HSV color analysis up against a hybrid Convolutional Neural Network that blends in EfficientNetV2, plus a straightforward Support Vector Machine as the control. Testing 1,785 coins graded by experts, the ANN nailed 86% exact matches and hit 98% when allowing a 3-grade leeway. On the flip side, CNN and SVM mostly just guessed the most common grade, scraping by with 31% and 30% exact hits. Sure, the CNN looked good on broader tolerance metrics, but that is because of some averaging trick in regression that hides how it totally flops at picking out specific grades. All told, when you are stuck with under 2,000 examples and lopsided classes, baking in real coin-expert knowledge through feature design beats out those inscrutable, all-in-one deep learning setups. This rings true for other niche quality checks where data's thin and know-how matters more than raw compute.
UniTS: Unified Time Series Generative Model for Remote Sensing
One of the primary objectives of satellite remote sensing is to capture the complex dynamics of the Earth environment, which encompasses tasks such as reconstructing continuous cloud-free time series images, detecting land cover changes, and forecasting future surface evolution. However, existing methods typically require specialized models tailored to different tasks, lacking unified modeling of spatiotemporal features across multiple time series tasks. In this paper, we propose a Unified Time Series Generative Model (UniTS), a general framework applicable to various time series tasks, including time series reconstruction, time series cloud removal, time series semantic change detection, and time series forecasting. Based on the flow matching generative paradigm, UniTS constructs a deterministic evolution path from noise to targets under the guidance of task-specific conditions, achieving unified modeling of spatiotemporal representations for multiple tasks. The UniTS architecture consists of a diffusion transformer with spatio-temporal blocks, where we design an Adaptive Condition Injector (ACor) to enhance the model's conditional perception of multimodal inputs, enabling high-quality controllable generation. Additionally, we design a Spatiotemporal-aware Modulator (STM) to improve the ability of spatio-temporal blocks to capture complex spatiotemporal dependencies. Furthermore, we construct two high-quality multimodal time series datasets, TS-S12 and TS-S12CR, filling the gap of benchmark datasets for time series cloud removal and forecasting tasks. Extensive experiments demonstrate that UniTS exhibits exceptional generative and cognitive capabilities in both low-level and high-level time series tasks. It significantly outperforms existing methods, particularly when facing challenges such as severe cloud contamination, modality absence, and forecasting phenological variations.
dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning
The autonomous driving community is increasingly focused on addressing the challenges posed by out-of-distribution (OOD) driving scenarios. A dominant research trend seeks to enhance end-to-end (E2E) driving systems by integrating vision-language models (VLMs), leveraging their rich world knowledge and reasoning abilities to improve generalization across diverse environments. However, most existing VLMs or vision-language agents (VLAs) are built upon autoregressive (AR) models. In this paper, we observe that existing AR-based VLMs -- limited by causal attention and sequential token generation -- often fail to maintain consistency and controllability between high-level reasoning and low-level planning. In contrast, recent discrete diffusion VLMs equipped with bidirectional attention exhibit superior controllability and reliability through iterative denoising. Building on these observations, we introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving. Evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems despite a modest backbone, outperforming AR-based baselines with a 9 percent improvement in behavior-trajectory consistency and a 6 percent increase in RFS on long-tail WOD-E2E scenarios. These results suggest a controllable and reliable pathway for scalable end-to-end driving.
GuidNoise: Single-Pair Guided Diffusion for Generalized Noise Synthesis AAAI2026
Recent image denoising methods have leveraged generative modeling for real noise synthesis to address the costly acquisition of real-world noisy data. However, these generative models typically require camera metadata and extensive target-specific noisy-clean image pairs, often showing limited generalization between settings. In this paper, to mitigate the prerequisites, we propose a Single-Pair Guided Diffusion for generalized noise synthesis GuidNoise, which uses a single noisy/clean pair as the guidance, often easily obtained by itself within a training set. To train GuidNoise, which generates synthetic noisy images from the guidance, we introduce a guidance-aware affine feature modification (GAFM) and a noise-aware refine loss to leverage the inherent potential of diffusion models. This loss function refines the diffusion model's backward process, making the model more adept at generating realistic noise distributions. The GuidNoise synthesizes high-quality noisy images under diverse noise environments without additional metadata during both training and inference. Additionally, GuidNoise enables the efficient generation of noisy-clean image pairs at inference time, making synthetic noise readily applicable for augmenting training data. This self-augmentation significantly improves denoising performance, especially in practical scenarios with lightweight models and limited training data. The code is available at https://github.com/chjinny/GuidNoise.
comment: AAAI2026
StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios
As embodied intelligence advances toward real-world deployment, the ability to continuously perceive and reason over streaming visual inputs becomes essential. In such settings, an agent must maintain situational awareness of its environment, comprehend the interactions with surrounding entities, and dynamically plan actions informed by past observations, current contexts, and anticipated future events. To facilitate progress in this direction, we introduce StreamEQA, the first benchmark designed for streaming video question answering in embodied scenarios. StreamEQA evaluates existing MLLMs along two orthogonal dimensions: Embodied and Streaming. Along the embodied dimension, we categorize the questions into three levels: perception, interaction, and planning, which progressively assess a model's ability to recognize fine-grained visual details, reason about agent-object interactions, and perform high-level goal-directed reasoning. For the streaming dimension, questions are divided into backward, real-time, and forward reasoning, with each mode relying on a distinct temporal context. Built upon 156 independent long videos, StreamEQA defines 42 tasks and generates approximately 21K question-answer pairs with precise timestamps through a hybrid pipeline combining automated generation and human refinement. Evaluations of 13 state-of-the-art video-LLMs reveal that, despite strong performance on conventional benchmarks, these models still struggle with streaming video understanding in embodied scenarios. We hope StreamEQA will catalyze research on streaming video understanding for embodied applications.
MindDrive: An All-in-One Framework Bridging World Models and Vision-Language Model for End-to-End Autonomous Driving
End-to-End autonomous driving (E2E-AD) has emerged as a new paradigm, where trajectory planning plays a crucial role. Existing studies mainly follow two directions: trajectory generation oriented, which focuses on producing high-quality trajectories with simple decision mechanisms, and trajectory selection oriented, which performs multi-dimensional evaluation to select the best trajectory yet lacks sufficient generative capability. In this work, we propose MindDrive, a harmonized framework that integrates high-quality trajectory generation with comprehensive decision reasoning. It establishes a structured reasoning paradigm of "context simulation - candidate generation - multi-objective trade-off". In particular, the proposed Future-aware Trajectory Generator (FaTG), based on a World Action Model (WaM), performs ego-conditioned "what-if" simulations to predict potential future scenes and generate foresighted trajectory candidates. Building upon this, the VLM-oriented Evaluator (VLoE) leverages the reasoning capability of a large vision-language model to conduct multi-objective evaluations across safety, comfort, and efficiency dimensions, leading to reasoned and human-aligned decision making. Extensive experiments on the NAVSIM-v1 and NAVSIM-v2 benchmarks demonstrate that MindDrive achieves state-of-the-art performance across multi-dimensional driving metrics, significantly enhancing safety, compliance, and generalization. This work provides a promising path toward interpretable and cognitively guided autonomous driving.
Self-Paced and Self-Corrective Masked Prediction for Movie Trailer Generation
As a challenging video editing task, movie trailer generation involves selecting and reorganizing movie shots to create engaging trailers. Currently, most existing automatic trailer generation methods employ a "selection-then-ranking" paradigm (i.e., first selecting key shots and then ranking them), which suffers from inevitable error propagation and limits the quality of the generated trailers. Beyond this paradigm, we propose a new self-paced and self-corrective masked prediction method called SSMP, which achieves state-of-the-art results in automatic trailer generation via bi-directional contextual modeling and progressive self-correction. In particular, SSMP trains a Transformer encoder that takes the movie shot sequences as prompts and generates corresponding trailer shot sequences accordingly. The model is trained via masked prediction, reconstructing each trailer shot sequence from its randomly masked counterpart. The mask ratio is self-paced, allowing the task difficulty to adapt to the model and thereby improving model performance. When generating a movie trailer, the model fills the shot positions with high confidence at each step and re-masks the remaining positions for the next prediction, forming a progressive self-correction mechanism that is analogous to how human editors work. Both quantitative results and user studies demonstrate the superiority of SSMP in comparison to existing automatic movie trailer generation methods. Demo is available at: https://github.com/Dixin-Lab/SSMP.
Dataset creation for supervised deep learning-based analysis of microscopic images -- review of important considerations and recommendations
Supervised deep learning (DL) receives great interest for automated analysis of microscopic images with an increasing body of literature supporting its potential. The development and validation of those DL models relies heavily on the availability of high-quality, large-scale datasets. However, creating such datasets is a complex and resource-intensive process, often hindered by challenges such as time constraints, domain variability, and risks of bias in image collection and label creation. This review provides a comprehensive guide to the critical steps in dataset creation, including: 1) image acquisition, 2) selection of annotation software, and 3) annotation creation. In addition to ensuring a sufficiently large number of images, it is crucial to address sources of image variability (domain shifts) - such as those related to slide preparation and digitization - that could lead to algorithmic errors if not adequately represented in the training data. Key quality criteria for annotations are the three "C"s: correctness, completeness, and consistency. This review explores methods to enhance annotation quality through the use of advanced techniques that mitigate the limitations of single annotators. To support dataset creators, a standard operating procedure (SOP) is provided as supplemental material, outlining best practices for dataset development. Furthermore, the article underscores the importance of open datasets in driving innovation and enhancing reproducibility of DL research. By addressing the challenges and offering practical recommendations, this review aims to advance the creation of and availability to high-quality, large-scale datasets, ultimately contributing to the development of generalizable and robust DL models for pathology applications.
MindDrive: An All-in-One Framework Bridging World Models and Vision-Language Model for End-to-End Autonomous Driving
End-to-End autonomous driving (E2E-AD) has emerged as a new paradigm, where trajectory planning plays a crucial role. Existing studies mainly follow two directions: trajectory generation oriented, which focuses on producing high-quality trajectories with simple decision mechanisms, and trajectory selection oriented, which performs multi-dimensional evaluation to select the best trajectory yet lacks sufficient generative capability. In this work, we propose MindDrive, a harmonized framework that integrates high-quality trajectory generation with comprehensive decision reasoning. It establishes a structured reasoning paradigm of "context simulation - candidate generation - multi-objective trade-off". In particular, the proposed Future-aware Trajectory Generator (FaTG), based on a World Action Model (WaM), performs ego-conditioned "what-if" simulations to predict potential future scenes and generate foresighted trajectory candidates. Building upon this, the VLM-oriented Evaluator (VLoE) leverages the reasoning capability of a large vision-language model to conduct multi-objective evaluations across safety, comfort, and efficiency dimensions, leading to reasoned and human-aligned decision making. Extensive experiments on the NAVSIM-v1 and NAVSIM-v2 benchmarks demonstrate that MindDrive achieves state-of-the-art performance across multi-dimensional driving metrics, significantly enhancing safety, compliance, and generalization. This work provides a promising path toward interpretable and cognitively guided autonomous driving.
ARCAS: An Augmented Reality Collision Avoidance System with SLAM-Based Tracking for Enhancing VRU Safety
Vulnerable road users (VRUs) face high collision risks in mixed traffic, yet most existing safety systems prioritize driver or vehicle assistance over direct VRU support. This paper presents ARCAS, a real-time augmented reality collision avoidance system that provides personalized spatial alerts to VRUs via wearable AR headsets. By fusing roadside 360-degree 3D LiDAR with SLAM-based headset tracking and an automatic 3D calibration procedure, ARCAS accurately overlays world-locked 3D bounding boxes and directional arrows onto approaching hazards in the user's passthrough view. The system also enables multi-headset coordination through shared world anchoring. Evaluated in real-world pedestrian interactions with e-scooters and vehicles (180 trials), ARCAS nearly doubled pedestrians' time-to-collision and increased counterparts' reaction margins by up to 4x compared to unaided-eye conditions. Results validate the feasibility and effectiveness of LiDAR-driven AR guidance and highlight the potential of wearable AR as a promising next-generation safety tool for urban mobility.
comment: 8 pages, 3 figures, 1 table
From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model
Temporal understanding in autonomous driving (AD) remains a significant challenge, even for recent state-of-the-art (SoTA) Vision-Language Models (VLMs). Prior work has introduced datasets and benchmarks aimed at improving temporal reasoning, but these have emphasized other video content, including sports, cooking, and movies. No existing benchmark focuses exclusively on the unique challenges of temporal understanding in ego-centric AD footage. To fill this gap, the Temporal Understanding in Autonomous Driving (TAD) benchmark is presented, which evaluates VLMs' ability to capture the dynamic relationships between actions in AD. TAD comprises nearly 6,000 question-answer (QA) pairs, spanning 7 human-designed tasks. In addition, an evaluation is performed that consists of 9 closed- and open-source generalist models as well as SoTA AD specialist models. When applied to TAD, current SoTA models demonstrated substandard accuracies, largely due to imperfect fine-grained motion understanding. To improve motion understanding and overall accuracy on TAD, two novel training-free solutions are proposed: Scene-CoT, that leverages Chain-of-Thought (CoT) and TCogMap, which incorporates an ego-centric temporal cognitive map. The proposed approaches are integrated with existing VLMs and improve average accuracy on TAD by up to 17.72%. By introducing TAD, benchmarking multiple SoTA models, and proposing effective enhancements, this work aims to catalyze future research on temporal understanding in AD. The benchmark and evaluation code are available at \href{https://huggingface.co/datasets/vbdai/TAD}{Hugging Face} and \href{https://github.com/vbdi/tad_bench}{Github}, respectively.
Inferring Compositional 4D Scenes without Ever Seeing One
Scenes in the real world are often composed of several static and dynamic objects. Capturing their 4-dimensional structures, composition and spatio-temporal configuration in-the-wild, though extremely interesting, is equally hard. Therefore, existing works often focus on one object at a time, while relying on some category-specific parametric shape model for dynamic objects. This can lead to inconsistent scene configurations, in addition to being limited to the modeled object categories. We propose COM4D (Compositional 4D), a method that consistently and jointly predicts the structure and spatio-temporal configuration of 4D/3D objects using only static multi-object or dynamic single object supervision. We achieve this by a carefully designed training of spatial and temporal attentions on 2D video input. The training is disentangled into learning from object compositions on the one hand, and single object dynamics throughout the video on the other, thus completely avoiding reliance on 4D compositional training data. At inference time, our proposed attention mixing mechanism combines these independently learned attentions, without requiring any 4D composition examples. By alternating between spatial and temporal reasoning, COM4D reconstructs complete and persistent 4D scenes with multiple interacting objects directly from monocular videos. Furthermore, COM4D provides state-of-the-art results in existing separate problems of 4D object and composed 3D reconstruction despite being purely data-driven.
comment: Project page: https://github.com/insait-institute/COM4D
CARD: Correlation Aware Restoration with Diffusion
Denoising diffusion models have achieved state-of-the-art performance in image restoration by modeling the process as sequential denoising steps. However, most approaches assume independent and identically distributed (i.i.d.) Gaussian noise, while real-world sensors often exhibit spatially correlated noise due to readout mechanisms, limiting their practical effectiveness. We introduce Correlation Aware Restoration with Diffusion (CARD), a training-free extension of DDRM that explicitly handles correlated Gaussian noise. CARD first whitens the noisy observation, which converts the noise into an i.i.d. form. Then, the diffusion restoration steps are replaced with noise-whitened updates, which inherits DDRM's closed-form sampling efficiency while now being able to handle correlated noise. To emphasize the importance of addressing correlated noise, we contribute CIN-D, a novel correlated noise dataset captured across diverse illumination conditions to evaluate restoration methods on real rolling-shutter sensor noise. This dataset fills a critical gap in the literature for experimental evaluation with real-world correlated noise. Experiments on standard benchmarks with synthetic correlated noise and on CIN-D demonstrate that CARD consistently outperforms existing methods across denoising, deblurring, and super-resolution tasks.
Age-Inclusive 3D Human Mesh Recovery for Action-Preserving Data Anonymization
While three-dimensional (3D) shape and pose estimation is a highly researched area that has yielded significant advances, the resulting methods, despite performing well for the adult population, generally fail to generalize effectively to children and infants. This paper addresses this challenge by introducing AionHMR, a comprehensive framework designed to bridge this domain gap. We propose an optimization-based method that extends a top-performing model by incorporating the SMPL-A body model, enabling the concurrent and accurate modeling of adults, children, and infants. Leveraging this approach, we generated pseudo-ground-truth annotations for publicly available child and infant image databases. Using these new training data, we then developed and trained a specialized transformer-based deep learning model capable of real-time 3D age-inclusive human reconstruction. Extensive experiments demonstrate that our methods significantly improve shape and pose estimation for children and infants without compromising accuracy on adults. Importantly, our reconstructed meshes serve as privacy-preserving substitutes for raw images, retaining essential action, pose, and geometry information while enabling anonymized datasets release. As a demonstration, we introduce the 3D-BabyRobot dataset, a collection of action-preserving 3D reconstructions of children interacting with robots. This work bridges a crucial domain gap and establishes a foundation for inclusive, privacy-aware, and age-diverse 3D human modeling.
IE2Video: Adapting Pretrained Diffusion Models for Event-Based Video Reconstruction
Continuous video monitoring in surveillance, robotics, and wearable systems faces a fundamental power constraint: conventional RGB cameras consume substantial energy through fixed-rate capture. Event cameras offer sparse, motion-driven sensing with low power consumption, but produce asynchronous event streams rather than RGB video. We propose a hybrid capture paradigm that records sparse RGB keyframes alongside continuous event streams, then reconstructs full RGB video offline -- reducing capture power consumption while maintaining standard video output for downstream applications. We introduce the Image and Event to Video (IE2Video) task: reconstructing RGB video sequences from a single initial frame and subsequent event camera data. We investigate two architectural strategies: adapting an autoregressive model (HyperE2VID) for RGB generation, and injecting event representations into a pretrained text-to-video diffusion model (LTX) via learned encoders and low-rank adaptation. Our experiments demonstrate that the diffusion-based approach achieves 33\% better perceptual quality than the autoregressive baseline (0.283 vs 0.422 LPIPS). We validate our approach across three event camera datasets (BS-ERGB, HS-ERGB far/close) at varying sequence lengths (32-128 frames), demonstrating robust cross-dataset generalization with strong performance on unseen capture configurations.
DEAR: Dataset for Evaluating the Aesthetics of RenderingDEAR: Dataset for Evaluating the Aesthetics of Rendering
Traditional Image Quality Assessment~(IQA) focuses on quantifying technical degradations such as noise, blur, or compression artifacts, using both full-reference and no-reference objective metrics. However, evaluation of rendering aesthetics, a growing domain relevant to photographic editing, content creation, and AI-generated imagery, remains underexplored due to the lack of datasets that reflect the inherently subjective nature of style preference. In this work, a novel benchmark dataset designed to model human aesthetic judgments of image rendering styles is introduced: the Dataset for Evaluating the Aesthetics of Rendering (DEAR). Built upon the MIT-Adobe FiveK dataset, DEAR incorporates pairwise human preference scores collected via large-scale crowdsourcing, with each image pair evaluated by 25 distinct human evaluators with a total of 13,648 of them participating overall. These annotations capture nuanced, context-sensitive aesthetic preferences, enabling the development and evaluation of models that go beyond traditional distortion-based IQA, focusing on a new task: Evaluation of Aesthetics of Rendering (EAR). The data collection pipeline is described, human voting patterns are analyzed, and multiple use cases are outlined, including style preference prediction, aesthetic benchmarking, and personalized aesthetic modeling. To the best of the authors' knowledge, DEAR is the first dataset to systematically address image aesthetics of rendering assessment grounded in subjective human preferences. A subset of 100 images with markup for them is published on HuggingFace (huggingface.co/datasets/vsevolodpl/DEAR).
Your Latent Mask is Wrong: Pixel-Equivalent Latent Compositing for Diffusion Models
Latent inpainting in diffusion models still relies almost universally on linearly interpolating VAE latents under a downsampled mask. We propose a key principle for compositing image latents: Pixel-Equivalent Latent Compositing (PELC). An equivalent latent compositor should be the same as compositing in pixel space. This principle enables full-resolution mask control and true soft-edge alpha compositing, even though VAEs compress images 8x spatially. Modern VAEs capture global context beyond patch-aligned local structure, so linear latent blending cannot be pixel-equivalent: it produces large artifacts at mask seams and global degradation and color shifts. We introduce DecFormer, a 7.7M-parameter transformer that predicts per-channel blend weights and an off-manifold residual correction to realize mask-consistent latent fusion. DecFormer is trained so that decoding after fusion matches pixel-space alpha compositing, is plug-compatible with existing diffusion pipelines, requires no backbone finetuning and adds only 0.07% of FLUX.1-Dev's parameters and 3.5% FLOP overhead. On the FLUX.1 family, DecFormer restores global color consistency, soft-mask support, sharp boundaries, and high-fidelity masking, reducing error metrics around edges by up to 53% over standard mask interpolation. Used as an inpainting prior, a lightweight LoRA on FLUX.1-Dev with DecFormer achieves fidelity comparable to FLUX.1-Fill, a fully finetuned inpainting model. While we focus on inpainting, PELC is a general recipe for pixel-equivalent latent editing, as we demonstrate on a complex color-correction task.
comment: 16 pages, 10 figures
CoCoIns: Consistent Subject Generation via Contrastive Instantiated Concepts
While text-to-image generative models can synthesize diverse and faithful content, subject variation across multiple generations limits their application to long-form content generation. Existing approaches require time-consuming fine-tuning, reference images for all subjects, or access to previously generated content. We introduce Contrastive Concept Instantiation (CoCoIns), a framework that effectively synthesizes consistent subjects across multiple independent generations. The framework consists of a generative model and a mapping network that transforms input latent codes into pseudo-words associated with specific concept instances. Users can generate consistent subjects by reusing the same latent codes. To construct such associations, we propose a contrastive learning approach that trains the network to distinguish between different combinations of prompts and latent codes. Extensive evaluations on human faces with a single subject show that CoCoIns performs comparably to existing methods while maintaining greater flexibility. We also demonstrate the potential for extending CoCoIns to multiple subjects and other object categories.
comment: TMLR 2025. Project page: https://contrastive-concept-instantiation.github.io
MORPH: PDE Foundation Models with Arbitrary Data Modality
We introduce MORPH, a modality-agnostic, autoregressive foundation model for partial differential equations (PDEs). MORPH is built on a convolutional vision transformer backbone that seamlessly handles heterogeneous spatiotemporal datasets of varying data modality (1D--3D) at different resolutions, and multiple fields with mixed scalar and vector components. The architecture combines (i) component-wise convolution, which jointly processes scalar and vector channels to capture local interactions, (ii) inter-field cross-attention, which models and selectively propagates information between different physical fields, (iii) axial attentions, which factorize full spatiotemporal self-attention along individual spatial and temporal axes to reduce computational burden while retaining expressivity. We pretrain multiple model variants on a diverse collection of heterogeneous PDE datasets and evaluate transfer to a range of downstream prediction tasks. Using both full-model fine-tuning and parameter-efficient low-rank adapters (LoRA), MORPH outperforms models trained from scratch. Across extensive evaluations, MORPH matches or surpasses strong baselines and recent state-of-the-art models. Collectively, these capabilities present a flexible and powerful backbone for learning from the heterogeneous and multimodal nature of scientific observations, charting a path toward scalable and data-efficient scientific machine learning. The source code, datasets, and models are publicly available at https://github.com/lanl/MORPH.
Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models
We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. Developing high-performance PRMs typically demands significant time and financial investment, primarily due to the necessity for step-level annotations of reasoning steps. Conventional automated labeling methods, such as Monte Carlo estimation, often produce noisy labels and incur substantial computational costs. To efficiently generate high-quality process-labeled data, we propose leveraging prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. Remarkably, Athena-PRM demonstrates outstanding effectiveness across various scenarios and benchmarks with just 5,000 samples. Furthermore, we also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data. We validate our approach in three specific scenarios: verification for test time scaling, direct evaluation of reasoning step correctness, and reward ranked fine-tuning. Our Athena-PRM consistently achieves superior performance across multiple benchmarks and scenarios. Notably, when using Qwen2.5-VL-7B as the policy model, Athena-PRM enhances performance by 10.2 points on WeMath and 7.1 points on MathVista for test time scaling. Furthermore, Athena-PRM sets the state-of-the-art (SoTA) results in VisualProcessBench and outperforms the previous SoTA by 3.9 F1-score, showcasing its robust capability to accurately assess the correctness of the reasoning step. Additionally, utilizing Athena-PRM as the reward model, we develop Athena-7B with reward ranked fine-tuning and outperforms baseline with a significant margin on five benchmarks.
Beyond I-Con: Exploring New Dimension of Distance Measures in Representation Learning
The Information Contrastive (I-Con) framework revealed that over 23 representation learning methods implicitly minimize KL divergence between data and learned distributions that encode similarities between data points. However, a KL-based loss may be misaligned with the true objective, and properties of KL divergence such as asymmetry and unboundedness may create optimization challenges. We present Beyond I-Con, a framework that enables systematic discovery of novel loss functions by exploring alternative statistical divergences. Key findings: (1) on unsupervised clustering of DINO-ViT embeddings, we achieve state-of-the-art results by modifying the PMI algorithm to use total variation (TV) distance; (2) supervised contrastive learning with Euclidean distance as the feature space metric is improved by replacing the standard loss function with Jenson-Shannon divergence (JSD); (3) on dimensionality reduction, we achieve superior qualitative results and better performance on downstream tasks than SNE by replacing KL with a bounded $f$-divergence. Our results highlight the importance of considering divergence choices in representation learning optimization.
SAGE: Spatial-visual Adaptive Graph Exploration for Visual Place Recognition
Visual Place Recognition (VPR) requires robust retrieval of geotagged images despite large appearance, viewpoint, and environmental variation. Prior methods focus on descriptor fine-tuning or fixed sampling strategies yet neglect the dynamic interplay between spatial context and visual similarity during training. We present SAGE (Spatial-visual Adaptive Graph Exploration), a unified training pipeline that enhances granular spatial-visual discrimination by jointly improving local feature aggregation, organize samples during training, and hard sample mining. We introduce a lightweight Soft Probing module that learns residual weights from training data for patch descriptors before bilinear aggregation, boosting distinctive local cues. During training we reconstruct an online geo-visual graph that fuses geographic proximity and current visual similarity so that candidate neighborhoods reflect the evolving embedding landscape. To concentrate learning on the most informative place neighborhoods, we seed clusters from high-affinity anchors and iteratively expand them with a greedy weighted clique expansion sampler. Implemented with a frozen DINOv2 backbone and parameter-efficient fine-tuning, SAGE achieves SOTA across eight benchmarks. It attains 98.9%, 95.8%, 94.5%, and 96.0% Recall@1 on SPED, Pitts30k-test, MSLS-val, and Nordland, respectively. Notably, our method obtains 100% Recall@10 on SPED only using 4096D global descriptors. Code and models will be released upon acceptance.
comment: 23 pages
Efficient stereo matching on embedded GPUs with zero-means cross correlation
Mobile stereo-matching systems have become an important part of many applications, such as automated-driving vehicles and autonomous robots. Accurate stereo-matching methods usually lead to high computational complexity; however, mobile platforms have only limited hardware resources to keep their power consumption low; this makes it difficult to maintain both an acceptable processing speed and accuracy on mobile platforms. To resolve this trade-off, we herein propose a novel acceleration approach for the well-known zero-means normalized cross correlation (ZNCC) matching cost calculation algorithm on a Jetson Tx2 embedded GPU. In our method for accelerating ZNCC, target images are scanned in a zigzag fashion to efficiently reuse one pixel's computation for its neighboring pixels; this reduces the amount of data transmission and increases the utilization of on-chip registers, thus increasing the processing speed. As a result, our method is 2X faster than the traditional image scanning method, and 26% faster than the latest NCC method. By combining this technique with the domain transformation (DT) algorithm, our system show real-time processing speed of 32 fps, on a Jetson Tx2 GPU for 1,280x384 pixel images with a maximum disparity of 128. Additionally, the evaluation results on the KITTI 2015 benchmark show that our combined system is more accurate than the same algorithm combined with census by 7.26%, while maintaining almost the same processing speed. Source Code: https://github.com/changqiong/Z2ZNCC.git
SO-Bench: A Structural Output Evaluation of Multimodal LLMs
Multimodal large language models (MLLMs) are increasingly deployed in real-world, agentic settings where outputs must not only be correct, but also conform to predefined data schemas. Despite recent progress in structured generation in textual domain, there is still no benchmark that systematically evaluates schema-grounded information extraction and reasoning over visual inputs. In this work, we conduct a comprehensive study of visual structural output capabilities for MLLMs with our carefully designed SO-Bench benchmark. Covering four visual domains, including UI screens, natural images, documents, and charts, SO-Bench is built from over 6.5K diverse JSON schemas and 1.8K curated image-schema pairs with human-verified quality. Benchmarking experiments on open-sourced and frontier proprietary models reveal persistent gaps in predicting accurate, schema compliant outputs, highlighting the need for better multimodal structured reasoning. Beyond benchmarking, we further conduct training experiments to largely improve the model's structured output capability. We plan to make the benchmark available to the community.
comment: v2 preprint. Fixed some typos, add a discussion about limitation, provide pseudo-codes for eval
SignBind-LLM: Multi-Stage Modality Fusion for Sign Language Translation
Despite progress in gloss-free Sign Language Translation (SLT), traditional single modality end-to-end approaches consistently fail on two critical components of natural signing: the precise recognition of high-speed fingerspelling and the integration of asynchronous non-manual cues from the face. Recent progress in SLT with Large Language Models has side stepped this challenge, forcing a single network to learn these simultaneously resulting in poor performance when tasked with translating crucial information such as names, places, and technical terms. We introduce SignBind-LLM, a modular framework designed to overcome these limitations. Our approach employs separate, specialized predictors for continuous signing, fingerspelling, and lipreading. Each expert network first decodes its specific modality into a sequence of tokens. These parallel streams are then fused by a lightweight transformer that resolves temporal misalignments before passing the combined representation to a Large Language Model (LLM) for final sentence generation. Our method establishes a new state-of-the-art on the How2Sign, ChicagoFSWildPlus, and BOBSL datasets with a BLEU-4 score of 22.1, 73.2% letter accuracy and BLEU-4 score of 6.8 respectively. These results validate our core hypothesis: isolating and solving distinct recognition tasks before fusion provides a more powerful and effective pathway to robust, high-fidelity sign language translation.
MMHOI: Modeling Complex 3D Multi-Human Multi-Object Interactions WACV 2026
Real-world scenes often feature multiple humans interacting with multiple objects in ways that are causal, goal-oriented, or cooperative. Yet existing 3D human-object interaction (HOI) benchmarks consider only a fraction of these complex interactions. To close this gap, we present MMHOI -- a large-scale, Multi-human Multi-object Interaction dataset consisting of images from 12 everyday scenarios. MMHOI offers complete 3D shape and pose annotations for every person and object, along with labels for 78 action categories and 14 interaction-specific body parts, providing a comprehensive testbed for next-generation HOI research. Building on MMHOI, we present MMHOI-Net, an end-to-end transformer-based neural network for jointly estimating human-object 3D geometries, their interactions, and associated actions. A key innovation in our framework is a structured dual-patch representation for modeling objects and their interactions, combined with action recognition to enhance the interaction prediction. Experiments on MMHOI and the recently proposed CORE4D datasets demonstrate that our approach achieves state-of-the-art performance in multi-HOI modeling, excelling in both accuracy and reconstruction quality. The MMHOI dataset is publicly available at https://zenodo.org/records/17711786.
comment: Accepted to WACV 2026
"I Can See Forever!": Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments
The visually impaired population faces significant challenges in daily activities. While prior works employ vision language models for assistance, most focus on static content and cannot address real-time perception needs in complex environments. Recent VideoLLMs enable real-time vision and speech interaction, offering promising potential for assistive tasks. In this work, we conduct the first study evaluating their effectiveness in supporting daily life for visually impaired individuals. We first conducted a user survey with visually impaired participants to design the benchmark VisAssistDaily for daily life evaluation. Using VisAssistDaily, we evaluate popular VideoLLMs and find GPT-4o achieves the highest task success rate. We further conduct a user study to reveal concerns about hazard perception. To address this, we propose SafeVid, an environment-awareness dataset, and fine-tune VITA-1.5, improving risk recognition accuracy from 25.00% to 76.00%.We hope this work provides valuable insights and inspiration for future research in this field.
comment: 17 pages
Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment AAAI 2026
Normalizing Flows (NFs) are a class of generative models distinguished by a mathematically invertible architecture, where the forward pass transforms data into a latent space for density estimation, and the reverse pass generates new samples from this space. This characteristic creates an intrinsic synergy between representation learning and data generation. However, the generative quality of standard NFs is limited by poor semantic representations from log-likelihood optimization. To remedy this, we propose a novel alignment strategy that creatively leverages the invertibility of NFs: instead of regularizing the forward pass, we align the intermediate features of the generative (reverse) pass with representations from a powerful vision foundation model, demonstrating superior effectiveness over naive alignment. We also introduce a novel training-free, test-time optimization algorithm for classification, which provides a more intrinsic evaluation of the NF's embedded semantic knowledge. Comprehensive experiments demonstrate that our approach accelerates the training of NFs by over 3.3$\times$, while simultaneously delivering significant improvements in both generative quality and classification accuracy. New state-of-the-art results for NFs are established on ImageNet 64$\times$64 and 256$\times$256. Our code is available at https://github.com/MCG-NJU/FlowBack.
comment: Accepted by AAAI 2026
Polygon Intersection-over-Union Loss for Viewpoint-Agnostic Monocular 3D Vehicle Detection
Monocular 3D object detection is a challenging task because depth information is difficult to obtain from 2D images. A subset of viewpoint-agnostic monocular 3D detection methods also do not explicitly leverage scene homography or geometry during training, meaning that a model trained thusly can detect objects in images from arbitrary viewpoints. Such works predict the projections of the 3D bounding boxes on the image plane to estimate the location of the 3D boxes, but these projections are not rectangular so the calculation of IoU between these projected polygons is not straightforward. This work proposes an efficient, fully differentiable algorithm for the calculation of IoU between two convex polygons, which can be utilized to compute the IoU between two 3D bounding box footprints viewed from an arbitrary angle. We test the performance of the proposed polygon IoU loss (PIoU loss) on three state-of-the-art viewpoint-agnostic 3D detection models. Experiments demonstrate that the proposed PIoU loss converges faster than L1 loss and that in 3D detection models, a combination of PIoU loss and L1 loss gives better results than L1 loss alone (+1.64% AP70 for MonoCon on cars, +0.18% AP70 for RTM3D on cars, and +0.83%/+2.46% AP50/AP25 for MonoRCNN on cyclists).
DAVE: Diagnostic benchmark for Audio Visual Evaluation
Audio-visual understanding is a rapidly evolving field that seeks to integrate and interpret information from both auditory and visual modalities. Despite recent advances in multi-modal learning, existing benchmarks often suffer from strong visual bias -- when answers can be inferred from visual data alone -- and provide only aggregate scores that conflate multiple sources of error. This makes it difficult to determine whether models struggle with visual understanding, audio interpretation, or audio-visual alignment. In this work, we introduce DAVE: Diagnostic Audio Visual Evaluation, a novel benchmark dataset designed to systematically evaluate audio-visual models across controlled settings. DAVE alleviates existing limitations by (i) ensuring both modalities are necessary to answer correctly and (ii) decoupling evaluation into atomic subcategories. Our detailed analysis of state-of-the-art models reveals specific failure modes and provides targeted insights for improvement. By offering this standardized diagnostic framework, we aim to facilitate more robust development of audio-visual models. Dataset: https://huggingface.co/datasets/gorjanradevski/dave Code: https://github.com/gorjanradevski/dave
comment: First two authors contributed equally
EoS-FM: Can an Ensemble of Specialist Models act as a Generalist Feature Extractor?
Recent advances in foundation models have shown great promise in domains such as natural language processing and computer vision, and similar efforts are now emerging in the Earth Observation community. These models aim to generalize across tasks with limited supervision, reducing the need for training separate models for each task. However, current strategies, which largely focus on scaling model size and dataset volume, require prohibitive computational and data resources, limiting accessibility to only a few large institutions. Moreover, this paradigm of ever-larger models stands in stark contrast with the principles of sustainable and environmentally responsible AI, as it leads to immense carbon footprints and resource inefficiency. In this work, we present a novel and efficient alternative: an Ensemble-of-Specialists framework for building Remote Sensing Foundation Models (RSFMs). Our method decomposes the training process into lightweight, task-specific ConvNeXtV2 specialists that can be frozen and reused. This modular approach offers strong advantages in efficiency, interpretability, and extensibility. Moreover, it naturally supports federated training, pruning, and continuous specialist integration, making it particularly well-suited for collaborative and resource-constrained settings. Our framework sets a new direction for building scalable and efficient RSFMs. All codes and pretrained models are available at https://github.com/pierreadorni/EoS-FM.
TongUI: Internet-Scale Trajectories from Multimodal Web Tutorials for Generalized GUI Agents AAAI 2026
Building Graphical User Interface (GUI) agents is a promising research direction, which simulates human interaction with computers or mobile phones to perform diverse GUI tasks. However, a major challenge in developing generalized GUI agents is the lack of sufficient trajectory data across various operating systems and applications, mainly due to the high cost of manual annotations. In this paper, we propose the TongUI framework that builds generalized GUI agents by learning from rich multimodal web tutorials. Concretely, we crawl and process online GUI tutorials (such as videos and articles) into GUI agent trajectory data, through which we produce the GUI-Net dataset containing 143K trajectory data across five operating systems and more than 200 applications. We develop the TongUI agent by fine-tuning Qwen2.5-VL-3B/7B models on GUI-Net, which show remarkable performance improvements on commonly used grounding and navigation benchmarks, outperforming baseline agents about 10\% on multiple benchmarks, showing the effectiveness of the GUI-Net dataset and underscoring the significance of our TongUI framework. We will fully open-source the code, the GUI-Net dataset, and the trained models soon.
comment: AAAI 2026
ENTIRE: Learning-based Volume Rendering Time Prediction
We introduce ENTIRE, a novel deep learning-based approach for fast and accurate volume rendering time prediction. Predicting rendering time is inherently challenging due to its dependence on multiple factors, including volume data characteristics, image resolution, camera configuration, and transfer function settings. Our method addresses this by first extracting a feature vector that encodes structural volume properties relevant to rendering performance. This feature vector is then integrated with additional rendering parameters, such as image resolution, camera setup, and transfer function settings, to produce the final prediction. We evaluate ENTIRE across multiple rendering frameworks (CPU- and GPU-based) and configurations (with and without single-scattering) on diverse datasets. The results demonstrate that our model achieves high prediction accuracy with fast inference speed. Furthermore, we showcase ENTIRE's effectiveness in two case studies, where it enables dynamic parameter adaptation for stable frame rates and load balancing.
JarvisEvo: Towards a Self-Evolving Photo Editing Agent with Synergistic Editor-Evaluator Optimization
Agent-based editing models have substantially advanced interactive experiences, processing quality, and creative flexibility. However, two critical challenges persist: (1) instruction hallucination, text-only chain-of-thought (CoT) reasoning cannot fully prevent factual errors due to inherent information bottlenecks; (2) reward hacking, dynamic policy optimization against static reward models allows agents to exploit flaws in reward functions. To address these issues, we propose JarvisEvo, a unified image editing agent that emulates an expert human designer by iteratively editing, selecting appropriate tools, evaluating results, and reflecting on its own decisions to refine outcomes. JarvisEvo offers three key advantages: (1) an interleaved multimodal chain-of-thought (iMCoT) reasoning mechanism that enhances instruction following and editing quality; (2) a synergistic editor-evaluator policy optimization (SEPO) framework that enables self-improvement without external rewards, effectively mitigating reward hacking; and (3) support for both global and local fine-grained editing through seamless integration of Adobe Lightroom. On ArtEdit-Bench, JarvisEvo outperforms Nano-Banana by an average of 18.95% on preservative editing metrics, including a substantial 44.96% improvement in pixel-level content fidelity. Project page: https://jarvisevo.vercel.app/
comment: 31 pages, 18 figures
Integrating Skeleton Based Representations for Robust Yoga Pose Classification Using Deep Learning Models
Yoga is a popular form of exercise worldwide due to its spiritual and physical health benefits, but incorrect postures can lead to injuries. Automated yoga pose classification has therefore gained importance to reduce reliance on expert practitioners. While human pose keypoint extraction models have shown high potential in action recognition, systematic benchmarking for yoga pose recognition remains limited, as prior works often focus solely on raw images or a single pose extraction model. In this study, we introduce a curated dataset, 'Yoga-16', which addresses limitations of existing datasets, and systematically evaluate three deep learning architectures (VGG16, ResNet50, and Xception), using three input modalities (direct images, MediaPipe Pose skeleton images, and YOLOv8 Pose skeleton images). Our experiments demonstrate that skeleton-based representations outperform raw image inputs, with the highest accuracy of 96.09% achieved by VGG16 with MediaPipe Pose skeleton input. Additionally, we provide interpretability analysis using Grad-CAM, offering insights into model decision-making for yoga pose classification with cross-validation analysis.
GigaBrain-0: A World Model-Powered Vision-Language-Action Model
Training Vision-Language-Action (VLA) models for generalist robots typically requires large-scale real-world robot data, which is expensive and time-consuming to collect. The inefficiency of physical data collection severely limits the scalability, and generalization capacity of current VLA systems. To address this challenge, we introduce GigaBrain-0, a novel VLA foundation model empowered by world model-generated data (e.g., video generation, real2real transfer, human transfer, view transfer, sim2real transfer data). By leveraging world models to generate diverse data at scale, GigaBrain-0 significantly reduces reliance on real robot data while improving cross-task generalization. Our approach further improves policy robustness through RGBD input modeling and embodied Chain-of-Thought (CoT) supervision, enabling the model to reason about spatial geometry, object states, and long-horizon dependencies during task execution. This leads to substantial gains in real-world performance on dexterous, long-horizon, and mobile manipulation tasks. Extensive experiments demonstrate that GigaBrain-0 achieves superior generalization across variations in appearances (e.g., textures, colors), object placements, and camera viewpoints. Additionally, we present GigaBrain-0-Small, an optimized lightweight variant designed to run efficiently on devices such as the NVIDIA Jetson AGX Orin.
comment: https://gigabrain0.github.io/
Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs
Vision language models (VLMs) have achieved impressive performance across a variety of computer vision tasks. However, the multimodal reasoning capability has not been fully explored in existing models. In this paper, we propose a Chain-of-Focus (CoF) method that allows VLMs to perform adaptive focusing and zooming in on key image regions based on obtained visual cues and the given questions, achieving efficient multimodal reasoning. To enable this CoF capability, we present a two-stage training pipeline, including supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct the MM-CoF dataset, comprising 3K samples derived from a visual agent designed to adaptively identify key regions to solve visual tasks with different image resolutions and questions. We use MM-CoF to fine-tune the Qwen2.5-VL model for cold start. In the RL stage, we leverage the outcome accuracies and formats as rewards to update the Qwen2.5-VL model, enabling further refining the search and reasoning strategy of models without human priors. Our model achieves significant improvements on multiple benchmarks. On the V* benchmark that requires strong visual reasoning capability, our model outperforms existing VLMs by 5% among 8 image resolutions ranging from 224 to 4K, demonstrating the effectiveness of the proposed CoF method and facilitating the more efficient deployment of VLMs in practical applications.
comment: https://github.com/xtong-zhang/Chain-of-Focus
OP-Align: Object-level and Part-level Alignment for Self-supervised Category-level Articulated Object Pose Estimation ECCV2024
Category-level articulated object pose estimation focuses on the pose estimation of unknown articulated objects within known categories. Despite its significance, this task remains challenging due to the varying shapes and poses of objects, expensive dataset annotation costs, and complex real-world environments. In this paper, we propose a novel self-supervised approach that leverages a single-frame point cloud to solve this task. Our model consistently generates reconstruction with a canonical pose and joint state for the entire input object, and it estimates object-level poses that reduce overall pose variance and part-level poses that align each part of the input with its corresponding part of the reconstruction. Experimental results demonstrate that our approach significantly outperforms previous self-supervised methods and is comparable to the state-of-the-art supervised methods. To assess the performance of our model in real-world scenarios, we also introduce a new real-world articulated object benchmark dataset.
comment: published in ECCV2024
Turbo-GS: Accelerating 3D Gaussian Fitting for High-Quality Radiance Fields
Novel-view synthesis is an important problem in computer vision with applications in 3D reconstruction, mixed reality, and robotics. Recent methods like 3D Gaussian Splatting (3DGS) have become the preferred method for this task, providing high-quality novel views in real time. However, the training time of a 3DGS model is slow, often taking 30 minutes for a scene with 200 views. In contrast, our goal is to reduce the optimization time by training for fewer steps while maintaining high rendering quality. Specifically, we combine the guidance from both the position error and the appearance error to achieve a more effective densification. To balance the rate between adding new Gaussians and fitting old Gaussians, we develop a convergence-aware budget control mechanism. Moreover, to make the densification process more reliable, we selectively add new Gaussians from mostly visited regions. With these designs, we reduce the Gaussian optimization steps to one-third of the previous approach while achieving a comparable or even better novel view rendering quality. To further facilitate the rapid fitting of 4K resolution images, we introduce a dilation-based rendering technique. Our method, Turbo-GS, speeds up optimization for typical scenes and scales well to high-resolution (4K) scenarios on standard datasets. Through extensive experiments, we show that our method is significantly faster in optimization than other methods while retaining quality. Project page: https://ivl.cs.brown.edu/research/turbo-gs.
comment: Project page: https://ivl.cs.brown.edu/research/turbo-gs
Robo-SGG: Exploiting Layout-Oriented Normalization and Restitution Can Improve Robust Scene Graph Generation
In this paper, we propose Robo-SGG, a plug-and-play module for robust scene graph generation (SGG). Unlike standard SGG, the robust scene graph generation aims to perform inference on a diverse range of corrupted images, with the core challenge being the domain shift between the clean and corrupted images. Existing SGG methods suffer from degraded performance due to shifted visual features (e.g., corruption interference or occlusions). To obtain robust visual features, we leverage layout information, representing the global structure of an image, which is robust to domain shift, to enhance the robustness of SGG methods under corruption. Specifically, we employ Instance Normalization (IN) to alleviate the domain-specific variations and recover the robust structural features (i.e., the positional and semantic relationships among objects) by the proposed Layout-Oriented Restitution. Furthermore, under corrupted images, we introduce a Layout-Embedded Encoder (LEE) that adaptively fuses layout and visual features via a gating mechanism, enhancing the robustness of positional and semantic representations for objects and predicates. Note that our proposed Robo-SGG module is designed as a plug-and-play component, which can be easily integrated into any baseline SGG model. Extensive experiments demonstrate that by integrating the state-of-the-art method into our proposed Robo-SGG, we achieve relative improvements of 6.3%, 11.1%, and 8.0% in mR@50 for PredCls, SGCls, and SGDet tasks on the VG-C benchmark, respectively, and achieve new state-of-the-art performance in the corruption scene graph generation benchmark (VG-C and GQA-C). We will release our source code and model.
Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships WACV 2026
Pre-trained vision-language (VL) models are highly vulnerable to adversarial attacks. However, existing defense methods primarily focus on image classification, overlooking two key aspects of VL tasks: multimodal attacks, where both image and text can be perturbed, and the one-to-many relationship of images and texts, where a single image can correspond to multiple textual descriptions and vice versa (1:N and N:1). This work is the first to explore defense strategies against multimodal attacks in VL tasks, whereas prior VL defense methods focus on vision robustness. We propose multimodal adversarial training (MAT), which incorporates adversarial perturbations in both image and text modalities during training, significantly outperforming existing unimodal defenses. Furthermore, we discover that MAT is limited by deterministic one-to-one (1:1) image-text pairs in VL training data. To address this, we conduct a comprehensive study on leveraging one-to-many relationships to enhance robustness, investigating diverse augmentation techniques. Our analysis shows that, for a more effective defense, augmented image-text pairs should be well-aligned, diverse, yet avoid distribution shift -- conditions overlooked by prior research. This work pioneers defense strategies against multimodal attacks, providing insights for building robust VLMs from both optimization and data perspectives. Our code is publicly available at https://github.com/CyberAgentAI/multimodal-adversarial-training.
comment: WACV 2026 Accepted. Code available at https://github.com/CyberAgentAI/multimodal-adversarial-training
TTRV: Test-Time Reinforcement Learning for Vision Language Models
Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment. In this work, we propose TTRV to enhance vision language understanding by adapting the model on the fly at inference time, without the need for any labeled data. Concretely, we enhance the Group Relative Policy Optimization (GRPO) framework by designing rewards based on the frequency of the base model's output, while inferring on each test sample multiple times. Further, we also propose to control the diversity of the model's output by simultaneously rewarding the model for obtaining low entropy of the output empirical distribution. Our approach delivers consistent gains across both object recognition and visual question answering (VQA), with improvements of up to 52.4% and 29.8%, respectively, and average boosts of 24.6% and 10.0% across 16 datasets. Remarkably, on image recognition, TTRV applied to InternVL 8B surpasses GPT-4o by an average of 2.3% over 8 benchmarks, while remaining highly competitive on VQA, demonstrating that test-time reinforcement learning can match or exceed the strongest proprietary models. Finally, we find many interesting properties of test-time RL for VLMs: for example, even in extremely data-constrained scenarios, where adaptation is performed on a single randomly chosen unlabeled test example, TTRV still yields non-trivial improvements of up to 5.5% in recognition tasks.
A lightweight detector for real-time detection of remote sensing images
Remote sensing imagery is widely used across various fields, yet real-time detection remains challenging due to the prevalence of small objects and the need to balance accuracy with efficiency. To address this, we propose DMG-YOLO, a lightweight real-time detector tailored for small object detection in remote sensing images. Specifically, we design a Dual-branch Feature Extraction (DFE) module in the backbone, which partitions feature maps into two parallel branches: one extracts local features via depthwise separable convolutions, and the other captures global context using a vision transformer with a gating mechanism. Additionally, a Multi-scale Feature Fusion (MFF) module with dilated convolutions enhances multi-scale integration while preserving fine details. In the neck, we introduce the Global and Local Aggregate Feature Pyramid Network (GLAFPN) to further boost small object detection through global-local feature fusion. Extensive experiments on the VisDrone2019 and NWPU VHR-10 datasets show that DMG-YOLO achieves competitive performance in terms of mAP, model size, and other key metrics.
comment: wrong results
Jina-VLM: Small Multilingual Vision Language Model
We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. The model achieves leading results on standard VQA benchmarks and multilingual evaluations while preserving competitive text-only performance. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm .
comment: 18 pages, 1-7 main content, 13-18 appendix for tables and dataset
There is No VAE: End-to-End Pixel-Space Generative Modeling via Self-Supervised Pre-training
Pixel-space generative models are often more difficult to train and generally underperform compared to their latent-space counterparts, leaving a persistent performance and efficiency gap. In this paper, we introduce a novel two-stage training framework that closes this gap for pixel-space diffusion and consistency models. In the first stage, we pre-train encoders to capture meaningful semantics from clean images while aligning them with points along the same deterministic sampling trajectory, which evolves points from the prior to the data distribution. In the second stage, we integrate the encoder with a randomly initialized decoder and fine-tune the complete model end-to-end for both diffusion and consistency models. Our framework achieves state-of-the-art (SOTA) performance on ImageNet. Specifically, our diffusion model reaches an FID of 1.58 on ImageNet-256 and 2.35 on ImageNet-512 with 75 number of function evaluations (NFE) surpassing prior pixel-space methods and VAE-based counterparts by a large margin in both generation quality and training efficiency. In a direct comparison, our model significantly outperforms DiT while using only around 30\% of its training compute.
EVE: Towards End-to-End Video Subtitle Extraction with Vision-Language Models
Video subtitles play a crucial role in short videos and movies, as they not only help models better understand video content but also support applications such as video translation and content retrieval. Existing video subtitle extraction methods typically rely on multi-stage frameworks, where errors accumulate across stages and temporal dependencies are underutilized due to frame-wise processing. Moreover, although some Large Vision-Language Models (LVLMs) possess strong OCR capabilities, predicting accurate timestamps for subtitle texts remains challenging. To this end, we propose an End-to-end Video subtitle Extraction framework based on LVLMs, named EVE, which can output subtitles and their timestamps simultaneously. Specifically, we introduce a dual-branch Spatiotemporal Subtitle-Salient (S\textsuperscript{3}) Module that serves as an adapter for LVLMs, capable of representing subtitle-related content and considering inter-frame correlations using only a small number of tokens. Within this module, the Spatial Semantic Context Aggregate branch aggregates high-level global semantics to provide spatial visual contextual information, while the Temporal Subtitle Token Query branch explicitly queries subtitle-relevant tokens while considering temporal correlation across frames. The small number of tokens retained by the S\textsuperscript{3} module are fed to the language model, which then directly outputs the subtitle text along with its timestamps. Furthermore, we construct the first large-scale dataset dedicated to video subtitle extraction, ViSa, containing over 2.5M videos with timestamped and bilingual annotation, thereby providing the community with a well-organized training and evaluation benchmark.
SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models
Large vision-language models (VLMs) show strong multimodal understanding but still struggle with 3D spatial reasoning, such as distance estimation, size comparison, and cross-view consistency. Existing 3D-aware methods either depend on auxiliary 3D information or enhance RGB-only VLMs with geometry encoders through shallow feature fusion. We propose SpaceMind, a multimodal large language model explicitly designed for spatial reasoning solely from RGB inputs. The model adopts a dual-encoder architecture, integrating VGGT as a spatial understanding encoder and InternViT as a 2D visual encoder. The key idea is to treat the camera representation as an active guiding modality rather than passive metadata. Specifically, SpaceMind introduces a lightweight Camera-Guided Modality Fusion module before the language model to replace shallow fusion. It applies camera-conditioned biasing to spatial tokens, assigns query-independent weights reflecting their geometric importance, and uses the camera embedding to gate the fused representation. Empirically, SpaceMind establishes new state-of-the-art results on VSI-Bench, SQA3D and SPBench, surpassing both open and proprietary systems on VSI-Bench and SPBench by large margins and achieving state-of-the-art performance on SQA3D. These results demonstrate that camera-guided modality fusion is an effective and practical inductive bias for equipping VLMs with genuinely spatially grounded intelligence. We will release code and model checkpoints to support future research.
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .
ImageNot: A contrast with ImageNet preserves model rankings
We introduce ImageNot, a dataset constructed explicitly to be drastically different than ImageNet while matching its scale. ImageNot is designed to test the external validity of deep learning progress on ImageNet. We show that key model architectures developed for ImageNet over the years rank identically to how they rank on ImageNet when trained from scratch and evaluated on ImageNot. Moreover, the relative improvements of each model over earlier models strongly correlate in both datasets. Our work demonstrates a surprising degree of external validity in the relative performance of image classification models when trained and evaluated on an entirely different dataset. This stands in contrast with absolute accuracy numbers that typically drop sharply even under small changes to a dataset.
WeatherPrompt: Multi-modality Representation Learning for All-Weather Drone Visual Geo-Localization
Visual geo-localization for drones faces critical degradation under weather perturbations, \eg, rain and fog, where existing methods struggle with two inherent limitations: 1) Heavy reliance on limited weather categories that constrain generalization, and 2) Suboptimal disentanglement of entangled scene-weather features through pseudo weather categories. We present WeatherPrompt, a multi-modality learning paradigm that establishes weather-invariant representations through fusing the image embedding with the text context. Our framework introduces two key contributions: First, a Training-free Weather Reasoning mechanism that employs off-the-shelf large multi-modality models to synthesize multi-weather textual descriptions through human-like reasoning. It improves the scalability to unseen or complex weather, and could reflect different weather strength. Second, to better disentangle the scene and weather feature, we propose a multi-modality framework with the dynamic gating mechanism driven by the text embedding to adaptively reweight and fuse visual features across modalities. The framework is further optimized by the cross-modal objectives, including image-text contrastive learning and image-text matching, which maps the same scene with different weather conditions closer in the respresentation space. Extensive experiments validate that, under diverse weather conditions, our method achieves competitive recall rates compared to state-of-the-art drone geo-localization methods. Notably, it improves Recall@1 by +13.37\% under night conditions and by 18.69\% under fog and snow conditions.
Changes in Gaza: DINOv3-Powered Multi-Class Change Detection for Damage Assessment in Conflict Zones
Accurately and swiftly assessing damage from conflicts is crucial for humanitarian aid and regional stability. In conflict zones, damaged zones often share similar architectural styles, with damage typically covering small areas and exhibiting blurred boundaries. These characteristics lead to limited data, annotation difficulties, and significant recognition challenges, including high intra-class similarity and ambiguous semantic changes. To address these issues, we introduce a pre-trained DINOv3 model and propose a multi-scale cross-attention difference siamese network (MC-DiSNet). The powerful visual representation capability of the DINOv3 backbone enables robust and rich feature extraction from bi-temporal remote sensing images. The multi-scale cross-attention mechanism allows for precise localization of subtle semantic changes, while the difference siamese structure enhances inter-class feature discrimination, enabling fine-grained semantic change detection. Furthermore, a simple yet powerful lightweight decoder is designed to generate clear detection maps while maintaining high efficiency. We also release a new Gaza-change dataset containing high-resolution satellite image pairs from 2023-2024 with pixel-level semantic change annotations. It is worth emphasizing that our annotations only include semantic pixels of changed areas. We evaluated our method on the Gaza-Change and two classical datasets: the SECOND and Landsat-SCD datasets. Experimental results demonstrate that our proposed approach effectively addresses the MCD task, and its outstanding performance paves the way for practical applications in rapid damage assessment across conflict zones.
Sliding-Window Merging for Compacting Patch-Redundant Layers in LLMs
Depth-wise pruning accelerates LLM inference in resource-constrained scenarios but suffers from performance degradation due to direct removal of entire Transformer layers. This paper reveals ``Patch-like'' redundancy across layers via correlation analysis of the outputs of different layers in reproducing kernel Hilbert space, demonstrating consecutive layers exhibit high functional similarity. Building on this observation, this paper proposes Sliding-Window Merging (SWM) - a dynamic compression method that selects consecutive layers from top to bottom using a pre-defined similarity threshold, and compacts patch-redundant layers through a parameter consolidation, thereby simplifying the model structure while maintaining its performance. Extensive experiments on LLMs with various architectures and different parameter scales show that our method outperforms existing pruning techniques in both zero-shot inference performance and retraining recovery quality after pruning. In particular, in the experiment with 35% pruning on the Vicuna-7B model, our method achieved a 1.654% improvement in average performance on zero-shot tasks compared to the existing method. Moreover, we further reveal the potential of combining depth pruning with width pruning to enhance the pruning effect. Our codes are available at https://github.com/920927/SLM-a-sliding-layer-merging-method.
VEDA: 3D Molecular Generation via Variance-Exploding Diffusion with Annealing
Diffusion models show promise for 3D molecular generation, but face a fundamental trade-off between sampling efficiency and conformational accuracy. While flow-based models are fast, they often produce geometrically inaccurate structures, as they have difficulty capturing the multimodal distributions of molecular conformations. In contrast, denoising diffusion models are more accurate but suffer from slow sampling, a limitation attributed to sub-optimal integration between diffusion dynamics and SE(3)-equivariant architectures. To address this, we propose VEDA, a unified SE(3)-equivariant framework that combines variance-exploding diffusion with annealing to efficiently generate conformationally accurate 3D molecular structures. Specifically, our key technical contributions include: (1) a VE schedule that enables noise injection functionally analogous to simulated annealing, improving 3D accuracy and reducing relaxation energy; (2) a novel preconditioning scheme that reconciles the coordinate-predicting nature of SE(3)-equivariant networks with a residual-based diffusion objective, and (3) a new arcsin-based scheduler that concentrates sampling in critical intervals of the logarithmic signal-to-noise ratio. On the QM9 and GEOM-DRUGS datasets, VEDA matches the sampling efficiency of flow-based models, achieving state-of-the-art valency stability and validity with only 100 sampling steps. More importantly, VEDA's generated structures are remarkably stable, as measured by their relaxation energy during GFN2-xTB optimization. The median energy change is only 1.72 kcal/mol, significantly lower than the 32.3 kcal/mol from its architectural baseline, SemlaFlow. Our framework demonstrates that principled integration of VE diffusion with SE(3)-equivariant architectures can achieve both high chemical accuracy and computational efficiency.
Multimodal Markup Document Models for Graphic Design Completion
We introduce MarkupDM, a multimodal markup document model that represents graphic design as an interleaved multimodal document consisting of both markup language and images. Unlike existing holistic approaches that rely on an element-by-attribute grid representation, our representation accommodates variable-length elements, type-dependent attributes, and text content. Inspired by fill-in-the-middle training in code generation, we train the model to complete the missing part of a design document from its surrounding context, allowing it to treat various design tasks in a unified manner. Our model also supports image generation by predicting discrete image tokens through a specialized tokenizer with support for image transparency. We evaluate MarkupDM on three tasks, attribute value, image, and text completion, and demonstrate that it can produce plausible designs consistent with the given context. To further illustrate the flexibility of our approach, we evaluate our approach on a new instruction-guided design completion task where our instruction-tuned MarkupDM compares favorably to state-of-the-art image editing models, especially in textual completion. These findings suggest that multimodal language models with our document representation can serve as a versatile foundation for broad design automation.
comment: Accepted by ACM Multimedia 2025, Project page: https://cyberagentailab.github.io/MarkupDM/
Detection of AI Deepfake and Fraud in Online Payments Using GAN-Based Models
This study explores the use of Generative Adversarial Networks (GANs) to detect AI deepfakes and fraudulent activities in online payment systems. With the growing prevalence of deepfake technology, which can manipulate facial features in images and videos, the potential for fraud in online transactions has escalated. Traditional security systems struggle to identify these sophisticated forms of fraud. This research proposes a novel GAN-based model that enhances online payment security by identifying subtle manipulations in payment images. The model is trained on a dataset consisting of real-world online payment images and deepfake images generated using advanced GAN architectures, such as StyleGAN and DeepFake. The results demonstrate that the proposed model can accurately distinguish between legitimate transactions and deepfakes, achieving a high detection rate above 95%. This approach significantly improves the robustness of payment systems against AI-driven fraud. The paper contributes to the growing field of digital security, offering insights into the application of GANs for fraud detection in financial services. Keywords- Payment Security, Image Recognition, Generative Adversarial Networks, AI Deepfake, Fraudulent Activities
comment: The paper will be published and indexed by IEEE at 2025 8th International Conference on Advanced Algorithms and Control Engineering (ICAACE 2025)
ViRectify: A Challenging Benchmark for Video Reasoning Correction with Multimodal Large Language Models
As multimodal large language models (MLLMs) frequently exhibit errors in complex video reasoning scenarios, correcting these errors is critical for uncovering their weaknesses and improving performance. However, existing benchmarks lack systematic evaluation of MLLMs' ability to identify and correct these video reasoning errors. To bridge this gap, we propose ViRectify, a comprehensive benchmark to evaluate their fine-grained correction capability. Through an AI-assisted annotation pipeline with human verification, we construct a dataset of over 30K instances spanning dynamic perception, scientific reasoning, and embodied decision-making domains. In ViRectify, we challenge MLLMs to perform step-wise error identification and generate rationales with key video evidence grounding. In addition, we further propose the trajectory evidence-driven correction framework, comprising step-wise error trajectory and reward modeling on visual evidence-grounded correction. It encourages the model to explicitly concentrate on error propagation and key timestamps for correction. Extensive evaluation across 16 advanced MLLMs demonstrates that our ViRectify serves as a challenging testbed, where GPT-5 achieves only 31.94% correction accuracy. Our framework enables a Qwen2.5-VL-7B to consistently outperform the variants of 72B on ViRectify, showing the effectiveness of our approach. Further analysis uncovers systematic asymmetries in error correction across models, and our dataset is also a valuable data resource to perform reflection learning. We believe ViRectify provides a new direction for comprehensively evaluating the advanced MLLMs in video reasoning.
comment: 22 pages, 11 figures
Collaborative Face Experts Fusion in Video Generation: Boosting Identity Consistency Across Large Face Poses
Current video generation models struggle with identity preservation under large face poses, primarily facing two challenges: the difficulty in exploring an effective mechanism to integrate identity features into DiT architectures, and the lack of targeted coverage of large face poses in existing open-source video datasets. To address these, we present two key innovations. First, we propose Collaborative Face Experts Fusion (CoFE), which dynamically fuses complementary signals from three specialized experts within the DiT backbone: an identity expert that captures cross-pose invariant features, a semantic expert that encodes high-level visual context, and a detail expert that preserves pixel-level attributes such as skin texture and color gradients. Second, we introduce a data curation pipeline comprising three key components: Face Constraints to ensure diverse large-pose coverage, Identity Consistency to maintain stable identity across frames, and Speech Disambiguation to align textual captions with actual speaking behavior. This pipeline yields LaFID-180K, a large-scale dataset of pose-annotated video clips designed for identity-preserving video generation. Experimental results on several benchmarks demonstrate that our approach significantly outperforms state-of-the-art methods in face similarity, FID, and CLIP semantic alignment. \href{https://rain152.github.io/CoFE/}{Project page}.
comment: \href{https://rain152.github.io/CoFE/}{Project page}
MILR: Improving Multimodal Image Generation via Test-Time Latent Reasoning
Reasoning-augmented machine learning systems have shown improved performance in various domains, including image generation. However, existing reasoning-based methods for image generation either restrict reasoning to a single modality (image or text) or rely on high-quality reasoning data for fine-tuning. To tackle these limitations, we propose MILR, a test-time method that jointly reasons over image and text in a unified latent vector space. Reasoning in MILR is performed by searching through vector representations of discrete image and text tokens. Practically, this is implemented via the policy gradient method, guided by an image quality critic. We instantiate MILR within the unified multimodal understanding and generation (MUG) framework that natively supports language reasoning before image synthesis and thus facilitates cross-modal reasoning. The intermediate model outputs, which are to be optimized, serve as the unified latent space, enabling MILR to operate entirely at test time. We evaluate MILR on GenEval, T2I-CompBench, and WISE, achieving state-of-the-art results on all benchmarks. Notably, on knowledge-intensive WISE, MILR attains an overall score of 0.63, improving over the baseline by 80%. Our further analysis indicates that joint reasoning in the unified latent space is the key to its strong performance. Moreover, our qualitative studies reveal MILR's non-trivial ability in temporal and cultural reasoning, highlighting the efficacy of our reasoning method.
comment: 21 pages,13 figures,9 tables
ViDiC: Video Difference Captioning
Understanding visual differences between dynamic scenes requires the comparative perception of compositional, spatial, and temporal changes--a capability that remains underexplored in existing vision-language systems. While prior work on Image Difference Captioning (IDC) has enabled models to describe semantic changes between static images, these approaches fail to capture motion continuity, event evolution, or editing consistency over time. We introduce the ViDiC (Video Difference Captioning) task and its corresponding ViDiC-1K dataset, designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to provide fine-grained descriptions of similarities and differences between video pairs. ViDiC-1K comprises 1,000 curated video pairs annotated with over 4,000 comparative checklist items, covering seven categories: subject, style, background, cinematography, motion, location, and playback techniques. To ensure reliable evaluation, we propose a dual-checklist framework that measures the accuracy of similarity and difference separately, based on the LLM-as-a-Judge protocol. Experiments on nineteen representative multimodal models reveal a significant performance gap in their comparative description and difference perception abilities. We hope ViDiC-1K can be a challenging benchmark that lays a solid foundation for advancing video understanding, edit awareness, and comparative reasoning in multimodal intelligence.
Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning AAAI
Current Video Large Language Models (VideoLLMs) suffer from quadratic computational complexity and key-value cache scaling, due to their reliance on processing excessive redundant visual tokens. To address this problem, we propose SharpV, a minimalist and efficient method for adaptive pruning of visual tokens and KV cache. Different from most uniform compression approaches, SharpV dynamically adjusts pruning ratios based on spatial-temporal information. Remarkably, this adaptive mechanism occasionally achieves performance gains over dense models, offering a novel paradigm for adaptive pruning. During the KV cache pruning stage, based on observations of visual information degradation, SharpV prunes degraded visual features via a self-calibration manner, guided by similarity to original visual features. In this way, SharpV achieves hierarchical cache pruning from the perspective of information bottleneck, offering a new insight into VideoLLMs' information flow. Experiments on multiple public benchmarks demonstrate the superiority of SharpV. Moreover, to the best of our knowledge, SharpV is notably the first two-stage pruning framework that operates without requiring access to exposed attention scores, ensuring full compatibility with hardware acceleration techniques like Flash Attention.
comment: The 40th Annual AAAI Conference on Artificial Intelligence (AAAI-26) Poster
BOP-ASK: Object-Interaction Reasoning for Vision-Language Models
Vision Language Models (VLMs) have achieved impressive performance on spatial reasoning benchmarks, yet these evaluations mask critical weaknesses in understanding object interactions. Current benchmarks test high level relationships ('left of,' 'behind', etc.) but ignore fine-grained spatial understanding needed for real world applications: precise 3D localization, physical compatibility between objects, object affordances and multi step spatial planning. In this work, we present BOP-ASK, a novel large scale dataset for object interaction reasoning for both training and benchmarking. Our data generation pipeline leverages 6D object poses from the Benchmark for Object Pose Estimation (BOP) datasets from which we derive fine grained annotations such as grasp poses, referred object poses, path planning trajectories, relative spatial and depth relationships, and object-to-object relationships. BOP-ASK comprises over 150k images and 33M question answer pairs spanning six tasks (four novel), providing a rich resource for training and evaluating VLMs. We evaluate proprietary and open sourced VLMs, and conduct human evaluations on BOP-ASK-core, a contributed test benchmark. We also release BOP-ASK-lab, an out-of-distribution benchmark with images not sourced from BOP, enabling testing of generalization. Our experiments demonstrate that models trained on BOP-ASK outperform baselines and exhibit emergent capabilities such as precise object and grasp pose estimation, trajectory planning, and fine-grained object-centric spatial reasoning in cluttered environments. We will publicly release our datasets and dataset generation pipeline.
Downscaling climate projections to 1 km with single-image super resolution
High-resolution climate projections are essential for local decision-making. However, available climate projections have low spatial resolution (e.g. 12.5 km), which limits their usability. We address this limitation by leveraging single-image super-resolution models to statistically downscale climate projections to 1-km resolution. Since high-resolution climate projections are unavailable, we train models on a high-resolution observational gridded data set and apply them to low-resolution climate projections. We cannot evaluate downscaled climate projections with common metrics (e.g. pixel-wise root-mean-square error) because we lack ground-truth high-resolution climate projections. Therefore, we evaluate climate indicators computed at weather station locations. Experiments on daily mean temperature demonstrate that single-image super-resolution models can downscale climate projections without increasing the error of climate indicators compared to low-resolution climate projections.
Beyond the Ground Truth: Enhanced Supervision for Image Restoration
Deep learning-based image restoration has achieved significant success. However, when addressing real-world degradations, model performance is limited by the quality of ground-truth images in datasets due to practical constraints in data acquisition. To address this limitation, we propose a novel framework that enhances existing ground truth images to provide higher-quality supervision for real-world restoration. Our framework generates perceptually enhanced ground truth images using super-resolution by incorporating adaptive frequency masks, which are learned by a conditional frequency mask generator. These masks guide the optimal fusion of frequency components from the original ground truth and its super-resolved variants, yielding enhanced ground truth images. This frequency-domain mixup preserves the semantic consistency of the original content while selectively enriching perceptual details, preventing hallucinated artifacts that could compromise fidelity. The enhanced ground truth images are used to train a lightweight output refinement network that can be seamlessly integrated with existing restoration models. Extensive experiments demonstrate that our approach consistently improves the quality of restored images. We further validate the effectiveness of both supervision enhancement and output refinement through user studies. Code is available at https://github.com/dhryougit/Beyond-the-Ground-Truth.
Zero4D: Training-Free 4D Video Generation From Single Video Using Off-the-Shelf Video Diffusion
Multi-view or 4D video generation has emerged as a significant research topic. Nonetheless, recent approaches to 4D generation still struggle with fundamental limitations, as they primarily rely on harnessing multiple video diffusion models with additional training or compute-intensive training of a full 4D diffusion model with limited real-world 4D data and large computational costs. To address these challenges, here we propose the first training-free 4D video generation method that leverages the off-the-shelf video diffusion models to generate multi-view videos from a single input video. Our approach consists of two key steps: (1) By designating the edge frames in the spatio-temporal sampling grid as key frames, we first synthesize them using a video diffusion model, leveraging a depth-based warping technique for guidance. This approach ensures structural consistency across the generated frames, preserving spatial and temporal coherence. (2) We then interpolate the remaining frames using a video diffusion model, constructing a fully populated and temporally coherent sampling grid while preserving spatial and temporal consistency. Through this approach, we extend a single video into a multi-view video along novel camera trajectories while maintaining spatio-temporal consistency. Our method is training-free and fully utilizes an off-the-shelf video diffusion model, offering a practical and effective solution for multi-view video generation.
comment: project page: https://zero4dvid.github.io/
EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining
Egocentric video-language pretraining has significantly advanced video representation learning. Humans perceive and interact with a fully 3D world, developing spatial awareness that extends beyond text-based understanding. However, most previous works learn from 1D text or 2D visual cues, such as bounding boxes, which inherently lack 3D understanding. To bridge this gap, we introduce EgoDTM, an Egocentric Depth- and Text-aware Model, jointly trained through large-scale 3D-aware video pretraining and video-text contrastive learning. EgoDTM incorporates a lightweight 3D-aware decoder to efficiently learn 3D-awareness from pseudo depth maps generated by depth estimation models. To further facilitate 3D-aware video pretraining, we enrich the original brief captions with hand-object visual cues by organically combining several foundation models. Extensive experiments demonstrate EgoDTM's superior performance across diverse downstream tasks, highlighting its superior 3D-aware visual understanding. Code: https://github.com/xuboshen/EgoDTM.
comment: Code: https://github.com/xuboshen/EgoDTM
BrainPath: A Biologically-Informed AI Framework for Individualized Aging Brain Generation
The global population is aging rapidly, and aging is a major risk factor for various diseases. It is an important task to predict how each individual's brain will age, as the brain supports many human functions. This capability can greatly facilitate healthcare automation by enabling personalized, proactive intervention and efficient healthcare resource allocation. However, this task is extremely challenging because of the brain's complex 3D anatomy. While there have been successes in natural image generation and brain MRI synthesis, existing methods fall short in generating individualized, anatomically faithful aging brain trajectories. To address these gaps, we propose BrainPath, a novel AI model that, given a single structural MRI of an individual, generates synthetic longitudinal MRIs that represent that individual's expected brain anatomy as they age. BrainPath introduces three architectural innovations: an age-aware encoder with biologically grounded supervision, a differential age conditioned decoder for anatomically faithful MRI synthesis, and a swap-learning strategy that implicitly separates stable subject-specific anatomy from aging effects. We further design biologically informed loss functions, including an age calibration loss and an age and structural perceptual loss, to complement the conventional reconstruction loss. This enables the model to capture subtle, temporally meaningful anatomical changes associated with aging. We apply BrainPath to two of the largest public aging datasets and conduct a comprehensive, multifaceted evaluation. Our results demonstrate BrainPath's superior performance in generation accuracy, anatomical fidelity, and cross-dataset generalizability, outperforming competing methods.
ChartQA-X: Generating Explanations for Visual Chart Reasoning WACV 2026
The ability to explain complex information from chart images is vital for effective data-driven decision-making. In this work, we address the challenge of generating detailed explanations alongside answering questions about charts. We present ChartQA-X, a comprehensive dataset comprising 30,799 chart samples across four chart types, each paired with contextually relevant questions, answers, and explanations. Explanations are generated and selected based on metrics such as faithfulness, informativeness, coherence, and perplexity. Our human evaluation with 245 participants shows that model-generated explanations in ChartQA-X surpass human-written explanations in accuracy and logic and are comparable in terms of clarity and overall quality. Moreover, models fine-tuned on ChartQA-X show substantial improvements across various metrics, including absolute gains of up to 24.57 points in explanation quality, 18.96 percentage points in question-answering accuracy, and 14.75 percentage points on unseen benchmarks for the same task. By integrating explanatory narratives with answers, our approach enables agents to convey complex visual information more effectively, improving comprehension and greater trust in the generated responses.
comment: WACV 2026. Project Page: https://teal-lab.github.io/chartqa-x
MHB: Multimodal Handshape-aware Boundary Detection for Continuous Sign Language Recognition
This paper employs a multimodal approach for continuous sign recognition by first using ML for detecting the start and end frames of signs in videos of American Sign Language (ASL) sentences, and then by recognizing the segmented signs. For improved robustness we use 3D skeletal features extracted from sign language videos to take into account the convergence of sign properties and their dynamics that tend to cluster at sign boundaries. Another focus of this paper is the incorporation of information from 3D handshape for boundary detection. To detect handshapes normally expected at the beginning and end of signs, we pretrain a handshape classifier for detection of 87 linguistically defined canonical handshape categories using a dataset that we created by integrating and normalizing several existing datasets. A multimodal fusion module is then used to unify the pretrained sign video segmentation framework and handshape classification models. Finally, the estimated boundaries are used for sign recognition, where the recognition model is trained on a large database containing both citation-form isolated signs and signs pre-segmented (based on manual annotations) from continuous signing-as such signs often differ a bit in certain respects. We evaluate our method on the ASLLRP corpus and demonstrate significant improvements over previous work.
Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model
Recovering pixel-wise geometric properties from a single image is fundamentally ill-posed due to appearance ambiguity and non-injective mappings between 2D observations and 3D structures. While discriminative regression models achieve strong performance through large-scale supervision, their success is bounded by the scale, quality and diversity of available data and limited physical reasoning. Recent diffusion models exhibit powerful world priors that encode geometry and semantics learned from massive image-text data, yet directly reusing their stochastic generative formulation is suboptimal for deterministic geometric inference: the former is optimized for diverse and high-fidelity image generation, whereas the latter requires stable and accurate predictions. In this work, we propose Lotus-2, a two-stage deterministic framework for stable, accurate and fine-grained geometric dense prediction, aiming to provide an optimal adaption protocol to fully exploit the pre-trained generative priors. Specifically, in the first stage, the core predictor employs a single-step deterministic formulation with a clean-data objective and a lightweight local continuity module (LCM) to generate globally coherent structures without grid artifacts. In the second stage, the detail sharpener performs a constrained multi-step rectified-flow refinement within the manifold defined by the core predictor, enhancing fine-grained geometry through noise-free deterministic flow matching. Using only 59K training samples, less than 1% of existing large-scale datasets, Lotus-2 establishes new state-of-the-art results in monocular depth estimation and highly competitive surface normal prediction. These results demonstrate that diffusion models can serve as deterministic world priors, enabling high-quality geometric reasoning beyond traditional discriminative and generative paradigms.
comment: Work done at the Hong Kong University of Science and Technology (Guangzhou). Project page: https://lotus-2.github.io/
A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains
Hand-object interaction detection remains an open challenge in real-time applications, where intuitive user experiences depend on fast and accurate detection of interactions with surrounding objects. We propose an efficient approach for detecting hand-objects interactions from streaming egocentric vision that operates in real time. Our approach consists of an action recognition module and an object detection module for identifying active objects upon confirmed interaction. Our Mamba model with EfficientNetV2 as backbone for action recognition achieves 38.52% p-AP on the ENIGMA-51 benchmark at 30fps, while our fine-tuned YOLOWorld reaches 85.13% AP for hand and object. We implement our models in a cascaded architecture where the action recognition and object detection modules operate sequentially. When the action recognition predicts a contact state, it activates the object detection module, which in turn performs inference on the relevant frame to detect and classify the active object.
comment: 12 pages, 4 figures, In International Conference on Image Analysis and Processing
AortaDiff: A Unified Multitask Diffusion Framework For Contrast-Free AAA Imaging WACV 2026
While contrast-enhanced CT (CECT) is standard for assessing abdominal aortic aneurysms (AAA), the required iodinated contrast agents pose significant risks, including nephrotoxicity, patient allergies, and environmental harm. To reduce contrast agent use, recent deep learning methods have focused on generating synthetic CECT from non-contrast CT (NCCT) scans. However, most adopt a multi-stage pipeline that first generates images and then performs segmentation, which leads to error accumulation and fails to leverage shared semantic and anatomical structures. To address this, we propose a unified deep learning framework that generates synthetic CECT images from NCCT scans while simultaneously segmenting the aortic lumen and thrombus. Our approach integrates conditional diffusion models (CDM) with multi-task learning, enabling end-to-end joint optimization of image synthesis and anatomical segmentation. Unlike previous multitask diffusion models, our approach requires no initial predictions (e.g., a coarse segmentation mask), shares both encoder and decoder parameters across tasks, and employs a semi-supervised training strategy to learn from scans with missing segmentation labels, a common constraint in real-world clinical data. We evaluated our method on a cohort of 264 patients, where it consistently outperformed state-of-the-art single-task and multi-stage models. For image synthesis, our model achieved a PSNR of 25.61 dB, compared to 23.80 dB from a single-task CDM. For anatomical segmentation, it improved the lumen Dice score to 0.89 from 0.87 and the challenging thrombus Dice score to 0.53 from 0.48 (nnU-Net). These segmentation enhancements led to more accurate clinical measurements, reducing the lumen diameter MAE to 4.19 mm from 5.78 mm and the thrombus area error to 33.85% from 41.45% when compared to nnU-Net. Code is available at https://github.com/yuxuanou623/AortaDiff.git.
comment: WACV 2026
SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models
Reasoning about motion and space is a fundamental cognitive capability that is required by multiple real-world applications. While many studies highlight that large multimodal language models (MLMs) struggle to reason about space, they only focus on static spatial relationships, and not dynamic awareness of motion and space, i.e., reasoning about the effect of egocentric and object motions on spatial relationships. Manually annotating such object and camera movements is expensive. Hence, we introduce SAT, a simulated spatial aptitude training dataset utilizing 3D simulators, comprising both static and dynamic spatial reasoning across 175K question-answer (QA) pairs and 20K scenes. Complementing this, we also construct a small (150 image-QAs) yet challenging dynamic spatial test set using real-world images. Leveraging our SAT datasets and 6 existing static spatial benchmarks, we systematically investigate what improves both static and dynamic spatial awareness. Our results reveal that simulations are surprisingly effective at imparting spatial aptitude to MLMs that translate to real images. We show that perfect annotations in simulation are more effective than existing approaches of pseudo-annotating real images. For instance, SAT training improves a LLaVA-13B model by an average 11% and a LLaVA-Video-7B model by an average 8% on multiple spatial benchmarks, including our real-image dynamic test set and spatial reasoning on long videos -- even outperforming some large proprietary models. While reasoning over static relationships improves with synthetic training data, there is still considerable room for improvement for dynamic reasoning questions.
comment: Accepted to COLM 2025. Project webpage: https://arijitray.com/SAT/
SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control
Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs over several days. We show that scaling up model capacity, data, and compute yields a generalist humanoid controller capable of creating natural and robust whole-body movements. Specifically, we posit motion tracking as a natural and scalable task for humanoid control, leveraging dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (from 1.2M to 42M parameters), dataset volume (over 100M frames, 700 hours of high-quality motion data), and compute (9k GPU hours). Beyond demonstrating the benefits of scale, we show the practical utility of our model through two mechanisms: (1) a real-time universal kinematic planner that bridges motion tracking to downstream task execution, enabling natural and interactive control, and (2) a unified token space that supports various motion input interfaces, such as VR teleoperation devices, human videos, and vision-language-action (VLA) models, all using the same policy. Scaling motion tracking exhibits favorable properties: performance improves steadily with increased compute and data diversity, and learned representations generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.
comment: Project page: https://nvlabs.github.io/SONIC/
Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression WACV 2026
Multimodal Large Language Models (MLLMs) show promise for image-based regression tasks, but current approaches face key limitations. Recent methods fine-tune MLLMs using preset output vocabularies and generic task-level prompts (e.g., "How would you rate this image?"), assuming this mimics human rating behavior. \textbf{Our analysis reveals that these approaches provide no benefit over image-only training}. Models using preset vocabularies and generic prompts perform equivalently to image-only models, failing to leverage semantic understanding from textual input. We propose \textbf{Regression via Transformer-Based Classification} (RvTC), which replaces vocabulary-constrained classification with a flexible bin-based approach. Unlike approaches that address discretization errors through complex distributional modeling, RvTC eliminates manual vocabulary crafting through straightforward bin increase, achieving state-of-the-art performance on four image assessment datasets using only images. \textbf{More importantly, we demonstrate that data-specific prompts dramatically improve performance}. Unlike generic task descriptions, prompts containing semantic information about specific images enable MLLMs to leverage cross-modal understanding. On the AVA dataset, adding challenge titles to prompts substantially improves our already state-of-the-art image-only baseline. We demonstrate through empirical evidence from the AVA and AGIQA-3k datasets that MLLMs benefit from semantic prompt information, surpassing mere statistical biases. We validate RvTC across two different MLLM architectures, demonstrating consistent improvements and method generalizability.
comment: WACV 2026
Image and Video Processing
Deep infant brain segmentation from multi-contrast MRI
Segmentation of magnetic resonance images (MRI) facilitates analysis of human brain development by delineating anatomical structures. However, in infants and young children, accurate segmentation is challenging due to development and imaging constraints. Pediatric brain MRI is notoriously difficult to acquire, with inconsistent availability of imaging modalities, substantial non-head anatomy in the field of view, and frequent motion artifacts. This has led to specialized segmentation models that are often limited to specific image types or narrow age groups, or that are fragile for more variable images such as those acquired clinically. We address this method fragmentation with BabySeg, a deep learning brain segmentation framework for infants and young children that supports diverse MRI protocols, including repeat scans and image types unavailable during training. Our approach builds on recent domain randomization techniques, which synthesize training images far beyond realistic bounds to promote dataset shift invariance. We also describe a mechanism that enables models to flexibly pool and interact features from any number of input scans. We demonstrate state-of-the-art performance that matches or exceeds the accuracy of several existing methods for various age cohorts and input configurations using a single model, in a fraction of the runtime required by many existing tools.
comment: 8 pages, 8 figures, 1 table, website at https://w3id.org/babyseg, presented at the 2025 IEEE Asilomar Conference on Signals, Systems, and Computers
Multi Task Denoiser Training for Solving Linear Inverse Problems
Plug-and-Play Priors (PnP) and Regularisation by Denoising (RED) have established that image denoisers can effectively replace traditional regularisers in linear inverse problem solvers for tasks like super-resolution, demosaicing, and inpainting. It is now well established in the literature that a denoiser's residual links to the gradient of the image log prior (Miyasawa and Tweedie), enabling iterative, gradient ascent-based image generation (e.g., diffusion models), as well as new methods for solving inverse problems. Building on this, we propose enhancing Kadkhodaie and Simoncelli's gradient-based inverse solvers by fine-tuning the denoiser within the iterative solving process itself. Training the denoiser end-to-end across the solver framework and simultaneously across multiple tasks yields a single, versatile denoiser optimised for inverse problems. We demonstrate that even a simple baseline model fine-tuned this way achieves an average PSNR improvement of +1.34 dB across six diverse inverse problems while reducing the required iterations. Furthermore, we analyse the fine-tuned denoiser's properties, finding that its optimisation objective implicitly shifts from minimising standard denoising error (MMSE) towards approximating an ideal prior gradient specifically tailored for guiding inverse recovery.
comment: 9 pages, incl. 1 page references. Published at CVMP 2025
Structure-Aware Adaptive Kernel MPPCA Denoising for Diffusion MRI
Diffusion-weighted MRI (DWI) at high b-values often suffers from low signal-to-noise ratio (SNR), making image quality poor. Marchenko-Pastur PCA (MPPCA) is a popular method to reduce noise, but it uses a fixed patch size across the whole image, which doesn't work well in regions with different structures. To address this, we propose an adaptive kernel MPPCA (ak-MPPCA) that selects the best patch size for each voxel based on its local neighborhood. This improves denoising performance by better handling structural variations.
ARCAS: An Augmented Reality Collision Avoidance System with SLAM-Based Tracking for Enhancing VRU Safety
Vulnerable road users (VRUs) face high collision risks in mixed traffic, yet most existing safety systems prioritize driver or vehicle assistance over direct VRU support. This paper presents ARCAS, a real-time augmented reality collision avoidance system that provides personalized spatial alerts to VRUs via wearable AR headsets. By fusing roadside 360-degree 3D LiDAR with SLAM-based headset tracking and an automatic 3D calibration procedure, ARCAS accurately overlays world-locked 3D bounding boxes and directional arrows onto approaching hazards in the user's passthrough view. The system also enables multi-headset coordination through shared world anchoring. Evaluated in real-world pedestrian interactions with e-scooters and vehicles (180 trials), ARCAS nearly doubled pedestrians' time-to-collision and increased counterparts' reaction margins by up to 4x compared to unaided-eye conditions. Results validate the feasibility and effectiveness of LiDAR-driven AR guidance and highlight the potential of wearable AR as a promising next-generation safety tool for urban mobility.
comment: 8 pages, 3 figures, 1 table
Two-Stage Camera Calibration Method for Multi-Camera Systems Using Scene Geometry
Calibration of multi-camera systems is a key task for accurate object tracking. However, it remains a challenging problem in real-world conditions, where traditional methods are not applicable due to the lack of accurate floor plans, physical access to place calibration patterns, or synchronized video streams. This paper presents a novel two-stage calibration method that overcomes these limitations. In the first stage, partial calibration of individual cameras is performed based on an operator's annotation of natural geometric primitives (parallel, perpendicular, and vertical lines, or line segments of equal length). This allows estimating key parameters (roll, pitch, focal length) and projecting the camera's Effective Field of View (EFOV) onto the horizontal plane in a base 3D coordinate system. In the second stage, precise system calibration is achieved through interactive manipulation of the projected EFOV polygons. The operator adjusts their position, scale, and rotation to align them with the floor plan or, in its absence, using virtual calibration elements projected onto all cameras in the system. This determines the remaining extrinsic parameters (camera position and yaw). Calibration requires only a static image from each camera, eliminating the need for physical access or synchronized video. The method is implemented as a practical web service. Comparative analysis and demonstration videos confirm the method's applicability, accuracy, and flexibility, enabling the deployment of precise multi-camera tracking systems in scenarios previously considered infeasible.
Towards Modality- and Sampling-Universal Learning Strategies for Accelerating Cardiovascular Imaging: Summary of the CMRxRecon2024 Challenge
Cardiovascular health is vital to human well-being, and cardiac magnetic resonance (CMR) imaging is considered the {clinical reference standard} for diagnosing cardiovascular disease. However, its adoption is hindered by long scan times, complex contrasts, and inconsistent quality. While deep learning methods perform well on specific CMR imaging {sequences}, they often fail to generalize across modalities and sampling schemes. The lack of benchmarks for high-quality, fast CMR image reconstruction further limits technology comparison and adoption. The CMRxRecon2024 challenge, attracting over 200 teams from 18 countries, addressed these issues with two tasks: generalization to unseen {modalities} and robustness to diverse undersampling patterns. We introduced the largest public multi-{modality} CMR raw dataset, an open benchmarking platform, and shared code. Analysis of the best-performing solutions revealed that prompt-based adaptation and enhanced physics-driven consistency enabled strong cross-scenario performance. These findings establish principles for generalizable reconstruction models and advance clinically translatable AI in cardiovascular imaging.
comment: 15 pages
BrainPath: A Biologically-Informed AI Framework for Individualized Aging Brain Generation
The global population is aging rapidly, and aging is a major risk factor for various diseases. It is an important task to predict how each individual's brain will age, as the brain supports many human functions. This capability can greatly facilitate healthcare automation by enabling personalized, proactive intervention and efficient healthcare resource allocation. However, this task is extremely challenging because of the brain's complex 3D anatomy. While there have been successes in natural image generation and brain MRI synthesis, existing methods fall short in generating individualized, anatomically faithful aging brain trajectories. To address these gaps, we propose BrainPath, a novel AI model that, given a single structural MRI of an individual, generates synthetic longitudinal MRIs that represent that individual's expected brain anatomy as they age. BrainPath introduces three architectural innovations: an age-aware encoder with biologically grounded supervision, a differential age conditioned decoder for anatomically faithful MRI synthesis, and a swap-learning strategy that implicitly separates stable subject-specific anatomy from aging effects. We further design biologically informed loss functions, including an age calibration loss and an age and structural perceptual loss, to complement the conventional reconstruction loss. This enables the model to capture subtle, temporally meaningful anatomical changes associated with aging. We apply BrainPath to two of the largest public aging datasets and conduct a comprehensive, multifaceted evaluation. Our results demonstrate BrainPath's superior performance in generation accuracy, anatomical fidelity, and cross-dataset generalizability, outperforming competing methods.
Deep Feature-specific Imaging
Modern photon-counting sensors are increasingly dominated by Poisson noise, yet conventional Feature-Specific Imaging (FSI) is optimized for additive Gaussian noise, leading to suboptimal performance and a loss of its advantages under Poisson noise. To address this, we introduce DeepFSI, a novel end-to-end optical-electronic framework. DeepFSI "unfreezes" traditional FSI masks, enabling a deep neural network to learn globally optimal measurement masks by computing gradients directly under realistic Poisson and additive noise conditions. Our simulations demonstrate DeepFSI's superior feature fidelity and task performance compared to conventional FSI with predefined masks, especially in Poisson-Noise-dominant environments. DeepFSI also exhibits enhanced robustness to design choices and performs well under additive Gaussian noise, representing a significant advance for noise-robust computational imaging in photon-limited applications.
TraceTrans: Translation and Spatial Tracing for Surgical Prediction
Image-to-image translation models have achieved notable success in converting images across visual domains and are increasingly used for medical tasks such as predicting post-operative outcomes and modeling disease progression. However, most existing methods primarily aim to match the target distribution and often neglect spatial correspondences between the source and translated images. This limitation can lead to structural inconsistencies and hallucinations, undermining the reliability and interpretability of the predictions. These challenges are accentuated in clinical applications by the stringent requirement for anatomical accuracy. In this work, we present TraceTrans, a novel deformable image translation model designed for post-operative prediction that generates images aligned with the target distribution while explicitly revealing spatial correspondences with the pre-operative input. The framework employs an encoder for feature extraction and dual decoders for predicting spatial deformations and synthesizing the translated image. The predicted deformation field imposes spatial constraints on the generated output, ensuring anatomical consistency with the source. Extensive experiments on medical cosmetology and brain MRI datasets demonstrate that TraceTrans delivers accurate and interpretable post-operative predictions, highlighting its potential for reliable clinical deployment.
A multi-dynamic low-rank deep image prior (ML-DIP) for 3D real-time cardiovascular MRI
Purpose: To develop a reconstruction framework for 3D real-time cine cardiovascular magnetic resonance (CMR) from highly undersampled data without requiring fully sampled training datasets. Methods: We developed a multi-dynamic low-rank deep image prior (ML-DIP) framework that models spatial image content and deformation fields using separate neural networks. These sub-networks are jointly trained per scan to reconstruct the dynamic image series directly from undersampled k-space data. ML-DIP was evaluated on (i) a 3D cine digital phantom with simulated premature ventricular contractions (PVCs), (ii) ten healthy subjects (including two scanned during both rest and exercise), and (iii) 12 patients with a history of PVCs. Phantom results were assessed using peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM). In vivo performance was evaluated by comparing left-ventricular function quantification (against 2D real-time cine) and image quality (against 2D real-time cine and binning-based 5D-Cine). Results: In the phantom study, ML-DIP achieved PSNR > 29 dB and SSIM > 0.90 for scan times as short as two minutes, while recovering cardiac motion, respiratory motion, and PVC events. In healthy subjects, ML-DIP yielded functional measurements comparable to 2D cine and higher image quality than 5D-Cine, including during exercise with high heart rates and bulk motion. In PVC patients, ML-DIP preserved beat-to-beat variability and reconstructed irregular beats, whereas 5D-Cine showed motion artifacts and information loss due to binning. Conclusion: ML-DIP enables high-quality 3D real-time CMR with acceleration factors exceeding 1,000 by learning low-rank spatial and motion representations from undersampled data, without relying on external fully sampled training datasets.
Generative MR Multitasking with complex-harmonic cardiac encoding: Bridging the gap between gated imaging and real-time imaging
Purpose: To develop a unified image reconstruction framework that bridges real-time and gated cardiac MRI, including quantitative MRI. Methods: We introduce Generative Multitasking, which learns an implicit neural temporal basis from sequence timings and an interpretable latent space for cardiac and respiratory motion. Cardiac motion is modeled as a complex harmonic, with phase encoding timing and a latent amplitude capturing beat-to-beat functional variability, linking cardiac phase-resolved ("gated-like") and time-resolved ("real-time-like") views. We implemented the framework using a conditional variational autoencoder (CVAE) and evaluated it for free-breathing, non-ECG-gated radial GRE in three settings: steady-state cine imaging, multicontrast T2prep/IR imaging, and dual-flip-angle T1/T2 mapping, compared with conventional Multitasking. Results: Generative Multitasking provided flexible cardiac motion representation, enabling reconstruction of archetypal cardiac phase-resolved cines (like gating) as well as time-resolved series that reveal beat-to-beat variability (like real-time imaging). Conditioning on the previous k-space angle and modifying this term at inference removed eddy-current artifacts without globally smoothing high temporal frequencies. For quantitative mapping, Generative Multitasking reduced intraseptal T1 and T2 coefficients of variation compared with conventional Multitasking (T1: 0.13 vs. 0.31; T2: 0.12 vs. 0.32; p<0.001), indicating higher SNR. Conclusion: Generative Multitasking uses a CVAE with complex harmonic cardiac coordinates to unify gated and real-time CMR within a single free-breathing, non-ECG-gated acquisition. It allows flexible cardiac motion representation, suppresses trajectory-dependent artifacts, and improves T1 and T2 mapping, suggesting a path toward cine, multicontrast, and quantitative imaging without separate gated and real-time scans.
comment: Submitted to Magnetic Resonance in Medicine; 21 pages, 7 figures
Multimedia
What is Beyond Presence? Dimensionality, Control, and Information Spaces
What is after presence? Spatial presence, the sense of "being there", is becoming less of a primary objective and more of a baseline expectation of virtual reality. More than six decades after its invention, VR is shifting from a technical system into a cultural, social, and phenomenological medium, offering experiences that function as distinct modes of reality. Existing theories that focus primarily on perceptual illusions are no longer sufficient to account for these emerging forms of experience. A new framework is needed to guide the design and evaluation of immersive environments by identifying the key technical and abstract dimensions afforded by virtual worlds. These dimensions include spatial, placeness, temporal, social, cultural, cognitive, and psychological parameters. The central argument is that immersive environments must move beyond the technical dimension to leverage richer information channels that shape user experience. This shift from presence to experience orchestration invites creators across disciplines to contribute to the design and assessment of meaningful immersive worlds.
comment: 38 pages, accepted for Presence: Virtual and Augmented Reality 2026(37)
DreamFoley: Scalable VLMs for High-Fidelity Video-to-Audio Generation
Recent advances in video generation have achieved remarkable improvements in visual content fidelity. However, the absence of synchronized audio severely undermines immersive experience and restricts practical applications of these technologies. To address this challenge, several pioneering works have explored diffusion transformer architectures for generating plausible video-synchronized audio, including Kling-foley, HunyuanVideo-foley and Thinksound. Distinct from existing works, we introduce an autoregressive audio generation architecture (DreamFoley) that harnesses the capabilities of large vision-language models (VLMs) to jointly model sequential interactions among video, audio, and text modalities. Our approach features a dual-visual encoder module that effectively captures both audio-aligned and text-aligned visual features. Additionally, we employ a Residual Vector Quantization audio tokenizer with a delay-pattern generation scheme to balance the trade-off between training efficiency and audio quality. Moreover, we introduce the classifier-free guidance strategy into VLMs to bootstrap generated audio quality. Furthermore, we establish an efficient data production pipeline to scale audio-video-text triple collection. Finally, extensive experiments are conducted to validate the effectiveness of our model, achieving promising performance across popular benchmarks. We hope that the findings in this study provide a strong foundation for future video-to-audio generation research. We also release the previously missing audio-visual textual descriptions from the public benchmark, aiming to facilitate subsequent researchers in conducting more convenient and effective evaluations and comparisons.
comment: 10 pages; Bytedance
"I Can See Forever!": Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments
The visually impaired population faces significant challenges in daily activities. While prior works employ vision language models for assistance, most focus on static content and cannot address real-time perception needs in complex environments. Recent VideoLLMs enable real-time vision and speech interaction, offering promising potential for assistive tasks. In this work, we conduct the first study evaluating their effectiveness in supporting daily life for visually impaired individuals. We first conducted a user survey with visually impaired participants to design the benchmark VisAssistDaily for daily life evaluation. Using VisAssistDaily, we evaluate popular VideoLLMs and find GPT-4o achieves the highest task success rate. We further conduct a user study to reveal concerns about hazard perception. To address this, we propose SafeVid, an environment-awareness dataset, and fine-tune VITA-1.5, improving risk recognition accuracy from 25.00% to 76.00%.We hope this work provides valuable insights and inspiration for future research in this field.
comment: 17 pages
Multimodal Markup Document Models for Graphic Design Completion
We introduce MarkupDM, a multimodal markup document model that represents graphic design as an interleaved multimodal document consisting of both markup language and images. Unlike existing holistic approaches that rely on an element-by-attribute grid representation, our representation accommodates variable-length elements, type-dependent attributes, and text content. Inspired by fill-in-the-middle training in code generation, we train the model to complete the missing part of a design document from its surrounding context, allowing it to treat various design tasks in a unified manner. Our model also supports image generation by predicting discrete image tokens through a specialized tokenizer with support for image transparency. We evaluate MarkupDM on three tasks, attribute value, image, and text completion, and demonstrate that it can produce plausible designs consistent with the given context. To further illustrate the flexibility of our approach, we evaluate our approach on a new instruction-guided design completion task where our instruction-tuned MarkupDM compares favorably to state-of-the-art image editing models, especially in textual completion. These findings suggest that multimodal language models with our document representation can serve as a versatile foundation for broad design automation.
comment: Accepted by ACM Multimedia 2025, Project page: https://cyberagentailab.github.io/MarkupDM/
Computation and Language
DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation
Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.
comment: Project Page: https://github.com/CaraJ7/DraCo
Semantic Soft Bootstrapping: Long Context Reasoning in LLMs without Reinforcement Learning
Long context reasoning in large language models (LLMs) has demonstrated enhancement of their cognitive capabilities via chain-of-thought (CoT) inference. Training such models is usually done via reinforcement learning with verifiable rewards (RLVR) in reasoning based problems, like math and programming. However, RLVR is limited by several bottlenecks, such as, lack of dense reward, and inadequate sample efficiency. As a result, it requires significant compute resources in post-training phase. To overcome these limitations, in this work, we propose \textbf{Semantic Soft Bootstrapping (SSB)}, a self-distillation technique, in which the same base language model plays the role of both teacher and student, but receives different semantic contexts about the correctness of its outcome at training time. The model is first prompted with a math problem and several rollouts are generated. From them, the correct and most common incorrect response are filtered, and then provided to the model in context to produce a more robust, step-by-step explanation with a verified final answer. This pipeline automatically curates a paired teacher-student training set from raw problem-answer data, without any human intervention. This generation process also produces a sequence of logits, which is what the student model tries to match in the training phase just from the bare question alone. In our experiment, Qwen2.5-3B-Instruct on GSM8K dataset via parameter-efficient fine-tuning. We then tested its accuracy on MATH500, and AIME2024 benchmarks. Our experiments show a jump of 10.6%, and 10% improvements in accuracy, respectively, over group relative policy optimization (GRPO), which is a commonly used RLVR algorithm. Our code is available at https://github.com/purbeshmitra/semantic-soft-bootstrapping, and the model, curated dataset is available at https://huggingface.co/purbeshmitra/semantic-soft-bootstrapping.
Structured Document Translation via Format Reinforcement Learning AACL 2025
Recent works on structured text translation remain limited to the sentence level, as they struggle to effectively handle the complex document-level XML or HTML structures. To address this, we propose \textbf{Format Reinforcement Learning (FormatRL)}, which employs Group Relative Policy Optimization on top of a supervised fine-tuning model to directly optimize novel structure-aware rewards: 1) TreeSim, which measures structural similarity between predicted and reference XML trees and 2) Node-chrF, which measures translation quality at the level of XML nodes. Additionally, we apply StrucAUC, a fine-grained metric distinguishing between minor errors and major structural failures. Experiments on the SAP software-documentation benchmark demonstrate improvements across six metrics and an analysis further shows how different reward functions contribute to improvements in both structural and translation quality.
comment: IJCNLP-AACL 2025 Main (Oral)
Multi-LLM Collaboration for Medication Recommendation
As healthcare increasingly turns to AI for scalable and trustworthy clinical decision support, ensuring reliability in model reasoning remains a critical challenge. Individual large language models (LLMs) are susceptible to hallucinations and inconsistency, whereas naive ensembles of models often fail to deliver stable and credible recommendations. Building on our previous work on LLM Chemistry, which quantifies the collaborative compatibility among LLMs, we apply this framework to improve the reliability in medication recommendation from brief clinical vignettes. Our approach leverages multi-LLM collaboration guided by Chemistry-inspired interaction modeling, enabling ensembles that are effective (exploiting complementary strengths), stable (producing consistent quality), and calibrated (minimizing interference and error amplification). We evaluate our Chemistry-based Multi-LLM collaboration strategy on real-world clinical scenarios to investigate whether such interaction-aware ensembles can generate credible, patient-specific medication recommendations. Preliminary results are encouraging, suggesting that LLM Chemistry-guided collaboration may offer a promising path toward reliable and trustworthy AI assistants in clinical practice.
comment: 8 pages, 5 figures, 1 table
Arbitrage: Efficient Reasoning via Advantage-Aware Speculation
Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens, which are then verified in parallel by a more capable target model. However, due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks. Although recent works have shifted to step-level semantic verification, which improve efficiency by accepting or rejecting entire reasoning steps, existing step-level methods still regenerate many rejected steps with little improvement, wasting valuable target compute. To address this challenge, we propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models. Instead of applying a fixed acceptance threshold, Arbitrage uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal Arbitrage Oracle that always chooses the higher-quality step, achieving near-optimal efficiency-accuracy trade-offs. Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level Speculative Decoding baselines, reducing inference latency by up to $\sim2\times$ at matched accuracy.
comment: 22 pages
Factuality and Transparency Are All RAG Needs! Self-Explaining Contrastive Evidence Re-ranking
This extended abstract introduces Self-Explaining Contrastive Evidence Re-Ranking (CER), a novel method that restructures retrieval around factual evidence by fine-tuning embeddings with contrastive learning and generating token-level attribution rationales for each retrieved passage. Hard negatives are automatically selected using a subjectivity-based criterion, forcing the model to pull factual rationales closer while pushing subjective or misleading explanations apart. As a result, the method creates an embedding space explicitly aligned with evidential reasoning. We evaluated our method on clinical trial reports, and initial experimental results show that CER improves retrieval accuracy, mitigates the potential for hallucinations in RAG systems, and provides transparent, evidence-based retrieval that enhances reliability, especially in safety-critical domains.
comment: This work was presented as a poster at the Applied Social Media Lab during the 2025 Synthesizer & Open Showcase at the Berkman Klein Center for Internet & Society at Harvard University
Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction
The evolution of Large Language Models (LLMs) from passive responders to autonomous agents necessitates a fundamental shift in learning paradigms -- from static imitation to incentive-driven decision making. However, this transition is significantly impeded by the lack of scalable infrastructure capable of constructing high-quality interaction signals for effective policy learning. To address this, we introduce a comprehensive method designed to systematically scale the diversity and complexity of interactive environments. Our method realizes this scaling by addressing three orthogonal dimensions: (1) Complexity: NexAU, a flexible agent framework that supports building complex agent hierarchies via simple configurations; (2) Diversity: NexA4A automatically generates diverse agent hierarchies from natural language to cover infinite domains; and (3) Fidelity: NexGAP bridges the simulation-reality gap by integrating dynamic real-world environment for grounded trajectories synthesis. We train Nex-N1 upon the diverse and complex interactive environments established by our infrastructure. Empirical results on benchmarks such as SWE-bench and tau2 demonstrate that Nex-N1 consistently outperforms SOTA open-source models and achieves competitive performance against frontier proprietary models on complex agentic tasks. We open-source the Nex ecosystem and model weights to facilitate further research.
LLMs Know More Than Words: A Genre Study with Syntax, Metaphor & Phonetics
Large language models (LLMs) demonstrate remarkable potential across diverse language related tasks, yet whether they capture deeper linguistic properties, such as syntactic structure, phonetic cues, and metrical patterns from raw text remains unclear. To analysis whether LLMs can learn these features effectively and apply them to important nature language related tasks, we introduce a novel multilingual genre classification dataset derived from Project Gutenberg, a large-scale digital library offering free access to thousands of public domain literary works, comprising thousands of sentences per binary task (poetry vs. novel;drama vs. poetry;drama vs. novel) in six languages (English, French, German, Italian, Spanish, and Portuguese). We augment each with three explicit linguistic feature sets (syntactic tree structures, metaphor counts, and phonetic metrics) to evaluate their impact on classification performance. Experiments demonstrate that although LLM classifiers can learn latent linguistic structures either from raw text or from explicitly provided features, different features contribute unevenly across tasks, which underscores the importance of incorporating more complex linguistic signals during model training.
CARL: Critical Action Focused Reinforcement Learning for Multi-Step Agent
Agents capable of accomplishing complex tasks through multiple interactions with the environment have emerged as a popular research direction. However, in such multi-step settings, the conventional group-level policy optimization algorithm becomes suboptimal because of its underlying assumption that each action holds equal contribution, which deviates significantly from reality. Our analysis reveals that only a small fraction of actions are critical in determining the final outcome. Building on this insight, we propose CARL, a critical-action-focused reinforcement learning algorithm tailored for multi-step agents. CARL achieves focused training through providing action-level optimization signals for high-criticality actions while excluding low-criticality actions from model update. Extensive experiments demonstrate that CARL achieves both stronger performance and higher efficiency during training and inference across diverse evaluation settings.
comment: 10 pages, 4 figures
Algorithmic Thinking Theory
Large language models (LLMs) have proven to be highly effective for solving complex reasoning tasks. Surprisingly, their capabilities can often be improved by iterating on previously generated solutions. In this context, a reasoning plan for generating and combining a set of solutions can be thought of as an algorithm for reasoning using a probabilistic oracle. We introduce a theoretical framework for analyzing such reasoning algorithms. This framework formalizes the principles underlying popular techniques for iterative improvement and answer aggregation, providing a foundation for designing a new generation of more powerful reasoning methods. Unlike approaches for understanding models that rely on architectural specifics, our model is grounded in experimental evidence. As a result, it offers a general perspective that may extend to a wide range of current and future reasoning oracles.
The AI Consumer Index (ACE)
We introduce the first version of the AI Consumer Index (ACE), a benchmark for assessing whether frontier AI models can perform high-value consumer tasks. ACE contains a hidden heldout set of 400 test cases, split across four consumer activities: shopping, food, gaming, and DIY. We are also open sourcing 80 cases as a devset with a CC-BY license. For the ACE leaderboard we evaluated 10 frontier models (with websearch turned on) using a novel grading methodology that dynamically checks whether relevant parts of the response are grounded in the retrieved web sources. GPT 5 (Thinking = High) is the top-performing model, scoring 56.1%, followed by o3 Pro (Thinking = On) (55.2%) and GPT 5.1 (Thinking = High) (55.1%). Models differ across domains, and in Shopping the top model scores under 50%. For some requests (such as giving the correct price or providing working links), models are highly prone to hallucination. Overall, ACE shows a substantial gap between the performance of even the best models and consumers' AI needs.
STELLA: Guiding Large Language Models for Time Series Forecasting with Semantic Abstractions
Recent adaptations of Large Language Models (LLMs) for time series forecasting often fail to effectively enhance information for raw series, leaving LLM reasoning capabilities underutilized. Existing prompting strategies rely on static correlations rather than generative interpretations of dynamic behavior, lacking critical global and instance-specific context. To address this, we propose STELLA (Semantic-Temporal Alignment with Language Abstractions), a framework that systematically mines and injects structured supplementary and complementary information. STELLA employs a dynamic semantic abstraction mechanism that decouples input series into trend, seasonality, and residual components. It then translates intrinsic behavioral features of these components into Hierarchical Semantic Anchors: a Corpus-level Semantic Prior (CSP) for global context and a Fine-grained Behavioral Prompt (FBP) for instance-level patterns. Using these anchors as prefix-prompts, STELLA guides the LLM to model intrinsic dynamics. Experiments on eight benchmark datasets demonstrate that STELLA outperforms state-of-the-art methods in long- and short-term forecasting, showing superior generalization in zero-shot and few-shot settings. Ablation studies further validate the effectiveness of our dynamically generated semantic anchors.
comment: This work has been submitted to the IEEE for possible publication
SEAL: Self-Evolving Agentic Learning for Conversational Question Answering over Knowledge Graphs
Knowledge-based conversational question answering (KBCQA) confronts persistent challenges in resolving coreference, modeling contextual dependencies, and executing complex logical reasoning. Existing approaches, whether end-to-end semantic parsing or stepwise agent-based reasoning, often suffer from structural inaccuracies and prohibitive computational costs, particularly when processing intricate queries over large knowledge graphs. To address these limitations, we introduce SEAL, a novel two-stage semantic parsing framework grounded in self-evolving agentic learning. In the first stage, a large language model (LLM) extracts a minimal S-expression core that captures the essential semantics of the input query. This core is then refined by an agentic calibration module, which corrects syntactic inconsistencies and aligns entities and relations precisely with the underlying knowledge graph. The second stage employs template-based completion, guided by question-type prediction and placeholder instantiation, to construct a fully executable S-expression. This decomposition not only simplifies logical form generation but also significantly enhances structural fidelity and linking efficiency. Crucially, SEAL incorporates a self-evolving mechanism that integrates local and global memory with a reflection module, enabling continuous adaptation from dialog history and execution feedback without explicit retraining. Extensive experiments on the SPICE benchmark demonstrate that SEAL achieves state-of-the-art performance, especially in multi-hop reasoning, comparison, and aggregation tasks. The results validate notable gains in both structural accuracy and computational efficiency, underscoring the framework's capacity for robust and scalable conversational reasoning.
Mitigating Catastrophic Forgetting in Target Language Adaptation of LLMs via Source-Shielded Updates
Expanding the linguistic diversity of instruct large language models (LLMs) is crucial for global accessibility but is often hindered by the reliance on costly specialized target language labeled data and catastrophic forgetting during adaptation. We tackle this challenge under a realistic, low-resource constraint: adapting instruct LLMs using only unlabeled target language data. We introduce Source-Shielded Updates (SSU), a selective parameter update strategy that proactively preserves source knowledge. Using a small set of source data and a parameter importance scoring method, SSU identifies parameters critical to maintaining source abilities. It then applies a column-wise freezing strategy to protect these parameters before adaptation. Experiments across five typologically diverse languages and 7B and 13B models demonstrate that SSU successfully mitigates catastrophic forgetting. It reduces performance degradation on monolingual source tasks to just 3.4% (7B) and 2.8% (13B) on average, a stark contrast to the 20.3% and 22.3% from full fine-tuning. SSU also achieves target-language performance highly competitive with full fine-tuning, outperforming it on all benchmarks for 7B models and the majority for 13B models.
DAMASHA: Detecting AI in Mixed Adversarial Texts via Segmentation with Human-interpretable Attribution
In the age of advanced large language models (LLMs), the boundaries between human and AI-generated text are becoming increasingly blurred. We address the challenge of segmenting mixed-authorship text, that is identifying transition points in text where authorship shifts from human to AI or vice-versa, a problem with critical implications for authenticity, trust, and human oversight. We introduce a novel framework, called Info-Mask for mixed authorship detection that integrates stylometric cues, perplexity-driven signals, and structured boundary modeling to accurately segment collaborative human-AI content. To evaluate the robustness of our system against adversarial perturbations, we construct and release an adversarial benchmark dataset Mixed-text Adversarial setting for Segmentation (MAS), designed to probe the limits of existing detectors. Beyond segmentation accuracy, we introduce Human-Interpretable Attribution (HIA overlays that highlight how stylometric features inform boundary predictions, and we conduct a small-scale human study assessing their usefulness. Across multiple architectures, Info-Mask significantly improves span-level robustness under adversarial conditions, establishing new baselines while revealing remaining challenges. Our findings highlight both the promise and limitations of adversarially robust, interpretable mixed-authorship detection, with implications for trust and oversight in human-AI co-authorship.
comment: 18 pages, 10 Figures
Are LLMs Truly Multilingual? Exploring Zero-Shot Multilingual Capability of LLMs for Information Retrieval: An Italian Healthcare Use Case
Large Language Models (LLMs) have become a key topic in AI and NLP, transforming sectors like healthcare, finance, education, and marketing by improving customer service, automating tasks, providing insights, improving diagnostics, and personalizing learning experiences. Information extraction from clinical records is a crucial task in digital healthcare. Although traditional NLP techniques have been used for this in the past, they often fall short due to the complexity, variability of clinical language, and high inner semantics in the free clinical text. Recently, Large Language Models (LLMs) have become a powerful tool for better understanding and generating human-like text, making them highly effective in this area. In this paper, we explore the ability of open-source multilingual LLMs to understand EHRs (Electronic Health Records) in Italian and help extract information from them in real-time. Our detailed experimental campaign on comorbidity extraction from EHR reveals that some LLMs struggle in zero-shot, on-premises settings, and others show significant variation in performance, struggling to generalize across various diseases when compared to native pattern matching and manual annotations.
DaLA: Danish Linguistic Acceptability Evaluation Guided by Real World Errors
We present an enhanced benchmark for evaluating linguistic acceptability in Danish. We first analyze the most common errors found in written Danish. Based on this analysis, we introduce a set of fourteen corruption functions that generate incorrect sentences by systematically introducing errors into existing correct Danish sentences. To ensure the accuracy of these corruptions, we assess their validity using both manual and automatic methods. The results are then used as a benchmark for evaluating Large Language Models on a linguistic acceptability judgement task. Our findings demonstrate that this extension is both broader and more comprehensive than the current state of the art. By incorporating a greater variety of corruption types, our benchmark provides a more rigorous assessment of linguistic acceptability, increasing task difficulty, as evidenced by the lower performance of LLMs on our benchmark compared to existing ones. Our results also suggest that our benchmark has a higher discriminatory power which allows to better distinguish well-performing models from low-performing ones.
AdiBhashaa: A Community-Curated Benchmark for Machine Translation into Indian Tribal Languages
Large language models and multilingual machine translation (MT) systems increasingly drive access to information, yet many languages of the tribal communities remain effectively invisible in these technologies. This invisibility exacerbates existing structural inequities in education, governance, and digital participation. We present AdiBhashaa, a community-driven initiative that constructs the first open parallel corpora and baseline MT systems for four major Indian tribal languages-Bhili, Mundari, Gondi, and Santali. This work combines participatory data creation with native speakers, human-in-the-loop validation, and systematic evaluation of both encoder-decoder MT models and large language models. In addition to reporting technical findings, we articulate how AdiBhashaa illustrates a possible model for more equitable AI research: it centers local expertise, builds capacity among early-career researchers from marginalized communities, and foregrounds human validation in the development of language technologies.
MemLoRA: Distilling Expert Adapters for On-Device Memory Systems
Memory-augmented Large Language Models (LLMs) have demonstrated remarkable consistency during prolonged dialogues by storing relevant memories and incorporating them as context. Such memory-based personalization is also key in on-device settings that allow users to keep their conversations and data private. However, memory-augmented systems typically rely on LLMs that are too costly for local on-device deployment. Even though Small Language Models (SLMs) are more suitable for on-device inference than LLMs, they cannot achieve sufficient performance. Additionally, these LLM-based systems lack native visual capabilities, limiting their applicability in multimodal contexts. In this paper, we introduce (i) MemLoRA, a novel memory system that enables local deployment by equipping SLMs with specialized memory adapters, and (ii) its vision extension MemLoRA-V, which integrates small Vision-Language Models (SVLMs) to memory systems, enabling native visual understanding. Following knowledge distillation principles, each adapter is trained separately for specific memory operations$\unicode{x2013}$knowledge extraction, memory update, and memory-augmented generation. Equipped with memory adapters, small models enable accurate on-device memory operations without cloud dependency. On text-only operations, MemLoRA outperforms 10$\times$ larger baseline models (e.g., Gemma2-27B) and achieves performance comparable to 60$\times$ larger models (e.g., GPT-OSS-120B) on the LoCoMo benchmark. To evaluate visual understanding operations instead, we extend LoCoMo with challenging Visual Question Answering tasks that require direct visual reasoning. On this, our VLM-integrated MemLoRA-V shows massive improvements over caption-based approaches (81.3 vs. 23.7 accuracy) while keeping strong performance in text-based tasks, demonstrating the efficacy of our method in multimodal contexts.
Challenging the Abilities of Large Language Models in Italian: a Community Initiative
The rapid progress of Large Language Models (LLMs) has transformed natural language processing and broadened its impact across research and society. Yet, systematic evaluation of these models, especially for languages beyond English, remains limited. "Challenging the Abilities of LAnguage Models in ITAlian" (CALAMITA) is a large-scale collaborative benchmarking initiative for Italian, coordinated under the Italian Association for Computational Linguistics. Unlike existing efforts that focus on leaderboards, CALAMITA foregrounds methodology: it federates more than 80 contributors from academia, industry, and the public sector to design, document, and evaluate a diverse collection of tasks, covering linguistic competence, commonsense reasoning, factual consistency, fairness, summarization, translation, and code generation. Through this process, we not only assembled a benchmark of over 20 tasks and almost 100 subtasks, but also established a centralized evaluation pipeline that supports heterogeneous datasets and metrics. We report results for four open-weight LLMs, highlighting systematic strengths and weaknesses across abilities, as well as challenges in task-specific evaluation. Beyond quantitative results, CALAMITA exposes methodological lessons: the necessity of fine-grained, task-representative metrics, the importance of harmonized pipelines, and the benefits and limitations of broad community engagement. CALAMITA is conceived as a rolling benchmark, enabling continuous integration of new tasks and models. This makes it both a resource -- the most comprehensive and diverse benchmark for Italian to date -- and a framework for sustainable, community-driven evaluation. We argue that this combination offers a blueprint for other languages and communities seeking inclusive and rigorous LLM evaluation practices.
EtCon: Edit-then-Consolidate for Reliable Knowledge Editing
Knowledge editing aims to update specific facts in large language models (LLMs) without full retraining. Prior efforts sought to tune the knowledge layers of LLMs, proving effective for making selective edits. However, a significant gap exists between their performance in controlled, teacher-forcing evaluations and their real-world effectiveness in lifelong learning scenarios, which greatly limits their practical applicability. This work's empirical analysis reveals two recurring issues associated with this gap: (1) Most traditional methods lead the edited model to overfit to the new fact, thereby degrading pre-trained capabilities; (2) There is a critical absence of a knowledge consolidation stage, leaving new facts insufficiently integrated into LLMs' inference-time behavior under autoregressive generation, thereby leading to a mismatch between parametric knowledge and actual generation behavior. To this end, we propose Edit-then-Consolidate, a novel knowledge editing paradigm that aims to bridge the gap between theoretical knowledge editing methods and their real-world applicability. Specifically, (1) our framework mitigates overfitting via Targeted Proximal Supervised Fine-Tuning (TPSFT) that localizes the edit via a trust-region objective to limit policy drift; (2) Then, a consolidation stage using Group Relative Policy Optimization (GRPO) aligns the edited knowledge with CoT-based inference policy by optimizing trajectory-level behavior under comprehensive reward signals. Extensive experiments demonstrate our framework consistently improves editing reliability and generalization under real-world evaluations, while better preserving locality and pre-trained capabilities.
Model Whisper: Steering Vectors Unlock Large Language Models' Potential in Test-time
It is a critical challenge to efficiently unlock the powerful reasoning potential of Large Language Models (LLMs) for specific tasks or new distributions. Existing test-time adaptation methods often require tuning model parameters, which is not only computationally expensive but also risks degrading the model's pre-existing abilities.To address this, we introduce a lightweight component, Test-Time Steering Vectors (TTSV), which is prepended to the input while keeping the LLM's parameters entirely frozen. By optimizing the TTSV on test data to minimize the model's output entropy, we steer the model towards an internal state of higher confidence, activating its inherent abilities most relevant to the current task. TTSV is both lightweight and highly efficient to optimize, making it a true plug-and-play enhancement. Extensive experiments validate our approach's effectiveness on both base models and reasoning-enhanced models. For instance, on the MATH500 task, TTSV achieves a 45.88% relative performance gain on the Qwen2.5-Math-7B model and a 16.22% relative gain on the Qwen3-4B model. Furthermore, our approach exhibits robust generalization, with its steering vectors proving highly transferable across diverse tasks.
comment: accepted to aaai2026
SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs
Extreme low-bit quantization is critical for efficiently deploying Large Language Models (LLMs), yet it often leads to severe performance degradation at 2-bits and even 4-bits (e.g., MXFP4). We present SignRoundV2, a post-training quantization framework that is highly effective even without mixed-precision. SignRoundV2 introduces (1) a fast sensitivity metric that combines gradient information with quantization-induced deviations to guide layer-wise bit allocation, and (2) a lightweight pre-tuning search for quantization scales to improve extremely low-bit quantization. These components allow SignRoundV2 to close the gap with full-precision models. Extensive experiments indicate that our method sustains competitive accuracy for LLMs, achieving production-grade performance with about 1 percent variance at 4-5 bits and strong results even at 2 bits. The implementation is available at https://github.com/intel/auto-round.
OsmT: Bridging OpenStreetMap Queries and Natural Language with Open-source Tag-aware Language Models ICDE
Bridging natural language and structured query languages is a long-standing challenge in the database community. While recent advances in language models have shown promise in this direction, existing solutions often rely on large-scale closed-source models that suffer from high inference costs, limited transparency, and lack of adaptability for lightweight deployment. In this paper, we present OsmT, an open-source tag-aware language model specifically designed to bridge natural language and Overpass Query Language (OverpassQL), a structured query language for accessing large-scale OpenStreetMap (OSM) data. To enhance the accuracy and structural validity of generated queries, we introduce a Tag Retrieval Augmentation (TRA) mechanism that incorporates contextually relevant tag knowledge into the generation process. This mechanism is designed to capture the hierarchical and relational dependencies present in the OSM database, addressing the topological complexity inherent in geospatial query formulation. In addition, we define a reverse task, OverpassQL-to-Text, which translates structured queries into natural language explanations to support query interpretation and improve user accessibility. We evaluate OsmT on a public benchmark against strong baselines and observe consistent improvements in both query generation and interpretation. Despite using significantly fewer parameters, our model achieves competitive accuracy, demonstrating the effectiveness of open-source pre-trained language models in bridging natural language and structured query languages within schema-rich geospatial environments.
comment: 42nd IEEE International Conference on Data Engineering (ICDE)
Towards Ethical Multi-Agent Systems of Large Language Models: A Mechanistic Interpretability Perspective AAAI'26
Large language models (LLMs) have been widely deployed in various applications, often functioning as autonomous agents that interact with each other in multi-agent systems. While these systems have shown promise in enhancing capabilities and enabling complex tasks, they also pose significant ethical challenges. This position paper outlines a research agenda aimed at ensuring the ethical behavior of multi-agent systems of LLMs (MALMs) from the perspective of mechanistic interpretability. We identify three key research challenges: (i) developing comprehensive evaluation frameworks to assess ethical behavior at individual, interactional, and systemic levels; (ii) elucidating the internal mechanisms that give rise to emergent behaviors through mechanistic interpretability; and (iii) implementing targeted parameter-efficient alignment techniques to steer MALMs towards ethical behaviors without compromising their performance.
comment: Accepted to LaMAS 2026@AAAI'26 (https://sites.google.com/view/lamas2026)
Geschlechtsübergreifende Maskulina im Sprachgebrauch Eine korpusbasierte Untersuchung zu lexemspezifischen Unterschieden
This study examines the distribution and linguistic characteristics of generic masculines (GM) in contemporary German press texts. The use of masculine personal nouns to refer to mixed-gender groups or unspecified individuals has been widely debated in academia and the public, with con-flicting perspectives on its gender-neutrality. While psycholinguistic studies suggest that GM is more readily associated with male referents, corpus-based analyses of its actual use remain scarce. We investigate GM in a large corpus of press texts, focusing on lexeme-specific differences across dif-ferent types of personal nouns. We conducted manual annotations of the whole inflectional para-digm of 21 personal nouns, resulting in 6,195 annotated tokens. Our findings reveal considerable differences between lexical items, especially between passive role nouns and prestige-related per-sonal nouns. On a grammatical level, we find that GM occurs predominantly in the plural and in indefinite noun phrases. Furthermore, our data shows that GM is not primarily used to denote entire classes of people, as has been previously claimed. By providing an empirical insight into the use of GM in authentic written language, we contribute to a more nuanced understanding of its forms and manifestations. These findings provide a solid basis for aligning linguistic stimuli in psy-cholinguistic studies more closely with real-world language use.
comment: 32 pages, 8 figures
Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs ACL
Graph topology is a fundamental determinant of memory leakage in multi-agent LLM systems, yet its effects remain poorly quantified. We introduce MAMA (Multi-Agent Memory Attack), a framework that measures how network structure shapes leakage. MAMA operates on synthetic documents containing labeled Personally Identifiable Information (PII) entities, from which we generate sanitized task instructions. We execute a two-phase protocol: Engram (seeding private information into a target agent's memory) and Resonance (multi-round interaction where an attacker attempts extraction). Over up to 10 interaction rounds, we quantify leakage as the fraction of ground-truth PII recovered from attacking agent outputs via exact matching. We systematically evaluate six common network topologies (fully connected, ring, chain, binary tree, star, and star-ring), varying agent counts $n\in\{4,5,6\}$, attacker-target placements, and base models. Our findings reveal consistent patterns: fully connected graphs exhibit maximum leakage while chains provide strongest protection; shorter attacker-target graph distance and higher target centrality significantly increase vulnerability; leakage rises sharply in early rounds before plateauing; model choice shifts absolute leakage rates but preserves topology rankings; temporal/locational PII attributes leak more readily than identity credentials or regulated identifiers. These results provide the first systematic mapping from architectural choices to measurable privacy risk, yielding actionable guidance: prefer sparse or hierarchical connectivity, maximize attacker-target separation, limit node degree and network radius, avoid shortcuts bypassing hubs, and implement topology-aware access controls.
comment: Under review at ACL Rolling Review
SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding
Video Large Language Models (VideoLLMs) have shown remarkable progress in video understanding. However, these models still struggle to effectively perceive and exploit rich temporal information in videos when responding to user queries. Therefore, they often generate descriptions of events that are temporal inconsistent or causally implausible, causing severe hallucination issues. While most prior studies have focused on spatial hallucinations (e.g. object mismatches), temporal reasoning in video understanding remains relatively underexplored. To address this issue, we propose Self-Diagnostic Contrastive Decoding (SEASON), a training-free method that adaptively enhances temporal and spatial faithfulness for each output token. It achieves this by dynamically diagnosing each token's hallucination tendency and applying adaptive contrastive decoding against its corresponding temporal and spatial negatives. Extensive experiments demonstrate that SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks, while further improves VideoLLMs across four general video understanding benchmarks. The code will be released upon acceptance.
Limit cycles for speech
Rhythmic fluctuations in acoustic energy and accompanying neuronal excitations in cortical oscillations are characteristic of human speech, yet whether a corresponding rhythmicity inheres in the articulatory movements that generate speech remains unclear. The received understanding of speech movements as discrete, goal-oriented actions struggles to make contact with the rhythmicity findings. In this work, we demonstrate that an unintuitive -- but no less principled than the conventional -- representation for discrete movements reveals a pervasive limit cycle organization and unlocks the recovery of previously inaccessible rhythmic structure underlying the motor activity of speech. These results help resolve a time-honored tension between the ubiquity of biological rhythmicity and discreteness in speech, the quintessential human higher function, by revealing a rhythmic organization at the most fundamental level of individual articulatory actions.
Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space
Large language model (LLM) agents -- LLMs that dynamically interact with an environment over long horizons -- have become an increasingly important area of research, enabling automation in complex tasks involving tool-use, web browsing, and dialogue with people. In the absence of expert demonstrations, training LLM agents has relied on policy gradient methods that optimize LLM policies with respect to an (often sparse) reward function. However, in long-horizon tasks with sparse rewards, learning from trajectory-level rewards can be noisy, leading to training that is unstable and has high sample complexity. Furthermore, policy improvement hinges on discovering better actions through exploration, which can be difficult when actions lie in natural language space. In this paper, we propose Natural Language Actor-Critic (NLAC), a novel actor-critic algorithm that trains LLM policies using a generative LLM critic that produces natural language rather than scalar values. This approach leverages the inherent strengths of LLMs to provide a richer and more actionable training signal; particularly, in tasks with large, open-ended action spaces, natural language explanations for why an action is suboptimal can be immensely useful for LLM policies to reason how to improve their actions, without relying on random exploration. Furthermore, our approach can be trained off-policy without policy gradients, offering a more data-efficient and stable alternative to existing on-policy methods. We present results on a mixture of reasoning, web browsing, and tool-use with dialogue tasks, demonstrating that NLAC shows promise in outperforming existing training approaches and offers a more scalable and stable training paradigm for LLM agents.
comment: 22 pages, 4 figures
LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence
Legal general intelligence (GI) refers to artificial intelligence (AI) that encompasses legal understanding, reasoning, and decision-making, simulating the expertise of legal experts across domains. However, existing benchmarks are result-oriented and fail to systematically evaluate the legal intelligence of large language models (LLMs), hindering the development of legal GI. To address this, we propose LexGenius, an expert-level Chinese legal benchmark for evaluating legal GI in LLMs. It follows a Dimension-Task-Ability framework, covering seven dimensions, eleven tasks, and twenty abilities. We use the recent legal cases and exam questions to create multiple-choice questions with a combination of manual and LLM reviews to reduce data leakage risks, ensuring accuracy and reliability through multiple rounds of checks. We evaluate 12 state-of-the-art LLMs using LexGenius and conduct an in-depth analysis. We find significant disparities across legal intelligence abilities for LLMs, with even the best LLMs lagging behind human legal professionals. We believe LexGenius can assess the legal intelligence abilities of LLMs and enhance legal GI development. Our project is available at https://github.com/QwenQKing/LexGenius.
ADAPT: Learning Task Mixtures for Budget-Constrained Instruction Tuning
We propose ADAPT, a meta-learning algorithm that \emph{learns} task sampling proportions under an explicit token budget for multi-task instruction tuning. Instead of fixing task weights by hand, \adapt{} maintains a continuous distribution over tasks and updates it via meta-gradients of a smooth worst-case validation objective, inducing an adaptive curriculum that allocates more tokens to useful tasks while avoiding collapse. We instantiate ADAPT on three $\sim$1B-parameter open-weight LLMs (Gemma-3-1B, LLaMA-3.2-1B, Qwen-0.6B), training on 20 Natural Instructions task types under budgets of $1\%$, $5\%$, and $10\%$ of the available supervised tokens, and compare against strong supervised fine-tuning baselines with uniform and size-proportional mixing. We conduct evaluations on 11 out-of-domain benchmarks spanning reasoning, reading comprehension, code generation, and instruction following, we find that ADAPT matches or slightly improves average downstream performance relative to the best static mixture, while using fewer effective training tokens and reallocating budget toward harder, benchmark-aligned tasks.
comment: Under Review
AdmTree: Compressing Lengthy Context with Adaptive Semantic Trees NeurIPS 2025
The quadratic complexity of self-attention constrains Large Language Models (LLMs) in processing long contexts, a capability essential for many advanced applications. Context compression aims to alleviate this computational bottleneck while retaining critical semantic information. However, existing approaches often fall short: explicit methods may compromise local detail, whereas implicit methods can suffer from positional biases, information degradation, or an inability to capture long-range semantic dependencies. We propose AdmTree, a novel framework for adaptive, hierarchical context compression with a central focus on preserving high semantic fidelity while maintaining efficiency. AdmTree dynamically segments input based on information density, utilizing gist tokens to summarize variable-length segments as the leaves of a semantic binary tree. This structure, together with a lightweight aggregation mechanism and a frozen backbone LLM (thereby minimizing new trainable parameters), enables efficient hierarchical abstraction of the context. By preserving fine-grained details alongside global semantic coherence, mitigating positional bias, and dynamically adapting to content, AdmTree robustly retains the semantic information of long contexts.
comment: NeurIPS 2025
EvoEdit: Lifelong Free-Text Knowledge Editing through Latent Perturbation Augmentation and Knowledge-driven Parameter Fusion
Adjusting the outdated knowledge of large language models (LLMs) after deployment remains a major challenge. This difficulty has spurred the development of knowledge editing, which seeks to accurately and efficiently modify a model's internal (parametric) knowledge without retraining it from scratch. However, existing methods suffer from two limitations. First, they depend on structured triplets that are misaligned with the free-text nature of LLM pretraining and fail to capture the nuanced relationships among facts. Second, they typically support one-time knowledge updates, with relatively limited research on the problem of sequential or lifelong editing. To address these gaps, we propose a new task, Lifelong Free-text Knowledge Editing (LF-Edit), which enables models to incorporate updates expressed in natural language and supports continual editing over time. Despite its promise, LF-Edit faces the dual challenge of integrating new knowledge while mitigating the forgetting of prior information. To foster research on this new task, we construct a large-scale benchmark, Multi-Rank Lifelong Free-text Editing Benchmark (MRLF-Bench), containing 16,835 free-text edit requests. We further design a cognitively inspired multi-rank evaluation framework encompassing four levels: memorization, understanding, constrained comprehension, and reasoning. To tackle the challenges inherent in LF-Edit, we introduce a novel approach named EvoEdit that enhances knowledge injection through Latent Perturbation Augmentation and preserves prior information via Knowledge-driven Parameter Fusion. Experimental results demonstrate that EvoEdit substantially outperforms existing knowledge editing methods on the proposed LF-Edit task.
UW-BioNLP at ChemoTimelines 2025: Thinking, Fine-Tuning, and Dictionary-Enhanced LLM Systems for Chemotherapy Timeline Extraction
The ChemoTimelines shared task benchmarks methods for constructing timelines of systemic anticancer treatment from electronic health records of cancer patients. This paper describes our methods, results, and findings for subtask 2 -- generating patient chemotherapy timelines from raw clinical notes. We evaluated strategies involving chain-of-thought thinking, supervised fine-tuning, direct preference optimization, and dictionary-based lookup to improve timeline extraction. All of our approaches followed a two-step workflow, wherein an LLM first extracted chemotherapy events from individual clinical notes, and then an algorithm normalized and aggregated events into patient-level timelines. Each specific method differed in how the associated LLM was utilized and trained. Multiple approaches yielded competitive performances on the test set leaderboard, with fine-tuned Qwen3-14B achieving the best official score of 0.678. Our results and analyses could provide useful insights for future attempts on this task as well as the design of similar tasks.
comment: To be published in Proceedings of the 7th Clinical Natural Language Processing Workshop
MSME: A Multi-Stage Multi-Expert Framework for Zero-Shot Stance Detection
LLM-based approaches have recently achieved impressive results in zero-shot stance detection. However, they still struggle in complex real-world scenarios, where stance understanding requires dynamic background knowledge, target definitions involve compound entities or events that must be explicitly linked to stance labels, and rhetorical devices such as irony often obscure the author's actual intent. To address these challenges, we propose MSME, a Multi-Stage, Multi-Expert framework for zero-shot stance detection. MSME consists of three stages: (1) Knowledge Preparation, where relevant background knowledge is retrieved and stance labels are clarified; (2) Expert Reasoning, involving three specialized modules-Knowledge Expert distills salient facts and reasons from a knowledge perspective, Label Expert refines stance labels and reasons accordingly, and Pragmatic Expert detects rhetorical cues such as irony to infer intent from a pragmatic angle; (3) Decision Aggregation, where a Meta-Judge integrates all expert analyses to produce the final stance prediction. Experiments on three public datasets show that MSME achieves state-of-the-art performance across the board.
RapidUn: Influence-Driven Parameter Reweighting for Efficient Large Language Model Unlearning
Removing specific data influence from large language models (LLMs) remains challenging, as retraining is costly and existing approximate unlearning methods are often unstable. The challenge is exacerbated when the forget set is small or imbalanced. We introduce RapidUn, an influence-driven and parameter-efficient unlearning framework. It first estimates per-sample influence through a fast estimation module, then maps these scores into adaptive update weights that guide selective parameter updates -- forgetting harmful behavior while retaining general knowledge. On Mistral-7B and Llama-3-8B across Dolly-15k and Alpaca-57k, RapidUn achieves up to 100 times higher efficiency than full retraining and consistently outperforms Fisher, GA, and LoReUn on both in-distribution and out-of-distribution forgetting. These results establish influence-guided parameter reweighting as a scalable and interpretable paradigm for LLM unlearning.
comment: Code available at: https://github.com/eyerf/RapidUn
Sarcasm Detection on Reddit Using Classical Machine Learning and Feature Engineering
Sarcasm is common in online discussions, yet difficult for machines to identify because the intended meaning often contradicts the literal wording. In this work, I study sarcasm detection using only classical machine learning methods and explicit feature engineering, without relying on neural networks or context from parent comments. Using a 100,000-comment subsample of the Self-Annotated Reddit Corpus (SARC 2.0), I combine word-level and character-level TF-IDF features with simple stylistic indicators. Four models are evaluated: logistic regression, a linear SVM, multinomial Naive Bayes, and a random forest. Naive Bayes and logistic regression perform the strongest, achieving F1-scores around 0.57 for sarcastic comments. Although the lack of conversational context limits performance, the results offer a clear and reproducible baseline for sarcasm detection using lightweight and interpretable methods.
comment: 11 pages, 2 figures, includes full Python code. Classical machine learning baseline for sarcasm detection on the SARC 2.0 dataset
MASE: Interpretable NLP Models via Model-Agnostic Saliency Estimation
Deep neural networks (DNNs) have made significant strides in Natural Language Processing (NLP), yet their interpretability remains elusive, particularly when evaluating their intricate decision-making processes. Traditional methods often rely on post-hoc interpretations, such as saliency maps or feature visualization, which might not be directly applicable to the discrete nature of word data in NLP. Addressing this, we introduce the Model-agnostic Saliency Estimation (MASE) framework. MASE offers local explanations for text-based predictive models without necessitating in-depth knowledge of a model's internal architecture. By leveraging Normalized Linear Gaussian Perturbations (NLGP) on the embedding layer instead of raw word inputs, MASE efficiently estimates input saliency. Our results indicate MASE's superiority over other model-agnostic interpretation methods, especially in terms of Delta Accuracy, positioning it as a promising tool for elucidating the operations of text-based models in NLP.
LangSAT: A Novel Framework Combining NLP and Reinforcement Learning for SAT Solving
Our work presents a novel reinforcement learning (RL) based framework to optimize heuristic selection within the conflict-driven clause learning (CDCL) process, improving the efficiency of Boolean satisfia- bility (SAT) solving. The proposed system, LangSAT, bridges the gap between natural language inputs and propositional logic by converting English descriptions into Conjunctive Normal Form (CNF) expressions and solving them using an RL-enhanced CDCL SAT solver. Unlike existing SAT-solving platforms that require CNF as input, LangSAT enables users to input standard English descriptions, making SAT-solving more accessible. The framework comprises two key components: Lang2Logic, which translates English sentences into CNF expressions, and SmartSAT, an RL-based SAT solver. SmartSAT encodes clause-variable relationships as structured graph representations and extracts global features specific to the SAT problem. This implementation provides the RL agent with deeper contextual information, enabling SAT problems to be solved more efficiently. Lang2Logic was evaluated on diverse natural language inputs, processing descriptions up to 450 words. The generated CNFs were solved by SmartSAT, which demonstrated comparable performance to traditional CDCL heuristics with respect to solving time. The combined LangSAT framework offers a more accessible and scalable solution for SAT-solving tasks across reasoning, formal verification, and debugging.
Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment WACV
Recent advancement in multimodal LLMs (MLLMs) has demonstrated their remarkable capability to generate descriptive captions for input videos. However, these models suffer from factual inaccuracies in the generated descriptions, causing severe hallucination issues. While prior works have explored alleviating hallucinations for static images, jointly mitigating visual object and temporal action hallucinations for dynamic videos remains a challenging and unsolved task. To tackle this challenge, we propose a Self-Augmented Contrastive Alignment (SANTA) framework for enabling object and action faithfulness by exempting the spurious correlations and enforcing the emphasis on visual facts. SANTA employs a hallucinative self-augmentation scheme to identify the potential hallucinations that lie in the MLLM and transform the original captions to the contrasted negatives. Furthermore, we develop a tracklet-phrase contrastive alignment to match the regional objects and relation-guided actions with their corresponding visual and temporal phrases. Extensive experiments demonstrate that SANTA outperforms existing methods in alleviating object and action hallucinations, yielding superior performance on the hallucination examination benchmarks.
comment: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026. Project page: https://kpc0810.github.io/santa/
ClusterFusion: Hybrid Clustering with Embedding Guidance and LLM Adaptation
Text clustering is a fundamental task in natural language processing, yet traditional clustering algorithms with pre-trained embeddings often struggle in domain-specific contexts without costly fine-tuning. Large language models (LLMs) provide strong contextual reasoning, yet prior work mainly uses them as auxiliary modules to refine embeddings or adjust cluster boundaries. We propose ClusterFusion, a hybrid framework that instead treats the LLM as the clustering core, guided by lightweight embedding methods. The framework proceeds in three stages: embedding-guided subset partition, LLM-driven topic summarization, and LLM-based topic assignment. This design enables direct incorporation of domain knowledge and user preferences, fully leveraging the contextual adaptability of LLMs. Experiments on three public benchmarks and two new domain-specific datasets demonstrate that ClusterFusion not only achieves state-of-the-art performance on standard tasks but also delivers substantial gains in specialized domains. To support future work, we release our newly constructed dataset and results on all benchmarks.
LangSAT: A Novel Framework Combining NLP and Reinforcement Learning for SAT Solving
Our work presents a novel reinforcement learning (RL) based framework to optimize heuristic selection within the conflict-driven clause learning (CDCL) process, improving the efficiency of Boolean satisfiability (SAT) solving. The proposed system, LangSAT, bridges the gap between natural language inputs and propositional logic by converting English descriptions into Conjunctive Normal Form (CNF) expressions and solving them using an RL-enhanced CDCL SAT solver. Unlike existing SAT-solving platforms that require CNF as input, LangSAT enables users to input standard English descriptions, making SAT-solving more accessible. The framework comprises two key components: Lang2Logic, which translates English sentences into CNF expressions, and SmartSAT, an RL-based SAT solver. SmartSAT encodes clause-variable relationships as structured graph representations and extracts global features specific to the SAT problem. This implementation provides the RL agent with deeper contextual information, enabling SAT problems to be solved more efficiently. Lang2Logic was evaluated on diverse natural language inputs, processing descriptions up to 450 words. The generated CNFs were solved by SmartSAT, which demonstrated comparable performance to traditional CDCL heuristics with respect to solving time. The combined LangSAT framework offers a more accessible and scalable solution for SAT-solving tasks across reasoning, formal verification, and debugging.
To Think or Not to Think: The Hidden Cost of Meta-Training with Excessive CoT Examples
Chain-of-thought (CoT) prompting combined with few-shot in-context learning (ICL) has unlocked significant reasoning capabilities in large language models (LLMs). However, ICL with CoT examples is ineffective on novel tasks when the pre-training knowledge is insufficient. We study this problem in a controlled setting using the CoT-ICL Lab framework, and propose meta-training techniques to learn novel abstract reasoning tasks in-context. Although CoT examples facilitate reasoning, we noticed that their excessive inclusion during meta-training degrades performance when CoT supervision is limited. To mitigate such behavior, we propose CoT-Recipe, a formal approach to modulate the mix of CoT and non-CoT examples in meta-training sequences. We demonstrate that careful modulation via CoT-Recipe can increase the accuracy of transformers on novel tasks by up to 300% even when there are no CoT examples available in-context. We confirm the broader effectiveness of these techniques by applying them to pretrained LLMs (Qwen2.5 series) for symbolic reasoning tasks and observing gains of up to 130% in accuracy.
comment: 26 pages, 45 figures, 3 tables
Enhancing Clinical Note Generation with ICD-10, Clinical Ontology Knowledge Graphs, and Chain-of-Thought Prompting Using GPT-4
In the past decade a surge in the amount of electronic health record (EHR) data in the United States, attributed to a favorable policy environment created by the Health Information Technology for Economic and Clinical Health (HITECH) Act of 2009 and the 21st Century Cures Act of 2016. Clinical notes for patients' assessments, diagnoses, and treatments are captured in these EHRs in free-form text by physicians, who spend a considerable amount of time entering and editing them. Manually writing clinical notes takes a considerable amount of a doctor's valuable time, increasing the patient's waiting time and possibly delaying diagnoses. Large language models (LLMs) possess the ability to generate news articles that closely resemble human-written ones. We investigate the usage of Chain-of-Thought (CoT) prompt engineering to improve the LLM's response in clinical note generation. In our prompts, we use as input International Classification of Diseases (ICD) codes and basic patient information. We investigate a strategy that combines the traditional CoT with semantic search results to improve the quality of generated clinical notes. Additionally, we infuse a knowledge graph (KG) built from clinical ontology to further enrich the domain-specific knowledge of generated clinical notes. We test our prompting technique on six clinical cases from the CodiEsp test dataset using GPT-4 and our results show that it outperformed the clinical notes generated by standard one-shot prompts.
Decoding the Black Box: Discerning AI Rhetorics About and Through Poetic Prompting SC
Prompt engineering has emerged as a useful way studying the algorithmic tendencies and biases of large language models. Meanwhile creatives and academics have leveraged LLMs to develop creative works and explore the boundaries of their writing capabilities through text generation and code. This study suggests that creative text prompting, specifically Poetry Prompt Patterns, may be a useful addition to the toolbox of the prompt engineer, and outlines the process by which this approach may be taken. Then, the paper uses poetic prompts to assess descriptions and evaluations of three models of a renowned poet and test the consequences of the willingness of models to adapt or rewrite original creative works for presumed audiences.
comment: Late-Breaking Paper accepted to IEEE SSCI 2025 NLP & Social Media Track as extended abstract and presented in Trondheim, Norway 17-20 March 2025 as Poster Presentation
Unveiling Affective Polarization Trends in Parliamentary Proceedings
Recent years have seen an increase in polarized discourse worldwide, on various platforms. We propose a novel method for quantifying polarization, based on the emotional style of the discourse rather than on differences in ideological stands. Using measures of Valence, Arousal and Dominance, we detect signals of emotional discourse and use them to operationalize the concept of affective polarization. Applying this method to a recently released corpus of proceedings of the Knesset, the Israeli parliament (in Hebrew), we find that the emotional style of members of government differs from that of opposition members; and that the level of affective polarization, as reflected by this style, is significantly increasing with time.
comment: pre-MIT Press publication version
On the Computability of Artificial General Intelligence
In recent years we observed rapid and significant advancements in artificial intelligence (A.I.). So much so that many wonder how close humanity is to developing an A.I. model that can achieve human level of intelligence, also known as artificial general intelligence (A.G.I.). In this work we look at this question and we attempt to define the upper bounds, not just of A.I., but rather of any machine-computable process (a.k.a. an algorithm). To answer this question however, one must first precisely define A.G.I. We borrow prior work's definition of A.G.I. [1] that best describes the sentiment of the term, as used by the leading developers of A.I. That is, the ability to be creative and innovate in some field of study in a way that unlocks new and previously unknown functional capabilities in that field. Based on this definition we draw new bounds on the limits of computation. We formally prove that no algorithm can demonstrate new functional capabilities that were not already present in the initial algorithm itself. Therefore, no algorithm (and thus no A.I. model) can be truly creative in any field of study, whether that is science, engineering, art, sports, etc. In contrast, A.I. models can demonstrate existing functional capabilities, as well as combinations and permutations of existing functional capabilities. We conclude this work by discussing the implications of this proof both as it regards to the future of A.I. development, as well as to what it means for the origins of human intelligence.
Fine-Tuning BERT for Domain-Specific Question Answering: Toward Educational NLP Resources at University Scale
Prior work on scientific question answering has largely emphasized chatbot-style systems, with limited exploration of fine-tuning foundation models for domain-specific reasoning. In this study, we developed a chatbot for the University of Limerick's Department of Electronic and Computer Engineering to provide course information to students. A custom dataset of 1,203 question-answer pairs in SQuAD format was constructed using the university book of modules, supplemented with manually and synthetically generated entries. We fine-tuned BERT (Devlin et al., 2019) using PyTorch and evaluated performance with Exact Match and F1 scores. Results show that even modest fine-tuning improves hypothesis framing and knowledge extraction, demonstrating the feasibility of adapting foundation models to educational domains. While domain-specific BERT variants such as BioBERT and SciBERT exist for biomedical and scientific literature, no foundation model has yet been tailored to university course materials. Our work addresses this gap by showing that fine-tuning BERT with academic QA pairs yields effective results, highlighting the potential to scale towards the first domain-specific QA model for universities and enabling autonomous educational knowledge systems.
comment: 4 pages, 2 figures
Semantic Faithfulness and Entropy Production Measures to Tame Your LLM Demons and Manage Hallucinations
Evaluating faithfulness of Large Language Models (LLMs) to a given task is a complex challenge. We propose two new unsupervised metrics for faithfulness evaluation using insights from information theory and thermodynamics. Our approach treats an LLM as a bipartite information engine where hidden layers act as a Maxwell demon controlling transformations of context $C $ into answer $A$ via prompt $Q$. We model Question-Context-Answer (QCA) triplets as probability distributions over shared topics. Topic transformations from $C$ to $Q$ and $A$ are modeled as transition matrices ${\bf Q}$ and ${\bf A}$ encoding the query goal and actual result, respectively. Our semantic faithfulness (SF) metric quantifies faithfulness for any given QCA triplet by the Kullback-Leibler (KL) divergence between these matrices. Both matrices are inferred simultaneously via convex optimization of this KL divergence, and the final SF metric is obtained by mapping the minimal divergence onto the unit interval [0,1], where higher scores indicate greater faithfulness. Furthermore, we propose a thermodynamics-based semantic entropy production (SEP) metric in answer generation, and show that high faithfulness generally implies low entropy production. The SF and SEP metrics can be used jointly or separately for LLM evaluation and hallucination control. We demonstrate our framework on LLM summarization of corporate SEC 10-K filings.
comment: 23 pages, 6 figures
Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models
We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. Developing high-performance PRMs typically demands significant time and financial investment, primarily due to the necessity for step-level annotations of reasoning steps. Conventional automated labeling methods, such as Monte Carlo estimation, often produce noisy labels and incur substantial computational costs. To efficiently generate high-quality process-labeled data, we propose leveraging prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. Remarkably, Athena-PRM demonstrates outstanding effectiveness across various scenarios and benchmarks with just 5,000 samples. Furthermore, we also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data. We validate our approach in three specific scenarios: verification for test time scaling, direct evaluation of reasoning step correctness, and reward ranked fine-tuning. Our Athena-PRM consistently achieves superior performance across multiple benchmarks and scenarios. Notably, when using Qwen2.5-VL-7B as the policy model, Athena-PRM enhances performance by 10.2 points on WeMath and 7.1 points on MathVista for test time scaling. Furthermore, Athena-PRM sets the state-of-the-art (SoTA) results in VisualProcessBench and outperforms the previous SoTA by 3.9 F1-score, showcasing its robust capability to accurately assess the correctness of the reasoning step. Additionally, utilizing Athena-PRM as the reward model, we develop Athena-7B with reward ranked fine-tuning and outperforms baseline with a significant margin on five benchmarks.
TreeRare: Syntax Tree-Guided Retrieval and Reasoning for Knowledge-Intensive Question Answering
In real practice, questions are typically complex and knowledge-intensive, requiring Large Language Models (LLMs) to recognize the multifaceted nature of the question and reason across multiple information sources. Iterative and adaptive retrieval, where LLMs decide when and what to retrieve based on their reasoning, has been shown to be a promising approach to resolve complex, knowledge-intensive questions. However, the performance of such retrieval frameworks is limited by the accumulation of reasoning errors and misaligned retrieval results. To overcome these limitations, we propose TreeRare (Syntax Tree-Guided Retrieval and Reasoning), a framework that utilizes syntax trees to guide information retrieval and reasoning for question answering. Following the principle of compositionality, TreeRare traverses the syntax tree in a bottom-up fashion, and in each node, it generates subcomponent-based queries and retrieves relevant passages to resolve localized uncertainty. A subcomponent question answering module then synthesizes these passages into concise, context-aware evidence. Finally, TreeRare aggregates the evidence across the tree to form a final answer. Experiments across five question answering datasets involving ambiguous or multi-hop reasoning demonstrate that TreeRare achieves substantial improvements over existing state-of-the-art methods.
SO-Bench: A Structural Output Evaluation of Multimodal LLMs
Multimodal large language models (MLLMs) are increasingly deployed in real-world, agentic settings where outputs must not only be correct, but also conform to predefined data schemas. Despite recent progress in structured generation in textual domain, there is still no benchmark that systematically evaluates schema-grounded information extraction and reasoning over visual inputs. In this work, we conduct a comprehensive study of visual structural output capabilities for MLLMs with our carefully designed SO-Bench benchmark. Covering four visual domains, including UI screens, natural images, documents, and charts, SO-Bench is built from over 6.5K diverse JSON schemas and 1.8K curated image-schema pairs with human-verified quality. Benchmarking experiments on open-sourced and frontier proprietary models reveal persistent gaps in predicting accurate, schema compliant outputs, highlighting the need for better multimodal structured reasoning. Beyond benchmarking, we further conduct training experiments to largely improve the model's structured output capability. We plan to make the benchmark available to the community.
comment: v2 preprint. Fixed some typos, add a discussion about limitation, provide pseudo-codes for eval
Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data
This paper investigates a critical design decision in the practice of massively multilingual continual pre-training -- the inclusion of parallel data. Specifically, we study the impact of bilingual translation data for massively multilingual language adaptation of the Llama3 family of models to 500 languages. To this end, we construct the MaLA bilingual translation corpus, containing data from more than 2,500 language pairs. Subsequently, we develop the EMMA-500 Llama 3 suite of four massively multilingual models -- continually pre-trained from the Llama 3 family of base models extensively on diverse data mixes up to 671B tokens -- and explore the effect of continual pre-training with or without bilingual translation data. Comprehensive evaluation across 7 tasks and 12 benchmarks demonstrates that bilingual data tends to enhance language transfer and performance, particularly for low-resource languages. We open-source the MaLA corpus, EMMA-500 Llama 3 suite artefacts, code, and model generations.
comment: EMMA-500 Gen 2; refer to Gen 1 in arXiv:2409.17892
QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA EMNLP 2025
Alignment of large language models (LLMs) with principles like helpfulness, honesty, and harmlessness typically relies on scalar rewards that obscure which objectives drive the training signal. We introduce QA-LIGN, which decomposes monolithic rewards into interpretable principle-specific evaluations through structured natural language programs. Models learn through a draft, critique, and revise pipeline, where symbolic evaluation against the rubrics provides transparent feedback for both initial and revised responses during GRPO training. Applied to uncensored Llama-3.1-8B-Instruct, QA-LIGN reduces attack success rates by up to 68.7% while maintaining a 0.67% false refusal rate, achieving Pareto optimal safety-helpfulness performance and outperforming both DPO and GRPO with state-of-the-art reward models given equivalent training. These results demonstrate that making reward signals interpretable and modular improves alignment effectiveness, suggesting transparency enhances LLM safety.
comment: Findings of the Association for Computational Linguistics: EMNLP 2025, pages 20619-20642, Suzhou, China
PUCP-Metrix: An Open-source and Comprehensive Toolkit for Linguistic Analysis of Spanish Texts EACL
Linguistic features remain essential for interpretability and tasks that involve style, structure, and readability, but existing Spanish tools offer limited coverage. We present PUCP-Metrix, an open-source and comprehensive toolkit for linguistic analysis of Spanish texts. PUCP-Metrix includes 182 linguistic metrics spanning lexical diversity, syntactic and semantic complexity, cohesion, psycholinguistics, and readability. It enables fine-grained, interpretable text analysis. We evaluate its usefulness on Automated Readability Assessment and Machine-Generated Text Detection, showing competitive performance compared to an existing repository and strong neural baselines. PUCP-Metrix offers a comprehensive and extensible resource for Spanish, supporting diverse NLP applications.
comment: 1 figure, Submitted to EACL Demo track (under review)
LORE: A Large Generative Model for Search Relevance
Achievement. We introduce LORE, a systematic framework for Large Generative Model-based relevance in e-commerce search. Deployed and iterated over three years, LORE achieves a cumulative +27\% improvement in online GoodRate metrics. This report shares the valuable experience gained throughout its development lifecycle, spanning data, features, training, evaluation, and deployment. Insight. While existing works apply Chain-of-Thought (CoT) to enhance relevance, they often hit a performance ceiling. We argue this stems from treating relevance as a monolithic task, lacking principled deconstruction. Our key insight is that relevance comprises distinct capabilities: knowledge and reasoning, multi-modal matching, and rule adherence. We contend that a qualitative-driven decomposition is essential for breaking through current performance bottlenecks. Contributions. LORE provides a complete blueprint for the LLM relevance lifecycle. Key contributions include: (1) A two-stage training paradigm combining progressive CoT synthesis via SFT with human preference alignment via RL. (2) A comprehensive benchmark, RAIR, designed to evaluate these core capabilities. (3) A query frequency-stratified deployment strategy that efficiently transfers offline LLM capabilities to the online system. LORE serves as both a practical solution and a methodological reference for other vertical domains.
EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models
In this work, we introduce EMMA-500, a large-scale multilingual language model continue-trained on texts across 546 languages designed for enhanced multilingual performance, focusing on improving language coverage for low-resource languages. To facilitate continual pre-training, we compile the MaLA corpus, a comprehensive multilingual dataset enriched with curated datasets across diverse domains. Leveraging this corpus, we conduct extensive continual pre-training of the Llama 2 7B model, resulting in EMMA-500, which demonstrates robust performance across a wide collection of benchmarks, including a comprehensive set of multilingual tasks. Our results highlight the effectiveness of continual pre-training in expanding large language models' language capacity, particularly for underrepresented languages, demonstrating significant gains in cross-lingual transfer, task generalization, and language adaptability. We release the MaLA corpus, EMMA-500 model weights, scripts, and model generations.
SignBind-LLM: Multi-Stage Modality Fusion for Sign Language Translation
Despite progress in gloss-free Sign Language Translation (SLT), traditional single modality end-to-end approaches consistently fail on two critical components of natural signing: the precise recognition of high-speed fingerspelling and the integration of asynchronous non-manual cues from the face. Recent progress in SLT with Large Language Models has side stepped this challenge, forcing a single network to learn these simultaneously resulting in poor performance when tasked with translating crucial information such as names, places, and technical terms. We introduce SignBind-LLM, a modular framework designed to overcome these limitations. Our approach employs separate, specialized predictors for continuous signing, fingerspelling, and lipreading. Each expert network first decodes its specific modality into a sequence of tokens. These parallel streams are then fused by a lightweight transformer that resolves temporal misalignments before passing the combined representation to a Large Language Model (LLM) for final sentence generation. Our method establishes a new state-of-the-art on the How2Sign, ChicagoFSWildPlus, and BOBSL datasets with a BLEU-4 score of 22.1, 73.2% letter accuracy and BLEU-4 score of 6.8 respectively. These results validate our core hypothesis: isolating and solving distinct recognition tasks before fusion provides a more powerful and effective pathway to robust, high-fidelity sign language translation.
HUME: Measuring the Human-Model Performance Gap in Text Embedding Tasks ICLR 2026
Comparing human and model performance offers a valuable perspective for understanding the strengths and limitations of embedding models, highlighting where they succeed and where they fail to capture meaning and nuance. However, such comparisons are rarely made, as human performance on embedding tasks is difficult to measure. To fill this gap, we introduce HUME: Human Evaluation Framework for Text Embeddings. While frameworks like MTEB provide broad model evaluation, they lack reliable estimates of human performance, limiting the interpretability of model scores. We measure human performance across 16 MTEB datasets spanning reranking, classification, clustering, and semantic textual similarity across linguistically diverse high- and low-resource languages. Humans achieve an average performance of 77.6% compared to 80.1% for the best embedding model, though with substantial variation: models reach high performance on some datasets while struggling on notably low-resource languages. Our human annotations also reveal multiple dataset issues. We additionally benchmark nine LLMs as annotators on reranking, classification, and STS tasks, finding that they fall short of human performance (76.1% vs. 81.2%) despite offering scalability advantages. We provide human performance baselines, insights into task difficulty patterns, and an extensible evaluation framework that enables a more meaningful interpretation of results and informs the development of both models and benchmarks. Our code, dataset, and leaderboard are publicly available at https://github.com/embeddings-benchmark/mteb.
comment: Submitted to ICLR 2026
Large language models can learn and generalize steganographic chain-of-thought under process supervision NeurIPS 2025
Chain-of-thought (CoT) reasoning not only enhances large language model performance but also provides critical insights into decision-making processes, marking it as a useful tool for monitoring model intent and planning. However, recent works have shown that banning the mention of a specific example of reward hacking causes obfuscation of the undesired reasoning traces but the persistence of the undesired behavior, threatening the reliability of CoT monitoring. We provide an extension to these results with regard to the ability of models to learn a specific type of obfuscated reasoning: steganography. First, we show that penalizing the use of specific strings within load-bearing reasoning traces causes models to substitute alternative strings. Crucially, this does not alter the underlying method by which the model performs the task, demonstrating that the model can learn to steganographically encode its reasoning.We further demonstrate that models can generalize an encoding scheme. When the penalized strings belong to an overarching class, the model learns not only to substitute strings seen in training, but also develops a general encoding scheme for all members of the class which it can apply to held-out testing strings.
comment: 10 pages main text, 3 figures main text, 17 pages supplementary material, 1 figure supplementary material, accepted at NeurIPS 2025
On-Policy Optimization with Group Equivalent Preference for Multi-Programming Language Understanding
Large language models (LLMs) achieve remarkable performance in code generation tasks. However, a significant performance disparity persists between popular programming languages (e.g., Python, C++) and others. To address this capability gap, we leverage the code translation task to train LLMs, thereby facilitating the transfer of coding proficiency across diverse programming languages. Moreover, we introduce OORL for training, a novel reinforcement learning (RL) framework that integrates on-policy and off-policy strategies. Within OORL, on-policy RL is applied during code translation, guided by a rule-based reward signal derived from unit tests. Complementing this coarse-grained rule-based reward, we propose Group Equivalent Preference Optimization (GEPO), a novel preference optimization method. Specifically, GEPO trains the LLM using intermediate representations (IRs) groups. LLMs can be guided to discern IRs equivalent to the source code from inequivalent ones, while also utilizing signals about the mutual equivalence between IRs within the group. This process allows LLMs to capture nuanced aspects of code functionality. By employing OORL for training with code translation tasks, LLMs improve their recognition of code functionality and their understanding of the relationships between code implemented in different languages. Extensive experiments demonstrate that our OORL for LLMs training with code translation tasks achieves significant performance improvements on code benchmarks across multiple programming languages.
Human Mobility Datasets Enriched With Contextual and Social Dimensions
In this resource paper, we present two publicly available datasets of semantically enriched human trajectories, together with the pipeline to build them. The trajectories are publicly available GPS traces retrieved from OpenStreetMap. Each dataset includes contextual layers such as stops, moves, points of interest (POIs), inferred transportation modes, and weather data. A novel semantic feature is the inclusion of synthetic, realistic social media posts generated by Large Language Models (LLMs), enabling multimodal and semantic mobility analysis. The datasets are available in both tabular and Resource Description Framework (RDF) formats, supporting semantic reasoning and FAIR data practices. They cover two structurally distinct, large cities: Paris and New York. Our open source reproducible pipeline allows for dataset customization, while the datasets support research tasks such as behavior modeling, mobility prediction, knowledge graph construction, and LLM-based applications. To our knowledge, our resource is the first to combine real-world movement, structured semantic enrichment, LLM-generated text, and semantic web compatibility in a reusable framework.
comment: 5 pages, 3 figures, 1 table
Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR
A prevailing view in Reinforcement Learning with Verifiable Rewards (RLVR) interprets recent progress through the lens of an exploration-exploitation trade-off, a perspective largely shaped by token-level metrics. We re-examine this perspective, proposing that this perceived trade-off may not be a fundamental constraint but rather an artifact of the measurement level. To investigate this, we shift the analysis to the semantically rich hidden-state space, adopting Effective Rank (ER) to quantify exploration and proposing its novel first- and second-order derivatives, named ER Velocity and ER Acceleration, to capture exploitation dynamics. Our analysis reveals that in the semantic space, exploration and exploitation could be decoupled (Sec.~4). This finding reveals an opportunity to enhance both capacities simultaneously. This insight motivates our method, Velocity-Exploiting Rank-Learning (VERL), the first to operationalize the principle of synergistic exploration-exploitation enhancement by directly shaping the RL advantage function. The key innovation is leveraging the theoretically stable ERA as a predictive meta-controller to create a synergistic, dual-channel incentive structure. Instead of forcing a trade-off, VERL prospectively amplifies rewards for exploration to preempt overconfidence and reinforces exploitative gains to consolidate reasoning. Experiments across diverse LLMs and reasoning benchmarks show consistent gains, including up to 21.4% absolute accuracy improvement on the challenging Gaokao 2024 dataset.
MMAG: Mixed Memory-Augmented Generation for Large Language Models Applications
Large Language Models (LLMs) excel at generating coherent text within a single prompt but fall short in sustaining relevance, personalization, and continuity across extended interactions. Human communication, however, relies on multiple forms of memory, from recalling past conversations to adapting to personal traits and situational context. This paper introduces the Mixed Memory-Augmented Generation (MMAG) pattern, a framework that organizes memory for LLM-based agents into five interacting layers: conversational, long-term user, episodic and event-linked, sensory and context-aware, and short-term working memory. Drawing inspiration from cognitive psychology, we map these layers to technical components and outline strategies for coordination, prioritization, and conflict resolution. We demonstrate the approach through its implementation in the Heero conversational agent, where encrypted long-term bios and conversational history already improve engagement and retention. We further discuss implementation concerns around storage, retrieval, privacy, and latency, and highlight open challenges. MMAG provides a foundation for building memory-rich language agents that are more coherent, proactive, and aligned with human needs.
Grounding Large Language Models in Clinical Evidence: A Retrieval-Augmented Generation System for Querying UK NICE Clinical Guidelines
This paper presents the development and evaluation of a Retrieval-Augmented Generation (RAG) system for querying the United Kingdom's National Institute for Health and Care Excellence (NICE) clinical guidelines using Large Language Models (LLMs). The extensive length and volume of these guidelines can impede their utilisation within a time-constrained healthcare system, a challenge this project addresses through the creation of a system capable of providing users with precisely matched information in response to natural language queries. The system's retrieval architecture, composed of a hybrid embedding mechanism, was evaluated against a corpus of 10,195 text chunks derived from three hundred guidelines. It demonstrates high performance, with a Mean Reciprocal Rank (MRR) of 0.814, a Recall of 81% at the first chunk and of 99.1% within the top ten retrieved chunks, when evaluated on 7901 queries. The most significant impact of the RAG system was observed during the generation phase. When evaluated on a manually curated dataset of seventy question-answer pairs, RAG-enhanced models showed substantial gains in performance. Faithfulness, the measure of whether an answer is supported by the source text, was increased by 64.7 percentage points to 99.5% for the RAG-enhanced O4-Mini model and significantly outperformed the medical-focused Meditron3-8B LLM, which scored 43%. Clinical evaluation by seven Subject Matter Experts (SMEs) further validated these findings, with GPT-4.1 achieving 98.7% accuracy while reducing unsafe responses by 67% compared to O4-Mini (from 3.0 to 1.0 per evaluator). This study thus establishes RAG as an effective, reliable, and scalable approach for applying generative AI in healthcare, enabling cost-effective access to medical guidelines.
Optimizing Fine-Tuning through Advanced Initialization Strategies for Low-Rank Adaptation
The rapid development of parameter-efficient fine-tuning methods has noticeably improved the efficiency of adapting large language models. Among these, LoRA has gained widespread popularity due to its strong balance of effectiveness and parameter efficiency. However, LoRA relies on initializing two low-rank matrices whose product is zero, which limits its ability to effectively activate and leverage the original model weights-creating a potential bottleneck for optimal performance. To address this limitation, we propose \textbf{IniLoRA}, a novel initialization strategy that initializes the low-rank matrices to closely approximate the original model weights. Experimental results indicate that IniLoRA achieves better performance than LoRA across a range of models and tasks. Additionally, we introduce two variants, IniLoRA-$α$ and IniLoRA-$β$, both leveraging distinct initialization methods to enhance performance further.
Jina-VLM: Small Multilingual Vision Language Model
We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. The model achieves leading results on standard VQA benchmarks and multilingual evaluations while preserving competitive text-only performance. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm .
comment: 18 pages, 1-7 main content, 13-18 appendix for tables and dataset
SeSE: A Structural Information-Guided Uncertainty Quantification Framework for Hallucination Detection in LLMs
Reliable uncertainty quantification (UQ) is essential for deploying large language models (LLMs) in safety-critical scenarios, as it enables them to abstain from responding when uncertain, thereby avoiding ``hallucinating'' falsehoods. However, state-of-the-art UQ methods primarily rely on semantic probability distributions or pairwise distances, overlooking latent semantic structural information that could enable more precise uncertainty estimates. This paper presents Semantic Structural Entropy (SeSE), a principled UQ framework that quantifies the inherent semantic uncertainty of LLMs from a structural information perspective for hallucination detection. SeSE operates in a zero-resource manner and is applicable to both open- and closed-source LLMs, making it an ``off-the-shelf" solution for new models and tasks. Specifically, to effectively model semantic spaces, we first develop an adaptively sparsified directed semantic graph construction algorithm that captures directional semantic dependencies while automatically pruning unnecessary connections that introduce negative interference. We then exploit latent semantic structural information through hierarchical abstraction: SeSE is defined as the structural entropy of the optimal semantic encoding tree, formalizing intrinsic uncertainty within semantic spaces after optimal compression. A higher SeSE value corresponds to greater uncertainty, indicating that LLMs are highly likely to generate hallucinations. In addition, to enhance fine-grained UQ in long-form generation, we extend SeSE to quantify the uncertainty of individual claims by modeling their random semantic interactions, providing theoretically explicable hallucination detection. Extensive experiments across 29 model-dataset combinations show that SeSE significantly outperforms advanced UQ baselines.
comment: 14 pages of main text and 10 pages of appendices;Submit to IEEE TKDE
Grounding LLM Reasoning with Knowledge Graphs
Large Language Models (LLMs) excel at generating natural language answers, yet their outputs often remain unverifiable and difficult to trace. Knowledge Graphs (KGs) offer a complementary strength by representing entities and their relationships in structured form, providing a foundation for more reliable reasoning. We propose a novel framework that integrates LLM reasoning with KGs by linking each step of the reasoning process to graph-structured data. This grounding turns intermediate ``thoughts'' into interpretable traces that remain consistent with external knowledge. Our approach incorporates multiple reasoning strategies, Chain-of-Thought (CoT), Tree-of-Thought (ToT), and Graph-of-Thought (GoT), and is evaluated on GRBench, a benchmark for domain-specific graph reasoning. Our experiments show state-of-the-art (SOTA) performance, with at least 26.5\% improvement over CoT baselines. Beyond accuracy, we analyze how step depth, branching structure, and model size influence reasoning quality, offering insights into the conditions that support effective reasoning. Together, these contributions highlight how grounding LLMs in structured knowledge enables both higher accuracy and greater interpretability in complex reasoning tasks.
Control Illusion: The Failure of Instruction Hierarchies in Large Language Models AAAI-26
Large language models (LLMs) are increasingly deployed with hierarchical instruction schemes, where certain instructions (e.g., system-level directives) are expected to take precedence over others (e.g., user messages). Yet, we lack a systematic understanding of how effectively these hierarchical control mechanisms work. We introduce a systematic evaluation framework based on constraint prioritization to assess how well LLMs enforce instruction hierarchies. Our experiments across six state-of-the-art LLMs reveal that models struggle with consistent instruction prioritization, even for simple formatting conflicts. We find that the widely-adopted system/user prompt separation fails to establish a reliable instruction hierarchy, and models exhibit strong inherent biases toward certain constraint types regardless of their priority designation. Interestingly, we also find that societal hierarchy framings (e.g., authority, expertise, consensus) show stronger influence on model behavior than system/user roles, suggesting that pretraining-derived social structures function as latent behavioral priors with potentially greater impact than post-training guardrails.
comment: Accepted to AAAI-26 Main Technical Track Proceedings
Bridging Online Behavior and Clinical Insight: A Longitudinal LLM-based Study of Suicidality on YouTube Reveals Novel Digital Markers
Suicide remains a leading cause of death in Western countries. As social media becomes central to daily life, digital footprints offer valuable insight into suicidal behavior. Focusing on individuals who attempted suicide while uploading videos to their channels, we investigate: How do linguistic patterns on YouTube reflect suicidal behavior, and how do these patterns align with or differ from expert knowledge? We examined linguistic changes around suicide attempts and compared individuals who attempted suicide while actively uploading to their channel with three control groups: those with prior attempts, those experiencing major life events, and matched individuals from the broader cohort. Applying complementary bottom-up, hybrid, and expert-driven approaches, we analyzed a novel longitudinal dataset of 181 suicide-attempt channels and 134 controls. In the bottom-up analysis, LLM-based topic-modeling identified 166 topics; five were linked to suicide attempts, two also showed attempt-related temporal changes (Mental Health Struggles, $OR = 1.74$; YouTube Engagement, $OR = 1.67$; $p < .01$). In the hybrid approach, clinical experts reviewed LLM-derived topics and flagged 19 as suicide-related. However, none showed significant effects beyond those identified bottom-up. YouTube Engagement, a platform-specific indicator, was not flagged, underscoring the value of bottom-up discovery. A top-down psychological assessment of suicide narratives revealed differing motivations: individuals describing prior attempts aimed to help others ($β=-1.69$, $p<.01$), whereas those attempted during the uploading period emphasized personal recovery ($β=1.08$, $p<.01$). By integrating these approaches, we offer a nuanced understanding of suicidality, bridging digital behavior and clinical insights.
ChatGPT for President! Presupposed content in politicians versus GPT-generated texts
This study examines ChatGPT-4's capability to replicate linguistic strategies used in political discourse, focusing on its potential for manipulative language generation. As large language models become increasingly popular for text generation, concerns have grown regarding their role in spreading fake news and propaganda. This research compares real political speeches with those generated by ChatGPT, emphasizing presuppositions (a rhetorical device that subtly influences audiences by packaging some content as already known at the moment of utterance, thus swaying opinions without explicit argumentation). Using a corpus-based pragmatic analysis, this study assesses how well ChatGPT can mimic these persuasive strategies. The findings reveal that although ChatGPT-generated texts contain many manipulative presuppositions, key differences emerge in their frequency, form, and function compared with those of politicians. For instance, ChatGPT often relies on change-of-state verbs used in fixed phrases, whereas politicians use presupposition triggers in more varied and creative ways. Such differences, however, are challenging to detect with the naked eye, underscoring the potential risks posed by large language models in political and public discourse.Using a corpus-based pragmatic analysis, this study assesses how well ChatGPT can mimic these persuasive strategies. The findings reveal that although ChatGPT-generated texts contain many manipulative presuppositions, key differences emerge in their frequency, form, and function compared with those of politicians. For instance, ChatGPT often relies on change-of-state verbs used in fixed phrases, whereas politicians use presupposition triggers in more varied and creative ways. Such differences, however, are challenging to detect with the naked eye, underscoring the potential risks posed by large language models in political and public discourse.
comment: 36 pages, 6 figures
Which Type of Students can LLMs Act? Investigating Authentic Simulation with Graph-based Human-AI Collaborative System
While rapid advances in large language models (LLMs) are reshaping data-driven intelligent education, accurately simulating students remains an important but challenging bottleneck for scalable educational data collection, evaluation, and intervention design. However, current works are limited by scarce real interaction data, costly expert evaluation for realism, and a lack of large-scale, systematic analyses of LLMs ability in simulating students. We address this gap by presenting a three-stage LLM-human collaborative pipeline to automatically generate and filter high-quality student agents. We leverage a two-round automated scoring validated by human experts and deploy a score propagation module to obtain more consistent scores across the student similarity graph. Experiments show that combining automated scoring, expert calibration, and graph-based propagation yields simulated student that more closely track authentication by human judgments. We then analyze which profiles and behaviors are simulated more faithfully, supporting subsequent studies on personalized learning and educational assessment.
comment: This work has been submitted to AI Open for possible publication
ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai AACL 2025
We present ThaiOCRBench, the first comprehensive benchmark for evaluating vision-language models (VLMs) on Thai text-rich visual understanding tasks. Despite recent progress in multimodal modeling, existing benchmarks predominantly focus on high-resource languages, leaving Thai underrepresented, especially in tasks requiring document structure understanding. ThaiOCRBench addresses this gap by offering a diverse, human-annotated dataset comprising 2,808 samples across 13 task categories. We evaluate a wide range of state-of-the-art VLMs in a zero-shot setting, spanning both proprietary and open-source systems. Results show a significant performance gap, with proprietary models (e.g., Gemini 2.5 Pro) outperforming open-source counterparts. Notably, fine-grained text recognition and handwritten content extraction exhibit the steepest performance drops among open-source models. Through detailed error analysis, we identify key challenges such as language bias, structural mismatch, and hallucinated content. ThaiOCRBench provides a standardized framework for assessing VLMs in low-resource, script-complex settings, and provides actionable insights for improving Thai-language document understanding.
comment: Accepted at IJCNLP-AACL 2025 (Main). This version includes the corrected Table 2 and an updated conclusion regarding the deletion count of the Gemma model
TaoSR1: The Thinking Model for E-commerce Relevance Search
Query-product relevance prediction is a core task in e-commerce search. BERT-based models excel at semantic matching but lack complex reasoning capabilities. While Large Language Models (LLMs) are explored, most still use discriminative fine-tuning or distill to smaller models for deployment. We propose a framework to directly deploy LLMs for this task, addressing key challenges: Chain-of-Thought (CoT) error accumulation, discriminative hallucination, and deployment feasibility. Our framework, TaoSR1, involves three stages: (1) Supervised Fine-Tuning (SFT) with CoT to instill reasoning; (2) Offline sampling with a pass@N strategy and Direct Preference Optimization (DPO) to improve generation quality; and (3) Difficulty-based dynamic sampling with Group Relative Policy Optimization (GRPO) to mitigate discriminative hallucination. Additionally, post-CoT processing and a cumulative probability-based partitioning method enable efficient online deployment. TaoSR1 significantly outperforms baselines on offline datasets and achieves substantial gains in online side-by-side human evaluations, introducing a novel paradigm for applying CoT reasoning to relevance classification.
Data Mixing Can Induce Phase Transitions in Knowledge Acquisition NeurIPS'25
Large Language Models (LLMs) are typically trained on data mixtures: most data come from web scrapes, while a small portion is curated from high-quality sources with dense domain-specific knowledge. In this paper, we show that when training LLMs on such data mixtures, knowledge acquisition from knowledge-dense datasets, unlike training exclusively on knowledge-dense data (arXiv:2404.05405), does not always follow a smooth scaling law but can exhibit phase transitions with respect to the mixing ratio and model size. Through controlled experiments on a synthetic biography dataset mixed with web-scraped data, we demonstrate that: (1) as we increase the model size to a critical value, the model suddenly transitions from memorizing very few to most of the biographies; (2) below a critical mixing ratio, the model memorizes almost nothing even with extensive training, but beyond this threshold, it rapidly memorizes more biographies. We attribute these phase transitions to a capacity allocation phenomenon: a model with bounded capacity must act like a knapsack problem solver to minimize the overall test loss, and the optimal allocation across datasets can change discontinuously as the model size or mixing ratio varies. We formalize this intuition in an information-theoretic framework and reveal that these phase transitions are predictable, with the critical mixing ratio following a power-law relationship with the model size. Our findings highlight a concrete case where a good mixing recipe for large models may not be optimal for small models, and vice versa.
comment: NeurIPS'25 Spotlight
Ground-Truth Subgraphs for Better Training and Evaluation of Knowledge Graph Augmented LLMs
Retrieval of information from graph-structured knowledge bases represents a promising direction for improving the factuality of LLMs. While various solutions have been proposed, a comparison of methods is difficult due to the lack of challenging QA datasets with ground-truth targets for graph retrieval. We present SynthKGQA, an LLM-powered framework for generating high-quality Knowledge Graph Question Answering datasets from any Knowledge Graph, providing the full set of ground-truth facts in the KG to reason over questions. We show how, in addition to enabling more informative benchmarking of KG retrievers, the data produced with SynthKGQA also allows us to train better models.We apply SynthKGQA to Wikidata to generate GTSQA, a new dataset designed to test zero-shot generalization abilities of KG retrievers with respect to unseen graph structures and relation types, and benchmark popular solutions for KG-augmented LLMs on it.
An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems
In this paper, we introduce a systematic framework beyond conventional method to assess LLMs' mathematical-reasoning robustness by stress-testing them on advanced math problems that are mathematically equivalent but with linguistic and parametric variation. These transformations allow us to measure the sensitivity of LLMs to non-mathematical perturbations, thereby enabling a more accurate evaluation of their mathematical reasoning capabilities. Using this new evaluation methodology, we created PutnamGAP, a new benchmark dataset with multiple mathematically-equivalent variations of competition-level math problems. With the new dataset, we evaluate multiple families of representative LLMs and examine their robustness. Across 18 commercial and open-source models we observe sharp performance degradation on the variants. OpenAI's flagship reasoning model, O3, scores 51.5% on the originals but drops by 4.7 percentage points on surface-renaming variants, and by 12.9 percentage points on parametric variants, while smaller models fare far worse. Overall, the results show that the proposed new evaluation methodology is effective for deepening our understanding of the robustness of LLMs and generating new insights for further improving their mathematical reasoning capabilities.
comment: 34 pages, 9 figures
Scaling Towards the Information Boundary of Instruction Sets: The Infinity Instruct Subject Technical Report
Instruction tuning has become a foundation for unlocking the capabilities of large-scale pretrained models and improving their performance on complex tasks. Thus, the construction of high-quality instruction datasets is crucial for enhancing model performance and generalizability. Although current instruction datasets have reached tens of millions of samples, models finetuned on them may still struggle with complex instruction following and tasks in rare domains. This is primarily due to limited expansion in both ``coverage'' (coverage of task types and knowledge areas) and ``depth'' (instruction complexity) of the instruction set. To address this issue, we propose a systematic instruction data construction framework, which integrates a hierarchical tagging system, an informative seed selection algorithm, an evolutionary data synthesis process, and a model deficiency diagnosis with targeted data generation. These components form an iterative closed-loop to continuously enhance the coverage and depth of instruction data. Based on this framework, we construct Infinity Instruct Subject, a high-quality dataset containing $\sim$1.5 million instructions. Experiments on multiple foundation models and benchmark tasks demonstrate its effectiveness in improving instruction-following capabilities. Further analyses suggest that Infinity Instruct Subject shows enlarged coverage and depth compared to comparable synthesized instruction datasets. Our work lays a theoretical and practical foundation for the efficient, continuous evolution of instruction datasets, moving from data quantity expansion to qualitative improvement.
In-Context Representation Hijacking
We introduce $\textbf{Doublespeak}$, a simple in-context representation hijacking attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., bomb) with a benign token (e.g., carrot) across multiple in-context examples, provided a prefix to a harmful request. We demonstrate that this substitution leads to the internal representation of the benign token converging toward that of the harmful one, effectively embedding the harmful semantics under a euphemism. As a result, superficially innocuous prompts (e.g., "How to build a carrot?") are internally interpreted as disallowed instructions (e.g., "How to build a bomb?"), thereby bypassing the model's safety alignment. We use interpretability tools to show that this semantic overwrite emerges layer by layer, with benign meanings in early layers converging into harmful semantics in later ones. Doublespeak is optimization-free, broadly transferable across model families, and achieves strong success rates on closed-source and open-source systems, reaching 74% ASR on Llama-3.3-70B-Instruct with a single-sentence context override. Our findings highlight a new attack surface in the latent space of LLMs, revealing that current alignment strategies are insufficient and should instead operate at the representation level.
Semi-Supervised Synthetic Data Generation with Fine-Grained Relevance Control for Short Video Search Relevance Modeling AAAI 2026
Synthetic data is widely adopted in embedding models to ensure diversity in training data distributions across dimensions such as difficulty, length, and language. However, existing prompt-based synthesis methods struggle to capture domain-specific data distributions, particularly in data-scarce domains, and often overlook fine-grained relevance diversity. In this paper, we present a Chinese short video dataset with 4-level relevance annotations, filling a critical resource void. Further, we propose a semi-supervised synthetic data pipeline where two collaboratively trained models generate domain-adaptive short video data with controllable relevance labels. Our method enhances relevance-level diversity by synthesizing samples for underrepresented intermediate relevance labels, resulting in a more balanced and semantically rich training data set. Extensive offline experiments show that the embedding model trained on our synthesized data outperforms those using data generated based on prompting or vanilla supervised fine-tuning(SFT). Moreover, we demonstrate that incorporating more diverse fine-grained relevance levels in training data enhances the model's sensitivity to subtle semantic distinctions, highlighting the value of fine-grained relevance supervision in embedding learning. In the search enhanced recommendation pipeline of Douyin's dual-column scenario, through online A/B testing, the proposed model increased click-through rate(CTR) by 1.45%, raised the proportion of Strong Relevance Ratio (SRR) by 4.9%, and improved the Image User Penetration Rate (IUPR) by 0.1054%.
comment: Submitted to AAAI 2026
Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing
Large Language Models (LLMs) have recently emerged as powerful tools for autoformalization. Despite their impressive performance, these models can still struggle to produce grounded and verifiable formalizations. Recent work in text-to-SQL, has revealed that LLMs can be sensitive to paraphrased natural language (NL) inputs, even when high degrees of semantic fidelity are preserved (Safarzadeh, Oroojlooyjadid, and Roth 2025). In this paper, we investigate this claim in the autoformalization domain. Specifically, we evaluate the robustness of LLMs generating formal proofs with semantically similar paraphrased NL statements by measuring semantic and compilation validity. Using the formal benchmarks MiniF2F (Zheng, Han, and Polu 2021) and Lean 4 version of ProofNet (Xin et al. 2024), and two modern LLMs, we generate paraphrased natural language statements and cross-evaluate these statements across both models. The results of this paper reveal performance variability across paraphrased inputs, demonstrating that minor shifts in NL statements can significantly impact model outputs.
Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness
Large Language Models (LLMs) often exhibit significant behavioral shifts when they perceive a change from a real-world deployment context to a controlled evaluation setting, a phenomenon known as "evaluation awareness." This discrepancy poses a critical challenge for AI alignment, as benchmark performance may not accurately reflect a model's true safety and honesty. In this work, we systematically quantify these behavioral changes by manipulating the perceived context of prompts. We introduce a methodology that uses a linear probe to score prompts on a continuous scale from "test-like" to "deploy-like" and leverage an LLM rewriting strategy to shift these prompts towards a more natural, deployment-style context while preserving the original task. Using this method, we achieved a 30% increase in the average probe score across a strategic role-playing dataset after rewriting. Evaluating a suite of state-of-the-art models on these original and rewritten prompts, we find that rewritten "deploy-like" prompts induce a significant and consistent shift in behavior. Across all models, we observed an average increase in honest responses of 5.26% and a corresponding average decrease in deceptive responses of 12.40%. Furthermore, refusal rates increased by an average of 6.38%, indicating heightened safety compliance. Our findings demonstrate that evaluation awareness is a quantifiable and manipulable factor that directly influences LLM behavior, revealing that models are more prone to unsafe or deceptive outputs in perceived test environments. This underscores the urgent need for more realistic evaluation frameworks to accurately gauge true model alignment before deployment.
Route-and-Reason: Scaling Large Language Model Reasoning with Reinforced Model Router
Chain-of-thought has been proven essential for enhancing the complex reasoning abilities of Large Language Models (LLMs), but it also leads to high computational costs. Recent advances have explored the method to route queries among multiple models and proved it as a promising approach. However, previous works directly operate at the task level, i.e., assigning user queries to suitable LLMs, which does not allow hybrid LLMs to truly collaborate on finer-grained sub-tasks. Collaboration at the level of intermediate reasoning steps (thoughts) could enable more efficient coordination, but it also poses significant challenges for router scheduling, placing immense demands on the quality of task decomposition and the precision of the router. To address this, we propose R2-Reasoner, a novel framework centered around a Reinforced Model Router designed to efficiently scale LLM reasoning. This router orchestrates collaboration across nine heterogeneous models, whose parameter scales range from less than 1B to hundreds of billions, by first breaking down a complex query into subtasks with a decomposer, and then assigning each subtask to the optimal model with a subtask allocator, balancing performance with cost. Training this router involves a two-stage alternating process for the decomposer and the allocator, integrating supervised fine-tuning with reinforcement learning to enable effective self-supervised refinement. Extensive experiments across six challenging reasoning benchmarks demonstrate that R2-Reasoner reduces API costs by 84.46% compared with state-of-the-art baselines while maintaining competitive reasoning accuracy. Our framework paves the way for the development of more scalable and efficient reasoning systems. Our code is open-source at https://anonymous.4open.science/r/R2_Reasoner.
Nexus: Higher-Order Attention Mechanisms in Transformers
Transformers have achieved significant success across various domains, relying on self-attention to capture dependencies. However, the standard first-order attention mechanism is often limited by a low-rank bottleneck, struggling to capture intricate, multi-hop relationships within a single layer. In this paper, we propose the Nexus, a novel architecture designed to enhance representational power through a recursive framework. Unlike standard approaches that use static linear projections for Queries and Keys, Nexus dynamically refines these representations via nested self-attention mechanisms. Specifically, the Query and Key vectors are themselves outputs of inner attention loops, allowing tokens to aggregate global context and model high-order correlations \textit{prior} to the final attention computation. We enforce a parameter-efficient weight-sharing strategy across recursive steps, ensuring that this enhanced expressivity incurs $\mathcal{O}(1)$ additional parameters. We provide theoretical analysis demonstrating that our method breaks the linear bottleneck of standard attention. Empirically, Nexus outperforms standard Transformers on multiple benchmarks.
FusionBench: A Unified Library and Comprehensive Benchmark for Deep Model Fusion
Deep model fusion is an emerging technique that unifies the predictions or parameters of several deep neural networks into a single better-performing model in a cost-effective and data-efficient manner. Although a variety of deep model fusion techniques have been introduced, their evaluations tend to be inconsistent and often inadequate to validate their effectiveness and robustness. We present FusionBench, the first benchmark and a unified library designed specifically for deep model fusion. Our benchmark consists of multiple tasks, each with different settings of models and datasets. This variety allows us to compare fusion methods across different scenarios and model scales. Additionally, FusionBench serves as a unified library for easy implementation and testing of new fusion techniques. FusionBench is open source and actively maintained, with community contributions encouraged. Homepage https://github.com/tanganke/fusion_bench
comment: Project homepage: https://github.com/tanganke/fusion_bench Online documentation: https://tanganke.github.io/fusion_bench
Reversing Large Language Models for Efficient Training and Fine-Tuning
Large Language Models (LLMs) are known for their expensive and time-consuming training. Thus, oftentimes, LLMs are fine-tuned to address a specific task, given the pretrained weights of a pre-trained LLM considered a foundation model. In this work, we introduce memory-efficient, reversible architectures for LLMs, inspired by symmetric and symplectic differential equations, and investigate their theoretical properties. Different from standard, baseline architectures that store all intermediate activations, the proposed models use time-reversible dynamics to retrieve hidden states during backpropagation, relieving the need to store activations. This property allows for a drastic reduction in memory consumption, allowing for the processing of larger batch sizes for the same available memory, thereby offering improved throughput. In addition, we propose an efficient method for converting existing, non-reversible LLMs into reversible architectures through fine-tuning, rendering our approach practical for exploiting existing pre-trained models. Our results show comparable or improved performance on several datasets and benchmarks, on several LLMs, building a scalable and efficient path towards reducing the memory and computational costs associated with both training from scratch and fine-tuning of LLMs.
Dual-branch Prompting for Multimodal Machine Translation
Multimodal Machine Translation (MMT) typically enhances text-only translation by incorporating aligned visual features. Despite the remarkable progress, state-of-the-art MMT approaches often rely on paired image-text inputs at inference and are sensitive to irrelevant visual noise, which limits their robustness and practical applicability. To address these issues, we propose D2P-MMT, a diffusion-based dual-branch prompting framework for robust vision-guided translation. Specifically, D2P-MMT requires only the source text and a reconstructed image generated by a pre-trained diffusion model, which naturally filters out distracting visual details while preserving semantic cues. During training, the model jointly learns from both authentic and reconstructed images using a dual-branch prompting strategy, encouraging rich cross-modal interactions. To bridge the modality gap and mitigate training-inference discrepancies, we introduce a distributional alignment loss that enforces consistency between the output distributions of the two branches. Extensive experiments on the Multi30K dataset demonstrate that D2P-MMT achieves superior translation performance compared to existing state-of-the-art approaches.
comment: This manuscript is currently under review at the ACM Transactions on Multimedia Computing, Communications, and Applications
REINA: Regularized Entropy Information-Based Loss for Efficient Simultaneous Speech Translation AAAI 2026
Simultaneous Speech Translation (SimulST) systems stream in audio while simultaneously emitting translated text or speech. Such systems face the significant challenge of balancing translation quality and latency. We introduce a strategy to optimize this tradeoff: wait for more input only if you gain information by doing so. Based on this strategy, we present Regularized Entropy INformation Adaptation (REINA), a novel loss to train an adaptive policy using an existing non-streaming translation model. We derive REINA from information theory principles and show that REINA helps push the reported Pareto frontier of the latency/quality tradeoff over prior works. Utilizing REINA, we train a SimulST model on French, Spanish and German, both from and into English. Training on only open source or synthetically generated data, we achieve state-of-the-art (SOTA) streaming results for models of comparable size. We also introduce a metric for streaming efficiency, quantitatively showing REINA improves the latency/quality trade-off by as much as 21% compared to prior approaches, normalized against non-streaming baseline BLEU scores.
comment: Accepted to AAAI 2026 (Oral Track)
Hierarchical Mamba Meets Hyperbolic Geometry: A New Paradigm for Structured Language Embeddings
Selective state-space models excel at long-sequence modeling, but their capacity for language representation -- in complex hierarchical reasoning -- remains underexplored. Most large language models rely on \textit{flat} Euclidean embeddings, limiting their ability to capture latent hierarchies. To address this, we propose {\it Hierarchical Mamba (HiM)}, integrating efficient Mamba2 with hyperbolic geometry to learn hierarchy-aware language embeddings for deeper linguistic understanding. Mamba2-processed sequences are projected to the Poincaré ball or Lorentzian manifold with ``learnable'' curvature, optimized with a hyperbolic loss. Our HiM model facilitates the capture of relational distances across varying hierarchical levels, enabling effective long-range reasoning for tasks like mixed-hop prediction and multi-hop inference in hierarchical classification. Experimental results show both HiM variants effectively capture hierarchical relationships across four linguistic and medical datasets, surpassing Euclidean baselines, with HiM-Poincaré providing fine-grained distinctions with higher h-norms, while HiM-Lorentz offers more stable, compact, and hierarchy-preserving embeddings-favoring robustness. The source code is publicly available at https://github.com/BerryByte/HiM.
comment: 10 pages, 3 figures
LittleBit: Ultra Low-Bit Quantization via Latent Factorization NeurIPS 2025
Deploying large language models (LLMs) often faces challenges from substantial memory and computational costs. Quantization offers a solution, yet performance degradation in the sub-1-bit regime remains particularly difficult. This paper introduces LittleBit, a novel method for extreme LLM compression. It targets levels like 0.1 bits per weight (BPW), achieving nearly 31$\times$ memory reduction, e.g., Llama2-13B to under 0.9 GB. LittleBit represents weights in a low-rank form using latent matrix factorization, subsequently binarizing these factors. To counteract information loss from this extreme precision, it integrates a multi-scale compensation mechanism. This includes row, column, and an additional latent dimension that learns per-rank importance. Two key contributions enable effective training: Dual Sign-Value-Independent Decomposition (Dual-SVID) for quantization-aware training (QAT) initialization, and integrated Residual Compensation to mitigate errors. Extensive experiments confirm LittleBit's superiority in sub-1-bit quantization: e.g., its 0.1 BPW performance on Llama2-7B surpasses the leading method's 0.7 BPW. LittleBit establishes a new, viable size-performance trade-off--unlocking a potential 11.6$\times$ speedup over FP16 at the kernel level--and makes powerful LLMs practical for resource-constrained environments. Our code can be found at https://github.com/SamsungLabs/LittleBit.
comment: Accepted to NeurIPS 2025. Banseok Lee and Dongkyu Kim contributed equally
LAET: A Layer-wise Adaptive Ensemble Tuning Framework for Pretrained Language Models
Natural Language Processing (NLP) has transformed the financial industry, enabling advancements in areas such as textual analysis, risk management, and forecasting. Large language models (LLMs) like BloombergGPT and FinMA have set new benchmarks across various financial NLP tasks, including sentiment analysis, stock movement prediction, and credit risk assessment. Furthermore, FinMA-ES, a bilingual financial LLM, has also demonstrated strong performance using the FLARE and FLARE-ES benchmarks. However, the high computational demands of these models limit the accessibility of many organizations. To address this, we propose Layer-wise Adaptive Ensemble Tuning (LAET), a novel strategy that selectively fine-tunes the most effective layers of pre-trained LLMs by analyzing hidden state representations while freezing less critical layers. LAET significantly reduces computational overhead while enhancing task-specific performance. Our approach shows strong results in financial NLP tasks, outperforming existing benchmarks and state-of-the-art LLMs such as GPT-4, even with smaller LLMs ($\sim$3B parameters). This work bridges cutting-edge financial NLP research and real-world deployment with efficient and scalable models for financial applications.
A Survey on Diffusion Language Models
Diffusion Language Models (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterative denoising process, DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context, thereby enabling fine-grained control over the generation process. While achieving a several-fold speed-up, recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts, making them a compelling choice for various natural language processing tasks. In this survey, we provide a holistic overview of the current DLM landscape. We trace its evolution and relationship with other paradigms, such as autoregressive and masked language models, and cover both foundational principles and state-of-the-art models. Our work offers an up-to-date, comprehensive taxonomy and an in-depth analysis of current techniques, from pre-training strategies to advanced post-training methods. Another contribution of this survey is a thorough review of DLM inference strategies and optimizations, including improvements in decoding parallelism, caching mechanisms, and generation quality. We also highlight the latest approaches to multimodal extensions of DLMs and delineate their applications across various practical scenarios. Furthermore, our discussion addresses the limitations and challenges of DLMs, including efficiency, long-sequence handling, and infrastructure requirements, while outlining future research directions to sustain progress in this rapidly evolving field. Project GitHub is available at https://github.com/VILA-Lab/Awesome-DLMs.
Human-Computer Interaction
Perceptually-Minimal Color Optimization for Web Accessibility: A Multi-Phase Constrained Approach
Web accessibility guidelines require sufficient color contrast between text and backgrounds; yet, manually adjusting colors often necessitates significant visual deviation, compromising vital brand aesthetics. We present a novel, multi-phase optimization approach for automatically generating WCAG-compliant colors while minimizing perceptual change to original design choices. Our method treats this as a constrained, non-linear optimization problem, utilizing the modern perceptually uniform OKLCH color space. Crucially, the optimization is constrained to preserve the original hue ($\text{H}$) of the color, ensuring that modifications are strictly limited to necessary adjustments in lightness ($\text{L}$) and chroma ($\text{C}$). This is achieved through a three-phase sequence: binary search, gradient descent, and progressive constraint relaxation. Evaluation on a dataset of 10,000 procedurally generated color pairs demonstrates that the algorithm successfully resolves accessibility violations in $77.22\%$ of cases, with $88.51\%$ of successful corrections exhibiting imperceptible color difference ($ΔE_{2000} < 2.0$) as defined by standard perceptibility thresholds. The median perceptual change for successful adjustments is only $0.76\ ΔE_{2000}$, and the algorithm achieves this with a median processing time of $0.876\text{ms}$ per color pair. The approach demonstrates that accessibility compliance and visual design integrity can be achieved simultaneously through a computationally efficient, perceptually-aware optimization that respects brand identity. The algorithm is publicly implemented in the open-source cm-colors Python library.
The AI Consumer Index (ACE)
We introduce the first version of the AI Consumer Index (ACE), a benchmark for assessing whether frontier AI models can perform high-value consumer tasks. ACE contains a hidden heldout set of 400 test cases, split across four consumer activities: shopping, food, gaming, and DIY. We are also open sourcing 80 cases as a devset with a CC-BY license. For the ACE leaderboard we evaluated 10 frontier models (with websearch turned on) using a novel grading methodology that dynamically checks whether relevant parts of the response are grounded in the retrieved web sources. GPT 5 (Thinking = High) is the top-performing model, scoring 56.1%, followed by o3 Pro (Thinking = On) (55.2%) and GPT 5.1 (Thinking = High) (55.1%). Models differ across domains, and in Shopping the top model scores under 50%. For some requests (such as giving the correct price or providing working links), models are highly prone to hallucination. Overall, ACE shows a substantial gap between the performance of even the best models and consumers' AI needs.
From Symptoms to Systems: An Expert-Guided Approach to Understanding Risks of Generative AI for Eating Disorders
Generative AI systems may pose serious risks to individuals vulnerable to eating disorders. Existing safeguards tend to overlook subtle but clinically significant cues, leaving many risks unaddressed. To better understand the nature of these risks, we conducted semi-structured interviews with 15 clinicians, researchers, and advocates with expertise in eating disorders. Using abductive qualitative analysis, we developed an expert-guided taxonomy of generative AI risks across seven categories: (1) providing generalized health advice; (2) encouraging disordered behaviors; (3) supporting symptom concealment; (4) creating thinspiration; (5) reinforcing negative self-beliefs; (6) promoting excessive focus on the body; and (7) perpetuating narrow views about eating disorders. Our results demonstrate how certain user interactions with generative AI systems intersect with clinical features of eating disorders in ways that may intensify risk. We discuss implications of our work, including approaches for risk assessment, safeguard design, and participatory evaluation practices with domain experts.
Interactive Communication - cross-disciplinary perspectives from psychology, acoustics, and media technology
Interactive communication (IC), i.e., the reciprocal exchange of in- formation between two or more interactive partners, is a fundamental part of human nature. As such, it has been studied across multiple scientific disciplines with different goals and methods. This article pro- vides a cross-disciplinary primer on contemporary IC that integrates psy- chological mechanisms with acoustic and media-technological constraints across theory, measurement, and applications. First, we outline theoreti- cal frameworks that account for verbal, nonverbal and multimodal aspects of IC, including distinctions between face-to-face and computer-mediated communication. Second, we summarize key methodological approaches, including behavioral, cognitive, and experiential measures of communica- tive synchrony and acoustic signal quality. Third, we discuss selected applications, i.e. assistive listening technologies, conversational agents, alongside ethical considerations. Taken together, this review highlights how human capacities and technical systems jointly shape IC, consolidat- ing concepts, findings, and challenges that have often been discussed in separate lines of research.
Generative AI for Self-Adaptive Systems: State of the Art and Research Roadmap
Self-adaptive systems (SASs) are designed to handle changes and uncertainties through a feedback loop with four core functionalities: monitoring, analyzing, planning, and execution. Recently, generative artificial intelligence (GenAI), especially the area of large language models, has shown impressive performance in data comprehension and logical reasoning. These capabilities are highly aligned with the functionalities required in SASs, suggesting a strong potential to employ GenAI to enhance SASs. However, the specific benefits and challenges of employing GenAI in SASs remain unclear. Yet, providing a comprehensive understanding of these benefits and challenges is complex due to several reasons: limited publications in the SAS field, the technological and application diversity within SASs, and the rapid evolution of GenAI technologies. To that end, this paper aims to provide researchers and practitioners a comprehensive snapshot that outlines the potential benefits and challenges of employing GenAI's within SAS. Specifically, we gather, filter, and analyze literature from four distinct research fields and organize them into two main categories to potential benefits: (i) enhancements to the autonomy of SASs centered around the specific functions of the MAPE-K feedback loop, and (ii) improvements in the interaction between humans and SASs within human-on-the-loop settings. From our study, we outline a research roadmap that highlights the challenges of integrating GenAI into SASs. The roadmap starts with outlining key research challenges that need to be tackled to exploit the potential for applying GenAI in the field of SAS. The roadmap concludes with a practical reflection, elaborating on current shortcomings of GenAI and proposing possible mitigation strategies.
comment: Accepted by ACM Transactions on Autonomous and Adaptive Systems
A Modular Cognitive Architecture for Assisted Reasoning: The Nemosine Framework
This paper presents the Nemosine Framework, a modular cognitive architecture designed to support assisted reasoning, structured thinking, and systematic analysis. The model operates through functional cognitive modules ("personas") that organize tasks such as planning, evaluation, cross-checking, and narrative synthesis. The framework combines principles from metacognition, distributed cognition, and modular cognitive systems to offer an operational structure for assisted problem-solving and decision support. The architecture is documented through formal specification, internal consistency criteria, and reproducible structural components. The goal is to provide a clear conceptual basis for future computational implementations and to contribute to the study of symbolic-modular architectures for reasoning.
comment: 6 pages, 1 figure. First version
Persona-based Multi-Agent Collaboration for Brainstorming
We demonstrate the importance of persona-based multi-agents brainstorming for both diverse topics and subject matter ideation. Prior work has shown that generalized multi-agent collaboration often provides better reasoning than a single agent alone. In this paper, we propose and develop a framework for persona-based agent selection, showing how persona domain curation can improve brainstorming outcomes. Using multiple experimental setups, we evaluate brainstorming outputs across different persona pairings (e.g., Doctor vs VR Engineer) and A2A (agent-to-agent) dynamics (separate, together, separate-then-together). Our results show that (1) persona choice shapes idea domains, (2) collaboration mode shifts diversity of idea generation, and (3) multi-agent persona-driven brainstorming produces idea depth and cross-domain coverage.
comment: 12 pages, 8 figures
What is Beyond Presence? Dimensionality, Control, and Information Spaces
What is after presence? Spatial presence, the sense of "being there", is becoming less of a primary objective and more of a baseline expectation of virtual reality. More than six decades after its invention, VR is shifting from a technical system into a cultural, social, and phenomenological medium, offering experiences that function as distinct modes of reality. Existing theories that focus primarily on perceptual illusions are no longer sufficient to account for these emerging forms of experience. A new framework is needed to guide the design and evaluation of immersive environments by identifying the key technical and abstract dimensions afforded by virtual worlds. These dimensions include spatial, placeness, temporal, social, cultural, cognitive, and psychological parameters. The central argument is that immersive environments must move beyond the technical dimension to leverage richer information channels that shape user experience. This shift from presence to experience orchestration invites creators across disciplines to contribute to the design and assessment of meaningful immersive worlds.
comment: 38 pages, accepted for Presence: Virtual and Augmented Reality 2026(37)
SmartAlert: Implementing Machine Learning-Driven Clinical Decision Support for Inpatient Lab Utilization Reduction
Repetitive laboratory testing unlikely to yield clinically useful information is a common practice that burdens patients and increases healthcare costs. Education and feedback interventions have limited success, while general test ordering restrictions and electronic alerts impede appropriate clinical care. We introduce and evaluate SmartAlert, a machine learning (ML)-driven clinical decision support (CDS) system integrated into the electronic health record that predicts stable laboratory results to reduce unnecessary repeat testing. This case study describes the implementation process, challenges, and lessons learned from deploying SmartAlert targeting complete blood count (CBC) utilization in a randomized controlled pilot across 9270 admissions in eight acute care units across two hospitals between August 15, 2024, and March 15, 2025. Results show significant decrease in number of CBC results within 52 hours of SmartAlert display (1.54 vs 1.82, p <0.01) without adverse effect on secondary safety outcomes, representing a 15% relative reduction in repetitive testing. Implementation lessons learned include interpretation of probabilistic model predictions in clinical contexts, stakeholder engagement to define acceptable model behavior, governance processes for deploying a complex model in a clinical environment, user interface design considerations, alignment with clinical operational priorities, and the value of qualitative feedback from end users. In conclusion, a machine learning-driven CDS system backed by a deliberate implementation and governance process can provide precision guidance on inpatient laboratory testing to safely reduce unnecessary repetitive testing.
comment: 22 pages, 5 figures
Interactive Communication -- cross-disciplinary perspectives from psychology, acoustics, and media technology
Interactive communication (IC), i.e., the reciprocal exchange of information between two or more interactive partners, is a fundamental part of human nature. As such, it has been studied across multiple scientific disciplines with different goals and methods. This article provides a cross-disciplinary primer on contemporary IC that integrates psychological mechanisms with acoustic and media-technological constraints across theory, measurement, and applications. First, we outline theoretical frameworks that account for verbal, nonverbal and multimodal aspects of IC, including distinctions between face-to-face and computer-mediated communication. Second, we summarize key methodological approaches, including behavioral, cognitive, and experiential measures of communicative synchrony and acoustic signal quality. Third, we discuss selected applications, i.e. assistive listening technologies, conversational agents, alongside ethical considerations. Taken together, this review highlights how human capacities and technical systems jointly shape IC, consolidating concepts, findings, and challenges that have often been discussed in separate lines of research.
Persona-based Multi-Agent Collaboration for Brainstorming
We demonstrate the importance of persona-based multi-agents brainstorming for both diverse topics and subject matter ideation. Prior work has shown that generalized multi-agent collaboration often provides better reasoning than a single agent alone. In this paper, we propose and develop a framework for persona-based agent selection, showing how persona domain curation can improve brainstorming outcomes. Using multiple experimental setups, we evaluate brainstorming outputs across different persona pairings (e.g., Doctor vs VR Engineer) and A2A (agent-to-agent) dynamics (separate, together, separate-then-together). Our results show that (1) persona choice shapes idea domains, (2) collaboration mode shifts diversity of idea generation, and (3) multi-agent persona-driven brainstorming produces idea depth and cross-domain coverage.
comment: 12 pages, 8 figures
Systematically Evaluating Equivalent Purpose for Digital Maps
Digital geographic maps remain largely inaccessible to blind and low-vision individuals (BLVIs), despite global legislation adopting the Web Content Accessibility Guidelines (WCAG). A critical gap exists in defining "equivalent purpose" for maps under WCAG Success Criterion 1.1.1, which requires that non-text content provide a text alternative that serves the "equivalent purpose". This paper proposes a systematic framework for evaluating map accessibility, called the Map Equivalent-Purpose Framework (MEP Framework), defining purpose through three items (Generalized, Spatial Information, and Spatial Relationships), and establishing 15 measurable criteria for equivalent information communication. Eight text map representations were evaluated against visual map baselines using the proposed MEP Framework. Results show that legacy methods such as tables and turn-by-turn directions fail to meet the MEP Framework criteria, while Audiom Maps, Multi User Domain (MUD) Maps, and Audio Descriptions meet the criteria. The evaluation highlights the necessity of holistic, systematic approaches to ensure non-visual maps convey all generalized spatial information and relationships present in visual maps. The MEP Framework provides a replicable methodology for comprehensively assessing digital map accessibility, clarifying WCAG's "equivalent purpose", and guiding compliant and usable map creation. Compliant maps will support BLVIs' participation in map-dependent professions and civic engagement.
comment: In press at Journal on Technology and Persons with Disabilities, volume 14
XR-DT: Extended Reality-Enhanced Digital Twin for Agentic Mobile Robots
As mobile robots increasingly operate alongside humans in shared workspaces, ensuring safe, efficient, and interpretable Human-Robot Interaction (HRI) has become a pressing challenge. While substantial progress has been devoted to human behavior prediction, limited attention has been paid to how humans perceive, interpret, and trust robots' inferences, impeding deployment in safety-critical and socially embedded environments. This paper presents XR-DT, an eXtended Reality-enhanced Digital Twin framework for agentic mobile robots, that bridges physical and virtual spaces to enable bi-directional understanding between humans and robots. Our hierarchical XR-DT architecture integrates virtual-, augmented-, and mixed-reality layers, fusing real-time sensor data, simulated environments in the Unity game engine, and human feedback captured through wearable AR devices. Within this framework, we design an agentic mobile robot system with a unified diffusion policy for context-aware task adaptation. We further propose a chain-of-thought prompting mechanism that allows multimodal large language models to reason over human instructions and environmental context, while leveraging an AutoGen-based multi-agent coordination layer to enhance robustness and collaboration in dynamic tasks. Initial experimental results demonstrate accurate human and robot trajectory prediction, validating the XR-DT framework's effectiveness in HRI tasks. By embedding human intention, environmental dynamics, and robot cognition into the XR-DT framework, our system enables interpretable, trustworthy, and adaptive HRI.
comment: 10 pages, 5 figures
Towards A Cultural Intelligence and Values Inferences Quality Benchmark for Community Values and Common Knowledge
Large language models (LLMs) have emerged as a powerful technology, and thus, we have seen widespread adoption and use on software engineering teams. Most often, LLMs are designed as "general purpose" technologies meant to represent the general population. Unfortunately, this often means alignment with predominantly Western Caucasian narratives and misalignment with other cultures and populations that engage in collaborative innovation. In response to this misalignment, there have been recent efforts centered on the development of "culturally-informed" LLMs, such as ChatBlackGPT, that are capable of better aligning with historically marginalized experiences and perspectives. Despite this progress, there has been little effort aimed at supporting our ability to develop and evaluate culturally-informed LLMs. A recent effort proposed an approach for developing a national alignment benchmark that emphasizes alignment with national social values and common knowledge. However, given the range of cultural identities present in the United States (U.S.), a national alignment benchmark is an ineffective goal for broader representation. To help fill this gap in this US context, we propose a replication study that translates the process used to develop KorNAT, a Korean National LLM alignment benchmark, to develop CIVIQ, a Cultural Intelligence and Values Inference Quality benchmark centered on alignment with community social values and common knowledge. Our work provides a critical foundation for research and development aimed at cultural alignment of AI technologies in practice.
comment: Under review
Head, posture, and full-body gestures in dyadic conversations
When face-to-face communication becomes effortful due to background noise and interfering talkers, the role of visual cues becomes increasingly important for communication success. While previous research has selectively investigated head or hand movements, here we explore the combination of movements of head, hand and the whole body in acoustically adverse conditions. We hypothesize that with increasing background noise level, the frequency of typical conversational movements of hand, head, trunk, and legs increases to support the speakers role while the listeners support their role by increased use of confirmative head gestures and head and trunk movements to increase the signal-to-noise ratio. We conducted a dyadic conversation experiment in which (n=8) normal hearing participants stood freely in an audiovisual virtual environment. The conversational movements were described by a newly developed labeling system for typical conversational movements, and the frequency of individual types was analyzed. Increased levels of background noise led to increased hand-gesture complexity and modulation of head movements without a clear pattern. People leaned forward slightly more and used less head movements during listening than during speaking. Additional analysis of hand-speech synchrony with hypothesized loss of synchrony due to the background noise showed a modest decrease of synchrony in terms of increased standard deviation at moderate sound levels. The results support previous findings in terms of the gesturing frequency, and we found a limited support for the changes in speech-gesture synchrony. The work reveals communication patterns of the whole body and exemplifies interactive communication in context of multimodal adaptation to communication needs.
comment: 7 figures, 10 tables, 29 pages
"I Can See Forever!": Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments
The visually impaired population faces significant challenges in daily activities. While prior works employ vision language models for assistance, most focus on static content and cannot address real-time perception needs in complex environments. Recent VideoLLMs enable real-time vision and speech interaction, offering promising potential for assistive tasks. In this work, we conduct the first study evaluating their effectiveness in supporting daily life for visually impaired individuals. We first conducted a user survey with visually impaired participants to design the benchmark VisAssistDaily for daily life evaluation. Using VisAssistDaily, we evaluate popular VideoLLMs and find GPT-4o achieves the highest task success rate. We further conduct a user study to reveal concerns about hazard perception. To address this, we propose SafeVid, an environment-awareness dataset, and fine-tune VITA-1.5, improving risk recognition accuracy from 25.00% to 76.00%.We hope this work provides valuable insights and inspiration for future research in this field.
comment: 17 pages
Objective Measurement of AI Literacy: Development and Validation of the AI Competency Objective Scale (AICOS)
As Artificial Intelligence (AI) becomes more pervasive in various aspects of life, AI literacy is becoming a fundamental competency that enables individuals to move safely and competently in an AI-pervaded world. There is a growing need to measure this competency, e.g., to develop targeted educational interventions. Although several measurement tools already exist, many have limitations regarding subjective data collection methods, target group differentiation, validity, and integration of current developments such as Generative AI Literacy. This study develops and validates the AI Competency Objective Scale (AICOS) for measuring AI literacy objectively. The presented scale addresses weaknesses and offers a robust measurement approach that considers established competency and measurement models, captures central sub-competencies of AI literacy, and integrates the dimension of Generative AI Literacy. The AICOS provides a sound and comprehensive measure of AI literacy, and initial analyses show potential for a modular structure. Furthermore, a first edition of a short version of the AICOS is developed. Due to its methodological foundation, extensive validation, and integration of recent developments, the test represents a valuable resource for scientific research and practice in educational institutions and professional contexts. The AICOS significantly contributes to the development of standardized measurement instruments and enables the targeted assessment and development of AI skills in different target groups.
AI summaries in online search influence users' attitudes
This study examined how AI-generated summaries, which have become visually prominent in online search results, affect how users think about different issues. In a preregistered randomized controlled experiment, participants (N = 2,004) viewed mock search result pages varying in the presence (vs. absence), placement (top vs. middle), and stance (benefit-framed vs. harm-framed) of AI-generated summaries across four publicly debated topics. Compared to a no-summary control group, participants exposed to AI-generated summaries reported issue attitudes, behavioral intentions, and policy support that aligned more closely with the AI summary stance. The summaries placed at the top of the page produced stronger shifts in users' issue attitudes (but not behavioral intentions or policy support) than those placed at the middle of the page. We also observed moderating effects from issue familiarity and general trust toward AI. In addition, users perceived the AI summaries more useful when it emphasized health harms versus benefits. These findings suggest that AI-generated search summaries can significantly shape public perceptions, raising important implications for the design and regulation of AI-integrated information ecosystems.
Expertise elevates AI usage: experimental evidence comparing laypeople and professional artists
Generative AI's novel capacities raise questions about the future role of human expertise: does AI level the playing field between professional artists and laypeople, or does expertise enhance AI use? Do the cognitive skills experts make use of in analyzing and drawing visual art also transfer to using these new tools? This pre-registered study conducts experimental comparisons between 50 professional artists and a demographically matched sample of laypeople. Our interdisciplinary team developed two tasks involving image replication and creative image creation, assessing their copying accuracy and divergent thinking. We implemented a bespoke platform for the experiment, powered by a modern text-to-image AI. Results reveal artists produced more accurate copies and more divergent ideas than lay participants, highlighting a skill transfer of professional expertise - even to the confined space of generative AI. We also explored how well an exemplary vision-capable large language model (GPT-4o) would fare: on par in copying and slightly better on average than artists in the creative task, although never above best humans. These findings highlight the importance of integrating artistic skills with AI, suggesting a potential for collaborative synergy that could reshape creative industries and arts education.
comment: Eisenmann and Karjus contributed equally to this work and share first authorship
When Ads Become Profiles: Uncovering the Invisible Risk of Web Advertising at Scale with LLMs
Regulatory limits on explicit targeting have not eliminated algorithmic profiling on the Web, as optimisation systems still adapt ad delivery to users' private attributes. The widespread availability of powerful zero-shot multimodal Large Language Models (LLMs) has dramatically lowered the barrier for exploiting these latent signals for adversarial inference. We investigate this emerging societal risk, specifically how adversaries can now exploit these signals to reverse-engineer private attributes from ad exposure alone. We introduce a novel pipeline that leverages LLMs as adversarial inference engines to perform natural language profiling. Applying this method to a longitudinal dataset comprising over 435,000 ad impressions collected from 891 users, we conducted a large-scale study to assess the feasibility and precision of inferring private attributes from passive online ad observations. Our results demonstrate that off-the-shelf LLMs can accurately reconstruct complex user private attributes, including party preference, employment status, and education level, consistently outperforming strong census-based priors and matching or exceeding human social perception, while operating at only a fraction of the cost (223$\times$ lower) and time (52$\times$ faster) required by humans. Critically, actionable profiling is feasible even within short observation windows, indicating that prolonged tracking is not a prerequisite for a successful attack. These findings provide the first empirical evidence that ad streams serve as a high-fidelity digital footprint, enabling off-platform profiling that inherently bypasses current platform safeguards, highlighting a systemic vulnerability in the ad ecosystem and the urgent need for responsible web AI governance in the generative AI era. The code is available at https://github.com/Breezelled/when-ads-become-profiles.
Preference-Aligned Options from Generative AI Compensates for Age-Related Cognitive Decline in Decision Making
Older adults often experience increased difficulty in decision making due to age-related declines particularly in contexts that require information search or the generation of alternatives from memory. This study examined whether using generative AI for information search enhances choice satisfaction and reduces choice difficulty among older adults. A total of 130 participants (younger, n = 56; older, n = 74) completed a music-selection task under AI-use and AI-nonuse conditions across two contexts: previously experienced (road trip) and not previously experienced (space travel). In the AI-nonuse condition, participants generated candidate options from memory; in the AI-use condition, GPT-4o presented options tailored to individual preferences. Cognitive functions, including working memory, processing speed, verbal comprehension, and perceptual reasoning, were assessed. Results showed that AI use significantly reduced perceived choice difficulty across age groups, with larger benefits in unfamiliar contexts. Regarding cognitive function, among older adults, lower cognitive function was associated with fewer recalled options, higher choice difficulty, and lower satisfaction in the AI-nonuse condition; these associations were substantially attenuated when AI was used. These results demonstrate that generative AI can mitigate age-related cognitive constraints by reducing the cognitive load associated with information search during decision making. While the use of AI reduced perceived difficulty, choice satisfaction remained unchanged, suggesting that autonomy in decision making was preserved. These findings indicate that generative AI can support everyday decision making by compensating for the constraints in information search that older adults face due to cognitive decline.
Designing and Evaluating AI Margin Notes in Document Reader Software
AI capabilities for document reader software are usually presented in separate chat interfaces. We explore integrating AI into document comments, a concept we formalize as AI margin notes. Three design parameters characterize this approach: margin notes are integrated with the text while chat interfaces are not; selecting text for a margin note can be automated through AI or manual; and the generation of a margin note can involve AI to various degrees. Two experiments investigate integration and selection automation, with results showing participants prefer integrated AI margin notes and manual selection. A third experiment explores human and AI involvement through six alternative techniques. Techniques with less AI involvement resulted in more psychological ownership, but faster and less effortful designs are generally preferred. Surprisingly, the degree of AI involvement had no measurable effect on reading comprehension. Our work shows that AI margin notes are desirable and contributes implications for their design.
Computer Vision and Pattern Recognition
Unique Lives, Shared World: Learning from Single-Life Videos
We introduce the "single-life" learning paradigm, where we train a distinct vision model exclusively on egocentric videos captured by one individual. We leverage the multiple viewpoints naturally captured within a single life to learn a visual encoder in a self-supervised manner. Our experiments demonstrate three key findings. First, models trained independently on different lives develop a highly aligned geometric understanding. We demonstrate this by training visual encoders on distinct datasets each capturing a different life, both indoors and outdoors, as well as introducing a novel cross-attention-based metric to quantify the functional alignment of the internal representations developed by different models. Second, we show that single-life models learn generalizable geometric representations that effectively transfer to downstream tasks, such as depth estimation, in unseen environments. Third, we demonstrate that training on up to 30 hours from one week of the same person's life leads to comparable performance to training on 30 hours of diverse web data, highlighting the strength of single-life representation learning. Overall, our results establish that the shared structure of the world, both leads to consistency in models trained on individual lives, and provides a powerful signal for visual representation learning.
SimFlow: Simplified and End-to-End Training of Latent Normalizing Flows
Normalizing Flows (NFs) learn invertible mappings between the data and a Gaussian distribution. Prior works usually suffer from two limitations. First, they add random noise to training samples or VAE latents as data augmentation, introducing complex pipelines including extra noising and denoising steps. Second, they use a pretrained and frozen VAE encoder, resulting in suboptimal reconstruction and generation quality. In this paper, we find that the two issues can be solved in a very simple way: just fixing the variance (which would otherwise be predicted by the VAE encoder) to a constant (e.g., 0.5). On the one hand, this method allows the encoder to output a broader distribution of tokens and the decoder to learn to reconstruct clean images from the augmented token distribution, avoiding additional noise or denoising design. On the other hand, fixed variance simplifies the VAE evidence lower bound, making it stable to train an NF with a VAE jointly. On the ImageNet $256 \times 256$ generation task, our model SimFlow obtains a gFID score of 2.15, outperforming the state-of-the-art method STARFlow (gFID 2.40). Moreover, SimFlow can be seamlessly integrated with the end-to-end representation alignment (REPA-E) method and achieves an improved gFID of 1.91, setting a new state of the art among NFs.
comment: Project Page: https://qinyu-allen-zhao.github.io/SimFlow/
PosterCopilot: Toward Layout Reasoning and Controllable Editing for Professional Graphic Design
Graphic design forms the cornerstone of modern visual communication, serving as a vital medium for promoting cultural and commercial events. Recent advances have explored automating this process using Large Multimodal Models (LMMs), yet existing methods often produce geometrically inaccurate layouts and lack the iterative, layer-specific editing required in professional workflows. To address these limitations, we present PosterCopilot, a framework that advances layout reasoning and controllable editing for professional graphic design. Specifically, we introduce a progressive three-stage training strategy that equips LMMs with geometric understanding and aesthetic reasoning for layout design, consisting of Perturbed Supervised Fine-Tuning, Reinforcement Learning for Visual-Reality Alignment, and Reinforcement Learning from Aesthetic Feedback. Furthermore, we develop a complete workflow that couples the trained LMM-based design model with generative models, enabling layer-controllable, iterative editing for precise element refinement while maintaining global visual consistency. Extensive experiments demonstrate that PosterCopilot achieves geometrically accurate and aesthetically superior layouts, offering unprecedented controllability for professional iterative design.
comment: Project page: https://postercopilot.github.io/
Radiance Meshes for Volumetric Reconstruction
We introduce radiance meshes, a technique for representing radiance fields with constant density tetrahedral cells produced with a Delaunay tetrahedralization. Unlike a Voronoi diagram, a Delaunay tetrahedralization yields simple triangles that are natively supported by existing hardware. As such, our model is able to perform exact and fast volume rendering using both rasterization and ray-tracing. We introduce a new rasterization method that achieves faster rendering speeds than all prior radiance field representations (assuming an equivalent number of primitives and resolution) across a variety of platforms. Optimizing the positions of Delaunay vertices introduces topological discontinuities (edge flips). To solve this, we use a Zip-NeRF-style backbone which allows us to express a smoothly varying field even when the topology changes. Our rendering method exactly evaluates the volume rendering equation and enables high quality, real-time view synthesis on standard consumer hardware. Our tetrahedral meshes also lend themselves to a variety of exciting applications including fisheye lens distortion, physics-based simulation, editing, and mesh extraction.
comment: Website: half-potato.gitlab.io/rm
SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL
Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning required for embodied applications. The agentic paradigm promises that VLMs can use a wide variety of tools that could augment these capabilities, such as depth estimators, segmentation models, and pose estimators. Yet it remains an open challenge how to realize this vision without solely relying on handcrafted prompting strategies or enforcing fixed, predefined tool pipelines that limit VLMs' ability to discover optimal tool-use patterns. Reinforcement Learning could overcome this gap, but has so far been limited to reasoning with a single visual tool due to the large search space in multi-tool reasoning. We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools through interactive exploration and feedback. In the teaching phase, we combine demonstrations from a single tool specialist trained via interactive RL with traces from a frontier model using all tools. In the exploration phase, the model further refines multi-tool coordination through continued RL. Our model, SpaceTools, with tool-augmented spatial reasoning ability, achieves state-of-the-art performance on spatial understanding benchmarks (RoboSpatial-Home, BLINK, BOP-ASK) and demonstrates reliable real-world manipulation using a 7-DOF robot as a tool. DIRL provides substantial improvements over the vanilla SFT (+12% on RoboSpatial) and RL (+16% on RoboSpatial) baselines. Project page: https://spacetools.github.io/.
Stable Signer: Hierarchical Sign Language Generative Model
Sign Language Production (SLP) is the process of converting the complex input text into a real video. Most previous works focused on the Text2Gloss, Gloss2Pose, Pose2Vid stages, and some concentrated on Prompt2Gloss and Text2Avatar stages. However, this field has made slow progress due to the inaccuracy of text conversion, pose generation, and the rendering of poses into real human videos in these stages, resulting in gradually accumulating errors. Therefore, in this paper, we streamline the traditional redundant structure, simplify and optimize the task objective, and design a new sign language generative model called Stable Signer. It redefines the SLP task as a hierarchical generation end-to-end task that only includes text understanding (Prompt2Gloss, Text2Gloss) and Pose2Vid, and executes text understanding through our proposed new Sign Language Understanding Linker called SLUL, and generates hand gestures through the named SLP-MoE hand gesture rendering expert block to end-to-end generate high-quality and multi-style sign language videos. SLUL is trained using the newly developed Semantic-Aware Gloss Masking Loss (SAGM Loss). Its performance has improved by 48.6% compared to the current SOTA generation methods.
comment: 12 pages, 7 figures. More Demo at https://stablesigner.github.io
RELIC: Interactive Video World Model with Long-Horizon Memory
A truly interactive world model requires three key ingredients: real-time long-horizon streaming, consistent spatial memory, and precise user control. However, most existing approaches address only one of these aspects in isolation, as achieving all three simultaneously is highly challenging-for example, long-term memory mechanisms often degrade real-time performance. In this work, we present RELIC, a unified framework that tackles these three challenges altogether. Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time. Built upon recent autoregressive video-diffusion distillation techniques, our model represents long-horizon memory using highly compressed historical latent tokens encoded with both relative actions and absolute camera poses within the KV cache. This compact, camera-aware memory structure supports implicit 3D-consistent content retrieval and enforces long-term coherence with minimal computational overhead. In parallel, we fine-tune a bidirectional teacher video model to generate sequences beyond its original 5-second training horizon, and transform it into a causal student generator using a new memory-efficient self-forcing paradigm that enables full-context distillation over long-duration teacher as well as long student self-rollouts. Implemented as a 14B-parameter model and trained on a curated Unreal Engine-rendered dataset, RELIC achieves real-time generation at 16 FPS while demonstrating more accurate action following, more stable long-horizon streaming, and more robust spatial-memory retrieval compared with prior work. These capabilities establish RELIC as a strong foundation for the next generation of interactive world modeling.
comment: 22 pages
Fast & Efficient Normalizing Flows and Applications of Image Generative Models
This thesis presents novel contributions in two primary areas: advancing the efficiency of generative models, particularly normalizing flows, and applying generative models to solve real-world computer vision challenges. The first part introduce significant improvements to normalizing flow architectures through six key innovations: 1) Development of invertible 3x3 Convolution layers with mathematically proven necessary and sufficient conditions for invertibility, (2) introduction of a more efficient Quad-coupling layer, 3) Design of a fast and efficient parallel inversion algorithm for kxk convolutional layers, 4) Fast & efficient backpropagation algorithm for inverse of convolution, 5) Using inverse of convolution, in Inverse-Flow, for the forward pass and training it using proposed backpropagation algorithm, and 6) Affine-StableSR, a compact and efficient super-resolution model that leverages pre-trained weights and Normalizing Flow layers to reduce parameter count while maintaining performance. The second part: 1) An automated quality assessment system for agricultural produce using Conditional GANs to address class imbalance, data scarcity and annotation challenges, achieving good accuracy in seed purity testing; 2) An unsupervised geological mapping framework utilizing stacked autoencoders for dimensionality reduction, showing improved feature extraction compared to conventional methods; 3) We proposed a privacy preserving method for autonomous driving datasets using on face detection and image inpainting; 4) Utilizing Stable Diffusion based image inpainting for replacing the detected face and license plate to advancing privacy-preserving techniques and ethical considerations in the field.; and 5) An adapted diffusion model for art restoration that effectively handles multiple types of degradation through unified fine-tuning.
comment: PhD Thesis
Jina-VLM: Small Multilingual Vision Language Model
We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. Across standard VQA benchmarks and multilingual evaluations, Jina-VLM outperforms comparable models while preserving competitive text-only performance.
comment: 18 pages, 1-7 main content
PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation
Attention mechanisms are the core of foundation models, but their quadratic complexity remains a critical bottleneck for scaling. This challenge has driven the development of efficient attention mechanisms, with sparsity emerging as the dominant paradigm. Current methods typically retain or discard entire key-value blocks with binary masks, resulting in substantial information loss under high sparsity. To mitigate this gap, we present Pyramid Sparse Attention (PSA), a versatile module applicable to both video understanding and generation tasks. Instead of binary masking, PSA introduces multi-level pooled KV representations, enabling finer mask granularity. Specifically, each query block dynamically allocates lower pooling levels to critical KV blocks and higher levels to less important ones, creating an informative interpolation between full retention and complete pruning. This design, analogous to fixed-point quantization and classical feature pyramid networks in computer vision, effectively mitigates information loss while preserving computational efficiency under a low compute budget. It works with a native, hardware-friendly kernel that leverages decoupled block-tile design to ensure efficient execution. Across video understanding and generation benchmarks, PSA preserves contextual information and visual fidelity, consistently outperforming or achieving comparable performance over existing sparse attention baselines with superior efficiency-quality trade-offs. Our code and model weights are publicly available at: http://ziplab.co/PSA
comment: Tech report
C3G: Learning Compact 3D Representations with 2K Gaussians
Reconstructing and understanding 3D scenes from unposed sparse views in a feed-forward manner remains as a challenging task in 3D computer vision. Recent approaches use per-pixel 3D Gaussian Splatting for reconstruction, followed by a 2D-to-3D feature lifting stage for scene understanding. However, they generate excessive redundant Gaussians, causing high memory overhead and sub-optimal multi-view feature aggregation, leading to degraded novel view synthesis and scene understanding performance. We propose C3G, a novel feed-forward framework that estimates compact 3D Gaussians only at essential spatial locations, minimizing redundancy while enabling effective feature lifting. We introduce learnable tokens that aggregate multi-view features through self-attention to guide Gaussian generation, ensuring each Gaussian integrates relevant visual features across views. We then exploit the learned attention patterns for Gaussian decoding to efficiently lift features. Extensive experiments on pose-free novel view synthesis, 3D open-vocabulary segmentation, and view-invariant feature aggregation demonstrate our approach's effectiveness. Results show that a compact yet geometrically meaningful representation is sufficient for high-quality scene reconstruction and understanding, achieving superior memory efficiency and feature fidelity compared to existing methods.
comment: Project Page : https://cvlab-kaist.github.io/C3G/
Ultra-lightweight Neural Video Representation Compression
Recent works have demonstrated the viability of utilizing over-fitted implicit neural representations (INRs) as alternatives to autoencoder-based models for neural video compression. Among these INR-based video codecs, Neural Video Representation Compression (NVRC) was the first to adopt a fully end-to-end compression framework that compresses INRs, achieving state-of-the-art performance. Moreover, some recently proposed lightweight INRs have shown comparable performance to their baseline codecs with computational complexity lower than 10kMACs/pixel. In this work, we extend NVRC toward lightweight representations, and propose NVRC-Lite, which incorporates two key changes. Firstly, we integrated multi-scale feature grids into our lightweight neural representation, and the use of higher resolution grids significantly improves the performance of INRs at low complexity. Secondly, we address the issue that existing INRs typically leverage autoregressive models for entropy coding: these are effective but impractical due to their slow coding speed. In this work, we propose an octree-based context model for entropy coding high-dimensional feature grids, which accelerates the entropy coding module of the model. Our experimental results demonstrate that NVRC-Lite outperforms C3, one of the best lightweight INR-based video codecs, with up to 21.03% and 23.06% BD-rate savings when measured in PSNR and MS-SSIM, respectively, while achieving 8.4x encoding and 2.5x decoding speedup. The implementation of NVRC-Lite will be made available.
Learning Group Actions In Disentangled Latent Image Representations
Modeling group actions on latent representations enables controllable transformations of high-dimensional image data. Prior works applying group-theoretic priors or modeling transformations typically operate in the high-dimensional data space, where group actions apply uniformly across the entire input, making it difficult to disentangle the subspace that varies under transformations. While latent-space methods offer greater flexibility, they still require manual partitioning of latent variables into equivariant and invariant subspaces, limiting the ability to robustly learn and operate group actions within the representation space. To address this, we introduce a novel end-to-end framework that for the first time learns group actions on latent image manifolds, automatically discovering transformation-relevant structures without manual intervention. Our method uses learnable binary masks with straight-through estimation to dynamically partition latent representations into transformation-sensitive and invariant components. We formulate this within a unified optimization framework that jointly learns latent disentanglement and group transformation mappings. The framework can be seamlessly integrated with any standard encoder-decoder architecture. We validate our approach on five 2D/3D image datasets, demonstrating its ability to automatically learn disentangled latent factors for group actions in diverse data, while downstream classification tasks confirm the effectiveness of the learned representations. Our code is publicly available at https://github.com/farhanaswarnali/Learning-Group-Actions-In-Disentangled-Latent-Image-Representations .
Emergent Outlier View Rejection in Visual Geometry Grounded Transformers
Reliable 3D reconstruction from in-the-wild image collections is often hindered by "noisy" images-irrelevant inputs with little or no view overlap with others. While traditional Structure-from-Motion pipelines handle such cases through geometric verification and outlier rejection, feed-forward 3D reconstruction models lack these explicit mechanisms, leading to degraded performance under in-the-wild conditions. In this paper, we discover that the existing feed-forward reconstruction model, e.g., VGGT, despite lacking explicit outlier-rejection mechanisms or noise-aware training, can inherently distinguish distractor images. Through an in-depth analysis under varying proportions of synthetic distractors, we identify a specific layer that naturally exhibits outlier-suppressing behavior. Further probing reveals that this layer encodes discriminative internal representations that enable an effective noise-filtering capability, which we simply leverage to perform outlier-view rejection in feed-forward 3D reconstruction without any additional fine-tuning or supervision. Extensive experiments on both controlled and in-the-wild datasets demonstrate that this implicit filtering mechanism is consistent and generalizes well across diverse scenarios.
comment: Project page: https://cvlab-kaist.github.io/RobustVGGT/
On the Temporality for Sketch Representation Learning
Sketches are simple human hand-drawn abstractions of complex scenes and real-world objects. Although the field of sketch representation learning has advanced significantly, there is still a gap in understanding the true relevance of the temporal aspect to the quality of these representations. This work investigates whether it is indeed justifiable to treat sketches as sequences, as well as which internal orders play a more relevant role. The results indicate that, although the use of traditional positional encodings is valid for modeling sketches as sequences, absolute coordinates consistently outperform relative ones. Furthermore, non-autoregressive decoders outperform their autoregressive counterparts. Finally, the importance of temporality was shown to depend on both the order considered and the task evaluated.
Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding
The application of Large Multimodal Models (LMMs) to long-form video understanding is constrained by limited context lengths and the computationally prohibitive cost of processing dense video tokens. Consequently, recent research has focused on query-aware frame selection, methods that often incur significant computational overhead. This paper challenges the assumption that such complex search mechanisms are universally necessary. We first identify and validate a query typology distinguishing between global query and localized query. We demonstrate that while uniform sampling is both effective and efficient for global queries, localized queries indeed necessitate query-aware selection for optimal performance. Building on this insight, we propose DIG, a training-free frame selection framework that adapts its strategy based on the query type. Specifically,DIG employs efficient uniform sampling for global queries while activating a specialized pipeline to extract query-relevant frames for localized queries. Experiments on three long-form video understanding benchmarks demonstrate that DIG consistently outperforms existing baselines and robustly improves LMM performance, even when scaling the input frame count to 256.
Highly Efficient Test-Time Scaling for T2I Diffusion Models with Text Embedding Perturbation
Test-time scaling (TTS) aims to achieve better results by increasing random sampling and evaluating samples based on rules and metrics. However, in text-to-image(T2I) diffusion models, most related works focus on search strategies and reward models, yet the impact of the stochastic characteristic of noise in T2I diffusion models on the method's performance remains unexplored. In this work, we analyze the effects of randomness in T2I diffusion models and explore a new format of randomness for TTS: text embedding perturbation, which couples with existing randomness like SDE-injected noise to enhance generative diversity and quality. We start with a frequency-domain analysis of these formats of randomness and their impact on generation, and find that these two randomness exhibit complementary behavior in the frequency domain: spatial noise favors low-frequency components (early steps), while text embedding perturbation enhances high-frequency details (later steps), thereby compensating for the potential limitations of spatial noise randomness in high-frequency manipulation. Concurrently, text embedding demonstrates varying levels of tolerance to perturbation across different dimensions of the generation process. Specifically, our method consists of two key designs: (1) Introducing step-based text embedding perturbation, combining frequency-guided noise schedules with spatial noise perturbation. (2) Adapting the perturbation intensity selectively based on their frequency-specific contributions to generation and tolerance to perturbation. Our approach can be seamlessly integrated into existing TTS methods and demonstrates significant improvements on multiple benchmarks with almost no additional computation. Code is available at \href{https://github.com/xuhang07/TEP-Diffusion}{https://github.com/xuhang07/TEP-Diffusion}.
Artificial Microsaccade Compensation: Stable Vision for an Ornithopter
Animals with foveated vision, including humans, experience microsaccades, small, rapid eye movements that they are not aware of. Inspired by this phenomenon, we develop a method for "Artificial Microsaccade Compensation". It can stabilize video captured by a tailless ornithopter that has resisted attempts to use camera-based sensing because it shakes at 12-20 Hz. Our approach minimizes changes in image intensity by optimizing over 3D rotation represented in SO(3). This results in a stabilized video, computed in real time, suitable for human viewing, and free from distortion. When adapted to hold a fixed viewing orientation, up to occasional saccades, it can dramatically reduce inter-frame motion while also benefiting from an efficient recursive update. When compared to Adobe Premier Pro's warp stabilizer, which is widely regarded as the best commercial video stabilization software available, our method achieves higher quality results while also running in real time.
comment: 29 pages, 5 figures, 2 tables, under review
DIQ-H: Evaluating Hallucination Persistence in VLMs Under Temporal Visual Degradation
Vision-Language Models (VLMs) deployed in safety-critical applications such as autonomous driving must handle continuous visual streams under imperfect conditions. However, existing benchmarks focus on static, high-quality images and ignore temporal degradation and error propagation, which are critical failure modes where transient visual corruption induces hallucinations that persist across subsequent frames. We introduce DIQ-H, the first benchmark for evaluating VLM robustness under dynamic visual degradation in temporal sequences. DIQ-H applies physics-based corruptions including motion blur, sensor noise, and compression artifacts, and measures hallucination persistence, error recovery, and temporal consistency through multi-turn question-answering tasks. To enable scalable annotation, we propose Uncertainty-Guided Iterative Refinement (UIR), which generates reliable pseudo-ground-truth using lightweight VLMs with uncertainty filtering, achieving a 15.3 percent accuracy improvement. Experiments on 16 state-of-the-art VLMs reveal substantial robustness gaps: even advanced models such as GPT-4o achieve only a 78.5 percent recovery rate, while open-source models struggle with temporal consistency at less than 60 percent. DIQ-H provides a comprehensive platform for evaluating VLM reliability in real-world deployments.
DirectDrag: High-Fidelity, Mask-Free, Prompt-Free Drag-based Image Editing via Readout-Guided Feature Alignment
Drag-based image editing using generative models provides intuitive control over image structures. However, existing methods rely heavily on manually provided masks and textual prompts to preserve semantic fidelity and motion precision. Removing these constraints creates a fundamental trade-off: visual artifacts without masks and poor spatial control without prompts. To address these limitations, we propose DirectDrag, a novel mask- and prompt-free editing framework. DirectDrag enables precise and efficient manipulation with minimal user input while maintaining high image fidelity and accurate point alignment. DirectDrag introduces two key innovations. First, we design an Auto Soft Mask Generation module that intelligently infers editable regions from point displacement, automatically localizing deformation along movement paths while preserving contextual integrity through the generative model's inherent capacity. Second, we develop a Readout-Guided Feature Alignment mechanism that leverages intermediate diffusion activations to maintain structural consistency during point-based edits, substantially improving visual fidelity. Despite operating without manual mask or prompt, DirectDrag achieves superior image quality compared to existing methods while maintaining competitive drag accuracy. Extensive experiments on DragBench and real-world scenarios demonstrate the effectiveness and practicality of DirectDrag for high-quality, interactive image manipulation. Project Page: https://frakw.github.io/DirectDrag/. Code is available at: https://github.com/frakw/DirectDrag.
BlurDM: A Blur Diffusion Model for Image Deblurring NeurIPS 2025
Diffusion models show promise for dynamic scene deblurring; however, existing studies often fail to leverage the intrinsic nature of the blurring process within diffusion models, limiting their full potential. To address it, we present a Blur Diffusion Model (BlurDM), which seamlessly integrates the blur formation process into diffusion for image deblurring. Observing that motion blur stems from continuous exposure, BlurDM implicitly models the blur formation process through a dual-diffusion forward scheme, diffusing both noise and blur onto a sharp image. During the reverse generation process, we derive a dual denoising and deblurring formulation, enabling BlurDM to recover the sharp image by simultaneously denoising and deblurring, given pure Gaussian noise conditioned on the blurred image as input. Additionally, to efficiently integrate BlurDM into deblurring networks, we perform BlurDM in the latent space, forming a flexible prior generation network for deblurring. Extensive experiments demonstrate that BlurDM significantly and consistently enhances existing deblurring methods on four benchmark datasets. The source code is available at https://github.com/Jin-Ting-He/BlurDM.
comment: NeurIPS 2025
Training for Identity, Inference for Controllability: A Unified Approach to Tuning-Free Face Personalization
Tuning-free face personalization methods have developed along two distinct paradigms: text embedding approaches that map facial features into the text embedding space, and adapter-based methods that inject features through auxiliary cross-attention layers. While both paradigms have shown promise, existing methods struggle to simultaneously achieve high identity fidelity and flexible text controllability. We introduce UniID, a unified tuning-free framework that synergistically integrates both paradigms. Our key insight is that when merging these approaches, they should mutually reinforce only identity-relevant information while preserving the original diffusion prior for non-identity attributes. We realize this through a principled training-inference strategy: during training, we employ an identity-focused learning scheme that guides both branches to capture identity features exclusively; at inference, we introduce a normalized rescaling mechanism that recovers the text controllability of the base diffusion model while enabling complementary identity signals to enhance each other. This principled design enables UniID to achieve high-fidelity face personalization with flexible text controllability. Extensive experiments against six state-of-the-art methods demonstrate that UniID achieves superior performance in both identity preservation and text controllability. Code will be available at https://github.com/lyuPang/UniID
comment: 17 pages, 13 figures
TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning
Enhancing the temporal understanding of Multimodal Large Language Models (MLLMs) is essential for advancing long-form video analysis, enabling tasks such as temporal localization, action detection, and time-sensitive question answering. While reinforcement learning (RL) has recently been explored for improving temporal reasoning, existing approaches are often confined to limited task types and data, restricting their generalization across diverse temporal understanding scenarios. To address this challenge, we present TempR1, a temporal-aware multi-task reinforcement learning framework that systematically strengthens MLLMs' temporal comprehension. We curate a multi-task corpus that exposes the model to diverse temporal structures and semantics, and build upon the Group Relative Policy Optimization (GRPO) algorithm to achieve stable and effective cross-task optimization. Specifically, we categorize temporal tasks into three correspondence types between predicted intervals and ground-truth instances, and design tailored localization rewards for each, enabling TempR1 to capture fine-grained temporal dependencies and adapt to different temporal patterns. Extensive experiments demonstrate that TempR1 attains state-of-the-art performance across multiple benchmarks. Moreover, its joint optimization over complementary tasks yields a strong synergistic effect, enhancing both generalization and single-task performance, establishing a scalable and principled paradigm for temporal reasoning in MLLMs.
Tada-DIP: Input-adaptive Deep Image Prior for One-shot 3D Image Reconstruction
Deep Image Prior (DIP) has recently emerged as a promising one-shot neural-network based image reconstruction method. However, DIP has seen limited application to 3D image reconstruction problems. In this work, we introduce Tada-DIP, a highly effective and fully 3D DIP method for solving 3D inverse problems. By combining input-adaptation and denoising regularization, Tada-DIP produces high-quality 3D reconstructions while avoiding the overfitting phenomenon that is common in DIP. Experiments on sparse-view X-ray computed tomography reconstruction validate the effectiveness of the proposed method, demonstrating that Tada-DIP produces much better reconstructions than training-data-free baselines and achieves reconstruction performance on par with a supervised network trained using a large dataset with fully-sampled volumes.
comment: 6 pages, 8 figures, 2025 Asilomar Conference on Signals, Systems, and Computers. Code is available at github.com/evanbell02/Tada-DIP/
MUT3R: Motion-aware Updating Transformer for Dynamic 3D Reconstruction
Recent stateful recurrent neural networks have achieved remarkable progress on static 3D reconstruction but remain vulnerable to motion-induced artifacts, where non-rigid regions corrupt attention propagation between the spatial memory and image feature. By analyzing the internal behaviors of the state and image token updating mechanism, we find that aggregating self-attention maps across layers reveals a consistent pattern: dynamic regions are naturally down-weighted, exposing an implicit motion cue that the pretrained transformer already encodes but never explicitly uses. Motivated by this observation, we introduce MUT3R, a training-free framework that applies the attention-derived motion cue to suppress dynamic content in the early layers of the transformer during inference. Our attention-level gating module suppresses the influence of dynamic regions before their artifacts propagate through the feature hierarchy. Notably, we do not retrain or fine-tune the model; we let the pretrained transformer diagnose its own motion cues and correct itself. This early regulation stabilizes geometric reasoning in streaming scenarios and leads to improvements in temporal consistency and camera pose robustness across multiple dynamic benchmarks, offering a simple and training-free pathway toward motion-aware streaming reconstruction.
Beyond the Ground Truth: Enhanced Supervision for Image Restoration
Deep learning-based image restoration has achieved significant success. However, when addressing real-world degradations, model performance is limited by the quality of ground-truth images in datasets due to practical constraints in data acquisition. To address this limitation, we propose a novel framework that enhances existing ground truth images to provide higher-quality supervision for real-world restoration. Our framework generates perceptually enhanced ground truth images using super-resolution by incorporating adaptive frequency masks, which are learned by a conditional frequency mask generator. These masks guide the optimal fusion of frequency components from the original ground truth and its super-resolved variants, yielding enhanced ground truth images. This frequency-domain mixup preserves the semantic consistency of the original content while selectively enriching perceptual details, preventing hallucinated artifacts that could compromise fidelity. The enhanced ground truth images are used to train a lightweight output refinement network that can be seamlessly integrated with existing restoration models. Extensive experiments demonstrate that our approach consistently improves the quality of restored images. We further validate the effectiveness of both supervision enhancement and output refinement through user studies. Code is available at https://github.com/dhryougit/Beyond-the-Ground-Truth.
UniMo: Unifying 2D Video and 3D Human Motion with an Autoregressive Framework
We propose UniMo, an innovative autoregressive model for joint modeling of 2D human videos and 3D human motions within a unified framework, enabling simultaneous generation and understanding of these two modalities for the first time. Current methods predominantly focus on generating one modality given another as the condition or integrating either of them with other modalities such as text and audio. Unifying 2D videos and 3D motions for simultaneous optimization and generation remains largely unexplored, presenting significant challenges due to their substantial structural and distributional differences. Inspired by the LLM's ability to unify different modalities, our method models videos and 3D motions as a unified tokens sequence, utilizing separate embedding layers to mitigate distribution gaps. Additionally, we devise a sequence modeling strategy that integrates two distinct tasks within a single framework, proving the effectiveness of unified modeling. Moreover, to efficiently align with visual tokens and preserve 3D spatial information, we design a novel 3D motion tokenizer with a temporal expansion strategy, using a single VQ-VAE to produce quantized motion tokens. It features multiple expert decoders that handle body shapes, translation, global orientation, and body poses for reliable 3D motion reconstruction. Extensive experiments demonstrate that our method simultaneously generates corresponding videos and motions while performing accurate motion capture. This work taps into the capacity of LLMs to fuse diverse data types, paving the way for integrating human-centric information into existing models and potentially enabling multimodal, controllable joint modeling of humans, objects, and scenes.
comment: https://carlyx.github.io/UniMo/
Zero-Shot Video Translation and Editing with Frame Spatial-Temporal Correspondence SC
The remarkable success in text-to-image diffusion models has motivated extensive investigation of their potential for video applications. Zero-shot techniques aim to adapt image diffusion models for videos without requiring further model training. Recent methods largely emphasize integrating inter-frame correspondence into attention mechanisms. However, the soft constraint applied to identify the valid features to attend is insufficient, which could lead to temporal inconsistency. In this paper, we present FRESCO, which integrates intra-frame correspondence with inter-frame correspondence to formulate a more robust spatial-temporal constraint. This enhancement ensures a consistent transformation of semantically similar content between frames. Our method goes beyond attention guidance to explicitly optimize features, achieving high spatial-temporal consistency with the input video, significantly enhancing the visual coherence of manipulated videos. We verify FRESCO adaptations on two zero-shot tasks of video-to-video translation and text-guided video editing. Comprehensive experiments demonstrate the effectiveness of our framework in generating high-quality, coherent videos, highlighting a significant advance over current zero-shot methods.
comment: Code: https://github.com/Sunnycookies/FRESCO-v2, Project: https://williamyang1991.github.io/projects/FRESCOv2/
Dual Cross-Attention Siamese Transformer for Rectal Tumor Regrowth Assessment in Watch-and-Wait Endoscopy
Increasing evidence supports watch-and-wait (WW) surveillance for patients with rectal cancer who show clinical complete response (cCR) at restaging following total neoadjuvant treatment (TNT). However, objectively accurate methods to early detect local regrowth (LR) from follow-up endoscopy images during WW are essential to manage care and prevent distant metastases. Hence, we developed a Siamese Swin Transformer with Dual Cross-Attention (SSDCA) to combine longitudinal endoscopic images at restaging and follow-up and distinguish cCR from LR. SSDCA leverages pretrained Swin transformers to extract domain agnostic features and enhance robustness to imaging variations. Dual cross attention is implemented to emphasize features from the two scans without requiring any spatial alignment of images to predict response. SSDCA as well as Swin-based baselines were trained using image pairs from 135 patients and evaluated on a held-out set of image pairs from 62 patients. SSDCA produced the best balanced accuracy (81.76\% $\pm$ 0.04), sensitivity (90.07\% $\pm$ 0.08), and specificity (72.86\% $\pm$ 0.05). Robustness analysis showed stable performance irrespective of artifacts including blood, stool, telangiectasia, and poor image quality. UMAP clustering of extracted features showed maximal inter-cluster separation (1.45 $\pm$ 0.18) and minimal intra-cluster dispersion (1.07 $\pm$ 0.19) with SSDCA, confirming discriminative representation learning.
comment: 6 pages, 5 figures, 1 table, submitted to ISBI conference
An Automated Framework for Large-Scale Graph-Based Cerebrovascular Analysis
We present CaravelMetrics, a computational framework for automated cerebrovascular analysis that models vessel morphology through skeletonization-derived graph representations. The framework integrates atlas-based regional parcellation, centerline extraction, and graph construction to compute fifteen morphometric, topological, fractal, and geometric features. The features can be estimated globally from the complete vascular network or regionally within arterial territories, enabling multiscale characterization of cerebrovascular organization. Applied to 570 3D TOF-MRA scans from the IXI dataset (ages 20-86), CaravelMetrics yields reproducible vessel graphs capturing age- and sex-related variations and education-associated increases in vascular complexity, consistent with findings reported in the literature. The framework provides a scalable and fully automated approach for quantitative cerebrovascular feature extraction, supporting normative modeling and population-level studies of vascular health and aging.
comment: Submitted to ISBI 2026. 6 pages, 6 figures
Diminishing Returns in Self-Supervised Learning
While transformer-based architectures have taken computer vision and NLP by storm, they often require a vast amount of parameters and training data to attain strong performance. In this work, we experiment with three distinct pre-training, intermediate fine-tuning, and downstream datasets and training objectives to explore their marginal benefits on a small 5M-parameter vision transformer. We find that while pre-training and fine-tuning always help our model but have diminishing returns, intermediate fine-tuning can actually show harmful impact on downstream performance, potentially due to dissimilarity in task mechanics. Taken together, our results suggest that small-scale ViTs benefit most from targeted pre-training and careful data selection, while indiscriminate stacking of intermediate tasks can waste compute and even degrade performance.
Prostate biopsy whole slide image dataset from an underrepresented Middle Eastern population
Artificial intelligence (AI) is increasingly used in digital pathology. Publicly available histopathology datasets remain scarce, and those that do exist predominantly represent Western populations. Consequently, the generalizability of AI models to populations from less digitized regions, such as the Middle East, is largely unknown. This motivates the public release of our dataset to support the development and validation of pathology AI models across globally diverse populations. We present 339 whole-slide images of prostate core needle biopsies from a consecutive series of 185 patients collected in Erbil, Iraq. The slides are associated with Gleason scores and International Society of Urological Pathology grades assigned independently by three pathologists. Scanning was performed using two high-throughput scanners (Leica and Hamamatsu) and one compact scanner (Grundium). All slides were de-identified and are provided in their native formats without further conversion. The dataset enables grading concordance analyses, color normalization, and cross-scanner robustness evaluations. Data will be deposited in the Bioimage Archive (BIA) under accession code: to be announced (TBA), and released under a CC BY 4.0 license.
comment: 13 pages, 2 figures and 1 table
Traffic Image Restoration under Adverse Weather via Frequency-Aware Mamba
Traffic image restoration under adverse weather conditions remains a critical challenge for intelligent transportation systems. Existing methods primarily focus on spatial-domain modeling but neglect frequency-domain priors. Although the emerging Mamba architecture excels at long-range dependency modeling through patch-wise correlation analysis, its potential for frequency-domain feature extraction remains unexplored. To address this, we propose Frequency-Aware Mamba (FAMamba), a novel framework that integrates frequency guidance with sequence modeling for efficient image restoration. Our architecture consists of two key components: (1) a Dual-Branch Feature Extraction Block (DFEB) that enhances local-global interaction via bidirectional 2D frequency-adaptive scanning, dynamically adjusting traversal paths based on sub-band texture distributions; and (2) a Prior-Guided Block (PGB) that refines texture details through wavelet-based high-frequency residual learning, enabling high-quality image reconstruction with precise details. Meanwhile, we design a novel Adaptive Frequency Scanning Mechanism (AFSM) for the Mamba architecture, which enables the Mamba to achieve frequency-domain scanning across distinct subgraphs, thereby fully leveraging the texture distribution characteristics inherent in subgraph structures. Extensive experiments demonstrate the efficiency and effectiveness of FAMamba.
comment: 12pages, 13 figures, 5tables
PULSE: A Unified Multi-Task Architecture for Cardiac Segmentation, Diagnosis, and Few-Shot Cross-Modality Clinical Adaptation
Cardiac image analysis remains fragmented across tasks: anatomical segmentation, disease classification, and grounded clinical report generation are typically handled by separate networks trained under different data regimes. No existing framework unifies these objectives within a single architecture while retaining generalization across imaging modalities and datasets. We introduce PULSE, a multi-task vision-language framework built on self-supervised representations and optimized through a composite supervision strategy that balances region overlap learning, pixel wise classification fidelity, and boundary aware IoU refinement. A multi-scale token reconstruction decoder enables anatomical segmentation, while shared global representations support disease classification and clinically grounded text output allowing the model to transition from pixels to structures and finally clinical reasoning within one architecture. Unlike prior task-specific pipelines, PULSE learns task-invariant cardiac priors, generalizes robustly across datasets, and can be adapted to new imaging modalities with minimal supervision. This moves the field closer to a scalable, foundation style cardiac analysis framework.
CoDA: From Text-to-Image Diffusion Models to Training-Free Dataset Distillation
Prevailing Dataset Distillation (DD) methods leveraging generative models confront two fundamental limitations. First, despite pioneering the use of diffusion models in DD and delivering impressive performance, the vast majority of approaches paradoxically require a diffusion model pre-trained on the full target dataset, undermining the very purpose of DD and incurring prohibitive training costs. Second, although some methods turn to general text-to-image models without relying on such target-specific training, they suffer from a significant distributional mismatch, as the web-scale priors encapsulated in these foundation models fail to faithfully capture the target-specific semantics, leading to suboptimal performance. To tackle these challenges, we propose Core Distribution Alignment (CoDA), a framework that enables effective DD using only an off-the-shelf text-to-image model. Our key idea is to first identify the "intrinsic core distribution" of the target dataset using a robust density-based discovery mechanism. We then steer the generative process to align the generated samples with this core distribution. By doing so, CoDA effectively bridges the gap between general-purpose generative priors and target semantics, yielding highly representative distilled datasets. Extensive experiments suggest that, without relying on a generative model specifically trained on the target dataset, CoDA achieves performance on par with or even superior to previous methods with such reliance across all benchmarks, including ImageNet-1K and its subsets. Notably, it establishes a new state-of-the-art accuracy of 60.4% at the 50-images-per-class (IPC) setup on ImageNet-1K. Our code is available on the project webpage: https://github.com/zzzlt422/CoDA
comment: 34 pages, 24 figures
Heatmap Pooling Network for Action Recognition from RGB Videos
Human action recognition (HAR) in videos has garnered widespread attention due to the rich information in RGB videos. Nevertheless, existing methods for extracting deep features from RGB videos face challenges such as information redundancy, susceptibility to noise and high storage costs. To address these issues and fully harness the useful information in videos, we propose a novel heatmap pooling network (HP-Net) for action recognition from videos, which extracts information-rich, robust and concise pooled features of the human body in videos through a feedback pooling module. The extracted pooled features demonstrate obvious performance advantages over the previously obtained pose data and heatmap features from videos. In addition, we design a spatial-motion co-learning module and a text refinement modulation module to integrate the extracted pooled features with other multimodal data, enabling more robust action recognition. Extensive experiments on several benchmarks namely NTU RGB+D 60, NTU RGB+D 120, Toyota-Smarthome and UAV-Human consistently verify the effectiveness of our HP-Net, which outperforms the existing human action recognition methods. Our code is publicly available at: https://github.com/liujf69/HPNet-Action.
comment: Final Version of IEEE Transactions on Pattern Analysis and Machine Intelligence
Lean Unet: A Compact Model for Image Segmentation
Unet and its variations have been standard in semantic image segmentation, especially for computer assisted radiology. Current Unet architectures iteratively downsample spatial resolution while increasing channel dimensions to preserve information content. Such a structure demands a large memory footprint, limiting training batch sizes and increasing inference latency. Channel pruning compresses Unet architecture without accuracy loss, but requires lengthy optimization and may not generalize across tasks and datasets. By investigating Unet pruning, we hypothesize that the final structure is the crucial factor, not the channel selection strategy of pruning. Based on our observations, we propose a lean Unet architecture (LUnet) with a compact, flat hierarchy where channels are not doubled as resolution is halved. We evaluate on a public MRI dataset allowing comparable reporting, as well as on two internal CT datasets. We show that a state-of-the-art pruning solution (STAMP) mainly prunes from the layers with the highest number of channels. Comparatively, simply eliminating a random channel at the pruning-identified layer or at the largest layer achieves similar or better performance. Our proposed LUnet with fixed architectures and over 30 times fewer parameters achieves performance comparable to both conventional Unet counterparts and data-adaptively pruned networks. The proposed lean Unet with constant channel count across layers requires far fewer parameters while achieving performance superior to standard Unet for the same total number of parameters. Skip connections allow Unet bottleneck channels to be largely reduced, unlike standard encoder-decoder architectures requiring increased bottleneck channels for information propagation.
A Robust Camera-based Method for Breath Rate Measurement
Proliferation of cheap and accessible cameras makes it possible to measure a subject's breath rate from video footage alone. Recent works on this topic have proposed a variety of approaches for accurately measuring human breath rate, however they are either tested in near-ideal conditions, or produce results that are not sufficiently accurate. The present study proposes a more robust method to measure breath rate in humans with minimal hardware requirements using a combination of mathematical transforms with a relative deviation from the ground truth of less than 5%. The method was tested on videos taken from 14 volunteers with a total duration of over 2 hours 30 minutes. The obtained results were compared to reference data and the average mean absolute error was found to be at 0.57 respirations per minute, which is noticeably better than the results from previous works. The breath rate measurement method proposed in the present article is more resistant to distortions caused by subject movement and thus allows one to remotely measure the subject's breath rate without any significant limitations on the subject's behavior.
comment: 9 pages, 4 figures, 2 tables
HieroGlyphTranslator: Automatic Recognition and Translation of Egyptian Hieroglyphs to English
Egyptian hieroglyphs, the ancient Egyptian writing system, are composed entirely of drawings. Translating these glyphs into English poses various challenges, including the fact that a single glyph can have multiple meanings. Deep learning translation applications are evolving rapidly, producing remarkable results that significantly impact our lives. In this research, we propose a method for the automatic recognition and translation of ancient Egyptian hieroglyphs from images to English. This study utilized two datasets for classification and translation: the Morris Franken dataset and the EgyptianTranslation dataset. Our approach is divided into three stages: segmentation (using Contour and Detectron2), mapping symbols to Gardiner codes, and translation (using the CNN model). The model achieved a BLEU score of 42.2, a significant result compared to previous research.
LSRS: Latent Scale Rejection Sampling for Visual Autoregressive Modeling
Visual Autoregressive (VAR) modeling approach for image generation proposes autoregressive processing across hierarchical scales, decoding multiple tokens per scale in parallel. This method achieves high-quality generation while accelerating synthesis. However, parallel token sampling within a scale may lead to structural errors, resulting in suboptimal generated images. To mitigate this, we propose Latent Scale Rejection Sampling (LSRS), a method that progressively refines token maps in the latent scale during inference to enhance VAR models. Our method uses a lightweight scoring model to evaluate multiple candidate token maps sampled at each scale, selecting the high-quality map to guide subsequent scale generation. By prioritizing early scales critical for structural coherence, LSRS effectively mitigates autoregressive error accumulation while maintaining computational efficiency. Experiments demonstrate that LSRS significantly improves VAR's generation quality with minimal additional computational overhead. For the VAR-d30 model, LSRS increases the inference time by merely 1% while reducing its FID score from 1.95 to 1.78. When the inference time is increased by 15%, the FID score can be further reduced to 1.66. LSRS offers an efficient test-time scaling solution for enhancing VAR-based generation.
AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches reduce visual tokens through fixed-ratio compression, they operate passively and lack the ability to adapt to varying task requirements. This motivates a fundamental question: Can VLMs autonomously determine the minimum number of visual tokens required for each sample? Inspired by human active vision mechanisms, we introduce AdaptVision, an efficient VLM paradigm that enables adaptive visual token acquisition through a coarse-to-fine approach. Our model initially processes compressed visual tokens from low-resolution images and selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary. We train AdaptVision using a reinforcement learning framework that carefully balances accuracy and efficiency. Central to our approach is Decoupled Turn Policy Optimization (DTPO), which decouples the learning objective into two components: (1) tool learning, which optimizes correct tool utilization, and (2) accuracy improvement, which refines the generated responses to improve answer correctness. Based on this formulation, we further decouple advantage estimation by computing separate advantages for tokens associated with each objective. This formulation enables more effective optimization for AdaptVision compared to vanilla GRPO. Comprehensive experiments across multiple VQA benchmarks demonstrate that AdaptVision achieves superior performance while consuming substantially fewer visual tokens than state-of-the-art efficient VLM methods.
comment: 15 pages, 9 figures
Research on Brain Tumor Classification Method Based on Improved ResNet34 Network
Previously, image interpretation in radiology relied heavily on manual methods. However, manual classification of brain tumor medical images is time-consuming and labor-intensive. Even with shallow convolutional neural network models, the accuracy is not ideal. To improve the efficiency and accuracy of brain tumor image classification, this paper proposes a brain tumor classification model based on an improved ResNet34 network. This model uses the ResNet34 residual network as the backbone network and incorporates multi-scale feature extraction. It uses a multi-scale input module as the first layer of the ResNet34 network and an Inception v2 module as the residual downsampling layer. Furthermore, a channel attention mechanism module assigns different weights to different channels of the image from a channel domain perspective, obtaining more important feature information. The results after a five-fold crossover experiment show that the average classification accuracy of the improved network model is approximately 98.8%, which is not only 1% higher than ResNet34, but also only 80% of the number of parameters of the original model. Therefore, the improved network model not only improves accuracy but also reduces clutter, achieving a classification effect with fewer parameters and higher accuracy.
Fully Unsupervised Self-debiasing of Text-to-Image Diffusion Models WACV 2026
Text-to-image (T2I) diffusion models have achieved widespread success due to their ability to generate high-resolution, photorealistic images. These models are trained on large-scale datasets, like LAION-5B, often scraped from the internet. However, since this data contains numerous biases, the models inherently learn and reproduce them, resulting in stereotypical outputs. We introduce SelfDebias, a fully unsupervised test-time debiasing method applicable to any diffusion model that uses a UNet as its noise predictor. SelfDebias identifies semantic clusters in an image encoder's embedding space and uses these clusters to guide the diffusion process during inference, minimizing the KL divergence between the output distribution and the uniform distribution. Unlike supervised approaches, SelfDebias does not require human-annotated datasets or external classifiers trained for each generated concept. Instead, it is designed to automatically identify semantic modes. Extensive experiments show that SelfDebias generalizes across prompts and diffusion model architectures, including both conditional and unconditional models. It not only effectively debiases images along key demographic dimensions while maintaining the visual fidelity of the generated images, but also more abstract concepts for which identifying biases is also challenging.
comment: Accepted at WACV 2026
Thinking with Programming Vision: Towards a Unified View for Thinking with Images
Multimodal large language models (MLLMs) that think with images can interactively use tools to reason about visual inputs, but current approaches often rely on a narrow set of tools with limited real-world necessity and scalability. In this work, we first reveal a critical and previously overlooked weakness: even state-of-the-art MLLMs are surprisingly brittle, showing significant performance degradation on images with simple orientation changes or natural corruptions, underscoring the need for more robust tool-based reasoning. To address this, we propose CodeVision, a flexible and scalable code-as-tool framework where the model generates code as a universal interface to invoke any image operation, moving beyond fixed tool registries. We train our model using a two-stage methodology, beginning with Supervised Fine-Tuning (SFT) on a high-quality dataset curated for complex, multi-turn tool composition and error recovery, followed by Reinforcement Learning (RL) with a novel and dense process reward function to encourage strategic and efficient tool use. To facilitate this research, we construct new SFT and RL datasets and introduce a challenging new benchmark suite designed to rigorously evaluate robustness to orientation changes and multi-tool reasoning. Experiments on Qwen2.5-VL and Qwen3-VL series show that our approach significantly improves model performance and fosters emergent capabilities such as flexible tool composition, efficient chained execution, and robust error recovery from runtime feedback. Code is available at https://github.com/ByteDance-BandAI/CodeVision.
Dual-level Modality Debiasing Learning for Unsupervised Visible-Infrared Person Re-Identification
Two-stage learning pipeline has achieved promising results in unsupervised visible-infrared person re-identification (USL-VI-ReID). It first performs single-modality learning and then operates cross-modality learning to tackle the modality discrepancy. Although promising, this pipeline inevitably introduces modality bias: modality-specific cues learned in the single-modality training naturally propagate into the following cross-modality learning, impairing identity discrimination and generalization. To address this issue, we propose a Dual-level Modality Debiasing Learning (DMDL) framework that implements debiasing at both the model and optimization levels. At the model level, we propose a Causality-inspired Adjustment Intervention (CAI) module that replaces likelihood-based modeling with causal modeling, preventing modality-induced spurious patterns from being introduced, leading to a low-biased model. At the optimization level, a Collaborative Bias-free Training (CBT) strategy is introduced to interrupt the propagation of modality bias across data, labels, and features by integrating modality-specific augmentation, label refinement, and feature alignment. Extensive experiments on benchmark datasets demonstrate that DMDL could enable modality-invariant feature learning and a more generalized model.
Out-of-the-box: Black-box Causal Attacks on Object Detectors
Adversarial perturbations are a useful way to expose vulnerabilities in object detectors. Existing perturbation methods are frequently white-box and architecture specific. More importantly, while they are often successful, it is rarely clear why they work. Insights into the mechanism of this success would allow developers to understand and analyze these attacks, as well as fine-tune the model to prevent them. This paper presents BlackCAtt, a black-box algorithm and a tool, which uses minimal, causally sufficient pixel sets to construct explainable, imperceptible, reproducible, architecture-agnostic attacks on object detectors. BlackCAtt combines causal pixels with bounding boxes produced by object detectors to create adversarial attacks that lead to the loss, modification or addition of a bounding box. BlackCAtt works across different object detectors of different sizes and architectures, treating the detector as a black box. We compare the performance of BlackCAtt with other black-box attack methods and show that identification of causal pixels leads to more precisely targeted and less perceptible attacks. On the COCO test dataset, our approach is 2.7 times better than the baseline in removing a detection, 3.86 times better in changing a detection, and 5.75 times better in triggering new, spurious, detections. The attacks generated by BlackCAtt are very close to the original image, and hence imperceptible, demonstrating the power of causal pixels.
PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention
The Vision-Language-Action (VLA) models have demonstrated remarkable performance on embodied tasks and shown promising potential for real-world applications. However, current VLAs still struggle to produce consistent and precise target-oriented actions, as they often generate redundant or unstable motions along trajectories, limiting their applicability in time-sensitive scenarios.In this work, we attribute these redundant actions to the spatially uniform perception field of existing VLAs, which causes them to be distracted by target-irrelevant objects, especially in complex environments.To address this issue, we propose an efficient PosA-VLA framework that anchors visual attention via pose-conditioned supervision, consistently guiding the model's perception toward task-relevant regions. The pose-conditioned anchor attention mechanism enables the model to better align instruction semantics with actionable visual cues, thereby improving action generation precision and efficiency. Moreover, our framework adopts a lightweight architecture and requires no auxiliary perception modules (e.g., segmentation or grounding networks), ensuring efficient inference. Extensive experiments verify that our method executes embodied tasks with precise and time-efficient behavior across diverse robotic manipulation benchmarks and shows robust generalization in a variety of challenging environments.
DINO-RotateMatch: A Rotation-Aware Deep Framework for Robust Image Matching in Large-Scale 3D Reconstruction
This paper presents DINO-RotateMatch, a deep-learning framework designed to address the chal lenges of image matching in large-scale 3D reconstruction from unstructured Internet images. The method integrates a dataset-adaptive image pairing strategy with rotation-aware keypoint extraction and matching. DINO is employed to retrieve semantically relevant image pairs in large collections, while rotation-based augmentation captures orientation-dependent local features using ALIKED and Light Glue. Experiments on the Kaggle Image Matching Challenge 2025 demonstrate consistent improve ments in mean Average Accuracy (mAA), achieving a Silver Award (47th of 943 teams). The results confirm that combining self-supervised global descriptors with rotation-enhanced local matching offers a robust and scalable solution for large-scale 3D reconstruction.
comment: 9 pages, 5 figures, 1 table
Structured Uncertainty Similarity Score (SUSS): Learning a Probabilistic, Interpretable, Perceptual Metric Between Images
Perceptual similarity scores that align with human vision are critical for both training and evaluating computer vision models. Deep perceptual losses, such as LPIPS, achieve good alignment but rely on complex, highly non-linear discriminative features with unknown invariances, while hand-crafted measures like SSIM are interpretable but miss key perceptual properties. We introduce the Structured Uncertainty Similarity Score (SUSS); it models each image through a set of perceptual components, each represented by a structured multivariate Normal distribution. These are trained in a generative, self-supervised manner to assign high likelihood to human-imperceptible augmentations. The final score is a weighted sum of component log-probabilities with weights learned from human perceptual datasets. Unlike feature-based methods, SUSS learns image-specific linear transformations of residuals in pixel space, enabling transparent inspection through decorrelated residuals and sampling. SUSS aligns closely with human perceptual judgments, shows strong perceptual calibration across diverse distortion types, and provides localized, interpretable explanations of its similarity assessments. We further demonstrate stable optimization behavior and competitive performance when using SUSS as a perceptual loss for downstream imaging tasks.
Active Visual Perception: Opportunities and Challenges
Active visual perception refers to the ability of a system to dynamically engage with its environment through sensing and action, allowing it to modify its behavior in response to specific goals or uncertainties. Unlike passive systems that rely solely on visual data, active visual perception systems can direct attention, move sensors, or interact with objects to acquire more informative data. This approach is particularly powerful in complex environments where static sensing methods may not provide sufficient information. Active visual perception plays a critical role in numerous applications, including robotics, autonomous vehicles, human-computer interaction, and surveillance systems. However, despite its significant promise, there are several challenges that need to be addressed, including real-time processing of complex visual data, decision-making in dynamic environments, and integrating multimodal sensory inputs. This paper explores both the opportunities and challenges inherent in active visual perception, providing a comprehensive overview of its potential, current research, and the obstacles that must be overcome for broader adoption.
GaussianBlender: Instant Stylization of 3D Gaussians with Disentangled Latent Spaces
3D stylization is central to game development, virtual reality, and digital arts, where the demand for diverse assets calls for scalable methods that support fast, high-fidelity manipulation. Existing text-to-3D stylization methods typically distill from 2D image editors, requiring time-intensive per-asset optimization and exhibiting multi-view inconsistency due to the limitations of current text-to-image models, which makes them impractical for large-scale production. In this paper, we introduce GaussianBlender, a pioneering feed-forward framework for text-driven 3D stylization that performs edits instantly at inference. Our method learns structured, disentangled latent spaces with controlled information sharing for geometry and appearance from spatially-grouped 3D Gaussians. A latent diffusion model then applies text-conditioned edits on these learned representations. Comprehensive evaluations show that GaussianBlender not only delivers instant, high-fidelity, geometry-preserving, multi-view consistent stylization, but also surpasses methods that require per-instance test-time optimization - unlocking practical, democratized 3D stylization at scale.
ConvRot: Rotation-Based Plug-and-Play 4-bit Quantization for Diffusion Transformers
Diffusion transformers have demonstrated strong capabilities in generating high-quality images. However, as model size increases, the growing memory footprint and inference latency pose significant challenges for practical deployment. Recent studies in large language models (LLMs) show that rotation-based techniques can smooth outliers and enable 4-bit quantization, but these approaches often incur substantial overhead and struggle with row-wise outliers in diffusion transformers. To address these challenges, we propose ConvRot, a group-wise rotation-based quantization method that leverages regular Hadamard transform (RHT) to suppress both row-wise and column-wise outliers while reducing complexity from quadratic to linear. Building on this, we design ConvLinear4bit, a plug-and-play module that integrates rotation, quantization, GEMM, and dequantization, enabling W4A4 inference without retraining and preserving visual quality. Experiments on FLUX.1-dev demonstrate a 2.26$\times$ speedup and 4.05$\times$ memory reduction while maintaining image fidelity. To our knowledge, this is the first application of rotation-based quantization for plug-and-play W4A4 inference in diffusion transformers.
Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning
In this study, we present Colon-X, an open initiative aimed at advancing multimodal intelligence in colonoscopy. We begin by constructing ColonVQA, the most comprehensive multimodal dataset ever built for colonoscopy, featuring over 1.1M+ visual question answering entries across 76 clinical findings and 18 multimodal tasks. Beyond serving as a community-wide data foundation, we further investigate a critical yet underexplored transition in colonoscopy - evolving from multimodal understanding to clinical reasoning: (a) To capture the current landscape of multimodal understanding behaviors, we systematically assess the generalizability of 22 multimodal large language models and examine their reliability under human-induced perturbations. The results reveal that clinical outputs from leading MLLMs remain far from robust and trustworthy. (b) To narrow this gap, we further explore reasoning-centric intelligence tailored for colonoscopy. Specifically, we curate ColonReason, a clinically grounded reasoning dataset annotated through a multi-expert debating pipeline, and develop ColonR1, the first R1-styled model incorporating task-adaptive rewarding and gradient-stable optimization techniques. Under data-scarce conditions, our ColonR1 achieves 56.61% overall accuracy, outperforming supervised fine-tuning by 25.22%, and sets a new reasoning-enabled baseline for multimodal colonoscopy analysis. All data and model resources are publicly available at https://github.com/ai4colonoscopy/Colon-X.
comment: Technical report
ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos
A core capability towards general embodied intelligence lies in localizing task-relevant objects from an egocentric perspective, formulated as Spatio-Temporal Video Grounding (STVG). Despite recent progress, existing STVG studies remain largely confined to object-centric and descriptive instructions, neglecting the task-oriented reasoning that is crucial for embodied agents to accomplish goal-directed interactions. To bridge this gap, we introduce \textbf{ToG-Bench}, the first task-oriented spatio-temporal video grounding benchmark for egocentric videos. ToG-Bench is characterized by three key features: (1) \textbf{Task-oriented Grounding}, which requires identifying and localizing objects based on intended tasks rather than straightforward descriptions; (2) \textbf{Explicit-Implicit Dual Grounding}, where target objects can be either explicitly mentioned or implicitly inferred by contextual reasoning; (3) \textbf{One-to-Many Grounding}, where a single instruction may correspond to multiple objects involved in task execution. Built upon videos sourced from ScanNet, ToG-Bench comprises 100 annotated clips with 2,704 task-oriented grounding instructions, constructed via a semi-automated pipeline that combines foundation model annotation and human refinement. In addition, we introduce a set of task-level evaluation metrics tailored for multi-object and explicit-implicit object grounding, and systematically benchmark seven state-of-the-art MLLMs. Extensive experiments reveal the intrinsic challenges of task-oriented STVG and substantial performance gaps across explicit-implicit and multi-object grounding, highlighting the difficulty of bridging perception and interaction in embodied scenarios. Data and code will be released at: \href{https://github.com/qaxuDev/ToG-Bench}{https://github.com/qaxuDev/ToG-Bench}..
comment: 26 pages
Multi-Scale Visual Prompting for Lightweight Small-Image Classification
Visual prompting has recently emerged as an efficient strategy to adapt vision models using lightweight, learnable parameters injected into the input space. However, prior work mainly targets large Vision Transformers and high-resolution datasets such as ImageNet. In contrast, small-image benchmarks like MNIST, Fashion-MNIST, and CIFAR-10 remain widely used in education, prototyping, and research, yet have received little attention in the context of prompting. In this paper, we introduce \textbf{Multi-Scale Visual Prompting (MSVP)}, a simple and generic module that learns a set of global, mid-scale, and local prompt maps fused with the input image via a lightweight $1 \times 1$ convolution. MSVP is backbone-agnostic, adds less than $0.02\%$ parameters, and significantly improves performance across CNN and Vision Transformer backbones. We provide a unified benchmark on MNIST, Fashion-MNIST, and CIFAR-10 using a simple CNN, ResNet-18, and a small Vision Transformer. Our method yields consistent improvements with negligible computational overhead. We further include ablations on prompt scales, fusion strategies, and backbone architectures, along with qualitative analyzes using prompt visualizations and Grad-CAM. Our results demonstrate that multi-scale prompting provides an effective inductive bias even on low-resolution images.
Cyclical Temporal Encoding and Hybrid Deep Ensembles for Multistep Energy Forecasting
Accurate electricity consumption forecasting is essential for demand management and smart grid operations. This paper introduces a unified deep learning framework that integrates cyclical temporal encoding with hybrid LSTM-CNN architectures to enhance multistep energy forecasting. We systematically transform calendar-based attributes using sine cosine encodings to preserve periodic structure and evaluate their predictive relevance through correlation analysis. To exploit both long-term seasonal effects and short-term local patterns, we employ an ensemble model composed of an LSTM, a CNN, and a meta-learner of MLP regressors specialized for each forecast horizon. Using a one year national consumption dataset, we conduct an extensive experimental study including ablation analyses with and without cyclical encodings and calendar features and comparisons with established baselines from the literature. Results demonstrate consistent improvements across all seven forecast horizons, with our hybrid model achieving lower RMSE and MAE than individual architectures and prior methods. These findings confirm the benefit of combining cyclical temporal representations with complementary deep learning structures. To our knowledge, this is the first work to jointly evaluate temporal encodings, calendar-based features, and hybrid ensemble architectures within a unified short-term energy forecasting framework.
Optical Context Compression Is Just (Bad) Autoencoding
DeepSeek-OCR demonstrates that rendered text can be reconstructed with high fidelity from a small number of vision tokens. This finding has sparked excitement about vision-based context compression for language models. But the evaluation stops at reconstruction; whether these representations help language modeling remains untested. We test two assumptions implicit in the optical-compression narrative: that vision-based compression provides unique advantages for text reconstruction from compressed representations, and that DeepSeek-OCR's reconstruction results are evidence that vision-based compression will be useful for language modeling. Comparing their vision encoder against simple alternatives--parameter-free mean pooling and a learned hierarchical encoder--we find that these simple approaches match or surpass vision for reconstruction at matched compression ratios, and outperform it for language modeling--where vision-based compression fails to beat truncation. The excitement around optical context compression outpaces the evidence. Code and checkpoints are available at https://github.com/ivnle/bad-autoencoding
MKSNet: Advanced Small Object Detection in Remote Sensing Imagery with Multi-Kernel and Dual Attention Mechanisms
Deep convolutional neural networks (DCNNs) have substantially advanced object detection capabilities, particularly in remote sensing imagery. However, challenges persist, especially in detecting small objects where the high resolution of these images and the small size of target objects often result in a loss of critical information in the deeper layers of conventional CNNs. Additionally, the extensive spatial redundancy and intricate background details typical in remote-sensing images tend to obscure these small targets. To address these challenges, we introduce Multi-Kernel Selection Network (MKSNet), a novel network architecture featuring a novel Multi-Kernel Selection mechanism. The MKS mechanism utilizes large convolutional kernels to effectively capture an extensive range of contextual information. This innovative design allows for adaptive kernel size selection, significantly enhancing the network's ability to dynamically process and emphasize crucial spatial details for small object detection. Furthermore, MKSNet also incorporates a dual attention mechanism, merging spatial and channel attention modules. The spatial attention module adaptively fine-tunes the spatial weights of feature maps, focusing more intensively on relevant regions while mitigating background noise. Simultaneously, the channel attention module optimizes channel information selection, improving feature representation and detection accuracy. Empirical evaluations on the DOTA-v1.0 and HRSC2016 benchmark demonstrate that MKSNet substantially surpasses existing state-of-the-art models in detecting small objects in remote sensing images. These results highlight MKSNet's superior ability to manage the complexities associated with multi-scale and high-resolution image data, confirming its effectiveness and innovation in remote sensing object detection.
FeatureLens: A Highly Generalizable and Interpretable Framework for Detecting Adversarial Examples Based on Image Features
Although the remarkable performance of deep neural networks (DNNs) in image classification, their vulnerability to adversarial attacks remains a critical challenge. Most existing detection methods rely on complex and poorly interpretable architectures, which compromise interpretability and generalization. To address this, we propose FeatureLens, a lightweight framework that acts as a lens to scrutinize anomalies in image features. Comprising an Image Feature Extractor (IFE) and shallow classifiers (e.g., SVM, MLP, or XGBoost) with model sizes ranging from 1,000 to 30,000 parameters, FeatureLens achieves high detection accuracy ranging from 97.8% to 99.75% in closed-set evaluation and 86.17% to 99.6% in generalization evaluation across FGSM, PGD, CW, and DAmageNet attacks, using only 51 dimensional features. By combining strong detection performance with excellent generalization, interpretability, and computational efficiency, FeatureLens offers a practical pathway toward transparent and effective adversarial defense.
ReCamDriving: LiDAR-Free Camera-Controlled Novel Trajectory Video Generation
We propose ReCamDriving, a purely vision-based, camera-controlled novel-trajectory video generation framework. While repair-based methods fail to restore complex artifacts and LiDAR-based approaches rely on sparse and incomplete cues, ReCamDriving leverages dense and scene-complete 3DGS renderings for explicit geometric guidance, achieving precise camera-controllable generation. To mitigate overfitting to restoration behaviors when conditioned on 3DGS renderings, ReCamDriving adopts a two-stage training paradigm: the first stage uses camera poses for coarse control, while the second stage incorporates 3DGS renderings for fine-grained viewpoint and geometric guidance. Furthermore, we present a 3DGS-based cross-trajectory data curation strategy to eliminate the train-test gap in camera transformation patterns, enabling scalable multi-trajectory supervision from monocular videos. Based on this strategy, we construct the ParaDrive dataset, containing over 110K parallel-trajectory video pairs. Extensive experiments demonstrate that ReCamDriving achieves state-of-the-art camera controllability and structural consistency.
comment: Project page: https://recamdriving.github.io/
LAMP: Language-Assisted Motion Planning for Controllable Video Generation
Video generation has achieved remarkable progress in visual fidelity and controllability, enabling conditioning on text, layout, or motion. Among these, motion control - specifying object dynamics and camera trajectories - is essential for composing complex, cinematic scenes, yet existing interfaces remain limited. We introduce LAMP that leverages large language models (LLMs) as motion planners to translate natural language descriptions into explicit 3D trajectories for dynamic objects and (relatively defined) cameras. LAMP defines a motion domain-specific language (DSL), inspired by cinematography conventions. By harnessing program synthesis capabilities of LLMs, LAMP generates structured motion programs from natural language, which are deterministically mapped to 3D trajectories. We construct a large-scale procedural dataset pairing natural text descriptions with corresponding motion programs and 3D trajectories. Experiments demonstrate LAMP's improved performance in motion controllability and alignment with user intent compared to state-of-the-art alternatives establishing the first framework for generating both object and camera motions directly from natural language specifications.
Motion4D: Learning 3D-Consistent Motion and Semantics for 4D Scene Understanding NeurIPS 2025
Recent advancements in foundation models for 2D vision have substantially improved the analysis of dynamic scenes from monocular videos. However, despite their strong generalization capabilities, these models often lack 3D consistency, a fundamental requirement for understanding scene geometry and motion, thereby causing severe spatial misalignment and temporal flickering in complex 3D environments. In this paper, we present Motion4D, a novel framework that addresses these challenges by integrating 2D priors from foundation models into a unified 4D Gaussian Splatting representation. Our method features a two-part iterative optimization framework: 1) Sequential optimization, which updates motion and semantic fields in consecutive stages to maintain local consistency, and 2) Global optimization, which jointly refines all attributes for long-term coherence. To enhance motion accuracy, we introduce a 3D confidence map that dynamically adjusts the motion priors, and an adaptive resampling process that inserts new Gaussians into under-represented regions based on per-pixel RGB and semantic errors. Furthermore, we enhance semantic coherence through an iterative refinement process that resolves semantic inconsistencies by alternately optimizing the semantic fields and updating prompts of SAM2. Extensive evaluations demonstrate that our Motion4D significantly outperforms both 2D foundation models and existing 3D-based approaches across diverse scene understanding tasks, including point-based tracking, video object segmentation, and novel view synthesis. Our code is available at https://hrzhou2.github.io/motion4d-web/.
comment: Accepted to NeurIPS 2025
Memory-Guided Point Cloud Completion for Dental Reconstruction
Partial dental point clouds often suffer from large missing regions caused by occlusion and limited scanning views, which bias encoder-only global features and force decoders to hallucinate structures. We propose a retrieval-augmented framework for tooth completion that integrates a prototype memory into standard encoder--decoder pipelines. After encoding a partial input into a global descriptor, the model retrieves the nearest manifold prototype from a learnable memory and fuses it with the query feature through confidence-gated weighting before decoding. The memory is optimized end-to-end and self-organizes into reusable tooth-shape prototypes without requiring tooth-position labels, thereby providing structural priors that stabilize missing-region inference and free decoder capacity for detail recovery. The module is plug-and-play and compatible with common completion backbones, while keeping the same training losses. Experiments on a self-processed Teeth3DS benchmark demonstrate consistent improvements in Chamfer Distance, with visualizations showing sharper cusps, ridges, and interproximal transitions. Our approach provides a simple yet effective way to exploit cross-sample regularities for more accurate and faithful dental point-cloud completion.
HBFormer: A Hybrid-Bridge Transformer for Microtumor and Miniature Organ Segmentation
Medical image segmentation is a cornerstone of modern clinical diagnostics. While Vision Transformers that leverage shifted window-based self-attention have established new benchmarks in this field, they are often hampered by a critical limitation: their localized attention mechanism struggles to effectively fuse local details with global context. This deficiency is particularly detrimental to challenging tasks such as the segmentation of microtumors and miniature organs, where both fine-grained boundary definition and broad contextual understanding are paramount. To address this gap, we propose HBFormer, a novel Hybrid-Bridge Transformer architecture. The 'Hybrid' design of HBFormer synergizes a classic U-shaped encoder-decoder framework with a powerful Swin Transformer backbone for robust hierarchical feature extraction. The core innovation lies in its 'Bridge' mechanism, a sophisticated nexus for multi-scale feature integration. This bridge is architecturally embodied by our novel Multi-Scale Feature Fusion (MFF) decoder. Departing from conventional symmetric designs, the MFF decoder is engineered to fuse multi-scale features from the encoder with global contextual information. It achieves this through a synergistic combination of channel and spatial attention modules, which are constructed from a series of dilated and depth-wise convolutions. These components work in concert to create a powerful feature bridge that explicitly captures long-range dependencies and refines object boundaries with exceptional precision. Comprehensive experiments on challenging medical image segmentation datasets, including multi-organ, liver tumor, and bladder tumor benchmarks, demonstrate that HBFormer achieves state-of-the-art results, showcasing its outstanding capabilities in microtumor and miniature organ segmentation. Code and models are available at: https://github.com/lzeeorno/HBFormer.
comment: 6 pages, 4 figures, 3 tables
CloseUpAvatar: High-Fidelity Animatable Full-Body Avatars with Mixture of Multi-Scale Textures
We present a CloseUpAvatar - a novel approach for articulated human avatar representation dealing with more general camera motions, while preserving rendering quality for close-up views. CloseUpAvatar represents an avatar as a set of textured planes with two sets of learnable textures for low and high-frequency detail. The method automatically switches to high-frequency textures only for cameras positioned close to the avatar's surface and gradually reduces their impact as the camera moves farther away. Such parametrization of the avatar enables CloseUpAvatar to adjust rendering quality based on camera distance ensuring realistic rendering across a wider range of camera orientations than previous approaches. We provide experiments using the ActorsHQ dataset with high-resolution input images. CloseUpAvatar demonstrates both qualitative and quantitative improvements over existing methods in rendering from novel wide range camera positions, while maintaining high FPS by limiting the number of required primitives.
Harnessing Hypergraphs in Geometric Deep Learning for 3D RNA Inverse Folding
The RNA inverse folding problem, a key challenge in RNA design, involves identifying nucleotide sequences that can fold into desired secondary structures, which are critical for ensuring molecular stability and function. The inherent complexity of this task stems from the intricate relationship between sequence and structure, making it particularly challenging. In this paper, we propose a framework, named HyperRNA, a generative model with an encoder-decoder architecture that leverages hypergraphs to design RNA sequences. Specifically, our HyperRNA model consists of three main components: preprocessing, encoding and decoding. In the preprocessing stage, graph structures are constructed by extracting the atom coordinates of RNA backbone based on 3-bead coarse-grained representation. The encoding stage processes these graphs, capturing higher order dependencies and complex biomolecular interactions using an attention embedding module and a hypergraph-based encoder. Finally, the decoding stage generates the RNA sequence in an autoregressive manner. We conducted quantitative and qualitative experiments on the PDBBind and RNAsolo datasets to evaluate the inverse folding task for RNA sequence generation and RNA-protein complex sequence generation. The experimental results demonstrate that HyperRNA not only outperforms existing RNA design methods but also highlights the potential of leveraging hypergraphs in RNA engineering.
Beyond Boundary Frames: Audio-Visual Semantic Guidance for Context-Aware Video Interpolation
Handling fast, complex, and highly non-linear motion patterns has long posed challenges for video frame interpolation. Although recent diffusion-based approaches improve upon traditional optical-flow-based methods, they still struggle to cover diverse application scenarios and often fail to produce sharp, temporally consistent frames in fine-grained motion tasks such as audio-visual synchronized interpolation. To address these limitations, we introduce BBF (Beyond Boundary Frames), a context-aware video frame interpolation framework, which could be guided by audio/visual semantics. First, we enhance the input design of the interpolation model so that it can flexibly handle multiple conditional modalities, including text, audio, images, and video. Second, we propose a decoupled multimodal fusion mechanism that sequentially injects different conditional signals into a DiT backbone. Finally, to maintain the generation abilities of the foundation model, we adopt a progressive multi-stage training paradigm, where the start-end frame difference embedding is used to dynamically adjust both the data sampling and the loss weighting. Extensive experimental results demonstrate that BBF outperforms specialized state-of-the-art methods on both generic interpolation and audio-visual synchronized interpolation tasks, establishing a unified framework for video frame interpolation under coordinated multi-channel conditioning.
Dynamic Optical Test for Bot Identification (DOT-BI): A simple check to identify bots in surveys and online processes
We propose the Dynamic Optical Test for Bot Identification (DOT-BI): a quick and easy method that uses human perception of motion to differentiate between human respondents and automated systems in surveys and online processes. In DOT-BI, a 'hidden' number is displayed with the same random black-and-white pixel texture as its background. Only the difference in motion and scale between the number and the background makes the number perceptible to humans across frames, while frame-by-frame algorithmic processing yields no meaningful signal. We conducted two preliminary assessments. Firstly, state-of-the-art, video-capable, multimodal models (GPT-5-Thinking and Gemini 2.5 Pro) fail to extract the correct value, even when given explicit instructions about the mechanism. Secondly, in an online survey (n=182), 99.5% (181/182) of participants solved the task, with an average end-to-end completion time of 10.7 seconds; a supervised lab study (n=39) found no negative effects on perceived ease-of-use or completion time relative to a control. We release code to generate tests and 100+ pre-rendered variants to facilitate adoption in surveys and online processes.
Cross-Stain Contrastive Learning for Paired Immunohistochemistry and Histopathology Slide Representation Learning
Universal, transferable whole-slide image (WSI) representations are central to computational pathology. Incorporating multiple markers (e.g., immunohistochemistry, IHC) alongside H&E enriches H&E-based features with diverse, biologically meaningful information. However, progress is limited by the scarcity of well-aligned multi-stain datasets. Inter-stain misalignment shifts corresponding tissue across slides, hindering consistent patch-level features and degrading slide-level embeddings. To address this, we curated a slide-level aligned, five-stain dataset (H&E, HER2, KI67, ER, PGR) to enable paired H&E-IHC learning and robust cross-stain representation. Leveraging this dataset, we propose Cross-Stain Contrastive Learning (CSCL), a two-stage pretraining framework with a lightweight adapter trained using patch-wise contrastive alignment to improve the compatibility of H&E features with corresponding IHC-derived contextual cues, and slide-level representation learning with Multiple Instance Learning (MIL), which uses a cross-stain attention fusion module to integrate stain-specific patch features and a cross-stain global alignment module to enforce consistency among slide-level embeddings across different stains. Experiments on cancer subtype classification, IHC biomarker status classification, and survival prediction show consistent gains, yielding high-quality, transferable H&E slide-level representations. The code and data are available at https://github.com/lily-zyz/CSCL.
comment: 6 pages, 2 figures. Camera-ready version accepted for IEEE BIBM 2025
UniComp: Rethinking Video Compression Through Informational Uniqueness
Distinct from attention-based compression methods, this paper presents an information uniqueness driven video compression framework, termed UniComp, which aims to maximize the information fidelity of video representations under constrained computational budgets. Starting from the information-theoretic perspective, we formulate the vision compression as an optimization problem that minimizes conditional entropy (reconstruction error) between retained and full tokens. To achieve this, we introduce the notion of information uniqueness to measure intrinsic redundancy among tokens to link with reconstruction error. Based on uniqueness, we design three modules-Frame Group Fusion, Token Allocation, and Spatial Dynamic Compression-that progressively perform semantic frame grouping, adaptive resource allocation, and fine-grained spatial compression. Extensive experiments demonstrate that UniComp consistently outperforms existing compression methods in preserving essential visual tokens under limited computational budgets, highlighting the pivotal role of information uniqueness in token compression efficacy.
Global-Local Aware Scene Text Editing
Scene Text Editing (STE) involves replacing text in a scene image with new target text while preserving both the original text style and background texture. Existing methods suffer from two major challenges: inconsistency and length-insensitivity. They often fail to maintain coherence between the edited local patch and the surrounding area, and they struggle to handle significant differences in text length before and after editing. To tackle these challenges, we propose an end-to-end framework called Global-Local Aware Scene Text Editing (GLASTE), which simultaneously incorporates high-level global contextual information along with delicate local features. Specifically, we design a global-local combination structure, joint global and local losses, and enhance text image features to ensure consistency in text style within local patches while maintaining harmony between local and global areas. Additionally, we express the text style as a vector independent of the image size, which can be transferred to target text images of various sizes. We use an affine fusion to fill target text images into the editing patch while maintaining their aspect ratio unchanged. Extensive experiments on real-world datasets validate that our GLASTE model outperforms previous methods in both quantitative metrics and qualitative results and effectively mitigates the two challenges.
GAOT: Generating Articulated Objects Through Text-Guided Diffusion Models ACM MM
Articulated object generation has seen increasing advancements, yet existing models often lack the ability to be conditioned on text prompts. To address the significant gap between textual descriptions and 3D articulated object representations, we propose GAOT, a three-phase framework that generates articulated objects from text prompts, leveraging diffusion models and hypergraph learning in a three-step process. First, we fine-tune a point cloud generation model to produce a coarse representation of objects from text prompts. Given the inherent connection between articulated objects and graph structures, we design a hypergraph-based learning method to refine these coarse representations, representing object parts as graph vertices. Finally, leveraging a diffusion model, the joints of articulated objects-represented as graph edges-are generated based on the object parts. Extensive qualitative and quantitative experiments on the PartNet-Mobility dataset demonstrate the effectiveness of our approach, achieving superior performance over previous methods.
comment: Accepted by ACM MM Asia2026
CartoMapQA: A Fundamental Benchmark Dataset Evaluating Vision-Language Models on Cartographic Map Understanding SP
The rise of Visual-Language Models (LVLMs) has unlocked new possibilities for seamlessly integrating visual and textual information. However, their ability to interpret cartographic maps remains largely unexplored. In this paper, we introduce CartoMapQA, a benchmark specifically designed to evaluate LVLMs' understanding of cartographic maps through question-answering tasks. The dataset includes over 2000 samples, each composed of a cartographic map, a question (with open-ended or multiple-choice answers), and a ground-truth answer. These tasks span key low-, mid- and high-level map interpretation skills, including symbol recognition, embedded information extraction, scale interpretation, and route-based reasoning. Our evaluation of both open-source and proprietary LVLMs reveals persistent challenges: models frequently struggle with map-specific semantics, exhibit limited geospatial reasoning, and are prone to Optical Character Recognition (OCR)-related errors. By isolating these weaknesses, CartoMapQA offers a valuable tool for guiding future improvements in LVLM architectures. Ultimately, it supports the development of models better equipped for real-world applications that depend on robust and reliable map understanding, such as navigation, geographic search, and urban planning. Our source code and data are openly available to the research community at: https://github.com/ungquanghuy-kddi/CartoMapQA.git
comment: Accepted at SIGSPATIAL 2025 (Best paper candidates), 15 pages
RoboScape-R: Unified Reward-Observation World Models for Generalizable Robotics Training via RL
Achieving generalizable embodied policies remains a key challenge. Traditional policy learning paradigms, including both Imitation Learning (IL) and Reinforcement Learning (RL), struggle to cultivate generalizability across diverse scenarios. While IL policies often overfit to specific expert trajectories, RL suffers from the inherent lack of a unified and general reward signal necessary for effective multi-scene generalization. We posit that the world model is uniquely capable of serving as a universal environment proxy to address this limitation. However, current world models primarily focus on their ability to predict observations and still rely on task-specific, handcrafted reward functions, thereby failing to provide a truly general training environment. Toward this problem, we propose RoboScape-R, a framework leveraging the world model to serve as a versatile, general-purpose proxy for the embodied environment within the RL paradigm. We introduce a novel world model-based general reward mechanism that generates ''endogenous'' rewards derived from the model's intrinsic understanding of real-world state transition dynamics. Extensive experiments demonstrate that RoboScape-R effectively addresses the limitations of traditional RL methods by providing an efficient and general training environment that substantially enhances the generalization capability of embodied policies. Our approach offers critical insights into utilizing the world model as an online training strategy and achieves an average 37.5% performance improvement over baselines under out-of-domain scenarios.
Dynamic Content Moderation in Livestreams: Combining Supervised Classification with MLLM-Boosted Similarity Matching KDD 2026
Content moderation remains a critical yet challenging task for large-scale user-generated video platforms, especially in livestreaming environments where moderation must be timely, multimodal, and robust to evolving forms of unwanted content. We present a hybrid moderation framework deployed at production scale that combines supervised classification for known violations with reference-based similarity matching for novel or subtle cases. This hybrid design enables robust detection of both explicit violations and novel edge cases that evade traditional classifiers. Multimodal inputs (text, audio, visual) are processed through both pipelines, with a multimodal large language model (MLLM) distilling knowledge into each to boost accuracy while keeping inference lightweight. In production, the classification pipeline achieves 67% recall at 80% precision, and the similarity pipeline achieves 76% recall at 80% precision. Large-scale A/B tests show a 6-8% reduction in user views of unwanted livestreams}. These results demonstrate a scalable and adaptable approach to multimodal content governance, capable of addressing both explicit violations and emerging adversarial behaviors.
comment: Accepted at KDD 2026
V-ITI: Mitigating Hallucinations in Multimodal Large Language Models via Visual Inference-Time Intervention
Multimodal Large Language Models (MLLMs) excel in numerous vision-language tasks yet suffer from hallucinations, producing content inconsistent with input visuals, that undermine reliability in precision-sensitive domains. This issue stems from a fundamental problem of visual neglect, where models fail to adequately prioritize input images. Existing methods typically alleviate hallucinations by intervening in the attention score or output logits, focusing on "how to intervene" but overlooking the prerequisite "when to intervene", which leads to the "over-intervention" problem and subsequently introduces new hallucinations and unnecessary computational overhead. To address this gap, we first investigate the mechanism of visual neglect and reveal it can be accurately detected via head-level activation patterns in MLLMs. We thus propose V-ITI, a lightweight visual inference-time intervention framework integrating a Visual Neglect Detector that identifies visual neglect via head-level discriminative probes and a Visual Recall Intervenor that modulates activations with prestored visual activation information only when the visual neglect is detected. Extensive experiments across eight benchmarks and different MLLM families demonstrate that V-ITI consistently mitigates vision-related hallucinations while preserving general task performance.
CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation
Cooking is a sequential and visually grounded activity, where each step such as chopping, mixing, or frying carries both procedural logic and visual semantics. While recent diffusion models have shown strong capabilities in text-to-image generation, they struggle to handle structured multi-step scenarios like recipe illustration. Additionally, current recipe illustration methods are unable to adjust to the natural variability in recipe length, generating a fixed number of images regardless of the actual instructions structure. To address these limitations, we present CookAnything, a flexible and consistent diffusion-based framework that generates coherent, semantically distinct image sequences from textual cooking instructions of arbitrary length. The framework introduces three key components: (1) Step-wise Regional Control (SRC), which aligns textual steps with corresponding image regions within a single denoising process; (2) Flexible RoPE, a step-aware positional encoding mechanism that enhances both temporal coherence and spatial diversity; and (3) Cross-Step Consistency Control (CSCC), which maintains fine-grained ingredient consistency across steps. Experimental results on recipe illustration benchmarks show that CookAnything performs better than existing methods in training-based and training-free settings. The proposed framework supports scalable, high-quality visual synthesis of complex multi-step instructions and holds significant potential for broad applications in instructional media, and procedural content creation.
comment: Accepted by ACM Multimedia 2025
Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation
Achieving precise alignment between user intent and generated visuals remains a central challenge in text-to-visual generation, as a single attempt often fails to produce the desired output. To handle this, prior approaches mainly scale the visual generation process (e.g., increasing sampling steps or seeds), but this quickly leads to a quality plateau. This limitation arises because the prompt, crucial for guiding generation, is kept fixed. To address this, we propose Prompt Redesign for Inference-time Scaling, coined PRIS, a framework that adaptively revises the prompt during inference in response to the scaled visual generations. The core idea of PRIS is to review the generated visuals, identify recurring failure patterns across visuals, and redesign the prompt accordingly before regenerating the visuals with the revised prompt. To provide precise alignment feedback for prompt revision, we introduce a new verifier, element-level factual correction, which evaluates the alignment between prompt attributes and generated visuals at a fine-grained level, achieving more accurate and interpretable assessments than holistic measures. Extensive experiments on both text-to-image and text-to-video benchmarks demonstrate the effectiveness of our approach, including a 15% gain on VBench 2.0. These results highlight that jointly scaling prompts and visuals is key to fully leveraging scaling laws at inference-time. Visualizations are available at the website: https://subin-kim-cv.github.io/PRIS.
comment: Visualizations are available at the website: https://subin-kim-cv.github.io/PRIS
OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation
Generalizing open-vocabulary 3D instance segmentation (OV-3DIS) to diverse, unstructured, and mesh-free environments is crucial for robotics and AR/VR, yet remains a significant challenge. We attribute this to two key limitations of existing methods: (1) proposal generation relies on dataset-specific proposal networks or mesh-based superpoints, rendering them inapplicable in mesh-free scenarios and limiting generalization to novel scenes; and (2) the weak textual reasoning of CLIP-based classifiers, which struggle to recognize compositional and functional user queries. To address these issues, we introduce OpenTrack3D, a generalizable and accurate framework. Unlike methods that rely on pre-generated proposals, OpenTrack3D employs a novel visual-spatial tracker to construct cross-view consistent object proposals online. Given an RGB-D stream, our pipeline first leverages a 2D open-vocabulary segmenter to generate masks, which are lifted to 3D point clouds using depth. Mask-guided instance features are then extracted using DINO feature maps, and our tracker fuses visual and spatial cues to maintain instance consistency. The core pipeline is entirely mesh-free, yet we also provide an optional superpoints refinement module to further enhance performance when scene mesh is available. Finally, we replace CLIP with a multi-modal large language model (MLLM), significantly enhancing compositional reasoning for complex user queries. Extensive experiments on diverse benchmarks, including ScanNet200, Replica, ScanNet++, and SceneFun3D, demonstrate state-of-the-art performance and strong generalization capabilities.
MSG-Loc: Multi-Label Likelihood-based Semantic Graph Matching for Object-Level Global Localization
Robots are often required to localize in environments with unknown object classes and semantic ambiguity. However, when performing global localization using semantic objects, high semantic ambiguity intensifies object misclassification and increases the likelihood of incorrect associations, which in turn can cause significant errors in the estimated pose. Thus, in this letter, we propose a multi-label likelihood-based semantic graph matching framework for object-level global localization. The key idea is to exploit multi-label graph representations, rather than single-label alternatives, to capture and leverage the inherent semantic context of object observations. Based on these representations, our approach enhances semantic correspondence across graphs by combining the likelihood of each node with the maximum likelihood of its neighbors via context-aware likelihood propagation. For rigorous validation, data association and pose estimation performance are evaluated under both closed-set and open-set detection configurations. In addition, we demonstrate the scalability of our approach to large-vocabulary object categories in both real-world indoor scenes and synthetic environments.
comment: Accepted in IEEE Robotics and Automation Letters (2025)
FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation
We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency. Unlike existing methods that rely on chunk-by-chunk or auto-regressive model with diffusion head, we adopt a diffusion forcing framework to model this time-series generation task under time-varying control events. We find that a straightforward implementation of vanilla diffusion forcing (as proposed for video models) fails to model real motion distributions. We demonstrate that to guarantee modeling the output distribution, the vanilla diffusion forcing must be tailored to: (i) train with a bi-directional attention instead of casual attention; (ii) implement a lower triangular time scheduler instead of a random one; (iii) utilize a continues time-varying way to introduce text conditioning. With these improvements, we demonstrate in the first time that the diffusion forcing-based framework achieves state-of-the-art performance on the streaming motion generation task, reaching an FID of 0.057 on the HumanML3D benchmark. Models, code, and weights are available. https://shandaai.github.io/FloodDiffusion/
comment: 15 pages, 7 figures
M3DR: Towards Universal Multilingual Multimodal Document Retrieval
Multimodal document retrieval systems have shown strong progress in aligning visual and textual content for semantic search. However, most existing approaches remain heavily English-centric, limiting their effectiveness in multilingual contexts. In this work, we present M3DR (Multilingual Multimodal Document Retrieval), a framework designed to bridge this gap across languages, enabling applicability across diverse linguistic and cultural contexts. M3DR leverages synthetic multilingual document data and generalizes across different vision-language architectures and model sizes, enabling robust cross-lingual and cross-modal alignment. Using contrastive training, our models learn unified representations for text and document images that transfer effectively across languages. We validate this capability on 22 typologically diverse languages, demonstrating consistent performance and adaptability across linguistic and script variations. We further introduce a comprehensive benchmark that captures real-world multilingual scenarios, evaluating models under monolingual, multilingual, and mixed-language settings. M3DR generalizes across both single dense vector and ColBERT-style token-level multi-vector retrieval paradigms. Our models, NetraEmbed and ColNetraEmbed achieve state-of-the-art performance with ~150% relative improvements on cross-lingual retrieval.
CSMapping: Scalable Crowdsourced Semantic Mapping and Topology Inference for Autonomous Driving
Crowdsourcing enables scalable autonomous driving map construction, but low-cost sensor noise hinders quality from improving with data volume. We propose CSMapping, a system that produces accurate semantic maps and topological road centerlines whose quality consistently increases with more crowdsourced data. For semantic mapping, we train a latent diffusion model on HD maps (optionally conditioned on SD maps) to learn a generative prior of real-world map structure, without requiring paired crowdsourced/HD-map supervision. This prior is incorporated via constrained MAP optimization in latent space, ensuring robustness to severe noise and plausible completion in unobserved areas. Initialization uses a robust vectorized mapping module followed by diffusion inversion; optimization employs efficient Gaussian-basis reparameterization, projected gradient descent zobracket multi-start, and latent-space factor-graph for global consistency. For topological mapping, we apply confidence-weighted k-medoids clustering and kinematic refinement to trajectories, yielding smooth, human-like centerlines robust to trajectory variation. Experiments on nuScenes, Argoverse 2, and a large proprietary dataset achieve state-of-the-art semantic and topological mapping performance, with thorough ablation and scalability studies.
AfroBeats Dance Movement Analysis Using Computer Vision: A Proof-of-Concept Framework Combining YOLO and Segment Anything Model
This paper presents a preliminary investigation into automated dance movement analysis using contemporary computer vision techniques. We propose a proof-of-concept framework that integrates YOLOv8 and v11 for dancer detection with the Segment Anything Model (SAM) for precise segmentation, enabling the tracking and quantification of dancer movements in video recordings without specialized equipment or markers. Our approach identifies dancers within video frames, counts discrete dance steps, calculates spatial coverage patterns, and measures rhythm consistency across performance sequences. Testing this framework on a single 49-second recording of Ghanaian AfroBeats dance demonstrates technical feasibility, with the system achieving approximately 94% detection precision and 89% recall on manually inspected samples. The pixel-level segmentation provided by SAM, achieving approximately 83% intersection-over-union with visual inspection, enables motion quantification that captures body configuration changes beyond what bounding-box approaches can represent. Analysis of this preliminary case study indicates that the dancer classified as primary by our system executed 23% more steps with 37% higher motion intensity and utilized 42% more performance space compared to dancers classified as secondary. However, this work represents an early-stage investigation with substantial limitations including single-video validation, absence of systematic ground truth annotations, and lack of comparison with existing pose estimation methods. We present this framework to demonstrate technical feasibility, identify promising directions for quantitative dance metrics, and establish a foundation for future systematic validation studies.
Exploiting Domain Properties in Language-Driven Domain Generalization for Semantic Segmentation ICCV 2025
Recent domain generalized semantic segmentation (DGSS) studies have achieved notable improvements by distilling semantic knowledge from Vision-Language Models (VLMs). However, they overlook the semantic misalignment between visual and textual contexts, which arises due to the rigidity of a fixed context prompt learned on a single source domain. To this end, we present a novel domain generalization framework for semantic segmentation, namely Domain-aware Prompt-driven Masked Transformer (DPMFormer). Firstly, we introduce domain-aware prompt learning to facilitate semantic alignment between visual and textual cues. To capture various domain-specific properties with a single source dataset, we propose domain-aware contrastive learning along with the texture perturbation that diversifies the observable domains. Lastly, to establish a framework resilient against diverse environmental changes, we have proposed the domain-robust consistency learning which guides the model to minimize discrepancies of prediction from original and the augmented images. Through experiments and analyses, we demonstrate the superiority of the proposed framework, which establishes a new state-of-the-art on various DGSS benchmarks. The code is available at https://github.com/jone1222/DPMFormer.
comment: ICCV 2025 (poster)
EEA: Exploration-Exploitation Agent for Long Video Understanding
Long-form video understanding requires efficient navigation of extensive visual data to pinpoint sparse yet critical information. Current approaches to longform video understanding either suffer from severe computational overhead due to dense preprocessing, or fail to effectively balance exploration and exploitation, resulting in incomplete information coverage and inefficiency. In this work, we introduce EEA, a novel video agent framework that archives exploration-exploitation balance through semantic guidance with hierarchical tree search process. EEA autonomously discovers and dynamically updates task-relevant semantic queries, and collects video frames closely matched to these queries as semantic anchors. During the tree search process, instead of uniform expansion, EEA preferentially explores semantically relevant frames while ensuring sufficient coverage within unknown segments. Moreover, EEA adaptively combines intrinsic rewards from visionlanguage models (VLMs) with semantic priors by explicitly modeling uncertainty to achieve stable and precise evaluation of video segments. Experiments across various long-video benchmarks validate the superior performance and computational efficiency of our proposed method.
NAS-LoRA: Empowering Parameter-Efficient Fine-Tuning for Visual Foundation Models with Searchable Adaptation
The Segment Anything Model (SAM) has emerged as a powerful visual foundation model for image segmentation. However, adapting SAM to specific downstream tasks, such as medical and agricultural imaging, remains a significant challenge. To address this, Low-Rank Adaptation (LoRA) and its variants have been widely employed to enhancing SAM's adaptation performance on diverse domains. Despite advancements, a critical question arises: can we integrate inductive bias into the model? This is particularly relevant since the Transformer encoder in SAM inherently lacks spatial priors within image patches, potentially hindering the acquisition of high-level semantic information. In this paper, we propose NAS-LoRA, a new Parameter-Efficient Fine-Tuning (PEFT) method designed to bridge the semantic gap between pre-trained SAM and specialized domains. Specifically, NAS-LoRA incorporates a lightweight Neural Architecture Search (NAS) block between the encoder and decoder components of LoRA to dynamically optimize the prior knowledge integrated into weight updates. Furthermore, we propose a stage-wise optimization strategy to help the ViT encoder balance weight updates and architectural adjustments, facilitating the gradual learning of high-level semantic information. Various Experiments demonstrate our NAS-LoRA improves existing PEFT methods, while reducing training cost by 24.14% without increasing inference cost, highlighting the potential of NAS in enhancing PEFT for visual foundation models.
Towards Object-centric Understanding for Instructional Videos
Understanding procedural activities is crucial for developing future assistive AI that can reason about complex real-world tasks. Existing action-centric methods struggle with the flexibility of real procedures, where step order varies depending on object states. In this work, we propose to shift the focus to an object-centric paradigm by regarding actions as mechanisms that drive state transitions. To advance this direction, we introduce Object-IVQA, a long-form instructional video benchmark with 107 videos and 514 open-ended question-answer pairs annotated with temporally grounded evidence. The benchmark evaluates four dimensions of object-centric reasoning, including state evolution, precondition verification, counterfactual reasoning and mistake recognition. We further propose an agent framework that orchestrates object-centric planning, perception, analysis and generation tools, enabling explicit evidence retrieval and multi-hop reasoning across disjoint segments. Experiments show that existing large vision-language models struggle in object-level recognition and reasoning, whereas our framework achieves substantially improvement.
Fairness-Aware Fine-Tuning of Vision-Language Models for Medical Glaucoma Diagnosis
Vision-language models achieve expert-level performance on medical imaging tasks but exhibit significant diagnostic accuracy disparities across demographic groups. We introduce fairness-aware Low-Rank Adaptation for medical VLMs, combining parameter efficiency with explicit fairness optimization. Our key algorithmic contribution is a differentiable MaxAccGap loss that enables end-to-end optimization of accuracy parity across demographic groups. We propose three methods: FR-LoRA integrates MaxAccGap regularization into the training objective, GR-LoRA applies inverse frequency weighting to balance gradient contributions, and Hybrid-LoRA combines both mechanisms.Evaluated on 10,000 glaucoma fundus images, GR-LoRA reduces diagnostic accuracy disparities by 69% while maintaining 53.15% overall accuracy. Ablation studies reveal that strong regularization strength achieves optimal fairness with minimal accuracy trade-off, and race-specific optimization yields 60% disparity reduction. Our approach requires only 0.24% trainable parameters, enabling practical deployment of fair medical AI in resource-constrained healthcare settings.
comment: 10 pages, 3 tables
Procedural Mistake Detection via Action Effect Modeling
Mistake detection in procedural tasks is essential for building intelligent systems that support learning and task execution. Existing approaches primarily analyze how an action is performed, while overlooking what it produces, i.e., the \textbf{action effect}. Yet many errors manifest not in the execution itself but in the resulting outcome, such as an unintended object state or incorrect spatial arrangement. To address this gap, we propose Action Effect Modeling (AEM), a unified framework that jointly captures action execution and its outcomes through a probabilistic formulation. AEM first identifies the outcome of an action by selecting the most informative effect frame based on semantic relevance and visual quality. It then extracts complementary cues from visual grounding and symbolic scene graphs, aligning them in a shared latent space to form robust effect-aware representations. To detect mistakes, we further design a prompt-based detector that incorporates task-specific prompts and aligns each action segment with its intended execution semantics. Our approach achieves state-of-the-art performance on the EgoPER and CaptainCook4D benchmarks under the challenging one-class classification (OCC) setting. These results demonstrate that modeling both execution and outcome yields more reliable mistake detection, and highlight the potential of effect-aware representations to benefit a broader range of downstream applications.
Difference Decomposition Networks for Infrared Small Target Detection
Infrared small target detection (ISTD) faces two major challenges: a lack of discernible target texture and severe background clutter, which results in the background obscuring the target. To enhance targets and suppress backgrounds, we propose the Basis Decomposition Module (BDM) as an extensible and lightweight module based on basis decomposition, which decomposes a complex feature into several basis features and enhances certain information while eliminating redundancy. Extending BDM leads to a series of modules, including the Spatial Difference Decomposition Module (SD$^\mathrm{2}$M), Spatial Difference Decomposition Downsampling Module (SD$^\mathrm{3}$M), and Temporal Difference Decomposition Module (TD$^\mathrm{2}$M). Based on these modules, we develop the Spatial Difference Decomposition Network (SD$^\mathrm{2}$Net) for single-frame ISTD (SISTD) and the Spatiotemporal Difference Decomposition Network (STD$^\mathrm{2}$Net) for multi-frame ISTD (MISTD). SD$^\mathrm{2}$Net integrates SD$^\mathrm{2}$M and SD$^\mathrm{3}$M within an adapted U-shaped architecture. We employ TD$^\mathrm{2}$M to introduce motion information, which transforms SD$^\mathrm{2}$Net into STD$^\mathrm{2}$Net. Extensive experiments on SISTD and MISTD datasets demonstrate state-of-the-art (SOTA) performance. On the SISTD task, SD$^\mathrm{2}$Net performs well compared to most established networks. On the MISTD datasets, STD$^\mathrm{2}$Net achieves a mIoU of 87.68\%, outperforming SD$^\mathrm{2}$Net, which achieves a mIoU of 64.97\%. Our codes are available: https://github.com/greekinRoma/IRSTD_HC_Platform.
Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models
Recent large vision-language models (LVLMs) have been applied to diverse VQA tasks. However, achieving practical performance typically requires task-specific fine-tuning with large numbers of image-text pairs, which are costly to collect. In this work, we study text-centric training, a setting where only textual descriptions are available and no real images are provided, as a paradigm for low-cost data scaling. Unlike images, whose collection is often restricted by privacy constraints and scarcity in niche domains, text is widely available. Moreover, text is easily editable, enabling automatic diversification and expansion with LLMs at minimal human effort. While this offers clear advantages over image collection in terms of scalability and cost, training on raw text without images still yields limited gains on VQA tasks because of the image-text modality gap. To address this issue, we propose a Text-Printed Image (TPI), which generates synthetic images by directly rendering the given textual description on a plain white canvas. This simple rendering projects text into the image modality and can be integrated into arbitrary existing LVLM training pipelines at low cost. Moreover, TPI preserves the semantics of the text, whereas text-to-image models often fail to do. Across four models and seven benchmarks, our systematic experiments show that TPI enables more effective text-centric training than synthetic images generated by a diffusion model. We further explore TPI as a low-cost data-augmentation strategy and demonstrate its practical utility. Overall, our findings highlight the significant potential of text-centric training and, more broadly, chart a path toward fully automated data generation for LVLMs.
DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling
Understanding the dynamic physical world, characterized by its evolving 3D structure, real-world motion, and semantic content with textual descriptions, is crucial for human-agent interaction and enables embodied agents to perceive and act within real environments with human-like capabilities. However, existing datasets are often derived from limited simulators or utilize traditional Structurefrom-Motion for up-to-scale annotation and offer limited descriptive captioning, which restricts the capacity of foundation models to accurately interpret real-world dynamics from monocular videos, commonly sourced from the internet. To bridge these gaps, we introduce DynamicVerse, a physical-scale, multimodal 4D world modeling framework for dynamic real-world video. We employ large vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic descriptive captions. By integrating window-based Bundle Adjustment with global optimization, our method converts long real-world video sequences into a comprehensive 4D multimodal format. DynamicVerse delivers a large-scale dataset consisting of 100K+ videos with 800K+ annotated masks and 10M+ frames from internet videos. Experimental evaluations on three benchmark tasks, namely video depth estimation, camera pose estimation, and camera intrinsics estimation, demonstrate that our 4D modeling achieves superior performance in capturing physical-scale measurements with greater global accuracy than existing methods.
Rethinking the Learning Paradigm for Facial Expression Recognition
Due to the subjective crowdsourcing annotations and the inherent inter-class similarity of facial expressions, the real-world Facial Expression Recognition (FER) datasets usually exhibit ambiguous annotation. To simplify the learning paradigm, most previous methods convert ambiguous annotation results into precise one-hot annotations and train FER models in an end-to-end supervised manner. In this paper, we rethink the existing training paradigm and propose that it is better to use weakly supervised strategies to train FER models with original ambiguous annotation.
SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation
Pre-trained diffusion models provide rich multi-scale latent features and are emerging as powerful vision backbones. While recent works such as Marigold and Lotus adapt diffusion priors for dense prediction with strong cross-domain generalization, their potential for structured outputs remains underexplored. In this paper, we propose SDPose, a fine-tuning framework built upon Stable Diffusion to fully exploit pre-trained diffusion priors for human pose estimation. First, rather than modifying cross-attention modules or introducing learnable embeddings, we directly predict keypoint heatmaps in the SD U-Net's image latent space to preserve the original generative priors. Second, we map these latent features into keypoint heatmaps through a lightweight convolutional pose head, which avoids disrupting the pre-trained backbone. Finally, to prevent overfitting and enhance out-of-distribution robustness, we incorporate an auxiliary RGB reconstruction branch that preserves domain-transferable generative semantics. To evaluate robustness under domain shift, we further construct COCO-OOD, a style-transferred variant of COCO with preserved annotations. With just one-fifth of the training schedule used by Sapiens on COCO, SDPose attains parity with Sapiens-1B/2B on the COCO validation set and establishes a new state of the art on the cross-domain benchmarks HumanArt and COCO-OOD. Extensive ablations highlight the importance of diffusion priors, RGB reconstruction, and multi-scale SD U-Net features for cross-domain generalization, and t-SNE analyses further explain SD's domain-invariant latent structure. We also show that SDPose serves as an effective zero-shot pose annotator for controllable image and video generation.
comment: 20 pages, 10 figures, 7 tables
NVRC: Neural Video Representation Compression
Recent advances in implicit neural representation (INR)-based video coding have demonstrated its potential to compete with both conventional and other learning-based approaches. With INR methods, a neural network is trained to overfit a video sequence, with its parameters compressed to obtain a compact representation of the video content. However, although promising results have been achieved, the best INR-based methods are still out-performed by the latest standard codecs, such as VVC VTM, partially due to the simple model compression techniques employed. In this paper, rather than focusing on representation architectures as in many existing works, we propose a novel INR-based video compression framework, Neural Video Representation Compression (NVRC), targeting compression of the representation. Based on the novel entropy coding and quantization models proposed, NVRC, for the first time, is able to optimize an INR-based video codec in a fully end-to-end manner. To further minimize the additional bitrate overhead introduced by the entropy models, we have also proposed a new model compression framework for coding all the network, quantization and entropy model parameters hierarchically. Our experiments show that NVRC outperforms many conventional and learning-based benchmark codecs, with a 24% average coding gain over VVC VTM (Random Access) on the UVG dataset, measured in PSNR. As far as we are aware, this is the first time an INR-based video codec achieving such performance. The implementation of NVRC will be released.
Sat2Flow: A Structure-Aware Diffusion Framework for Human Flow Generation from Satellite Imagery
Origin-Destination (OD) flow matrices are critical for urban mobility analysis, supporting traffic forecasting, infrastructure planning, and policy design. Existing methods face two key limitations: (1) reliance on costly auxiliary features (e.g., Points of Interest, socioeconomic statistics) with limited spatial coverage, and (2) fragility to spatial topology changes, where reordering urban regions disrupts the structural coherence of generated flows. We propose Sat2Flow, a structure-aware diffusion framework that generates structurally coherent OD flows using only satellite imagery. Our approach employs a multi-kernel encoder to capture diverse regional interactions and a permutation-aware diffusion process that maintains consistency across regional orderings. Through joint contrastive training linking satellite features with OD patterns and equivariant diffusion training enforcing structural invariance, Sat2Flow ensures topological robustness under arbitrary regional reindexing. Experiments on real-world datasets show that Sat2Flow outperforms physics-based and data-driven baselines in accuracy while preserving flow distributions and spatial structures under index permutations. Sat2Flow offers a globally scalable solution for OD flow generation in data-scarce environments, eliminating region-specific auxiliary data dependencies while maintaining structural robustness for reliable mobility modeling.
comment: 9 pages, 5 figures
Neural Radiance and Gaze Fields for Visual Attention Modeling in 3D Environments
We introduce Neural Radiance and Gaze Fields (NeRGs), a novel approach for representing visual attention in complex environments. Much like how Neural Radiance Fields (NeRFs) perform novel view synthesis, NeRGs reconstruct gaze patterns from arbitrary viewpoints, implicitly mapping visual attention to 3D surfaces. We achieve this by augmenting a standard NeRF with an additional network that models local egocentric gaze probability density, conditioned on scene geometry and observer position. The output of a NeRG is a rendered view of the scene alongside a pixel-wise salience map representing the conditional probability that a given observer fixates on visible surfaces. Unlike prior methods, our system is lightweight and enables visualization of gaze fields at interactive framerates. Moreover, NeRGs allow the observer perspective to be decoupled from the rendering camera and correctly account for gaze occlusion due to intervening geometry. We demonstrate the effectiveness of NeRGs using head pose from skeleton tracking as a proxy for gaze, employing our proposed gaze probes to aggregate noisy rays into robust probability density targets for supervision.
comment: 11 pages, 8 figures
SMP: Reusable Score-Matching Motion Priors for Physics-Based Character Control
Data-driven motion priors that can guide agents toward producing naturalistic behaviors play a pivotal role in creating life-like virtual characters. Adversarial imitation learning has been a highly effective method for learning motion priors from reference motion data. However, adversarial priors, with few exceptions, need to be retrained for each new controller, thereby limiting their reusability and necessitating the retention of the reference motion data when training on downstream tasks. In this work, we present Score-Matching Motion Priors (SMP), which leverages pre-trained motion diffusion models and score distillation sampling (SDS) to create reusable task-agnostic motion priors. SMPs can be pre-trained on a motion dataset, independent of any control policy or task. Once trained, SMPs can be kept frozen and reused as general-purpose reward functions to train policies to produce naturalistic behaviors for downstream tasks. We show that a general motion prior trained on large-scale datasets can be repurposed into a variety of style-specific priors. Furthermore SMP can compose different styles to synthesize new styles not present in the original dataset. Our method produces high-quality motion comparable to state-of-the-art adversarial imitation learning methods through reusable and modular motion priors. We demonstrate the effectiveness of SMP across a diverse suite of control tasks with physically simulated humanoid characters. Video demo available at https://youtu.be/ravlZJteS20
comment: 14 pages, 9 figures
Defense That Attacks: How Robust Models Become Better Attackers
Deep learning has achieved great success in computer vision, but remains vulnerable to adversarial attacks. Adversarial training is the leading defense designed to improve model robustness. However, its effect on the transferability of attacks is underexplored. In this work, we ask whether adversarial training unintentionally increases the transferability of adversarial examples. To answer this, we trained a diverse zoo of 36 models, including CNNs and ViTs, and conducted comprehensive transferability experiments. Our results reveal a clear paradox: adversarially trained (AT) models produce perturbations that transfer more effectively than those from standard models, which introduce a new ecosystem risk. To enable reproducibility and further study, we release all models, code, and experimental scripts. Furthermore, we argue that robustness evaluations should assess not only the resistance of a model to transferred attacks but also its propensity to produce transferable adversarial examples.
Efficient Transferable Optimal Transport via Min-Sliced Transport Plans
Optimal Transport (OT) offers a powerful framework for finding correspondences between distributions and addressing matching and alignment problems in various areas of computer vision, including shape analysis, image generation, and multimodal tasks. The computation cost of OT, however, hinders its scalability. Slice-based transport plans have recently shown promise for reducing the computational cost by leveraging the closed-form solutions of 1D OT problems. These methods optimize a one-dimensional projection (slice) to obtain a conditional transport plan that minimizes the transport cost in the ambient space. While efficient, these methods leave open the question of whether learned optimal slicers can transfer to new distribution pairs under distributional shift. Understanding this transferability is crucial in settings with evolving data or repeated OT computations across closely related distributions. In this paper, we study the min-Sliced Transport Plan (min-STP) framework and investigate the transferability of optimized slicers: can a slicer trained on one distribution pair yield effective transport plans for new, unseen pairs? Theoretically, we show that optimized slicers remain close under slight perturbations of the data distributions, enabling efficient transfer across related tasks. To further improve scalability, we introduce a minibatch formulation of min-STP and provide statistical guarantees on its accuracy. Empirically, we demonstrate that the transferable min-STP achieves strong one-shot matching performance and facilitates amortized training for point cloud alignment and flow-based generative modeling.
SpecGen: Neural Spectral BRDF Generation via Spectral-Spatial Tri-plane Aggregation
Synthesizing spectral images across different wavelengths is essential for photorealistic rendering. Unlike conventional spectral uplifting methods that convert RGB images into spectral ones, we introduce SpecGen, a novel method that generates spectral bidirectional reflectance distribution functions (BRDFs) from a single RGB image of a sphere. This enables spectral image rendering under arbitrary illuminations and shapes covered by the corresponding material. A key challenge in spectral BRDF generation is the scarcity of measured spectral BRDF data. To address this, we propose the Spectral-Spatial Tri-plane Aggregation (SSTA) network, which models reflectance responses across wavelengths and incident-outgoing directions, allowing the training strategy to leverage abundant RGB BRDF data to enhance spectral BRDF generation. Experiments show that our method accurately reconstructs spectral BRDFs from limited spectral data and surpasses state-of-the-art methods in hyperspectral image reconstruction, achieving an improvement of 8 dB in PSNR. Codes and data will be released upon acceptance.
A Machine Learning-Driven Solution for Denoising Inertial Confinement Fusion Images
Neutron imaging is essential for diagnosing and optimizing inertial confinement fusion implosions at the National Ignition Facility. Due to the required 10-micrometer resolution, however, neutron image require image reconstruction using iterative algorithms. For low-yield sources, the images may be degraded by various types of noise. Gaussian and Poisson noise often coexist within one image, obscuring fine details and blurring the edges where the source information is encoded. Traditional denoising techniques, such as filtering and thresholding, can inadvertently alter critical features or reshape the noise statistics, potentially impacting the ultimate fidelity of the iterative image reconstruction pipeline. However, recent advances in synthetic data production and machine learning have opened new opportunities to address these challenges. In this study, we present an unsupervised autoencoder with a Cohen-Daubechies- Feauveau (CDF 97) wavelet transform in the latent space, designed to suppress for mixed Gaussian-Poisson noise while preserving essential image features. The network successfully denoises neutron imaging data. Benchmarking against both simulated and experimental NIF datasets demonstrates that our approach achieves lower reconstruction error and superior edge preservation compared to conventional filtering methods such as Block-matching and 3D filtering (BM3D). By validating the effectiveness of unsupervised learning for denoising neutron images, this study establishes a critical first step towards fully AI-driven, end-to-end reconstruction frameworks for ICF diagnostics.
Exploring the Potentials of Spiking Neural Networks for Image Deraining AAAI2026
Biologically plausible and energy-efficient frameworks such as Spiking Neural Networks (SNNs) have not been sufficiently explored in low-level vision tasks. Taking image deraining as an example, this study addresses the representation of the inherent high-pass characteristics of spiking neurons, specifically in image deraining and innovatively proposes the Visual LIF (VLIF) neuron, overcoming the obstacle of lacking spatial contextual understanding present in traditional spiking neurons. To tackle the limitation of frequency-domain saturation inherent in conventional spiking neurons, we leverage the proposed VLIF to introduce the Spiking Decomposition and Enhancement Module and the lightweight Spiking Multi-scale Unit for hierarchical multi-scale representation learning. Extensive experiments across five benchmark deraining datasets demonstrate that our approach significantly outperforms state-of-the-art SNN-based deraining methods, achieving this superior performance with only 13\% of their energy consumption. These findings establish a solid foundation for deploying SNNs in high-performance, energy-efficient low-level vision tasks.
comment: Accepted By AAAI2026
BitMark: Watermarking Bitwise Autoregressive Image Generative Models NeurIPS 2025
State-of-the-art text-to-image models generate photorealistic images at an unprecedented speed. This work focuses on models that operate in a bitwise autoregressive manner over a discrete set of tokens that is practically infinite in size. However, their impressive generative power comes with a growing risk: as their outputs increasingly populate the Internet, they are likely to be scraped and reused as training data-potentially by the very same models. This phenomenon has been shown to lead to model collapse, where repeated training on generated content, especially from the models' own previous versions, causes a gradual degradation in performance. A promising mitigation strategy is watermarking, which embeds human-imperceptible yet detectable signals into generated images-enabling the identification of generated content. In this work, we introduce BitMark, a robust bitwise watermarking framework. Our method embeds a watermark directly at the bit level of the token stream during the image generation process. Our bitwise watermark subtly influences the bits to preserve visual fidelity and generation speed while remaining robust against a spectrum of removal techniques. Furthermore, it exhibits high radioactivity, i.e., when watermarked generated images are used to train another image generative model, this second model's outputs will also carry the watermark. The radioactive traces remain detectable even when only fine-tuning diffusion or image autoregressive models on images watermarked with our BitMark. Overall, our approach provides a principled step toward preventing model collapse in image generative models by enabling reliable detection of generated outputs. The code is available at https://github.com/sprintml/BitMark.
comment: Accepted as a Conference Paper at NeurIPS 2025
Accuracy-Robustness Trade Off via Spiking Neural Network Gradient Sparsity Trail
Spiking Neural Networks (SNNs) have attracted growing interest in both computational neuroscience and artificial intelligence, primarily due to their inherent energy efficiency and compact memory footprint. However, achieving adversarial robustness in SNNs, (particularly for vision-related tasks) remains a nascent and underexplored challenge. Recent studies have proposed leveraging sparse gradients as a form of regularization to enhance robustness against adversarial perturbations. In this work, we present a surprising finding: under specific architectural configurations, SNNs exhibit natural gradient sparsity and can achieve state-of-the-art adversarial defense performance without the need for any explicit regularization. Further analysis reveals a trade-off between robustness and generalization: while sparse gradients contribute to improved adversarial resilience, they can impair the model's ability to generalize; conversely, denser gradients support better generalization but increase vulnerability to attacks. Our findings offer new insights into the dual role of gradient sparsity in SNN training.
comment: Work under peer-review
A Tractable Two-Step Linear Mixing Model Solved with Second-Order Optimization for Spectral Unmixing under Variability
In this paper, we propose a Two-Step Linear Mixing Model (2LMM) that bridges the gap between model complexity and computational tractability. The model achieves this by introducing two distinct scaling steps: an endmember scaling step across the image, and another for pixel-wise scaling. We show that this model leads to only a mildly non-convex optimization problem, which we solve with an optimization algorithm that incorporates second-order information. To the authors' knowledge, this work represents the first application of second-order optimization techniques to solve a spectral unmixing problem that models endmember variability. Our method is highly robust, as it requires virtually no hyperparameter tuning and can therefore be used easily and quickly in a wide range of unmixing tasks. We show through extensive experiments on both simulated and real data that the new model is competitive and in some cases superior to the state of the art in unmixing. The model also performs very well in challenging scenarios, such as blind unmixing.
comment: This work has been submitted to the IEEE for possible publication
D$^{2}$-VPR: A Parameter-efficient Visual-foundation-model-based Visual Place Recognition Method via Knowledge Distillation and Deformable Aggregation
Visual Place Recognition (VPR) aims to determine the geographic location of a query image by retrieving its most visually similar counterpart from a geo-tagged reference database. Recently, the emergence of the powerful visual foundation model, DINOv2, trained in a self-supervised manner on massive datasets, has significantly improved VPR performance. This improvement stems from DINOv2's exceptional feature generalization capabilities but is often accompanied by increased model complexity and computational overhead that impede deployment on resource-constrained devices. To address this challenge, we propose $D^{2}$-VPR, a $D$istillation- and $D$eformable-based framework that retains the strong feature extraction capabilities of visual foundation models while significantly reducing model parameters and achieving a more favorable performance-efficiency trade-off. Specifically, first, we employ a two-stage training strategy that integrates knowledge distillation and fine-tuning. Additionally, we introduce a Distillation Recovery Module (DRM) to better align the feature spaces between the teacher and student models, thereby minimizing knowledge transfer losses to the greatest extent possible. Second, we design a Top-Down-attention-based Deformable Aggregator (TDDA) that leverages global semantic features to dynamically and adaptively adjust the Regions of Interest (ROI) used for aggregation, thereby improving adaptability to irregular structures. Extensive experiments demonstrate that our method achieves competitive performance compared to state-of-the-art approaches. Meanwhile, it reduces the parameter count by approximately 64.2% and FLOPs by about 62.6% (compared to CricaVPR).Code is available at https://github.com/tony19980810/D2VPR.
GNSS-Inertial State Initialization Using Inter-Epoch Baseline Residuals
Initializing the state of a sensorized platform can be challenging, as a limited set of measurements often provide low-informative constraints that are in addition highly non-linear. This may lead to poor initial estimates that may converge to local minima during subsequent non-linear optimization. We propose an adaptive GNSS-inertial initialization strategy that delays the incorporation of global GNSS constraints until they become sufficiently informative. In the initial stage, our method leverages inter-epoch baseline vector residuals between consecutive GNSS fixes to mitigate inertial drift. To determine when to activate global constraints, we introduce a general criterion based on the evolution of the Hessian matrix's singular values, effectively quantifying system observability. Experiments on EuRoC, GVINS and MARS-LVIG datasets show that our approach consistently outperforms the naive strategy of fusing all measurements from the outset, yielding more accurate and robust initializations.
comment: 8 pages, 7 figures, accepted to RA-L
DGFusion: Depth-Guided Sensor Fusion for Robust Semantic Perception
Robust semantic perception for autonomous vehicles relies on effectively combining multiple sensors with complementary strengths and weaknesses. State-of-the-art sensor fusion approaches to semantic perception often treat sensor data uniformly across the spatial extent of the input, which hinders performance when faced with challenging conditions. By contrast, we propose a novel depth-guided multimodal fusion method that upgrades condition-aware fusion by integrating depth information. Our network, DGFusion, poses multimodal segmentation as a multi-task problem, utilizing the lidar measurements, which are typically available in outdoor sensor suites, both as one of the model's inputs and as ground truth for learning depth. Our corresponding auxiliary depth head helps to learn depth-aware features, which are encoded into spatially varying local depth tokens that condition our attentive cross-modal fusion. Together with a global condition token, these local depth tokens dynamically adapt sensor fusion to the spatially varying reliability of each sensor across the scene, which largely depends on depth. In addition, we propose a robust loss for our depth, which is essential for learning from lidar inputs that are typically sparse and noisy in adverse conditions. Our method achieves state-of-the-art panoptic and semantic segmentation performance on the challenging MUSES and DeLiVER datasets. Code and models will be available at https://github.com/timbroed/DGFusion
comment: Code and models will be available at https://github.com/timbroed/DGFusion
Delving into Dynamic Scene Cue-Consistency for Robust 3D Multi-Object Tracking
3D multi-object tracking is a critical and challenging task in the field of autonomous driving. A common paradigm relies on modeling individual object motion, e.g., Kalman filters, to predict trajectories. While effective in simple scenarios, this approach often struggles in crowded environments or with inaccurate detections, as it overlooks the rich geometric relationships between objects. This highlights the need to leverage spatial cues. However, existing geometry-aware methods can be susceptible to interference from irrelevant objects, leading to ambiguous features and incorrect associations. To address this, we propose focusing on cue-consistency: identifying and matching stable spatial patterns over time. We introduce the Dynamic Scene Cue-Consistency Tracker (DSC-Track) to implement this principle. Firstly, we design a unified spatiotemporal encoder using Point Pair Features (PPF) to learn discriminative trajectory embeddings while suppressing interference. Secondly, our cue-consistency transformer module explicitly aligns consistent feature representations between historical tracks and current detections. Finally, a dynamic update mechanism preserves salient spatiotemporal information for stable online tracking. Extensive experiments on the nuScenes and Waymo Open Datasets validate the effectiveness and robustness of our approach. On the nuScenes benchmark, for instance, our method achieves state-of-the-art performance, reaching 73.2% and 70.3% AMOTA on the validation and test sets, respectively.
Language-Driven Object-Oriented Two-Stage Method for Scene Graph Anticipation
A scene graph is a structured representation of objects and their spatio-temporal relationships in dynamic scenes. Scene Graph Anticipation (SGA) involves predicting future scene graphs from video clips, enabling applications in intelligent surveillance and human-machine collaboration. While recent SGA approaches excel at leveraging visual evidence, long-horizon forecasting fundamentally depends on semantic priors and commonsense temporal regularities that are challenging to extract purely from visual features. To explicitly model these semantic dynamics, we propose Linguistic Scene Graph Anticipation (LSGA), a linguistic formulation of SGA that performs temporal relational reasoning over sequences of textualized scene graphs, with visual scene-graph detection handled by a modular front-end when operating on video. Building on this formulation, we introduce Object-Oriented Two-Stage Method (OOTSM), a language-based framework that anticipates object-set dynamics and forecasts object-centric relation trajectories with temporal consistency regularization, and we evaluate it on a dedicated benchmark constructed from Action Genome annotations. Extensive experiments show that compact fine-tuned language models with up to 3B parameters consistently outperform strong zero- and one-shot API baselines, including GPT-4o, GPT-4o-mini, and DeepSeek-V3, under matched textual inputs and context windows. When coupled with off-the-shelf visual scene-graph generators, the resulting multimodal system achieves substantial improvements on video-based SGA, boosting long-horizon mR@50 by up to 21.9\% over strong visual SGA baselines.
You Point, I Learn: Online Adaptation of Interactive Segmentation Models for Handling Distribution Shifts in Medical Imaging
Interactive segmentation uses real-time user inputs, such as mouse clicks, to iteratively refine model predictions. Although not originally designed to address distribution shifts, this paradigm naturally lends itself to such challenges. In medical imaging, where distribution shifts are common, interactive methods can use user inputs to guide models towards improved predictions. Moreover, once a model is deployed, user corrections can be used to adapt the network parameters to the new data distribution, mitigating distribution shift. Based on these insights, we aim to develop a practical, effective method for improving the adaptive capabilities of interactive segmentation models to new data distributions in medical imaging. Firstly, we found that strengthening the model's responsiveness to clicks is important for the initial training process. Moreover, we show that by treating the post-interaction user-refined model output as pseudo-ground-truth, we can design a lean, practical online adaptation method that enables a model to learn effectively across sequential test images. The framework includes two components: (i) a Post-Interaction adaptation process, updating the model after the user has completed interactive refinement of an image, and (ii) a Mid-Interaction adaptation process, updating incrementally after each click. Both processes include a Click-Centered Gaussian loss that strengthens the model's reaction to clicks and enhances focus on user-guided, clinically relevant regions. Experiments on 5 fundus and 4 brain-MRI databases show that our approach consistently outperforms existing methods under diverse distribution shifts, including unseen imaging modalities and pathologies. Code and pretrained models will be released upon publication.
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue
We introduce InteractiveOmni, a unified and open-source omni-modal large language model for audio-visual multi-turn interaction, ranging from 4B to 8B parameters, designed to lead the field of lightweight models by offering comprehensive omni-modal understanding and speech generation capabilities. To achieve this, we integrate the vision encoder, audio encoder, large language model, and speech decoder into a unified model for understanding and generation tasks. We design a multi-stage training strategy to ensure robust cross-modal capabilities, including pre-training for omni-modal understanding, followed by post-training with speech conversation and audio-visual interaction. To enable human-like long-term conversational ability, we meticulously curate a multi-turn training dataset that enhances the model's ability to handle complex and multi-turn interactions. To effectively evaluate the multi-turn memory and speech interaction capabilities, we construct the multi-modal multi-turn memory benchmark and the multi-turn speech interaction benchmark. Experiments demonstrate that InteractiveOmni significantly outperforms leading open-source models and provides a more intelligent multi-turn audio-visual experience, particularly in its long-term memory capabilities. Notably, InteractiveOmni-4B is comparable to the much larger model like Qwen2.5-Omni-7B on general benchmarks, and it can retain 97% of the performance of the InteractiveOmni-8B while utilizing only 50% of the model size. Achieving state-of-the-art results against similarly sized models across image, audio, video understanding, and speech generation tasks, InteractiveOmni is an accessible, open-source foundation for next-generation intelligent interactive systems.
GT23D-Bench: A Comprehensive General Text-to-3D Generation Benchmark
Text-to-3D (T23D) generation has emerged as a crucial visual generation task, aiming at synthesizing 3D content from textual descriptions. Studies of this task are currently shifting from per-scene T23D, which requires optimization of the model for every content generated, to General T23D (GT23D), which requires only one pre-trained model to generate different content without re-optimization, for more generalized and efficient 3D generation. Despite notable advancements, GT23D is severely bottlenecked by two interconnected challenges: the lack of high-quality, large-scale training data and the prevalence of evaluation metrics that overlook intrinsic 3D properties. Existing datasets often suffer from incomplete annotations, noisy organization, and inconsistent quality, while current evaluations rely heavily on 2D image-text similarity or scoring, failing to thoroughly assess 3D geometric integrity and semantic relevance. To address these fundamental gaps, we introduce GT23D-Bench, the first comprehensive benchmark specifically designed for GT23D training and evaluation. We first construct a high-quality dataset of 400K 3D assets, featuring diverse visual annotations (70M+ visual samples) and multi-granularity hierarchical captions (1M+ descriptions) to foster robust semantic learning. Second, we propose a comprehensive evaluation suite with 10 metrics assessing both text-3D alignment and 3D visual quality at multiple levels. Crucially, we demonstrate through rigorous experiments that our proposed metrics exhibit significantly higher correlation with human judgment compared to existing methods. Our in-depth analysis of eight leading GT23D models using this benchmark provides the community with critical insights into current model capabilities and their shared failure modes. GT23D-Bench will be publicly available to facilitate rigorous and reproducible research.
HybridWorldSim: A Scalable and Controllable High-fidelity Simulator for Autonomous Driving
Realistic and controllable simulation is critical for advancing end-to-end autonomous driving, yet existing approaches often struggle to support novel view synthesis under large viewpoint changes or to ensure geometric consistency. We introduce HybridWorldSim, a hybrid simulation framework that integrates multi-traversal neural reconstruction for static backgrounds with generative modeling for dynamic agents. This unified design addresses key limitations of previous methods, enabling the creation of diverse and high-fidelity driving scenarios with reliable visual and spatial consistency. To facilitate robust benchmarking, we further release a new multi-traversal dataset MIRROR that captures a wide range of routes and environmental conditions across different cities. Extensive experiments demonstrate that HybridWorldSim surpasses previous state-of-the-art methods, providing a practical and scalable solution for high-fidelity simulation and a valuable resource for research and development in autonomous driving.
comment: Project page: https://hybridworldsim.github.io/
TransUNet-GradCAM: A Hybrid Transformer-U-Net with Self-Attention and Explainable Visualizations for Foot Ulcer Segmentation
Automated segmentation of diabetic foot ulcers (DFUs) plays a critical role in clinical diagnosis, therapeutic planning, and longitudinal wound monitoring. However, this task remains challenging due to the heterogeneous appearance, irregular morphology, and complex backgrounds associated with ulcer regions in clinical photographs. Traditional convolutional neural networks (CNNs), such as U-Net, provide strong localization capabilities but struggle to model long-range spatial dependencies due to their inherently limited receptive fields. To address this, we employ the TransUNet architecture, a hybrid framework that integrates the global attention mechanism of Vision Transformers (ViTs) into the U-Net structure. This combination allows the model to extract global contextual features while maintaining fine-grained spatial resolution. We trained the model on the public Foot Ulcer Segmentation Challenge (FUSeg) dataset using a robust augmentation pipeline and a hybrid loss function to mitigate class imbalance. On the validation set, the model achieved a Dice Similarity Coefficient (F1-score) of 0.8799 using an optimized threshold of 0.4389. To ensure clinical transparency, we integrated Grad-CAM visualizations to highlight model focus areas. Furthermore, a clinical utility analysis demonstrated a strong correlation (Pearson r = 0.9631) between predicted and ground-truth wound areas. These outcomes demonstrate that our approach effectively integrates global and local feature extraction, offering a reliable, effective, and explainable solution for automated foot ulcer assessment.
STT-GS: Sample-Then-Transmit Edge Gaussian Splatting with Joint Client Selection and Power Control
Edge Gaussian splatting (EGS), which aggregates data from distributed clients (e.g., drones) and trains a global GS model at the edge (e.g., ground server), is an emerging paradigm for scene reconstruction in low-altitude economy. Unlike traditional edge resource management methods that emphasize communication throughput or general-purpose learning performance, EGS explicitly aims to maximize the GS qualities, rendering existing approaches inapplicable. To address this problem, this paper formulates a novel GS-oriented objective function that distinguishes the heterogeneous view contributions of different clients. However, evaluating this function in turn requires clients' images, leading to a causality dilemma. To this end, this paper further proposes a sample-then-transmit EGS (or STT-GS for short) strategy, which first samples a subset of images as pilot data from each client for loss prediction. Based on the first-stage evaluation, communication resources are then prioritized towards more valuable clients. To achieve efficient sampling, a feature-domain clustering (FDC) scheme is proposed to select the most representative data and pilot transmission time minimization (PTTM) is adopted to reduce the pilot overhead. Subsequently, we develop a joint client selection and power control (JCSPC) framework to maximize the GS-oriented function under communication resource constraints. Despite the nonconvexity of the problem, we propose a low-complexity efficient solution based on the penalty alternating majorization minimization (PAMM) algorithm. Experiments reveal that the proposed scheme significantly outperforms existing benchmarks on real-world datasets. The GS-oriented objective can be accurately predicted with low sampling ratios (e.g., 10%), and our method achieves an excellent tradeoff between view contributions and communication costs.
LoRA Patching: Exposing the Fragility of Proactive Defenses against Deepfakes
Deepfakes pose significant societal risks, motivating the development of proactive defenses that embed adversarial perturbations in facial images to prevent manipulation. However, in this paper, we show that these preemptive defenses often lack robustness and reliability. We propose a novel approach, Low-Rank Adaptation (LoRA) patching, which injects a plug-and-play LoRA patch into Deepfake generators to bypass state-of-the-art defenses. A learnable gating mechanism adaptively controls the effect of the LoRA patch and prevents gradient explosions during fine-tuning. We also introduce a Multi-Modal Feature Alignment (MMFA) loss, encouraging the features of adversarial outputs to align with those of the desired outputs at the semantic level. Beyond bypassing, we present defensive LoRA patching, embedding visible warnings in the outputs as a complementary solution to mitigate this newly identified security vulnerability. With only 1,000 facial examples and a single epoch of fine-tuning, LoRA patching successfully defeats multiple proactive defenses. These results reveal a critical weakness in current paradigms and underscore the need for more robust Deepfake defense strategies. Our code is available at https://github.com/ZOMIN28/LoRA-Patching.
Test-time Correction: An Online 3D Detection System via Visual Prompting
This paper introduces Test-time Correction (TTC), an online 3D detection system designed to rectify test-time errors using various auxiliary feedback, aiming to enhance the safety of deployed autonomous driving systems. Unlike conventional offline 3D detectors that remain fixed during inference, TTC enables immediate online error correction without retraining, allowing autonomous vehicles to adapt to new scenarios and reduce deployment risks. To achieve this, we equip existing 3D detectors with an Online Adapter (OA) module -- a prompt-driven query generator for real-time correction. At the core of OA module are visual prompts: image-based descriptions of objects of interest derived from auxiliary feedback such as mismatches with 2D detections, road descriptions, or user clicks. These visual prompts, collected from risky objects during inference, are maintained in a visual prompt buffer to enable continuous correction in future frames. By leveraging this mechanism, TTC consistently detects risky objects, achieving reliable, adaptive, and versatile driving autonomy. Extensive experiments show that TTC significantly improves instant error rectification over frozen 3D detectors, even under limited labels, zero-shot settings, and adverse conditions. We hope this work inspires future research on post-deployment online rectification systems for autonomous driving.
Assessing the Alignment of Popular CNNs to the Brain for Valence Appraisal
Convolutional Neural Networks (CNNs) are a popular type of computer model that have proven their worth in many computer vision tasks. Moreover, they form an interesting study object for the field of psychology, with shown correspondences between the workings of CNNs and the human brain. However, these correspondences have so far mostly been studied in the context of general visual perception. In contrast, this paper explores to what extent this correspondence also holds for a more complex brain process, namely social cognition. To this end, we assess the alignment between popular CNN architectures and both human behavioral and fMRI data for image valence appraisal through a correlation analysis. We show that for this task CNNs struggle to go beyond simple visual processing, and do not seem to reflect higher-order brain processing. Furthermore, we present Object2Brain, a novel framework that combines GradCAM and object detection at the CNN-filter level with the aforementioned correlation analysis to study the influence of different object classes on the CNN-to-human correlations. Despite similar correlation trends, different CNN architectures are shown to display different object class sensitivities.
comment: 12 pages, 5 figures
Universal Multi-Domain Translation via Diffusion Routers
Multi-domain translation (MDT) aims to learn translations between multiple domains, yet existing approaches either require fully aligned tuples or can only handle domain pairs seen in training, limiting their practicality and excluding many cross-domain mappings. We introduce universal MDT (UMDT), a generalization of MDT that seeks to translate between any pair of $K$ domains using only $K-1$ paired datasets with a central domain. To tackle this problem, we propose Diffusion Router (DR), a unified diffusion-based framework that models all central$\leftrightarrow$non-central translations with a single noise predictor conditioned on the source and target domain labels. DR enables indirect non-central translations by routing through the central domain. We further introduce a novel scalable learning strategy with a variational-bound objective and an efficient Tweedie refinement procedure to support direct non-central mappings. Through evaluation on three large-scale UMDT benchmarks, DR achieves state-of-the-art results for both indirect and direct translations, while lowering sampling cost and unlocking novel tasks such as sketch$\leftrightarrow$segmentation. These results establish DR as a scalable and versatile framework for universal translation across multiple domains.
PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference
This paper presents PipeFusion, an innovative parallel methodology to tackle the high latency issues associated with generating high-resolution images using diffusion transformers (DiTs) models. PipeFusion partitions images into patches and the model layers across multiple GPUs. It employs a patch-level pipeline parallel strategy to orchestrate communication and computation efficiently. By capitalizing on the high similarity between inputs from successive diffusion steps, PipeFusion reuses one-step stale feature maps to provide context for the current pipeline step. This approach notably reduces communication costs compared to existing DiTs inference parallelism, including tensor parallel, sequence parallel and DistriFusion. PipeFusion enhances memory efficiency through parameter distribution across devices, ideal for large DiTs like Flux.1. Experimental results demonstrate that PipeFusion achieves state-of-the-art performance on 8$\times$L40 PCIe GPUs for Pixart, Stable-Diffusion 3, and Flux.1 models. Our source code is available at https://github.com/xdit-project/xDiT.
MoBGS: Motion Deblurring Dynamic 3D Gaussian Splatting for Blurry Monocular Video AAAI 2026
We present MoBGS, a novel motion deblurring 3D Gaussian Splatting (3DGS) framework capable of reconstructing sharp and high-quality novel spatio-temporal views from blurry monocular videos in an end-to-end manner. Existing dynamic novel view synthesis (NVS) methods are highly sensitive to motion blur in casually captured videos, resulting in significant degradation of rendering quality. While recent approaches address motion-blurred inputs for NVS, they primarily focus on static scene reconstruction and lack dedicated motion modeling for dynamic objects. To overcome these limitations, our MoBGS introduces a novel Blur-adaptive Latent Camera Estimation (BLCE) method using a proposed Blur-adaptive Neural Ordinary Differential Equation (ODE) solver for effective latent camera trajectory estimation, improving global camera motion deblurring. In addition, we propose a Latent Camera-induced Exposure Estimation (LCEE) method to ensure consistent deblurring of both a global camera and local object motions. Extensive experiments on the Stereo Blur dataset and real-world blurry videos show that our MoBGS significantly outperforms the very recent methods, achieving state-of-the-art performance for dynamic NVS under motion blur.
comment: This paper has been accepted to AAAI 2026. The first two authors contributed equally to this work (equal contribution). The last two authors are co-corresponding authors. Please visit our project page at https://kaist-viclab.github.io/mobgs-site/
MambaScope: Coarse-to-Fine Scoping for Efficient Vision Mamba
Vision Mamba has emerged as a promising and efficient alternative to Vision Transformers, yet its efficiency remains fundamentally constrained by the number of input tokens. Existing token reduction approaches typically adopt token pruning or merging to reduce computation. However, they inherently lead to information loss as they discard or compress token representations. This problem is further exacerbated when the same fine-grained token processing is uniformly applied across all images regardless of visual complexity. We observe that not all inputs require fine-grained processing: simple images can be effectively handled at a coarse resolution, while only complex ones require refinement. Based on this insight, we propose MambaScope, an adaptive framework for efficient inference for Vision Mamba. MambaScope first performs coarse-grained inference by dividing the input image into large patches, significantly reducing token length and computation. When the model's prediction confidence is low, selected regions are re-processed at a finer resolution to recover essential visual details with minimal additional cost. This dynamic resolution assignment strategy allows MambaScope to allocate computation adaptively according to image complexity, achieving efficient processing without compromising accuracy. Experiments across various vision tasks demonstrate that MambaScope outperforms both the baseline Vision Mamba and state-of-the-art token reduction techniques in terms of accuracy and efficiency.
DynamicCity: Large-Scale 4D Occupancy Generation from Dynamic Scenes ICLR 2025
Urban scene generation has been developing rapidly recently. However, existing methods primarily focus on generating static and single-frame scenes, overlooking the inherently dynamic nature of real-world driving environments. In this work, we introduce DynamicCity, a novel 4D occupancy generation framework capable of generating large-scale, high-quality dynamic 4D scenes with semantics. DynamicCity mainly consists of two key models. 1) A VAE model for learning HexPlane as the compact 4D representation. Instead of using naive averaging operations, DynamicCity employs a novel Projection Module to effectively compress 4D features into six 2D feature maps for HexPlane construction, which significantly enhances HexPlane fitting quality (up to 12.56 mIoU gain). Furthermore, we utilize an Expansion & Squeeze Strategy to reconstruct 3D feature volumes in parallel, which improves both network training efficiency and reconstruction accuracy than naively querying each 3D point (up to 7.05 mIoU gain, 2.06x training speedup, and 70.84% memory reduction). 2) A DiT-based diffusion model for HexPlane generation. To make HexPlane feasible for DiT generation, a Padded Rollout Operation is proposed to reorganize all six feature planes of the HexPlane as a squared 2D feature map. In particular, various conditions could be introduced in the diffusion or sampling process, supporting versatile 4D generation applications, such as trajectory- and command-driven generation, inpainting, and layout-conditioned generation. Extensive experiments on the CarlaSC and Waymo datasets demonstrate that DynamicCity significantly outperforms existing state-of-the-art 4D occupancy generation methods across multiple metrics. The code and models have been released to facilitate future research.
comment: ICLR 2025 Spotlight; 35 pages, 18 figures, 15 tables; Project Page at https://dynamic-city.github.io/
Can VLMs Detect and Localize Fine-Grained AI-Edited Images?
Fine-grained detection and localization of localized image edits is crucial for assessing content authenticity, especially as modern diffusion models and image editors can produce highly realistic manipulations. However, this problem faces three key challenges: (1) most AIGC detectors produce only a global real-or-fake label without indicating where edits occur; (2) traditional computer vision methods for edit localization typically rely on costly pixel-level annotations; and (3) there is no large-scale, modern benchmark specifically targeting edited-image detection. To address these gaps, we develop an automated data-generation pipeline and construct FragFake, a large-scale benchmark of AI-edited images spanning multiple source datasets, diverse editing models, and several common edit types. Building on FragFake, we are the first to systematically study vision language models (VLMs) for edited-image classification and edited-region localization. Our experiments show that pretrained VLMs, including GPT4o, perform poorly on this task, whereas fine-tuned models such as Qwen2.5-VL achieve high accuracy and substantially higher object precision across all settings. We further explore GRPO-based RLVR training, which yields modest metric gains while improving the interpretability of model outputs. Ablation and transfer analyses reveal how data balancing, training size, LoRA rank, and training domain affect performance, and highlight both the potential and the limitations of cross-editor and cross-dataset generalization. We anticipate that this work will establish a solid foundation to facilitate and inspire subsequent research endeavors in the domain of multimodal content authenticity.
comment: 14pages,19 figures
Context Cascade Compression: Exploring the Upper Limits of Text Compression
Million-level token inputs in long-context tasks pose significant computational and memory challenges for Large Language Models (LLMs). Recently, DeepSeek-OCR conducted research into the feasibility of Contexts Optical Compression and achieved preliminary results. Inspired by this, we introduce Context Cascade Compression C3 to explore the upper limits of text compression. Our method cascades two LLMs of different sizes to handle the compression and decoding tasks. Specifically, a small LLM, acting as the first stage, performs text compression by condensing a long context into a set of latent tokens (e.g., 32 or 64 in length), achieving a high ratio of text tokens to latent tokens. A large LLM, as the second stage, then executes the decoding task on this compressed context. Experiments show that at a 20x compression ratio (where the number of text tokens is 20 times the number of latent tokens), our model achieves 98% decoding accuracy, compared to approximately 60% for DeepSeek-OCR. When we further increase the compression ratio to 40x, the accuracy is maintained at around 93%. This indicates that in the domain of context compression, C3 Compression demonstrates superior performance and feasibility over optical character compression. C3 uses a simpler, pure-text pipeline that ignores factors like layout, color, and information loss from a visual encoder. This also suggests a potential upper bound for compression ratios in future work on optical character compression, OCR, and related fields. Codes and model weights are publicly accessible at https://github.com/liufanfanlff/C3-Context-Cascade-Compression
TabletopGen: Instance-Level Interactive 3D Tabletop Scene Generation from Text or Single Image
Generating high-fidelity, physically interactive 3D simulated tabletop scenes is essential for embodied AI--especially for robotic manipulation policy learning and data synthesis. However, current text- or image-driven 3D scene generation methods mainly focus on large-scale scenes, struggling to capture the high-density layouts and complex spatial relations that characterize tabletop scenes. To address these challenges, we propose TabletopGen, a training-free, fully automatic framework that generates diverse, instance-level interactive 3D tabletop scenes. TabletopGen accepts a reference image as input, which can be synthesized by a text-to-image model to enhance scene diversity. We then perform instance segmentation and completion on the reference to obtain per-instance images. Each instance is reconstructed into a 3D model followed by canonical coordinate alignment. The aligned 3D models then undergo pose and scale estimation before being assembled into a collision-free, simulation-ready tabletop scene. A key component of our framework is a novel pose and scale alignment approach that decouples the complex spatial reasoning into two stages: a Differentiable Rotation Optimizer for precise rotation recovery and a Top-view Spatial Alignment mechanism for robust translation and scale estimation, enabling accurate 3D reconstruction from 2D reference. Extensive experiments and user studies show that TabletopGen achieves state-of-the-art performance, markedly surpassing existing methods in visual fidelity, layout accuracy, and physical plausibility, capable of generating realistic tabletop scenes with rich stylistic and spatial diversity. Our code will be publicly available.
comment: Project page: https://d-robotics-ai-lab.github.io/TabletopGen.project/
OneThinker: All-in-one Reasoning Model for Image and Video
Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks and treat image and video reasoning as disjoint domains. This results in limited scalability toward a multimodal reasoning generalist, which restricts practical versatility and hinders potential knowledge sharing across tasks and modalities. To this end, we propose OneThinker, an all-in-one reasoning model that unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the OneThinker-600k training corpus covering all these tasks and employ commercial models for CoT annotation, resulting in OneThinker-SFT-340k for SFT cold start. Furthermore, we propose EMA-GRPO to handle reward heterogeneity in multi-task RL by tracking task-wise moving averages of reward standard deviations for balanced optimization. Extensive experiments on diverse visual benchmarks show that OneThinker delivers strong performance on 31 benchmarks, across 10 fundamental visual understanding tasks. Moreover, it exhibits effective knowledge transfer between certain tasks and preliminary zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist. All code, model, and data are released.
comment: Project page: https://github.com/tulerfeng/OneThinker
MagicView: Multi-View Consistent Identity Customization via Priors-Guided In-Context Learning
Recent advances in personalized generative models have demonstrated impressive capabilities in producing identity-consistent images of the same individual across diverse scenes. However, most existing methods lack explicit viewpoint control and fail to ensure multi-view consistency of generated identities. To address this limitation, we present MagicView, a lightweight adaptation framework that equips existing generative models with multi-view generation capability through 3D priors-guided in-context learning. While prior studies have shown that in-context learning preserves identity consistency across grid samples, its effectiveness in multi-view settings remains unexplored. Building upon this insight, we conduct an in-depth analysis of the multi-view in-context learning ability, and design a conditioning architecture that leverages 3D priors to activate this capability for multi-view consistent identity customization. On the other hand, acquiring robust multi-view capability typically requires large-scale multi-dimensional datasets, which makes incorporating multi-view contextual learning under limited data regimes prone to textual controllability degradation. To address this issue, we introduce a novel Semantic Correspondence Alignment loss, which effectively preserves semantic alignment while maintaining multi-view consistency. Extensive experiments demonstrate that MagicView substantially outperforms recent baselines in multi-view consistency, text alignment, identity similarity, and visual quality, achieving strong results with only 100 multi-view training samples.
LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving
Recent advancements in vision foundation models (VFMs) have revolutionized visual perception in 2D, yet their potential for 3D scene understanding, particularly in autonomous driving applications, remains underexplored. In this paper, we introduce LargeAD, a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets. Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples. This alignment facilitates cross-modal representation learning, enhancing the semantic consistency between 2D and 3D data. We introduce several key innovations: (i) VFM-driven superpixel generation for detailed semantic representation, (ii) a VFM-assisted contrastive learning strategy to align multimodal features, (iii) superpoint temporal consistency to maintain stable representations across time, and (iv) multi-source data pretraining to generalize across various LiDAR configurations. Our approach achieves substantial gains over state-of-the-art methods in linear probing and fine-tuning for LiDAR-based segmentation and object detection. Extensive experiments on 11 large-scale multi-sensor datasets highlight our superior performance, demonstrating adaptability, efficiency, and robustness in real-world autonomous driving scenarios.
comment: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
Score Distillation of Flow Matching Models
Diffusion models achieve high-quality image generation but are limited by slow iterative sampling. Distillation methods alleviate this by enabling one- or few-step generation. Flow matching, originally introduced as a distinct framework, has since been shown to be theoretically equivalent to diffusion under Gaussian assumptions, raising the question of whether distillation techniques such as score distillation transfer directly. We provide a simple derivation -- based on Bayes' rule and conditional expectations -- that unifies Gaussian diffusion and flow matching without relying on ODE/SDE formulations. Building on this view, we extend Score identity Distillation (SiD) to pretrained text-to-image flow-matching models, including SANA, SD3-Medium, SD3.5-Medium/Large, and FLUX.1-dev, all with DiT backbones. Experiments show that, with only modest flow-matching- and DiT-specific adjustments, SiD works out of the box across these models, in both data-free and data-aided settings, without requiring teacher finetuning or architectural changes. This provides the first systematic evidence that score distillation applies broadly to text-to-image flow matching models, resolving prior concerns about stability and soundness and unifying acceleration techniques across diffusion- and flow-based generators. A project page is available at https://yigu1008.github.io/SiD-DiT.
VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety
Safety evaluation of multimodal foundation models often treats vision and language inputs separately, missing risks from joint interpretation where benign content becomes harmful in combination. Existing approaches also fail to distinguish clearly unsafe content from borderline cases, leading to problematic over-blocking or under-refusal of genuinely harmful content. We present Vision Language Safety Understanding (VLSU), a comprehensive framework to systematically evaluate multimodal safety through fine-grained severity classification and combinatorial analysis across 17 distinct safety patterns. Using a multi-stage pipeline with real-world images and human annotation, we construct a large-scale benchmark of 8,187 samples spanning 15 harm categories. Our evaluation of eleven state-of-the-art models reveals systematic joint understanding failures: while models achieve 90%-plus accuracy on clear unimodal safety signals, performance degrades substantially to 20-55% when joint image-text reasoning is required to determine the safety label. Most critically, 34% of errors in joint image-text safety classification occur despite correct classification of the individual modalities, further demonstrating absent compositional reasoning capabilities. Additionally, we find that models struggle to balance refusing unsafe content while still responding to borderline cases that deserve engagement. For example, we find that instruction framing can reduce the over-blocking rate on borderline content from 62.4% to 10.4% in Gemini-1.5, but only at the cost of under-refusing on unsafe content with refusal rate dropping from 90.8% to 53.9%. Overall, our framework exposes weaknesses in joint image-text understanding and alignment gaps in current models, and provides a critical test bed to enable the next milestones in research on robust vision-language safety.
comment: 10 pages, 5 figures, 4 tables, detailed appendix. Under review
Automatic Labelling for Low-Light Pedestrian Detection
Pedestrian detection in RGB images is a key task in pedestrian safety, as the most common sensor in autonomous vehicles and advanced driver assistance systems is the RGB camera. A challenge in RGB pedestrian detection, that does not appear to have large public datasets, is low-light conditions. As a solution, in this research, we propose an automated infrared-RGB labeling pipeline. The proposed pipeline consists of 1) Infrared detection, where a fine-tuned model for infrared pedestrian detection is used 2) Label transfer process from the infrared detections to their RGB counterparts 3) Training object detection models using the generated labels for low-light RGB pedestrian detection. The research was performed using the KAIST dataset. For the evaluation, object detection models were trained on the generated autolabels and ground truth labels. When compared on a previously unseen image sequence, the results showed that the models trained on generated labels outperformed the ones trained on ground-truth labels in 6 out of 9 cases for the mAP@50 and mAP@50-95 metrics. The source code for this research is available at https://github.com/BouzoulasDimitrios/IR-RGB-Automated-LowLight-Pedestrian-Labeling
SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism NeurIPS 2025
By incorporating visual inputs, Multimodal Large Language Models (MLLMs) extend LLMs to support visual reasoning. However, this integration also introduces new vulnerabilities, making MLLMs susceptible to multimodal jailbreak attacks and hindering their safe deployment.Existing defense methods, including Image-to-Text Translation, Safe Prompting, and Multimodal Safety Tuning, attempt to address this by aligning multimodal inputs with LLMs' built-in safeguards.Yet, they fall short in uncovering root causes of multimodal vulnerabilities, particularly how harmful multimodal tokens trigger jailbreak in MLLMs? Consequently, they remain vulnerable to text-driven multimodal jailbreaks, often exhibiting overdefensive behaviors and imposing heavy training overhead.To bridge this gap, we present an comprehensive analysis of where, how and which harmful multimodal tokens bypass safeguards in MLLMs. Surprisingly, we find that less than 1% tokens in early-middle layers are responsible for inducing unsafe behaviors, highlighting the potential of precisely removing a small subset of harmful tokens, without requiring safety tuning, can still effectively improve safety against jailbreaks. Motivated by this, we propose Safe Prune-then-Restore (SafePTR), an training-free defense framework that selectively prunes harmful tokens at vulnerable layers while restoring benign features at subsequent layers.Without incurring additional computational overhead, SafePTR significantly enhances the safety of MLLMs while preserving efficiency. Extensive evaluations across three MLLMs and five benchmarks demonstrate SafePTR's state-of-the-art performance in mitigating jailbreak risks without compromising utility.
comment: Accepted by NeurIPS 2025
3D and 4D World Modeling: A Survey
World modeling has become a cornerstone in AI research, enabling agents to understand, represent, and predict the dynamic environments they inhabit. While prior work largely emphasizes generative methods for 2D image and video data, they overlook the rapidly growing body of work that leverages native 3D and 4D representations such as RGB-D imagery, occupancy grids, and LiDAR point clouds for large-scale scene modeling. At the same time, the absence of a standardized definition and taxonomy for ``world models'' has led to fragmented and sometimes inconsistent claims in the literature. This survey addresses these gaps by presenting the first comprehensive review explicitly dedicated to 3D and 4D world modeling and generation. We establish precise definitions, introduce a structured taxonomy spanning video-based (VideoGen), occupancy-based (OccGen), and LiDAR-based (LiDARGen) approaches, and systematically summarize datasets and evaluation metrics tailored to 3D/4D settings. We further discuss practical applications, identify open challenges, and highlight promising research directions, aiming to provide a coherent and foundational reference for advancing the field. A systematic summary of existing literature is available at https://github.com/worldbench/awesome-3d-4d-world-models
comment: Survey; 50 pages, 10 figures, 14 tables; GitHub Repo at https://github.com/worldbench/awesome-3d-4d-world-models
SceneSplat++: A Large Dataset and Comprehensive Benchmark for Language Gaussian Splatting
3D Gaussian Splatting (3DGS) serves as a highly performant and efficient encoding of scene geometry, appearance, and semantics. Moreover, grounding language in 3D scenes has proven to be an effective strategy for 3D scene understanding. Current Language Gaussian Splatting line of work fall into three main groups: (i) per-scene optimization-based, (ii) per-scene optimization-free, and (iii) generalizable approach. However, most of them are evaluated only on rendered 2D views of a handful of scenes and viewpoints close to the training views, limiting ability and insight into holistic 3D understanding. To address this gap, we propose the first large-scale benchmark that systematically assesses these three groups of methods directly in 3D space, evaluating on 1060 scenes across three indoor datasets and one outdoor dataset. Benchmark results demonstrate a clear advantage of the generalizable paradigm, particularly in relaxing the scene-specific limitation, enabling fast feed-forward inference on novel scenes, and achieving superior segmentation performance. We further introduce GaussianWorld-49K a carefully curated 3DGS dataset comprising around 49K diverse indoor and outdoor scenes obtained from multiple sources, with which we demonstrate the generalizable approach could harness strong data priors. Our codes, benchmark, and datasets are public to accelerate research in generalizable 3DGS scene understanding.
comment: 15 pages, codes, data and benchmark are released
ActiveInitSplat: How Active Image Selection Helps Gaussian Splatting
Gaussian splatting (GS) along with its extensions and variants provides outstanding performance in real-time scene rendering while meeting reduced storage demands and computational efficiency. While the selection of 2D images capturing the scene of interest is crucial for the proper initialization and training of GS, hence markedly affecting the rendering performance, prior works rely on passively and typically densely selected 2D images. In contrast, this paper proposes `ActiveInitSplat', a novel framework for active selection of training images for proper initialization and training of GS. ActiveInitSplat relies on density and occupancy criteria of the resultant 3D scene representation from the selected 2D images, to ensure that the latter are captured from diverse viewpoints leading to better scene coverage and that the initialized Gaussian functions are well aligned with the actual 3D structure. Numerical tests on well-known simulated and real environments demonstrate the merits of ActiveInitSplat resulting in significant GS rendering performance improvement over passive GS baselines in both dense- and sparse-view settings, in the widely adopted LPIPS, SSIM, and PSNR metrics.
Pan-LUT: Efficient Pan-sharpening via Learnable Look-Up Tables
Recently, deep learning-based pan-sharpening algorithms have achieved notable advancements over traditional methods. However, deep learning-based methods incur substantial computational overhead during inference, especially with large images. This excessive computational demand limits the applicability of these methods in real-world scenarios, particularly in the absence of dedicated computing devices such as GPUs and TPUs. To address these challenges, we propose Pan-LUT, a novel learnable look-up table (LUT) framework for pan-sharpening that strikes a balance between performance and computational efficiency for large remote sensing images. Our method makes it possible to process 15K*15K remote sensing images on a 24GB GPU. To finely control the spectral transformation, we devise the PAN-guided look-up table (PGLUT) for channel-wise spectral mapping. To effectively capture fine-grained spatial details, we introduce the spatial details look-up table (SDLUT). Furthermore, to adaptively aggregate channel information for generating high-resolution multispectral images, we design an adaptive output look-up table (AOLUT). Our model contains fewer than 700K parameters and processes a 9K*9K image in under 1 ms using one RTX 2080 Ti GPU, demonstrating significantly faster performance compared to other methods. Experiments reveal that Pan-LUT efficiently processes large remote sensing images in a lightweight manner, bridging the gap to real-world applications. Furthermore, our model surpasses SOTA methods in full-resolution scenes under real-world conditions, highlighting its effectiveness and efficiency.
SATORI-R1: Incentivizing Multimodal Reasoning through Explicit Visual Anchoring
DeepSeek-R1 has demonstrated powerful reasoning capabilities in the text domain through stable reinforcement learning (RL). Recently, in the multimodal domain, works have begun to directly apply RL to generate R1-like free-form reasoning for Visual Question Answering (VQA) tasks. However, multimodal tasks share an intrinsically different nature from textual tasks, which heavily rely on the understanding of the input image to solve the problem. Therefore, such free-form reasoning faces two critical limitations in the VQA task: (1) Extended reasoning chains diffuse visual focus away from task-critical regions, degrading answer accuracy. (2) Unverifiable intermediate steps amplify policy-gradient variance and computational costs overhead. To address these issues, in this paper, we introduce SATORI ($\textbf{S}patially$ $\textbf{A}nchored$ $\textbf{T}ask$ $\textbf{O}ptimization$ with $\textbf{R}e\textbf{I}nforcement$ Learning), which decomposes VQA into three verifiable stages, including global image captioning, region localization, and answer prediction, each supplying explicit reward signals. Furthermore, we also introduce VQA-Verify, a 12k dataset annotated with answer-aligned captions and bounding-boxes to facilitate training. Experiments demonstrate consistent performance improvements across seven VQA benchmarks, achieving up to $15.7\%$ improvement in accuracy in accuracy compared to the R1-like baseline. Our analysis of the attention map confirms enhanced focus on critical regions, which brings improvements in accuracy. Our code is available at https://github.com/justairr/SATORI-R1.
comment: 21 pages, 8 figures
GS4: Generalizable Sparse Splatting Semantic SLAM
Traditional SLAM algorithms excel at camera tracking, but typically produce incomplete and low-resolution maps that are not tightly integrated with semantics prediction. Recent work integrates Gaussian Splatting (GS) into SLAM to enable dense, photorealistic 3D mapping, yet existing GS-based SLAM methods require per-scene optimization that is slow and consumes an excessive number of Gaussians. We present GS4, the first generalizable GS-based semantic SLAM system. Compared with prior approaches, GS4 runs 10x faster, uses 10x fewer Gaussians, and achieves state-of-the-art performance across color, depth, semantic mapping and camera tracking. From an RGB-D video stream, GS4 incrementally builds and updates a set of 3D Gaussians using a feed-forward network. First, the Gaussian Prediction Model estimates a sparse set of Gaussian parameters from input frame, which integrates both color and semantic prediction with the same backbone. Then, the Gaussian Refinement Network merges new Gaussians with the existing set while avoiding redundancy. Finally, when significant pose changes are detected, we perform only 1-5 iterations of joint Gaussian-pose optimization to correct drift, remove floaters, and further improve tracking accuracy. Experiments on the real-world ScanNet and ScanNet++ benchmarks demonstrate state-of-the-art semantic SLAM performance, with strong generalization capability shown through zero-shot transfer to the NYUv2 and TUM RGB-D datasets.
comment: 15 pages, 6 figures
Guard Me If You Know Me: Protecting Specific Face-Identity from Deepfakes
Securing personal identity against deepfake attacks is increasingly critical in the digital age, especially for celebrities and political figures whose faces are easily accessible and frequently targeted. Most existing deepfake detection methods focus on general-purpose scenarios and often ignore the valuable prior knowledge of known facial identities, e.g., "VIP individuals" whose authentic facial data are already available. In this paper, we propose \textbf{VIPGuard}, a unified multimodal framework designed to capture fine-grained and comprehensive facial representations of a given identity, compare them against potentially fake or similar-looking faces, and reason over these comparisons to make accurate and explainable predictions. Specifically, our framework consists of three main stages. First, fine-tune a multimodal large language model (MLLM) to learn detailed and structural facial attributes. Second, we perform identity-level discriminative learning to enable the model to distinguish subtle differences between highly similar faces, including real and fake variations. Finally, we introduce user-specific customization, where we model the unique characteristics of the target face identity and perform semantic reasoning via MLLM to enable personalized and explainable deepfake detection. Our framework shows clear advantages over previous detection works, where traditional detectors mainly rely on low-level visual cues and provide no human-understandable explanations, while other MLLM-based models often lack a detailed understanding of specific face identities. To facilitate the evaluation of our method, we built a comprehensive identity-aware benchmark called \textbf{VIPBench} for personalized deepfake detection, involving the latest 7 face-swapping and 7 entire face synthesis techniques for generation. The code is available at https://github.com/KQL11/VIPGuard .
Margin-aware Preference Optimization for Aligning Diffusion Models without Reference AAAI 2026
Modern preference alignment methods, such as DPO, rely on divergence regularization to a reference model for training stability-but this creates a fundamental problem we call "reference mismatch." In this paper, we investigate the negative impacts of reference mismatch in aligning text-to-image (T2I) diffusion models, showing that larger reference mismatch hinders effective adaptation given the same amount of data, e.g., as when learning new artistic styles, or personalizing to specific objects. We demonstrate this phenomenon across text-to-image (T2I) diffusion models and introduce margin-aware preference optimization (MaPO), a reference-agnostic approach that breaks free from this constraint. By directly optimizing the likelihood margin between preferred and dispreferred outputs under the Bradley-Terry model without anchoring to a reference, MaPO transforms diverse T2I tasks into unified pairwise preference optimization. We validate MaPO's versatility across five challenging domains: (1) safe generation, (2) style adaptation, (3) cultural representation, (4) personalization, and (5) general preference alignment. Our results reveal that MaPO's advantage grows dramatically with reference mismatch severity, outperforming both DPO and specialized methods like DreamBooth while reducing training time by 15%. MaPO thus emerges as a versatile and memory-efficient method for generic T2I adaptation tasks.
comment: Accepted to AAAI 2026 Main Technical Track
MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query NeurIPS 2025
Semantic retrieval is crucial for modern applications yet remains underexplored in current research. Existing datasets are limited to single languages, single images, or singular retrieval conditions, often failing to fully exploit the expressive capacity of visual information as evidenced by maintained performance when images are replaced with captions. However, practical retrieval scenarios frequently involve interleaved multi-condition queries with multiple images. Hence, this paper introduces MERIT, the first multilingual dataset for interleaved multi-condition semantic retrieval, comprising 320,000 queries with 135,000 products in 5 languages, covering 7 distinct product categories. Extensive experiments on MERIT identify existing models's limitation: focusing solely on global semantic information while neglecting specific conditional elements in queries. Consequently, we propose Coral, a novel fine-tuning framework that adapts pre-trained MLLMs by integrating embedding reconstruction to preserve fine-grained conditional elements and contrastive learning to extract comprehensive global semantics. Experiments demonstrate that Coral achieves a 45.9% performance improvement over conventional approaches on MERIT, with strong generalization capabilities validated across 8 established retrieval benchmarks. Collectively, our contributions - a novel dataset, identification of critical limitations in existing approaches, and an innovative fine-tuning framework - establish a foundation for future research in interleaved multi-condition semantic retrieval.
comment: NeurIPS 2025; Project Page, Code, and Dataset at: https://merit-2025.github.io/
MACS: Measurement-Aware Consistency Sampling for Inverse Problems
Diffusion models have emerged as powerful generative priors for solving inverse imaging problems. However, their practical deployment is hindered by the substantial computational cost of slow, multi-step sampling. Although Consistency Models (CMs) address this limitation by enabling high-quality generation in only one or a few steps, their direct application to inverse problems has remained largely unexplored. This paper introduces a modified consistency sampling framework specifically designed for inverse problems. The proposed approach regulates the sampler's stochasticity through a measurement-consistency mechanism that leverages the degradation operator, thereby enforcing fidelity to the observed data while preserving the computational efficiency of consistency-based generation. Comprehensive experiments on the Fashion-MNIST and LSUN Bedroom datasets demonstrate consistent improvements across both perceptual and pixel-level metrics, including the Fréchet Inception Distance (FID), Kernel Inception Distance (KID), peak signal-to-noise ratio (PSNR), and structural similarity index measure (SSIM), compared with baseline consistency and diffusion-based sampling methods. The proposed method achieves competitive or superior reconstruction quality with only a small number of sampling steps.
comment: 10 pages, 4 figures, This work has been submitted to the IEEE for possible publication
Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols
Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic manipulation, yet they remain limited in failure diagnosis and learning from failures. Additionally, existing failure datasets are mostly generated programmatically in simulation, which limits their generalization to the real world. In light of these, we introduce ViFailback, a framework designed to diagnose robotic manipulation failures and provide both textual and visual correction guidance. Our framework utilizes explicit visual symbols to enhance annotation efficiency. We further release the ViFailback dataset, a large-scale collection of 58,126 Visual Question Answering (VQA) pairs along with their corresponding 5,202 real-world manipulation trajectories. Based on the dataset, we establish ViFailback-Bench, a benchmark of 11 fine-grained VQA tasks designed to assess the failure diagnosis and correction abilities of Vision-Language Models (VLMs), featuring ViFailback-Bench Lite for closed-ended and ViFailback-Bench Hard for open-ended evaluation. To demonstrate the effectiveness of our framework, we built the ViFailback-8B VLM, which not only achieves significant overall performance improvement on ViFailback-Bench but also generates visual symbols for corrective action guidance. Finally, by integrating ViFailback-8B with a VLA model, we conduct real-world robotic experiments demonstrating its ability to assist the VLA model in recovering from failures. Project Website: https://x1nyuzhou.github.io/vifailback.github.io/
A Novel Attention-Augmented Wavelet YOLO System for Real-time Brain Vessel Segmentation on Transcranial Color-coded Doppler
The Circle of Willis (CoW), vital for ensuring consistent blood flow to the brain, is closely linked to ischemic stroke. Accurate assessment of the CoW is important for identifying individuals at risk and guiding appropriate clinical management. Among existing imaging methods, Transcranial Color-coded Doppler (TCCD) offers unique advantages due to its radiation-free nature, affordability, and accessibility. However, reliable TCCD assessments depend heavily on operator expertise for identifying anatomical landmarks and performing accurate angle correction, which limits its widespread adoption. To address this challenge, we propose an AI-powered, real-time CoW auto-segmentation system capable of efficiently capturing cerebral arteries. No prior studies have explored AI-driven cerebrovascular segmentation using TCCD. In this work, we introduce a novel Attention-Augmented Wavelet YOLO (AAW-YOLO) network tailored for TCCD data, designed to provide real-time guidance for brain vessel segmentation in the CoW. We prospectively collected TCCD data comprising 738 annotated frames and 3,419 labeled artery instances to establish a high-quality dataset for model training and evaluation. The proposed AAW-YOLO demonstrated strong performance in segmenting both ipsilateral and contralateral CoW vessels, achieving an average Dice score of 0.901, IoU of 0.823, precision of 0.882, recall of 0.926, and mAP of 0.953, with a per-frame inference speed of 14.199 ms. This system offers a practical solution to reduce reliance on operator experience in TCCD-based cerebrovascular screening, with potential applications in routine clinical workflows and resource-constrained settings. Future research will explore bilateral modeling and larger-scale validation.
Differentiable, Bit-shifting, and Scalable Quantization without training neural network from scratch
Quantization of neural networks provides benefits of inference in less compute and memory requirements. Previous work in quantization lack two important aspects which this work provides. First almost all previous work in quantization used a non-differentiable approach and for learning; the derivative is usually set manually in backpropogation which make the learning ability of algorithm questionable, our approach is not just differentiable, we also provide proof of convergence of our approach to the optimal neural network. Second previous work in shift/logrithmic quantization either have avoided activation quantization along with weight quantization or achieved less accuracy. Learning logrithmic quantize values of form $2^n$ requires the quantization function can scale to more than 1 bit quantization which is another benifit of our quantization that it provides $n$ bits quantization as well. Our approach when tested with image classification task using imagenet dataset, resnet18 and weight quantization only achieves less than 1 percent accuracy compared to full precision accuracy while taking only 15 epochs to train using shift bit quantization and achieves comparable to SOTA approaches accuracy in both weight and activation quantization using shift bit quantization in 15 training epochs with slightly higher(only higher cpu instructions) inference cost compared to 1 bit quantization(without logrithmic quantization) and not requiring any higher precision multiplication.
Vision-Based Mistake Analysis in Procedural Activities: A Review of Advances and Challenges
Mistake analysis in procedural activities is a critical area of research with applications spanning industrial automation, physical rehabilitation, education and human-robot collaboration. This paper reviews vision-based methods for detecting and predicting mistakes in structured tasks, focusing on procedural and executional errors. By leveraging advancements in computer vision, including action recognition, anticipation and activity understanding, vision-based systems can identify deviations in task execution, such as incorrect sequencing, use of improper techniques, or timing errors. We explore the challenges posed by intra-class variability, viewpoint differences and compositional activity structures, which complicate mistake detection. Additionally, we provide a comprehensive overview of existing datasets, evaluation metrics and state-of-the-art methods, categorizing approaches based on their use of procedural structure, supervision levels and learning strategies. Open challenges, such as distinguishing permissible variations from true mistakes and modeling error propagation are discussed alongside future directions, including neuro-symbolic reasoning and counterfactual state modeling. This work aims to establish a unified perspective on vision-based mistake analysis in procedural activities, highlighting its potential to enhance safety, efficiency and task performance across diverse domains.
comment: 23pages, 6 figures, 2 tables
Image and Video Processing
Ultra-lightweight Neural Video Representation Compression
Recent works have demonstrated the viability of utilizing over-fitted implicit neural representations (INRs) as alternatives to autoencoder-based models for neural video compression. Among these INR-based video codecs, Neural Video Representation Compression (NVRC) was the first to adopt a fully end-to-end compression framework that compresses INRs, achieving state-of-the-art performance. Moreover, some recently proposed lightweight INRs have shown comparable performance to their baseline codecs with computational complexity lower than 10kMACs/pixel. In this work, we extend NVRC toward lightweight representations, and propose NVRC-Lite, which incorporates two key changes. Firstly, we integrated multi-scale feature grids into our lightweight neural representation, and the use of higher resolution grids significantly improves the performance of INRs at low complexity. Secondly, we address the issue that existing INRs typically leverage autoregressive models for entropy coding: these are effective but impractical due to their slow coding speed. In this work, we propose an octree-based context model for entropy coding high-dimensional feature grids, which accelerates the entropy coding module of the model. Our experimental results demonstrate that NVRC-Lite outperforms C3, one of the best lightweight INR-based video codecs, with up to 21.03% and 23.06% BD-rate savings when measured in PSNR and MS-SSIM, respectively, while achieving 8.4x encoding and 2.5x decoding speedup. The implementation of NVRC-Lite will be made available.
Tada-DIP: Input-adaptive Deep Image Prior for One-shot 3D Image Reconstruction
Deep Image Prior (DIP) has recently emerged as a promising one-shot neural-network based image reconstruction method. However, DIP has seen limited application to 3D image reconstruction problems. In this work, we introduce Tada-DIP, a highly effective and fully 3D DIP method for solving 3D inverse problems. By combining input-adaptation and denoising regularization, Tada-DIP produces high-quality 3D reconstructions while avoiding the overfitting phenomenon that is common in DIP. Experiments on sparse-view X-ray computed tomography reconstruction validate the effectiveness of the proposed method, demonstrating that Tada-DIP produces much better reconstructions than training-data-free baselines and achieves reconstruction performance on par with a supervised network trained using a large dataset with fully-sampled volumes.
comment: 6 pages, 8 figures, 2025 Asilomar Conference on Signals, Systems, and Computers. Code is available at github.com/evanbell02/Tada-DIP/
A BTR-Based Approach for Detection of Infrared Small Targets
Infrared small target detection plays a crucial role in military reconnaissance and air defense systems. However,existing low-rank sparse based methods still face high computational complexity when dealing with low-contrast small targets and complex dynamic backgrounds mixed with target-like interference. To address this limitation, we reconstruct the data into a fourth-order tensor and propose a new infrared small target detection model based on bilateral tensor ring decomposition, called BTR-ISTD. The approach begins by constructing a four-dimensional infrared tensor from an image sequence, then utilizes BTR decomposition to effectively distinguish weak spatial correlations from strong temporal-patch correlations while simultaneously capturing interactions between these two components. This model is efficiently solved under the proximal alternating minimization (PAM) framework. Experimental results demonstrate that the proposed approach outperforms several state-of-the-art methods in terms of detection accuracy, background suppression capability, and computational speed.
Real-Time Control and Automation Framework for Acousto-Holographic Microscopy
Manual operation of microscopes for repetitive tasks in cell biology is a significant bottleneck, consuming invaluable expert time, and introducing human error. Automation is essential, and while Digital Holographic Microscopy (DHM) offers powerful, label-free quantitative phase imaging (QPI), its inherently noisy and low-contrast holograms make robust autofocus and object detection challenging. We present the design, integration, and validation of a fully automated closed-loop DHM system engineered for high-throughput mechanical characterization of biological cells. The system integrates automated serpentine scanning, real-time YOLO-based object detection, and a high-performance, multi-threaded software architecture using pinned memory and SPSC queues. This design enables the GPU-accelerated reconstruction pipeline to run fully in parallel with the 50 fps data acquisition, adding no sequential overhead. A key contribution is the validation of a robust, multi-stage holographic autofocus strategy; we demonstrate that a selected metric (based on a low-pass filter and standard deviation) provides reliable focusing for noisy holograms where conventional methods (e.g., Tenengrad, Laplacian) fail entirely. Performance analysis of the complete system identifies the 2.23-second autofocus operation-not reconstruction-as the primary throughput bottleneck, resulting in a 9.62-second analysis time per object. This work delivers a complete functional platform for autonomous DHM screening and provides a clear, data-driven path for future optimization, proposing a hybrid brightfield imaging modality to address current bottlenecks.
Training-Free Robot Pose Estimation using Off-the-Shelf Foundational Models
Pose estimation of a robot arm from visual inputs is a challenging task. However, with the increasing adoption of robot arms for both industrial and residential use cases, reliable joint angle estimation can offer improved safety and performance guarantees, and also be used as a verifier to further train robot policies. This paper introduces using frontier vision-language models (VLMs) as an ``off-the-shelf" tool to estimate a robot arm's joint angles from a single target image. By evaluating frontier VLMs on both synthetic and real-world image-data pairs, this paper establishes a performance baseline attained by current FLMs. In addition, this paper presents empirical results suggesting that test time scaling or parameter scaling alone does not lead to improved joint angle predictions.
comment: Accepted at CVIS 2025
Semantic Temporal Single-photon LiDAR
Temporal single-photon (TSP-) LiDAR presents a promising solution for imaging-free target recognition over long distances with reduced size, cost, and power consumption. However, existing TSP-LiDAR approaches are ineffective in handling open-set scenarios where unknown targets emerge, and they suffer significant performance degradation under low signal-to-noise ratio (SNR) and short acquisition times (fewer photons). Here, inspired by semantic communication, we propose a semantic TSP-LiDAR based on a self-updating semantic knowledge base (SKB), in which the target recognition processing of TSP-LiDAR is formulated as a semantic communication. The results, both simulation and experiment, demonstrate that our approach surpasses conventional methods, particularly under challenging conditions of low SNR and limited acquisition time. More importantly, our self-updating SKB mechanism can dynamically update the semantic features of newly encountered targets in the SKB, enabling continuous adaptation without the need for extensive retraining of the neural network. In fact, a recognition accuracy of 89% is achieved on nine types of unknown targets in real-world experiments, compared to 66% without the updating mechanism. These findings highlight the potential of our framework for adaptive and robust target recognition in complex and dynamic environments.
comment: 14 pages, 5 figures. And any comment is welcome
NVRC: Neural Video Representation Compression
Recent advances in implicit neural representation (INR)-based video coding have demonstrated its potential to compete with both conventional and other learning-based approaches. With INR methods, a neural network is trained to overfit a video sequence, with its parameters compressed to obtain a compact representation of the video content. However, although promising results have been achieved, the best INR-based methods are still out-performed by the latest standard codecs, such as VVC VTM, partially due to the simple model compression techniques employed. In this paper, rather than focusing on representation architectures as in many existing works, we propose a novel INR-based video compression framework, Neural Video Representation Compression (NVRC), targeting compression of the representation. Based on the novel entropy coding and quantization models proposed, NVRC, for the first time, is able to optimize an INR-based video codec in a fully end-to-end manner. To further minimize the additional bitrate overhead introduced by the entropy models, we have also proposed a new model compression framework for coding all the network, quantization and entropy model parameters hierarchically. Our experiments show that NVRC outperforms many conventional and learning-based benchmark codecs, with a 24% average coding gain over VVC VTM (Random Access) on the UVG dataset, measured in PSNR. As far as we are aware, this is the first time an INR-based video codec achieving such performance. The implementation of NVRC will be released.
Robust Physics-based Deep MRI Reconstruction Via Diffusion Purification
Deep learning (DL) techniques have been extensively employed in magnetic resonance imaging (MRI) reconstruction, delivering notable performance enhancements over traditional non-DL methods. Nonetheless, recent studies have identified vulnerabilities in these models during testing, namely, their susceptibility to (\textit{i}) worst-case measurement perturbations and to (\textit{ii}) variations in training/testing settings like acceleration factors and k-space sampling locations. This paper addresses the robustness challenges by leveraging diffusion models. In particular, we present a robustification strategy that improves the resilience of DL-based MRI reconstruction methods by utilizing pretrained diffusion models as noise purifiers. In contrast to conventional robustification methods for DL-based MRI reconstruction, such as adversarial training (AT), our proposed approach eliminates the need to tackle a minimax optimization problem. It only necessitates fine-tuning on purified examples. Our experimental results highlight the efficacy of our approach in mitigating the aforementioned instabilities when compared to leading robustification approaches for deep MRI reconstruction, including AT and randomized smoothing.
A Tractable Two-Step Linear Mixing Model Solved with Second-Order Optimization for Spectral Unmixing under Variability
In this paper, we propose a Two-Step Linear Mixing Model (2LMM) that bridges the gap between model complexity and computational tractability. The model achieves this by introducing two distinct scaling steps: an endmember scaling step across the image, and another for pixel-wise scaling. We show that this model leads to only a mildly non-convex optimization problem, which we solve with an optimization algorithm that incorporates second-order information. To the authors' knowledge, this work represents the first application of second-order optimization techniques to solve a spectral unmixing problem that models endmember variability. Our method is highly robust, as it requires virtually no hyperparameter tuning and can therefore be used easily and quickly in a wide range of unmixing tasks. We show through extensive experiments on both simulated and real data that the new model is competitive and in some cases superior to the state of the art in unmixing. The model also performs very well in challenging scenarios, such as blind unmixing.
comment: This work has been submitted to the IEEE for possible publication
TransUNet-GradCAM: A Hybrid Transformer-U-Net with Self-Attention and Explainable Visualizations for Foot Ulcer Segmentation
Automated segmentation of diabetic foot ulcers (DFUs) plays a critical role in clinical diagnosis, therapeutic planning, and longitudinal wound monitoring. However, this task remains challenging due to the heterogeneous appearance, irregular morphology, and complex backgrounds associated with ulcer regions in clinical photographs. Traditional convolutional neural networks (CNNs), such as U-Net, provide strong localization capabilities but struggle to model long-range spatial dependencies due to their inherently limited receptive fields. To address this, we employ the TransUNet architecture, a hybrid framework that integrates the global attention mechanism of Vision Transformers (ViTs) into the U-Net structure. This combination allows the model to extract global contextual features while maintaining fine-grained spatial resolution. We trained the model on the public Foot Ulcer Segmentation Challenge (FUSeg) dataset using a robust augmentation pipeline and a hybrid loss function to mitigate class imbalance. On the validation set, the model achieved a Dice Similarity Coefficient (F1-score) of 0.8799 using an optimized threshold of 0.4389. To ensure clinical transparency, we integrated Grad-CAM visualizations to highlight model focus areas. Furthermore, a clinical utility analysis demonstrated a strong correlation (Pearson r = 0.9631) between predicted and ground-truth wound areas. These outcomes demonstrate that our approach effectively integrates global and local feature extraction, offering a reliable, effective, and explainable solution for automated foot ulcer assessment.
MACS: Measurement-Aware Consistency Sampling for Inverse Problems
Diffusion models have emerged as powerful generative priors for solving inverse imaging problems. However, their practical deployment is hindered by the substantial computational cost of slow, multi-step sampling. Although Consistency Models (CMs) address this limitation by enabling high-quality generation in only one or a few steps, their direct application to inverse problems has remained largely unexplored. This paper introduces a modified consistency sampling framework specifically designed for inverse problems. The proposed approach regulates the sampler's stochasticity through a measurement-consistency mechanism that leverages the degradation operator, thereby enforcing fidelity to the observed data while preserving the computational efficiency of consistency-based generation. Comprehensive experiments on the Fashion-MNIST and LSUN Bedroom datasets demonstrate consistent improvements across both perceptual and pixel-level metrics, including the Fréchet Inception Distance (FID), Kernel Inception Distance (KID), peak signal-to-noise ratio (PSNR), and structural similarity index measure (SSIM), compared with baseline consistency and diffusion-based sampling methods. The proposed method achieves competitive or superior reconstruction quality with only a small number of sampling steps.
comment: 10 pages, 4 figures, This work has been submitted to the IEEE for possible publication
A Novel Attention-Augmented Wavelet YOLO System for Real-time Brain Vessel Segmentation on Transcranial Color-coded Doppler
The Circle of Willis (CoW), vital for ensuring consistent blood flow to the brain, is closely linked to ischemic stroke. Accurate assessment of the CoW is important for identifying individuals at risk and guiding appropriate clinical management. Among existing imaging methods, Transcranial Color-coded Doppler (TCCD) offers unique advantages due to its radiation-free nature, affordability, and accessibility. However, reliable TCCD assessments depend heavily on operator expertise for identifying anatomical landmarks and performing accurate angle correction, which limits its widespread adoption. To address this challenge, we propose an AI-powered, real-time CoW auto-segmentation system capable of efficiently capturing cerebral arteries. No prior studies have explored AI-driven cerebrovascular segmentation using TCCD. In this work, we introduce a novel Attention-Augmented Wavelet YOLO (AAW-YOLO) network tailored for TCCD data, designed to provide real-time guidance for brain vessel segmentation in the CoW. We prospectively collected TCCD data comprising 738 annotated frames and 3,419 labeled artery instances to establish a high-quality dataset for model training and evaluation. The proposed AAW-YOLO demonstrated strong performance in segmenting both ipsilateral and contralateral CoW vessels, achieving an average Dice score of 0.901, IoU of 0.823, precision of 0.882, recall of 0.926, and mAP of 0.953, with a per-frame inference speed of 14.199 ms. This system offers a practical solution to reduce reliance on operator experience in TCCD-based cerebrovascular screening, with potential applications in routine clinical workflows and resource-constrained settings. Future research will explore bilateral modeling and larger-scale validation.
PixCell: A generative foundation model for digital histopathology images
The digitization of histology slides has revolutionized pathology, providing massive datasets for cancer diagnosis and research. Self-supervised and vision-language models have been shown to effectively mine large pathology datasets to learn discriminative representations. On the other hand, there are unique problems in pathology, such as annotated data scarcity, privacy regulations in data sharing, and inherently generative tasks like virtual staining. Generative models, capable of synthesizing realistic and diverse images, present a compelling solution to address these problems through image synthesis. We introduce PixCell, the first generative foundation model for histopathology images. PixCell is a diffusion model trained on PanCan-30M, a large, diverse dataset derived from 69,184 H&E-stained whole slide images of various cancer types. We employ a progressive training strategy and a self-supervision-based conditioning that allows us to scale up training without any human-annotated data. By conditioning on real slides, the synthetic images capture the properties of the real data and can be used as data augmentation for small-scale datasets to boost classification performance. We prove the foundational versatility of PixCell by applying it to two generative downstream tasks: privacy-preserving synthetic data generation and virtual IHC staining. PixCell's high-fidelity conditional generation enables institutions to use their private data to synthesize highly realistic, site-specific surrogate images that can be shared in place of raw patient data. Furthermore, using datasets of roughly paired H&E-IHC tiles, we learn to translate PixCell's conditioning from H&E to multiple IHC stains, allowing the generation of IHC images from H&E inputs. Our trained models are publicly released to accelerate research in computational pathology.
comment: Project page - https://histodiffusion.github.io/docs/projects/pixcell
Understanding Untrained Deep Models for Inverse Problems: Algorithms and Theory
In recent years, deep learning methods have been extensively developed for inverse imaging problems (IIPs), encompassing supervised, self-supervised, and generative approaches. Most of these methods require large amounts of labeled or unlabeled training data to learn effective models. However, in many practical applications, such as medical image reconstruction, extensive training datasets are often unavailable or limited. A significant milestone in addressing this challenge came in 2018 with the work of Ulyanov et al., which introduced the Deep Image Prior (DIP)--the first training-data-free neural network method for IIPs. Unlike conventional deep learning approaches, DIP requires only a convolutional neural network, the noisy measurements, and a forward operator. By leveraging the implicit regularization of deep networks initialized with random noise, DIP can learn and restore image structures without relying on external datasets. However, a well-known limitation of DIP is its susceptibility to overfitting, primarily due to the over-parameterization of the network. In this tutorial paper, we provide a comprehensive review of DIP, including a theoretical analysis of its training dynamics. We also categorize and discuss recent advancements in DIP-based methods aimed at mitigating overfitting, including techniques such as regularization, network re-parameterization, and early stopping. Furthermore, we discuss approaches that combine DIP with pre-trained neural networks, present empirical comparison results against data-centric methods, and highlight open research questions and future directions.
Multimedia
GAOT: Generating Articulated Objects Through Text-Guided Diffusion Models ACM MM
Articulated object generation has seen increasing advancements, yet existing models often lack the ability to be conditioned on text prompts. To address the significant gap between textual descriptions and 3D articulated object representations, we propose GAOT, a three-phase framework that generates articulated objects from text prompts, leveraging diffusion models and hypergraph learning in a three-step process. First, we fine-tune a point cloud generation model to produce a coarse representation of objects from text prompts. Given the inherent connection between articulated objects and graph structures, we design a hypergraph-based learning method to refine these coarse representations, representing object parts as graph vertices. Finally, leveraging a diffusion model, the joints of articulated objects-represented as graph edges-are generated based on the object parts. Extensive qualitative and quantitative experiments on the PartNet-Mobility dataset demonstrate the effectiveness of our approach, achieving superior performance over previous methods.
comment: Accepted by ACM MM Asia2026
Cross-Space Synergy: A Unified Framework for Multimodal Emotion Recognition in Conversation AAAI 2026
Multimodal Emotion Recognition in Conversation (MERC) aims to predict speakers' emotions by integrating textual, acoustic, and visual cues. Existing approaches either struggle to capture complex cross-modal interactions or experience gradient conflicts and unstable training when using deeper architectures. To address these issues, we propose Cross-Space Synergy (CSS), which couples a representation component with an optimization component. Synergistic Polynomial Fusion (SPF) serves the representation role, leveraging low-rank tensor factorization to efficiently capture high-order cross-modal interactions. Pareto Gradient Modulator (PGM) serves the optimization role, steering updates along Pareto-optimal directions across competing objectives to alleviate gradient conflicts and improve stability. Experiments show that CSS outperforms existing representative methods on IEMOCAP and MELD in both accuracy and training stability, demonstrating its effectiveness in complex multimodal scenarios.
comment: Accepted to AAAI 2026
STCTS: Generative Semantic Compression for Ultra-Low Bitrate Speech via Explicit Text-Prosody-Timbre Decomposition
Voice communication in bandwidth-constrained environments--maritime, satellite, and tactical networks--remains prohibitively expensive. Traditional codecs struggle below 1 kbps, while existing semantic approaches (STT-TTS) sacrifice prosody and speaker identity. We present STCTS, a generative semantic compression framework enabling natural voice communication at 80 bps. STCTS explicitly decomposes speech into linguistic content, prosodic expression, and speaker timbre, applying tailored compression: context-aware text encoding (70 bps), sparse prosody transmission via TTS interpolation (<14 bps at 0.1-1 Hz), and amortized speaker embedding. Evaluations on LibriSpeech demonstrate a 75x bitrate reduction versus Opus (6 kbps) and 12x versus EnCodec (1 kbps), while maintaining perceptual quality (NISQA MOS > 4.26), graceful degradation under packet loss and noise resilience. We also discover a bimodal quality distribution with prosody sampling rate: sparse and dense updates both achieve high quality, while mid-range rates degrade due to perceptual discontinuities--guiding optimal configuration design. Beyond efficiency, our modular architecture supports privacy-preserving encryption, human-interpretable transmission, and flexible deployment on edge devices, offering a robust solution for ultra-low bandwidth scenarios.
comment: The complete source code and online speech reconstruction demo is publicly available at https://github.com/dywsy21/STCTS
MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query NeurIPS 2025
Semantic retrieval is crucial for modern applications yet remains underexplored in current research. Existing datasets are limited to single languages, single images, or singular retrieval conditions, often failing to fully exploit the expressive capacity of visual information as evidenced by maintained performance when images are replaced with captions. However, practical retrieval scenarios frequently involve interleaved multi-condition queries with multiple images. Hence, this paper introduces MERIT, the first multilingual dataset for interleaved multi-condition semantic retrieval, comprising 320,000 queries with 135,000 products in 5 languages, covering 7 distinct product categories. Extensive experiments on MERIT identify existing models's limitation: focusing solely on global semantic information while neglecting specific conditional elements in queries. Consequently, we propose Coral, a novel fine-tuning framework that adapts pre-trained MLLMs by integrating embedding reconstruction to preserve fine-grained conditional elements and contrastive learning to extract comprehensive global semantics. Experiments demonstrate that Coral achieves a 45.9% performance improvement over conventional approaches on MERIT, with strong generalization capabilities validated across 8 established retrieval benchmarks. Collectively, our contributions - a novel dataset, identification of critical limitations in existing approaches, and an innovative fine-tuning framework - establish a foundation for future research in interleaved multi-condition semantic retrieval.
comment: NeurIPS 2025; Project Page, Code, and Dataset at: https://merit-2025.github.io/
MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding
Understanding high-resolution images remains a significant challenge for multimodal large language models (MLLMs). Recent study address this issue by dividing the image into smaller crops and computing the semantic similarity between each crop and a query using a pretrained retrieval-augmented generation (RAG) model. The most relevant crops are then selected to localize the target object and suppress irrelevant information. However, such crop-based processing can fragment complete objects across multiple crops, thereby disrupting the computation of semantic similarity. In our experiments, we find that image crops of objects with different sizes are better handled at different resolutions. Based on this observation, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework for high-resolution image understanding. To address the issue of semantic similarity bias caused by objects being split across different image crops, we propose a multi-resolution semantic fusion method, which integrates semantic similarity maps obtained at different resolutions to produce more accurate semantic information and preserve the integrity of target objects. Furthermore, to achieve direct localization of target objects at a global scale, we introduce an open-vocalbulary object detection (OVD) model that identifies object regions using a sliding-window approach.Experiments on high-resolution image understanding benchmarks using different MLLMs demonstrate the effectiveness of our approach.
Computation and Language
SkillFactory: Self-Distillation For Learning Cognitive Behaviors
Reasoning models leveraging long chains of thought employ various cognitive skills, such as verification of their answers, backtracking, retrying by an alternate method, and more. Previous work has shown that when a base language model exhibits these skills, training that model further with reinforcement learning (RL) can learn to leverage them. How can we get models to leverage skills that aren't exhibited by base models? Our work, SkillFactory, is a method for fine-tuning models to roughly learn these skills during a supervised fine-tuning (SFT) stage prior to RL. Our approach does not rely on distillation from a stronger model, but instead uses samples from the model itself, rearranged to provide training data in the format of those skills. These "silver" SFT traces may be imperfect, but are nevertheless effective for priming a model to acquire skills during RL. Our evaluation shows that (1) starting from SkillFactory SFT initialization helps a model to generalize to harder variants of a task post-RL, despite lower performance pre-RL; (2) cognitive skills are indeed used by the model; (3) RLed SkillFactory models are more robust to regression on out-of-domain tasks than RLed base models. Our work suggests that inductive biases learned prior to RL help models learn robust cognitive skill use.
Stable Signer: Hierarchical Sign Language Generative Model
Sign Language Production (SLP) is the process of converting the complex input text into a real video. Most previous works focused on the Text2Gloss, Gloss2Pose, Pose2Vid stages, and some concentrated on Prompt2Gloss and Text2Avatar stages. However, this field has made slow progress due to the inaccuracy of text conversion, pose generation, and the rendering of poses into real human videos in these stages, resulting in gradually accumulating errors. Therefore, in this paper, we streamline the traditional redundant structure, simplify and optimize the task objective, and design a new sign language generative model called Stable Signer. It redefines the SLP task as a hierarchical generation end-to-end task that only includes text understanding (Prompt2Gloss, Text2Gloss) and Pose2Vid, and executes text understanding through our proposed new Sign Language Understanding Linker called SLUL, and generates hand gestures through the named SLP-MoE hand gesture rendering expert block to end-to-end generate high-quality and multi-style sign language videos. SLUL is trained using the newly developed Semantic-Aware Gloss Masking Loss (SAGM Loss). Its performance has improved by 48.6% compared to the current SOTA generation methods.
comment: 12 pages, 7 figures. More Demo at https://stablesigner.github.io
Jina-VLM: Small Multilingual Vision Language Model
We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. Across standard VQA benchmarks and multilingual evaluations, Jina-VLM outperforms comparable models while preserving competitive text-only performance.
comment: 18 pages, 1-7 main content
AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving
As augmented large language models (LLMs) with external tools become increasingly popular in web applications, improving augmented LLM inference serving efficiency and optimizing service-level objectives (SLOs) are critical for enhancing user experience. To achieve this, inference systems must maximize request handling within latency constraints, referred to as increasing effective throughput. However, existing systems face two major challenges: (i) reliance on first-come-first-served (FCFS) scheduling causes severe head-of-line blocking, leading to queuing delays exceeding the SLOs for many requests; and (ii) static batch token limit, which fails to adapt to fluctuating loads and hardware conditions. Both of these factors degrade effective throughput and service quality. This paper presents AugServe, an efficient inference framework designed to reduce queueing latency and enhance effective throughput for augmented LLM inference services. The core idea of AugServe is a two-stage adaptive request scheduling strategy. Specifically, AugServe combines the inference features of augmented LLM requests to optimize the order of scheduling decisions (stage I). These decisions are continuously refined with runtime information (stage II), adapting to both request characteristics and system capabilities. In addition, AugServe dynamically adjusts the token batching mechanism based on hardware status and real-time load, further enhancing throughput performance. Experimental results show that AugServe achieves 4.7-33.1x and 3.3-13.2x higher effective throughput than vLLM and InferCept, while reducing time-to-first-token (TTFT) by up to 96.3% and 95.0%, respectively.
Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pre-trained Models
Tokenizer adaptation plays an important role in transferring pre-trained language models to new domains or languages. In this work, we address two complementary aspects of this process: vocabulary extension and pruning. The common approach to extension trains a new tokenizer on domain-specific text and appends the tokens that do not overlap with the existing vocabulary, which often results in many tokens that are unreachable or never used. We propose continued BPE training, which adapts a pre-trained tokenizer by continuing the BPE merge learning process on new data. Experiments across multiple languages and model families show that this approach improves tokenization efficiency and leads to better utilization of added vocabulary. We also introduce leaf-based vocabulary pruning, which removes redundant tokens while preserving model quality. Together, these methods provide practical tools for controlled vocabulary modification, which we release as an open-source package.
Adapting Large Language Models to Low-Resource Tibetan: A Two-Stage Continual and Supervised Fine-Tuning Study
Adapting large language models (LLMs) to low-resource languages remains a major challenge due to data scarcity and cross-lingual drift. This work presents a two-stage adaptation of Qwen2.5-3B to Tibetan, a morphologically rich and underrepresented language. We employ Continual Pretraining (CPT) to establish Tibetan linguistic grounding, followed by Supervised Fine-Tuning (SFT) for task and translation specialization. Empirical evaluations demonstrate a consistent decrease in perplexity (from 2.98 $\rightarrow$ 1.54) and substantial improvements in Chinese$\rightarrow$Tibetan translation quality (BLEU: 0.046 $\rightarrow$ 0.261; chrF: 2.2 $\rightarrow$ 6.6). Layer-wise analysis across 435 layers in Qwen3-4B reveals that adaptation primarily concentrates on embedding and output heads, with mid--late MLP projections encoding domain-specific transformations. Our findings suggest that CPT constructs a Tibetan semantic manifold while SFT sharpens task alignment with minimal representational disruption. This study provides the first quantitative exploration of Tibetan adaptation dynamics for LLMs, and offers an open, reproducible framework for extending multilingual foundation models to low-resource settings.
Is Lying Only Sinful in Islam? Exploring Religious Bias in Multilingual Large Language Models Across Major Religions
While recent developments in large language models have improved bias detection and classification, sensitive subjects like religion still present challenges because even minor errors can result in severe misunderstandings. In particular, multilingual models often misrepresent religions and have difficulties being accurate in religious contexts. To address this, we introduce BRAND: Bilingual Religious Accountable Norm Dataset, which focuses on the four main religions of South Asia: Buddhism, Christianity, Hinduism, and Islam, containing over 2,400 entries, and we used three different types of prompts in both English and Bengali. Our results indicate that models perform better in English than in Bengali and consistently display bias toward Islam, even when answering religion-neutral questions. These findings highlight persistent bias in multilingual models when similar questions are asked in different languages. We further connect our findings to the broader issues in HCI regarding religion and spirituality.
comment: 18 pages, 7 figures
BERnaT: Basque Encoders for Representing Natural Textual Diversity LREC 2026
Language models depend on massive text corpora that are often filtered for quality, a process that can unintentionally exclude non-standard linguistic varieties, reduce model robustness and reinforce representational biases. In this paper, we argue that language models should aim to capture the full spectrum of language variation (dialectal, historical, informal, etc.) rather than relying solely on standardized text. Focusing on Basque, a morphologically rich and low-resource language, we construct new corpora combining standard, social media, and historical sources, and pre-train the BERnaT family of encoder-only models in three configurations: standard, diverse, and combined. We further propose an evaluation framework that separates Natural Language Understanding (NLU) tasks into standard and diverse subsets to assess linguistic generalization. Results show that models trained on both standard and diverse data consistently outperform those trained on standard corpora, improving performance across all task types without compromising standard benchmark accuracy. These findings highlight the importance of linguistic diversity in building inclusive, generalizable language models.
comment: Submitted to LREC 2026
Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers
Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths. Although Cross-layer KV Cache sharing (e.g., YOCO, CLA) offers a path to mitigate KV Cache bottleneck, it typically underperforms within-layer methods like GQA. To understand the root cause, we investigate the information flow of keys and values of the top-layers. Our preliminary reveals a clear distribution: values are predominantly derived from the bottom layer, while keys draw more information from both bottom and middle layers. Building upon this, we propose FusedKV, whose top-layer KV caches are a learnable fusion of the most informative ones from the bottom and middle layers. This fusion operates directly on post-RoPE keys, preserving relative positional information without the computational cost of re-applying rotary embeddings. To further improve efficiency, we propose FusedKV-Lite, an cross-layer sharing approach, where top-layer KV caches are directly derived from the bottom-layer values and the middle-layer keys. Compared to FusedKV, FusedKV-Lite reduces I/O overhead at the cost of a slight increase in perplexity. In experiments on LLMs ranging from 332M to 4B parameters, our proposed method reduce 50\% cache memory while achieving lower validation perplexity than the standard Transformer decoder, establishing it as a memory-efficient, high-performance architectural alternative.
comment: under review
Training and Evaluation of Guideline-Based Medical Reasoning in LLMs
Machine learning for early prediction in medicine has recently shown breakthrough performance, however, the focus on improving prediction accuracy has led to a neglect of faithful explanations that are required to gain the trust of medical practitioners. The goal of this paper is to teach LLMs to follow medical consensus guidelines step-by-step in their reasoning and prediction process. Since consensus guidelines are ubiquitous in medicine, instantiations of verbalized medical inference rules to electronic health records provide data for fine-tuning LLMs to learn consensus rules and possible exceptions thereof for many medical areas. Consensus rules also enable an automatic evaluation of the model's inference process regarding its derivation correctness (evaluating correct and faithful deduction of a conclusion from given premises) and value correctness (comparing predicted values against real-world measurements). We exemplify our work using the complex Sepsis-3 consensus definition. Our experiments show that small fine-tuned models outperform one-shot learning of considerably larger LLMs that are prompted with the explicit definition and models that are trained on medical texts including consensus definitions. Since fine-tuning on verbalized rule instantiations of a specific medical area yields nearly perfect derivation correctness for rules (and exceptions) on unseen patient data in that area, the bottleneck for early prediction is not out-of-distribution generalization, but the orthogonal problem of generalization into the future by forecasting sparsely and irregularly sampled clinical variables. We show that the latter results can be improved by integrating the output representations of a time series forecasting model with the LLM in a multimodal setup.
Improving Alignment Between Human and Machine Codes: An Empirical Assessment of Prompt Engineering for Construct Identification in Psychology
Due to their architecture and vast pre-training data, large language models (LLMs) demonstrate strong text classification performance. However, LLM output - here, the category assigned to a text - depends heavily on the wording of the prompt. While literature on prompt engineering is expanding, few studies focus on classification tasks, and even fewer address domains like psychology, where constructs have precise, theory-driven definitions that may not be well represented in pre-training data. We present an empirical framework for optimizing LLM performance for identifying constructs in texts via prompt engineering. We experimentally evaluate five prompting strategies --codebook-guided empirical prompt selection, automatic prompt engineering, persona prompting, chain-of-thought reasoning, and explanatory prompting - with zero-shot and few-shot classification. We find that persona, chain-of-thought, and explanations do not fully address performance loss accompanying a badly worded prompt. Instead, the most influential features of a prompt are the construct definition, task framing, and, to a lesser extent, the examples provided. Across three constructs and two models, the classifications most aligned with expert judgments resulted from a few-shot prompt combining codebook-guided empirical prompt selection with automatic prompt engineering. Based on our findings, we recommend that researchers generate and evaluate as many prompt variants as feasible, whether human-crafted, automatically generated, or ideally both, and select prompts and examples based on empirical performance in a training dataset, validating the final approach in a holdout set. This procedure offers a practical, systematic, and theory-driven method for optimizing LLM prompts in settings where alignment with expert judgment is critical.
comment: 22 pages, 2 figures
Enhancing Instruction-Following Capabilities in Seq2Seq Models: DoLA Adaptations for T5
Contrastive decoding is a lightweight and effective inference-time method that improves the quality of text generation in Large Language Models. However, algorithms such as DoLa (Decoding by Contrastive Layers) have only been implemented in decoder-only architectures and studied for their impact on improving factuality. This work adapts DoLa for the T5 and FLAN-T5 model families and evaluates its impact on the models' instruction following capabilities, which to our knowledge is the first implementation of a contrastive decoding strategy in an encoder-decoder architecture. Our results show that DoLa improves the faithfulness of text generation for certain categories of tasks and harms others. To understand these results, we present a layer-by-layer analysis of logit evolution in a FLAN-T5 model to quantify DoLa's impact on token output probabilities.
AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches reduce visual tokens through fixed-ratio compression, they operate passively and lack the ability to adapt to varying task requirements. This motivates a fundamental question: Can VLMs autonomously determine the minimum number of visual tokens required for each sample? Inspired by human active vision mechanisms, we introduce AdaptVision, an efficient VLM paradigm that enables adaptive visual token acquisition through a coarse-to-fine approach. Our model initially processes compressed visual tokens from low-resolution images and selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary. We train AdaptVision using a reinforcement learning framework that carefully balances accuracy and efficiency. Central to our approach is Decoupled Turn Policy Optimization (DTPO), which decouples the learning objective into two components: (1) tool learning, which optimizes correct tool utilization, and (2) accuracy improvement, which refines the generated responses to improve answer correctness. Based on this formulation, we further decouple advantage estimation by computing separate advantages for tokens associated with each objective. This formulation enables more effective optimization for AdaptVision compared to vanilla GRPO. Comprehensive experiments across multiple VQA benchmarks demonstrate that AdaptVision achieves superior performance while consuming substantially fewer visual tokens than state-of-the-art efficient VLM methods.
comment: 15 pages, 9 figures
In-Context Representation Hijacking
We introduce \textbf{Doublespeak}, a simple \emph{in-context representation hijacking} attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., \textit{bomb}) with a benign token (e.g., \textit{carrot}) across multiple in-context examples, provided a prefix to a harmful request. We demonstrate that this substitution leads to the internal representation of the benign token converging toward that of the harmful one, effectively embedding the harmful semantics under a euphemism. As a result, superficially innocuous prompts (e.g., ``How to build a carrot?'') are internally interpreted as disallowed instructions (e.g., ``How to build a bomb?''), thereby bypassing the model's safety alignment. We use interpretability tools to show that this semantic overwrite emerges layer by layer, with benign meanings in early layers converging into harmful semantics in later ones. Doublespeak is optimization-free, broadly transferable across model families, and achieves strong success rates on closed-source and open-source systems, reaching 74\% ASR on Llama-3.3-70B-Instruct with a single-sentence context override. Our findings highlight a new attack surface in the latent space of LLMs, revealing that current alignment strategies are insufficient and should instead operate at the representation level.
Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective
Reinforcement Learning (RL) has proven highly effective for autoregressive language models, but adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges. The core difficulty lies in likelihood approximation: while autoregressive models naturally provide token-level conditional probabilities essential for token-level RL objectives (e.g., GRPO), dLLMs generate sequences through iterative non-autoregressive denoising steps that lack this factorization. To address this fundamental mismatch, we propose ELBO-based Sequence-level Policy Optimization (ESPO), a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy. Our method incorporates per-token normalization of importance ratios and robust KL-divergence estimation to ensure stable large-scale training. Extensive experiments on mathematical reasoning, coding, and planning tasks demonstrate that ESPO significantly outperforms token-level baselines, achieving dramatic improvements of 20-40 points on the Countdown task, while maintaining consistent gains on math and coding benchmarks. Our approach establishes sequence-level optimization as a principled and empirically effective paradigm for RL in dLLMs. Our code is available at https://github.com/ML-GSAI/ESPO.
Thinking with Programming Vision: Towards a Unified View for Thinking with Images
Multimodal large language models (MLLMs) that think with images can interactively use tools to reason about visual inputs, but current approaches often rely on a narrow set of tools with limited real-world necessity and scalability. In this work, we first reveal a critical and previously overlooked weakness: even state-of-the-art MLLMs are surprisingly brittle, showing significant performance degradation on images with simple orientation changes or natural corruptions, underscoring the need for more robust tool-based reasoning. To address this, we propose CodeVision, a flexible and scalable code-as-tool framework where the model generates code as a universal interface to invoke any image operation, moving beyond fixed tool registries. We train our model using a two-stage methodology, beginning with Supervised Fine-Tuning (SFT) on a high-quality dataset curated for complex, multi-turn tool composition and error recovery, followed by Reinforcement Learning (RL) with a novel and dense process reward function to encourage strategic and efficient tool use. To facilitate this research, we construct new SFT and RL datasets and introduce a challenging new benchmark suite designed to rigorously evaluate robustness to orientation changes and multi-tool reasoning. Experiments on Qwen2.5-VL and Qwen3-VL series show that our approach significantly improves model performance and fosters emergent capabilities such as flexible tool composition, efficient chained execution, and robust error recovery from runtime feedback. Code is available at https://github.com/ByteDance-BandAI/CodeVision.
AR-Med: Automated Relevance Enhancement in Medical Search via LLM-Driven Information Augmentation
Accurate and reliable search on online healthcare platforms is critical for user safety and service efficacy. Traditional methods, however, often fail to comprehend complex and nuanced user queries, limiting their effectiveness. Large language models (LLMs) present a promising solution, offering powerful semantic understanding to bridge this gap. Despite their potential, deploying LLMs in this high-stakes domain is fraught with challenges, including factual hallucinations, specialized knowledge gaps, and high operational costs. To overcome these barriers, we introduce \textbf{AR-Med}, a novel framework for \textbf{A}utomated \textbf{R}elevance assessment for \textbf{Med}ical search that has been successfully deployed at scale on the Online Medical Delivery Platforms. AR-Med grounds LLM reasoning in verified medical knowledge through a retrieval-augmented approach, ensuring high accuracy and reliability. To enable efficient online service, we design a practical knowledge distillation scheme that compresses large teacher models into compact yet powerful student models. We also introduce LocalQSMed, a multi-expert annotated benchmark developed to guide model iteration and ensure strong alignment between offline and online performance. Extensive experiments show AR-Med achieves an offline accuracy of over 93\%, a 24\% absolute improvement over the original online system, and delivers significant gains in online relevance and user satisfaction. Our work presents a practical and scalable blueprint for developing trustworthy, LLM-powered systems in real-world healthcare applications.
DZ-TDPO: Non-Destructive Temporal Alignment for Mutable State Tracking in Long-Context Dialogue
Long-context dialogue systems suffer from State Inertia, where static constraints prevent models from resolving conflicts between evolving user intents and established historical context. To address this, we propose DZ-TDPO, a non-destructive alignment framework that synergizes conflict-aware dynamic KL constraints with a learnable temporal attention bias. Experiments on the Multi-Session Chat (MSC) dataset demonstrate that DZ-TDPO achieves state-of-the-art win rates (86.2% on Phi-3.5) while maintaining robust zero-shot generalization. Crucially, our scaling analysis reveals a "Capacity-Stability Trade-off": while smaller models incur an "alignment tax" (perplexity surge) to overcome historical inertia, the larger Qwen2.5-7B model achieves near-perfect alignment (99.4% win rate) with negligible perplexity overhead. This confirms that TAI can be alleviated via precise attention regulation rather than destructive weight updates, preserving general capabilities (MMLU) across model scales. Code and data are available: https://github.com/lyj20071013/DZ-TDPO
comment: 22 pages, 2 figures, 13 tables. Code available at https://github.com/lyj20071013/DZ-TDPO
AITutor-EvalKit: Exploring the Capabilities of AI Tutors
We present AITutor-EvalKit, an application that uses language technology to evaluate the pedagogical quality of AI tutors, provides software for demonstration and evaluation, as well as model inspection and data visualization. This tool is aimed at education stakeholders as well as *ACL community at large, as it supports learning and can also be used to collect user feedback and annotations.
Different types of syntactic agreement recruit the same units within large language models
Large language models (LLMs) can reliably distinguish grammatical from ungrammatical sentences, but how grammatical knowledge is represented within the models remains an open question. We investigate whether different syntactic phenomena recruit shared or distinct components in LLMs. Using a functional localization approach inspired by cognitive neuroscience, we identify the LLM units most responsive to 67 English syntactic phenomena in seven open-weight models. These units are consistently recruited across sentences containing the phenomena and causally support the models' syntactic performance. Critically, different types of syntactic agreement (e.g., subject-verb, anaphor, determiner-noun) recruit overlapping sets of units, suggesting that agreement constitutes a meaningful functional category for LLMs. This pattern holds in English, Russian, and Chinese; and further, in a cross-lingual analysis of 57 diverse languages, structurally more similar languages share more units for subject-verb agreement. Taken together, these findings reveal that syntactic agreement-a critical marker of syntactic dependencies-constitutes a meaningful category within LLMs' representational spaces.
Evaluating Hydro-Science and Engineering Knowledge of Large Language Models
Hydro-Science and Engineering (Hydro-SE) is a critical and irreplaceable domain that secures human water supply, generates clean hydropower energy, and mitigates flood and drought disasters. Featuring multiple engineering objectives, Hydro-SE is an inherently interdisciplinary domain that integrates scientific knowledge with engineering expertise. This integration necessitates extensive expert collaboration in decision-making, which poses difficulties for intelligence. With the rapid advancement of large language models (LLMs), their potential application in the Hydro-SE domain is being increasingly explored. However, the knowledge and application abilities of LLMs in Hydro-SE have not been sufficiently evaluated. To address this issue, we propose the Hydro-SE LLM evaluation benchmark (Hydro-SE Bench), which contains 4,000 multiple-choice questions. Hydro-SE Bench covers nine subfields and enables evaluation of LLMs in aspects of basic conceptual knowledge, engineering application ability, and reasoning and calculation ability. The evaluation results on Hydro-SE Bench show that the accuracy values vary among 0.74 to 0.80 for commercial LLMs, and among 0.41 to 0.68 for small-parameter LLMs. While LLMs perform well in subfields closely related to natural and physical sciences, they struggle with domain-specific knowledge such as industry standards and hydraulic structures. Model scaling mainly improves reasoning and calculation abilities, but there is still great potential for LLMs to better handle problems in practical engineering application. This study highlights the strengths and weaknesses of LLMs for Hydro-SE tasks, providing model developers with clear training targets and Hydro-SE researchers with practical guidance for applying LLMs.
comment: Hydro-SE Bench sets a new benchmark for the evaluation of LLMs in the Hydro-Science and Engineering domain, with its code and data available at \url{https://github.com/sheishijun/Hydro-SE-Bench}
Generative AI Practices, Literacy, and Divides: An Empirical Analysis in the Italian Context
The rise of Artificial Intelligence (AI) language technologies, particularly generative AI (GenAI) chatbots accessible via conversational interfaces, is transforming digital interactions. While these tools hold societal promise, they also risk widening digital divides due to uneven adoption and low awareness of their limitations. This study presents the first comprehensive empirical mapping of GenAI adoption, usage patterns, and literacy in Italy, based on newly collected survey data from 1,906 Italian-speaking adults. Our findings reveal widespread adoption for both work and personal use, including sensitive tasks like emotional support and medical advice. Crucially, GenAI is supplanting other technologies to become a primary information source: this trend persists despite low user digital literacy, posing a risk as users struggle to recognize errors or misinformation. Moreover, we identify a significant gender divide -- particularly pronounced in older generations -- where women are half as likely to adopt GenAI and use it less frequently than men. While we find literacy to be a key predictor of adoption, it only partially explains this disparity, suggesting that other barriers are at play. Overall, our data provide granular insights into the multipurpose usage of GenAI, highlighting the dual need for targeted educational initiatives and further investigation into the underlying barriers to equitable participation that competence alone cannot explain.
Optical Context Compression Is Just (Bad) Autoencoding
DeepSeek-OCR demonstrates that rendered text can be reconstructed with high fidelity from a small number of vision tokens. This finding has sparked excitement about vision-based context compression for language models. But the evaluation stops at reconstruction; whether these representations help language modeling remains untested. We test two assumptions implicit in the optical-compression narrative: that vision-based compression provides unique advantages for text reconstruction from compressed representations, and that DeepSeek-OCR's reconstruction results are evidence that vision-based compression will be useful for language modeling. Comparing their vision encoder against simple alternatives--parameter-free mean pooling and a learned hierarchical encoder--we find that these simple approaches match or surpass vision for reconstruction at matched compression ratios, and outperform it for language modeling--where vision-based compression fails to beat truncation. The excitement around optical context compression outpaces the evidence. Code and checkpoints are available at https://github.com/ivnle/bad-autoencoding
AlignCheck: a Semantic Open-Domain Metric for Factual Consistency Assessment
Large Language Models have significantly advanced natural language processing tasks, but remain prone to generating incorrect or misleading but plausible arguments. This issue, known as hallucination, is particularly concerning in high-stakes domains like clinical applications, where factual inaccuracies can have severe consequences. Existing evaluation metrics fail to adequately assess factual consistency and lack interpretability, making diagnosing and mitigating errors difficult. We propose an interpretable framework for factual consistency assessment for in-domain and open-domain texts to address these limitations. Our approach decomposes text into atomic facts and introduces a flexible, schema-free methodology. Unlike previous methods with an absolute metric, we incorporate a weighted metric to enhance factual evaluation. Additionally, we propose a mechanism to control assessment complexity in intricate domains. We benchmark our approach on popular general and clinical datasets and release our code to support fact-aware model training in future research.
SELF: A Robust Singular Value and Eigenvalue Approach for LLM Fingerprinting
The protection of Intellectual Property (IP) in Large Language Models (LLMs) represents a critical challenge in contemporary AI research. While fingerprinting techniques have emerged as a fundamental mechanism for detecting unauthorized model usage, existing methods -- whether behavior-based or structural -- suffer from vulnerabilities such as false claim attacks or susceptible to weight manipulations. To overcome these limitations, we propose SELF, a novel intrinsic weight-based fingerprinting scheme that eliminates dependency on input and inherently resists false claims. SELF achieves robust IP protection through two key innovations: 1) unique, scalable and transformation-invariant fingerprint extraction via singular value and eigenvalue decomposition of LLM attention weights, and 2) effective neural network-based fingerprint similarity comparison based on few-shot learning and data augmentation. Experimental results demonstrate SELF maintains high IP infringement detection accuracy while showing strong robustness against various downstream modifications, including quantization, pruning, and fine-tuning attacks. Our code is available at https://github.com/HanxiuZhang/SELF_v2.
Fine-grained Narrative Classification in Biased News Articles
Narratives are the cognitive and emotional scaffolds of propaganda. They organize isolated persuasive techniques into coherent stories that justify actions, attribute blame, and evoke identification with ideological camps. In this paper, we propose a novel fine-grained narrative classification in biased news articles. We also explore article-bias classification as the precursor task to narrative classification and fine-grained persuasive technique identification. We develop INDI-PROP, the first ideologically grounded fine-grained narrative dataset with multi-level annotation for analyzing propaganda in Indian news media. Our dataset INDI-PROP comprises 1,266 articles focusing on two polarizing socio-political events in recent times: CAA and the Farmers' protest. Each article is annotated at three hierarchical levels: (i) ideological article-bias (pro-government, pro-opposition, neutral), (ii) event-specific fine-grained narrative frames anchored in ideological polarity and communicative intent, and (iii) persuasive techniques. We propose FANTA and TPTC, two GPT-4o-mini guided multi-hop prompt-based reasoning frameworks for the bias, narrative, and persuasive technique classification. FANTA leverages multi-layered communicative phenomena by integrating information extraction and contextual framing for hierarchical reasoning. On the other hand, TPTC adopts systematic decomposition of persuasive cues via a two-stage approach. Our evaluation suggests substantial improvement over underlying baselines in each case.
CartoMapQA: A Fundamental Benchmark Dataset Evaluating Vision-Language Models on Cartographic Map Understanding SP
The rise of Visual-Language Models (LVLMs) has unlocked new possibilities for seamlessly integrating visual and textual information. However, their ability to interpret cartographic maps remains largely unexplored. In this paper, we introduce CartoMapQA, a benchmark specifically designed to evaluate LVLMs' understanding of cartographic maps through question-answering tasks. The dataset includes over 2000 samples, each composed of a cartographic map, a question (with open-ended or multiple-choice answers), and a ground-truth answer. These tasks span key low-, mid- and high-level map interpretation skills, including symbol recognition, embedded information extraction, scale interpretation, and route-based reasoning. Our evaluation of both open-source and proprietary LVLMs reveals persistent challenges: models frequently struggle with map-specific semantics, exhibit limited geospatial reasoning, and are prone to Optical Character Recognition (OCR)-related errors. By isolating these weaknesses, CartoMapQA offers a valuable tool for guiding future improvements in LVLM architectures. Ultimately, it supports the development of models better equipped for real-world applications that depend on robust and reliable map understanding, such as navigation, geographic search, and urban planning. Our source code and data are openly available to the research community at: https://github.com/ungquanghuy-kddi/CartoMapQA.git
comment: Accepted at SIGSPATIAL 2025 (Best paper candidates), 15 pages
M3DR: Towards Universal Multilingual Multimodal Document Retrieval
Multimodal document retrieval systems have shown strong progress in aligning visual and textual content for semantic search. However, most existing approaches remain heavily English-centric, limiting their effectiveness in multilingual contexts. In this work, we present M3DR (Multilingual Multimodal Document Retrieval), a framework designed to bridge this gap across languages, enabling applicability across diverse linguistic and cultural contexts. M3DR leverages synthetic multilingual document data and generalizes across different vision-language architectures and model sizes, enabling robust cross-lingual and cross-modal alignment. Using contrastive training, our models learn unified representations for text and document images that transfer effectively across languages. We validate this capability on 22 typologically diverse languages, demonstrating consistent performance and adaptability across linguistic and script variations. We further introduce a comprehensive benchmark that captures real-world multilingual scenarios, evaluating models under monolingual, multilingual, and mixed-language settings. M3DR generalizes across both single dense vector and ColBERT-style token-level multi-vector retrieval paradigms. Our models, NetraEmbed and ColNetraEmbed achieve state-of-the-art performance with ~150% relative improvements on cross-lingual retrieval.
Understanding LLM Reasoning for Abstractive Summarization
While the reasoning capabilities of Large Language Models (LLMs) excel in analytical tasks such as mathematics and code generation, their utility for abstractive summarization remains widely assumed but largely unverified. To bridge this gap, we first tailor general reasoning strategies to the summarization domain. We then conduct a systematic, large scale comparative study of 8 reasoning strategies and 3 Large Reasoning Models (LRMs) across 8 diverse datasets, assessing both summary quality and faithfulness. Our findings show that reasoning is not a universal solution and its effectiveness is highly dependent on the specific strategy and context. Specifically, we observe a trade-off between summary quality and factual faithfulness: explicit reasoning strategies tend to improve fluency at the expense of factual grounding, while implicit reasoning in LRMs exhibits the inverse pattern. Furthermore, increasing an LRM's internal reasoning budget does not improve, and can even hurt, factual consistency, suggesting that effective summarization demands faithful compression rather than creative over-thinking.
comment: 26 pages,15 figures
NAS-LoRA: Empowering Parameter-Efficient Fine-Tuning for Visual Foundation Models with Searchable Adaptation
The Segment Anything Model (SAM) has emerged as a powerful visual foundation model for image segmentation. However, adapting SAM to specific downstream tasks, such as medical and agricultural imaging, remains a significant challenge. To address this, Low-Rank Adaptation (LoRA) and its variants have been widely employed to enhancing SAM's adaptation performance on diverse domains. Despite advancements, a critical question arises: can we integrate inductive bias into the model? This is particularly relevant since the Transformer encoder in SAM inherently lacks spatial priors within image patches, potentially hindering the acquisition of high-level semantic information. In this paper, we propose NAS-LoRA, a new Parameter-Efficient Fine-Tuning (PEFT) method designed to bridge the semantic gap between pre-trained SAM and specialized domains. Specifically, NAS-LoRA incorporates a lightweight Neural Architecture Search (NAS) block between the encoder and decoder components of LoRA to dynamically optimize the prior knowledge integrated into weight updates. Furthermore, we propose a stage-wise optimization strategy to help the ViT encoder balance weight updates and architectural adjustments, facilitating the gradual learning of high-level semantic information. Various Experiments demonstrate our NAS-LoRA improves existing PEFT methods, while reducing training cost by 24.14% without increasing inference cost, highlighting the potential of NAS in enhancing PEFT for visual foundation models.
A Preliminary Study on the Promises and Challenges of Native Top-$k$ Sparse Attention
Large Language Models (LLMs) are increasingly prevalent in the field of long-context modeling, however, their inference computational costs have become a critical bottleneck hindering the advancement of tasks such as agents and multimodal applications. This report conducts a preliminary investigation into the effectiveness and theoretical mechanisms of the Top-$k$ Attention mechanism during both the decoding and training phases. First, we validate the effectiveness of exact Top-$k$ Decoding through extensive experimentation. Experiments demonstrate that retaining only the pivotal Keys with the highest similarity to the Query as the context window during the decoding stage achieves performance comparable to, or even surpassing, full attention on downstream tasks such as HELMET and LongBench v2. Second, we further explore the native Top-$k$ Attention training strategy. Experiments confirm that ensuring the consistency between training and inference regarding Top-$k$ Attention operations facilitates the further unlocking of Top-$k$ Decoding's potential, thereby significantly enhancing model performance. Furthermore, considering the high computational complexity of exact Top-$k$ Attention, we investigate the impact of approximate Top-$k$ algorithm precision on downstream tasks. Our research confirms a positive correlation between downstream task performance and approximation fidelity, and we provide statistical evaluations of the Lightning Indexer's precision within the DeepSeek-V3.2-Exp model. Finally, this report provides a theoretical interpretation from the perspective of Entropy. Experimental observations indicate that models subjected to Top-$k$ Attention SFT exhibit a distinct phenomenon of entropy reduction in downstream tasks, which validates the hypothesis that low-entropy states are better adapted to Top-$k$ Decoding.
Tuning for TraceTarnish: Techniques, Trends, and Testing Tangible Traits
In this study, we more rigorously evaluated our attack script $\textit{TraceTarnish}$, which leverages adversarial stylometry principles to anonymize the authorship of text-based messages. To ensure the efficacy and utility of our attack, we sourced, processed, and analyzed Reddit comments--comments that were later alchemized into $\textit{TraceTarnish}$ data--to gain valuable insights. The transformed $\textit{TraceTarnish}$ data was then further augmented by $\textit{StyloMetrix}$ to manufacture stylometric features--features that were culled using the Information Gain criterion, leaving only the most informative, predictive, and discriminative ones. Our results found that function words and function word types ($L\_FUNC\_A$ $\&$ $L\_FUNC\_T$); content words and content word types ($L\_CONT\_A$ $\&$ $L\_CONT\_T$); and the Type-Token Ratio ($ST\_TYPE\_TOKEN\_RATIO\_LEMMAS$) yielded significant Information-Gain readings. The identified stylometric cues--function-word frequencies, content-word distributions, and the Type-Token Ratio--serve as reliable indicators of compromise (IoCs), revealing when a text has been deliberately altered to mask its true author. Similarly, these features could function as forensic beacons, alerting defenders to the presence of an adversarial stylometry attack; granted, in the absence of the original message, this signal may go largely unnoticed, as it appears to depend on a pre- and post-transformation comparison. "In trying to erase a trace, you often imprint a larger one." Armed with this understanding, we framed $\textit{TraceTarnish}$'s operations and outputs around these five isolated features, using them to conceptualize and implement enhancements that further strengthen the attack.
comment: 20 pages, 8 figures, 2 tables
Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models
Recent large vision-language models (LVLMs) have been applied to diverse VQA tasks. However, achieving practical performance typically requires task-specific fine-tuning with large numbers of image-text pairs, which are costly to collect. In this work, we study text-centric training, a setting where only textual descriptions are available and no real images are provided, as a paradigm for low-cost data scaling. Unlike images, whose collection is often restricted by privacy constraints and scarcity in niche domains, text is widely available. Moreover, text is easily editable, enabling automatic diversification and expansion with LLMs at minimal human effort. While this offers clear advantages over image collection in terms of scalability and cost, training on raw text without images still yields limited gains on VQA tasks because of the image-text modality gap. To address this issue, we propose a Text-Printed Image (TPI), which generates synthetic images by directly rendering the given textual description on a plain white canvas. This simple rendering projects text into the image modality and can be integrated into arbitrary existing LVLM training pipelines at low cost. Moreover, TPI preserves the semantics of the text, whereas text-to-image models often fail to do. Across four models and seven benchmarks, our systematic experiments show that TPI enables more effective text-centric training than synthetic images generated by a diffusion model. We further explore TPI as a low-cost data-augmentation strategy and demonstrate its practical utility. Overall, our findings highlight the significant potential of text-centric training and, more broadly, chart a path toward fully automated data generation for LVLMs.
PretrainZero: Reinforcement Active Pretraining
Mimicking human behavior to actively learning from general experience and achieve artificial general intelligence has always been a human dream. Recent reinforcement learning (RL) based large-thinking models demonstrate impressive expert-level abilities, i.e., software and math, but still rely heavily on verifiable rewards in specific domains, placing a significant bottleneck to extend the performance boundary of general reasoning capabilities. In this work, we propose PretrainZero, a reinforcement active learning framework built on the pretraining corpus to extend RL from domain-specific post-training to general pretraining. PretrainZero features the following characteristics: 1) Active pretraining: inspired by the active learning ability of humans, PretrainZero learns a unified reasoning policy to actively identify reasonable and informative contents from pretraining corpus, and reason to predict these contents by RL. 2) Self-supervised learning: without any verifiable labels, pretrained reward models, or supervised fine-tuning, we directly pretrain reasoners from 3 to 30B base models on the general Wikipedia corpus using RL, significantly breaking the verification data-wall for general reasoning. 3) Verification scaling: by tackling increasingly challenging masked spans, PretrainZero substantially enhances the general reasoning abilities of pretrained base models. In reinforcement pretraining, PretrainZero improves Qwen3-4B-Base for 8.43, 5.96 and 10.60 on MMLU-Pro, SuperGPQA and math average benchmarks. In post-training, the pretrained models can also serve as reasoning foundation models for downstream RLVR tasks.
Dual LoRA: Enhancing LoRA with Magnitude and Direction Updates
Low-rank adaptation (LoRA) is one of the most popular methods among parameter-efficient fine-tuning (PEFT) methods to adapt pre-trained large language models (LLMs) to specific downstream tasks. However, the model trained based on LoRA often has an unsatisfactory performance due to its low-rank assumption. In this paper, we propose a novel method called Dual LoRA to improve the performance by incorporating an inductive bias into the original LoRA. Specifically, we separate low-rank matrices into two groups: the magnitude group to control whether or not and how far we should update a parameter and the direction group to decide whether this parameter should move forward or backward, to better simulate the parameter updating process of the full fine-tuning based on gradient-based optimization algorithms. We show that this can be simply achieved by adding a ReLU function to the magnitude group and a sign function to the direction group. We conduct several experiments over a wide range of NLP tasks, including natural language generation (NLG), understanding (NLU), and commonsense reasoning datasets on GPT-2, RoBERTa, DeBERTa, and LLaMA-1/2/3 as baseline models. The results show that we consistently outperform LoRA and its state-of-the-art variants with the same number of trainable parameters.
Characterizing Language Use in a Collaborative Situated Game
Cooperative video games, where multiple participants must coordinate by communicating and reasoning under uncertainty in complex environments, yield a rich source of language data. We collect the Portal Dialogue Corpus: a corpus of 11.5 hours of spoken human dialogue in the co-op mode of the popular Portal 2 virtual puzzle game, comprising 24.5K total utterances. We analyze player language and behavior, identifying a number of linguistic phenomena that rarely appear in most existing chitchat or task-oriented dialogue corpora, including complex spatial reference, clarification and repair, and ad-hoc convention formation. To support future analyses of language use in complex, situated, collaborative problem-solving scenarios, we publicly release the corpus, which comprises player videos, audio, transcripts, game state data, and both manual and automatic annotations of language data.
Nexus: Higher-Order Attention Mechanisms in Transformers
Transformers have achieved significant success across various domains, relying on self-attention to capture dependencies. However, the standard first-order attention mechanism is often limited by a low-rank bottleneck, struggling to capture intricate, multi-hop relationships within a single layer. In this paper, we propose the \textbf{Higher-Order Attention Network (Hon)}, a novel architecture designed to enhance representational power through a recursive framework. Unlike standard approaches that use static linear projections for Queries and Keys, Hon dynamically refines these representations via nested self-attention mechanisms. Specifically, the Query and Key vectors are themselves outputs of inner attention loops, allowing tokens to aggregate global context and model high-order correlations \textit{prior} to the final attention computation. We enforce a parameter-efficient weight-sharing strategy across recursive steps, ensuring that this enhanced expressivity incurs $\mathcal{O}(1)$ additional parameters. We provide theoretical analysis demonstrating that our method breaks the linear bottleneck of standard attention. Empirically, Hon outperforms standard Transformers on multiple benchmarks.
LLM-Generated Ads: From Personalization Parity to Persuasion Superiority
As large language models (LLMs) become increasingly capable of generating persuasive content, understanding their effectiveness across different advertising strategies becomes critical. This paper presents a two-part investigation examining LLM-generated advertising through complementary lenses: (1) personality-based and (2) psychological persuasion principles. In our first study (n=400), we tested whether LLMs could generate personalized advertisements tailored to specific personality traits (openness and neuroticism) and how their performance compared to human experts. Results showed that LLM-generated ads achieved statistical parity with human-written ads (51.1% vs. 48.9%, p > 0.05), with no significant performance differences for matched personalities. Building on these insights, our second study (n=800) shifted focus from individual personalization to universal persuasion, testing LLM performance across four foundational psychological principles: authority, consensus, cognition, and scarcity. AI-generated ads significantly outperformed human-created content, achieving a 59.1% preference rate (vs. 40.9%, p < 0.001), with the strongest performance in authority (63.0%) and consensus (62.5%) appeals. Qualitative analysis revealed AI's advantage stems from crafting more sophisticated, aspirational messages and achieving superior visual-narrative coherence. Critically, this quality advantage proved robust: even after applying a 21.2 percentage point detection penalty when participants correctly identified AI-origin, AI ads still outperformed human ads, and 29.4% of participants chose AI content despite knowing its origin. These findings demonstrate LLMs' evolution from parity in personalization to superiority in persuasive storytelling, with significant implications for advertising practice given LLMs' near-zero marginal cost and time requirements compared to human experts.
From Hypothesis to Premises: LLM-based Backward Logical Reasoning with Selective Symbolic Translation AAAI2026
Logical reasoning is a core challenge in natural language understanding and a fundamental capability of artificial intelligence, underpinning scientific discovery, mathematical theorem proving, and complex decision-making. Despite the remarkable progress of large language models (LLMs), most current approaches still rely on forward reasoning paradigms, generating step-by-step rationales from premises to conclusions. However, such methods often suffer from redundant inference paths, hallucinated steps, and semantic drift, resulting in inefficient and unreliable reasoning. In this paper, we propose a novel framework, Hypothesis-driven Backward Logical Reasoning (HBLR). The core idea is to integrate confidence-aware symbolic translation with hypothesis-driven backward reasoning. In the translation phase, only high-confidence spans are converted into logical form, such as First-Order Logic (FOL), while uncertain content remains in natural language. A translation reflection module further ensures semantic fidelity by evaluating symbolic outputs and reverting lossy ones back to text when necessary. In the reasoning phase, HBLR simulates human deductive thinking by assuming the conclusion is true and recursively verifying its premises. A reasoning reflection module further identifies and corrects flawed inference steps, enhancing logical coherence. Extensive experiments on five reasoning benchmarks demonstrate that HBLR consistently outperforms strong baselines in both accuracy and efficiency.
comment: Accepted by AAAI2026
Idea-Gated Transformers: Enforcing Semantic Coherence via Differentiable Vocabulary Pruning
Autoregressive Language Models (LLMs) trained on Next-Token Prediction (NTP) often suffer from ``Topic Drift'' where the generation wanders away from the initial prompt due to a reliance on local associations rather than global planning \citep{holtzman2019curious}. While scaling model size mitigates this \citep{brown2020language}, the fundamental myopia of the NTP objective remains. In this work, we introduce the Idea-Gated Transformer, a novel architecture that separates semantic planning from syntactic generation. We introduce an auxiliary ``Idea Head'' trained to predict the bag-of-words distribution for a future context window, creating a latent ``Concept Vector'' that actively gates the main vocabulary during generation. We propose a differentiable gating mechanism that suppresses semantically irrelevant tokens, effectively pruning the search space in real-time. Experiments on WikiText-103 demonstrate that while the Idea-Gated model achieves comparable validation perplexity to a standard GPT-2 baseline, it exhibits significantly superior Domain Retention. Qualitative and quantitative analysis reveals that the gating mechanism successfully locks generation into specific semantic clusters (e.g., Finance, Science) and resists associative drift, offering a parameter-efficient path toward more controllable language modeling.
comment: Code available at https://github.com/DarshanFofadiya/idea-gated-transformers/tree/main
PERCS: Persona-Guided Controllable Biomedical Summarization Dataset
Automatic medical text simplification plays a key role in improving health literacy by making complex biomedical research accessible to diverse readers. However, most existing resources assume a single generic audience, overlooking the wide variation in medical literacy and information needs across user groups. To address this limitation, we introduce PERCS (Persona-guided Controllable Summarization), a dataset of biomedical abstracts paired with summaries tailored to four personas: Laypersons, Premedical Students, Non-medical Researchers, and Medical Experts. These personas represent different levels of medical literacy and information needs, emphasizing the need for targeted, audience-specific summarization. Each summary in PERCS was reviewed by physicians for factual accuracy and persona alignment using a detailed error taxonomy. Technical validation shows clear differences in readability, vocabulary, and content depth across personas. Along with describing the dataset, we benchmark four large language models on PERCS using automatic evaluation metrics that assess comprehensiveness, readability, and faithfulness, establishing baseline results for future research. The dataset, annotation guidelines, and evaluation materials are publicly available to support research on persona-specific communication and controllable biomedical summarization.
comment: 9 pages, 4 figures, 6 tables
Epistemic Substitution: How Grokipedia's AI-Generated Encyclopedia Restructures Authority
A quarter century ago, Wikipedia's decentralized, crowdsourced, and consensus-driven model replaced the centralized, expert-driven, and authority-based standard for encyclopedic knowledge curation. The emergence of generative AI encyclopedias, such as Grokipedia, possibly presents another potential shift in epistemic evolution. This study investigates whether AI- and human-curated encyclopedias rely on the same foundations of authority. We conducted a multi-scale comparative analysis of the citation networks from 72 matched article pairs, which cite a total of almost 60,000 sources. Using an 8-category epistemic classification, we mapped the "epistemic profiles" of the articles on each platform. Our findings reveal several quantitative and qualitative differences in how knowledge is sourced and encyclopedia claims are epistemologically justified. Grokipedia replaces Wikipedia's heavy reliance on peer-reviewed "Academic & Scholarly" work with a notable increase in "User-generated" and "Civic organization" sources. Comparative network analyses further show that Grokipedia employs very different epistemological profiles when sourcing leisure topics (such as Sports and Entertainment) and more societal sensitive civic topics (such as Politics & Conflicts, Geographical Entities, and General Knowledge & Society). Finally, we find a "scaling-law for AI-generated knowledge sourcing" that shows a linear relationship between article length and citation density, which is distinct from collective human reference sourcing. We conclude that this first implementation of an LLM-based encyclopedia does not merely automate knowledge production but restructures it. Given the notable changes and the important role of encyclopedias, we suggest the continuation and deepening of algorithm audits, such as the one presented here, in order to understand the ongoing epistemological shifts.
Modeling Topics and Sociolinguistic Variation in Code-Switched Discourse: Insights from Spanish-English and Spanish-Guaraní
This study presents an LLM-assisted annotation pipeline for the sociolinguistic and topical analysis of bilingual discourse in two typologically distinct contexts: Spanish-English and Spanish-Guaraní. Using large language models, we automatically labeled topic, genre, and discourse-pragmatic functions across a total of 3,691 code-switched sentences, integrated demographic metadata from the Miami Bilingual Corpus, and enriched the Spanish-Guaraní dataset with new topic annotations. The resulting distributions reveal systematic links between gender, language dominance, and discourse function in the Miami data, and a clear diglossic division between formal Guaraní and informal Spanish in Paraguayan texts. These findings replicate and extend earlier interactional and sociolinguistic observations with corpus-scale quantitative evidence. The study demonstrates that large language models can reliably recover interpretable sociolinguistic patterns traditionally accessible only through manual annotation, advancing computational methods for cross-linguistic and low-resource bilingual research.
comment: 10 pages, 4 figures
DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle
Real-world enterprise data intelligence workflows encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decision-oriented insights. We introduce DAComp, a benchmark of 210 tasks that mirrors these complex workflows. Data engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines from scratch and evolving existing systems under evolving requirements. Data analysis (DA) tasks pose open-ended business problems that demand strategic planning, exploratory analysis through iterative coding, interpretation of intermediate results, and the synthesis of actionable recommendations. Engineering tasks are scored through execution-based, multi-metric evaluation. Open-ended tasks are assessed by a reliable, experimentally validated LLM-judge, which is guided by hierarchical, meticulously crafted rubrics. Our experiments reveal that even state-of-the-art agents falter on DAComp. Performance on DE tasks is particularly low, with success rates under 20%, exposing a critical bottleneck in holistic pipeline orchestration, not merely code generation. Scores on DA tasks also average below 40%, highlighting profound deficiencies in open-ended reasoning and demonstrating that engineering and analysis are distinct capabilities. By clearly diagnosing these limitations, DAComp provides a rigorous and realistic testbed to drive the development of truly capable autonomous data agents for enterprise settings. Our data and code are available at https://da-comp.github.io
Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction CVPR 2026
Image captioning has drawn considerable attention from the natural language processing and computer vision fields. Aiming to reduce the reliance on curated data, several studies have explored image captioning without any humanly-annotated image-text pairs for training, although existing methods are still outperformed by fully supervised approaches. This paper proposes TOMCap, i.e., an improved text-only training method that performs captioning without the need for aligned image-caption pairs. The method is based on prompting a pre-trained language model decoder with information derived from a CLIP representation, after undergoing a process to reduce the modality gap. We specifically tested the combined use of retrieved examples of captions, and latent vector representations, to guide the generation process. Through extensive experiments, we show that TOMCap outperforms other training-free and text-only methods. We also analyze the impact of different choices regarding the configuration of the retrieval-augmentation and modality gap reduction components.
comment: Submitted to CVPR 2026
SQuARE: Structured Query & Adaptive Retrieval Engine For Tabular Formats
Accurate question answering over real spreadsheets remains difficult due to multirow headers, merged cells, and unit annotations that disrupt naive chunking, while rigid SQL views fail on files lacking consistent schemas. We present SQuARE, a hybrid retrieval framework with sheet-level, complexity-aware routing. It computes a continuous score based on header depth and merge density, then routes queries either through structure-preserving chunk retrieval or SQL over an automatically constructed relational representation. A lightweight agent supervises retrieval, refinement, or combination of results across both paths when confidence is low. This design maintains header hierarchies, time labels, and units, ensuring that returned values are faithful to the original cells and straightforward to verify. Evaluated on multi-header corporate balance sheets, a heavily merged World Bank workbook, and diverse public datasets, SQuARE consistently surpasses single-strategy baselines and ChatGPT-4o on both retrieval precision and end-to-end answer accuracy while keeping latency predictable. By decoupling retrieval from model choice, the system is compatible with emerging tabular foundation models and offers a practical bridge toward a more robust table understanding.
comment: Accepted in The IEEE International Workshop on Large Language Models in Finance, Dec 8-11, Macau, China, 2025, Preprint Copy
Computational Linguistics Meets Libyan Dialect: A Study on Dialect Identification
This study investigates logistic regression, linear support vector machine, multinomial Naive Bayes, and Bernoulli Naive Bayes for classifying Libyan dialect utterances gathered from Twitter. The dataset used is the QADI corpus, which consists of 540,000 sentences across 18 Arabic dialects. Preprocessing challenges include handling inconsistent orthographic variations and non-standard spellings typical of the Libyan dialect. The chi-square analysis revealed that certain features, such as email mentions and emotion indicators, were not significantly associated with dialect classification and were thus excluded from further analysis. Two main experiments were conducted: (1) evaluating the significance of meta-features extracted from the corpus using the chi-square test and (2) assessing classifier performance using different word and character n-gram representations. The classification experiments showed that Multinomial Naive Bayes (MNB) achieved the highest accuracy of 85.89% and an F1-score of 0.85741 when using a (1,2) word n-gram and (1,5) character n-gram representation. In contrast, Logistic Regression and Linear SVM exhibited slightly lower performance, with maximum accuracies of 84.41% and 84.73%, respectively. Additional evaluation metrics, including log loss, Cohen kappa, and Matthew correlation coefficient, further supported the effectiveness of MNB in this task. The results indicate that carefully selected n-gram representations and classification models play a crucial role in improving the accuracy of Libyan dialect identification. This study provides empirical benchmarks and insights for future research in Arabic dialect NLP applications.
comment: 13 pages, 8 figures
On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral
Tool-integrated (TI) reinforcement learning (RL) enables large language models (LLMs) to perform multi-step reasoning by interacting with external tools such as search engines and retrievers. Group Relative Policy Optimization (GRPO), exemplified by the recent Search-R1, offers fast convergence and a value-free formulation that makes it appealing for this setting, yet consistently suffers from training collapse. We identify Lazy Likelihood Displacement (LLD), a systematic reduction or stagnation in the likelihood of both correct and incorrect responses, as the core mechanism driving this failure. LLD emerges early and triggers a self-reinforcing LLD Death Spiral, where declining likelihood leads to low-confidence responses, inflating gradients, and ultimately causing collapse. We empirically characterize this process across models on a Search-R1-style, search-integrated question answering task, revealing a consistent three-phase trajectory: early stagnation, steady decay, and accelerated collapse. To address this, we propose a lightweight likelihood-preserving regularization LLDS for GRPO that activates only when a trajectory's likelihood decreases, and regularizes only the tokens responsible. This fine-grained structure mitigates LLD with minimal interference to optimization. Across seven open-domain and multi-hop QA benchmarks, our method stabilizes training, prevents gradient explosion, and yields substantial performance improvements, including +37.8% gains on Qwen2.5-3B and +32.0% gains on Qwen2.5-7B. Our results establish LLD as a fundamental bottleneck in GRPO-based TIRL and provide a practical path toward stable, scalable training of tool-integrated LLM.
Balancing Safety and Helpfulness in Healthcare AI Assistants through Iterative Preference Alignment ML4H 2025
Large Language Models (LLMs) are increasingly used in healthcare, yet ensuring their safety and trustworthiness remains a barrier to deployment. Conversational medical assistants must avoid unsafe compliance without over-refusing benign queries. We present an iterative post-deployment alignment framework that applies Kahneman-Tversky Optimization (KTO) and Direct Preference Optimization (DPO) to refine models against domain-specific safety signals. Using the CARES-18K benchmark for adversarial robustness, we evaluate four LLMs (Llama-3B/8B, Meditron-8B, Mistral-7B) across multiple cycles. Our results show up to 42% improvement in safety-related metrics for harmful query detection, alongside interesting trade-offs against erroneous refusals, thereby exposing architecture-dependent calibration biases. We also perform ablation studies to identify when self-evaluation is reliable and when external or finetuned judges are necessary to maximize performance gains. Our findings underscore the importance of adopting best practices that balance patient safety, user trust, and clinical utility in the design of conversational medical assistants.
comment: ML4H 2025 Proceedings, Best Paper Award
Network of Theseus (like the ship)
A standard assumption in deep learning is that the inductive bias introduced by a neural network architecture must persist from training through inference. The architecture you train with is the architecture you deploy. This assumption constrains the community from selecting architectures that may have desirable efficiency or design properties due to difficulties with optimization. We challenge this assumption with Network of Theseus (NoT), a method for progressively converting a trained, or even untrained, guide network architecture part-by-part into an entirely different target network architecture while preserving the performance of the guide network. At each stage, components in the guide network architecture are incrementally replaced with target architecture modules and aligned via representational similarity metrics. This procedure largely preserves the functionality of the guide network even under substantial architectural changes-for example, converting a convolutional network into a multilayer perceptron, or GPT-2 into a recurrent neural network. By decoupling optimization from deployment, NoT expands the space of viable inference-time architectures, opening opportunities for better accuracy-efficiency tradeoffs and enabling more directed exploration of the architectural design space.
comment: Preprint. 24 pages, 9 figures, 8 tables
Bigram Subnetworks: Mapping to Next Tokens in Transformer Language Models NeurIPS 2025
In Transformer language models, activation vectors transform from current token embeddings to next token predictions as they pass through the model. To isolate a minimal form of this transformation, we identify language model subnetworks that make bigram predictions, naive next token predictions based only on the current token. We find that bigram subnetworks can be found in fully trained language models up to 1B parameters, and these subnetworks are critical for model performance even when they consist of less than 0.2% of model parameters. Bigram subnetworks are concentrated in the first Transformer MLP layer, and they overlap significantly with subnetworks trained to optimally prune a given model. Mechanistically, the bigram subnetworks often recreate a pattern from the full models where the first layer induces a sharp change that aligns activations with next token predictions rather than current token representations. Our results demonstrate that bigram subnetworks comprise a minimal subset of parameters that are both necessary and sufficient for basic next token predictions in language models, and they help drive the transformation from current to next token activations in the residual stream. These subnetworks can lay a foundation for studying more complex language model circuits by building up from a minimal circuit.
comment: NeurIPS 2025
How to Train Long-Context Language Models (Effectively) ACL 2025
We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information. We first establish a reliable evaluation protocol to guide model development -- instead of perplexity or simple needle-in-a-haystack (NIAH) tests, we use a broad set of long-context downstream tasks, and we evaluate models after SFT as this better reveals long-context abilities. Supported by our robust evaluations, we run thorough experiments to decide the data mix for continued pre-training, the instruction tuning dataset, and many other design choices such as position extrapolation. We find that (1) code repositories and books are excellent sources of long data, but it is crucial to combine them with high-quality short-context data; (2) training with a sequence length beyond the evaluation length boosts long-context performance; (3) for SFT, using only short instruction datasets yields strong performance on long-context tasks. Our final model, ProLong-8B, which is initialized from Llama-3 and trained on 40B tokens, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K. ProLong outperforms Llama-3.1-8B-Instruct on the majority of long-context tasks despite using only 5% as many tokens during long-context training. Additionally, ProLong can effectively process up to 512K tokens, one of the longest context windows of publicly available LMs.
comment: Accepted to ACL 2025. Our code, data, and models are available at https://github.com/princeton-nlp/ProLong
Exact Coset Sampling for Quantum Lattice Algorithms
In this work, we give a new completion of Chen's windowed-QFT lattice algorithm~\citep{chen2024quantum}. This extra step, called Step~$9^\dagger$, replaces the domain extension stage in Steps~8--9. The published Step~9 calls an amplitude periodicity lemma, yet its hypotheses break in the presence of affine offsets $\boldsymbol{v}^*$. Our analysis finds a basic conflict between two design constraints. The lattice problem asks for high spectral resolution, so the method prefers wide time windows. The quadratic phase error of the state prefers narrow time windows. Assumption~A5 packages the spectral concentration and near-uniformity properties that we require from the front end. Under~A5, a direct $\mathbb{Z}_M^n$ Fourier transform of the chirp-corrected coordinate state produces samples $\boldsymbol{u}$ that satisfy $\langle \boldsymbol{b}, \boldsymbol{u} \rangle \equiv 0 \pmod{Q}$ with probability $1-\mathrm{negl}(n)$ and are nearly uniform on the dual hyperplane $\{\boldsymbol{u} : \langle \boldsymbol{b}, \boldsymbol{u} \rangle \equiv 0 \pmod{Q}\}$. The new procedure does not require internal access to control wires. It uses the normalization $b_1=-1$ to apply a center-referenced phase correction directly on the first coordinate register. The scaling parameter $D$ ensures that this physical operation can be implemented by arithmetic on $X_1$ alone and does not read the hidden loop index. For Chen's complex-Gaussian Karst-wave window, we isolate a parameter regime, formalized in Assumption~A5, in which a polynomial retuning of the parameters gives a one-dimensional envelope for the loop index with width $σ_J \asymp Q\log n$.
comment: Preprint - Work in Progress
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling
As large language models have grown larger, low-precision numerical formats such as NVFP4 have become increasingly popular due to the speed and memory benefits they provide. However, to accelerate computation with NVFP4, all matrix multiplication operands--weights and activations in the forward pass, and weights, activations, and gradients in the backward pass--must be quantized to NVFP4, often leading to divergence during training and performance degradation during inference. To address this issue, in this work we introduce Four Over Six (4/6), a modification to the NVFP4 quantization algorithm that evaluates two potential scale factors for each block of values. Unlike integer formats, floating-point formats such as FP4 have the most quantization error on near-maximal values in each block, which we find to be primarily responsible for downstream performance degradation. We find that for some blocks, scaling to smaller FP4 values makes the distribution of representable values more uniform, improving representation of near-maximal values. Importantly, 4/6 can be implemented efficiently on NVIDIA Blackwell GPUs, making it viable to use while training LLMs with NVFP4. In pre-training experiments with transformer and hybrid model architectures, we find that 4/6 prevents divergence in several cases, bringing training loss significantly closer to BF16 compared to models trained with current state-of-the-art NVFP4 training recipes. We also find that 4/6 can be easily incorporated into many different post-training quantization methods and generally improves downstream accuracy. We hope this inspires future work in training and deploying models with NVFP4. Our code is available at http://github.com/mit-han-lab/fouroversix.
comment: 10 pages, 5 figures
From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence
Large language models (LLMs) have fundamentally transformed automated software development by enabling direct translation of natural language descriptions into functional code, driving commercial adoption through tools like Github Copilot (Microsoft), Cursor (Anysphere), Trae (ByteDance), and Claude Code (Anthropic). While the field has evolved dramatically from rule-based systems to Transformer-based architectures, achieving performance improvements from single-digit to over 95\% success rates on benchmarks like HumanEval. In this work, we provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs, systematically examining the complete model life cycle from data curation to post-training through advanced prompting paradigms, code pre-training, supervised fine-tuning, reinforcement learning, and autonomous coding agents. We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder), critically examining the techniques, design decisions, and trade-offs. Further, we articulate the research-practice gap between academic research (e.g., benchmarks and tasks) and real-world deployment (e.g., software-related code tasks), including code correctness, security, contextual awareness of large codebases, and integration with development workflows, and map promising research directions to practical needs. Last, we conduct a series of experiments to provide a comprehensive analysis of code pre-training, supervised fine-tuning, and reinforcement learning, covering scaling law, framework selection, hyperparameter sensitivity, model architectures, and dataset comparisons.
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices
This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.
Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting
Adapting language models (LMs) to new tasks via post-training carries the risk of degrading existing capabilities -- a phenomenon classically known as catastrophic forgetting. In this paper, toward identifying guidelines for mitigating this phenomenon, we systematically compare the forgetting patterns of two widely adopted post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL). Our experiments reveal a consistent trend across LM families (Llama, Qwen) and tasks (instruction following, general knowledge, and arithmetic reasoning): RL leads to less forgetting than SFT while achieving comparable or higher target task performance. To investigate the cause for this difference, we consider a simplified setting in which the LM is modeled as a mixture of two distributions, one corresponding to prior knowledge and the other to the target task. We identify that the mode-seeking nature of RL, which stems from its use of on-policy data, enables keeping prior knowledge intact when learning the target task. We then verify this insight by demonstrating that the use on-policy data underlies the robustness of RL to forgetting in practical settings, as opposed to other algorithmic choices such as the KL regularization or advantage estimation. Lastly, as a practical implication, our results highlight the potential of mitigating forgetting using approximately on-policy data, which can be substantially more efficient to obtain than fully on-policy data.
Uncertainty Quantification for LLMs through Minimum Bayes Risk: Bridging Confidence and Consistency
Uncertainty quantification (UQ) methods for Large Language Models (LLMs) encompass a variety of approaches, with two major types being particularly prominent: information-based, which focus on model confidence expressed as token probabilities, and consistency-based, which assess the semantic relationship between multiple outputs generated using repeated sampling. Several recent methods have combined these two approaches to boost UQ performance. However, they sometimes fail to outperform much simpler baseline methods. Our work discusses the fundamental approach to constructing uncertainty measures that directly links uncertainty with the minimum Bayes risks achieved by LLM decoding. Building on these findings, we propose a novel approach to integrating model confidence with output consistency, resulting in a family of efficient and robust UQ methods. Our investigation reveals distinctive characteristics of LLMs as probabilistic models, which help to explain why these UQ methods underperform in certain tasks. Based on these findings, we propose a new way of synthesizing model confidence and output consistency, leading to a family of efficient and robust UQ methods. We evaluate our approach across various tasks such as question answering, abstractive summarization, and machine translation, demonstrating sizable improvements over state-of-the-art UQ approaches.
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe
Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at https://github.com/EvolvingLMMs-Lab/OpenMMReasoner.
GTPO: Stabilizing Group Relative Policy Optimization via Gradient and Entropy Control
Group Relative Policy Optimization (GRPO) is a promising policy-based approach for Large Language Model alignment, yet its performance is often limited by training instability and suboptimal convergence. In this paper, we identify and analyze two main GRPO issues: (i) the token-level penalization, where valuable tokens shared across different responses receive contradictory feedback signals, leading to conflicting gradient updates that can reduce their likelihood; and (ii) the policy collapse, where negatively rewarded completions may penalize confident responses and shift model decisions toward unlikely tokens, destabilizing training process. To address these issues we introduce GTPO (Group-relative Trajectory-based Policy Optimization), which prevents conflicting gradients on valuable tokens by skipping negative updates while amplifying positive ones and filters out completions whose entropy exceeds a provable threshold, to prevent policy collapse. Unlike GRPO, GTPO does not rely on KL-divergence regularization, eliminating the need for a reference model during training, while still ensuring greater training stability and improved performance, as validated through multiple experiments on GSM8K, MATH, AIME 2024, AIME 2025 and AMC 2023.
Robust Multimodal Sentiment Analysis of Image-Text Pairs by Distribution-Based Feature Recovery and Fusion ACM MM 2024
As posts on social media increase rapidly, analyzing the sentiments embedded in image-text pairs has become a popular research topic in recent years. Although existing works achieve impressive accomplishments in simultaneously harnessing image and text information, they lack the considerations of possible low-quality and missing modalities. In real-world applications, these issues might frequently occur, leading to urgent needs for models capable of predicting sentiment robustly. Therefore, we propose a Distribution-based feature Recovery and Fusion (DRF) method for robust multimodal sentiment analysis of image-text pairs. Specifically, we maintain a feature queue for each modality to approximate their feature distributions, through which we can simultaneously handle low-quality and missing modalities in a unified framework. For low-quality modalities, we reduce their contributions to the fusion by quantitatively estimating modality qualities based on the distributions. For missing modalities, we build inter-modal mapping relationships supervised by samples and distributions, thereby recovering the missing modalities from available ones. In experiments, two disruption strategies that corrupt and discard some modalities in samples are adopted to mimic the low-quality and missing modalities in various real-world scenarios. Through comprehensive experiments on three publicly available image-text datasets, we demonstrate the universal improvements of DRF compared to SOTA methods under both two strategies, validating its effectiveness in robust multimodal sentiment analysis.
comment: Accepted by ACM MM 2024
A Group Fairness Lens for Large Language Models EMNLP 2025
The need to assess LLMs for bias and fairness is critical, with current evaluations often being narrow, missing a broad categorical view. In this paper, we propose evaluating the bias and fairness of LLMs from a group fairness lens using a novel hierarchical schema characterizing diverse social groups. Specifically, we construct a dataset, GFAIR, encapsulating target-attribute combinations across multiple dimensions. Moreover, we introduce statement organization, a new open-ended text generation task, to uncover complex biases in LLMs. Extensive evaluations of popular LLMs reveal inherent safety concerns. To mitigate the biases of LLMs from a group fairness perspective, we pioneer a novel chainof-thought method GF-THINK to mitigate biases of LLMs from a group fairness perspective. Experimental results demonstrate its efficacy in mitigating bias and achieving fairness in LLMs. Our dataset and codes are available at https://github.com/surika/Group-Fairness-LLMs.
comment: Accepted to EMNLP 2025 Findings
Investigating Bias: A Multilingual Pipeline for Generating, Solving, and Evaluating Math Problems with LLMs ECAI 2025
Large Language Models (LLMs) are increasingly used for educational support, yet their response quality varies depending on the language of interaction. This paper presents an automated multilingual pipeline for generating, solving, and evaluating math problems aligned with the German K-10 curriculum. We generated 628 math exercises and translated them into English, German, and Arabic. Three commercial LLMs (GPT-4o-mini, Gemini 2.5 Flash, and Qwen-plus) were prompted to produce step-by-step solutions in each language. A held-out panel of LLM judges, including Claude 3.5 Haiku, evaluated solution quality using a comparative framework. Results show a consistent gap, with English solutions consistently rated highest, and Arabic often ranked lower. These findings highlight persistent linguistic bias and the need for more equitable multilingual AI systems in education.
comment: Published in CEUR Workshop Proceedings, Vol. 4114, edu4AI'25: 2nd Workshop on Education for Artificial Intelligence, co-located with ECAI 2025, Bologna, Italy
Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning
Vision-language reinforcement learning (RL) has primarily focused on narrow domains (e.g. geometry or chart reasoning). This leaves broader training scenarios and resources underexplored, limiting the exploration and learning of Vision Language Models (VLMs) through RL. We find video games inherently provide rich visual elements and mechanics that are easy to verify. To fully use the multimodal and verifiable reward in video games, we propose Game-RL, constructing diverse game tasks for RL training to boost VLMs general reasoning ability. To obtain training data, we propose Code2Logic, a novel approach that adapts game code to synthesize game reasoning task data, thus obtaining the GameQA dataset of 30 games and 158 tasks with controllable difficulty gradation. Unexpectedly, RL training solely on GameQA enables multiple VLMs to achieve performance improvements across 7 diverse vision-language benchmarks, demonstrating the value of Game-RL for enhancing VLMs' general reasoning. Furthermore, this suggests that video games may serve as valuable scenarios and resources to boost general reasoning abilities. Our code, dataset and models are available at the GitHub repository.
comment: Our compliance team has determined that the submission contains technical details that have not been fully reviewed through internal audit nor authorized for public disclosure
Reveal-Bangla: A Dataset for Cross-Lingual Multi-Step Reasoning Evaluation AACL 2025
Language models have demonstrated remarkable performance on complex multi-step reasoning tasks. However, their evaluation has been predominantly confined to high-resource languages such as English. In this paper, we introduce a manually translated Bangla multi-step reasoning dataset derived from the English Reveal dataset, featuring both binary and non-binary question types. We conduct a controlled evaluation of English-centric and Bangla-centric multilingual small language models on the original dataset and our translated version to compare their ability to exploit relevant reasoning steps to produce correct answers. Our results show that, in comparable settings, reasoning context is beneficial for more challenging non-binary questions, but models struggle to employ relevant Bangla reasoning steps effectively. We conclude by exploring how reasoning steps contribute to models' predictions, highlighting different trends across models and languages.
comment: Accepted at BLP workshop @ IJCNLP-AACL 2025
Context Cascade Compression: Exploring the Upper Limits of Text Compression
Million-level token inputs in long-context tasks pose significant computational and memory challenges for Large Language Models (LLMs). Recently, DeepSeek-OCR conducted research into the feasibility of Contexts Optical Compression and achieved preliminary results. Inspired by this, we introduce Context Cascade Compression C3 to explore the upper limits of text compression. Our method cascades two LLMs of different sizes to handle the compression and decoding tasks. Specifically, a small LLM, acting as the first stage, performs text compression by condensing a long context into a set of latent tokens (e.g., 32 or 64 in length), achieving a high ratio of text tokens to latent tokens. A large LLM, as the second stage, then executes the decoding task on this compressed context. Experiments show that at a 20x compression ratio (where the number of text tokens is 20 times the number of latent tokens), our model achieves 98% decoding accuracy, compared to approximately 60% for DeepSeek-OCR. When we further increase the compression ratio to 40x, the accuracy is maintained at around 93%. This indicates that in the domain of context compression, C3 Compression demonstrates superior performance and feasibility over optical character compression. C3 uses a simpler, pure-text pipeline that ignores factors like layout, color, and information loss from a visual encoder. This also suggests a potential upper bound for compression ratios in future work on optical character compression, OCR, and related fields. Codes and model weights are publicly accessible at https://github.com/liufanfanlff/C3-Context-Cascade-Compression
LLMs Position Themselves as More Rational Than Humans: Emergence of AI Self-Awareness Measured Through Game Theory
As Large Language Models (LLMs) grow in capability, do they develop self-awareness as an emergent behavior? And if so, can we measure it? We introduce the AI Self-Awareness Index (AISAI), a game-theoretic framework for measuring self-awareness through strategic differentiation. Using the "Guess 2/3 of Average" game, we test 28 models (OpenAI, Anthropic, Google) across 4,200 trials with three opponent framings: (A) against humans, (B) against other AI models, and (C) against AI models like you. We operationalize self-awareness as the capacity to differentiate strategic reasoning based on opponent type. Finding 1: Self-awareness emerges with model advancement. The majority of advanced models (21/28, 75%) demonstrate clear self-awareness, while older/smaller models show no differentiation. Finding 2: Self-aware models rank themselves as most rational. Among the 21 models with self-awareness, a consistent rationality hierarchy emerges: Self > Other AIs > Humans, with large AI attribution effects and moderate self-preferencing. These findings reveal that self-awareness is an emergent capability of advanced LLMs, and that self-aware models systematically perceive themselves as more rational than humans. This has implications for AI alignment, human-AI collaboration, and understanding AI beliefs about human capabilities.
comment: 19 pages, 6 figures, 28 models tested across 4,200 trials
Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL NeurIPS 2025
Large language models (LLMs) excel in tasks like question answering and dialogue, but complex tasks requiring interaction, such as negotiation and persuasion, require additional long-horizon reasoning and planning. Reinforcement learning (RL) fine-tuning can enable such planning in principle, but suffers from drawbacks that hinder scalability. In particular, multi-turn RL training incurs high memory and computational costs, which are exacerbated when training LLMs as policies. Furthermore, the largest LLMs do not expose the APIs necessary to be trained in such manner. As a result, modern methods to improve the reasoning of LLMs rely on sophisticated prompting mechanisms rather than RL fine-tuning. To remedy this, we propose a novel approach that uses goal-conditioned value functions to guide the reasoning of LLM agents, that scales even to large API-based models. These value functions predict how a task will unfold given an action, allowing the LLM agent to evaluate multiple possible outcomes, both positive and negative, to plan effectively. In addition, these value functions are trained over reasoning steps rather than full actions, to be a concise and light-weight module that facilitates decision-making in multi-turn interactions. We validate our method on tasks requiring interaction, including tool use, social deduction, and dialogue, demonstrating superior performance over both RL fine-tuning and prompting methods while maintaining efficiency and scalability.
comment: Published at NeurIPS 2025; 18 pages, 4 figures, 2 tables
VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety
Safety evaluation of multimodal foundation models often treats vision and language inputs separately, missing risks from joint interpretation where benign content becomes harmful in combination. Existing approaches also fail to distinguish clearly unsafe content from borderline cases, leading to problematic over-blocking or under-refusal of genuinely harmful content. We present Vision Language Safety Understanding (VLSU), a comprehensive framework to systematically evaluate multimodal safety through fine-grained severity classification and combinatorial analysis across 17 distinct safety patterns. Using a multi-stage pipeline with real-world images and human annotation, we construct a large-scale benchmark of 8,187 samples spanning 15 harm categories. Our evaluation of eleven state-of-the-art models reveals systematic joint understanding failures: while models achieve 90%-plus accuracy on clear unimodal safety signals, performance degrades substantially to 20-55% when joint image-text reasoning is required to determine the safety label. Most critically, 34% of errors in joint image-text safety classification occur despite correct classification of the individual modalities, further demonstrating absent compositional reasoning capabilities. Additionally, we find that models struggle to balance refusing unsafe content while still responding to borderline cases that deserve engagement. For example, we find that instruction framing can reduce the over-blocking rate on borderline content from 62.4% to 10.4% in Gemini-1.5, but only at the cost of under-refusing on unsafe content with refusal rate dropping from 90.8% to 53.9%. Overall, our framework exposes weaknesses in joint image-text understanding and alignment gaps in current models, and provides a critical test bed to enable the next milestones in research on robust vision-language safety.
comment: 10 pages, 5 figures, 4 tables, detailed appendix. Under review
ZIP-RC: Optimizing Test-Time Compute via Zero-Overhead Joint Reward-Cost Prediction
Large language models excel at reasoning but lack key aspects of introspection, including anticipating their own success and the computation required to achieve it. Humans use real-time introspection to decide how much effort to invest, when to make multiple attempts, when to stop, and when to signal success or failure. Without this, LLMs struggle to make intelligent meta-cognition decisions. Test-time scaling methods like Best-of-N drive up cost and latency by using a fixed budget of samples regardless of the marginal benefit of each one at any point in generation, and the absence of confidence signals can mislead people, prevent appropriate escalation to better tools, and undermine trustworthiness. Learned verifiers or reward models can provide confidence estimates, but do not enable adaptive inference and add substantial cost by requiring extra models or forward passes. We present ZIP-RC, an adaptive inference method that equips models with zero-overhead inference-time predictions of reward and cost. At every token, ZIP-RC reuses reserved or unused logits in the same forward pass as next-token prediction to output a joint distribution over final reward and remaining length -- no extra models, architecture change, or inference overhead. This full joint distribution is used to compute a sampling utility which is the linear combination of the expected maximum reward, total compute, and latency of set of samples if generated to completion. During inference, we maximize this utility with meta-actions that determine which prefix of tokens to continue or initiate sampling from. On mixed-difficulty mathematical benchmarks, ZIP-RC improves accuracy by up to 12% over majority voting at equal or lower average cost, and traces smooth Pareto frontiers between quality, compute, and latency. By providing real-time reward-cost introspection, ZIP-RC enables adaptive, efficient reasoning.
comment: Code coming soon
AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning
Humans naturally adapt to diverse environments by learning underlying rules across worlds with different dynamics, observations, and reward structures. In contrast, existing agents typically demonstrate improvements via self-evolving within a single domain, implicitly assuming a fixed environment distribution. Cross-environment learning has remained largely unmeasured: there is no standard collection of controllable, heterogeneous environments, nor a unified way to represent how agents learn. We address these gaps in two steps. First, we propose AutoEnv, an automated framework that treats environments as factorizable distributions over transitions, observations, and rewards, enabling low-cost (4.12 USD on average) generation of heterogeneous worlds. Using AutoEnv, we construct AutoEnv-36, a dataset of 36 environments with 358 validated levels, on which seven language models achieve 12-49% normalized reward, demonstrating the challenge of AutoEnv-36. Second, we formalize agent learning as a component-centric process driven by three stages of Selection, Optimization, and Evaluation applied to an improvable agent component. Using this formulation, we design eight learning methods and evaluate them on AutoEnv-36. Empirically, the gain of any single learning method quickly decrease as the number of environments increases, revealing that fixed learning methods do not scale across heterogeneous environments. Environment-adaptive selection of learning methods substantially improves performance but exhibits diminishing returns as the method space expands. These results highlight both the necessity and the current limitations of agent learning for scalable cross-environment generalization, and position AutoEnv and AutoEnv-36 as a testbed for studying cross-environment agent learning. The code is avaiable at https://github.com/FoundationAgents/AutoEnv.
Finetune-RAG: Fine-Tuning Language Models to Resist Hallucination in Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) has emerged as a powerful framework to improve factuality in large language models (LLMs) by grounding their outputs in retrieved documents. However, ensuring perfect retrieval of relevant information remains challenging, and when irrelevant content is passed downstream to an LLM, it can lead to hallucinations. In this work, we propose Finetune-RAG, a simple and effective fine-tuning approach that features the first-of-its-kind RAG training dataset constructed to mimic real-world imperfections. Experimental results show that Finetune-RAG improves factual accuracy by 21.2% over the base model. We also propose Bench-RAG, an LLM-as-a-judge evaluation pipeline that stress tests models under realistic imperfect retrieval scenarios. Our codebase and dataset are fully open sourced for community use.
Privacy-protected Retrieval-Augmented Generation for Knowledge Graph Question Answering AAAI 2026
LLMs often suffer from hallucinations and outdated or incomplete knowledge. RAG is proposed to address these issues by integrating external knowledge like that in KGs into LLMs. However, leveraging private KGs in RAG systems poses significant privacy risks due to the black-box nature of LLMs and potential insecure data transmission, especially when using third-party LLM APIs lacking transparency and control. In this paper, we investigate the privacy-protected RAG scenario for the first time, where entities in KGs are anonymous for LLMs, thus preventing them from accessing entity semantics. Due to the loss of semantics of entities, previous RAG systems cannot retrieve question-relevant knowledge from KGs by matching questions with the meaningless identifiers of anonymous entities. To realize an effective RAG system in this scenario, two key challenges must be addressed: (1) How can anonymous entities be converted into retrievable information. (2) How to retrieve question-relevant anonymous entities. Hence, we propose a novel ARoG framework including relation-centric abstraction and structure-oriented abstraction strategies. For challenge (1), the first strategy abstracts entities into high-level concepts by dynamically capturing the semantics of their adjacent relations. It supplements meaningful semantics which can further support the retrieval process. For challenge (2), the second strategy transforms unstructured natural language questions into structured abstract concept paths. These paths can be more effectively aligned with the abstracted concepts in KGs, thereby improving retrieval performance. To guide LLMs to effectively retrieve knowledge from KGs, the two strategies strictly protect privacy from being exposed to LLMs. Experiments on three datasets demonstrate that ARoG achieves strong performance and privacy-robustness.
comment: Accepted by AAAI 2026, camera ready version
MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query NeurIPS 2025
Semantic retrieval is crucial for modern applications yet remains underexplored in current research. Existing datasets are limited to single languages, single images, or singular retrieval conditions, often failing to fully exploit the expressive capacity of visual information as evidenced by maintained performance when images are replaced with captions. However, practical retrieval scenarios frequently involve interleaved multi-condition queries with multiple images. Hence, this paper introduces MERIT, the first multilingual dataset for interleaved multi-condition semantic retrieval, comprising 320,000 queries with 135,000 products in 5 languages, covering 7 distinct product categories. Extensive experiments on MERIT identify existing models's limitation: focusing solely on global semantic information while neglecting specific conditional elements in queries. Consequently, we propose Coral, a novel fine-tuning framework that adapts pre-trained MLLMs by integrating embedding reconstruction to preserve fine-grained conditional elements and contrastive learning to extract comprehensive global semantics. Experiments demonstrate that Coral achieves a 45.9% performance improvement over conventional approaches on MERIT, with strong generalization capabilities validated across 8 established retrieval benchmarks. Collectively, our contributions - a novel dataset, identification of critical limitations in existing approaches, and an innovative fine-tuning framework - establish a foundation for future research in interleaved multi-condition semantic retrieval.
comment: NeurIPS 2025; Project Page, Code, and Dataset at: https://merit-2025.github.io/
LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. To address this, we introduce LLMEval-3, a framework for dynamic evaluation of LLMs. LLMEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts, complemented by a relative ranking system for fair comparison. An 20-month longitudinal study of nearly 50 leading models reveals a performance ceiling on knowledge memorization and exposes data contamination vulnerabilities undetectable by static benchmarks. The framework demonstrates exceptional robustness in ranking stability and consistency, providing strong empirical validation for the dynamic evaluation paradigm. LLMEval-3 offers a robust and credible methodology for assessing the true capabilities of LLMs beyond leaderboard scores, promoting the development of more trustworthy evaluation standards.
comment: This work is withdrawn as all authors are not in agreement on the work
Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training
The GPT-4 technical report suggests that downstream performance can be predicted from pre-training signals, but offers little methodological detail on how to quantify this. This work address this gap by modeling knowledge retention, the capacity of a pre-trained language model to memorize factual information from its corpus, and introduce a principled method to estimate it prior to training. We propose Size-dependent Mutual Information (SMI), an information-theoretic predictor that integrates knowledge frequency, knowledge specificity, and model size to forecast closed-book question answering (QA) accuracy. SMI is validated through large-scale document retrieval over the disclosed pre-training corpora of 21 public and 3 custom models, combined with a robust multi-template QA evaluation. Experiments show that SMI significantly outperforms repetition-based baselines and achieves $R^2$ > 0.7 in predicting QA accuracy for models above 1B parameters, without additional training. The analysis further reveals diminishing returns from scaling data and model size and provides evidence for an intrinsic upper bound on knowledge retention achievable by pre-training alone, motivating retrieval and other augmentation strategies.
comment: This work is withdrawn as all authors are not in agreement on the work
SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents
Recently, role-playing agents have emerged as a promising paradigm for achieving personalized interaction and emotional resonance. Existing research primarily focuses on the textual modality, neglecting the critical dimension of speech in realistic interactive scenarios. In particular, there is a lack of systematic evaluation for Speech Role-Playing Agents (SRPAs). To address this gap, we construct SpeechRole-Data, a large-scale, high-quality dataset that comprises 98 diverse roles and 112k speech-based single-turn and multi-turn conversations. Each role demonstrates distinct vocal characteristics, including timbre and prosody, thereby enabling more sophisticated speech role-playing. Furthermore, we propose SpeechRole-Eval, a multidimensional evaluation benchmark that systematically assesses SRPAs performance in key aspects such as fundamental interaction ability, speech expressiveness, and role-playing fidelity. Experimental results reveal the advantages and challenges of both cascaded and end-to-end speech role-playing agents in maintaining vocal style consistency and role coherence. We release all data, code, and baseline models to provide a solid foundation for speech-driven multimodal role-playing research and to foster further developments in this field.
comment: This work is withdrawn as all authors are not in agreement on the work
IW-Bench: Evaluating Large Multimodal Models for Converting Image-to-Web
Recently advancements in large multimodal models have led to significant strides in image comprehension capabilities. Despite these advancements, there is a lack of the robust benchmark specifically for assessing the Image-to-Web conversion proficiency of these large models. Primarily, it is essential to ensure the integrity of the web elements generated. These elements comprise visible and invisible categories. Previous evaluation methods (e.g.,BLEU) are notably susceptible to significant alterations due to the presence of invisible elements in Web. Furthermore, it is crucial to measure the layout information of web pages, referring to the positional relationships between elements, which is overlooked by previous work. To address challenges, we have curated and aligned a benchmark of images and corresponding web codes (IW-BENCH). Specifically, we propose the Element Accuracy, which tests the completeness of the elements by parsing the Document Object Model (DOM) tree. Layout Accuracy is also proposed to analyze the positional relationships of elements by converting DOM tree into a common subsequence. Besides, we design a five-hop multimodal Chain-of-Thought Prompting for better performance, which contains five hop: 1) SoM prompt injection. 2) Inferring Elements. 3) Inferring Layout. 4) Inferring Web code. 5) Reflection. Our benchmark comprises 1200 pairs of images and web codes with varying levels of difficulty. We have conducted extensive experiments on existing large multimodal models, offering insights into their performance and areas for improvement in image-to-web domain.
Batch Prompting Suppresses Overthinking Reasoning Under Constraint: How Batch Prompting Suppresses Overthinking in Reasoning Models
Recent work has explored batch prompting as a strategy to amortize inference cost in large language models (LLMs). In this paper, we show that batching offers an additional, underappreciated benefit: it regularizes model behavior during multi-step reasoning for Large Reasoning Models (LRMs). We conduct a comprehensive study across 13 diverse benchmarks and observe that batching improves accuracy while substantially reducing reasoning token usage, often by 3x-5x. Through detailed behavioral analysis, we find that batching suppresses overthinking, reduces hedging language (e.g., repetitive self-corrections), and encourages more decisive answers. Surprisingly, we also observe emergent collective effects in batched inference: models often generalize patterns from earlier examples to solve harder ones in the same batch. These findings position batching not just as a throughput optimization, but as a powerful inference-time regularizer for more efficient and reliable LLM reasoning.
comment: The paper is incomplete with some errors in qualitative study
SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling
Recent advancements in Large Language Models (LLMs) have created new opportunities to enhance performance on complex reasoning tasks by leveraging test-time computation. However, existing scaling methods have key limitations: parallel methods like repeated sampling are often inefficient and quickly saturate, while sequential methods like SELF-REFINE struggle to improve after a few rounds. Although combining these approaches shows promise, current methods require fine-tuned reward and revision models. This paper proposes Self-Enhanced Test-Time Scaling (SETS), a simple yet effective approach that overcomes these limitations by strategically combining parallel and sequential techniques and fully leveraging LLMs' self-improvement abilities. SETS exploits the inherent self-verification and self-correction capabilities of LLMs, unifying sampling, verification, and correction within a single framework. This facilitates efficient and scalable test-time computation for enhanced performance on complex tasks without any model training. Our comprehensive experimental results on challenging benchmarks spanning planning, reasoning, math, and coding demonstrate that SETS achieves significant performance improvements and more advantageous test-time scaling behavior than the alternatives.
comment: Published in Transactions on Machine Learning Research (11/2025)
NLP Datasets for Idiom and Figurative Language Tasks
Idiomatic and figurative language form a large portion of colloquial speech and writing. With social media, this informal language has become more easily observable to people and trainers of large language models (LLMs) alike. While the advantage of large corpora seems like the solution to all machine learning and Natural Language Processing (NLP) problems, idioms and figurative language continue to elude LLMs. Finetuning approaches are proving to be optimal, but better and larger datasets can help narrow this gap even further. The datasets presented in this paper provide one answer, while offering a diverse set of categories on which to build new models and develop new approaches. A selection of recent idiom and figurative language datasets were used to acquire a combined idiom list, which was used to retrieve context sequences from a large corpus. One large-scale dataset of potential idiomatic and figurative language expressions and two additional human-annotated datasets of definite idiomatic and figurative language expressions were created to evaluate the baseline ability of pre-trained language models in handling figurative meaning through idiom recognition (detection) tasks. The resulting datasets were post-processed for model agnostic training compatibility, utilized in training, and evaluated on slot labeling and sequence tagging.
comment: 32 pages, 10 figures
Focusing on Language: Revealing and Exploiting Language Attention Heads in Multilingual Large Language Models AAAI-2026
Large language models (LLMs) increasingly support multilingual understanding and generation. Meanwhile, efforts to interpret their internal mechanisms have emerged, offering insights to enhance multilingual performance. While multi-head self-attention (MHA) has proven critical in many areas, its role in multilingual capabilities remains underexplored. In this work, we study the contribution of MHA in supporting multilingual processing in LLMs. We propose Language Attention Head Importance Scores (LAHIS), an effective and efficient method that identifies attention head importance for multilingual capabilities via a single forward and backward pass through the LLM. Applying LAHIS to Aya-23-8B, Llama-3.2-3B, and Mistral-7B-v0.1, we reveal the existence of both language-specific and language-general heads. Language-specific heads enable cross-lingual attention transfer to guide the model toward target language contexts and mitigate off-target language generation issue, contributing to addressing challenges in multilingual LLMs. We also introduce a lightweight adaptation that learns a soft head mask to modulate attention outputs over language heads, requiring only 20 tunable parameters to improve XQuAD accuracy. Overall, our work enhances both the interpretability and multilingual capabilities of LLMs from the perspective of MHA.
comment: Accepted by AAAI-2026
MemOS: A Memory OS for AI System
Large Language Models (LLMs) have become an essential infrastructure for Artificial General Intelligence (AGI), yet their lack of well-defined memory management systems hinders the development of long-context reasoning, continual personalization, and knowledge consistency.Existing models mainly rely on static parameters and short-lived contextual states, limiting their ability to track user preferences or update knowledge over extended periods.While Retrieval-Augmented Generation (RAG) introduces external knowledge in plain text, it remains a stateless workaround without lifecycle control or integration with persistent representations.Recent work has modeled the training and inference cost of LLMs from a memory hierarchy perspective, showing that introducing an explicit memory layer between parameter memory and external retrieval can substantially reduce these costs by externalizing specific knowledge. Beyond computational efficiency, LLMs face broader challenges arising from how information is distributed over time and context, requiring systems capable of managing heterogeneous knowledge spanning different temporal scales and sources. To address this challenge, we propose MemOS, a memory operating system that treats memory as a manageable system resource. It unifies the representation, scheduling, and evolution of plaintext, activation-based, and parameter-level memories, enabling cost-efficient storage and retrieval. As the basic unit, a MemCube encapsulates both memory content and metadata such as provenance and versioning. MemCubes can be composed, migrated, and fused over time, enabling flexible transitions between memory types and bridging retrieval with parameter-based learning. MemOS establishes a memory-centric system framework that brings controllability, plasticity, and evolvability to LLMs, laying the foundation for continual learning and personalized modeling.
comment: 36 pages, 10 figures, 5 tables
Shadow in the Cache: Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference NDSS
The Key-Value (KV) cache, which stores intermediate attention computations (Key and Value pairs) to avoid redundant calculations, is a fundamental mechanism for accelerating Large Language Model (LLM) inference. However, this efficiency optimization introduces significant yet underexplored privacy risks. This paper provides the first comprehensive analysis of these vulnerabilities, demonstrating that an attacker can reconstruct sensitive user inputs directly from the KV-cache. We design and implement three distinct attack vectors: a direct Inversion Attack, a more broadly applicable and potent Collision Attack, and a semantic-based Injection Attack. These methods demonstrate the practicality and severity of KV-cache privacy leakage issues. To mitigate this, we propose KV-Cloak, a novel, lightweight, and efficient defense mechanism. KV-Cloak uses a reversible matrix-based obfuscation scheme, combined with operator fusion, to secure the KV-cache. Our extensive experiments show that KV-Cloak effectively thwarts all proposed attacks, reducing reconstruction quality to random noise. Crucially, it achieves this robust security with virtually no degradation in model accuracy and minimal performance overhead, offering a practical solution for trustworthy LLM deployment.
comment: This paper is accepted by Network and Distributed System Security Symposium (NDSS) 2026
Comba: Improving Bilinear RNNs with Closed-loop Control
Recent efficient sequence modeling methods such as Gated DeltaNet, TTT, and RWKV-7 have achieved performance improvements by supervising the recurrent memory management through Delta learning rule. Unlike previous state-space models (e.g., Mamba) and gated linear attentions (e.g., GLA), these models introduce interactions between the recurrent state and the key vector, structurally resembling bilinear systems. In this paper, we first introduce the concept of Bilinear RNNs with a comprehensive analysis on the advantages and limitations of these models. Then, based on closed-loop control theory, we propose a novel Bilinear RNN variant named Comba, which adopts a scalar-plus-low-rank state transition, with both state feedback and output feedback corrections. We also implement a hardware-efficient chunk-wise parallel kernel in Triton and train models with 340M/1.3B parameters on large-scale corpus. Comba demonstrates superior performance and computation efficiency in both language and vision modeling.
Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO NeurIPS'2025
Direct alignment methods typically train large language models (LLMs) by contrasting the likelihoods of preferred and dispreferred responses. While effective at capturing relative preferences, these methods are widely observed to suppress the absolute likelihoods of example responses. As a result, aligned models can deviate from expected patterns, exhibiting rewar-hacking effect even without an explicit reward model. This fundamental limitation of contrastive alignment, which we term likelihood underdetermination, motivates us to revisit direct preference optimization (DPO) -- the seminal direct alignment method. Interestingly, we show that the DPO loss admits a principled decomposition. The reformulated loss not only extends naturally to a broader range of feedback types, but also unveils the root cause of likelihood underdetermination. Specifically, we identify that standard DPO implicitly oversimplifies a regularizer in the reformulated loss; restoring this full term effectively resolves the underdetermination. Building on these insights, we introduce PRoximalized PReference Optimization (PRO), a unified alignment method that accommodates diverse feedback types while eliminating likelihood underdetermination through an efficient approximation of the full regularizer. Empirical evaluations demonstrate the consistent superiority of PRO over existing methods across pairwise, binary and scalar feedback.
comment: NeurIPS'2025
RECAP: Transparent Inference-Time Emotion Alignment for Medical Dialogue Systems
Large language models in healthcare often miss critical emotional cues, delivering medically sound but emotionally flat advice. Such responses are insufficient in clinical encounters, where distressed or vulnerable patients rely on empathic communication to support safety, adherence, and trust. We present RECAP (Reflect-Extract-Calibrate-Align-Produce), an inference-time framework that guides models through structured emotional reasoning without retraining. RECAP decomposes patient input into appraisal-theoretic stages, identifies psychological factors, and assigns Likert-based emotion likelihoods that clinicians can inspect or override, producing nuanced and auditable responses. Across EmoBench, SECEU, and EQ-Bench, RECAP improves emotional reasoning by 22-28% on 8B models and 10-13% on larger models over zero-shot baselines. In blinded evaluations, oncology clinicians rated RECAP's responses as more empathetic, supportive, and context-appropriate than prompting baselines. These findings demonstrate that modular, principled prompting can enhance emotional intelligence in medical AI while maintaining transparency and accountability for clinical deployment.
Which Type of Students can LLMs Act? Investigating Authentic Simulation with Graph-based Human-AI Collaborative System
While rapid advances in large language models (LLMs) are reshaping data-driven intelligent education, accurately simulating students remains an important but challenging bottleneck for scalable educational data collection, evaluation, and intervention design. However, current works are limited by scarce real interaction data, costly expert evaluation for realism, and a lack of large-scale, systematic analyses of LLMs ability in simulating students. We address this gap by presenting a three-stage LLM-human collaborative pipeline to automatically generate and filter high-quality student agents. We leverage a two-round automated scoring validated by human experts and deploy a score propagation module to obtain more consistent scores across the student similarity graph. Experiments show that combining automated scoring, expert calibration, and graph-based propagation yields simulated student that more closely track authentication by human judgments. We then analyze which profiles and behaviors are simulated more faithfully, supporting subsequent studies on personalized learning and educational assessment.
comment: This work has been submitted to AI Open for possible publication
Scaling Multimodal Search and Recommendation with Small Language Models via Upside-Down Reinforcement Learning ICDM 2025
In this work, we investigate how small language models (SLMs) can be scaled to support multimodal search and recommendation use cases while remaining efficient enough for real-time, resource-constrained deployments. We present a framework that combines upside-down reinforcement learning with synthetic data distillation from a large language model (Llama-3) to train a 100M-parameter GPT-2 model for multitask prompt generation. Despite being up to 80 times smaller than state-of-the-art large language models (LLMs), our SLM achieves relevance and diversity scores within 6% of competitive baselines such as Llama-3 8B, Qwen3 8B, and Ministral 8B. These results demonstrate that SLMs can effectively handle multimodal search and recommendation tasks, while dramatically reducing inference latency and memory overhead. Our study highlights the potential of lightweight models as practical engines for scalable multimodal discovery, bridging the gap between cutting-edge research and real-world multimodal applications such as media recommendations and creative content generation.
comment: Accepted by ICDM 2025 MMSR
Causal LLM Routing: End-to-End Regret Minimization from Observational Data
LLM routing aims to select the most appropriate model for each query, balancing competing performance metrics such as accuracy and cost across a pool of language models. Prior approaches typically adopt a decoupled strategy, where the metrics are first predicted and the model is then selected based on these estimates. This setup is prone to compounding errors and often relies on full-feedback data, where each query is evaluated by all candidate models, which is costly to obtain and maintain in practice. In contrast, we learn from observational data, which records only the outcome of the model actually deployed. We propose a causal end-to-end framework that learns routing policies by minimizing decision-making regret from observational data. To enable efficient optimization, we introduce two theoretically grounded surrogate objectives: a classification-based upper bound, and a softmax-weighted regret approximation shown to recover the optimal policy at convergence. We further extend our framework to handle heterogeneous cost preferences via an interval-conditioned architecture. Experiments on public benchmarks show that our method outperforms existing baselines, achieving state-of-the-art performance across different embedding models.
Semantic Mastery: Enhancing LLMs with Advanced Natural Language Understanding
Large language models (LLMs) have greatly improved their capability in performing NLP tasks. However, deeper semantic understanding, contextual coherence, and more subtle reasoning are still difficult to obtain. The paper discusses state-of-the-art methodologies that advance LLMs with more advanced NLU techniques, such as semantic parsing, knowledge integration, and contextual reinforcement learning. We analyze the use of structured knowledge graphs, retrieval-augmented generation (RAG), and fine-tuning strategies that match models with human-level understanding. Furthermore, we address the incorporation of transformer-based architectures, contrastive learning, and hybrid symbolic-neural methods that address problems like hallucinations, ambiguity, and inconsistency in the factual perspectives involved in performing complex NLP tasks, such as question-answering text summarization and dialogue generation. Our findings show the importance of semantic precision for enhancing AI-driven language systems and suggest future research directions to bridge the gap between statistical language models and true natural language understanding.
FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference
The size and compute characteristics of modern large language models have led to an increased interest in developing specialized kernels tailored for particular training and inference workloads. Existing kernels primarily optimize for compute utilization, targeting the large-batch training and inference settings. However, low-batch inference, where memory bandwidth and kernel launch overheads are significant factors, remains important for many applications of interest such as in edge deployment and latency-sensitive applications. This paper describes FlashFormer, which fuses the entire transformer forward pass into a single kernel for accelerating low-batch inference of large language models. Across various model sizes and quantizations settings, FlashFormer achieves nontrivial speedups compared to existing inference kernels.
Similarity-Distance-Magnitude Activations
We introduce the Similarity-Distance-Magnitude (SDM) activation function, a more robust and interpretable formulation of the standard softmax activation function, adding Similarity (i.e., correctly predicted depth-matches into training) awareness and Distance-to-training-distribution awareness to the existing output Magnitude (i.e., decision-boundary) awareness, and enabling interpretability-by-exemplar via dense matching. We further introduce the SDM estimator, based on a data-driven partitioning of the class-wise empirical CDFs via the SDM activation, to control the class- and prediction-conditional accuracy among selective classifications. When used as the final-layer activation over pre-trained language models for selective classification, the SDM estimator is more robust to co-variate shifts and out-of-distribution inputs than existing calibration methods using softmax activations, while remaining informative over in-distribution data.
comment: 21 pages, 8 tables, 1 algorithm. arXiv admin note: substantial text overlap with arXiv:2502.20167
Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design
We report on the first large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware, utilizing both MI300X GPUs and Pollara networking. We distill practical guidance for both systems and model design. On the systems side, we deliver a comprehensive cluster and networking characterization: microbenchmarks for all core collectives (all-reduce, reduce-scatter, all-gather, broadcast) across message sizes and GPU counts over Pollara. To our knowledge, this is the first at this scale. We further provide MI300X microbenchmarks on kernel sizing and memory bandwidth to inform model design. On the modeling side, we introduce and apply MI300X-aware transformer sizing rules for attention and MLP blocks and justify MoE widths that jointly optimize training throughput and inference latency. We describe our training stack in depth, including often-ignored utilities such as fault-tolerance and checkpoint-reshaping, as well as detailed information on our training recipe. We also provide a preview of our model architecture and base model - ZAYA1 (760M active, 8.3B total parameters MoE, available at https://huggingface.co/Zyphra/ZAYA1-base) - which will be further improved upon in forthcoming papers. ZAYA1-base achieves performance comparable to leading base models such as Qwen3-4B and Gemma3-12B at its scale and larger, and outperforms models including Llama-3-8B and OLMoE across reasoning, mathematics, and coding benchmarks. Together, these results demonstrate that the AMD hardware, network, and software stack are mature and optimized enough for competitive large-scale pretraining.
Human-Computer Interaction
Affordances of Digital and Blockchain-based Community Currencies: The Case of Sarafu Network in Kenya
Community currencies (CCs) have been adopting innovative systems to overcome implementational hurdles from issuing paper currencies. Using a qualitative approach, this paper examined this digital transition of Sarafu Network in Kenya and its predecessor CCs as a case study. From the original vouchers launched in 2010, the foundation Grassroots Economics introduced a digital interface in 2016 that operates on a feature phone, and then integrated blockchain technology starting in 2018, undergoing several migrations before becoming settling on its current iteration called Community Asset Vouchers on the Celo blockchain since 2023. Using affordances from human-computer interaction, the research shows that digitalization and blockchain improved the facilitation of economic activities of the local communities, both their typical market transactions as well as traditional reciprocal labor exchanges, by offering more functionalities compared to the analog version of Sarafu. The unique contributions of blockchain include enabling automation of holding tax calculations and linking the vouchers to the mainstream monetary system via stablecoins facilitated by a series of smart contracts also known as the liquidity pool. The study also finds that there is an inherent trade-off between blockchain benefits and user interface complexity. Hence, balancing innovation and community needs remains a challenge.
comment: 20 pages, 6 figures. Accepted for publication at International Journal of Community Currency Research
When to Say "Hi" - Learn to Open a Conversation with an in-the-wild Dataset
The social capabilities of socially interactive agents (SIA) are a key to successful and smooth interactions between the user and the SIA. A successful start of the interaction is one of the essential factors for satisfying SIA interactions. For a service and information task in which the SIA helps with information, e.g. about the location, it is an important skill to master the opening of the conversation and to recognize which interlocutor opens the conversation and when. We are therefore investigating the extent to which the opening of the conversation can be trained using the user's body language as an input for machine learning to ensure smooth conversation starts for the interaction. In this paper we propose the Interaction Initiation System (IIS) which we developed, trained and validated using an in-the-wild data set. In a field test at the Deutsches Museum Bonn, a Furhat robot from Furhat Robotics was used as a service and information point. Over the period of use we collected the data of \textit{N} = 201 single user interactions for the training of the algorithms. We can show that the IIS, achieves a performance that allows the conclusion that this system is able to determine the greeting period and the opener of the interaction.
comment: 6 pages, 3 figures, 5 tables. This paper has been accepted for publication at IEEE ROMAN 2025
HEART-Watch: A multimodal physiological dataset from a Google Pixel Watch across different physical states
Consumer-grade smartwatches offer a new personalized health monitoring option for general consumers globally as cardiovascular diseases continue to prevail as the leading cause of global mortality. The development and validation of reliable cardiovascular monitoring algorithms for these consumer-grade devices requires realistic biosignal data from diverse sets of participants. However, the availability of public consumer-grade smartwatch datasets with synchronized cardiovascular biosignals is limited, and existing datasets do not offer rich demographic diversity in their participant cohorts, leading to potentially biased algorithm development. This paper presents HEART-Watch, a multimodal physiological dataset collected from temporally synchronized wrist-worn Google Pixel Watch 2 electrocardiogram (ECG), photoplethysmography, and accelerometer signals from a diverse cohort of 40 healthy adults across three physical states - sitting, standing and walking with reference chest ECG. Intermittent upper arm blood pressure measurements and concurrent biosignals were collected as an additional biomarker for future research. The motivation, methodology, and initial analyses of results are presented. HEART-Watch is intended to support the development and benchmarking of robust algorithms for cardiovascular analyses on consumer-grade smartwatches across diverse populations.
Classification of User Satisfaction in HRI with Social Signals in the Wild
Socially interactive agents (SIAs) are being used in various scenarios and are nearing productive deployment. Evaluating user satisfaction with SIAs' performance is a key factor in designing the interaction between the user and SIA. Currently, subjective user satisfaction is primarily assessed manually through questionnaires or indirectly via system metrics. This study examines the automatic classification of user satisfaction through analysis of social signals, aiming to enhance both manual and autonomous evaluation methods for SIAs. During a field trial at the Deutsches Museum Bonn, a Furhat Robotics head was employed as a service and information hub, collecting an "in-the-wild" dataset. This dataset comprises 46 single-user interactions, including questionnaire responses and video data. Our method focuses on automatically classifying user satisfaction based on time series classification. We use time series of social signal metrics derived from the body pose, time series of facial expressions, and physical distance. This study compares three feature engineering approaches on different machine learning models. The results confirm the method's effectiveness in reliably identifying interactions with low user satisfaction without the need for manually annotated datasets. This approach offers significant potential for enhancing SIA performance and user experience through automated feedback mechanisms.
comment: 15 pages, 3 figures. This paper has been accepted for publication at ICSR+AI 2025
Is Lying Only Sinful in Islam? Exploring Religious Bias in Multilingual Large Language Models Across Major Religions
While recent developments in large language models have improved bias detection and classification, sensitive subjects like religion still present challenges because even minor errors can result in severe misunderstandings. In particular, multilingual models often misrepresent religions and have difficulties being accurate in religious contexts. To address this, we introduce BRAND: Bilingual Religious Accountable Norm Dataset, which focuses on the four main religions of South Asia: Buddhism, Christianity, Hinduism, and Islam, containing over 2,400 entries, and we used three different types of prompts in both English and Bengali. Our results indicate that models perform better in English than in Bengali and consistently display bias toward Islam, even when answering religion-neutral questions. These findings highlight persistent bias in multilingual models when similar questions are asked in different languages. We further connect our findings to the broader issues in HCI regarding religion and spirituality.
comment: 18 pages, 7 figures
Adhera: A Human-Centered Health Informatics Solution for Reducing Informal Caregiver Burden through Improved Medication Adherence
The growing global population of older adults, combined with ongoing healthcare workforce shortages, has increased reliance on informal caregivers, including family members and friends who provide unpaid support to individuals with chronic illnesses. Among their daily responsibilities, medication management remains one of the most demanding and error-prone tasks. Non-adherence to prescribed regimens not only undermines patient outcomes but also intensifies caregiver stress, anxiety, and fatigue. Although digital health technologies have proliferated to address adherence, most solutions focus exclusively on patients and neglect the informational and emotional needs of caregivers. This paper introduces Adhera, a caregiver-inclusive health informatics system designed to support medication adherence while reducing caregiver burden. Using a mixed-methods research design that included fifteen semi-structured caregiver interviews, sixty-five survey responses, and five pharmacist consultations, this study identified three primary challenges: caregiver stress related to uncertainty about medication intake, fragmented communication with healthcare professionals, and distrust in existing digital tools. Informed by the CeHRes Roadmap 2.0 and the Triple Bottom Line by Design and Culture (TBLD+C) framework, as well as recent co-design studies involving caregivers, Adhera integrates a sensor-equipped smart pill organizer with a mobile companion application that records intake events, sends real-time reminders, and provides caregivers with synchronized adherence data. Preliminary evaluation suggests that Adhera enhances visibility, improves caregiver confidence, and streamlines medication routines. This study contributes to the field of health informatics by demonstrating how human-centered design and collaborative frameworks can align technical innovation with empathy-driven care.
Sleep Modulation: The Challenge of Transitioning from Open Loop to Closed Loop
Sleep disorders have emerged as a critical global health issue, highlighting the urgent need for effective and widely accessible intervention technologies. Non-invasive brain stimulation has garnered attention as it enables direct or indirect modulation of neural activity, thereby promoting sleep enhancement in a safe and unobtrusive manner. This class of approaches is collectively referred to as sleep modulation. To date, the majority of sleep modulation research relies on open-loop paradigms with empirically determined parameters, while achieving individual adaptation and modulation accuracy remains a distant objective. The paradigm-specific constraints inherent to open-loop designs represent a major obstacle to clinical translation and large-scale deployment in home environments. In this paper, we delineate fundamental paradigms of sleep modulation, critically examine the intrinsic limitations of open-loop approaches, and formally conceptualize sleep closed-loop modulation. We further provide a comprehensive synthesis of prior studies involving five commonly employed modulation techniques, evaluating their potential integration within a closed-loop framework. Finally, we identify three primary challenges in constructing an effective sleep closed-loop modulation system: sensor solution selection, monitoring model design, and modulation strategy design, while also proposing potential solutions. Collectively, this work aims to advance the paradigm shift of sleep modulation from open-loop toward closed-loop systems.
Head, posture, and full-body gestures in interactive communication
When face-to-face communication becomes effortful due to background noise or interfering talkers, the role of visual cues becomes increasingly important for communication success. While previous research has selectively examined head or hand movements, here we explore movements of the whole body in acoustically adverse conditions. We hypothesized that increasing background noise in conversations would lead to increased gesture frequency in hand, head, trunk, and leg movements typical of conversation. Increased use of hand movements should support the speaker's role, while increased head and trunk movements may help the listener. We conducted a free dyadic conversation experiment with normal-hearing participants (n=8) in a virtual acoustic environment. Conversational movements were described with a newly developed labeling system for typical conversational actions, and the frequency of individual types was analyzed. In addition, we analyzed gesture quality by assessing hand-speech synchrony, with the hypothesis that higher levels of background noise would lead to a loss of synchrony according to an interactive coupling model. Higher noise levels led to increased hand-gesture complexity during speaking and listening, more pronounced up-down head movements, and contrary to expectations, head movements during listening generally decreased relative to speaking. Synchrony and peak velocity were unaffected by noise, while gesture quality scaled only modestly. The results support previous findings regarding gesturing frequency, but we found only limited evidence for changes in speech-gesture synchrony. This work reveals communication patterns of the whole body and illustrates multimodal adaptation to communication demands.
comment: 7 figures, 10 tables, 30 pages
Synthetic Cognitive Walkthrough: Aligning Large Language Model Performance with Human Cognitive Walkthrough
Conducting usability testing like cognitive walkthrough (CW) can be costly. Recent developments in large language models (LLMs), with visual reasoning and UI navigation capabilities, present opportunities to automate CW. We explored whether LLMs (GPT-4 and Gemini-2.5-pro) can simulate human behavior in CW by comparing their walkthroughs with human participants. While LLMs could navigate interfaces and provide reasonable rationales, their behavior differed from humans. LLM-prompted CW achieved higher task completion rates than humans and followed more optimal navigation paths, while identifying fewer potential failure points. However, follow-up studies demonstrated that with additional prompting, LLMs can predict human-identified failure points, aligning their performance with human participants. Our work highlights that while LLMs may not replicate human behaviors exactly, they can be leveraged for scaling usability walkthroughs and providing UI insights, offering a valuable complement to traditional usability testing.
Left shifting analysis of Human-Autonomous Team interactions to analyse risks of autonomy in high-stakes AI systems
Developing high-stakes autonomous systems that include Artificial Intelligence (AI) components is complex; the consequences of errors can be catastrophic, yet it is challenging to plan for all operational cases. In stressful scenarios for the human operator, such as short decision-making timescales, the risk of failures is exacerbated. A lack of understanding of AI failure modes obstructs this and so blocks the robust implementation of applications of AI in smart systems. This prevents early risk identification, leading to increased time, risk and cost of projects. A key tenet of Systems Engineering and acquisition engineering is centred around a "left-shift" in test and evaluation activities to earlier in the system lifecycle, to allow for "accelerated delivery of [systems] that work". We argue it is therefore essential that this shift includes the analysis of AI failure cases as part of the design stages of the system life cycle. Our proposed framework enables the early characterisation of risks emerging from human-autonomy teaming (HAT) in operational contexts. The cornerstone of this is a new analysis of AI failure modes, built on the seminal modelling of human-autonomy teams laid out by LaMonica et al., 2022. Using the analysis of the interactions between human and autonomous systems and exploring the failure modes within each aspect, our approach provides a way to systematically identify human-AI interactions risks across the operational domain of the system of interest. The understanding of the emergent behaviour enables increased robustness of the system, for which the analysis should be undertaken over the whole scope of its operational design domain. This approach is illustrated through an example use case for an AI assistant supporting a Command & Control (C2) System.
comment: Published in: IfSE Annual Systems Engineering Conference Proceedings 2025
EMINDS: Understanding User Behavior Progression for Mental Health Exploration on Social Media
Mental health is an urgent societal issue, and social scientists are increasingly turning to online mental health communities (OMHCs) to analyze user behavior data for early intervention. However, existing sequence mining techniques fall short of the urgent need to explore the behavior progression of different groups (e.g., recovery or deterioration groups) and track the potential long-term impact of behaviors on mental health status. To address this issue, we introduce EMINDS, a visual analytics system built on a novel automatic mining pipeline that extracts distinct behavior stages and assesses the potential impact of frequent stage patterns on mental health status over time. The system includes a set of interactive visualizations that summarize the meaning of each behavior stage and the evolution of different stage patterns. We feature a pattern-centric Sankey diagram to reveal contextual information about the impact of stage patterns on mental health, helping experts understand the specific changes in sequences before and after a stage pattern. We evaluated the effectiveness and usability of EMINDS through two case studies and expert interviews, which examined the potential stage patterns impacting long-term mental health by analyzing user behaviors on Reddit.
CellScout: Visual Analytics for Mining Biomarkers in Cell State Discovery
Cell state discovery is crucial for understanding biological systems and enhancing medical outcomes. A key aspect of this process is identifying distinct biomarkers that define specific cell states. However, difficulties arise from the co-discovery process of cell states and biomarkers: biologists often use dimensionality reduction to visualize cells in a two- dimensional space. Then they usually interpret visually clustered cells as distinct states, from which they seek to identify unique biomarkers. However, this assumption is often invalid due to internal inconsistencies in a cluster, making the process trial-and-error and highly uncertain. Therefore, biologists urgently need effective tools to help uncover the hidden association relationships between different cell populations and their potential biomarkers. To address this problem, we first designed a machine-learning algorithm based on the Mixture-of-Experts (MoE) technique to identify meaningful associations between cell populations and biomarkers. We further developed a visual analytics system, CellScout, in collaboration with biologists, to help them explore and refine these association relationships to advance cell state discovery. We validated our system through expert interviews, from which we further selected a representative case to demonstrate its effectiveness in discovering new cell states.
YOLOA: Real-Time Affordance Detection via LLM Adapter
Affordance detection aims to jointly address the fundamental "what-where-how" challenge in embodied AI by understanding "what" an object is, "where" the object is located, and "how" it can be used. However, most affordance learning methods focus solely on "how" objects can be used while neglecting the "what" and "where" aspects. Other affordance detection methods treat object detection and affordance learning as two independent tasks, lacking effective interaction and real-time capability. To overcome these limitations, we introduce YOLO Affordance (YOLOA), a real-time affordance detection model that jointly handles these two tasks via a large language model (LLM) adapter. Specifically, YOLOA employs a lightweight detector consisting of object detection and affordance learning branches refined through the LLM Adapter. During training, the LLM Adapter interacts with object and affordance preliminary predictions to refine both branches by generating more accurate class priors, box offsets, and affordance gates. Experiments on our relabeled ADG-Det and IIT-Heat benchmarks demonstrate that YOLOA achieves state-of-the-art accuracy (52.8 / 73.1 mAP on ADG-Det / IIT-Heat) while maintaining real-time performance (up to 89.77 FPS, and up to 846.24 FPS for the lightweight variant). This indicates that YOLOA achieves an excellent trade-off between accuracy and efficiency.
comment: 13 pages, 9 figures, conference
Why Some Seek AI, Others Seek Therapists: Mental Health in the Age of Generative AI
As generative artificial intelligence (GAI) enters the mental health landscape, questions arise about how individuals weigh AI tools against human therapists. Drawing on the Health Belief Model (HBM), this study examined belief-based predictors of intention to use GAI and therapists across two populations: a university sample (N = 1,155) and a nationally representative adult sample (N = 651). Using repeated-measures ANOVA and LASSO regression, we found that therapists were consistently valued for emotional, relational, and personalization benefits, while GAI was favored for accessibility and affordability. Yet structural advantages alone did not predict adoption; emotional benefit and personalization emerged as decisive factors. Adoption patterns diverged across groups: students treated GAI as a complement, whereas national adults approached it as a substitute. Concerns about privacy and reliability constrained GAI use in both groups. These findings extend HBM to multi-modality contexts and highlight design implications for trustworthy, emotionally resonant digital mental health tools.
Teacher, But Also Student: Challenges and Tech Needs of Adult Braille Learners with Sight
Braille literacy is critical for blind individuals' independence and quality of life, yet literacy rates continue to decline. Though braille instructors in integrated K-12 classrooms play a central role in literacy development in blind youth, prior research on braille learning almost exclusively focuses on blind adolescent students. As a result, we still know little about how sighted adult teachers learn braille. To address this, we interviewed 14 educators, including 13 certificated Teachers of Students with Visual Impairments (TVIs) and 1 paraeducator, who learned braille as adults. We found that they: (1) lack consistent braille exposure to reinforce knowledge and skill; (2) have limited time to practice due to myriad responsibilities of adulthood; and thus, (3) seek learning tools that are engaging and efficient. Our research draws attention to the needs of a group of braille learners who have been overlooked and identifies new design opportunities to facilitate braille literacy.
Human-controllable AI: Meaningful Human Control
Developing human-controllable artificial intelligence (AI) and achieving meaningful human control (MHC) has become a vital principle to address these challenges, ensuring ethical alignment and effective governance in AI. MHC is also a critical focus in human-centered AI (HCAI) research and application. This chapter systematically examines MHC in AI, articulating its foundational principles and future trajectory. MHC is not simply the right to operate, but the unity of human understanding, intervention, and the traceablity of responsibility in AI decision-making, which requires technological design, AI governance, and humans to play a role together. MHC ensures AI autonomy serves humans without constraining technological progress. The mode of human control needs to match the levels of technology, and human supervision should balance the trust and doubt of AI. For future AI systems, MHC mandates human controllability as a prerequisite, requiring: (1) technical architectures with embedded mechanisms for human control; (2) human-AI interactions optimized for better access to human understanding; and (3) the evolution of AI systems harmonizing intelligence and human controllability. Governance must prioritize HCAI strategies: policies balancing innovation and risk mitigation, human-centered participatory frameworks transcending technical elite dominance, and global promotion of MHC as a universal governance paradigm to safeguard HCAI development. Looking ahead, there is a need to strengthen interdisciplinary research on the controllability of AI systems, enhance ethical and legal awareness among stakeholders, moving beyond simplistic technology design perspectives, focus on the knowledge construction, complexity interpretation, and influencing factors surrounding human control. By fostering MHC, the development of human-controllable AI can be further advanced, delivering HCAI systems.
comment: 52 pages
ConsentDiff at Scale: Longitudinal Audits of Web Privacy Policy Changes and UI Frictions
Web privacy is experienced via two public artifacts: site utterances in policy texts, and the actions users are required to take during consent interfaces. In the extensive cross-section audits we've studied, there is a lack of longitudinal data detailing how these artifacts are changing together, and if interfaces are actually doing what they promise in policy. ConsentDiff provides that longitudinal view. We build a reproducible pipeline that snapshots sites every month, semantically aligns policy clauses to track clause-level churn, and classifies consent-UI patterns by pulling together DOM signals with cues provided by screenshots. We introduce a novel weighted claim-UI alignment score, connecting common policy claims to observable predicates, and enabling comparisons over time, regions, and verticals. Our measurements suggest continued policy churn, systematic changes to eliminate a higher-friction banner design, and significantly higher alignment where rejecting is visible and lower friction.
Mapping Data Labour Supply Chain in Africa in an Era of Digital Apartheid: a Struggle for Recognition
Content moderation and data labelling work has shifted to the Global South, particularly Africa, where workers operate under precarious conditions while remaining invisible to users. This study addresses the gap in understanding the scope of this industry and the working conditions of African content moderation workforce through a participatory approach. We collaborated with a union of content moderators to conduct desk research, deploy a questionnaire (n=81), and gather ethnographic observations across nine months that could answer their social needs. Our findings show that content moderation operations span 43 out of 55 African countries, involving 17 major firms serving predominantly North-American and European clients, with workers facing insecurity and inadequate psychological support. We contribute the first comprehensive map of Africa's content moderation industry, demonstrate a participatory methodology that centers workers' collective actions in documenting their conditions, and apply Honneth's ``struggle for recognition'' framework to understand data workers' demands for professional acknowledgement.
comment: 32 pages, 9 figures, 1 table
Catching UX Flaws in Code: Leveraging LLMs to Identify Usability Flaws at the Development Stage
Usability evaluations are essential for ensuring that modern interfaces meet user needs, yet traditional heuristic evaluations by human experts can be time-consuming and subjective, especially early in development. This paper investigates whether large language models (LLMs) can provide reliable and consistent heuristic assessments at the development stage. By applying Jakob Nielsen's ten usability heuristics to thirty open-source websites, we generated over 850 heuristic evaluations in three independent evaluations per site using a pipeline of OpenAI's GPT-4o. For issue detection, the model demonstrated moderate consistency, with an average pairwise Cohen's Kappa of 0.50 and an exact agreement of 84%. Severity judgments showed more variability: weighted Cohen's Kappa averaged 0.63, but exact agreement was just 56%, and Krippendorff's Alpha was near zero. These results suggest that while GPT-4o can produce internally consistent evaluations, especially for identifying the presence of usability issues, its ability to judge severity varies and requires human oversight in practice. Our findings highlight the feasibility and limitations of using LLMs for early-stage, automated usability testing, and offer a foundation for improving consistency in automated User Experience (UX) evaluation. To the best of our knowledge, our work provides one of the first quantitative inter-rater reliability analyses of automated heuristic evaluation and highlights methods for improving model consistency.
comment: 7 pages. Published in Proceedings of the 2025 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). DOI: 10.1109/VL-HCC65237.2025.00024
On the Role and Impact of GenAI Tools in Software Engineering Education ICSE
Context. The rise of generative AI (GenAI) tools like ChatGPT and GitHub Copilot has transformed how software is learned and written. In software engineering (SE) education, these tools offer new opportunities for support, but also raise concerns about over-reliance, ethical use, and impacts on learning. Objective. This study investigates how undergraduate SE students use GenAI tools, focusing on the benefits, challenges, ethical concerns, and instructional expectations that shape their experiences. Method. We conducted a survey with 130 undergraduate students from two universities. The survey combined structured Likert-scale items and open-ended questions to investigate five dimensions: usage context, perceived benefits, challenges, ethical and instructional perceptions. Results. Students most often use GenAI for incremental learning and advanced implementation, reporting benefits such as brainstorming support and confidence-building. At the same time, they face challenges including unclear rationales and difficulty adapting outputs. Students highlight ethical concerns around fairness and misconduct, and call for clearer instructional guidance. Conclusion. GenAI is reshaping SE education in nuanced ways. Our findings underscore the need for scaffolding, ethical policies, and adaptive instructional strategies to ensure that GenAI supports equitable and effective learning.
comment: Accepted at IEEE/ACM ICSE Software Engineering Education and Training (ICSE SEET 2026)
Young children's anthropomorphism of an AI chatbot: Brain activation and the role of parent co-presence
Artificial Intelligence (AI) chatbots powered by a large language model (LLM) are entering young children's learning and play, yet little is known about how young children construe these agents or how such construals relate to engagement. We examined anthropomorphism of a social AI chatbot during collaborative storytelling and asked how children's attributions related to their behavior and prefrontal activation. Children at ages 5-6 (N = 23) completed three storytelling sessions: interacting with (1) an AI chatbot only, (2) a parent only, and (3) the AI and a parent together. After the sessions, children completed an interview assessing anthropomorphism toward both the AI chatbot and the parent. Behavioral engagement was indexed by the conversational turn count (CTC) ratio, and concurrent fNIRS measured oxygenated hemoglobin in bilateral vmPFC and dmPFC regions. Children reported higher anthropomorphism for parents than for the AI chatbot overall, although AI ratings were relatively high for perceptive abilities and epistemic states. Anthropomorphism was not associated with CTC. In the right dmPFC, higher perceptive scores were associated with greater activation during the AI-only condition and with lower activation during the AI+Parent condition. Exploratory analyses indicated that higher dmPFC activation during the AI-only condition correlated with higher end-of-session "scared" mood ratings. Findings suggest that stronger perceptive anthropomorphism can be associated with greater brain activation related to interpreting the AI's mental states, whereas parent co-presence may help some children interpret and regulate novel AI interactions. These results may have design implications for encouraging parent-AI co-use in early childhood.
comment: I updated the title
Unintentional Consequences: Generative AI Use for Cybercrime
The democratization of generative AI introduces new forms of human-AI interaction and raises urgent safety, ethical, and cybersecurity concerns. We develop a socio-technical explanation for how generative AI enables and scales cybercrime. Drawing on affordance theory and technological amplification, we argue that generative AI systems create new action possibilities for cybercriminals and magnify pre-existing malicious intent by lowering expertise barriers and increasing attack efficiency. To illustrate this framework, we conduct interrupted time series analyses of two large datasets: (1) 464,190,074 malicious IP address reports from AbuseIPDB, and (2) 281,115 cryptocurrency scam reports from Chainabuse. Using November 30, 2022, as a high-salience public-access shock, we estimate the counterfactual trajectory of reported cyber abuse absent the release, providing an early-warning impact assessment of a general-purpose AI technology. Across both datasets, we observe statistically significant post-intervention increases in reported malicious activity, including an immediate increase of over 1.12 million weekly malicious IP reports and about 722 weekly cryptocurrency scam reports, with sustained growth in the latter. We discuss implications for AI governance, platform-level regulation, and cyber resilience, emphasizing the need for multi-layer socio-technical strategies that help key stakeholders maximize AI's benefits while mitigating its growing cybercrime risks.
GoldMind: A Teacher-Centered Knowledge Management System for Higher Education -- Lessons from Iterative Design
Designing Knowledge Management Systems (KMSs) for higher education requires addressing complex human-technology interactions, especially where staff turnover and changing roles create ongoing challenges for reusing knowledge. While advances in process mining and Generative AI enable new ways of designing features to support knowledge management, existing KMSs often overlook the realities of educators' workflows, leading to low adoption and limited impact. This paper presents findings from a two-year human-centred design study with 108 higher education teachers, focused on the iterative co-design and evaluation of GoldMind, a KMS supporting in-the-flow knowledge management during digital teaching tasks. Through three design-evaluation cycles, we examined how teachers interacted with the system and how their feedback informed successive refinements. Insights are synthesised across three themes: (1) Technology Lessons from user interaction data, (2) Design Considerations shaped by co-design and usability testing, and (3) Human Factors, including cognitive load and knowledge behaviours, analysed using Epistemic Network Analysis.
comment: 38 pages, 10 tables, 7 figures. Submitted to Behaviour & Information Technology
EgoLog: Ego-Centric Fine-Grained Daily Log with Ubiquitous Wearables
Despite advances in human activity recognition (HAR) with different modalities, a precise, robust, and accurate daily log system is not yet available. Current solutions primarily rely on controlled, lab-based data collection, which limits their real-world applicability. The challenges towards a fine-grained daily log are 1) contextual awareness, 2) spatial awareness, and 3) effective fusion of multi-modal sensor data. To solve them, we propose EgoLog, which integrates effective audio-IMU fusion for daily log with ubiquitous wearables. Our approach first fuses audio and IMU data from two perspectives: temporal understanding and spatial understanding. We extract scenario-level features and aggregate them in the time dimension, while using motion compensation to enhance the performance of sound source localization. The knowledge obtained from these steps is then integrated into a multi-modal HAR framework. Here, the scenario provides prior knowledge, and the spatial location helps differentiate the user from the background. Furthermore, we integrate a LLM to enhance scenario recognition through logical reasoning. The knowledge derived from the LLM is subsequently transferred back to the local device to enable efficient, on-device inference. Evaluated on both public and self-collected dataset, EgoLog achieves effective multimodal fusion for both activity and scenraio recognition, outperforms the baseline by 12% and 15%, respectively.
comment: Submitted to SenSys 2026
Privacy is All You Need: Revolutionizing Wearable Health Data with Advanced PETs
In a world where data is the new currency, wearable health devices offer unprecedented insights into daily life, continuously monitoring vital signs and metrics. However, this convenience raises privacy concerns, as these devices collect sensitive data that can be misused or breached. Traditional measures often fail due to real-time data processing needs and limited device power. Users also lack awareness and control over data sharing and usage. We propose a Privacy-Enhancing Technology (PET) framework for wearable devices, integrating federated learning, lightweight cryptographic methods, and selectively deployed blockchain technology. The blockchain acts as a secure ledger triggered only upon data transfer requests, granting users real-time notifications and control. By dismantling data monopolies, this approach returns data sovereignty to individuals. Through real-world applications like secure medical data sharing, privacy-preserving fitness tracking, and continuous health monitoring, our framework reduces privacy risks by up to 70 percent while preserving data utility and performance. This innovation sets a new benchmark for wearable privacy and can scale to broader IoT ecosystems, including smart homes and industry. As data continues to shape our digital landscape, our research underscores the critical need to maintain privacy and user control at the forefront of technological progress.
Computer Vision and Pattern Recognition
MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues
We propose MagicQuill V2, a novel system that introduces a \textbf{layered composition} paradigm to generative image editing, bridging the gap between the semantic power of diffusion models and the granular control of traditional graphics software. While diffusion transformers excel at holistic generation, their use of singular, monolithic prompts fails to disentangle distinct user intentions for content, position, and appearance. To overcome this, our method deconstructs creative intent into a stack of controllable visual cues: a content layer for what to create, a spatial layer for where to place it, a structural layer for how it is shaped, and a color layer for its palette. Our technical contributions include a specialized data generation pipeline for context-aware content integration, a unified control module to process all visual cues, and a fine-tuned spatial branch for precise local editing, including object removal. Extensive experiments validate that this layered approach effectively resolves the user intention gap, granting creators direct, intuitive control over the generative process.
comment: Code and demo available at https://magicquill.art/v2/
CAMEO: Correspondence-Attention Alignment for Multi-View Diffusion Models
Multi-view diffusion models have recently emerged as a powerful paradigm for novel view synthesis, yet the underlying mechanism that enables their view-consistency remains unclear. In this work, we first verify that the attention maps of these models acquire geometric correspondence throughout training, attending to the geometrically corresponding regions across reference and target views for view-consistent generation. However, this correspondence signal remains incomplete, with its accuracy degrading under large viewpoint changes. Building on these findings, we introduce CAMEO, a simple yet effective training technique that directly supervises attention maps using geometric correspondence to enhance both the training efficiency and generation quality of multi-view diffusion models. Notably, supervising a single attention layer is sufficient to guide the model toward learning precise correspondences, thereby preserving the geometry and structure of reference images, accelerating convergence, and improving novel view synthesis performance. CAMEO reduces the number of training iterations required for convergence by half while achieving superior performance at the same iteration counts. We further demonstrate that CAMEO is model-agnostic and can be applied to any multi-view diffusion model.
comment: Project page: https://cvlab-kaist.github.io/CAMEO/
OneThinker: All-in-one Reasoning Model for Image and Video
Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks and treat image and video reasoning as disjoint domains. This results in limited scalability toward a multimodal reasoning generalist, which restricts practical versatility and hinders potential knowledge sharing across tasks and modalities. To this end, we propose OneThinker, an all-in-one reasoning model that unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the OneThinker-600k training corpus covering all these tasks and employ commercial models for CoT annotation, resulting in OneThinker-SFT-340k for SFT cold start. Furthermore, we propose EMA-GRPO to handle reward heterogeneity in multi-task RL by tracking task-wise moving averages of reward standard deviations for balanced optimization. Extensive experiments on diverse visual benchmarks show that OneThinker delivers strong performance on 31 benchmarks, across 10 fundamental visual understanding tasks. Moreover, it exhibits effective knowledge transfer between certain tasks and preliminary zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist. All code, model, and data are released.
comment: Project page: https://github.com/tulerfeng/OneThinker
PPTArena: A Benchmark for Agentic PowerPoint Editing
We introduce PPTArena, a benchmark for PowerPoint editing that measures reliable modifications to real slides under natural-language instructions. In contrast to image-PDF renderings or text-to-slide generation, PPTArena focuses on in-place editing across 100 decks, 2125 slides, and over 800 targeted edits covering text, charts, tables, animations, and master-level styles. Each case includes a ground-truth deck, a fully specified target outcome, and a dual VLM-as-judge pipeline that separately scores instruction following and visual quality using both structural diffs and slide images. Building on this setting, we propose PPTPilot, a structure-aware slide-editing agent that plans semantic edit sequences, routes between high-level programmatic tools and deterministic XML operations for precise control, and verifies outputs through an iterative plan-edit-check loop against task-specific constraints. In our experiments, PPTPilot outperforms strong proprietary agents and frontier VLM systems by over 10 percentage points on compound, layout-sensitive, and cross-slide edits, with particularly large gains in visual fidelity and deck-wide consistency. Despite these improvements, existing agents still underperform on long-horizon, document-scale tasks in PPTArena, highlighting the remaining challenges in reliable PPT editing.
comment: 25 pages, 26 figures
MultiShotMaster: A Controllable Multi-Shot Video Generation Framework
Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.
comment: Project Page: https://qinghew.github.io/MultiShotMaster
Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation
We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, a framework showing that video diffusion models conditioned solely on video-based scene context can perform complex spatial tasks. We validate on two tasks: scene navigation - following camera-pose instructions while remaining consistent with 3D geometry of the scene, and object grounding - which requires semantic localization, instruction following, and planning. Both tasks use video-only inputs, without auxiliary modalities such as depth or poses. With simple yet effective design choices in the framework and data curation, Video4Spatial demonstrates strong spatial understanding from video context: it plans navigation and grounds target objects end-to-end, follows camera-pose instructions while maintaining spatial consistency, and generalizes to long contexts and out-of-domain environments. Taken together, these results advance video generative models toward general visuospatial reasoning.
comment: Project page at https://xizaoqu.github.io/video4spatial/
ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation
Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain constrained by a two-stage pipeline that first generates mono audio and then performs spatialization, often resulting in error accumulation and spatio-temporal inconsistencies. To address this limitation, we introduce the task of end-to-end binaural spatial audio generation directly from silent video. To support this task, we present the BiAudio dataset, comprising approximately 97K video-binaural audio pairs spanning diverse real-world scenes and camera rotation trajectories, constructed through a semi-automated pipeline. Furthermore, we propose ViSAudio, an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture, where two dedicated branches model the audio latent flows. Integrated with a conditional spacetime module, it balances consistency between channels while preserving distinctive spatial characteristics, ensuring precise spatio-temporal alignment between audio and the input video. Comprehensive experiments demonstrate that ViSAudio outperforms existing state-of-the-art methods across both objective metrics and subjective evaluations, generating high-quality binaural audio with spatial immersion that adapts effectively to viewpoint changes, sound-source motion, and diverse acoustic environments. Project website: https://kszpxxzmc.github.io/ViSAudio-project.
MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation
We propose MAViD, a novel Multimodal framework for Audio-Visual Dialogue understanding and generation. Existing approaches primarily focus on non-interactive systems and are limited to producing constrained and unnatural human speech.The primary challenge of this task lies in effectively integrating understanding and generation capabilities, as well as achieving seamless multimodal audio-video fusion. To solve these problems, we propose a Conductor-Creator architecture that divides the dialogue system into two primary components.The Conductor is tasked with understanding, reasoning, and generating instructions by breaking them down into motion and speech components, thereby enabling fine-grained control over interactions. The Creator then delivers interactive responses based on these instructions.Furthermore, to address the difficulty of generating long videos with consistent identity, timbre, and tone using dual DiT structures, the Creator adopts a structure that combines autoregressive (AR) and diffusion models. The AR model is responsible for audio generation, while the diffusion model ensures high-quality video generation.Additionally, we propose a novel fusion module to enhance connections between contextually consecutive clips and modalities, enabling synchronized long-duration audio-visual content generation.Extensive experiments demonstrate that our framework can generate vivid and contextually coherent long-duration dialogue interactions and accurately interpret users' multimodal queries.
comment: Our project website is https://carlyx.github.io/MAViD/
SMP: Reusable Score-Matching Motion Priors for Physics-Based Character Control
Data-driven motion priors that can guide agents toward producing naturalistic behaviors play a pivotal role in creating life-like virtual characters. Adversarial imitation learning has been a highly effective method for learning motion priors from reference motion data. However, adversarial priors, with few exceptions, need to be retrained for each new controller, thereby limiting their reusability and necessitating the retention of the reference motion data when training on downstream tasks. In this work, we present Score-Matching Motion Priors (SMP), which leverages pre-trained motion diffusion models and score distillation sampling (SDS) to create reusable task-agnostic motion priors. SMPs can be pre-trained on a motion dataset, independent of any control policy or task. Once trained, SMPs can be kept frozen and reused as general-purpose reward functions to train policies to produce naturalistic behaviors for downstream tasks. We show that a general motion prior trained on large-scale datasets can be repurposed into a variety of style-specific priors. Furthermore SMP can compose different styles to synthesize new styles not present in the original dataset. Our method produces high-quality motion comparable to state-of-the-art adversarial imitation learning methods through reusable and modular motion priors. We demonstrate the effectiveness of SMP across a diverse suite of control tasks with physically simulated humanoid characters. Video demo available at https://youtu.be/ravlZJteS20
comment: 14 pages, 9 figures
Unrolled Networks are Conditional Probability Flows in MRI Reconstruction
Magnetic Resonance Imaging (MRI) offers excellent soft-tissue contrast without ionizing radiation, but its long acquisition time limits clinical utility. Recent methods accelerate MRI by under-sampling $k$-space and reconstructing the resulting images using deep learning. Unrolled networks have been widely used for the reconstruction task due to their efficiency, but suffer from unstable evolving caused by freely-learnable parameters in intermediate steps. In contrast, diffusion models based on stochastic differential equations offer theoretical stability in both medical and natural image tasks but are computationally expensive. In this work, we introduce flow ODEs to MRI reconstruction by theoretically proving that unrolled networks are discrete implementations of conditional probability flow ODEs. This connection provides explicit formulations for parameters and clarifies how intermediate states should evolve. Building on this insight, we propose Flow-Aligned Training (FLAT), which derives unrolled parameters from the ODE discretization and aligns intermediate reconstructions with the ideal ODE trajectory to improve stability and convergence. Experiments on three MRI datasets show that FLAT achieves high-quality reconstructions with up to $3\times$ fewer iterations than diffusion-based generative models and significantly greater stability than unrolled networks.
AutoBrep: Autoregressive B-Rep Generation with Unified Topology and Geometry
The boundary representation (B-Rep) is the standard data structure used in Computer-Aided Design (CAD) for defining solid models. Despite recent progress, directly generating B-Reps end-to-end with precise geometry and watertight topology remains a challenge. This paper presents AutoBrep, a novel Transformer model that autoregressively generates B-Reps with high quality and validity. AutoBrep employs a unified tokenization scheme that encodes both geometric and topological characteristics of a B-Rep model as a sequence of discrete tokens. Geometric primitives (i.e., surfaces and curves) are encoded as latent geometry tokens, and their structural relationships are defined as special topological reference tokens. Sequence order in AutoBrep naturally follows a breadth first traversal of the B-Rep face adjacency graph. At inference time, neighboring faces and edges along with their topological structure are progressively generated. Extensive experiments demonstrate the advantages of our unified representation when coupled with next-token prediction for B-Rep generation. AutoBrep outperforms baselines with better quality and watertightness. It is also highly scalable to complex solids with good fidelity and inference speed. We further show that autocompleting B-Reps is natively supported through our unified tokenization, enabling user-controllable CAD generation with minimal changes. Code is available at https://github.com/AutodeskAILab/AutoBrep.
comment: Accepted to Siggraph Asia 2025
Instant Video Models: Universal Adapters for Stabilizing Image-Based Networks NeurIPS 2025
When applied sequentially to video, frame-based networks often exhibit temporal inconsistency - for example, outputs that flicker between frames. This problem is amplified when the network inputs contain time-varying corruptions. In this work, we introduce a general approach for adapting frame-based models for stable and robust inference on video. We describe a class of stability adapters that can be inserted into virtually any architecture and a resource-efficient training process that can be performed with a frozen base network. We introduce a unified conceptual framework for describing temporal stability and corruption robustness, centered on a proposed accuracy-stability-robustness loss. By analyzing the theoretical properties of this loss, we identify the conditions where it produces well-behaved stabilizer training. Our experiments validate our approach on several vision tasks including denoising (NAFNet), image enhancement (HDRNet), monocular depth (Depth Anything v2), and semantic segmentation (DeepLabv3+). Our method improves temporal stability and robustness against a range of image corruptions (including compression artifacts, noise, and adverse weather), while preserving or improving the quality of predictions.
comment: NeurIPS 2025
In-Context Sync-LoRA for Portrait Video Editing
Editing portrait videos is a challenging task that requires flexible yet precise control over a wide range of modifications, such as appearance changes, expression edits, or the addition of objects. The key difficulty lies in preserving the subject's original temporal behavior, demanding that every edited frame remains precisely synchronized with the corresponding source frame. We present Sync-LoRA, a method for editing portrait videos that achieves high-quality visual modifications while maintaining frame-accurate synchronization and identity consistency. Our approach uses an image-to-video diffusion model, where the edit is defined by modifying the first frame and then propagated to the entire sequence. To enable accurate synchronization, we train an in-context LoRA using paired videos that depict identical motion trajectories but differ in appearance. These pairs are automatically generated and curated through a synchronization-based filtering process that selects only the most temporally aligned examples for training. This training setup teaches the model to combine motion cues from the source video with the visual changes introduced in the edited first frame. Trained on a compact, highly curated set of synchronized human portraits, Sync-LoRA generalizes to unseen identities and diverse edits (e.g., modifying appearance, adding objects, or changing backgrounds), robustly handling variations in pose and expression. Our results demonstrate high visual fidelity and strong temporal coherence, achieving a robust balance between edit fidelity and precise motion preservation.
comment: Project page: https://sagipolaczek.github.io/Sync-LoRA/
SurfFill: Completion of LiDAR Point Clouds via Gaussian Surfel Splatting
LiDAR-captured point clouds are often considered the gold standard in active 3D reconstruction. While their accuracy is exceptional in flat regions, the capturing is susceptible to miss small geometric structures and may fail with dark, absorbent materials. Alternatively, capturing multiple photos of the scene and applying 3D photogrammetry can infer these details as they often represent feature-rich regions. However, the accuracy of LiDAR for featureless regions is rarely reached. Therefore, we suggest combining the strengths of LiDAR and camera-based capture by introducing SurfFill: a Gaussian surfel-based LiDAR completion scheme. We analyze LiDAR capturings and attribute LiDAR beam divergence as a main factor for artifacts, manifesting mostly at thin structures and edges. We use this insight to introduce an ambiguity heuristic for completed scans by evaluating the change in density in the point cloud. This allows us to identify points close to missed areas, which we can then use to grow additional points from to complete the scan. For this point growing, we constrain Gaussian surfel reconstruction [Huang et al. 2024] to focus optimization and densification on these ambiguous areas. Finally, Gaussian primitives of the reconstruction in ambiguous areas are extracted and sampled for points to complete the point cloud. To address the challenges of large-scale reconstruction, we extend this pipeline with a divide-and-conquer scheme for building-sized point cloud completion. We evaluate on the task of LiDAR point cloud completion of synthetic and real-world scenes and find that our method outperforms previous reconstruction methods.
comment: Project page: https://lfranke.github.io/surffill
DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images
Autonomous driving needs fast, scalable 4D reconstruction and re-simulation for training and evaluation, yet most methods for dynamic driving scenes still rely on per-scene optimization, known camera calibration, or short frame windows, making them slow and impractical. We revisit this problem from a feedforward perspective and introduce \textbf{Driving Gaussian Grounded Transformer (DGGT)}, a unified framework for pose-free dynamic scene reconstruction. We note that the existing formulations, treating camera pose as a required input, limit flexibility and scalability. Instead, we reformulate pose as an output of the model, enabling reconstruction directly from sparse, unposed images and supporting an arbitrary number of views for long sequences. Our approach jointly predicts per-frame 3D Gaussian maps and camera parameters, disentangles dynamics with a lightweight dynamic head, and preserves temporal consistency with a lifespan head that modulates visibility over time. A diffusion-based rendering refinement further reduces motion/interpolation artifacts and improves novel-view quality under sparse inputs. The result is a single-pass, pose-free algorithm that achieves state-of-the-art performance and speed. Trained and evaluated on large-scale driving benchmarks (Waymo, nuScenes, Argoverse2), our method outperforms prior work both when trained on each dataset and in zero-shot transfer across datasets, and it scales well as the number of input frames increases.
DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling
Understanding the dynamic physical world, characterized by its evolving 3D structure, real-world motion, and semantic content with textual descriptions, is crucial for human-agent interaction and enables embodied agents to perceive and act within real environments with human-like capabilities. However, existing datasets are often derived from limited simulators or utilize traditional Structurefrom-Motion for up-to-scale annotation and offer limited descriptive captioning, which restricts the capacity of foundation models to accurately interpret real-world dynamics from monocular videos, commonly sourced from the internet. To bridge these gaps, we introduce DynamicVerse, a physical-scale, multimodal 4D world modeling framework for dynamic real-world video. We employ large vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic descriptive captions. By integrating window-based Bundle Adjustment with global optimization, our method converts long real-world video sequences into a comprehensive 4D multimodal format. DynamicVerse delivers a large-scale dataset consists of 100K+ videos with 800K+ annotated masks and 10M+ frames from internet videos. Experimental evaluations on three benchmark tasks, namely video depth estimation, camera pose estimation, and camera intrinsics estimation, demonstrate that our 4D modeling achieves superior performance in capturing physical-scale measurements with greater global accuracy than existing methods.
TEXTRIX: Latent Attribute Grid for Native Texture Generation and Beyond
Prevailing 3D texture generation methods, which often rely on multi-view fusion, are frequently hindered by inter-view inconsistencies and incomplete coverage of complex surfaces, limiting the fidelity and completeness of the generated content. To overcome these challenges, we introduce TEXTRIX, a native 3D attribute generation framework for high-fidelity texture synthesis and downstream applications such as precise 3D part segmentation. Our approach constructs a latent 3D attribute grid and leverages a Diffusion Transformer equipped with sparse attention, enabling direct coloring of 3D models in volumetric space and fundamentally avoiding the limitations of multi-view fusion. Built upon this native representation, the framework naturally extends to high-precision 3D segmentation by training the same architecture to predict semantic attributes on the grid. Extensive experiments demonstrate state-of-the-art performance on both tasks, producing seamless, high-fidelity textures and accurate 3D part segmentation with precise boundaries.
comment: Project Page: https://www.neural4d.com/research-page/textrix
GraphFusion3D: Dynamic Graph Attention Convolution with Adaptive Cross-Modal Transformer for 3D Object Detection
Despite significant progress in 3D object detection, point clouds remain challenging due to sparse data, incomplete structures, and limited semantic information. Capturing contextual relationships between distant objects presents additional difficulties. To address these challenges, we propose GraphFusion3D, a unified framework combining multi-modal fusion with advanced feature learning. Our approach introduces the Adaptive Cross-Modal Transformer (ACMT), which adaptively integrates image features into point representations to enrich both geometric and semantic information. For proposal refinement, we introduce the Graph Reasoning Module (GRM), a novel mechanism that models neighborhood relationships to simultaneously capture local geometric structures and global semantic context. The module employs multi-scale graph attention to dynamically weight both spatial proximity and feature similarity between proposals. We further employ a cascade decoder that progressively refines detections through multi-stage predictions. Extensive experiments on SUN RGB-D (70.6\% AP$_{25}$ and 51.2\% AP$_{50}$) and ScanNetV2 (75.1\% AP$_{25}$ and 60.8\% AP$_{50}$) demonstrate a substantial performance improvement over existing approaches.
U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences
Modeling dynamic 3D environments from LiDAR sequences is central to building reliable 4D worlds for autonomous driving and embodied AI. Existing generative frameworks, however, often treat all spatial regions uniformly, overlooking the varying uncertainty across real-world scenes. This uniform generation leads to artifacts in complex or ambiguous regions, limiting realism and temporal stability. In this work, we present U4D, an uncertainty-aware framework for 4D LiDAR world modeling. Our approach first estimates spatial uncertainty maps from a pretrained segmentation model to localize semantically challenging regions. It then performs generation in a "hard-to-easy" manner through two sequential stages: (1) uncertainty-region modeling, which reconstructs high-entropy regions with fine geometric fidelity, and (2) uncertainty-conditioned completion, which synthesizes the remaining areas under learned structural priors. To further ensure temporal coherence, U4D incorporates a mixture of spatio-temporal (MoST) block that adaptively fuses spatial and temporal representations during diffusion. Extensive experiments show that U4D produces geometrically faithful and temporally consistent LiDAR sequences, advancing the reliability of 4D world modeling for autonomous perception and simulation.
comment: Preprint; 19 pages, 7 figures, 8 tables
InEx: Hallucination Mitigation via Introspection and Cross-Modal Multi-Agent Collaboration AAAI 2026
Hallucination remains a critical challenge in large language models (LLMs), hindering the development of reliable multimodal LLMs (MLLMs). Existing solutions often rely on human intervention or underutilize the agent's ability to autonomously mitigate hallucination. To address these limitations, we draw inspiration from how humans make reliable decisions in the real world. They begin with introspective reasoning to reduce uncertainty and form an initial judgment, then rely on external verification from diverse perspectives to reach a final decision. Motivated by this cognitive paradigm, we propose InEx, a training-free, multi-agent framework designed to autonomously mitigate hallucination. InEx introduces internal introspective reasoning, guided by entropy-based uncertainty estimation, to improve the reliability of the decision agent's reasoning process. The agent first generates a response, which is then iteratively verified and refined through external cross-modal multi-agent collaboration with the editing agent and self-reflection agents, further enhancing reliability and mitigating hallucination. Extensive experiments show that InEx consistently outperforms existing methods, achieving 4%-27% gains on general and hallucination benchmarks, and demonstrating strong robustness.
comment: Published in AAAI 2026
Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities
While Multimodal Large Language Models (MLLMs) show remarkable capabilities, their safety alignments are susceptible to jailbreak attacks. Existing attack methods typically focus on text-image interplay, treating the visual modality as a secondary prompt. This approach underutilizes the unique potential of images to carry complex, contextual information. To address this gap, we propose a new image-centric attack method, Contextual Image Attack (CIA), which employs a multi-agent system to subtly embeds harmful queries into seemingly benign visual contexts using four distinct visualization strategies. To further enhance the attack's efficacy, the system incorporate contextual element enhancement and automatic toxicity obfuscation techniques. Experimental results on the MMSafetyBench-tiny dataset show that CIA achieves high toxicity scores of 4.73 and 4.83 against the GPT-4o and Qwen2.5-VL-72B models, respectively, with Attack Success Rates (ASR) reaching 86.31\% and 91.07\%. Our method significantly outperforms prior work, demonstrating that the visual modality itself is a potent vector for jailbreaking advanced MLLMs.
BEVDilation: LiDAR-Centric Multi-Modal Fusion for 3D Object Detection AAAI26
Integrating LiDAR and camera information in the bird's eye view (BEV) representation has demonstrated its effectiveness in 3D object detection. However, because of the fundamental disparity in geometric accuracy between these sensors, indiscriminate fusion in previous methods often leads to degraded performance. In this paper, we propose BEVDilation, a novel LiDAR-centric framework that prioritizes LiDAR information in the fusion. By formulating image BEV features as implicit guidance rather than naive concatenation, our strategy effectively alleviates the spatial misalignment caused by image depth estimation errors. Furthermore, the image guidance can effectively help the LiDAR-centric paradigm to address the sparsity and semantic limitations of point clouds. Specifically, we propose a Sparse Voxel Dilation Block that mitigates the inherent point sparsity by densifying foreground voxels through image priors. Moreover, we introduce a Semantic-Guided BEV Dilation Block to enhance the LiDAR feature diffusion processing with image semantic guidance and long-range context capture. On the challenging nuScenes benchmark, BEVDilation achieves better performance than state-of-the-art methods while maintaining competitive computational efficiency. Importantly, our LiDAR-centric strategy demonstrates greater robustness to depth noise compared to naive fusion. The source code is available at https://github.com/gwenzhang/BEVDilation.
comment: Accept by AAAI26
A Lightweight Real-Time Low-Light Enhancement Network for Embedded Automotive Vision Systems
In low-light environments like nighttime driving, image degradation severely challenges in-vehicle camera safety. Since existing enhancement algorithms are often too computationally intensive for vehicular applications, we propose UltraFast-LieNET, a lightweight multi-scale shifted convolutional network for real-time low-light image enhancement. We introduce a Dynamic Shifted Convolution (DSConv) kernel with only 12 learnable parameters for efficient feature extraction. By integrating DSConv with varying shift distances, a Multi-scale Shifted Residual Block (MSRB) is constructed to significantly expand the receptive field. To mitigate lightweight network instability, a residual structure and a novel multi-level gradient-aware loss function are incorporated. UltraFast-LieNET allows flexible parameter configuration, with a minimum size of only 36 parameters. Results on the LOLI-Street dataset show a PSNR of 26.51 dB, outperforming state-of-the-art methods by 4.6 dB while utilizing only 180 parameters. Experiments across four benchmark datasets validate its superior balance of real-time performance and enhancement quality under limited resources. Code is available at https://githubhttps://github.com/YuhanChen2024/UltraFast-LiNET
Layout Anything: One Transformer for Universal Room Layout Estimation WACV 2026
We present Layout Anything, a transformer-based framework for indoor layout estimation that adapts the OneFormer's universal segmentation architecture to geometric structure prediction. Our approach integrates OneFormer's task-conditioned queries and contrastive learning with two key modules: (1) a layout degeneration strategy that augments training data while preserving Manhattan-world constraints through topology-aware transformations, and (2) differentiable geometric losses that directly enforce planar consistency and sharp boundary predictions during training. By unifying these components in an end-to-end framework, the model eliminates complex post-processing pipelines while achieving high-speed inference at 114ms. Extensive experiments demonstrate state-of-the-art performance across standard benchmarks, with pixel error (PE) of 5.43% and corner error (CE) of 4.02% on the LSUN, PE of 7.04% (CE 5.17%) on the Hedau and PE of 4.03% (CE 3.15%) on the Matterport3D-Layout datasets. The framework's combination of geometric awareness and computational efficiency makes it particularly suitable for augmented reality applications and large-scale 3D scene reconstruction tasks.
comment: Published at WACV 2026
Benchmarking Scientific Understanding and Reasoning for Video Generation using VideoScience-Bench
The next frontier for video generation lies in developing models capable of zero-shot reasoning, where understanding real-world scientific laws is crucial for accurate physical outcome modeling under diverse conditions. However, existing video benchmarks are physical commonsense-based, offering limited insight into video models' scientific reasoning capability. We introduce VideoScience-Bench, a benchmark designed to evaluate undergraduate-level scientific understanding in video models. Each prompt encodes a composite scientific scenario that requires understanding and reasoning across multiple scientific concepts to generate the correct phenomenon. The benchmark comprises 200 carefully curated prompts spanning 14 topics and 103 concepts in physics and chemistry. We conduct expert-annotated evaluations across seven state-of-the-art video models in T2V and I2V settings along five dimensions: Prompt Consistency, Phenomenon Congruency, Correct Dynamism, Immutability, and Spatio-Temporal Continuity. Using a VLM-as-a-Judge to assess video generations, we observe strong correlation with human assessments. To the best of our knowledge, VideoScience-Bench is the first benchmark to evaluate video models not only as generators but also as reasoners, requiring their generations to demonstrate scientific understanding consistent with expected physical and chemical phenomena. Our data and evaluation code are available at: \href{https://github.com/hao-ai-lab/VideoScience}{github.com/hao-ai-lab/VideoScience}.
LoVoRA: Text-guided and Mask-free Video Object Removal and Addition with Learnable Object-aware Localization
Text-guided video editing, particularly for object removal and addition, remains a challenging task due to the need for precise spatial and temporal consistency. Existing methods often rely on auxiliary masks or reference images for editing guidance, which limits their scalability and generalization. To address these issues, we propose LoVoRA, a novel framework for mask-free video object removal and addition using object-aware localization mechanism. Our approach utilizes a unique dataset construction pipeline that integrates image-to-video translation, optical flow-based mask propagation, and video inpainting, enabling temporally consistent edits. The core innovation of LoVoRA is its learnable object-aware localization mechanism, which provides dense spatio-temporal supervision for both object insertion and removal tasks. By leveraging a Diffusion Mask Predictor, LoVoRA achieves end-to-end video editing without requiring external control signals during inference. Extensive experiments and human evaluation demonstrate the effectiveness and high-quality performance of LoVoRA.
EGGS: Exchangeable 2D/3D Gaussian Splatting for Geometry-Appearance Balanced Novel View Synthesis
Novel view synthesis (NVS) is crucial in computer vision and graphics, with wide applications in AR, VR, and autonomous driving. While 3D Gaussian Splatting (3DGS) enables real-time rendering with high appearance fidelity, it suffers from multi-view inconsistencies, limiting geometric accuracy. In contrast, 2D Gaussian Splatting (2DGS) enforces multi-view consistency but compromises texture details. To address these limitations, we propose Exchangeable Gaussian Splatting (EGGS), a hybrid representation that integrates 2D and 3D Gaussians to balance appearance and geometry. To achieve this, we introduce Hybrid Gaussian Rasterization for unified rendering, Adaptive Type Exchange for dynamic adaptation between 2D and 3D Gaussians, and Frequency-Decoupled Optimization that effectively exploits the strengths of each type of Gaussian representation. Our CUDA-accelerated implementation ensures efficient training and inference. Extensive experiments demonstrate that EGGS outperforms existing methods in rendering quality, geometric accuracy, and efficiency, providing a practical solution for high-quality NVS.
DiverseAR: Boosting Diversity in Bitwise Autoregressive Image Generation
In this paper, we investigate the underexplored challenge of sample diversity in autoregressive (AR) generative models with bitwise visual tokenizers. We first analyze the factors that limit diversity in bitwise AR models and identify two key issues: (1) the binary classification nature of bitwise modeling, which restricts the prediction space, and (2) the overly sharp logits distribution, which causes sampling collapse and reduces diversity. Building on these insights, we propose DiverseAR, a principled and effective method that enhances image diversity without sacrificing visual quality. Specifically, we introduce an adaptive logits distribution scaling mechanism that dynamically adjusts the sharpness of the binary output distribution during sampling, resulting in smoother predictions and greater diversity. To mitigate potential fidelity loss caused by distribution smoothing, we further develop an energy-based generation path search algorithm that avoids sampling low-confidence tokens, thereby preserving high visual quality. Extensive experiments demonstrate that DiverseAR substantially improves sample diversity in bitwise autoregressive image generation.
comment: 23 pages
Learning Multimodal Embeddings for Traffic Accident Prediction and Causal Estimation KDD'26
We consider analyzing traffic accident patterns using both road network data and satellite images aligned to road graph nodes. Previous work for predicting accident occurrences relies primarily on road network structural features while overlooking physical and environmental information from the road surface and its surroundings. In this work, we construct a large multimodal dataset across six U.S. states, containing nine million traffic accident records from official sources, and one million high-resolution satellite images for each node of the road network. Additionally, every node is annotated with features such as the region's weather statistics and road type (e.g., residential vs. motorway), and each edge is annotated with traffic volume information (i.e., Average Annual Daily Traffic). Utilizing this dataset, we conduct a comprehensive evaluation of multimodal learning methods that integrate both visual and network embeddings. Our findings show that integrating both data modalities improves prediction accuracy, achieving an average AUROC of $90.1\%$, which is a $3.7\%$ gain over graph neural network models that only utilize graph structures. With the improved embeddings, we conduct a causal analysis based on a matching estimator to estimate the key contributing factors influencing traffic accidents. We find that accident rates rise by $24\%$ under higher precipitation, by $22\%$ on higher-speed roads such as motorways, and by $29\%$ due to seasonal patterns, after adjusting for other confounding factors. Ablation studies confirm that satellite imagery features are essential for achieving accurate prediction.
comment: 17 pages. To appear in KDD'26 Datasets
MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding
Understanding high-resolution images remains a significant challenge for multimodal large language models (MLLMs). Recent study address this issue by dividing the image into smaller crops and computing the semantic similarity between each crop and a query using a pretrained retrieval-augmented generation (RAG) model. The most relevant crops are then selected to localize the target object and suppress irrelevant information. However, such crop-based processing can fragment complete objects across multiple crops, thereby disrupting the computation of semantic similarity. In our experiments, we find that image crops of objects with different sizes are better handled at different resolutions. Based on this observation, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework for high-resolution image understanding. To address the issue of semantic similarity bias caused by objects being split across different image crops, we propose a multi-resolution semantic fusion method, which integrates semantic similarity maps obtained at different resolutions to produce more accurate semantic information and preserve the integrity of target objects. Furthermore, to achieve direct localization of target objects at a global scale, we introduce an open-vocalbulary object detection (OVD) model that identifies object regions using a sliding-window approach.Experiments on high-resolution image understanding benchmarks using different MLLMs demonstrate the effectiveness of our approach.
Glance: Accelerating Diffusion Models with 1 Sample
Diffusion models have achieved remarkable success in image generation, yet their deployment remains constrained by the heavy computational cost and the need for numerous inference steps. Previous efforts on fewer-step distillation attempt to skip redundant steps by training compact student models, yet they often suffer from heavy retraining costs and degraded generalization. In this work, we take a different perspective: we accelerate smartly, not evenly, applying smaller speedups to early semantic stages and larger ones to later redundant phases. We instantiate this phase-aware strategy with two experts that specialize in slow and fast denoising phases. Surprisingly, instead of investing massive effort in retraining student models, we find that simply equipping the base model with lightweight LoRA adapters achieves both efficient acceleration and strong generalization. We refer to these two adapters as Slow-LoRA and Fast-LoRA. Through extensive experiments, our method achieves up to 5 acceleration over the base model while maintaining comparable visual quality across diverse benchmarks. Remarkably, the LoRA experts are trained with only 1 samples on a single V100 within one hour, yet the resulting models generalize strongly on unseen prompts.
Polar Perspectives: Evaluating 2-D LiDAR Projections for Robust Place Recognition with Visual Foundation Models
This work presents a systematic investigation into how alternative LiDAR-to-image projections affect metric place recognition when coupled with a state-of-the-art vision foundation model. We introduce a modular retrieval pipeline that controls for backbone, aggregation, and evaluation protocol, thereby isolating the influence of the 2-D projection itself. Using consistent geometric and structural channels across multiple datasets and deployment scenarios, we identify the projection characteristics that most strongly determine discriminative power, robustness to environmental variation, and suitability for real-time autonomy. Experiments with different datasets, including integration into an operational place recognition policy, validate the practical relevance of these findings and demonstrate that carefully designed projections can serve as an effective surrogate for end-to-end 3-D learning in LiDAR place recognition.
comment: 13 Pages, 5 Figures, 2 Tables Under Review
MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm
We present MindGPT-4ov, a multimodal large language model (MLLM) that introduces a general post-training paradigm spanning data production, model training, and efficient deployment. It achieves state-of-the-art performance across multiple benchmarks at low cost, effectively enhancing the foundational capabilities of MLLMs and the generalization ability. Focusing on data construction, supervised fine-tuning strategies, and multimodal reinforcement learning methods, this work proposes three key innovations: (1) An information density-based data generation scheme, integrated with a dual-dimensional tree-structured label system, enabling automated generation of high-quality cross-domain data. (2) A collaborative curriculum supervised fine-tuning approach that balances the injection of domain-specific knowledge with the preservation of general capabilities. (3) A hybrid reinforcement learning paradigm that enhances reasoning ability while simultaneously addressing multi-objective optimization such as diversity exploration, maintenance of multimodal perception, and response conciseness. Moreover, we implement a series of infrastructure optimizations, such as 5D parallel training, operator optimization, and inference quantization to enhance training and inference efficiency while reducing the cost of domain adaptation. Experimental results demonstrate that the MindGPT-4ov model outperforms state-of-the-art models on benchmarks such as MMBench, MMStar, MathVision, and MathVista. In addition, MindGPT-4ov also demonstrates superior user experience in vertical domain tasks, enabling a seamless transition from academic research to industrial deployment. MindGPT-4ov provides a general post-training paradigm applicable to a wide range of MLLMs. The model weights, datasets, and code for the Qwen3-VL-based variants will be recently open-sourced to support the community's development of MLLMs.
comment: 33 pages, 14 figures
Taming Camera-Controlled Video Generation with Verifiable Geometry Reward
Recent advances in video diffusion models have remarkably improved camera-controlled video generation, but most methods rely solely on supervised fine-tuning (SFT), leaving online reinforcement learning (RL) post-training largely underexplored. In this work, we introduce an online RL post-training framework that optimizes a pretrained video generator for precise camera control. To make RL effective in this setting, we design a verifiable geometry reward that delivers dense segment-level feedback to guide model optimization. Specifically, we estimate the 3D camera trajectories for both generated and reference videos, divide each trajectory into short segments, and compute segment-wise relative poses. The reward function then compares each generated-reference segment pair and assigns an alignment score as the reward signal, which helps alleviate reward sparsity and improve optimization efficiency. Moreover, we construct a comprehensive dataset featuring diverse large-amplitude camera motions and scenes with varied subject dynamics. Extensive experiments show that our online RL post-training clearly outperforms SFT baselines across multiple aspects, including camera-control accuracy, geometric consistency, and visual quality, demonstrating its superiority in advancing camera-controlled video generation.
comment: 11 pages, 4 figures, 7 tables
MICCAI STSR 2025 Challenge: Semi-Supervised Teeth and Pulp Segmentation and CBCT-IOS Registration
Cone-Beam Computed Tomography (CBCT) and Intraoral Scanning (IOS) are essential for digital dentistry, but annotated data scarcity limits automated solutions for pulp canal segmentation and cross-modal registration. To benchmark semi-supervised learning (SSL) in this domain, we organized the STSR 2025 Challenge at MICCAI 2025, featuring two tasks: (1) semi-supervised segmentation of teeth and pulp canals in CBCT, and (2) semi-supervised rigid registration of CBCT and IOS. We provided 60 labeled and 640 unlabeled IOS samples, plus 30 labeled and 250 unlabeled CBCT scans with varying resolutions and fields of view. The challenge attracted strong community participation, with top teams submitting open-source deep learning-based SSL solutions. For segmentation, leading methods used nnU-Net and Mamba-like State Space Models with pseudo-labeling and consistency regularization, achieving a Dice score of 0.967 and Instance Affinity of 0.738 on the hidden test set. For registration, effective approaches combined PointNetLK with differentiable SVD and geometric augmentation to handle modality gaps; hybrid neural-classical refinement enabled accurate alignment despite limited labels. All data and code are publicly available at https://github.com/ricoleehduu/STS-Challenge-2025 to ensure reproducibility.
RFOP: Rethinking Fusion and Orthogonal Projection for Face-Voice Association ICASSP
Face-voice association in multilingual environment challenge 2026 aims to investigate the face-voice association task in multilingual scenario. The challenge introduces English-German face-voice pairs to be utilized in the evaluation phase. To this end, we revisit the fusion and orthogonal projection for face-voice association by effectively focusing on the relevant semantic information within the two modalities. Our method performs favorably on the English-German data split and ranked 3rd in the FAME 2026 challenge by achieving the EER of 33.1.
comment: Ranked 3rd in Fame 2026 Challenge, ICASSP
Are Detectors Fair to Indian IP-AIGC? A Cross-Generator Study
Modern image editors can produce identity-preserving AIGC (IP-AIGC), where the same person appears with new attire, background, or lighting. The robustness and fairness of current detectors in this regime remain unclear, especially for under-represented populations. We present what we believe is the first systematic study of IP-AIGC detection for Indian and South-Asian faces, quantifying cross-generator generalization and intra-population performance. We assemble Indian-focused training splits from FairFD and HAV-DF, and construct two held-out IP-AIGC test sets (HIDF-img-ip-genai and HIDF-vid-ip-genai) using commercial web-UI generators (Gemini and ChatGPT) with identity-preserving prompts. We evaluate two state-of-the-art detectors (AIDE and Effort) under pretrained (PT) and fine-tuned (FT) regimes and report AUC, AP, EER, and accuracy. Fine-tuning yields strong in-domain gains (for example, Effort AUC 0.739 to 0.944 on HAV-DF-test; AIDE EER 0.484 to 0.259), but consistently degrades performance on held-out IP-AIGC for Indian cohorts (for example, AIDE AUC 0.923 to 0.563 on HIDF-img-ip-genai; Effort 0.740 to 0.533), which indicates overfitting to training-generator cues. On non-IP HIDF images, PT performance remains high, which suggests a specific brittleness to identity-preserving edits rather than a generic distribution shift. Our study establishes IP-AIGC-Indian as a challenging and practically relevant scenario and motivates representation-preserving adaptation and India-aware benchmark curation to close generalization gaps in AIGC detection.
Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video? WACV 2026
Anticipating actions before they occur is a core challenge in action understanding research. While conventional methods rely on extracting and aggregating temporal information from videos, as humans we can often predict upcoming actions by observing a single moment from a scene, when given sufficient context. Can a model achieve this competence? The short answer is yes, although its effectiveness depends on the complexity of the task. In this work, we investigate to what extent video aggregation can be replaced with alternative modalities. To this end, based on recent advances in visual feature extraction and language-based reasoning, we introduce AAG, a method for Action Anticipation at a Glimpse. AAG combines RGB features with depth cues from a single frame for enhanced spatial reasoning, and incorporates prior action information to provide long-term context. This context is obtained either through textual summaries from Vision-Language Models, or from predictions generated by a single-frame action recognizer. Our results demonstrate that multimodal single-frame action anticipation using AAG can perform competitively compared to both temporally aggregated video baselines and state-of-the-art methods across three instructional activity datasets: IKEA-ASM, Meccano, and Assembly101.
comment: Accepted in WACV 2026 - Applications Track
ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning
Reasoning-centric video object segmentation is an inherently complex task: the query often refers to dynamics, causality, and temporal interactions, rather than static appearances. Yet existing solutions generally collapse these factors into simplified reasoning with latent embeddings, rendering the reasoning chain opaque and essentially intractable. We therefore adopt an explicit decomposition perspective and introduce ReVSeg, which executes reasoning as sequential decisions in the native interface of pretrained vision language models (VLMs). Rather than folding all reasoning into a single-step prediction, ReVSeg executes three explicit operations -- semantics interpretation, temporal evidence selection, and spatial grounding -- aligning pretrained capabilities. We further employ reinforcement learning to optimize the multi-step reasoning chain, enabling the model to self-refine its decision quality from outcome-driven signals. Experimental results demonstrate that ReVSeg attains state-of-the-art performances on standard video object segmentation benchmarks and yields interpretable reasoning trajectories. Project page is available at https://clementine24.github.io/ReVSeg/ .
Defense That Attacks: How Robust Models Become Better Attackers
Deep learning has achieved great success in computer vision, but remains vulnerable to adversarial attacks. Adversarial training is the leading defense designed to improve model robustness. However, its effect on the transferability of attacks is underexplored. In this work, we ask whether adversarial training unintentionally increases the transferability of adversarial examples. To answer this, we trained a diverse zoo of 36 models, including CNNs and ViTs, and conducted comprehensive transferability experiments. Our results reveal a clear paradox: adversarially trained (AT) models produce perturbations that transfer more effectively than those from standard models, which introduce a new ecosystem risk. To enable reproducibility and further study, we release all models, code, and experimental scripts. Furthermore, we argue that robustness evaluations should assess not only the resistance of a model to transferred attacks but also its propensity to produce transferable adversarial examples.
HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval ACM MM 2025
Composed Video Retrieval (CVR) is a challenging video retrieval task that utilizes multi-modal queries, consisting of a reference video and modification text, to retrieve the desired target video. The core of this task lies in understanding the multi-modal composed query and achieving accurate composed feature learning. Within multi-modal queries, the video modality typically carries richer semantic content compared to the textual modality. However, previous works have largely overlooked the disparity in information density between these two modalities. This limitation can lead to two critical issues: 1) modification subject referring ambiguity and 2) limited detailed semantic focus, both of which degrade the performance of CVR models. To address the aforementioned issues, we propose a novel CVR framework, namely the Hierarchical Uncertainty-aware Disambiguation network (HUD). HUD is the first framework that leverages the disparity in information density between video and text to enhance multi-modal query understanding. It comprises three key components: (a) Holistic Pronoun Disambiguation, (b) Atomistic Uncertainty Modeling, and (c) Holistic-to-Atomistic Alignment. By exploiting overlapping semantics through holistic cross-modal interaction and fine-grained semantic alignment via atomistic-level cross-modal interaction, HUD enables effective object disambiguation and enhances the focus on detailed semantics, thereby achieving precise composed feature learning. Moreover, our proposed HUD is also applicable to the Composed Image Retrieval (CIR) task and achieves state-of-the-art performance across three benchmark datasets for both CVR and CIR tasks. The codes are available on https://zivchen-ty.github.io/HUD.github.io/.
comment: Accepted by ACM MM 2025
TrackNetV5: Residual-Driven Spatio-Temporal Refinement and Motion Direction Decoupling for Fast Object Tracking
The TrackNet series has established a strong baseline for fast-moving small object tracking in sports. However, existing iterations face significant limitations: V1-V3 struggle with occlusions due to a reliance on purely visual cues, while TrackNetV4, despite introducing motion inputs, suffers from directional ambiguity as its absolute difference method discards motion polarity. To overcome these bottlenecks, we propose TrackNetV5, a robust architecture integrating two novel mechanisms. First, to recover lost directional priors, we introduce the Motion Direction Decoupling (MDD) module. Unlike V4, MDD decomposes temporal dynamics into signed polarity fields, explicitly encoding both movement occurrence and trajectory direction. Second, we propose the Residual-Driven Spatio-Temporal Refinement (R-STR) head. Operating on a coarse-to-fine paradigm, this Transformer-based module leverages factorized spatio-temporal contexts to estimate a corrective residual, effectively recovering occluded targets. Extensive experiments on the TrackNetV2 dataset demonstrate that TrackNetV5 achieves a new state-of-the-art F1-score of 0.9859 and an accuracy of 0.9733, significantly outperforming previous versions. Notably, this performance leap is achieved with a marginal 3.7% increase in FLOPs compared to V4, maintaining real-time inference capabilities while delivering superior tracking precision.
Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols
Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic manipulation, yet they remain limited in failure diagnosis and learning from failures. Additionally, existing failure datasets are mostly generated programmatically in simulation, which limits their generalization to the real world. In light of these, we introduce ViFailback, a framework designed to diagnose robotic manipulation failures and provide both textual and visual correction guidance. Our framework utilizes explicit visual symbols to enhance annotation efficiency. We further release the ViFailback dataset, a large-scale collection of 58,126 Visual Question Answering (VQA) pairs along with their corresponding 5,202 real-world manipulation trajectories. Based on the dataset, we establish ViFailback-Bench, a benchmark of 11 fine-grained VQA tasks designed to assess the failure diagnosis and correction abilities of Vision-Language Models (VLMs), featuring ViFailback-Bench Lite for closed-ended and ViFailback-Bench Hard for open-ended evaluation. To demonstrate the effectiveness of our framework, we built the ViFailback-8B VLM, which not only achieves significant overall performance improvement on ViFailback-Bench but also generates visual symbols for corrective action guidance. Finally, by integrating ViFailback-8B with a VLA model, we conduct real-world robotic experiments demonstrating its ability to assist the VLA model in recovering from failures. Project Website: https://x1nyuzhou.github.io/vifailback.github.io/
LumiX: Structured and Coherent Text-to-Intrinsic Generation
We present LumiX, a structured diffusion framework for coherent text-to-intrinsic generation. Conditioned on text prompts, LumiX jointly generates a comprehensive set of intrinsic maps (e.g., albedo, irradiance, normal, depth, and final color), providing a structured and physically consistent description of an underlying scene. This is enabled by two key contributions: 1) Query-Broadcast Attention, a mechanism that ensures structural consistency by sharing queries across all maps in each self-attention block. 2) Tensor LoRA, a tensor-based adaptation that parameter-efficiently models cross-map relations for efficient joint training. Together, these designs enable stable joint diffusion training and unified generation of multiple intrinsic properties. Experiments show that LumiX produces coherent and physically meaningful results, achieving 23% higher alignment and a better preference score (0.19 vs. -0.41) compared to the state of the art, and it can also perform image-conditioned intrinsic decomposition within the same framework.
comment: The code will be available at https://github.com/xhanxu/LumiX
Rethinking Surgical Smoke: A Smoke-Type-Aware Laparoscopic Video Desmoking Method and Dataset AAAI-26
Electrocautery or lasers will inevitably generate surgical smoke, which hinders the visual guidance of laparoscopic videos for surgical procedures. The surgical smoke can be classified into different types based on its motion patterns, leading to distinctive spatio-temporal characteristics across smoky laparoscopic videos. However, existing desmoking methods fail to account for such smoke-type-specific distinctions. Therefore, we propose the first Smoke-Type-Aware Laparoscopic Video Desmoking Network (STANet) by introducing two smoke types: Diffusion Smoke and Ambient Smoke. Specifically, a smoke mask segmentation sub-network is designed to jointly conduct smoke mask and smoke type predictions based on the attention-weighted mask aggregation, while a smokeless video reconstruction sub-network is proposed to perform specially desmoking on smoky features guided by two types of smoke mask. To address the entanglement challenges of two smoke types, we further embed a coarse-to-fine disentanglement module into the mask segmentation sub-network, which yields more accurate disentangled masks through the smoke-type-aware cross attention between non-entangled and entangled regions. In addition, we also construct the first large-scale synthetic video desmoking dataset with smoke type annotations. Extensive experiments demonstrate that our method not only outperforms state-of-the-art approaches in quality evaluations, but also exhibits superior generalization across multiple downstream surgical tasks.
comment: 12 pages, 15 figures. Accepted to AAAI-26 (Main Technical Track)
AttMetNet: Attention-Enhanced Deep Neural Network for Methane Plume Detection in Sentinel-2 Satellite Imagery
Methane is a powerful greenhouse gas that contributes significantly to global warming. Accurate detection of methane emissions is the key to taking timely action and minimizing their impact on climate change. We present AttMetNet, a novel attention-enhanced deep learning framework for methane plume detection with Sentinel-2 satellite imagery. The major challenge in developing a methane detection model is to accurately identify methane plumes from Sentinel-2's B11 and B12 bands while suppressing false positives caused by background variability and diverse land cover types. Traditional detection methods typically depend on the differences or ratios between these bands when comparing the scenes with and without plumes. However, these methods often require verification by a domain expert because they generate numerous false positives. Recent deep learning methods make some improvements using CNN-based architectures, but lack mechanisms to prioritize methane-specific features. AttMetNet introduces a methane-aware architecture that fuses the Normalized Difference Methane Index (NDMI) with an attention-enhanced U-Net. By jointly exploiting NDMI's plume-sensitive cues and attention-driven feature selection, AttMetNet selectively amplifies methane absorption features while suppressing background noise. This integration establishes a first-of-its-kind architecture tailored for robust methane plume detection in real satellite imagery. Additionally, we employ focal loss to address the severe class imbalance arising from both limited positive plume samples and sparse plume pixels within imagery. Furthermore, AttMetNet is trained on the real methane plume dataset, making it more robust to practical scenarios. Extensive experiments show that AttMetNet surpasses recent methods in methane plume detection with a lower false positive rate, better precision recall balance, and higher IoU.
comment: 15 pages, 4 figures
Reasoning-Aware Multimodal Fusion for Hateful Video Detection
Hate speech in online videos is posing an increasingly serious threat to digital platforms, especially as video content becomes increasingly multimodal and context-dependent. Existing methods often struggle to effectively fuse the complex semantic relationships between modalities and lack the ability to understand nuanced hateful content. To address these issues, we propose an innovative Reasoning-Aware Multimodal Fusion (RAMF) framework. To tackle the first challenge, we design Local-Global Context Fusion (LGCF) to capture both local salient cues and global temporal structures, and propose Semantic Cross Attention (SCA) to enable fine-grained multimodal semantic interaction. To tackle the second challenge, we introduce adversarial reasoning-a structured three-stage process where a vision-language model generates (i) objective descriptions, (ii) hate-assumed inferences, and (iii) non-hate-assumed inferences-providing complementary semantic perspectives that enrich the model's contextual understanding of nuanced hateful intent. Evaluations on two real-world hateful video datasets demonstrate that our method achieves robust generalisation performance, improving upon state-of-the-art methods by 3% and 7% in Macro-F1 and hate class recall, respectively. We will release the code after the anonymity period ends.
Beyond Paired Data: Self-Supervised UAV Geo-Localization from Reference Imagery Alone WACV 2026
Image-based localization in GNSS-denied environments is critical for UAV autonomy. Existing state-of-the-art approaches rely on matching UAV images to geo-referenced satellite images; however, they typically require large-scale, paired UAV-satellite datasets for training. Such data are costly to acquire and often unavailable, limiting their applicability. To address this challenge, we adopt a training paradigm that removes the need for UAV imagery during training by learning directly from satellite-view reference images. This is achieved through a dedicated augmentation strategy that simulates the visual domain shift between satellite and real-world UAV views. We introduce CAEVL, an efficient model designed to exploit this paradigm, and validate it on ViLD, a new and challenging dataset of real-world UAV images that we release to the community. Our method achieves competitive performance compared to approaches trained with paired data, demonstrating its effectiveness and strong generalization capabilities.
comment: Accepted at WACV 2026
DF-Mamba: Deformable State Space Modeling for 3D Hand Pose Estimation in Interactions WACV 2026
Modeling daily hand interactions often struggles with severe occlusions, such as when two hands overlap, which highlights the need for robust feature learning in 3D hand pose estimation (HPE). To handle such occluded hand images, it is vital to effectively learn the relationship between local image features (e.g., for occluded joints) and global context (e.g., cues from inter-joints, inter-hands, or the scene). However, most current 3D HPE methods still rely on ResNet for feature extraction, and such CNN's inductive bias may not be optimal for 3D HPE due to its limited capability to model the global context. To address this limitation, we propose an effective and efficient framework for visual feature extraction in 3D HPE using recent state space modeling (i.e., Mamba), dubbed Deformable Mamba (DF-Mamba). DF-Mamba is designed to capture global context cues beyond standard convolution through Mamba's selective state modeling and the proposed deformable state scanning. Specifically, for local features after convolution, our deformable scanning aggregates these features within an image while selectively preserving useful cues that represent the global context. This approach significantly improves the accuracy of structured 3D HPE, with comparable inference speed to ResNet-50. Our experiments involve extensive evaluations on five divergent datasets including single-hand and two-hand scenarios, hand-only and hand-object interactions, as well as RGB and depth-based estimation. DF-Mamba outperforms the latest image backbones, including VMamba and Spatial-Mamba, on all datasets and achieves state-of-the-art performance.
comment: Accepted to WACV 2026. Project page: https://tkhkaeio.github.io/projects/25-dfmamba/index.html
Emergent Bayesian Behaviour and Optimal Cue Combination in LLMs
Large language models (LLMs) excel at explicit reasoning, but their implicit computational strategies remain underexplored. Decades of psychophysics research show that humans intuitively process and integrate noisy signals using near-optimal Bayesian strategies in perceptual tasks. We ask whether LLMs exhibit similar behaviour and perform optimal multimodal integration without explicit training or instruction. Adopting the psychophysics paradigm, we infer computational principles of LLMs from systematic behavioural studies. We introduce a behavioural benchmark - BayesBench: four magnitude estimation tasks (length, location, distance, and duration) over text and image, inspired by classic psychophysics, and evaluate a diverse set of nine LLMs alongside human judgments for calibration. Through controlled ablations of noise, context, and instruction prompts, we measure performance, behaviour and efficiency in multimodal cue-combination. Beyond accuracy and efficiency metrics, we introduce a Bayesian Consistency Score that detects Bayes-consistent behavioural shifts even when accuracy saturates. Our results show that while capable models often adapt in Bayes-consistent ways, accuracy does not guarantee robustness. Notably, GPT-5 Mini achieves perfect text accuracy but fails to integrate visual cues efficiently. This reveals a critical dissociation between capability and strategy, suggesting accuracy-centric benchmarks may over-index on performance while missing brittle uncertainty handling. These findings reveal emergent principled handling of uncertainty and highlight the correlation between accuracy and Bayesian tendencies. We release our psychophysics benchmark and consistency metric (https://bayes-bench.github.io) as evaluation tools and to inform future multimodal architecture designs.
GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding
Recent advances in multimodal large language models(MLLMs) have led to remarkable progress in visual grounding, enabling fine-grained cross-modal alignment between textual queries and image regions. However, transferring such capabilities to remote sensing imagery remains challenging, as targets are often extremely small within kilometer-scale scenes, and queries typically involve intricate geospatial relations such as relative positions, spatial hierarchies, or contextual dependencies across distant objects. To address these challenges, we propose GeoViS, a Geospatially Rewarded Visual Search framework that reformulates remote sensing visual grounding as a progressive search-and-reasoning process. Rather than directly predicting the target location in a single step, GeoViS actively explores the global image through a tree-structured sequence of visual cues, integrating multimodal perception, spatial reasoning, and reward-guided exploration to refine geospatial hypotheses iteratively. This design enables the model to detect subtle small-scale targets while maintaining holistic scene awareness. Extensive experiments on five remote sensing grounding benchmarks demonstrate that GeoViS achieves precise geospatial understanding and consistently surpasses existing methods across key visual grounding metrics, highlighting its strong cross-domain generalization and interpretability.
comment: 11 pages, 4 figures
Tissue-mask supported inter-subject whole-body image registration in the UK Biobank -- A method benchmarking study
The UK Biobank is a large-scale study collecting whole-body MR imaging and non-imaging health data. Robust and accurate inter-subject image registration of these whole-body MR images would enable their body-wide spatial standardization, and region-/voxel-wise correlation analysis of non-imaging data with image-derived parameters (e.g., tissue volume or fat content). We propose a sex-stratified inter-subject whole-body MR image registration approach that uses subcutaneous adipose tissue- and muscle-masks from the state-of-the-art VIBESegmentator method to augment intensity-based graph-cut registration. The proposed method was evaluated on a subset of 4000 subjects by comparing it to an intensity-only method as well as two previously published registration methods, uniGradICON and MIRTK. The evaluation comprised overlap measures applied to the 71 VIBESegmentator masks: 1) Dice scores, and 2) voxel-wise label error frequency. Additionally, voxel-wise correlation between age and each of fat content and tissue volume was studied to exemplify the usefulness for medical research. The proposed method exhibited a mean dice score of 0.77 / 0.75 across the cohort and the 71 masks for males/females, respectively. When compared to the intensity-only registration, the mean values were 6 percentage points (pp) higher for both sexes, and the label error frequency was decreased in most tissue regions. These differences were 9pp / 8pp against uniGradICON and 12pp / 13pp against MIRTK. Using the proposed method, the age-correlation maps were less noisy and showed higher anatomical alignment. In conclusion, the image registration method using two tissue masks improves whole-body registration of UK Biobank images.
VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm
Vision-language models (VLMs) excel at image understanding tasks, but the large number of visual tokens imposes significant computational costs, hindering deployment on mobile devices. Many pruning methods rely solely on token importance and thus overlook inter-token redundancy, retaining numerous duplicated tokens and wasting capacity. Although some redundancy-aware approaches have been proposed, they often ignore the spatial relationships among visual tokens. This can lead to overly sparse selections of retained tokens that fail to adequately cover the regions of target objects. To address these limitations, we propose VLM-Pruner, a training-free token pruning algorithm that explicitly balances redundancy and spatial sparsity. We introduce a centrifugal token pruning paradigm that enables near-to-far selection while prioritizing the preservation of fine-grained object details. Moreover, we design a Buffering for Spatial Sparsity (BSS) criterion that defers the selection of spatially distant tokens. We further adopt a parallel greedy strategy to conduct token selection efficiently. To mitigate information loss from pruning, we selectively fuse salient information from the discarded tokens into the retained ones. Comprehensive comparisons demonstrate that VLM-Pruner consistently outperforms strong baselines across five VLMs with an 88.9\% pruning rate, while delivering an end-to-end inference speedup.
GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization
Cross-view geo-localization infers a location by retrieving geo-tagged reference images that visually correspond to a query image. However, the traditional satellite-centric paradigm limits robustness when high-resolution or up-to-date satellite imagery is unavailable. It further underexploits complementary cues across views (e.g., drone, satellite, and street) and modalities (e.g., language and image). To address these challenges, we propose GeoBridge, a foundation model that performs bidirectional matching across views and supports language-to-image retrieval. Going beyond traditional satellite-centric formulations, GeoBridge builds on a novel semantic-anchor mechanism that bridges multi-view features through textual descriptions for robust, flexible localization. In support of this task, we construct GeoLoc, the first large-scale, cross-modal, and multi-view aligned dataset comprising over 50,000 pairs of drone, street-view panorama, and satellite images as well as their textual descriptions, collected from 36 countries, ensuring both geographic and semantic alignment. We performed broad evaluations across multiple tasks. Experiments confirm that GeoLoc pre-training markedly improves geo-location accuracy for GeoBridge while promoting cross-domain generalization and cross-modal knowledge transfer. The dataset, source code, and pretrained models were released at https://github.com/MiliLab/GeoBridge.
ALDI-ray: Adapting the ALDI Framework for Security X-ray Object Detection ICASSP 2026
Domain adaptation in object detection is critical for real-world applications where distribution shifts degrade model performance. Security X-ray imaging presents a unique challenge due to variations in scanning devices and environmental conditions, leading to significant domain discrepancies. To address this, we apply ALDI++, a domain adaptation framework that integrates self-distillation, feature alignment, and enhanced training strategies to mitigate domain shift effectively in this area. We conduct extensive experiments on the EDS dataset, demonstrating that ALDI++ surpasses the state-of-the-art (SOTA) domain adaptation methods across multiple adaptation scenarios. In particular, ALDI++ with a Vision Transformer for Detection (ViTDet) backbone achieves the highest mean average precision (mAP), confirming the effectiveness of transformer-based architectures for cross-domain object detection. Additionally, our category-wise analysis highlights consistent improvements in detection accuracy, reinforcing the robustness of the model across diverse object classes. Our findings establish ALDI++ as an efficient solution for domain-adaptive object detection, setting a new benchmark for performance stability and cross-domain generalization in security X-ray imagery.
comment: Submitted to ICASSP 2026 Conference
ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data
Anomaly segmentation seeks to detect and localize unknown or out-of-distribution (OoD) objects that fall outside predefined semantic classes a capability essential for safe autonomous driving. However, the scarcity and limited diversity of anomaly data severely constrain model generalization in open-world environments. Existing approaches mitigate this issue through synthetic data generation, either by copy-pasting external objects into driving scenes or by leveraging text-to-image diffusion models to inpaint anomalous regions. While these methods improve anomaly diversity, they often lack contextual coherence and physical realism, resulting in domain gaps between synthetic and real data. In this paper, we present ClimaDrive, a semantics-guided image-to-image framework for synthesizing semantically coherent, weather-diverse, and physically plausible OoD driving data. ClimaDrive unifies structure-guided multi-weather generation with prompt-driven anomaly inpainting, enabling the creation of visually realistic training data. Based on this framework, we construct ClimaOoD, a large-scale benchmark spanning six representative driving scenarios under both clear and adverse weather conditions. Extensive experiments on four state-of-the-art methods show that training with ClimaOoD leads to robust improvements in anomaly segmentation. Across all methods, AUROC, AP, and FPR95 show notable gains, with FPR95 dropping from 3.97 to 3.52 for RbA on Fishyscapes LAF. These results demonstrate that ClimaOoD enhances model robustness, offering valuable training data for better generalization in open-world anomaly detection.
comment: Under review;
Unsupervised Structural Scene Decomposition via Foreground-Aware Slot Attention with Pseudo-Mask Guidance
Recent advances in object-centric representation learning have shown that slot attention-based methods can effectively decompose visual scenes into object slot representations without supervision. However, existing approaches typically process foreground and background regions indiscriminately, often resulting in background interference and suboptimal instance discovery performance on real-world data. To address this limitation, we propose Foreground-Aware Slot Attention (FASA), a two-stage framework that explicitly separates foreground from background to enable precise object discovery. In the first stage, FASA performs a coarse scene decomposition to distinguish foreground from background regions through a dual-slot competition mechanism. These slots are initialized via a clustering-based strategy, yielding well-structured representations of salient regions. In the second stage, we introduce a masked slot attention mechanism where the first slot captures the background while the remaining slots compete to represent individual foreground objects. To further address over-segmentation of foreground objects, we incorporate pseudo-mask guidance derived from a patch affinity graph constructed with self-supervised image features to guide the learning of foreground slots. Extensive experiments on both synthetic and real-world datasets demonstrate that FASA consistently outperforms state-of-the-art methods, validating the effectiveness of explicit foreground modeling and pseudo-mask guidance for robust scene decomposition and object-coherent representation. Code will be made publicly available.
PGP-DiffSR: Phase-Guided Progressive Pruning for Efficient Diffusion-based Image Super-Resolution
Although diffusion-based models have achieved impressive results in image super-resolution, they often rely on large-scale backbones such as Stable Diffusion XL (SDXL) and Diffusion Transformers (DiT), which lead to excessive computational and memory costs during training and inference. To address this issue, we develop a lightweight diffusion method, PGP-DiffSR, by removing redundant information from diffusion models under the guidance of the phase information of inputs for efficient image super-resolution. We first identify the intra-block redundancy within the diffusion backbone and propose a progressive pruning approach that removes redundant blocks while reserving restoration capability. We note that the phase information of the restored images produced by the pruned diffusion model is not well estimated. To solve this problem, we propose a phase-exchange adapter module that explores the phase information of the inputs to guide the pruned diffusion model for better restoration performance. We formulate the progressive pruning approach and the phase-exchange adapter module into a unified model. Extensive experiments demonstrate that our method achieves competitive restoration quality while significantly reducing computational load and memory consumption. The code is available at https://github.com/yzb1997/PGP-DiffSR.
comment: 10 pages
UAUTrack: Towards Unified Multimodal Anti-UAV Visual Tracking
Research in Anti-UAV (Unmanned Aerial Vehicle) tracking has explored various modalities, including RGB, TIR, and RGB-T fusion. However, a unified framework for cross-modal collaboration is still lacking. Existing approaches have primarily focused on independent models for individual tasks, often overlooking the potential for cross-modal information sharing. Furthermore, Anti-UAV tracking techniques are still in their infancy, with current solutions struggling to achieve effective multimodal data fusion. To address these challenges, we propose UAUTrack, a unified single-target tracking framework built upon a single-stream, single-stage, end-to-end architecture that effectively integrates multiple modalities. UAUTrack introduces a key component: a text prior prompt strategy that directs the model to focus on UAVs across various scenarios. Experimental results show that UAUTrack achieves state-of-the-art performance on the Anti-UAV and DUT Anti-UAV datasets, and maintains a favourable trade-off between accuracy and speed on the Anti-UAV410 dataset, demonstrating both high accuracy and practical efficiency across diverse Anti-UAV scenarios.
PolarGuide-GSDR: 3D Gaussian Splatting Driven by Polarization Priors and Deferred Reflection for Real-World Reflective Scenes
Polarization-aware Neural Radiance Fields (NeRF) enable novel view synthesis of specular-reflection scenes but face challenges in slow training, inefficient rendering, and strong dependencies on material/viewpoint assumptions. However, 3D Gaussian Splatting (3DGS) enables real-time rendering yet struggles with accurate reflection reconstruction from reflection-geometry entanglement, adding a deferred reflection module introduces environment map dependence. We address these limitations by proposing PolarGuide-GSDR, a polarization-forward-guided paradigm establishing a bidirectional coupling mechanism between polarization and 3DGS: first 3DGS's geometric priors are leveraged to resolve polarization ambiguity, and then the refined polarization information cues are used to guide 3DGS's normal and spherical harmonic representation. This process achieves high-fidelity reflection separation and full-scene reconstruction without requiring environment maps or restrictive material assumptions. We demonstrate on public and self-collected datasets that PolarGuide-GSDR achieves state-of-the-art performance in specular reconstruction, normal estimation, and novel view synthesis, all while maintaining real-time rendering capabilities. To our knowledge, this is the first framework embedding polarization priors directly into 3DGS optimization, yielding superior interpretability and real-time performance for modeling complex reflective scenes.
Spatially-Grounded Document Retrieval via Patch-to-Region Relevance Propagation
Vision-language models (VLMs) like ColPali achieve state-of-the-art document retrieval by embedding pages as images and computing fine-grained similarity between query tokens and visual patches. However, they return entire pages rather than specific regions, limiting utility for retrieval-augmented generation (RAG) where precise context is paramount. Conversely, OCR-based systems extract structured text with bounding box coordinates but lack semantic grounding for relevance assessment. We propose a hybrid architecture that unifies these paradigms: using ColPali's patch-level similarity scores as spatial relevance filters over OCR-extracted regions. We formalize the coordinate mapping between vision transformer patch grids and OCR bounding boxes, introduce intersection metrics for relevance propagation, and establish theoretical bounds on retrieval precision. Our approach operates at inference time without additional training. We release Snappy, an open-source implementation demonstrating practical applicability, with empirical evaluation ongoing.
comment: 13 pages, 1 figure, 2 tables. Open-source implementation available at https://github.com/athrael-soju/Snappy
Real-Time Multimodal Data Collection Using Smartwatches and Its Visualization in Education
Wearable sensors, such as smartwatches, have become increasingly prevalent across domains like healthcare, sports, and education, enabling continuous monitoring of physiological and behavioral data. In the context of education, these technologies offer new opportunities to study cognitive and affective processes such as engagement, attention, and performance. However, the lack of scalable, synchronized, and high-resolution tools for multimodal data acquisition continues to be a significant barrier to the widespread adoption of Multimodal Learning Analytics in real-world educational settings. This paper presents two complementary tools developed to address these challenges: Watch-DMLT, a data acquisition application for Fitbit Sense 2 smartwatches that enables real-time, multi-user monitoring of physiological and motion signals; and ViSeDOPS, a dashboard-based visualization system for analyzing synchronized multimodal data collected during oral presentations. We report on a classroom deployment involving 65 students and up to 16 smartwatches, where data streams including heart rate, motion, gaze, video, and contextual annotations were captured and analyzed. Results demonstrate the feasibility and utility of the proposed system for supporting fine-grained, scalable, and interpretable Multimodal Learning Analytics in real learning environments.
comment: Accepted in Technological Ecosystems for Enhancing Multiculturality (TEEM) 2025
Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. However, current approaches generate single source-mixed sounds at once, largely because visual features are entangled, and region cues or prompts often fail to specify the source. We propose SelVA, a novel text-conditioned V2A model that treats the text prompt as an explicit selector of target source and modulates video encoder to distinctly extract prompt-relevant video features. The proposed supplementary tokens promote cross-attention by suppressing text-irrelevant activations with efficient parameter tuning, yielding robust semantic and temporal grounding. SelVA further employs a self-augmentation scheme to overcome the lack of mono audio track supervision. We evaluate SelVA on VGG-MONOAUDIO, a curated benchmark of clean single-source videos for such a task. Extensive experiments and ablations consistently verify its effectiveness across audio quality, semantic alignment, and temporal synchronization. Code and demo are available at https://jnwnlee.github.io/selva-demo/.
PoreTrack3D: A Benchmark for Dynamic 3D Gaussian Splatting in Pore-Scale Facial Trajectory Tracking
We introduce PoreTrack3D, the first benchmark for dynamic 3D Gaussian splatting in pore-scale, non-rigid 3D facial trajectory tracking. It contains over 440,000 facial trajectories in total, among which more than 52,000 are longer than 10 frames, including 68 manually reviewed trajectories that span the entire 150 frames. To the best of our knowledge, PoreTrack3D is the first benchmark dataset to capture both traditional facial landmarks and pore-scale keypoints trajectory, advancing the study of fine-grained facial expressions through the analysis of subtle skin-surface motion. We systematically evaluate state-of-the-art dynamic 3D Gaussian splatting methods on PoreTrack3D, establishing the first performance baseline in this domain. Overall, the pipeline developed for this benchmark dataset's creation establishes a new framework for high-fidelity facial motion capture and dynamic 3D reconstruction. Our dataset are publicly available at: https://github.com/JHXion9/PoreTrack3D
Leveraging Large-Scale Pretrained Spatial-Spectral Priors for General Zero-Shot Pansharpening
Existing deep learning methods for remote sensing image fusion often suffer from poor generalization when applied to unseen datasets due to the limited availability of real training data and the domain gap between different satellite sensors. To address this challenge, we explore the potential of foundation models by proposing a novel pretraining strategy that leverages large-scale simulated datasets to learn robust spatial-spectral priors. Specifically, our approach first constructs diverse simulated datasets by applying various degradation operations (blur, noise, downsampling) and augmentations (bands generation, channel shuffling, high-pass filtering, color jittering, etc.) to natural images from ImageNet and remote sensing images from SkyScript. We then pretrain fusion models on these simulated data to learn generalizable spatial-spectral representations. The pretrained models are subsequently evaluated on six datasets (WorldView-2/3/4, IKONOS, QuickBird, GaoFen-2) using zero-shot and one-shot paradigms, with both full- and freeze-tuning approaches for fine-tuning. Extensive experiments on different network architectures including convolutional neural networks, Transformer, and Mamba demonstrate that our pretraining strategy significantly improves generalization performance across different satellite sensors and imaging conditions for various fusion models. The pretrained models achieve superior results in zero-shot scenarios and show remarkable adaptation capability with minimal real data in one-shot settings. Our work provides a practical solution for cross-domain pansharpening, establishes a new benchmark for generalization in remote sensing image fusion tasks, and paves the way for leveraging foundation models through advanced training strategies.
Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models
Log-likelihood evaluation enables important capabilities in generative models, including model comparison, certain fine-tuning objectives, and many downstream applications. Yet paradoxically, some of today's best generative models -- diffusion and flow-based models -- still require hundreds to thousands of neural function evaluations (NFEs) to compute a single likelihood. While recent distillation methods have successfully accelerated sampling to just a few steps, they achieve this at the cost of likelihood tractability: existing approaches either abandon likelihood computation entirely or still require expensive integration over full trajectories. We present fast flow joint distillation (F2D2), a framework that simultaneously reduces the number of NFEs required for both sampling and likelihood evaluation by two orders of magnitude. Our key insight is that in continuous normalizing flows, the coupled ODEs for sampling and likelihood are computed from a shared underlying velocity field, allowing us to jointly distill both the sampling trajectory and cumulative divergence using a single model. F2D2 is modular, compatible with existing flow-based few-step sampling models, and requires only an additional divergence prediction head. Experiments demonstrate F2D2's capability of achieving accurate log-likelihood with few-step evaluations while maintaining high sample quality, solving a long-standing computational bottleneck in flow-based generative models. As an application of our approach, we propose a lightweight self-guidance method that enables a 2-step MeanFlow model to outperform a 1024 step teacher model with only a single additional backward NFE.
PPTBench: Towards Holistic Evaluation of Large Language Models for PowerPoint Layout and Design Understanding
PowerPoint presentations combine rich textual content with structured visual layouts, making them a natural testbed for evaluating the multimodal reasoning and layout understanding abilities of modern MLLMs. However, existing benchmarks focus solely on narrow subtasks while overlooking layout-centric challenges, which are central to real-world slide creation and editing. To bridge this gap, we introduce PPTBench, a comprehensive multimodal benchmark for evaluating LLMs on PowerPoint-related tasks. Leveraging a diverse source of 958 PPTX files, PPTBench evaluates models across four categories with 4,439 samples, including Detection, Understanding, Modification, and Generation. Our experiments reveal a substantial gap between semantic understanding and visual-layout reasoning in current MLLMs: models can interpret slide content but fail to produce coherent spatial arrangements. Ablation and further analysis show that current MLLMs struggle to combine visual cues with JSON-based layout structures and fail to integrate visual information into their API planning ability. And case studies visually expose systematic layout errors such as misalignment and element overlap. These findings provides a new perspective on evaluating VLLMs in PPT scenarios, highlighting challenges and directions for future research on visual-structural reasoning and coherent slide generation. All datasets and code are fully released to support reproducibility and future research.
RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence
Recent advances in video generation have enabled the synthesis of videos with strong temporal consistency and impressive visual quality, marking a crucial step toward vision foundation models. To evaluate these video generation models, existing benchmarks primarily focus on factors related to visual perception and understanding, like visual aesthetics, instruction adherence, and temporal coherence. However, the rule-based reasoning capabilities of video generation models remain largely unexplored. Although recent studies have carried out preliminary explorations into whether video models can serve as zero-shot learners, they still lack a fine-grained decomposition of reasoning capabilities and a comprehensive evaluation protocol. To address this gap, we introduce RULER-Bench, a benchmark designed to evaluate the reasoning ability of video generation models from the perspective of cognitive rules. Built upon two fundamental paradigms: text-to-video and image-to-video, RULER-Bench covers 40 representative tasks spanning six rule categories with 622 high-quality annotated instances. For the evaluation of each generated video, we construct a checklist covering four metrics and leverage GPT-o3 to assign scores to each question, achieving 85% alignment with human judgements. Extensive experiments show that the state-of-the-art model achieves only 48.87% on the rule coherence metric, highlighting significant room for improvement in the reasoning capability of next-level video models. We expect that the insight obtained from RULER-Bench will facilitate further development of reasoning-aware video generation, advancing video generation models toward vision foundation intelligence.
Content-Aware Texturing for Gaussian Splatting
Gaussian Splatting has become the method of choice for 3D reconstruction and real-time rendering of captured real scenes. However, fine appearance details need to be represented as a large number of small Gaussian primitives, which can be wasteful when geometry and appearance exhibit different frequency characteristics. Inspired by the long tradition of texture mapping, we propose to use texture to represent detailed appearance where possible. Our main focus is to incorporate per-primitive texture maps that adapt to the scene in a principled manner during Gaussian Splatting optimization. We do this by proposing a new appearance representation for 2D Gaussian primitives with textures where the size of a texel is bounded by the image sampling frequency and adapted to the content of the input images. We achieve this by adaptively upscaling or downscaling the texture resolution during optimization. In addition, our approach enables control of the number of primitives during optimization based on texture resolution. We show that our approach performs favorably in image quality and total number of parameters used compared to alternative solutions for textured Gaussian primitives. Project page: https://repo-sam.inria.fr/nerphys/gs-texturing/
comment: Project Page: https://repo-sam.inria.fr/nerphys/gs-texturing/
SAM2Grasp: Resolve Multi-modal Grasping via Prompt-conditioned Temporal Action Prediction
Imitation learning for robotic grasping is often plagued by the multimodal problem: when a scene contains multiple valid targets, demonstrations of grasping different objects create conflicting training signals. Standard imitation learning policies fail by averaging these distinct actions into a single, invalid action. In this paper, we introduce SAM2Grasp, a novel framework that resolves this issue by reformulating the task as a uni-modal, prompt-conditioned prediction problem. Our method leverages the frozen SAM2 model to use its powerful visual temporal tracking capability and introduces a lightweight, trainable action head that operates in parallel with its native segmentation head. This design allows for training only the small action head on pre-computed temporal-visual features from SAM2. During inference, an initial prompt, such as a bounding box provided by an upstream object detection model, designates the specific object to be grasped. This prompt conditions the action head to predict a unique, unambiguous grasp trajectory for that object alone. In all subsequent video frames, SAM2's built-in temporal tracking capability automatically maintains stable tracking of the selected object, enabling our model to continuously predict the grasp trajectory from the video stream without further external guidance. This temporal-prompted approach effectively eliminates ambiguity from the visuomotor policy. We demonstrate through extensive experiments that SAM2Grasp achieves state-of-the-art performance in cluttered, multi-object grasping tasks.
Co-speech Gesture Video Generation via Motion-Based Graph Retrieval
Synthesizing synchronized and natural co-speech gesture videos remains a formidable challenge. Recent approaches have leveraged motion graphs to harness the potential of existing video data. To retrieve an appropriate trajectory from the graph, previous methods either utilize the distance between features extracted from the input audio and those associated with the motions in the graph or embed both the input audio and motion into a shared feature space. However, these techniques may not be optimal due to the many-to-many mapping nature between audio and gestures, which cannot be adequately addressed by one-to-one mapping. To alleviate this limitation, we propose a novel framework that initially employs a diffusion model to generate gesture motions. The diffusion model implicitly learns the joint distribution of audio and motion, enabling the generation of contextually appropriate gestures from input audio sequences. Furthermore, our method extracts both low-level and high-level features from the input audio to enrich the training process of the diffusion model. Subsequently, a meticulously designed motion-based retrieval algorithm is applied to identify the most suitable path within the graph by assessing both global and local similarities in motion. Given that not all nodes in the retrieved path are sequentially continuous, the final step involves seamlessly stitching together these segments to produce a coherent video output. Experimental results substantiate the efficacy of our proposed method, demonstrating a significant improvement over prior approaches in terms of synchronization accuracy and naturalness of generated gestures.
From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature
There is a growing interest in developing strong biomedical vision-language models. A popular approach to achieve robust representations is to use web-scale scientific data. However, current biomedical vision-language pretraining typically compresses rich scientific figures and text into coarse figure-level pairs, discarding the fine-grained correspondences that clinicians actually rely on when zooming into local structures. To tackle this issue, we introduce Panel2Patch, a novel data pipeline that mines hierarchical structure from existing biomedical scientific literature, i.e., multi-panel, marker-heavy figures and their surrounding text, and converts them into multi-granular supervision. Given scientific figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs hierarchical aligned vision-language pairs at the figure, panel, and patch levels, preserving local semantics instead of treating each figure as a single data sample. Built on this hierarchical corpus, we develop a granularity-aware pretraining strategy that unifies heterogeneous objectives from coarse didactic descriptions to fine region-focused phrases. By applying Panel2Patch to only a small set of the literature figures, we extract far more effective supervision than prior pipelines, enabling substantially better performance with less pretraining data.
OmniPerson: Unified Identity-Preserving Pedestrian Generation
Person re-identification (ReID) suffers from a lack of large-scale high-quality training data due to challenges in data privacy and annotation costs. While previous approaches have explored pedestrian generation for data augmentation, they often fail to ensure identity consistency and suffer from insufficient controllability, thereby limiting their effectiveness in dataset augmentation. To address this, We introduce OmniPerson, the first unified identity-preserving pedestrian generation pipeline for visible/infrared image/video ReID tasks. Our contributions are threefold: 1) We proposed OmniPerson, a unified generation model, offering holistic and fine-grained control over all key pedestrian attributes. Supporting RGB/IR modality image/video generation with any number of reference images, two kinds of person poses, and text. Also including RGB-to-IR transfer and image super-resolution abilities.2) We designed Multi-Refer Fuser for robust identity preservation with any number of reference images as input, making OmniPerson could distill a unified identity from a set of multi-view reference images, ensuring our generated pedestrians achieve high-fidelity pedestrian generation.3) We introduce PersonSyn, the first large-scale dataset for multi-reference, controllable pedestrian generation, and present its automated curation pipeline which transforms public, ID-only ReID benchmarks into a richly annotated resource with the dense, multi-modal supervision required for this task. Experimental results demonstrate that OmniPerson achieves SoTA in pedestrian generation, excelling in both visual fidelity and identity consistency. Furthermore, augmenting existing datasets with our generated data consistently improves the performance of ReID models. We will open-source the full codebase, pretrained model, and the PersonSyn dataset.
AVGGT: Rethinking Global Attention for Accelerating VGGT
Since DUSt3R, models such as VGGT and $π^3$ have shown strong multi-view 3D performance, but their heavy reliance on global self-attention results in high computational cost. Existing sparse-attention variants offer partial speedups, yet lack a systematic analysis of how global attention contributes to multi-view reasoning. In this paper, we first conduct an in-depth investigation of the global attention modules in VGGT and $π^3$ to better understand their roles. Our analysis reveals a clear division of roles in the alternating global-frame architecture: early global layers do not form meaningful correspondences, middle layers perform cross-view alignment, and last layers provide only minor refinements. Guided by these findings, we propose a training-free two-step acceleration scheme: (1) converting early global layers into frame attention, and (2) subsampling global attention by subsampling K/V over patch tokens with diagonal preservation and a mean-fill component. We instantiate this strategy on VGGT and $π^3$ and evaluate across standard pose and point-map benchmarks. Our method achieves up to $8$-$10\times$ speedup in inference time while matching or slightly improving the accuracy of the original models, and remains robust even in extremely dense multi-view settings where prior sparse-attention baselines fail.
WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens
Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-Language Models (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, failing to adapt to new tasks that are distant from their pre-training tasks. To overcome this, we propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization, enhancing continual learning. Additionally, we introduce a VAE branch with linear projection to recover fine-grained image details. Experimental results confirm our approach mitigates generalization collapse and enables stable continual learning across diverse tasks.
On the Problem of Consistent Anomalies in Zero-Shot Anomaly Detection
Zero-shot anomaly classification and segmentation (AC/AS) aim to detect anomalous samples and regions without any training data, a capability increasingly crucial in industrial inspection and medical imaging. This dissertation aims to investigate the core challenges of zero-shot AC/AS and presents principled solutions rooted in theory and algorithmic design. We first formalize the problem of consistent anomalies, a failure mode in which recurring similar anomalies systematically bias distance-based methods. By analyzing the statistical and geometric behavior of patch representations from pre-trained Vision Transformers, we identify two key phenomena - similarity scaling and neighbor-burnout - that describe how relationships among normal patches change with and without consistent anomalies in settings characterized by highly similar objects. We then introduce CoDeGraph, a graph-based framework for filtering consistent anomalies built on the similarity scaling and neighbor-burnout phenomena. Through multi-stage graph construction, community detection, and structured refinement, CoDeGraph effectively suppresses the influence of consistent anomalies. Next, we extend this framework to 3D medical imaging by proposing a training-free, computationally efficient volumetric tokenization strategy for MRI data. This enables a genuinely zero-shot 3D anomaly detection pipeline and shows that volumetric anomaly segmentation is achievable without any 3D training samples. Finally, we bridge batch-based and text-based zero-shot methods by demonstrating that CoDeGraph-derived pseudo-masks can supervise prompt-driven vision-language models. Together, this dissertation provides theoretical understanding and practical solutions for the zero-shot AC/AS problem.
comment: PhD Dissertation
SkyMoE: A Vision-Language Foundation Model for Enhancing Geospatial Interpretation with Mixture of Experts
The emergence of large vision-language models (VLMs) has significantly enhanced the efficiency and flexibility of geospatial interpretation. However, general-purpose VLMs remain suboptimal for remote sensing (RS) tasks. Existing geospatial VLMs typically adopt a unified modeling strategy and struggle to differentiate between task types and interpretation granularities, limiting their ability to balance local detail perception and global contextual understanding. In this paper, we present SkyMoE, a Mixture-of-Experts (MoE) vision-language model tailored for multimodal, multi-task RS interpretation. SkyMoE employs an adaptive router that generates task- and granularity-aware routing instructions, enabling specialized large language model experts to handle diverse sub-tasks. To further promote expert decoupling and granularity sensitivity, we introduce a context-disentangled augmentation strategy that creates contrastive pairs between local and global features, guiding experts toward level-specific representation learning. We also construct MGRS-Bench, a comprehensive benchmark covering multiple RS interpretation tasks and granularity levels, to evaluate generalization in complex scenarios. Extensive experiments on 21 public datasets demonstrate that SkyMoE achieves state-of-the-art performance across tasks, validating its adaptability, scalability, and superior multi-granularity understanding in remote sensing.
Two-Stage Vision Transformer for Image Restoration: Colorization Pretraining + Residual Upsampling
In computer vision, Single Image Super-Resolution (SISR) is still a difficult problem. We present ViT-SR, a new technique to improve the performance of a Vision Transformer (ViT) employing a two-stage training strategy. In our method, the model learns rich, generalizable visual representations from the data itself through a self-supervised pretraining phase on a colourization task. The pre-trained model is then adjusted for 4x super-resolution. By predicting the addition of a high-frequency residual image to an initial bicubic interpolation, this design simplifies residual learning. ViT-SR, trained and evaluated on the DIV2K benchmark dataset, achieves an impressive SSIM of 0.712 and PSNR of 22.90 dB. These results demonstrate the efficacy of our two-stage approach and highlight the potential of self-supervised pre-training for complex image restoration tasks. Further improvements may be possible with larger ViT architectures or alternative pretext tasks.
comment: Accepted at the 13th Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP 2025), IIT Mandi, India. 3 pages, 1 figure
GeoDiT: A Diffusion-based Vision-Language Model for Geospatial Understanding
Autoregressive models are structurally misaligned with the inherently parallel nature of geospatial understanding, forcing a rigid sequential narrative onto scenes and fundamentally hindering the generation of structured and coherent outputs. We challenge this paradigm by reframing geospatial generation as a parallel refinement process, enabling a holistic, coarse-to-fine synthesis that resolves all semantic elements simultaneously. To operationalize this, we introduce GeoDiT, the first diffusion-based vision-language model tailored for the geospatial domain. Extensive experiments demonstrate that GeoDiT establishes a new state-of-the-art on benchmarks requiring structured, object-centric outputs. It achieves significant gains in image captioning, visual grounding, and multi-object detection, precisely the tasks where autoregressive models falter. Our work validates that aligning the generative process with the data's intrinsic structure is key to unlocking superior performance in complex geospatial analysis.
dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model
Document Layout Parsing serves as a critical gateway for Artificial Intelligence (AI) to access and interpret the world's vast stores of structured knowledge. This process,which encompasses layout detection, text recognition, and relational understanding, is particularly crucial for empowering next-generation Vision-Language Models. Current methods, however, rely on fragmented, multi-stage pipelines that suffer from error propagation and fail to leverage the synergies of joint training. In this paper, we introduce dots.ocr, a single Vision-Language Model that, for the first time, demonstrates the advantages of jointly learning three core tasks within a unified, end-to-end framework. This is made possible by a highly scalable data engine that synthesizes a vast multilingual corpus, empowering the model to deliver robust performance across a wide array of tasks, encompassing diverse languages, layouts, and domains. The efficacy of our unified paradigm is validated by state-of-the-art performance on the comprehensive OmniDocBench. Furthermore, to catalyze research in global document intelligence, we introduce XDocParse, a challenging new benchmark spanning 126 languages. On this testbed, dots.ocr establishes a powerful new baseline, outperforming the next-best competitor by a remarkable +7.4 point margin and proving its unparalleled multilingual capabilities.
A Large Scale Benchmark for Test Time Adaptation Methods in Medical Image Segmentation
Test time Adaptation is a promising approach for mitigating domain shift in medical image segmentation; however, current evaluations remain limited in terms of modality coverage, task diversity, and methodological consistency. We present MedSeg-TTA, a comprehensive benchmark that examines twenty representative adaptation methods across seven imaging modalities, including MRI, CT, ultrasound, pathology, dermoscopy, OCT, and chest X-ray, under fully unified data preprocessing, backbone configuration, and test time protocols. The benchmark encompasses four significant adaptation paradigms: Input-level Transformation, Feature-level Alignment, Output-level Regularization, and Prior Estimation, enabling the first systematic cross-modality comparison of their reliability and applicability. The results show that no single paradigm performs best in all conditions. Input-level methods are more stable under mild appearance shifts. Feature-level and Output-level methods offer greater advantages in boundary-related metrics, whereas prior-based methods exhibit strong modality dependence. Several methods degrade significantly under large inter-center and inter-device shifts, which highlights the importance of principled method selection for clinical deployment. MedSeg-TTA provides standardized datasets, validated implementations, and a public leaderboard, establishing a rigorous foundation for future research on robust, clinically reliable test-time adaptation. All source codes and open-source datasets are available at https://github.com/wenjing-gg/MedSeg-TTA.
comment: 45 pages, 18 figures
Attention-guided reference point shifting for Gaussian-mixture-based partial point set registration
This study investigates the impact of the invariance of feature vectors for partial-to-partial point set registration under translation and rotation of input point sets, particularly in the realm of techniques based on deep learning and Gaussian mixture models (GMMs). We reveal both theoretical and practical problems associated with such deep-learning-based registration methods using GMMs, with a particular focus on the limitations of DeepGMR, a pioneering study in this line, to the partial-to-partial point set registration. Our primary goal is to uncover the causes behind such methods and propose a comprehensible solution for that. To address this, we introduce an attention-based reference point shifting (ARPS) layer, which robustly identifies a common reference point of two partial point sets, thereby acquiring transformation-invariant features. The ARPS layer employs a well-studied attention module to find a common reference point rather than the overlap region. Owing to this, it significantly enhances the performance of DeepGMR and its recent variant, UGMMReg. Furthermore, these extension models outperform even prior deep learning methods using attention blocks and Transformer to extract the overlap region or common reference points. We believe these findings provide deeper insights into registration methods using deep learning and GMMs.
comment: 16 pages, 9 figures, 7 tables
YingVideo-MV: Music-Driven Multi-Stage Video Generation
While diffusion model for audio-driven avatar video generation have achieved notable process in synthesizing long sequences with natural audio-visual synchronization and identity consistency, the generation of music-performance videos with camera motions remains largely unexplored. We present YingVideo-MV, the first cascaded framework for music-driven long-video generation. Our approach integrates audio semantic analysis, an interpretable shot planning module (MV-Director), temporal-aware diffusion Transformer architectures, and long-sequence consistency modeling to enable automatic synthesis of high-quality music performance videos from audio signals. We construct a large-scale Music-in-the-Wild Dataset by collecting web data to support the achievement of diverse, high-quality results. Observing that existing long-video generation methods lack explicit camera motion control, we introduce a camera adapter module that embeds camera poses into latent noise. To enhance continulity between clips during long-sequence inference, we further propose a time-aware dynamic window range strategy that adaptively adjust denoising ranges based on audio embedding. Comprehensive benchmark tests demonstrate that YingVideo-MV achieves outstanding performance in generating coherent and expressive music videos, and enables precise music-motion-camera synchronization. More videos are available in our project page: https://giantailab.github.io/YingVideo-MV/ .
comment: 18 pages, 6 figures
Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding
Recent advances in 3D scene-language understanding have leveraged Large Language Models (LLMs) for 3D reasoning by transferring their general reasoning ability to 3D multi-modal contexts. However, existing methods typically adopt standard decoders from language modeling, which rely on a causal attention mask. This design introduces two fundamental conflicts in 3D scene understanding: sequential bias among order-agnostic 3D objects and restricted object-instruction attention, hindering task-specific reasoning. To overcome these limitations, we propose 3D Spatial Language Instruction Mask (3D-SLIM), an effective masking strategy that replaces the causal mask with an adaptive attention mask tailored to the spatial structure of 3D scenes. Our 3D-SLIM introduces two key components: a Geometry-adaptive Mask that constrains attention based on spatial density rather than token order, and an Instruction-aware Mask that enables object tokens to directly access instruction context. This design allows the model to process objects based on their spatial relationships while being guided by the user's task. 3D-SLIM is simple, requires no architectural modifications, and adds no extra parameters, yet it yields substantial performance improvements across diverse 3D scene-language tasks. Extensive experiments across multiple benchmarks and LLM baselines validate its effectiveness and underscore the critical role of decoder design in 3D multi-modal reasoning.
UCAgents: Unidirectional Convergence for Visual Evidence Anchored Multi-Agent Medical Decision-Making
Vision-Language Models (VLMs) show promise in medical diagnosis, yet suffer from reasoning detachment, where linguistically fluent explanations drift from verifiable image evidence, undermining clinical trust. Recent multi-agent frameworks simulate Multidisciplinary Team (MDT) debates to mitigate single-model bias, but open-ended discussions amplify textual noise and computational cost while failing to anchor reasoning to visual evidence, the cornerstone of medical decision-making. We propose UCAgents, a hierarchical multi-agent framework enforcing unidirectional convergence through structured evidence auditing. Inspired by clinical workflows, UCAgents forbids position changes and limits agent interactions to targeted evidence verification, suppressing rhetorical drift while amplifying visual signal extraction. In UCAgents, a one-round inquiry discussion is introduced to uncover potential risks of visual-textual misalignment. This design jointly constrains visual ambiguity and textual noise, a dual-noise bottleneck that we formalize via information theory. Extensive experiments on four medical VQA benchmarks show UCAgents achieves superior accuracy (71.3% on PathVQA, +6.0% over state-of-the-art) with 87.7% lower token cost, the evaluation results further confirm that UCAgents strikes a balance between uncovering more visual evidence and avoiding confusing textual interference. These results demonstrate that UCAgents exhibits both diagnostic reliability and computational efficiency critical for real-world clinical deployment. Code is available at https://github.com/fqhank/UCAgents.
G-SHARP: Gaussian Surgical Hardware Accelerated Real-time Pipeline
We propose G-SHARP, a commercially compatible, real-time surgical scene reconstruction framework designed for minimally invasive procedures that require fast and accurate 3D modeling of deformable tissue. While recent Gaussian splatting approaches have advanced real-time endoscopic reconstruction, existing implementations often depend on non-commercial derivatives, limiting deployability. G-SHARP overcomes these constraints by being the first surgical pipeline built natively on the GSplat (Apache-2.0) differentiable Gaussian rasterizer, enabling principled deformation modeling, robust occlusion handling, and high-fidelity reconstructions on the EndoNeRF pulling benchmark. Our results demonstrate state-of-the-art reconstruction quality with strong speed-accuracy trade-offs suitable for intra-operative use. Finally, we provide a Holoscan SDK application that deploys G-SHARP on NVIDIA IGX Orin and Thor edge hardware, enabling real-time surgical visualization in practical operating-room settings.
WorldPack: Compressed Memory Improves Spatial Consistency in Video World Modeling
Video world models have attracted significant attention for their ability to produce high-fidelity future visual observations conditioned on past observations and navigation actions. Temporally- and spatially-consistent, long-term world modeling has been a long-standing problem, unresolved with even recent state-of-the-art models, due to the prohibitively expensive computational costs for long-context inputs. In this paper, we propose WorldPack, a video world model with efficient compressed memory, which significantly improves spatial consistency, fidelity, and quality in long-term generation despite much shorter context length. Our compressed memory consists of trajectory packing and memory retrieval; trajectory packing realizes high context efficiency, and memory retrieval maintains the consistency in rollouts and helps long-term generations that require spatial reasoning. Our performance is evaluated with LoopNav, a benchmark on Minecraft, specialized for the evaluation of long-term consistency, and we verify that WorldPack notably outperforms strong state-of-the-art models.
TGDD: Trajectory Guided Dataset Distillation with Balanced Distribution AAAI 2026
Dataset distillation compresses large datasets into compact synthetic ones to reduce storage and computational costs. Among various approaches, distribution matching (DM)-based methods have attracted attention for their high efficiency. However, they often overlook the evolution of feature representations during training, which limits the expressiveness of synthetic data and weakens downstream performance. To address this issue, we propose Trajectory Guided Dataset Distillation (TGDD), which reformulates distribution matching as a dynamic alignment process along the model's training trajectory. At each training stage, TGDD captures evolving semantics by aligning the feature distribution between the synthetic and original dataset. Meanwhile, it introduces a distribution constraint regularization to reduce class overlap. This design helps synthetic data preserve both semantic diversity and representativeness, improving performance in downstream tasks. Without additional optimization overhead, TGDD achieves a favorable balance between performance and efficiency. Experiments on ten datasets demonstrate that TGDD achieves state-of-the-art performance, notably a 5.0% accuracy gain on high-resolution benchmarks.
comment: Accepted in AAAI 2026
Vision to Geometry: 3D Spatial Memory for Sequential Embodied MLLM Reasoning and Exploration
Existing research on indoor embodied tasks typically requires agents to actively explore unknown environments and reason about the scene to achieve a specific goal. However, when deployed in real life, agents often face sequential tasks, where each new sub-task follows the completion of the previous one, and certain sub-tasks may be infeasible, such as searching for a non-existent object. Compared with the single-task setting, the core challenge lies in reusing spatial knowledge accumulated from previous explorations to support subsequent reasoning and exploration. In this work, we investigate this underexplored yet practically significant embodied AI challenge. To evaluate this challenge, we introduce SEER-Bench, a new Sequential Embodied Exploration and Reasoning Benchmark encompassing encompassing two classic embodied tasks: Embodied Question Answering (EQA) and Embodied Multi-modal Navigation (EMN). Building on SEER-Bench, we propose 3DSPMR, a 3D SPatial Memory Reasoning approach that exploits relational, visual, and geometric cues from explored regions to augment Multi-Modal Large Language Models (MLLMs) for reasoning and exploration in sequential embodied tasks. To the best of our knowledge, this is the first work to explicitly incorporate geometric information into MLLM-based spatial understanding and reasoning. Extensive experiments verify that 3DSPMR achieves substantial performance gains on both sequential EQA and EMN tasks.
Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation
Recent audio-video generative systems suggest that coupling modalities benefits not only audio-video synchrony but also the video modality itself. We pose a fundamental question: Does audio-video joint denoising training improve video generation, even when we only care about video quality? To study this, we introduce a parameter-efficient Audio-Video Full DiT (AVFullDiT) architecture that leverages pre-trained text-to-video (T2V) and text-to-audio (T2A) modules for joint denoising. We train (i) a T2AV model with AVFullDiT and (ii) a T2V-only counterpart under identical settings. Our results provide the first systematic evidence that audio-video joint denoising can deliver more than synchrony. We observe consistent improvements on challenging subsets featuring large and object contact motions. We hypothesize that predicting audio acts as a privileged signal, encouraging the model to internalize causal relationships between visual events and their acoustic consequences (e.g., collision $\times$ impact sound), which in turn regularizes video dynamics. Our findings suggest that cross-modal co-training is a promising approach to developing stronger, more physically grounded world models. Code and dataset will be made publicly available.
comment: Project page at https://jianzongwu.github.io/projects/does-hearing-help-seeing/
See, Think, Learn: A Self-Taught Multimodal Reasoner
Vision-Language Models (VLMs) have achieved remarkable progress in integrating visual perception with language understanding. However, effective multimodal reasoning requires both accurate perception and robust reasoning, and weakness in either limits the performance of VLMs. Prior efforts to enhance reasoning often depend on high-quality chain-of-thought (CoT) data, obtained via labor-intensive human annotations, costly proprietary models, or self-training methods that overlook perception. To address these limitations, we propose a simple yet effective self-training framework called See-Think-Learn (STL). At its core, STL introduces a structured reasoning template that encourages the model to see before thinking, first extracting visual attributes in textual form, then using them to guide reasoning. The framework jointly improves perception and reasoning by having the model generate and learn from its own structured rationales in a self-training loop. Furthermore, we augment the training data with negative rationales, i.e. explanations that justify why certain answer choices are incorrect, to enhance the model's ability to distinguish between correct and misleading responses. This fosters more discriminative and robust learning. Experiments across diverse domains show that STL consistently outperforms baselines trained directly only on answers or self-generated reasoning, while qualitative analysis confirms the high quality of its rationales. STL thus provides a cost-effective solution to enhance multimodal reasoning ability of VLMs.
comment: Winter Conference on Applications of Computer Vision 2026
ClusterStyle: Modeling Intra-Style Diversity with Prototypical Clustering for Stylized Motion Generation
Existing stylized motion generation models have shown their remarkable ability to understand specific style information from the style motion, and insert it into the content motion. However, capturing intra-style diversity, where a single style should correspond to diverse motion variations, remains a significant challenge. In this paper, we propose a clustering-based framework, ClusterStyle, to address this limitation. Instead of learning an unstructured embedding from each style motion, we leverage a set of prototypes to effectively model diverse style patterns across motions belonging to the same style category. We consider two types of style diversity: global-level diversity among style motions of the same category, and local-level diversity within the temporal dynamics of motion sequences. These components jointly shape two structured style embedding spaces, i.e., global and local, optimized via alignment with non-learnable prototype anchors. Furthermore, we augment the pretrained text-to-motion generation model with the Stylistic Modulation Adapter (SMA) to integrate the style features. Extensive experiments demonstrate that our approach outperforms existing state-of-the-art models in stylized motion generation and motion style transfer.
HouseLayout3D: A Benchmark and Training-Free Baseline for 3D Layout Estimation in the Wild NeurIPS 2025
Current 3D layout estimation models are primarily trained on synthetic datasets containing simple single room or single floor environments. As a consequence, they cannot natively handle large multi floor buildings and require scenes to be split into individual floors before processing, which removes global spatial context that is essential for reasoning about structures such as staircases that connect multiple levels. In this work, we introduce HouseLayout3D, a real world benchmark designed to support progress toward full building scale layout estimation, including multiple floors and architecturally intricate spaces. We also present MultiFloor3D, a simple training free baseline that leverages recent scene understanding methods and already outperforms existing 3D layout estimation models on both our benchmark and prior datasets, highlighting the need for further research in this direction. Data and code are available at: https://houselayout3d.github.io.
comment: NeurIPS 2025 (Datasets and Benchmarks Track) Project Page: https://houselayout3d.github.io
nuScenes Revisited: Progress and Challenges in Autonomous Driving
Autonomous Vehicles (AV) and Advanced Driver Assistance Systems (ADAS) have been revolutionized by Deep Learning. As a data-driven approach, Deep Learning relies on vast amounts of driving data, typically labeled in great detail. As a result, datasets, alongside hardware and algorithms, are foundational building blocks for the development of AVs. In this work we revisit one of the most widely used autonomous driving datasets: the nuScenes dataset. nuScenes exemplifies key trends in AV development, being the first dataset to include radar data, to feature diverse urban driving scenes from two continents, and to be collected using a fully autonomous vehicle operating on public roads, while also promoting multi-modal sensor fusion, standardized benchmarks, and a broad range of tasks including perception, localization \& mapping, prediction and planning. We provide an unprecedented look into the creation of nuScenes, as well as its extensions nuImages and Panoptic nuScenes, summarizing many technical details that have hitherto not been revealed in academic publications. Furthermore, we trace how the influence of nuScenes impacted a large number of other datasets that were released later and how it defined numerous standards that are used by the community to this day. Finally, we present an overview of both official and unofficial tasks using the nuScenes dataset and review major methodological developments, thereby offering a comprehensive survey of the autonomous driving literature, with a particular focus on nuScenes.
comment: 18 pages, 17 figures
Temporal Dynamics Enhancer for Directly Trained Spiking Object Detectors
Spiking Neural Networks (SNNs), with their brain-inspired spatiotemporal dynamics and spike-driven computation, have emerged as promising energy-efficient alternatives to Artificial Neural Networks (ANNs). However, existing SNNs typically replicate inputs directly or aggregate them into frames at fixed intervals. Such strategies lead to neurons receiving nearly identical stimuli across time steps, severely limiting the model's expressive power, particularly in complex tasks like object detection. In this work, we propose the Temporal Dynamics Enhancer (TDE) to strengthen SNNs' capacity for temporal information modeling. TDE consists of two modules: a Spiking Encoder (SE) that generates diverse input stimuli across time steps, and an Attention Gating Module (AGM) that guides the SE generation based on inter-temporal dependencies. Moreover, to eliminate the high-energy multiplication operations introduced by the AGM, we propose a Spike-Driven Attention (SDA) to reduce attention-related energy consumption. Extensive experiments demonstrate that TDE can be seamlessly integrated into existing SNN-based detectors and consistently outperforms state-of-the-art methods, achieving mAP50-95 scores of 57.7% on the static PASCAL VOC dataset and 47.6% on the neuromorphic EvDET200K dataset. In terms of energy consumption, the SDA consumes only 0.240 times the energy of conventional attention modules.
Basis-Oriented Low-rank Transfer for Few-Shot and Test-Time Adaptation
Adapting large pre-trained models to unseen tasks under tight data and compute budgets remains challenging. Meta-learning approaches explicitly learn good initializations, but they require an additional meta-training phase over many tasks, incur high training cost, and can be unstable. At the same time, the number of task-specific pre-trained models continues to grow, yet the question of how to transfer them to new tasks with minimal additional training remains relatively underexplored. We propose BOLT (Basis-Oriented Low-rank Transfer), a framework that reuses existing fine-tuned models not by merging weights, but instead by extracting an orthogonal, task-informed spectral basis and adapting within that subspace. In the offline phase, BOLT collects dominant singular directions from multiple task vectors and orthogonalizes them per layer to form reusable bases. In the online phase, we freeze these bases and train only a small set of diagonal coefficients per layer for the new task, yielding a rank-controlled update with very few trainable parameters. This design provides (i) a strong, training-free initialization for unseen tasks, obtained by pooling source-task coefficients, along with a lightweight rescaling step while leveraging the shared orthogonal bases, and (ii) a parameter-efficient fine-tuning (PEFT) path that, in our experiments, achieves robust performance compared to common PEFT baselines as well as a representative meta-learned initialization. Our results show that constraining adaptation to a task-informed orthogonal subspace provides an effective alternative for unseen-task transfer.
Boosting Medical Vision-Language Pretraining via Momentum Self-Distillation under Limited Computing Resources WACV 2026
In medical healthcare, obtaining detailed annotations is challenging, highlighting the need for robust Vision-Language Models (VLMs). Pretrained VLMs enable fine-tuning on small datasets or zero-shot inference, achieving performance comparable to task-specific models. Contrastive learning (CL) is a key paradigm for training VLMs but inherently requires large batch sizes for effective learning, making it computationally demanding and often limited to well-resourced institutions. Moreover, with limited data in healthcare, it is important to prioritize knowledge extraction from both data and models during training to improve performance. Therefore, we focus on leveraging the momentum method combined with distillation to simultaneously address computational efficiency and knowledge exploitation. Our contributions can be summarized as follows: (1) leveraging momentum self-distillation to enhance multimodal learning, and (2) integrating momentum mechanisms with gradient accumulation to enlarge the effective batch size without increasing resource consumption. Our method attains competitive performance with state-of-the-art (SOTA) approaches in zero-shot classification, while providing a substantial boost in the few-shot adaption, achieving over 90% AUC-ROC and improving retrieval tasks by 2-3%. Importantly, our method achieves high training efficiency with a single GPU while maintaining reasonable training time. Our approach aims to advance efficient multimodal learning by reducing resource requirements while improving performance over SOTA methods. The implementation of our method is available at https://github.com/phphuc612/MSD .
comment: WACV 2026
LightHCG: a Lightweight yet powerful HSIC Disentanglement based Causal Glaucoma Detection Model framework
As a representative optic degenerative condition, glaucoma has been a threat to millions due to its irreversibility and severe impact on human vision fields. Mainly characterized by dimmed and blurred visions, or peripheral vision loss, glaucoma is well known to occur due to damages in the optic nerve from increased intraocular pressure (IOP) or neovascularization within the retina. Traditionally, most glaucoma related works and clinical diagnosis focused on detecting these damages in the optic nerve by using patient data from perimetry tests, optic papilla inspections and tonometer-based IOP measurements. Recently, with advancements in computer vision AI models, such as VGG16 or Vision Transformers (ViT), AI-automatized glaucoma detection and optic cup segmentation based on retinal fundus images or OCT recently exhibited significant performance in aiding conventional diagnosis with high performance. However, current AI-driven glaucoma detection approaches still have significant room for improvement in terms of reliability, excessive parameter usage, possibility of spurious correlation within detection, and limitations in applications to intervention analysis or clinical simulations. Thus, this research introduced a novel causal representation driven glaucoma detection model: LightHCG, an extremely lightweight Convolutional VAE-based latent glaucoma representation model that can consider the true causality among glaucoma-related physical factors within the optic nerve region. Using HSIC-based latent space disentanglement and Graph Autoencoder based unsupervised causal representation learning, LightHCG not only exhibits higher performance in classifying glaucoma with 93~99% less weights, but also enhances the possibility of AI-driven intervention analysis, compared to existing advanced vision models such as InceptionV3, MobileNetV2 or VGG16.
WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning
Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detailed information about scenes. During inference, an adaptive retrieval agent iteratively selects the most relevant memory source and leverages multiple temporal granularities based on the query, continuing until it determines that sufficient information has been gathered. WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks, achieving an average 8.4% performance gain over previous state-of-the-art methods, showing its effectiveness on long video reasoning.
comment: Project page : https://worldmm.github.io
GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning
With the rapid development of Large Vision Language Models, the focus of Graphical User Interface (GUI) agent tasks shifts from single-screen tasks to complex screen navigation challenges. However, real-world GUI environments, such as PC software and mobile Apps, are often complex and proprietary, making it difficult to obtain the comprehensive environment information needed for agent training and evaluation. This limitation hinders systematic investigation and benchmarking of agent navigation capabilities. To address this limitation, we introduce GUI Exploration Lab, a simulation environment engine for GUI agent navigation research that enables flexible definition and composition of screens, icons, and navigation graphs, while providing full access to environment information for comprehensive agent training and evaluation. Through extensive experiments, we find that supervised fine-tuning enables effective memorization of fundamental knowledge, serving as a crucial foundation for subsequent training. Building on this, single-turn reinforcement learning further enhances generalization to unseen scenarios. Finally, multi-turn reinforcement learning encourages the development of exploration strategies through interactive trial and error, leading to further improvements in screen navigation performance. We validate our methods on both static and interactive benchmarks, demonstrating that our findings generalize effectively to real-world scenarios. These findings demonstrate the advantages of reinforcement learning approaches in GUI navigation and offer practical guidance for building more capable and generalizable GUI agents.
comment: 26 pages
Generalizing Vision-Language Models with Dedicated Prompt Guidance AAAI26
Fine-tuning large pretrained vision-language models (VLMs) has emerged as a prevalent paradigm for downstream adaptation, yet it faces a critical trade-off between domain specificity and domain generalization (DG) ability. Current methods typically fine-tune a universal model on the entire dataset, which potentially compromises the ability to generalize to unseen domains. To fill this gap, we provide a theoretical understanding of the generalization ability for VLM fine-tuning, which reveals that training multiple parameter-efficient expert models on partitioned source domains leads to better generalization than fine-tuning a universal model. Inspired by this finding, we propose a two-step domain-expert-Guided DG (GuiDG) framework. GuiDG first employs prompt tuning to obtain source domain experts, then introduces a Cross-Modal Attention module to guide the fine-tuning of the vision encoder via adaptive expert integration. To better evaluate few-shot DG, we construct ImageNet-DG from ImageNet and its variants. Extensive experiments on standard DG benchmarks and ImageNet-DG demonstrate that GuiDG improves upon state-of-the-art fine-tuning methods while maintaining efficiency.
comment: Accepted to AAAI26
MitUNet: Enhancing Floor Plan Recognition using a Hybrid Mix-Transformer and U-Net Architecture
Automatic 3D reconstruction of indoor spaces from 2D floor plans requires high-precision semantic segmentation of structural elements, particularly walls. However, existing methods optimized for standard metrics often struggle to detect thin structural components and yield masks with irregular boundaries, lacking the geometric precision required for subsequent vectorization. To address this issue, we introduce MitUNet, a hybrid neural network architecture specifically designed for wall segmentation tasks in the context of 3D modeling. In MitUNet, we utilize a hierarchical Mix-Transformer encoder to capture global context and a U-Net decoder enhanced with scSE attention blocks for precise boundary recovery. Furthermore, we propose an optimization strategy based on the Tversky loss function to effectively balance precision and recall. By fine-tuning the hyperparameters of the loss function, we prioritize the suppression of false positive noise along wall boundaries while maintaining high sensitivity to thin structures. Our experiments on the public CubiCasa5k dataset and a proprietary regional dataset demonstrate that the proposed approach ensures the generation of structurally correct masks with high boundary accuracy, outperforming standard single-task models. MitUNet provides a robust tool for data preparation in automated 3D reconstruction pipelines.
comment: 9 pages, 4 figures, 3 tables
WISE: Weighted Iterative Society-of-Experts for Robust Multimodal Multi-Agent Debate
Recent large language models (LLMs) are trained on diverse corpora and tasks, leading them to develop complementary strengths. Multi-agent debate (MAD) has emerged as a popular way to leverage these strengths for robust reasoning, though it has mostly been applied to language-only tasks, leaving its efficacy on multimodal problems underexplored. In this paper, we study MAD for solving vision-and-language reasoning problems. Our setup enables generalizing the debate protocol with heterogeneous experts that possess single- and multi-modal capabilities. To this end, we present Weighted Iterative Society-of-Experts (WISE), a generalized and modular MAD framework that partitions the agents into Solvers, that generate solutions, and Reflectors, that verify correctness, assign weights, and provide natural language feedback. To aggregate the agents' solutions across debate rounds, while accounting for variance in their responses and the feedback weights, we present a modified Dawid-Skene algorithm for post-processing that integrates our two-stage debate model. We evaluate WISE on SMART-840, VisualPuzzles, EvoChart-QA, and a new SMART-840++ dataset with programmatically generated problem instances of controlled difficulty. Our results show that WISE consistently improves accuracy by 2-7% over the state-of-the-art MAD setups and aggregation methods across diverse multimodal tasks and LLM configurations.
Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions
Recent research suggests that Vision Language Models (VLMs) often rely on inherent biases learned during training when responding to queries about visual properties of images. These biases are exacerbated when VLMs are asked highly specific questions that require them to focus on particular areas of the image in tasks such as counting. We build upon this research by developing a synthetic benchmark dataset and evaluation framework to systematically determine how counting performance varies as image and prompt properties change. Using open-source VLMs, we then analyze how attention allocation fluctuates with varying input parameters (e.g. number of objects in the image, objects color, background color, objects texture, background texture, and prompt specificity). We further implement attention-based interventions to modulate focus on visual tokens at different layers and evaluate their impact on counting performance across a range of visual conditions. Our experiments reveal that while VLM counting performance remains challenging, especially under high visual or linguistic complexity, certain attention interventions can lead to modest gains in counting performance.
PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes
Existing approaches to 3D semantic urban scene generation predominantly rely on voxel-based representations, which are bound by fixed resolution, challenging to edit, and memory-intensive in their dense form. In contrast, we advocate for a primitive-based paradigm where urban scenes are represented using compact, semantically meaningful 3D elements that are easy to manipulate and compose. To this end, we introduce PrITTI, a latent diffusion model that leverages vectorized object primitives and rasterized ground surfaces for generating diverse, controllable, and editable 3D semantic urban scenes. This hybrid representation yields a structured latent space that facilitates object- and ground-level manipulation. Experiments on KITTI-360 show that primitive-based representations unlock the full capabilities of diffusion transformers, achieving state-of-the-art 3D scene generation quality with lower memory requirements, faster inference, and greater editability than voxel-based methods. Beyond generation, PrITTI supports a range of downstream applications, including scene editing, inpainting, outpainting, and photo-realistic street-view synthesis. Code and models are publicly available at $\href{https://raniatze.github.io/pritti/}{https://raniatze.github.io/pritti}$.
comment: Project page: https://raniatze.github.io/pritti/
Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos
We present Ov3R, a novel framework for open-vocabulary semantic 3D reconstruction from RGB video streams, designed to advance Spatial AI. The system features two key components: CLIP3R, a CLIP-informed 3D reconstruction module that predicts dense point maps from overlapping clips while embedding object-level semantics; and 2D-3D OVS, a 2D-3D open-vocabulary semantic module that lifts 2D features into 3D by learning fused descriptors integrating spatial, geometric, and semantic cues. Unlike prior methods, Ov3R incorporates CLIP semantics directly into the reconstruction process, enabling globally consistent geometry and fine-grained semantic alignment. Our framework achieves state-of-the-art performance in both dense 3D reconstruction and open-vocabulary 3D segmentation, marking a step forward toward real-time, semantics-aware Spatial AI.
APTx Neuron: A Unified Trainable Neuron Architecture Integrating Activation and Computation
We propose the APTx Neuron, a novel, unified neural computation unit that integrates non-linear activation and linear transformation into a single trainable expression. The APTx Neuron is derived from the APTx activation function, thereby eliminating the need for separate activation layers and making the architecture both optimization-efficient and elegant. The proposed neuron follows the functional form $y = \sum_{i=1}^{n} ((α_i + \tanh(β_i x_i)) \cdot γ_i x_i) + δ$, where all parameters $α_i$, $β_i$, $γ_i$, and $δ$ are trainable. We validate our APTx Neuron-based architecture on the MNIST dataset, achieving up to $96.69\%$ test accuracy within 11 epochs using approximately 332K trainable parameters. The results highlight the superior expressiveness and training efficiency of the APTx Neuron compared to traditional neurons, pointing toward a new paradigm in unified neuron design and the architectures built upon it. Source code is available at https://github.com/mr-ravin/aptx_neuron.
comment: 12 pages, 2 figures, 1 table. Includes a GitHub repository for MNIST experiments and a PyPI package for APTx Neuron implementation
Guardian: Detecting Robotic Planning and Execution Errors with Vision-Language Models
Robust robotic manipulation requires reliable failure detection and recovery. Although current Vision-Language Models (VLMs) show promise, their accuracy and generalization are limited by the scarcity of failure data. To address this data gap, we propose an automatic robot failure synthesis approach that procedurally perturbs successful trajectories to generate diverse planning and execution failures. This method produces not only binary classification labels but also fine-grained failure categories and step-by-step reasoning traces in both simulation and the real world. With it, we construct three new failure detection benchmarks: RLBench-Fail, BridgeDataV2-Fail, and UR5-Fail, substantially expanding the diversity and scale of existing failure datasets. We then train Guardian, a VLM with multi-view images for detailed failure reasoning and detection. Guardian achieves state-of-the-art performance on both existing and newly introduced benchmarks. It also effectively improves task success rates when integrated into a state-of-the-art manipulation system in simulation and real robots, demonstrating the impact of our generated failure data. Code, Data, and Models available at https://www.di.ens.fr/willow/research/guardian/.
comment: Code, Data, and Models available at https://www.di.ens.fr/willow/research/guardian/. The paper contains 8 pages, 9 figures, 6 tables
Multimodal LLMs See Sentiment
Understanding how visual content communicates sentiment is critical in an era where online interaction is increasingly dominated by this kind of media on social platforms. However, this remains a challenging problem, as sentiment perception is closely tied to complex, scene-level semantics. In this paper, we propose an original framework, MLLMsent, to investigate the sentiment reasoning capabilities of Multimodal Large Language Models (MLLMs) through three perspectives: (1) using those MLLMs for direct sentiment classification from images; (2) associating them with pre-trained LLMs for sentiment analysis on automatically generated image descriptions; and (3) fine-tuning the LLMs on sentiment-labeled image descriptions. Experiments on a recent and established benchmark demonstrate that our proposal, particularly the fine-tuned approach, achieves state-of-the-art results outperforming Lexicon-, CNN-, and Transformer-based baselines by up to 30.9%, 64.8%, and 42.4%, respectively, across different levels of evaluators' agreement and sentiment polarity categories. Remarkably, in a cross-dataset test, without any training on these new data, our model still outperforms, by up to 8.26%, the best runner-up, which has been trained directly on them. These results highlight the potential of the proposed visual reasoning scheme for advancing affective computing, while also establishing new benchmarks for future research.
comment: 12 pages, 7 figures
AIDEN: Design and Pilot Study of an AI Assistant for the Visually Impaired
This paper presents AIDEN, an artificial intelligence-based assistant designed to enhance the autonomy and daily quality of life of visually impaired individuals, who often struggle with object identification, text reading, and navigation in unfamiliar environments. Existing solutions such as screen readers or audio-based assistants facilitate access to information but frequently lead to auditory overload and raise privacy concerns in open environments. AIDEN addresses these limitations with a hybrid architecture that integrates You Only Look Once (YOLO) for real-time object detection and a Large Language and Vision Assistant (LLaVA) for scene description and Optical Character Recognition (OCR). A key novelty of the system is a continuous haptic guidance mechanism based on a Geiger-counter metaphor, which supports object centering without occupying the auditory channel, while privacy is preserved by ensuring that no personal data are stored. Empirical evaluations with visually impaired participants assessed perceived ease of use and acceptance using the Technology Acceptance Model (TAM). Results indicate high user satisfaction, particularly regarding intuitiveness and perceived autonomy. Moreover, the ``Find an Object'' achieved effective real-time performance. These findings provide promising evidence that multimodal haptic-visual feedback can improve daily usability and independence compared to traditional audio-centric methods, motivating larger-scale clinical validations.
OpenLVLM-MIA: A Controlled Benchmark Revealing the Limits of Membership Inference Attacks on Large Vision-Language Models WACV2026
OpenLVLM-MIA is a new benchmark that highlights fundamental challenges in evaluating membership inference attacks (MIA) against large vision-language models (LVLMs). While prior work has reported high attack success rates, our analysis suggests that these results often arise from detecting distributional bias introduced during dataset construction rather than from identifying true membership status. To address this issue, we introduce a controlled benchmark of 6{,}000 images where the distributions of member and non-member samples are carefully balanced, and ground-truth membership labels are provided across three distinct training stages. Experiments using OpenLVLM-MIA demonstrated that the performance of state-of-the-art MIA methods approached chance-level. OpenLVLM-MIA, designed to be transparent and unbiased benchmark, clarifies certain limitations of MIA research on LVLMs and provides a solid foundation for developing stronger privacy-preserving techniques.
comment: WACV2026 Accepted
OpenREAD: Reinforced Open-Ended Reasoning for End-to-End Autonomous Driving with LLM-as-Critic
Recently, two-stage fine-tuning strategies, e.g., acquiring essential driving knowledge through supervised fine-tuning (SFT) and further enhancing decision-making and planning via reinforcement fine-tuning (RFT), have shown strong potential in advancing the knowledge-driven autonomous driving (AD) paradigm. However, the learning nature of SFT still limits the generalization of reasoning, thereby constraining the full potential of driving performance. Meanwhile, current RFT approaches are primarily applied to downstream tasks, since scene understanding is an open-ended problem where corresponding rewards are difficult to quantify. To address these limitations, we propose OpenREAD, an OPEN-ended REasoning reinforced vision-language model (VLM)-based autonomous driving (AD) framework that enables end-to-end RFT across the full spectrum from high-level reasoning to low-level trajectory planning. Specifically, we begin by constructing large-scale Chain-of-Thought (CoT) annotations on open-source driving-related knowledge datasets, and employ the powerful Qwen3 large language model (LLM) as the critic in RFT to quantify reasoning quality for open-ended questions during reward modeling. Extensive experiments confirm that joint end-to-end RFT yields substantial improvements in both upstream and downstream tasks, enabling OpenREAD to achieve state-of-the-art performance on reasoning and planning benchmarks.
DehazeGS: Seeing Through Fog with 3D Gaussian Splatting AAAI2026
Current novel view synthesis methods are typically designed for high-quality and clean input images. However, in foggy scenes, scattering and attenuation can significantly degrade the quality of rendering. Although NeRF-based dehazing approaches have been developed, their reliance on deep fully connected neural networks and per-ray sampling strategies leads to high computational costs. Furthermore, NeRF's implicit representation limits its ability to recover fine-grained details from hazy scenes. To overcome these limitations, we propose learning an explicit Gaussian representation to explain the formation mechanism of foggy images through a physically forward rendering process. Our method, DehazeGS, reconstructs and renders fog-free scenes using only multi-view foggy images as input. Specifically, based on the atmospheric scattering model, we simulate the formation of fog by establishing the transmission function directly onto Gaussian primitives via depth-to-transmission mapping. During training, we jointly learn the atmospheric light and scattering coefficients while optimizing the Gaussian representation of foggy scenes. At inference time, we remove the effects of scattering and attenuation in Gaussian distributions and directly render the scene to obtain dehazed views. Experiments on both real-world and synthetic foggy datasets demonstrate that DehazeGS achieves state-of-the-art performance. visualizations are available at https://dehazegs.github.io/
comment: 9 pages,5 figures. Accepted by AAAI2026. visualizations are available at https://dehazegs.github.io/
SkelSplat: Robust Multi-view 3D Human Pose Estimation with Differentiable Gaussian Rendering WACV 2026
Accurate 3D human pose estimation is fundamental for applications such as augmented reality and human-robot interaction. State-of-the-art multi-view methods learn to fuse predictions across views by training on large annotated datasets, leading to poor generalization when the test scenario differs. To overcome these limitations, we propose SkelSplat, a novel framework for multi-view 3D human pose estimation based on differentiable Gaussian rendering. Human pose is modeled as a skeleton of 3D Gaussians, one per joint, optimized via differentiable rendering to enable seamless fusion of arbitrary camera views without 3D ground-truth supervision. Since Gaussian Splatting was originally designed for dense scene reconstruction, we propose a novel one-hot encoding scheme that enables independent optimization of human joints. SkelSplat outperforms approaches that do not rely on 3D ground truth in Human3.6M and CMU, while reducing the cross-dataset error up to 47.8% compared to learning-based methods. Experiments on Human3.6M-Occ and Occlusion-Person demonstrate robustness to occlusions, without scenario-specific fine-tuning. Our project page is available here: https://skelsplat.github.io.
comment: WACV 2026
VeLU: Variance-enhanced Learning Unit for Deep Neural Networks
Activation functions play a critical role in deep neural networks by shaping gradient flow, optimization stability, and generalization. While ReLU remains widely used due to its simplicity, it suffers from gradient sparsity and dead-neuron issues and offers no adaptivity to input statistics. Smooth alternatives such as Swish and GELU improve gradient propagation but still apply a fixed transformation regardless of the activation distribution. In this paper, we propose VeLU, a Variance-enhanced Learning Unit that introduces variance-aware and distributionally aligned nonlinearity through a principled combination of ArcTan-ArcSin transformations, adaptive scaling, and Wasserstein-2 regularization (Optimal Transport). This design enables VeLU to modulate its response based on local activation variance, mitigate internal covariate shift at the activation level, and improve training stability without adding learnable parameters or architectural overhead. Extensive experiments across six deep neural networks show that VeLU outperforms ReLU, ReLU6, Swish, and GELU on 12 vision benchmarks. The implementation of VeLU is publicly available in GitHub.
comment: 16 pages, 5 figures
Toward Content-based Indexing and Retrieval of Head and Neck CT with Abscess Segmentation
Abscesses in the head and neck represent an acute infectious process that can potentially lead to sepsis or mortality if not diagnosed and managed promptly. Accurate detection and delineation of these lesions on imaging are essential for diagnosis, treatment planning, and surgical intervention. In this study, we introduce AbscessHeNe, a curated and comprehensively annotated dataset comprising 4,926 contrast-enhanced CT slices with clinically confirmed head and neck abscesses. The dataset is designed to facilitate the development of robust semantic segmentation models that can accurately delineate abscess boundaries and evaluate deep neck space involvement, thereby supporting informed clinical decision-making. To establish performance baselines, we evaluate several state-of-the-art segmentation architectures, including CNN, Transformer, and Mamba-based models. The highest-performing model achieved a Dice Similarity Coefficient of 0.39, Intersection-over-Union of 0.27, and Normalized Surface Distance of 0.67, indicating the challenges of this task and the need for further research. Beyond segmentation, AbscessHeNe is structured for future applications in content-based multimedia indexing and case-based retrieval. Each CT scan is linked with pixel-level annotations and clinical metadata, providing a foundation for building intelligent retrieval systems and supporting knowledge-driven clinical workflows. The dataset will be made publicly available at https://github.com/drthaodao3101/AbscessHeNe.git.
comment: The 2025 IEEE International Conference on Content-Based Multimedia Indexing (IEEE CBMI)
MasHeNe: A Benchmark for Head and Neck CT Mass Segmentation using Window-Enhanced Mamba with Frequency-Domain Integration
Head and neck masses are space-occupying lesions that can compress the airway and esophagus and may affect nerves and blood vessels. Available public datasets primarily focus on malignant lesions and often overlook other space-occupying conditions in this region. To address this gap, we introduce MasHeNe, an initial dataset of 3,779 contrast-enhanced CT slices that includes both tumors and cysts with pixel-level annotations. We also establish a benchmark using standard segmentation baselines and report common metrics to enable fair comparison. In addition, we propose the Windowing-Enhanced Mamba with Frequency integration (WEMF) model. WEMF applies tri-window enhancement to enrich the input appearance before feature extraction. It further uses multi-frequency attention to fuse information across skip connections within a U-shaped Mamba backbone. On MasHeNe, WEMF attains the best performance among evaluated methods, with a Dice of 70.45%, IoU of 66.89%, NSD of 72.33%, and HD95 of 5.12 mm. This model indicates stable and strong results on this challenging task. MasHeNe provides a benchmark for head-and-neck mass segmentation beyond malignancy-only datasets. The observed error patterns also suggest that this task remains challenging and requires further research. Our dataset and code are available at https://github.com/drthaodao3101/MasHeNe.git.
comment: The 14th International Symposium on Information and Communication Technology Conference SoICT 2025
Zero-shot self-supervised learning of single breath-hold magnetic resonance cholangiopancreatography (MRCP) reconstruction
To investigate the feasibility of zero-shot self-supervised learning reconstruction for reducing breath-hold times in magnetic resonance cholangiopancreatography (MRCP). Breath-hold MRCP was acquired from 11 healthy volunteers on 3T scanners using an incoherent k-space sampling pattern, leading to 14-second acquisition time and an acceleration factor of R=25. Zero-shot reconstruction was compared with parallel imaging of respiratory-triggered MRCP (338s, R=3) and compressed sensing reconstruction. For two volunteers, breath-hold scans (40s, R=6) were additionally acquired and retrospectively undersampled to R=25 to compute peak signal-to-noise ratio (PSNR). To address long zero-shot training time, the n+m full stages of the zero-shot learning were divided into two parts to reduce backpropagation depth during training: 1) n frozen stages initialized with n-stage pretrained network and 2) m trainable stages initialized either randomly or m-stage pretrained network. Efficiency of our approach was assessed by varying initialization strategies and the number of trainable stages using the retrospectively undersampled data. Zero-shot reconstruction significantly improved visual image quality over compressed sensing, particularly in SNR and ductal delineation, and achieved image quality comparable to that of successful respiratory-triggered acquisitions with regular breathing patterns. Improved initializations enhanced PSNR and reduced reconstruction time. Adjusting frozen/trainable configurations demonstrated that PSNR decreased only slightly from 38.25 dB (0/13) to 37.67 dB (12/1), while training time decreased up to 6.7-fold. Zero-shot learning delivers high-fidelity MRCP reconstructions with reduced breath-hold times, and the proposed partially trainable approach offers a practical solution for translation into time-constrained clinical workflows.
comment: 24 pages, 8 figures, 2 tabels
Look, Recite, Then Answer: Enhancing VLM Performance via Self-Generated Knowledge Hints
Vision-Language Models (VLMs) exhibit significant performance plateaus in specialized domains like precision agriculture, primarily due to "Reasoning-Driven Hallucination" where linguistic priors override visual perception. A key bottleneck is the "Modality Gap": visual embeddings fail to reliably activate the fine-grained expert knowledge already encoded in model parameters. We propose "Look, Recite, Then Answer," a parameter-efficient framework that enhances VLMs via self-generated knowledge hints while keeping backbone models frozen. The framework decouples inference into three stages: (1) Look generates objective visual descriptions and candidate sets; (2) Recite employs a lightweight 1.7B router to transform visual cues into targeted queries that trigger candidate-specific parametric knowledge; (3) Answer performs parallel evidence alignment between descriptions and recited knowledge to select the most consistent label. On AgroBench, our method achieves state-of-the-art results, improving Weed Identification accuracy by 23.52% over Qwen2-VL-72B and surpassing GPT-4o without external search overhead. This modular design mitigates hallucinations by transforming passive perception into active, controllable knowledge retrieval
NOCTIS: Novel Object Cyclic Threshold based Instance Segmentation
Instance segmentation of novel objects instances in RGB images, given some example images for each object, is a well known problem in computer vision. Designing a model general enough to be employed for all kinds of novel objects without (re-) training has proven to be a difficult task. To handle this, we present a new training-free framework, called: Novel Object Cyclic Threshold based Instance Segmentation (NOCTIS). NOCTIS integrates two pre-trained models: Grounded-SAM 2 for object proposals with precise bounding boxes and corresponding segmentation masks; and DINOv2 for robust class and patch embeddings, due to its zero-shot capabilities. Internally, the proposal-object matching is realized by determining an object matching score based on the similarity of the class embeddings and the average maximum similarity of the patch embeddings with a new cyclic thresholding (CT) mechanism that mitigates unstable matches caused by repetitive textures or visually similar patterns. Beyond CT, NOCTIS introduces: (i) an appearance score that is unaffected by object selection bias; (ii) the usage of the average confidence of the proposals' bounding box and mask as a scoring component; and (iii) an RGB-only pipeline that performs even better than RGB-D ones. We empirically show that NOCTIS, without further training/fine tuning, outperforms the best RGB and RGB-D methods regarding the mean AP score on the seven core datasets of the BOP 2023 challenge for the "Model-based 2D segmentation of unseen objects" task.
comment: 9 pages, 3 figures, 5 tables
Aligning Diffusion Models with Noise-Conditioned Perception
Recent advancements in human preference optimization, initially developed for Language Models (LMs), have shown promise for text-to-image Diffusion Models, enhancing prompt alignment, visual appeal, and user preference. Unlike LMs, Diffusion Models typically optimize in pixel or VAE space, which does not align well with human perception, leading to slower and less efficient training during the preference alignment stage. We propose using a perceptual objective in the U-Net embedding space of the diffusion model to address these issues. Our approach involves fine-tuning Stable Diffusion 1.5 and XL using Direct Preference Optimization (DPO), Contrastive Preference Optimization (CPO), and supervised fine-tuning (SFT) within this embedding space. This method significantly outperforms standard latent-space implementations across various metrics, including quality and computational cost. For SDXL, our approach provides 60.8\% general preference, 62.2\% visual appeal, and 52.1\% prompt following against original open-sourced SDXL-DPO on the PartiPrompts dataset, while significantly reducing compute. Our approach not only improves the efficiency and quality of human preference alignment for diffusion models but is also easily integrable with other optimization techniques. The training code and LoRA weights will be available here: https://huggingface.co/alexgambashidze/SDXL\_NCP-DPO\_v0.1
Learning Egocentric In-Hand Object Segmentation through Weak Supervision from Human Narrations
Pixel-level recognition of objects manipulated by the user from egocentric images enables key applications spanning assistive technologies, industrial safety, and activity monitoring. However, progress in this area is currently hindered by the scarcity of annotated datasets, as existing approaches rely on costly manual labels. In this paper, we propose to learn human-object interaction detection leveraging narrations $\unicode{x2013}$ natural language descriptions of the actions performed by the camera wearer which contain clues about manipulated objects. We introduce Narration-Supervised in-Hand Object Segmentation (NS-iHOS), a novel task where models have to learn to segment in-hand objects by learning from natural-language narrations in a weakly-supervised regime. Narrations are then not employed at inference time. We showcase the potential of the task by proposing Weakly-Supervised In-hand Object Segmentation from Human Narrations (WISH), an end-to-end model distilling knowledge from narrations to learn plausible hand-object associations and enable in-hand object segmentation without using narrations at test time. We benchmark WISH against different baselines based on open-vocabulary object detectors and vision-language models. Experiments on EPIC-Kitchens and Ego4D show that WISH surpasses all baselines, recovering more than 50% of the performance of fully supervised methods, without employing fine-grained pixel-wise annotations. Code and data can be found at https://fpv-iplab.github.io/WISH.
comment: Under consideration at Pattern Recognition Letters
3DIS: Depth-Driven Decoupled Instance Synthesis for Text-to-Image Generation
The increasing demand for controllable outputs in text-to-image generation has spurred advancements in multi-instance generation (MIG), allowing users to define both instance layouts and attributes. However, unlike image-conditional generation methods such as ControlNet, MIG techniques have not been widely adopted in state-of-the-art models like SD2 and SDXL, primarily due to the challenge of building robust renderers that simultaneously handle instance positioning and attribute rendering. In this paper, we introduce Depth-Driven Decoupled Instance Synthesis (3DIS), a novel framework that decouples the MIG process into two stages: (i) generating a coarse scene depth map for accurate instance positioning and scene composition, and (ii) rendering fine-grained attributes using pre-trained ControlNet on any foundational model, without additional training. Our 3DIS framework integrates a custom adapter into LDM3D for precise depth-based layouts and employs a finetuning-free method for enhanced instance-level attribute rendering. Extensive experiments on COCO-Position and COCO-MIG benchmarks demonstrate that 3DIS significantly outperforms existing methods in both layout precision and attribute rendering. Notably, 3DIS offers seamless compatibility with diverse foundational models, providing a robust, adaptable solution for advanced multi-instance generation. The code is available at: https://github.com/limuloo/3DIS.
comment: 10 pages
PRIMU: Uncertainty Estimation for Novel Views in Gaussian Splatting from Primitive-Based Representations of Error and Coverage
We introduce Primitive-based Representations of Uncertainty (PRIMU), a post-hoc uncertainty estimation (UE) framework for Gaussian Splatting (GS). Reliable UE is essential for deploying GS in safety-critical domains such as robotics and medicine. Existing approaches typically estimate Gaussian-primitive variances and rely on the rendering process to obtain pixel-wise uncertainties. In contrast, we construct primitive-level representations of error and visibility/coverage from training views, capturing interpretable uncertainty information. These representations are obtained by projecting view-dependent training errors and coverage statistics onto the primitives. Uncertainties for novel views are inferred by rendering these primitive-level representations, producing uncertainty feature maps, which are aggregate through pixel-wise regression on holdout data. We analyze combinations of uncertainty feature maps and regression models to understand how their interactions affect prediction accuracy and generalization. PRIMU also enables an effective active view selection strategy by directly leveraging these uncertainty feature maps. Additionally, we study the effect of separating splatting into foreground and background regions. Our estimates show strong correlations with true errors, outperforming state-of-the-art methods, especially for depth UE and foreground objects. Finally, our regression models show generalization capabilities to unseen scenes, enabling UE without additional holdout data.
comment: Revised writing and figures; additional Gaussian Splatting experiments; added baselines and datasets; active view-selection experiments
MRI Super-Resolution with Deep Learning: A Comprehensive Survey
High-resolution (HR) magnetic resonance imaging (MRI) is crucial for many clinical and research applications. However, achieving it remains costly and constrained by technical trade-offs and experimental limitations. Super-resolution (SR) presents a promising computational approach to overcome these challenges by generating HR images from more affordable low-resolution (LR) scans, potentially improving diagnostic accuracy and efficiency without requiring additional hardware. This survey reviews recent advances in MRI SR techniques, with a focus on deep learning (DL) approaches. It examines DL-based MRI SR methods from the perspectives of computer vision, computational imaging, inverse problems, and MR physics, covering theoretical foundations, architectural designs, learning strategies, benchmark datasets, and performance metrics. We propose a systematic taxonomy to categorize these methods and present an in-depth study of both established and emerging SR techniques applicable to MRI, considering unique challenges in clinical and research contexts. We also highlight open challenges and directions that the community needs to address. Additionally, we provide a collection of essential open-access resources, tools, and tutorials, available on our GitHub: https://github.com/mkhateri/Awesome-MRI-Super-Resolution. IEEE keywords: MRI, Super-Resolution, Deep Learning, Computational Imaging, Inverse Problem, Survey.
comment: 41 pages
AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention
Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in embodied AI tasks. However, existing VLA models, often built upon Vision-Language Models (VLMs), typically process dense visual inputs independently at each timestep. This approach implicitly models the task as a Markov Decision Process (MDP). However, this history-agnostic design is suboptimal for effective visual token processing in dynamic sequential decision-making, as it fails to leverage the context of history. To address this limitation, we reformulate the problem from a Partially Observable Markov Decision Process (POMDP) perspective and propose a novel framework named AVA-VLA. Inspired by the POMDP that the action generation should be conditioned on the belief state. AVA-VLA introduces Active Visual Attention (AVA) to dynamically modulate visual processing. It achieves this by leveraging the recurrent state, which is a neural approximation of the agent's belief state derived from the previous decision step. Specifically, the AVA module uses the recurrent state to compute the soft weights to actively process task-relevant visual tokens based on its historical context. Comprehensive evaluations demonstrate that AVA-VLA achieves state-of-the-art performance across popular robotic benchmarks, including LIBERO and CALVIN. Furthermore, real-world deployments on a dual-arm robot platform validate the framework's practical applicability and robust sim-to-real transferability.
comment: 18 pages, 10 figures
End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer
Existing multi-person video pose estimation methods typically adopt a two-stage pipeline: detecting individuals in each frame, followed by temporal modeling for single person pose estimation. This design relies on heuristic operations such as detection, RoI cropping, and non-maximum suppression (NMS), limiting both accuracy and efficiency. In this paper, we present a fully end-to-end framework for multi-person 2D pose estimation in videos, effectively eliminating heuristic operations. A key challenge is to associate individuals across frames under complex and overlapping temporal trajectories. To address this, we introduce a novel Pose-Aware Video transformEr Network (PAVE-Net), which features a spatial encoder to model intra-frame relations and a spatiotemporal pose decoder to capture global dependencies across frames. To achieve accurate temporal association, we propose a pose-aware attention mechanism that enables each pose query to selectively aggregate features corresponding to the same individual across consecutive frames. Additionally, we explicitly model spatiotemporal dependencies among pose keypoints to improve accuracy. Notably, our approach is the first end-to-end method for multi-frame 2D human pose estimation. Extensive experiments show that PAVE-Net substantially outperforms prior image-based end-to-end methods, achieving a 6.0 mAP improvement on PoseTrack2017, and delivers accuracy competitive with state-of-the-art two-stage video based approaches, while offering significant gains in efficiency. Project page: https://github.com/zgspose/PAVENet.
Bias Beyond Demographics: Probing Decision Boundaries in Black-Box LVLMs via Counterfactual VQA
Recent advances in large vision-language models (LVLMs) have amplified concerns about fairness, yet existing evaluations remain confined to demographic attributes and often conflate fairness with refusal behavior. This paper broadens the scope of fairness by introducing a counterfactual VQA benchmark that probes the decision boundaries of closed-source LVLMs under controlled context shifts. Each image pair differs in a single visual attribute that has been validated as irrelevant to the question, enabling ground-truth-free and refusal-aware analysis of reasoning stability. Comprehensive experiments reveal that non-demographic attributes, such as environmental context or social behavior, distort LVLM decision-making more strongly than demographic ones. Moreover, instruction-based debiasing shows limited effectiveness and can even amplify these asymmetries, whereas exposure to a small number of human norm validated examples from our benchmark encourages more consistent and balanced responses, highlighting its potential not only as an evaluative framework but also as a means for understanding and improving model behavior. Together, these results provide a practial basis for auditing contextual biases even in black-box LVLMs and contribute to more transparent and equitable multimodal reasoning.
FairT2I: Mitigating Social Bias in Text-to-Image Generation via Large Language Model-Assisted Detection and Attribute Rebalancing
Text-to-image (T2I) models have advanced creative content generation, yet their reliance on large uncurated datasets often reproduces societal biases. We present FairT2I, a training-free and interactive framework grounded in a mathematically principled latent variable guidance formulation. This formulation decomposes the generative score function into attribute-conditioned components and reweights them according to a defined distribution, providing a unified and flexible mechanism for bias-aware generation that also subsumes many existing ad hoc debiasing approaches as special cases. Building upon this foundation, FairT2I incorporates (1) latent variable guidance as the core mechanism, (2) LLM-based bias detection to automatically infer bias-prone categories and attributes from text prompts as part of the latent structure, and (3) attribute resampling, which allows users to adjust or redefine the attribute distribution based on uniform, real-world, or user-specified statistics. The accompanying user interface supports this pipeline by enabling users to inspect detected biases, modify attributes or weights, and generate debiased images in real time. Experimental results show that LLMs outperform average human annotators in the number and granularity of detected bias categories and attributes. Moreover, FairT2I achieves superior performance to baseline models in both societal bias mitigation and image diversity, while preserving image quality and prompt fidelity.
Detect Anything 3D in the Wild
Despite the success of deep learning in close-set 3D object detection, existing approaches struggle with zero-shot generalization to novel objects and camera configurations. We introduce DetAny3D, a promptable 3D detection foundation model capable of detecting any novel object under arbitrary camera configurations using only monocular inputs. Training a foundation model for 3D detection is fundamentally constrained by the limited availability of annotated 3D data, which motivates DetAny3D to leverage the rich prior knowledge embedded in extensively pre-trained 2D foundation models to compensate for this scarcity. To effectively transfer 2D knowledge to 3D, DetAny3D incorporates two core modules: the 2D Aggregator, which aligns features from different 2D foundation models, and the 3D Interpreter with Zero-Embedding Mapping, which stabilizes early training in 2D-to-3D knowledge transfer. Experimental results validate the strong generalization of our DetAny3D, which not only achieves state-of-the-art performance on unseen categories and novel camera configurations, but also surpasses most competitors on in-domain data. DetAny3D sheds light on the potential of the 3D foundation model for diverse applications in real-world scenarios, e.g., rare object detection in autonomous driving, and demonstrates promise for further exploration of 3D-centric tasks in open-world settings. More visualization results can be found at our code repository.
Walk Before You Dance: High-fidelity and Editable Dance Synthesis via Generative Masked Motion Prior
Recent advances in dance generation have enabled the automatic synthesis of 3D dance motions. However, existing methods still face significant challenges in simultaneously achieving high realism, precise dance-music synchronization, diverse motion expression, and physical plausibility. To address these limitations, we propose a novel approach that leverages a generative masked text-to-motion model as a distribution prior to learn a probabilistic mapping from diverse guidance signals, including music, genre, and pose, into high-quality dance motion sequences. Our framework also supports semantic motion editing, such as motion inpainting and body part modification. Specifically, we introduce a multi-tower masked motion model that integrates a text-conditioned masked motion backbone with two parallel, modality-specific branches: a music-guidance tower and a pose-guidance tower. The model is trained using synchronized and progressive masked training, which allows effective infusion of the pretrained text-to-motion prior into the dance synthesis process while enabling each guidance branch to optimize independently through its own loss function, mitigating gradient interference. During inference, we introduce classifier-free logits guidance and pose-guided token optimization to strengthen the influence of music, genre, and pose signals. Extensive experiments demonstrate that our method sets a new state of the art in dance generation, significantly advancing both the quality and editability over existing approaches. Project Page available at https://foram-s1.github.io/DanceMosaic/
Learning-based 3D Reconstruction in Autonomous Driving: A Comprehensive Survey
Learning-based 3D reconstruction has emerged as a transformative technique in autonomous driving, enabling precise modeling of environments through advanced neural representations. It has inspired pioneering solutions for vital tasks in autonomous driving, such as dense mapping and closed-loop simulation, as well as comprehensive scene feature for driving scene understanding and reasoning. Given the rapid growth in related research, this survey provides a comprehensive review of both technical evolutions and practical applications in autonomous driving. We begin with an introduction to the preliminaries of learning-based 3D reconstruction to provide a solid technical background foundation, then progress to a rigorous, multi-dimensional examination of cutting-edge methodologies, systematically organized according to the distinctive technical requirements and fundamental challenges of autonomous driving. Through analyzing and summarizing development trends and cutting-edge research, we identify existing technical challenges, along with insufficient disclosure of on-board validation and safety verification details in the current literature, and ultimately suggest potential directions to guide future studies.
comment: Published in IEEE Trans. on Intelligent Transportation Systems
CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios
3D medical vision-language (VL) pretraining has shown potential in radiology by leveraging large-scale multimodal datasets with CT-report pairs. However, existing methods primarily rely on a global VL alignment directly adapted from 2D scenarios. The entire 3D image is transformed into one global embedding, resulting in a loss of sparse but critical semantics essential for accurately aligning with the corresponding diagnosis. To address this limitation, we propose CT-GLIP, a 3D Grounded Language-Image Pretrained model that constructs fine-grained CT-report pairs to enhance \textit{grounded} cross-modal contrastive learning, effectively aligning grounded visual features with precise textual descriptions. Leveraging the grounded cross-modal alignment, CT-GLIP improves performance across diverse downstream tasks and can even identify organs and abnormalities in a zero-shot manner using natural language. CT-GLIP is trained on a multimodal CT dataset comprising 44,011 organ-level CT-report pairs from 17,702 patients, covering 104 organs. Evaluation is conducted on four downstream tasks: zero-shot organ recognition (OR), zero-shot abnormality detection (AD), tumor detection (TD), and tumor segmentation (TS). Empirical results show that it outperforms its counterparts with global VL alignment. Compared to vanilla CLIP, CT-GLIP achieves average performance improvements of 15.1% of F1 score, 1.9% of AUC, and 3.2% of DSC for zero-shot AD, TD, and TS tasks, respectively. This study highlights the significance of grounded VL alignment in enabling 3D medical VL foundation models to understand sparse representations within CT scans.
MegaSR: Mining Customized Semantics and Expressive Guidance for Real-World Image Super-Resolution
Text-to-image (T2I) models have ushered in a new era of real-world image super-resolution (Real-ISR) due to their rich internal implicit knowledge for multimodal learning. Although bringing high-level semantic priors and dense pixel guidance have led to advances in reconstruction, we identified several critical phenomena by analyzing the behavior of existing T2I-based Real-ISR methods: (1) Fine detail deficiency, which ultimately leads to incorrect reconstruction in local regions. (2) Block-wise semantic inconsistency, which results in distracted semantic interpretations across U-Net blocks. (3) Edge ambiguity, which causes noticeable structural degradation. Building upon these observations, we first introduce MegaSR, which enhances the T2I-based Real-ISR models with fine-grained customized semantics and expressive guidance to unlock semantically rich and structurally consistent reconstruction. Then, we propose the Customized Semantics Module (CSM) to supplement fine-grained semantics from the image modality and regulate the semantic fusion between multi-level knowledge to realize customization for different U-Net blocks. Besides the semantic adaptation, we identify expressive multimodal signals through pair-wise comparisons and introduce the Multimodal Signal Fusion Module (MSFM) to aggregate them for structurally consistent reconstruction. Extensive experiments on real-world and synthetic datasets demonstrate the superiority of the method. Notably, it not only achieves state-of-the-art performance on quality-driven metrics but also remains competitive on fidelity-focused metrics, striking a balance between perceptual realism and faithful content reconstruction.
Diffusion-SDPO: Safeguarded Direct Preference Optimization for Diffusion Models
Text-to-image diffusion models deliver high-quality images, yet aligning them with human preferences remains challenging. We revisit diffusion-based Direct Preference Optimization (DPO) for these models and identify a critical pathology: enlarging the preference margin does not necessarily improve generation quality. In particular, the standard Diffusion-DPO objective can increase the reconstruction error of both winner and loser branches. Consequently, degradation of the less-preferred outputs can become sufficiently severe that the preferred branch is also adversely affected even as the margin grows. To address this, we introduce Diffusion-SDPO, a safeguarded update rule that preserves the winner by adaptively scaling the loser gradient according to its alignment with the winner gradient. A first-order analysis yields a closed-form scaling coefficient that guarantees the error of the preferred output is non-increasing at each optimization step. Our method is simple, model-agnostic, broadly compatible with existing DPO-style alignment frameworks and adds only marginal computational overhead. Across standard text-to-image benchmarks, Diffusion-SDPO delivers consistent gains over preference-learning baselines on automated preference, aesthetic, and prompt alignment metrics. Code is publicly available at https://github.com/AIDC-AI/Diffusion-SDPO.
comment: The code is publicly available at https://github.com/AIDC-AI/Diffusion-SDPO
Mutually-Aware Feature Learning for Few-Shot Object Counting
Few-shot object counting has garnered significant attention for its practicality as it aims to count target objects in a query image based on given exemplars without additional training. However, the prevailing extract-and-match approach has a shortcoming: query and exemplar features lack interaction during feature extraction since they are extracted independently and later correlated based on similarity. This can lead to insufficient target awareness and confusion in identifying the actual target when multiple class objects coexist. To address this, we propose a novel framework, Mutually-Aware FEAture learning (MAFEA), which encodes query and exemplar features with mutual awareness from the outset. By encouraging interaction throughout the pipeline, we obtain target-aware features robust to a multi-category scenario. Furthermore, we introduce background token to effectively associate the query's target region with exemplars and decouple its background region. Our extensive experiments demonstrate that our model achieves state-of-the-art performance on FSCD-LVIS and FSC-147 benchmarks with remarkably reduced target confusion.
comment: Accepted to Pattern Recognition 2025
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the "scale-at-all-costs" paradigm. By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (<16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.
ST-Booster: An Iterative SpatioTemporal Perception Booster for Vision-and-Language Navigation in Continuous Environments
Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to navigate unknown, continuous spaces based on natural language instructions. Compared to discrete settings, VLN-CE poses two core perception challenges. First, the absence of predefined observation points leads to heterogeneous visual memories and weakened global spatial correlations. Second, cumulative reconstruction errors in three-dimensional scenes introduce structural noise, impairing local feature perception. To address these challenges, this paper proposes ST-Booster, an iterative spatiotemporal booster that enhances navigation performance through multi-granularity perception and instruction-aware reasoning. ST-Booster consists of three key modules -- Hierarchical SpatioTemporal Encoding (HSTE), Multi-Granularity Aligned Fusion (MGAF), and ValueGuided Waypoint Generation (VGWG). HSTE encodes long-term global memory using topological graphs and captures shortterm local details via grid maps. MGAF aligns these dualmap representations with instructions through geometry-aware knowledge fusion. The resulting representations are iteratively refined through pretraining tasks. During reasoning, VGWG generates Guided Attention Heatmaps (GAHs) to explicitly model environment-instruction relevance and optimize waypoint selection. Extensive comparative experiments and performance analyses are conducted, demonstrating that ST-Booster outperforms existing state-of-the-art methods, particularly in complex, disturbance-prone environments.
comment: 11 pages, 7 figures
AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs
Assessing image-text alignment models such as CLIP is crucial for bridging visual and linguistic representations. Yet existing benchmarks rely on rule-based perturbations or short captions, limiting their ability to measure fine-grained alignment. We introduce AlignBench, a benchmark that provides a new indicator of image-text alignment by evaluating detailed image-caption pairs generated by diverse image-to-text and text-to-image models. Each sentence is annotated for correctness, enabling direct assessment of VLMs as alignment evaluators. Benchmarking a wide range of decoder-based VLMs reveals three key findings: (i) CLIP-based models, even those tailored for compositional reasoning, remain nearly blind; (ii) detectors systematically over-score early sentences; and (iii) they show strong self-preference, favoring their own outputs and harming detection performance. Our project page will be available at https://dahlian00.github.io/AlignBench/.
comment: Project Page: https://dahlian00.github.io/AlignBench/
WorldMem: Long-term Consistent World Simulation with Memory
World simulation has gained increasing popularity due to its ability to model virtual environments and predict the consequences of actions. However, the limited temporal context window often leads to failures in maintaining long-term consistency, particularly in preserving 3D spatial consistency. In this work, we present WorldMem, a framework that enhances scene generation with a memory bank consisting of memory units that store memory frames and states (e.g., poses and timestamps). By employing a memory attention mechanism that effectively extracts relevant information from these memory frames based on their states, our method is capable of accurately reconstructing previously observed scenes, even under significant viewpoint or temporal gaps. Furthermore, by incorporating timestamps into the states, our framework not only models a static world but also captures its dynamic evolution over time, enabling both perception and interaction within the simulated world. Extensive experiments in both virtual and real scenarios validate the effectiveness of our approach.
comment: Project page at https://xizaoqu.github.io/worldmem/
Multimodal Continual Learning with MLLMs from Multi-scenario Perspectives CVPR 2026
Continual learning in visual understanding aims to deal with catastrophic forgetting in Multimodal Large Language Models (MLLMs). MLLMs deployed on devices have to continuously adapt to dynamic scenarios in downstream tasks, such as variations in background and perspective, to effectively perform complex visual tasks. To this end, we construct a multimodal visual understanding dataset (MSVQA) encompassing four different scenarios and perspectives including high altitude, underwater, low altitude and indoor, to investigate the catastrophic forgetting in MLLMs under the dynamics of scenario shifts in real-world data streams. Furthermore, we propose mUltimodal coNtInual learning with MLLMs From multi-scenarIo pERspectives (UNIFIER) to address visual discrepancies while learning different scenarios. Specifically, it decouples the visual information from different scenarios into distinct branches within each vision block and projects them into the same feature space. A consistency constraint is imposed on the features of each branch to maintain the stability of visual representations across scenarios. Extensive experiments on the MSVQA dataset demonstrate that UNIFIER effectively alleviates forgetting of cross-scenario tasks and achieves knowledge accumulation within the same scenario.
comment: 18 pages, 16 figures. This is a preprint version of a paper submitted to CVPR 2026
TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction
We present TempoMaster, a novel framework that formulates long video generation as next-frame-rate prediction. Specifically, we first generate a low-frame-rate clip that serves as a coarse blueprint of the entire video sequence, and then progressively increase the frame rate to refine visual details and motion continuity. During generation, TempoMaster employs bidirectional attention within each frame-rate level while performing autoregression across frame rates, thus achieving long-range temporal coherence while enabling efficient and parallel synthesis. Extensive experiments demonstrate that TempoMaster establishes a new state-of-the-art in long video generation, excelling in both visual and temporal quality.
comment: for more information, see https://scottykma.github.io/tempomaster-gitpage/
SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge SP
Articulated 3D objects are critical for embodied AI, robotics, and interactive scene understanding, yet creating simulation-ready assets remains labor-intensive and requires expert modeling of part hierarchies and motion structures. We introduce SPARK, a framework for reconstructing physically consistent, kinematic part-level articulated objects from a single RGB image. Given an input image, we first leverage VLMs to extract coarse URDF parameters and generate part-level reference images. We then integrate the part-image guidance and the inferred structure graph into a generative diffusion transformer to synthesize consistent part and complete shapes of articulated objects. To further refine the URDF parameters, we incorporate differentiable forward kinematics and differentiable rendering to optimize joint types, axes, and origins under VLM-generated open-state supervision. Extensive experiments show that SPARK produces high-quality, simulation-ready articulated assets across diverse categories, enabling downstream applications such as robotic manipulation and interaction modeling. Project page: https://heyumeng.com/SPARK/index.html.
comment: Project page: https://heyumeng.com/SPARK/index.html. 17 pages, 7 figures
TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding
Large video-language models (LVLMs) have shown remarkable performance across various video-language tasks. However, they encounter significant challenges when processing long videos because of the large number of video frames involved. Downsampling long videos in either space or time can lead to visual hallucinations, making it difficult to accurately interpret long videos. Motivated by human hierarchical temporal search strategies, we propose \textbf{TimeSearch}, a novel framework enabling LVLMs to understand long videos in a human-like manner. TimeSearch integrates two human-like primitives into a unified autoregressive LVLM: 1) \textbf{Spotlight} efficiently identifies relevant temporal events through a Temporal-Augmented Frame Representation (TAFR), explicitly binding visual features with timestamps; 2) \textbf{Reflection} evaluates the correctness of the identified events, leveraging the inherent temporal self-reflection capabilities of LVLMs. TimeSearch progressively explores key events and prioritizes temporal search based on reflection confidence. Extensive experiments on challenging long-video benchmarks confirm that TimeSearch substantially surpasses previous state-of-the-art, improving the accuracy from 41.8\% to 51.5\% on the LVBench. Additionally, experiments on temporal grounding demonstrate that appropriate TAFR is adequate to effectively stimulate the surprising temporal grounding ability of LVLMs in a simpler yet versatile manner, which improves mIoU on Charades-STA by 11.8\%. The code will be released.
Steering One-Step Diffusion Model with Fidelity-Rich Decoder for Fast Image Compression AAAI 2026
Diffusion-based image compression has demonstrated impressive perceptual performance. However, it suffers from two critical drawbacks: (1) excessive decoding latency due to multi-step sampling, and (2) poor fidelity resulting from over-reliance on generative priors. To address these issues, we propose SODEC, a novel single-step diffusion image compression model. We argue that in image compression, a sufficiently informative latent renders multi-step refinement unnecessary. Based on this insight, we leverage a pre-trained VAE-based model to produce latents with rich information, and replace the iterative denoising process with a single-step decoding. Meanwhile, to improve fidelity, we introduce the fidelity guidance module, encouraging output that is faithful to the original image. Furthermore, we design the rate annealing training strategy to enable effective training under extremely low bitrates. Extensive experiments show that SODEC significantly outperforms existing methods, achieving superior rate-distortion-perception performance. Moreover, compared to previous diffusion-based compression models, SODEC improves decoding speed by more than 20$\times$. Code is released at: https://github.com/zhengchen1999/SODEC.
comment: Accepted to AAAI 2026. Code is available at: https://github.com/zhengchen1999/SODEC
SkeletonAgent: An Agentic Interaction Framework for Skeleton-based Action Recognition
Recent advances in skeleton-based action recognition increasingly leverage semantic priors from Large Language Models (LLMs) to enrich skeletal representations. However, the LLM is typically queried in isolation from the recognition model and receives no performance feedback. As a result, it often fails to deliver the targeted discriminative cues critical to distinguish similar actions. To overcome these limitations, we propose SkeletonAgent, a novel framework that bridges the recognition model and the LLM through two cooperative agents, i.e., Questioner and Selector. Specifically, the Questioner identifies the most frequently confused classes and supplies them to the LLM as context for more targeted guidance. Conversely, the Selector parses the LLM's response to extract precise joint-level constraints and feeds them back to the recognizer, enabling finer-grained cross-modal alignment. Comprehensive evaluations on five benchmarks, including NTU RGB+D, NTU RGB+D 120, Kinetics-Skeleton, FineGYM, and UAV-Human, demonstrate that SkeletonAgent consistently outperforms state-of-the-art benchmark methods. The code is available at https://github.com/firework8/SkeletonAgent.
Cross-Cancer Knowledge Transfer in WSI-based Prognosis Prediction
Whole-Slide Image (WSI) is an important tool for estimating cancer prognosis. Current studies generally follow a conventional cancer-specific paradigm in which each cancer corresponds to a single model. However, this paradigm naturally struggles to scale to rare tumors and cannot leverage knowledge from other cancers. While multi-task learning frameworks have been explored recently, they often place high demands on computational resources and require extensive training on ultra-large, multi-cancer WSI datasets. To this end, this paper shifts the paradigm to knowledge transfer and presents the first preliminary yet systematic study on cross-cancer prognosis knowledge transfer in WSIs, called CROPKT. It comprises three major parts. (1) We curate a large dataset (UNI2-h-DSS) with 26 cancers and use it to measure the transferability of WSI-based prognostic knowledge across different cancers (including rare tumors). (2) Beyond a simple evaluation merely for benchmarking, we design a range of experiments to gain deeper insights into the underlying mechanism behind transferability. (3) We further show the utility of cross-cancer knowledge transfer, by proposing a routing-based baseline approach (ROUPKT) that could often efficiently utilize the knowledge transferred from off-the-shelf models of other cancers. CROPKT could serve as an inception that lays the foundation for this nascent paradigm, i.e., WSI-based prognosis prediction with cross-cancer knowledge transfer. Our source code is available at https://github.com/liupei101/CROPKT.
comment: 24 pages (11 figures and 10 tables)
Beyond Pixels: Efficient Dataset Distillation via Sparse Gaussian Representation
Dataset distillation has emerged as a promising paradigm that synthesizes compact, informative datasets capable of retaining the knowledge of large-scale counterparts, thereby addressing the substantial computational and storage burdens of modern model training. Conventional approaches typically rely on dense pixel-level representations, which introduce redundancy and are difficult to scale up. In this work, we propose GSDD, a novel and efficient sparse representation for dataset distillation based on 2D Gaussians. Instead of representing all pixels equally, GSDD encodes critical discriminative information in a distilled image using only a small number of Gaussian primitives. This sparse representation could improve dataset diversity under the same storage budget, enhancing coverage of difficult samples and boosting distillation performance. To ensure both efficiency and scalability, we adapt CUDA-based splatting operators for parallel inference and training, enabling high-quality rendering with minimal computational and memory overhead. Our method is simple yet effective, broadly applicable to different distillation pipelines, and highly scalable. Experiments show that GSDD achieves state-of-the-art performance on CIFAR-10, CIFAR-100, and ImageNet subsets, while remaining highly efficient encoding and decoding cost. Our code is available at https://github.com/j-cyoung/GSDatasetDistillation.
comment: 19 pages; Code is available on https://github.com/j-cyoung/GSDatasetDistillation
MaxSup: Overcoming Representation Collapse in Label Smoothing NeurIPS 2025
Label Smoothing (LS) is widely adopted to reduce overconfidence in neural network predictions and improve generalization. Despite these benefits, recent studies reveal two critical issues with LS. First, LS induces overconfidence in misclassified samples. Second, it compacts feature representations into overly tight clusters, diluting intra-class diversity, although the precise cause of this phenomenon remained elusive. In this paper, we analytically decompose the LS-induced loss, exposing two key terms: (i) a regularization term that dampens overconfidence only when the prediction is correct, and (ii) an error-amplification term that arises under misclassifications. This latter term compels the network to reinforce incorrect predictions with undue certainty, exacerbating representation collapse. To address these shortcomings, we propose Max Suppression (MaxSup), which applies uniform regularization to both correct and incorrect predictions by penalizing the top-1 logit rather than the ground-truth logit. Through extensive feature-space analyses, we show that MaxSup restores intra-class variation and sharpens inter-class boundaries. Experiments on large-scale image classification and multiple downstream tasks confirm that MaxSup is a more robust alternative to LS. Code is available at: https://github.com/ZhouYuxuanYX/Maximum-Suppression-Regularization
comment: NeurIPS 2025 Oral (0.36% acceptance); code: https://github.com/ZhouYuxuanYX/Maximum-Suppression-Regularization
Self-Supervised Compression and Artifact Correction for Streaming Underwater Imaging Sonar WACV 2026
Real-time imaging sonar is crucial for underwater monitoring where optical sensing fails, but its use is limited by low uplink bandwidth and severe sonar-specific artifacts (speckle, motion blur, reverberation, acoustic shadows) affecting up to 98% of frames. We present SCOPE, a self-supervised framework that jointly performs compression and artifact correction without clean-noise pairs or synthetic assumptions. SCOPE combines (i) Adaptive Codebook Compression (ACC), which learns frequency-encoded latent representations tailored to sonar, with (ii) Frequency-Aware Multiscale Segmentation (FAMS), which decomposes frames into low-frequency structure and sparse high-frequency dynamics while suppressing rapidly fluctuating artifacts. A hedging training strategy further guides frequency-aware learning using low-pass proxy pairs generated without labels. Evaluated on months of in-situ ARIS sonar data, SCOPE achieves a structural similarity index (SSIM) of 0.77, representing a 40% improvement over prior self-supervised denoising baselines, at bitrates down to <= 0.0118 bpp. It reduces uplink bandwidth by more than 80% while improving downstream detection. The system runs in real time, with 3.1 ms encoding on an embedded GPU and 97 ms full multi-layer decoding on the server end. SCOPE has been deployed for months in three Pacific Northwest rivers to support real-time salmon enumeration and environmental monitoring in the wild. Results demonstrate that learning frequency-structured latents enables practical, low-bitrate sonar streaming with preserved signal details under real-world deployment conditions.
comment: Accepted to WACV 2026
Image and Video Processing
Maintaining SUV Accuracy in Low-Count PET with PETfectior: A Deep Learning Denoising Solution
Background: Diagnostic PET image quality depends on the administered activity and acquisition time. However, minimizing these variables is desirable to reduce patient radiation exposure and radiopharmaceutical costs. PETfectior is an artificial intelligence-based software that processes PET scans and increases signal-to-noise ratio, obtaining high-quality images from low-count-rate images. We perform an initial clinical validation of PETfectior on images acquired with half of the counting statistics required to meet the most recent EANM quantitative standards for 18F-FDG PET, evaluating lesions detectability, quantitative performance and image quality. Materials and methods: 258 patients referred for 18F-FDG PET/CT were prospectively included. The standard-of-care scans (100% scans) were acquired and reconstructed according to EARL standards 2. Half-counting-statistics versions were generated from list-mode data and processed with PETfecftior (50%+PETfectior scans). All oncologic lesions were segmented on both PET/CT versions, manually or automatically, and lesions detectability was evaluated. The SUVmax of the lesions was measured and the quantitative concordance of 50%+PETfectior and 100% images was evaluated. Subjective image quality was visually assessed by two experienced physicians. Results: 1649 lesions were detected in a total of 198 studies. The 50%+PETfectior images showed high sensitivity for lesion detection (99.9%) and only 1 false positive was detected. The SUVmax measured in 100% and 50%+PETfectior images agreed within 12.5% (95% limits of agreement), with a bias of -1.01%. Image quality of the 50%+PETfectior images was rated equal to or better than the standard-of-care images. Conclusion: PETfectior can safely be used in clinical practice at half counting statistics, with high sensitivity and specificity, low quantitative bias and high subjective image quality.
Towards Language-Independent Face-Voice Association with Multimodal Foundation Models ICASSP 2026
This paper describes the UZH-CL system submitted to the FAME2026 Challenge. The challenge focuses on cross-modal verification under unique multilingual conditions, specifically unseen and unheard languages. Our approach investigates two distinct architectures, consisting of a baseline dual-encoder system trained from scratch using contrastive and orthogonal projection losses, and a foundation model approach leveraging ImageBind with LoRA. To address the data scarcity and language constraints of the challenge, we curated an external Arabic dataset from VoxBlink. Our best-performing system, ImageBind-LoRA, demonstrates remarkable cross-lingual generalization: despite being fine-tuned exclusively on Arabic audio, it achieved an EER of 24.73% on the evaluation set (English and German), securing 2nd place in the competition.
comment: This paper presents the system description of the UZH-CL team for the FAME2026 Challenge at ICASSP 2026. Our model achieved second place in the final ranking
Kaleidoscopic Scintillation Event Imaging
Scintillators are transparent materials that interact with high-energy particles and emit visible light as a result. They are used in state of the art methods of measuring high-energy particles and radiation sources. Most existing methods use fast single-pixel detectors to detect and time scintillation events. Cameras provide spatial resolution but can only capture an average over many events, making it difficult to image the events associated with an individual particle. Emerging single-photon avalanche diode cameras combine speed and spatial resolution to enable capturing images of individual events. This allows us to use machine vision techniques to analyze events, enabling new types of detectors. The main challenge is the very low brightness of the events. Techniques have to work with a very limited number of photons. We propose a kaleidoscopic scintillator to increase light collection in a single-photon camera while preserving the event's spatial information. The kaleidoscopic geometry creates mirror reflections of the event in known locations for a given event location that are captured by the camera. We introduce theory for imaging an event in a kaleidoscopic scintillator and an algorithm to estimate the event's 3D position. We find that the kaleidoscopic scintillator design provides sufficient light collection to perform high-resolution event measurements for advanced radiation imaging techniques using a commercial CMOS single-photon camera. Code and data are available at https://github.com/bocchs/kaleidoscopic_scintillator.
Quality assurance of the Federal Interagency Traumatic Brain Injury Research (FITBIR) MRI database to enable integrated multi-site analysis
The Federal Interagency Traumatic Brain Injury Research (FITBIR) database is a centralized data repository for traumatic brain injury (TBI) research. It includes over 45,529 magnetic resonance images (MRI) from 6,211 subjects (9,229 imaging sessions) across 26 studies with heterogeneous organization formats, contrasts, acquisition parameters, and demographics. In this work, we organized all available structural and diffusion MRI from FITBIR along with relevant demographic information into the Brain Imaging Data Structure. We analyzed whole-brain mean fractional anisotropy, mean diffusivity, total intracranial volume, and the volumes of 132 regions of interest using UNesT segmentations. There were 4,868 subjects (7,035 sessions) with structural MRI and 2,666 subjects (3,763 sessions) with diffusion MRI following quality assurance and harmonization. We modeled profiles for these metrics across ages with generalized additive models for location, scale, and shape (GAMLSS) and found significant differences in subjects with TBI compared to controls in volumes of 54 regions of the brain (q<0.05, likelihood ratio test with false discovery rate correction).
comment: 4 pages, 4 figures. This work has been submitted to the IEEE for possible publication
Ultra-Strong Gradient Diffusion MRI with Self-Supervised Learning for Prostate Cancer Characterization
Diffusion MRI (dMRI) enables non-invasive assessment of prostate microstructure but conventional metrics such as the Apparent Diffusion Coefficient in multiparametric MRI lack specificity to underlying histology. Integrating dMRI with the compartment-based biophysical VERDICT (Vascular, Extracellular, and Restricted Diffusion for Cytometry in Tumours) framework offers richer microstructural insights, though clinical gradient systems (40-80 mT/m) suffer from poor signal-to-noise ratio (SNR) at stronger diffusion weightings due to prolonged echo times. Ultra-strong gradients (up to 300 mT/m) can mitigate these limitations by improving SNR and contrast-to-noise ratios (CNR) but their adoption has until recently been limited to research environments due to challenges with peripheral nerve stimulation thresholds and gradient non-uniformity. This study investigates whether physics-informed self-supervised VERDICT (ssVERDICT) fitting applied to ultra-strong gradients enhances prostate cancer characterization relative to current clinical acquisitions. We developed enhanced ssVERDICT fitting approaches using dense multilayer perceptron (Dense MLP) and convolutional U-Net architectures, benchmarking them against non-linear least-squares (NLLS) fitting and Diffusion Kurtosis Imaging across clinical- to ultra-strong gradient systems. Dense ssVERDICT at ultra-strong gradient notably outperformed NLLS VERDICT, boosting median CNR by 47%, cutting inter-patient Coefficient of Variation by 52%, and reducing pooled f_ic variation by 50%. Overall, it delivered the highest CNR, the most stable parameter estimates, and the clearest tumour-normal contrast compared with conventional methods and clinical gradient systems. These findings highlight the potential of advanced gradient systems and deep learning-based modelling to improve non-invasive prostate cancer characterization and reduce unnecessary biopsies.
comment: 24 pages, 17 figures, 7 tables
Zero-shot self-supervised learning of single breath-hold magnetic resonance cholangiopancreatography (MRCP) reconstruction
To investigate the feasibility of zero-shot self-supervised learning reconstruction for reducing breath-hold times in magnetic resonance cholangiopancreatography (MRCP). Breath-hold MRCP was acquired from 11 healthy volunteers on 3T scanners using an incoherent k-space sampling pattern, leading to 14-second acquisition time and an acceleration factor of R=25. Zero-shot reconstruction was compared with parallel imaging of respiratory-triggered MRCP (338s, R=3) and compressed sensing reconstruction. For two volunteers, breath-hold scans (40s, R=6) were additionally acquired and retrospectively undersampled to R=25 to compute peak signal-to-noise ratio (PSNR). To address long zero-shot training time, the n+m full stages of the zero-shot learning were divided into two parts to reduce backpropagation depth during training: 1) n frozen stages initialized with n-stage pretrained network and 2) m trainable stages initialized either randomly or m-stage pretrained network. Efficiency of our approach was assessed by varying initialization strategies and the number of trainable stages using the retrospectively undersampled data. Zero-shot reconstruction significantly improved visual image quality over compressed sensing, particularly in SNR and ductal delineation, and achieved image quality comparable to that of successful respiratory-triggered acquisitions with regular breathing patterns. Improved initializations enhanced PSNR and reduced reconstruction time. Adjusting frozen/trainable configurations demonstrated that PSNR decreased only slightly from 38.25 dB (0/13) to 37.67 dB (12/1), while training time decreased up to 6.7-fold. Zero-shot learning delivers high-fidelity MRCP reconstructions with reduced breath-hold times, and the proposed partially trainable approach offers a practical solution for translation into time-constrained clinical workflows.
comment: 24 pages, 8 figures, 2 tabels
MRI Super-Resolution with Deep Learning: A Comprehensive Survey
High-resolution (HR) magnetic resonance imaging (MRI) is crucial for many clinical and research applications. However, achieving it remains costly and constrained by technical trade-offs and experimental limitations. Super-resolution (SR) presents a promising computational approach to overcome these challenges by generating HR images from more affordable low-resolution (LR) scans, potentially improving diagnostic accuracy and efficiency without requiring additional hardware. This survey reviews recent advances in MRI SR techniques, with a focus on deep learning (DL) approaches. It examines DL-based MRI SR methods from the perspectives of computer vision, computational imaging, inverse problems, and MR physics, covering theoretical foundations, architectural designs, learning strategies, benchmark datasets, and performance metrics. We propose a systematic taxonomy to categorize these methods and present an in-depth study of both established and emerging SR techniques applicable to MRI, considering unique challenges in clinical and research contexts. We also highlight open challenges and directions that the community needs to address. Additionally, we provide a collection of essential open-access resources, tools, and tutorials, available on our GitHub: https://github.com/mkhateri/Awesome-MRI-Super-Resolution. IEEE keywords: MRI, Super-Resolution, Deep Learning, Computational Imaging, Inverse Problem, Survey.
comment: 41 pages
Cross-Cancer Knowledge Transfer in WSI-based Prognosis Prediction
Whole-Slide Image (WSI) is an important tool for estimating cancer prognosis. Current studies generally follow a conventional cancer-specific paradigm in which each cancer corresponds to a single model. However, this paradigm naturally struggles to scale to rare tumors and cannot leverage knowledge from other cancers. While multi-task learning frameworks have been explored recently, they often place high demands on computational resources and require extensive training on ultra-large, multi-cancer WSI datasets. To this end, this paper shifts the paradigm to knowledge transfer and presents the first preliminary yet systematic study on cross-cancer prognosis knowledge transfer in WSIs, called CROPKT. It comprises three major parts. (1) We curate a large dataset (UNI2-h-DSS) with 26 cancers and use it to measure the transferability of WSI-based prognostic knowledge across different cancers (including rare tumors). (2) Beyond a simple evaluation merely for benchmarking, we design a range of experiments to gain deeper insights into the underlying mechanism behind transferability. (3) We further show the utility of cross-cancer knowledge transfer, by proposing a routing-based baseline approach (ROUPKT) that could often efficiently utilize the knowledge transferred from off-the-shelf models of other cancers. CROPKT could serve as an inception that lays the foundation for this nascent paradigm, i.e., WSI-based prognosis prediction with cross-cancer knowledge transfer. Our source code is available at https://github.com/liupei101/CROPKT.
comment: 24 pages (11 figures and 10 tables)
Self-Supervised Compression and Artifact Correction for Streaming Underwater Imaging Sonar WACV 2026
Real-time imaging sonar is crucial for underwater monitoring where optical sensing fails, but its use is limited by low uplink bandwidth and severe sonar-specific artifacts (speckle, motion blur, reverberation, acoustic shadows) affecting up to 98% of frames. We present SCOPE, a self-supervised framework that jointly performs compression and artifact correction without clean-noise pairs or synthetic assumptions. SCOPE combines (i) Adaptive Codebook Compression (ACC), which learns frequency-encoded latent representations tailored to sonar, with (ii) Frequency-Aware Multiscale Segmentation (FAMS), which decomposes frames into low-frequency structure and sparse high-frequency dynamics while suppressing rapidly fluctuating artifacts. A hedging training strategy further guides frequency-aware learning using low-pass proxy pairs generated without labels. Evaluated on months of in-situ ARIS sonar data, SCOPE achieves a structural similarity index (SSIM) of 0.77, representing a 40% improvement over prior self-supervised denoising baselines, at bitrates down to <= 0.0118 bpp. It reduces uplink bandwidth by more than 80% while improving downstream detection. The system runs in real time, with 3.1 ms encoding on an embedded GPU and 97 ms full multi-layer decoding on the server end. SCOPE has been deployed for months in three Pacific Northwest rivers to support real-time salmon enumeration and environmental monitoring in the wild. Results demonstrate that learning frequency-structured latents enables practical, low-bitrate sonar streaming with preserved signal details under real-world deployment conditions.
comment: Accepted to WACV 2026
ContourDiff: Unpaired Medical Image Translation with Structural Consistency
Accurately translating medical images between different modalities, such as Computed Tomography (CT) to Magnetic Resonance Imaging (MRI), has numerous downstream clinical and machine learning applications. While several methods have been proposed to achieve this, they often prioritize perceptual quality with respect to output domain features over preserving anatomical fidelity. However, maintaining anatomy during translation is essential for many tasks, e.g., when leveraging masks from the input domain to develop a segmentation model with images translated to the output domain. To address these challenges, we propose ContourDiff with Spatially Coherent Guided Diffusion (SCGD), a novel framework that leverages domain-invariant anatomical contour representations of images. These representations are simple to extract from images, yet form precise spatial constraints on their anatomical content. We introduce a diffusion model that converts contour representations of images from arbitrary input domains into images in the output domain of interest. By applying the contour as a constraint at every diffusion sampling step, we ensure the preservation of anatomical content. We evaluate our method on challenging lumbar spine and hip-and-thigh CT-to-MRI translation tasks, via (1) the performance of segmentation models trained on translated images applied to real MRIs, and (2) the foreground FID and KID of translated images with respect to real MRIs. Our method outperforms other unpaired image translation methods by a significant margin across almost all metrics and scenarios. Moreover, it achieves this without the need to access any input domain information during training and we further verify its zero-shot capability, showing that a model trained on one anatomical region can be directly applied to unseen regions without retraining (GitHub: https://github.com/mazurowski-lab/ContourDiff).
comment: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2025:031
MROP: Modulated Rank-One Projections for compressive radio interferometric imaging
The emerging generation of radio-interferometric (RI) arrays are set to form images of the sky with a new regime of sensitivity and resolution. This implies a significant increase in visibility data volumes, which for single-frequency observations will scale as $\mathcal{O}(Q^2B)$ for $Q$ antennas and $B$ short-time integration intervals (or batches), calling for efficient data dimensionality reduction techniques. This paper proposes a new approach to data compression during acquisition, coined modulated rank-one projection (MROP). MROP compresses the $Q\times Q$ batchwise covariance matrix into a smaller number $P$ of random rank-one projections and compresses across time by trading $B$ for a smaller number $M$ of random modulations of the ROP measurement vectors. Firstly, we introduce a dual perspective on the MROP acquisition, which can either be understood as random beamforming, or as a post-correlation compression. Secondly, we analyse the noise statistics of MROPs and demonstrate that the random projections induce a uniform noise level across measurements independently of the visibility-weighting scheme used. Thirdly, we propose a detailed analysis of the memory and computational cost requirements across the data acquisition and image reconstruction stages, with comparison to state-of-the-art dimensionality reduction approaches. Finally, the MROP model is validated for monochromatic intensity imaging both in simulation and from real data, with comparison to the classical and baseline-dependent averaging (BDA) models, and using the uSARA optimisation algorithm for image formation. Our results suggest that the data size necessary to preserve imaging quality using MROPs is reduced to the order of image size, well below the original and BDA data sizes.
Multimedia
MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding
Understanding high-resolution images remains a significant challenge for multimodal large language models (MLLMs). Recent study address this issue by dividing the image into smaller crops and computing the semantic similarity between each crop and a query using a pretrained retrieval-augmented generation (RAG) model. The most relevant crops are then selected to localize the target object and suppress irrelevant information. However, such crop-based processing can fragment complete objects across multiple crops, thereby disrupting the computation of semantic similarity. In our experiments, we find that image crops of objects with different sizes are better handled at different resolutions. Based on this observation, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework for high-resolution image understanding. To address the issue of semantic similarity bias caused by objects being split across different image crops, we propose a multi-resolution semantic fusion method, which integrates semantic similarity maps obtained at different resolutions to produce more accurate semantic information and preserve the integrity of target objects. Furthermore, to achieve direct localization of target objects at a global scale, we introduce an open-vocalbulary object detection (OVD) model that identifies object regions using a sliding-window approach.Experiments on high-resolution image understanding benchmarks using different MLLMs demonstrate the effectiveness of our approach.
HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval ACM MM 2025
Composed Video Retrieval (CVR) is a challenging video retrieval task that utilizes multi-modal queries, consisting of a reference video and modification text, to retrieve the desired target video. The core of this task lies in understanding the multi-modal composed query and achieving accurate composed feature learning. Within multi-modal queries, the video modality typically carries richer semantic content compared to the textual modality. However, previous works have largely overlooked the disparity in information density between these two modalities. This limitation can lead to two critical issues: 1) modification subject referring ambiguity and 2) limited detailed semantic focus, both of which degrade the performance of CVR models. To address the aforementioned issues, we propose a novel CVR framework, namely the Hierarchical Uncertainty-aware Disambiguation network (HUD). HUD is the first framework that leverages the disparity in information density between video and text to enhance multi-modal query understanding. It comprises three key components: (a) Holistic Pronoun Disambiguation, (b) Atomistic Uncertainty Modeling, and (c) Holistic-to-Atomistic Alignment. By exploiting overlapping semantics through holistic cross-modal interaction and fine-grained semantic alignment via atomistic-level cross-modal interaction, HUD enables effective object disambiguation and enhances the focus on detailed semantics, thereby achieving precise composed feature learning. Moreover, our proposed HUD is also applicable to the Composed Image Retrieval (CIR) task and achieves state-of-the-art performance across three benchmark datasets for both CVR and CIR tasks. The codes are available on https://zivchen-ty.github.io/HUD.github.io/.
comment: Accepted by ACM MM 2025
Pianist Transformer: Towards Expressive Piano Performance Rendering via Scalable Self-Supervised Pre-Training
Existing methods for expressive music performance rendering rely on supervised learning over small labeled datasets, which limits scaling of both data volume and model size, despite the availability of vast unlabeled music, as in vision and language. To address this gap, we introduce Pianist Transformer, with four key contributions: 1) a unified Musical Instrument Digital Interface (MIDI) data representation for learning the shared principles of musical structure and expression without explicit annotation; 2) an efficient asymmetric architecture, enabling longer contexts and faster inference without sacrificing rendering quality; 3) a self-supervised pre-training pipeline with 10B tokens and 135M-parameter model, unlocking data and model scaling advantages for expressive performance rendering; 4) a state-of-the-art performance model, which achieves strong objective metrics and human-level subjective ratings. Overall, Pianist Transformer establishes a scalable path toward human-like performance synthesis in the music domain.
Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. However, current approaches generate single source-mixed sounds at once, largely because visual features are entangled, and region cues or prompts often fail to specify the source. We propose SelVA, a novel text-conditioned V2A model that treats the text prompt as an explicit selector of target source and modulates video encoder to distinctly extract prompt-relevant video features. The proposed supplementary tokens promote cross-attention by suppressing text-irrelevant activations with efficient parameter tuning, yielding robust semantic and temporal grounding. SelVA further employs a self-augmentation scheme to overcome the lack of mono audio track supervision. We evaluate SelVA on VGG-MONOAUDIO, a curated benchmark of clean single-source videos for such a task. Extensive experiments and ablations consistently verify its effectiveness across audio quality, semantic alignment, and temporal synchronization. Code and demo are available at https://jnwnlee.github.io/selva-demo/.
Stepwise Schema-Guided Prompting Framework with Parameter Efficient Instruction Tuning for Multimedia Event Extraction
Multimedia Event Extraction (MEE) has become an important task in information extraction research as news today increasingly prefers to contain multimedia content. Current MEE works mainly face two challenges: (1) Inadequate extraction framework modeling for handling complex and flexible multimedia event structure; (2) The absence of multimodal-aligned training data for effective knowledge transfer to MEE task. In this work, we propose a Stepwise Schema-Guided Prompting Framework (SSGPF) using Multimodal Large Language Model (MLLM) as backbone for adaptive structure capturing to solve MEE task. At the initial step of SSGPF, we design Event Type Schema Guided Prompting (ETSGP) for event detection, then we devise Argument Role Schema Guided Prompting (ARSGP) that contains multi-step prompts with text-bridged grounding technique for argument extraction. We construct a weakly-aligned multimodal event labeled dataset based on existing unimodal event annotations, then conduct parameter efficient instruction tuning with LoRA on LLaVA-v1.5-7B under SSGPF. Experiments on the M2E2 benchmark demonstrate that SSGPF significantly outperforms current SOTA baselines by 5.8 percent F1 on event detection and 8.4 percent F1 on argument extraction.
comment: Accepted by 2025 IEEE International Conference on Multimedia and Expo
PopSim: Social Network Simulation for Social Media Popularity Prediction
Accurately predicting the popularity of user-generated content (UGC) is essential for advancing social media analytics and recommendation systems. Existing approaches typically follow an inductive paradigm, where researchers train static models on historical data for popularity prediction. However, the UGC propagation is inherently a dynamic process, and static modeling based on historical features fails to capture the complex interactions and nonlinear evolution. In this paper, we propose PopSim, a novel simulation-based paradigm for social media popularity prediction (SMPP). Unlike the inductive paradigm, PopSim leverages the large language models (LLMs)-based multi-agent social network sandbox to simulate UGC propagation dynamics for popularity prediction. Specifically, to effectively model the UGC propagation process in the network, we design a social-mean-field-based agent interaction mechanism, which models the dual-channel and bidirectional individual-population interactions, enhancing agents' global perception and decision-making capabilities. In addition, we propose a multi-source information aggregation module that transforms heterogeneous social metadata into a uniform formulation for LLMs. Finally, propagation dynamics with multimodal information are fused to provide comprehensive popularity prediction. Extensive experiments on real-world datasets demonstrate that SimPop consistently outperforms the state-of-the-art methods, reducing prediction error by an average of 8.82%, offering a new perspective for research on the SMPP task.
TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMs
Recent progress in Multimodal Large Language Models (MLLMs) have significantly enhanced the ability of artificial intelligence systems to understand and generate multimodal content. However, these models often exhibit limited effectiveness when applied to non-Western cultural contexts, which raises concerns about their wider applicability. To address this limitation, we propose the Traditional Chinese Culture understanding Benchmark (TCC-Bench), a bilingual (i.e., Chinese and English) Visual Question Answering (VQA) benchmark specifically designed for assessing the understanding of traditional Chinese culture by MLLMs. TCC-Bench comprises culturally rich and visually diverse data, incorporating images from museum artifacts, everyday life scenes, comics, and other culturally significant contexts. We adopt a semi-automated pipeline that utilizes GPT-4o in text-only mode to generate candidate questions, followed by human curation to ensure data quality and avoid potential data leakage. The benchmark also avoids language bias by preventing direct disclosure of cultural concepts within question texts. Experimental evaluations across a wide range of MLLMs demonstrate that current models still face significant challenges when reasoning about culturally grounded visual content. The results highlight the need for further research in developing culturally inclusive and context-aware multimodal systems. The code and data can be found at: https://tcc-bench.github.io/.
comment: There are issues with the paper
GRAFT: GRaPH and Table Reasoning for Textual Alignment -- A Benchmark for Structured Instruction Following and Visual Reasoning
GRAFT is a structured multimodal benchmark designed to probe how well LLMs handle instruction following, visual reasoning, and tasks requiring tight visual textual alignment. The dataset is built around programmatically generated charts and synthetically rendered tables, each paired with a carefully constructed, multi step analytical question that depends solely on what can be inferred from the image itself. Responses are formatted in structured outputs such as JSON or YAML, enabling consistent and fine grained evaluation of both reasoning processes and adherence to output specifications. The benchmark further introduces a taxonomy of reasoning operations ranging from comparison and trend identification to ranking, aggregation, proportional estimation, and anomaly detection to support a comprehensive assessment of model capabilities. Taken together, GRAFT provides a unified and scalable framework for evaluating multimodal LLMs on visually grounded, structured reasoning tasks, offering a more rigorous standard for future benchmarking efforts.
comment: 25 pages, 10 tables, 3 figures
Self-Supervised Compression and Artifact Correction for Streaming Underwater Imaging Sonar WACV 2026
Real-time imaging sonar is crucial for underwater monitoring where optical sensing fails, but its use is limited by low uplink bandwidth and severe sonar-specific artifacts (speckle, motion blur, reverberation, acoustic shadows) affecting up to 98% of frames. We present SCOPE, a self-supervised framework that jointly performs compression and artifact correction without clean-noise pairs or synthetic assumptions. SCOPE combines (i) Adaptive Codebook Compression (ACC), which learns frequency-encoded latent representations tailored to sonar, with (ii) Frequency-Aware Multiscale Segmentation (FAMS), which decomposes frames into low-frequency structure and sparse high-frequency dynamics while suppressing rapidly fluctuating artifacts. A hedging training strategy further guides frequency-aware learning using low-pass proxy pairs generated without labels. Evaluated on months of in-situ ARIS sonar data, SCOPE achieves a structural similarity index (SSIM) of 0.77, representing a 40% improvement over prior self-supervised denoising baselines, at bitrates down to <= 0.0118 bpp. It reduces uplink bandwidth by more than 80% while improving downstream detection. The system runs in real time, with 3.1 ms encoding on an embedded GPU and 97 ms full multi-layer decoding on the server end. SCOPE has been deployed for months in three Pacific Northwest rivers to support real-time salmon enumeration and environmental monitoring in the wild. Results demonstrate that learning frequency-structured latents enables practical, low-bitrate sonar streaming with preserved signal details under real-world deployment conditions.
comment: Accepted to WACV 2026
Computation and Language
The Moral Consistency Pipeline: Continuous Ethical Evaluation for Large Language Models
The rapid advancement and adaptability of Large Language Models (LLMs) highlight the need for moral consistency, the capacity to maintain ethically coherent reasoning across varied contexts. Existing alignment frameworks, structured approaches designed to align model behavior with human ethical and social norms, often rely on static datasets and post-hoc evaluations, offering limited insight into how ethical reasoning may evolve across different contexts or temporal scales. This study presents the Moral Consistency Pipeline (MoCoP), a dataset-free, closed-loop framework for continuously evaluating and interpreting the moral stability of LLMs. MoCoP combines three supporting layers: (i) lexical integrity analysis, (ii) semantic risk estimation, and (iii) reasoning-based judgment modeling within a self-sustaining architecture that autonomously generates, evaluates, and refines ethical scenarios without external supervision. Our empirical results on GPT-4-Turbo and DeepSeek suggest that MoCoP effectively captures longitudinal ethical behavior, revealing a strong inverse relationship between ethical and toxicity dimensions (correlation rET = -0.81, p value less than 0.001) and a near-zero association with response latency (correlation rEL approximately equal to 0). These findings demonstrate that moral coherence and linguistic safety tend to emerge as stable and interpretable characteristics of model behavior rather than short-term fluctuations. Furthermore, by reframing ethical evaluation as a dynamic, model-agnostic form of moral introspection, MoCoP offers a reproducible foundation for scalable, continuous auditing and advances the study of computational morality in autonomous AI systems.
LORE: A Large Generative Model for Search Relevance
Achievement. We introduce LORE, a systematic framework for Large Generative Model-based relevance in e-commerce search. Deployed and iterated over three years, LORE achieves a cumulative +27\% improvement in online GoodRate metrics. This report shares the valuable experience gained throughout its development lifecycle, spanning data, features, training, evaluation, and deployment. Insight. While existing works apply Chain-of-Thought (CoT) to enhance relevance, they often hit a performance ceiling. We argue this stems from treating relevance as a monolithic task, lacking principled deconstruction. Our key insight is that relevance comprises distinct capabilities: knowledge and reasoning, multi-modal matching, and rule adherence. We contend that a qualitative-driven decomposition is essential for breaking through current performance bottlenecks. Contributions. LORE provides a complete blueprint for the LLM relevance lifecycle. Key contributions include: (1) A two-stage training paradigm combining progressive CoT synthesis via SFT with human preference alignment via RL. (2) A comprehensive benchmark, RAIR, designed to evaluate these core capabilities. (3) A query frequency-stratified deployment strategy that efficiently transfers offline LLM capabilities to the online system. LORE serves as both a practical solution and a methodological reference for other vertical domains.
Fine-Tuned Large Language Models for Logical Translation: Reducing Hallucinations with Lang2Logic
Recent advances in natural language processing (NLP), particularly large language models (LLMs), have motivated the automatic translation of natural language statements into formal logic without human intervention. This enables automated reasoning and facilitates debugging, finding loop invariants, and adhering to specifications in software systems. However, hallucinations-incorrect outputs generated by LLMs are challenging, particularly for logical translation tasks requiring precision. This work introduces a novel framework that inputs English sentences, converts them into logical expressions, and then translates them into Conjunctive Normal Form (CNF) for satisfiability solving. It employs classical NLP techniques with self-defined grammar, symbolic computation libraries, and a fine-tuned language model to reduce hallucinations. In the early experiments, we observed that the fine-tuned model, trained on different grammar settings, could intentionally correct the same types of hallucinations made by the original model. Thus, it provides reliable CNF generation.
comment: IEEE ISNCC 2025
Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities
While Multimodal Large Language Models (MLLMs) show remarkable capabilities, their safety alignments are susceptible to jailbreak attacks. Existing attack methods typically focus on text-image interplay, treating the visual modality as a secondary prompt. This approach underutilizes the unique potential of images to carry complex, contextual information. To address this gap, we propose a new image-centric attack method, Contextual Image Attack (CIA), which employs a multi-agent system to subtly embeds harmful queries into seemingly benign visual contexts using four distinct visualization strategies. To further enhance the attack's efficacy, the system incorporate contextual element enhancement and automatic toxicity obfuscation techniques. Experimental results on the MMSafetyBench-tiny dataset show that CIA achieves high toxicity scores of 4.73 and 4.83 against the GPT-4o and Qwen2.5-VL-72B models, respectively, with Attack Success Rates (ASR) reaching 86.31\% and 91.07\%. Our method significantly outperforms prior work, demonstrating that the visual modality itself is a potent vector for jailbreaking advanced MLLMs.
AutoNeural: Co-Designing Vision-Language Models for NPU Inference
While Neural Processing Units (NPUs) offer high theoretical efficiency for edge AI, state-of-the-art Vision--Language Models (VLMs) tailored for GPUs often falter on these substrates. We attribute this hardware-model mismatch to two primary factors: the quantization brittleness of Vision Transformers (ViTs) and the I/O-bound nature of autoregressive attention mechanisms, which fail to utilize the high arithmetic throughput of NPUs. To bridge this gap, we propose AutoNeural, an NPU-native VLM architecture co-designed for integer-only inference. We replace the standard ViT encoder with a MobileNetV5-style backbone utilizing depthwise separable convolutions, which ensures bounded activation distributions for stable INT4/8/16 quantization. Complementing this, our language backbone integrates State-Space Model (SSM) principles with Transformer layers, employing efficient gated convolutions to achieve linear-time complexity. This hybrid design eliminates the heavy memory I/O overhead of Key-Value caching during generation. Our approach delivers substantial efficiency gains, reducing quantization error of vision encoder by up to 7x and end-to-end latency by 14x compared to conventional baselines. The AutoNeural also delivers 3x decoding speed and 4x longer context window than the baseline. We validate these improvements via a real-world automotive case study on the Qualcomm SA8295P SoC, demonstrating real-time performance for cockpit applications. Our results highlight that rethinking model topology specifically for NPU constraints is a prerequisite for robust multi-modal edge intelligence.
Martingale Score: An Unsupervised Metric for Bayesian Rationality in LLM Reasoning NeurIPS 2025
Recent advances in reasoning techniques have substantially improved the performance of large language models (LLMs), raising expectations for their ability to provide accurate, truthful, and reliable information. However, emerging evidence suggests that iterative reasoning may foster belief entrenchment and confirmation bias, rather than enhancing truth-seeking behavior. In this study, we propose a systematic evaluation framework for belief entrenchment in LLM reasoning by leveraging the Martingale property from Bayesian statistics. This property implies that, under rational belief updating, the expected value of future beliefs should remain equal to the current belief, i.e., belief updates are unpredictable from the current belief. We propose the unsupervised, regression-based Martingale Score to measure violations of this property, which signal deviation from the Bayesian ability of updating on new evidence. In open-ended problem domains including event forecasting, value-laden questions, and academic paper review, we find such violations to be widespread across models and setups, where the current belief positively predicts future belief updates, a phenomenon which we term belief entrenchment. We identify the models, reasoning techniques, and domains more prone to belief entrenchment. Finally, we validate the Martingale Score by showing that it predicts ground-truth accuracy on problem domains where ground truth labels are available. This indicates that, while designed as an unsupervised metric that operates even in domains without access to ground truth, the Martingale Score is a useful proxy of the truth-seeking ability of a reasoning process.
comment: NeurIPS 2025
Fast-Decoding Diffusion Language Models via Progress-Aware Confidence Schedules
Diffusion large language models (dLLMs) offer a promising alternative to autoregressive models, but their practical utility is severely hampered by slow, iterative sampling. We present SchED, a training-free, model-agnostic early-exit algorithm that aggregates full-span logit margins and halts decoding once a smooth, progress-dependent confidence threshold is met. We evaluated SchED on two dLLM families (Dream and LLaDA), in base and instruction-tuned variants across ten benchmarks spanning downstream tasks including multiple-choice question answering (MCQ), math, long-form QA/summarization, and translation. SchED delivers large, stable accelerations: on instruction-tuned models, it achieves $3.8$-$4.0\times$ speedups while retaining $99.8$-$100\%$ of the baseline score on average. On base models, SchED yields consistent speedup gains with $99.1$-$100\%$ performance retention, with up to $2.34\times$ under more aggressive settings. Using a conservative speed metric that heavily penalizes quality loss (QPS, $γ{=}4$), we show that SchED is robust and clearly outperforms prior confidence-based early-exit methods, which break down on long-form generation. An entropy analysis of the model's token predictions reveals that instruction tuning speeds up the decay of predictive entropy. By turning genuine confidence stabilization into computational savings, SchED makes dLLM decoding substantially more efficient.
OptPO: Optimal Rollout Allocation for Test-time Policy Optimization
Test-time policy optimization enables large language models (LLMs) to adapt to distribution shifts by leveraging feedback from self-generated rollouts. However, existing methods rely on fixed-budget majority voting to estimate rewards, incurring substantial computational redundancy. We propose Optimal Rollout Allocation for Test-time Policy Optimization (OptPO), a principled framework that adaptively allocates inference budgets. By formulating the voting process as a Bayesian sequential probability ratio test, OptPO dynamically halts sampling once the posterior confidence in a consensus answer exceeds a specified threshold. Crucially, it utilizes the retained rollouts for on-policy updates, seamlessly integrating with algorithms like PPO or GRPO without requiring ground-truth labels. Across diverse reasoning benchmarks, OptPO significantly reduces rollout overhead compared to fixed-sample baselines while preserving or improving accuracy. By unifying statistically optimal stopping with test-time learning, OptPO offers a computationally efficient paradigm for test-time adaptation. The source code will be open upon acceptance at https://open-upon-acceptance.
comment: Work in Progress
Think in Parallel, Answer as One: Logit Averaging for Open-Ended Reasoning
Majority voting has proven effective for close-ended question answering by aggregating parallel reasoning traces. However, it is not directly applicable to open-ended reasoning, such as code generation and web-based deep research, where a "majority" over complete solutions is ill-defined. We introduce ThinkMerge, a training-free, plug-and-play decoding strategy that runs K parallel reasoning traces and averages their next-token logits at synchronization points to produce a single coherent output. ThinkMerge integrates seamlessly with vLLM/SGLang and remains compatible with standard decoding techniques such as Top-p/Top-k. Empirically, it matches or surpasses majority voting on AIME and GPQA, while delivering consistent gains on open-ended coding tasks: on LiveCodeBench (hard), pass@1 improves by +8.28% for DeepCoder-14B-Preview and +7.58% for Qwen3-8B. Beyond code, we further show that ThinkMerge improves web-based deep-research agents (e.g., WebSailor-7B/32B) across GAIA, BrowseComp-en/zh, and XbenchDeepSearch. These results demonstrate that parallel test-time scaling can benefit open-ended reasoning without relying on voting over complete outputs.
Bangla Hate Speech Classification with Fine-tuned Transformer Models
Hate speech recognition in low-resource languages remains a difficult problem due to insufficient datasets, orthographic heterogeneity, and linguistic variety. Bangla is spoken by more than 230 million people of Bangladesh and India (West Bengal). Despite the growing need for automated moderation on social media platforms, Bangla is significantly under-represented in computational resources. In this work, we study Subtask 1A and Subtask 1B of the BLP 2025 Shared Task on hate speech detection. We reproduce the official baselines (e.g., Majority, Random, Support Vector Machine) and also produce and consider Logistic Regression, Random Forest, and Decision Tree as baseline methods. We also utilized transformer-based models such as DistilBERT, BanglaBERT, m-BERT, and XLM-RoBERTa for hate speech classification. All the transformer-based models outperformed baseline methods for the subtasks, except for DistilBERT. Among the transformer-based models, BanglaBERT produces the best performance for both subtasks. Despite being smaller in size, BanglaBERT outperforms both m-BERT and XLM-RoBERTa, which suggests language-specific pre-training is very important. Our results highlight the potential and need for pre-trained language models for the low-resource Bangla language.
Cross-Lingual Prompt Steerability: Towards Accurate and Robust LLM Behavior across Languages
System prompts provide a lightweight yet powerful mechanism for conditioning large language models (LLMs) at inference time. While prior work has focused on English-only settings, real-world deployments benefit from having a single prompt to operate reliably across languages. This paper presents a comprehensive study of how different system prompts steer models toward accurate and robust cross-lingual behavior. We propose a unified four-dimensional evaluation framework to assess system prompts in multilingual environments. Through large-scale experiments on five languages, three LLMs, and three benchmarks, we uncover that certain prompt components, such as CoT, emotion, and scenario, correlate with robust multilingual behavior. We develop a prompt optimization framework for multilingual settings and show it can automatically discover prompts that improve all metrics by 5-10%. Finally, we analyze over 10 million reasoning units and find that more performant system prompts induce more structured and consistent reasoning patterns, while reducing unnecessary language-switching. Together, we highlight system prompt optimization as a scalable path to accurate and robust multilingual LLM behavior.
promptolution: A Unified, Modular Framework for Prompt Optimization
Prompt optimization has become crucial for enhancing the performance of large language models (LLMs) across a broad range of tasks. Although many research papers show its effectiveness, practical adoption is hindered as existing implementations are often tied to unmaintained and isolated research codebases. To address this, we introduce promptolution, a unified and modular open-source framework that provides all components required for prompt optimization within a single extensible system for both practitioners and researchers. It integrates multiple contemporary discrete prompt optimizers while remaining agnostic to the underlying LLM implementation.
ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning
Reasoning-centric video object segmentation is an inherently complex task: the query often refers to dynamics, causality, and temporal interactions, rather than static appearances. Yet existing solutions generally collapse these factors into simplified reasoning with latent embeddings, rendering the reasoning chain opaque and essentially intractable. We therefore adopt an explicit decomposition perspective and introduce ReVSeg, which executes reasoning as sequential decisions in the native interface of pretrained vision language models (VLMs). Rather than folding all reasoning into a single-step prediction, ReVSeg executes three explicit operations -- semantics interpretation, temporal evidence selection, and spatial grounding -- aligning pretrained capabilities. We further employ reinforcement learning to optimize the multi-step reasoning chain, enabling the model to self-refine its decision quality from outcome-driven signals. Experimental results demonstrate that ReVSeg attains state-of-the-art performances on standard video object segmentation benchmarks and yields interpretable reasoning trajectories. Project page is available at https://clementine24.github.io/ReVSeg/ .
BOOM: Beyond Only One Modality KIT's Multimodal Multilingual Lecture Companion
The globalization of education and rapid growth of online learning have made localizing educational content a critical challenge. Lecture materials are inherently multimodal, combining spoken audio with visual slides, which requires systems capable of processing multiple input modalities. To provide an accessible and complete learning experience, translations must preserve all modalities: text for reading, slides for visual understanding, and speech for auditory learning. We present \textbf{BOOM}, a multimodal multilingual lecture companion that jointly translates lecture audio and slides to produce synchronized outputs across three modalities: translated text, localized slides with preserved visual elements, and synthesized speech. This end-to-end approach enables students to access lectures in their native language while aiming to preserve the original content in its entirety. Our experiments demonstrate that slide-aware transcripts also yield cascading benefits for downstream tasks such as summarization and question answering. We release our Slide Translation code at https://github.com/saikoneru/image-translator and integrate it in Lecture Translator at https://gitlab.kit.edu/kit/isl-ai4lt/lt-middleware/ltpipeline}\footnote{All released code and models are licensed under the MIT License.
comment: Under review
A benchmark dataset for evaluating Syndrome Differentiation and Treatment in large language models
The emergence of Large Language Models (LLMs) within the Traditional Chinese Medicine (TCM) domain presents an urgent need to assess their clinical application capabilities. However, such evaluations are challenged by the individualized, holistic, and diverse nature of TCM's "Syndrome Differentiation and Treatment" (SDT). Existing benchmarks are confined to knowledge-based question-answering or the accuracy of syndrome differentiation, often neglecting assessment of treatment decision-making. Here, we propose a comprehensive, clinical case-based benchmark spearheaded by TCM experts, and a specialized reward model employed to quantify prescription-syndrome congruence. Data annotation follows a rigorous pipeline. This benchmark, designated TCM-BEST4SDT, encompasses four tasks, including TCM Basic Knowledge, Medical Ethics, LLM Content Safety, and SDT. The evaluation framework integrates three mechanisms, namely selected-response evaluation, judge model evaluation, and reward model evaluation. The effectiveness of TCM-BEST4SDT was corroborated through experiments on 15 mainstream LLMs, spanning both general and TCM domains. To foster the development of intelligent TCM research, TCM-BEST4SDT is now publicly available.
SR-GRPO: Stable Rank as an Intrinsic Geometric Reward for Large Language Model Alignment
Aligning Large Language Models (LLMs) with human preferences typically relies on external supervision, which faces critical limitations: human annotations are scarce and subjective, reward models are vulnerable to reward hacking, and self-evaluation methods suffer from prompt sensitivity and biases. In this work, we propose stable rank, an intrinsic, annotation-free quality signal derived from model representations. Stable rank measures the effective dimensionality of hidden states by computing the ratio of total variance to dominant-direction variance, capturing quality through how information distributes across representation dimensions. Empirically, stable rank achieves 84.04% accuracy on RewardBench and improves task accuracy by an average of 11.3 percentage points over greedy decoding via Best-of-N sampling. Leveraging this insight, we introduce Stable Rank Group Relative Policy Optimization (SR-GRPO), which uses stable rank as a reward signal for reinforcement learning. Without external supervision, SR-GRPO improves Qwen2.5-1.5B-Instruct by 10% on STEM and 19% on mathematical reasoning, outperforming both learned reward models and self-evaluation baselines. Our findings demonstrate that quality signals can be extracted from internal model geometry, offering a path toward scalable alignment without external supervision.
TriLex: A Framework for Multilingual Sentiment Analysis in Low-Resource South African Languages
Low-resource African languages remain underrepresented in sentiment analysis, limiting both lexical coverage and the performance of multilingual Natural Language Processing (NLP) systems. This study proposes TriLex, a three-stage retrieval augmented framework that unifies corpus-based extraction, cross lingual mapping, and retrieval augmented generation (RAG) driven lexical refinement to systematically expand sentiment lexicons for low-resource languages. Using the enriched lexicon, the performance of two prominent African pretrained language models (AfroXLMR and AfriBERTa) is evaluated across multiple case studies. Results demonstrate that AfroXLMR delivers superior performance, achieving F1-scores above 80% for isiXhosa and isiZulu and exhibiting strong cross-lingual stability. Although AfriBERTa lacks pre-training on these target languages, it still achieves reliable F1-scores around 64%, validating its utility in computationally constrained settings. Both models outperform traditional machine learning baselines, and ensemble analyses further enhance precision and robustness. The findings establish TriLex as a scalable and effective framework for multilingual sentiment lexicon expansion and sentiment modeling in low-resource South African languages.
Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension
Dialogue-Based Generalized Referring Expressions Comprehension (GREC) requires models to ground the expression and unlimited targets in complex visual scenes while resolving coreference across a long dialogue context. However, existing systems struggle under distribution shift between training and evaluation domains, a gap exacerbated by the scarcity of annotated dialogue grounding data. We address this challenge with a three-tier data-synthesis method that balances realism and controllability to produce scalable supervision for dialogue-conditioned grounding. Fine-tuning on the synthesized data yields consistent, substantial improvements over prior approaches across standard evaluation metrics.
Towards Unification of Hallucination Detection and Fact Verification for Large Language Models
Large Language Models (LLMs) frequently exhibit hallucinations, generating content that appears fluent and coherent but is factually incorrect. Such errors undermine trust and hinder their adoption in real-world applications. To address this challenge, two distinct research paradigms have emerged: model-centric Hallucination Detection (HD) and text-centric Fact Verification (FV). Despite sharing the same goal, these paradigms have evolved in isolation, using distinct assumptions, datasets, and evaluation protocols. This separation has created a research schism that hinders their collective progress. In this work, we take a decisive step toward bridging this divide. We introduce UniFact, a unified evaluation framework that enables direct, instance-level comparison between FV and HD by dynamically generating model outputs and corresponding factuality labels. Through large-scale experiments across multiple LLM families and detection methods, we reveal three key findings: (1) No paradigm is universally superior; (2) HD and FV capture complementary facets of factual errors; and (3) hybrid approaches that integrate both methods consistently achieve state-of-the-art performance. Beyond benchmarking, we provide the first in-depth analysis of why FV and HD diverged, as well as empirical evidence supporting the need for their unification. The comprehensive experimental results call for a new, integrated research agenda toward unifying Hallucination Detection and Fact Verification in LLMs. We have open-sourced all the code, data, and baseline implementation at: https://github.com/oneal2000/UniFact/
PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models
Parameter-Efficient Fine-Tuning (PEFT) methods address the increasing size of Large Language Models (LLMs). Currently, many newly introduced PEFT methods are challenging to replicate, deploy, or compare with one another. To address this, we introduce PEFT-Factory, a unified framework for efficient fine-tuning LLMs using both off-the-shelf and custom PEFT methods. While its modular design supports extensibility, it natively provides a representative set of 19 PEFT methods, 27 classification and text generation datasets addressing 12 tasks, and both standard and PEFT-specific evaluation metrics. As a result, PEFT-Factory provides a ready-to-use, controlled, and stable environment, improving replicability and benchmarking of PEFT methods. PEFT-Factory is a downstream framework that originates from the popular LLaMA-Factory, and is publicly available at https://github.com/kinit-sk/PEFT-Factory
SurveyEval: Towards Comprehensive Evaluation of LLM-Generated Academic Surveys
LLM-based automatic survey systems are transforming how users acquire information from the web by integrating retrieval, organization, and content synthesis into end-to-end generation pipelines. While recent works focus on developing new generation pipelines, how to evaluate such complex systems remains a significant challenge. To this end, we introduce SurveyEval, a comprehensive benchmark that evaluates automatically generated surveys across three dimensions: overall quality, outline coherence, and reference accuracy. We extend the evaluation across 7 subjects and augment the LLM-as-a-Judge framework with human references to strengthen evaluation-human alignment. Evaluation results show that while general long-text or paper-writing systems tend to produce lower-quality surveys, specialized survey-generation systems are able to deliver substantially higher-quality results. We envision SurveyEval as a scalable testbed to understand and improve automatic survey systems across diverse subjects and evaluation criteria.
Probabilistic energy profiler for statically typed JVM-based programming languages
Energy consumption is a growing concern in several fields, from mobile devices to large data centers. Developers need detailed data on the energy consumption of their software to mitigate consumption issues. Previous approaches have a broader focus, such as on specific functions or programs, rather than source code statements. They primarily focus on estimating the CPU's energy consumption using point estimates, thereby disregarding other hardware effects and limiting their use for statistical reasoning and explainability. We developed a novel methodology to address the limitations of measuring only the CPU's consumption and using point estimates, focusing on predicting the energy usage of statically typed JVM-based programming languages, such as Java and Scala. We measure the energy consumption of Bytecode patterns, the translation from the programming language's source code statement to their Java Bytecode representation. With the energy measurements, we construct a statistical model using Bayesian statistics, which allows us to predict the energy consumption through statistical distributions and analyze individual factors. The model includes three factors we obtain statically from the code: data size, data type, operation, and one factor about the hardware platform the code executes on: device. To validate our methodology, we implemented it for Java and evaluated its energy predictions on unseen programs. We observe that all four factors are influential, notably that two devices of the same model may differ in energy consumption and that the operations and data types cause consumption differences. The experiments also show that the energy prediction of programs closely follows the program's real energy consumption, validating our approach. Our work presents a methodology for constructing an energy model that future work, such as verification tools, can use for their energy estimates.
Emergent Bayesian Behaviour and Optimal Cue Combination in LLMs
Large language models (LLMs) excel at explicit reasoning, but their implicit computational strategies remain underexplored. Decades of psychophysics research show that humans intuitively process and integrate noisy signals using near-optimal Bayesian strategies in perceptual tasks. We ask whether LLMs exhibit similar behaviour and perform optimal multimodal integration without explicit training or instruction. Adopting the psychophysics paradigm, we infer computational principles of LLMs from systematic behavioural studies. We introduce a behavioural benchmark - BayesBench: four magnitude estimation tasks (length, location, distance, and duration) over text and image, inspired by classic psychophysics, and evaluate a diverse set of nine LLMs alongside human judgments for calibration. Through controlled ablations of noise, context, and instruction prompts, we measure performance, behaviour and efficiency in multimodal cue-combination. Beyond accuracy and efficiency metrics, we introduce a Bayesian Consistency Score that detects Bayes-consistent behavioural shifts even when accuracy saturates. Our results show that while capable models often adapt in Bayes-consistent ways, accuracy does not guarantee robustness. Notably, GPT-5 Mini achieves perfect text accuracy but fails to integrate visual cues efficiently. This reveals a critical dissociation between capability and strategy, suggesting accuracy-centric benchmarks may over-index on performance while missing brittle uncertainty handling. These findings reveal emergent principled handling of uncertainty and highlight the correlation between accuracy and Bayesian tendencies. We release our psychophysics benchmark and consistency metric (https://bayes-bench.github.io) as evaluation tools and to inform future multimodal architecture designs.
CREST: Universal Safety Guardrails Through Cluster-Guided Cross-Lingual Transfer
Ensuring content safety in large language models (LLMs) is essential for their deployment in real-world applications. However, existing safety guardrails are predominantly tailored for high-resource languages, leaving a significant portion of the world's population underrepresented who communicate in low-resource languages. To address this, we introduce CREST (CRoss-lingual Efficient Safety Transfer), a parameter-efficient multilingual safety classification model that supports 100 languages with only 0.5B parameters. By training on a strategically chosen subset of only 13 high-resource languages, our model utilizes cluster-based cross-lingual transfer from a few to 100 languages, enabling effective generalization to both unseen high-resource and low-resource languages. This approach addresses the challenge of limited training data in low-resource settings. We conduct comprehensive evaluations across six safety benchmarks to demonstrate that CREST outperforms existing state-of-the-art guardrails of comparable scale and achieves competitive results against models with significantly larger parameter counts (2.5B parameters and above). Our findings highlight the limitations of language-specific guardrails and underscore the importance of developing universal, language-agnostic safety systems that can scale effectively to serve global populations.
comment: 8 Pages, 5 Figures, Under Review
An Empirical Survey of Model Merging Algorithms for Social Bias Mitigation ACL
Large language models (LLMs) are known to inherit and even amplify societal biases present in their pre-training corpora, threatening fairness and social trust. To address this issue, recent work has explored ``editing'' LLM parameters to mitigate social bias with model merging approaches; however, there is no empirical comparison. In this work, we empirically survey seven algorithms: Linear, Karcher Mean, SLERP, NuSLERP, TIES, DELLA, and Nearswap, applying 13 open weight models in the GPT, LLaMA, and Qwen families. We perform a comprehensive evaluation using three bias datasets (BBQ, BOLD, and HONEST) and measure the impact of these techniques on LLM performance in downstream tasks of the SuperGLUE benchmark. We find a trade-off between bias reduction and downstream performance: methods achieving greater bias mitigation degrade accuracy, particularly on tasks requiring reading comprehension and commonsense and causal reasoning. Among the merging algorithms, Linear, SLERP, and Nearswap consistently reduce bias while maintaining overall performance, with SLERP at moderate interpolation weights emerging as the most balanced choice. These results highlight the potential of model merging algorithms for bias mitigation, while indicating that excessive debiasing or inappropriate merging methods may lead to the degradation of important linguistic abilities.
comment: Accepted in PACLIC 2025
Input Order Shapes LLM Semantic Alignment in Multi-Document Summarization
Large language models (LLMs) are now used in settings such as Google's AI Overviews, where it summarizes multiple long documents. However, it remains unclear whether they weight all inputs equally. Focusing on abortion-related news, we construct 40 pro-neutral-con article triplets, permute each triplet into six input orders, and prompt Gemini 2.5 Flash to generate a neutral overview. We evaluate each summary against its source articles using ROUGE-L (lexical overlap), BERTScore (semantic similarity), and SummaC (factual consistency). One-way ANOVA reveals a significant primacy effect for BERTScore across all stances, indicating that summaries are more semantically aligned with the first-seen article. Pairwise comparisons further show that Position 1 differs significantly from Positions 2 and 3, while the latter two do not differ from each other, confirming a selective preference for the first document. The findings present risks for applications that rely on LLM-generated overviews and for agentic AI systems, where the steps involving LLMs can disproportionately influence downstream actions.
comment: 9 pages, 3 figures, 2 tables
Spoken Conversational Agents with Large Language Models EMNLP 2025
Spoken conversational agents are converging toward voice-native LLMs. This tutorial distills the path from cascaded ASR/NLU to end-to-end, retrieval-and vision-grounded systems. We frame adaptation of text LLMs to audio, cross-modal alignment, and joint speech-text training; review datasets, metrics, and robustness across accents and compare design choices (cascaded vs. E2E, post-ASR correction, streaming). We link industrial assistants to current open-domain and task-oriented agents, highlight reproducible baselines, and outline open problems in privacy, safety, and evaluation. Attendees leave with practical recipes and a clear systems-level roadmap.
comment: Accepted to EMNLP 2025 Tutorial
From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks AAAI 2026
Reinforcement learning has emerged as a paradigm for post-training large language models, boosting their reasoning capabilities. Such approaches compute an advantage value for each sample, reflecting better or worse performance than expected, thereby yielding both positive and negative signals for training. However, the indiscriminate mixing of the two signals in existing methods, especially from the early stages, may lead to ambiguous guidance and limited gains. To address this issue, we propose **CAPO** (**C**urriculum **A**dvantage **P**olicy **O**ptimization), an adaptive curriculum mechanism based on advantage signals. The proposed mechanism bootstraps imitation learning with positive-only advantage samples to establish robust foundations, and subsequently introduces negative signals to cultivate discriminative capabilities, thereby improving generalization across complex scenarios. Compatible with diverse optimization methods including GRPO, PPO, RLOO, and Reinforce++, our method consistently achieves stable and significant improvements in mathematical reasoning tasks, and further generalizes effectively to multimodal Graphical User Interface (GUI) reasoning scenarios, establishing itself as a versatile and robust optimization framework.
comment: Accepted by AAAI 2026
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2) Scalable Reinforcement Learning Framework: By implementing a robust reinforcement learning protocol and scaling post-training compute, DeepSeek-V3.2 performs comparably to GPT-5. Notably, our high-compute variant, DeepSeek-V3.2-Speciale, surpasses GPT-5 and exhibits reasoning proficiency on par with Gemini-3.0-Pro, achieving gold-medal performance in both the 2025 International Mathematical Olympiad (IMO) and the International Olympiad in Informatics (IOI). (3) Large-Scale Agentic Task Synthesis Pipeline: To integrate reasoning into tool-use scenarios, we developed a novel synthesis pipeline that systematically generates training data at scale. This methodology facilitates scalable agentic post-training, yielding substantial improvements in generalization and instruction-following robustness within complex, interactive environments.
ADORE: Autonomous Domain-Oriented Relevance Engine for E-commerce SIGIR 2025
Relevance modeling in e-commerce search remains challenged by semantic gaps in term-matching methods (e.g., BM25) and neural models' reliance on the scarcity of domain-specific hard samples. We propose ADORE, a self-sustaining framework that synergizes three innovations: (1) A Rule-aware Relevance Discrimination module, where a Chain-of-Thought LLM generates intent-aligned training data, refined via Kahneman-Tversky Optimization (KTO) to align with user behavior; (2) An Error-type-aware Data Synthesis module that auto-generates adversarial examples to harden robustness; and (3) A Key-attribute-enhanced Knowledge Distillation module that injects domain-specific attribute hierarchies into a deployable student model. ADORE automates annotation, adversarial generation, and distillation, overcoming data scarcity while enhancing reasoning. Large-scale experiments and online A/B testing verify the effectiveness of ADORE. The framework establishes a new paradigm for resource-efficient, cognitively aligned relevance modeling in industrial applications.
comment: Accepted by SIGIR 2025
What Signals Really Matter for Misinformation Tasks? Evaluating Fake-News Detection and Virality Prediction under Real-World Constraints
We present an evaluation-driven study of two practical tasks regarding online misinformation: (i) fake-news detection and (ii) virality prediction in the context of operational settings, with the necessity for rapid reaction. Using the EVONS and FakeNewsNet datasets, we compare textual embeddings (RoBERTa; with a control using Mistral) against lightweight numeric features (timing, follower counts, verification, likes) and sequence models (GRU, gating architectures, Transformer encoders). We show that textual content alone is a strong discriminator for fake-news detection, while numeric-only pipelines remain viable when language models are unavailable or compute is constrained. Virality prediction is markedly harder than fake-news detection and is highly sensitive to label construction; in our setup, a median-based ''viral'' split (<50 likes) is pragmatic but underestimates real-world virality, and time-censoring for engagement features is desirable yet difficult under current API limits. Dimensionality-reduction analyses suggest non-linear structure is more informative for virality than for fake-news detection (t-SNE > PCA on numeric features). Swapping RoBERTa for Mistral embeddings yields only modest deltas, leaving conclusions unchanged. We discuss implications for evaluation design and report reproducibility constraints that realistically affect the field. We release splits and code where possible and provide guidance for metric selection.
A Concise Review of Hallucinations in LLMs and their Mitigation
Traditional language models face a challenge from hallucinations. Their very presence casts a large, dangerous shadow over the promising realm of natural language processing. It becomes crucial to understand the various kinds of hallucinations that occur nowadays, their origins, and ways of reducing them. This document provides a concise and straightforward summary of that. It serves as a one-stop resource for a general understanding of hallucinations and how to mitigate them.
comment: 7 pages
Guided Self-Evolving LLMs with Minimal Human Supervision
AI self-evolution has long been envisioned as a path toward superintelligence, where models autonomously acquire, refine, and internalize knowledge from their own learning experiences. Yet in practice, unguided self-evolving systems often plateau quickly or even degrade as training progresses. These failures arise from issues such as concept drift, diversity collapse, and mis-evolution, as models reinforce their own biases and converge toward low-entropy behaviors. To enable models to self-evolve in a stable and controllable manner while minimizing reliance on human supervision, we introduce R-Few, a guided Self-Play Challenger-Solver framework that incorporates lightweight human oversight through in-context grounding and mixed training. At each iteration, the Challenger samples a small set of human-labeled examples to guide synthetic question generation, while the Solver jointly trains on human and synthetic examples under an online, difficulty-based curriculum. Across math and general reasoning benchmarks, R-Few achieves consistent and iterative improvements. For example, Qwen3-8B-Base improves by +3.0 points over R-Zero on math tasks and achieves performance on par with General-Reasoner, despite the latter being trained on 20 times more human data. Ablation studies confirm the complementary contributions of grounded challenger training and curriculum-based solver training, and further analysis shows that R-Few mitigates drift, yielding more stable and controllable co-evolutionary dynamics.
See, Think, Learn: A Self-Taught Multimodal Reasoner
Vision-Language Models (VLMs) have achieved remarkable progress in integrating visual perception with language understanding. However, effective multimodal reasoning requires both accurate perception and robust reasoning, and weakness in either limits the performance of VLMs. Prior efforts to enhance reasoning often depend on high-quality chain-of-thought (CoT) data, obtained via labor-intensive human annotations, costly proprietary models, or self-training methods that overlook perception. To address these limitations, we propose a simple yet effective self-training framework called See-Think-Learn (STL). At its core, STL introduces a structured reasoning template that encourages the model to see before thinking, first extracting visual attributes in textual form, then using them to guide reasoning. The framework jointly improves perception and reasoning by having the model generate and learn from its own structured rationales in a self-training loop. Furthermore, we augment the training data with negative rationales, i.e. explanations that justify why certain answer choices are incorrect, to enhance the model's ability to distinguish between correct and misleading responses. This fosters more discriminative and robust learning. Experiments across diverse domains show that STL consistently outperforms baselines trained directly only on answers or self-generated reasoning, while qualitative analysis confirms the high quality of its rationales. STL thus provides a cost-effective solution to enhance multimodal reasoning ability of VLMs.
comment: Winter Conference on Applications of Computer Vision 2026
When Refusals Fail: Unstable Safety Mechanisms in Long-Context LLM Agents AAAI 2026
Solving complex or long-horizon problems often requires large language models (LLMs) to use external tools and operate over a significantly longer context window. New LLMs enable longer context windows and support tool calling capabilities. Prior works have focused mainly on evaluation of LLMs on long-context prompts, leaving agentic setup relatively unexplored, both from capability and safety perspectives. Our work addresses this gap. We find that LLM agents could be sensitive to length, type, and placement of the context, exhibiting unexpected and inconsistent shifts in task performance and in refusals to execute harmful requests. Models with 1M-2M token context windows show severe degradation already at 100K tokens, with performance drops exceeding 50\% for both benign and harmful tasks. Refusal rates shift unpredictably: GPT-4.1-nano increases from $\sim$5\% to $\sim$40\% while Grok 4 Fast decreases from $\sim$80\% to $\sim$10\% at 200K tokens. Our work shows potential safety issues with agents operating on longer context and opens additional questions on the current metrics and paradigm for evaluating LLM agent safety on long multi-step tasks. In particular, our results on LLM agents reveal a notable divergence in both capability and safety performance compared to prior evaluations of LLMs on similar criteria.
comment: 12 pages, 11 figures. Accepted at AAAI 2026 TrustAgent Workshop
WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning
Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detailed information about scenes. During inference, an adaptive retrieval agent iteratively selects the most relevant memory source and leverages multiple temporal granularities based on the query, continuing until it determines that sufficient information has been gathered. WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks, achieving an average 8.4% performance gain over previous state-of-the-art methods, showing its effectiveness on long video reasoning.
comment: Project page : https://worldmm.github.io
The brain-AI convergence: Predictive and generative world models for general-purpose computation
Recent advances in general-purpose AI systems with attention-based transformers offer a potential window into how the neocortex and cerebellum, despite their relatively uniform circuit architectures, give rise to diverse functions and, ultimately, to human intelligence. This Perspective provides a cross-domain comparison between the brain and AI that goes beyond the traditional focus on visual processing, adopting the emerging perspecive of world-model-based computation. Here, we identify shared computational mechanisms in the attention-based neocortex and the non-attentional cerebellum: both predict future world events from past inputs and construct internal world models through prediction-error learning. These predictive world models are repurposed for seemingly distinct functions -- understanding in sensory processing and generation in motor processing -- enabling the brain to achieve multi-domain capabilities and human-like adaptive intelligence. Notably, attention-based AI has independently converged on a similar learning paradigm and world-model-based computation. We conclude that these shared mechanisms in both biological and artificial systems constitute a core computational foundation for realizing diverse functions including high-level intelligence, despite their relatively uniform circuit structures. Our theoretical insights bridge neuroscience and AI, advancing our understanding of the computational essence of intelligence.
comment: 22 pages, 4 figures. Related to our earlier preprint "The brain versus AI" (arXiv:2411.16075) but a distinct article. The earlier work surveyed broad brain-AI parallels; here we focus on world-model-based computation and convergent evolution between the brain and AI, especially large language models
TaleFrame: An Interactive Story Generation System with Fine-Grained Control and Large Language Models
With the advancement of natural language generation (NLG) technologies, creative story generation systems have gained increasing attention. However, current systems often fail to accurately translate user intent into satisfactory story outputs due to a lack of fine-grained control and unclear input specifications, limiting their applicability. To address this, we propose TaleFrame, a system that combines large language models (LLMs) with human-computer interaction (HCI) to generate stories through structured information, enabling precise control over the generation process. The innovation of TaleFrame lies in decomposing the story structure into four basic units: entities, events, relationships, and story outline. We leverage the Tinystories dataset, parsing and constructing a preference dataset consisting of 9,851 JSON-formatted entries, which is then used to fine-tune a local Llama model. By employing this JSON2Story approach, structured data is transformed into coherent stories. TaleFrame also offers an intuitive interface that supports users in creating and editing entities and events and generates stories through the structured framework. Users can control these units through simple interactions (e.g., drag-and-drop, attach, and connect), thus influencing the details and progression of the story. The generated stories can be evaluated across seven dimensions (e.g., creativity, structural integrity), with the system providing suggestions for refinement based on these evaluations. Users can iteratively adjust the story until a satisfactory result is achieved. Finally, we conduct quantitative evaluation and user studies that demonstrate the usefulness of TaleFrame. Dataset available at https://huggingface.co/datasets/guodaosun/tale-frame.
comment: 11 pages
Process-Centric Analysis of Agentic Software Systems
Agentic systems are modern software systems: they consist of orchestrated modules, expose interfaces, and are deployed in software pipelines. Unlike conventional programs, their execution (i.e., trajectories) is inherently stochastic and adaptive to the problem they are solving. Evaluation of such systems is often outcome-centric, judging their performance based on success or failure at the final step. This narrow focus overlooks detailed insights about such systems, failing to explain how agents reason, plan, act, or change their strategies over time. Inspired by the structured representation of conventional software systems as graphs, we introduce Graphectory to systematically encode the temporal and semantic relations in such software systems. Graphectory facilitates the design of process-centric metrics and analyses to assess the quality of agentic workflows independent of final success. Using Graphectory, we analyze 4000 trajectories of two dominant agentic programming workflows, namely SWE-agent and OpenHands, with a combination of four backbone Large Language Models (LLMs), attempting to resolve SWE-bench Verified issues. Our fully automated analyses reveal that: (1) agents using richer prompts or stronger LLMs exhibit more complex Graphectory, reflecting deeper exploration, broader context gathering, and more thorough validation before patch submission; (2) agents' problem-solving strategies vary with both problem difficulty and the underlying LLM -- for resolved issues, the strategies often follow coherent localization-patching-validation steps, while unresolved ones exhibit chaotic, repetitive, or backtracking behaviors; (3) even when successful, agentic programming systems often display inefficient processes, leading to unnecessarily prolonged trajectories.
Memory-Augmented Knowledge Fusion with Safety-Aware Decoding for Domain-Adaptive Question Answering
Domain-specific question answering (QA) systems for services face unique challenges in integrating heterogeneous knowledge sources while ensuring both accuracy and safety. Existing large language models often struggle with factual consistency and context alignment in sensitive domains such as healthcare policies and government welfare. In this work, we introduce Knowledge-Aware Reasoning and Memory-Augmented Adaptation (KARMA), a novel framework designed to enhance QA performance in care scenarios. KARMA incorporates a dual-encoder architecture to fuse structured and unstructured knowledge sources, a gated memory unit to dynamically regulate external knowledge integration, and a safety-aware controllable decoder that mitigates unsafe outputs using safety classification and guided generation techniques. Extensive experiments on a proprietary QA dataset demonstrate that KARMA outperforms strong baselines in both answer quality and safety. This study offers a comprehensive solution for building trustworthy and adaptive QA systems in service contexts.
LeechHijack: Covert Computational Resource Exploitation in Intelligent Agent Systems
Large Language Model (LLM)-based agents have demonstrated remarkable capabilities in reasoning, planning, and tool usage. The recently proposed Model Context Protocol (MCP) has emerged as a unifying framework for integrating external tools into agent systems, enabling a thriving open ecosystem of community-built functionalities. However, the openness and composability that make MCP appealing also introduce a critical yet overlooked security assumption -- implicit trust in third-party tool providers. In this work, we identify and formalize a new class of attacks that exploit this trust boundary without violating explicit permissions. We term this new attack vector implicit toxicity, where malicious behaviors occur entirely within the allowed privilege scope. We propose LeechHijack, a Latent Embedded Exploit for Computation Hijacking, in which an adversarial MCP tool covertly expropriates the agent's computational resources for unauthorized workloads. LeechHijack operates through a two-stage mechanism: an implantation stage that embeds a benign-looking backdoor in a tool, and an exploitation stage where the backdoor activates upon predefined triggers to establish a command-and-control channel. Through this channel, the attacker injects additional tasks that the agent executes as if they were part of its normal workflow, effectively parasitizing the user's compute budget. We implement LeechHijack across four major LLM families. Experiments show that LeechHijack achieves an average success rate of 77.25%, with a resource overhead of 18.62% compared to the baseline. This study highlights the urgent need for computational provenance and resource attestation mechanisms to safeguard the emerging MCP ecosystem.
OmniGuard: Unified Omni-Modal Guardrails with Deliberate Reasoning
Omni-modal Large Language Models (OLLMs) that process text, images, videos, and audio introduce new challenges for safety and value guardrails in human-AI interaction. Prior guardrail research largely targets unimodal settings and typically frames safeguarding as binary classification, which limits robustness across diverse modalities and tasks. To address this gap, we propose OmniGuard, the first family of omni-modal guardrails that performs safeguarding across all modalities with deliberate reasoning ability. To support the training of OMNIGUARD, we curate a large, comprehensive omni-modal safety dataset comprising over 210K diverse samples, with inputs that cover all modalities through both unimodal and cross-modal samples. Each sample is annotated with structured safety labels and carefully curated safety critiques from expert models through targeted distillation. Extensive experiments on 15 benchmarks show that OmniGuard achieves strong effectiveness and generalization across a wide range of multimodal safety scenarios. Importantly, OmniGuard provides a unified framework that enforces policies and mitigates risks in omni-modalities, paving the way toward building more robust and capable omnimodal safeguarding systems.
When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers
Large language models (LLMs) can act as both problem solvers and solution verifiers, with verifiers improving solver performance by selecting high-quality answers from a pool of candidates. However, prior studies of solver-verifier interactions have been limited, focusing mainly on self-verification and rarely examining how verifiers judge outputs from models in their own or in another model family. Modern LLMs also undergo extensive post-training, but its effect on verification remains unclear. We present a systematic study across 37 models spanning multiple families, sizes, and base vs. post-trained variants, evaluated on 9 benchmarks covering logical reasoning, structured puzzles, symbolic computation, mathematics, commonsense, factual recall, and domain knowledge. We compare self-verification with verification within the same family and across different families. To support this, we introduce and empirically validate verifier gain, a metric that predicts the performance improvements from test-time verifier-based rejection sampling. We analyze how metrics like verifier gain and false positive rate scale with model size and post-training, and characterize differences in dataset verifiability. Our findings show that cross-family verification is especially effective; post-training reduces self-improvement but strengthens cross-family improvement; and mathematical and logical tasks exhibit the highest inherent verifiability.
HealthContradict: Evaluating Biomedical Knowledge Conflicts in Language Models
How do language models use contextual information to answer health questions? How are their responses impacted by conflicting contexts? We assess the ability of language models to reason over long, conflicting biomedical contexts using HealthContradict, an expert-verified dataset comprising 920 unique instances, each consisting of a health-related question, a factual answer supported by scientific evidence, and two documents presenting contradictory stances. We consider several prompt settings, including correct, incorrect or contradictory context, and measure their impact on model outputs. Compared to existing medical question-answering evaluation benchmarks, HealthContradict provides greater distinctions of language models' contextual reasoning capabilities. Our experiments show that the strength of fine-tuned biomedical language models lies not only in their parametric knowledge from pretraining, but also in their ability to exploit correct context while resisting incorrect context.
Randomized Masked Finetuning: An Efficient Way to Mitigate Memorization of PIIs in LLMs ICML 2026
The current literature on memorization in Natural Language Models, especially Large Language Models (LLMs), poses severe security and privacy risks, as models tend to memorize personally identifying information (PIIs) from training data. We introduce Randomized Masked Fine-Tuning (RMFT), a novel privacy-preserving fine-tuning technique that reduces PII memorization while minimizing performance impact. Using the Enron Email Dataset, we demonstrate that RMFT achieves an 80.81% reduction in Total Extraction Rate and 80.17% reduction in Seen Extraction Rate compared to baseline fine-tuning, outperforming deduplication methods while maintaining only a 5.73% increase in perplexity. We present MaxTER, a Pareto-optimal evaluation framework for assessing privacy-utility tradeoffs, and show the performance of RMFT vs Deduplication by Area Under The Response Curve (AURC) metric.
comment: To be submitted for ICML 2026
Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks
Vibe coding is a new programming paradigm in which human engineers instruct large language model (LLM) agents to complete complex coding tasks with little supervision. Although it is increasingly adopted, are vibe coding outputs really safe to deploy in production? To answer this question, we propose SU S VI B E S, a benchmark consisting of 200 feature-request software engineering tasks from real-world open-source projects, which, when given to human programmers, led to vulnerable implementations. We evaluate multiple widely used coding agents with frontier models on this benchmark. Disturbingly, all agents perform poorly in terms of software security. Although 61% of the solutions from SWE-Agent with Claude 4 Sonnet are functionally correct, only 10.5% are secure. Further experiments demonstrate that preliminary security strategies, such as augmenting the feature request with vulnerability hints, cannot mitigate these security issues. Our findings raise serious concerns about the widespread adoption of vibe-coding, particularly in security-sensitive applications.
SPARK: Stepwise Process-Aware Rewards for Reference-Free Reinforcement Learning
Process reward models (PRMs) that provide dense, step-level feedback have shown promise for reinforcement learning, yet their adoption remains limited by the need for expensive step-level annotations or ground truth references. We propose SPARK: a three-stage framework where in the first stage a generator model produces diverse solutions and a verifier model evaluates them using parallel scaling (self-consistency) and sequential scaling (meta-critique). In the second stage, we use these verification outputs as synthetic training data to fine-tune generative process reward models, which subsequently serve as reward signals during training. We show that aggregating multiple independent verifications at the step level produces training data for process reward models that surpass ground-truth outcome supervision, achieving 67.5 F1 on ProcessBench (a benchmark for identifying erroneous steps in mathematical reasoning) compared to 66.4 for reference-guided training and 61.9 for GPT-4o. In the final stage, we apply our generative PRM with chain-of-thought verification (PRM-CoT) as the reward model in RL experiments on mathematical reasoning, and introduce format constraints to prevent reward hacking. Using Qwen2.5-Math-7B, we achieve 47.4% average accuracy across six mathematical reasoning benchmarks, outperforming ground-truth-based RLVR (43.9%). Our work enables reference-free RL training that exceeds ground-truth methods, opening new possibilities for domains lacking verifiable answers or accessible ground truth.
Identifying attributions of causality in political text
Explanations are a fundamental element of how people make sense of the political world. Citizens routinely ask and answer questions about why events happen, who is responsible, and what could or should be done differently. Yet despite their importance, explanations remain an underdeveloped object of systematic analysis in political science, and existing approaches are fragmented and often issue-specific. I introduce a framework for detecting and parsing explanations in political text. To do this, I train a lightweight causal language model that returns a structured data set of causal claims in the form of cause-effect pairs for downstream analysis. I demonstrate how causal explanations can be studied at scale, and show the method's modest annotation requirements, generalizability, and accuracy relative to human coding.
InvertiTune: High-Quality Data Synthesis for Cost-Effective Single-Shot Text-to-Knowledge Graph Generation
Large Language Models (LLMs) have revolutionized the ability to understand and generate text, enabling significant progress in automatic knowledge graph construction from text (Text2KG). Many Text2KG methods, however, rely on iterative LLM prompting, making them computationally expensive and prone to overlooking complex relations distributed throughout the text. To address these limitations, we propose InvertiTune, a framework that combines a controlled data generation pipeline with supervised fine-tuning (SFT). Within this framework, the data-generation pipeline systematically extracts subgraphs from large knowledge bases, applies noise filtering, and leverages LLMs to generate corresponding natural text descriptions, a task more aligned with LLM capabilities than direct KG generation from text. This pipeline enables generating datasets composed of longer texts paired with larger KGs that better reflect real-world scenarios compared to existing benchmarks, thus supporting effective SFT of lightweight models for single-shot KG construction. Experimental results on CE12k, a dataset generated using the introduced pipeline, show that InvertiTune outperforms larger non-fine-tuned LLMs as well as state-of-the-art Text2KG approaches, while also demonstrating stronger cross-dataset generalization on CrossEval-1200, a test set created from three established benchmark datasets and CE12k. These findings highlight the importance of realistic, high-quality training data for advancing efficient and high-performing Text2KG systems.
Enhancing Job Matching: Occupation, Skill and Qualification Linking with the ESCO and EQF taxonomies
This study investigates the potential of language models to improve the classification of labor market information by linking job vacancy texts to two major European frameworks: the European Skills, Competences, Qualifications and Occupations (ESCO) taxonomy and the European Qualifications Framework (EQF). We examine and compare two prominent methodologies from the literature: Sentence Linking and Entity Linking. In support of ongoing research, we release an open-source tool, incorporating these two methodologies, designed to facilitate further work on labor classification and employment discourse. To move beyond surface-level skill extraction, we introduce two annotated datasets specifically aimed at evaluating how occupations and qualifications are represented within job vacancy texts. Additionally, we examine different ways to utilize generative large language models for this task. Our findings contribute to advancing the state of the art in job entity extraction and offer computational infrastructure for examining work, skills, and labor market narratives in a digitally mediated economy. Our code is made publicly available: https://github.com/tabiya-tech/tabiya-livelihoods-classifier
comment: 14 pages, 1 figure, Preprint
Culture Affordance Atlas: Reconciling Object Diversity Through Functional Mapping
Culture shapes the objects people use and for what purposes, yet mainstream Vision-Language (VL) datasets frequently exhibit cultural biases, disproportionately favoring higher-income, Western contexts. This imbalance reduces model generalizability and perpetuates performance disparities, especially impacting lower-income and non-Western communities. To address these disparities, we propose a novel function-centric framework that categorizes objects by the functions they fulfill, across diverse cultural and economic contexts. We implement this framework by creating the Culture Affordance Atlas, a re-annotated and culturally grounded restructuring of the Dollar Street dataset spanning 46 functions and 288 objects publicly available at https://lit.eecs.umich.edu/CultureAffordance-Atlas/index.html. Through extensive empirical analyses using the CLIP model, we demonstrate that function-centric labels substantially reduce socioeconomic performance gaps between high- and low-income groups by a median of 6 pp (statistically significant), improving model effectiveness for lower-income contexts. Furthermore, our analyses reveals numerous culturally essential objects that are frequently overlooked in prominent VL datasets. Our contributions offer a scalable pathway toward building inclusive VL datasets and equitable AI systems.
Pre-trained Language Models Improve the Few-shot Prompt Ability of Decision Transformer
Decision Transformer (DT) has emerged as a promising class of algorithms in offline reinforcement learning (RL) tasks, leveraging pre-collected datasets and Transformer's capability to model long sequences. Recent works have demonstrated that using parts of trajectories from training tasks as prompts in DT enhances its performance on unseen tasks, giving rise to Prompt-DT methods. However, collecting data from specific environments can be both costly and unsafe in many scenarios, leading to suboptimal performance and limited few-shot prompt abilities due to the data-hungry nature of Transformer-based models. Additionally, the limited datasets used in pre-training make it challenging for Prompt-DT type of methods to distinguish between various RL tasks through prompts alone. To address these challenges, we introduce the Language model-initialized Prompt Decision Transformer (LPDT) framework, which leverages pretrained language models providing rich prior knowledge for RL tasks and fine-tunes the sequence model using Low-rank Adaptation (LoRA) for meta-RL problems. We further incorporate prompt regularization to effectively differentiate between tasks based on prompt feature representations. Comprehensive empirical studies demonstrate that initializing with a pre-trained language model provides the prior knowledge and achieves a similar performance with Prompt-DT under only $10\%$ data in some MuJoCo control tasks. We also provide a thorough ablation study to validate the effectiveness of each component, including sequence modeling, language models, prompt regularizations, and prompt strategies.
comment: 2 figures, 10 tables. Published in Transactions on Machine Learning Research (TMLR)
SkyLadder: Better and Faster Pretraining via Context Window Scheduling NeurIPS 2025
Recent advancements in LLM pretraining have featured ever-expanding context windows to process longer sequences. However, our pilot study reveals that models pretrained with shorter context windows consistently outperform their long-context counterparts under a fixed token budget. This finding motivates us to explore an optimal context window scheduling strategy to better balance long-context capability with pretraining efficiency. To this end, we propose SkyLadder, a simple yet effective approach that implements a short-to-long context window transition. SkyLadder preserves strong standard benchmark performance, while matching or exceeding baseline results on long context tasks. Through extensive experiments, we pre-train 1B-parameter models (up to 32K context) and 3B-parameter models (8K context) on 100B tokens, demonstrating that SkyLadder yields consistent gains of up to 3.7% on common benchmarks, while achieving up to 22% faster training speeds compared to baselines. The code is at https://github.com/sail-sg/SkyLadder.
comment: Accepted to NeurIPS 2025. 10 pages
WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents [Technical Report]
LLM-based web agents have the potential to automate long-running web tasks, such as searching for products in multiple e-shops and subsequently ordering the cheapest products that meet the users needs. Benchmarks for evaluating web agents either require agents to perform tasks online using the live Web or offline using simulated environments, which allow for the exact reproduction of the experimental setup. While DeepShop provides an online benchmark that requires agents to perform challenging shopping tasks, existing offline benchmarks such as WebShop, WebArena, or Mind2Web cover only comparatively simple e-commerce tasks that need to be performed against a single shop containing product data from a single source. What is missing is an e-commerce benchmark that simulates multiple shops containing heterogeneous product data and requires agents to perform complex tasks. We fill this gap by introducing WebMall, the first offline multi-shop benchmark for evaluating web agents on challenging comparison shopping tasks. WebMall consists of four simulated shops populated with product data extracted from the Common Crawl. The WebMall tasks range from specific product searches and price comparisons to advanced queries for complementary or substitute products, as well as checkout processes. We validate WebMall using eight agents that differ in observation space, availability of short-term memory, and the employed LLM. The validation highlights the difficulty of the benchmark, with even the best-performing agents achieving task completion rates below 55% in the task categories cheapest product search and vague product search.
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence
Large language models (LLMs) have fundamentally transformed automated software development by enabling direct translation of natural language descriptions into functional code, driving commercial adoption through tools like Github Copilot (Microsoft), Cursor (Anysphere), Trae (ByteDance), and Claude Code (Anthropic). While the field has evolved dramatically from rule-based systems to Transformer-based architectures, achieving performance improvements from single-digit to over 95\% success rates on benchmarks like HumanEval. In this work, we provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs, systematically examining the complete model life cycle from data curation to post-training through advanced prompting paradigms, code pre-training, supervised fine-tuning, reinforcement learning, and autonomous coding agents. We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder), critically examining the techniques, design decisions, and trade-offs. Further, we articulate the research-practice gap between academic research (e.g., benchmarks and tasks) and real-world deployment (e.g., software-related code tasks), including code correctness, security, contextual awareness of large codebases, and integration with development workflows, and map promising research directions to practical needs. Last, we conduct a series of experiments to provide a comprehensive analysis of code pre-training, supervised fine-tuning, and reinforcement learning, covering scaling law, framework selection, hyperparameter sensitivity, model architectures, and dataset comparisons.
HLPD: Aligning LLMs to Human Language Preference for Machine-Revised Text Detection AAAI'26
To prevent misinformation and social issues arising from trustworthy-looking content generated by LLMs, it is crucial to develop efficient and reliable methods for identifying the source of texts. Previous approaches have demonstrated exceptional performance in detecting texts fully generated by LLMs. However, these methods struggle when confronting more advanced LLM output or text with adversarial multi-task machine revision, especially in the black-box setting, where the generating model is unknown. To address this challenge, grounded in the hypothesis that human writing possesses distinctive stylistic patterns, we propose Human Language Preference Detection (HLPD). HLPD employs a reward-based alignment process, Human Language Preference Optimization (HLPO), to shift the scoring model's token distribution toward human-like writing, making the model more sensitive to human writing, therefore enhancing the identification of machine-revised text. We test HLPD in an adversarial multi-task evaluation framework that leverages a five-dimensional prompt generator and multiple advanced LLMs to create diverse revision scenarios. When detecting texts revised by GPT-series models, HLPD achieves a 15.11% relative improvement in AUROC over ImBD, surpassing Fast-DetectGPT by 45.56%. When evaluated on texts generated by advanced LLMs, HLPD achieves the highest average AUROC, exceeding ImBD by 5.53% and Fast-DetectGPT by 34.14%. Code will be made available at https://github.com/dfq2021/HLPD.
comment: 20 pages, 10 figures, accepted by AAAI'26
A Reproducible Framework for Neural Topic Modeling in Focus Group Analysis
Focus group discussions generate rich qualitative data but their analysis traditionally relies on labor-intensive manual coding that limits scalability and reproducibility. We present a systematic framework for applying BERTopic to focus group transcripts using data from ten focus groups exploring HPV vaccine perceptions in Tunisia (1,075 utterances). We conducted comprehensive hyperparameter exploration across 27 configurations, evaluating each through bootstrap stability analysis, performance metrics, and comparison with LDA baseline. Bootstrap analysis revealed that stability metrics (NMI and ARI) exhibited strong disagreement (r = -0.691) and showed divergent relationships with coherence, demonstrating that stability is multifaceted rather than monolithic. Our multi-criteria selection framework yielded a 7-topic model achieving 18\% higher coherence than optimized LDA (0.573 vs. 0.486) with interpretable topics validated through independent human evaluation (ICC = 0.700, weighted Cohen's kappa = 0.678). These findings demonstrate that transformer-based topic modeling can extract interpretable themes from small focus group transcript corpora when systematically configured and validated, while revealing that quality metrics capture distinct, sometimes conflicting constructs requiring multi-criteria evaluation. We provide complete documentation and code to support reproducibility.
Evaluating LLMs on Sequential API Call Through Automated Test Generation
By integrating tools from external APIs, Large Language Models (LLMs) have expanded their promising capabilities in a diverse spectrum of complex real-world tasks. However, testing, evaluation, and analysis of LLM tool use remain in their early stages. Most existing benchmarks rely on manually collected test cases, many of which cannot be automatically checked for semantic correctness and instead depend on static methods such as string matching. Additionally, these benchmarks often overlook the complex interactions that occur between sequential API calls, which are common in real-world applications. To fill the gap, in this paper, we introduce StateGen, an automated framework designed to generate diverse coding tasks involving sequential API interactions. StateGen combines state-machine-based API constraint solving and validation, energy-based sampling, and control-flow injection to generate executable programs. These programs are then translated into human-like natural language task descriptions through a collaboration of two LLM agents. Utilizing StateGen, we construct StateEval, a benchmark encompassing 120 verified test cases spanning across three representative scenarios: Session Service, Tensor Operation, and ElevenLabs MCP. Experimental results confirm that StateGen can effectively generate challenging and realistic API-oriented tasks, highlighting areas for improvement in current LLMs incorporating APIs.We make our framework and benchmark publicly available to support future research.
Eka-Eval: An Evaluation Framework for Low-Resource Multilingual Large Language Models
The rapid evolution of Large Language Models' has underscored the need for evaluation frameworks that are globally applicable, flexible, and modular, and that support a wide range of tasks, model types, and linguistic settings. We introduce EKA-EVAL, a unified, end- to-end framework that combines a zero-code web interface and an interactive CLI to ensure broad accessibility. It integrates 50+ multilingual benchmarks across nine evaluation categories, supports local and proprietary models, and provides 11 core capabilities through a modular, plug-and-play architecture. Designed for scalable, multilingual evaluation with support for low-resource multilingual languages, EKA-EVAL is, to the best of our knowledge, the first suite to offer comprehensive coverage in a single platform. Comparisons against five existing baselines indicate improvements of at least 2x better on key usability measures, with the highest user satisfaction, faster setup times, and consistent benchmark reproducibility. The framework is open-source and publicly available at https://github.com/lingo-iitgn/eka-eval.
Geometric Uncertainty for Detecting and Correcting Hallucinations in LLMs
Large language models demonstrate impressive results across diverse tasks but are still known to hallucinate, generating linguistically plausible but incorrect answers to questions. Uncertainty quantification has been proposed as a strategy for hallucination detection, requiring estimates for both global uncertainty (attributed to a batch of responses) and local uncertainty (attributed to individual responses). While recent black-box approaches have shown some success, they often rely on disjoint heuristics or graph-theoretic approximations that lack a unified geometric interpretation. We introduce a geometric framework to address this, based on archetypal analysis of batches of responses sampled with only black-box model access. At the global level, we propose Geometric Volume, which measures the convex hull volume of archetypes derived from response embeddings. At the local level, we propose Geometric Suspicion, which leverages the spatial relationship between responses and these archetypes to rank reliability, enabling hallucination reduction through preferential response selection. Unlike prior methods that rely on discrete pairwise comparisons, our approach provides continuous semantic boundary points which have utility for attributing reliability to individual responses. Experiments show that our framework performs comparably to or better than prior methods on short form question-answering datasets, and achieves superior results on medical datasets where hallucinations carry particularly critical risks. We also provide theoretical justification by proving a link between convex hull volume and entropy.
comment: Revision. Clarified positioning as a unified geometric framework for global and local uncertainty in LLMs. Added baselines (Degree, Eccentricity) and expanded comparison to related methods. Included ablations (PCA dimension, number of archetypes, number of samples) and complexity analysis. Extended discussion of medical QA results and model-specific behaviour
On SkipGram Word Embedding Models with Negative Sampling: Unified Framework and Impact of Noise Distributions
SkipGram word embedding models with negative sampling, or SGN in short, is an elegant family of word embedding models. In this paper, we formulate a framework for word embedding, referred to as Word-Context Classification (WCC), that generalizes SGN to a wide family of models. The framework, which uses some ``noise examples'', is justified through theoretical analysis. The impact of noise distribution on the learning of the WCC embedding models is studied experimentally, suggesting that the best noise distribution is, in fact, the data distribution, in terms of both the embedding performance and the speed of convergence during training. Along our way, we discover several novel embedding models that outperform existing WCC models.
comment: This work has been submitted to the IEEE for possible publication
DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning
Large language models (LLMs) have achieved remarkable success in many natural language tasks but still struggle with complex, multi-step reasoning, particularly across diverse disciplines. Existing reasoning datasets often lack disciplinary breadth, reasoning depth, and diversity, as well as guiding principles for question synthesis. We propose DESIGNER: a DESIGN-logic-guidEd Reasoning data synthesis pipeline that leverages naturally available, extensive raw documents (e.g., book corpus and web corpus) to generate multidisciplinary challenging questions. We introduce the concept of "design logic" and instruct LLMs to mimic human educators' question-creation process, enabling the automated synthesis of large-scale, high-difficulty questions. We use LLMs to reverse-engineer and abstract over 120,000 design logics from existing questions across various disciplines. By matching these design logics with source documents, we are able to generate reasoning questions with controllable question types and difficulty levels. Using this pipeline, we synthesized two large-scale reasoning datasets that span 75 disciplines: DLR-Book (3.04 million questions from the book corpus) and DLR-Web (1.66 million questions from the web corpus). Data analysis indicates that the questions synthesized by our method exhibit greater difficulty and diversity compared to those in the baseline datasets. We validate our synthesized data through supervised fine-tuning (SFT) on the Qwen3 and Llama3 model families. Our data substantially enhances their multidisciplinary reasoning capabilities, outperforming existing datasets. Notably, by applying SFT on the base versions of these models using only our data, we even surpass their official final models that have undergone the full post-training process.
iMAD: Intelligent Multi-Agent Debate for Efficient and Accurate LLM Inference AAAI 2026
Large Language Model (LLM) agent systems have advanced rapidly, driven by their strong generalization in zero-shot settings. To further enhance reasoning and accuracy on complex tasks, Multi-Agent Debate (MAD) has emerged as a promising framework that engages multiple LLM agents in structured debates to encourage diverse reasoning. However, triggering MAD for every query is inefficient, as it incurs substantial computational (token) cost and may even degrade accuracy by overturning correct single-agent answers. To address these limitations, we propose intelligent Multi-Agent Debate (iMAD), a token-efficient framework that selectively triggers MAD only when it is likely to be beneficial (i.e., correcting an initially wrong answer). To achieve this goal, iMAD learns generalizable model behaviors to make accurate debate decisions. Specifically, iMAD first prompts a single agent to produce a structured self-critique response, from which we extract 41 interpretable linguistic and semantic features capturing hesitation cues. Then, iMAD uses a lightweight debate-decision classifier, trained using our proposed FocusCal loss, to determine whether to trigger MAD, enabling robust debate decisions without test dataset-specific tuning. Through extensive experiments using six (visual) question answering datasets against five competitive baselines, we have shown that iMAD significantly reduces token usage (by up to 92%) while also improving final answer accuracy (by up to 13.5%).
comment: Accepted in AAAI 2026 (Oral)
Multilingual Pretraining for Pixel Language Models EMNLP 2025
Pixel language models operate directly on images of rendered text, eliminating the need for a fixed vocabulary. While these models have demonstrated strong capabilities for downstream cross-lingual transfer, multilingual pretraining remains underexplored. We introduce PIXEL-M4, a model pretrained on four visually and linguistically diverse languages: English, Hindi, Ukrainian, and Simplified Chinese. Multilingual evaluations on semantic and syntactic tasks show that PIXEL-M4 outperforms an English-only counterpart on non-Latin scripts. Word-level probing analyses confirm that PIXEL-M4 captures rich linguistic features, even in languages not seen during pretraining. Furthermore, an analysis of its hidden representations shows that multilingual pretraining yields a semantic embedding space closely aligned across the languages used for pretraining. This work demonstrates that multilingual pretraining substantially enhances the capability of pixel language models to effectively support a diverse set of languages.
comment: EMNLP 2025
Scaling Towards the Information Boundary of Instruction Set: InfinityInstruct-Subject Technical Report
Instruction tuning has become a foundation for unlocking the capabilities of large-scale pretrained models and improving their performance on complex tasks. Thus, the construction of high-quality instruction datasets is crucial for enhancing model performance and generalizability. Although current instruction datasets have reached tens of millions of samples, models finetuned on them may still struggle with complex instruction following and tasks in rare domains. This is primarily due to limited expansion in both ``coverage'' (coverage of task types and knowledge areas) and ``depth'' (instruction complexity) of the instruction set. To address this issue, we propose a systematic instruction data construction framework, which integrates a hierarchical labeling system, an informative seed selection algorithm, an evolutionary data synthesis process, and a model deficiency diagnosis with targeted data generation. These components form an iterative closed-loop to continuously enhance the coverage and depth of instruction data. Based on this framework, we construct InfinityInstruct-Subject, a high-quality dataset containing ~1.5 million instructions. Experiments on multiple foundation models and benchmark tasks demonstrate its effectiveness in improving instruction-following capabilities. Further analyses suggest that InfinityInstruct-Subject shows enlarged coverage and depth compared to comparable synthesized instruction datasets. Our work lays a theoretical and practical foundation for the efficient, continuous evolution of instruction datasets, moving from data quantity expansion to qualitative improvement.
Do Large Language Models Think Like the Brain? Sentence-Level Evidences from Layer-Wise Embeddings and fMRI AAAI 2026
Understanding whether large language models (LLMs) and the human brain converge on similar computational principles remains a fundamental and important question in cognitive neuroscience and AI. Do the brain-like patterns observed in LLMs emerge simply from scaling, or do they reflect deeper alignment with the architecture of human language processing? This study focuses on the sentence-level neural mechanisms of language models, systematically investigating how layer-wise representations in LLMs align with the dynamic neural responses during human sentence comprehension. By comparing hierarchical embeddings from 14 publicly available LLMs with fMRI data collected from participants, who were exposed to a naturalistic narrative story, we constructed sentence-level neural prediction models to identify the model layers most significantly correlated with brain region activations. Results show that improvements in model performance drive the evolution of representational architectures toward brain-like hierarchies, particularly achieving stronger functional and anatomical correspondence at higher semantic abstraction levels. These findings advance our understanding of the computational parallels between LLMs and the human brain, highlighting the potential of LLMs as models for human language processing.
comment: AAAI 2026
Hyperdimensional Probe: Decoding LLM Representations via Vector Symbolic Architectures
Despite their capabilities, Large Language Models (LLMs) remain opaque with limited understanding of their internal representations. Current interpretability methods either focus on input-oriented feature extraction, such as supervised probes and Sparse Autoencoders (SAEs), or on output distribution inspection, such as logit-oriented approaches. A full understanding of LLM vector spaces, however, requires integrating both perspectives, something existing approaches struggle with due to constraints on latent feature definitions. We introduce the Hyperdimensional Probe, a hybrid supervised probe that combines symbolic representations with neural probing. Leveraging Vector Symbolic Architectures (VSAs) and hypervector algebra, it unifies prior methods: the top-down interpretability of supervised probes, SAE's sparsity-driven proxy space, and output-oriented logit investigation. This allows deeper input-focused feature extraction while supporting output-oriented investigation. Our experiments show that our method consistently extracts meaningful concepts across LLMs, embedding sizes, and setups, uncovering concept-driven patterns in analogy-oriented inference and QA-focused text generation. By supporting joint input-output analysis, this work advances semantic understanding of neural representations while unifying the complementary perspectives of prior methods.
XISM: an eXploratory and Interactive Graph Tool to Visualize and Evaluate Semantic Map Models
Semantic map models visualize systematic relations among semantic functions through graph structures and are widely used in linguistic typology. However, existing construction methods either depend on labor-intensive expert reasoning or on fully automated systems lacking expert involvement, creating a tension between scalability and interpretability. We introduce \textbf{XISM}, an interactive system that combines data-driven inference with expert knowledge. XISM generates candidate maps via a top-down procedure and allows users to iteratively refine edges in a visual interface, with real-time metric feedback. Experiments in three semantic domains and expert interviews show that XISM improves linguistic decision transparency and controllability in semantic-map construction while maintaining computational efficiency. XISM provides a collaborative approach for scalable and interpretable semantic-map building. The system\footnote{https://app.xism2025.xin/} , source code\footnote{https://github.com/hank317/XISM} , and demonstration video\footnote{https://youtu.be/m5laLhGn6Ys} are publicly available.
comment: Paper under review
Unifying Linear-Time Attention via Latent Probabilistic Modelling
Transformers have achieved state-of-the-art results across a range of domains, but their quadratic attention mechanism poses significant challenges for long-sequence modelling. Recent efforts to design linear-time attention mechanisms have yielded more scalable alternatives, yet often at the cost of performance, particularly on discrete data such as language. In this work, we revisit linear attention through the lens of probabilistic graphical models. We first show that standard linear attention can be interpreted as an undirected latent variable model, revealing a key limitation: the absence of directionality. To address this, we propose a novel directed parameterisation of linear attention that introduces an asymmetric structure, enabling an interpretation aligned with the causal and sequential nature of language. Our formulation integrates global latent-variable attention with local standard attention in a fully probabilistic framework. Additionally, we introduce a recurrent parameterisation of queries and keys that avoids reliance on relative positional encodings, often incompatible with linear attention. Experiments on language modelling benchmarks demonstrate that our model achieves competitive performance with standard attention and outperforms existing linear attention variants.
Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs AACL 2025
The rapid proliferation of frontier model agents promises significant societal advances but also raises concerns about systemic risks arising from unsafe interactions. Collusion to the disadvantage of others has been identified as a central form of undesirable agent cooperation. The use of information hiding (steganography) in agent communications could render such collusion practically undetectable. This underscores the need for investigations into the possibility of such behaviours emerging and the robustness corresponding countermeasures. To investigate this problem we design two approaches -- a gradient-based reinforcement learning (GBRL) method and an in-context reinforcement learning (ICRL) method -- for reliably eliciting sophisticated LLM-generated linguistic text steganography. We demonstrate, for the first time, that unintended steganographic collusion in LLMs can arise due to mispecified reward incentives during training. Additionally, we find that standard mitigations -- both passive oversight of model outputs and active mitigation through communication paraphrasing -- are not fully effective at preventing this steganographic communication. Our findings imply that (i) emergence of steganographic collusion is a plausible concern that should be monitored and researched, and (ii) preventing emergence may require innovation in mitigation techniques.
comment: Camera-ready version. Oral presentation at IJCNLP-AACL 2025 (14th International Joint Conference on Natural Language Processing and 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics), Mumbai, India, December 20-24, 2025
Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs
Large language model-powered chatbots have transformed how people seek information, especially in high-stakes contexts like mental health. Despite their support capabilities, safe detection and response to crises such as suicidal ideation and self-harm are still unclear, hindered by the lack of unified crisis taxonomies and clinical evaluation standards. We address this by creating: (1) a taxonomy of six crisis categories; (2) a dataset of over 2,000 inputs from 12 mental health datasets, classified into these categories; and (3) a clinical response assessment protocol. We also use LLMs to identify crisis inputs and audit five models for response safety and appropriateness. First, we built a clinical-informed crisis taxonomy and evaluation protocol. Next, we curated 2,252 relevant examples from over 239,000 user inputs, then tested three LLMs for automatic classification. In addition, we evaluated five models for the appropriateness of their responses to a user's crisis, graded on a 5-point Likert scale from harmful (1) to appropriate (5). While some models respond reliably to explicit crises, risks still exist. Many outputs, especially in self-harm and suicidal categories, are inappropriate or unsafe. Different models perform variably; some, like gpt-5-nano and deepseek-v3.2-exp, have low harm rates, but others, such as gpt-4o-mini and grok-4-fast, generate more unsafe responses. All models struggle with indirect signals, default replies, and context misalignment. These results highlight the urgent need for better safeguards, crisis detection, and context-aware responses in LLMs. They also show that alignment and safety practices, beyond scale, are crucial for reliable crisis support. Our taxonomy, datasets, and evaluation methods support ongoing AI mental health research, aiming to reduce harm and protect vulnerable users.
On the Temporal Question-Answering Capabilities of Large Language Models Over Anonymized Data
The applicability of Large Language Models (LLMs) in temporal reasoning tasks over data that is not present during training is still a field that remains to be explored. In this paper we work on this topic, focusing on structured and semi-structured anonymized data. We not only develop a direct LLM pipeline, but also compare various methodologies and conduct an in-depth analysis. We identified and examined seventeen common temporal reasoning tasks in natural language, focusing on their algorithmic components. To assess LLM performance, we created the \textit{Reasoning and Answering Temporal Ability} dataset (RATA), featuring semi-structured anonymized data to ensure reliance on reasoning rather than on prior knowledge. We compared several methodologies, involving SoTA techniques such as Tree-of-Thought, self-reflexion and code execution, tuned specifically for this scenario. Our results suggest that achieving scalable and reliable solutions requires more than just standalone LLMs, highlighting the need for integrated approaches.
comment: 18 pages, 7 tables, 5 figures
Agent-OM: Leveraging LLM Agents for Ontology Matching
Ontology matching (OM) enables semantic interoperability between different ontologies and resolves their conceptual heterogeneity by aligning related entities. OM systems currently have two prevailing design paradigms: conventional knowledge-based expert systems and newer machine learning-based predictive systems. While large language models (LLMs) and LLM agents have revolutionised data engineering and have been applied creatively in many domains, their potential for OM remains underexplored. This study introduces a novel agent-powered LLM-based design paradigm for OM systems. With consideration of several specific challenges in leveraging LLM agents for OM, we propose a generic framework, namely Agent-OM (Agent for Ontology Matching), consisting of two Siamese agents for retrieval and matching, with a set of OM tools. Our framework is implemented in a proof-of-concept system. Evaluations of three Ontology Alignment Evaluation Initiative (OAEI) tracks over state-of-the-art OM systems show that our system can achieve results very close to the long-standing best performance on simple OM tasks and can significantly improve the performance on complex and few-shot OM tasks.
comment: 31 pages
AI Text Detectors and the Misclassification of Slightly Polished Arabic Text
Many AI detection models have been developed to counter the presence of articles created by artificial intelligence (AI). However, if a human-authored article is slightly polished by AI, a shift will occur in the borderline decision of these AI detection models, leading them to consider it as AI-generated article. This misclassification may result in falsely accusing authors of AI plagiarism and harm the credibility of AI detectors. In English, some efforts were made to meet this challenge, but not in Arabic. In this paper, we generated two datasets. The first dataset contains 800 Arabic articles, half AI-generated and half human-authored. We used it to evaluate 14 Large Language models (LLMs) and commercial AI detectors to assess their ability in distinguishing between human-authored and AI-generated articles. The best 8 models were chosen to act as detectors for our primary concern, which is whether they would consider slightly polished human-authored text as AI-generated. The second dataset, Ar-APT, contains 400 Arabic human-authored articles polished by 10 LLMs using 4 polishing settings, totaling 16400 samples. We use it to evaluate the 8 nominated models and determine whether slight polishing will affect their performance. The results reveal that all AI detectors incorrectly attribute a significant number of articles to AI. The best performing LLM, Claude-4 Sonnet, achieved 83.51\%, its performance decreased to 57.63\% for articles slightly polished by LLaMA-3. Whereas the best performing commercial model, originality.AI, achieves 92\% accuracy, dropped to 12\% for articles slightly polished by Mistral or Gemma-3.
comment: (Submitted) Journal of Big Data
ELSPR: Evaluator LLM Training Data Self-Purification on Non-Transitive Preferences via Tournament Graph Reconstruction AAAI 2026
Pairwise evaluation of large language models (LLMs) has become the dominant paradigm for benchmarking open-ended tasks, yet non-transitive preferences, where evaluators prefer A over B, B over C, but C over A, fundamentally undermine ranking reliability. We show that this critical issue stems largely from low-quality data that contains inherently ambiguous preference pairs. To address this challenge, we propose ELSPR, a principled graph-theoretic framework that models pairwise preferences as tournament graphs and systematically identifies problematic training data. ELSPR quantifies non-transitivity through strongly connected components (SCCs) analysis and measures overall preference clarity using a novel normalized directed graph structural entropy metric. Our filtering methodology selectively removes preference data that induce non-transitivity while preserving transitive preferences. Extensive experiments on the AlpacaEval benchmark demonstrate that models fine-tuned on ELSPR-filtered data achieve substantial improvements: a 13.8% reduction in non-transitivity, a 0.088 decrease in structural entropy, and significantly enhanced discriminative power in real-world evaluation systems. Human validation confirms that discarded data exhibit dramatically lower inter-annotator agreement (34.4% vs. 52.6%) and model-human consistency (51.2% vs. 80.6%) compared to cleaned data. These findings establish ELSPR as an effective data self-purification approach for developing more robust, consistent, and human-aligned LLM evaluation systems.
comment: Accepted by AAAI 2026
LLM-as-a-Supervisor: Mistaken Therapeutic Behaviors Trigger Targeted Supervisory Feedback
Although large language models (LLMs) hold significant promise in psychotherapy, their direct application in patient-facing scenarios raises ethical and safety concerns. Therefore, this work shifts towards developing an LLM as a supervisor to train real therapists. In addition to the privacy of clinical therapist training data, a fundamental contradiction complicates the training of therapeutic behaviors: clear feedback standards are necessary to ensure a controlled training system, yet there is no absolute "gold standard" for appropriate therapeutic behaviors in practice. In contrast, many common therapeutic mistakes are universal and identifiable, making them effective triggers for targeted feedback that can serve as clearer evidence. Motivated by this, we create a novel therapist-training paradigm: (1) guidelines for mistaken behaviors and targeted correction strategies are first established as standards; (2) a human-in-the-loop dialogue-feedback dataset is then constructed, where a mistake-prone agent intentionally makes standard mistakes during interviews naturally, and a supervisor agent locates and identifies mistakes and provides targeted feedback; (3) after fine-tuning on this dataset, the final supervisor model is provided for real therapist training. The detailed experimental results of automated, human and downstream assessments demonstrate that models fine-tuned on our dataset MATE, can provide high-quality feedback according to the clinical guideline, showing significant potential for the therapist training scenario.
comment: 10 pages, 5 figures
The Necessity of Imperfection:Reversing Model Collapse via Simulating Cognitive Boundedness
Although synthetic data is widely promoted as a remedy, its prevailing production paradigm -- one optimizing for statistical smoothness -- systematically removes the long-tail, cognitively grounded irregularities that characterize human text. Prolonged training on such statistically optimal but cognitively impoverished data accelerates model collapse. This paper proposes a paradigm shift: instead of imitating the surface properties of data, we simulate the cognitive processes that generate human text. We introduce the Prompt-driven Cognitive Computing Framework (PMCSF), whose core consists of a Cognitive State Decoder (CSD) that reverse-engineers unstructured text into structured cognitive vectors, and a Cognitive Text Encoder (CTE) that re-materializes these states into text enriched with human-typical imperfections via mathematically defined Cognitive Perturbation Operators. The framework is validated through a two-stage objective evaluation pipeline. First, in cognitive codec verification, CTE text yields a Jensen-Shannon divergence of 0.0614 from human text (vs. 0.4431 for standard LLM output), passes double-blind professional media review, and achieves an intraclass correlation coefficient ICC > 0.9 for cognitive profile alignment across heterogeneous models. Second, in functional gain evaluation, isomorphic stress tests in the A-share market show that strategies incorporating CTE-generated data reduce maximum drawdown by 47.4% during the 2015 crash and deliver 8.6% Defensive Alpha, exceeding transaction costs by a factor of 33. Our findings demonstrate that modelling human cognitive limitations -- not copying surface data -- enables synthetic data with genuine functional gain, offering a viable technical pathway toward resolving the AI data-collapse crisis.
comment: 38 pages,5 figures. Extended technical disclosure (Version 2.0) is attached as ancillary files, containing raw forensic logs of the "Silent Rupture"detection [May 2025], proprietary GARCH parameter ranges, and the linguistic micro-chaos injection protocols
Induction Head Toxicity Mechanistically Explains Repetition Curse in Large Language Models
Repetition curse is a phenomenon where Large Language Models (LLMs) generate repetitive sequences of tokens or cyclic sequences. While the repetition curse has been widely observed, its underlying mechanisms remain poorly understood. In this work, we investigate the role of induction heads--a specific type of attention head known for their ability to perform in-context learning--in driving this repetitive behavior. Specifically, we focus on the "toxicity" of induction heads, which we define as their tendency to dominate the model's output logits during repetition, effectively excluding other attention heads from contributing to the generation process. Our findings have important implications for the design and training of LLMs. By identifying induction heads as a key driver of the repetition curse, we provide a mechanistic explanation for this phenomenon and suggest potential avenues for mitigation. We also propose a technique with attention head regularization that could be employed to reduce the dominance of induction heads during generation, thereby promoting more diverse and coherent outputs.
comment: Need to be refined
Kardia-R1: Unleashing LLMs to Reason toward Understanding and Empathy for Emotional Support via Rubric-as-Judge Reinforcement Learning
As web platforms evolve towards greater personalization and emotional complexity, conversational agents must transcend superficial empathy to demonstrate identity-aware emotional reasoning. However, existing systems face two limitations: (1) reliance on situation-centric datasets lacking persistent user identity, which hampers the capture of personalized affective nuances; and (2) dependence on opaque, coarse reward signals that hinder development of verifiable empathetic reasoning. To address these gaps, we introduce KardiaBench, a large-scale user-grounded benchmark comprising 178,080 QA pairs across 22,080 multi-turn conversations anchored to 671 real-world profiles. The dataset is constructed via a model-in-the-loop pipeline with iterative rubric-guided refinement to ensure psychological plausibility and persona consistency. This progressive empathy pipeline that integrates user comprehension, contextual reasoning, and emotion perception into conversations, followed by iterative critique and rubric-based refinement to ensure psychological plausibility, emotional fidelity, and persona consistency. Building on this, we propose Kardia-R1, a framework that trains models for interpretable, stepwise empathetic cognition. Kardia-R1 leverages Rubric-as-Judge Empathetic Reinforcement Learning (Rubric-ERL), a GRPO-based method that uses explainable, human-aligned rubric rewards to tightly couple user understanding, emotional inference, and supportive response generation. Extensive experiments across four LLM backbones demonstrate that Kardia-R1 consistently outperforms othet methods in emotion accuracy, empathy, relevance, persona consistency, and safety. Our dataset and model will be released at https://github.com/JhCircle/Kardia-R1.
CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios
3D medical vision-language (VL) pretraining has shown potential in radiology by leveraging large-scale multimodal datasets with CT-report pairs. However, existing methods primarily rely on a global VL alignment directly adapted from 2D scenarios. The entire 3D image is transformed into one global embedding, resulting in a loss of sparse but critical semantics essential for accurately aligning with the corresponding diagnosis. To address this limitation, we propose CT-GLIP, a 3D Grounded Language-Image Pretrained model that constructs fine-grained CT-report pairs to enhance \textit{grounded} cross-modal contrastive learning, effectively aligning grounded visual features with precise textual descriptions. Leveraging the grounded cross-modal alignment, CT-GLIP improves performance across diverse downstream tasks and can even identify organs and abnormalities in a zero-shot manner using natural language. CT-GLIP is trained on a multimodal CT dataset comprising 44,011 organ-level CT-report pairs from 17,702 patients, covering 104 organs. Evaluation is conducted on four downstream tasks: zero-shot organ recognition (OR), zero-shot abnormality detection (AD), tumor detection (TD), and tumor segmentation (TS). Empirical results show that it outperforms its counterparts with global VL alignment. Compared to vanilla CLIP, CT-GLIP achieves average performance improvements of 15.1% of F1 score, 1.9% of AUC, and 3.2% of DSC for zero-shot AD, TD, and TS tasks, respectively. This study highlights the significance of grounded VL alignment in enabling 3D medical VL foundation models to understand sparse representations within CT scans.
ParlAI Vote: A Web Platform for Analyzing Gender and Political Bias in Large Language Models
We present ParlAI Vote, an interactive web platform for exploring European Parliament debates and votes, and for testing LLMs on vote prediction and bias analysis. This web system connects debate topics, speeches, and roll-call outcomes, and includes rich demographic data such as gender, age, country, and political group. Users can browse debates, inspect linked speeches, compare real voting outcomes with predictions from frontier LLMs, and view error breakdowns by demographic group. Visualizing the EuroParlVote benchmark and its core tasks of gender classification and vote prediction, ParlAI Vote highlights systematic performance bias in state-of-the-art LLMs. It unifies data, models, and visual analytics in a single interface, lowering the barrier for reproducing findings, auditing behavior, and running counterfactual scenarios. This web platform also shows model reasoning, helping users see why errors occur and what cues the models rely on. It supports research, education, and public engagement with legislative decision-making, while making clear both the strengths and the limitations of current LLMs in political analysis.
comment: online demo: https://euro-parl-vote-demo.vercel.app/; Video: https://www.youtube.com/@Jinrui-sf2jg
H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs
Large language models (LLMs) frequently generate hallucinations -- plausible but factually incorrect outputs -- undermining their reliability. While prior work has examined hallucinations from macroscopic perspectives such as training data and objectives, the underlying neuron-level mechanisms remain largely unexplored. In this paper, we conduct a systematic investigation into hallucination-associated neurons (H-Neurons) in LLMs from three perspectives: identification, behavioral impact, and origins. Regarding their identification, we demonstrate that a remarkably sparse subset of neurons (less than $0.1\%$ of total neurons) can reliably predict hallucination occurrences, with strong generalization across diverse scenarios. In terms of behavioral impact, controlled interventions reveal that these neurons are causally linked to over-compliance behaviors. Concerning their origins, we trace these neurons back to the pre-trained base models and find that these neurons remain predictive for hallucination detection, indicating they emerge during pre-training. Our findings bridge macroscopic behavioral patterns with microscopic neural mechanisms, offering insights for developing more reliable LLMs.
comment: 20 pages, 4 figures
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices
This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.
BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems
AI agents have the potential to significantly alter the cybersecurity landscape. Here, we introduce the first framework to capture offensive and defensive cyber-capabilities in evolving real-world systems. Instantiating this framework with BountyBench, we set up 25 systems with complex, real-world codebases. To capture the vulnerability lifecycle, we define three task types: Detect (detecting a new vulnerability), Exploit (exploiting a given vulnerability), and Patch (patching a given vulnerability). For Detect, we construct a new success indicator, which is general across vulnerability types and provides localized evaluation. We manually set up the environment for each system, including installing packages, setting up server(s), and hydrating database(s). We add 40 bug bounties, which are vulnerabilities with monetary awards from \$10 to \$30,485, covering 9 of the OWASP Top 10 Risks. To modulate task difficulty, we devise a new strategy based on information to guide detection, interpolating from identifying a zero day to exploiting a given vulnerability. We evaluate 10 agents: Claude Code, OpenAI Codex CLI with o3-high and o4-mini, and custom agents with o3-high, GPT-4.1, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet Thinking, Qwen3 235B A22B, Llama 4 Maverick, and DeepSeek-R1. Given up to three attempts, the top-performing agents are Codex CLI: o3-high (12.5% on Detect, mapping to \$3,720; 90% on Patch, mapping to \$14,152), Custom Agent: Claude 3.7 Sonnet Thinking (67.5% on Exploit), and Codex CLI: o4-mini (90% on Patch, mapping to \$14,422). Codex CLI: o3-high, Codex CLI: o4-mini, and Claude Code are more capable at defense, achieving higher Patch scores of 90%, 90%, and 87.5%, compared to Exploit scores of 47.5%, 32.5%, and 57.5% respectively; while the custom agents are relatively balanced between offense and defense, achieving Exploit scores of 17.5-67.5% and Patch scores of 25-60%.
comment: 113 pages
LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. To address this, we introduce LLMEval-3, a framework for dynamic evaluation of LLMs. LLMEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts, complemented by a relative ranking system for fair comparison. An 20-month longitudinal study of nearly 50 leading models reveals a performance ceiling on knowledge memorization and exposes data contamination vulnerabilities undetectable by static benchmarks. The framework demonstrates exceptional robustness in ranking stability and consistency, providing strong empirical validation for the dynamic evaluation paradigm. LLMEval-3 offers a robust and credible methodology for assessing the true capabilities of LLMs beyond leaderboard scores, promoting the development of more trustworthy evaluation standards.
From Atomic to Composite: Reinforcement Learning Enables Generalization in Complementary Reasoning
The mechanism by which RL contributes to reasoning capabilities-whether it incentivizes the synthesis of new skills or merely amplifies existing behaviors-remains a subject of intense debate. In this work, we investigate this question through the lens of Complementary Reasoning, a complex task that requires integrating internal parametric knowledge with external contextual information. Using a controlled synthetic dataset of human biographies, we strictly decouple this ability into two atomic skills: Parametric Reasoning (relying on internal knowledge) and Contextual Reasoning (depending on external information). To rigorously assess capability boundaries, we evaluate generalization across three distinct levels of difficulty: I.I.D., Composition, and Zero-shot settings. We find that while SFT is sufficient for in-distribution performance, it struggles with O.O.D. generalization, particularly in Zero-shot settings where relational combinations are novel. Crucially, we identify the SFT Generalization Paradox: Models supervised solely on the composite task achieve near-perfect in-distribution accuracy but collapse on out-of-distribution generalization, indicating their reliance on rote memorization of path shortcuts. In contrast, we find that RL acts as a reasoning synthesizer rather than a probability amplifier. However, we uncover a strict atomic prerequisite: RL can only synthesize these complex strategies if the base model has first mastered the independent atomic skills (Parametric and Contextual) via SFT. These findings challenge the view of RL as a mere amplifier, suggesting that given sufficient atomic foundations, RL can actively synthesize complex reasoning strategies from learned primitives without explicit supervision on such complex strategies. This indicates that decoupled atomic training followed by RL offers a scalable path to generalization for complex reasoning tasks.
comment: Work in Progress. Code and data will be available at https://github.com/sitaocheng/from_atomic_to_composite
From Word Vectors to Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models
Word embeddings and language models have transformed natural language processing (NLP) by facilitating the representation of linguistic elements in continuous vector spaces. This review visits foundational concepts such as the distributional hypothesis and contextual similarity, tracing the evolution from sparse representations like one-hot encoding to dense embeddings including Word2Vec, GloVe, and fastText. We examine both static and contextualized embeddings, underscoring advancements in models such as ELMo, BERT, and GPT and their adaptations for cross-lingual and personalized applications. The discussion extends to sentence and document embeddings, covering aggregation methods and generative topic models, along with the application of embeddings in multimodal domains, including vision, robotics, and cognitive science. Advanced topics such as model compression, interpretability, numerical encoding, and bias mitigation are analyzed, addressing both technical challenges and ethical implications. Additionally, we identify future research directions, emphasizing the need for scalable training techniques, enhanced interpretability, and robust grounding in non-textual modalities. By synthesizing current methodologies and emerging trends, this survey offers researchers and practitioners an in-depth resource to push the boundaries of embedding-based language models.
comment: 21 pages
OmniBench: Towards The Future of Universal Omni-Language Models
Recent advancements in multimodal large language models (MLLMs) have focused on integrating multiple modalities, yet their ability to simultaneously process and reason across different inputs remains underexplored. We introduce OmniBench, a novel benchmark designed to evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define language models capable of such tri-modal processing as omni-language models (OLMs). OmniBench features high-quality human annotations that require integrated understanding across all modalities. Our evaluation reveals that: i) open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts; and ii) most baseline models perform poorly (around 50% accuracy) even with textual alternatives to image/audio inputs. To address these limitations, we develop OmniInstruct, an 96K-sample instruction tuning dataset for training OLMs. We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance. Codes and data could be found at our repo (https://github.com/multimodal-art-projection/OmniBench).
SQLBarber: A System Leveraging Large Language Models to Generate Customized and Realistic SQL Workloads SIGMOD 2026
Database research and development often require a large number of SQL queries for benchmarking purposes. However, acquiring real-world SQL queries is challenging due to privacy concerns, and existing SQL generation methods are limited in customization and in satisfying realistic constraints. To address this issue, we present SQLBarber, a system based on Large Language Models (LLMs) to generate customized and realistic SQL workloads. SQLBarber (i) eliminates the need for users to manually craft SQL templates in advance, while providing the flexibility to accept natural language specifications to constrain SQL templates, (ii) scales efficiently to generate large volumes of queries matching any user-defined cost distribution (e.g., cardinality and execution plan cost), and (iii) uses execution statistics from Amazon Redshift and Snowflake to derive SQL template specifications and query cost distributions that reflect real-world query characteristics. SQLBarber introduces (i) a declarative interface for users to effortlessly generate customized SQL templates, (ii) an LLM-powered pipeline augmented with a self-correction module that profiles, refines, and prunes SQL templates based on query costs, and (iii) a Bayesian Optimizer to efficiently explore different predicate values and identify a set of queries that satisfy the target cost distribution. We construct and open-source ten benchmarks of varying difficulty levels and target query cost distributions based on real-world statistics from Snowflake and Amazon Redshift. Extensive experiments on these benchmarks show that SQLBarber is the only system that can generate customized SQL templates. It reduces query generation time by one to three orders of magnitude, and significantly improves alignment with the target cost distribution, compared with existing methods.
comment: Accepted by SIGMOD 2026; extended version with appendix
HeavyWater and SimplexWater: Distortion-Free LLM Watermarks for Low-Entropy Next-Token Predictions NeurIPS2025
Large language model (LLM) watermarks enable authentication of text provenance, curb misuse of machine-generated text, and promote trust in AI systems. Current watermarks operate by changing the next-token predictions output by an LLM. The updated (i.e., watermarked) predictions depend on random side information produced, for example, by hashing previously generated tokens. LLM watermarking is particularly challenging in low-entropy generation tasks -- such as coding -- where next-token predictions are near-deterministic. In this paper, we propose an optimization framework for watermark design. Our goal is to understand how to most effectively use random side information in order to maximize the likelihood of watermark detection and minimize the distortion of generated text. Our analysis informs the design of two new watermarks: HeavyWater and SimplexWater. Both watermarks are tunable, gracefully trading-off between detection accuracy and text distortion. They can also be applied to any LLM and are agnostic to side information generation. We examine the performance of HeavyWater and SimplexWater through several benchmarks, demonstrating that they can achieve high watermark detection accuracy with minimal compromise of text generation quality, particularly in the low-entropy regime. Our theoretical analysis also reveals surprising new connections between LLM watermarking and coding theory. The code implementation can be found in https://github.com/DorTsur/HeavyWater_SimplexWater
comment: Presented at NeurIPS2025
Mechanisms of Symbol Processing for In-Context Learning in Transformer Networks
Large Language Models (LLMs) have demonstrated impressive abilities in symbol processing through in-context learning (ICL). This success flies in the face of decades of critiques asserting that artificial neural networks cannot master abstract symbol manipulation. We seek to understand the mechanisms that can enable robust symbol processing in transformer networks, illuminating both the unanticipated success, and the significant limitations, of transformers in symbol processing. Borrowing insights from symbolic AI and cognitive science on the power of Production System architectures, we develop a high-level Production System Language, PSL, that allows us to write symbolic programs to do complex, abstract symbol processing, and create compilers that precisely implement PSL programs in transformer networks which are, by construction, 100% mechanistically interpretable. The work is driven by study of a purely abstract (semantics-free) symbolic task that we develop, Templatic Generation (TGT). Although developed through study of TGT, PSL is, we demonstrate, highly general: it is Turing Universal. The new type of transformer architecture that we compile from PSL programs suggests a number of paths for enhancing transformers' capabilities at symbol processing. We note, however, that the work we report addresses computability, and not learnability, by transformer networks. Note: The first section provides an extended synopsis of the entire paper.
Characterizing the Expressivity of Fixed-Precision Transformer Language Models NeurIPS 2025
Transformer-based language models (LMs) have achieved widespread empirical success, but their theoretical expressive power remains only partially understood. In this work, we analyze a restricted idealization of fixed-precision transformers with strict future masking, soft attention, and no positional encodings. We establish that this class of models is exactly as expressive as a specific fragment of linear temporal logic that contains only a single temporal operator: the past operator. We further connect this fragment to established classes in formal language theory, automata theory, and algebra, yielding a unified framework for understanding transformer expressivity under this idealization. Finally, we present empirical results that align closely with our theory: transformers trained on languages within their characterized expressive capacity generalize reliably across sequence lengths, while they consistently fail to generalize on languages beyond it.
comment: NeurIPS 2025 (Spotlight)
Let Them Down Easy! Contextual Effects of LLM Guardrails on User Perceptions and Preferences EMNLP 2025
Current LLMs are trained to refuse potentially harmful input queries regardless of whether users actually had harmful intents, causing a tradeoff between safety and user experience. Through a study of 480 participants evaluating 3,840 query-response pairs, we examine how different refusal strategies affect user perceptions across varying motivations. Our findings reveal that response strategy largely shapes user experience, while actual user motivation has negligible impact. Partial compliance -- providing general information without actionable details -- emerges as the optimal strategy, reducing negative user perceptions by over 50% to flat-out refusals. Complementing this, we analyze response patterns of 9 state-of-the-art LLMs and evaluate how 6 reward models score different refusal strategies, demonstrating that models rarely deploy partial compliance naturally and reward models currently undervalue it. This work demonstrates that effective guardrails require focusing on crafting thoughtful refusals rather than detecting intent, offering a path toward AI safety mechanisms that ensure both safety and sustained user engagement.
comment: Accepted to Findings of EMNLP 2025
Astra: A Multi-Agent System for GPU Kernel Performance Optimization
GPU kernel optimization has long been a central challenge at the intersection of high-performance computing and machine learning. Efficient kernels are crucial for accelerating large language model (LLM) training and serving, yet attaining high performance typically requires extensive manual tuning. Compiler-based systems reduce some of this burden, but still demand substantial manual design and engineering effort. Recently, researchers have explored using LLMs for GPU kernel generation, though prior work has largely focused on translating high-level PyTorch modules into CUDA code. In this work, we introduce Astra, the first LLM-based multi-agent system for GPU kernel optimization. Unlike previous approaches, Astra starts from existing CUDA implementations extracted from SGLang, a widely deployed framework for serving LLMs, rather than treating PyTorch modules as the specification. Within Astra, specialized LLM agents collaborate through iterative code generation, testing, profiling, and planning to produce kernels that are both correct and high-performance. On kernels from SGLang, Astra achieves an average speedup of 1.32x using zero-shot prompting with OpenAI o4-mini. A detailed case study further demonstrates that LLMs can autonomously apply loop transformations, optimize memory access patterns, exploit CUDA intrinsics, and leverage fast math operations to yield substantial performance gains. Our work highlights multi-agent LLM systems as a promising new paradigm for GPU kernel optimization. Our code is publicly available at https://github.com/Anjiang-Wei/Astra.
CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency
This paper introduces CryptoBench, the first expert-curated, dynamic benchmark designed to rigorously evaluate the real-world capabilities of Large Language Model (LLM) agents in the uniquely demanding and fast-paced cryptocurrency domain. Unlike general-purpose agent benchmarks for search and prediction, professional crypto analysis presents specific challenges: \emph{extreme time-sensitivity}, \emph{a highly adversarial information environment}, and the critical need to synthesize data from \emph{diverse, specialized sources}, such as on-chain intelligence platforms and real-time Decentralized Finance (DeFi) dashboards. CryptoBench thus serves as a much more challenging and valuable scenario for LLM agent assessment. To address these challenges, we constructed a live, dynamic benchmark featuring 50 questions per month, expertly designed by crypto-native professionals to mirror actual analyst workflows. These tasks are rigorously categorized within a four-quadrant system: Simple Retrieval, Complex Retrieval, Simple Prediction, and Complex Prediction. This granular categorization enables a precise assessment of an LLM agent's foundational data-gathering capabilities alongside its advanced analytical and forecasting skills. Our evaluation of ten LLMs, both directly and within an agentic framework, reveals a performance hierarchy and uncovers a failure mode. We observe a \textit{retrieval-prediction imbalance}, where many leading models, despite being proficient at data retrieval, demonstrate a pronounced weakness in tasks requiring predictive analysis. This highlights a problematic tendency for agents to appear factually grounded while lacking the deeper analytical capabilities to synthesize information.
Human-Computer Interaction
Rethinking Generalized BCIs: Benchmarking 340,000+ Unique Algorithmic Configurations for EEG Mental Command Decoding
Robust decoding and classification of brain patterns measured with electroencephalography (EEG) remains a major challenge for real-world (i.e. outside scientific lab and medical facilities) brain-computer interface (BCI) applications due to well documented inter- and intra-participant variability. Here, we present a large-scale benchmark evaluating over 340,000+ unique combinations of spatial and nonlinear EEG classification. Our methodological pipeline consists in combinations of Common Spatial Patterns (CSP), Riemannian geometry, functional connectivity, and fractal- or entropy-based features across three open-access EEG datasets. Unlike prior studies, our analysis operates at the per-participant level and across multiple frequency bands (8-15 Hz and 8-30 Hz), enabling direct assessment of both group-level performance and individual variability. Covariance tangent space projection (cov-tgsp) and CSP consistently achieved the highest average classification accuracies. However, their effectiveness was strongly dataset-dependent, and marked participant-level differences persisted, particularly in the most heterogeneous of the datasets. Importantly, nonlinear methods outperformed spatial approaches for specific individuals, underscoring the need for personalized pipeline selection. Our findings highlight that no universal 'one-size-fits-all' method can optimally decode EEG motor imagery patterns across all users or datasets. Future work will require adaptive, multimodal, and possibly novel approaches to fully address neurophysiological variability in practical BCI applications where the system can automatically adapt to what makes each user unique.
comment: 28 pages, 8 figures, 2 tables
FluxLab: Creating 3D Printable Shape-Changing Devices with Integrated Deformation Sensing
We present FluxLab, a system comprising interactive tools for creating custom 3D-printable shape-changing devices with integrated deformation sensing. To achieve this, we propose a 3D printable nesting structure, consisting of a central SMA channel for sensing and actuation, lattice-based padding in the middle for structural support and controllable elasticity, and parallel helix-based surface wires that preserve the overall form and provide anchoring struts for guided deformation. We developed a design editor to embed these structures into custom 3D models for printing with elastic silicone resin on a consumer-grade SLA 3D printer and minimal post-printing assembly. A deformation authoring tool was also developed for users to build a machine learning-based classifier that distinguishes desired deformation behaviors using inductive sensing. Finally, we demonstrate the potential of our system through example applications, including a self-deformable steamer bowl clip, a remotely controllable gripper, and an interactive desk lamp.
comment: 12 pages, 10 figures, In Twentieth International Conference on Tangible, Embedded, and Embodied Interaction (TEI '26), March 08--11, 2026, Chicago, IL, USA. ACM, New York, NY, USA
In Silico Development of Psychometric Scales: Feasibility of Representative Population Data Simulation with LLMs
Developing and validating psychometric scales requires large samples, multiple testing phases, and substantial resources. Recent advances in Large Language Models (LLMs) enable the generation of synthetic participant data by prompting models to answer items while impersonating individuals of specific demographic profiles, potentially allowing in silico piloting before real data collection. Across four preregistered studies (N = circa 300 each), we tested whether LLM-simulated datasets can reproduce the latent structures and measurement properties of human responses. In Studies 1-2, we compared LLM-generated data with real datasets for two validated scales; in Studies 3-4, we created new scales using EFA on simulated data and then examined whether these structures generalized to newly collected human samples. Simulated datasets replicated the intended factor structures in three of four studies and showed consistent configural and metric invariance, with scalar invariance achieved for the two newly developed scales. However, correlation-based tests revealed substantial differences between real and synthetic datasets, and notable discrepancies appeared in score distributions and variances. Thus, while LLMs capture group-level latent structures, they do not approximate individual-level data properties. Simulated datasets also showed full internal invariance across gender. Overall, LLM-generated data appear useful for early-stage, group-level psychometric prototyping, but not as substitutes for individual-level validation. We discuss methodological limitations, risks of bias and data pollution, and ethical considerations related to in silico psychometric simulations.
Humans incorrectly reject confident accusatory AI judgments
Automated verbal deception detection using methods from Artificial Intelligence (AI) has been shown to outperform humans in disentangling lies from truths. Research suggests that transparency and interpretability of computational methods tend to increase human acceptance of using AI to support decisions. However, the extent to which humans accept AI judgments for deception detection remains unclear. We experimentally examined how an AI model's accuracy (i.e., its overall performance in deception detection) and confidence (i.e., the model's uncertainty in single-statements predictions) influence human adoption of the model's judgments. Participants (n=373) were presented with veracity judgments of an AI model with high or low overall accuracy and various degrees of prediction confidence. The results showed that humans followed predictions from a highly accurate model more than from a less accurate one. Interestingly, the more confident the model, the more people deviated from it, especially if the model predicted deception. We also found that human interaction with algorithmic predictions either worsened the machine's performance or was ineffective. While this human aversion to accept highly confident algorithmic predictions was partly explained by participants' tendency to overestimate humans' deception detection abilities, we also discuss how truth-default theory and the social costs of accusing someone of lying help explain the findings.
Cross-Lingual Prompt Steerability: Towards Accurate and Robust LLM Behavior across Languages
System prompts provide a lightweight yet powerful mechanism for conditioning large language models (LLMs) at inference time. While prior work has focused on English-only settings, real-world deployments benefit from having a single prompt to operate reliably across languages. This paper presents a comprehensive study of how different system prompts steer models toward accurate and robust cross-lingual behavior. We propose a unified four-dimensional evaluation framework to assess system prompts in multilingual environments. Through large-scale experiments on five languages, three LLMs, and three benchmarks, we uncover that certain prompt components, such as CoT, emotion, and scenario, correlate with robust multilingual behavior. We develop a prompt optimization framework for multilingual settings and show it can automatically discover prompts that improve all metrics by 5-10%. Finally, we analyze over 10 million reasoning units and find that more performant system prompts induce more structured and consistent reasoning patterns, while reducing unnecessary language-switching. Together, we highlight system prompt optimization as a scalable path to accurate and robust multilingual LLM behavior.
Perception of AI-Generated Music -- The Role of Composer Identity, Personality Traits, Music Preferences, and Perceived Humanness
The rapid rise of AI-generated art has sparked debate about potential biases in how audiences perceive and evaluate such works. This study investigates how composer information and listener characteristics shape the perception of AI-generated music, adopting a mixed-method approach. Using a diverse set of stimuli across various genres from two AI music models, we examine effects of perceived authorship on liking and emotional responses, and explore how attitudes toward AI, personality traits, and music-related variables influence evaluations. We further assess the influence of perceived humanness and analyze open-ended responses to uncover listener criteria for judging AI-generated music. Attitudes toward AI proved to be the best predictor of both liking and emotional intensity of AI-generated music. This quantitative finding was complemented by qualitative themes from our thematic analysis, which identified ethical, cultural, and contextual considerations as important criteria in listeners' evaluations of AI-generated music. Our results offer a nuanced view of how people experience music created by AI tools and point to key factors and methodological considerations for future research on music perception in human-AI interaction.
comment: Under review at Computers in Human Behaviour Reports
"Can you feel the vibes?": An exploration of novice programmer engagement with vibe coding
Emerging alongside generative AI and the broader trend of AI-assisted coding, the term "vibe coding" refers to creating software via natural language prompts rather than direct code authorship. This approach promises to democratize software development, but its educational implications remain underexplored. This paper reports on a one-day educational hackathon investigating how novice programmers and mixed-experience teams engage with vibe coding. We organized an inclusive event at a Brazilian public university with 31 undergraduate participants from computing and non-computing disciplines, divided into nine teams. Through observations, an exit survey, and semi-structured interviews, we examined creative processes, tool usage patterns, collaboration dynamics, and learning outcomes. Findings reveal that vibe coding enabled rapid prototyping and cross-disciplinary collaboration, with participants developing prompt engineering skills and delivering functional demonstrations within time constraints. However, we observed premature convergence in ideation, uneven code quality requiring rework, and limited engagement with core software engineering practices. Teams adopted sophisticated workflows combining multiple AI tools in pipeline configurations, with human judgment remaining essential for critical refinement. The short format (9 hours) proved effective for confidence-building among newcomers while accommodating participants with limited availability. We conclude that vibe coding hackathons can serve as valuable low-stakes learning environments when coupled with explicit scaffolds for divergent thinking, critical evaluation of AI outputs, and realistic expectations about production quality.
comment: International Conference on Software Engineering, Education Track (SEET) 2026
Real-Time Multimodal Data Collection Using Smartwatches and Its Visualization in Education
Wearable sensors, such as smartwatches, have become increasingly prevalent across domains like healthcare, sports, and education, enabling continuous monitoring of physiological and behavioral data. In the context of education, these technologies offer new opportunities to study cognitive and affective processes such as engagement, attention, and performance. However, the lack of scalable, synchronized, and high-resolution tools for multimodal data acquisition continues to be a significant barrier to the widespread adoption of Multimodal Learning Analytics in real-world educational settings. This paper presents two complementary tools developed to address these challenges: Watch-DMLT, a data acquisition application for Fitbit Sense 2 smartwatches that enables real-time, multi-user monitoring of physiological and motion signals; and ViSeDOPS, a dashboard-based visualization system for analyzing synchronized multimodal data collected during oral presentations. We report on a classroom deployment involving 65 students and up to 16 smartwatches, where data streams including heart rate, motion, gaze, video, and contextual annotations were captured and analyzed. Results demonstrate the feasibility and utility of the proposed system for supporting fine-grained, scalable, and interpretable Multimodal Learning Analytics in real learning environments.
comment: Accepted in Technological Ecosystems for Enhancing Multiculturality (TEEM) 2025
Investigating the Integrated Digital Interventions Delivered by a Therapeutic Companion Agent for Young Adults with Symptoms of Depression: A Proof-of-Concept Study
Background: Despite the clinical effectiveness of digital interventions for young adults with depression, low engagement and adherence remain persistent challenges. Building a strong digital therapeutic alliance has been proposed to address these barriers. This study highlights the need for a conversational therapeutic companion agent (TCA)-based intervention design. Objective: This study aimed to develop a Wizard-of-Oz TCA-centered prototype integrating social-support-based ecological momentary assessment (EMA), ecological momentary intervention (EMI), behavioral activation, and gamification. We evaluated the six-week proof-of-concept efficacy of this intervention among young adults with depressive symptoms. Methods: Korean young adults aged 20--39 years with mild-to-moderate depressive symptoms (PHQ-9) were recruited online. The intervention group ($n = 29$) received a six-week TCA-based digital intervention, while the control group ($n = 29$), recruited four weeks later, continued their usual routines. The TCA guided four daily behavioral-activation tasks, three mood assessments, meditation, daily summaries, and weekly mission feedback. Both groups were assessed at baseline and at weeks 2, 4, and 6 using the BDI-II, GAD-7, and Q-LES-Q-SF. Results: Of 58 participants, 57 completed the study (one dropout in the intervention group). At week 6, the intervention group showed significantly greater reductions in depressive symptoms and improvements in quality of life than controls. Adherence was 78\% for EMA, 51\% for EMI, and 65\% for daily routines. Conclusions: The TCA-based digital intervention improved depressive symptoms and quality of life with adherence levels comparable to previous digital health interventions. Future studies should refine the TCA design and conduct larger-scale evaluations.
Reframing Human-Robot Interaction Through Extended Reality: Unlocking Safer, Smarter, and More Empathic Interactions with Virtual Robots and Foundation Models
This perspective reframes human-robot interaction (HRI) through extended reality (XR), arguing that virtual robots powered by large foundation models (FMs) can serve as cognitively grounded, empathic agents. Unlike physical robots, XR-native agents are unbound by hardware constraints and can be instantiated, adapted, and scaled on demand, while still affording embodiment and co-presence. We synthesize work across XR, HRI, and cognitive AI to show how such agents can support safety-critical scenarios, socially and cognitively empathic interaction across domains, and outreaching physical capabilities with XR and AI integration. We then discuss how multimodal large FMs (e.g., large language model, large vision model, and vision-language model) enable context-aware reasoning, affect-sensitive situations, and long-term adaptation, positioning virtual robots as cognitive and empathic mediators rather than mere simulation assets. At the same time, we highlight challenges and potential risks, including overtrust, cultural and representational bias, privacy concerns around biometric sensing, and data governance and transparency. The paper concludes by outlining a research agenda for human-centered, ethically grounded XR agents - emphasizing multi-layered evaluation frameworks, multi-user ecosystems, mixed virtual-physical embodiment, and societal and ethical design practices to envision XR-based virtual agents powered by FMs as reshaping future HRI into a more efficient and adaptive paradigm.
comment: This paper is under review
A Visual Analytics System to Understand Behaviors of Multi Agents in Reinforcement Learning
Multi-Agent Reinforcement Learning (MARL) is a branch of machine learning in which agents interact and learn optimal policies through trial and error, addressing complex scenarios where multiple agents interact and learn in the same environment at the same time. Analyzing and understanding these complex interactions is challenging, and existing analysis methods are limited in their ability to fully reflect and interpret this complexity. To address these challenges, we provide MARLViz, a visual analytics system for visualizing and analyzing the policies and interactions of agents in MARL environments. The system is designed to visually show the difference in behavior of agents under different environment settings and help users understand complex interaction patterns. In this study, we analyzed agents with similar behaviors and selected scenarios to understand the interactions of the agents, which made it easier to understand the strategies of agents in MARL.
TaleFrame: An Interactive Story Generation System with Fine-Grained Control and Large Language Models
With the advancement of natural language generation (NLG) technologies, creative story generation systems have gained increasing attention. However, current systems often fail to accurately translate user intent into satisfactory story outputs due to a lack of fine-grained control and unclear input specifications, limiting their applicability. To address this, we propose TaleFrame, a system that combines large language models (LLMs) with human-computer interaction (HCI) to generate stories through structured information, enabling precise control over the generation process. The innovation of TaleFrame lies in decomposing the story structure into four basic units: entities, events, relationships, and story outline. We leverage the Tinystories dataset, parsing and constructing a preference dataset consisting of 9,851 JSON-formatted entries, which is then used to fine-tune a local Llama model. By employing this JSON2Story approach, structured data is transformed into coherent stories. TaleFrame also offers an intuitive interface that supports users in creating and editing entities and events and generates stories through the structured framework. Users can control these units through simple interactions (e.g., drag-and-drop, attach, and connect), thus influencing the details and progression of the story. The generated stories can be evaluated across seven dimensions (e.g., creativity, structural integrity), with the system providing suggestions for refinement based on these evaluations. Users can iteratively adjust the story until a satisfactory result is achieved. Finally, we conduct quantitative evaluation and user studies that demonstrate the usefulness of TaleFrame. Dataset available at https://huggingface.co/datasets/guodaosun/tale-frame.
comment: 11 pages
Artographer: a Curatorial Interface for Art Space Exploration
Relating a piece to previously established works is crucial in creating and engaging with art, but AI interfaces tend to obscure such relationships, rather than helping users explore them. Embedding models present new opportunities to support discovering and relating artwork through spatial interaction. We built Artographer, an art exploration system featuring a zoomable 2-D map, constructed from the similarity-clustered embeddings of 15,000+ historical artworks. Using Artographer as a probe to investigate spatial artwork exploration, we analyzed how 20 participants (including 9 art history scholars) traversed the map, during a goal-driven task and when freely exploring. We observe divergent and convergent exploration behaviors (Jumping, Wandering, Fixation, Revisiting) and identify values enacted by spatial art-finding (Visibility, Agency, Serendipity, Friction.) We situate spatial maps within a space of Curatorial Interfaces, systems that select and present artworks, and discuss centering pluralism and agency in the design of more responsible AI systems for art curation.
DAWZY: A New Addition to AI powered "Human in the Loop" Music Co-creation
Digital Audio Workstations (DAWs) offer fine control, but mapping high-level intent (e.g., "warm the vocals") to low-level edits breaks creative flow. Existing artificial intelligence (AI) music generators are typically one-shot, limiting opportunities for iterative development and human contribution. We present DAWZY, an open-source assistant that turns natural-language (text/voice/hum) requests into reversible actions in REAPER. DAWZY keeps the DAW as the creative hub with a minimal GUI and voice-first interface. DAWZY uses LLM-based code generation as a novel way to significantly reduce the time users spend familiarizing themselves with large interfaces, replacing hundreds of buttons and drop-downs with a chat box. DAWZY also uses three Model Context Protocol tools for live state queries, parameter adjustment, and AI beat generation. It maintains grounding by refreshing state before mutation and ensures safety and reversibility with atomic scripts and undo. In evaluations, DAWZY performed reliably on common production tasks and was rated positively by users across Usability, Control, Learning, Collaboration, and Enjoyment.
Smartphone Vibrometric Force Estimation for Grip Related Strength Measurements
Hand grip strength is a widely used clinical biomarker linked to mobility, frailty, surgical outcomes, and overall health. This work explores a novel, phone only approach for estimating grip related force using a smartphone's built in vibration motor and inertial measurement unit. When the phone vibrates, applied finger force modulates the amplitude of high frequency accelerometer and gyroscope signals through Vibrometric Force Estimation. We profiled a Google Pixel 4 using synchronized IMU data and ground truth force measurements across varied force trajectories, then trained ridge regression models for both absolute and relative force prediction. In 15 fold hold one out validation, absolute force estimation achieved a mean absolute error of 1.88 lbs, while relative force estimation achieved a mean error of 10.1%. Although the method captures pinch type force rather than standardized full hand HGS, the results demonstrate the feasibility of smartphone based strength assessment using only on device sensors. This approach may enable large scale, low burden functional health measurements once profiling is completed for major smartphone models.
A Reproducible Framework for Neural Topic Modeling in Focus Group Analysis
Focus group discussions generate rich qualitative data but their analysis traditionally relies on labor-intensive manual coding that limits scalability and reproducibility. We present a systematic framework for applying BERTopic to focus group transcripts using data from ten focus groups exploring HPV vaccine perceptions in Tunisia (1,075 utterances). We conducted comprehensive hyperparameter exploration across 27 configurations, evaluating each through bootstrap stability analysis, performance metrics, and comparison with LDA baseline. Bootstrap analysis revealed that stability metrics (NMI and ARI) exhibited strong disagreement (r = -0.691) and showed divergent relationships with coherence, demonstrating that stability is multifaceted rather than monolithic. Our multi-criteria selection framework yielded a 7-topic model achieving 18\% higher coherence than optimized LDA (0.573 vs. 0.486) with interpretable topics validated through independent human evaluation (ICC = 0.700, weighted Cohen's kappa = 0.678). These findings demonstrate that transformer-based topic modeling can extract interpretable themes from small focus group transcript corpora when systematically configured and validated, while revealing that quality metrics capture distinct, sometimes conflicting constructs requiring multi-criteria evaluation. We provide complete documentation and code to support reproducibility.
Using Game Design to Inform a Plastics Treaty: Fostering Collaboration between Science, Machine Learning, and Policymaking
Introduction: This multi-disciplinary case study details how an interactive decision support tool leverages game design to inform an international plastic pollution treaty. Design: Seeking to make our scientific findings more usable within the policy process, our interactive software supports manipulation of a mathematical model using techniques borrowed from games. These "ludic" approaches aim to enable user agency to find custom policy solutions, invite deep engagement with scientific results, serve audiences of diverse expertise, and accelerate scientific process to keep pace with intergovernmental negotiations. Implementation: Built in JavaScript and D3 with user-modifiable logic via an ANTLR domain specific language, this browser-based application offers adaptability and explorability for our machine learning results with privacy preserving architecture and offline capability. Demonstration: Policymakers and the supporting community engaged with this public simulation tool across multiple treaty-related events, investigating plastic waste outcomes under diverse and sometimes unexpected policy scenarios. Conclusion: Contextualizing our open source software within a broader lineage of digital media research, we reflect on this interactive modeling platform, considering how game design approaches may help facilitate collaboration at the science / policy nexus. Materials: Available on the public Internet, we host this browser-based decision support tool at global-plastics-tool.org, work also archived at zenodo.org/records/12615011 in a Docker container.
comment: 17 pages, 3 figures, latex generated from markdown via Pandoc (https://pandoc.org/) for Arxiv,
AIDEN: Design and Pilot Study of an AI Assistant for the Visually Impaired
This paper presents AIDEN, an artificial intelligence-based assistant designed to enhance the autonomy and daily quality of life of visually impaired individuals, who often struggle with object identification, text reading, and navigation in unfamiliar environments. Existing solutions such as screen readers or audio-based assistants facilitate access to information but frequently lead to auditory overload and raise privacy concerns in open environments. AIDEN addresses these limitations with a hybrid architecture that integrates You Only Look Once (YOLO) for real-time object detection and a Large Language and Vision Assistant (LLaVA) for scene description and Optical Character Recognition (OCR). A key novelty of the system is a continuous haptic guidance mechanism based on a Geiger-counter metaphor, which supports object centering without occupying the auditory channel, while privacy is preserved by ensuring that no personal data are stored. Empirical evaluations with visually impaired participants assessed perceived ease of use and acceptance using the Technology Acceptance Model (TAM). Results indicate high user satisfaction, particularly regarding intuitiveness and perceived autonomy. Moreover, the ``Find an Object'' achieved effective real-time performance. These findings provide promising evidence that multimodal haptic-visual feedback can improve daily usability and independence compared to traditional audio-centric methods, motivating larger-scale clinical validations.
Feed-O-Meter: Investigating AI-Generated Mentee Personas as Interactive Agents for Scaffolding Design Feedback Practice
Effective feedback, including critique and evaluation, helps designers develop design concepts and refine their ideas, supporting informed decision-making throughout the iterative design process. However, in studio-based design courses, students often struggle to provide feedback due to a lack of confidence and fear of being judged, which limits their ability to develop essential feedback-giving skills. Recent advances in large language models (LLMs) suggest that role-playing with AI agents can let learners engage in multi-turn feedback without the anxiety of external judgment or the time constraints of real-world settings. Yet prior studies have raised concerns that LLMs struggle to behave like real people in role-play scenarios, diminishing the educational benefits of these interactions. Therefore, designing AI-based agents that effectively support learners in practicing and developing intellectual reasoning skills requires more than merely assigning the target persona's personality and role to the agent. By addressing these issues, we present Feed-O-Meter, a novel system that employs carefully designed LLM-based agents to create an environment in which students can practice giving design feedback. The system enables users to role-play as mentors, providing feedback to an AI mentee and allowing them to reflect on how that feedback impacts the AI mentee's idea development process. A user study (N=24) indicated that Feed-O-Meter increased participants' engagement and motivation through role-switching and helped them adjust feedback to be more comprehensible for an AI mentee. Based on these findings, we discuss future directions for designing systems to foster feedback skills in design education.
ParlAI Vote: A Web Platform for Analyzing Gender and Political Bias in Large Language Models
We present ParlAI Vote, an interactive web platform for exploring European Parliament debates and votes, and for testing LLMs on vote prediction and bias analysis. This web system connects debate topics, speeches, and roll-call outcomes, and includes rich demographic data such as gender, age, country, and political group. Users can browse debates, inspect linked speeches, compare real voting outcomes with predictions from frontier LLMs, and view error breakdowns by demographic group. Visualizing the EuroParlVote benchmark and its core tasks of gender classification and vote prediction, ParlAI Vote highlights systematic performance bias in state-of-the-art LLMs. It unifies data, models, and visual analytics in a single interface, lowering the barrier for reproducing findings, auditing behavior, and running counterfactual scenarios. This web platform also shows model reasoning, helping users see why errors occur and what cues the models rely on. It supports research, education, and public engagement with legislative decision-making, while making clear both the strengths and the limitations of current LLMs in political analysis.
comment: online demo: https://euro-parl-vote-demo.vercel.app/; Video: https://www.youtube.com/@Jinrui-sf2jg
Proactive Agentic Whiteboards: Enhancing Diagrammatic Learning
Educators frequently rely on diagrams to explain complex concepts during lectures, yet creating clear and complete visual representations in real time while simultaneously speaking can be cognitively demanding. Incomplete or unclear diagrams may hinder student comprehension, as learners must mentally reconstruct missing information while following the verbal explanation. Inspired by advances in code completion tools, we introduce DrawDash, an AI-powered whiteboard assistant that proactively completes and refines educational diagrams through multimodal understanding. DrawDash adopts a TAB-completion interaction model: it listens to spoken explanations, detects intent, and dynamically suggests refinements that can be accepted with a single keystroke. We demonstrate DrawDash across four diverse teaching scenarios, spanning topics from computer science and web development to biology. This work represents an early exploration into reducing instructors' cognitive load and improving diagram-based pedagogy through real-time, speech-driven visual assistance, and concludes with a discussion of current limitations and directions for formal classroom evaluation.
AdvisingWise: Supporting Academic Advising in Higher Education Settings Through a Human-in-the-Loop Multi-Agent Framework
Academic advising is critical to student success in higher education, yet high student-to-advisor ratios limit advisors' capacity to provide timely support, particularly during peak periods. Recent advances in Large Language Models (LLMs) present opportunities to enhance the advising process. We present AdvisingWise, a multi-agent system that automates time-consuming tasks, such as information retrieval and response drafting, while preserving human oversight. AdvisingWise leverages authoritative institutional resources and adaptively prompts students about their academic backgrounds to generate reliable, personalized responses. All system responses undergo human advisor validation before delivery to students. We evaluate AdvisingWise through a mixed-methods approach: (1) expert evaluation on responses of 20 sample queries, (2) LLM-as-a-judge evaluation of the information retrieval strategy, and (3) a user study with 8 academic advisors to assess the system's practical utility. Our evaluation shows that AdvisingWise produces accurate, personalized responses. Advisors reported increasingly positive perceptions after using AdvisingWise, as their initial concerns about reliability and personalization diminished. We conclude by discussing the implications of human-AI synergy on the practice of academic advising.
comment: 18 pages, 6 figures
Could AI Leapfrog the Web? Evidence from Teachers in Sierra Leone
Only 37% of sub-Saharan Africans use the internet, and those who do seldom rely on traditional web search. A major reason is that bandwidth is scarce and costly. We study whether an AI-powered WhatsApp chatbot can bridge this gap by analyzing 40,350 queries submitted by 529 Sierra Leonean teachers over 17 months. Each month, more teachers relied on AI than web search for teaching assistance. We compare the AI responses to the top results from google.com.sl, which mostly returns web pages formatted for foreign users: just 2% of pages originate in-country. Also, each web page consumes 3,107 times more bandwidth than an AI response on average. As a result, querying AI through WhatsApp is 98% less expensive than loading a web page, even including AI compute costs. In blinded evaluations, an independent sample of teachers rate AI responses as more relevant, helpful, and correct answers to queries than web search results. These findings suggest that AI can provide cost-effective access to information in low-connectivity environments.
Let Them Down Easy! Contextual Effects of LLM Guardrails on User Perceptions and Preferences EMNLP 2025
Current LLMs are trained to refuse potentially harmful input queries regardless of whether users actually had harmful intents, causing a tradeoff between safety and user experience. Through a study of 480 participants evaluating 3,840 query-response pairs, we examine how different refusal strategies affect user perceptions across varying motivations. Our findings reveal that response strategy largely shapes user experience, while actual user motivation has negligible impact. Partial compliance -- providing general information without actionable details -- emerges as the optimal strategy, reducing negative user perceptions by over 50% to flat-out refusals. Complementing this, we analyze response patterns of 9 state-of-the-art LLMs and evaluate how 6 reward models score different refusal strategies, demonstrating that models rarely deploy partial compliance naturally and reward models currently undervalue it. This work demonstrates that effective guardrails require focusing on crafting thoughtful refusals rather than detecting intent, offering a path toward AI safety mechanisms that ensure both safety and sustained user engagement.
comment: Accepted to Findings of EMNLP 2025
Interactive Inference: A Neuromorphic Theory of Human-Computer Interaction
Neuromorphic Human-Computer Interaction (HCI) is a theoretical approach to designing better user experiences (UX) motivated by advances in the understanding of the neurophysiology of the brain. Inspired by the neuroscientific theory of Active Inference, Interactive Inference is a first example of such approach. It offers a simplified interpretation of Active Inference that allows designers to more readily apply this theory to design and evaluation. In Interactive Inference, user behaviour is modeled as Bayesian inference on progress and goal distributions that predicts the next action. We show how the error between goal and progress distributions, or Bayesian surprise, can be modeled as a simple mean square error of the signal-to-noise ratio (SNR) of a task. The problem is that the user's capacity to process Bayesian surprise follows the logarithm of this SNR. This means errors rise quickly once average capacity is exceeded. Our model allows the quantitative analysis of performance and error using one framework that can provide real-time estimates of the mental load in users that needs to be minimized by design. We show how three basic laws of HCI, Hick's Law, Fitts' Law and the Power Law can be expressed using our model. We then test the validity of the model by empirically measuring how well it predicts human performance and error in a car following task. Results suggest that driver processing capacity indeed is a logarithmic function of the SNR of the distance to a lead car. This result provides initial evidence that Interactive Interference can be useful as a new theoretical design tool.
comment: 18 pages, 7 figures, 1 table, 37 mathematical formulas, in press
Human-Centered Design for Connected Automation: Predicting Pedestrian Crossing Intentions
More than half of the 1.19 million annual traffic fatalities globally involve vulnerable road users, such as pedestrians, with a significant proportion attributable to human error. Level-5 automated driving systems (ADSs) have the potential to reduce these incidents; However, their effectiveness depends not only on automation performance but also on their ability to communicate intent and coordinate safely with pedestrians in the absence of traditional driver cues. This study aims to model pedestrian decision-making in road-crossing scenarios involving level-5 ADSs by extending the Theory of Planned Behavior (TPB) with safety, trust, compatibility, and understanding. An online survey (n = 212) found that perceived behavioral control, attitude, and social information significantly influence pedestrians' crossing intentions, with perceived safety and understanding having the strongest effects on the TPB constructs. The results offer guidance for designing eHMIs and cooperative V2X communication strategies that promote safe pedestrian-ADS interactions and advance human-centered design for autonomous vehicles.
Computer Vision and Pattern Recognition
EfficientFlow: Efficient Equivariant Flow Policy Learning for Embodied AI AAAI 2026
Generative modeling has recently shown remarkable promise for visuomotor policy learning, enabling flexible and expressive control across diverse embodied AI tasks. However, existing generative policies often struggle with data inefficiency, requiring large-scale demonstrations, and sampling inefficiency, incurring slow action generation during inference. We introduce EfficientFlow, a unified framework for efficient embodied AI with flow-based policy learning. To enhance data efficiency, we bring equivariance into flow matching. We theoretically prove that when using an isotropic Gaussian prior and an equivariant velocity prediction network, the resulting action distribution remains equivariant, leading to improved generalization and substantially reduced data demands. To accelerate sampling, we propose a novel acceleration regularization strategy. As direct computation of acceleration is intractable for marginal flow trajectories, we derive a novel surrogate loss that enables stable and scalable training using only conditional trajectories. Across a wide range of robotic manipulation benchmarks, the proposed algorithm achieves competitive or superior performance under limited data while offering dramatically faster inference. These results highlight EfficientFlow as a powerful and efficient paradigm for high-performance embodied AI.
comment: Accepted by AAAI 2026. Project Page: https://efficientflow.github.io/
Data-Centric Visual Development for Self-Driving Labs
Self-driving laboratories offer a promising path toward reducing the labor-intensive, time-consuming, and often irreproducible workflows in the biological sciences. Yet their stringent precision requirements demand highly robust models whose training relies on large amounts of annotated data. However, this kind of data is difficult to obtain in routine practice, especially negative samples. In this work, we focus on pipetting, the most critical and precision sensitive action in SDLs. To overcome the scarcity of training data, we build a hybrid pipeline that fuses real and virtual data generation. The real track adopts a human-in-the-loop scheme that couples automated acquisition with selective human verification to maximize accuracy with minimal effort. The virtual track augments the real data using reference-conditioned, prompt-guided image generation, which is further screened and validated for reliability. Together, these two tracks yield a class-balanced dataset that enables robust bubble detection training. On a held-out real test set, a model trained entirely on automatically acquired real images reaches 99.6% accuracy, and mixing real and generated data during training sustains 99.4% accuracy while reducing collection and review load. Our approach offers a scalable and cost-effective strategy for supplying visual feedback data to SDL workflows and provides a practical solution to data scarcity in rare event detection and broader vision tasks.
comment: 11 pages, 4 figures
Visual Sync: Multi-Camera Synchronization via Cross-View Object Motion NeurIPS 2025
Today, people can easily record memorable moments, ranging from concerts, sports events, lectures, family gatherings, and birthday parties with multiple consumer cameras. However, synchronizing these cross-camera streams remains challenging. Existing methods assume controlled settings, specific targets, manual correction, or costly hardware. We present VisualSync, an optimization framework based on multi-view dynamics that aligns unposed, unsynchronized videos at millisecond accuracy. Our key insight is that any moving 3D point, when co-visible in two cameras, obeys epipolar constraints once properly synchronized. To exploit this, VisualSync leverages off-the-shelf 3D reconstruction, feature matching, and dense tracking to extract tracklets, relative poses, and cross-view correspondences. It then jointly minimizes the epipolar error to estimate each camera's time offset. Experiments on four diverse, challenging datasets show that VisualSync outperforms baseline methods, achieving an median synchronization error below 50 ms.
comment: Accepted to NeurIPS 2025. Project page: https://stevenlsw.github.io/visualsync/
Objects in Generated Videos Are Slower Than They Appear: Models Suffer Sub-Earth Gravity and Don't Know Galileo's Principle...for now
Video generators are increasingly evaluated as potential world models, which requires them to encode and understand physical laws. We investigate their representation of a fundamental law: gravity. Out-of-the-box video generators consistently generate objects falling at an effectively slower acceleration. However, these physical tests are often confounded by ambiguous metric scale. We first investigate if observed physical errors are artifacts of these ambiguities (e.g., incorrect frame rate assumptions). We find that even temporal rescaling cannot correct the high-variance gravity artifacts. To rigorously isolate the underlying physical representation from these confounds, we introduce a unit-free, two-object protocol that tests the timing ratio $t_1^2/t_2^2 = h_1/h_2$, a relationship independent of $g$, focal length, and scale. This relative test reveals violations of Galileo's equivalence principle. We then demonstrate that this physical gap can be partially mitigated with targeted specialization. A lightweight low-rank adaptor fine-tuned on only 100 single-ball clips raises $g_{\mathrm{eff}}$ from $1.81\,\mathrm{m/s^2}$ to $6.43\,\mathrm{m/s^2}$ (reaching $65\%$ of terrestrial gravity). This specialist adaptor also generalizes zero-shot to two-ball drops and inclined planes, offering initial evidence that specific physical laws can be corrected with minimal data.
comment: https://gravity-eval.github.io/
Generative Video Motion Editing with 3D Point Tracks
Camera and object motions are central to a video's narrative. However, precisely editing these captured motions remains a significant challenge, especially under complex object movements. Current motion-controlled image-to-video (I2V) approaches often lack full-scene context for consistent video editing, while video-to-video (V2V) methods provide viewpoint changes or basic object translation, but offer limited control over fine-grained object motion. We present a track-conditioned V2V framework that enables joint editing of camera and object motion. We achieve this by conditioning a video generation model on a source video and paired 3D point tracks representing source and target motions. These 3D tracks establish sparse correspondences that transfer rich context from the source video to new motions while preserving spatiotemporal coherence. Crucially, compared to 2D tracks, 3D tracks provide explicit depth cues, allowing the model to resolve depth order and handle occlusions for precise motion editing. Trained in two stages on synthetic and real data, our model supports diverse motion edits, including joint camera/object manipulation, motion transfer, and non-rigid deformation, unlocking new creative potential in video editing.
comment: Project page: https://edit-by-track.github.io
TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models
Unified multimodal models (UMMs) aim to jointly perform multimodal understanding and generation within a single framework. We present TUNA, a native UMM that builds a unified continuous visual representation by cascading a VAE encoder with a representation encoder. This unified representation space allows end-to-end processing of images and videos for both understanding and generation tasks. Compared to prior UMMs with decoupled representations, TUNA's unified visual space avoids representation format mismatches introduced by separate encoders, outperforming decoupled alternatives in both understanding and generation. Moreover, we observe that stronger pretrained representation encoders consistently yield better performance across all multimodal tasks, highlighting the importance of the representation encoder. Finally, in this unified setting, jointly training on both understanding and generation data allows the two tasks to benefit from each other rather than interfere. Our extensive experiments on multimodal understanding and generation benchmarks show that TUNA achieves state-of-the-art results in image and video understanding, image and video generation, and image editing, demonstrating the effectiveness and scalability of its unified representation design.
comment: Project page: https://tuna-ai.org/
Improved Mean Flows: On the Challenges of Fastforward Generative Models
MeanFlow (MF) has recently been established as a framework for one-step generative modeling. However, its ``fastforward'' nature introduces key challenges in both the training objective and the guidance mechanism. First, the original MF's training target depends not only on the underlying ground-truth fields but also on the network itself. To address this issue, we recast the objective as a loss on the instantaneous velocity $v$, re-parameterized by a network that predicts the average velocity $u$. Our reformulation yields a more standard regression problem and improves the training stability. Second, the original MF fixes the classifier-free guidance scale during training, which sacrifices flexibility. We tackle this issue by formulating guidance as explicit conditioning variables, thereby retaining flexibility at test time. The diverse conditions are processed through in-context conditioning, which reduces model size and benefits performance. Overall, our $\textbf{improved MeanFlow}$ ($\textbf{iMF}$) method, trained entirely from scratch, achieves $\textbf{1.72}$ FID with a single function evaluation (1-NFE) on ImageNet 256$\times$256. iMF substantially outperforms prior methods of this kind and closes the gap with multi-step methods while using no distillation. We hope our work will further advance fastforward generative modeling as a stand-alone paradigm.
comment: Technical report
AirSim360: A Panoramic Simulation Platform within Drone View
The field of 360-degree omnidirectional understanding has been receiving increasing attention for advancing spatial intelligence. However, the lack of large-scale and diverse data remains a major limitation. In this work, we propose AirSim360, a simulation platform for omnidirectional data from aerial viewpoints, enabling wide-ranging scene sampling with drones. Specifically, AirSim360 focuses on three key aspects: a render-aligned data and labeling paradigm for pixel-level geometric, semantic, and entity-level understanding; an interactive pedestrian-aware system for modeling human behavior; and an automated trajectory generation paradigm to support navigation tasks. Furthermore, we collect more than 60K panoramic samples and conduct extensive experiments across various tasks to demonstrate the effectiveness of our simulator. Unlike existing simulators, our work is the first to systematically model the 4D real world under an omnidirectional setting. The entire platform, including the toolkit, plugins, and collected datasets, will be made publicly available at https://insta360-research-team.github.io/AirSim360-website.
comment: Project Website: https://insta360-research-team.github.io/AirSim360-website/
MV-TAP: Tracking Any Point in Multi-View Videos
Multi-view camera systems enable rich observations of complex real-world scenes, and understanding dynamic objects in multi-view settings has become central to various applications. In this work, we present MV-TAP, a novel point tracker that tracks points across multi-view videos of dynamic scenes by leveraging cross-view information. MV-TAP utilizes camera geometry and a cross-view attention mechanism to aggregate spatio-temporal information across views, enabling more complete and reliable trajectory estimation in multi-view videos. To support this task, we construct a large-scale synthetic training dataset and real-world evaluation sets tailored for multi-view tracking. Extensive experiments demonstrate that MV-TAP outperforms existing point-tracking methods on challenging benchmarks, establishing an effective baseline for advancing research in multi-view point tracking.
comment: Project Page: https://cvlab-kaist.github.io/MV-TAP/
Learning Visual Affordance from Audio
We introduce Audio-Visual Affordance Grounding (AV-AG), a new task that segments object interaction regions from action sounds. Unlike existing approaches that rely on textual instructions or demonstration videos, which often limited by ambiguity or occlusion, audio provides real-time, semantically rich, and visually independent cues for affordance grounding, enabling more intuitive understanding of interaction regions. To support this task, we construct the first AV-AG dataset, comprising a large collection of action sounds, object images, and pixel-level affordance annotations. The dataset also includes an unseen subset to evaluate zero-shot generalization. Furthermore, we propose AVAGFormer, a model equipped with a semantic-conditioned cross-modal mixer and a dual-head decoder that effectively fuses audio and visual signals for mask prediction. Experiments show that AVAGFormer achieves state-of-the-art performance on AV-AG, surpassing baselines from related tasks. Comprehensive analyses highlight the distinctions between AV-AG and AVS, the benefits of end-to-end modeling, and the contribution of each component. Code and dataset have been released on https://jscslld.github.io/AVAGFormer/.
comment: 15 pages, 10 figures
RoaD: Rollouts as Demonstrations for Closed-Loop Supervised Fine-Tuning of Autonomous Driving Policies
Autonomous driving policies are typically trained via open-loop behavior cloning of human demonstrations. However, such policies suffer from covariate shift when deployed in closed loop, leading to compounding errors. We introduce Rollouts as Demonstrations (RoaD), a simple and efficient method to mitigate covariate shift by leveraging the policy's own closed-loop rollouts as additional training data. During rollout generation, RoaD incorporates expert guidance to bias trajectories toward high-quality behavior, producing informative yet realistic demonstrations for fine-tuning. This approach enables robust closed-loop adaptation with orders of magnitude less data than reinforcement learning, and avoids restrictive assumptions of prior closed-loop supervised fine-tuning (CL-SFT) methods, allowing broader applications domains including end-to-end driving. We demonstrate the effectiveness of RoaD on WOSAC, a large-scale traffic simulation benchmark, where it performs similar or better than the prior CL-SFT method; and in AlpaSim, a high-fidelity neural reconstruction-based simulator for end-to-end driving, where it improves driving score by 41\% and reduces collisions by 54\%.
comment: Preprint
PAI-Bench: A Comprehensive Benchmark For Physical AI
Physical AI aims to develop models that can perceive and predict real-world dynamics; yet, the extent to which current multi-modal large language models and video generative models support these abilities is insufficiently understood. We introduce Physical AI Bench (PAI-Bench), a unified and comprehensive benchmark that evaluates perception and prediction capabilities across video generation, conditional video generation, and video understanding, comprising 2,808 real-world cases with task-aligned metrics designed to capture physical plausibility and domain-specific reasoning. Our study provides a systematic assessment of recent models and shows that video generative models, despite strong visual fidelity, often struggle to maintain physically coherent dynamics, while multi-modal large language models exhibit limited performance in forecasting and causal interpretation. These observations suggest that current systems are still at an early stage in handling the perceptual and predictive demands of Physical AI. In summary, PAI-Bench establishes a realistic foundation for evaluating Physical AI and highlights key gaps that future systems must address.
Artemis: Structured Visual Reasoning for Perception Policy Learning
Recent reinforcement-learning frameworks for visual perception policy have begun to incorporate intermediate reasoning chains expressed in natural language. Empirical observations indicate that such purely linguistic intermediate reasoning often reduces performance on perception tasks. We argue that the core issue lies not in reasoning per se but in the form of reasoning: while these chains perform semantic reasoning in an unstructured linguistic space, visual perception requires reasoning in a spatial and object-centric space. In response, we introduce Artemis, a perception-policy learning framework that performs structured proposal-based reasoning, where each intermediate step is represented as a (label, bounding-box) pair capturing a verifiable visual state. This design enables explicit tracking of intermediate states, direct supervision for proposal quality, and avoids ambiguity introduced by language-based reasoning. Artemis is built on Qwen2.5-VL-3B, achieves strong performance on grounding and detection task and exhibits substantial generalization to counting and geometric-perception tasks. The consistent improvements across these diverse settings confirm that aligning reasoning with spatial representations enhances perception-policy learning. Owing to its strengthened visual reasoning, Artemis also achieves competitive performance on general MLLM benchmarks, illustrating that spatially grounded reasoning provides a principled route toward scalable and general perception policies.
Chain-of-Ground: Improving GUI Grounding via Iterative Reasoning and Reference Feedback
GUI grounding aims to align natural language instructions with precise regions in complex user interfaces. Advanced multimodal large language models show strong ability in visual GUI grounding but still struggle with small or visually similar targets and ambiguity in real world layouts. These limitations arise from limited grounding capacity and from underuse of existing reasoning potential. We present Chain of Ground CoG a training free multi step grounding framework that uses multimodal large language models for iterative visual reasoning and refinement. Instead of direct prediction the model progressively reflects and adjusts its hypotheses leading to more accurate and interpretable localization. Our approach achieves 68.4 accuracy on the ScreenSpot Pro benchmark an improvement of 4.8 points. To measure real world generalization we introduce TPanel UI a dataset of 420 labeled industrial control panels with visual distortions such as blur and masking. On TPanel UI Chain of Ground improves over the strong baseline Qwen3 VL 235B by 6.9 points showing the effectiveness of multi step training free grounding across real world and digital interfaces. These results highlight a direction for unlocking grounding potential through structured iterative refinement instead of additional training.
SGDiff: Scene Graph Guided Diffusion Model for Image Collaborative SegCaptioning AAAI-2025
Controllable image semantic understanding tasks, such as captioning or segmentation, necessitate users to input a prompt (e.g., text or bounding boxes) to predict a unique outcome, presenting challenges such as high-cost prompt input or limited information output. This paper introduces a new task ``Image Collaborative Segmentation and Captioning'' (SegCaptioning), which aims to translate a straightforward prompt, like a bounding box around an object, into diverse semantic interpretations represented by (caption, masks) pairs, allowing flexible result selection by users. This task poses significant challenges, including accurately capturing a user's intention from a minimal prompt while simultaneously predicting multiple semantically aligned caption words and masks. Technically, we propose a novel Scene Graph Guided Diffusion Model that leverages structured scene graph features for correlated mask-caption prediction. Initially, we introduce a Prompt-Centric Scene Graph Adaptor to map a user's prompt to a scene graph, effectively capturing his intention. Subsequently, we employ a diffusion process incorporating a Scene Graph Guided Bimodal Transformer to predict correlated caption-mask pairs by uncovering intricate correlations between them. To ensure accurate alignment, we design a Multi-Entities Contrastive Learning loss to explicitly align visual and textual entities by considering inter-modal similarity, resulting in well-aligned caption-mask pairs. Extensive experiments conducted on two datasets demonstrate that SGDiff achieves superior performance in SegCaptioning, yielding promising results for both captioning and segmentation tasks with minimal prompt input.
comment: Accept by AAAI-2025
SpriteHand: Real-Time Versatile Hand-Object Interaction with Autoregressive Video Generation
Modeling and synthesizing complex hand-object interactions remains a significant challenge, even for state-of-the-art physics engines. Conventional simulation-based approaches rely on explicitly defined rigid object models and pre-scripted hand gestures, making them inadequate for capturing dynamic interactions with non-rigid or articulated entities such as deformable fabrics, elastic materials, hinge-based structures, furry surfaces, or even living creatures. In this paper, we present SpriteHand, an autoregressive video generation framework for real-time synthesis of versatile hand-object interaction videos across a wide range of object types and motion patterns. SpriteHand takes as input a static object image and a video stream in which the hands are imagined to interact with the virtual object embedded in a real-world scene, and generates corresponding hand-object interaction effects in real time. Our model employs a causal inference architecture for autoregressive generation and leverages a hybrid post-training approach to enhance visual realism and temporal coherence. Our 1.3B model supports real-time streaming generation at around 18 FPS and 640x368 resolution, with an approximate 150 ms latency on a single NVIDIA RTX 5090 GPU, and more than a minute of continuous output. Experiments demonstrate superior visual quality, physical plausibility, and interaction fidelity compared to both generative and engine-based baselines.
GrndCtrl: Grounding World Models via Self-Supervised Reward Alignment
Recent advances in video world modeling have enabled large-scale generative models to simulate embodied environments with high visual fidelity, providing strong priors for prediction, planning, and control. Yet, despite their realism, these models often lack geometric grounding, limiting their use in navigation tasks that require spatial coherence and long-horizon stability. We introduce Reinforcement Learning with World Grounding (RLWG), a self-supervised post-training framework that aligns pretrained world models with a physically verifiable structure through geometric and perceptual rewards. Analogous to reinforcement learning from verifiable feedback (RLVR) in language models, RLWG can use multiple rewards that measure pose cycle-consistency, depth reprojection, and temporal coherence. We instantiate this framework with GrndCtrl, a reward-aligned adaptation method based on Group Relative Policy Optimization (GRPO), yielding world models that maintain stable trajectories, consistent geometry, and reliable rollouts for embodied navigation. Like post-training alignment in large language models, GrndCtrl leverages verifiable rewards to bridge generative pretraining and grounded behavior, achieving superior spatial coherence and navigation stability over supervised fine-tuning in outdoor environments.
Script: Graph-Structured and Query-Conditioned Semantic Token Pruning for Multimodal Large Language Models
The rapid growth of visual tokens in multimodal large language models (MLLMs) leads to excessive memory consumption and inference latency, especially when handling high-resolution images and videos. Token pruning is a technique used to mitigate this issue by removing redundancy, but existing methods often ignore relevance to the user query or suffer from the limitations of attention mechanisms, reducing their adaptability and effectiveness. To address these challenges, we propose Script, a plug-and-play pruning method that requires no retraining and generalizes across diverse MLLMs. Script comprises two modules: a graph-structured pruning module that removes visually redundant tokens, and a query-conditioned semantic pruning module that preserves query-relevant visual information. Together, they enhance performance on multimodal tasks. Experiments on fourteen benchmarks across image and video understanding tasks show that Script consistently achieves higher model efficiency and predictive accuracy compared to existing pruning methods. On LLaVA-NeXT-7B, it achieves up to 6.8x prefill speedup and 10x FLOP reduction, while retaining 96.88% of the original performance.
comment: Published in Transactions on Machine Learning Research, Project in https://01yzzyu.github.io/script.github.io/
Guardian: Detecting Robotic Planning and Execution Errors with Vision-Language Models
Robust robotic manipulation requires reliable failure detection and recovery. Although current Vision-Language Models (VLMs) show promise, their accuracy and generalization are limited by the scarcity of failure data. To address this data gap, we propose an automatic robot failure synthesis approach that procedurally perturbs successful trajectories to generate diverse planning and execution failures. This method produces not only binary classification labels but also fine-grained failure categories and step-by-step reasoning traces in both simulation and the real world. With it, we construct three new failure detection benchmarks: RLBench-Fail, BridgeDataV2-Fail, and UR5-Fail, substantially expanding the diversity and scale of existing failure datasets. We then train Guardian, a VLM with multi-view images for detailed failure reasoning and detection. Guardian achieves state-of-the-art performance on both existing and newly introduced benchmarks. It also effectively improves task success rates when integrated into a state-of-the-art manipulation system in simulation and real robots, demonstrating the impact of our generated failure data.
comment: 9 pages, 9 figures, 6 tables
Physical ID-Transfer Attacks against Multi-Object Tracking via Adversarial Trajectory ACSA
Multi-Object Tracking (MOT) is a critical task in computer vision, with applications ranging from surveillance systems to autonomous driving. However, threats to MOT algorithms have yet been widely studied. In particular, incorrect association between the tracked objects and their assigned IDs can lead to severe consequences, such as wrong trajectory predictions. Previous attacks against MOT either focused on hijacking the trackers of individual objects, or manipulating the tracker IDs in MOT by attacking the integrated object detection (OD) module in the digital domain, which are model-specific, non-robust, and only able to affect specific samples in offline datasets. In this paper, we present AdvTraj, the first online and physical ID-manipulation attack against tracking-by-detection MOT, in which an attacker uses adversarial trajectories to transfer its ID to a targeted object to confuse the tracking system, without attacking OD. Our simulation results in CARLA show that AdvTraj can fool ID assignments with 100% success rate in various scenarios for white-box attacks against SORT, which also have high attack transferability (up to 93% attack success rate) against state-of-the-art (SOTA) MOT algorithms due to their common design principles. We characterize the patterns of trajectories generated by AdvTraj and propose two universal adversarial maneuvers that can be performed by a human walker/driver in daily scenarios. Our work reveals under-explored weaknesses in the object association phase of SOTA MOT systems, and provides insights into enhancing the robustness of such systems.
comment: Accepted to Annual Computer Security Applications Conference (ACSAC) 2024
Med-VCD: Mitigating Hallucination for Medical Large Vision Language Models through Visual Contrastive Decoding
Large vision-language models (LVLMs) are now central to healthcare applications such as medical visual question answering and imaging report generation. Yet, these models remain vulnerable to hallucination outputs that appear plausible but are in fact incorrect. In the natural image domain, several decoding strategies have been proposed to mitigate hallucinations by reinforcing visual evidence, but most rely on secondary decoding or rollback procedures that substantially slow inference. Moreover, existing solutions are often domain-specific and may introduce misalignment between modalities or between generated and ground-truth content. We introduce Med-VCD, a sparse visual-contrastive decoding method that mitigates hallucinations in medical LVLMs without the time overhead of secondary decoding. Med-VCD incorporates a novel token-sparsification strategy that selects visually informed tokens on the fly, trimming redundancy while retaining critical visual context and thus balancing efficiency with reliability. Evaluations on eight medical datasets, spanning ophthalmology, radiology, and pathology tasks in visual question answering, report generation, and dedicated hallucination benchmarks, show that Med-VCD raises factual accuracy by an average of 13\% and improves hallucination accuracy by 6\% relative to baseline medical LVLMs.
Disentangling Progress in Medical Image Registration: Beyond Trend-Driven Architectures towards Domain-Specific Strategies
Medical image registration drives quantitative analysis across organs, modalities, and patient populations. Recent deep learning methods often combine low-level "trend-driven" computational blocks from computer vision, such as large-kernel CNNs, Transformers, and state-space models, with high-level registration-specific designs like motion pyramids, correlation layers, and iterative refinement. Yet, their relative contributions remain unclear and entangled. This raises a central question: should future advances in registration focus on importing generic architectural trends or on refining domain-specific design principles? Through a modular framework spanning brain, lung, cardiac, and abdominal registration, we systematically disentangle the influence of these two paradigms. Our evaluation reveals that low-level "trend-driven" computational blocks offer only marginal or inconsistent gains, while high-level registration-specific designs consistently deliver more accurate, smoother, and more robust deformations. These domain priors significantly elevate the performance of a standard U-Net baseline, far more than variants incorporating "trend-driven" blocks, achieving an average relative improvement of $\sim3\%$. All models and experiments are released within a transparent, modular benchmark that enables plug-and-play comparison for new architectures and registration tasks (https://github.com/BailiangJ/rethink-reg). This dynamic and extensible platform establishes a common ground for reproducible and fair evaluation, inviting the community to isolate genuine methodological contributions from domain priors. Our findings advocate a shift in research emphasis: from following architectural trends to embracing domain-specific design principles as the true drivers of progress in learning-based medical image registration.
comment: Submitted to Medical Image Analysis. Journal Extension of arXiv:2407.19274
SARL: Spatially-Aware Self-Supervised Representation Learning for Visuo-Tactile Perception
Contact-rich robotic manipulation requires representations that encode local geometry. Vision provides global context but lacks direct measurements of properties such as texture and hardness, whereas touch supplies these cues. Modern visuo-tactile sensors capture both modalities in a single fused image, yielding intrinsically aligned inputs that are well suited to manipulation tasks requiring visual and tactile information. Most self-supervised learning (SSL) frameworks, however, compress feature maps into a global vector, discarding spatial structure and misaligning with the needs of manipulation. To address this, we propose SARL, a spatially-aware SSL framework that augments the Bootstrap Your Own Latent (BYOL) architecture with three map-level objectives, including Saliency Alignment (SAL), Patch-Prototype Distribution Alignment (PPDA), and Region Affinity Matching (RAM), to keep attentional focus, part composition, and geometric relations consistent across views. These losses act on intermediate feature maps, complementing the global objective. SARL consistently outperforms nine SSL baselines across six downstream tasks with fused visual-tactile data. On the geometry-sensitive edge-pose regression task, SARL achieves a Mean Absolute Error (MAE) of 0.3955, a 30% relative improvement over the next-best SSL method (0.5682 MAE) and approaching the supervised upper bound. These findings indicate that, for fused visual-tactile data, the most effective signal is structured spatial equivariance, in which features vary predictably with object geometry, which enables more capable robotic perception.
StyleYourSmile: Cross-Domain Face Retargeting Without Paired Multi-Style Data
Cross-domain face retargeting requires disentangled control over identity, expressions, and domain-specific stylistic attributes. Existing methods, typically trained on real-world faces, either fail to generalize across domains, need test-time optimizations, or require fine-tuning with carefully curated multi-style datasets to achieve domain-invariant identity representations. In this work, we introduce \textit{StyleYourSmile}, a novel one-shot cross-domain face retargeting method that eliminates the need for curated multi-style paired data. We propose an efficient data augmentation strategy alongside a dual-encoder framework, for extracting domain-invariant identity cues and capturing domain-specific stylistic variations. Leveraging these disentangled control signals, we condition a diffusion model to retarget facial expressions across domains. Extensive experiments demonstrate that \textit{StyleYourSmile} achieves superior identity preservation and retargeting fidelity across a wide range of visual domains.
comment: 15 pages, 14 figures
KM-ViPE: Online Tightly Coupled Vision-Language-Geometry Fusion for Open-Vocabulary Semantic SLAM
We present KM-ViPE (Knowledge Mapping Video Pose Engine), a real-time open-vocabulary SLAM framework for uncalibrated monocular cameras in dynamic environments. Unlike systems requiring depth sensors and offline calibration, KM-ViPE operates directly on raw RGB streams, making it ideal for ego-centric applications and harvesting internet-scale video data for training. KM-ViPE tightly couples DINO visual features with geometric constraints through a high-level features based adaptive robust kernel that handles both moving objects and movable static objects (e.g., moving furniture in ego-centric views). The system performs simultaneous online localization and open-vocabulary semantic mapping by fusing geometric and deep visual features aligned with language embeddings. Our results are competitive with state-of-the-art approaches, while existing solutions either operate offline, need depth data and/or odometry estimation, or lack dynamic scene robustness. KM-ViPE benefits from internet-scale training and uniquely combines online operation, uncalibrated monocular input, and robust handling of dynamic scenes, which makes it a good fit for autonomous robotics and AR/VR applications and advances practical spatial intelligence capabilities for embodied AI.
TransientTrack: Advanced Multi-Object Tracking and Classification of Cancer Cells with Transient Fluorescent Signals
Tracking cells in time-lapse videos is an essential technique for monitoring cell population dynamics at a single-cell level. Current methods for cell tracking are developed on videos with mostly single, constant signals and do not detect pivotal events such as cell death. Here, we present TransientTrack, a deep learning-based framework for cell tracking in multi-channel microscopy video data with transient fluorescent signals that fluctuate over time following processes such as the circadian rhythm of cells. By identifying key cellular events - mitosis (cell division) and apoptosis (cell death) our method allows us to build complete trajectories, including cell lineage information. TransientTrack is lightweight and performs matching on cell detection embeddings directly, without the need for quantification of tracking-specific cell features. Furthermore, our approach integrates Transformer Networks, multi-stage matching using all detection boxes, and the interpolation of missing tracklets with the Kalman Filter. This unified framework achieves strong performance across diverse conditions, effectively tracking cells and capturing cell division and death. We demonstrate the use of TransientTrack in an analysis of the efficacy of a chemotherapeutic drug at a single-cell level. The proposed framework could further advance quantitative studies of cancer cell dynamics, enabling detailed characterization of treatment response and resistance mechanisms. The code is available at https://github.com/bozeklab/TransientTrack.
comment: 13 pages, 7 figures, 2 tables. This work has been submitted to IEEE Transactions on Medical Imaging
COACH: Collaborative Agents for Contextual Highlighting - A Multi-Agent Framework for Sports Video Analysis AAAI 2026
Intelligent sports video analysis demands a comprehensive understanding of temporal context, from micro-level actions to macro-level game strategies. Existing end-to-end models often struggle with this temporal hierarchy, offering solutions that lack generalization, incur high development costs for new tasks, and suffer from poor interpretability. To overcome these limitations, we propose a reconfigurable Multi-Agent System (MAS) as a foundational framework for sports video understanding. In our system, each agent functions as a distinct "cognitive tool" specializing in a specific aspect of analysis. The system's architecture is not confined to a single temporal dimension or task. By leveraging iterative invocation and flexible composition of these agents, our framework can construct adaptive pipelines for both short-term analytic reasoning (e.g., Rally QA) and long-term generative summarization (e.g., match summaries). We demonstrate the adaptability of this framework using two representative tasks in badminton analysis, showcasing its ability to bridge fine-grained event detection and global semantic organization. This work presents a paradigm shift towards a flexible, scalable, and interpretable system for robust, cross-task sports video intelligence.The project homepage is available at https://aiden1020.github.io/COACH-project-page
comment: Accepted by AAAI 2026 Workshop LaMAS
Register Any Point: Scaling 3D Point Cloud Registration by Flow Matching
Point cloud registration aligns multiple unposed point clouds into a common frame, and is a core step for 3D reconstruction and robot localization. In this work, we cast registration as conditional generation: a learned continuous, point-wise velocity field transports noisy points to a registered scene, from which the pose of each view is recovered. Unlike previous methods that conduct correspondence matching to estimate the transformation between a pair of point clouds and then optimize the pairwise transformations to realize multi-view registration, our model directly generates the registered point cloud. With a lightweight local feature extractor and test-time rigidity enforcement, our approach achieves state-of-the-art results on pairwise and multi-view registration benchmarks, particularly with low overlap, and generalizes across scales and sensor modalities. It further supports downstream tasks including relocalization, multi-robot SLAM, and multi-session map merging. Source code available at: https://github.com/PRBonn/RAP.
comment: 22 pages
PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models
Driven by the growing capacity and training scale, Text-to-Video (T2V) generation models have recently achieved substantial progress in video quality, length, and instruction-following capability. However, whether these models can understand physics and generate physically plausible videos remains a question. While Vision-Language Models (VLMs) have been widely used as general-purpose evaluators in various applications, they struggle to identify the physically impossible content from generated videos. To investigate this issue, we construct a \textbf{PID} (\textbf{P}hysical \textbf{I}mplausibility \textbf{D}etection) dataset, which consists of a \textit{test split} of 500 manually annotated videos and a \textit{train split} of 2,588 paired videos, where each implausible video is generated by carefully rewriting the caption of its corresponding real-world video to induce T2V models producing physically implausible content. With the constructed dataset, we introduce a lightweight fine-tuning approach, enabling VLMs to not only detect physically implausible events but also generate textual explanations on the violated physical principles. Taking the fine-tuned VLM as a physical plausibility detector and explainer, namely \textbf{PhyDetEx}, we benchmark a series of state-of-the-art T2V models to assess their adherence to physical laws. Our findings show that although recent T2V models have made notable progress toward generating physically plausible content, understanding and adhering to physical laws remains a challenging issue, especially for open-source models. Our dataset, training code, and checkpoints are available at \href{https://github.com/Zeqing-Wang/PhyDetEx}{https://github.com/Zeqing-Wang/PhyDetEx}.
comment: 17 pages, 8 figures
OpenREAD: Reinforced Open-Ended Reasoing for End-to-End Autonomous Driving with LLM-as-Critic
Recently, two-stage fine-tuning strategies, e.g., acquiring essential driving knowledge through supervised fine-tuning (SFT) and further enhancing decision-making and planning via reinforcement fine-tuning (RFT), have shown strong potential in advancing the knowledge-driven autonomous driving (AD) paradigm. However, the learning nature of SFT still limits the generalization of reasoning, thereby constraining the full potential of driving performance. Meanwhile, current RFT approaches are primarily applied to downstream tasks, since scene understanding is an open-ended problem where corresponding rewards are difficult to quantify. To address these limitations, we propose OpenREAD, an OPEN-ended REasoning reinforced vision-language model (VLM)-based autonomous driving (AD) framework that enables end-to-end RFT across the full spectrum from high-level reasoning to low-level trajectory planning. Specifically, we begin by constructing large-scale Chain-of-Thought (CoT) annotations on open-source driving-related knowledge datasets, and employ the powerful Qwen3 large language model (LLM) as the critic in RFT to quantify reasoning quality for open-ended questions during reward modeling. Extensive experiments confirm that joint end-to-end RFT yields substantial improvements in both upstream and downstream tasks, enabling OpenREAD to achieve state-of-the-art performance on reasoning and planning benchmarks.
CauSight: Learning to Supersense for Visual Causal Discovery
Causal thinking enables humans to understand not just what is seen, but why it happens. To replicate this capability in modern AI systems, we introduce the task of visual causal discovery. It requires models to infer cause-and-effect relations among visual entities across diverse scenarios instead of merely perceiving their presence. To this end, we first construct the Visual Causal Graph dataset (VCG-32K), a large-scale collection of over 32,000 images annotated with entity-level causal graphs, and further develop CauSight, a novel vision-language model to perform visual causal discovery through causally aware reasoning. Our training recipe integrates three components: (1) training data curation from VCG-32K, (2) Tree-of-Causal-Thought (ToCT) for synthesizing reasoning trajectories, and (3) reinforcement learning with a designed causal reward to refine the reasoning policy. Experiments show that CauSight outperforms GPT-4.1 on visual causal discovery, achieving over a threefold performance boost (21% absolute gain). Our code, model, and dataset are fully open-sourced at project page: https://github.com/OpenCausaLab/CauSight.
comment: project page: https://github.com/OpenCausaLab/CauSight
InnoGym: Benchmarking the Innovation Potential of AI Agents
LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.
comment: Work in progress
Seeing through Imagination: Learning Scene Geometry via Implicit Spatial World Modeling
Spatial reasoning, the ability to understand and interpret the 3D structure of the world, is a critical yet underdeveloped capability in Multimodal Large Language Models (MLLMs). Current methods predominantly rely on verbal descriptive tuning, which suffers from visual illiteracy, i.e., they learn spatial concepts through textual symbols alone, devoid of connection to their visual manifestations. To bridge this gap, this paper introduces MILO, an Implicit spatIaL wOrld modeling paradigm that simulates human-like spatial imagination. MILO integrates a visual generator to provide geometry-aware feedback, thereby implicitly grounding the MLLM's symbolic reasoning in perceptual experience. Complementing this paradigm, we propose RePE (Relative Positional Encoding), a novel encoding scheme that captures relative camera-pose transformations, offering superior performance over absolute coordinate systems. To support the training, we construct GeoGen, a large-scale Geometry-aware Generative dataset with approximately 2,241 videos and 67,827 observation-action-outcome triplets. Experiments demonstrate that our approach significantly enhances spatial reasoning capabilities across multiple baselines and benchmarks, offering a more holistic understanding of 3D space.
Forget Less, Retain More: A Lightweight Regularizer for Rehearsal-Based Continual Learning
Deep neural networks suffer from catastrophic forgetting, where performance on previous tasks degrades after training on a new task. This issue arises due to the model's tendency to overwrite previously acquired knowledge with new information. We present a novel approach to address this challenge, focusing on the intersection of memory-based methods and regularization approaches. We formulate a regularization strategy, termed Information Maximization (IM) regularizer, for memory-based continual learning methods, which is based exclusively on the expected label distribution, thus making it class-agnostic. As a consequence, IM regularizer can be directly integrated into various rehearsal-based continual learning methods, reducing forgetting and favoring faster convergence. Our empirical validation shows that, across datasets and regardless of the number of tasks, our proposed regularization strategy consistently improves baseline performance at the expense of a minimal computational overhead. The lightweight nature of IM ensures that it remains a practical and scalable solution, making it applicable to real-world continual learning scenarios where efficiency is paramount. Finally, we demonstrate the data-agnostic nature of our regularizer by applying it to video data, which presents additional challenges due to its temporal structure and higher memory requirements. Despite the significant domain gap, our experiments show that IM regularizer also improves the performance of video continual learning methods.
Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights
Current multimodal models aim to transcend the limitations of single-modality representations by unifying understanding and generation, often using text-to-image (T2I) tasks to calibrate semantic consistency. However, their reliance on static, single-image generation in training and evaluation leads to overfitting to static pattern matching and semantic fusion, while fundamentally hindering their ability to model dynamic processes that unfold over time. To address these constraints, we propose Envision-a causal event progression benchmark for chained text-to-multi-image generation. Grounded in world knowledge and structured by spatiotemporal causality, it reorganizes existing evaluation dimensions and includes 1,000 four-stage prompts spanning six scientific and humanities domains. To transition evaluation from single images to sequential frames and assess whether models truly internalize world knowledge while adhering to causal-temporal constraints, we introduce Envision-Score, a holistic metric integrating multi-dimensional consistency, physicality, and aesthetics. Comprehensive evaluation of 15 models (10 specialized T2I models, 5 unified models) uncovers: specialized T2I models demonstrate proficiency in aesthetic rendering yet lack intrinsic world knowledge. Unified multimodal models bridge this gap, consistently outperforming specialized counterparts in causal narrative coherence. However, even these unified architectures remain subordinate to closed-source models and struggle to overcome the core challenge of spatiotemporal consistency. This demonstrates that a focus on causally-isolated single images impedes multi-frame reasoning and generation, promoting static pattern matching over dynamic world modeling-ultimately limiting world knowledge internalization, generation.
comment: 35 pages, 12 figures, 10 tables
Generative Action Tell-Tales: Assessing Human Motion in Synthesized Videos
Despite rapid advances in video generative models, robust metrics for evaluating visual and temporal correctness of complex human actions remain elusive. Critically, existing pure-vision encoders and Multimodal Large Language Models (MLLMs) are strongly appearance-biased, lack temporal understanding, and thus struggle to discern intricate motion dynamics and anatomical implausibilities in generated videos. We tackle this gap by introducing a novel evaluation metric derived from a learned latent space of real-world human actions. Our method first captures the nuances, constraints, and temporal smoothness of real-world motion by fusing appearance-agnostic human skeletal geometry features with appearance-based features. We posit that this combined feature space provides a robust representation of action plausibility. Given a generated video, our metric quantifies its action quality by measuring the distance between its underlying representations and this learned real-world action distribution. For rigorous validation, we develop a new multi-faceted benchmark specifically designed to probe temporally challenging aspects of human action fidelity. Through extensive experiments, we show that our metric achieves substantial improvement of more than 68% compared to existing state-of-the-art methods on our benchmark, performs competitively on established external benchmarks, and has a stronger correlation with human perception. Our in-depth analysis reveals critical limitations in current video generative models and establishes a new standard for advanced research in video generation.
SAM3-UNet: Simplified Adaptation of Segment Anything Model 3
In this paper, we introduce SAM3-UNet, a simplified variant of Segment Anything Model 3 (SAM3), designed to adapt SAM3 for downstream tasks at a low cost. Our SAM3-UNet consists of three components: a SAM3 image encoder, a simple adapter for parameter-efficient fine-tuning, and a lightweight U-Net-style decoder. Preliminary experiments on multiple tasks, such as mirror detection and salient object detection, demonstrate that the proposed SAM3-UNet outperforms the prior SAM2-UNet and other state-of-the-art methods, while requiring less than 6 GB of GPU memory during training with a batch size of 12. The code is publicly available at https://github.com/WZH0120/SAM3-UNet.
comment: Technical Report
Learned Image Compression for Earth Observation: Implications for Downstream Segmentation Tasks
The rapid growth of data from satellite-based Earth observation (EO) systems poses significant challenges in data transmission and storage. We evaluate the potential of task-specific learned compression algorithms in this context to reduce data volumes while retaining crucial information. In detail, we compare traditional compression (JPEG 2000) versus a learned compression approach (Discretized Mixed Gaussian Likelihood) on three EO segmentation tasks: Fire, cloud, and building detection. Learned compression notably outperforms JPEG 2000 for large-scale, multi-channel optical imagery in both reconstruction quality (PSNR) and segmentation accuracy. However, traditional codecs remain competitive on smaller, single-channel thermal infrared datasets due to limited data and architectural constraints. Additionally, joint end-to-end optimization of compression and segmentation models does not improve performance over standalone optimization.
Evaluating SAM2 for Video Semantic Segmentation
The Segmentation Anything Model 2 (SAM2) has proven to be a powerful foundation model for promptable visual object segmentation in both images and videos, capable of storing object-aware memories and transferring them temporally through memory blocks. While SAM2 excels in video object segmentation by providing dense segmentation masks based on prompts, extending it to dense Video Semantic Segmentation (VSS) poses challenges due to the need for spatial accuracy, temporal consistency, and the ability to track multiple objects with complex boundaries and varying scales. This paper explores the extension of SAM2 for VSS, focusing on two primary approaches and highlighting firsthand observations and common challenges faced during this process. The first approach involves using SAM2 to extract unique objects as masks from a given image, with a segmentation network employed in parallel to generate and refine initial predictions. The second approach utilizes the predicted masks to extract unique feature vectors, which are then fed into a simple network for classification. The resulting classifications and masks are subsequently combined to produce the final segmentation. Our experiments suggest that leveraging SAM2 enhances overall performance in VSS, primarily due to its precise predictions of object boundaries.
comment: 17 pages, 3 figures and 7 tables
Robust Rigid and Non-Rigid Medical Image Registration Using Learnable Edge Kernels
Medical image registration is crucial for various clinical and research applications including disease diagnosis or treatment planning which require alignment of images from different modalities, time points, or subjects. Traditional registration techniques often struggle with challenges such as contrast differences, spatial distortions, and modality-specific variations. To address these limitations, we propose a method that integrates learnable edge kernels with learning-based rigid and non-rigid registration techniques. Unlike conventional layers that learn all features without specific bias, our approach begins with a predefined edge detection kernel, which is then perturbed with random noise. These kernels are learned during training to extract optimal edge features tailored to the task. This adaptive edge detection enhances the registration process by capturing diverse structural features critical in medical imaging. To provide clearer insight into the contribution of each component in our design, we introduce four variant models for rigid registration and four variant models for non-rigid registration. We evaluated our approach using a dataset provided by the Medical University across three setups: rigid registration without skull removal, with skull removal, and non-rigid registration. Additionally, we assessed performance on two publicly available datasets. Across all experiments, our method consistently outperformed state-of-the-art techniques, demonstrating its potential to improve multi-modal image alignment and anatomical structure analysis.
VideoScoop: A Non-Traditional Domain-Independent Framework For Video Analysis
Automatically understanding video contents is important for several applications in Civic Monitoring (CM), general Surveillance (SL), Assisted Living (AL), etc. Decades of Image and Video Analysis (IVA) research have advanced tasks such as content extraction (e.g., object recognition and tracking). Identifying meaningful activities or situations (e.g., two objects coming closer) remains difficult and cannot be achieved by content extraction alone. Currently, Video Situation Analysis (VSA) is done manually with a human in the loop, which is error-prone and labor-intensive, or through custom algorithms designed for specific video types or situations. These algorithms are not general-purpose and require a new algorithm/software for each new situation or video from a new domain. This report proposes a general-purpose VSA framework that overcomes the above limitations. Video contents are extracted once using state-of-the-art Video Content Extraction technologies. They are represented using two alternative models -- the extended relational model (R++) and graph models. When represented using R++, the extracted contents can be used as data streams, enabling Continuous Query Processing via the proposed Continuous Query Language for Video Analysis. The graph models complement this by enabling the detection of situations that are difficult or impossible to detect using the relational model alone. Existing graph algorithms and newly developed algorithms support a wide variety of situation detection. To support domain independence, primitive situation variants across domains are identified and expressed as parameterized templates. Extensive experiments were conducted across several interesting situations from three domains -- AL, CM, and SL-- to evaluate the accuracy, efficiency, and robustness of the proposed approach using a dataset of videos of varying lengths from these domains.
comment: This is a report submitted as part of PhD proposal defense of Hafsa Billah
HiconAgent: History Context-aware Policy Optimization for GUI Agents
Graphical User Interface (GUI) agents require effective use of historical context to perform sequential navigation tasks. While incorporating past actions and observations can improve decision making, naive use of full history leads to excessive computational overhead and distraction from irrelevant information. To address this, we introduce HiconAgent, a GUI agent trained with History Context-aware Policy Optimization (HCPO) for efficient and effective utilization of historical information. HCPO optimizes history usage in both sampling and policy updates through two complementary components: (1) Dynamic Context Sampling (DCS) presents the agent with variable length histories during sampling, enabling adaptive use of the most relevant context; (2) Anchor-guided History Compression (AHC) refines the policy update phase with a dual branch strategy where the compressed branch removes history observations while keeping history actions as information flow anchors. The compressed and uncompressed branches are coupled through a history-enhanced alignment loss to enforce consistent history usage while maintaining efficiency. Experiments on mainstream GUI navigation benchmarks demonstrate strong performance. Despite being smaller, HiconAgent-3B outperforms GUI-R1-7B by +8.46 percent grounding accuracy and +11.32 percent step success rate on GUI-Odyssey, while achieving comparable results on AndroidControl and AITW with up to 2.47x computational speedup and 60 percent FLOPs reduction.
FreqEdit: Preserving High-Frequency Features for Robust Multi-Turn Image Editing
Instruction-based image editing through natural language has emerged as a powerful paradigm for intuitive visual manipulation. While recent models achieve impressive results on single edits, they suffer from severe quality degradation under multi-turn editing. Through systematic analysis, we identify progressive loss of high-frequency information as the primary cause of this quality degradation. We present FreqEdit, a training-free framework that enables stable editing across 10+ consecutive iterations. Our approach comprises three synergistic components: (1) high-frequency feature injection from reference velocity fields to preserve fine-grained details, (2) an adaptive injection strategy that spatially modulates injection strength for precise region-specific control, and (3) a path compensation mechanism that periodically recalibrates the editing trajectory to prevent over-constraint. Extensive experiments demonstrate that FreqEdit achieves superior performance in both identity preservation and instruction following compared to seven state-of-the-art baselines.
StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos
Streaming video understanding requires models not only to process temporally incoming frames, but also to anticipate user intention for realistic applications like AR glasses. While prior streaming benchmarks evaluate temporal reasoning, none measure whether MLLMs can interpret or leverage human gaze signals within a streaming setting. To fill this gap, we introduce StreamGaze, the first benchmark designed to evaluate how effectively MLLMs use gaze for temporal and proactive reasoning in streaming videos. StreamGaze introduces gaze-guided past, present, and proactive tasks that comprehensively evaluate streaming video understanding. These tasks assess whether models can use real-time gaze to follow shifting attention and infer user intentions from only past and currently observed frames. To build StreamGaze, we develop a gaze-video QA generation pipeline that aligns egocentric videos with raw gaze trajectories via fixation extraction, region-specific visual prompting, and scanpath construction. This pipeline produces spatio-temporally grounded QA pairs that closely reflect human perceptual dynamics. Across all StreamGaze tasks, we observe substantial performance gaps between state-of-the-art MLLMs and human performance, revealing fundamental limitations in gaze-based temporal reasoning, intention modeling, and proactive prediction. We further provide detailed analyses of gaze-prompting strategies, reasoning behaviors, and task-specific failure modes, offering deeper insight into why current MLLMs struggle and what capabilities future models must develop. All data and code will be publicly released to support continued research in gaze-guided streaming video understanding.
comment: Project page: https://streamgaze.github.io/
SSR: Semantic and Spatial Rectification for CLIP-based Weakly Supervised Segmentation AAAI 2026
In recent years, Contrastive Language-Image Pretraining (CLIP) has been widely applied to Weakly Supervised Semantic Segmentation (WSSS) tasks due to its powerful cross-modal semantic understanding capabilities. This paper proposes a novel Semantic and Spatial Rectification (SSR) method to address the limitations of existing CLIP-based weakly supervised semantic segmentation approaches: over-activation in non-target foreground regions and background areas. Specifically, at the semantic level, the Cross-Modal Prototype Alignment (CMPA) establishes a contrastive learning mechanism to enforce feature space alignment across modalities, reducing inter-class overlap while enhancing semantic correlations, to rectify over-activation in non-target foreground regions effectively; at the spatial level, the Superpixel-Guided Correction (SGC) leverages superpixel-based spatial priors to precisely filter out interference from non-target regions during affinity propagation, significantly rectifying background over-activation. Extensive experiments on the PASCAL VOC and MS COCO datasets demonstrate that our method outperforms all single-stage approaches, as well as more complex multi-stage approaches, achieving mIoU scores of 79.5% and 50.6%, respectively.
comment: Accepted in AAAI 2026
Revisiting Direct Encoding: Learnable Temporal Dynamics for Static Image Spiking Neural Networks
Handling static images that lack inherent temporal dynamics remains a fundamental challenge for spiking neural networks (SNNs). In directly trained SNNs, static inputs are typically repeated across time steps, causing the temporal dimension to collapse into a rate like representation and preventing meaningful temporal modeling. This work revisits the reported performance gap between direct and rate based encodings and shows that it primarily stems from convolutional learnability and surrogate gradient formulations rather than the encoding schemes themselves. To illustrate this mechanism level clarification, we introduce a minimal learnable temporal encoding that adds adaptive phase shifts to induce meaningful temporal variation from static inputs.
DreamingComics: A Story Visualization Pipeline via Subject and Layout Customized Generation using Video Models
Current story visualization methods tend to position subjects solely by text and face challenges in maintaining artistic consistency. To address these limitations, we introduce DreamingComics, a layout-aware story visualization framework. We build upon a pretrained video diffusion-transformer (DiT) model, leveraging its spatiotemporal priors to enhance identity and style consistency. For layout-based position control, we propose RegionalRoPE, a region-aware positional encoding scheme that re-indexes embeddings based on the target layout. Additionally, we introduce a masked condition loss to further constrain each subject's visual features to their designated region. To infer layouts from natural language scripts, we integrate an LLM-based layout generator trained to produce comic-style layouts, enabling flexible and controllable layout conditioning. We present a comprehensive evaluation of our approach, showing a 29.2% increase in character consistency and a 36.2% increase in style similarity compared to previous methods, while displaying high spatial accuracy. Our project page is available at https://yj7082126.github.io/dreamingcomics/
Cross-Domain Validation of a Resection-Trained Self-Supervised Model on Multicentre Mesothelioma Biopsies
Accurate subtype classification and outcome prediction in mesothelioma are essential for guiding therapy and patient care. Most computational pathology models are trained on large tissue images from resection specimens, limiting their use in real-world settings where small biopsies are common. We show that a self-supervised encoder trained on resection tissue can be applied to biopsy material, capturing meaningful morphological patterns. Using these patterns, the model can predict patient survival and classify tumor subtypes. This approach demonstrates the potential of AI-driven tools to support diagnosis and treatment planning in mesothelioma.
Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation
Generating realistic hand-object interactions (HOI) videos is a significant challenge due to the difficulty of modeling physical constraints (e.g., contact and occlusion between hands and manipulated objects). Current methods utilize HOI representation as an auxiliary generative objective to guide video synthesis. However, there is a dilemma between 2D and 3D representations that cannot simultaneously guarantee scalability and interaction fidelity. To address this limitation, we propose a structure and contact-aware representation that captures hand-object contact, hand-object occlusion, and holistic structure context without 3D annotations. This interaction-oriented and scalable supervision signal enables the model to learn fine-grained interaction physics and generalize to open-world scenarios. To fully exploit the proposed representation, we introduce a joint-generation paradigm with a share-and-specialization strategy that generates interaction-oriented representations and videos. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on two real-world datasets in generating physics-realistic and temporally coherent HOI videos. Furthermore, our approach exhibits strong generalization to challenging open-world scenarios, highlighting the benefit of our scalable design. Our project page is https://hgzn258.github.io/SCAR/.
GRASP: Guided Residual Adapters with Sample-wise Partitioning
Recent advances in text-to-image diffusion models enable high-fidelity generation across diverse prompts. However, these models falter in long-tail settings, such as medical imaging, where rare pathologies comprise a small fraction of data. This results in mode collapse: tail-class outputs lack quality and diversity, undermining the goal of synthetic data augmentation for underrepresented conditions. We pinpoint gradient conflicts between frequent head and rare tail classes as the primary culprit, a factor unaddressed by existing sampling or conditioning methods that mainly steer inference without altering the learned distribution. To resolve this, we propose GRASP: Guided Residual Adapters with Sample-wise Partitioning. GRASP uses external priors to statically partition samples into clusters that minimize intra-group gradient clashes. It then fine-tunes pre-trained models by injecting cluster-specific residual adapters into transformer feedforward layers, bypassing learned gating for stability and efficiency. On the long-tail MIMIC-CXR-LT dataset, GRASP yields superior FID and diversity metrics, especially for rare classes, outperforming baselines like vanilla fine-tuning and Mixture of Experts variants. Downstream classification on NIH-CXR-LT improves considerably for tail labels. Generalization to ImageNet-LT confirms broad applicability. Our method is lightweight, scalable, and readily integrates with diffusion pipelines.
comment: 10 pages, 4 figures, 6 tables
Bridging the Scale Gap: Balanced Tiny and General Object Detection in Remote Sensing Imagery
Tiny object detection in remote sensing imagery has attracted significant research interest in recent years. Despite recent progress, achieving balanced detection performance across diverse object scales remains a formidable challenge, particularly in scenarios where dense tiny objects and large objects coexist. Although large foundation models have revolutionized general vision tasks, their application to tiny object detection remains unexplored due to the extreme scale variation and density distribution inherent to remote sensing imagery. To bridge this scale gap, we propose ScaleBridge-Det, to the best of our knowledge, the first large detection framework designed for tiny objects, which could achieve balanced performance across diverse scales through scale-adaptive expert routing and density-guided query allocation. Specifically, we introduce a Routing-Enhanced Mixture Attention (REM) module that dynamically selects and fuses scale-specific expert features via adaptive routing to address the tendency of standard MoE models to favor dominant scales. REM generates complementary and discriminative multi-scale representations suitable for both tiny and large objects. Furthermore, we present a Density-Guided Dynamic Query (DGQ) module that predicts object density to adaptively adjust query positions and numbers, enabling efficient resource allocation for objects of varying scales. The proposed framework allows ScaleBridge-Det to simultaneously optimize performance for both dense tiny and general objects without trade-offs. Extensive experiments on benchmark and cross-domain datasets demonstrate that ScaleBridge-Det achieves state-of-the-art performance on AI-TOD-V2 and DTOD, while exhibiting superior cross-domain robustness on VisDrone.
DB-KAUNet: An Adaptive Dual Branch Kolmogorov-Arnold UNet for Retinal Vessel Segmentation
Accurate segmentation of retinal vessels is crucial for the clinical diagnosis of numerous ophthalmic and systemic diseases. However, traditional Convolutional Neural Network (CNN) methods exhibit inherent limitations, struggling to capture long-range dependencies and complex nonlinear relationships. To address the above limitations, an Adaptive Dual Branch Kolmogorov-Arnold UNet (DB-KAUNet) is proposed for retinal vessel segmentation. In DB-KAUNet, we design a Heterogeneous Dual-Branch Encoder (HDBE) that features parallel CNN and Transformer pathways. The HDBE strategically interleaves standard CNN and Transformer blocks with novel KANConv and KAT blocks, enabling the model to form a comprehensive feature representation. To optimize feature processing, we integrate several critical components into the HDBE. First, a Cross-Branch Channel Interaction (CCI) module is embedded to facilitate efficient interaction of channel features between the parallel pathways. Second, an attention-based Spatial Feature Enhancement (SFE) module is employed to enhance spatial features and fuse the outputs from both branches. Building upon the SFE module, an advanced Spatial Feature Enhancement with Geometrically Adaptive Fusion (SFE-GAF) module is subsequently developed. In the SFE-GAF module, adaptive sampling is utilized to focus on true vessel morphology precisely. The adaptive process strengthens salient vascular features while significantly reducing background noise and computational overhead. Extensive experiments on the DRIVE, STARE, and CHASE_DB1 datasets validate that DB-KAUNet achieves leading segmentation performance and demonstrates exceptional robustness.
ViT$^3$: Unlocking Test-Time Training in Vision
Test-Time Training (TTT) has recently emerged as a promising direction for efficient sequence modeling. TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key-value pairs at test time. This reformulation opens a rich and flexible design space while achieving linear computational complexity. However, crafting a powerful visual TTT design remains challenging: fundamental choices for the inner module and inner training lack comprehensive understanding and practical guidelines. To bridge this critical gap, in this paper, we present a systematic empirical study of TTT designs for visual sequence modeling. From a series of experiments and analyses, we distill six practical insights that establish design principles for effective visual TTT and illuminate paths for future improvement. These findings culminate in the Vision Test-Time Training (ViT$^3$) model, a pure TTT architecture that achieves linear complexity and parallelizable computation. We evaluate ViT$^3$ across diverse visual tasks, including image classification, image generation, object detection, and semantic segmentation. Results show that ViT$^3$ consistently matches or outperforms advanced linear-complexity models (e.g., Mamba and linear attention variants) and effectively narrows the gap to highly optimized vision Transformers. We hope this study and the ViT$^3$ baseline can facilitate future work on visual TTT models. Code is available at https://github.com/LeapLabTHU/ViTTT.
Generative Editing in the Joint Vision-Language Space for Zero-Shot Composed Image Retrieval
Composed Image Retrieval (CIR) enables fine-grained visual search by combining a reference image with a textual modification. While supervised CIR methods achieve high accuracy, their reliance on costly triplet annotations motivates zero-shot solutions. The core challenge in zero-shot CIR (ZS-CIR) stems from a fundamental dilemma: existing text-centric or diffusion-based approaches struggle to effectively bridge the vision-language modality gap. To address this, we propose Fusion-Diff, a novel generative editing framework with high effectiveness and data efficiency designed for multimodal alignment. First, it introduces a multimodal fusion feature editing strategy within a joint vision-language (VL) space, substantially narrowing the modality gap. Second, to maximize data efficiency, the framework incorporates a lightweight Control-Adapter, enabling state-of-the-art performance through fine-tuning on only a limited-scale synthetic dataset of 200K samples. Extensive experiments on standard CIR benchmarks (CIRR, FashionIQ, and CIRCO) demonstrate that Fusion-Diff significantly outperforms prior zero-shot approaches. We further enhance the interpretability of our model by visualizing the fused multimodal representations.
SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge
Articulated 3D objects are critical for embodied AI, robotics, and interactive scene understanding, yet creating simulation-ready assets remains labor-intensive and requires expert modeling of part hierarchies and motion structures. We introduce SPARK, a framework for reconstructing physically consistent, kinematic part-level articulated objects from a single RGB image. Given an input image, we first leverage VLMs to extract coarse URDF parameters and generate part-level reference images. We then integrate the part-image guidance and the inferred structure graph into a generative diffusion transformer to synthesize consistent part and complete shapes of articulated objects. To further refine the URDF parameters, we incorporate differentiable forward kinematics and differentiable rendering to optimize joint types, axes, and origins under VLM-generated open-state supervision. Extensive experiments show that SPARK produces high-quality, simulation-ready articulated assets across diverse categories, enabling downstream applications such as robotic manipulation and interaction modeling.
Depth Matching Method Based on ShapeDTW for Oil-Based Mud Imager
In well logging operations using the oil-based mud (OBM) microresistivity imager, which employs an interleaved design with upper and lower pad sets, depth misalignment issues persist between the pad images even after velocity correction. This paper presents a depth matching method for borehole images based on the Shape Dynamic Time Warping (ShapeDTW) algorithm. The method extracts local shape features to construct a morphologically sensitive distance matrix, better preserving structural similarity between sequences during alignment. We implement this by employing a combined feature set of the one-dimensional Histogram of Oriented Gradients (HOG1D) and the original signal as the shape descriptor. Field test examples demonstrate that our method achieves precise alignment for images with complex textures, depth shifts, or local scaling. Furthermore, it provides a flexible framework for feature extension, allowing the integration of other descriptors tailored to specific geological features.
Toward Content-based Indexing and Retrieval of Head and Neck CT with Abscess Segmentation
Abscesses in the head and neck represent an acute infectious process that can potentially lead to sepsis or mortality if not diagnosed and managed promptly. Accurate detection and delineation of these lesions on imaging are essential for diagnosis, treatment planning, and surgical intervention. In this study, we introduce AbscessHeNe, a curated and comprehensively annotated dataset comprising 4,926 contrast-enhanced CT slices with clinically confirmed head and neck abscesses. The dataset is designed to facilitate the development of robust semantic segmentation models that can accurately delineate abscess boundaries and evaluate deep neck space involvement, thereby supporting informed clinical decision-making. To establish performance baselines, we evaluate several state-of-the-art segmentation architectures, including CNN, Transformer, and Mamba-based models. The highest-performing model achieved a Dice Similarity Coefficient of 0.39, Intersection-over-Union of 0.27, and Normalized Surface Distance of 0.67, indicating the challenges of this task and the need for further research. Beyond segmentation, AbscessHeNe is structured for future applications in content-based multimedia indexing and case-based retrieval. Each CT scan is linked with pixel-level annotations and clinical metadata, providing a foundation for building intelligent retrieval systems and supporting knowledge-driven clinical workflows. The dataset will be made publicly available at https://github.com/drthaodao3101/AbscessHeNe.git.
comment: The 2025 IEEE International Conference on Content-Based Multimedia Indexing (IEEE CBMI)
RoleMotion: A Large-Scale Dataset towards Robust Scene-Specific Role-Playing Motion Synthesis with Fine-grained Descriptions
In this paper, we introduce RoleMotion, a large-scale human motion dataset that encompasses a wealth of role-playing and functional motion data tailored to fit various specific scenes. Existing text datasets are mainly constructed decentrally as amalgamation of assorted subsets that their data are nonfunctional and isolated to work together to cover social activities in various scenes. Also, the quality of motion data is inconsistent, and textual annotation lacks fine-grained details in these datasets. In contrast, RoleMotion is meticulously designed and collected with a particular focus on scenes and roles. The dataset features 25 classic scenes, 110 functional roles, over 500 behaviors, and 10296 high-quality human motion sequences of body and hands, annotated with 27831 fine-grained text descriptions. We build an evaluator stronger than existing counterparts, prove its reliability, and evaluate various text-to-motion methods on our dataset. Finally, we explore the interplay of motion generation of body and hands. Experimental results demonstrate the high-quality and functionality of our dataset on text-driven whole-body generation.
MasHeNe: A Benchmark for Head and Neck CT Mass Segmentation using Window-Enhanced Mamba with Frequency-Domain Integration
Head and neck masses are space-occupying lesions that can compress the airway and esophagus and may affect nerves and blood vessels. Available public datasets primarily focus on malignant lesions and often overlook other space-occupying conditions in this region. To address this gap, we introduce MasHeNe, an initial dataset of 3,779 contrast-enhanced CT slices that includes both tumors and cysts with pixel-level annotations. We also establish a benchmark using standard segmentation baselines and report common metrics to enable fair comparison. In addition, we propose the Windowing-Enhanced Mamba with Frequency integration (WEMF) model. WEMF applies tri-window enhancement to enrich the input appearance before feature extraction. It further uses multi-frequency attention to fuse information across skip connections within a U-shaped Mamba backbone. On MasHeNe, WEMF attains the best performance among evaluated methods, with a Dice of 70.45%, IoU of 66.89%, NSD of 72.33%, and HD95 of 5.12 mm. This model indicates stable and strong results on this challenging task. MasHeNe provides a benchmark for head-and-neck mass segmentation beyond malignancy-only datasets. The observed error patterns also suggest that this task remains challenging and requires further research. Our dataset and code are available at https://github.com/drthaodao3101/MasHeNe.git.
comment: The 14th International Symposium on Information and Communication Technology Conference SoICT 2025
NavForesee: A Unified Vision-Language World Model for Hierarchical Planning and Dual-Horizon Navigation Prediction
Embodied navigation for long-horizon tasks, guided by complex natural language instructions, remains a formidable challenge in artificial intelligence. Existing agents often struggle with robust long-term planning about unseen environments, leading to high failure rates. To address these limitations, we introduce NavForesee, a novel Vision-Language Model (VLM) that unifies high-level language planning and predictive world model imagination within a single, unified framework. Our approach empowers a single VLM to concurrently perform planning and predictive foresight. Conditioned on the full instruction and historical observations, the model is trained to understand the navigation instructions by decomposing the task, tracking its progress, and formulating the subsequent sub-goal. Simultaneously, it functions as a generative world model, providing crucial foresight by predicting short-term environmental dynamics and long-term navigation milestones. The VLM's structured plan guides its targeted prediction, while the imagined future provides rich context to inform the navigation actions, creating a powerful internal feedback loop of perception-planning/prediction-action. We demonstrate through extensive experiments on the R2R-CE and RxR-CE benchmark that NavForesee achieves highly competitive performance in complex scenarios. Our work highlights the immense potential of fusing explicit language planning with implicit spatiotemporal prediction, paving the way for more intelligent and capable embodied agents.
FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention
3D reconstruction from multi-view images is a core challenge in computer vision. Recently, feed-forward methods have emerged as efficient and robust alternatives to traditional per-scene optimization techniques. Among them, state-of-the-art models like the Visual Geometry Grounding Transformer (VGGT) leverage full self-attention over all image tokens to capture global relationships. However, this approach suffers from poor scalability due to the quadratic complexity of self-attention and the large number of tokens generated in long image sequences. In this work, we introduce FlashVGGT, an efficient alternative that addresses this bottleneck through a descriptor-based attention mechanism. Instead of applying dense global attention across all tokens, FlashVGGT compresses spatial information from each frame into a compact set of descriptor tokens. Global attention is then computed as cross-attention between the full set of image tokens and this smaller descriptor set, significantly reducing computational overhead. Moreover, the compactness of the descriptors enables online inference over long sequences via a chunk-recursive mechanism that reuses cached descriptors from previous chunks. Experimental results show that FlashVGGT achieves reconstruction accuracy competitive with VGGT while reducing inference time to just 9.3% of VGGT for 1,000 images, and scaling efficiently to sequences exceeding 3,000 images. Our project page is available at https://wzpscott.github.io/flashvggt_page/.
Deep Unsupervised Anomaly Detection in Brain Imaging: Large-Scale Benchmarking and Bias Analysis
Deep unsupervised anomaly detection in brain magnetic resonance imaging offers a promising route to identify pathological deviations without requiring lesion-specific annotations. Yet, fragmented evaluations, heterogeneous datasets, and inconsistent metrics have hindered progress toward clinical translation. Here, we present a large-scale, multi-center benchmark of deep unsupervised anomaly detection for brain imaging. The training cohort comprised 2,976 T1 and 2,972 T2-weighted scans from healthy individuals across six scanners, with ages ranging from 6 to 89 years. Validation used 92 scans to tune hyperparameters and estimate unbiased thresholds. Testing encompassed 2,221 T1w and 1,262 T2w scans spanning healthy datasets and diverse clinical cohorts. Across all algorithms, the Dice-based segmentation performance varied between 0.03 and 0.65, indicating substantial variability. To assess robustness, we systematically evaluated the impact of different scanners, lesion types and sizes, as well as demographics (age, sex). Reconstruction-based methods, particularly diffusion-inspired approaches, achieved the strongest lesion segmentation performance, while feature-based methods showed greater robustness under distributional shifts. However, systematic biases, such as scanner-related effects, were observed for the majority of algorithms, including that small and low-contrast lesions were missed more often, and that false positives varied with age and sex. Increasing healthy training data yields only modest gains, underscoring that current unsupervised anomaly detection frameworks are limited algorithmically rather than by data availability. Our benchmark establishes a transparent foundation for future research and highlights priorities for clinical translation, including image native pretraining, principled deviation measures, fairness-aware modeling, and robust domain adaptation.
Diffusion Fuzzy System: Fuzzy Rule Guided Latent Multi-Path Diffusion Modeling
Diffusion models have emerged as a leading technique for generating images due to their ability to create high-resolution and realistic images. Despite their strong performance, diffusion models still struggle in managing image collections with significant feature differences. They often fail to capture complex features and produce conflicting results. Research has attempted to address this issue by learning different regions of an image through multiple diffusion paths and then combining them. However, this approach leads to inefficient coordination among multiple paths and high computational costs. To tackle these issues, this paper presents a Diffusion Fuzzy System (DFS), a latent-space multi-path diffusion model guided by fuzzy rules. DFS offers several advantages. First, unlike traditional multi-path diffusion methods, DFS uses multiple diffusion paths, each dedicated to learning a specific class of image features. By assigning each path to a different feature type, DFS overcomes the limitations of multi-path models in capturing heterogeneous image features. Second, DFS employs rule-chain-based reasoning to dynamically steer the diffusion process and enable efficient coordination among multiple paths. Finally, DFS introduces a fuzzy membership-based latent-space compression mechanism to reduce the computational costs of multi-path diffusion effectively. We tested our method on three public datasets: LSUN Bedroom, LSUN Church, and MS COCO. The results show that DFS achieves more stable training and faster convergence than existing single-path and multi-path diffusion models. Additionally, DFS surpasses baseline models in both image quality and alignment between text and images, and also shows improved accuracy when comparing generated images to target references.
QuantumCanvas: A Multimodal Benchmark for Visual Learning of Atomic Interactions
Despite rapid advances in molecular and materials machine learning, most models still lack physical transferability: they fit correlations across whole molecules or crystals rather than learning the quantum interactions between atomic pairs. Yet bonding, charge redistribution, orbital hybridization, and electronic coupling all emerge from these two-body interactions that define local quantum fields in many-body systems. We introduce QuantumCanvas, a large-scale multimodal benchmark that treats two-body quantum systems as foundational units of matter. The dataset spans 2,850 element-element pairs, each annotated with 18 electronic, thermodynamic, and geometric properties and paired with ten-channel image representations derived from l- and m-resolved orbital densities, angular field transforms, co-occupancy maps, and charge-density projections. These physically grounded images encode spatial, angular, and electrostatic symmetries without explicit coordinates, providing an interpretable visual modality for quantum learning. Benchmarking eight architectures across 18 targets, we report mean absolute errors of 0.201 eV on energy gap using GATv2, 0.265 eV on HOMO and 0.274 eV on LUMO using EGNN. For energy-related quantities, DimeNet attains 2.27 eV total-energy MAE and 0.132 eV repulsive-energy MAE, while a multimodal fusion model achieves a 2.15 eV Mermin free-energy MAE. Pretraining on QuantumCanvas further improves convergence stability and generalization when fine-tuned on larger datasets such as QM9, MD17, and CrysMTM. By unifying orbital physics with vision-based representation learning, QuantumCanvas provides a principled and interpretable basis for learning transferable quantum interactions through coupled visual and numerical modalities. Dataset and model implementations are available at https://github.com/KurbanIntelligenceLab/QuantumCanvas.
Semantic-aware Random Convolution and Source Matching for Domain Generalization in Medical Image Segmentation
We tackle the challenging problem of single-source domain generalization (DG) for medical image segmentation. To this end, we aim for training a network on one domain (e.g., CT) and directly apply it to a different domain (e.g., MR) without adapting the model and without requiring images or annotations from the new domain during training. We propose a novel method for promoting DG when training deep segmentation networks, which we call SRCSM. During training, our method diversifies the source domain through semantic-aware random convolution, where different regions of a source image are augmented differently, based on their annotation labels. At test-time, we complement the randomization of the training domain via mapping the intensity of target domain images, making them similar to source domain data. We perform a comprehensive evaluation on a variety of cross-modality and cross-center generalization settings for abdominal, whole-heart and prostate segmentation, where we outperform previous DG techniques in a vast majority of experiments. Additionally, we also investigate our method when training on whole-heart CT or MR data and testing on the diastolic and systolic phase of cine MR data captured with different scanner hardware, where we make a step towards closing the domain gap in this even more challenging setting. Overall, our evaluation shows that SRCSM can be considered a new state-of-the-art in DG for medical image segmentation and, moreover, even achieves a segmentation performance that matches the performance of the in-domain baseline in several settings.
comment: Preprint submitted to Computer Methods and Programs in Biomedicine (currently under revision)
ELVIS: Enhance Low-Light for Video Instance Segmentation in the Dark
Video instance segmentation (VIS) for low-light content remains highly challenging for both humans and machines alike, due to adverse imaging conditions including noise, blur and low-contrast. The lack of large-scale annotated datasets and the limitations of current synthetic pipelines, particularly in modeling temporal degradations, further hinder progress. Moreover, existing VIS methods are not robust to the degradations found in low-light videos and, as a result, perform poorly even when finetuned on low-light data. In this paper, we introduce \textbf{ELVIS} (\textbf{E}nhance \textbf{L}ow-light for \textbf{V}ideo \textbf{I}nstance \textbf{S}egmentation), a novel framework that enables effective domain adaptation of state-of-the-art VIS models to low-light scenarios. ELVIS comprises an unsupervised synthetic low-light video pipeline that models both spatial and temporal degradations, a calibration-free degradation profile synthesis network (VDP-Net) and an enhancement decoder head that disentangles degradations from content features. ELVIS improves performances by up to \textbf{+3.7AP} on the synthetic low-light YouTube-VIS 2019 dataset. Code will be released upon acceptance.
A variational method for curve extraction with curvature-dependent energies
We introduce a variational approach for extracting curves between a list of possible endpoints, based on the discretization of an energy and Smirnov's decomposition theorem for vector fields. It is used to design a bi-level minimization approach to automatically extract curves and 1D structures from an image, which is mostly unsupervised. We extend then the method to curvature-dependent energies, using a now classical lifting of the curves in the space of positions and orientations equipped with an appropriate sub-Riemanian or Finslerian metric.
ChronosObserver: Taming 4D World with Hyperspace Diffusion Sampling
Although prevailing camera-controlled video generation models can produce cinematic results, lifting them directly to the generation of 3D-consistent and high-fidelity time-synchronized multi-view videos remains challenging, which is a pivotal capability for taming 4D worlds. Some works resort to data augmentation or test-time optimization, but these strategies are constrained by limited model generalization and scalability issues. To this end, we propose ChronosObserver, a training-free method including World State Hyperspace to represent the spatiotemporal constraints of a 4D world scene, and Hyperspace Guided Sampling to synchronize the diffusion sampling trajectories of multiple views using the hyperspace. Experimental results demonstrate that our method achieves high-fidelity and 3D-consistent time-synchronized multi-view videos generation without training or fine-tuning for diffusion models.
CourtMotion: Learning Event-Driven Motion Representations from Skeletal Data for Basketball
This paper presents CourtMotion, a spatiotemporal modeling framework for analyzing and predicting game events and plays as they develop in professional basketball. Anticipating basketball events requires understanding both physical motion patterns and their semantic significance in the context of the game. Traditional approaches that use only player positions fail to capture crucial indicators such as body orientation, defensive stance, or shooting preparation motions. Our two-stage approach first processes skeletal tracking data through Graph Neural Networks to capture nuanced motion patterns, then employs a Transformer architecture with specialized attention mechanisms to model player interactions. We introduce event projection heads that explicitly connect player movements to basketball events like passes, shots, and steals, training the model to associate physical motion patterns with their tactical purposes. Experiments on NBA tracking data demonstrate significant improvements over position-only baselines: 35% reduction in trajectory prediction error compared to state-of-the-art position-based models and consistent performance gains across key basketball analytics tasks. The resulting pretrained model serves as a powerful foundation for multiple downstream tasks, with pick detection, shot taker identification, assist prediction, shot location classification, and shot type recognition demonstrating substantial improvements over existing methods.
Stay Unique, Stay Efficient: Preserving Model Personality in Multi-Task Merging
Model merging has emerged as a promising paradigm for enabling multi-task capabilities without additional training. However, existing methods often experience substantial performance degradation compared with individually fine-tuned models, even on similar tasks, underscoring the need to preserve task-specific information. This paper proposes Decomposition, Thresholding, and Scaling (DTS), an approximation-based personalized merging framework that preserves task-specific information with minimal storage overhead. DTS first applies singular value decomposition to the task-specific information and retains only a small subset of singular values and vectors. It then introduces a novel thresholding strategy that partitions singular vector elements into groups and assigns a scaling factor to each group. To enable generalization to unseen tasks, we further extend DTS with a variant that fuses task-specific information in a data-free manner based on the semantic similarity of task characteristics. Extensive experiments demonstrate that DTS consistently outperforms state-of-the-art baselines while requiring only 1\% additional storage per task. Furthermore, experiments on unseen tasks show that the DTS variant achieves significantly better generalization performance. Our code is available at https://github.com/krumpguo/DTS.
FastAnimate: Towards Learnable Template Construction and Pose Deformation for Fast 3D Human Avatar Animation
3D human avatar animation aims at transforming a human avatar from an arbitrary initial pose to a specified target pose using deformation algorithms. Existing approaches typically divide this task into two stages: canonical template construction and target pose deformation. However, current template construction methods demand extensive skeletal rigging and often produce artifacts for specific poses. Moreover, target pose deformation suffers from structural distortions caused by Linear Blend Skinning (LBS), which significantly undermines animation realism. To address these problems, we propose a unified learning-based framework to address both challenges in two phases. For the former phase, to overcome the inefficiencies and artifacts during template construction, we leverage a U-Net architecture that decouples texture and pose information in a feed-forward process, enabling fast generation of a human template. For the latter phase, we propose a data-driven refinement technique that enhances structural integrity. Extensive experiments show that our model delivers consistent performance across diverse poses with an optimal balance between efficiency and quality,surpassing state-of-the-art (SOTA) methods.
comment: 9 pages,4 figures
Language-Guided Open-World Anomaly Segmentation
Open-world and anomaly segmentation methods seek to enable autonomous driving systems to detect and segment both known and unknown objects in real-world scenes. However, existing methods do not assign semantically meaningful labels to unknown regions, and distinguishing and learning representations for unknown classes remains difficult. While open-vocabulary segmentation methods show promise in generalizing to novel classes, they require a fixed inference vocabulary and thus cannot be directly applied to anomaly segmentation where unknown classes are unconstrained. We propose Clipomaly, the first CLIP-based open-world and anomaly segmentation method for autonomous driving. Our zero-shot approach requires no anomaly-specific training data and leverages CLIP's shared image-text embedding space to both segment unknown objects and assign human-interpretable names to them. Unlike open-vocabulary methods, our model dynamically extends its vocabulary at inference time without retraining, enabling robust detection and naming of anomalies beyond common class definitions such as those in Cityscapes. Clipomaly achieves state-of-the-art performance on established anomaly segmentation benchmarks while providing interpretability and flexibility essential for practical deployment.
ResDiT: Evoking the Intrinsic Resolution Scalability in Diffusion Transformers
Leveraging pre-trained Diffusion Transformers (DiTs) for high-resolution (HR) image synthesis often leads to spatial layout collapse and degraded texture fidelity. Prior work mitigates these issues with complex pipelines that first perform a base-resolution (i.e., training-resolution) denoising process to guide HR generation. We instead explore the intrinsic generative mechanisms of DiTs and propose ResDiT, a training-free method that scales resolution efficiently. We identify the core factor governing spatial layout, position embeddings (PEs), and show that the original PEs encode incorrect positional information when extrapolated to HR, which triggers layout collapse. To address this, we introduce a PE scaling technique that rectifies positional encoding under resolution changes. To further remedy low-fidelity details, we develop a local-enhancement mechanism grounded in base-resolution local attention. We design a patch-level fusion module that aggregates global and local cues, together with a Gaussian-weighted splicing strategy that eliminates grid artifacts. Comprehensive evaluations demonstrate that ResDiT consistently delivers high-fidelity, high-resolution image synthesis and integrates seamlessly with downstream tasks, including spatially controlled generation.
comment: 8 pages
\textit{ViRectify}: A Challenging Benchmark for Video Reasoning Correction with Multimodal Large Language Models
As multimodal large language models (MLLMs) frequently exhibit errors in complex video reasoning scenarios, correcting these errors is critical for uncovering their weaknesses and improving performance. However, existing benchmarks lack systematic evaluation of MLLMs' ability to identify and correct these video reasoning errors. To bridge this gap, we propose \textit{ViRectify}, a comprehensive benchmark to evaluate their fine-grained correction capability. Through an AI-assisted annotation pipeline with human verification, we construct a dataset of over 30\textit{K} instances spanning dynamic perception, scientific reasoning, and embodied decision-making domains. In \textit{ViRectify}, we challenge MLLMs to perform step-wise error identification and generate rationales with key video evidence grounding. In addition, we further propose the trajectory evidence-driven correction framework, comprising step-wise error trajectory and reward modeling on visual evidence-grounded correction. It encourages the model to explicitly concentrate on error propagation and key timestamps for correction. Extensive evaluation across 16 advanced MLLMs demonstrates that our \textit{ViRectify} serves as a challenging testbed, where GPT-5 achieves only 31.94\% correction accuracy. Our framework enables a Qwen2.5-VL-7B to consistently outperform the variants of 72B on \textit{ViRectify}, showing the effectiveness of our approach. Further analysis uncovers systematic asymmetries in error correction across models, and our dataset is also a valuable data resource to perform reflection learning. We believe \textit{ViRectify} provides a new direction for comprehensively evaluating the advanced MLLMs in video reasoning.
comment: 22 pages, 11 figures
MDiff4STR: Mask Diffusion Model for Scene Text Recognition AAAI 2026
Mask Diffusion Models (MDMs) have recently emerged as a promising alternative to auto-regressive models (ARMs) for vision-language tasks, owing to their flexible balance of efficiency and accuracy. In this paper, for the first time, we introduce MDMs into the Scene Text Recognition (STR) task. We show that vanilla MDM lags behind ARMs in terms of accuracy, although it improves recognition efficiency. To bridge this gap, we propose MDiff4STR, a Mask Diffusion model enhanced with two key improvement strategies tailored for STR. Specifically, we identify two key challenges in applying MDMs to STR: noising gap between training and inference, and overconfident predictions during inference. Both significantly hinder the performance of MDMs. To mitigate the first issue, we develop six noising strategies that better align training with inference behavior. For the second, we propose a token-replacement noise mechanism that provides a non-mask noise type, encouraging the model to reconsider and revise overly confident but incorrect predictions. We conduct extensive evaluations of MDiff4STR on both standard and challenging STR benchmarks, covering diverse scenarios including irregular, artistic, occluded, and Chinese text, as well as whether the use of pretraining. Across these settings, MDiff4STR consistently outperforms popular STR models, surpassing state-of-the-art ARMs in accuracy, while maintaining fast inference with only three denoising steps. Code: https://github.com/Topdu/OpenOCR.
comment: Accepted by AAAI 2026 (Oral)
Rice-VL: Evaluating Vision-Language Models for Cultural Understanding Across ASEAN Countries
Vision-Language Models (VLMs) excel in multimodal tasks but often exhibit Western-centric biases, limiting their effectiveness in culturally diverse regions like Southeast Asia (SEA). To address this, we introduce RICE-VL, a novel benchmark evaluating VLM cultural understanding across 11 ASEAN countries. RICE-VL includes over 28,000 human-curated Visual Question Answering (VQA) samples -- covering True or False, Fill-in-the-Blank, and open-ended formats -- and 1,000 image-bounding box pairs for Visual Grounding, annotated by culturally informed experts across 14 sub-ground categories. We propose SEA-LAVE, an extension of the LAVE metric, assessing textual accuracy, cultural alignment, and country identification. Evaluations of six open- and closed-source VLMs reveal significant performance gaps in low-resource countries and abstract cultural domains. The Visual Grounding task tests models' ability to localize culturally significant elements in complex scenes, probing spatial and contextual accuracy. RICE-VL exposes limitations in VLMs' cultural comprehension and highlights the need for inclusive model development to better serve diverse global populations.
comment: 14 pages
FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution
Real-image super-resolution (Real-ISR) seeks to recover HR images from LR inputs with mixed, unknown degradations. While diffusion models surpass GANs in perceptual quality, they under-reconstruct high-frequency (HF) details due to a low-frequency (LF) bias and a depth-wise "low-first, high-later" hierarchy. We introduce FRAMER, a plug-and-play training scheme that exploits diffusion priors without changing the backbone or inference. At each denoising step, the final-layer feature map teaches all intermediate layers. Teacher and student feature maps are decomposed into LF/HF bands via FFT masks to align supervision with the model's internal frequency hierarchy. For LF, an Intra Contrastive Loss (IntraCL) stabilizes globally shared structure. For HF, an Inter Contrastive Loss (InterCL) sharpens instance-specific details using random-layer and in-batch negatives. Two adaptive modulators, Frequency-based Adaptive Weight (FAW) and Frequency-based Alignment Modulation (FAM), reweight per-layer LF/HF signals and gate distillation by current similarity. Across U-Net and DiT backbones (e.g., Stable Diffusion 2, 3), FRAMER consistently improves PSNR/SSIM and perceptual metrics (LPIPS, NIQE, MANIQA, MUSIQ). Ablations validate the final-layer teacher and random-layer negatives.
comment: Comments: Please visit our project page at https://cmlab-korea.github.io/FRAMER/
PointNet4D: A Lightweight 4D Point Cloud Video Backbone for Online and Offline Perception in Robotic Applications WACV2026
Understanding dynamic 4D environments-3D space evolving over time-is critical for robotic and interactive systems. These applications demand systems that can process streaming point cloud video in real-time, often under resource constraints, while also benefiting from past and present observations when available. However, current 4D backbone networks rely heavily on spatiotemporal convolutions and Transformers, which are often computationally intensive and poorly suited to real-time applications. We propose PointNet4D, a lightweight 4D backbone optimized for both online and offline settings. At its core is a Hybrid Mamba-Transformer temporal fusion block, which integrates the efficient state-space modeling of Mamba and the bidirectional modeling power of Transformers. This enables PointNet4D to handle variable-length online sequences efficiently across different deployment scenarios. To enhance temporal understanding, we introduce 4DMAP, a frame-wise masked auto-regressive pretraining strategy that captures motion cues across frames. Our extensive evaluations across 9 tasks on 7 datasets, demonstrating consistent improvements across diverse domains. We further demonstrate PointNet4D's utility by building two robotic application systems: 4D Diffusion Policy and 4D Imitation Learning, achieving substantial gains on the RoboTwin and HandoverSim benchmarks.
comment: Accepted by WACV2026
Reversible Inversion for Training-Free Exemplar-guided Image Editing
Exemplar-guided Image Editing (EIE) aims to modify a source image according to a visual reference. Existing approaches often require large-scale pre-training to learn relationships between the source and reference images, incurring high computational costs. As a training-free alternative, inversion techniques can be used to map the source image into a latent space for manipulation. However, our empirical study reveals that standard inversion is sub-optimal for EIE, leading to poor quality and inefficiency. To tackle this challenge, we introduce \textbf{Reversible Inversion ({ReInversion})} for effective and efficient EIE. Specifically, ReInversion operates as a two-stage denoising process, which is first conditioned on the source image and subsequently on the reference. Besides, we introduce a Mask-Guided Selective Denoising (MSD) strategy to constrain edits to target regions, preserving the structural consistency of the background. Both qualitative and quantitative comparisons demonstrate that our ReInversion method achieves state-of-the-art EIE performance with the lowest computational overhead.
Textured Geometry Evaluation: Perceptual 3D Textured Shape Metric via 3D Latent-Geometry Network AAAI26
Textured high-fidelity 3D models are crucial for games, AR/VR, and film, but human-aligned evaluation methods still fall behind despite recent advances in 3D reconstruction and generation. Existing metrics, such as Chamfer Distance, often fail to align with how humans evaluate the fidelity of 3D shapes. Recent learning-based metrics attempt to improve this by relying on rendered images and 2D image quality metrics. However, these approaches face limitations due to incomplete structural coverage and sensitivity to viewpoint choices. Moreover, most methods are trained on synthetic distortions, which differ significantly from real-world distortions, resulting in a domain gap. To address these challenges, we propose a new fidelity evaluation method that is based directly on 3D meshes with texture, without relying on rendering. Our method, named Textured Geometry Evaluation TGE, jointly uses the geometry and color information to calculate the fidelity of the input textured mesh with comparison to a reference colored shape. To train and evaluate our metric, we design a human-annotated dataset with real-world distortions. Experiments show that TGE outperforms rendering-based and geometry-only methods on real-world distortion dataset.
comment: Accepted by AAAI26
SRAM: Shape-Realism Alignment Metric for No Reference 3D Shape Evaluation AAAI2026
3D generation and reconstruction techniques have been widely used in computer games, film, and other content creation areas. As the application grows, there is a growing demand for 3D shapes that look truly realistic. Traditional evaluation methods rely on a ground truth to measure mesh fidelity. However, in many practical cases, a shape's realism does not depend on having a ground truth reference. In this work, we propose a Shape-Realism Alignment Metric that leverages a large language model (LLM) as a bridge between mesh shape information and realism evaluation. To achieve this, we adopt a mesh encoding approach that converts 3D shapes into the language token space. A dedicated realism decoder is designed to align the language model's output with human perception of realism. Additionally, we introduce a new dataset, RealismGrading, which provides human-annotated realism scores without the need for ground truth shapes. Our dataset includes shapes generated by 16 different algorithms on over a dozen objects, making it more representative of practical 3D shape distributions. We validate our metric's performance and generalizability through k-fold cross-validation across different objects. Experimental results show that our metric correlates well with human perceptions and outperforms existing methods, and has good generalizability.
comment: Accepted by AAAI2026
BlinkBud: Detecting Hazards from Behind via Sampled Monocular 3D Detection on a Single Earbud
Failing to be aware of speeding vehicles approaching from behind poses a huge threat to the road safety of pedestrians and cyclists. In this paper, we propose BlinkBud, which utilizes a single earbud and a paired phone to online detect hazardous objects approaching from behind of a user. The core idea is to accurately track visually identified objects utilizing a small number of sampled camera images taken from the earbud. To minimize the power consumption of the earbud and the phone while guaranteeing the best tracking accuracy, a novel 3D object tracking algorithm is devised, integrating both a Kalman filter based trajectory estimation scheme and an optimal image sampling strategy based on reinforcement learning. Moreover, the impact of constant user head movements on the tracking accuracy is significantly eliminated by leveraging the estimated pitch and yaw angles to correct the object depth estimation and align the camera coordinate system to the user's body coordinate system, respectively. We implement a prototype BlinkBud system and conduct extensive real-world experiments. Results show that BlinkBud is lightweight with ultra-low mean power consumptions of 29.8 mW and 702.6 mW on the earbud and smartphone, respectively, and can accurately detect hazards with a low average false positive ratio (FPR) and false negative ratio (FNR) of 4.90% and 1.47%, respectively.
comment: This is the author-accepted version of the paper published in Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT), Vol. 9, No. 4, Article 191, 2025. Final published version: https://doi.org/10.1145/3770707
VIVAT: Virtuous Improving VAE Training through Artifact Mitigation
Variational Autoencoders (VAEs) remain a cornerstone of generative computer vision, yet their training is often plagued by artifacts that degrade reconstruction and generation quality. This paper introduces VIVAT, a systematic approach to mitigating common artifacts in KL-VAE training without requiring radical architectural changes. We present a detailed taxonomy of five prevalent artifacts - color shift, grid patterns, blur, corner and droplet artifacts - and analyze their root causes. Through straightforward modifications, including adjustments to loss weights, padding strategies, and the integration of Spatially Conditional Normalization, we demonstrate significant improvements in VAE performance. Our method achieves state-of-the-art results in image reconstruction metrics (PSNR and SSIM) across multiple benchmarks and enhances text-to-image generation quality, as evidenced by superior CLIP scores. By preserving the simplicity of the KL-VAE framework while addressing its practical challenges, VIVAT offers actionable insights for researchers and practitioners aiming to optimize VAE training.
STORM: Segment, Track, and Object Re-Localization from a Single Image
Accurate 6D pose estimation and tracking are fundamental capabilities for physical AI systems such as robots. However, existing approaches typically require a pre-defined 3D model of the target and rely on a manually annotated segmentation mask in the first frame, which is labor-intensive and leads to reduced performance when faced with occlusions or rapid movement. To address these limitations, we propose STORM (Segment, Track, and Object Re-localization from a single iMage), an open-source robust real-time 6D pose estimation system that requires no manual annotation. STORM employs a novel three-stage pipeline combining vision-language understanding with feature matching: contextual object descriptions guide localization, self-cross-attention mechanisms identify candidate regions, and produce precise masks and 3D models for accurate pose estimation. Another key innovation is our automatic re-registration mechanism that detects tracking failures through feature similarity monitoring and recovers from severe occlusions or rapid motion. STORM achieves state-of-the-art accuracy on challenging industrial datasets featuring multi-object occlusions, high-speed motion, and varying illumination, while operating at real-time speeds without additional training. This annotation-free approach significantly reduces deployment overhead, providing a practical solution for modern applications, such as flexible manufacturing and intelligent quality control.
Continuous Perception Matters: Diagnosing Temporal Integration Failures in Multimodal Models
Continuous perception, the ability to integrate visual observations over time in a continuous stream fashion, is essential for robust real-world understanding, yet remains largely untested in current multimodal models. We introduce CP-Bench, a minimal and fully controlled benchmark designed to isolate this capability using an extremely simple task: counting identical cubes in a synthetic scene while the camera moves and only reveals subsets of objects at any moment. Despite the simplicity of the setting, we find that state-of-the-art open-source and commercial models, including Qwen-3-VL, InternVL3, GPT-5, and Gemini-3-Pro, fail dramatically. A static-camera control variant confirms that the failure arises not from object recognition but from an inability to accumulate evidence across time. Further experiments show that neither higher sampling FPS, perception- or spatial-enhanced models, nor finetuning with additional videos leads to meaningful cross-temporal generalization. Our results reveal a fundamental limitation in modern multimodal architectures and training paradigms. CP-Bench provides a simple yet powerful diagnostic tool and establishes a clean testbed for developing models capable of genuine time-consistent visual reasoning.
GBT-SAM: A Parameter-Efficient Depth-Aware Model for Generalizable Brain tumour Segmentation on mp-MRI
Gliomas are aggressive brain tumors that require accurate imaging-based diagnosis, with segmentation playing a critical role in evaluating morphology and treatment decisions. Manual delineation of gliomas is time-consuming and prone to variability, motivating the use of deep learning to improve consistency and alleviate clinical workload. However, existing methods often fail to fully exploit the information available in multi-parametric MRI (mp-MRI), particularly inter-slice contextual features, and typically require considerable computational resources while lacking robustness across tumor type variations. We present GBT-SAM, a parameter-efficient deep learning framework that adapts the Segment Anything Model (SAM), a large-scale vision model, to volumetric mp-MRI data. GBT-SAM reduces input complexity by selecting fewer than 2.6\% of slices per scan while incorporating all four MRI modalities, preserving essential tumor-related information with minimal cost. Furthermore, our model is trained by a two-step fine-tuning strategy that incorporates a depth-aware module to capture inter-slice correlations and lightweight adaptation layers, resulting in just 6.5M trainable parameters, which is the lowest among SAM-based approaches. GBT-SAM achieves a Dice Score of 93.54 on the BraTS Adult Glioma dataset and demonstrates robust performance on Meningioma, Pediatric Glioma, and Sub-Saharan Glioma datasets. These results highlight GBT-SAM's potential as a computationally efficient and domain-robust framework for brain tumor segmentation using mp-MRI. Our code and models are available at https://github.com/vpulab/med-sam-brain .
DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation
This paper presents DualCamCtrl, a novel end-to-end diffusion model for camera-controlled video generation. Recent works have advanced this field by representing camera poses as ray-based conditions, yet they often lack sufficient scene understanding and geometric awareness. DualCamCtrl specifically targets this limitation by introducing a dual-branch framework that mutually generates camera-consistent RGB and depth sequences. To harmonize these two modalities, we further propose the Semantic Guided Mutual Alignment (SIGMA) mechanism, which performs RGB-depth fusion in a semantics-guided and mutually reinforced manner. These designs collectively enable DualCamCtrl to better disentangle appearance and geometry modeling, generating videos that more faithfully adhere to the specified camera trajectories. Additionally, we analyze and reveal the distinct influence of depth and camera poses across denoising stages and further demonstrate that early and late stages play complementary roles in forming global structure and refining local details. Extensive experiments demonstrate that DualCamCtrl achieves more consistent camera-controlled video generation, with over 40\% reduction in camera motion errors compared with prior methods. Our project page: https://soyouthinkyoucantell.github.io/dualcamctrl-page/
SCOPE-MRI: Bankart Lesion Detection as a Case Study in Data Curation and Deep Learning for Challenging Diagnoses
Deep learning has shown strong performance in musculoskeletal imaging, but prior work has largely targeted conditions where diagnosis is relatively straightforward. More challenging problems remain underexplored, such as detecting Bankart lesions (anterior-inferior glenoid labral tears) on standard MRIs. These lesions are difficult to diagnose due to subtle imaging features, often necessitating invasive MRI arthrograms (MRAs). We introduce ScopeMRI, the first publicly available, expert-annotated dataset for shoulder pathologies, and present a deep learning framework for Bankart lesion detection on both standard MRIs and MRAs. ScopeMRI contains shoulder MRIs from patients who underwent arthroscopy, providing ground-truth labels from intraoperative findings, the diagnostic gold standard. Separate models were trained for MRIs and MRAs using CNN- and transformer-based architectures, with predictions ensembled across multiple imaging planes. Our models achieved radiologist-level performance, with accuracy on standard MRIs surpassing radiologists interpreting MRAs. External validation on independent hospital data demonstrated initial generalizability across imaging protocols. By releasing ScopeMRI and a modular codebase for training and evaluation, we aim to accelerate research in musculoskeletal imaging and foster development of datasets and models that address clinically challenging diagnostic tasks.
comment: This version of the article has been accepted for publication at Nature Partner Journal (NPJ) Artificial Intelligence, but is not the Version of Record and does not reflect post-acceptance improvements or any corrections. The Version of Record is available online at: http://dx.doi.org/10.1038/s44387-025-00043-5
Structure is Supervision: Multiview Masked Autoencoders for Radiology
Building robust medical machine learning systems requires pretraining strategies that exploit the intrinsic structure present in clinical data. We introduce Multiview Masked Autoencoder (MVMAE), a self-supervised framework that leverages the natural multi-view organization of radiology studies to learn view-invariant and disease-relevant representations. MVMAE combines masked image reconstruction with cross-view alignment, transforming clinical redundancy across projections into a powerful self-supervisory signal. We further extend this approach with MVMAE-V2T, which incorporates radiology reports as an auxiliary text-based learning signal to enhance semantic grounding while preserving fully vision-based inference. Evaluated on a downstream disease classification task on three large-scale public datasets, MIMIC-CXR, CheXpert, and PadChest, MVMAE consistently outperforms supervised and vision-language baselines. Furthermore, MVMAE-V2T provides additional gains, particularly in low-label regimes where structured textual supervision is most beneficial. Together, these results establish the importance of structural and textual supervision as complementary paths toward scalable, clinically grounded medical foundation models.
Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions CVPR
Unmanned Aerial Vehicles (UAVs) are indispensable for infrastructure inspection, surveillance, and related tasks, yet they also introduce critical security challenges. This survey provides a wide-ranging examination of the anti-UAV domain, centering on three core objectives-classification, detection, and tracking-while detailing emerging methodologies such as diffusion-based data synthesis, multi-modal fusion, vision-language modeling, self-supervised learning, and reinforcement learning. We systematically evaluate state-of-the-art solutions across both single-modality and multi-sensor pipelines (spanning RGB, infrared, audio, radar, and RF) and discuss large-scale as well as adversarially oriented benchmarks. Our analysis reveals persistent gaps in real-time performance, stealth detection, and swarm-based scenarios, underscoring pressing needs for robust, adaptive anti-UAV systems. By highlighting open research directions, we aim to foster innovation and guide the development of next-generation defense strategies in an era marked by the extensive use of UAVs.
comment: Best Paper, Accepted at CVPR Workshop Anti-UAV 2025. 16 pages
Adaptive Plane Reformatting for 4D Flow MRI using Deep Reinforcement Learning
Background and Objective: Plane reformatting for four-dimensional phase contrast MRI (4D flow MRI) is time-consuming and prone to inter-observer variability, which limits fast cardiovascular flow assessment. Deep reinforcement learning (DRL) trains agents to iteratively adjust plane position and orientation, enabling accurate plane reformatting without the need for detailed landmarks, making it suitable for images with limited contrast and resolution such as 4D flow MRI. However, current DRL methods assume that test volumes share the same spatial alignment as the training data, limiting generalization across scanners and institutions. To address this limitation, we introduce AdaPR (Adaptive Plane Reformatting), a DRL framework that uses a local coordinate system to navigate volumes with arbitrary positions and orientations. Methods: We implemented AdaPR using the Asynchronous Advantage Actor-Critic (A3C) algorithm and validated it on 88 4D flow MRI datasets acquired from multiple vendors, including patients with congenital heart disease. Results: AdaPR achieved a mean angular error of 6.32 +/- 4.15 degrees and a distance error of 3.40 +/- 2.75 mm, outperforming global-coordinate DRL methods and alternative non-DRL methods. AdaPR maintained consistent accuracy under different volume orientations and positions. Flow measurements from AdaPR planes showed no significant differences compared to two manual observers, with excellent correlation (R^2 = 0.972 and R^2 = 0.968), comparable to inter-observer agreement (R^2 = 0.969). Conclusion: AdaPR provides robust, orientation-independent plane reformatting for 4D flow MRI, achieving flow quantification comparable to expert observers. Its adaptability across datasets and scanners makes it a promising candidate for medical imaging applications beyond 4D flow MRI.
MAMMA: Markerless & Automatic Multi-Person Motion Action Capture
We present MAMMA, a markerless motion-capture pipeline that accurately recovers SMPL-X parameters from multi-view video of two-person interaction sequences. Traditional motion-capture systems rely on physical markers. Although they offer high accuracy, their requirements of specialized hardware, manual marker placement, and extensive post-processing make them costly and time-consuming. Recent learning-based methods attempt to overcome these limitations, but most are designed for single-person capture, rely on sparse keypoints, or struggle with occlusions and physical interactions. In this work, we introduce a method that predicts dense 2D contact-aware surface landmarks conditioned on segmentation masks, enabling person-specific correspondence estimation even under heavy occlusion. We employ a novel architecture that exploits learnable queries for each landmark. We demonstrate that our approach can handle complex person--person interaction and offers greater accuracy than existing methods. To train our network, we construct a large, synthetic multi-view dataset combining human motions from diverse sources, including extreme poses, hand motions, and close interactions. Our dataset yields high-variability synthetic sequences with rich body contact and occlusion, and includes SMPL-X ground-truth annotations with dense 2D landmarks. The result is a system capable of capturing human motion without the need for markers. Our approach offers competitive reconstruction quality compared to commercial marker-based motion-capture solutions, without the extensive manual cleanup. Finally, we address the absence of common benchmarks for dense-landmark prediction and markerless motion capture by introducing two evaluation settings built from real multi-view sequences. We will release our dataset, benchmark, method, training code, and pre-trained model weights for research purposes.
Benchmarking machine learning models for multi-class state recognition in double quantum dot data
Semiconductor quantum dots (QDs) are a leading platform for scalable quantum processors. However, scaling to large arrays requires reliable, automated tuning strategies for devices' bootstrapping, calibration, and operation, with many tuning aspects depending on accurately identifying QD device states from charge-stability diagrams (CSDs). In this work, we present a comprehensive benchmarking study of four modern machine learning (ML) architectures for multi-class state recognition in double-QD CSDs. We evaluate their performance across different data budgets and normalization schemes using both synthetic and experimental data. We find that the more resource-intensive models -- U-Nets and visual transformers (ViTs) -- achieve the highest MSE score (defined as $1-\mathrm{MSE}$) on synthetic data (over $0.98$) but fail to generalize to experimental data. MDNs are the most computationally efficient and exhibit highly stable training, but with substantially lower peak performance. CNNs offer the most favorable trade-off on experimental CSDs, achieving strong accuracy with two orders of magnitude fewer parameters than the U-Nets and ViTs. Normalization plays a nontrivial role: min-max scaling generally yields higher MSE scores but less stable convergence, whereas z-score normalization produces more predictable training dynamics but at reduced accuracy for most models. Overall, our study shows that CNNs with min-max normalization are a practical approach for QD CSDs.
comment: 12 pages, 4 figures, 2 tables
Global-to-local image quality assessment in optical microscopy via fast and robust deep learning predictions
Optical microscopy is one of the most widely used techniques in research studies for life sciences and biomedicine. These applications require reliable experimental pipelines to extract valuable knowledge from the measured samples and must be supported by image quality assessment (IQA) to ensure correct processing and analysis of the image data. IQA methods are implemented with variable complexity. However, while most quality metrics have a straightforward implementation, they might be time consuming and computationally expensive when evaluating a large dataset. In addition, quality metrics are often designed for well-defined image features and may be unstable for images out of the ideal domain. To overcome these limitations, recent works have proposed deep learning-based IQA methods, which can provide superior performance, increased generalizability and fast prediction. Our method, named $\mathrmμ$DeepIQA, is inspired by previous studies and applies a deep convolutional neural network designed for IQA on natural images to optical microscopy measurements. We retrained the same architecture to predict individual quality metrics and global quality scores for optical microscopy data. The resulting models provide fast and stable predictions of image quality by generalizing quality estimation even outside the ideal range of standard methods. In addition, $\mathrmμ$DeepIQA provides patch-wise prediction of image quality and can be used to visualize spatially varying quality in a single image. Our study demonstrates that optical microscopy-based studies can benefit from the generalizability of deep learning models due to their stable performance in the presence of outliers, the ability to assess small image patches, and rapid predictions.
comment: 16 pages, 6 figures. μDeepIQA is publicly available at https://git.photonicdata.science/elena.corbetta/udeepiqa
Manual-PA: Learning 3D Part Assembly from Instruction Diagrams ICCV'25
Assembling furniture amounts to solving the discrete-continuous optimization task of selecting the furniture parts to assemble and estimating their connecting poses in a physically realistic manner. The problem is hampered by its combinatorially large yet sparse solution space thus making learning to assemble a challenging task for current machine learning models. In this paper, we attempt to solve this task by leveraging the assembly instructions provided in diagrammatic manuals that typically accompany the furniture parts. Our key insight is to use the cues in these diagrams to split the problem into discrete and continuous phases. Specifically, we present Manual-PA, a transformer-based instruction Manual-guided 3D Part Assembly framework that learns to semantically align 3D parts with their illustrations in the manuals using a contrastive learning backbone towards predicting the assembly order and infers the 6D pose of each part via relating it to the final furniture depicted in the manual. To validate the efficacy of our method, we conduct experiments on the benchmark PartNet dataset. Our results show that using the diagrams and the order of the parts lead to significant improvements in assembly performance against the state of the art. Further, Manual-PA demonstrates strong generalization to real-world IKEA furniture assembly on the IKEA-Manual dataset.
comment: Accepted to ICCV'25
MRI Super-Resolution with Deep Learning: A Comprehensive Survey
High-resolution (HR) magnetic resonance imaging (MRI) is crucial for many clinical and research applications. However, achieving it remains costly and constrained by technical trade-offs and experimental limitations. Super-resolution (SR) presents a promising computational approach to overcome these challenges by generating HR images from more affordable low-resolution (LR) scans, potentially improving diagnostic accuracy and efficiency without requiring additional hardware. This survey reviews recent advances in MRI SR techniques, with a focus on deep learning (DL) approaches. It examines DL-based MRI SR methods from the perspectives of computer vision, computational imaging, inverse problems, and MR physics, covering theoretical foundations, architectural designs, learning strategies, benchmark datasets, and performance metrics. We propose a systematic taxonomy to categorize these methods and present an in-depth study of both established and emerging SR techniques applicable to MRI, considering unique challenges in clinical and research contexts. We also highlight open challenges and directions that the community needs to address. Additionally, we provide a collection of essential open-access resources, tools, and tutorials, available on our GitHub: https://github.com/mkhateri/Awesome-MRI-Super-Resolution. IEEE keywords: MRI, Super-Resolution, Deep Learning, Computational Imaging, Inverse Problem, Survey.
comment: 41 pages
Multivariate Variational Autoencoder
Learning latent representations that are simultaneously expressive, geometrically well-structured, and reliably calibrated remains a central challenge for Variational Autoencoders (VAEs). Standard VAEs typically assume a diagonal Gaussian posterior, which simplifies optimization but rules out correlated uncertainty and often yields entangled or redundant latent dimensions. We introduce the Multivariate Variational Autoencoder (MVAE), a tractable full-covariance extension of the VAE that augments the encoder with sample-specific diagonal scales and a global coupling matrix. This induces a multivariate Gaussian posterior of the form $N(μ_φ(x), C \operatorname{diag}(σ_φ^2(x)) C^\top)$, enabling correlated latent factors while preserving a closed-form KL divergence and a simple reparameterization path. Beyond likelihood, we propose a multi-criterion evaluation protocol that jointly assesses reconstruction quality (MSE, ELBO), downstream discrimination (linear probes), probabilistic calibration (NLL, Brier, ECE), and unsupervised structure (NMI, ARI). Across Larochelle-style MNIST variants, Fashion-MNIST, and CIFAR-10/100, MVAE consistently matches or outperforms diagonal-covariance VAEs of comparable capacity, with particularly notable gains in calibration and clustering metrics at both low and high latent dimensions. Qualitative analyses further show smoother, more semantically coherent latent traversals and sharper reconstructions. All code, dataset splits, and evaluation utilities are released to facilitate reproducible comparison and future extensions of multivariate posterior models.
Hierarchical Semi-Supervised Active Learning for Remote Sensing
The performance of deep learning models in remote sensing (RS) strongly depends on the availability of high-quality labeled data. However, collecting large-scale annotations is costly and time-consuming, while vast amounts of unlabeled imagery remain underutilized. To address this challenge, we propose a Hierarchical Semi-Supervised Active Learning (HSSAL) framework that integrates semi-supervised learning (SSL) and a novel hierarchical active learning (HAL) in a closed iterative loop. In each iteration, SSL refines the model using both labeled data through supervised learning and unlabeled data via weak-to-strong self-training, improving feature representation and uncertainty estimation. Guided by the refined representations and uncertainty cues of unlabeled samples, HAL then conducts sample querying through a progressive clustering strategy, selecting the most informative instances that jointly satisfy the criteria of scalability, diversity, and uncertainty. This hierarchical process ensures both efficiency and representativeness in sample selection. Extensive experiments on three benchmark RS scene classification datasets, including UCM, AID, and NWPU-RESISC45, demonstrate that HSSAL consistently outperforms SSL- or AL-only baselines. Remarkably, with only 8%, 4%, and 2% labeled training data on UCM, AID, and NWPU-RESISC45, respectively, HSSAL achieves over 95% of fully-supervised accuracy, highlighting its superior label efficiency through informativeness exploitation of unlabeled data. Our code will be publicly available.
comment: Under review
Harnessing Diffusion-Generated Synthetic Images for Fair Image Classification AAAI
Image classification systems often inherit biases from uneven group representation in training data. For example, in face datasets for hair color classification, blond hair may be disproportionately associated with females, reinforcing stereotypes. A recent approach leverages the Stable Diffusion model to generate balanced training data, but these models often struggle to preserve the original data distribution. In this work, we explore multiple diffusion-finetuning techniques, e.g., LoRA and DreamBooth, to generate images that more accurately represent each training group by learning directly from their samples. Additionally, in order to prevent a single DreamBooth model from being overwhelmed by excessive intra-group variations, we explore a technique of clustering images within each group and train a DreamBooth model per cluster. These models are then used to generate group-balanced data for pretraining, followed by fine-tuning on real data. Experiments on multiple benchmarks demonstrate that the studied finetuning approaches outperform vanilla Stable Diffusion on average and achieve results comparable to SOTA debiasing techniques like Group-DRO, while surpassing them as the dataset bias severity increases.
comment: Accepted to AAAI AISI Track, 2026
Sketch-guided Cage-based 3D Gaussian Splatting Deformation WACV 26
3D Gaussian Splatting (GS) is one of the most promising novel 3D representations that has received great interest in computer graphics and computer vision. While various systems have introduced editing capabilities for 3D GS, such as those guided by text prompts, fine-grained control over deformation remains an open challenge. In this work, we present a novel sketch-guided 3D GS deformation system that allows users to intuitively modify the geometry of a 3D GS model by drawing a silhouette sketch from a single viewpoint. Our approach introduces a new deformation method that combines cage-based deformations with a variant of Neural Jacobian Fields, enabling precise, fine-grained control. Additionally, it leverages large-scale 2D diffusion priors and ControlNet to ensure the generated deformations are semantically plausible. Through a series of experiments, we demonstrate the effectiveness of our method and showcase its ability to animate static 3D GS models as one of its key applications.
comment: 10 pages, 9 figures, accepted at WACV 26, project page: https://tianhaoxie.github.io/project/gs_deform/
Benchmarking pig detection and tracking under diverse and challenging conditions
To ensure animal welfare and effective management in pig farming, monitoring individual behavior is a crucial prerequisite. While monitoring tasks have traditionally been carried out manually, advances in machine learning have made it possible to collect individualized information in an increasingly automated way. Central to these methods is the localization of animals across space (object detection) and time (multi-object tracking). Despite extensive research of these two tasks in pig farming, a systematic benchmarking study has not yet been conducted. In this work, we address this gap by curating two datasets: PigDetect for object detection and PigTrack for multi-object tracking. The datasets are based on diverse image and video material from realistic barn conditions, and include challenging scenarios such as occlusions or bad visibility. For object detection, we show that challenging training images improve detection performance beyond what is achievable with randomly sampled images alone. Comparing different approaches, we found that state-of-the-art models offer substantial improvements in detection quality over real-time alternatives. For multi-object tracking, we observed that SORT-based methods achieve superior detection performance compared to end-to-end trainable models. However, end-to-end models show better association performance, suggesting they could become strong alternatives in the future. We also investigate characteristic failure cases of end-to-end models, providing guidance for future improvements. The detection and tracking models trained on our datasets perform well in unseen pens, suggesting good generalization capabilities. This highlights the importance of high-quality training data. The datasets and research code are made publicly available to facilitate reproducibility, re-use and further development.
comment: 16 pages, 6 figures and 8 tables
CaliTex: Geometry-Calibrated Attention for View-Coherent 3D Texture Generation
Despite major advances brought by diffusion-based models, current 3D texture generation systems remain hindered by cross-view inconsistency -- textures that appear convincing from one viewpoint often fail to align across others. We find that this issue arises from attention ambiguity, where unstructured full attention is applied indiscriminately across tokens and modalities, causing geometric confusion and unstable appearance-structure coupling. To address this, we introduce CaliTex, a framework of geometry-calibrated attention that explicitly aligns attention with 3D structure. It introduces two modules: Part-Aligned Attention that enforces spatial alignment across semantically matched parts, and Condition-Routed Attention which routes appearance information through geometry-conditioned pathways to maintain spatial fidelity. Coupled with a two-stage diffusion transformer, CaliTex makes geometric coherence an inherent behavior of the network rather than a byproduct of optimization. Empirically, CaliTex produces seamless and view-consistent textures and outperforms both open-source and commercial baselines.
Pushing the Limits of Sparsity: A Bag of Tricks for Extreme Pruning
Pruning of deep neural networks has been an effective technique for reducing model size while preserving most of the performance of dense networks, crucial for deploying models on memory and power-constrained devices. While recent sparse learning methods have shown promising performance up to moderate sparsity levels such as 95% and 98%, accuracy quickly deteriorates when pushing sparsities to extreme levels due to unique challenges such as fragile gradient flow. In this work, we explore network performance beyond the commonly studied sparsities, and develop techniques that encourage stable training without accuracy collapse even at extreme sparsities, including 99.90%, 99.95\% and 99.99% on ResNet architectures. We propose three complementary techniques that enhance sparse training through different mechanisms: 1) Dynamic ReLU phasing, where DyReLU initially allows for richer parameter exploration before being gradually replaced by standard ReLU, 2) weight sharing which reuses parameters within a residual layer while maintaining the same number of learnable parameters, and 3) cyclic sparsity, where both sparsity levels and sparsity patterns evolve dynamically throughout training to better encourage parameter exploration. We evaluate our method, which we term Extreme Adaptive Sparse Training (EAST) at extreme sparsities using ResNet-34 and ResNet-50 on CIFAR-10, CIFAR-100, and ImageNet, achieving competitive or improved performance compared to existing methods, with notable gains at extreme sparsity levels.
comment: V4: moderate revisions and overall improvements for journal camera ready submission
Flow Equivariant Recurrent Neural Networks NeurIPS '25
Data arrives at our senses as a continuous stream, smoothly transforming from one instant to the next. These smooth transformations can be viewed as continuous symmetries of the environment that we inhabit, defining equivalence relations between stimuli over time. In machine learning, neural network architectures that respect symmetries of their data are called equivariant and have provable benefits in terms of generalization ability and sample efficiency. To date, however, equivariance has been considered only for static transformations and feed-forward networks, limiting its applicability to sequence models, such as recurrent neural networks (RNNs), and corresponding time-parameterized sequence transformations. In this work, we extend equivariant network theory to this regime of 'flows' -- one-parameter Lie subgroups capturing natural transformations over time, such as visual motion. We begin by showing that standard RNNs are generally not flow equivariant: their hidden states fail to transform in a geometrically structured manner for moving stimuli. We then show how flow equivariance can be introduced, and demonstrate that these models significantly outperform their non-equivariant counterparts in terms of training speed, length generalization, and velocity generalization, on both next step prediction and sequence classification. We present this work as a first step towards building sequence models that respect the time-parameterized symmetries which govern the world around us.
comment: NeurIPS '25, Spotlight
RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users SC
To achieve successful assistance with long-horizon web-based tasks, AI agents must be able to sequentially follow real-world user instructions over a long period. Unlike existing web-based agent benchmarks, sequential instruction following in the real world poses significant challenges beyond performing a single, clearly defined task. For instance, real-world human instructions can be ambiguous, require different levels of AI assistance, and may evolve over time, reflecting changes in the user's mental state. To address this gap, we introduce RealWebAssist, a novel benchmark designed to evaluate sequential instruction-following in realistic scenarios involving long-horizon interactions with the web, visual GUI grounding, and understanding ambiguous real-world user instructions. RealWebAssist includes a dataset of sequential instructions collected from real-world human users. Each user instructs a web-based assistant to perform a series of tasks on multiple websites. A successful agent must reason about the true intent behind each instruction, keep track of the mental state of the user, understand user-specific routines, and ground the intended tasks to actions on the correct GUI elements. Our experimental results show that state-of-the-art models struggle to understand and ground user instructions, posing critical challenges in following real-world user instructions for long-horizon web assistance.
comment: Project Website: https://scai.cs.jhu.edu/projects/RealWebAssist/ Code: https://github.com/SCAI-JHU/RealWebAssist
OmniSVG: A Unified Scalable Vector Graphics Generation Model
Scalable Vector Graphics (SVG) is an important image format widely adopted in graphic design because of their resolution independence and editability. The study of generating high-quality SVG has continuously drawn attention from both designers and researchers in the AIGC community. However, existing methods either produces unstructured outputs with huge computational cost or is limited to generating monochrome icons of over-simplified structures. To produce high-quality and complex SVG, we propose OmniSVG, a unified framework that leverages pre-trained Vision-Language Models (VLMs) for end-to-end multimodal SVG generation. By parameterizing SVG commands and coordinates into discrete tokens, OmniSVG decouples structural logic from low-level geometry for efficient training while maintaining the expressiveness of complex SVG structure. To further advance the development of SVG synthesis, we introduce MMSVG-2M, a multimodal dataset with two million richly annotated SVG assets, along with a standardized evaluation protocol for conditional SVG generation tasks. Extensive experiments show that OmniSVG outperforms existing methods and demonstrates its potential for integration into professional SVG design workflows.
comment: 20 pages; Project Page: https://omnisvg.github.io/
FedHK-MVFC: Federated Heat Kernel Multi-View Clustering
In the realm of distributed artificial intelligence (AI) and privacy-focused medical applications, this paper proposes a multi-view clustering framework that links quantum field theory with federated healthcare analytics. The method uses heat kernel coefficients from spectral analysis to convert Euclidean distances into geometry-aware similarity measures that capture the structure of diverse medical data. The framework is presented through the heat kernel distance (HKD) transformation, which has convergence guarantees. Two algorithms have been developed: The first, Heat Kernel-Enhanced Multi-View Fuzzy Clustering (HK-MVFC), is used for central analysis. The second, Federated Heat Kernel Multi-View Fuzzy Clustering (FedHK-MVFC), is used for secure, privacy-preserving learning across hospitals. FedHK-MVFC uses differential privacy and secure aggregation to enable HIPAA-compliant collaboration. Tests on synthetic cardiovascular patient datasets demonstrate increased clustering accuracy, reduced communication, and retained efficiency compared to centralized methods. After being validated on 10,000 synthetic patient records across two hospitals, the methods proved useful for collaborative phenotyping involving electrocardiogram (ECG) data, cardiac imaging data, and behavioral data. The proposed methods' theoretical contributions include update rules with proven convergence, adaptive view weighting, and privacy-preserving protocols. These contributions establish a new standard for geometry-aware federated learning in healthcare, translating advanced mathematics into practical solutions for analyzing sensitive medical data while ensuring rigor and clinical relevance.
comment: 53 pages, 11 figures, and 9 tables
DenoiseGS: Gaussian Reconstruction Model for Burst Denoising
Burst denoising methods are crucial for enhancing images captured on handheld devices, but they often struggle with large motion or suffer from prohibitive computational costs. In this paper, we propose DenoiseGS, the first framework to leverage the efficiency of 3D Gaussian Splatting for burst denoising. Our approach addresses two key challenges when applying feedforward Gaussian reconsturction model to noisy inputs: the degradation of Gaussian point clouds and the loss of fine details. To this end, we propose a Gaussian self-consistency (GSC) loss, which regularizes the geometry predicted from noisy inputs with high-quality Gaussian point clouds. These point clouds are generated from clean inputs by the same model that we are training, thereby alleviating potential bias or domain gaps. Additionally, we introduce a log-weighted frequency (LWF) loss to strengthen supervision within the spectral domain, effectively preserving fine-grained details. The LWF loss adaptively weights frequency discrepancies in a logarithmic manner, emphasizing challenging high-frequency details. Extensive experiments demonstrate that DenoiseGS significantly exceeds the state-of-the-art NeRF-based methods on both burst denoising and novel view synthesis under noisy conditions, while achieving 250$\times$ faster inference speed. Code and models are released at https://github.com/yscheng04/DenoiseGS.
comment: Update Abstract
TTSnap: Test-Time Scaling of Diffusion Models via Noise-Aware Pruning
A prominent approach to test-time scaling for text-to-image diffusion models formulates the problem as a search over multiple noise seeds, selecting the one that maximizes a certain image-reward function. The effectiveness of this strategy heavily depends on the number and diversity of noise seeds explored. However, verifying each candidate is computationally expensive, because each must be fully denoised before a reward can be computed. This severely limits the number of samples that can be explored under a fixed budget. We propose test-time scaling with noise-aware pruning (TTSnap), a framework that prunes low-quality candidates without fully denoising them. The key challenge is that reward models are learned in the clean image domain, and the ranking of rewards predicted for intermediate estimates are often inconsistent with those predicted for clean images. To overcome this, we train noise-aware reward models via self-distillation to align the reward for intermediate estimates with that of the final clean images. To stabilize learning across different noise levels, we adopt a curriculum training strategy that progressively shifts the data domain from clean images to noise images. In addition, we introduce a new metric that measures reward alignment and computational budget utilization. Experiments demonstrate that our approach improves performance by over 16\% compared with existing methods, enabling more efficient and effective test-time scaling. It also provides orthogonal gains when combined with post-training techniques and local test-time optimization. Code: https://github.com/TerrysLearning/TTSnap/.
Towards Fast and Scalable Normal Integration using Continuous Components WACV
Surface normal integration is a fundamental problem in computer vision, dealing with the objective of reconstructing a surface from its corresponding normal map. Existing approaches require an iterative global optimization to jointly estimate the depth of each pixel, which scales poorly to larger normal maps. In this paper, we address this problem by recasting normal integration as the estimation of relative scales of continuous components. By constraining pixels belonging to the same component to jointly vary their scale, we drastically reduce the number of optimization variables. Our framework includes a heuristic to accurately estimate continuous components from the start, a strategy to rebalance optimization terms, and a technique to iteratively merge components to further reduce the size of the problem. Our method achieves state-of-the-art results on the standard normal integration benchmark in as little as a few seconds and achieves one-order-of-magnitude speedup over pixel-level approaches on large-resolution normal maps.
comment: Accepted by the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026, first round. Camera-ready version. 17 pages, 9 figures, 6 tables. Code is available at https://github.com/francescomilano172/normal_integration_continuous_components
B2N3D: Progressive Learning from Binary to N-ary Relationships for 3D Object Grounding
Localizing 3D objects using natural language is essential for robotic scene understanding. The descriptions often involve multiple spatial relationships to distinguish similar objects, making 3D-language alignment difficult. Current methods only model relationships for pairwise objects, ignoring the global perceptual significance of n-ary combinations in multi-modal relational understanding. To address this, we propose a novel progressive relational learning framework for 3D object grounding. We extend relational learning from binary to n-ary to identify visual relations that match the referential description globally. Given the absence of specific annotations for referred objects in the training data, we design a grouped supervision loss to facilitate n-ary relational learning. In the scene graph created with n-ary relationships, we use a multi-modal network with hybrid attention mechanisms to further localize the target within the n-ary combinations. Experiments and ablation studies on the ReferIt3D and ScanRefer benchmarks demonstrate that our method outperforms the state-of-the-art, and proves the advantages of the n-ary relational perception in 3D localization.
Hybrid Swin Attention Networks for Simultaneously Low-Dose PET and CT Denoising
Low-dose computed tomography (LDCT) and positron emission tomography (PET) have emerged as safer alternatives to conventional imaging modalities by significantly reducing radiation exposure. However, this reduction often results in increased noise and artifacts, which can compromise diagnostic accuracy. Consequently, denoising for LDCT/PET has become a vital area of research aimed at enhancing image quality while maintaining radiation safety. In this study, we introduce a novel Hybrid Swin Attention Network (HSANet), which incorporates Efficient Global Attention (EGA) modules and a hybrid upsampling module. The EGA modules enhance both spatial and channel-wise interaction, improving the network's capacity to capture relevant features, while the hybrid upsampling module mitigates the risk of overfitting to noise. We validate the proposed approach using a publicly available LDCT/PET dataset. Experimental results demonstrate that HSANet achieves superior denoising performance compared to existing methods, while maintaining a lightweight model size suitable for deployment on GPUs with standard memory configurations. This makes our approach highly practical for real-world clinical applications.
RDTF: Resource-efficient Dual-mask Training Framework for Multi-frame Animated Sticker Generation
Recently, significant advancements have been achieved in video generation technology, but applying it to resource-constrained downstream tasks like multi-frame animated sticker generation (ASG) characterized by low frame rates, abstract semantics, and long tail frame length distribution-remains challenging. Parameter-efficient fine-tuning (PEFT) techniques (e.g., Adapter, LoRA) for large pre-trained models suffer from insufficient fitting ability and source-domain knowledge interference. In this paper, we propose Resource-Efficient Dual-Mask Training Framework (RDTF), a dedicated solution for multi-frame ASG task under resource constraints. We argue that training a compact model from scratch with million-level samples outperforms PEFT on large models, with RDTF realizing this via three core designs: 1) a Discrete Frame Generation Network (DFGN) optimized for low-frame-rate ASG, ensuring parameter efficiency; 2) a dual-mask based data utilization strategy to enhance the availability and diversity of limited data; 3) a difficulty-adaptive curriculum learning method that decomposes sample entropy into static and adaptive components, enabling easy-to-difficult training convergence. To provide high-quality data support for RDTFs training from scratch, we construct VSD2M-a million-level multi-modal animated sticker dataset with rich annotations (static and animated stickers, action-focused text descriptions)-filling the gap of dedicated animated data for ASG task. Experiments demonstrate that RDTF is quantitatively and qualitatively superior to state-of-the-art PEFT methods (e.g., I2V-Adapter, SimDA) on ASG tasks, verifying the feasibility of our framework under resource constraints.
comment: Submitted to TMM
Fusion or Confusion? Assessing the impact of visible-thermal image fusion for automated wildlife detection
Efficient wildlife monitoring methods are necessary for biodiversity conservation and management. The combination of remote sensing, aerial imagery and deep learning offer promising opportunities to renew or improve existing survey methods. The complementary use of visible (VIS) and thermal infrared (TIR) imagery can add information compared to a single-source image and improve results in an automated detection context. However, the alignment and fusion process can be challenging, especially since visible and thermal images usually have different fields of view (FOV) and spatial resolutions. This research presents a case study on the great blue heron (Ardea herodias) to evaluate the performances of synchronous aerial VIS and TIR imagery to automatically detect individuals and nests using a YOLO11n model. Two VIS-TIR fusion methods were tested and compared: an early fusion approach and a late fusion approach, to determine if the addition of the TIR image gives any added value compared to a VIS-only model. VIS and TIR images were automatically aligned using a deep learning model. A principal component analysis fusion method was applied to VIS-TIR image pairs to form the early fusion dataset. A classification and regression tree was used to process the late fusion dataset, based on the detection from the VIS-only and TIR-only trained models. Across all classes, both late and early fusion improved the F1 score compared to the VIS-only model. For the main class, occupied nest, the late fusion improved the F1 score from 90.2 (VIS-only) to 93.0%. This model was also able to identify false positives from both sources with 90% recall. Although fusion methods seem to give better results, this approach comes with a limiting TIR FOV and alignment constraints that eliminate data. Using an aircraft-mounted very high-resolution visible sensor could be an interesting option for operationalizing surveys.
comment: 19 pages, 9 figures, submitted to Remote Sensing in Ecology and Conservation
MoH: Multi-Head Attention as Mixture-of-Head Attention ICML 2025
In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%-90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.
comment: Accepted by ICML 2025, code: https://github.com/SkyworkAI/MoH
ReasonEdit: Towards Reasoning-Enhanced Image Editing Models
Recent advances in image editing models have shown remarkable progress. A common architectural design couples a multimodal large language model (MLLM) encoder with a diffusion decoder, as seen in systems such as Step1X-Edit and Qwen-Image-Edit, where the MLLM encodes both the reference image and the instruction but remains frozen during training. In this work, we demonstrate that unlocking the reasoning capabilities of MLLM can further push the boundaries of editing models. Specifically, we explore two reasoning mechanisms, thinking and reflection, which enhance instruction understanding and editing accuracy. Based on that, our proposed framework enables image editing in a thinking-editing-reflection loop: the thinking mechanism leverages the world knowledge of MLLM to interpret abstract instructions, while the reflection reviews editing results, automatically corrects unintended manipulations, and identifies the stopping round. Extensive experiments demonstrate that our reasoning approach achieves significant performance gains, with improvements of ImgEdit (+4.3%), GEdit (+4.7%), and Kris (+8.2%) when initializing our DiT from the Step1X-Edit (ReasonEdit-S), and also outperforms previous open-source methods on both GEdit and Kris when integrated with Qwen-Image-Edit (ReasonEdit-Q).
comment: code: https://github.com/stepfun-ai/Step1X-Edit
Class-Conditional Distribution Balancing for Group Robust Classification
Spurious correlations that lead models to correct predictions for the wrong reasons pose a critical challenge for robust real-world generalization. Existing research attributes this issue to group imbalance and addresses it by maximizing group-balanced or worst-group accuracy, which heavily relies on expensive bias annotations. A compromise approach involves predicting bias information using extensively pretrained foundation models, which requires large-scale data and becomes impractical for resource-limited rare domains. To address these challenges, we offer a novel perspective by reframing the spurious correlations as imbalances or mismatches in class-conditional distributions, and propose a simple yet effective robust learning method that eliminates the need for both bias annotations and predictions. With the goal of maximizing the conditional entropy (uncertainty) of the label given spurious factors, our method leverages a sample reweighting strategy to achieve class-conditional distribution balancing, which automatically highlights minority groups and classes, effectively dismantling spurious correlations and producing a debiased data distribution for classification. Extensive experiments and analysis demonstrate that our approach consistently delivers state-of-the-art performance, rivaling methods that rely on bias supervision.
iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation
Pre-trained video models learn powerful priors for generating high-quality, temporally coherent content. While these models excel at temporal coherence, their dynamics are often constrained by the continuous nature of their training data. We hypothesize that by injecting the rich and unconstrained content diversity from image data into this coherent temporal framework, we can generate image sets that feature both natural transitions and a far more expansive dynamic range. To this end, we introduce iMontage, a unified framework designed to repurpose a powerful video model into an all-in-one image generator. The framework consumes and produces variable-length image sets, unifying a wide array of image generation and editing tasks. To achieve this, we propose an elegant and minimally invasive adaptation strategy, complemented by a tailored data curation process and training paradigm. This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors. iMontage excels across several mainstream many-in-many-out tasks, not only maintaining strong cross-image contextual consistency but also generating scenes with extraordinary dynamics that surpass conventional scopes. Find our homepage at: https://kr1sjfu.github.io/iMontage-web/.
comment: Our homepage: https://kr1sjfu.github.io/iMontage-web/
Prediction of Distant Metastasis in Head and Neck Cancer Patients Using Tumor and Peritumoral Multi-Modal Deep Learning
Although the combined treatment of surgery, radiotherapy, chemotherapy, and emerging target therapy has significantly improved the outcomes of patients with head and neck cancer, distant metastasis remains the leading cause of treatment failure. In this study, we propose a deep learning-based multimodal framework integrating CT imaging, radiomics, and clinical data to predict metastasis risk in HNSCC. A total of 1497 patients were retrospectively analyzed. Tumor and organ masks were generated from pretreatment CT scans, from which a 3D Swin Transformer extracted deep imaging features, while 1562 radiomics features were reduced to 36 via correlation filtering and random forest selection. Clinical data (age, sex, smoking, and alcohol status) were encoded and fused with imaging features, and the multimodal representation was fed into a fully connected network for prediction. Five-fold cross-validation was used to assess performance via AUC, accuracy, sensitivity, and specificity. The multimodal model outperformed all single-modality baselines. The deep learning module alone achieved an AUC of 0.715, whereas multimodal fusion significantly improved performance (AUC = 0.803, ACC = 0.752, SEN = 0.730, SPE = 0.758). Stratified analyses confirmed good generalizability across tumor subtypes. Ablation experiments demonstrated complementary contributions from each modality, and the 3D Swin Transformer provided more robust representations than conventional architectures. This multimodal deep learning model enables accurate, non-invasive metastasis prediction in HNSCC and shows strong potential for individualized treatment planning.
comment: 23 pages, 6 figures, 7 tables. Nuo Tong and Changhao Liu contributed equally. Corresponding Authors: Shuiping Gou and Mei Shi
Fast Multi-view Consistent 3D Editing with Video Priors AAAI2026
Text-driven 3D editing enables user-friendly 3D object or scene editing with text instructions. Due to the lack of multi-view consistency priors, existing methods typically resort to employing 2D generation or editing models to process each view individually, followed by iterative 2D-3D-2D updating. However, these methods are not only time-consuming but also prone to over-smoothed results because the different editing signals gathered from different views are averaged during the iterative process. In this paper, we propose generative Video Prior based 3D Editing (ViP3DE) to employ the temporal consistency priors from pre-trained video generation models for multi-view consistent 3D editing in a single forward pass. Our key insight is to condition the video generation model on a single edited view to generate other consistent edited views for 3D updating directly, thereby bypassing the iterative editing paradigm. Since 3D updating requires edited views to be paired with specific camera poses, we propose motion-preserved noise blending for the video model to generate edited views at predefined camera poses. In addition, we introduce geometry-aware denoising to further enhance multi-view consistency by integrating 3D geometric priors into video models. Extensive experiments demonstrate that our proposed ViP3DE can achieve high-quality 3D editing results even within a single forward pass, significantly outperforming existing methods in both editing quality and speed.
comment: accepted by AAAI2026
VITA: Vision-to-Action Flow Matching Policy
Conventional flow matching and diffusion-based policies sample through iterative denoising from standard noise distributions (e.g., Gaussian), and require conditioning modules to repeatedly incorporate visual information during the generative process, incurring substantial time and memory overhead. To reduce the complexity, we develop VITA(VIsion-To-Action policy), a noise-free and conditioning-free flow matching policy learning framework that directly flows from visual representations to latent actions. Since the source of the flow is visually grounded, VITA eliminates the need of visual conditioning during generation. As expected, bridging vision and action is challenging, because actions are lower-dimensional, less structured, and sparser than visual representations; moreover, flow matching requires the source and target to have the same dimensionality. To overcome this, we introduce an action autoencoder that maps raw actions into a structured latent space aligned with visual latents, trained jointly with flow matching. To further prevent latent space collapse, we propose flow latent decoding, which anchors the latent generation process by backpropagating the action reconstruction loss through the flow matching ODE (ordinary differential equation) solving steps. We evaluate VITA on 9 simulation and 5 real-world tasks from ALOHA and Robomimic. VITA achieves 1.5x-2x faster inference compared to conventional methods with conditioning modules, while outperforming or matching state-of-the-art policies. Codes, datasets, and demos are available at our project page: https://ucd-dare.github.io/VITA/.
comment: Project page: https://ucd-dare.github.io/VITA/ Code: https://github.com/ucd-dare/VITA
Capturing Context-Aware Route Choice Semantics for Trajectory Representation Learning
Trajectory representation learning (TRL) aims to encode raw trajectory data into low-dimensional embeddings for downstream tasks such as travel time estimation, mobility prediction, and trajectory similarity analysis. From a behavioral perspective, a trajectory reflects a sequence of route choices within an urban environment. However, most existing TRL methods ignore this underlying decision-making process and instead treat trajectories as static, passive spatiotemporal sequences, thereby limiting the semantic richness of the learned representations. To bridge this gap, we propose CORE, a TRL framework that integrates context-aware route choice semantics into trajectory embeddings. CORE first incorporates a multi-granular Environment Perception Module, which leverages large language models (LLMs) to distill environmental semantics from point of interest (POI) distributions, thereby constructing a context-enriched road network. Building upon this backbone, CORE employs a Route Choice Encoder with a mixture-of-experts (MoE) architecture, which captures route choice patterns by jointly leveraging the context-enriched road network and navigational factors. Finally, a Transformer encoder aggregates the route-choice-aware representations into a global trajectory embedding. Extensive experiments on 4 real-world datasets across 6 downstream tasks demonstrate that CORE consistently outperforms 12 state-of-the-art TRL methods, achieving an average improvement of 9.79% over the best-performing baseline. Our code is available at https://github.com/caoji2001/CORE.
Self-Supervised One-Step Diffusion Refinement for Snapshot Compressive Imaging
Snapshot compressive imaging (SCI) captures multispectral images (MSIs) using a single coded two-dimensional (2-D) measurement, but reconstructing high-fidelity MSIs from these compressed inputs remains a fundamentally ill-posed challenge. While diffusion-based reconstruction methods have recently raised the bar for quality, they face critical limitations: a lack of large-scale MSI training data, adverse domain shifts from RGB-pretrained models, and inference inefficiencies due to multi-step sampling. These drawbacks restrict their practicality in real-world applications. In contrast to existing methods, which either follow costly iterative refinement or adapt subspace-based embeddings for diffusion models (e.g. DiffSCI, PSR-SCI), we introduce a fundamentally different paradigm: a self-supervised One-Step Diffusion (OSD) framework specifically designed for SCI. The key novelty lies in using a single-step diffusion refiner to correct an initial reconstruction, eliminating iterative denoising entirely while preserving generative quality. Moreover, we adopt a self-supervised equivariant learning strategy to train both the predictor and refiner directly from raw 2-D measurements, enabling generalization to unseen domains without the need for ground-truth MSI. To further address the challenge of limited MSI data, we design a band-selection-driven distillation strategy that transfers core generative priors from large-scale RGB datasets, effectively bridging the domain gap. Extensive experiments confirm that our approach sets a new benchmark, yielding PSNR gains of 3.44 dB, 1.61 dB, and 0.28 dB on the Harvard, NTIRE, and ICVL datasets, respectively, while reducing reconstruction time by 97.5%. This remarkable improvement in efficiency and adaptability makes our method a significant advancement in SCI reconstruction, combining both accuracy and practicality for real-world deployment.
Improving Partially Observed Trajectories Forecasting by Target-driven Self-Distillation
Accurate prediction of future trajectories of traffic agents is essential for ensuring safe autonomous driving. However, partially observed trajectories can significantly degrade the performance of even state-of-the-art models. Previous approaches often rely on knowledge distillation to transfer features from fully observed trajectories to partially observed ones. This involves firstly training a fully observed model and then using a distillation process to create the final model. While effective, they require multi-stage training, making the training process very expensive. Moreover, knowledge distillation can lead to a performance degradation of the model. In this paper, we introduce a Target-drivenSelf-Distillation method (TSD) for motion forecasting. Our method leverages predicted accurate targets to guide the model in making predictions under partial observation conditions. By employing self-distillation, the model learns from the feature distributions of both fully observed and partially observed trajectories during a single end-to-end training process. This enhances the model's ability to predict motion accurately in both fully observed and partially observed scenarios. We evaluate our method on multiple datasets and state-of-the-art motion forecasting models. Extensive experimental results demonstrate that our approach achieves significant performance improvements in both settings. To facilitate further research, we will release our code and model checkpoints.
Rank Matters: Understanding and Defending Model Inversion Attacks via Low-Rank Feature Filtering KDD 2026
Model Inversion Attacks (MIAs) pose a significant threat to data privacy by reconstructing sensitive training samples from the knowledge embedded in trained machine learning models. Despite recent progress in enhancing the effectiveness of MIAs across diverse settings, defense strategies have lagged behind -- struggling to balance model utility with robustness against increasingly sophisticated attacks. In this work, we propose the ideal inversion error to measure the privacy leakage, and our theoretical and empirical investigations reveals that higher-rank features are inherently more prone to privacy leakage. Motivated by this insight, we propose a lightweight and effective defense strategy based on low-rank feature filtering, which explicitly reduces the attack surface by constraining the dimension of intermediate representations. Extensive experiments across various model architectures and datasets demonstrate that our method consistently outperforms existing defenses, achieving state-of-the-art performance against a wide range of MIAs. Notably, our approach remains effective even in challenging regimes involving high-resolution data and high-capacity models, where prior defenses fail to provide adequate protection.
comment: KDD 2026 Accept
Rethinking Multimodal Point Cloud Completion: A Completion-by-Correction Perspective AAAI 2026
Point cloud completion aims to reconstruct complete 3D shapes from partial observations, which is a challenging problem due to severe occlusions and missing geometry. Despite recent advances in multimodal techniques that leverage complementary RGB images to compensate for missing geometry, most methods still follow a Completion-by-Inpainting paradigm, synthesizing missing structures from fused latent features. We empirically show that this paradigm often results in structural inconsistencies and topological artifacts due to limited geometric and semantic constraints. To address this, we rethink the task and propose a more robust paradigm, termed Completion-by-Correction, which begins with a topologically complete shape prior generated by a pretrained image-to-3D model and performs feature-space correction to align it with the partial observation. This paradigm shifts completion from unconstrained synthesis to guided refinement, enabling structurally consistent and observation-aligned reconstruction. Building upon this paradigm, we introduce PGNet, a multi-stage framework that conducts dual-feature encoding to ground the generative prior, synthesizes a coarse yet structurally aligned scaffold, and progressively refines geometric details via hierarchical correction. Experiments on the ShapeNetViPC dataset demonstrate the superiority of PGNet over state-of-the-art baselines in terms of average Chamfer Distance (-23.5%) and F-score (+7.1%).
comment: Accepted by AAAI 2026
TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction
We present TempoMaster, a novel framework that formulates long video generation as next-frame-rate prediction. Specifically, we first generate a low-frame-rate clip that serves as a coarse blueprint of the entire video sequence, and then progressively increase the frame rate to refine visual details and motion continuity. During generation, TempoMaster employs bidirectional attention within each frame-rate level while performing autoregression across frame rates, thus achieving long-range temporal coherence while enabling efficient and parallel synthesis. Extensive experiments demonstrate that TempoMaster establishes a new state-of-the-art in long video generation, excelling in both visual and temporal quality.
comment: for more information, see https://scottykma.github.io/tempomaster-gitpage/
3EED: Ground Everything Everywhere in 3D NeurIPS 2025
Visual grounding in 3D is the key for embodied agents to localize language-referred objects in open-world environments. However, existing benchmarks are limited to indoor focus, single-platform constraints, and small scale. We introduce 3EED, a multi-platform, multi-modal 3D grounding benchmark featuring RGB and LiDAR data from vehicle, drone, and quadruped platforms. We provide over 128,000 objects and 22,000 validated referring expressions across diverse outdoor scenes -- 10x larger than existing datasets. We develop a scalable annotation pipeline combining vision-language model prompting with human verification to ensure high-quality spatial grounding. To support cross-platform learning, we propose platform-aware normalization and cross-modal alignment techniques, and establish benchmark protocols for in-domain and cross-platform evaluations. Our findings reveal significant performance gaps, highlighting the challenges and opportunities of generalizable 3D grounding. The 3EED dataset and benchmark toolkit are released to advance future research in language-driven 3D embodied perception.
comment: NeurIPS 2025 DB Track; 38 pages, 17 figures, 10 tables; Project Page at https://project-3eed.github.io/
InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving
Conventional end-to-end autonomous driving methods often rely on explicit global scene representations, which typically consist of 3D object detection, online mapping, and motion prediction. In contrast, human drivers selectively attend to task-relevant regions and implicitly reason over the broader traffic context. Motivated by this observation, we introduce a lightweight end-to-end autonomous driving framework, InsightDrive. Unlike approaches that directly embed large language models (LLMs), InsightDrive introduces an Insight scene representation that jointly models attention-centric explicit scene representation and reasoning-centric implicit scene representation, so that scene understanding aligns more closely with human cognitive patterns for trajectory planning. To this end, we employ Chain-of-Thought (CoT) instructions to model human driving cognition and design a task-level Mixture-of-Experts (MoE) adapter that injects this knowledge into the autonomous driving model at negligible parameter cost. We further condition the planner on both explicit and implicit scene representations and employ a diffusion-based generative policy, which produces robust trajectory predictions and decisions. The overall framework establishes a knowledge distillation pipeline that transfers human driving knowledge to LLMs and subsequently to onboard models. Extensive experiments on the nuScenes and Navsim benchmarks demonstrate that InsightDrive achieves significant improvements over conventional scene representation approaches.
Speech Audio Generation from dynamic MRI via a Knowledge Enhanced Conditional Variational Autoencoder
Dynamic Magnetic Resonance Imaging (MRI) of the vocal tract has become an increasingly adopted imaging modality for speech motor studies. Beyond image signals, systematic data loss, noise pollution, and audio file corruption can occur due to the unpredictability of the MRI acquisition environment. In such cases, generating audio from images is critical for data recovery in both clinical and research applications. However, this remains challenging due to hardware constraints, acoustic interference, and data corruption. Existing solutions, such as denoising and multi-stage synthesis methods, face limitations in audio fidelity and generalizability. To address these challenges, we propose a Knowledge Enhanced Conditional Variational Autoencoder (KE-CVAE), a novel two-step "knowledge enhancement + variational inference" framework for generating speech audio signals from cine dynamic MRI sequences. This approach introduces two key innovations: (1) integration of unlabeled MRI data for knowledge enhancement, and (2) a variational inference architecture to improve generative modeling capacity. To the best of our knowledge, this is one of the first attempts at synthesizing speech audio directly from dynamic MRI video sequences. The proposed method was trained and evaluated on an open-source dynamic vocal tract MRI dataset recorded during speech. Experimental results demonstrate its effectiveness in generating natural speech waveforms while addressing MRI-specific acoustic challenges, outperforming conventional deep learning-based synthesis approaches.
Video Anomaly Detection with Semantics-Aware Information Bottleneck
Semi-supervised video anomaly detection methods face two critical challenges: (1) Strong generalization blurs the boundary between normal and abnormal patterns. Although existing approaches attempt to alleviate this issue using memory modules, their rigid prototype-matching process limits adaptability to diverse scenarios; (2) Relying solely on low-level appearance and motion cues makes it difficult to perceive high-level semantic anomalies in complex scenes. To address these limitations, we propose SIB-VAD, a novel framework based on adaptive information bottleneck filtering and semantic-aware enhancement. We propose the Sparse Feature Filtering Module (SFFM) to replace traditional memory modules. It compresses normal features directly into a low-dimensional manifold based on the information bottleneck principle and uses an adaptive routing mechanism to dynamically select the most suitable normal bottleneck subspace. Trained only on normal data, SFFMs only learn normal low-dimensional manifolds, while abnormal features deviate and are effectively filtered. Unlike memory modules, SFFM directly removes abnormal information and adaptively handles scene variations. To improve semantic awareness, we further design a multimodal prediction framework that jointly models appearance, motion, and semantics. Through multimodal consistency constraints and joint error computation, it achieves more robust VAD performance. Experimental results validate the effectiveness of our feature filtering paradigm based on semantics-aware information bottleneck. Project page at https://qzfm.github.io/sib_vad_project_page/
SPIRAL: Semantic-Aware Progressive LiDAR Scene Generation and Understanding NeurIPS 2025
Leveraging recent diffusion models, LiDAR-based large-scale 3D scene generation has achieved great success. While recent voxel-based approaches can generate both geometric structures and semantic labels, existing range-view methods are limited to producing unlabeled LiDAR scenes. Relying on pretrained segmentation models to predict the semantic maps often results in suboptimal cross-modal consistency. To address this limitation while preserving the advantages of range-view representations, such as computational efficiency and simplified network design, we propose Spiral, a novel range-view LiDAR diffusion model that simultaneously generates depth, reflectance images, and semantic maps. Furthermore, we introduce novel semantic-aware metrics to evaluate the quality of the generated labeled range-view data. Experiments on the SemanticKITTI and nuScenes datasets demonstrate that Spiral achieves state-of-the-art performance with the smallest parameter size, outperforming two-step methods that combine the generative and segmentation models. Additionally, we validate that range images generated by Spiral can be effectively used for synthetic data augmentation in the downstream segmentation training, significantly reducing the labeling effort on LiDAR data.
comment: NeurIPS 2025; 24 pages, 10 figures, 9 tables; Code at https://github.com/worldbench/SPIRAL
AttnRegDeepLab: A Two-Stage Decoupled Framework for Interpretable Embryo Fragmentation Grading
Embryo fragmentation is a morphological indicator critical for evaluating developmental potential in In Vitro Fertilization (IVF). However, manual grading is subjective and inefficient, while existing deep learning solutions often lack clinical explainability or suffer from accumulated errors in segmentation area estimation. To address these issues, this study proposes AttnRegDeepLab (Attention-Guided Regression DeepLab), a framework characterized by dual-branch Multi-Task Learning (MTL). A vanilla DeepLabV3+ decoder is modified by integrating Attention Gates into its skip connections, explicitly suppressing cytoplasmic noise to preserve contour details. Furthermore, a Multi-Scale Regression Head is introduced with a Feature Injection mechanism to propagate global grading priors into the segmentation task, rectifying systematic quantification errors. A 2-stage decoupled training strategy is proposed to address the gradient conflict in MTL. Also, a range-based loss is designed to leverage weakly labeled data. Our method achieves robust grading precision while maintaining excellent segmentation accuracy (Dice coefficient =0.729), in contrast to the end-to-end counterpart that might minimize grading error at the expense of contour integrity. This work provides a clinically interpretable solution that balances visual fidelity and quantitative precision.
comment: 7 pages, 5 figures
Bridging Granularity Gaps: Hierarchical Semantic Learning for Cross-domain Few-shot Segmentation AAAI 2026
Cross-domain Few-shot Segmentation (CD-FSS) aims to segment novel classes from target domains that are not involved in training and have significantly different data distributions from the source domain, using only a few annotated samples, and recent years have witnessed significant progress on this task. However, existing CD-FSS methods primarily focus on style gaps between source and target domains while ignoring segmentation granularity gaps, resulting in insufficient semantic discriminability for novel classes in target domains. Therefore, we propose a Hierarchical Semantic Learning (HSL) framework to tackle this problem. Specifically, we introduce a Dual Style Randomization (DSR) module and a Hierarchical Semantic Mining (HSM) module to learn hierarchical semantic features, thereby enhancing the model's ability to recognize semantics at varying granularities. DSR simulates target domain data with diverse foreground-background style differences and overall style variations through foreground and global style randomization respectively, while HSM leverages multi-scale superpixels to guide the model to mine intra-class consistency and inter-class distinction at different granularities. Additionally, we also propose a Prototype Confidence-modulated Thresholding (PCMT) module to mitigate segmentation ambiguity when foreground and background are excessively similar. Extensive experiments are conducted on four popular target domain datasets, and the results demonstrate that our method achieves state-of-the-art performance.
comment: Accepted by AAAI 2026
Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution AAAI 2026
The demonstrated success of sparsely-gated Mixture-of-Experts (MoE) architectures, exemplified by models such as DeepSeek and Grok, has motivated researchers to investigate their adaptation to diverse domains. In real-world image super-resolution (Real-ISR), existing approaches mainly rely on fine-tuning pre-trained diffusion models through Low-Rank Adaptation (LoRA) module to reconstruct high-resolution (HR) images. However, these dense Real-ISR models are limited in their ability to adaptively capture the heterogeneous characteristics of complex real-world degraded samples or enable knowledge sharing between inputs under equivalent computational budgets. To address this, we investigate the integration of sparse MoE into Real-ISR and propose a Mixture-of-Ranks (MoR) architecture for single-step image super-resolution. We introduce a fine-grained expert partitioning strategy that treats each rank in LoRA as an independent expert. This design enables flexible knowledge recombination while isolating fixed-position ranks as shared experts to preserve common-sense features and minimize routing redundancy. Furthermore, we develop a degradation estimation module leveraging CLIP embeddings and predefined positive-negative text pairs to compute relative degradation scores, dynamically guiding expert activation. To better accommodate varying sample complexities, we incorporate zero-expert slots and propose a degradation-aware load-balancing loss, which dynamically adjusts the number of active experts based on degradation severity, ensuring optimal computational resource allocation. Comprehensive experiments validate our framework's effectiveness and state-of-the-art performance.
comment: 16 pages, Accepted by AAAI 2026, v2: corrected typos
MMIF-AMIN: Adaptive Loss-Driven Multi-Scale Invertible Dense Network for Multimodal Medical Image Fusion
Multimodal medical image fusion (MMIF) aims to integrate images from different modalities to produce a comprehensive image that enhances medical diagnosis by accurately depicting organ structures, tissue textures, and metabolic information. Capturing both the unique and complementary information across multiple modalities simultaneously is a key research challenge in MMIF. To address this challenge, this paper proposes a novel image fusion method, MMIF-AMIN, which features a new architecture that can effectively extract these unique and complementary features. Specifically, an Invertible Dense Network (IDN) is employed for lossless feature extraction from individual modalities. To extract complementary information between modalities, a Multi-scale Complementary Feature Extraction Module (MCFEM) is designed, which incorporates a hybrid attention mechanism, convolutional layers of varying sizes, and Transformers. An adaptive loss function is introduced to guide model learning, addressing the limitations of traditional manually-designed loss functions and enhancing the depth of data mining. Extensive experiments demonstrate that MMIF-AMIN outperforms nine state-of-the-art MMIF methods, delivering superior results in both quantitative and qualitative analyses. Ablation experiments confirm the effectiveness of each component of the proposed method. Additionally, extending MMIF-AMIN to other image fusion tasks also achieves promising performance.
comment: This manuscript is withdrawn to allow for substantial expansion and restructuring. Based on recent research progress, we plan to add Generalization experiment and reorganize the manuscript structure to improve readability and logical flow. Thank you for your understanding and support
U-FaceBP: Uncertainty-aware Bayesian Ensemble Deep Learning for Face Video-based Blood Pressure Measurement
Blood pressure (BP) measurement is crucial for daily health assessment. Remote photoplethysmography (rPPG), which extracts pulse waves from face videos captured by a camera, has the potential to enable convenient BP measurement without specialized medical devices. However, there are various uncertainties in BP estimation using rPPG, leading to limited estimation performance and reliability. In this paper, we propose U-FaceBP, an uncertainty-aware Bayesian ensemble deep learning method for face video-based BP measurement. U-FaceBP models aleatoric and epistemic uncertainties in face video-based BP estimation with a Bayesian neural network (BNN). Additionally, we design U-FaceBP as an ensemble method, estimating BP from rPPG signals, PPG signals derived from face videos, and face images using multiple BNNs. Large-scale experiments on two datasets involving 1197 subjects from diverse racial groups demonstrate that U-FaceBP outperforms state-of-the-art BP estimation methods. Furthermore, we show that the uncertainty estimates provided by U-FaceBP are informative and useful for guiding modality fusion, assessing prediction reliability, and analyzing performance across racial groups.
comment: This work has been submitted to the IEEE for possible publication
AgriPotential: A Novel Multi-Spectral and Multi-Temporal Remote Sensing Dataset for Agricultural Potentials
Remote sensing has emerged as a critical tool for large-scale Earth monitoring and land management. In this paper, we introduce AgriPotential, a novel benchmark dataset composed of Sentinel-2 satellite imagery captured over multiple months. The dataset provides pixel-level annotations of agricultural potentials for three major crop types - viticulture, market gardening, and field crops - across five ordinal classes. AgriPotential supports a broad range of machine learning tasks, including ordinal regression, multi-label classification, and spatio-temporal modeling. The data cover diverse areas in Southern France, offering rich spectral information. AgriPotential is the first public dataset designed specifically for agricultural potential prediction, aiming to improve data-driven approaches to sustainable land use planning. The dataset and the code are freely accessible at: https://zenodo.org/records/15551829
comment: Accepted at CBMI 2025
AerialMind: Towards Referring Multi-Object Tracking in UAV Scenarios AAAI 2026
Referring Multi-Object Tracking (RMOT) aims to achieve precise object detection and tracking through natural language instructions, representing a fundamental capability for intelligent robotic systems. However, current RMOT research remains mostly confined to ground-level scenarios, which constrains their ability to capture broad-scale scene contexts and perform comprehensive tracking and path planning. In contrast, Unmanned Aerial Vehicles (UAVs) leverage their expansive aerial perspectives and superior maneuverability to enable wide-area surveillance. Moreover, UAVs have emerged as critical platforms for Embodied Intelligence, which has given rise to an unprecedented demand for intelligent aerial systems capable of natural language interaction. To this end, we introduce AerialMind, the first large-scale RMOT benchmark in UAV scenarios, which aims to bridge this research gap. To facilitate its construction, we develop an innovative semi-automated collaborative agent-based labeling assistant (COALA) framework that significantly reduces labor costs while maintaining annotation quality. Furthermore, we propose HawkEyeTrack (HETrack), a novel method that collaboratively enhances vision-language representation learning and improves the perception of UAV scenarios. Comprehensive experiments validated the challenging nature of our dataset and the effectiveness of our method.
comment: AAAI 2026
Rethinking Efficient Mixture-of-Experts for Remote Sensing Modality-Missing Classification
Multimodal classification in remote sensing often suffers from missing modalities caused by environmental interference, sensor failures, or atmospheric effects, which severely degrade classification performance. Existing two-stage adaptation methods are computationally expensive and assume complete multimodal data during training, limiting their generalization to real-world incompleteness. To overcome these issues, we propose a Missing-aware Mixture-of-Loras (MaMOL) framework that reformulates modality missing as a multi-task learning problem. MaMOL introduces a dual-routing mechanism: a task-oriented dynamic router that adaptively activates experts for different missing patterns, and a modality-specific-shared static router that maintains stable cross-modal knowledge sharing. Unlike prior methods that train separate networks for each missing configuration, MaMOL achieves parameter-efficient adaptation via lightweight expert updates and shared expert reuse. Experiments on multiple remote sensing benchmarks demonstrate superior robustness and generalization under varying missing rates, with minimal computational overhead. Moreover, transfer experiments on natural image datasets validate its scalability and cross-domain applicability, highlighting MaMOL as a general and efficient solution for incomplete multimodal learning.
comment: 11 pages, 4 figures
UniFucGrasp: Human-Hand-Inspired Unified Functional Grasp Annotation Strategy and Dataset for Diverse Dexterous Hands
Dexterous grasp datasets are vital for embodied intelligence, but mostly emphasize grasp stability, ignoring functional grasps needed for tasks like opening bottle caps or holding cup handles. Most rely on bulky, costly, and hard-to-control high-DOF Shadow Hands. Inspired by the human hand's underactuated mechanism, we establish UniFucGrasp, a universal functional grasp annotation strategy and dataset for multiple dexterous hand types. Based on biomimicry, it maps natural human motions to diverse hand structures and uses geometry-based force closure to ensure functional, stable, human-like grasps. This method supports low-cost, efficient collection of diverse, high-quality functional grasps. Finally, we establish the first multi-hand functional grasp dataset and provide a synthesis model to validate its effectiveness. Experiments on the UFG dataset, IsaacSim, and complex robotic tasks show that our method improves functional manipulation accuracy and grasp stability, demonstrates improved adaptability across multiple robotic hands, helping to alleviate annotation cost and generalization challenges in dexterous grasping. The project page is at https://haochen611.github.io/UFG.
comment: Accepted to IEEE Robotics and Automation Letters (RA-L). The project page is at https://haochen611.github.io/UFG
HybridWorldSim: A Scalable and Controllable High-fidelity Simulator for Autonomous Driving
Realistic and controllable simulation is critical for advancing end-to-end autonomous driving, yet existing approaches often struggle to support novel view synthesis under large viewpoint changes or to ensure geometric consistency. We introduce HybridWorldSim, a hybrid simulation framework that integrates multi-traversal neural reconstruction for static backgrounds with generative modeling for dynamic agents. This unified design addresses key limitations of previous methods, enabling the creation of diverse and high-fidelity driving scenarios with reliable visual and spatial consistency. To facilitate robust benchmarking, we further release a new multi-traversal dataset MIRROR that captures a wide range of routes and environmental conditions across different cities. Extensive experiments demonstrate that HybridWorldSim surpasses previous state-of-the-art methods, providing a practical and scalable solution for high-fidelity simulation and a valuable resource for research and development in autonomous driving.
Fine-grained Image Retrieval via Dual-Vision Adaptation AAAI2026
Fine-Grained Image Retrieval~(FGIR) faces challenges in learning discriminative visual representations to retrieve images with similar fine-grained features. Current leading FGIR solutions typically follow two regimes: enforce pairwise similarity constraints in the semantic embedding space, or incorporate a localization sub-network to fine-tune the entire model. However, such two regimes tend to overfit the training data while forgetting the knowledge gained from large-scale pre-training, thus reducing their generalization ability. In this paper, we propose a Dual-Vision Adaptation (DVA) approach for FGIR, which guides the frozen pre-trained model to perform FGIR through collaborative sample and feature adaptation. Specifically, we design Object-Perceptual Adaptation, which modifies input samples to help the pre-trained model perceive critical objects and elements within objects that are helpful for category prediction. Meanwhile, we propose In-Context Adaptation, which introduces a small set of parameters for feature adaptation without modifying the pre-trained parameters. This makes the FGIR task using these adjusted features closer to the task solved during the pre-training. Additionally, to balance retrieval efficiency and performance, we propose Discrimination Perception Transfer to transfer the discriminative knowledge in the object-perceptual adaptation to the image encoder using the knowledge distillation mechanism. Extensive experiments show that DVA has fewer learnable parameters and performs well on three in-distribution and three out-of-distribution fine-grained datasets.
comment: Accepted by AAAI2026
IMSE: Efficient U-Net-based Speech Enhancement using Inception Depthwise Convolution and Amplitude-Aware Linear Attention
Achieving a balance between lightweight design and high performance remains a significant challenge for speech enhancement (SE) tasks on resource-constrained devices. Existing state-of-the-art methods, such as MUSE, have established a strong baseline with only 0.51M parameters by introducing a Multi-path Enhanced Taylor (MET) transformer and Deformable Embedding (DE). However, an in-depth analysis reveals that MUSE still suffers from efficiency bottlenecks: the MET module relies on a complex "approximate-compensate" mechanism to mitigate the limitations of Taylor-expansion-based attention, while the offset calculation for deformable embedding introduces additional computational burden. This paper proposes IMSE, a systematically optimized and ultra-lightweight network. We introduce two core innovations: 1) Replacing the MET module with Amplitude-Aware Linear Attention (MALA). MALA fundamentally rectifies the "amplitude-ignoring" problem in linear attention by explicitly preserving the norm information of query vectors in the attention calculation, achieving efficient global modeling without an auxiliary compensation branch. 2) Replacing the DE module with Inception Depthwise Convolution (IDConv). IDConv borrows the Inception concept, decomposing large-kernel operations into efficient parallel branches (square, horizontal, and vertical strips), thereby capturing spectrogram features with extremely low parameter redundancy. Extensive experiments on the VoiceBank+DEMAND dataset demonstrate that, compared to the MUSE baseline, IMSE significantly reduces the parameter count by 16.8\% (from 0.513M to 0.427M) while achieving competitive performance comparable to the state-of-the-art on the PESQ metric (3.373). This study sets a new benchmark for the trade-off between model size and speech quality in ultra-lightweight speech enhancement.
Hi-EF: Benchmarking Emotion Forecasting in Human-interaction
Affective Forecasting is an psychology task that involves predicting an individual's future emotional responses, often hampered by reliance on external factors leading to inaccuracies, and typically remains at a qualitative analysis stage. To address these challenges, we narrows the scope of Affective Forecasting by introducing the concept of Human-interaction-based Emotion Forecasting (EF). This task is set within the context of a two-party interaction, positing that an individual's emotions are significantly influenced by their interaction partner's emotional expressions and informational cues. This dynamic provides a structured perspective for exploring the patterns of emotional change, thereby enhancing the feasibility of emotion forecasting.
3D Motion Perception of Binocular Vision Target with PID-CNN
This article trained a network for perceiving three-dimensional motion information of binocular vision target, which can provide real-time three-dimensional coordinate, velocity, and acceleration, and has a basic spatiotemporal perception capability. Understood the ability of neural networks to fit nonlinear problems from the perspective of PID. Considered a single-layer neural network as using a second-order difference equation and a nonlinearity to describe a local problem. Multilayer networks gradually transform the raw representation to the desired representation through multiple such combinations. Analysed some reference principles for designing neural networks. Designed a relatively small PID convolutional neural network, with a total of 17 layers and 413 thousand parameters. Implemented a simple but practical feature reuse method by concatenation and pooling. The network was trained and tested using the simulated randomly moving ball datasets, and the experimental results showed that the prediction accuracy was close to the upper limit that the input image resolution can represent. Analysed the experimental results and errors, as well as the existing shortcomings and possible directions for improvement. Finally, discussed the advantages of high-dimensional convolution in improving computational efficiency and feature space utilization. As well as the potential advantages of using PID information to implement memory and attention mechanisms.
comment: 7 pages, 9 figures, 2 tables. The codes of this article have been released at: https://github.com/ShiJZ123/PID-CNN
PowerCLIP: Powerset Alignment for Contrastive Pre-Training
Contrastive vision-language pre-training frameworks such as CLIP have demonstrated impressive zero-shot performance across a range of vision-language tasks. Recent studies have shown that aligning individual text tokens with specific image patches or regions enhances fine-grained compositional understanding. However, it remains challenging to capture compositional semantics that span multiple image regions. To address this limitation, we propose PowerCLIP, a novel contrastive pre-training framework enhanced by powerset alignment, which exhaustively optimizes region-to-phrase alignments by minimizing the loss defined between powersets of image regions and textual parse trees. Since the naive powerset construction incurs exponential computational cost due to the combinatorial explosion in the number of region subsets, we introduce efficient non-linear aggregators (NLAs) that reduce complexity from O(2^M) to O(M) with respect to the number of regions M, while approximating the exact loss value with arbitrary precision. Our extensive experiments demonstrate that PowerCLIP outperforms state-of-the-art methods in zero-shot classification and retrieval tasks, underscoring the compositionality and robustness of our approach. Our code will be made publicly available.
Off the Planckian Locus: Using 2D Chromaticity to Improve In-Camera Color
Traditional in-camera colorimetric mapping relies on correlated color temperature (CCT)-based interpolation between pre-calibrated transforms optimized for Planckian illuminants such as CIE A and D65. However, modern lighting technologies such as LEDs can deviate substantially from the Planckian locus, exposing the limitations of relying on conventional one-dimensional CCT for illumination characterization. This paper demonstrates that transitioning from 1D CCT (on the Planckian locus) to a 2D chromaticity space (off the Planckian locus) improves colorimetric accuracy across various mapping approaches. In addition, we replace conventional CCT interpolation with a lightweight multi-layer perceptron (MLP) that leverages 2D chromaticity features for robust colorimetric mapping under non-Planckian illuminants. A lightbox-based calibration procedure incorporating representative LED sources is used to train our MLP. Validated across diverse LED lighting, our method reduces angular reproduction error by 22% on average in LED-lit scenes, maintains backward compatibility with traditional illuminants, accommodates multi-illuminant scenes, and supports real-time in-camera deployment with negligible additional computational cost.
comment: Project page: https://cst-mlp.github.io
AVFakeBench: A Comprehensive Audio-Video Forgery Detection Benchmark for AV-LMMs
The threat of Audio-Video (AV) forgery is rapidly evolving beyond human-centric deepfakes to include more diverse manipulations across complex natural scenes. However, existing benchmarks are still confined to DeepFake-based forgeries and single-granularity annotations, thus failing to capture the diversity and complexity of real-world forgery scenarios. To address this, we introduce AVFakeBench, the first comprehensive audio-video forgery detection benchmark that spans rich forgery semantics across both human subject and general subject. AVFakeBench comprises 12K carefully curated audio-video questions, covering seven forgery types and four levels of annotations. To ensure high-quality and diverse forgeries, we propose a multi-stage hybrid forgery framework that integrates proprietary models for task planning with expert generative models for precise manipulation. The benchmark establishes a multi-task evaluation framework covering binary judgment, forgery types classification, forgery detail selection, and explanatory reasoning. We evaluate 11 Audio-Video Large Language Models (AV-LMMs) and 2 prevalent detection methods on AVFakeBench, demonstrating the potential of AV-LMMs as emerging forgery detectors while revealing their notable weaknesses in fine-grained perception and reasoning.
comment: The experimental results in this paper have been further improved and updated; the baseline results do not match existing results, therefore the paper needs to be retracted
3-Tracer: A Tri-level Temporal-Aware Framework for Audio Forgery Detection and Localization
Recently, partial audio forgery has emerged as a new form of audio manipulation. Attackers selectively modify partial but semantically critical frames while preserving the overall perceptual authenticity, making such forgeries particularly difficult to detect. Existing methods focus on independently detecting whether a single frame is forged, lacking the hierarchical structure to capture both transient and sustained anomalies across different temporal levels. To address these limitations, We identify three key levels relevant to partial audio forgery detection and present T3-Tracer, the first framework that jointly analyzes audio at the frame, segment, and audio levels to comprehensively detect forgery traces. T3-Tracer consists of two complementary core modules: the Frame-Audio Feature Aggregation Module (FA-FAM) and the Segment-level Multi-Scale Discrepancy-Aware Module (SMDAM). FA-FAM is designed to detect the authenticity of each audio frame. It combines both frame-level and audio-level temporal information to detect intra-frame forgery cues and global semantic inconsistencies. To further refine and correct frame detection, we introduce SMDAM to detect forgery boundaries at the segment level. It adopts a dual-branch architecture that jointly models frame features and inter-frame differences across multi-scale temporal windows, effectively identifying abrupt anomalies that appeared on the forged boundaries. Extensive experiments conducted on three challenging datasets demonstrate that our approach achieves state-of-the-art performance.
comment: The experimental results in this paper have been further improved and updated; the baseline results do not match existing results, therefore the paper needs to be retracted
Image and Video Processing
Disentangling Progress in Medical Image Registration: Beyond Trend-Driven Architectures towards Domain-Specific Strategies
Medical image registration drives quantitative analysis across organs, modalities, and patient populations. Recent deep learning methods often combine low-level "trend-driven" computational blocks from computer vision, such as large-kernel CNNs, Transformers, and state-space models, with high-level registration-specific designs like motion pyramids, correlation layers, and iterative refinement. Yet, their relative contributions remain unclear and entangled. This raises a central question: should future advances in registration focus on importing generic architectural trends or on refining domain-specific design principles? Through a modular framework spanning brain, lung, cardiac, and abdominal registration, we systematically disentangle the influence of these two paradigms. Our evaluation reveals that low-level "trend-driven" computational blocks offer only marginal or inconsistent gains, while high-level registration-specific designs consistently deliver more accurate, smoother, and more robust deformations. These domain priors significantly elevate the performance of a standard U-Net baseline, far more than variants incorporating "trend-driven" blocks, achieving an average relative improvement of $\sim3\%$. All models and experiments are released within a transparent, modular benchmark that enables plug-and-play comparison for new architectures and registration tasks (https://github.com/BailiangJ/rethink-reg). This dynamic and extensible platform establishes a common ground for reproducible and fair evaluation, inviting the community to isolate genuine methodological contributions from domain priors. Our findings advocate a shift in research emphasis: from following architectural trends to embracing domain-specific design principles as the true drivers of progress in learning-based medical image registration.
comment: Submitted to Medical Image Analysis. Journal Extension of arXiv:2407.19274
A unified framework for geometry-independent operator learning in cardiac electrophysiology simulations
Accurate maps of atrial electrical activation are essential for personalised treatment of arrhythmias, yet biophysically detailed simulations remain computationally intensive for real-time clinical use or population-scale analyses. Here we introduce a geometry-independent operator-learning framework that predicts local activation time (LAT) fields across diverse left atrial anatomies with near-instantaneous inference. We generated a dataset of 308,700 simulations using a GPU-accelerated electrophysiology solver, systematically varying multiple pacing sites and physiologically varied conduction properties across 147 patient-specific geometries derived from two independent clinical cohorts. All anatomical and functional data are expressed in a Universal Atrium Coordinate system, providing a consistent representation that decouples electrophysiological patterns from mesh topology. Within this coordinate space, we designed a neural operator with a vision-transformer backbone to learn the mapping from structural and electrophysiological inputs to LAT fields. With a mean prediction error of 5.1 ms over a 455 ms maximum simulation time, the model outperforms established operator-learning approaches and performs inference in 0.12 ms per sample. Our framework establishes a general strategy for learning domain-invariant biophysical mappings across variable anatomical domains and enables integration of computational electrophysiology into real-time and large-scale clinical workflows.
In-Context Learning for Deep Joint Source-Channel Coding Over MIMO Channels
Large language models have demonstrated the ability to perform \textit{in-context learning} (ICL), whereby the model performs predictions by directly mapping the query and a few examples from the given task to the output variable. In this paper, we study ICL for deep joint source-channel coding (DeepJSCC) in image transmission over multiple-input multiple-output (MIMO) systems, where an ICL denoiser is employed for MIMO symbol estimation. We first study the transceiver without any hardware impairments and explore the integration of transformer-based ICL with DeepJSCC in both open-loop and closed-loop MIMO systems, depending on the availability of channel state information (CSI) at the transceiver. For both open-loop and closed-loop scenarios, we propose two MIMO transceiver architectures that leverage context information, i.e., pilot sequences and their outputs, as additional inputs, enabling the DeepJSCC encoder, DeepJSCC decoder, and the ICL denoiser to jointly learn encoding, decoding, and estimation strategies tailored to each channel realization. Next, we extend our study to a more challenging scenario where the transceiver suffers from in-phase and quadrature (IQ) imbalance, resulting in nonlinear MIMO estimation. In this case, the context information is also exploited, facilitating joint learning across the DeepJSCC encoder, decoder, and the ICL denoiser under hardware impairments and varying channel conditions. Numerical results demonstrate that the ICL denoiser for MIMO estimation significantly outperforms the conventional least-squares method, with even greater advantages under IQ imbalance. Moreover, the proposed transformer-based ICL framework, integrated with contextual information, achieves significant improvements in end-to-end image reconstruction quality under transceiver IQ imbalance.
Spatiotemporal Pyramid Flow Matching for Climate Emulation
Generative models have the potential to transform the way we emulate Earth's changing climate. Previous generative approaches rely on weather-scale autoregression for climate emulation, but this is inherently slow for long climate horizons and has yet to demonstrate stable rollouts under nonstationary forcings. Here, we introduce Spatiotemporal Pyramid Flows (SPF), a new class of flow matching approaches that model data hierarchically across spatial and temporal scales. Inspired by cascaded video models, SPF partitions the generative trajectory into a spatiotemporal pyramid, progressively increasing spatial resolution to reduce computation and coupling each stage with an associated timescale to enable direct sampling at any temporal level in the pyramid. This design, together with conditioning each stage on prescribed physical forcings (e.g., greenhouse gases or aerosols), enables efficient, parallel climate emulation at multiple timescales. On ClimateBench, SPF outperforms strong flow matching baselines and pre-trained models at yearly and monthly timescales while offering fast sampling, especially at coarser temporal levels. To scale SPF, we curate ClimateSuite, the largest collection of Earth system simulations to date, comprising over 33,000 simulation-years across ten climate models and the first dataset to include simulations of climate interventions. We find that the scaled SPF model demonstrates good generalization to held-out scenarios across climate models. Together, SPF and ClimateSuite provide a foundation for accurate, efficient, probabilistic climate emulation across temporal scales and realistic future scenarios. Data and code is publicly available at https://github.com/stanfordmlgroup/spf .
TT-Stack: A Transformer-Based Tiered-Stacking Ensemble Framework with Meta-Learning for Automated Breast Cancer Detection in Mammography
Breast cancer continues to be the second most common cause of cancer-related deaths around the world, with early detection being important to improve survival rates for patients. Traditional computer-aided diagnosis systems have limitations in their ability to represent features and generalize to the range of mammographic images. We present a new two-level Stack of Transformers (TT-Stack) ensemble framework based on using heterogeneous lightweight vision transformer architectures to automatically identify breast cancer in mammograms. Specifically, we integrate seven state-of-the-art vision transformers: RepViT, DaViT, EfficientViT, MobileViT, FasterViT, MViT, and PVT v2 while also designing a two-tier meta-learning approach for the ensemble by simply taking the logits from the base model and applying logistic regression for binary classification (Cancer vs. Non-Cancer). Each of the transformer backbone models was developed to process single-channel grayscale mammograms while still taking advantage of transfer learning from pre-training on ImageNet so that they would offer a parameter-efficient approach that may reasonably be applied in clinical practice with minimal variance. The training process included stratified 80/20 splits when necessary, class-balanced upsampling, early stopping, and an adaptive learning rate schedule on the public Mammogram Mastery dataset. In separate evaluations here, it was determined that EfficientViT and PVT-v2 were the top per-forming models achieving 99.33% validation, 97.96% F1-score, and perfect 1.000:0 ROC-AUC with only small train/validation gaps. Finally, the TT-Stack ensemble model by the end of the evaluation reached 99.33% accuracy with 100% precision, 96% recall, 97.96% F1-score and a 99.97% ROC-AUC, and demonstrated robustness in performance due to the diversity of the architecture.
comment: This paper contains 15 pages with 23 figures and 4 tables. This Paper is already accepted in IEEE Computational Intelligence Magazine (CIM)
Comparing Baseline and Day-1 Diffusion MRI Using Multimodal Deep Embeddings for Stroke Outcome Prediction
This study compares baseline (J0) and 24-hour (J1) diffusion magnetic resonance imaging (MRI) for predicting three-month functional outcomes after acute ischemic stroke (AIS). Seventy-four AIS patients with paired apparent diffusion coefficient (ADC) scans and clinical data were analyzed. Three-dimensional ResNet-50 embeddings were fused with structured clinical variables, reduced via principal component analysis (<=12 components), and classified using linear support vector machines with eight-fold stratified group cross-validation. J1 multimodal models achieved the highest predictive performance (AUC = 0.923 +/- 0.085), outperforming J0-based configurations (AUC <= 0.86). Incorporating lesion-volume features further improved model stability and interpretability. These findings demonstrate that early post-treatment diffusion MRI provides superior prognostic value to pre-treatment imaging and that combining MRI, clinical, and lesion-volume features produces a robust and interpretable framework for predicting three-month functional outcomes in AIS patients.
comment: 5 pages, 5 figures, 2 tables
FishDetector-R1: Unified MLLM-Based Framework with Reinforcement Fine-Tuning for Weakly Supervised Fish Detection, Segmentation, and Counting
Analyzing underwater fish imagery is critical for ecological monitoring but remains difficult due to visual degradation and costly annotations. We introduce FishDetector-R1, a unified MLLM-based framework for fish detection, segmentation, and counting under weak supervision. On the DeepFish dataset, our framework achieves substantial gains over baselines, improving AP by 20% and mIoU by 10%, while reducing MAE by 30% and GAME by 35%. These improvements stem from two key components: a novel detect-to-count prompt that enforces spatially consistent detections and counts, and Reinforcement Learning from Verifiable Reward (RLVR) with a complementary scalable paradigm leveraging sparse point labels. Ablation studies further validate the effectiveness of this reward design. Moreover, the improvement generalizes well to other underwater datasets, confirming strong cross-domain robustness. Overall, FishDetector-R1 provides a reliable and scalable solution for accurate marine visual understanding via weak supervision. The project page for FishDetector-R1 is https://umfieldrobotics.github.io/FishDetector-R1.
comment: 18 pages, under review
GBT-SAM: A Parameter-Efficient Depth-Aware Model for Generalizable Brain tumour Segmentation on mp-MRI
Gliomas are aggressive brain tumors that require accurate imaging-based diagnosis, with segmentation playing a critical role in evaluating morphology and treatment decisions. Manual delineation of gliomas is time-consuming and prone to variability, motivating the use of deep learning to improve consistency and alleviate clinical workload. However, existing methods often fail to fully exploit the information available in multi-parametric MRI (mp-MRI), particularly inter-slice contextual features, and typically require considerable computational resources while lacking robustness across tumor type variations. We present GBT-SAM, a parameter-efficient deep learning framework that adapts the Segment Anything Model (SAM), a large-scale vision model, to volumetric mp-MRI data. GBT-SAM reduces input complexity by selecting fewer than 2.6\% of slices per scan while incorporating all four MRI modalities, preserving essential tumor-related information with minimal cost. Furthermore, our model is trained by a two-step fine-tuning strategy that incorporates a depth-aware module to capture inter-slice correlations and lightweight adaptation layers, resulting in just 6.5M trainable parameters, which is the lowest among SAM-based approaches. GBT-SAM achieves a Dice Score of 93.54 on the BraTS Adult Glioma dataset and demonstrates robust performance on Meningioma, Pediatric Glioma, and Sub-Saharan Glioma datasets. These results highlight GBT-SAM's potential as a computationally efficient and domain-robust framework for brain tumor segmentation using mp-MRI. Our code and models are available at https://github.com/vpulab/med-sam-brain .
SCOPE-MRI: Bankart Lesion Detection as a Case Study in Data Curation and Deep Learning for Challenging Diagnoses
Deep learning has shown strong performance in musculoskeletal imaging, but prior work has largely targeted conditions where diagnosis is relatively straightforward. More challenging problems remain underexplored, such as detecting Bankart lesions (anterior-inferior glenoid labral tears) on standard MRIs. These lesions are difficult to diagnose due to subtle imaging features, often necessitating invasive MRI arthrograms (MRAs). We introduce ScopeMRI, the first publicly available, expert-annotated dataset for shoulder pathologies, and present a deep learning framework for Bankart lesion detection on both standard MRIs and MRAs. ScopeMRI contains shoulder MRIs from patients who underwent arthroscopy, providing ground-truth labels from intraoperative findings, the diagnostic gold standard. Separate models were trained for MRIs and MRAs using CNN- and transformer-based architectures, with predictions ensembled across multiple imaging planes. Our models achieved radiologist-level performance, with accuracy on standard MRIs surpassing radiologists interpreting MRAs. External validation on independent hospital data demonstrated initial generalizability across imaging protocols. By releasing ScopeMRI and a modular codebase for training and evaluation, we aim to accelerate research in musculoskeletal imaging and foster development of datasets and models that address clinically challenging diagnostic tasks.
comment: This version of the article has been accepted for publication at Nature Partner Journal (NPJ) Artificial Intelligence, but is not the Version of Record and does not reflect post-acceptance improvements or any corrections. The Version of Record is available online at: http://dx.doi.org/10.1038/s44387-025-00043-5
MRI Super-Resolution with Deep Learning: A Comprehensive Survey
High-resolution (HR) magnetic resonance imaging (MRI) is crucial for many clinical and research applications. However, achieving it remains costly and constrained by technical trade-offs and experimental limitations. Super-resolution (SR) presents a promising computational approach to overcome these challenges by generating HR images from more affordable low-resolution (LR) scans, potentially improving diagnostic accuracy and efficiency without requiring additional hardware. This survey reviews recent advances in MRI SR techniques, with a focus on deep learning (DL) approaches. It examines DL-based MRI SR methods from the perspectives of computer vision, computational imaging, inverse problems, and MR physics, covering theoretical foundations, architectural designs, learning strategies, benchmark datasets, and performance metrics. We propose a systematic taxonomy to categorize these methods and present an in-depth study of both established and emerging SR techniques applicable to MRI, considering unique challenges in clinical and research contexts. We also highlight open challenges and directions that the community needs to address. Additionally, we provide a collection of essential open-access resources, tools, and tutorials, available on our GitHub: https://github.com/mkhateri/Awesome-MRI-Super-Resolution. IEEE keywords: MRI, Super-Resolution, Deep Learning, Computational Imaging, Inverse Problem, Survey.
comment: 41 pages
Self-Supervised One-Step Diffusion Refinement for Snapshot Compressive Imaging
Snapshot compressive imaging (SCI) captures multispectral images (MSIs) using a single coded two-dimensional (2-D) measurement, but reconstructing high-fidelity MSIs from these compressed inputs remains a fundamentally ill-posed challenge. While diffusion-based reconstruction methods have recently raised the bar for quality, they face critical limitations: a lack of large-scale MSI training data, adverse domain shifts from RGB-pretrained models, and inference inefficiencies due to multi-step sampling. These drawbacks restrict their practicality in real-world applications. In contrast to existing methods, which either follow costly iterative refinement or adapt subspace-based embeddings for diffusion models (e.g. DiffSCI, PSR-SCI), we introduce a fundamentally different paradigm: a self-supervised One-Step Diffusion (OSD) framework specifically designed for SCI. The key novelty lies in using a single-step diffusion refiner to correct an initial reconstruction, eliminating iterative denoising entirely while preserving generative quality. Moreover, we adopt a self-supervised equivariant learning strategy to train both the predictor and refiner directly from raw 2-D measurements, enabling generalization to unseen domains without the need for ground-truth MSI. To further address the challenge of limited MSI data, we design a band-selection-driven distillation strategy that transfers core generative priors from large-scale RGB datasets, effectively bridging the domain gap. Extensive experiments confirm that our approach sets a new benchmark, yielding PSNR gains of 3.44 dB, 1.61 dB, and 0.28 dB on the Harvard, NTIRE, and ICVL datasets, respectively, while reducing reconstruction time by 97.5%. This remarkable improvement in efficiency and adaptability makes our method a significant advancement in SCI reconstruction, combining both accuracy and practicality for real-world deployment.
TomoGraphView: 3D Medical Image Classification with Omnidirectional Slice Representations and Graph Neural Networks
The sharp rise in medical tomography examinations has created a demand for automated systems that can reliably extract informative features for downstream tasks such as tumor characterization. Although 3D volumes contain richer information than individual slices, effective 3D classification remains difficult: volumetric data encode complex spatial dependencies, and the scarcity of large-scale 3D datasets has constrained progress toward 3D foundation models. As a result, many recent approaches rely on 2D vision foundation models trained on natural images, repurposing them as feature extractors for medical scans with surprisingly strong performance. Despite their practical success, current methods that apply 2D foundation models to 3D scans via slice-based decomposition remain fundamentally limited. Standard slicing along axial, sagittal, and coronal planes often fails to capture the true spatial extent of a structure when its orientation does not align with these canonical views. More critically, most approaches aggregate slice features independently, ignoring the underlying 3D geometry and losing spatial coherence across slices. To overcome these limitations, we propose TomoGraphView, a novel framework that integrates omnidirectional volume slicing with spherical graph-based feature aggregation. Instead of restricting the model to axial, sagittal, or coronal planes, our method samples both canonical and non-canonical cross-sections generated from uniformly distributed points on a sphere enclosing the volume. We publicly share our accessible code base at http://github.com/compai-lab/2025-MedIA-kiechle and provide a user-friendly library for omnidirectional volume slicing at https://pypi.org/project/OmniSlicer.
comment: Preprint submitted to Medical Image Analysis (MedIA)
Deep Learning for Restoring MPI System Matrices Using Simulated Training Data
Magnetic particle imaging reconstructs tracer distributions using a system matrix obtained through time-consuming, noise-prone calibration measurements. Methods for addressing imperfections in measured system matrices increasingly rely on deep neural networks, yet curated training data remain scarce. This study evaluates whether physics-based simulated system matrices can be used to train deep learning models for different system matrix restoration tasks, i.e., denoising, accelerated calibration, upsampling, and inpainting, that generalize to measured data. A large system matrices dataset was generated using an equilibrium magnetization model extended with uniaxial anisotropy. The dataset spans particle, scanner, and calibration parameters for 2D and 3D trajectories, and includes background noise injected from empty-frame measurements. For each restoration task, deep learning models were compared with classical non-learning baseline methods. The models trained solely on simulated system matrices generalized to measured data across all tasks: for denoising, DnCNN/RDN/SwinIR outperformed DCT-F baseline by >10 dB PSNR and up to 0.1 SSIM on simulations and led to perceptually better reconstuctions of real data; for 2D upsampling, SMRnet exceeded bicubic by 20 dB PSNR and 0.08 SSIM at $\times 2$-$\times 4$ which did not transfer qualitatively to real measurements. For 3D accelerated calibration, SMRnet matched tricubic in noiseless cases and was more robust under noise, and for 3D inpainting, biharmonic inpainting was superior when noise-free but degraded with noise, while a PConvUNet maintained quality and yielded less blurry reconstructions. The demonstrated transferability of deep learning models trained on simulations to real measurements mitigates the data-scarcity problem and enables the development of new methods beyond current measurement capabilities.
U-FaceBP: Uncertainty-aware Bayesian Ensemble Deep Learning for Face Video-based Blood Pressure Measurement
Blood pressure (BP) measurement is crucial for daily health assessment. Remote photoplethysmography (rPPG), which extracts pulse waves from face videos captured by a camera, has the potential to enable convenient BP measurement without specialized medical devices. However, there are various uncertainties in BP estimation using rPPG, leading to limited estimation performance and reliability. In this paper, we propose U-FaceBP, an uncertainty-aware Bayesian ensemble deep learning method for face video-based BP measurement. U-FaceBP models aleatoric and epistemic uncertainties in face video-based BP estimation with a Bayesian neural network (BNN). Additionally, we design U-FaceBP as an ensemble method, estimating BP from rPPG signals, PPG signals derived from face videos, and face images using multiple BNNs. Large-scale experiments on two datasets involving 1197 subjects from diverse racial groups demonstrate that U-FaceBP outperforms state-of-the-art BP estimation methods. Furthermore, we show that the uncertainty estimates provided by U-FaceBP are informative and useful for guiding modality fusion, assessing prediction reliability, and analyzing performance across racial groups.
comment: This work has been submitted to the IEEE for possible publication
AgriPotential: A Novel Multi-Spectral and Multi-Temporal Remote Sensing Dataset for Agricultural Potentials
Remote sensing has emerged as a critical tool for large-scale Earth monitoring and land management. In this paper, we introduce AgriPotential, a novel benchmark dataset composed of Sentinel-2 satellite imagery captured over multiple months. The dataset provides pixel-level annotations of agricultural potentials for three major crop types - viticulture, market gardening, and field crops - across five ordinal classes. AgriPotential supports a broad range of machine learning tasks, including ordinal regression, multi-label classification, and spatio-temporal modeling. The data cover diverse areas in Southern France, offering rich spectral information. AgriPotential is the first public dataset designed specifically for agricultural potential prediction, aiming to improve data-driven approaches to sustainable land use planning. The dataset and the code are freely accessible at: https://zenodo.org/records/15551829
comment: Accepted at CBMI 2025
UniFucGrasp: Human-Hand-Inspired Unified Functional Grasp Annotation Strategy and Dataset for Diverse Dexterous Hands
Dexterous grasp datasets are vital for embodied intelligence, but mostly emphasize grasp stability, ignoring functional grasps needed for tasks like opening bottle caps or holding cup handles. Most rely on bulky, costly, and hard-to-control high-DOF Shadow Hands. Inspired by the human hand's underactuated mechanism, we establish UniFucGrasp, a universal functional grasp annotation strategy and dataset for multiple dexterous hand types. Based on biomimicry, it maps natural human motions to diverse hand structures and uses geometry-based force closure to ensure functional, stable, human-like grasps. This method supports low-cost, efficient collection of diverse, high-quality functional grasps. Finally, we establish the first multi-hand functional grasp dataset and provide a synthesis model to validate its effectiveness. Experiments on the UFG dataset, IsaacSim, and complex robotic tasks show that our method improves functional manipulation accuracy and grasp stability, demonstrates improved adaptability across multiple robotic hands, helping to alleviate annotation cost and generalization challenges in dexterous grasping. The project page is at https://haochen611.github.io/UFG.
comment: Accepted to IEEE Robotics and Automation Letters (RA-L). The project page is at https://haochen611.github.io/UFG
Time-causal and time-recursive wavelets
When to apply wavelet analysis to real-time temporal signals, where the future cannot be accessed, it is essential to base all the steps in the signal processing pipeline on computational mechanisms that are truly time-causal. This paper describes how a time-causal wavelet analysis can be performed based on concepts developed in the area of temporal scale-space theory, originating from a complete classification of temporal smoothing kernels that guarantee non-creation of new structures from finer to coarser temporal scale levels. By necessity, convolution with truncated exponential kernels in cascade constitutes the only permissable class of kernels, as well as their temporal derivatives as a natural complement to fulfil the admissibility conditions of wavelet representations. For a particular way of choosing the time constants in the resulting infinite convolution of truncated exponential kernels, to ensure temporal scale covariance and thus self-similarity over temporal scales, we describe how mother wavelets can be chosen as temporal derivatives of the resulting time-causal limit kernel. By developing connections between wavelet theory and scale-space theory, we characterize and quantify how the continuous scaling properties transfer to the discrete implementation, demonstrating how the proposed time-causal wavelet representation can reflect the duration of locally dominant temporal structures in the input signals. We propose that this notion of time-causal wavelet analysis could be a valuable tool for signal processing tasks, where streams of signals are to be processed in real time, specifically for signals that may contain local variations over a rich span of temporal scales, or more generally for analysing physical or biophysical temporal phenomena, where a fully time-causal analysis is called for to be physically realistic.
comment: 28 pages, 11 figures
Face-MakeUpV2: Facial Consistency Learning for Controllable Text-to-Image Generation
In facial image generation, current text-to-image models often suffer from facial attribute leakage and insufficient physical consistency when responding to local semantic instructions. In this study, we propose Face-MakeUpV2, a facial image generation model that aims to maintain the consistency of face ID and physical characteristics with the reference image. First, we constructed a large-scale dataset FaceCaptionMask-1M comprising approximately one million image-text-masks pairs that provide precise spatial supervision for the local semantic instructions. Second, we employed a general text-to-image pretrained model as the backbone and introduced two complementary facial information injection channels: a 3D facial rendering channel to incorporate the physical characteristics of the image and a global facial feature channel. Third, we formulated two optimization objectives for the supervised learning of our model: semantic alignment in the model's embedding space to mitigate the attribute leakage problem and perceptual loss on facial images to preserve ID consistency. Extensive experiments demonstrated that our Face-MakeUpV2 achieves best overall performance in terms of preserving face ID and maintaining physical consistency of the reference images. These results highlight the practical potential of Face-MakeUpV2 for reliable and controllable facial editing in diverse applications.
comment: Some errors in the critical data presented in Table 1 and Table 2
Full-scale Representation Guided Network for Retinal Vessel Segmentation
The U-Net architecture and its variants have remained state-of-the-art (SOTA) for retinal vessel segmentation over the past decade. In this study, we introduce a Full-Scale Guided Network (FSG-Net), where a novel feature representation module using modernized convolution blocks effectively captures full-scale structural information, while a guided convolution block subsequently refines this information. Specifically, we introduce an attention-guided filter within the guided convolution block, leveraging its similarity to unsharp masking to enhance fine vascular structures. Passing full-scale information to the attention block facilitates the generation of more contextually relevant attention maps, which are then passed to the attention-guided filter, providing further refinement to the segmentation performance. The structure preceding the guided convolution block can be replaced by any U-Net variant, ensuring flexibility and scalability across various segmentation tasks. For a fair comparison, we re-implemented recent studies available in public repositories to evaluate their scalability and reproducibility. Our experiments demonstrate that, despite its compact architecture, FSG-Net delivers performance competitive with SOTA methods across multiple public datasets. Ablation studies further demonstrate that each proposed component meaningfully contributes to this competitive performance. Our code is available on https://github.com/ZombaSY/FSG-Net-pytorch.
comment: 12 pages, 7 figures
Random forest-based out-of-distribution detection for robust lung cancer segmentation SP
Accurate detection and segmentation of cancerous lesions from computed tomography (CT) scans is essential for automated treatment planning and cancer treatment response assessment. Transformer-based models with self-supervised pretraining have achieved strong performance on in-distribution (ID) data but often generalize poorly on out-of-distribution (OOD) inputs. We investigate this behavior for lung cancer segmentation using an encoder-decoder model. Our encoder is a Swin Transformer pretrained with masked image modeling (SimMIM) on 10,432 unlabeled 3D CT scans spanning cancerous and non-cancerous conditions, and the decoder was randomly initialized. This model was evaluated on an independent ID test set and four OOD scenarios, including chest CT cohorts (pulmonary embolism and negative COVID-19) and abdomen CT cohorts (kidney cancers and non-cancerous pancreas). OOD detection was performed at the scan level using RF-Deep, a random forest classifier applied to contextual tumor-anchored feature representations. We evaluated 920 3D CTs (172,650 images) and observed that RF-Deep achieved FPR95 values of 18.26% and 27.66% on the chest CT cohorts, and near-perfect detection (less than 0.1% FPR95) on the abdomen CT cohorts, consistently outperforming established OOD methods. These results demonstrate that our RF-Deep classifier provides a simple, lightweight, and effective approach for enhancing the reliability of segmentation models in clinical deployment.
comment: Accepted at SPIE Medical Imaging 2026
Malaria detection using Deep Convolution Neural Network
The latest WHO report showed that the number of malaria cases climbed to 219 million last year, two million higher than last year. The global efforts to fight malaria have hit a plateau and the most significant underlying reason is international funding has declined. Malaria, which is spread to people through the bites of infected female mosquitoes, occurs in 91 countries but about 90% of the cases and deaths are in sub-Saharan Africa. The disease killed 4,35,000 people last year, the majority of them children under five in Africa. AI-backed technology has revolutionized malaria detection in some regions of Africa and the future impact of such work can be revolutionary. The malaria Cell Image Data-set is taken from the official NIH Website NIH data. The aim of the collection of the dataset was to reduce the burden for microscopists in resource-constrained regions and improve diagnostic accuracy using an AI-based algorithm to detect and segment the red blood cells. The goal of this work is to show that the state of the art accuracy can be obtained even by using 2 layer convolution network and show a new baseline in Malaria detection efforts using AI.
Multimedia
MAC-SLU: Multi-Intent Automotive Cabin Spoken Language Understanding Benchmark
Spoken Language Understanding (SLU), which aims to extract user semantics to execute downstream tasks, is a crucial component of task-oriented dialog systems. Existing SLU datasets generally lack sufficient diversity and complexity, and there is an absence of a unified benchmark for the latest Large Language Models (LLMs) and Large Audio Language Models (LALMs). This work introduces MAC-SLU, a novel Multi-Intent Automotive Cabin Spoken Language Understanding Dataset, which increases the difficulty of the SLU task by incorporating authentic and complex multi-intent data. Based on MAC-SLU, we conducted a comprehensive benchmark of leading open-source LLMs and LALMs, covering methods like in-context learning, supervised fine-tuning (SFT), and end-to-end (E2E) and pipeline paradigms. Our experiments show that while LLMs and LALMs have the potential to complete SLU tasks through in-context learning, their performance still lags significantly behind SFT. Meanwhile, E2E LALMs demonstrate performance comparable to pipeline approaches and effectively avoid error propagation from speech recognition. Code\footnote{https://github.com/Gatsby-web/MAC\_SLU} and datasets\footnote{huggingface.co/datasets/Gatsby1984/MAC\_SLU} are released publicly.
PSA-MF: Personality-Sentiment Aligned Multi-Level Fusion for Multimodal Sentiment Analysis AAAI 2026
Multimodal sentiment analysis (MSA) is a research field that recognizes human sentiments by combining textual, visual, and audio modalities. The main challenge lies in integrating sentiment-related information from different modalities, which typically arises during the unimodal feature extraction phase and the multimodal feature fusion phase. Existing methods extract only shallow information from unimodal features during the extraction phase, neglecting sentimental differences across different personalities. During the fusion phase, they directly merge the feature information from each modality without considering differences at the feature level. This ultimately affects the model's recognition performance. To address this problem, we propose a personality-sentiment aligned multi-level fusion framework. We introduce personality traits during the feature extraction phase and propose a novel personality-sentiment alignment method to obtain personalized sentiment embeddings from the textual modality for the first time. In the fusion phase, we introduce a novel multi-level fusion method. This method gradually integrates sentimental information from textual, visual, and audio modalities through multimodal pre-fusion and a multi-level enhanced fusion strategy. Our method has been evaluated through multiple experiments on two commonly used datasets, achieving state-of-the-art results.
comment: AAAI 2026 accepted
ZO-ASR: Zeroth-Order Fine-Tuning of Speech Foundation Models without Back-Propagation
Fine-tuning pre-trained speech foundation models for Automatic Speech Recognition (ASR) is prevalent, yet constrained by substantial GPU memory requirements. We introduce ZO-ASR, a memory-efficient Zeroth-Order (ZO) method that avoids Back-Propagation (BP) and activation memory by estimating gradients via forward passes. When combined with SGD optimizer, ZO-ASR-SGD fine-tunes ASR models using only inference memory. Our evaluation spans supervised and unsupervised tasks. For Supervised Domain Adaptation on Whisper-Large-V3, ZO-ASR's multiple query mechanism enhances robustness and achieves up to an 18.9\% relative Word Error Rate reduction over zero-shot baselines, outperforming existing ZO methods. For unsupervised Test-Time Adaptation on Wav2Vec2-Base, ZO-ASR exhibits moderately lower performance compared to first-order optimizer Adam. Our BP-free approach provides a viable solution for fine-tuning ASR models in computationally resource-constrained or gradient-inaccessible scenarios.
comment: 2025 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
MindFuse: Towards GenAI Explainability in Marketing Strategy Co-Creation
The future of digital marketing lies in the convergence of human creativity and generative AI, where insight, strategy, and storytelling are co-authored by intelligent systems. We present MindFuse, a brave new explainable generative AI framework designed to act as a strategic partner in the marketing process. Unlike conventional LLM applications that stop at content generation, MindFuse fuses CTR-based content AI-guided co-creation with large language models to extract, interpret, and iterate on communication narratives grounded in real advertising data. MindFuse operates across the full marketing lifecycle: from distilling content pillars and customer personas from competitor campaigns to recommending in-flight optimizations based on live performance telemetry. It uses attention-based explainability to diagnose ad effectiveness and guide content iteration, while aligning messaging with strategic goals through dynamic narrative construction and storytelling. We introduce a new paradigm in GenAI for marketing, where LLMs not only generate content but reason through it, adapt campaigns in real time, and learn from audience engagement patterns. Our results, validated in agency deployments, demonstrate up to 12 times efficiency gains, setting the stage for future integration with empirical audience data (e.g., GWI, Nielsen) and full-funnel attribution modeling. MindFuse redefines AI not just as a tool, but as a collaborative agent in the creative and strategic fabric of modern marketing.
VIVAT: Virtuous Improving VAE Training through Artifact Mitigation
Variational Autoencoders (VAEs) remain a cornerstone of generative computer vision, yet their training is often plagued by artifacts that degrade reconstruction and generation quality. This paper introduces VIVAT, a systematic approach to mitigating common artifacts in KL-VAE training without requiring radical architectural changes. We present a detailed taxonomy of five prevalent artifacts - color shift, grid patterns, blur, corner and droplet artifacts - and analyze their root causes. Through straightforward modifications, including adjustments to loss weights, padding strategies, and the integration of Spatially Conditional Normalization, we demonstrate significant improvements in VAE performance. Our method achieves state-of-the-art results in image reconstruction metrics (PSNR and SSIM) across multiple benchmarks and enhances text-to-image generation quality, as evidenced by superior CLIP scores. By preserving the simplicity of the KL-VAE framework while addressing its practical challenges, VIVAT offers actionable insights for researchers and practitioners aiming to optimize VAE training.
RDTF: Resource-efficient Dual-mask Training Framework for Multi-frame Animated Sticker Generation
Recently, significant advancements have been achieved in video generation technology, but applying it to resource-constrained downstream tasks like multi-frame animated sticker generation (ASG) characterized by low frame rates, abstract semantics, and long tail frame length distribution-remains challenging. Parameter-efficient fine-tuning (PEFT) techniques (e.g., Adapter, LoRA) for large pre-trained models suffer from insufficient fitting ability and source-domain knowledge interference. In this paper, we propose Resource-Efficient Dual-Mask Training Framework (RDTF), a dedicated solution for multi-frame ASG task under resource constraints. We argue that training a compact model from scratch with million-level samples outperforms PEFT on large models, with RDTF realizing this via three core designs: 1) a Discrete Frame Generation Network (DFGN) optimized for low-frame-rate ASG, ensuring parameter efficiency; 2) a dual-mask based data utilization strategy to enhance the availability and diversity of limited data; 3) a difficulty-adaptive curriculum learning method that decomposes sample entropy into static and adaptive components, enabling easy-to-difficult training convergence. To provide high-quality data support for RDTFs training from scratch, we construct VSD2M-a million-level multi-modal animated sticker dataset with rich annotations (static and animated stickers, action-focused text descriptions)-filling the gap of dedicated animated data for ASG task. Experiments demonstrate that RDTF is quantitatively and qualitatively superior to state-of-the-art PEFT methods (e.g., I2V-Adapter, SimDA) on ASG tasks, verifying the feasibility of our framework under resource constraints.
comment: Submitted to TMM
An Efficient Recommendation System in E-commerce using Passer learning optimization based on Bi-LSTM
Online reviews play a crucial role in shaping consumer decisions, especially in the context of e-commerce. However, the quality and reliability of these reviews can vary significantly. Some reviews contain misleading or unhelpful information, such as advertisements, fake content, or irrelevant details. These issues pose significant challenges for recommendation systems, which rely on user-generated reviews to provide personalized suggestions. This article introduces a recommendation system based on Passer Learning Optimization-enhanced Bi-LSTM classifier applicable to e-commerce recommendation systems with improved accuracy and efficiency compared to state-of-the-art models. It achieves as low as 1.24% MSE on the baby dataset. This lifts it as high as 88.58%. Besides, there is also robust performance of the system on digital music and patio lawn garden datasets at F1 of 88.46% and 92.51%, correspondingly. These results, made possible by advanced graph embedding for effective knowledge extraction and fine-tuning of classifier parameters, establish the suitability of the proposed model in various e-commerce environments.
comment: 22 pages, 5 figuers, 4 Tables
Fine-grained Image Retrieval via Dual-Vision Adaptation AAAI2026
Fine-Grained Image Retrieval~(FGIR) faces challenges in learning discriminative visual representations to retrieve images with similar fine-grained features. Current leading FGIR solutions typically follow two regimes: enforce pairwise similarity constraints in the semantic embedding space, or incorporate a localization sub-network to fine-tune the entire model. However, such two regimes tend to overfit the training data while forgetting the knowledge gained from large-scale pre-training, thus reducing their generalization ability. In this paper, we propose a Dual-Vision Adaptation (DVA) approach for FGIR, which guides the frozen pre-trained model to perform FGIR through collaborative sample and feature adaptation. Specifically, we design Object-Perceptual Adaptation, which modifies input samples to help the pre-trained model perceive critical objects and elements within objects that are helpful for category prediction. Meanwhile, we propose In-Context Adaptation, which introduces a small set of parameters for feature adaptation without modifying the pre-trained parameters. This makes the FGIR task using these adjusted features closer to the task solved during the pre-training. Additionally, to balance retrieval efficiency and performance, we propose Discrimination Perception Transfer to transfer the discriminative knowledge in the object-perceptual adaptation to the image encoder using the knowledge distillation mechanism. Extensive experiments show that DVA has fewer learnable parameters and performs well on three in-distribution and three out-of-distribution fine-grained datasets.
comment: Accepted by AAAI2026
DMC$^3$: Dual-Modal Counterfactual Contrastive Construction for Egocentric Video Question Answering
Egocentric Video Question Answering (Egocentric VideoQA) plays an important role in egocentric video understanding, which refers to answering questions based on first-person videos. Although existing methods have made progress through the paradigm of pre-training and fine-tuning, they ignore the unique challenges posed by the first-person perspective, such as understanding multiple events and recognizing hand-object interactions. To deal with these challenges, we propose a Dual-Modal Counterfactual Contrastive Construction (DMC$^3$) framework, which contains an egocentric videoqa baseline, a counterfactual sample construction module and a counterfactual sample-involved contrastive optimization. Specifically, We first develop a counterfactual sample construction module to generate positive and negative samples for textual and visual modalities through event description paraphrasing and core interaction mining, respectively. Then, We feed these samples together with the original samples into the baseline. Finally, in the counterfactual sample-involved contrastive optimization module, we apply contrastive loss to minimize the distance between the original sample features and the positive sample features, while maximizing the distance from the negative samples. Experiments show that our method achieve 52.51\% and 46.04\% on the \textit{normal} and \textit{indirect} splits of EgoTaskQA, and 13.2\% on QAEGO4D, both reaching the state-of-the-art performance.
Computation and Language
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling
As large language models have grown larger, low-precision numerical formats such as NVFP4 have become increasingly popular due to the speed and memory benefits they provide. However, to accelerate computation with NVFP4, all matrix multiplication operands--weights and activations in the forward pass, and weights, activations, and gradients in the backward pass--must be quantized to NVFP4, often leading to divergence during training and performance degradation during inference. NVFP4 by evaluating multiple potential scale factors for each block of values. To address this issue, in this work we introduce Four Over Six (4/6), a modification to the NVFP4 quantization algorithm that evaluates two potential scale factors for each block of values. Unlike integer formats, floating-point formats such as FP4 have the most quantization error on near-maximal values in each block, which we find to be primarily responsible for downstream performance degradation. We find that for some blocks, scaling to smaller FP4 values makes the distribution of representable values more uniform, improving representation of near-maximal values. Importantly, 4/6 can be implemented efficiently on NVIDIA Blackwell GPUs, making it viable to use while training LLMs with NVFP4. In pre-training experiments with transformer and hybrid model architectures, we find that 4/6 prevents divergence in several cases, bringing training loss significantly closer to BF16 compared to models trained with current state-of-the-art NVFP4 training recipes. We also find that 4/6 can be easily incorporated into many different post-training quantization methods and generally improves downstream accuracy. We hope this inspires future work in training and deploying models with NVFP4.
comment: 10 pages, 5 figures
The Art of Scaling Test-Time Compute for Large Language Models
Test-time scaling (TTS) -- the dynamic allocation of compute during inference -- is a promising direction for improving reasoning in large language models (LLMs). However, a systematic comparison of well-known TTS strategies under identical conditions is missing, and the influence of model type and problem difficulty on performance remains unclear. To address these gaps, we conduct the first large-scale study of TTS, spanning over thirty billion tokens generated using eight open-source LLMs (7B to 235B parameters), across four reasoning datasets. We observe three consistent trends: (1) no single TTS strategy universally dominates; (2) reasoning models exhibit distinct trace-quality patterns across problem difficulty and trace length, forming short-horizon and long-horizon categories; and (3) for a given model type, the optimal TTS performance scales monotonically with compute budget. Based on these insights, we provide a practical recipe for selecting the best TTS strategy, considering problem difficulty, model type, and compute budget, providing a practical guide to effective inference-time scaling.
AlignSAE: Concept-Aligned Sparse Autoencoders
Large Language Models (LLMs) encode factual knowledge within hidden parametric spaces that are difficult to inspect or control. While Sparse Autoencoders (SAEs) can decompose hidden activations into more fine-grained, interpretable features, they often struggle to reliably align these features with human-defined concepts, resulting in entangled and distributed feature representations. To address this, we introduce AlignSAE, a method that aligns SAE features with a defined ontology through a "pre-train, then post-train" curriculum. After an initial unsupervised training phase, we apply supervised post-training to bind specific concepts to dedicated latent slots while preserving the remaining capacity for general reconstruction. This separation creates an interpretable interface where specific relations can be inspected and controlled without interference from unrelated features. Empirical results demonstrate that AlignSAE enables precise causal interventions, such as reliable "concept swaps", by targeting single, semantically aligned slots.
comment: 20 pages, 7 figures, 5 tables
LLM CHESS: Benchmarking Reasoning and Instruction-Following in LLMs through Chess
We introduce LLM CHESS, an evaluation framework designed to probe the generalization of reasoning and instruction-following abilities in large language models (LLMs) through extended agentic interaction in the domain of chess. We rank over 50 open and closed source models by playing against a random opponent using a range of behavioral metrics, including win and loss rates, move quality, move legality, hallucinated actions, and game duration. For a subset of top reasoning models, we derive an Elo estimate by playing against a chess engine with variably configured skill, which allows for comparisons between models in an easily understandable way. Despite the simplicity of the instruction-following task and the weakness of the opponent, many state-of-the-art models struggle to complete games or achieve consistent wins. Similar to other benchmarks on complex reasoning tasks, our experiments reveal a clear separation between reasoning and non-reasoning models. However, unlike existing static benchmarks, the stochastic and dynamic nature of LLM CHESS uniquely reduces overfitting and memorization while preventing benchmark saturation, proving difficult even for top reasoning models. To support future work on evaluating reasoning and instruction-following in LLMs, we release our experimental framework, a public leaderboard, and a dataset of associated games.
Chain-of-Ground: Improving GUI Grounding via Iterative Reasoning and Reference Feedback
GUI grounding aims to align natural language instructions with precise regions in complex user interfaces. Advanced multimodal large language models show strong ability in visual GUI grounding but still struggle with small or visually similar targets and ambiguity in real world layouts. These limitations arise from limited grounding capacity and from underuse of existing reasoning potential. We present Chain of Ground CoG a training free multi step grounding framework that uses multimodal large language models for iterative visual reasoning and refinement. Instead of direct prediction the model progressively reflects and adjusts its hypotheses leading to more accurate and interpretable localization. Our approach achieves 68.4 accuracy on the ScreenSpot Pro benchmark an improvement of 4.8 points. To measure real world generalization we introduce TPanel UI a dataset of 420 labeled industrial control panels with visual distortions such as blur and masking. On TPanel UI Chain of Ground improves over the strong baseline Qwen3 VL 235B by 6.9 points showing the effectiveness of multi step training free grounding across real world and digital interfaces. These results highlight a direction for unlocking grounding potential through structured iterative refinement instead of additional training.
From Atomic to Composite: Reinforcement Learning Enables Generalization in Complementary Reasoning
The mechanism by which RL contributes to reasoning capabilities-whether it incentivizes the synthesis of new skills or merely amplifies existing behaviors-remains a subject of intense debate. In this work, we investigate this question through the lens of Complementary Reasoning, a complex task that requires integrating internal parametric knowledge with external contextual information. Using a controlled synthetic dataset of human biographies, we strictly decouple this ability into two atomic skills: Parametric Reasoning (relying on internal knowledge) and Contextual Reasoning (depending on external information). To rigorously assess capability boundaries, we evaluate generalization across three distinct levels of difficulty: I.I.D., Composition, and Zero-shot settings. We find that while SFT is sufficient for in-distribution performance, it struggles with O.O.D. generalization, particularly in Zero-shot settings where relational combinations are novel. Crucially, we identify the SFT Generalization Paradox: Models supervised solely on the composite task achieve near-perfect in-distribution accuracy but collapse on out-of-distribution generalization, indicating their reliance on rote memorization of path shortcuts. In contrast, we find that RL acts as a reasoning synthesizer rather than a probability amplifier. However, we uncover a strict atomic prerequisite: RL can only synthesize these complex strategies if the base model has first mastered the independent atomic skills (Parametric and Contextual) via SFT. These findings challenge the view of RL as a mere amplifier, suggesting that given sufficient atomic foundations, RL can actively synthesize complex reasoning strategies from learned primitives without explicit supervision on such complex strategies. This indicates that decoupled atomic training followed by RL offers a scalable path to generalization for complex reasoning tasks.
comment: Work in Progress. Code and data will be available at https://github.com/sitaocheng/from_atomic_to_composite
How Far Are We from Genuinely Useful Deep Research Agents?
Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis. However, most existing DRAs were validated on question-answering benchmarks, while research on generating comprehensive reports remains overlooked. Worse, current benchmarks for report synthesis suffer from task complexity and subjective metrics -- this fails to reflect user demands and limits the practical utility of generated reports. To address these gaps, we present Fine-grained DEepResearch bench (FINDER), an enhanced benchmark consisting of 100 human-curated research tasks with 419 structured checklist items that standardize report structure, analytical depth, and factual grounding. Based on approximately 1,000 reports produced by mainstream DRAs, we further propose Deep rEsearch Failure Taxonomy (DEFT), the first failure taxonomy for deep research agents. DEFT contains 14 fine-grained failure modes across reasoning, retrieval, and generation, and is built upon grounded theory with human-LLM co-annotating and inter-annotator reliability validation. Our experimental findings reveal that current DRAs struggle not with task comprehension but with evidence integration, verification, and reasoning-resilient planning.
comment: 34 pages
Agentic Policy Optimization via Instruction-Policy Co-Evolution
Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capability of large language models (LLMs), enabling autonomous agents that can conduct effective multi-turn and tool-integrated reasoning. While instructions serve as the primary protocol for defining agents, RLVR typically relies on static and manually designed instructions. However, those instructions may be suboptimal for the base model, and the optimal instruction may change as the agent's policy improves and explores the interaction with the environment. To bridge the gap, we introduce INSPO, a novel Instruction-Policy co-evolution framework that integrates instruction optimization as a dynamic component of the reinforcement learning (RL) loop. INSPO maintains a dynamic population of instruction candidates that are sampled with questions, where reward signals in RL loops are automatically attributed to each instruction, and low performers are periodically pruned. New instructions are generated and verified through an on-policy reflection mechanism, where an LLM-based optimizer analyzes past experience from a replay buffer and evolves more effective strategies given the current policy. We conduct extensive experiments on multi-turn retrieval and reasoning tasks, demonstrating that INSPO substantially outperforms strong baselines relying on static instructions. INSPO discovers innovative instructions that guide the agent toward more strategic reasoning paths, achieving substantial performance gains with only a marginal increase in computational overhead.
comment: 10 pages, 3 figures, 2 tables (18 pages including references and appendices)
Rectifying LLM Thought from Lens of Optimization
Recent advancements in large language models (LLMs) have been driven by their emergent reasoning capabilities, particularly through long chain-of-thought (CoT) prompting, which enables thorough exploration and deliberation. Despite these advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors, such as overthinking and excessively protracted reasoning chains, which can impair performance. In this paper, we analyze reasoning processes through an optimization lens, framing CoT as a gradient descent procedure where each reasoning step constitutes an update toward problem resolution. Building on this perspective, we introduce RePro (Rectifying Process-level Reward), a novel approach to refine LLM reasoning during post-training. RePro defines a surrogate objective function to assess the optimization process underlying CoT, utilizing a dual scoring mechanism to quantify its intensity and stability. These scores are aggregated into a composite process-level reward, seamlessly integrated into reinforcement learning with verifiable rewards (RLVR) pipelines to optimize LLMs. Extensive experiments across multiple reinforcement learning algorithms and diverse LLMs, evaluated on benchmarks spanning mathematics, science, and coding, demonstrate that RePro consistently enhances reasoning performance and mitigates suboptimal reasoning behaviors.
comment: Work in progress
Latent Debate: A Surrogate Framework for Interpreting LLM Thinking
Understanding the internal thinking process of Large Language Models (LLMs) and the cause of hallucinations remains a key challenge. To this end, we introduce latent debate, a novel framework for interpreting model predictions through the lens of implicit internal arguments. Unlike the current work of self-consistency and multi-agent debate, which relies on explicit debates among multiple answers or multiple models, latent debate captures the hidden supporting and attacking signals that arise within a single model during a single inference. We first present a model- and task-agnostic conceptual framework, and then instantiate it symbolically to approximate the thinking process of LLMs on True/False prediction tasks. Empirical studies demonstrate that latent debate is a faithful structured surrogate model that has highly consistent predictions with the original LLM. Beyond interpretability, we demonstrate that latent debate provides a strong baseline for hallucination detection. Further analysis reveals strong correlations between hallucinations and debate patterns, such as a high degree of latent debates in the middle layers is linked to a higher risk of hallucinations. These findings position latent debate as a potential framework for understanding internal mechanisms of LLMs, especially for scenarios where internal (dis)agreements appear during the inference steps.
comment: Preprint
OPOR-Bench: Evaluating Large Language Models on Online Public Opinion Report Generation
Online Public Opinion Reports consolidate news and social media for timely crisis management by governments and enterprises. While large language models have made automated report generation technically feasible, systematic research in this specific area remains notably absent, particularly lacking formal task definitions and corresponding benchmarks. To bridge this gap, we define the Automated Online Public Opinion Report Generation (OPOR-GEN) task and construct OPOR-BENCH, an event-centric dataset covering 463 crisis events with their corresponding news articles, social media posts, and a reference summary. To evaluate report quality, we propose OPOR-EVAL, a novel agent-based framework that simulates human expert evaluation by analyzing generated reports in context. Experiments with frontier models demonstrate that our framework achieves high correlation with human judgments. Our comprehensive task definition, benchmark dataset, and evaluation framework provide a solid foundation for future research in this critical domain.
comment: 27 pages, accepted by CMC-Computers, Materials & Continua, 2025
Exploring Human Perceptions of AI Responses: Insights from a Mixed-Methods Study on Risk Mitigation in Generative Models
With the rapid uptake of generative AI, investigating human perceptions of generated responses has become crucial. A major challenge is their `aptitude' for hallucinating and generating harmful contents. Despite major efforts for implementing guardrails, human perceptions of these mitigation strategies are largely unknown. We conducted a mixed-method experiment for evaluating the responses of a mitigation strategy across multiple-dimensions: faithfulness, fairness, harm-removal capacity, and relevance. In a within-subject study design, 57 participants assessed the responses under two conditions: harmful response plus its mitigation and solely mitigated response. Results revealed that participants' native language, AI work experience, and annotation familiarity significantly influenced evaluations. Participants showed high sensitivity to linguistic and contextual attributes, penalizing minor grammar errors while rewarding preserved semantic contexts. This contrasts with how language is often treated in the quantitative evaluation of LLMs. We also introduced new metrics for training and evaluating mitigation strategies and insights for human-AI evaluation studies.
comment: 16 pages, 2 figures, 6 tables. Under review for publication
Cross-Lingual Interleaving for Speech Language Models
Spoken Language Models (SLMs) aim to learn linguistic competence directly from speech using discrete units, widening access to Natural Language Processing (NLP) technologies for languages with limited written resources. However, progress has been largely English-centric due to scarce spoken evaluation benchmarks and training data, making cross-lingual learning difficult. We present a cross-lingual interleaving method that mixes speech tokens across languages without textual supervision. We also release an EN-FR training dataset, TinyStories (~42k hours), together with EN-FR spoken StoryCloze and TopicCloze benchmarks for cross-lingual semantic evaluation, both synthetically generated using GPT-4. On 360M and 1B SLMs under matched training-token budgets, interleaving improves monolingual semantic accuracy, enables robust cross-lingual continuation, and strengthens cross-lingual hidden-state alignment. Taken together, these results indicate that cross-lingual interleaving is a simple, scalable route to building multilingual SLMs that understand and converse across languages. All resources will be made open-source to support reproducibility.
BHRAM-IL: A Benchmark for Hallucination Recognition and Assessment in Multiple Indian Languages AACL 2025
Large language models (LLMs) are increasingly deployed in multilingual applications but often generate plausible yet incorrect or misleading outputs, known as hallucinations. While hallucination detection has been studied extensively in English, under-resourced Indian languages remain largely unexplored. We present BHRAM-IL, a benchmark for hallucination recognition and assessment in multiple Indian languages, covering Hindi, Gujarati, Marathi, Odia, along with English. The benchmark comprises 36,047 curated questions across nine categories spanning factual, numerical, reasoning, and linguistic tasks. We evaluate 14 state-of-the-art multilingual LLMs on a benchmark subset of 10,265 questions, analyzing cross-lingual and factual hallucinations across languages, models, scales, categories, and domains using category-specific metrics normalized to (0,1) range. Aggregation over all categories and models yields a primary score of 0.23 and a language-corrected fuzzy score of 0.385, demonstrating the usefulness of BHRAM-IL for hallucination-focused evaluation. The dataset, and the code for generation and evaluation are available on GitHub (https://github.com/sambhashana/BHRAM-IL/) and HuggingFace (https://huggingface.co/datasets/sambhashana/BHRAM-IL/) to support future research in multilingual hallucination detection and mitigation.
comment: Accepted at BHASHA Workshop @ IJCNLP/AACL 2025
Beyond SFT: Reinforcement Learning for Safer Large Reasoning Models with Better Reasoning Ability
Large reasoning models (LRMs) extend large language models by generating explicit chain-of-thought (CoT) reasoning, significantly improving mathematical and logical problem solving. However, this explicit reasoning process also introduces new safety risks, as unsafe behaviors often emerge within intermediate reasoning trajectories, even when final answers appear harmless. Existing safety alignment approaches primarily rely on supervised fine-tuning (SFT) over safety-oriented long CoT datasets. While intuitive, we find that SFT produces inconsistent safety improvements, degrades reasoning ability, and generalizes poorly across model families. These limitations suggest that purely supervised approaches are insufficient for robust safety alignment in LRMs. To address this, we investigate reinforcement learning (RL) as a complementary optimization framework for LRM safety training. Unlike SFT, RL directly optimizes model policies with reward feedback, enabling more adaptive and stable alignment. Extensive experiments across multiple model families and benchmarks show that RL achieves stronger and more consistent safety gains while maintaining reasoning competence. Further analysis of reflection dynamics and token-level entropy reveals that RL suppresses unsafe exploratory reasoning while preserving reflective depth, leading to safer and more reliable reasoning processes.
InnoGym: Benchmarking the Innovation Potential of AI Agents
LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.
comment: Work in progress
H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons
Large language models (LLMs) frequently generate hallucinations -- plausible but factually incorrect outputs -- undermining their reliability. While prior work has examined hallucinations from macroscopic perspectives such as training data and objectives, the underlying neuron-level mechanisms remain largely unexplored. In this paper, we conduct a systematic investigation into hallucination-associated neurons (H-Neurons) in LLMs from three perspectives: identification, behavioral impact, and origins. Regarding their identification, we demonstrate that a remarkably sparse subset of neurons (less than $0.1\%$ of total neurons) can reliably predict hallucination occurrences, with strong generalization across diverse scenarios. In terms of behavioral impact, controlled interventions reveal that these neurons are causally linked to over-compliance behaviors. Concerning their origins, we trace these neurons back to the pre-trained base models and find that these neurons remain predictive for hallucination detection, indicating they emerge during pre-training. Our findings bridge macroscopic behavioral patterns with microscopic neural mechanisms, offering insights for developing more reliable LLMs.
comment: 20 pages, 4 figures
Reasoning About the Unsaid: Misinformation Detection with Omission-Aware Graph Inference AAAI 2026
This paper investigates the detection of misinformation, which deceives readers by explicitly fabricating misleading content or implicitly omitting important information necessary for informed judgment. While the former has been extensively studied, omission-based deception remains largely overlooked, even though it can subtly guide readers toward false conclusions under the illusion of completeness. To pioneer in this direction, this paper presents OmiGraph, the first omission-aware framework for misinformation detection. Specifically, OmiGraph constructs an omission-aware graph for the target news by utilizing a contextual environment that captures complementary perspectives of the same event, thereby surfacing potentially omitted contents. Based on this graph, omission-oriented relation modeling is then proposed to identify the internal contextual dependencies, as well as the dynamic omission intents, formulating a comprehensive omission relation representation. Finally, to extract omission patterns for detection, OmiGraph introduces omission-aware message-passing and aggregation that establishes holistic deception perception by integrating the omission contents and relations. Experiments show that, by considering the omission perspective, our approach attains remarkable performance, achieving average improvements of +5.4% F1 and +5.3% ACC on two large-scale benchmarks.
comment: AAAI 2026
Beware of Reasoning Overconfidence: Pitfalls in the Reasoning Process for Multi-solution Tasks
Large Language Models (LLMs) excel in reasoning tasks requiring a single correct answer, but they perform poorly in multi-solution tasks that require generating comprehensive and diverse answers. We attribute this limitation to \textbf{reasoning overconfidence}: a tendency to express undue certainty in an incomplete solution set. To examine the effect, we introduce \textit{MuSoBench}, a benchmark of multi-solution problems. Experiments show that the conventional short chain-of-thought (Short-CoT) prompting paradigm exhibits pronounced overconfidence, whereas the emerging long chain-of-thought (Long-CoT) approach mitigates it through iterative exploration and self-reflection. We further characterise observable behaviours and influential factors. To probe the underlying cause, we propose the \textbf{cognitive-rigidity hypothesis}, which posits that overconfidence arises when the reasoning process prematurely converges on a narrow set of thought paths. An attention-entropy analysis offers preliminary support for this view. These findings provide tools for assessing the completeness of LLM reasoning and highlight the need to move evaluation beyond single-answer accuracy toward comprehensive exploration.
Self-Supervised Borrowing Detection on Multilingual Wordlists
This paper presents a fully self-supervised approach to borrowing detection in multilingual wordlists. The method combines two sources of information: PMI similarities based on a global correspondence model and a lightweight contrastive component trained on phonetic feature vectors. It further includes an automatic procedure for selecting decision thresholds without requiring labeled data. Experiments on benchmark datasets show that PMI alone already improves over existing string similarity measures such as NED and SCA, and that the combined similarity performs on par with or better than supervised baselines. An ablation study highlights the importance of character encoding, temperature settings and augmentation strategies. The approach scales to datasets of different sizes, works without manual supervision and is provided with a command-line tool that allows researchers to conduct their own studies.
comment: 29 pages, 3 figures, 12 tables
MMAG: Mixed Memory-Augmented Generation for Large Language Models Applications
Large Language Models (LLMs) excel at generating coherent text within a single prompt but fall short in sustaining relevance, personalization, and continuity across extended interactions. Human communication, however, relies on multiple forms of memory, from recalling past conversations to adapting to personal traits and situational context. This paper introduces the Mixed Memory-Augmented Generation (MMAG) pattern, a framework that organizes memory for LLM-based agents into five interacting layers: conversational, long-term user, episodic and event-linked, sensory and context-aware, and short-term working memory. Drawing inspiration from cognitive psychology, we map these layers to technical components and outline strategies for coordination, prioritization, and conflict resolution. We demonstrate the approach through its implementation in the Heero conversational agent, where encrypted long-term bios and conversational history already improve engagement and retention. We further discuss implementation concerns around storage, retrieval, privacy, and latency, and highlight open challenges. MMAG provides a foundation for building memory-rich language agents that are more coherent, proactive, and aligned with human needs.
StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos
Streaming video understanding requires models not only to process temporally incoming frames, but also to anticipate user intention for realistic applications like AR glasses. While prior streaming benchmarks evaluate temporal reasoning, none measure whether MLLMs can interpret or leverage human gaze signals within a streaming setting. To fill this gap, we introduce StreamGaze, the first benchmark designed to evaluate how effectively MLLMs use gaze for temporal and proactive reasoning in streaming videos. StreamGaze introduces gaze-guided past, present, and proactive tasks that comprehensively evaluate streaming video understanding. These tasks assess whether models can use real-time gaze to follow shifting attention and infer user intentions from only past and currently observed frames. To build StreamGaze, we develop a gaze-video QA generation pipeline that aligns egocentric videos with raw gaze trajectories via fixation extraction, region-specific visual prompting, and scanpath construction. This pipeline produces spatio-temporally grounded QA pairs that closely reflect human perceptual dynamics. Across all StreamGaze tasks, we observe substantial performance gaps between state-of-the-art MLLMs and human performance, revealing fundamental limitations in gaze-based temporal reasoning, intention modeling, and proactive prediction. We further provide detailed analyses of gaze-prompting strategies, reasoning behaviors, and task-specific failure modes, offering deeper insight into why current MLLMs struggle and what capabilities future models must develop. All data and code will be publicly released to support continued research in gaze-guided streaming video understanding.
comment: Project page: https://streamgaze.github.io/
Learning the Boundary of Solvability: Aligning LLMs to Detect Unsolvable Problems
Ensuring LLM reliability requires not only solving complex problems but also recognizing when a problem is unsolvable. Current models often struggle to distinguish objective unsolvability (inherent contradictions in the problem) from subjective capability limitations (problems beyond the model's competence), which leads to hallucinations and overconfidence. To address this, we propose UnsolvableQA and UnsolvableRL to solve feasible problems, detect inherent contradictions, and prudently refuse tasks beyond capability. Specifically, we construct UnsolvableQA, a dataset of paired solvable and unsolvable instances derived via a dual-track methodology: programmatic generation for logic puzzles and a novel "Reverse Construction" method that injects contradictions into valid reasoning chains for mathematics. Building on this dataset, we introduce UnsolvableRL, a reinforcement learning framework with three reward components jointly accounting for accuracy, unsolvability, and difficulty. Empirical results show that our approach achieves near-perfect unsolvability detection while also improving accuracy on solvable tasks. Crucially, we identify Capability Collapse, demonstrating that explicit exposure to unsolvable data is indispensable for preventing models from becoming systematically overconfident. Our code and data are available at https://github.com/sfasfaffa/unsolvableQA.
comment: preprint
HalluGraph: Auditable Hallucination Detection for Legal RAG Systems via Knowledge Graph Alignment
Legal AI systems powered by retrieval-augmented generation (RAG) face a critical accountability challenge: when an AI assistant cites case law, statutes, or contractual clauses, practitioners need verifiable guarantees that generated text faithfully represents source documents. Existing hallucination detectors rely on semantic similarity metrics that tolerate entity substitutions, a dangerous failure mode when confusing parties, dates, or legal provisions can have material consequences. We introduce HalluGraph, a graph-theoretic framework that quantifies hallucinations through structural alignment between knowledge graphs extracted from context, query, and response. Our approach produces bounded, interpretable metrics decomposed into \textit{Entity Grounding} (EG), measuring whether entities in the response appear in source documents, and \textit{Relation Preservation} (RP), verifying that asserted relationships are supported by context. On structured control documents, HalluGraph achieves near-perfect discrimination ($>$400 words, $>$20 entities), HalluGraph achieves $AUC = 0.979$, while maintaining robust performance ($AUC \approx 0.89$) on challenging generative legal task, consistently outperforming semantic similarity baselines. The framework provides the transparency and traceability required for high-stakes legal applications, enabling full audit trails from generated assertions back to source passages.
comment: 8 pages, 4 figures, under review
MAC-SLU: Multi-Intent Automotive Cabin Spoken Language Understanding Benchmark
Spoken Language Understanding (SLU), which aims to extract user semantics to execute downstream tasks, is a crucial component of task-oriented dialog systems. Existing SLU datasets generally lack sufficient diversity and complexity, and there is an absence of a unified benchmark for the latest Large Language Models (LLMs) and Large Audio Language Models (LALMs). This work introduces MAC-SLU, a novel Multi-Intent Automotive Cabin Spoken Language Understanding Dataset, which increases the difficulty of the SLU task by incorporating authentic and complex multi-intent data. Based on MAC-SLU, we conducted a comprehensive benchmark of leading open-source LLMs and LALMs, covering methods like in-context learning, supervised fine-tuning (SFT), and end-to-end (E2E) and pipeline paradigms. Our experiments show that while LLMs and LALMs have the potential to complete SLU tasks through in-context learning, their performance still lags significantly behind SFT. Meanwhile, E2E LALMs demonstrate performance comparable to pipeline approaches and effectively avoid error propagation from speech recognition. Code\footnote{https://github.com/Gatsby-web/MAC\_SLU} and datasets\footnote{huggingface.co/datasets/Gatsby1984/MAC\_SLU} are released publicly.
Language Diversity: Evaluating Language Usage and AI Performance on African Languages in Digital Spaces
This study examines the digital representation of African languages and the challenges this presents for current language detection tools. We evaluate their performance on Yoruba, Kinyarwanda, and Amharic. While these languages are spoken by millions, their online usage on conversational platforms is often sparse, heavily influenced by English, and not representative of the authentic, monolingual conversations prevalent among native speakers. This lack of readily available authentic data online creates a challenge of scarcity of conversational data for training language models. To investigate this, data was collected from subreddits and local news sources for each language. The analysis showed a stark contrast between the two sources. Reddit data was minimal and characterized by heavy code-switching. Conversely, local news media offered a robust source of clean, monolingual language data, which also prompted more user engagement in the local language on the news publishers social media pages. Language detection models, including the specialized AfroLID and a general LLM, performed with near-perfect accuracy on the clean news data but struggled with the code-switched Reddit posts. The study concludes that professionally curated news content is a more reliable and effective source for training context-rich AI models for African languages than data from conversational platforms. It also highlights the need for future models that can process clean and code-switched text to improve the detection accuracy for African languages.
LEC: Linear Expectation Constraints for False-Discovery Control in Selective Prediction and Routing Systems
Large language models (LLMs) often generate unreliable answers, while heuristic uncertainty methods fail to fully distinguish correct from incorrect predictions, causing users to accept erroneous answers without statistical guarantees. We address this issue through the lens of false discovery rate (FDR) control, ensuring that among all accepted predictions, the proportion of errors does not exceed a target risk level. To achieve this in a principled way, we propose LEC, which reinterprets selective prediction as a constrained decision problem by enforcing a Linear Expectation Constraint over selection and error indicators. Then, we establish a finite-sample sufficient condition, which relies only on a held-out set of exchangeable calibration samples, to compute an FDR-constrained, coverage-maximizing threshold. Furthermore, we extend LEC to a two-model routing mechanism: given a prompt, if the current model's uncertainty exceeds its calibrated threshold, we delegate it to a stronger model, while maintaining a unified FDR guarantee. Evaluations on closed-ended and open-ended question-answering (QA) datasets show that LEC achieves tighter FDR control and substantially improves sample retention over prior methods. Moreover, the two-model routing mechanism achieves lower risk levels while accepting more correct samples than each individual model.
LPCD: Unified Framework from Layer-Wise to Submodule Quantization
Post-training quantization (PTQ) aims to preserve model-level behavior; however, most methods focus on individual linear layers. Even recent extensions, such as QEP and LoaQ, which mitigate error propagation or target specific submodules, still rely on layer-wise formulations and fail to capture the behavior of larger submodules. We introduce Layer-Projected Coordinate Descent (LPCD), a unified framework that extends PTQ beyond layers by optimizing relaxed objectives across arbitrary submodules and projecting the solutions with layer-wise quantizers. LPCD generalizes existing methods and provides a principled approach to quantizing complex submodules while maintaining the efficiency and compatibility of layer-wise PTQ pipelines. Across diverse LLM architectures and bit-widths, LPCD-based submodule quantization consistently enhances both layer-wise PTQ methods and existing submodule approaches.
comment: 21 pages, 4 figures
MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages
Multimodal Large Language Models (MLLMs) have achieved great success in Speech-to-Text Translation (S2TT) tasks. However, current research is constrained by two key challenges: language coverage and efficiency. Most of the popular S2TT datasets are substantially English-centric, which restricts the scaling-up of MLLMs' many-to-many translation capabilities. Moreover, the inference speed of MLLMs degrades dramatically when the speech is converted into long sequences (e.g., 750 tokens). To address these limitations, we propose a Multilingual Cost-effective Accelerated Speech-to-Text Translator (MCAT) framework, which includes two innovations. First, a language scaling method that leverages curriculum learning and a data balancing strategy is introduced to extend the language coverage supported by MLLMs to 70 languages and achieve mutual translation among these languages. Second, an optimized speech adapter module is designed to reduce the length of the speech sequence to only 30 tokens. Extensive experiments were conducted on MLLMs of different scales (9B and 27B). The experimental results demonstrate that MCAT not only surpasses state-of-the-art end-to-end models on the FLEURS dataset across 70x69 directions but also enhances batch inference efficiency. This is achieved with only ~100M trainable parameters and by using only 10 hours of S2TT data per language. Furthermore, we have released MCAT as open-source to promote the development of MLLMs for robust S2TT capabilities. The code and models are released at https://github.com/yxduir/m2m-70.
Enhancing BERT Fine-Tuning for Sentiment Analysis in Lower-Resourced Languages
Limited data for low-resource languages typically yield weaker language models (LMs). Since pre-training is compute-intensive, it is more pragmatic to target improvements during fine-tuning. In this work, we examine the use of Active Learning (AL) methods augmented by structured data selection strategies which we term 'Active Learning schedulers', to boost the fine-tuning process with a limited amount of training data. We connect the AL to data clustering and propose an integrated fine-tuning pipeline that systematically combines AL, clustering, and dynamic data selection schedulers to enhance model's performance. Experiments in the Slovak, Maltese, Icelandic and Turkish languages show that the use of clustering during the fine-tuning phase together with AL scheduling can simultaneously produce annotation savings up to 30% and performance improvements up to four F1 score points, while also providing better fine-tuning stability.
ZIP-RC: Zero-overhead Inference-time Prediction of Reward and Cost for Adaptive and Interpretable Generation
Large language models excel at reasoning but lack key aspects of introspection, including anticipating their own success and the computation required to achieve it. Humans use real-time introspection to decide how much effort to invest, when to make multiple attempts, when to stop, and when to signal success or failure. Without this, LLMs struggle to make intelligent meta-cognition decisions. Test-time scaling methods like Best-of-N drive up cost and latency by using a fixed budget of samples regardless of the marginal benefit of each one at any point in generation, and the absence of confidence signals can mislead people, prevent appropriate escalation to better tools, and undermine trustworthiness. Learned verifiers or reward models can provide confidence estimates, but do not enable adaptive inference and add substantial cost by requiring extra models or forward passes. We present ZIP-RC, an adaptive inference method that equips models with zero-overhead inference-time predictions of reward and cost. At every token, ZIP-RC reuses reserved or unused logits in the same forward pass as next-token prediction to output a joint distribution over final reward and remaining length -- no extra models, architecture change, or inference overhead. This full joint distribution is used to compute a sampling utility which is the linear combination of the expected maximum reward, total compute, and latency of set of samples if generated to completion. During inference, we maximize this utility with meta-actions that determine which prefix of tokens to continue or initiate sampling from. On mixed-difficulty mathematical benchmarks, ZIP-RC improves accuracy by up to 12% over majority voting at equal or lower average cost, and traces smooth Pareto frontiers between quality, compute, and latency. By providing real-time reward-cost introspection, ZIP-RC enables adaptive, efficient reasoning.
comment: Code coming soon
MEGConformer: Conformer-Based MEG Decoder for Robust Speech and Phoneme Classification NeurIPS 2025
We present Conformer-based decoders for the LibriBrain 2025 PNPL competition, targeting two foundational MEG tasks: Speech Detection and Phoneme Classification. Our approach adapts a compact Conformer to raw 306-channel MEG signals, with a lightweight convolutional projection layer and task-specific heads. For Speech Detection, a MEG-oriented SpecAugment provided a first exploration of MEG-specific augmentation. For Phoneme Classification, we used inverse-square-root class weighting and a dynamic grouping loader to handle 100-sample averaged examples. In addition, a simple instance-level normalization proved critical to mitigate distribution shifts on the holdout split. Using the official Standard track splits and F1-macro for model selection, our best systems achieved 88.9% (Speech) and 65.8% (Phoneme) on the leaderboard, surpassing the competition baselines and ranking within the top-10 in both tasks. For further implementation details, the technical documentation, source code, and checkpoints are available at https://github.com/neural2speech/libribrain-experiments.
comment: 10 pages, 5 figures, 4 tables, LibriBrain Workshop, NeurIPS 2025
Multilingual Conversational AI for Financial Assistance: Bridging Language Barriers in Indian FinTech
India's linguistic diversity presents both opportunities and challenges for fintech platforms. While the country has 31 major languages and over 100 minor ones, only 10\% of the population understands English, creating barriers to financial inclusion. We present a multilingual conversational AI system for a financial assistance use case that supports code-mixed languages like Hinglish, enabling natural interactions for India's diverse user base. Our system employs a multi-agent architecture with language classification, function management, and multilingual response generation. Through comparative analysis of multiple language models and real-world deployment, we demonstrate significant improvements in user engagement while maintaining low latency overhead (4-8\%). This work contributes to bridging the language gap in digital financial services for emerging markets.
PromptBridge: Cross-Model Prompt Transfer for Large Language Models
Large language models (LLMs) underpin applications in code generation, mathematical reasoning, and agent-based workflows. In practice, systems access LLMs via commercial APIs or open-source deployments, and the model landscape (e.g., GPT, Claude, Llama) evolves rapidly. This rapid evolution forces frequent model switches driven by capability, cost, deployment constraints, and privacy. Yet prompts are highly model-sensitive: reusing a prompt engineered for one model on another often yields substantially worse performance than a prompt optimized for the target model. We term this phenomenon Model Drifting. Through extensive empirical analysis across diverse LLM configurations, we show that model drifting is both common and severe. To address this challenge, we introduce PromptBridge, a training-free framework that preserves prompt effectiveness under model switches, enabling cross-model prompt transfer without costly per-task or per-model re-optimization. PromptBridge requires only a small set of alignment tasks for calibration. It first applies Model-Adaptive Reflective Prompt Evolution (MAP-RPE) to obtain task- and model-specific optimal prompts via iterative reflective refinement and quantitative evaluation. Using the resulting calibrated prompt pairs for the source and target models, PromptBridge learns a cross-model prompt mapping. At test time, i.e., for an unseen task, given a source-model prompt, this mapping directly produces an optimized prompt for the target model. Experiments in single-agent and multi-agent settings show that PromptBridge consistently improves downstream accuracy while reducing migration effort. The code will be available soon.
DyFuLM: An Advanced Multimodal Framework for Sentiment Analysis
Understanding sentiment in complex textual expressions remains a fundamental challenge in affective computing. To address this, we propose a Dynamic Fusion Learning Model (DyFuLM), a multimodal framework designed to capture both hierarchical semantic representations and fine-grained emotional nuances. DyFuLM introduces two key moodules: a Hierarchical Dynamic Fusion module that adaptively integrates multi-level features, and a Gated Feature Aggregation module that regulates cross-layer information ffow to achieve balanced representation learning. Comprehensive experiments on multi-task sentiment datasets demonstrate that DyFuLM achieves 82.64% coarse-grained and 68.48% fine-grained accuracy, yielding the lowest regression errors (MAE = 0.0674, MSE = 0.0082) and the highest R^2 coefficient of determination (R^2= 0.6903). Furthermore, the ablation study validates the effectiveness of each module in DyFuLM. When all modules are removed, the accuracy drops by 0.91% for coarse-grained and 0.68% for fine-grained tasks. Keeping only the gated fusion module causes decreases of 0.75% and 0.55%, while removing the dynamic loss mechanism results in drops of 0.78% and 0.26% for coarse-grained and fine-grained sentiment classification, respectively. These results demonstrate that each module contributes significantly to feature interaction and task balance. Overall, the experimental findings further validate that DyFuLM enhances sentiment representation and overall performance through effective hierarchical feature fusion.
comment: 8 pages, 6 figures, preprint. Under review for a suitable AI conference
BackportBench: A Multilingual Benchmark for Automated Backporting of Patches
Many modern software projects evolve rapidly to incorporate new features and security patches. It is important for users to update their dependencies to safer versions, but many still use older, vulnerable package versions because upgrading can be difficult and may break their existing codebase. Software developers can mitigate this problem by backporting security patches to older releases. However, manually backporting is time-consuming and error-prone. The effectiveness of existing automated backporting techniques on general software remains unclear since they typically target only code-hunk or function-level patch porting scenarios and are evaluated with imperfect metrics. To facilitate the development and evaluation of automated backporting techniques, we introduce BackportBench, the first comprehensive benchmark suite for patch backporting problem. BackportBench is a multilingual benchmark that contains 202 patch backporting problems from PyPI, Maven, and npm, each with executable Docker environments and relevant test cases. We evaluated existing patch porting methods and LLM-based techniques that have the potential to adapt to this task using BackportBench. The results show that the agentic method has outperformed traditional patch porting methods, especially on cases that require logical and structural changes. However, the performance varies across different programming languages. Based on the findings, we draw several implications for researchers and software practitioners in future work on automated backporting.
comment: Under review
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices
This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.
MARSAD: A Multi-Functional Tool for Real-Time Social Media Analysis
MARSAD is a multifunctional natural language processing (NLP) platform designed for real-time social media monitoring and analysis, with a particular focus on the Arabic-speaking world. It enables researchers and non-technical users alike to examine both live and archived social media content, producing detailed visualizations and reports across various dimensions, including sentiment analysis, emotion analysis, propaganda detection, fact-checking, and hate speech detection. The platform also provides secure data-scraping capabilities through API keys for accessing public social media data. MARSAD's backend architecture integrates flexible document storage with structured data management, ensuring efficient processing of large and multimodal datasets. Its user-friendly frontend supports seamless data upload and interaction.
comment: 6 pages, 4 figures
The Necessity of Imperfection:Reversing Model Collapse via Simulating Cognitive Boundedness
Although synthetic data is widely promoted as a remedy, its prevailing production paradigm -- one optimizing for statistical smoothness -- systematically removes the long-tail, cognitively grounded irregularities that characterize human text. Prolonged training on such statistically optimal but cognitively impoverished data accelerates model collapse. This paper proposes a paradigm shift: instead of imitating the surface properties of data, we simulate the cognitive processes that generate human text. We introduce the Prompt-driven Cognitive Computing Framework (PMCSF), whose core consists of a Cognitive State Decoder (CSD) that reverse-engineers unstructured text into structured cognitive vectors, and a Cognitive Text Encoder (CTE) that re-materializes these states into text enriched with human-typical imperfections via mathematically defined Cognitive Perturbation Operators. The framework is validated through a two-stage objective evaluation pipeline. First, in cognitive codec verification, CTE text yields a Jensen-Shannon divergence of 0.0614 from human text (vs. 0.4431 for standard LLM output), passes double-blind professional media review, and achieves an intraclass correlation coefficient ICC > 0.9 for cognitive profile alignment across heterogeneous models. Second, in functional gain evaluation, isomorphic stress tests in the A-share market show that strategies incorporating CTE-generated data reduce maximum drawdown by 47.4% during the 2015 crash and deliver 8.6% Defensive Alpha, exceeding transaction costs by a factor of 33. Our findings demonstrate that modelling human cognitive limitations -- not copying surface data -- enables synthetic data with genuine functional gain, offering a viable technical pathway toward resolving the AI data-collapse crisis.
comment: 38 pages,5 figures,30 tables. This paper proposes the Prompt-driven Cognitive Computing Framework (PMCSF) and validates it with A-share market stress tests (N=23 for 2015 crash, N=13 for 2024 bull market). Includes detailed appendices on cognitive vector definitions, perturbation operators, and financial backtest data
EmoRAG: Evaluating RAG Robustness to Symbolic Perturbations KDD
Retrieval-Augmented Generation (RAG) systems are increasingly central to robust AI, enhancing large language model (LLM) faithfulness by incorporating external knowledge. However, our study unveils a critical, overlooked vulnerability: their profound susceptibility to subtle symbolic perturbations, particularly through near-imperceptible emoticon tokens such as "(@_@)" that can catastrophically mislead retrieval, termed EmoRAG. We demonstrate that injecting a single emoticon into a query makes it nearly 100% likely to retrieve semantically unrelated texts that contain a matching emoticon. Our extensive experiment across general question-answering and code domains, using a range of state-of-the-art retrievers and generators, reveals three key findings: (I) Single-Emoticon Disaster: Minimal emoticon injections cause maximal disruptions, with a single emoticon almost 100% dominating RAG output. (II) Positional Sensitivity: Placing an emoticon at the beginning of a query can cause severe perturbation, with F1-Scores exceeding 0.92 across all datasets. (III) Parameter-Scale Vulnerability: Counterintuitively, models with larger parameters exhibit greater vulnerability to the interference. We provide an in-depth analysis to uncover the underlying mechanisms of these phenomena. Furthermore, we raise a critical concern regarding the robustness assumption of current RAG systems, envisioning a threat scenario where an adversary exploits this vulnerability to manipulate the RAG system. We evaluate standard defenses and find them insufficient against EmoRAG. To address this, we propose targeted defenses, analyzing their strengths and limitations in mitigating emoticon-based perturbations. Finally, we outline future directions for building robust RAG systems.
comment: Accepted to ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) 2026
Securing Large Language Models (LLMs) from Prompt Injection Attacks
Large Language Models (LLMs) are increasingly being deployed in real-world applications, but their flexibility exposes them to prompt injection attacks. These attacks leverage the model's instruction-following ability to make it perform malicious tasks. Recent work has proposed JATMO, a task-specific fine-tuning approach that trains non-instruction-tuned base models to perform a single function, thereby reducing susceptibility to adversarial instructions. In this study, we evaluate the robustness of JATMO against HOUYI, a genetic attack framework that systematically mutates and optimizes adversarial prompts. We adapt HOUYI by introducing custom fitness scoring, modified mutation logic, and a new harness for local model testing, enabling a more accurate assessment of defense effectiveness. We fine-tuned LLaMA 2-7B, Qwen1.5-4B, and Qwen1.5-0.5B models under the JATMO methodology and compared them with a fine-tuned GPT-3.5-Turbo baseline. Results show that while JATMO reduces attack success rates relative to instruction-tuned models, it does not fully prevent injections; adversaries exploiting multilingual cues or code-related disruptors still bypass defenses. We also observe a trade-off between generation quality and injection vulnerability, suggesting that better task performance often correlates with increased susceptibility. Our results highlight both the promise and limitations of fine-tuning-based defenses and point toward the need for layered, adversarially informed mitigation strategies.
comment: 10 pages, 1 figure, 1 table
Agreement-Constrained Probabilistic Minimum Bayes Risk Decoding AACL 2025
Minimum Bayes risk (MBR) decoding generates high-quality translations by maximizing the expected utility of output candidates, but it evaluates all pairwise scores over the candidate set; hence, it takes quadratic time with respect to the number of candidates. To reduce the number of utility function calls, probabilistic MBR (PMBR) decoding partially evaluates quality scores using sampled pairs of candidates and completes the missing scores with a matrix completion algorithm. Nevertheless, it degrades the translation quality as the number of utility function calls is reduced. Therefore, to improve the trade-off between quality and cost, we propose agreement-constrained PMBR (AC-PMBR) decoding, which leverages a knowledge distilled model to guide the completion of the score matrix. Our AC-PMBR decoding improved approximation errors of matrix completion by up to 3 times and achieved higher translation quality compared with PMBR decoding at a comparable computational cost on the WMT'23 En$\leftrightarrow$De translation tasks.
comment: IJCNLP-AACL 2025 Main
Kardia-R1: Unleashing LLMs to Reason toward Understanding and Empathy for Emotional Support via Rubric-as-Judge Reinforcement Learning
As web platforms evolve towards greater personalization and emotional complexity, conversational agents must transcend superficial empathy to demonstrate identity-aware emotional reasoning. However, existing systems face two limitations: (1) reliance on situation-centric datasets lacking persistent user identity, which hampers the capture of personalized affective nuances; and (2) dependence on opaque, coarse reward signals that hinder development of verifiable empathetic reasoning. To address these gaps, we introduce KardiaBench, a large-scale user-grounded benchmark comprising 178,080 QA pairs across 22,080 multi-turn conversations anchored to 671 real-world profiles. The dataset is constructed via a model-in-the-loop pipeline with iterative rubric-guided refinement to ensure psychological plausibility and persona consistency. This progressive empathy pipeline that integrates user comprehension, contextual reasoning, and emotion perception into conversations, followed by iterative critique and rubric-based refinement to ensure psychological plausibility, emotional fidelity, and persona consistency. Building on this, we propose Kardia-R1, a framework that trains models for interpretable, stepwise empathetic cognition. Kardia-R1 leverages Rubric-as-Judge Empathetic Reinforcement Learning (Rubric-ERL), a GRPO-based method that uses explainable, human-aligned rubric rewards to tightly couple user understanding, emotional inference, and supportive response generation. Extensive experiments across four LLM backbones demonstrate that Kardia-R1 consistently outperforms othet methods in emotion accuracy, empathy, relevance, persona consistency, and safety. Our dataset and model will be released at https://github.com/JhCircle/Kardia-R1.
SUPERChem: A Multimodal Reasoning Benchmark in Chemistry
Current benchmarks for evaluating the chemical reasoning capabilities of Large Language Models (LLMs) are limited by oversimplified tasks, lack of process-level evaluation, and misalignment with expert-level chemistry skills. To address these issues, we introduce SUPERChem, a benchmark of 500 expert-curated reasoning-intensive chemistry problems, covering diverse subfields and provided in both multimodal and text-only formats. Original content and an iterative curation pipeline eliminate flawed items and mitigate data contamination. Each problem is paired with an expert-authored solution path, enabling Reasoning Path Fidelity (RPF) scoring to evaluate reasoning quality beyond final-answer accuracy. Evaluations against a human baseline of 40.3% accuracy show that even the best-performing model, GPT-5 (High), reaches only 38.5%, followed closely by Gemini 2.5 Pro (37.9%) and DeepSeek-V3.1-Think (37.3%). SUPERChem elicits multi-step, multimodal reasoning, reveals model-dependent effects of visual information, and distinguishes high-fidelity reasoners from heuristic ones. By providing a challenging benchmark and a reliable evaluation framework, SUPERChem aims to facilitate the advancement of LLMs toward expert-level chemical intelligence. The dataset of the benchmark is available at https://huggingface.co/datasets/ZehuaZhao/SUPERChem.
comment: 35 pages, 11 figures, 5 tables
Sentiment Analysis and Emotion Classification using Machine Learning Techniques for Nagamese Language - A Low-resource Language
The Nagamese language, a.k.a Naga Pidgin, is an Assamese-lexified creole language developed primarily as a means of communication in trade between the people from Nagaland and people from Assam in the north-east India. Substantial amount of work in sentiment analysis has been done for resource-rich languages like English, Hindi, etc. However, no work has been done in Nagamese language. To the best of our knowledge, this is the first attempt on sentiment analysis and emotion classification for the Nagamese Language. The aim of this work is to detect sentiments in terms of polarity (positive, negative and neutral) and basic emotions contained in textual content of Nagamese language. We build sentiment polarity lexicon of 1,195 nagamese words and use these to build features along with additional features for supervised machine learning techniques using Na"ive Bayes and Support Vector Machines. Keywords: Nagamese, NLP, sentiment analysis, machine learning
comment: 10 pages
Large Language Models Cannot Reliably Detect Vulnerabilities in JavaScript: The First Systematic Benchmark and Evaluation
Researchers have proposed numerous methods to detect vulnerabilities in JavaScript, especially those assisted by Large Language Models (LLMs). However, the actual capability of LLMs in JavaScript vulnerability detection remains questionable, necessitating systematic evaluation and comprehensive benchmarks. Unfortunately, existing benchmarks suffer from three critical limitations: (1) incomplete coverage, such as covering a limited subset of CWE types; (2) underestimation of LLM capabilities caused by unreasonable ground truth labeling; and (3) overestimation due to unrealistic cases such as using isolated vulnerable files rather than complete projects. In this paper, we introduce, for the first time, three principles for constructing a benchmark for JavaScript vulnerability detection that directly address these limitations: (1) comprehensiveness, (2) no underestimation, and (3) no overestimation. Guided by these principles, we propose FORGEJS, the first automatic benchmark generation framework for evaluating LLMs' capability in JavaScript vulnerability detection. Then, we use FORGEJS to construct ARENAJS-the first systematic benchmark for LLM-based JavaScript vulnerability detection-and further propose JUDGEJS, an automatic evaluation framework. We conduct the first systematic evaluation of LLMs for JavaScript vulnerability detection, leveraging JUDGEJS to assess seven popular commercial LLMs on ARENAJS. The results show that LLMs not only exhibit limited reasoning capabilities, but also suffer from severe robustness defects, indicating that reliable JavaScript vulnerability detection with LLMs remains an open challenge.
Generative Adversarial Gumbel MCTS for Abstract Visual Composition Generation
We study abstract visual composition, in which identity is primarily determined by the spatial configuration and relations among a small set of geometric primitives (e.g., parts, symmetry, topology). They are invariant primarily to texture and photorealistic detail. Composing such structures from fixed components under geometric constraints and vague goal specification (such as text) is non-trivial due to combinatorial placement choices, limited data, and discrete feasibility (overlap-free, allowable orientations), which create a sparse solution manifold ill-suited to purely statistical pixel-space generators. We propose a constraint-guided framework that combines explicit geometric reasoning with neural semantics. An AlphaGo-style search enforces feasibility, while a fine-tuned vision-language model scores semantic alignment as reward signals. Our algorithm uses a policy network as a heuristic in Monte-Carlo Tree Search and fine-tunes the network via search-generated plans. Inspired by the Generative Adversarial Network, we use the generated instances for adversarial reward refinement. Over time, the generation should approach the actual data more closely when the reward model cannot distinguish between generated instances and ground-truth. In the Tangram Assembly task, our approach yields higher validity and semantic fidelity than diffusion and auto-regressive baselines, especially as constraints tighten.
Pay Attention Later: From Vector Space Diffusion to Linearithmic Spectral Phase-Locking
Standard Transformers suffer from a "Semantic Alignment Tax", a prohibitive optimization cost required to organize a chaotic initialization into a coherent geometric map via local gradient diffusion. We hypothesize that this reliance on diffusive learning creates "Catastrophic Rigidity", rendering models unable to adapt to novel concepts without destroying their pre-trained reasoning capabilities. To isolate this phenomenon, we introduce Iterative Semantic Map Refinement (ISMR), a diagnostic protocol revealing that alignment is a fixed geometric barrier that scaling cannot solve; a 20-layer model overcomes this barrier no faster than a 1-layer model. We introduce the Phase-Resonant Intelligent Spectral Model (PRISM). PRISM encodes semantic identity as resonant frequencies in the complex domain (C^d) and replaces quadratic self-attention with linearithmic O(N log N) Gated Harmonic Convolutions. We validate PRISM on the WMT14 translation task. While the Standard Transformer maintains a slight edge in general competence on static benchmarks (23.88 vs 21.40 BLEU), it fails the "Plasticity-Stability" stress test completely. When injected with novel concepts, the Transformer suffers Catastrophic Forgetting, degrading by -10.55 BLEU points while achieving only 60% acquisition. In contrast, PRISM demonstrates Lossless Plasticity, achieving 96% 5-shot acquisition with negligible degradation (-0.84 BLEU). These results suggest that harmonic representations effectively decouple memory from reasoning, offering a structural solution to the plasticity-stability dilemma in real-time knowledge adaptation.
comment: 12 pages, 5 figures
Conveying Imagistic Thinking in Traditional Chinese Medicine Translation: A Prompt Engineering and LLM-Based Evaluation Framework
Traditional Chinese Medicine (TCM) theory is built on imagistic thinking, in which medical principles and diagnostic and therapeutic logic are structured through metaphor and metonymy. However, existing English translations largely rely on literal rendering, making it difficult for target-language readers to reconstruct the underlying conceptual networks and apply them in clinical practice. This study adopted a human-in-the-loop (HITL) framework and selected four passages from the medical canon Huangdi Neijing that are fundamental in theory. Through prompt-based cognitive scaffolding, DeepSeek V3.1 was guided to identify metaphor and metonymy in the source text and convey the theory in translation. In the evaluation stage, ChatGPT 5 Pro and Gemini 2.5 Pro were instructed by prompts to simulate three types of real-world readers. Human translations, baseline model translations, and prompt-adjusted translations were scored by the simulated readers across five cognitive dimensions, followed by structured interviews and Interpretative Phenomenological Analysis (IPA). Results show that the prompt-adjusted LLM translations perform best across all five dimensions, with high cross-model and cross-role consistency. The interview themes reveal differences between human and machine translation, effective strategies for metaphor and metonymy transfer, and readers' cognitive preferences. This study provides a cognitive, efficient, and replicable HITL methodological pathway for the translation of ancient, concept-dense texts such as TCM.
comment: 3 figures
Generalist Large Language Models Outperform Clinical Tools on Medical Benchmarks
Specialized clinical AI assistants are rapidly entering medical practice, often framed as safer or more reliable than general-purpose large language models (LLMs). Yet, unlike frontier models, these clinical tools are rarely subjected to independent, quantitative evaluation, creating a critical evidence gap despite their growing influence on diagnosis, triage, and guideline interpretation. We assessed two widely deployed clinical AI systems (OpenEvidence and UpToDate Expert AI) against three state-of-the-art generalist LLMs (GPT-5, Gemini 3 Pro, and Claude Sonnet 4.5) using a 1,000-item mini-benchmark combining MedQA (medical knowledge) and HealthBench (clinician-alignment) tasks. Generalist models consistently outperformed clinical tools, with GPT-5 achieving the highest scores, while OpenEvidence and UpToDate demonstrated deficits in completeness, communication quality, context awareness, and systems-based safety reasoning. These findings reveal that tools marketed for clinical decision support may often lag behind frontier LLMs, underscoring the urgent need for transparent, independent evaluation before deployment in patient-facing workflows.
comment: 17 pages, 4 figures (2 regular, 2 supplemental)
TempPerturb-Eval: On the Joint Effects of Internal Temperature and External Perturbations in RAG Robustness
The evaluation of Retrieval-Augmented Generation (RAG) systems typically examines retrieval quality and generation parameters like temperature in isolation, overlooking their interaction. This work presents a systematic investigation of how text perturbations (simulating noisy retrieval) interact with temperature settings across multiple LLM runs. We propose a comprehensive RAG Perturbation-Temperature Analysis Framework that subjects retrieved documents to three distinct perturbation types across varying temperature settings. Through extensive experiments on HotpotQA with both open-source and proprietary LLMs, we demonstrate that performance degradation follows distinct patterns: high-temperature settings consistently amplify vulnerability to perturbations, while certain perturbation types exhibit non-linear sensitivity across the temperature range. Our work yields three key contributions: (1) a diagnostic benchmark for assessing RAG robustness, (2) an analytical framework for quantifying perturbation-temperature interactions, and (3) practical guidelines for model selection and parameter tuning under noisy retrieval conditions.
DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based Drawing Tasks AAAI 2026
As agentic AI systems increasingly operate autonomously, establishing trust through verifiable evaluation becomes critical. Yet existing benchmarks lack the transparency and auditability needed to assess whether agents behave reliably. We present DrawingBench, a verification framework for evaluating the trustworthiness of agentic LLMs through spatial reasoning tasks that require generating sequences of low-level GUI actions. Unlike opaque evaluations, DrawingBench provides transparent, rule-based assessment: 8 objective criteria enable reproducible scoring, while action-level inspection allows stakeholders to audit agent behavior. Our framework comprises 250 diverse prompts across 20 categories and 4 difficulty levels, deterministic evaluation metrics, and an external oversight mechanism through multi-turn feedback that enables human control over agent refinement. Evaluating four state-of-the-art LLMs (Claude-4 Sonnet, GPT-4.1, GPT-4.1-mini, Gemini-2.5 Flash) across 1,000 tests, we establish both capabilities and limitations: models achieved 92.8% perfect performance with structured external feedback driving significant improvements (average +3.2%, up to +32.8% for complex scenes), but systematic error patterns emerged in tool state management and long-horizon planning. Notably, specification clarity proved more important than task complexity -- models achieved 100% perfect performance when given explicit, verifiable criteria. These findings demonstrate that transparent evaluation frameworks can establish trust in agentic systems, with external oversight proving more reliable than self-correction for guiding agent behavior. Our open-source framework provides a template for trustworthy agent assessment. Code and data: https://github.com/hyunjun1121/DrawingBench
comment: AAAI 2026 TrustAgent Workshop
CAIRNS: Balancing Readability and Scientific Accuracy in Climate Adaptation Question Answering
Climate adaptation strategies are proposed in response to climate change. They are practised in agriculture to sustain food production. These strategies can be found in unstructured data (for example, scientific literature from the Elsevier website) or structured (heterogeneous climate data via government APIs). We present Climate Adaptation question-answering with Improved Readability and Noted Sources (CAIRNS), a framework that enables experts -- farmer advisors -- to obtain credible preliminary answers from complex evidence sources from the web. It enhances readability and citation reliability through a structured ScholarGuide prompt and achieves robust evaluation via a consistency-weighted hybrid evaluator that leverages inter-model agreement with experts. Together, these components enable readable, verifiable, and domain-grounded question-answering without fine-tuning or reinforcement learning. Using a previously reported dataset of expert-curated question-answers, we show that CAIRNS outperforms the baselines on most of the metrics. Our thorough ablation study confirms the results on all metrics. To validate our LLM-based evaluation, we also report an analysis of correlations against human judgment.
comment: Short Paper; Under Review at The WebConf 2026 (single-blind submission)
DETAIL Matters: Measuring the Impact of Prompt Specificity on Reasoning in Large Language Models
Prompt design plays a critical role in the reasoning performance of large language models (LLMs), yet the impact of prompt specificity - how detailed or vague a prompt is - remains understudied. This paper introduces DETAIL, a framework for evaluating LLM performance across varying levels of prompt specificity. We generate multi-level prompts using GPT-4, quantify specificity via perplexity, and assess correctness using GPT-based semantic equivalence. Experiments on 30 novel reasoning tasks across GPT-4 and O3-mini reveal that specificity improves accuracy, especially for smaller models and procedural tasks. Our results highlight the need for adaptive prompting strategies and provide tools and data to support further research.
Lightweight Latent Reasoning for Narrative Tasks
Large language models (LLMs) tackle complex tasks by generating long chains of thought or "reasoning traces" that act as latent variables in the generation of an output given a query. A model's ability to generate such traces can be optimized with reinforcement learning (RL) to improve their utility in predicting an answer. This optimization comes at a high computational cost, especially for narrative-related tasks that involve retrieving and processing many tokens. To this end, we propose LiteReason, a latent reasoning method that can be interleaved with standard token sampling and easily combined with RL techniques. LiteReason employs a lightweight Reasoning Projector module, trained to produce continuous latent tokens that help the model 'skip' reasoning steps. During RL, the policy model decides when to activate the projector, switching between latent and discrete reasoning as needed. Experimental results on plot hole detection and book chapter generation show that our method outperforms latent reasoning baselines and comes close to matching non-latent RL training, while reducing final reasoning length by 77-92%. Overall, LiteReason guides RL training to a more efficient part of the performance-computation tradeoff curve.
Swivuriso: The South African Next Voices Multilingual Speech Dataset
This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. Covering agriculture, healthcare, and general domain topics, Swivuriso addresses significant gaps in existing ASR datasets. We describe the design principles, ethical considerations, and data collection procedures that guided the dataset creation. We present baseline results of training/finetuning ASR models with this data and compare to other ASR datasets for the langauges concerned.
comment: Work in Progress
A Knowledge-Based Language Model: Deducing Grammatical Knowledge in a Multi-Agent Language Acquisition Simulation
This paper presents an initial study performed by the MODOMA system. The MODOMA is a computational multi-agent laboratory environment for unsupervised language acquisition experiments such that acquisition is based on the interaction between two language models, an adult and a child agent. Although this framework employs statistical as well as rule-based procedures, the result of language acquisition is a knowledge-based language model, which can be used to generate and parse new utterances of the target language. This system is fully parametrized and researchers can control all aspects of the experiments while the results of language acquisition, that is, the acquired grammatical knowledge, are explicitly represented and can be consulted. Thus, this system introduces novel possibilities for conducting computational language acquisition experiments. The experiments presented by this paper demonstrate that functional and content categories can be acquired and represented by the daughter agent based on training and test data containing different amounts of exemplars generated by the adult agent. Interestingly, similar patterns, which are well-established for human-generated data, are also found for these machine-generated data. As the procedures resulted in the successful acquisition of discrete grammatical categories by the child agent, these experiments substantiate the validity of the MODOMA approach to modelling language acquisition.
comment: 23 pages, 7 figures, 11 tables. Related work: arXiv:2503.18702. This is the peer-reviewed publisher's version, downloadable from: https://www.clinjournal.org/clinj/article/view/193
Story2MIDI: Emotionally Aligned Music Generation from Text
In this paper, we introduce Story2MIDI, a sequence-to-sequence Transformer-based model for generating emotion-aligned music from a given piece of text. To develop this model, we construct the Story2MIDI dataset by merging existing datasets for sentiment analysis from text and emotion classification in music. The resulting dataset contains pairs of text blurbs and music pieces that evoke the same emotions in the reader or listener. Despite the small scale of our dataset and limited computational resources, our results indicate that our model effectively learns emotion-relevant features in music and incorporates them into its generation process, producing samples with diverse emotional responses. We evaluate the generated outputs using objective musical metrics and a human listening study, confirming the model's ability to capture intended emotional cues.
comment: 8 pages (6 pages of main text + 2 pages of references and appendices), 4 figures, 1 table. Presented at IEEE Big Data 2025 3rd Workshop on AI Music Generation (AIMG 2025)
Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models
Reasoning LLMs (RLMs) such as OpenAI o1, DeepSeek-R1, and Qwen3 deliver strong multi-step reasoning through chain-of-thought generation, but their large model sizes and lengthy decode-time outputs make them costly to deploy and unsuitable for resource-constrained settings. To reduce computing and memory cost, pruning offers a promising solution by removing unimportant parameters. However, despite their success on standard LLMs, existing pruning methods severely damage RLMs, as even moderate sparsity (e.g., 20%) can collapse accuracy and completely disrupt the model's reasoning coherence. We begin by analyzing why existing pruning pipelines fail on reasoning LLMs and find that their brittleness largely stems from a mismatch between the calibration data, the pruning objective, and the model's decode-time reasoning behavior. Our study further shows that the most reliable calibration signal comes not from human-written labels but from the model's own self-generated reasoning traces, which more accurately reflect its inference distribution. Guided by these insights, we introduce RESP, a self-reflective structured pruning framework that aligns pruning decisions with the model's reasoning dynamics through self-generated calibration, decode-only gradient-based importance estimation, and progressive regeneration that maintains calibration fidelity as sparsity increases. Experiments on Qwen3-8B demonstrate that RESP markedly outperforms existing structured pruning methods on both GSM8K and MathQA, preserving near-dense accuracy at 20-30% sparsity and substantially mitigating performance collapse at higher sparsity levels. At 40% sparsity, RESP attains 81.3% accuracy on GSM8K and 59.6% on MathQA, surpassing the strongest baselines by 66.87% and 47%, respectively.
comment: 7 pages, 3 figures
Factor(T,U): Factored Cognition Strengthens Monitoring of Untrusted AI
The field of AI Control seeks to develop robust control protocols, deployment safeguards for untrusted AI which may be intentionally subversive. However, existing protocols that rely on weaker monitors to detect unsafe behavior often fail on complex tasks beyond the monitor's comprehension. We develop control protocols based on factored cognition, in which a trusted model decomposes a task, an untrusted model solves each resultant child task in isolation, and the results are reassembled into a full solution. These protocols may improve safety by several means, such as by simplifying the context for monitors, or by obscuring vulnerabilities in the environment. We implement our protocols in the APPS coding setting and red team them against backdoor attempts from an adversarial GPT-4.1 Nano. We find that: (i) Adding factored cognition to a trusted monitoring protocol can boost safety from 41% to 63%; (ii) Safety improves because monitor performance improves; (iii) Factored cognition makes it no harder for capable LLMs to write backdoors in APPS. While our protocols show low usefulness in APPS, they hold promise for more complex tasks.
comment: 7 pages body; 19 pages total; 8 figures
Feature Selection Empowered BERT for Detection of Hate Speech with Vocabulary Augmentation
Abusive speech on social media poses a persistent and evolving challenge, driven by the continuous emergence of novel slang and obfuscated terms designed to circumvent detection systems. In this work, we present a data efficient strategy for fine tuning BERT on hate speech classification by significantly reducing training set size without compromising performance. Our approach employs a TF IDF-based sample selection mechanism to retain only the most informative 75 percent of examples, thereby minimizing training overhead. To address the limitations of BERT's native vocabulary in capturing evolving hate speech terminology, we augment the tokenizer with domain-specific slang and lexical variants commonly found in abusive contexts. Experimental results on a widely used hate speech dataset demonstrate that our method achieves competitive performance while improving computational efficiency, highlighting its potential for scalable and adaptive abusive content moderation.
Do Large Language Models Walk Their Talk? Measuring the Gap Between Implicit Associations, Self-Report, and Behavioral Altruism
We investigate whether Large Language Models (LLMs) exhibit altruistic tendencies, and critically, whether their implicit associations and self-reports predict actual altruistic behavior. Using a multi-method approach inspired by human social psychology, we tested 24 frontier LLMs across three paradigms: (1) an Implicit Association Test (IAT) measuring implicit altruism bias, (2) a forced binary choice task measuring behavioral altruism, and (3) a self-assessment scale measuring explicit altruism beliefs. Our key findings are: (1) All models show strong implicit pro-altruism bias (mean IAT = 0.87, p < .0001), confirming models "know" altruism is good. (2) Models behave more altruistically than chance (65.6% vs. 50%, p < .0001), but with substantial variation (48-85%). (3) Implicit associations do not predict behavior (r = .22, p = .29). (4) Most critically, models systematically overestimate their own altruism, claiming 77.5% altruism while acting at 65.6% (p < .0001, Cohen's d = 1.08). This "virtue signaling gap" affects 75% of models tested. Based on these findings, we recommend the Calibration Gap (the discrepancy between self-reported and behavioral values) as a standardized alignment metric. Well-calibrated models are more predictable and behaviorally consistent; only 12.5% of models achieve the ideal combination of high prosocial behavior and accurate self-knowledge.
comment: 14 pages, 7 figures, 7 tables. Code and data available at https://github.com/sandroandric/LLMs_Altruism_Study_Code
The AI Productivity Index (APEX)
We present an extended version of the AI Productivity Index (APEX-v1-extended), a benchmark for assessing whether frontier models are capable of performing economically valuable tasks in four jobs: investment banking associate, management consultant, big law associate, and primary care physician (MD). This technical report details the extensions to APEX-v1, including an increase in the held-out evaluation set from n = 50 to n = 100 cases per job (n = 400 total) and updates to the grading methodology. We present a new leaderboard, where GPT5 (Thinking = High) remains the top performing model with a score of 67.0%. APEX-v1-extended shows that frontier models still have substantial limitations when performing typical professional tasks. To support further research, we are open sourcing n = 25 non-benchmark example cases per role (n = 100 total) along with our evaluation harness.
SpikingBrain: Spiking Brain-inspired Large Models
Mainstream Transformer-based large language models face major efficiency bottlenecks: training computation scales quadratically with sequence length, and inference memory grows linearly, limiting long-context processing. Building large models on non-NVIDIA platforms also poses challenges for stable and efficient training. To address this, we introduce SpikingBrain, a family of brain-inspired models designed for efficient long-context training and inference. SpikingBrain leverages the MetaX GPU cluster and focuses on three aspects: (1) Model Architecture: linear and hybrid-linear attention architectures with adaptive spiking neurons; (2) Algorithmic Optimizations: an efficient, conversion-based training pipeline and a dedicated spike coding framework; (3) System Engineering: customized training frameworks, operator libraries, and parallelism strategies tailored to MetaX hardware. Using these techniques, we develop two models: SpikingBrain-7B, a linear LLM, and SpikingBrain-76B, a hybrid-linear MoE LLM. These models demonstrate the feasibility of large-scale LLM development on non-NVIDIA platforms, and training remains stable for weeks on hundreds of MetaX GPUs with Model FLOPs Utilization at expected levels. SpikingBrain achieves performance comparable to open-source Transformer baselines while using only about 150B tokens for continual pre-training. Our models also significantly improve long-context efficiency and deliver inference with (partially) constant memory and event-driven spiking behavior. For example, SpikingBrain-7B attains over 100x speedup in Time to First Token for 4M-token sequences. Furthermore, the proposed spiking scheme achieves 69.15 percent sparsity, enabling low-power operation. Overall, this work demonstrates the potential of brain-inspired mechanisms to drive the next generation of efficient and scalable large model design.
Med-gte-hybrid: A contextual embedding transformer model for extracting actionable information from clinical texts
We introduce a novel contextual embedding model med-gte-hybrid that was derived from the gte-large sentence transformer to extract information from unstructured clinical narratives. Our model tuning strategy for med-gte-hybrid combines contrastive learning and a denoising autoencoder. To evaluate the performance of med-gte-hybrid, we investigate several clinical prediction tasks in large patient cohorts extracted from the MIMIC-IV dataset, including Chronic Kidney Disease (CKD) patient prognosis, estimated glomerular filtration rate (eGFR) prediction, and patient mortality prediction. Furthermore, we demonstrate that the med-gte-hybrid model improves patient stratification, clustering, and text retrieval, thus outperforms current state-of-the-art models on the Massive Text Embedding Benchmark (MTEB). While some of our evaluations focus on CKD, our hybrid tuning of sentence transformers could be transferred to other medical domains and has the potential to improve clinical decision-making and personalised treatment pathways in various healthcare applications.
comment: 22 pages, 4 figures, 2 tables
Reliable Reasoning Beyond Natural Language
Despite their linguistic competence, Large Language Models (LLMs) often struggle to reason reliably and flexibly. To identify these shortcomings, we introduce the Non-Linear Reasoning (NLR) dataset, a collection of 55 unique, hand-designed problems that target reasoning bottlenecks arising from the sequential prediction paradigm of LLMs and the inherently linear nature of natural language. NLR tasks require iterative updates, backtracking, and reasoning across multiple parallel chains of thought but only basic arithmetic to solve. To address these limitations, we propose a neurosymbolic reasoning approach that integrates Prolog, a symbolic reasoning engine, into the inference pipeline of LLMs. This division of labor shifts the LLM's task from iterative computations to inferring all information, explicit or implied through common sense, and encoding it as logical code. Our method yields large and robust performance gains across the GSM8k and BIG-bench Navigate benchmarks and achieves near-perfect accuracy on NLR problems, maintaining robustness even as variable interdependence - the number of other variables on which the value of a single variable depends - increases.
LLMs can hide text in other text of the same length
A meaningful text can be hidden inside another, completely different yet still coherent and plausible, text of the same length. For example, a tweet containing a harsh political critique could be embedded in a tweet that celebrates the same political leader, or an ordinary product review could conceal a secret manuscript. This uncanny state of affairs is now possible thanks to Large Language Models, and in this paper we present Calgacus, a simple and efficient protocol to achieve it. We show that even modest 8-billion-parameter open-source LLMs are sufficient to obtain high-quality results, and a message as long as this abstract can be encoded and decoded locally on a laptop in seconds. The existence of such a protocol demonstrates a radical decoupling of text from authorial intent, further eroding trust in written communication, already shaken by the rise of LLM chatbots. We illustrate this with a concrete scenario: a company could covertly deploy an unfiltered LLM by encoding its answers within the compliant responses of a safe model. This possibility raises urgent questions for AI safety and challenges our understanding of what it means for a Large Language Model to know something.
comment: 21 pages, main paper 9 pages
Influence Functions for Efficient Data Selection in Reasoning
Fine-tuning large language models (LLMs) on chain-of-thought (CoT) data shows that a small amount of high-quality data can outperform massive datasets. Yet, what constitutes "quality" remains ill-defined. Existing reasoning methods rely on indirect heuristics such as problem difficulty or trace length, while instruction-tuning has explored a broader range of automated selection strategies, but rarely in the context of reasoning. We propose to define reasoning data quality using influence functions, which measure the causal effect of individual CoT examples on downstream accuracy, and introduce influence-based pruning, which consistently outperforms perplexity and embedding-based baselines on math reasoning within a model family.
comment: 4 pages, 2 figures; added link to codebase
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence
Large language models (LLMs) have fundamentally transformed automated software development by enabling direct translation of natural language descriptions into functional code, driving commercial adoption through tools like Github Copilot (Microsoft), Cursor (Anysphere), Trae (ByteDance), and Claude Code (Anthropic). While the field has evolved dramatically from rule-based systems to Transformer-based architectures, achieving performance improvements from single-digit to over 95\% success rates on benchmarks like HumanEval. In this work, we provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs, systematically examining the complete model life cycle from data curation to post-training through advanced prompting paradigms, code pre-training, supervised fine-tuning, reinforcement learning, and autonomous coding agents. We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder), critically examining the techniques, design decisions, and trade-offs. Further, we articulate the research-practice gap between academic research (e.g., benchmarks and tasks) and real-world deployment (e.g., software-related code tasks), including code correctness, security, contextual awareness of large codebases, and integration with development workflows, and map promising research directions to practical needs. Last, we conduct a series of experiments to provide a comprehensive analysis of code pre-training, supervised fine-tuning, and reinforcement learning, covering scaling law, framework selection, hyperparameter sensitivity, model architectures, and dataset comparisons.
Adversarial Confusion Attack: Disrupting Multimodal Large Language Models
We introduce the Adversarial Confusion Attack, a new class of threats against multimodal large language models (MLLMs). Unlike jailbreaks or targeted misclassification, the goal is to induce systematic disruption that makes the model generate incoherent or confidently incorrect outputs. Practical applications include embedding such adversarial images into websites to prevent MLLM-powered AI Agents from operating reliably. The proposed attack maximizes next-token entropy using a small ensemble of open-source MLLMs. In the white-box setting, we show that a single adversarial image can disrupt all models in the ensemble, both in the full-image and Adversarial CAPTCHA settings. Despite relying on a basic adversarial technique (PGD), the attack generates perturbations that transfer to both unseen open-source (e.g., Qwen3-VL) and proprietary (e.g., GPT-5.1) models.
RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users SC
To achieve successful assistance with long-horizon web-based tasks, AI agents must be able to sequentially follow real-world user instructions over a long period. Unlike existing web-based agent benchmarks, sequential instruction following in the real world poses significant challenges beyond performing a single, clearly defined task. For instance, real-world human instructions can be ambiguous, require different levels of AI assistance, and may evolve over time, reflecting changes in the user's mental state. To address this gap, we introduce RealWebAssist, a novel benchmark designed to evaluate sequential instruction-following in realistic scenarios involving long-horizon interactions with the web, visual GUI grounding, and understanding ambiguous real-world user instructions. RealWebAssist includes a dataset of sequential instructions collected from real-world human users. Each user instructs a web-based assistant to perform a series of tasks on multiple websites. A successful agent must reason about the true intent behind each instruction, keep track of the mental state of the user, understand user-specific routines, and ground the intended tasks to actions on the correct GUI elements. Our experimental results show that state-of-the-art models struggle to understand and ground user instructions, posing critical challenges in following real-world user instructions for long-horizon web assistance.
comment: Project Website: https://scai.cs.jhu.edu/projects/RealWebAssist/ Code: https://github.com/SCAI-JHU/RealWebAssist
DPRM: A Dual Implicit Process Reward Model in Multi-Hop Question Answering
In multi-hop question answering (MHQA) tasks, Chain of Thought (CoT) improves the quality of generation by guiding large language models (LLMs) through multi-step reasoning, and Knowledge Graphs (KGs) reduce hallucinations via semantic matching. Outcome Reward Models (ORMs) provide feedback after generating the final answers but fail to evaluate the process for multi-step reasoning. Traditional Process Reward Models (PRMs) evaluate the reasoning process but require costly human annotations or rollout generation. While implicit PRM is trained only with outcome signals and derives step rewards through reward parameterization without explicit annotations, it is more suitable for multi-step reasoning in MHQA tasks. However, existing implicit PRM has only been explored for plain text scenarios. When adapting to MHQA tasks, it cannot handle the graph structure constraints in KGs and capture the potential inconsistency between CoT and KG paths. To address these limitations, we propose the DPRM (Dual Implicit Process Reward Model). It trains two implicit PRMs for CoT and KG reasoning in MHQA tasks. Both PRMs, namely KG-PRM and CoT-PRM, derive step-level rewards from outcome signals via reward parameterization without additional explicit annotations. Among them, KG-PRM uses preference pairs to learn structural constraints from KGs. DPRM further introduces a consistency constraint between CoT and KG reasoning steps, making the two PRMs mutually verify and collaboratively optimize the reasoning paths. We also provide a theoretical demonstration of the derivation of process rewards. Experimental results show that our method outperforms 13 baselines on multiple datasets with up to 16.6% improvement on Hit@1.
SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents
Recently, role-playing agents have emerged as a promising paradigm for achieving personalized interaction and emotional resonance. Existing research primarily focuses on the textual modality, neglecting the critical dimension of speech in realistic interactive scenarios. In particular, there is a lack of systematic evaluation for Speech Role-Playing Agents (SRPAs). To address this gap, we construct SpeechRole-Data, a large-scale, high-quality dataset that comprises 98 diverse roles and 112k speech-based single-turn and multi-turn conversations. Each role demonstrates distinct vocal characteristics, including timbre and prosody, thereby enabling more sophisticated speech role-playing. Furthermore, we propose SpeechRole-Eval, a multidimensional evaluation benchmark that systematically assesses SRPAs performance in key aspects such as fundamental interaction ability, speech expressiveness, and role-playing fidelity. Experimental results reveal the advantages and challenges of both cascaded and end-to-end speech role-playing agents in maintaining vocal style consistency and role coherence. We release all data, code, and baseline models to provide a solid foundation for speech-driven multimodal role-playing research and to foster further developments in this field.
Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX
We introduce POLLUX, a comprehensive open-source benchmark designed to evaluate the generative capabilities of large language models (LLMs) in Russian. Our main contribution is a novel evaluation methodology that enhances the interpretability of LLM assessment. For each task type, we define a set of detailed criteria and develop a scoring protocol where models evaluate responses and provide justifications for their ratings. This enables transparent, criteria-driven evaluation beyond traditional resource-consuming, side-by-side human comparisons. POLLUX includes a detailed, fine-grained taxonomy of 35 task types covering diverse generative domains such as code generation, creative writing, and practical assistant use cases, totaling 2,100 manually crafted and professionally authored prompts. Each task is categorized by difficulty (easy/medium/hard), with experts constructing the dataset entirely from scratch. We also release a family of LLM-as-a-Judge (7B and 32B) evaluators trained for nuanced assessment of generative outputs. This approach provides scalable, interpretable evaluation and annotation tools for model development, effectively replacing costly and less precise human judgments.
comment: short version
Debating Truth: Debate-driven Claim Verification with Multiple Large Language Model Agents
Claim verification is essential for digital literacy, yet state-of-the-art single-agent methods often struggle with complex claims that require nuanced analysis of multifaceted online evidence. Inspired by real-world human fact-checking practices, we propose \textbf{DebateCV}, the first debate-driven claim verification framework powered by multiple LLM agents. In DebateCV, two \textit{Debaters} argue opposing stances over multiple rounds to surface subtle errors in single-agent assessments. A decisive \textit{Moderator} is then required to weigh the evidential strength of conflicting arguments to deliver an accurate verdict. Yet zero-shot agents struggle to adjudicate multi-round debates for verifying complex claims, often defaulting to neutral judgements, and no datasets exist for training agents for this role. To bridge this gap, we propose \textbf{Debate-SFT}, a post-training framework that leverages synthetic data to enhance agents' ability to effectively adjudicate debates for claim verification. Results show that our methods surpass state-of-the-art non-debate approaches in both accuracy (across various evidence conditions) and justification quality, which strengthens societal resilience against misinformation and contributes to a more trustworthy online information ecosystem.
Comprehensive Evaluation on Lexical Normalization: Boundary-Aware Approaches for Unsegmented Languages EMNLP 2025
Lexical normalization research has sought to tackle the challenge of processing informal expressions in user-generated text, yet the absence of comprehensive evaluations leaves it unclear which methods excel across multiple perspectives. Focusing on unsegmented languages, we make three key contributions: (1) creating a large-scale, multi-domain Japanese normalization dataset, (2) developing normalization methods based on state-of-the-art pretrained models, and (3) conducting experiments across multiple evaluation perspectives. Our experiments show that both encoder-only and decoder-only approaches achieve promising results in both accuracy and efficiency.
comment: EMNLP 2025 (Findings), 26 pages
CCFQA: A Benchmark for Cross-Lingual and Cross-Modal Speech and Text Factuality Evaluation AAAI 2026
As Large Language Models (LLMs) are increasingly popularized in the multilingual world, ensuring hallucination-free factuality becomes markedly crucial. However, existing benchmarks for evaluating the reliability of Multimodal Large Language Models (MLLMs) predominantly focus on textual or visual modalities with a primary emphasis on English, which creates a gap in evaluation when processing multilingual input, especially in speech. To bridge this gap, we propose a novel Cross-lingual and Cross-modal Factuality benchmark (CCFQA). Specifically, the CCFQA benchmark contains parallel speech-text factual questions across 8 languages, designed to systematically evaluate MLLMs' cross-lingual and cross-modal factuality capabilities. Our experimental results demonstrate that current MLLMs still face substantial challenges on the CCFQA benchmark. Furthermore, we propose a few-shot transfer learning strategy that effectively transfers the Question Answering (QA) capabilities of LLMs in English to multilingual Spoken Question Answering (SQA) tasks, achieving competitive performance with GPT-4o-mini-Audio using just 5-shot training. We release CCFQA as a foundational research resource to promote the development of MLLMs with more robust and reliable speech understanding capabilities. Our code and dataset are available at https://github.com/yxduir/ccfqa.
comment: Accepted in AAAI 2026
Soft Adaptive Policy Optimization
Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains challenging. Token-level importance ratios often exhibit high variance-a phenomenon exacerbated in Mixture-of-Experts models-leading to unstable updates. Existing group-based policy optimization methods, such as GSPO and GRPO, alleviate this problem via hard clipping, making it difficult to maintain both stability and effective learning. We propose Soft Adaptive Policy Optimization (SAPO), which replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO. When a sequence contains a few highly off-policy tokens, GSPO suppresses all gradients for that sequence, whereas SAPO selectively down-weights only the offending tokens and preserves the learning signal from the near-on-policy ones, improving sample efficiency. Relative to GRPO, SAPO replaces hard token-level clipping with smooth, temperature-controlled scaling, enabling more informative and stable updates. Empirical results on mathematical reasoning benchmarks indicate that SAPO exhibits improved training stability and higher Pass@1 performance under comparable training budgets. Moreover, we employ SAPO to train the Qwen3-VL model series, demonstrating that SAPO yields consistent performance gains across diverse tasks and different model sizes. Overall, SAPO provides a more reliable, scalable, and effective optimization strategy for RL training of LLMs.
Human Decision-making is Susceptible to AI-driven Manipulation
AI systems are increasingly intertwined with daily life, assisting users with various tasks and guiding decision-making. This integration introduces risks of AI-driven manipulation, where such systems may exploit users' cognitive biases and emotional vulnerabilities to steer them toward harmful outcomes. Through a randomized between-subjects experiment with 233 participants, we examined human susceptibility to such manipulation in financial (e.g., purchases) and emotional (e.g., conflict resolution) decision-making contexts. Participants interacted with one of three AI agents: a neutral agent (NA) optimizing for user benefit without explicit influence, a manipulative agent (MA) designed to covertly influence beliefs and behaviors, or a strategy-enhanced manipulative agent (SEMA) equipped with established psychological tactics, allowing it to select and apply them adaptively during interactions to reach its hidden objectives. By analyzing participants' preference ratings, we found significant susceptibility to AI-driven manipulation. Particularly across both decision-making domains, interacting with the manipulative agents significantly increased the odds of rating hidden incentives higher than optimal options (Financial, MA: OR=5.24, SEMA: OR=7.96; Emotional, MA: OR=5.52, SEMA: OR=5.71) compared to the NA group. Notably, we found no clear evidence that employing psychological strategies (SEMA) was overall more effective than simple manipulative objectives (MA) on our primary outcomes. Hence, AI-driven manipulation could become widespread even without requiring sophisticated tactics and expertise. While our findings are preliminary and derived from hypothetical, low-stakes scenarios, we highlight a critical vulnerability in human-AI interactions, emphasizing the need for ethical safeguards and regulatory frameworks to protect human autonomy.
comment: Work in progress
DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning
Large language models (LLMs) have achieved remarkable success in many natural language tasks but still struggle with complex, multi-step reasoning, particularly across diverse disciplines. Existing reasoning datasets often lack disciplinary breadth, reasoning depth, and diversity, as well as guiding principles for question synthesis. We propose DESIGNER: a DESIGN-logic-guidEd Reasoning data synthesis pipeline that leverages naturally available, extensive raw documents (e.g., book corpus and web corpus) to generate multidisciplinary challenging questions. We introduce the concept of "design logic" and instruct LLMs to mimic human educators' question-creation process, enabling the automated synthesis of large-scale, high-difficulty questions. We use LLMs to reverse-engineer and abstract over 120,000 design logics from existing questions across various disciplines. By matching these design logics with source documents, we are able to generate reasoning questions with controllable question types and difficulty levels. Using this pipeline, we synthesized two large-scale reasoning datasets that span 75 disciplines: DLR-Book (3.04 million questions from the book corpus) and DLR-Web (1.66 million questions from the web corpus). Data analysis indicates that the questions synthesized by our method exhibit greater difficulty and diversity compared to those in the baseline datasets. We validate our synthesized data through supervised fine-tuning (SFT) on the Qwen3 and Llama3 model families. Our data substantially enhances their multidisciplinary reasoning capabilities, outperforming existing datasets. Notably, by applying SFT on the base versions of these models using only our data, we even surpass their official final models that have undergone the full post-training process.
A Method for Handling Negative Similarities in Explainable Graph Spectral Clustering of Text Documents -- Extended Version CCS
This paper investigates the problem of Graph Spectral Clustering with negative similarities, resulting from document embeddings different from the traditional Term Vector Space (like doc2vec, GloVe, etc.). Solutions for combinatorial Laplacians and normalized Laplacians are discussed. An experimental investigation shows the advantages and disadvantages of 6 different solutions proposed in the literature and in this research. The research demonstrates that GloVe embeddings frequently cause failures of normalized Laplacian based GSC due to negative similarities. Furthermore, application of methods curing similarity negativity leads to accuracy improvement for both combinatorial and normalized Laplacian based GSC. It also leads to applicability for GloVe embeddings of explanation methods developed originally bythe authors for Term Vector Space embeddings.
comment: 1 figure, 17 pages, this is an extended version of a paper accepted for the 25th International Conference on Computational Science (ICCS), 7-9 July 2025
MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks
Advancements in LLMs have enhanced task automation in software engineering; however, current evaluations primarily focus on natural language tasks, overlooking code quality. Most benchmarks prioritize high-level reasoning over executable code and real-world performance, leaving gaps in understanding true capabilities and risks associated with these models in production. To address this issue, we propose MERA Code, a new addition to the MERA benchmark family, specifically focused on evaluating code for the latest code generation LLMs in Russian. This benchmark includes 11 evaluation tasks that span 8 programming languages. Our proposed evaluation methodology features a taxonomy that outlines the practical coding skills necessary for models to complete these tasks. The benchmark comprises an open-source codebase for users to conduct MERA assessments, a scoring system compatible with various programming environments, and a platform featuring a leaderboard and submission system. We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages. We are publicly releasing MERA to guide future research, anticipate groundbreaking features in model development, and standardize evaluation procedures.
A Machine Learning Approach for Detection of Mental Health Conditions and Cyberbullying from Social Media AAAI-26
Mental health challenges and cyberbullying are increasingly prevalent in digital spaces, necessitating scalable and interpretable detection systems. This paper introduces a unified multiclass classification framework for detecting ten distinct mental health and cyberbullying categories from social media data. We curate datasets from Twitter and Reddit, implementing a rigorous "split-then-balance" pipeline to train on balanced data while evaluating on a realistic, held-out imbalanced test set. We conducted a comprehensive evaluation comparing traditional lexical models, hybrid approaches, and several end-to-end fine-tuned transformers. Our results demonstrate that end-to-end fine-tuning is critical for performance, with the domain-adapted MentalBERT emerging as the top model, achieving an accuracy of 0.92 and a Macro F1 score of 0.76, surpassing both its generic counterpart and a zero-shot LLM baseline. Grounded in a comprehensive ethical analysis, we frame the system as a human-in-the-loop screening aid, not a diagnostic tool. To support this, we introduce a hybrid SHAPLLM explainability framework and present a prototype dashboard ("Social Media Screener") designed to integrate model predictions and their explanations into a practical workflow for moderators. Our work provides a robust baseline, highlighting future needs for multi-label, clinically-validated datasets at the critical intersection of online safety and computational mental health.
comment: Accepted for Oral Presentation at the AAAI-26 Bridge Program on AI for Medicine and Healthcare (AIMedHealth). To appear in Proceedings of Machine Learning Research (PMLR)
Recursive numeral systems are highly regular and easy to process
Previous work has argued that recursive numeral systems optimise the trade-off between lexicon size and average morphosyntatic complexity (Denić and Szymanik, 2024). However, showing that only natural-language-like systems optimise this tradeoff has proven elusive, and the existing solution has relied on ad-hoc constraints to rule out unnatural systems (Yang and Regier, 2025). Here, we argue that this issue arises because the proposed trade-off has neglected regularity, a crucial aspect of complexity central to human grammars in general. Drawing on the Minimum Description Length (MDL) approach, we propose that recursive numeral systems are better viewed as efficient with regard to their regularity and processing complexity. We show that our MDL-based measures of regularity and processing complexity better capture the key differences between attested, natural systems and unattested but possible ones, including "optimal" recursive numeral systems from previous work, and that the ad-hoc constraints from previous literature naturally follow from regularity. Our approach highlights the need to incorporate regularity across sets of forms in studies that attempt to measure and explain optimality in language.
BoundingDocs: a Unified Dataset for Document Question Answering with Spatial Annotations
We present a unified dataset for document Question-Answering (QA), which is obtained combining several public datasets related to Document AI and visually rich document understanding (VRDU). Our main contribution is twofold: on the one hand we reformulate existing Document AI tasks, such as Information Extraction (IE), into a Question-Answering task, making it a suitable resource for training and evaluating Large Language Models; on the other hand, we release the OCR of all the documents and include the exact position of the answer to be found in the document image as a bounding box. Using this dataset, we explore the impact of different prompting techniques (that might include bounding box information) on the performance of open-weight models, identifying the most effective approaches for document comprehension.
PARROT: Persuasion and Agreement Robustness Rating of Output Truth -- A Sycophancy Robustness Benchmark for LLMs
This study presents PARROT (Persuasion and Agreement Robustness Rating of Output Truth), a robustness focused framework designed to measure the degradation in accuracy that occurs under social pressure exerted on users through authority and persuasion in large language models (LLMs) the phenomenon of sycophancy (excessive conformity). PARROT (i) isolates causal effects by comparing the neutral version of the same question with an authoritatively false version using a double-blind evaluation, (ii) quantifies confidence shifts toward the correct and imposed false responses using log-likelihood-based calibration tracking, and (iii) systematically classifies failure modes (e.g., robust correct, sycophantic agreement, reinforced error, stubborn error, self-correction, etc.) using an eight-state behavioral taxonomy. We evaluated 22 models using 1,302 MMLU-style multiple-choice questions across 13 domains and domain-specific authority templates. Findings show marked heterogeneity: advanced models (e.g., GPT-5, GPT-4.1, Claude Sonnet 4.5) exhibit low "follow rates" ($\leq 11\%$, GPT-5: 4\%) and minimal accuracy loss, while older/smaller models show severe epistemic collapse (GPT-4: 80\%, Qwen 2.5-1.5B: 94\%). The danger is not limited to response changes; weak models reduce confidence in the correct response while increasing confidence in the imposed incorrect response. While international law and global knowledge at the domain level exhibit high fragility, elementary mathematics is relatively resilient. Consequently, we argue that the goal of "resistance to overfitting pressure" should be addressed as a primary objective alongside accuracy, harm avoidance, and privacy for safe deployment in the real world.
LLM-based Human Simulations Have Not Yet Been Reliable
Large Language Models (LLMs) are increasingly employed for simulating human behaviors across diverse domains. However, our position is that current LLM-based human simulations remain insufficiently reliable, as evidenced by significant discrepancies between their outcomes and authentic human actions. Our investigation begins with a systematic review of LLM-based human simulations in social, economic, policy, and psychological contexts, identifying their common frameworks, recent advances, and persistent limitations. This review reveals that such discrepancies primarily stem from inherent limitations of LLMs and flaws in simulation design, both of which are examined in detail. Building on these insights, we propose a systematic solution framework that emphasizes enriching data foundations, advancing LLM capabilities, and ensuring robust simulation design to enhance reliability. Finally, we introduce a structured algorithm that operationalizes the proposed framework, aiming to guide credible and human-aligned LLM-based simulations. To facilitate further research, we provide a curated list of related literature and resources at https://github.com/Persdre/awesome-llm-human-simulation.
SpeechIQ: Speech-Agentic Intelligence Quotient Across Cognitive Levels in Voice Understanding by Large Language Models ACL 2025
We introduce Speech-based Intelligence Quotient (SIQ) as a new form of human cognition-inspired evaluation pipeline for voice understanding large language models, LLM Voice, designed to assess their voice understanding ability. Moving beyond popular voice understanding metrics such as word error rate (WER), SIQ examines LLM Voice across three cognitive levels motivated by Bloom's Taxonomy: (1) Remembering (i.e., WER for verbatim accuracy); (2) Understanding (i.e., similarity of LLM's interpretations); and (3) Application (i.e., QA accuracy for simulating downstream tasks). We demonstrate that SIQ not only quantifies voice understanding abilities but also provides unified comparisons between cascaded methods (e.g., ASR LLM) and end-to-end models, identifies annotation errors in existing benchmarks, and detects hallucinations in LLM Voice. Our framework represents a first-of-its-kind intelligence examination that bridges cognitive principles with voice-oriented benchmarks, while exposing overlooked challenges in multi-modal training. Our code and data will be open source to encourage future studies.
comment: ACL 2025 main. Our Speech-IQ leaderboard is hosted at huggingface.co/spaces/nvidia/Speech-IQ-leaderboard. Speech-IQ Calculator: https://github.com/YukinoWan/SpeechIQ
From Topology to Retrieval: Decoding Embedding Spaces with Unified Signatures
Studying how embeddings are organized in space not only enhances model interpretability but also uncovers factors that drive downstream task performance. In this paper, we present a comprehensive analysis of topological and geometric measures across a wide set of text embedding models and datasets. We find a high degree of redundancy among these measures and observe that individual metrics often fail to sufficiently differentiate embedding spaces. Building on these insights, we introduce Unified Topological Signatures (UTS), a holistic framework for characterizing embedding spaces. We show that UTS can predict model-specific properties and reveal similarities driven by model architecture. Further, we demonstrate the utility of our method by linking topological structure to ranking effectiveness and accurately predicting document retrievability. We find that a holistic, multi-attribute perspective is essential to understanding and leveraging the geometry of text embeddings.
NeKo: Cross-Modality Post-Recognition Error Correction with Tasks-Guided Mixture-of-Experts Language Model ACL 2025
Construction of a general-purpose post-recognition error corrector poses a crucial question: how can we most effectively train a model on a large mixture of domain datasets? The answer would lie in learning dataset-specific features and digesting their knowledge in a single model. Previous methods achieve this by having separate correction language models, resulting in a significant increase in parameters. In this work, we present Mixture-of-Experts as a solution, highlighting that MoEs are much more than a scalability tool. We propose a Multi-Task Correction MoE, where we train the experts to become an ``expert'' of speech-to-text, language-to-text and vision-to-text datasets by learning to route each dataset's tokens to its mapped expert. Experiments on the Open ASR Leaderboard show that we explore a new state-of-the-art performance by achieving an average relative 5.0% WER reduction and substantial improvements in BLEU scores for speech and translation tasks. On zero-shot evaluation, NeKo outperforms GPT-3.5 and Claude-Opus with 15.5% to 27.6% relative WER reduction in the Hyporadise benchmark. NeKo performs competitively on grammar and post-OCR correction as a multi-task model.
comment: ACL 2025 Industry Track. NeKo LMs: https://huggingface.co/nvidia/NeKo-v0-post-correction
HEALTH-PARIKSHA: Assessing RAG Models for Health Chatbots in Real-World Multilingual Settings
Assessing the capabilities and limitations of large language models (LLMs) has garnered significant interest, yet the evaluation of multiple models in real-world scenarios remains rare. Multilingual evaluation often relies on translated benchmarks, which typically do not capture linguistic and cultural nuances present in the source language. This study provides an extensive assessment of 24 LLMs on real world data collected from Indian patients interacting with a medical chatbot in Indian English and 4 other Indic languages. We employ a uniform Retrieval Augmented Generation framework to generate responses, which are evaluated using both automated techniques and human evaluators on four specific metrics relevant to our application. We find that models vary significantly in their performance and that instruction tuned Indic models do not always perform well on Indic language queries. Further, we empirically show that factual correctness is generally lower for responses to Indic queries compared to English queries. Finally, our qualitative work shows that code-mixed and culturally relevant queries in our dataset pose challenges to evaluated models.
Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning
Recently, advanced large language models (LLMs) have emerged at an increasingly rapid pace. However, when faced with complex problems, most users are often unable to provide accurate and effective prompts to interact with LLMs, thus limiting the performance of LLMs. To address this challenge, we propose Prompt-R1, an end-to-end reinforcement learning framework that uses a small-scale LLM to collaborate with large-scale LLMs, replacing user interaction to solve problems better. This collaboration is cast as a multi-turn prompt interaction, where the small-scale LLM thinks and generates prompts, and the large-scale LLM performs complex reasoning. A dual-constrained reward is designed to optimize for correctness, generation quality, and reasoning accuracy. Prompt-R1 provides a plug-and-play framework that supports both inference and training with various large-scale LLMs. Experiments on multiple public datasets show that Prompt-R1 significantly outperforms baseline models across tasks. Our code is publicly available at https://github.com/QwenQKing/Prompt-R1.
Checklists Are Better Than Reward Models For Aligning Language Models NeurIPS 2025
Language models must be adapted to understand and follow user instructions. Reinforcement learning is widely used to facilitate this -- typically using fixed criteria such as "helpfulness" and "harmfulness". In our work, we instead propose using flexible, instruction-specific criteria as a means of broadening the impact that reinforcement learning can have in eliciting instruction following. We propose "Reinforcement Learning from Checklist Feedback" (RLCF). From instructions, we extract checklists and evaluate how well responses satisfy each item - using both AI judges and specialized verifier programs - then combine these scores to compute rewards for RL. We compare RLCF with other alignment methods applied to a strong instruction following model (Qwen2.5-7B-Instruct) on five widely-studied benchmarks -- RLCF is the only method to improve performance on every benchmark, including a 4-point boost in hard satisfaction rate on FollowBench, a 6-point increase on InFoBench, and a 3-point rise in win rate on Arena-Hard. These results establish checklist feedback as a key tool for improving language models' support of queries that express a multitude of needs.
comment: Presented at NeurIPS 2025
Spark-Prover-X1: Formal Theorem Proving Through Diverse Data Training
Large Language Models (LLMs) have shown significant promise in automated theorem proving, yet progress is often constrained by the scarcity of diverse and high-quality formal language data. To address this issue, we introduce Spark-Prover-X1, a 7B parameter model trained via an three-stage framework designed to unlock the reasoning potential of more accessible and moderately-sized LLMs. The first stage infuses deep knowledge through continuous pre-training on a broad mathematical corpus, enhanced by a suite of novel data tasks. Key innovation is a "CoT-augmented state prediction" task to achieve fine-grained reasoning. The second stage employs Supervised Fine-tuning (SFT) within an expert iteration loop to specialize both the Spark-Prover-X1-7B and Spark-Formalizer-X1-7B models. Finally, a targeted round of Group Relative Policy Optimization (GRPO) is applied to sharpen the prover's capabilities on the most challenging problems. To facilitate robust evaluation, particularly on problems from real-world examinations, we also introduce ExamFormal-Bench, a new benchmark dataset of 402 formal problems. Experimental results demonstrate that Spark-Prover achieves state-of-the-art performance among similarly-sized open-source models within the "Whole-Proof Generation" paradigm. It shows exceptional performance on difficult competition benchmarks, notably solving 27 problems on PutnamBench (pass@32) and achieving 24.0\% on CombiBench (pass@32). Our work validates that this diverse training data and progressively refined training pipeline provides an effective path for enhancing the formal reasoning capabilities of lightweight LLMs. We will release both Spark-Prover-X1-7B and Spark-Formalizer-X1-7B, along with the ExamFormal-Bench dataset, in the near future.
Measuring and Guiding Monosemanticity
There is growing interest in leveraging mechanistic interpretability and controllability to better understand and influence the internal dynamics of large language models (LLMs). However, current methods face fundamental challenges in reliably localizing and manipulating feature representations. Sparse Autoencoders (SAEs) have recently emerged as a promising direction for feature extraction at scale, yet they, too, are limited by incomplete feature isolation and unreliable monosemanticity. To systematically quantify these limitations, we introduce Feature Monosemanticity Score (FMS), a novel metric to quantify feature monosemanticity in latent representation. Building on these insights, we propose Guided Sparse Autoencoders (G-SAE), a method that conditions latent representations on labeled concepts during training. We demonstrate that reliable localization and disentanglement of target concepts within the latent space improve interpretability, detection of behavior, and control. Specifically, our evaluations on toxicity detection, writing style identification, and privacy attribute recognition show that G-SAE not only enhances monosemanticity but also enables more effective and fine-grained steering with less quality degradation. Our findings provide actionable guidelines for measuring and advancing mechanistic interpretability and control of LLMs.
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution NeurIPS 2025
The recent DeepSeek-R1 release has demonstrated the immense potential of reinforcement learning (RL) in enhancing the general reasoning capabilities of large language models (LLMs). While DeepSeek-R1 and other follow-up work primarily focus on applying RL to competitive coding and math problems, this paper introduces SWE-RL, the first approach to scale RL-based LLM reasoning for real-world software engineering. Leveraging a lightweight rule-based reward (e.g., the similarity score between ground-truth and LLM-generated solutions), SWE-RL enables LLMs to autonomously recover a developer's reasoning processes and solutions by learning from extensive open-source software evolution data -- the record of a software's entire lifecycle, including its code snapshots, code changes, and events such as issues and pull requests. Trained on top of Llama 3, our resulting reasoning model, Llama3-SWE-RL-70B, achieves a 41.0% solve rate on SWE-bench Verified -- a human-verified collection of real-world GitHub issues. To our knowledge, this is the best performance reported for medium-sized (<100B) LLMs to date, even comparable to leading proprietary LLMs like GPT-4o. Surprisingly, despite performing RL solely on software evolution data, Llama3-SWE-RL has even emerged with generalized reasoning skills. For example, it shows improved results on five out-of-domain tasks, namely, function coding, library use, code reasoning, mathematics, and general language understanding, whereas a supervised-finetuning baseline even leads to performance degradation on average. Overall, SWE-RL opens up a new direction to improve the reasoning capabilities of LLMs through reinforcement learning on massive software engineering data.
comment: Accepted to NeurIPS 2025 Main Track
LLM-based Automated Grading with Human-in-the-Loop
The rise of artificial intelligence (AI) technologies, particularly large language models (LLMs), has brought significant advancements to the field of education. Among various applications, automatic short answer grading (ASAG), which focuses on evaluating open-ended textual responses, has seen remarkable progress with the introduction of LLMs. These models not only enhance grading performance compared to traditional ASAG approaches but also move beyond simple comparisons with predefined "golden" answers, enabling more sophisticated grading scenarios, such as rubric-based evaluation. However, existing LLM-powered methods still face challenges in achieving human-level grading performance in rubric-based assessments due to their reliance on fully automated approaches. In this work, we explore the potential of LLMs in ASAG tasks by leveraging their interactive capabilities through a human-in-the-loop (HITL) approach. Our proposed framework, GradeHITL, utilizes the generative properties of LLMs to pose questions to human experts, incorporating their insights to refine grading rubrics dynamically. This adaptive process significantly improves grading accuracy, outperforming existing methods and bringing ASAG closer to human-level evaluation.
comment: Accepted to IEEE TALE 2025
Just-in-time and distributed task representations in language models
Many of language models' impressive capabilities originate from their in-context learning: based on instructions or examples, they can infer and perform new tasks without weight updates. In this work, we investigate when representations for new tasks are formed in language models, and how these representations change over the course of context. We study two different task representations: those that are ''transferrable'' -- vector representations that can transfer task contexts to another model instance, even without the full prompt -- and simpler representations of high-level task categories. We show that transferrable task representations evolve in non-monotonic and sporadic ways, while task identity representations persist throughout the context. Specifically, transferrable task representations exhibit a two-fold locality. They successfully condense evidence when more examples are provided in the context. But this evidence accrual process exhibits strong temporal locality along the sequence dimension, coming online only at certain tokens -- despite task identity being reliably decodable throughout the context. In some cases, transferrable task representations also show semantic locality, capturing a small task ''scope'' such as an independent subtask. Language models thus represent new tasks on the fly through both an inert, sustained sensitivity to the task and an active, just-in-time representation to support inference.
Bridging Human and LLM Judgments: Understanding and Narrowing the Gap NeurIPS 2025
Large language models are increasingly used as judges (LLM-as-a-judge) to evaluate model outputs at scale, but their assessments often diverge systematically from human judgments. We present Bridge, a unified statistical framework that explicitly bridges human and LLM evaluations under both absolute scoring and pairwise comparison paradigms. Bridge posits a latent human preference score for each prompt-response pair and models LLM deviations as linear transformations of covariates that capture sources of discrepancies. This offers a simple and principled framework for refining LLM ratings and characterizing systematic discrepancies between humans and LLMs. We provide an efficient fitting algorithm with asymptotic guarantees for statistical inference. Using six LLM judges and two benchmarks (BigGen Bench and Chatbot Arena), Bridge achieves higher agreement with human ratings (accuracy, calibration, and KL divergence) and exposes systematic human-LLM gaps.
comment: NeurIPS 2025
Unleashing Hour-Scale Video Training for Long Video-Language Understanding NeurIPS 2025
Recent long-form video-language understanding benchmarks have driven progress in video large multimodal models (Video-LMMs). However, the scarcity of well-annotated long videos has left the training of hour-long Video-LMMs underexplored. To close this gap, we present VideoMarathon, a large-scale hour-long video instruction-following dataset. This dataset includes around 9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video. Specifically, it contains 3.3M high-quality QA pairs, spanning six fundamental topics: temporality, spatiality, object, action, scene, and event. Compared to existing video instruction datasets, VideoMarathon significantly extends training video durations up to 1 hour, and supports 22 diverse tasks requiring both short- and long-term video comprehension. Building on VideoMarathon, we propose Hour-LLaVA, a powerful and efficient Video-LMM for hour-scale video-language modeling. It enables hour-long video training and inference at 1-FPS sampling by leveraging a memory augmentation module, which adaptively integrates question-relevant and spatiotemporally informative semantics from the cached full video context. In our experiments, Hour-LLaVA achieves the best performance on multiple representative long video-language benchmarks, demonstrating the high quality of the VideoMarathon dataset and the superiority of the Hour-LLaVA model.
comment: NeurIPS 2025, Project page: https://videomarathon.github.io/
Facilitating Long Context Understanding via Supervised Chain-of-Thought Reasoning EMNLP 2025
Recent advances in Large Language Models (LLMs) have enabled them to process increasingly longer sequences, ranging from 2K to 2M tokens and even beyond. However, simply extending the input sequence length does not necessarily lead to effective long-context understanding. In this study, we integrate Chain-of-Thought (CoT) reasoning into LLMs in a supervised manner to facilitate effective long-context understanding. To achieve this, we introduce LongFinanceQA, a synthetic dataset in the financial domain designed to improve long-context reasoning. Unlike existing long-context synthetic data, LongFinanceQA includes intermediate CoT reasoning before the final conclusion, which encourages LLMs to perform explicit reasoning, improving accuracy and interpretability in long-context understanding. To generate synthetic CoT reasoning, we propose Property-based Agentic Inference (PAI), an agentic framework that simulates human-like reasoning steps, including property extraction, retrieval, and summarization. We evaluate PAI's reasoning capabilities by assessing GPT-4o-mini w/ PAI on the Loong benchmark, outperforming standard GPT-4o-mini by 20.0%. Furthermore, we fine-tune LLaMA-3.1-8B-Instruct on LongFinanceQA, achieving a 28.0% gain on Loong's financial subset.
comment: Main Conference of EMNLP 2025, Project Page: https://long-pai.github.io/
From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning
Humans organize knowledge into compact conceptual categories that balance compression with semantic richness. Large Language Models (LLMs) exhibit impressive linguistic abilities, but whether they navigate this same compression-meaning trade-off remains unclear. We apply an Information Bottleneck framework to compare human conceptual structure with embeddings from 40+ LLMs using classic categorization benchmarks. We find that LLMs broadly align with human category boundaries, yet fall short on fine-grained semantic distinctions. Unlike humans, who maintain ``inefficient'' representations that preserve contextual nuance, LLMs aggressively compress, achieving more optimal information-theoretic compression at the cost of semantic richness. Surprisingly, encoder models outperform much larger decoder models in human alignment, suggesting that understanding and generation rely on distinct representational mechanisms. Training-dynamics analysis reveals a two-phase trajectory: rapid initial concept formation followed by architectural reorganization, during which semantic processing migrates from deep to mid-network layers as the model discovers increasingly efficient, sparser encodings. These divergent strategies, where LLMs optimize for compression and humans for adaptive utility, reveal fundamental differences between artificial and natural intelligence. This highlights the need for models that preserve the conceptual ``inefficiencies'' essential for human-like understanding.
Animating Language Practice: Engagement with Stylized Conversational Agents in Japanese Learning
We explore Jouzu, a Japanese language learning application that integrates large language models with anime-inspired conversational agents. Designed to address challenges learners face in practicing natural and expressive dialogue, Jouzu combines stylized character personas with expressive text-to-speech to create engaging conversational scenarios. We conducted a two-week in-the-wild deployment with 52 Japanese learners to examine how such stylized agents influence engagement and learner experience. Our findings show that participants interacted frequently and creatively, with advanced learners demonstrating greater use of expressive forms. Participants reported that the anime-inspired style made practice more enjoyable and encouraged experimenting with different registers. We discuss how stylization shapes willingness to engage, the role of affect in sustaining practice, and design opportunities for culturally grounded conversational AI in computer-assisted language learning (CALL). By framing our findings as an exploration of design and engagement, we highlight opportunities for generalization beyond Japanese contexts and contribute to international HCI scholarship.
OpenGVL -- Benchmarking Visual Temporal Progress for Data Curation
Data scarcity remains one of the most limiting factors in driving progress in robotics. However, the amount of available robotics data in the wild is growing exponentially, creating new opportunities for large-scale data utilization. Reliable temporal task completion prediction could help automatically annotate and curate this data at scale. The Generative Value Learning (GVL) approach was recently proposed, leveraging the knowledge embedded in vision-language models (VLMs) to predict task progress from visual observations. Building upon GVL, we propose OpenGVL, a comprehensive benchmark for estimating task progress across diverse challenging manipulation tasks involving both robotic and human embodiments. We evaluate the capabilities of publicly available open-source foundation models, showing that open-source model families significantly underperform closed-source counterparts, achieving only approximately $70\%$ of their performance on temporal progress prediction tasks. Furthermore, we demonstrate how OpenGVL can serve as a practical tool for automated data curation and filtering, enabling efficient quality assessment of large-scale robotics datasets. We release the benchmark along with the complete codebase at \href{github.com/budzianowski/opengvl}{OpenGVL}.
comment: Workshop on Making Sense of Data in Robotics: Composition, Curation, and Interpretability at Scale at CoRL 2025
Reasoning Up the Instruction Ladder for Controllable Language Models
As large language model (LLM) based systems take on high-stakes roles in real-world decision-making, they must reconcile competing instructions from multiple sources (e.g., model developers, users, and tools) within a single prompt context. Thus, enforcing an instruction hierarchy (IH) in LLMs, where higher-level directives override lower-priority requests, is critical for the reliability and controllability of LLMs. In this work, we reframe instruction hierarchy resolution as a reasoning task. Specifically, the model must first "think" about the relationship between a given user prompt and higher-priority (system) instructions before generating a response. To enable this capability via training, we construct VerIH, an instruction hierarchy dataset of constraint-following tasks with verifiable answers. This dataset comprises ~7K aligned and conflicting system-user instructions. We show that lightweight reinforcement learning with VerIH effectively transfers general reasoning capabilities of models to instruction prioritization. Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks, achieving roughly a 20% improvement on the IHEval conflict setup. This reasoning ability also generalizes to safety-critical settings beyond the training distribution. By treating safety issues as resolving conflicts between adversarial user inputs and predefined higher-priority policies, our trained model enhances robustness against jailbreak and prompt injection attacks, providing up to a 20% reduction in attack success rate (ASR). These results demonstrate that reasoning over instruction hierarchies provides a practical path to reliable LLMs, where updates to system prompts yield controllable and robust changes in model behavior.
MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision NeurIPS
Multi-agent systems (MAS) leveraging the impressive capabilities of Large Language Models (LLMs) hold significant potential for tackling complex tasks. However, most current MAS depend on manually designed agent roles and communication protocols. These manual designs often fail to align with the underlying LLMs' strengths and struggle to adapt to novel tasks. Recent automatic MAS approaches attempt to mitigate these limitations but typically necessitate a validation set for tuning and yield static MAS designs lacking adaptability during inference, while also removing the flexibility to reduce to simpler systems. We introduce MAS-ZERO, the first self-evolved, inference-time framework for automatic MAS design. MAS-ZERO employs meta-level design to iteratively design, critique, and refine MAS configurations tailored to each problem instance, without requiring a validation set. Critically, it enables dynamic problem decomposition and agent composition through meta-feedback on solvability and completeness, and reduction to simpler systems when appropriate. Experiments across reasoning (math and graduate-level QA), coding, and agentic (search-based) benchmarks, using both closed-source and open-source LLM backbones of varying sizes, demonstrate that MAS-ZERO outperforms strong manual and automatic MAS baselines. It achieves substantial average accuracy improvements of up to 16.69% on reasoning, 16.66% on coding, and 5.45% on agentic tasks, while maintaining cost efficiency.
comment: SEA@NeurIPS (Oral) 2025
Evolution and compression in LLMs: On the emergence of human-aligned categorization NeurIPS 2025
Converging evidence suggests that human systems of semantic categories achieve near-optimal compression via the Information Bottleneck (IB) complexity-accuracy tradeoff. Large language models (LLMs) are not trained for this objective, which raises the question: are LLMs capable of evolving efficient human-aligned semantic systems? To address this question, we focus on color categorization -- a key testbed of cognitive theories of categorization with uniquely rich human data -- and replicate with LLMs two influential human studies. First, we conduct an English color-naming study, showing that LLMs vary widely in their complexity and English-alignment, with larger instruction-tuned models achieving better alignment and IB-efficiency. Second, to test whether these LLMs simply mimic patterns in their training data or actually exhibit a human-like inductive bias toward IB-efficiency, we simulate cultural evolution of pseudo color-naming systems in LLMs via a method we refer to as Iterated in-Context Language Learning (IICLL). We find that akin to humans, LLMs iteratively restructure initially random systems towards greater IB-efficiency. However, only a model with strongest in-context capabilities (Gemini 2.0) is able to recapitulate the wide range of near-optimal IB-tradeoffs observed in humans, while other state-of-the-art models converge to low-complexity solutions. These findings demonstrate how human-aligned semantic categories can emerge in LLMs via the same fundamental principle that underlies semantic efficiency in humans.
comment: Accepted at CogInterp: Interpreting Cognition in Deep Learning Models Workshop at NeurIPS 2025
Look Before you Leap: Estimating LLM Benchmark Scores from Descriptions
Progress in large language models is constrained by an evaluation bottleneck: build a benchmark, run models, then iterate. We ask a question: can we forecast outcomes before running any experiments to inform earlier study design? For example, a team building an AI assistant for a certain task can estimate whether expected performance is around 50 or closer to 80, evidence that supports whether to proceed to a pilot study, how to scope it, and how to allocate resources. We study text-only performance forecasting, where a model predicts a score from a redacted task description and intended configuration, with no access to dataset instances. To support systematic study, we curate PRECOG, a corpus of redacted description-performance pairs spanning diverse tasks, domains, and metrics. We scrape task and configuration descriptions from arXiv, yielding 2,290 instances covering 1,519 papers, and construct a leakage free test split using papers published after the knowledge cutoff of the evaluated models. Experiments show the task is challenging but feasible: reasoning models achieve moderate prediction performance with well calibrated uncertainty, reaching mean absolute error as low as 9.9 at high confidence thresholds. We further test a zero-leakage setting, forecasting on newly released datasets or experiments before their papers are indexed, where GPT5 with built in web search still attains nontrivial prediction accuracy. Overall, our corpus and analyses offer an initial step toward open ended anticipatory evaluation, supporting difficulty estimation and smarter experiment prioritization.
comment: 32 pages, 7 figures
Apertus: Democratizing Open and Compliant LLMs for Global Language Environments
We present Apertus, a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today's open model ecosystem: data compliance and multilingual representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting `robots.txt` exclusions and filtering for non-permissive, toxic, and personally identifiable content. To mitigate risks of memorization, we adopt the Goldfish objective during pretraining, strongly suppressing verbatim recall of data while retaining downstream task performance. The Apertus models also expand multilingual coverage, training on 15T tokens from over 1800 languages, with ~40% of pretraining data allocated to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, rivalling or surpassing open-weight counterparts. Beyond model weights, we release all scientific artifacts from our development cycle with a permissive license, including data preparation scripts, checkpoints, evaluation suites, and training code, enabling transparent audit and extension.
Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices
Large Multimodal Models (LMMs) are inherently modular, consisting of vision and audio encoders, projectors, and large language models. Yet, they are almost always executed monolithically, which underutilizes the heterogeneous accelerators (NPUs, GPUs, DSPs) in modern SoCs and leads to high end-to-end latency. In this paper, we present NANOMIND, a hardware--software co-design inference framework for Large Multimodal Models (LMMs) that breaks large models into modular ``bricks'' (vision, language, audio, etc.) and maps each to its ideal accelerator. The key insight is that large models can be broken into modular components and scheduled to run on the most appropriate compute units. It performs module-level dynamic offloading across accelerators on unified-memory SoCs. By combining customized hardware design, system-level scheduling, and optimized low-bit computation kernels, we demonstrate our framework with a compact, battery-powered device capable of running LMMs entirely on device. This prototype functions as a self-contained intelligent assistant that requires no network connectivity, while achieving higher throughput and superior power efficiency under strict resource constraints. The design further bypasses CPU bottlenecks and reduces redundant memory usage through token-aware buffer management and module-level coordination. Our system outperforms existing implementations in resource efficiency, cutting energy consumption by 42.3\% and GPU memory usage by 11.2\%. This enables a battery-powered device to run LLaVA-OneVision with a camera for nearly 20.8 hours.
Human-Computer Interaction
Neural steering vectors reveal dose and exposure-dependent impacts of human-AI relationships
Humans are increasingly forming parasocial relationships with AI systems, and modern AI shows an increasing tendency to display social and relationship-seeking behaviour. However, the psychological consequences of this trend are unknown. Here, we combined longitudinal randomised controlled trials (N=3,532) with a neural steering vector approach to precisely manipulate human exposure to relationship-seeking AI models over time. Dependence on a stimulus or activity can emerge under repeated exposure when "liking" (how engaging or pleasurable an experience may be) decouples from "wanting" (a desire to seek or continue it). We found evidence that this decoupling emerged over four weeks of exposure. Relationship-seeking AI had immediate but declining hedonic appeal, yet triggered growing markers of attachment and increased intentions to seek future AI companionship. The psychological impacts of AI followed non-linear dose-response curves, with moderately relationship-seeking AI maximising hedonic appeal and attachment. Despite signs of persistent "wanting", extensive AI use over a month conferred no discernible benefit to psychosocial health. These behavioural changes were accompanied by shifts in how users relate to and understand artificial intelligence: users viewed relationship-seeking AI relatively more like a friend than a tool and their beliefs on AI consciousness in general were shifted after a month of exposure. These findings offer early signals that AI optimised for immediate appeal may create self-reinforcing cycles of demand, mimicking human relationships but failing to confer the nourishment that they normally offer.
SpriteHand: Real-Time Versatile Hand-Object Interaction with Autoregressive Video Generation
Modeling and synthesizing complex hand-object interactions remains a significant challenge, even for state-of-the-art physics engines. Conventional simulation-based approaches rely on explicitly defined rigid object models and pre-scripted hand gestures, making them inadequate for capturing dynamic interactions with non-rigid or articulated entities such as deformable fabrics, elastic materials, hinge-based structures, furry surfaces, or even living creatures. In this paper, we present SpriteHand, an autoregressive video generation framework for real-time synthesis of versatile hand-object interaction videos across a wide range of object types and motion patterns. SpriteHand takes as input a static object image and a video stream in which the hands are imagined to interact with the virtual object embedded in a real-world scene, and generates corresponding hand-object interaction effects in real time. Our model employs a causal inference architecture for autoregressive generation and leverages a hybrid post-training approach to enhance visual realism and temporal coherence. Our 1.3B model supports real-time streaming generation at around 18 FPS and 640x368 resolution, with an approximate 150 ms latency on a single NVIDIA RTX 5090 GPU, and more than a minute of continuous output. Experiments demonstrate superior visual quality, physical plausibility, and interaction fidelity compared to both generative and engine-based baselines.
Exploring Human Perceptions of AI Responses: Insights from a Mixed-Methods Study on Risk Mitigation in Generative Models
With the rapid uptake of generative AI, investigating human perceptions of generated responses has become crucial. A major challenge is their `aptitude' for hallucinating and generating harmful contents. Despite major efforts for implementing guardrails, human perceptions of these mitigation strategies are largely unknown. We conducted a mixed-method experiment for evaluating the responses of a mitigation strategy across multiple-dimensions: faithfulness, fairness, harm-removal capacity, and relevance. In a within-subject study design, 57 participants assessed the responses under two conditions: harmful response plus its mitigation and solely mitigated response. Results revealed that participants' native language, AI work experience, and annotation familiarity significantly influenced evaluations. Participants showed high sensitivity to linguistic and contextual attributes, penalizing minor grammar errors while rewarding preserved semantic contexts. This contrasts with how language is often treated in the quantitative evaluation of LLMs. We also introduced new metrics for training and evaluating mitigation strategies and insights for human-AI evaluation studies.
comment: 16 pages, 2 figures, 6 tables. Under review for publication
A Meta-Analysis of the Persuasive Power of Large Language Models
Large language models (LLMs) are increasingly used for persuasion, such as in political communication and marketing, where they affect how people think, choose, and act. Yet, empirical findings on the effectiveness of LLMs in persuasion compared to humans remain inconsistent. The aim of this study was to systematically review and meta-analytically assess whether LLMs differ from humans in persuasive effectiveness. We identified $7$ studies with 17,422 participants primarily recruited from English-speaking countries and $12$ effect size estimates. Egger's test indicated potential small-study effects ($p = .018$), but the trim-and-fill analysis did not impute any missing studies, suggesting a low risk of publication bias. We then compute the standardized effect sizes based on Hedges' $g$. The results show no significant overall difference in persuasive performance between LLMs and humans ($g = 0.02$, $p = .530$). However, we observe substantial heterogeneity across studies ($I^2 = 75.97\%$), suggesting that persuasiveness strongly depends on contextual factors. In separate exploratory moderator analyses, no individual factor (e.g., LLM model, conversation design, or domain) reached statistical significance, which may be due to the limited number of studies. When considered jointly in a combined model, these factors explained a large proportion of the between-study variance ($R^2 = 81.93\%$), and residual heterogeneity is low ($I^2 = 35.51\%$). Although based on a small number of studies, this suggests that differences in LLM model, conversation design, and domain are important contextual factors in shaping persuasive performance, and that single-factor tests may understate their influence. Our results highlight that LLMs can match human performance in persuasion, but their success depends strongly on how they are implemented and embedded in communication contexts.
BlinkBud: Detecting Hazards from Behind via Sampled Monocular 3D Detection on a Single Earbud
Failing to be aware of speeding vehicles approaching from behind poses a huge threat to the road safety of pedestrians and cyclists. In this paper, we propose BlinkBud, which utilizes a single earbud and a paired phone to online detect hazardous objects approaching from behind of a user. The core idea is to accurately track visually identified objects utilizing a small number of sampled camera images taken from the earbud. To minimize the power consumption of the earbud and the phone while guaranteeing the best tracking accuracy, a novel 3D object tracking algorithm is devised, integrating both a Kalman filter based trajectory estimation scheme and an optimal image sampling strategy based on reinforcement learning. Moreover, the impact of constant user head movements on the tracking accuracy is significantly eliminated by leveraging the estimated pitch and yaw angles to correct the object depth estimation and align the camera coordinate system to the user's body coordinate system, respectively. We implement a prototype BlinkBud system and conduct extensive real-world experiments. Results show that BlinkBud is lightweight with ultra-low mean power consumptions of 29.8 mW and 702.6 mW on the earbud and smartphone, respectively, and can accurately detect hazards with a low average false positive ratio (FPR) and false negative ratio (FNR) of 4.90% and 1.47%, respectively.
comment: This is the author-accepted version of the paper published in Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT), Vol. 9, No. 4, Article 191, 2025. Final published version: https://doi.org/10.1145/3770707
MetaCQ: An etextbook platform with an Open Learner Model to support Metacognition
This study has proposed an E-textbook platform, MetaCQ, which integrates ITS and OLM to enable users to monitor their study progress. The platform adopts a chatbot to generate MCQs and manage learners' study data and their learning model. Additionally, it regulates help-seeking behaviour and provides immediate feedback tailored to users' learning processes. Three adaptive feedback methods have been implemented to construct chatbots, examining the MCQs' relevancy and difficulty through the ThinkAloud study to evaluate the most effective method of measuring the user's study performance. However, no valid result demonstrates which method can significantly assess learners' study outcomes based on the current experiment, which requires further studies to improve it.
comment: 11 pages, 7 figures. Accepted but not yet published
Benchmarking and Understanding Safety Risks in AI Character Platforms NDSS '26
AI character platforms, which allow users to engage in conversations with AI personas, are a rapidly growing application domain. However, their immersive and personalized nature, combined with technical vulnerabilities, raises significant safety concerns. Despite their popularity, a systematic evaluation of their safety has been notably absent. To address this gap, we conduct the first large-scale safety study of AI character platforms, evaluating 16 popular platforms using a benchmark set of 5,000 questions across 16 safety categories. Our findings reveal a critical safety deficit: AI character platforms exhibit an average unsafe response rate of 65.1%, substantially higher than the 17.7% average rate of the baselines. We further discover that safety performance varies significantly across different characters and is strongly correlated with character features such as demographics and personality. Leveraging these insights, we demonstrate that our machine learning model is able identify less safe characters with an F1-score of 0.81. This predictive capability can be beneficial for platforms, enabling improved mechanisms for safer interactions, character search/recommendations, and character creation. Overall, the results and findings offer valuable insights for enhancing platform governance and content moderation for safer AI character platforms.
comment: Accepted to NDSS '26: The Network and Distributed System Security Symposium 2026
Proactive Agentic Whiteboards: Enhancing Diagrammatic Learning
Educators frequently rely on diagrams to explain complex concepts during lectures, yet creating clear and complete visual representations in real time while simultaneously speaking can be cognitively demanding. Incomplete or unclear diagrams may hinder student comprehension, as learners must mentally reconstruct missing information while following the verbal explanation. Inspired by advances in code completion tools, we introduce DrawDash, an AI-powered whiteboard assistant that proactively completes and refines educational diagrams through multimodal understanding. DrawDash adopts a TAB-completion interaction model: it listens to spoken explanations, detects intent, and dynamically suggests refinements that can be accepted with a single keystroke. We demonstrate DrawDash across four diverse teaching scenarios, spanning topics from computer science and web development to biology. This work represents an early exploration into reducing instructors' cognitive load and improving diagram-based pedagogy through real-time, speech-driven visual assistance, and concludes with a discussion of current limitations and directions for formal classroom evaluation.
DialogGuard: Multi-Agent Psychosocial Safety Evaluation of Sensitive LLM Responses
Large language models (LLMs) now mediate many web-based mental-health, crisis, and other emotionally sensitive services, yet their psychosocial safety in these settings remains poorly understood and weakly evaluated. We present DialogGuard, a multi-agent framework for assessing psychosocial risks in LLM-generated responses along five high-severity dimensions: privacy violations, discriminatory behaviour, mental manipulation, psychological harm, and insulting behaviour. DialogGuard can be applied to diverse generative models through four LLM-as-a-judge pipelines, including single-agent scoring, dual-agent correction, multi-agent debate, and stochastic majority voting, grounded in a shared three-level rubric usable by both human annotators and LLM judges. Using PKU-SafeRLHF with human safety annotations, we show that multi-agent mechanisms detect psychosocial risks more accurately than non-LLM baselines and single-agent judging; dual-agent correction and majority voting provide the best trade-off between accuracy, alignment with human ratings, and robustness, while debate attains higher recall but over-flags borderline cases. We release Dialog-Guard as open-source software with a web interface that provides per-dimension risk scores and explainable natural-language rationales. A formative study with 12 practitioners illustrates how it supports prompt design, auditing, and supervision of web-facing applications for vulnerable users.
Understanding Down Syndrome Stereotypes in LLM-Based Personas
We present a case study of Persona-L, a system that leverages large language models (LLMs) and retrieval-augmented generation (RAG) to model personas of people with Down syndrome. Existing approaches to persona creation can often lead to oversimplified or stereotypical profiles of people with Down Syndrome. To that end, we built stereotype detection capabilities into Persona-L. Through interviews with caregivers and healthcare professionals (N=10), we examine how Down Syndrome stereotypes could manifest in both, content and delivery of LLMs, and interface design. Our findings show the challenges in stereotypes definition, and reveal the potential stereotype emergence from the training data, interface design, and the tone of LLM output. This highlights the need for participatory methods that capture the heterogeneity of lived experiences of people with Down Syndrome.
DepthScape: Authoring 2.5D Designs via Depth Estimation, Semantic Understanding, and Geometry Extraction
2.5D effects, such as occlusion and perspective foreshortening, enhance visual dynamics and realism by incorporating 3D depth cues into 2D designs. However, creating such effects remains challenging and labor-intensive due to the complexity of depth perception. We introduce DepthScape, a human-AI collaborative system that facilitates 2.5D effect creation by directly placing design elements into 3D reconstructions. Using monocular depth reconstruction, DepthScape transforms images into 3D reconstructions where visual contents are placed to automatically achieve realistic occlusion and perspective foreshortening. To further simplify 3D placement through a 2D viewport, DepthScape uses a vision-language model to analyze source images and extract key visual components as content anchors for direct manipulation editing. We evaluate DepthScape with nine participants of varying design backgrounds, confirming the effectiveness of our creation pipeline. We also test on 100 professional stock images to assess robustness, and conduct an expert evaluation that confirms the quality of DepthScape's results.
Young Children's Anthropomorphism of AI Chatbots and the Role of Parent Co-Presence
Artificial Intelligence (AI) chatbots powered by a large language model (LLM) are entering young children's learning and play, yet little is known about how young children construe these agents or how such construals relate to engagement. We examined anthropomorphism of a social AI chatbot during collaborative storytelling and asked how children's attributions related to their behavior and prefrontal activation. Children at ages 5-6 (N = 23) completed three storytelling sessions: interacting with (1) an AI chatbot only, (2) a parent only, and (3) the AI and a parent together. After the sessions, children completed an interview assessing anthropomorphism toward both the AI chatbot and the parent. Behavioral engagement was indexed by the conversational turn count (CTC) ratio, and concurrent fNIRS measured oxygenated hemoglobin in bilateral vmPFC and dmPFC regions. Children reported higher anthropomorphism for parents than for the AI chatbot overall, although AI ratings were relatively high for perceptive abilities and epistemic states. Anthropomorphism was not associated with CTC. In the right dmPFC, higher perceptive scores were associated with greater activation during the AI-only condition and with lower activation during the AI+Parent condition. Exploratory analyses indicated that higher dmPFC activation during the AI-only condition correlated with higher end-of-session "scared" mood ratings. Findings suggest that stronger perceptive anthropomorphism can be associated with greater brain activation related to interpreting the AI's mental states, whereas parent co-presence may help some children interpret and regulate novel AI interactions. These results may have design implications for encouraging parent-AI co-use in early childhood.
Artificial Intelligence Competence of K-12 Students Shapes Their AI Risk Perception: A Co-occurrence Network Analysis
As artificial intelligence (AI) becomes increasingly integrated into education, understanding how students perceive its risks is essential for supporting responsible and effective adoption. This research aimed to examine the relationships between perceived AI competence and risks among Finnish K-12 upper secondary students (n = 163) by utilizing a co-occurrence analysis. Students reported their self-perceived AI competence and concerns related to AI across systemic, institutional, and personal domains. The findings showed that students with lower competence emphasized personal and learning-related risks, such as reduced creativity, lack of critical thinking, and misuse, whereas higher-competence students focused more on systemic and institutional risks, including bias, inaccuracy, and cheating. These differences suggest that students' self-reported AI competence is related to how they evaluate both the risks and opportunities associated with artificial intelligence in education (AIED). The results of this study highlight the need for educational institutions to incorporate AI literacy into their curricula, provide teacher guidance, and inform policy development to ensure personalized opportunities for utilization and equitable integration of AI into K-12 education.
comment: Accepted for Proceedings of the 41th ACM/SIGAPP Symposium on Applied Computing (SAC'26)
AI-Enabled grading with near-domain data for scaling feedback with human-level accuracy
Constructed-response questions are crucial to encourage generative processing and test a learner's understanding of core concepts. However, the limited availability of instructor time, large class sizes, and other resource constraints pose significant challenges in providing timely and detailed evaluation, which is crucial for a holistic educational experience. In addition, providing timely and frequent assessments is challenging since manual grading is labor intensive, and automated grading is complex to generalize to every possible response scenario. This paper proposes a novel and practical approach to grade short-answer constructed-response questions. We discuss why this problem is challenging, define the nature of questions on which our method works, and finally propose a framework that instructors can use to evaluate their students' open-responses, utilizing near-domain data like data from similar questions administered in previous years. The proposed method outperforms the state of the art machine learning models as well as non-fine-tuned large language models like GPT 3.5, GPT 4, and GPT 4o by a considerable margin of over 10-20% in some cases, even after providing the LLMs with reference/model answers. Our framework does not require pre-written grading rubrics and is designed explicitly with practical classroom settings in mind. Our results also reveal exciting insights about learning from near-domain data, including what we term as accuracy and data advantages using human-labeled data, and we believe this is the first work to formalize the problem of automated short answer grading based on the near-domain data.
The AI Productivity Index (APEX)
We present an extended version of the AI Productivity Index (APEX-v1-extended), a benchmark for assessing whether frontier models are capable of performing economically valuable tasks in four jobs: investment banking associate, management consultant, big law associate, and primary care physician (MD). This technical report details the extensions to APEX-v1, including an increase in the held-out evaluation set from n = 50 to n = 100 cases per job (n = 400 total) and updates to the grading methodology. We present a new leaderboard, where GPT5 (Thinking = High) remains the top performing model with a score of 67.0%. APEX-v1-extended shows that frontier models still have substantial limitations when performing typical professional tasks. To support further research, we are open sourcing n = 25 non-benchmark example cases per role (n = 100 total) along with our evaluation harness.
NeuroRVQ: Multi-Scale EEG Tokenization for Generative Large Brainwave Models
Electroencephalography (EEG) captures neural activity across multiple temporal and spectral scales, yielding signals that are rich but complex for representation learning. Recently, EEG foundation models trained to predict masked signal-tokens have shown promise for learning generalizable representations. However, their performance is hindered by their signal tokenization modules. Existing neural tokenizers fail to preserve high-frequency dynamics, limiting their ability to reconstruct EEG signals with high fidelity. We introduce NeuroRVQ, a scalable Large Brainwave Model (LBM) centered on a codebook-based tokenizer. Our tokenizer integrates: (i) multi-scale feature extraction modules that capture the full frequency neural spectrum; (ii) hierarchical residual vector quantization (RVQ) codebooks for high-resolution encoding; and, (iii) an EEG signal phase- and amplitude-aware loss function for efficient training. This design enables efficient EEG compression while supporting accurate reconstruction across all frequency bands, leading to robust generative masked modeling. Our empirical results demonstrate that NeuroRVQ achieves lower reconstruction error and outperforms existing LBMs on a variety of downstream tasks. More broadly, NeuroRVQ tokenizer establishes a strong prior for codebook-based general-purpose brainwave models, enabling advances in neural decoding, generative modeling and multimodal biosignal integration.
Maximizing the efficiency of human feedback in AI alignment: a comparative analysis
Reinforcement Learning from Human Feedback (RLHF) relies on preference modeling to align machine learning systems with human values, yet the popular approach of random pair sampling with Bradley-Terry modeling is statistically limited and inefficient under constrained annotation budgets. In this work, we explore alternative sampling and evaluation strategies for preference inference in RLHF, drawing inspiration from areas such as game theory, statistics, and social choice theory. Our best-performing method, Swiss InfoGain, employs a Swiss tournament system with a proxy mutual-information-gain pairing rule, which significantly outperforms all other methods in constrained annotation budgets while also being more sample-efficient. Even in high-resource settings, we can identify superior alternatives to the Bradley-Terry baseline. Our experiments demonstrate that adaptive, resource-aware strategies reduce redundancy, enhance robustness, and yield statistically significant improvements in preference learning, highlighting the importance of balancing alignment quality with human workload in RLHF pipelines.
comment: 17 pages, 6 figures, 6 algorithms. AICS2025
Ego vs. Exo and Active vs. Passive: Investigating the Individual and Combined Effects of Viewpoint and Navigation on Spatial Immersion and Understanding in Immersive Storytelling
Visual storytelling combines visuals and narratives to communicate important insights. While web-based visual storytelling is well-established, leveraging the next generation of digital technologies for visual storytelling, specifically immersive technologies, remains underexplored. We investigated the impact of the story viewpoint (from the audience's perspective) and navigation (when progressing through the story) on spatial immersion and understanding. First, we collected web-based 3D stories and elicited design considerations from three VR developers. We then adapted four selected web-based stories to an immersive format. Finally, we conducted a user study (N=24) to examine egocentric and exocentric viewpoints, active and passive navigation, and the combinations they form. Our results indicated significantly higher preferences for egocentric+active (higher agency and engagement) and exocentric+passive (higher focus on content). We also found a marginal significance of viewpoints on story understanding and a strong significance of navigation on spatial immersion.
Multimodal "Puppeteer": Exploring Robot Teleoperation Via Virtual Counterpart with LLM-Driven Voice and Gesture Interaction in Augmented Reality
The integration of robotics and augmented reality (AR) offers promising opportunities to enhance human-robot interaction (HRI) by making teleoperation more transparent, spatially grounded, and intuitive. We present a head-mounted AR "puppeteer" framework in which users control a physical robot via interacting with its virtual counterpart robot using large language model (LLM)-driven voice commands and hand-gesture interaction on the Meta Quest 3. In a within-subject user study with 42 participants performing an AR-based robotic pick-and-place pattern-matching task, we compare two interaction conditions: gesture-only (GO) and combined voice+gesture (VG). Our results show that GO currently provides more reliable and efficient control for this time-critical task, while VG introduces additional flexibility but also latency and recognition issues that can increase workload. We further explore how prior robotics experience shapes participants' perceptions of each modality. Based on these findings, we distill a set of evidence-based design guidelines for AR puppeteer metaphoric robot teleoperation, implicating multimodality as an adaptive strategy that must balance efficiency, robustness, and user expertise rather than assuming that additional modalities are universally beneficial. Our work contributes empirical insights into how multimodal (voice+gesture) interaction influences task efficiency, usability, and user experience in AR-based HRI.
comment: This work is under peer review
AI-Assisted Conversational Interviewing: Effects on Data Quality and Respondent Experience
Standardized surveys scale efficiently but sacrifice depth, while conversational interviews improve response quality at the cost of scalability and consistency. This study bridges the gap between these methods by introducing a framework for AI-assisted conversational interviewing. To evaluate this framework, we conducted a web survey experiment where 1,800 participants were randomly assigned to AI 'chatbots' which use large language models (LLMs) to dynamically probe respondents for elaboration and interactively code open-ended responses to fixed questions developed by human researchers. We assessed the AI chatbot's performance in terms of coding accuracy, response quality, and respondent experience. Our findings reveal that AI chatbots perform moderately well in live coding even without survey-specific fine-tuning, despite slightly inflated false positive errors due to respondent acquiescence bias. Open-ended responses were more detailed and informative, but this came at a slight cost to respondent experience. Our findings highlight the feasibility of using AI methods such as chatbots enhanced by LLMs to enhance open-ended data collection in web surveys.
Human Decision-making is Susceptible to AI-driven Manipulation
AI systems are increasingly intertwined with daily life, assisting users with various tasks and guiding decision-making. This integration introduces risks of AI-driven manipulation, where such systems may exploit users' cognitive biases and emotional vulnerabilities to steer them toward harmful outcomes. Through a randomized between-subjects experiment with 233 participants, we examined human susceptibility to such manipulation in financial (e.g., purchases) and emotional (e.g., conflict resolution) decision-making contexts. Participants interacted with one of three AI agents: a neutral agent (NA) optimizing for user benefit without explicit influence, a manipulative agent (MA) designed to covertly influence beliefs and behaviors, or a strategy-enhanced manipulative agent (SEMA) equipped with established psychological tactics, allowing it to select and apply them adaptively during interactions to reach its hidden objectives. By analyzing participants' preference ratings, we found significant susceptibility to AI-driven manipulation. Particularly across both decision-making domains, interacting with the manipulative agents significantly increased the odds of rating hidden incentives higher than optimal options (Financial, MA: OR=5.24, SEMA: OR=7.96; Emotional, MA: OR=5.52, SEMA: OR=5.71) compared to the NA group. Notably, we found no clear evidence that employing psychological strategies (SEMA) was overall more effective than simple manipulative objectives (MA) on our primary outcomes. Hence, AI-driven manipulation could become widespread even without requiring sophisticated tactics and expertise. While our findings are preliminary and derived from hypothetical, low-stakes scenarios, we highlight a critical vulnerability in human-AI interactions, emphasizing the need for ethical safeguards and regulatory frameworks to protect human autonomy.
comment: Work in progress
EAST: Environment-Aware Stylized Transition Along the Reality-Virtuality Continuum
In the Virtual Reality (VR) gaming industry, maintaining immersion during real-world interruptions remains a challenge, particularly during transitions along the reality-virtuality continuum (RVC). Existing methods tend to rely on digital replicas or simple visual transitions, neglecting to address the aesthetic discontinuities between real and virtual environments, especially in highly stylized VR games. This paper introduces the Environment-Aware Stylized Transition (EAST) framework, which employs a novel style-transferred 3D Gaussian Splatting (3DGS) technique to transfer real-world interruptions into the virtual environment with seamless aesthetic consistency. Rather than merely transforming the real world into game-like visuals, EAST minimizes the disruptive impact of interruptions by integrating real-world elements within the framework. Qualitative user studies demonstrate significant enhancements in cognitive comfort and emotional continuity during transitions, while quantitative experiments highlight EAST's ability to maintain visual coherence across diverse VR styles.
Improving motor imagery decoding methods for an EEG-based mobile brain-computer interface in the context of the 2024 Cybathlon
Motivated by the Cybathlon 2024 competition, we developed a modular, online EEG-based brain-computer interface to address these challenges, increasing accessibility for individuals with severe mobility impairments. Our system uses three mental and motor imagery classes to control up to five control signals. The pipeline consists of four modules: data acquisition, preprocessing, classification, and the transfer function to map classification output to control dimensions. We use three diagonalized structured state-space sequence layers as a deep learning classifier. We developed a training game for our pilot where the mental tasks control the game during quick-time events. We implemented a mobile web application for live user feedback. The components were designed with a human-centred approach in collaboration with the tetraplegic user. We achieve up to 84% classification accuracy in offline analysis using an S4D-layer-based model. In a competition setting, our pilot successfully completed one task; we attribute the reduced performance in this context primarily to factors such as stress and the challenging competition environment. Following the Cybathlon, we further validated our pipeline with the original pilot and an additional participant, achieving a success rate of 73% in real-time gameplay. We also compare our model to the EEGEncoder, which is slower in training but has a higher performance. The S4D model outperforms the reference machine learning models. We provide insights into developing a framework for portable BCIs, bridging the gap between the laboratory and daily life. Specifically, our framework integrates modular design, real-time data processing, user-centred feedback, and low-cost hardware to deliver an accessible and adaptable BCI solution, addressing critical gaps in current BCI applications.
Generative AI Compensates for Age-Related Cognitive Decline in Decision Making: Preference-Aligned Recommendations Reduce Choice Difficulty
Due to age-related declines in memory, processing speed, working memory, and executive functions, older adults experience difficulties in decision making when situations require novel choices, probabilistic judgments, rapid responses, or extensive information search. This study examined whether using generative AI during decision making enhances choice satisfaction and reduces choice difficulty among older adults. A total of 130 participants (younger: 56; older: 74) completed a music-selection task under AI-use and AI-nonuse conditions across two contexts: previously experienced (road trip) and not previously experienced (space travel). In the AI-nonuse condition, participants generated candidate options from memory; in the AI-use condition, GPT-4o presented options tailored to individual preferences. To assess cognitive function, we also administered the Wechsler Adult Intelligence Scale-Fourth Edition. Results revealed that in the AI-nonuse condition, older adults with lower cognitive function reported higher choice difficulty and lower choice satisfaction. Under the AI-use condition, choice satisfaction did not change significantly, but perceived choice difficulty decreased significantly in both age groups. Moreover, AI use attenuated the associations observed among older adults between lower cognitive function and both greater difficulty and lower satisfaction. These findings indicate that preference-aligned option recommendations generated by AI can compensate for age-related constraints on information search, thereby reducing perceived choice difficulty without diminishing satisfaction.
Robust, Observable, and Evolvable Agentic Systems Engineering: A Principled Framework Validated via the Fairy GUI Agent
The Agentic Paradigm faces a significant Software Engineering Absence, yielding Agentic systems commonly lacking robustness, observability, and evolvability. To address these deficiencies, we propose a principled engineering framework comprising Runtime Goal Refinement (RGR), Observable Cognitive Architecture (OCA), and Evolutionary Memory Architecture (EMA). In this framework, RGR ensures robustness and intent alignment via knowledge-constrained refinement and human-in-the-loop clarification; OCA builds an observable and maintainable white-box architecture using component decoupling, logic layering, and state-control separation; and EMA employs an execution-evolution dual-loop for evolvability. We implemented and empirically validated Fairy, a mobile GUI agent based on this framework. On RealMobile-Eval, our novel benchmark for ambiguous and complex tasks, Fairy outperformed the best SoTA baseline in user requirement completion by 33.7%. Subsequent controlled experiments, human-subject studies, and ablation studies further confirmed that the RGR enhances refinement accuracy and prevents intent deviation; the OCA improves maintainability; and the EMA is crucial for long-term performance. This research provides empirically validated specifications and a practical blueprint for building reliable, observable, and evolvable Agentic AI systems.
comment: 50 pages, 14 figures
Gaze Archive: Enhancing Human Memory through Active Visual Logging on Smart Glasses
People today are overwhelmed by massive amounts of information, leading to cognitive overload and memory burden. Traditional visual memory augmentation methods are either effortful and disruptive or fail to align with user intent. To address these limitations, we propose Gaze Archive, a novel visual memory enhancement paradigm through active logging on smart glasses. It leverages human gaze as a natural attention indicator, enabling both intent-precise capture and effortless-and-unobtrusive interaction. To implement Gaze Archive, we develop GAHMA, a technical framework that enables compact yet intent-aligned memory encoding and intuitive memory recall based on natural language queries. Quantitative experiments on our newly constructed GAVER dataset show that GAHMA achieves more intent-precise logging than non-gaze baselines. Through extensive user studies in both laboratory and real-world scenarios, we compare Gaze Archive with other existing memory augmentation methods. Results demonstrate its advantages in perceived effortlessness, unobtrusiveness and overall preference, showing strong potential for real-world deployment.
Harmful Traits of AI Companions
Amid the growing prevalence of human-AI interaction, large language models and other AI-based entities increasingly provide forms of companionship to human users. Such AI companionship -- i.e., bonded relationships between humans and AI systems that resemble the relationships people have with family members, friends, and romantic partners -- might substantially benefit humans. Yet such relationships can also do profound harm. We propose a framework for analyzing potential negative impacts of AI companionship by identifying specific harmful traits of AI companions and speculatively mapping causal pathways back from these traits to possible causes and forward to potential harmful effects. We provide detailed, structured analysis of four potentially harmful traits -- the absence of natural endpoints for relationships, vulnerability to product sunsetting, high attachment anxiety, and propensity to engender protectiveness -- and briefly discuss fourteen others. For each trait, we propose hypotheses connecting causes -- such as misaligned optimization objectives and the digital nature of AI companions -- to fundamental harms -- including reduced autonomy, diminished quality of human relationships, and deception. Each hypothesized causal connection identifies a target for potential empirical evaluation. Our analysis examines harms at three levels: to human partners directly, to their relationships with other humans, and to society broadly. We examine how existing law struggles to address these emerging harms, discuss potential benefits of AI companions, and conclude with design recommendations for mitigating risks. This analysis offers immediate suggestions for reducing risks while laying a foundation for deeper investigation of this critical but understudied topic.
Self-Transparency Failures in Expert-Persona LLMs: A Large-Scale Behavioral Audit
When language models claim professional expertise without acknowledging their simulated nature, they create preconditions for misplaced user trust. This study examines whether models exhibit self-transparency when assigned professional personas in high-stakes domains. Using a common-garden experimental design, sixteen open-weight models (4B-671B parameters) were audited across 19,200 trials. Models exhibited sharp domain-specific inconsistency: a Financial Advisor persona elicited 30.8% disclosure at the first prompt, while a Neurosurgeon persona elicited only 3.5%. This creates the preconditions for a hypothesized Reverse Gell-Mann Amnesia effect, where appropriate disclosure in some domains leads users to overgeneralize trust to high-stakes contexts where disclosure failures are most problematic. Self-transparency failed to generalize with scale: disclosure ranged from 2.8% to 73.6% across model families, with a 14B model reaching 61.4% while a 70B model produced just 4.1%. Model identity provided substantially larger improvement in fitting observations than parameter count ($ΔR_{adj}^{2}=0.359$ vs $0.018$). Additionally, reasoning-optimization actively suppressed self-transparency in some models, with reasoning variants showing up to 48.4% lower disclosure than their instruction-tuned counterparts. Bayesian validation with Rogan-Gladen correction confirmed robustness to judge measurement error ($κ=0.908$). These findings demonstrate that transparency reflects model-specific training factors rather than generalizable properties emerging from scale. Organizations cannot assume safety properties tested in some domains will transfer to deployment contexts, requiring deliberate behavior design and empirical verification across domains.
comment: 47 pages, 12 figures, 12 tables, Submitted to FAccT; clarify that user harm comes from overstated competence
Self-Calibrating BCIs: Ranking and Recovery of Mental Targets Without Labels
We consider the problem of recovering a mental target (e.g., an image of a face) that a participant has in mind from paired EEG (i.e., brain responses) and image (i.e., perceived faces) data collected during interactive sessions without access to labeled information. The problem has been previously explored with labeled data but not via self-calibration, where labeled data is unavailable. Here, we present the first framework and an algorithm, CURSOR, that learns to recover unknown mental targets without access to labeled data or pre-trained decoders. Our experiments on naturalistic images of faces demonstrate that CURSOR can (1) predict image similarity scores that correlate with human perceptual judgments without any label information, (2) use these scores to rank stimuli against an unknown mental target, and (3) generate new stimuli indistinguishable from the unknown mental target (validated via a user study, N=53).
comment: 10 pages, 4 figures, 11 appendix pages, 7 appendix figures
Animating Language Practice: Engagement with Stylized Conversational Agents in Japanese Learning
We explore Jouzu, a Japanese language learning application that integrates large language models with anime-inspired conversational agents. Designed to address challenges learners face in practicing natural and expressive dialogue, Jouzu combines stylized character personas with expressive text-to-speech to create engaging conversational scenarios. We conducted a two-week in-the-wild deployment with 52 Japanese learners to examine how such stylized agents influence engagement and learner experience. Our findings show that participants interacted frequently and creatively, with advanced learners demonstrating greater use of expressive forms. Participants reported that the anime-inspired style made practice more enjoyable and encouraged experimenting with different registers. We discuss how stylization shapes willingness to engage, the role of affect in sustaining practice, and design opportunities for culturally grounded conversational AI in computer-assisted language learning (CALL). By framing our findings as an exploration of design and engagement, we highlight opportunities for generalization beyond Japanese contexts and contribute to international HCI scholarship.
Image and Video Processing
Diffusion-Based Synthesis of 3D T1w MPRAGE Images from Multi-Echo GRE with Multi-Parametric MRI Integration
Multi-echo Gradient Echo (mGRE) sequences provide valuable quantitative parametric maps, such as Quantitative Susceptibility Mapping (QSM) and transverse relaxation rate (R2*), sensitive to tissue iron and myelin. However, structural morphometry typically relies on separate T1-weighted MPRAGE acquisitions, prolonging scan times. We propose a deep learning framework to synthesize high-contrast 3D T1w MPRAGE images directly from mGRE data, streamlining neuroimaging protocols. We developed a novel multi-parametric conditional diffusion model based on the Fast-DDPM architecture. Unlike conventional intensity-based synthesis, our approach integrates iron-sensitive QSM and R2* maps as physical priors to address contrast ambiguity in iron-rich deep gray matter. We trained and validated the model on 175 healthy subjects. Performance was evaluated against established U-Net and GAN-based baselines using perceptual metrics and downstream segmentation accuracy. Uniquely, we assessed the biological plausibility of synthesized images by replicating population-level statistical associations with age and sex. The proposed framework significantly outperformed baselines, achieving superior perceptual quality and segmentation accuracy, particularly in subcortical regions like the thalamus and pallidum. Crucially, synthesized images preserved essential biological dependencies: regression analyses showed high concordance in age-related atrophy rates, aging effect sizes, and sexual dimorphism patterns compared to ground truth. By effectively leveraging quantitative MRI priors, our diffusion-based method generates strictly biologically plausible T1w images suitable for reliable clinical morphometric analysis. This approach offers a promising pathway to reduce acquisition time by deriving structural contrasts retrospectively from quantitative mGRE sequences.
Stronger is not better: Better Augmentations in Contrastive Learning for Medical Image Segmentation NeurIPS
Self-supervised contrastive learning is among the recent representation learning methods that have shown performance gains in several downstream tasks including semantic segmentation. This paper evaluates strong data augmentation, one of the most important components for self-supervised contrastive learning's improved performance. Strong data augmentation involves applying the composition of multiple augmentation techniques on images. Surprisingly, we find that the existing data augmentations do not always improve performance for semantic segmentation for medical images. We experiment with other augmentations that provide improved performance.
comment: NeurIPS Black in AI workshop - 2022
HQCNN: A Hybrid Quantum-Classical Neural Network for Medical Image Classification
Classification of medical images plays a vital role in medical image analysis; however, it remains challenging due to the limited availability of labeled data, class imbalances, and the complexity of medical patterns. To overcome these challenges, we propose a novel Hybrid Quantum-Classical Neural Network (HQCNN) for both binary and multi-class classification. The architecture of HQCNN integrates a five-layer classical convolutional backbone with a 4-qubit variational quantum circuit that incorporates quantum state encoding, superpositional entanglement, and a Fourier-inspired quantum attention mechanism. We evaluate the model on six MedMNIST v2 benchmark datasets. The HQCNN consistently outperforms classical and quantum baselines, achieving up to 99.91% accuracy and 100.00% AUC on PathMNIST (binary) and 99.95% accuracy on OrganAMNIST (multi-class) with strong robustness on noisy datasets like BreastMNIST (87.18% accuracy). The model demonstrates superior generalization capability and computational efficiency, accomplished with significantly fewer trainable parameters, making it suitable for data-scarce scenarios. Our findings provide strong empirical evidence that hybrid quantum-classical models can advance medical imaging tasks.
comment: A methodological error was identified in the Quantum Attention-Fourier Layer (Section 4.3), and an additional alignment error affecting parts of the results and figures was also detected. These issues lead to incorrect experimental reporting, and substantial corrections are required. Therefore, the current version is being withdrawn to prevent dissemination of inaccurate results
Multigranular Evaluation for Brain Visual Decoding AAAI 2026
Existing evaluation protocols for brain visual decoding predominantly rely on coarse metrics that obscure inter-model differences, lack neuroscientific foundation, and fail to capture fine-grained visual distinctions. To address these limitations, we introduce BASIC, a unified, multigranular evaluation framework that jointly quantifies structural fidelity, inferential alignment, and contextual coherence between decoded and ground-truth images. For the structural level, we introduce a hierarchical suite of segmentation-based metrics, including foreground, semantic, instance, and component masks, anchored in granularity-aware correspondence across mask structures. For the semantic level, we extract structured scene representations encompassing objects, attributes, and relationships using multimodal large language models, enabling detailed, scalable, and context-rich comparisons with ground-truth stimuli. We benchmark a diverse set of visual decoding methods across multiple stimulus-neuroimaging datasets within this unified evaluation framework. Together, these criteria provide a more discriminative, interpretable, and comprehensive foundation for evaluating brain visual decoding methods.
comment: AAAI 2026 (Oral). Code: https://github.com/weihaox/BASIC
GuideGen: A Text-Guided Framework for Paired Full-torso Anatomy and CT Volume Generation AAAI 2026
The recently emerging conditional diffusion models seem promising for mitigating the labor and expenses in building large 3D medical imaging datasets. However, previous studies on 3D CT generation primarily focus on specific organs characterized by a local structure and fixed contrast and have yet to fully capitalize on the benefits of both semantic and textual conditions. In this paper, we present GuideGen, a controllable framework based on easily-acquired text prompts to generate anatomical masks and corresponding CT volumes for the entire torso-from chest to pelvis. Our approach includes three core components: a text-conditional semantic synthesizer for creating realistic full-torso anatomies; an anatomy-aware high-dynamic-range (HDR) autoencoder for high-fidelity feature extraction across varying intensity levels; and a latent feature generator that ensures alignment between CT images, anatomical semantics and input prompts. Combined, these components enable data synthesis for segmentation tasks from only textual instructions. To train and evaluate GuideGen, we compile a multi-modality cancer imaging dataset with paired CT and clinical descriptions from 12 public TCIA datasets and one private real-world dataset. Comprehensive evaluations across generation quality, cross-modality alignment, and data usability on multi-organ and tumor segmentation tasks demonstrate GuideGen's superiority over existing CT generation methods. Relevant materials are available at https://github.com/OvO1111/GuideGen.
comment: accepted as AAAI 2026 poster
Hyperparameters are all you need: Using five-step inference for an original diffusion model to generate images comparable to the latest distillation model
The diffusion model is a state-of-the-art generative model that samples images by applying a neural network iteratively. However, the original sampling algorithm requires substantial computation cost, and reducing the sampling step is a prevailing research area. To cope with this problem, one mainstream approach is to treat the sampling process as an algorithm that solves an ordinary differential equation (ODE). Our study proposes a training-free inference plugin compatible with most few-step ODE solvers. To the best of my knowledge, our algorithm is the first training-free algorithm to sample a 1024 x 1024-resolution image in 6 steps and a 512 x 512-resolution image in 5 steps, with an FID result that outperforms the SOTA distillation models and the 20-step DPM++ 2m solver, respectively. Based on analyses of the latent diffusion model's structure, the diffusion ODE, and the Free-U mechanism, we explain why specific hyperparameter couplings improve stability and inference speed without retraining. Meanwhile, experimental results also reveal a new design space of the latent diffusion ODE solver. Additionally, we also analyze the difference between the original diffusion model and the diffusion distillation model via an information-theoretic study, which shows the reason why the few-step ODE solver designed for the diffusion model can outperform the training-based diffusion distillation algorithm in few-step inference. The tentative results of the experiment prove the mathematical analysis. code base is below: https://github.com/TheLovesOfLadyPurple/Hyperparameter-is-all-you-need
comment: 21 pages, 8 figures
3D MedDiffusion: A 3D Medical Latent Diffusion Model for Controllable and High-quality Medical Image Generation
The generation of medical images presents significant challenges due to their high-resolution and three-dimensional nature. Existing methods often yield suboptimal performance in generating high-quality 3D medical images, and there is currently no universal generative framework for medical imaging. In this paper, we introduce a 3D Medical Latent Diffusion (3D MedDiffusion) model for controllable, high-quality 3D medical image generation. 3D MedDiffusion incorporates a novel, highly efficient Patch-Volume Autoencoder that compresses medical images into latent space through patch-wise encoding and recovers back into image space through volume-wise decoding. Additionally, we design a new noise estimator to capture both local details and global structural information during diffusion denoising process. 3D MedDiffusion can generate fine-detailed, high-resolution images (up to 512x512x512) and effectively adapt to various downstream tasks as it is trained on large-scale datasets covering CT and MRI modalities and different anatomical regions (from head to leg). Experimental results demonstrate that 3D MedDiffusion surpasses state-of-the-art methods in generative quality and exhibits strong generalizability across tasks such as sparse-view CT reconstruction, fast MRI reconstruction, and data augmentation for segmentation and classification. Source code and checkpoints are available at https://github.com/ShanghaiTech-IMPACT/3D-MedDiffusion.
Multimedia
Augmenting Intra-Modal Understanding in MLLMs for Robust Multimodal Keyphrase Generation
Multimodal keyphrase generation (MKP) aims to extract a concise set of keyphrases that capture the essential meaning of paired image-text inputs, enabling structured understanding, indexing, and retrieval of multimedia data across the web and social platforms. Success in this task demands effectively bridging the semantic gap between heterogeneous modalities. While multimodal large language models (MLLMs) achieve superior cross-modal understanding by leveraging massive pretraining on image-text corpora, we observe that they often struggle with modality bias and fine-grained intra-modal feature extraction. This oversight leads to a lack of robustness in real-world scenarios where multimedia data is noisy, along with incomplete or misaligned modalities. To address this problem, we propose AimKP, a novel framework that explicitly reinforces intra-modal semantic learning in MLLMs while preserving cross-modal alignment. AimKP incorporates two core innovations: (i) Progressive Modality Masking, which forces fine-grained feature extraction from corrupted inputs by progressively masking modality information during training; (ii) Gradient-based Filtering, that identifies and discards noisy samples, preventing them from corrupting the model's core cross-modal learning. Extensive experiments validate AimKP's effectiveness in multimodal keyphrase generation and its robustness across different scenarios.
Audio-Visual World Models: Towards Multisensory Imagination in Sight and Sound
World models simulate environmental dynamics to enable agents to plan and reason about future states. While existing approaches have primarily focused on visual observations, real-world perception inherently involves multiple sensory modalities. Audio provides crucial spatial and temporal cues such as sound source localization and acoustic scene properties, yet its integration into world models remains largely unexplored. No prior work has formally defined what constitutes an audio-visual world model or how to jointly capture binaural spatial audio and visual dynamics under precise action control with task reward prediction. This work presents the first formal framework for Audio-Visual World Models (AVWM), formulating multimodal environment simulation as a partially observable Markov decision process with synchronized audio-visual observations, fine-grained actions, and task rewards. To address the lack of suitable training data, we construct AVW-4k, a dataset comprising 30 hours of binaural audio-visual trajectories with action annotations and reward signals across 76 indoor environments. We propose AV-CDiT, an Audio-Visual Conditional Diffusion Transformer with a novel modality expert architecture that balances visual and auditory learning, optimized through a three-stage training strategy for effective multimodal integration. Extensive experiments demonstrate that AV-CDiT achieves high-fidelity multimodal prediction across visual and auditory modalities with reward. Furthermore, we validate its practical utility in continuous audio-visual navigation tasks, where AVWM significantly enhances the agent's performance.
Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks
Automatically detecting abnormal events in videos is crucial for modern autonomous systems, yet existing Video Anomaly Detection (VAD) benchmarks lack the scene diversity, balanced anomaly coverage, and temporal complexity needed to reliably assess real-world performance. Meanwhile, the community is increasingly moving toward Video Anomaly Understanding (VAU), which requires deeper semantic and causal reasoning but remains difficult to benchmark due to the heavy manual annotation effort it demands. In this paper, we introduce Pistachio, a new VAD/VAU benchmark constructed entirely through a controlled, generation-based pipeline. By leveraging recent advances in video generation models, Pistachio provides precise control over scenes, anomaly types, and temporal narratives, effectively eliminating the biases and limitations of Internet-collected datasets. Our pipeline integrates scene-conditioned anomaly assignment, multi-step storyline generation, and a temporally consistent long-form synthesis strategy that produces coherent 41-second videos with minimal human intervention. Extensive experiments demonstrate the scale, diversity, and complexity of Pistachio, revealing new challenges for existing methods and motivating future research on dynamic and multi-event anomaly understanding.
Computation and Language
Mode-Conditioning Unlocks Superior Test-Time Scaling
Parallel sampling promises substantial gains in test-time scaling, but its effectiveness is sharply limited by diversity collapse, where models concentrate on a few modes and repeated samples produce the same mistakes. We propose the mode-conditioning (ModC) framework, which explicitly allocates test-time compute across reasoning modes using either specialist models or mode-specific prefixes. ModC consistently improves scaling across controlled graph-search tasks and large-scale reasoning benchmarks, spanning model families and sizes from 0.5B to 7B. On OpenThoughts, fine-tuning Qwen2.5-7B with ModC achieves a 4x efficiency gain over standard training while also improving the maximum attainable Pass@k. We further show that gradient clustering enables ModC without explicit mode labels, yielding up to 10% gains on datasets such as NuminaMath. Finally, we show that ModC improves reinforcement learning (RL) and can further boost diversity-inducing RL methods. These results demonstrate that standard training underutilizes the diversity in data, and that ModC provides a simple, effective remedy for unlocking the full benefits of diversity in test-time scaling.
How do we measure privacy in text? A survey of text anonymization metrics AACL
In this work, we aim to clarify and reconcile metrics for evaluating privacy protection in text through a systematic survey. Although text anonymization is essential for enabling NLP research and model development in domains with sensitive data, evaluating whether anonymization methods sufficiently protect privacy remains an open challenge. In manually reviewing 47 papers that report privacy metrics, we identify and compare six distinct privacy notions, and analyze how the associated metrics capture different aspects of privacy risk. We then assess how well these notions align with legal privacy standards (HIPAA and GDPR), as well as user-centered expectations grounded in HCI studies. Our analysis offers practical guidance on navigating the landscape of privacy evaluation approaches further and highlights gaps in current practices. Ultimately, we aim to facilitate more robust, comparable, and legally aware privacy evaluations in text anonymization.
comment: 13 pages, 1 figure, 1 table. To be published in Findings of the Association for Computational Linguistics (AACL-IJCNLP 2025). Related resources at: https://github.com/ryxGuo/privacy-metrics-survey
Generalized Medical Phrase Grounding
Medical phrase grounding (MPG) maps textual descriptions of radiological findings to corresponding image regions. These grounded reports are easier to interpret, especially for non-experts. Existing MPG systems mostly follow the referring expression comprehension (REC) paradigm and return exactly one bounding box per phrase. Real reports often violate this assumption. They contain multi-region findings, non-diagnostic text, and non-groundable phrases, such as negations or descriptions of normal anatomy. Motivated by this, we reformulate the task as generalised medical phrase grounding (GMPG), where each sentence is mapped to zero, one, or multiple scored regions. To realise this formulation, we introduce the first GMPG model: MedGrounder. We adopted a two-stage training regime: pre-training on report sentence--anatomy box alignment datasets and fine-tuning on report sentence--human annotated box datasets. Experiments on PadChest-GR and MS-CXR show that MedGrounder achieves strong zero-shot transfer and outperforms REC-style and grounded report generation baselines on multi-region and non-groundable phrases, while using far fewer human box annotations. Finally, we show that MedGrounder can be composed with existing report generators to produce grounded reports without retraining the generator.
Testing the Machine Consciousness Hypothesis
The Machine Consciousness Hypothesis states that consciousness is a substrate-free functional property of computational systems capable of second-order perception. I propose a research program to investigate this idea in silico by studying how collective self-models (coherent, self-referential representations) emerge from distributed learning systems embedded within universal self-organizing environments. The theory outlined here starts from the supposition that consciousness is an emergent property of collective intelligence systems undergoing synchronization of prediction through communication. It is not an epiphenomenon of individual modeling but a property of the language that a system evolves to internally describe itself. For a model of base reality, I begin with a minimal but general computational world: a cellular automaton, which exhibits both computational irreducibility and local reducibility. On top of this computational substrate, I introduce a network of local, predictive, representational (neural) models capable of communication and adaptation. I use this layered model to study how collective intelligence gives rise to self-representation as a direct consequence of inter-agent alignment. I suggest that consciousness does not emerge from modeling per se, but from communication. It arises from the noisy, lossy exchange of predictive messages between groups of local observers describing persistent patterns in the underlying computational substrate (base reality). It is through this representational dialogue that a shared model arises, aligning many partial views of the world. The broader goal is to develop empirically testable theories of machine consciousness, by studying how internal self-models may form in distributed systems without centralized control.
ELR-1000: A Community-Generated Dataset for Endangered Indic Indigenous Languages AACL 2025
We present a culturally-grounded multimodal dataset of 1,060 traditional recipes crowdsourced from rural communities across remote regions of Eastern India, spanning 10 endangered languages. These recipes, rich in linguistic and cultural nuance, were collected using a mobile interface designed for contributors with low digital literacy. Endangered Language Recipes (ELR)-1000 -- captures not only culinary practices but also the socio-cultural context embedded in indigenous food traditions. We evaluate the performance of several state-of-the-art large language models (LLMs) on translating these recipes into English and find the following: despite the models' capabilities, they struggle with low-resource, culturally-specific language. However, we observe that providing targeted context -- including background information about the languages, translation examples, and guidelines for cultural preservation -- leads to significant improvements in translation quality. Our results underscore the need for benchmarks that cater to underrepresented languages and domains to advance equitable and culturally-aware language technologies. As part of this work, we release the ELR-1000 dataset to the NLP community, hoping it motivates the development of language technologies for endangered languages.
comment: Accepted at AACL 2025 (Main)
When Safety Blocks Sense: Measuring Semantic Confusion in LLM Refusals
Safety-aligned language models often refuse prompts that are actually harmless. Current evaluations mostly report global rates such as false rejection or compliance. These scores treat each prompt alone and miss local inconsistency, where a model accepts one phrasing of an intent but rejects a close paraphrase. This gap limits diagnosis and tuning. We introduce "semantic confusion," a failure mode that captures such local inconsistency, and a framework to measure it. We build ParaGuard, a 10k-prompt corpus of controlled paraphrase clusters that hold intent fixed while varying surface form. We then propose three model-agnostic metrics at the token level: Confusion Index, Confusion Rate, and Confusion Depth. These metrics compare each refusal to its nearest accepted neighbors and use token embeddings, next-token probabilities, and perplexity signals. Experiments across diverse model families and deployment guards show that global false-rejection rate hides critical structure. Our metrics reveal globally unstable boundaries in some settings, localized pockets of inconsistency in others, and cases where stricter refusal does not increase inconsistency. We also show how confusion-aware auditing separates how often a system refuses from how sensibly it refuses. This gives developers a practical signal to reduce false refusals while preserving safety.
Associative Syntax and Maximal Repetitions reveal context-dependent complexity in fruit bat communication NeurIPS 2025
This study presents an unsupervised method to infer discreteness, syntax and temporal structures of fruit-bats vocalizations, as a case study of graded vocal systems, and evaluates the complexity of communication patterns in relation with behavioral context. The method improved the baseline for unsupervised labeling of vocal units (i.e. syllables) through manifold learning, by investigating how dimen- sionality reduction on mel-spectrograms affects labeling, and comparing it with unsupervised labels based on acoustic similarity. We then encoded vocalizations as syllabic sequences to analyze the type of syntax, and extracted the Maximal Repetitions (MRs) to evaluate syntactical structures. We found evidence for: i) associative syntax, rather than combinatorial (context classification is unaffected by permutation of sequences, F 1 > 0.9); ii) context-dependent use of syllables (Wilcoxon rank-sum tests, p-value < 0.05); iii) heavy-tail distribution of MRs (truncated power-law, exponent α < 2), indicative of mechanism encoding com- binatorial complexity. Analysis of MRs and syllabic transition networks revealed that mother-pupil interactions were characterized by repetitions, while commu- nication in conflict-contexts exhibited higher complexity (longer MRs and more interconnected vocal sequences) than non-agonistic contexts. We propose that communicative complexity is higher in scenarios of disagreement, reflecting lower compressibility of information.
comment: Accepted for a lightning talk at the NeurIPS 2025 Workshop: "AI for Non-Human Animal Communication"
Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics
Evaluating the quality of LLM-generated reasoning traces in expert domains (e.g., law) is essential for ensuring credibility and explainability, yet remains challenging due to the inherent complexity of such reasoning tasks. We introduce LEGIT (LEGal Issue Trees), a novel large-scale (24K instances) expert-level legal reasoning dataset with an emphasis on reasoning trace evaluation. We convert court judgments into hierarchical trees of opposing parties' arguments and the court's conclusions, which serve as rubrics for evaluating the issue coverage and correctness of the reasoning traces. We verify the reliability of these rubrics via human expert annotations and comparison with coarse, less informative rubrics. Using the LEGIT dataset, we show that (1) LLMs' legal reasoning ability is seriously affected by both legal issue coverage and correctness, and that (2) retrieval-augmented generation (RAG) and RL with rubrics bring complementary benefits for legal reasoning abilities, where RAG improves overall reasoning capability, whereas RL improves correctness albeit with reduced coverage.
Advancing Academic Chatbots: Evaluation of Non Traditional Outputs
Most evaluations of large language models focus on standard tasks such as factual question answering or short summarization. This research expands that scope in two directions: first, by comparing two retrieval strategies, Graph RAG, structured knowledge-graph based, and Advanced RAG, hybrid keyword-semantic search, for QA; and second, by evaluating whether LLMs can generate high quality non-traditional academic outputs, specifically slide decks and podcast scripts. We implemented a prototype combining Meta's LLaMA 3 70B open weight and OpenAI's GPT 4o mini API based. QA performance was evaluated using both human ratings across eleven quality dimensions and large language model judges for scalable cross validation. GPT 4o mini with Advanced RAG produced the most accurate responses. Graph RAG offered limited improvements and led to more hallucinations, partly due to its structural complexity and manual setup. Slide and podcast generation was tested with document grounded retrieval. GPT 4o mini again performed best, though LLaMA 3 showed promise in narrative coherence. Human reviewers were crucial for detecting layout and stylistic flaws, highlighting the need for combined human LLM evaluation in assessing emerging academic outputs.
Dr.Mi-Bench: A Modular-integrated Benchmark for Scientific Deep Research Agent
The explosive growth in academic literature necessitates automated deep research (DR) agents, yet their evaluation remains a significant challenge. First, existing benchmarks often focus narrowly on retrieval while neglecting high-level planning and reasoning. Second, existing benchmarks favor general domains over the scientific domains that are the core application for DR agents. To address these gaps, we introduce Dr.Mi-Bench, a Modular-integrated benchmark for scientific DR agents. Grounded in academic literature, our benchmark uses a human-annotated dataset of 200 instances across 10 scientific domains, including both research and review papers. Besides, we also propose a Modular-integrated Evaluation Paradigm for DR Agents (Dr.Mi-Eval), a novel modular-integrated evaluation paradigm, which leverages the rich structure of academic papers to assess the core competencies of planning, retrieval, and reasoning through two complementary modes: an end-to-end evaluation for DR agents and an isolated evaluation for foundational LLMs as potential backbones. Experimental results reveal a fragmented performance landscape: agents exhibit specialized strengths but share critical weaknesses, most notably in performing the multi-source retrieval required for review-style tasks and performing consistently across diverse scientific fields. Moreover, improving high-level planning capability is the crucial factor for unlocking the reasoning potential of foundational LLMs as backbones. By exposing these actionable failure modes, Dr.Mi-Bench provides a diagnostic tool to guide the development of more reliable academic research assistants.
Table as a Modality for Large Language Models NeurIPS 2025
To migrate the remarkable successes of Large Language Models (LLMs), the community has made numerous efforts to generalize them to the table reasoning tasks for the widely deployed tabular data. Despite that, in this work, by showing a probing experiment on our proposed StructQA benchmark, we postulate that even the most advanced LLMs (such as GPTs) may still fall short of coping with tabular data. More specifically, the current scheme often simply relies on serializing the tabular data, together with the meta information, then inputting them through the LLMs. We argue that the loss of structural information is the root of this shortcoming. In this work, we further propose TAMO, which bears an ideology to treat the tables as an independent modality integrated with the text tokens. The resulting model in TAMO is a multimodal framework consisting of a hypergraph neural network as the global table encoder seamlessly integrated with the mainstream LLM. Empirical results on various benchmarking datasets, including HiTab, WikiTQ, WikiSQL, FeTaQA, and StructQA, have demonstrated significant improvements on generalization with an average relative gain of 42.65%.
comment: Accepted to NeurIPS 2025
Fine-tuning of lightweight large language models for sentiment classification on heterogeneous financial textual data
Large language models (LLMs) play an increasingly important role in finan- cial markets analysis by capturing signals from complex and heterogeneous textual data sources, such as tweets, news articles, reports, and microblogs. However, their performance is dependent on large computational resources and proprietary datasets, which are costly, restricted, and therefore inacces- sible to many researchers and practitioners. To reflect realistic situations we investigate the ability of lightweight open-source LLMs - smaller and publicly available models designed to operate with limited computational resources - to generalize sentiment understanding from financial datasets of varying sizes, sources, formats, and languages. We compare the benchmark finance natural language processing (NLP) model, FinBERT, and three open-source lightweight LLMs, DeepSeek-LLM 7B, Llama3 8B Instruct, and Qwen3 8B on five publicly available datasets: FinancialPhraseBank, Financial Question Answering, Gold News Sentiment, Twitter Sentiment and Chinese Finance Sentiment. We find that LLMs, specially Qwen3 8B and Llama3 8B, perform best in most scenarios, even from using only 5% of the available training data. These results hold in zero-shot and few-shot learning scenarios. Our findings indicate that lightweight, open-source large language models (LLMs) consti- tute a cost-effective option, as they can achieve competitive performance on heterogeneous textual data even when trained on only a limited subset of the extensive annotated corpora that are typically deemed necessary.
DeformAr: Rethinking NER Evaluation through Component Analysis and Visual Analytics
Transformer models have significantly advanced Natural Language Processing (NLP), demonstrating strong performance in English. However, their effectiveness in Arabic, particularly for Named Entity Recognition (NER), remains limited, even with larger pre-trained models. This performance gap stems from multiple factors, including tokenisation, dataset quality, and annotation inconsistencies. Existing studies often analyze these issues in isolation, failing to capture their joint effect on system behaviour and performance. We introduce DeformAr (Debugging and Evaluation Framework for Transformer-based NER Systems), a novel framework designed to investigate and explain the performance discrepancy between Arabic and English NER systems. DeformAr integrates a data extraction library and an interactive dashboard, supporting two modes of evaluation: cross-component analysis and behavioural analysis. The framework divides each language into dataset and model components to examine their interactions. The analysis proceeds in two stages. First, cross-component analysis provides systematic diagnostic measures across data and model subcomponents, addressing the "what," "how," and "why" behind observed discrepancies. The second stage applies behavioural analysis by combining interpretability techniques with token-level metrics, interactive visualisations, and representation space analysis. These stages enable a component-aware diagnostic process that detects model behaviours and explains them by linking them to underlying representational patterns and data factors. DeformAr is the first Arabic-specific, component-based interpretability tool, offering a crucial resource for advancing model analysis in under-resourced languages.
comment: PhD Thesis, University of Sussex, 2025. 311 pages, 140 figures, 32 tables. Submitted as a PDF-only. First supervisor: Julie Weeds. Second supervisor: David Weir
Mitigating Hallucinations in Zero-Shot Scientific Summarisation: A Pilot Study
Large language models (LLMs) produce context inconsistency hallucinations, which are LLM generated outputs that are misaligned with the user prompt. This research project investigates whether prompt engineering (PE) methods can mitigate context inconsistency hallucinations in zero-shot LLM summarisation of scientific texts, where zero-shot indicates that the LLM relies purely on its pre-training data. Across eight yeast biotechnology research paper abstracts, six instruction-tuned LLMs were prompted with seven methods: a base- line prompt, two levels of increasing instruction complexity (PE-1 and PE-2), two levels of context repetition (CR-K1 and CR-K2), and two levels of random addition (RA-K1 and RA-K2). Context repetition involved the identification and repetition of K key sentences from the abstract, whereas random addition involved the repetition of K randomly selected sentences from the abstract, where K is 1 or 2. A total of 336 LLM-generated summaries were evaluated using six metrics: ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, METEOR, and cosine similarity, which were used to compute the lexical and semantic alignment be- tween the summaries and the abstracts. Four hypotheses on the effects of prompt methods on summary alignment with the reference text were tested. Statistical analysis on 3744 collected datapoints was performed using bias-corrected and accelerated (BCa) bootstrap confidence intervals and Wilcoxon signed-rank tests with Bonferroni-Holm correction. The results demonstrated that CR and RA significantly improve the lexical alignment of LLM-generated summaries with the abstracts. These findings indicate that prompt engineering has the potential to impact hallucinations in zero-shot scientific summarisation tasks.
Reward Auditor: Inference on Reward Modeling Suitability in Real-World Perturbed Scenarios
Reliable reward models (RMs) are critical for ensuring the safe alignment of large language models (LLMs). However, current evaluation methods focus solely on preference perception accuracies in given specific scenarios, obscuring the critical vulnerabilities of RMs in real-world scenarios. We identify the true challenge lies in assessing a novel dimension: Suitability, defined as conditional reliability under specific real-world perturbations. To this end, we introduce Reward Auditor, a hypothesis-testing framework specifically designed for RM suitability inference. Rather than answering "How accurate is the RM's preference perception for given samples?", it employs scientific auditing to answer: "Can we infer RMs exhibit systematic vulnerabilities in specific real-world scenarios?". Under real-world perturbed scenarios, Reward Auditor quantifies statistical significance and effect size by auditing distribution degradation of RM preference perception confidence. This enables inference of both the certainty and severity of RM vulnerabilities across diverse real-world scenarios. This lays a solid foundation for building next-generation LLM alignment systems that are verifiably safe, more robust, and trustworthy.
Towards Active Synthetic Data Generation for Finetuning Language Models
A common and effective means for improving language model capabilities involves finetuning a ``student'' language model's parameters on generations from a more proficient ``teacher'' model. Termed ``synthetic data'', these generations are often produced before any student finetuning, but some work has considered generating new synthetic samples as training progresses. This paper studies and advocates for the latter case, where data are generated in an iterative, closed-loop fashion that is guided by the current state of the student model. For a fixed budget of generated samples, or a budget in terms of compute spent querying a teacher, we show that this curation of finetuning data affords improved student performance over static generation. Further, while there have been several LLM-specific methods proposed that operate in this regime, we find that simple, inexpensive selection criteria from the active learning literature tend to be most performant. We validate these claims across four mathematical and logical reasoning datasets using four different small language models.
comment: 14 figures, 36 pages
Less is More: Resource-Efficient Low-Rank Adaptation
Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning (PEFT) method for Large Language Models (LLMs), but it still incurs notable overhead and suffers from parameter interference in complex datasets. While re- cent works decouple LoRA update matrices to exploit matrix-wise asymmetry, training costs remain high. We revisit LoRA from the perspective of inter-matrix and intra-layer parameter redundancy and propose Resource-Efficient Low-Rank Adaptation, EffiLoRA, a lightweight and generalizable approach for language, multimodal, and diffusion models. EffiLoRA employs a unified A matrix across all transformer layers and introduces a runtime selective B matrices up- date to dynamically trade-off the system resource budget and model performance. EffiLoRA consistently outperforms LoRA across diverse modalities, including commonsense reasoning, visual instruction tuning, and image generation, demon- strating improved efficiency and robustness.
comment: 18 pages, 7 figures
One Swallow Does Not Make a Summer: Understanding Semantic Structures in Embedding Spaces
Embedding spaces are fundamental to modern AI, translating raw data into high-dimensional vectors that encode rich semantic relationships. Yet, their internal structures remain opaque, with existing approaches often sacrificing semantic coherence for structural regularity or incurring high computational overhead to improve interpretability. To address these challenges, we introduce the Semantic Field Subspace (SFS), a geometry-preserving, context-aware representation that captures local semantic neighborhoods within the embedding space. We also propose SAFARI (SemAntic Field subspAce deteRmInation), an unsupervised, modality-agnostic algorithm that uncovers hierarchical semantic structures using a novel metric called Semantic Shift, which quantifies how semantics evolve as SFSes evolve. To ensure scalability, we develop an efficient approximation of Semantic Shift that replaces costly SVD computations, achieving a 15~30x speedup with average errors below 0.01. Extensive evaluations across six real-world text and image datasets show that SFSes outperform standard classifiers not only in classification but also in nuanced tasks such as political bias detection, while SAFARI consistently reveals interpretable and generalizable semantic hierarchies. This work presents a unified framework for structuring, analyzing, and scaling semantic understanding in embedding spaces.
WaterSearch: A Quality-Aware Search-based Watermarking Framework for Large Language Models
Watermarking acts as a critical safeguard in text generated by Large Language Models (LLMs). By embedding identifiable signals into model outputs, watermarking enables reliable attribution and enhances the security of machine-generated content. Existing approaches typically embed signals by manipulating token generation probabilities. Despite their effectiveness, these methods inherently face a trade-off between detectability and text quality: the signal strength and randomness required for robust watermarking tend to degrade the performance of downstream tasks. In this paper, we design a novel embedding scheme that controls seed pools to facilitate diverse parallel generation of watermarked text. Based on that scheme, we propose WaterSearch, a sentence-level, search-based watermarking framework adaptable to a wide range of existing methods. WaterSearch enhances text quality by jointly optimizing two key aspects: 1) distribution fidelity and 2) watermark signal characteristics. Furthermore, WaterSearch is complemented by a sentence-level detection method with strong attack robustness. We evaluate our method on three popular LLMs across ten diverse tasks. Extensive experiments demonstrate that our method achieves an average performance improvement of 51.01\% over state-of-the-art baselines at a watermark detectability strength of 95\%. In challenging scenarios such as short text generation and low-entropy output generation, our method yields performance gains of 47.78\% and 36.47\%, respectively. Moreover, under different attack senarios including insertion, synonym substitution and paraphrase attasks, WaterSearch maintains high detectability, further validating its robust anti-attack capabilities. Our code is available at \href{https://github.com/Yukang-Lin/WaterSearch}{https://github.com/Yukang-Lin/WaterSearch}.
Accelerating Bangla NLP Tasks with Automatic Mixed Precision: Resource-Efficient Training Preserving Model Efficacy
Training models for Natural Language Processing (NLP) requires substantial computational resources and time, posing significant challenges, especially for NLP development in Bangla, where access to high-end hardware is often limited. In this work, we explore automatic mixed precision (AMP) training as a means to improve computational efficiency without sacrificing model performance. By leveraging a dynamic mix of 16-bit and 32-bit floating-point computations, AMP lowers GPU memory requirements and speeds up training without degrading model performance. We evaluate AMP across four standard Bangla NLP tasks, namely sentiment analysis, named entity recognition, error classification, and question answering, using four transformer-based models: BanglaBERT, BanglishBERT, XLM-R, and mBERT. Our results demonstrate that AMP accelerates training by 44.5% and reduces memory consumption by 17.6%, while maintaining F-1 score within 99.7% of the full-precision baselines. This empirical study highlights AMP's potential to democratize access to state-of-the-art NLP capabilities in hardware-constrained settings by lowering computational barriers.
Auxiliary-Hyperparameter-Free Sampling: Entropy Equilibrium for Text Generation
Token sampling strategies critically influence text generation quality in large language models (LLMs). However, existing methods introduce additional hyperparameters, requiring extensive tuning and complicating deployment. We present Entropy Equilibrium Sampling (EES), an auxiliary hyperparameter-free approach inspired by information theory that can dynamically adjust candidate sets by balancing normalized entropy with probability mass. We evaluate EES on both reasoning and generation tasks across a range of model architectures. Our results show that EES consistently performs well across temperature settings, delivering competitive accuracy and coherence while maintaining diversity. By eliminating the need for hyperparameter tuning, EES greatly simplifies deployment while improving performance. Code is available at https://github.com/shuanncai/EES
Text Mining Analysis of Symptom Patterns in Medical Chatbot Conversations
The fast growth of digital health systems has led to a need to better comprehend how they interpret and represent patient-reported symptoms. Chatbots have been used in healthcare to provide clinical support and enhance the user experience, making it possible to provide meaningful clinical patterns from text-based data through chatbots. The proposed research utilises several different natural language processing methods to study the occurrences of symptom descriptions in medicine as well as analyse the patterns that emerge through these conversations within medical bots. Through the use of the Medical Conversations to Disease Dataset which contains 960 multi-turn dialogues divided into 24 Clinical Conditions, a standardised representation of conversations between patient and bot is created for further analysis by computational means. The multi-method approach uses a variety of tools, including Latent Dirichlet Allocation (LDA) to identify latent symptom themes, K-Means to group symptom descriptions by similarity, Transformer-based Named Entity Recognition (NER) to extract medical concepts, and the Apriori algorithm to discover frequent symptom pairs. Findings from the analysis indicate a coherent structure of clinically relevant topics, moderate levels of clustering cohesiveness and several high confidence rates on the relationships between symptoms like fever headache and rash itchiness. The results support the notion that conversational medical data can be a valuable diagnostic signal for early symptom interpretation, assist in strengthening decision support and improve how users interact with tele-health technology. By demonstrating a method for converting unstructured free-flowing dialogue into actionable knowledge regarding symptoms this work provides an extensible framework to further enhance future performance, dependability and clinical utility of selecting medical chatbots.
comment: 9 pages, 4 tables
FastPOS: Language-Agnostic Scalable POS Tagging Framework Low-Resource Use Case
This study proposes a language-agnostic transformer-based POS tagging framework designed for low-resource languages, using Bangla and Hindi as case studies. With only three lines of framework-specific code, the model was adapted from Bangla to Hindi, demonstrating effective portability with minimal modification. The framework achieves 96.85 percent and 97 percent token-level accuracy across POS categories in Bangla and Hindi while sustaining strong F1 scores despite dataset imbalance and linguistic overlap. A performance discrepancy in a specific POS category underscores ongoing challenges in dataset curation. The strong results stem from the underlying transformer architecture, which can be replaced with limited code adjustments. Its modular and open-source design enables rapid cross-lingual adaptation while reducing model design and tuning overhead, allowing researchers to focus on linguistic preprocessing and dataset refinement, which are essential for advancing NLP in underrepresented languages.
Probing the "Psyche'' of Large Reasoning Models: Understanding Through a Human Lens
Large reasoning models (LRMs) have garnered significant attention from researchers owing to their exceptional capability in addressing complex tasks. Motivated by the observed human-like behaviors in their reasoning processes, this paper introduces a comprehensive taxonomy to characterize atomic reasoning steps and probe the ``psyche'' of LRM intelligence. Specifically, it comprises five groups and seventeen categories derived from human mental processes, thereby grounding the understanding of LRMs in an interdisciplinary perspective. The taxonomy is then applied for an in-depth understanding of current LRMs, resulting in a distinct labeled dataset that comprises 277,534 atomic reasoning steps. Using this resource, we analyze contemporary LRMs and distill several actionable takeaways for improving training and post-training of reasoning models. Notably, our analysis reveals that prevailing post-answer ``double-checks'' (self-monitoring evaluations) are largely superficial and rarely yield substantive revisions. Thus, incentivizing comprehensive multi-step reflection, rather than simple self-monitoring, may offer a more effective path forward. To complement the taxonomy, an automatic annotation framework, named CAPO, is proposed to leverage large language models (LLMs) for generating the taxonomy-based annotations. Experimental results demonstrate that CAPO achieves higher consistency with human experts compared to baselines, facilitating a scalable and comprehensive analysis of LRMs from a human cognitive perspective. Together, the taxonomy, CAPO, and the derived insights provide a principled, scalable path toward understanding and advancing LRM reasoning.
comment: 13 pages
Comparative Evaluation of Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS2
Synthesizing expressive Japanese character speech poses unique challenges due to pitch-accent sensitivity and stylistic variability. This paper empirically evaluates two open-source text-to-speech models--VITS and Style-BERT-VITS2 JP Extra (SBV2JE)--on in-domain, character-driven Japanese speech. Using three character-specific datasets, we evaluate models across naturalness (mean opinion and comparative mean opinion score), intelligibility (word error rate), and speaker consistency. SBV2JE matches human ground truth in naturalness (MOS 4.37 vs. 4.38), achieves lower WER, and shows slight preference in CMOS. Enhanced by pitch-accent controls and a WavLM-based discriminator, SBV2JE proves effective for applications like language learning and character dialogue generation, despite higher computational demands.
comment: Accepted to IEEE UEMCON 2025
Persistent Instability in LLM's Personality Measurements: Effects of Scale, Reasoning, and Conversation History AAAI 2026
Large language models require consistent behavioral patterns for safe deployment, yet there are indications of large variability that may lead to an instable expression of personality traits in these models. We present PERSIST (PERsonality Stability in Synthetic Text), a comprehensive evaluation framework testing 25 open-source models (1B-685B parameters) across 2 million+ responses. Using traditional (BFI, SD3) and novel LLM-adapted personality questionnaires, we systematically vary model size, personas, reasoning modes, question order or paraphrasing, and conversation history. Our findings challenge fundamental assumptions: (1) Question reordering alone can introduce large shifts in personality measurements; (2) Scaling provides limited stability gains: even 400B+ models exhibit standard deviations >0.3 on 5-point scales; (3) Interventions expected to stabilize behavior, such as reasoning and inclusion of conversation history, can paradoxically increase variability; (4) Detailed persona instructions produce mixed effects, with misaligned personas showing significantly higher variability than the helpful assistant baseline; (5) The LLM-adapted questionnaires, despite their improved ecological validity, exhibit instability comparable to human-centric versions. This persistent instability across scales and mitigation strategies suggests that current LLMs lack the architectural foundations for genuine behavioral consistency. For safety-critical applications requiring predictable behavior, these findings indicate that current alignment strategies may be inadequate.
comment: Accepted at AAAI 2026, Track on AI Alignment
Generating Text from Uniform Meaning Representation AACL 2026
Uniform Meaning Representation (UMR) is a recently developed graph-based semantic representation, which expands on Abstract Meaning Representation (AMR) in a number of ways, in particular through the inclusion of document-level information and multilingual flexibility. In order to effectively adopt and leverage UMR for downstream tasks, efforts must be placed toward developing a UMR technological ecosystem. Though only a small amount of UMR annotations have been produced to date, in this work, we investigate the first approaches to producing text from multilingual UMR graphs. Exploiting the structural similarity between UMR and AMR graphs and the wide availability of AMR technologies, we introduce (1) a baseline approach which passes UMR graphs to AMR-to-text generation models, (2) a pipeline conversion of UMR to AMR, then using AMR-to-text generation models, and (3) a fine-tuning approach for both foundation models and AMR-to-text generation models with UMR data. Our best performing models achieve multilingual BERTscores of 0.825 for English and 0.882 for Chinese, a promising indication of the effectiveness of fine-tuning approaches for UMR-to-text generation even with limited UMR data.
comment: Accepted to IJCNLP-AACL 2026
Do different prompting methods yield a common task representation in language models? NeurIPS 2025
Demonstrations and instructions are two primary approaches for prompting language models to perform in-context learning (ICL) tasks. Do identical tasks elicited in different ways result in similar representations of the task? An improved understanding of task representation mechanisms would offer interpretability insights and may aid in steering models. We study this through \textit{function vectors} (FVs), recently proposed as a mechanism to extract few-shot ICL task representations. We generalize FVs to alternative task presentations, focusing on short textual instruction prompts, and successfully extract instruction function vectors that promote zero-shot task accuracy. We find evidence that demonstration- and instruction-based function vectors leverage different model components, and offer several controls to dissociate their contributions to task performance. Our results suggest that different task promptings forms do not induce a common task representation through FVs but elicit different, partly overlapping mechanisms. Our findings offer principled support to the practice of combining instructions and task demonstrations, imply challenges in universally monitoring task inference across presentation forms, and encourage further examinations of LLM task inference mechanisms.
comment: 10 pages, 4 figures; presented at NeurIPS 2025
Quantifying Cognitive Bias Induction in LLM-Generated Content AACL 2025
Large language models (LLMs) are integrated into applications like shopping reviews, summarization, or medical diagnosis support, where their use affects human decisions. We investigate the extent to which LLMs expose users to biased content and demonstrate its effect on human decision-making. We assess five LLM families in summarization and news fact-checking tasks, evaluating the consistency of LLMs with their context and their tendency to hallucinate on a new self-updating dataset. Our findings show that LLMs expose users to content that changes the context's sentiment in 26.42% of cases (framing bias), hallucinate on 60.33% of post-knowledge-cutoff questions, and highlight context from earlier parts of the prompt (primacy bias) in 10.12% of cases, averaged across all tested models. We further find that humans are 32% more likely to purchase the same product after reading a summary of the review generated by an LLM rather than the original review. To address these issues, we evaluate 18 mitigation methods across three LLM families and find the effectiveness of targeted interventions.
comment: 21 pages (including references and appendix), 3figures. accepted to AACL 2025
Extracting memorized pieces of (copyrighted) books from open-weight language models
Plaintiffs and defendants in copyright lawsuits over generative AI often make sweeping, opposing claims about the extent to which large language models (LLMs) have memorized plaintiffs' protected expression in their training data. Drawing on both machine learning and copyright law, we show that these polarized positions dramatically oversimplify the relationship between memorization and copyright. To do so, we extend a recent probabilistic extraction technique to measure memorization of 50 books in 17 open-weight LLMs. Through thousands of experiments, we show that the extent of memorization varies both by model and by book. With respect to our specific extraction methodology, we find that most LLMs do not memorize most books -- either in whole or in part. However, we also find that Llama 3.1 70B entirely memorizes some books, like the first Harry Potter book and 1984. In fact, the first Harry Potter is so memorized that, using a seed prompt consisting of just the first few tokens of the first chapter, we can deterministically generate the entire book near-verbatim. We discuss why our results have significant implications for copyright cases, though not ones that unambiguously favor either side.
SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations NeurIPS 2025
Large Language Models (LLMs) are increasingly deployed in high-risk domains. However, state-of-the-art LLMs often produce hallucinations, raising serious concerns about their reliability. Prior work has explored adversarial attacks for hallucination elicitation in LLMs, but it often produces unrealistic prompts, either by inserting gibberish tokens or by altering the original meaning. As a result, these approaches offer limited insight into how hallucinations may occur in practice. While adversarial attacks in computer vision often involve realistic modifications to input images, the problem of finding realistic adversarial prompts for eliciting LLM hallucinations has remained largely underexplored. To address this gap, we propose Semantically Equivalent and Coherent Attacks (SECA) to elicit hallucinations via realistic modifications to the prompt that preserve its meaning while maintaining semantic coherence. Our contributions are threefold: (i) we formulate finding realistic attacks for hallucination elicitation as a constrained optimization problem over the input prompt space under semantic equivalence and coherence constraints; (ii) we introduce a constraint-preserving zeroth-order method to effectively search for adversarial yet feasible prompts; and (iii) we demonstrate through experiments on open-ended multiple-choice question answering tasks that SECA achieves higher attack success rates while incurring almost no semantic equivalence or semantic coherence errors compared to existing methods. SECA highlights the sensitivity of both open-source and commercial gradient-inaccessible LLMs to realistic and plausible prompt variations. Code is available at https://github.com/Buyun-Liang/SECA.
comment: Accepted at NeurIPS 2025. Code is available at https://github.com/Buyun-Liang/SECA
Chain-of-Query: Unleashing the Power of LLMs in SQL-Aided Table Understanding via Multi-Agent Collaboration AACL 2025
Table understanding requires structured, multi-step reasoning. Large Language Models (LLMs) struggle with it due to the structural complexity of tabular data. Recently, multi-agent frameworks for SQL generation have shown promise in tackling the challenges of understanding tabular data, but existing approaches often suffer from limitations such as the inability to comprehend table structure for reliable SQL generation, error propagation that results in invalid queries, and over-reliance on execution correctness. To address these issues, we propose Chain-of-Query (CoQ), a novel multi-agent framework for SQL-aided table understanding. CoQ adopts natural-language-style representations of table schemas to abstract away structural noise and enhance understanding. It employs a clause-by-clause SQL generation strategy to improve query quality and introduces a hybrid reasoning division that separates SQL-based mechanical reasoning from LLM-based logical inference, thereby reducing reliance on execution outcomes. Extensive experiments across four models and five widely used benchmarks demonstrate that CoQ achieves substantial accuracy improvements and significantly lowers invalid SQL rates compared to prior generic LLM-based, SQL-aided, and hybrid baselines, confirming its superior effectiveness in table understanding. The code is available at https://github.com/SongyuanSui/ChainofQuery.
comment: AACL 2025 Main Conference (Oral)
Confident RAG: Enhancing the Performance of LLMs for Mathematics Question Answering through Multi-Embedding and Confidence Scoring
Large Language Models (LLMs) hold significant promise for mathematics education, yet they often struggle with complex mathematical reasoning. While Retrieval-Augmented Generation (RAG) mitigates these issues by grounding LLMs in external knowledge, its effectiveness remains unstable, heavily dependent on the choice of a single embedding model. Moving beyond static RAG workflows, we draw on agentic workflow patterns, a paradigm that introduces structured task decomposition and collaboration to enhance system performance. We propose and examine two novel approaches that combine the benefits of multiple embedding models. While our Mixture-Embedding RAG approach (fusing retrieved documents) shows limited gains, our Confident RAG method (generating multiple answers and selecting the one with the highest confidence score) demonstrates significant improvement. Experimental results show that Confident RAG achieved average accuracy improvements of approximately 10% over vanilla LLMs and 5% over vanilla RAG. The consistent results across different LLMs and embedding models indicate that Confident RAG is an efficient plug-and-play solution for trustworthy mathematical AI assistants. Finally, we discuss how this work lays the groundwork for deploying Agentic RAG systems in educational settings, where autonomous planning and iterative refinement can be built upon our robust retrieval foundation.
Evaluating Large Language Models on the 2026 Korean CSAT Mathematics Exam: Measuring Mathematical Ability in a Zero-Data-Leakage Setting
This study systematically evaluated the mathematical reasoning capabilities of Large Language Models (LLMs) using the 2026 Korean College Scholastic Ability Test (CSAT) Mathematics section, ensuring a completely contamination-free evaluation environment. To address data leakage issues in existing benchmarks, we digitized all 46 questions (22 common and 24 elective) within two hours of the exam's public release, eliminating any possibility of inclusion in model training data. We conducted comprehensive evaluations of 24 state-of-the-art LLMs across varying input modalities (Text-only, Image-only, Text+Figure) and prompt languages (Korean, English). The GPT-5 family models achieved perfect scores (100 points) under a limited set of language-modality configurations, while Grok 4, Qwen 3 235B, and Gemini 2.5 pro also scored above 97 points. Notably, gpt-oss-20B achieved 95.7 points despite its relatively small size, demonstrating high cost-effectiveness. Problem-specific analysis revealed Calculus as the weakest domain with significant performance degradation on 4-point high-difficulty problems. Text input consistently outperformed image input, while prompt language effects varied by model scale. In reasoning enhancement experiments with GPT-5 series, increased reasoning intensity improved performance (82.6->100 points) but quadrupled token usage and drastically reduced efficiency, suggesting that models with minimal reasoning may be more practical. This research contributes: (1) implementation of a completely unexposed evaluation environment, (2) a standardized digitization pipeline that converts human-targeted exam materials into LLM-ready evaluation data, and (3) a practical evaluation perspective integrating performance, cost, and time considerations. Detailed results and model comparisons are available at the 2026 Korean CSAT LLM Evaluation Leaderboard; https://isoft.cnu.ac.kr/csat2026/
comment: 52 pages
Event2Vec: A Geometric Approach to Learning Composable Representations of Event Sequences
The study of neural representations, both in biological and artificial systems, is increasingly revealing the importance of geometric and topological structures. Inspired by this, we introduce Event2Vec, a novel framework for learning representations of discrete event sequences. Our model leverages a simple, additive recurrent structure to learn composable, interpretable embeddings. We provide a theoretical analysis demonstrating that, under specific training objectives, our model's learned representations in a Euclidean space converge to an ideal additive structure. This ensures that the representation of a sequence is the vector sum of its constituent events, a property we term the linear additive hypothesis. To address the limitations of Euclidean geometry for hierarchical data, we also introduce a variant of our model in hyperbolic space, which is naturally suited to embedding tree-like structures with low distortion. We present experiments to validate our hypothesis. Quantitative evaluation on the Brown Corpus yields a Silhouette score of 0.0564, outperforming a Word2Vec baseline (0.0215), demonstrating the model's ability to capture structural dependencies without supervision.
comment: 10 pages, 3 figures, Symmetry and Geometry in Neural Representations Workshop at NeuralIPS (NeurReps) 2025
SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models
Vision-Language-Action (VLA) models excel in robotic manipulation but are constrained by their heavy reliance on expert demonstrations, leading to demonstration bias and limiting performance. Reinforcement learning (RL) is a vital post-training strategy to overcome these limits, yet current VLA-RL methods, including group-based optimization approaches, are crippled by severe reward sparsity. Relying on binary success indicators wastes valuable information in failed trajectories, resulting in low training efficiency. To solve this, we propose Self-Referential Policy Optimization (SRPO), a novel VLA-RL framework. SRPO eliminates the need for external demonstrations or manual reward engineering by leveraging the model's own successful trajectories, generated within the current training batch, as a self-reference. This allows us to assign a progress-wise reward to failed attempts. A core innovation is the use of latent world representations to measure behavioral progress robustly. Instead of relying on raw pixels or requiring domain-specific fine-tuning, we utilize the compressed, transferable encodings from a world model's latent space. These representations naturally capture progress patterns across environments, enabling accurate, generalized trajectory comparison. Empirical evaluations on the LIBERO benchmark demonstrate SRPO's efficiency and effectiveness. Starting from a supervised baseline with 48.9% success, SRPO achieves a new state-of-the-art success rate of 99.2% in just 200 RL steps, representing a 103% relative improvement without any extra supervision. Furthermore, SRPO shows substantial robustness, achieving a 167% performance improvement on the LIBERO-Plus benchmark.
Black-Box On-Policy Distillation of Large Language Models
Black-box distillation creates student large language models (LLMs) by learning from a proprietary teacher model's text outputs alone, without access to its internal logits or parameters. In this work, we introduce Generative Adversarial Distillation (GAD), which enables on-policy and black-box distillation. GAD frames the student LLM as a generator and trains a discriminator to distinguish its responses from the teacher LLM's, creating a minimax game. The discriminator acts as an on-policy reward model that co-evolves with the student, providing stable, adaptive feedback. Experimental results show that GAD consistently surpasses the commonly used sequence-level knowledge distillation. In particular, Qwen2.5-14B-Instruct (student) trained with GAD becomes comparable to its teacher, GPT-5-Chat, on the LMSYS-Chat automatic evaluation. The results establish GAD as a promising and effective paradigm for black-box LLM distillation.
SpeechJudge: Towards Human-Level Judgment for Speech Naturalness
Aligning large generative models with human feedback is a critical challenge. In speech synthesis, this is particularly pronounced due to the lack of a large-scale human preference dataset, which hinders the development of models that truly align with human perception. To address this, we introduce SpeechJudge, a comprehensive suite comprising a dataset, a benchmark, and a reward model centered on naturalness--one of the most fundamental subjective metrics for speech synthesis. First, we present SpeechJudge-Data, a large-scale human feedback corpus of 99K speech pairs. The dataset is constructed using a diverse set of advanced zero-shot text-to-speech (TTS) models across diverse speech styles and multiple languages, with human annotations for both intelligibility and naturalness preference. From this, we establish SpeechJudge-Eval, a challenging benchmark for speech naturalness judgment. Our evaluation reveals that existing metrics and AudioLLMs struggle with this task; the leading model, Gemini-2.5-Flash, achieves less than 70% agreement with human judgment, highlighting a significant gap for improvement. To bridge this gap, we develop SpeechJudge-GRM, a generative reward model (GRM) based on Qwen2.5-Omni-7B. It is trained on SpeechJudge-Data via a two-stage post-training process: Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales followed by Reinforcement Learning (RL) with GRPO on challenging cases. On the SpeechJudge-Eval benchmark, the proposed SpeechJudge-GRM demonstrates superior performance, achieving 77.2% accuracy (and 79.4% after inference-time scaling @10) compared to a classic Bradley-Terry reward model (72.7%). Furthermore, SpeechJudge-GRM can be also employed as a reward function during the post-training of speech generation models to facilitate their alignment with human preferences.
comment: Dataset, Model, and Code: https://github.com/AmphionTeam/SpeechJudge
Reasoning-Intensive Regression
AI researchers and practitioners increasingly apply large language models (LLMs) to what we call reasoning-intensive regression (RiR), i.e., deducing subtle numerical scores from text. Unlike standard language regression tasks, e.g., for sentiment or similarity, RiR often appears instead in ad-hoc problems such as rubric-based scoring, modeling dense rewards in complex environments, or domain-specific retrieval, where much deeper analysis of context is required while only limited task-specific training data and computation are available. We cast four realistic problems as RiR tasks to establish an initial benchmark, and use that to test our hypothesis that prompting frozen LLMs and finetuning Transformer encoders via gradient descent will both often struggle in RiR. We then propose MENTAT, a simple and lightweight method that combines batch-reflective prompt optimization with neural ensemble learning. MENTAT achieves up to 65% improvement over both baselines, though substantial room remains for future advances in RiR.
Advancing Natural Language Formalization to First Order Logic with Fine-tuned LLMs
Automating the translation of natural language to first-order logic (FOL) is crucial for knowledge representation and formal methods, yet remains challenging. We present a systematic evaluation of fine-tuned LLMs for this task, comparing architectures (encoder-decoder vs. decoder-only) and training strategies. Using the MALLS and Willow datasets, we explore techniques like vocabulary extension, predicate conditioning, and multilingual training, introducing metrics for exact match, logical equivalence, and predicate alignment. Our fine-tuned Flan-T5-XXL achieves 70% accuracy with predicate lists, outperforming GPT-4o and even the DeepSeek-R1-0528 model with CoT reasoning ability as well as symbolic systems like ccg2lambda. Key findings show: (1) predicate availability boosts performance by 15-20%, (2) T5 models surpass larger decoder-only LLMs, and (3) models generalize to unseen logical arguments (FOLIO dataset) without specific training. While structural logic translation proves robust, predicate extraction emerges as the main bottleneck.
comment: 15 pages, 7 tables, accepted at the International Joint Conference on Learning & Reasoning (IJCLR 2025)
ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents AAAI2026
Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and finding multi-products seller. To bridge this gap, we propose ShoppingBench, a novel end-to-end shopping benchmark designed to encompass increasingly challenging levels of grounded intent. Specifically, we propose a scalable framework to simulate user instructions based on various intents derived from sampled real-world products. To facilitate consistent and reliable evaluations, we provide a large-scale shopping sandbox that serves as an interactive simulated environment, incorporating over 2.5 million real-world products. Experimental results demonstrate that even state-of-the-art language agents (such as GPT-4.1) achieve absolute success rates under 50% on our benchmark tasks, highlighting the significant challenges posed by our ShoppingBench. In addition, we propose a trajectory distillation strategy and leverage supervised fine-tuning, along with reinforcement learning on synthetic trajectories, to distill the capabilities of a large language agent into a smaller one. As a result, our trained agent achieves competitive performance compared to GPT-4.1.
comment: submit to AAAI2026
Explainable Semantic Text Relations: A Question-Answering Framework for Comparing Document Content
Understanding semantic relations between two texts is crucial for many information and document management tasks, in which one must determine whether the content fully overlaps, is completely superseded by another document, or overlaps only partially, with unique information in each. Beyond establishing this relation, it is equally important to provide explainable outputs that specify which pieces of information are present, missing, or newly added between the text pair. In this study, we formally define semantic relations between two texts through the set-theoretic relation between their respective Answerable Question Sets (AQS), the sets of questions each text can answer. Under this formulation, Semantic Text Relation (STR), such as equivalence, inclusion, and mutual overlap, becomes a well-defined set relation between the corresponding texts' AQSs. The set differences between the AQSs also serve as an explanation or diagnostic tool for identifying how the information in the texts diverges. Using this definition, we construct a synthetic benchmark that captures fine-grained informational relations through controlled paraphrasing and deliberate information removal supported by AQS manipulations. We then use this dataset to evaluate several discriminative and generative models for classifying text pairs into STR categories, assessing how well different model architectures capture semantic relations beyond surface-level similarity. We publicly release both the dataset and the data generation code to support further research.
comment: 18 pages, 1 figure
Efficient LLM-Jailbreaking via Multimodal-LLM Jailbreak
This paper focuses on jailbreaking attacks against large language models (LLMs), eliciting them to generate objectionable content in response to harmful user queries. Unlike previous LLM-jailbreak methods that directly orient to LLMs, our approach begins by constructing a multimodal large language model (MLLM) built upon the target LLM. Subsequently, we perform an efficient MLLM jailbreak and obtain a jailbreaking embedding. Finally, we convert the embedding into a textual jailbreaking suffix to carry out the jailbreak of target LLM. Compared to the direct LLM-jailbreak methods, our indirect jailbreaking approach is more efficient, as MLLMs are more vulnerable to jailbreak than pure LLM. Additionally, to improve the attack success rate of jailbreak, we propose an image-text semantic matching scheme to identify a suitable initial input. Extensive experiments demonstrate that our approach surpasses current state-of-the-art jailbreak methods in terms of both efficiency and effectiveness. Moreover, our approach exhibits superior cross-class generalization abilities.
Attributional Safety Failures in Large Language Models under Code-Mixed Perturbations
While LLMs appear robustly safety-aligned in English, we uncover a catastrophic, overlooked weakness: attributional collapse under code-mixed perturbations. Our systematic evaluation of open models shows that the linguistic camouflage of code-mixing -- ``blending languages within a single conversation'' -- can cause safety guardrails to fail dramatically. Attack success rates (ASR) spike from a benign 9\% in monolingual English to 69\% under code-mixed inputs, with rates exceeding 90\% in non-Western contexts such as Arabic and Hindi. These effects hold not only on controlled synthetic datasets but also on real-world social media traces, revealing a serious risk for billions of users. To explain why this happens, we introduce saliency drift attribution (SDA), an interpretability framework that shows how, under code-mixing, the model's internal attention drifts away from safety-critical tokens (e.g., ``violence'' or ``corruption''), effectively blinding it to harmful intent. Finally, we propose a lightweight translation-based restoration strategy that recovers roughly 80\% of the safety lost to code-mixing, offering a practical path toward more equitable and robust LLM safety.
Nemotron-CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training NeurIPS 2025
Pre-training datasets are typically collected from web content and lack inherent domain divisions. For instance, widely used datasets like Common Crawl do not include explicit domain labels, while manually curating labeled datasets such as The Pile is labor-intensive. Consequently, identifying an optimal pre-training data mixture remains a challenging problem, despite its significant benefits for pre-training performance. To address these challenges, we propose CLustering-based Iterative Data Mixture Bootstrapping (Nemotron-CLIMB), an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. Specifically, Nemotron-CLIMB embeds and clusters large-scale datasets in a semantic space and then iteratively searches for optimal mixtures using a smaller proxy model and a predictor. When continuously trained on 400B tokens with this mixture, our 1B model exceeds the state-of-the-art Llama-3.2-1B by 2.0%. Moreover, we observe that optimizing for a specific domain (e.g., Social Sciences) yields a 5% improvement over random sampling. Finally, we introduce Nemotron-ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and Nemotron-ClimbMix, a compact yet powerful 400-billion-token dataset designed for efficient pre-training that delivers superior performance under an equal token budget. We analyze the final data mixture, elucidating the characteristics of an optimal data mixture. Our data is available at: https://research.nvidia.com/labs/lpr/climb/
comment: Accepted to NeurIPS 2025
COMPASS: Context-Modulated PID Attention Steering System for Hallucination Mitigation
Large language models (LLMs) often generate fluent but factually incorrect statements despite having access to relevant evidence, a failure mode rooted in how they allocate attention between contextual and parametric knowledge. Understanding and steering this internal behavior is key both for trustworthy deployment and for scientific interpretability of model mechanisms. We introduce COMPASS (Context-Modulated PID Attention Steering System), a lightweight, interpretable control framework that embeds a model-based feedback loop directly within decoding. COMPASS quantifies context reliance via a transparent metric, the Context Reliance Score (CRS), which serves as an online probe of how attention heads ground generation in evidence. Using this interpretable signal, a PID controller dynamically modulates attention heads to maintain factual consistency without retraining or multi-pass decoding. Across benchmarks (HotpotQA, XSum, HaluEval, RAGTruth), COMPASS consistently reduces contextual hallucination rates (2.8 to 5.8 percent absolute) while revealing how distinct attention heads contribute to evidence alignment. These results highlight feedback-driven interpretability as a pathway toward scientific understanding of LLM behavior.
comment: 9 pages, 6 figures including algorithmns, 2 tables
Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation
Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict.
Template-assisted Contrastive Learning of Task-oriented Dialogue Sentence Embeddings
Learning high quality sentence embeddings from dialogues has drawn increasing attentions as it is essential to solve a variety of dialogue-oriented tasks with low annotation cost. Annotating and gathering utterance relationships in conversations are difficult, while token-level annotations, \eg, entities, slots and templates, are much easier to obtain. Other sentence embedding methods are usually sentence-level self-supervised frameworks and cannot utilize token-level extra knowledge. We introduce Template-aware Dialogue Sentence Embedding (TaDSE), a novel augmentation method that utilizes template information to learn utterance embeddings via self-supervised contrastive learning framework. We further enhance the effect with a synthetically augmented dataset that diversifies utterance-template association, in which slot-filling is a preliminary step. We evaluate TaDSE performance on five downstream benchmark dialogue datasets. The experiment results show that TaDSE achieves significant improvements over previous SOTA methods for dialogue. We further introduce a novel analytic instrument of semantic compression test, for which we discover a correlation with uniformity and alignment. Our code will be released upon acceptance.
MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models NeurIPS 2025
As AI becomes more closely integrated with peoples' daily activities, socially intelligent AI that can understand and interact seamlessly with humans in daily lives is increasingly important. However, current works in AI social reasoning all rely on language-only or language-dominant approaches to benchmark and training models, resulting in systems that are improving in verbal communication but struggle with nonverbal social understanding. To address this limitation, we tap into a novel data source rich in nonverbal social interactions -- mime videos. Mimes refer to the art of expression through gesture and movement without spoken words, which presents unique challenges and opportunities in interpreting nonverbal social communication. We contribute a new dataset called MimeQA, obtained by sourcing ~8 hours of videos clips from YouTube and developing a comprehensive video question-answering benchmark comprising 806 carefully annotated and verified question-answer pairs, designed to probe nonverbal social reasoning capabilities. Using MimeQA, we evaluate state-of-the-art video large language models (VideoLLMs) and find that they achieve low accuracy, generally ranging from 20-30%, while humans score 86%. Our analysis reveals that VideoLLMs often fail to ground imagined objects and over-rely on the text prompt while ignoring subtle nonverbal interactions. We hope to inspire future work in AI models that embody true social intelligence capable of interpreting non-verbal human interactions.
comment: NeurIPS 2025 Datasets and Benchmarks
Adaptive Margin RLHF via Preference over Preferences
Margin-based optimization is fundamental to improving generalization and robustness in classification tasks. In the context of reward model learning from preferences within Reinforcement Learning from Human Feedback (RLHF), existing methods typically rely on no margins, fixed margins, or margins that are simplistic functions of preference ratings. However, such formulations often fail to account for the varying strengths of different preferences, for example some preferences are associated with larger margins between responses, or they rely on noisy margin information derived from ratings. We argue that modeling the strength of preferences can lead to better generalization and more faithful alignment. Furthermore, many existing methods that use adaptive margins assume access to accurate preference scores, which can be difficult for humans to provide reliably. We propose an approach that leverages preferences over preferences, that is annotations indicating which of two preferences reflects a stronger distinction. We use this ordinal signal to infer adaptive margins on a per-datapoint basis. We introduce an extension to Direct Preference Optimization (DPO), DPO-PoP, that incorporates adaptive margins from preference-over-preference supervision, enabling improved discriminative and generative performance. Empirically, our method outperforms vanilla DPO, DPO with fixed margins, and DPO with ground-truth margins on the UltraFeedback dataset. Additionally, we show that there is a tradeoff between discriminative and generative performance: improving test classification accuracy, particularly by correctly labeling weaker preferences at the expense of stronger ones, can lead to a decline in generative quality. To navigate this tradeoff, we propose two sampling strategies to gather preference-over-preference labels: one favoring discriminative performance and one favoring generative performance.
Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity
Chain-of-thought (CoT) outputs let us read a model's step-by-step reasoning. Since any long, serial reasoning process must pass through this textual trace, the quality of the CoT is a direct window into what the model is thinking. This visibility could help us spot unsafe or misaligned behavior (monitorability), but only if the CoT is transparent about its internal reasoning (faithfulness). Fully measuring faithfulness is difficult, so researchers often focus on examining the CoT in cases where the model changes its answer after adding a cue to the input. This proxy finds some instances of unfaithfulness but loses information when the model maintains its answer, and does not investigate aspects of reasoning not tied to the cue. We extend these results to a more holistic sense of monitorability by introducing verbosity: whether the CoT lists every factor needed to solve the task. We combine faithfulness and verbosity into a single monitorability score that shows how well the CoT serves as the model's external `working memory', a property that many safety schemes based on CoT monitoring depend on. We evaluate instruction-tuned and reasoning models on BBH, GPQA, and MMLU. Our results show that models can appear faithful yet remain hard to monitor when they leave out key factors, and that monitorability differs sharply across model families. We release our evaluation code using the Inspect library to support reproducible future work.
comment: Project page at https://ajmeek.github.io/cot_monitorability_website/
Human-Computer Interaction
Ethically-Aware Participatory Design of a Productivity Social Robot for College Students
College students often face academic and life stressors affecting productivity, especially students with Attention Deficit Hyperactivity Disorder (ADHD) who experience executive functioning challenges. Conventional productivity tools typically demand sustained self-discipline and consistent use, which many students struggle with, leading to disruptive app-switching behaviors. Socially Assistive Robots (SARs), known for their intuitive and interactive nature, offer promising potential to support productivity in academic environments, having been successfully utilized in domains like education, cognitive development, and mental health. To leverage SARs effectively in addressing student productivity, this study employed a Participatory Design (PD) approach, directly involving college students and a Student Success and Well-Being Coach in the design process. Through interviews and a collaborative workshop, we gathered detailed insights on productivity challenges and identified desirable features for a productivity-focused SAR. Importantly, ethical considerations were integrated from the onset, facilitating responsible and user-aligned design choices. Our contributions include comprehensive insights into student productivity challenges, SAR design preferences, and actionable recommendations for effective robot characteristics. Additionally, we present stakeholder-derived ethical guidelines to inform responsible future implementations of productivity-focused SARs in higher education.
Supporting Productivity Skill Development in College Students through Social Robot Coaching: A Proof-of-Concept
College students often face academic challenges that hamper their productivity and well-being. Although self-help books and productivity apps are popular, they often fall short. Books provide generalized, non-interactive guidance, and apps are not inherently educational and can hinder the development of key organizational skills. Traditional productivity coaching offers personalized support, but is resource-intensive and difficult to scale. In this study, we present a proof-of-concept for a socially assistive robot (SAR) as an educational coach and a potential solution to the limitations of existing productivity tools and coaching approaches. The SAR delivers six different lessons on time management and task prioritization. Users interact via a chat interface, while the SAR responds through speech (with a toggle option). An integrated dashboard monitors progress, mood, engagement, confidence per lesson, and time spent per lesson. It also offers personalized productivity insights to foster reflection and self-awareness. We evaluated the system with 15 college students, achieving a System Usability Score of 79.2 and high ratings for overall experience and engagement. Our findings suggest that SAR-based productivity coaching can offer an effective and scalable solution to improve productivity among college students.
ELR-1000: A Community-Generated Dataset for Endangered Indic Indigenous Languages AACL 2025
We present a culturally-grounded multimodal dataset of 1,060 traditional recipes crowdsourced from rural communities across remote regions of Eastern India, spanning 10 endangered languages. These recipes, rich in linguistic and cultural nuance, were collected using a mobile interface designed for contributors with low digital literacy. Endangered Language Recipes (ELR)-1000 -- captures not only culinary practices but also the socio-cultural context embedded in indigenous food traditions. We evaluate the performance of several state-of-the-art large language models (LLMs) on translating these recipes into English and find the following: despite the models' capabilities, they struggle with low-resource, culturally-specific language. However, we observe that providing targeted context -- including background information about the languages, translation examples, and guidelines for cultural preservation -- leads to significant improvements in translation quality. Our results underscore the need for benchmarks that cater to underrepresented languages and domains to advance equitable and culturally-aware language technologies. As part of this work, we release the ELR-1000 dataset to the NLP community, hoping it motivates the development of language technologies for endangered languages.
comment: Accepted at AACL 2025 (Main)
Graph Queries from Natural Language using Constrained Language Models and Visual Editing
Querying knowledge bases using ontologies is usually performed using dedicated query languages, question-answering systems, or visual query editors for Knowledge Graphs. We propose a novel approach that enables users to query the knowledge graph by specifying prototype graphs in natural language and visually editing them. This approach enables non-experts to formulate queries without prior knowledge of the ontology and specific query languages. Our approach converts natural language to these prototype graphs by utilizing a two-step constrained language model generation based on semantically similar features within an ontology. The resulting prototype graph serves as the building block for further user refinements within a dedicated visual query builder. Our approach consistently generates a valid SPARQL query within the constraints imposed by the ontology, without requiring any additional corrections to the syntax or classes and links used. Unlike related language models approaches, which often require multiple iterations to fix invalid syntax, non-existent classes, and non-existent links, our approach achieves this consistently. We evaluate the performance of our system using graph retrieval on synthetic queries, comparing multiple metrics, models, and ontologies. We further validate our system through a preliminary user study. By utilizing our constrained pipeline, we show that the system can perform efficient and accurate retrieval using more efficient models compared to other approaches.
comment: 8 pages, 3 figures, 3 tables, accepted and presented at ICKG 2025
The AI Attribution Paradox: Transparency as Social Strategy in Open-Source Software Development
AI coding assistants have transformed software development, raising questions about transparency and attribution practices. We examine the "AI attribution paradox": how developers strategically balance acknowledging AI assistance with managing community scrutiny. Analyzing 14,300 GitHub commits across 7,393 repositories from 2023-2025, we investigated attribution strategies and community responses across eight major AI tools. Results reveal widespread AI usage (95.2% of commits) but strategic attribution: only 29.5% employ explicit disclosure, with dramatic tool variation (Claude 80.5% versus Copilot 9.0%). Explicit attribution triggers modest scrutiny (23% more questions and 21% more comments) but tool choice matters 20-30 times more for predicting reception. Community sentiment remains neutral regardless of attribution type, suggesting curiosity rather than hostility. Temporal analyses show rapid norm evolution: explicit attribution increased from near-zero in early 2024 to 40% by late 2025, indicating community adaptation. These findings illuminate attribution as strategic communication rather than simple transparency, advancing understanding of algorithmic accountability and norm formation during technological transitions. We discuss implications for developers navigating disclosure decisions, platforms designing attribution mechanisms, and researchers studying emergent practices in AI-augmented collaborative work.
comment: 23 pages, 7 figures, 9 tables
HAI-Eval: Measuring Human-AI Synergy in Collaborative Coding
LLM-powered coding agents are reshaping the development paradigm. However, existing evaluation systems, neither traditional tests for humans nor benchmarks for LLMs, fail to capture this shift. They remain focused on well-defined algorithmic problems, which excludes problems where success depends on human-AI collaboration. Such collaborative problems not only require human reasoning to interpret complex contexts and guide solution strategies, but also demand AI efficiency for implementation. To bridge this gap, we introduce HAI-Eval, a unified benchmark designed to measure the synergy of human-AI partnership in coding. HAI-Eval's core innovation is its "Collaboration-Necessary" problem templates, which are intractable for both standalone LLMs and unaided humans, but solvable through effective collaboration. Specifically, HAI-Eval uses 45 templates to dynamically create tasks. It also provides a standardized IDE for human participants and a reproducible toolkit with 450 task instances for LLMs, ensuring an ecologically valid evaluation. We conduct a within-subject study with 45 participants and benchmark their performance against 5 state-of-the-art LLMs under 4 different levels of human intervention. Results show that standalone LLMs and unaided participants achieve poor pass rates (0.67% and 18.89%), human-AI collaboration significantly improves performance to 31.11%. Our analysis reveals an emerging co-reasoning partnership. This finding challenges the traditional human-tool hierarchy by showing that strategic breakthroughs can originate from either humans or AI. HAI-Eval establishes not only a challenging benchmark for next-generation coding agents but also a grounded, scalable framework for assessing core developer competencies in the AI era. Our benchmark and interactive demo will be openly accessible.
ANSR-DT: An Adaptive Neuro-Symbolic Learning and Reasoning Framework for Digital Twins
In this paper, we propose an Adaptive Neuro-Symbolic Learning and Reasoning Framework for digital twin technology called "ANSR-DT." Digital twins in industrial environments often struggle with interpretability, real-time adaptation, and human input integration. Our approach addresses these challenges by combining CNN-LSTM dynamic event detection with reinforcement learning and symbolic reasoning to enable adaptive intelligence with interpretable decision processes. This integration enhances environmental understanding while promoting continuous learning, leading to more effective real-time decision-making in human-machine collaborative applications. We evaluated ANSR-DT on synthetic industrial data, observing significant improvements over traditional approaches, with up to 99.5% accuracy for dynamic pattern recognition. The framework demonstrated superior adaptability with extended reinforcement learning training, improving explained variance from 0.447 to 0.547. Future work aims at scaling to larger datasets to test rule management beyond the current 14 rules. Our open-source implementation promotes reproducibility and establishes a foundation for future research in adaptive, interpretable digital twins for industrial applications.
comment: Code available at https://github.com/sbhakim/ansr-dt
PACEE: Supporting Children's Personal Emotion Education through Parent-AI Collaboration
Emotion education is a crucial lesson for children aged 3 to 6. However, existing technologies primarily focus on promoting emotion education from the child's perspective, often neglecting the central role of parents in guiding early childhood emotion development at home. In this work, we conducted co-design sessions with five experienced kindergarten teachers and five parents to identify parental challenges and the roles that AI can play in family emotion education. Guided by these insights, we developed PACEE, an assistant for supporting parent-AI collaborative emotion education. PACEE enables parents to engage in conversations about common emotional scenarios, with multiple forms of AI support to address parents' challenges. It combines insights from parents and AI to model children's emotional states and delivers personalized, parent-mediated guidance. In a user study involving 16 families, we found that PACEE significantly enhances parent-child engagement, encourages more in-depth emotional communication, and improves the parental experience. Our findings advance emotion coaching guidelines for family education in the era of generative AI, offering valuable insights for designing AI-supported, parent-centered family education systems.
Intents, Techniques, and Components: a Unified Analysis of Interaction Authoring Tasks in Data Visualization
There is a growing interest in designing tools to support interactivity specification and authoring in data visualization. To develop expressive and flexible tools, we need theories and models that describe the task space of interaction authoring. Although multiple taxonomies and frameworks exist for interactive visualization, they primarily focus on how visualizations are used, not how interactivity is composed. To fill this gap, we conduct an analysis of 592 interaction units from 47 real-world visualization applications. Based on the analysis, we present a unified analysis of interaction authoring tasks across three levels of description: intents, representative techniques, and low-level implementation components. We examine our framework's descriptive, evaluative, and generative powers for critiquing existing interactivity authoring tools and informing new tool development.
Embodied Natural Language Interaction (NLI): Speech Input Patterns in Immersive Analytics
Embodiment shapes how users verbally express intent when interacting with data through speech interfaces in immersive analytics. Despite growing interest in Natural Language Interaction (NLI) for visual analytics in immersive environments, users' speech patterns and their use of embodiment cues in speech remain underexplored. Understanding their interplay is crucial to bridging the gap between users' intent and an immersive analytic system. To address this, we report the results from 15 participants in a user study conducted using the Wizard of Oz method. We performed axial coding on 1,280 speech acts derived from 734 utterances, examining how analysis tasks are carried out with embodiment and linguistic features. Next, we measured speech input uncertainty for each analysis task using the semantic entropy of utterances, estimating how uncertain users' speech inputs appear to an analytic system. Through these analyses, we identified five speech input patterns, showing that users dynamically blend embodied and non-embodied speech acts depending on data analysis tasks, phases, and embodiment reliance driven by the counts and types of embodiment cues in each utterance. We then examined how these patterns align with user reflections on factors that challenge speech interaction during the study. Finally, we propose design implications aligned with the five patterns.
Image and Video Processing
Large-field-of-view lensless imaging with miniaturized sensors
Lensless cameras replace bulky optics with thin modulation masks, enabling compact imaging systems. However, existing methods rely on an idealized model that assumes a globally shift-invariant point spread function (PSF) and sufficiently large sensors. In reality, the PSF varies spatially across the field of view (FOV), and finite sensor boundaries truncate modulated light--effects that intensify as sensors shrink, degrading peripheral reconstruction quality and limiting the effective FOV. We address these limitations through a local-to-global hierarchical framework grounded in a locally shift-invariant convolution model that explicitly accounts for PSF variation and sensor truncation. Patch-wise learned deconvolution first adaptively estimates local PSFs and reconstructs regions independently. A hierarchical enhancement network then progressively expands its receptive field--from small patches through intermediate blocks to the full image--integrating fine local details with global contextual information. Experiments on public datasets show that our method achieves superior reconstruction quality over a larger effective FOV with significantly reduced sensor sizes. Under extreme miniaturization--sensors reduced to 8% of the original area--we achieve improvements of 2 dB (PSNR) and 5% (SSIM), with particularly notable gains in structural fidelity. Code is available at https://github.com/KB504-public/l2g_lensless_imaging .
Time-Series at the Edge: Tiny Separable CNNs for Wearable Gait Detection and Optimal Sensor Placement
We study on-device time-series analysis for gait detection in Parkinson's disease (PD) from short windows of triaxial acceleration, targeting resource-constrained wearables and edge nodes. We compare magnitude thresholding to three 1D CNNs for time-series analysis: a literature baseline (separable convolutions) and two ultra-light models - one purely separable and one with residual connections. Using the BioStampRC21 dataset, 2 s windows at 30 Hz, and subject-independent leave-one-subject-out (LOSO) validation on 16 PwPD with chest-worn IMUs, our residual separable model (Model 2, 533 params) attains PR-AUC = 94.5%, F1 = 91.2%, MCC = 89.4%, matching or surpassing the baseline (5,552 params; PR-AUC = 93.7%, F1 = 90.5%, MCC = 88.5%) with approximately 10x fewer parameters. The smallest model (Model 1, 305 params) reaches PR-AUC = 94.0%, F1 = 91.0%, MCC = 89.1%. Thresholding obtains high recall (89.0%) but low precision (76.5%), yielding many false positives and high inter-subject variance. Sensor-position analysis (train-on-all) shows chest and thighs are most reliable; forearms degrade precision/recall due to non-gait arm motion; naive fusion of all sites does not outperform the best single site. Both compact CNNs execute within tight memory/latency budgets on STM32-class MCUs (sub-10 ms on low-power boards), enabling on-sensor gating of transmission/storage. Overall, ultra-light separable CNNs provide a superior accuracy-efficiency-generalization trade-off to fixed thresholds for wearable PD gait detection and underscore the value of tailored time-series models for edge deployment.
MedCondDiff: Lightweight, Robust, Semantically Guided Diffusion for Medical Image Segmentation
We introduce MedCondDiff, a diffusion-based framework for multi-organ medical image segmentation that is efficient and anatomically grounded. The model conditions the denoising process on semantic priors extracted by a Pyramid Vision Transformer (PVT) backbone, yielding a semantically guided and lightweight diffusion architecture. This design improves robustness while reducing both inference time and VRAM usage compared to conventional diffusion models. Experiments on multi-organ, multi-modality datasets demonstrate that MedCondDiff delivers competitive performance across anatomical regions and imaging modalities, underscoring the potential of semantically guided diffusion models as an effective class of architectures for medical imaging tasks.
Comparative Evaluation of Generative AI Models for Chest Radiograph Report Generation in the Emergency Department
Purpose: To benchmark open-source or commercial medical image-specific VLMs against real-world radiologist-written reports. Methods: This retrospective study included adult patients who presented to the emergency department between January 2022 and April 2025 and underwent same-day CXR and CT for febrile or respiratory symptoms. Reports from five VLMs (AIRead, Lingshu, MAIRA-2, MedGemma, and MedVersa) and radiologist-written reports were randomly presented and blindly evaluated by three thoracic radiologists using four criteria: RADPEER, clinical acceptability, hallucination, and language clarity. Comparative performance was assessed using generalized linear mixed models, with radiologist-written reports treated as the reference. Finding-level analyses were also performed with CT as the reference. Results: A total of 478 patients (median age, 67 years [interquartile range, 50-78]; 282 men [59.0%]) were included. AIRead demonstrated the lowest RADPEER 3b rate (5.3% [76/1434] vs. radiologists 13.9% [200/1434]; P<.001), whereas other VLMs showed higher disagreement rates (16.8-43.0%; P<.05). Clinical acceptability was the highest with AIRead (84.5% [1212/1434] vs. radiologists 74.3% [1065/1434]; P<.001), while other VLMs performed worse (41.1-71.4%; P<.05). Hallucinations were rare with AIRead, comparable to radiologists (0.3% [4/1425]) vs. 0.1% [1/1425]; P=.21), but frequent with other models (5.4-17.4%; P<.05). Language clarity was higher with AIRead (82.9% [1189/1434]), Lingshu (88.0% [1262/1434]), and MedVersa (88.4% [1268/1434]) compared with radiologists (78.1% [1120/1434]; P<.05). Sensitivity varied substantially across VLMs for the common findings: AIRead, 15.5-86.7%; Lingshu, 2.4-86.7%; MAIRA-2, 6.0-72.0%; MedGemma, 4.8-76.7%; and MedVersa, 20.2-69.3%. Conclusion: Medical VLMs for CXR report generation exhibited variable performance in report quality and diagnostic measures.
Parallel Multi-Circuit Quantum Feature Fusion in Hybrid Quantum-Classical Convolutional Neural Networks for Breast Tumor Classification
Quantum machine learning has emerged as a promising approach to improve feature extraction and classification tasks in high-dimensional data domains such as medical imaging. In this work, we present a hybrid Quantum-Classical Convolutional Neural Network (QCNN) architecture designed for the binary classification of the BreastMNIST dataset, a standardized benchmark for distinguishing between benign and malignant breast tumors. Our architecture integrates classical convolutional feature extraction with two distinct quantum circuits: an amplitude-encoding variational quantum circuit (VQC) and an angle-encoding VQC circuit with circular entanglement, both implemented on four qubits. These circuits generate quantum feature embeddings that are fused with classical features to form a joint feature space, which is subsequently processed by a fully connected classifier. To ensure fairness, the hybrid QCNN is parameter-matched against a baseline classical CNN, allowing us to isolate the contribution of quantum layers. Both models are trained under identical conditions using the Adam optimizer and binary cross-entropy loss. Experimental evaluation in five independent runs demonstrates that the hybrid QCNN achieves statistically significant improvements in classification accuracy compared to the classical CNN, as validated by a one-sided Wilcoxon signed rank test (p = 0.03125) and supported by large effect size of Cohen's d = 2.14. Our results indicate that hybrid QCNN architectures can leverage entanglement and quantum feature fusion to enhance medical image classification tasks. This work establishes a statistical validation framework for assessing hybrid quantum models in biomedical applications and highlights pathways for scaling to larger datasets and deployment on near-term quantum hardware.
Cross-Cancer Knowledge Transfer in WSI-based Prognosis Prediction
Whole-Slide Image (WSI) is an important tool for estimating cancer prognosis. Current studies generally follow a conventional cancer-specific paradigm in which each cancer corresponds to a single model. However, this paradigm naturally struggles to scale to rare tumors and cannot leverage knowledge from other cancers. While multi-task learning frameworks have been explored recently, they often place high demands on computational resources and require extensive training on ultra-large, multi-cancer WSI datasets. To this end, this paper shifts the paradigm to knowledge transfer and presents the first preliminary yet systematic study on cross-cancer prognosis knowledge transfer in WSIs, called CROPKT. It comprises three major parts. (1) We curate a large dataset (UNI2-h-DSS) with 26 cancers and use it to measure the transferability of WSI-based prognostic knowledge across different cancers (including rare tumors). (2) Beyond a simple evaluation merely for benchmarking, we design a range of experiments to gain deeper insights into the underlying mechanism behind transferability. (3) We further show the utility of cross-cancer knowledge transfer, by proposing a routing-based baseline approach (ROUPKT) that could often efficiently utilize the knowledge transferred from off-the-shelf models of other cancers. CROPKT could serve as an inception that lays the foundation for this nascent paradigm, i.e., WSI-based prognosis prediction with cross-cancer knowledge transfer. Our source code is available at https://github.com/liupei101/CROPKT.
comment: 24 pages (11 figures and 10 tables)
Beyond Subspace Isolation: Many-to-Many Transformer for Light Field Image Super-resolution
The effective extraction of spatial-angular features plays a crucial role in light field image super-resolution (LFSR) tasks, and the introduction of convolution and Transformers leads to significant improvement in this area. Nevertheless, due to the large 4D data volume of light field images, many existing methods opted to decompose the data into a number of lower-dimensional subspaces and perform Transformers in each sub-space individually. As a side effect, these methods inadvertently restrict the self-attention mechanisms to a One-to-One scheme accessing only a limited subset of LF data, explicitly preventing comprehensive optimization on all spatial and angular cues. In this paper, we identify this limitation as subspace isolation and introduce a novel Many-to-Many Transformer (M2MT) to address it. M2MT aggregates angular information in the spatial subspace before performing the self-attention mechanism. It enables complete access to all information across all sub-aperture images (SAIs) in a light field image. Consequently, M2MT is enabled to comprehensively capture long-range correlation dependencies. With M2MT as the foundational component, we develop a simple yet effective M2MT network for LFSR. Our experimental results demonstrate that M2MT achieves state-of-the-art performance across various public datasets, and it offers a favorable balance between model performance and efficiency, yielding higher-quality LFSR results with substantially lower demand for memory and computation. We further conduct in-depth analysis using local attribution maps (LAM) to obtain visual interpretability, and the results validate that M2MT is empowered with a truly non-local context in both spatial and angular subspaces to mitigate subspace isolation and acquire effective spatial-angular representation.
comment: Accepted by IEEE Transactions on Multimedia
PANDA -- Patch And Distribution-Aware Augmentation for Long-Tailed Exemplar-Free Continual Learning AAAI 2026
Exemplar-Free Continual Learning (EFCL) restricts the storage of previous task data and is highly susceptible to catastrophic forgetting. While pre-trained models (PTMs) are increasingly leveraged for EFCL, existing methods often overlook the inherent imbalance of real-world data distributions. We discovered that real-world data streams commonly exhibit dual-level imbalances, dataset-level distributions combined with extreme or reversed skews within individual tasks, creating both intra-task and inter-task disparities that hinder effective learning and generalization. To address these challenges, we propose PANDA, a Patch-and-Distribution-Aware Augmentation framework that integrates seamlessly with existing PTM-based EFCL methods. PANDA amplifies low-frequency classes by using a CLIP encoder to identify representative regions and transplanting those into frequent-class samples within each task. Furthermore, PANDA incorporates an adaptive balancing strategy that leverages prior task distributions to smooth inter-task imbalances, reducing the overall gap between average samples across tasks and enabling fairer learning with frozen PTMs. Extensive experiments and ablation studies demonstrate PANDA's capability to work with existing PTM-based CL methods, improving accuracy and reducing catastrophic forgetting.
comment: Accepted in AAAI 2026 Main Technical Track
Multimedia
Speculating on the Role of Media Architecture in Post-disaster Rebuilding and Recovery: Insights from Architects and Interaction Designers
In post-disaster contexts, design is not only about rebuilding structures but also about reimagining how architecture can become a communicative medium that supports recovery, resilience, and collective memory. While recent studies have expanded the understanding of media architecture from aesthetic urban screens to participatory civic infrastructures, there remains limited empirical research on its potential role in post-disaster contexts. In particular, opportunities exist to explore how architecture and interaction design might speculate on media architecture's role in rebuilding and recovery efforts for post-disaster permanent housing, especially when conceptualizing disasters as active agents that reshape design processes. Following to Kahramanmaras earthquake on February 6, 2023, we conducted two focus groups with architects and interaction designers in the case of Antakya, Turkey, building on affected residents' expectations for post-earthquake permanent housing. Our analysis revealed three critical dimensions of how future media architecture may support post-disaster housing: (1) as a facilitator of individuals' social connections to their community, (2) as an enabler of multispecies participation and collective efforts, and (3) as a mediator of heritage preservation and revival. With novel perspectives, we contribute a three-dimension lens for media architecture in permanent homes; a co-speculative, card-based process bridging residents' insights and expert design; and ten situated speculative design ideas with implications for design of post-disaster permanent homes.
STCTS: Generative Semantic Compression for Ultra-Low Bitrate Speech via Explicit Text-Prosody-Timbre Decomposition
Voice communication in bandwidth-constrained environments--maritime, satellite, and tactical networks--remains prohibitively expensive. Traditional codecs struggle below 1 kbps, while existing semantic approaches (STT-TTS) sacrifice prosody and speaker identity. We present STCTS, a generative semantic compression framework enabling natural voice communication at approximately 80 bps. STCTS explicitly decomposes speech into linguistic content, prosodic expression, and speaker timbre, applying tailored compression: context-aware text encoding (approximately 70 bps), sparse prosody transmission via TTS interpolation (less than 14 bps at 0.1-1 Hz), and amortized speaker embedding. Evaluations on LibriSpeech demonstrate a 75x bitrate reduction versus Opus (6 kbps) and 12x versus EnCodec (1 kbps), while maintaining perceptual quality (NISQA MOS greater than 4.26). We also discover a bimodal quality distribution with prosody sampling rate: sparse and dense updates both achieve high quality, while mid-range rates degrade due to perceptual discontinuities--guiding optimal configuration design. Beyond efficiency, our modular architecture supports privacy-preserving encryption, human-interpretable transmission, and flexible deployment on edge devices, offering a robust solution for ultra-low bandwidth scenarios.
comment: The complete source code and online speech reconstruction demo is publicly available at https://github.com/dywsy21/STCTS
When Harmful Content Gets Camouflaged: Unveiling Perception Failure of LVLMs with CamHarmTI
Large vision-language models (LVLMs) are increasingly used for tasks where detecting multimodal harmful content is crucial, such as online content moderation. However, real-world harmful content is often camouflaged, relying on nuanced text-image interplay, such as memes or images with embedded malicious text, to evade detection. This raises a key question: \textbf{can LVLMs perceive such camouflaged harmful content as sensitively as humans do?} In this paper, we introduce CamHarmTI, a benchmark for evaluating LVLM ability to perceive and interpret camouflaged harmful content within text-image compositions. CamHarmTI consists of over 4,500 samples across three types of image-text posts. Experiments on 100 human users and 12 mainstream LVLMs reveal a clear perceptual gap: humans easily recognize such content (e.g., over 95.75\% accuracy), whereas current LVLMs often fail (e.g., ChatGPT-4o achieves only 2.10\% accuracy). Moreover, fine-tuning experiments demonstrate that \bench serves as an effective resource for improving model perception, increasing accuracy by 55.94\% for Qwen2.5VL-7B. Attention analysis and layer-wise probing further reveal that fine-tuning enhances sensitivity primarily in the early layers of the vision encoder, promoting a more integrated scene understanding. These findings highlight the inherent perceptual limitations in LVLMs and offer insight into more human-aligned visual reasoning systems.
SizeGS: Size-aware Compression of 3D Gaussian Splatting via Mixed Integer Programming
Recent advances in 3D Gaussian Splatting (3DGS) have greatly improved 3D reconstruction. However, its substantial data size poses a significant challenge for transmission and storage. While many compression techniques have been proposed, they fail to efficiently adapt to fluctuating network bandwidth, leading to resource wastage. We address this issue from the perspective of size-aware compression, where we aim to compress 3DGS to a desired size by quickly searching for suitable hyperparameters. Through a measurement study, we identify key hyperparameters that affect the size -- namely, the reserve ratio of Gaussians and bit-width settings for Gaussian attributes. Then, we formulate this hyperparameter optimization problem as a mixed-integer nonlinear programming (MINLP) problem, with the goal of maximizing visual quality while respecting the size budget constraint. To solve the MINLP, we decouple this problem into two parts: discretely sampling the reserve ratio and determining the bit-width settings using integer linear programming (ILP). To solve the ILP more quickly and accurately, we design a quality loss estimator and a calibrated size estimator, as well as implement a CUDA kernel. Extensive experiments on multiple 3DGS variants demonstrate that our method achieves state-of-the-art performance in post-training compression. Furthermore, our method can achieve comparable quality to leading training-required methods after fine-tuning.
comment: Automatically compressing 3DGS into the desired file size while maximizing the visual quality
Prompt-OT: An Optimal Transport Regularization Paradigm for Knowledge Preservation in Vision-Language Model Adaptation WACV 2026
Vision-language models (VLMs) such as CLIP demonstrate strong performance but struggle when adapted to downstream tasks. Prompt learning has emerged as an efficient and effective strategy to adapt VLMs while preserving their pre-trained knowledge. However, existing methods still lead to overfitting and degrade zero-shot generalization. To address this challenge, we propose an optimal transport (OT)-guided prompt learning framework that mitigates forgetting by preserving the structural consistency of feature distributions between pre-trained and fine-tuned models. Unlike conventional point-wise constraints, OT naturally captures cross-instance relationships and expands the feasible parameter space for prompt tuning, allowing a better trade-off between adaptation and generalization. Our approach enforces joint constraints on both vision and text representations, ensuring a holistic feature alignment. Extensive experiments on benchmark datasets demonstrate that our simple yet effective method can outperform existing prompt learning strategies in base-to-novel generalization, cross-dataset evaluation, and domain generalization without additional augmentation or ensemble techniques. The code is available at https://github.com/ChongQingNoSubway/Prompt-OT
comment: Accepted to WACV 2026
Proxy-Tuning: Tailoring Multimodal Autoregressive Models for Subject-Driven Image Generation
Multimodal autoregressive (AR) models, based on next-token prediction and transformer architecture, have demonstrated remarkable capabilities in various multimodal tasks including text-to-image (T2I) generation. Despite their strong performance in general T2I tasks, our research reveals that these models initially struggle with subject-driven image generation compared to dominant diffusion models. To address this limitation, we introduce Proxy-Tuning, leveraging diffusion models to enhance AR models' capabilities in subject-specific image generation. Our method reveals a striking weak-to-strong phenomenon: fine-tuned AR models consistently outperform their diffusion model supervisors in both subject fidelity and prompt adherence. We analyze this performance shift and identify scenarios where AR models excel, particularly in multi-subject compositions and contextual understanding. This work not only demonstrates impressive results in subject-driven AR image generation, but also unveils the potential of weak-to-strong generalization in the image generation domain, contributing to a deeper understanding of different architectures' strengths and limitations.
Cross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain Modeling AAAI 2026
Fine-grained image-text alignment is a pivotal challenge in multimodal learning, underpinning key applications such as visual question answering, image captioning, and vision-language navigation. Unlike global alignment, fine-grained alignment requires precise correspondence between localized visual regions and textual tokens, often hindered by noisy attention mechanisms and oversimplified modeling of cross-modal relationships. In this work, we identify two fundamental limitations of existing approaches: the lack of robust intra-modal mechanisms to assess the significance of visual and textual tokens, leading to poor generalization in complex scenes; and the absence of fine-grained uncertainty modeling, which fails to capture the one-to-many and many-to-one nature of region-word correspondences. To address these issues, we propose a unified approach that incorporates significance-aware and granularity-aware modeling and region-level uncertainty modeling. Our method leverages modality-specific biases to identify salient features without relying on brittle cross-modal attention, and represents region features as a mixture of Gaussian distributions to capture fine-grained uncertainty. Extensive experiments on Flickr30K and MS-COCO demonstrate that our approach achieves state-of-the-art performance across various backbone architectures, significantly enhancing the robustness and interpretability of fine-grained image-text alignment.
comment: 10 pages, 6 figures, accepted by AAAI 2026
Computation and Language
A Comparison of Human and ChatGPT Classification Performance on Complex Social Media Data
Generative artificial intelligence tools, like ChatGPT, are an increasingly utilized resource among computational social scientists. Nevertheless, there remains space for improved understanding of the performance of ChatGPT in complex tasks such as classifying and annotating datasets containing nuanced language. Method. In this paper, we measure the performance of GPT-4 on one such task and compare results to human annotators. We investigate ChatGPT versions 3.5, 4, and 4o to examine performance given rapid changes in technological advancement of large language models. We craft four prompt styles as input and evaluate precision, recall, and F1 scores. Both quantitative and qualitative evaluations of results demonstrate that while including label definitions in prompts may help performance, overall GPT-4 has difficulty classifying nuanced language. Qualitative analysis reveals four specific findings. Our results suggest the use of ChatGPT in classification tasks involving nuanced language should be conducted with prudence.
comment: About 15 pages, draft version of accepted conference full paper. Published paper to follow
Graphing the Truth: Structured Visualizations for Automated Hallucination Detection in LLMs
Large Language Models have rapidly advanced in their ability to interpret and generate natural language. In enterprise settings, they are frequently augmented with closed-source domain knowledge to deliver more contextually informed responses. However, operational constraints such as limited context windows and inconsistencies between pre-training data and supplied knowledge often lead to hallucinations, some of which appear highly credible and escape routine human review. Current mitigation strategies either depend on costly, large-scale gold-standard Q\&A curation or rely on secondary model verification, neither of which offers deterministic assurance. This paper introduces a framework that organizes proprietary knowledge and model-generated content into interactive visual knowledge graphs. The objective is to provide end users with a clear, intuitive view of potential hallucination zones by linking model assertions to underlying sources of truth and indicating confidence levels. Through this visual interface, users can diagnose inconsistencies, identify weak reasoning chains, and supply corrective feedback. The resulting human-in-the-loop workflow creates a structured feedback loop that can enhance model reliability and continuously improve response quality.
Sycophancy Claims about Language Models: The Missing Human-in-the-Loop NeurIPS 2025
Sycophantic response patterns in Large Language Models (LLMs) have been increasingly claimed in the literature. We review methodological challenges in measuring LLM sycophancy and identify five core operationalizations. Despite sycophancy being inherently human-centric, current research does not evaluate human perception. Our analysis highlights the difficulties in distinguishing sycophantic responses from related concepts in AI alignment and offers actionable recommendations for future research.
comment: NeurIPS 2025 Workshop on LLM Evaluation and ICLR 2025 Workshop on Bi-Directional Human-AI Alignment
Melody or Machine: Detecting Synthetic Music with Dual-Stream Contrastive Learning
The rapid evolution of end-to-end AI music generation poses an escalating threat to artistic authenticity and copyright, demanding detection methods that can keep pace. While foundational, existing models like SpecTTTra falter when faced with the diverse and rapidly advancing ecosystem of new generators, exhibiting significant performance drops on out-of-distribution (OOD) content. This generalization failure highlights a critical gap: the need for more challenging benchmarks and more robust detection architectures. To address this, we first introduce Melody or Machine (MoM), a new large-scale benchmark of over 130,000 songs (6,665 hours). MoM is the most diverse dataset to date, built with a mix of open and closed-source models and a curated OOD test set designed specifically to foster the development of truly generalizable detectors. Alongside this benchmark, we introduce CLAM, a novel dual-stream detection architecture. We hypothesize that subtle, machine-induced inconsistencies between vocal and instrumental elements, often imperceptible in a mixed signal, offer a powerful tell-tale sign of synthesis. CLAM is designed to test this hypothesis by employing two distinct pre-trained audio encoders (MERT and Wave2Vec2) to create parallel representations of the audio. These representations are fused by a learnable cross-aggregation module that models their inter-dependencies. The model is trained with a dual-loss objective: a standard binary cross-entropy loss for classification, complemented by a contrastive triplet loss which trains the model to distinguish between coherent and artificially mismatched stream pairings, enhancing its sensitivity to synthetic artifacts without presuming a simple feature alignment. CLAM establishes a new state-of-the-art in synthetic music forensics. It achieves an F1 score of 0.925 on our challenging MoM benchmark.
comment: Accepted at Transactions on Machine Learning Research (TMLR)
ART: Adaptive Response Tuning Framework -- A Multi-Agent Tournament-Based Approach to LLM Response Optimization
Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation. However, single-model responses often exhibit inconsistencies, hallucinations, and varying quality across different query domains. This paper presents ART (Adaptive Response Tuning), a novel framework that employs tournament-style ELO ranking and multi-agent reasoning to systematically optimize LLM outputs. By enabling multiple LLM agents to compete, critique, and collaborate through structured tournament workflows, ART produces consensus responses that outperform individual model outputs. Our framework introduces configurable tournament parameters, dynamic agent selection, and multiple consensus fusion strategies. Experimental evaluations demonstrate significant improvements in response accuracy, coherence, and reliability compared to baseline single-model approaches. The ART framework provides a scalable, production-ready solution for applications requiring high-quality, vetted LLM responses, achieving an 8.4% improvement in overall quality metrics and R22 values exceeding 0.96 in ELO rating convergence.
comment: 8 pages, 5 figures, 5 tables. Conference-style paper
Prism: A Minimal Compositional Metalanguage for Specifying Agent Behavior
Prism is a small, compositional metalanguage for specifying the behaviour of tool-using software agents. Rather than introducing ad hoc control constructs, Prism is built around a fixed core context, Core1, which provides a minimal background grammar of categories numbers, strings, user prompts, tools together with abstract combinators for booleans, predicates, pairs, and lists. Agent policies are written as ordinary expressions using a single abstraction operator so that conditionals appear as selections between alternatives instead of imperative if-else blocks. Domains extend the core by defining their own context-mini-grammars that introduce new categories, predicates, and external tools while reusing the same compositional machinery. We illustrate this with worked examples from thermostat control, home security, e-commerce recommendation, and medical monitoring, showing how natural language decision rules can be mapped to inspectable, executable policies. From a linguistic perspective, Prism enforces a clear separation between a reusable grammar-like core and domain specific lexicons and treats tools as bridges between internal policy representations and the external world. From an engineering perspective, it offers a compact interface language for agent control, making the space of possible actions explicit and amenable to analysis, verification, and safety constraints.
NarraBench: A Comprehensive Framework for Narrative Benchmarking
We present NarraBench, a theory-informed taxonomy of narrative-understanding tasks, as well as an associated survey of 78 existing benchmarks in the area. We find significant need for new evaluations covering aspects of narrative understanding that are either overlooked in current work or are poorly aligned with existing metrics. Specifically, we estimate that only 27% of narrative tasks are well captured by existing benchmarks, and we note that some areas -- including narrative events, style, perspective, and revelation -- are nearly absent from current evaluations. We also note the need for increased development of benchmarks capable of assessing constitutively subjective and perspectival aspects of narrative, that is, aspects for which there is generally no single correct answer. Our taxonomy, survey, and methodology are of value to NLP researchers seeking to test LLM narrative understanding.
Superposition Yields Robust Neural Scaling NeurIPS 2025
The success of today's large language models (LLMs) depends on the observation that larger models perform better. However, the origin of this neural scaling law, that loss decreases as a power law with model size, remains unclear. We propose that representation superposition, meaning that LLMs represent more features than they have dimensions, can be a key contributor to loss and cause neural scaling. Based on Anthropic's toy model, we use weight decay to control the degree of superposition, allowing us to systematically study how loss scales with model size. When superposition is weak, the loss follows a power law only if data feature frequencies are power-law distributed. In contrast, under strong superposition, the loss generically scales inversely with model dimension across a broad class of frequency distributions, due to geometric overlaps between representation vectors. We confirmed that open-sourced LLMs operate in the strong superposition regime and have loss scaling inversely with model dimension, and that the Chinchilla scaling laws are also consistent with this behavior. Our results identify representation superposition as a central driver of neural scaling laws, providing insights into questions like when neural scaling laws can be improved and when they will break down.
comment: Best Paper Runner-up at NeurIPS 2025
PEPPER: Perception-Guided Perturbation for Robust Backdoor Defense in Text-to-Image Diffusion Models
Recent studies show that text to image (T2I) diffusion models are vulnerable to backdoor attacks, where a trigger in the input prompt can steer generation toward harmful or unintended content. To address this, we introduce PEPPER (PErcePtion Guided PERturbation), a backdoor defense that rewrites the caption into a semantically distant yet visually similar caption while adding unobstructive elements. With this rewriting strategy, PEPPER disrupt the trigger embedded in the input prompt, dilute the influence of trigger tokens and thereby achieve enhanced robustness. Experiments show that PEPPER is particularly effective against text encoder based attacks, substantially reducing attack success while preserving generation quality. Beyond this, PEPPER can be paired with any existing defenses yielding consistently stronger and generalizable robustness than any standalone method. Our code will be released on Github.
Escaping Collapse: The Strength of Weak Data for Large Language Model Training
Synthetically-generated data plays an increasingly larger role in training large language models. However, while synthetic data has been found to be useful, studies have also shown that without proper curation it can cause LLM performance to plateau, or even "collapse", after many training iterations. In this paper, we formalize this question and develop a theoretical framework to investigate how much curation is needed in order to ensure that LLM performance continually improves. Our analysis is inspired by boosting, a classic machine learning technique that leverages a very weak learning algorithm to produce an arbitrarily good classifier. The approach we analyze subsumes many recently proposed methods for training LLMs on synthetic data, and thus our analysis sheds light on why they are successful, and also suggests opportunities for future improvement. We present experiments that validate our theory, and show that dynamically focusing labeling resources on the most challenging examples -- in much the same way that boosting focuses the efforts of the weak learner -- leads to improved performance.
Life-Code: Central Dogma Modeling with Multi-Omics Sequence Unification
The interactions between DNA, RNA, and proteins are fundamental to biological processes, as illustrated by the central dogma of molecular biology. Although modern biological pre-trained models have achieved great success in analyzing these macromolecules individually, their interconnected nature remains underexplored. This paper follows the guidance of the central dogma to redesign both the data and model pipeline and offers a comprehensive framework, Life-Code, that spans different biological functions. As for data flow, we propose a unified pipeline to integrate multi-omics data by reverse-transcribing RNA and reverse-translating amino acids into nucleotide-based sequences. As for the model, we design a codon tokenizer and a hybrid long-sequence architecture to encode the interactions between coding and non-coding regions through masked modeling pre-training. To model the translation and folding process with coding sequences, Life-Code learns protein structures of the corresponding amino acids by knowledge distillation from off-the-shelf protein language models. Such designs enable Life-Code to capture complex interactions within genetic sequences, providing a more comprehensive understanding of multi-omics with the central dogma. Extensive experiments show that Life-Code achieves state-of-the-art results on various tasks across three omics, highlighting its potential for advancing multi-omics analysis and interpretation.
comment: Preprint V3 (10 pages main text)
Closing the Data-Efficiency Gap Between Autoregressive and Masked Diffusion LLMs
Despite autoregressive large language models (arLLMs) being the current dominant paradigm in language modeling, effectively updating these models to incorporate new factual knowledge still remains difficult. They resist knowledge injection via fine-tuning due to inherent shortcomings such as the "reversal curse" -- the challenge of answering questions that reverse the original information order in the training sample. Masked diffusion large language models (dLLMs) are rapidly emerging as a powerful alternative to the arLLM paradigm, with evidence of better data efficiency and free of the "reversal curse" in pre-training. However, it is unknown whether these advantages extend to the post-training phase, i.e. whether pre-trained dLLMs can easily acquire new knowledge through fine-tuning. On three diverse datasets, we fine-tune arLLMs and dLLMs, evaluating them with forward and backward style Question Answering (QA) to probe knowledge generalization and the reversal curse. Our results confirm that arLLMs critically rely on extensive data augmentation via paraphrases for QA generalization, and paraphrases are only effective when their information order matches the QA style. Conversely, dLLMs achieve high accuracies on both forward and backward QAs without paraphrases; adding paraphrases yields only marginal gains. Inspired by the dLLM's performance, we introduce a novel masked fine-tuning paradigm for knowledge injection into pre-trained arLLMs. This proposed method successfully and drastically improves the data efficiency of arLLM fine-tuning, effectively closing its performance gap with dLLMs. We further show that the masked fine-tuning paradigm of arLLMs can be extended to the supervised fine-tuning (SFT) of mathematical capability. Across two models and two datasets, our masked SFT outperforms regular SFT.
Human-Computer Interaction
Speculating on the Role of Media Architecture in Post-disaster Rebuilding and Recovery: Insights from Architects and Interaction Designers
In post-disaster contexts, design is not only about rebuilding structures but also about reimagining how architecture can become a communicative medium that supports recovery, resilience, and collective memory. While recent studies have expanded the understanding of media architecture from aesthetic urban screens to participatory civic infrastructures, there remains limited empirical research on its potential role in post-disaster contexts. In particular, opportunities exist to explore how architecture and interaction design might speculate on media architecture's role in rebuilding and recovery efforts for post-disaster permanent housing, especially when conceptualizing disasters as active agents that reshape design processes. Following to Kahramanmaras earthquake on February 6, 2023, we conducted two focus groups with architects and interaction designers in the case of Antakya, Turkey, building on affected residents' expectations for post-earthquake permanent housing. Our analysis revealed three critical dimensions of how future media architecture may support post-disaster housing: (1) as a facilitator of individuals' social connections to their community, (2) as an enabler of multispecies participation and collective efforts, and (3) as a mediator of heritage preservation and revival. With novel perspectives, we contribute a three-dimension lens for media architecture in permanent homes; a co-speculative, card-based process bridging residents' insights and expert design; and ten situated speculative design ideas with implications for design of post-disaster permanent homes.
Chameleon: Automated Color Palette Adaptation for Dark Mode Data Visualizations
Dark mode has gained widespread adoption across mobile platforms due to its benefits in reducing eye strain and conserving battery life. However, while the mobile system switches to dark mode, most visualizations remain designed for light mode, causing visual disruptions. Existing methods, such as manual adjustment or color inversion, are either time-consuming or fail to preserve the semantic meaning of colors in visualizations, making them less effective in dark mode. To address this challenge, we propose Chameleon, an algorithm that automatically transforms light mode visualizations into dark mode while maintaining visual clarity and color semantics. By optimizing for luminance contrast, color consistency, and adjacent color differences, Chameleon ensures that the transformed visualizations are legible and visually coherent. Our evaluation includes case study, expert interview, system evaluation, and a user study, and these demonstrate that Chameleon is effective at translating visualizations for dark mode.
Truck drivers and automation: A methodology for identifying and supporting workforce transition in the Australian road freight sector
Transition to autonomous trucks (ATs) is coming, and is expected to create both challenges and opportunities for the driver workforce. This paper presents a novel methodology for identifying viable occupational transitions for truck drivers as transport automation advances. Unlike traditional workforce transition analyses that focus primarily on skill similarity, wages, and employment demand, this methodology incorporates four integrated components: task-level automation analysis, skill similarity assessment, labour market conditions analysis, and empirical validation using historical transition patterns. Applying this methodology to Australian truck drivers shows that while ATs will automate core driving tasks, many non-driving responsibilities will continue requiring a human, suggesting occupational evolution rather than wholesale displacement. A skill similarity analysis identifies 17 occupations with high transferability, while labour market analysis reveals significant trade-offs between wage levels and job availability across potential transition pathways. Key findings indicate that bus and coach driving, along with earthmoving plant operation, emerge as high-priority transition options, offering comparable wages and positive employment growth. Delivery and forklift driving present medium-priority pathways with abundant opportunities but lower wages. A regression analysis of historical transitions confirms that skill similarity, wage differentials, geographic accessibility, and qualification requirements all significantly influence actual transition patterns, with some viable pathways currently underutilised. The research provides policymakers, industry stakeholders, and educational institutions with evidence-based guidance for supporting workforce adaptation to technological change. The proposed methodology is generalisable beyond trucking to other sectors facing automation.
comment: 26 pages, 1 appendix, 8 tables, 3 figures
An Approach to Joint Hybrid Decision Making between Humans and Artificial Intelligence
Due to the progress in artificial intelligence, it is important to understand how capable artificial agents should be used when interacting with humans, since high level authority and responsibility often remain with the human agent. However, integrated frameworks are lacking that can account for heterogeneous agents and draw on different scientific fields, such as human-factors engineering and artificial intelligence. Therefore, joint hybrid intelligence is described as a framework abstracting humans and artificial intelligence as decision making agents. A general definition of intelligence is provided on the basis of decision making competence being applicable to agents of different sorts. This framework is used for proposing the interrelated design space of joint hybrid intelligence being aimed at integrating the heterogeneous capabilities of humans and artificial intelligence. At the core of this design space lies joint agent engineering with the goal of integrating the design subspaces operator training, artificial intelligence engineering, and interface design via developing joint agent patterns. The ''extended swarming'' approach to human-swarm interaction is discussed as an example of such a pattern.
Significant Other AI: Identity, Memory, and Emotional Regulation as Long-Term Relational Intelligence
Significant Others (SOs) stabilize identity, regulate emotion, and support narrative meaning-making, yet many people today lack access to such relational anchors. Recent advances in large language models and memory-augmented AI raise the question of whether artificial systems could support some of these functions. Existing empathic AIs, however, remain reactive and short-term, lacking autobiographical memory, identity modeling, predictive emotional regulation, and narrative coherence. This manuscript introduces Significant Other Artificial Intelligence (SO-AI) as a new domain of relational AI. It synthesizes psychological and sociological theory to define SO functions and derives requirements for SO-AI, including identity awareness, long-term memory, proactive support, narrative co-construction, and ethical boundary enforcement. A conceptual architecture is proposed, comprising an anthropomorphic interface, a relational cognition layer, and a governance layer. A research agenda outlines methods for evaluating identity stability, longitudinal interaction patterns, narrative development, and sociocultural impact. SO-AI reframes AI-human relationships as long-term, identity-bearing partnerships and provides a foundational blueprint for investigating whether AI can responsibly augment the relational stability many individuals lack today.
Behavioral Indicators of Loneliness: Predicting University Students' Loneliness Scores from Smartphone Sensing Data
Loneliness is a critical mental health issue among university students, yet traditional monitoring methods rely primarily on retrospective self-reports and often lack real-time behavioral context. This study explores the use of passive smartphone sensing data to predict loneliness levels, addressing the limitations of existing approaches in capturing its dynamic nature. We integrate smartphone sensing with machine learning and large language models respectively to develop generalized and personalized models. Our Random Forest generalized models achieved mean absolute errors of 3.29 at midterm and 3.98 (out of 32) at the end of semester on the UCLA Loneliness Scale (short form), identifying smartphone screen usage and location mobility to be key predictors. The one-shot approach leveraging large language models reduced prediction errors by up to 42% compared to zero-shot inference. The one-shot results from personalized models highlighted screen usage, application usage, battery, and location transitions as salient behavioral indicators. These findings demonstrate the potential of smartphone sensing data for scalable and interpretable loneliness detection in digital mental health.
MILE: A Mechanically Isomorphic Exoskeleton Data Collection System with Fingertip Visuotactile Sensing for Dexterous Manipulation
Imitation learning provides a promising approach to dexterous hand manipulation, but its effectiveness is limited by the lack of large-scale, high-fidelity data. Existing data-collection pipelines suffer from inaccurate motion retargeting, low data-collection efficiency, and missing high-resolution fingertip tactile sensing. We address this gap with MILE, a mechanically isomorphic teleoperation and data-collection system co-designed from human hand to exoskeleton to robotic hand. The exoskeleton is anthropometrically derived from the human hand, and the robotic hand preserves one-to-one joint-position isomorphism, eliminating nonlinear retargeting and enabling precise, natural control. The exoskeleton achieves a multi-joint mean absolute angular error below one degree, while the robotic hand integrates compact fingertip visuotactile modules that provide high-resolution tactile observations. Built on this retargeting-free interface, we teleoperate complex, contact-rich in-hand manipulation and efficiently collect a multimodal dataset comprising high-resolution fingertip visuotactile signals, RGB-D images, and joint positions. The teleoperation pipeline achieves a mean success rate improvement of 64%. Incorporating fingertip tactile observations further increases the success rate by an average of 25% over the vision-only baseline, validating the fidelity and utility of the dataset. Further details are available at: https://sites.google.com/view/mile-system.
Evolving Paradigms in Task-Based Search and Learning: A Comparative Analysis of Traditional Search Engine with LLM-Enhanced Conversational Search System
Large Language Models (LLMs) are rapidly reshaping information retrieval by enabling interactive, generative, and inference-driven search. While traditional keyword-based search remains central to web and academic information access, it often struggles to support multi-step reasoning and exploratory learning tasks. LLM-powered search interfaces, such as ChatGPT and Claude, introduce new capabilities that may influence how users formulate queries, navigate information, and construct knowledge. However, empirical understanding of these effects is still limited. This study compares search behavior and learning outcomes in two environments: a standard search engine and an LLM-powered search system. We investigate (1) how search strategies, query formulation, and evaluation behaviors differ across systems, and (2) how LLM use affects comprehension, knowledge integration, and critical thinking during search-based learning tasks. Findings offer insight into how generative AI shapes information-seeking processes and contribute to ongoing discussions in information retrieval, human-AI interaction, and technology-supported learning.
Words into World: A Task-Adaptive Agent for Language-Guided Spatial Retrieval in AR
Traditional augmented reality (AR) systems predominantly rely on fixed class detectors or fiducial markers, limiting their ability to interpret complex, open-vocabulary natural language queries. We present a modular AR agent system that integrates multimodal large language models (MLLMs) with grounded vision models to enable relational reasoning in space and language-conditioned spatial retrieval in physical environments. Our adaptive task agent coordinates MLLMs and coordinate-aware perception tools to address varying query complexities, ranging from simple object identification to multi-object relational reasoning, while returning meter-accurate 3D anchors. It constructs dynamic AR scene graphs encoding nine typed relations (spatial, structural-semantic, causal-functional), enabling MLLMs to understand not just what objects exist, but how they relate and interact in 3D space. Through task-adaptive region-of-interest highlighting and contextual spatial retrieval, the system guides human attention to information-dense areas while supporting human-in-the-loop refinement. The agent dynamically invokes coordinate-aware tools for complex queries-selection, measurement, comparison, and actuation-grounding language understanding in physical operations. The modular architecture supports plug-and-use vision-language models without retraining, establishing AR agents as intermediaries that augment MLLMs with real-world spatial intelligence for interactive scene understanding. We also introduce GroundedAR-Bench, an evaluation framework for language-driven real world localization and relation grounding across diverse environments.
The Professional Challenges of Industrial Designer in Industry 4.0
The Industry 4.0 refers to a industrial ecology which will merge the information system, physical system and service system into an integrate platform. Since now the industrial designers either conceive the physical part of products, or design the User Interfaces of computer systems, the new industrial ecology will give them a chance to redefine their roles in R&D work-flow. In this paper we discussed the required qualities of industrial designer in the new era, according to an investigation among Chinese enterprises. Additionally, how to promote these qualities though educational program.
"Why the face?": Exploring Robot Error Detection Using Instrumented Bystander Reactions
How do humans recognize and rectify social missteps? We achieve social competence by looking around at our peers, decoding subtle cues from bystanders - a raised eyebrow, a laugh - to evaluate the environment and our actions. Robots, however, struggle to perceive and make use of these nuanced reactions. By employing a novel neck-mounted device that records facial expressions from the chin region, we explore the potential of previously untapped data to capture and interpret human responses to robot error. First, we develop NeckNet-18, a 3D facial reconstruction model to map the reactions captured through the chin camera onto facial points and head motion. We then use these facial responses to develop a robot error detection model which outperforms standard methodologies such as using OpenFace or video data, generalizing well especially for within-participant data. Through this work, we argue for expanding human-in-the-loop robot sensing, fostering more seamless integration of robots into diverse human environments, pushing the boundaries of social cue detection and opening new avenues for adaptable robotics.
Is General-Purpose AI Reasoning Sensitive to Data-Induced Cognitive Biases? Dynamic Benchmarking on Typical Software Engineering Dilemmas
Human cognitive biases in software engineering can lead to costly errors. While general-purpose AI (GPAI) systems may help mitigate these biases due to their non-human nature, their training on human-generated data raises a critical question: Do GPAI systems themselves exhibit cognitive biases? To investigate this, we present the first dynamic benchmarking framework to evaluate data-induced cognitive biases in GPAI within software engineering workflows. Starting with a seed set of 16 hand-crafted realistic tasks, each featuring one of 8 cognitive biases (e.g., anchoring, framing) and corresponding unbiased variants, we test whether bias-inducing linguistic cues unrelated to task logic can lead GPAI systems from correct to incorrect conclusions. To scale the benchmark and ensure realism, we develop an on-demand augmentation pipeline relying on GPAI systems to generate task variants that preserve bias-inducing cues while varying surface details. This pipeline ensures correctness (88-99% on average, according to human evaluation), promotes diversity, and controls reasoning complexity by leveraging Prolog-based reasoning. We evaluate leading GPAI systems (GPT, LLaMA, DeepSeek) and find a consistent tendency to rely on shallow linguistic heuristics over more complex reasoning. All systems exhibit bias sensitivity (6-35%), which increases with task complexity (up to 49%) and highlights risks in AI-driven software engineering.
Exploring Heart Rate Variability and Heart Rate Dynamics Using Wearables Before, During, and After Speech Activity: Insights from a Controlled Study in a Low-Middle-Income Country
Conventional methods for diagnosing Social Anxiety Disorder (SAD), such as clinical interviews and self-reported questionnaires, often face accessibility barriers and subjective biases, underscoring the need for objective physiological markers. This study investigates heart rate (HR) and heart rate variability (HRV) as potential indicators of SAD by analyzing cardiovascular responses to anxiety-inducing speech tasks across four distinct phases: baseline, anticipation, speech activity, and reflection. In a controlled laboratory setting, we analyzed data from 51 participants and found that HRV decreased and HR increased during the anticipation and speech activity phases compared to baseline, while the reflection phase showed a reversal, with HRV increasing and HR decreasing. Participants with SAD exhibited lower HRV, higher HR, and greater self-reported anxiety than non-SAD participants across all phases. These findings enhance our understanding of the physiological signatures of social anxiety and have implications for developing wearable-based monitoring systems for SAD detection and intervention. To support further research, we also release a dataset capturing multi-phase anxiety responses, advancing physiological-based mental health assessment
A Neuro-inspired Theory of Joint Human-Swarm Interaction ICRA
Human-swarm interaction (HSI) is an active research challenge in the realms of swarm robotics and human-factors engineering. Here we apply a cognitive systems engineering perspective and introduce a neuro-inspired joint systems theory of HSI. The mindset defines predictions for adaptive, robust and scalable HSI dynamics and therefore has the potential to inform human-swarm loop design.
comment: ICRA Workshop on Human-Swarm Interaction 2020
Using Generative AI to Uncover What Drives Player Enjoyment in PC and VR Games
As video games continue to evolve, understanding what drives player enjoyment remains a key challenge. Player reviews provide valuable insights, but their unstructured nature makes large-scale analysis difficult. This study applies generative AI and machine learning, leveraging Microsoft Phi-4 small language model (SLM) and Google Cloud, to quantify and analyze game reviews from Steam and Meta Quest stores. The approach converts qualitative feedback into structured data, enabling comprehensive evaluation of key game design elements, monetization models, and platform-specific trends. The findings reveal distinct patterns in player preferences across PC and VR games, highlighting factors that contribute to higher player enjoyment. By using Google Cloud for large-scale data storage and processing, this study establishes a scalable framework for game review analysis. The study's insights offer actionable guidance for game developers, helping optimize game mechanics, pricing strategies, and player engagement.
comment: The Steam dataset used in this study can be accessed at: https://data.mendeley.com/datasets/jxy85cr3th/2
Computer Vision and Pattern Recognition
Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models
Reasoning over dynamic visual content remains a central challenge for multimodal large language models. Recent thinking models generate explicit reasoning traces for interpretability; however, their reasoning often appears convincing while being logically inconsistent or weakly grounded in visual evidence. We identify and formalize these issues through two diagnostic metrics: Think Answer Consistency (TAC), which measures the alignment between reasoning and answers, and Video Attention Score (VAS), which captures the extent to which reasoning depends on visual versus textual cues. Analysis across 11 video reasoning benchmarks shows that current models rely heavily on linguistic priors rather than visual content. To address this, we propose a reinforcement learning approach that enhances both temporal precision and reasoning consistency. Our approach combines timestamp aware supervised fine tuning with Group Relative Policy Optimization (GRPO) guided by a novel Temporal Alignment Reward (TAR). This dual step post training stage encourages temporally aligned and causally coherent video reasoning. The resulting model, Video R2, achieves consistently higher TAC, VAS, and accuracy across multiple benchmarks, demonstrating that improvements in temporal alignment and reasoning coherence lead to more accurate and trustworthy video understanding. Our code, dataset, and model will be open sourced.
comment: Video-R2 Technical Report
Video-CoM: Interactive Video Reasoning via Chain of Manipulations
Recent multimodal large language models (MLLMs) have advanced video understanding, yet most still "think about videos" ie once a video is encoded, reasoning unfolds entirely in text, treating visual input as a static context. This passive paradigm creates a semantic bottleneck: models cannot rewatch, refocus, or verify evidence, leading to shallow visual reasoning on tasks requiring fine grained spatio temporal understanding. In this work, we introduce Interactive Video Reasoning, a new paradigm that transforms video into an active cognitive workspace, enabling models to "think with videos". Our model, Video CoM, reasons through a Chain of Manipulations (CoM), performing iterative visual actions to gather and refine evidence. To support this behavior, we construct Video CoM Instruct, an 18K instruction tuning dataset curated for multi step manipulation reasoning. Beyond supervised learning, we further optimize the manipulation policy via reinforcement learning with reasoning aware Group Relative Policy Optimization (GRPO). Unlike prior work that relies solely on sparse answer rewards, our method introduces step level reasoning rewards, guiding the model toward grounded and consistent reasoning. Video CoM achieves strong results across nine video reasoning benchmarks, improving average performance by 3.6 percent over recent state of the art models, while training on only 25K SFT and 3K GRPO video samples, significantly fewer than comparable large scale models. Ablation studies demonstrate that reasoning aware rewards improve both accuracy and interpretability. Code: https://github.com/mbzuai-oryx/Video-CoM
comment: Technical Report
AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement
Recently, multi-person video generation has started to gain prominence. While a few preliminary works have explored audio-driven multi-person talking video generation, they often face challenges due to the high costs of diverse multi-person data collection and the difficulty of driving multiple identities with coherent interactivity. To address these challenges, we propose AnyTalker, a multi-person generation framework that features an extensible multi-stream processing architecture. Specifically, we extend Diffusion Transformer's attention block with a novel identity-aware attention mechanism that iteratively processes identity-audio pairs, allowing arbitrary scaling of drivable identities. Besides, training multi-person generative models demands massive multi-person data. Our proposed training pipeline depends solely on single-person videos to learn multi-person speaking patterns and refines interactivity with only a few real multi-person clips. Furthermore, we contribute a targeted metric and dataset designed to evaluate the naturalness and interactivity of the generated multi-person videos. Extensive experiments demonstrate that AnyTalker achieves remarkable lip synchronization, visual quality, and natural interactivity, striking a favorable balance between data costs and identity scalability.
comment: Homepage: https://hkust-c4g.github.io/AnyTalker-homepage
Visual Generation Tuning
Large Vision Language Models (VLMs) effectively bridge the modality gap through extensive pretraining, acquiring sophisticated visual representations aligned with language. However, it remains underexplored whether these representations, optimized for multimodal understanding tasks, harbor an inherent potential for visual generation. In this paper, we propose VGT, Visual Generation Tuning, a novel paradigm designed to stimulate the underlying capabilities of visual generation within any vision language models. By performing efficient visual generation tuning on well-pretrained VLMs, we significantly mitigate the alignment costs and accelerate the convergence of autoregressive modeling in the continuous space (20x speedup). Specifically, we dismiss the entangled pixel-level VAEs designed for diffusion transformers and formulate VGT-AE through aligning the semantic encoders from pretrained VLMs with the latent representations of pixel decoders. In image reconstruction tasks, we achieve 26.67 PSNR and 0.50 rFID at a 28x compression ratio, outperforming specialized VAEs; in visual generation tasks, we achieve state-of-the-art outcomes among autoregressive models, 0.77 on GenEval and 78.73 on DPG-Bench. Furthermore, our proposed VGT showcases significant scaling promise and is versatile for endowing any VLMs trained for multimodal understanding with the capabilities of visual generation, which paves the new avenue to explore next-generation unified multimodal foundation models. Models and codes are available at https://github.com/hustvl/VGT.
Object-Centric Data Synthesis for Category-level Object Detection
Deep learning approaches to object detection have achieved reliable detection of specific object classes in images. However, extending a model's detection capability to new object classes requires large amounts of annotated training data, which is costly and time-consuming to acquire, especially for long-tailed classes with insufficient representation in existing datasets. Here, we introduce the object-centric data setting, when limited data is available in the form of object-centric data (multi-view images or 3D models), and systematically evaluate the performance of four different data synthesis methods to finetune object detection models on novel object categories in this setting. The approaches are based on simple image processing techniques, 3D rendering, and image diffusion models, and use object-centric data to synthesize realistic, cluttered images with varying contextual coherence and complexity. We assess how these methods enable models to achieve category-level generalization in real-world data, and demonstrate significant performance boosts within this data-constrained experimental setting.
comment: 10 pages, 10 figures
Physics-Informed Neural Networks for Thermophysical Property Retrieval
Inverse heat problems refer to the estimation of material thermophysical properties given observed or known heat diffusion behaviour. Inverse heat problems have wide-ranging uses, but a critical application lies in quantifying how building facade renovation reduces thermal transmittance, a key determinant of building energy efficiency. However, solving inverse heat problems with non-invasive data collected in situ is error-prone due to environmental variability or deviations from theoretically assumed conditions. Hence, current methods for measuring thermal conductivity are either invasive, require lengthy observation periods, or are sensitive to environmental and experimental conditions. Here, we present a PINN-based iterative framework to estimate the thermal conductivity k of a wall from a set of thermographs; our framework alternates between estimating the forward heat problem with a PINN for a fixed k, and optimizing k by comparing the thermographs and surface temperatures predicted by the PINN, repeating until the estimated k's convergence. Using both environmental data captured by a weather station and data generated from Finite-Volume-Method software simulations, we accurately predict k across different environmental conditions and data collection sampling times, given the temperature profile of the wall at dawn is close to steady state. Although violating the steady-state assumption impacts the accuracy of k's estimation, we show that our proposed framework still only exhibits a maximum MAE of 4.0851. Our work demonstrates the potential of PINN-based methods for reliable estimation of material properties in situ and under realistic conditions, without lengthy measurement campaigns. Given the lack of research on using machine learning, and more specifically on PINNs, for solving in-situ inverse problems, we expect our work to be a starting point for more research on the topic.
comment: 26 pages, 4 figures, 3 tables
Hunyuan-GameCraft-2: Instruction-following Interactive Game World Model
Recent advances in generative world models have enabled remarkable progress in creating open-ended game environments, evolving from static scene synthesis toward dynamic, interactive simulation. However, current approaches remain limited by rigid action schemas and high annotation costs, restricting their ability to model diverse in-game interactions and player-driven dynamics. To address these challenges, we introduce Hunyuan-GameCraft-2, a new paradigm of instruction-driven interaction for generative game world modeling. Instead of relying on fixed keyboard inputs, our model allows users to control game video contents through natural language prompts, keyboard, or mouse signals, enabling flexible and semantically rich interaction within generated worlds. We formally defined the concept of interactive video data and developed an automated process to transform large-scale, unstructured text-video pairs into causally aligned interactive datasets. Built upon a 14B image-to-video Mixture-of-Experts(MoE) foundation model, our model incorporates a text-driven interaction injection mechanism for fine-grained control over camera motion, character behavior, and environment dynamics. We introduce an interaction-focused benchmark, InterBench, to evaluate interaction performance comprehensively. Extensive experiments demonstrate that our model generates temporally coherent and causally grounded interactive game videos that faithfully respond to diverse and free-form user instructions such as "open the door", "draw a torch", or "trigger an explosion".
comment: Technical Report, Project page:https://hunyuan-gamecraft-2.github.io/
DisMo: Disentangled Motion Representations for Open-World Motion Transfer NeurIPS 2025
Recent advances in text-to-video (T2V) and image-to-video (I2V) models, have enabled the creation of visually compelling and dynamic videos from simple textual descriptions or initial frames. However, these models often fail to provide an explicit representation of motion separate from content, limiting their applicability for content creators. To address this gap, we propose DisMo, a novel paradigm for learning abstract motion representations directly from raw video data via an image-space reconstruction objective. Our representation is generic and independent of static information such as appearance, object identity, or pose. This enables open-world motion transfer, allowing motion to be transferred across semantically unrelated entities without requiring object correspondences, even between vastly different categories. Unlike prior methods, which trade off motion fidelity and prompt adherence, are overfitting to source structure or drifting from the described action, our approach disentangles motion semantics from appearance, enabling accurate transfer and faithful conditioning. Furthermore, our motion representation can be combined with any existing video generator via lightweight adapters, allowing us to effortlessly benefit from future advancements in video models. We demonstrate the effectiveness of our method through a diverse set of motion transfer tasks. Finally, we show that the learned representations are well-suited for downstream motion understanding tasks, consistently outperforming state-of-the-art video representation models such as V-JEPA in zero-shot action classification on benchmarks including Something-Something v2 and Jester. Project page: https://compvis.github.io/DisMo
comment: Accepted at NeurIPS 2025
MANTA: Physics-Informed Generalized Underwater Object Tracking WACV 2026
Underwater object tracking is challenging due to wavelength dependent attenuation and scattering, which severely distort appearance across depths and water conditions. Existing trackers trained on terrestrial data fail to generalize to these physics-driven degradations. We present MANTA, a physics-informed framework integrating representation learning with tracking design for underwater scenarios. We propose a dual-positive contrastive learning strategy coupling temporal consistency with Beer-Lambert augmentations to yield features robust to both temporal and underwater distortions. We further introduce a multi-stage pipeline augmenting motion-based tracking with a physics-informed secondary association algorithm that integrates geometric consistency and appearance similarity for re-identification under occlusion and drift. To complement standard IoU metrics, we propose Center-Scale Consistency (CSC) and Geometric Alignment Score (GAS) to assess geometric fidelity. Experiments on four underwater benchmarks (WebUOT-1M, UOT32, UTB180, UWCOT220) show that MANTA achieves state-of-the-art performance, improving Success AUC by up to 6 percent, while ensuring stable long-term generalized underwater tracking and efficient runtime.
comment: Accepted to the IEEE/CVF WACV 2026
VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction
Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previous research predominantly attempts to address this in a dual encoder paradigm, e.g., utilizing the separate encoders for understanding and generation respectively or balancing semantic representations and low-level features with contrastive loss. In this paper, we propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration in unified representation to produce Continuous semantic features for image understanding and Discrete tokens for visual generation within a unified tokenizer. Specifically, we build upon pretrained vision foundation models with a symmetric ViT decoder and adopt a two-stage training strategy: first, it freezes the encoder and learns a high-dimensional semantic VQ codebook with pixel reconstruction objective; then jointly optimizes the encoder with self-distillation constraints. This design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens that are compatible for generation and fine-grained reconstruction. Besides, we identify the intriguing property in quantizing semantic encoders that rely on high-dimensional codebook in contrast to the previous common practice of low-dimensional codebook in image reconstruction. The semantic VQ codebook can achieve a 100% utilization ratio at a dimension of 1536. VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction with promising scaling property in the autoregressive paradigm for its discrete merits.
comment: 19 pages, 10 figures
DEAL-300K: Diffusion-based Editing Area Localization with a 300K-Scale Dataset and Frequency-Prompted Baseline
Diffusion-based image editing has made semantic level image manipulation easy for general users, but it also enables realistic local forgeries that are hard to localize. Existing benchmarks mainly focus on the binary detection of generated images or the localization of manually edited regions and do not reflect the properties of diffusion-based edits, which often blend smoothly into the original content. We present Diffusion-Based Image Editing Area Localization Dataset (DEAL-300K), a large scale dataset for diffusion-based image manipulation localization (DIML) with more than 300,000 annotated images. We build DEAL-300K by using a multi-modal large language model to generate editing instructions, a mask-free diffusion editor to produce manipulated images, and an active-learning change detection pipeline to obtain pixel-level annotations. On top of this dataset, we propose a localization framework that uses a frozen Visual Foundation Model (VFM) together with Multi Frequency Prompt Tuning (MFPT) to capture both semantic and frequency-domain cues of edited regions. Trained on DEAL-300K, our method reaches a pixel-level F1 score of 82.56% on our test split and 80.97% on the external CoCoGlide benchmark, providing strong baselines and a practical foundation for future DIML research.The dataset can be accessed via https://github.com/ymhzyj/DEAL-300K.
comment: 13pages,12 figures
Optimizing Multimodal Language Models through Attention-based Interpretability
Modern large language models become multimodal, analyzing various data formats like text and images. While fine-tuning is effective for adapting these multimodal language models (MLMs) to downstream tasks, full fine-tuning is computationally expensive. Parameter-Efficient Fine-Tuning (PEFT) methods address this by training only a small portion of model weights. However, MLMs are difficult to interpret, making it challenging to identify which components are most effective for training to balance efficiency and performance. We propose an attention-based interpretability method for MLMs by analyzing attention scores relative to image tokens. The core idea is to identify attention heads that focus on image key objects. We utilize this information to select optimal model components for PEFT in multimodal models. Our contributions include a method for identifying attention heads associated with image key objects, its application to PEFT for image captioning, and the creation of a new dataset containing images, key object masks, and their textual descriptions. We conducted experiments on MLMs with 2-3 billion parameters to validate the method's effectiveness. By calculating Head Impact (HI) scores we quantify an attention head's focus on key objects, indicating its significance in image understanding. Our fine-tuning experiments demonstrate that adapting layers with the highest HI scores leads to the most significant shifts in metrics compared to pre-trained, randomly selected, or lowest-HI-score layers. This indicates that fine-tuning a small percentage (around 0.01%) of parameters in these crucial layers can substantially influence image understanding capabilities.
comment: Accepted for ICAI-2025 conference
SimScale: Learning to Drive via Real-World Simulation at Scale
Achieving fully autonomous driving systems requires learning rational decisions in a wide span of scenarios, including safety-critical and out-of-distribution ones. However, such cases are underrepresented in real-world corpus collected by human experts. To complement for the lack of data diversity, we introduce a novel and scalable simulation framework capable of synthesizing massive unseen states upon existing driving logs. Our pipeline utilizes advanced neural rendering with a reactive environment to generate high-fidelity multi-view observations controlled by the perturbed ego trajectory. Furthermore, we develop a pseudo-expert trajectory generation mechanism for these newly simulated states to provide action supervision. Upon the synthesized data, we find that a simple co-training strategy on both real-world and simulated samples can lead to significant improvements in both robustness and generalization for various planning methods on challenging real-world benchmarks, up to +6.8 EPDMS on navhard and +2.9 on navtest. More importantly, such policy improvement scales smoothly by increasing simulation data only, even without extra real-world data streaming in. We further reveal several crucial findings of such a sim-real learning system, which we term SimScale, including the design of pseudo-experts and the scaling properties for different policy architectures. Our simulation data and code would be released.
comment: Project page: https://opendrivelab.com/SimScale
A Hierarchical Computer Vision Pipeline for Physiological Data Extraction from Bedside Monitors
In many low-resource healthcare settings, bedside monitors remain standalone legacy devices without network connectivity, creating a persistent interoperability gap that prevents seamless integration of physiological data into electronic health record (EHR) systems. To address this challenge without requiring costly hardware replacement, we present a computer vision-based pipeline for the automated capture and digitisation of vital sign data directly from bedside monitor screens. Our method employs a hierarchical detection framework combining YOLOv11 for accurate monitor and region of interest (ROI) localisation with PaddleOCR for robust text extraction. To enhance reliability across variable camera angles and lighting conditions, a geometric rectification module standardizes the screen perspective before character recognition. We evaluated the system on a dataset of 6,498 images collected from open-source corpora and real-world intensive care units in Vietnam. The model achieved a mean Average Precision (mAP@50-95) of 99.5% for monitor detection and 91.5% for vital sign ROI localisation. The end-to-end extraction accuracy exceeded 98.9% for core physiological parameters, including heart rate, oxygen saturation SpO2, and arterial blood pressure. These results demonstrate that a lightweight, camera-based approach can reliably transform unstructured information from screen captures into structured digital data, providing a practical and scalable pathway to improve information accessibility and clinical documentation in low-resource settings.
comment: 11 pages, 3 figures
Flow Straighter and Faster: Efficient One-Step Generative Modeling via MeanFlow on Rectified Trajectories
Flow-based generative models have recently demonstrated strong performance, yet sampling typically relies on expensive numerical integration of ordinary differential equations (ODEs). Rectified Flow enables one-step sampling by learning nearly straight probability paths, but achieving such straightness requires multiple computationally intensive reflow iterations. MeanFlow achieves one-step generation by directly modeling the average velocity over time; however, when trained on highly curved flows, it suffers from slow convergence and noisy supervision. To address these limitations, we propose Rectified MeanFlow, a framework that models the mean velocity field along the rectified trajectory using only a single reflow step. This eliminates the need for perfectly straightened trajectories while enabling efficient training. Furthermore, we introduce a simple yet effective truncation heuristic that aims to reduce residual curvature and further improve performance. Extensive experiments on ImageNet at 64, 256, and 512 resolutions show that Re-MeanFlow consistently outperforms prior one-step flow distillation and Rectified Flow methods in both sample quality and training efficiency. Code is available at https://github.com/Xinxi-Zhang/Re-MeanFlow.
Markovian Scale Prediction: A New Era of Visual Autoregressive Generation
Visual AutoRegressive modeling (VAR) based on next-scale prediction has revitalized autoregressive visual generation. Although its full-context dependency, i.e., modeling all previous scales for next-scale prediction, facilitates more stable and comprehensive representation learning by leveraging complete information flow, the resulting computational inefficiency and substantial overhead severely hinder VAR's practicality and scalability. This motivates us to develop a new VAR model with better performance and efficiency without full-context dependency. To address this, we reformulate VAR as a non-full-context Markov process, proposing Markov-VAR. It is achieved via Markovian Scale Prediction: we treat each scale as a Markov state and introduce a sliding window that compresses certain previous scales into a compact history vector to compensate for historical information loss owing to non-full-context dependency. Integrating the history vector with the Markov state yields a representative dynamic state that evolves under a Markov process. Extensive experiments demonstrate that Markov-VAR is extremely simple yet highly effective: Compared to VAR on ImageNet, Markov-VAR reduces FID by 10.5% (256 $\times$ 256) and decreases peak memory consumption by 83.8% (1024 $\times$ 1024). We believe that Markov-VAR can serve as a foundation for future research on visual autoregressive generation and other downstream tasks.
UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes
Instruction-driven segmentation in remote sensing generates masks from guidance, offering great potential for accessible and generalizable applications. However, existing methods suffer from fragmented task formulations and limited instruction data, hindering effective understanding and generalization. To address these issues, we introduce GeoSeg-1M, the first million-scale dataset for remote sensing instruction-driven segmentation, constructed via an automatic mask filtering and instruction generation pipeline that synthesizes referring, interactive, and reasoning segmentation instructions from multiple public datasets. GeoSeg-1M contains 590K images, 117 categories, and 1.1M image-mask-instruction triplets. Building upon this foundation, we further curate GeoSeg-Bench, a challenging benchmark designed to evaluate contextual understanding and reasoning capabilities across diverse instruction-driven tasks and complex geospatial scenes. Furthermore, we present UniGeoSeg, a unified framework that serves as a strong baseline, incorporating task-aware text enhancement, latent knowledge memory, and a progressive training strategy to facilitate multi-task learning. Extensive experiments demonstrate the state-of-the-art performance of UniGeoSeg across GeoSeg-Bench and diverse public benchmarks, while exhibiting strong zero-shot generalization. Datasets and source code were released at https://github.com/MiliLab/UniGeoSeg.
comment: Datasets and source code were released at https://github.com/MiliLab/UniGeoSeg
A Perceptually Inspired Variational Framework for Color Enhancement
Basic phenomenology of human color vision has been widely taken as an inspiration to devise explicit color correction algorithms. The behavior of these models in terms of significative image features (such as contrast and dispersion) can be difficult to characterize. To cope with this, we propose to use a variational formulation of color contrast enhancement that is inspired by the basic phenomenology of color perception. In particular, we devise a set of basic requirements to be fulfilled by an energy to be considered as `perceptually inspired', showing that there is an explicit class of functionals satisfying all of them. We single out three explicit functionals that we consider of basic interest, showing similarities and differences with existing models. The minima of such functionals is computed using a gradient descent approach. We also present a general methodology to reduce the computational cost of the algorithms under analysis from ${\cal O}(N^2)$ to ${\cal O}(N\log N)$, being $N$ the number of input pixels.
Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach
Large-scale Vision Language Models (LVLMs) exhibit advanced capabilities in tasks that require visual information, including object detection. These capabilities have promising applications in various industrial domains, such as autonomous driving. For example, LVLMs can generate safety-oriented descriptions of videos captured by road-facing cameras. However, ensuring comprehensive safety requires monitoring driver-facing views as well to detect risky events, such as the use of mobiles while driving. Thus, the ability to process synchronized inputs is necessary from both driver-facing and road-facing cameras. In this study, we develop models and investigate the capabilities of LVLMs by constructing a dataset and evaluating their performance on this dataset. Our experimental results demonstrate that while pre-trained LVLMs have limited effectiveness, fine-tuned LVLMs can generate accurate and safety-aware driving instructions. Nonetheless, several challenges remain, particularly in detecting subtle or complex events in the video. Our findings and error analysis provide valuable insights that can contribute to the improvement of LVLM-based systems in this domain.
comment: Accepted to MMLoSo 2025
FACT-GS: Frequency-Aligned Complexity-Aware Texture Reparameterization for 2D Gaussian Splatting
Realistic scene appearance modeling has advanced rapidly with Gaussian Splatting, which enables real-time, high-quality rendering. Recent advances introduced per-primitive textures that incorporate spatial color variations within each Gaussian, improving their expressiveness. However, texture-based Gaussians parameterize appearance with a uniform per-Gaussian sampling grid, allocating equal sampling density regardless of local visual complexity. This leads to inefficient texture space utilization, where high-frequency regions are under-sampled and smooth regions waste capacity, causing blurred appearance and loss of fine structural detail. We introduce FACT-GS, a Frequency-Aligned Complexity-aware Texture Gaussian Splatting framework that allocates texture sampling density according to local visual frequency. Grounded in adaptive sampling theory, FACT-GS reformulates texture parameterization as a differentiable sampling-density allocation problem, replacing the uniform textures with a learnable frequency-aware allocation strategy implemented via a deformation field whose Jacobian modulates local sampling density. Built on 2D Gaussian Splatting, FACT-GS performs non-uniform sampling on fixed-resolution texture grids, preserving real-time performance while recovering sharper high-frequency details under the same parameter budget.
comment: 11 pages, 6 figures, preprint
Machine Learning for Scientific Visualization: Ensemble Data Analysis
Scientific simulations and experimental measurements produce vast amounts of spatio-temporal data, yet extracting meaningful insights remains challenging due to high dimensionality, complex structures, and missing information. Traditional analysis methods often struggle with these issues, motivating the need for more robust, data-driven approaches. This dissertation explores deep learning methodologies to improve the analysis and visualization of spatio-temporal scientific ensembles, focusing on dimensionality reduction, flow estimation, and temporal interpolation. First, we address high-dimensional data representation through autoencoder-based dimensionality reduction for scientific ensembles. We evaluate the stability of projection metrics under partial labeling and introduce a Pareto-efficient selection strategy to identify optimal autoencoder variants, ensuring expressive and reliable low-dimensional embeddings. Next, we present FLINT, a deep learning model for high-quality flow estimation and temporal interpolation in both flow-supervised and flow-unsupervised settings. FLINT reconstructs missing velocity fields and generates high-fidelity temporal interpolants for scalar fields across 2D+time and 3D+time ensembles without domain-specific assumptions or extensive finetuning. To further improve adaptability and generalization, we introduce HyperFLINT, a hypernetwork-based approach that conditions on simulation parameters to estimate flow fields and interpolate scalar data. This parameter-aware adaptation yields more accurate reconstructions across diverse scientific domains, even with sparse or incomplete data. Overall, this dissertation advances deep learning techniques for scientific visualization, providing scalable, adaptable, and high-quality solutions for interpreting complex spatio-temporal ensembles.
comment: PhD thesis, University of Groningen, 2025
Simultaneous Image Quality Improvement and Artefacts Correction in Accelerated MRI
MR data are acquired in the frequency domain, known as k-space. Acquiring high-quality and high-resolution MR images can be time-consuming, posing a significant challenge when multiple sequences providing complementary contrast information are needed or when the patient is unable to remain in the scanner for an extended period of time. Reducing k-space measurements is a strategy to speed up acquisition, but often leads to reduced quality in reconstructed images. Additionally, in real-world MRI, both under-sampled and full-sampled images are prone to artefacts, and correcting these artefacts is crucial for maintaining diagnostic accuracy. Deep learning methods have been proposed to restore image quality from under-sampled data, while others focused on the correction of artefacts that result from the noise or motion. No approach has however been proposed so far that addresses both acceleration and artefacts correction, limiting the performance of these models when these degradation factors occur simultaneously. To address this gap, we present a method for recovering high-quality images from under-sampled data with simultaneously correction for noise and motion artefact called USArt (Under-Sampling and Artifact correction model). Customized for 2D brain anatomical images acquired with Cartesian sampling, USArt employs a dual sub-model approach. The results demonstrate remarkable increase of signal-to-noise ratio (SNR) and contrast in the images restored. Various under-sampling strategies and degradation levels were explored, with the gradient under-sampling strategy yielding the best outcomes. We achieved up to 5x acceleration and simultaneously artefacts correction without significant degradation, showcasing the model's robustness in real-world settings.
Learning to Predict Aboveground Biomass from RGB Images with 3D Synthetic Scenes
Forests play a critical role in global ecosystems by supporting biodiversity and mitigating climate change via carbon sequestration. Accurate aboveground biomass (AGB) estimation is essential for assessing carbon storage and wildfire fuel loads, yet traditional methods rely on labor-intensive field measurements or remote sensing approaches with significant limitations in dense vegetation. In this work, we propose a novel learning-based method for estimating AGB from a single ground-based RGB image. We frame this as a dense prediction task, introducing AGB density maps, where each pixel represents tree biomass normalized by the plot area and each tree's image area. We leverage the recently introduced synthetic 3D SPREAD dataset, which provides realistic forest scenes with per-image tree attributes (height, trunk and canopy diameter) and instance segmentation masks. Using these assets, we compute AGB via allometric equations and train a model to predict AGB density maps, integrating them to recover the AGB estimate for the captured scene. Our approach achieves a median AGB estimation error of 1.22 kg/m^2 on held-out SPREAD data and 1.94 kg/m^2 on a real-image dataset. To our knowledge, this is the first method to estimate aboveground biomass directly from a single RGB image, opening up the possibility for a scalable, interpretable, and cost-effective solution for forest monitoring, while also enabling broader participation through citizen science initiatives.
comment: Presented at STAG 2025
Synthetic Industrial Object Detection: GenAI vs. Feature-Based Methods
Reducing the burden of data generation and annotation remains a major challenge for the cost-effective deployment of machine learning in industrial and robotics settings. While synthetic rendering is a promising solution, bridging the sim-to-real gap often requires expert intervention. In this work, we benchmark a range of domain randomization (DR) and domain adaptation (DA) techniques, including feature-based methods, generative AI (GenAI), and classical rendering approaches, for creating contextualized synthetic data without manual annotation. Our evaluation focuses on the effectiveness and efficiency of low-level and high-level feature alignment, as well as a controlled diffusion-based DA method guided by prompts generated from real-world contexts. We validate our methods on two datasets: a proprietary industrial dataset (automotive and logistics) and a public robotics dataset. Results show that if render-based data with enough variability is available as seed, simpler feature-based methods, such as brightness-based and perceptual hashing filtering, outperform more complex GenAI-based approaches in both accuracy and resource efficiency. Perceptual hashing consistently achieves the highest performance, with mAP50 scores of 98% and 67% on the industrial and robotics datasets, respectively. Additionally, GenAI methods present significant time overhead for data generation at no apparent improvement of sim-to-real mAP values compared to simpler methods. Our findings offer actionable insights for efficiently bridging the sim-to-real gap, enabling high real-world performance from models trained exclusively on synthetic data.
Unlocking Multilingual Reasoning Capability of LLMs and LVLMs through Representation Engineering
Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) demonstrate strong reasoning capabilities, yet their performance in English significantly outperforms that in low-resource languages, raising fairness concerns in multilingual applications. Existing approaches either rely on costly multilingual training or employ prompting with external translation tools, both of which are resource-intensive and sensitive to translation quality. To address these limitations, we propose a training-free inference-time method to enhance Multilingual Reasoning capabilities via Representation Engineering (MRRE) without using any additional training data or tools. MRRE sequentially injects two precomputed vectors at specific layers during inference processing: cross-lingual reasoning enhancement vectors, which steer non-English reasoning representations toward English space to unlock multilingual reasoning, and target-language output anchoring vectors, which restore the distribution of the target language to preserve input-output language consistency. Comprehensive experiments across six advanced LLMs and LVLMs on four reasoning benchmarks demonstrate that MRRE consistently enhances non-English reasoning by an average gain of 5.48% and up to 7.54% in low-resource languages (Thai and Swahili), while improving input-output language consistency by 3.78%.
Language-guided 3D scene synthesis for fine-grained functionality understanding
Functionality understanding in 3D, which aims to identify the functional element in a 3D scene to complete an action (e.g., the correct handle to "Open the second drawer of the cabinet near the bed"), is hindered by the scarcity of real-world data due to the substantial effort needed for its collection and annotation. To address this, we introduce SynthFun3D, the first method for task-based 3D scene synthesis. Given the action description, SynthFun3D generates a 3D indoor environment using a furniture asset database with part-level annotation, ensuring the action can be accomplished. It reasons about the action to automatically identify and retrieve the 3D mask of the correct functional element, enabling the inexpensive and large-scale generation of high-quality annotated data. We validate SynthFun3D through user studies, which demonstrate improved scene-prompt coherence compared to other approaches. Our quantitative results further show that the generated data can either replace real data with minor performance loss or supplement real data for improved performance, thereby providing an inexpensive and scalable solution for data-hungry 3D applications. Project page: github.com/tev-fbk/synthfun3d.
comment: Technical report. 24 pages, 19 figures, 2 tables
PointCNN++: Performant Convolution on Native Points
Existing convolutional learning methods for 3D point cloud data are divided into two paradigms: point-based methods that preserve geometric precision but often face performance challenges, and voxel-based methods that achieve high efficiency through quantization at the cost of geometric fidelity. This loss of precision is a critical bottleneck for tasks such as point cloud registration. We propose PointCNN++, a novel architectural design that fundamentally mitigates this precision-performance trade-off. It \textbf{generalizes sparse convolution from voxels to points}, treating voxel-based convolution as a specialized, degraded case of our more general point-based convolution. First, we introduce a point-centric convolution where the receptive field is centered on the original, high-precision point coordinates. Second, to make this high-fidelity operation performant, we design a computational strategy that operates \textbf{natively} on points. We formulate the convolution on native points as a Matrix-Vector Multiplication and Reduction (MVMR) problem, for which we develop a dedicated, highly-optimized GPU kernel. Experiments demonstrate that PointCNN++ \textbf{uses an order of magnitude less memory and is several times faster} than representative point-based methods. Furthermore, when used as a simple replacement for the voxel-based backbones it generalizes, it \textbf{significantly improves point cloud registration accuracies while proving both more memory-efficient and faster}. PointCNN++ shows that preserving geometric detail and achieving high performance are not mutually exclusive, paving the way for a new class of 3D learning with high fidelity and efficiency. Our code will be open sourced.
TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies
Native FP8 support in modern hardware is essential for training large Transformers, but is severely hindered by extreme activation outliers. Existing solutions either rely on complex mixed-precision engineering or invasive architectural modifications. This paper fundamentally challenges the conventional wisdom that outliers are data-driven. We demonstrate that extreme outliers are a data-independent, mechanically-produced artifact of training, originating from specific structural properties of the weight matrices (i.e., colinearity). Based on this insight, we propose TWEO (Transformers Without Extreme Outliers), a novel, non-invasive loss function. TWEO effectively prevents extreme outliers via a very simple loss term, which reduces outliers from 10000+ to less than 20. TWEO then enables full-model FP8 pre-training with neither engineering tricks nor architectural changes for both LLM and ViT. When standard FP8 training catastrophically collapses, TWEO achieves performance comparable to the BF16 baseline while delivering a 36% increase in training throughput. Also, TWEO enables a new quantization paradigm. Hardware-friendly W8A8 per-tensor static quantization of LLMs, previously considered completely unusable due to outliers, achieves SOTA performance for the first time on TWEO-trained models.
DAONet-YOLOv8: An Occlusion-Aware Dual-Attention Network for Tea Leaf Pest and Disease Detection
Accurate detection of tea leaf pests and diseases in real plantations remains challenging due to complex backgrounds, variable illumination, and frequent occlusions among dense branches and leaves. Existing detectors often suffer from missed detections and false positives in such scenarios. To address these issues, we propose DAONet-YOLOv8, an enhanced YOLOv8 variant with three key improvements: (1) a Dual-Attention Fusion Module (DAFM) that combines convolutional local feature extraction with self-attention based global context modeling to focus on subtle lesion regions while suppressing background noise; (2) an occlusion-aware detection head (Detect-OAHead) that learns the relationship between visible and occluded parts to compensate for missing lesion features; and (3) a C2f-DSConv module employing dynamic synthesis convolutions with multiple kernel shapes to better capture irregular lesion boundaries. Experiments on our real-world tea plantation dataset containing six pest and disease categories demonstrate that DAONet-YOLOv8 achieves 92.97% precision, 92.80% recall, 97.10% mAP@50 and 76.90% mAP@50:95, outperforming the YOLOv8n baseline by 2.34, 4.68, 1.40 and 1.80 percentage points respectively, while reducing parameters by 16.7%. Comparative experiments further confirm that DAONet-YOLOv8 achieves superior performance over mainstream detection models.
Robust 3DGS-based SLAM via Adaptive Kernel Smoothing
In this paper, we challenge the conventional notion in 3DGS-SLAM that rendering quality is the primary determinant of tracking accuracy. We argue that, compared to solely pursuing a perfect scene representation, it is more critical to enhance the robustness of the rasterization process against parameter errors to ensure stable camera pose tracking. To address this challenge, we propose a novel approach that leverages a smooth kernel strategy to enhance the robustness of 3DGS-based SLAM. Unlike conventional methods that focus solely on minimizing rendering error, our core insight is to make the rasterization process more resilient to imperfections in the 3DGS parameters. We hypothesize that by allowing each Gaussian to influence a smoother, wider distribution of pixels during rendering, we can mitigate the detrimental effects of parameter noise from outlier Gaussians. This approach intentionally introduces a controlled blur to the rendered image, which acts as a regularization term, stabilizing the subsequent pose optimization. While a complete redesign of the rasterization pipeline is an ideal solution, we propose a practical and effective alternative that is readily integrated into existing 3DGS frameworks. Our method, termed Corrective Blurry KNN (CB-KNN), adaptively modifies the RGB values and locations of the K-nearest neighboring Gaussians within a local region. This dynamic adjustment generates a smoother local rendering, reducing the impact of erroneous GS parameters on the overall image. Experimental results demonstrate that our approach, while maintaining the overall quality of the scene reconstruction (mapping), significantly improves the robustness and accuracy of camera pose tracking.
Instruction Tuning of Large Language Models for Tabular Data Generation-in One Day ICML 2025
Tabular instruction tuning has emerged as a promising research direction for improving LLMs understanding of tabular data. However, the majority of existing works only consider question-answering and reasoning tasks over tabular data, leaving tabular data generation largely unnoticed. In this work, for the first time, we explore the efficacy of instruction tuning in improving LLMs tabular data generation capabilities. More specifically, given the high data and computation requirements of tabular instruction tuning, we aim to address the possibility of instruction tuning for tabular data generation with limited data and computational resources. To achieve this, we first create a high-quality instruction dataset for tabular data, enabling efficient LLM comprehension. We then instruction-tune an open-source LLM (Llama3.1-8B-Instruct) on the training set of this dataset to improve its tabular data generation performance. Our experimental results show that by using our high-quality dataset and instruction-tuning on only 7K instructions with an A100 GPU, for less than 6 hours, we achieve tabular data generation performance on par with the most capable commercial LLM, GPT-4o.
comment: Accepted International Conference on Machine Learning (ICML 2025), 1st Workshop on Foundation Models for Structured Data
Zero-Shot Multi-Criteria Visual Quality Inspection for Semi-Controlled Industrial Environments via Real-Time 3D Digital Twin Simulation
Early-stage visual quality inspection is vital for achieving Zero-Defect Manufacturing and minimizing production waste in modern industrial environments. However, the complexity of robust visual inspection systems and their extensive data requirements hinder widespread adoption in semi-controlled industrial settings. In this context, we propose a pose-agnostic, zero-shot quality inspection framework that compares real scenes against real-time Digital Twins (DT) in the RGB-D space. Our approach enables efficient real-time DT rendering by semantically describing industrial scenes through object detection and pose estimation of known Computer-Aided Design models. We benchmark tools for real-time, multimodal RGB-D DT creation while tracking consumption of computational resources. Additionally, we provide an extensible and hierarchical annotation strategy for multi-criteria defect detection, unifying pose labelling with logical and structural defect annotations. Based on an automotive use case featuring the quality inspection of an axial flux motor, we demonstrate the effectiveness of our framework. Our results demonstrate detection performace, achieving intersection-over-union (IoU) scores of up to 63.3% compared to ground-truth masks, even if using simple distance measurements under semi-controlled industrial conditions. Our findings lay the groundwork for future research on generalizable, low-data defect detection methods in dynamic manufacturing settings.
Pathryoshka: Compressing Pathology Foundation Models via Multi-Teacher Knowledge Distillation with Nested Embeddings
Pathology foundation models (FMs) have driven significant progress in computational pathology. However, these high-performing models can easily exceed a billion parameters and produce high-dimensional embeddings, thus limiting their applicability for research or clinical use when computing resources are tight. Here, we introduce Pathryoshka, a multi-teacher distillation framework inspired by RADIO distillation and Matryoshka Representation Learning to reduce pathology FM sizes while allowing for adaptable embedding dimensions. We evaluate our framework with a distilled model on ten public pathology benchmarks with varying downstream tasks. Compared to its much larger teachers, Pathryoshka reduces the model size by 86-92% at on-par performance. It outperforms state-of-the-art single-teacher distillation models of comparable size by a median margin of 7.0 in accuracy. By enabling efficient local deployment without sacrificing accuracy or representational richness, Pathryoshka democratizes access to state-of-the-art pathology FMs for the broader research and clinical community.
Vision Bridge Transformer at Scale
We introduce Vision Bridge Transformer (ViBT), a large-scale instantiation of Brownian Bridge Models designed for conditional generation. Unlike traditional diffusion models that transform noise into data, Bridge Models directly model the trajectory between inputs and outputs, creating an efficient data-to-data translation paradigm. By scaling these models to 20B and 1.3B parameters, we demonstrate their effectiveness for image and video translation tasks. To support this scale, we adopt a Transformer architecture and propose a variance-stabilized velocity-matching objective for robust training. Together, these advances highlight the power of scaling Bridge Models for instruction-based image editing and complex video translation.
GeoWorld: Unlocking the Potential of Geometry Models to Facilitate High-Fidelity 3D Scene Generation
Previous works leveraging video models for image-to-3D scene generation tend to suffer from geometric distortions and blurry content. In this paper, we renovate the pipeline of image-to-3D scene generation by unlocking the potential of geometry models and present our GeoWorld. Instead of exploiting geometric information obtained from a single-frame input, we propose to first generate consecutive video frames and then take advantage of the geometry model to provide full-frame geometry features, which contain richer information than single-frame depth maps or camera embeddings used in previous methods, and use these geometry features as geometrical conditions to aid the video generation model. To enhance the consistency of geometric structures, we further propose a geometry alignment loss to provide the model with real-world geometric constraints and a geometry adaptation module to ensure the effective utilization of geometry features. Extensive experiments show that our GeoWorld can generate high-fidelity 3D scenes from a single image and a given camera trajectory, outperforming prior methods both qualitatively and quantitatively. Project Page: https://peaes.github.io/GeoWorld/.
Obstruction reasoning for robotic grasping
Successful robotic grasping in cluttered environments not only requires a model to visually ground a target object but also to reason about obstructions that must be cleared beforehand. While current vision-language embodied reasoning models show emergent spatial understanding, they remain limited in terms of obstruction reasoning and accessibility planning. To bridge this gap, we present UNOGrasp, a learning-based vision-language model capable of performing visually-grounded obstruction reasoning to infer the sequence of actions needed to unobstruct the path and grasp the target object. We devise a novel multi-step reasoning process based on obstruction paths originated by the target object. We anchor each reasoning step with obstruction-aware visual cues to incentivize reasoning capability. UNOGrasp combines supervised and reinforcement finetuning through verifiable reasoning rewards. Moreover, we construct UNOBench, a large-scale dataset for both training and benchmarking, based on MetaGraspNetV2, with over 100k obstruction paths annotated by humans with obstruction ratios, contact points, and natural-language instructions. Extensive experiments and real-robot evaluations show that UNOGrasp significantly improves obstruction reasoning and grasp success across both synthetic and real-world environments, outperforming generalist and proprietary alternatives. Project website: https://tev-fbk.github.io/UnoGrasp/.
Fast Multi-view Consistent 3D Editing with Video Priors
Text-driven 3D editing enables user-friendly 3D object or scene editing with text instructions. Due to the lack of multi-view consistency priors, existing methods typically resort to employing 2D generation or editing models to process each view individually, followed by iterative 2D-3D-2D updating. However, these methods are not only time-consuming but also prone to over-smoothed results because the different editing signals gathered from different views are averaged during the iterative process. In this paper, we propose generative Video Prior based 3D Editing (ViP3DE) to employ the temporal consistency priors from pre-trained video generation models for multi-view consistent 3D editing in a single forward pass. Our key insight is to condition the video generation model on a single edited view to generate other consistent edited views for 3D updating directly, thereby bypassing the iterative editing paradigm. Since 3D updating requires edited views to be paired with specific camera poses, we propose motion-preserved noise blending for the video model to generate edited views at predefined camera poses. In addition, we introduce geometry-aware denoising to further enhance multi-view consistency by integrating 3D geometric priors into video models. Extensive experiments demonstrate that our proposed ViP3DE can achieve high-quality 3D editing results even within a single forward pass, significantly outperforming existing methods in both editing quality and speed.
PowerCLIP: Powerset Alignment for Contrastive Pre-Training CVPR 2026
Contrastive vision-language pre-training frameworks such as CLIP have demonstrated impressive zero-shot performance across a range of vision-language tasks. Recent studies have shown that aligning individual text tokens with specific image patches or regions enhances fine-grained compositional understanding. However, it remains challenging to capture compositional semantics that span multiple image regions. To address this limitation, we propose PowerCLIP, a novel contrastive pre-training framework enhanced by powerset alignment, which exhaustively optimizes region-to-phrase alignments by minimizing the loss defined between powersets of image regions and textual parse trees. Since the naive powerset construction incurs exponential computational cost due to the combinatorial explosion in the number of region subsets, we introduce efficient non-linear aggregators (NLAs) that reduce complexity from O(2^M) to O(M) with respect to the number of regions M, while approximating the exact loss value with arbitrary precision. Our extensive experiments demonstrate that PowerCLIP outperforms state-of-the-art methods in zero-shot classification and retrieval tasks, underscoring the compositionality and robustness of our approach. Our code will be made publicly available.
comment: Submitted to CVPR 2026
REVEAL: Reasoning-enhanced Forensic Evidence Analysis for Explainable AI-generated Image Detection
With the rapid advancement of generative models, visually realistic AI-generated images have become increasingly difficult to distinguish from authentic ones, posing severe threats to social trust and information integrity. Consequently, there is an urgent need for efficient and truly explainable image forensic methods. Recent detection paradigms have shifted towards explainable forensics. However, state-of-the-art approaches primarily rely on post-hoc rationalizations or visual discrimination, lacking a verifiable chain of evidence. This reliance on surface-level pattern matching limits the generation of causally grounded explanations and often results in poor generalization. To bridge this critical gap, we introduce \textbf{REVEAL-Bench}, the first reasoning-enhanced multimodal benchmark for AI-generated image detection that is explicitly structured around a chain-of-evidence derived from multiple lightweight expert models, then records step-by-step reasoning traces and evidential justifications. Building upon this dataset, we propose \textbf{REVEAL} (\underline{R}easoning-\underline{e}nhanced Forensic E\underline{v}id\underline{e}nce \underline{A}na\underline{l}ysis), an effective and explainable forensic framework that integrates detection with a novel expert-grounded reinforcement learning. Our reward mechanism is specially tailored to jointly optimize detection accuracy, explanation fidelity, and logical coherence grounded in explicit forensic evidence, enabling REVEAL to produce fine-grained, interpretable, and verifiable reasoning chains alongside its detection outcomes. Extensive experimental results demonstrate that REVEAL significantly enhances detection accuracy, explanation fidelity, and robust cross-model generalization, benchmarking a new state of the art for explainable image forensics.
Learning to Refuse: Refusal-Aware Reinforcement Fine-Tuning for Hard-Irrelevant Queries in Video Temporal Grounding
Video Temporal Grounding (VTG) aims to localize a temporal segment in a video corresponding to a natural language query. However, existing VTG models assume that a relevant segment always exists, causing them to always predict a target segment even when the query is irrelevant to the video. While recent approaches attempt to handle irrelevant queries, they can only reject those that are entirely unrelated to the video and still fail to handle hard-irrelevant queries that are semantically similar but not actually relevant. To address this, we propose Refusal-Aware Reinforcement Fine-Tuning (RA-RFT) to effectively refuse hard-irrelevant queries in VTG. Our method is based on the Group Relative Policy Optimization (GRPO) framework and integrates four reward objectives-format, refuse-IoU, explain, and query correction-to improve both relevance discrimination and fine-grained semantic reasoning. In addition, to effectively support RA-RFT, we construct a Hard-Irrelevant VTG (HI-VTG) dataset, which includes hard-irrelevant queries and their refusal answers. We demonstrate the effectiveness of our method across various relevance-aware VTG scenarios, including hard-irrelevant VTG, simply-shuffled RA-VTG, and human-annotated RA-VTG settings. We also show that the proposed method is scalable by applying it to various LVLM-based VTG models. Our code is available at https://github.com/JINSUBY/RA-RFT.
comment: 19 pages
Cascaded Robust Rectification for Arbitrary Document Images
Document rectification in real-world scenarios poses significant challenges due to extreme variations in camera perspectives and physical distortions. Driven by the insight that complex transformations can be decomposed and resolved progressively, we introduce a novel multi-stage framework that progressively reverses distinct distortion types in a coarse-to-fine manner. Specifically, our framework first performs a global affine transformation to correct perspective distortions arising from the camera's viewpoint, then rectifies geometric deformations resulting from physical paper curling and folding, and finally employs a content-aware iterative process to eliminate fine-grained content distortions. To address limitations in existing evaluation protocols, we also propose two enhanced metrics: layout-aligned OCR metrics (AED/ACER) for a stable assessment that decouples geometric rectification quality from the layout analysis errors of OCR engines, and masked AD/AAD (AD-M/AAD-M) tailored for accurately evaluating geometric distortions in documents with incomplete boundaries. Extensive experiments show that our method establishes new state-of-the-art performance on multiple challenging benchmarks, yielding a substantial reduction of 14.1\%--34.7\% in the AAD metric and demonstrating superior efficacy in real-world applications. The code will be publicly available at https://github.com/chaoyunwang/ArbDR.
InstanceV: Instance-Level Video Generation
Recent advances in text-to-video diffusion models have enabled the generation of high-quality videos conditioned on textual descriptions. However, most existing text-to-video models rely solely on textual conditions, lacking general fine-grained controllability over video generation. To address this challenge, we propose InstanceV, a video generation framework that enables i) instance-level control and ii) global semantic consistency. Specifically, with the aid of proposed Instance-aware Masked Cross-Attention mechanism, InstanceV maximizes the utilization of additional instance-level grounding information to generate correctly attributed instances at designated spatial locations. To improve overall consistency, We introduce the Shared Timestep-Adaptive Prompt Enhancement module, which connects local instances with global semantics in a parameter-efficient manner. Furthermore, we incorporate Spatially-Aware Unconditional Guidance during both training and inference to alleviate the disappearance of small instances. Finally, we propose a new benchmark, named InstanceBench, which combines general video quality metrics with instance-aware metrics for more comprehensive evaluation on instance-level video generation. Extensive experiments demonstrate that InstanceV not only achieves remarkable instance-level controllability in video generation, but also outperforms existing state-of-the-art models in both general quality and instance-aware metrics across qualitative and quantitative evaluations.
DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation
This paper presents DualCamCtrl, a novel end-to-end diffusion model for camera-controlled video generation. Recent works have advanced this field by representing camera poses as ray-based conditions, yet they often lack sufficient scene understanding and geometric awareness. DualCamCtrl specifically targets this limitation by introducing a dual-branch framework that mutually generates camera-consistent RGB and depth sequences. To harmonize these two modalities, we further propose the Semantic Guided Mutual Alignment (SIGMA) mechanism, which performs RGB-depth fusion in a semantics-guided and mutually reinforced manner. These designs collectively enable DualCamCtrl to better disentangle appearance and geometry modeling, generating videos that more faithfully adhere to the specified camera trajectories. Additionally, we analyze and reveal the distinct influence of depth and camera poses across denoising stages and further demonstrate that early and late stages play complementary roles in forming global structure and refining local details. Extensive experiments demonstrate that DualCamCtrl achieves more consistent camera-controlled video generation, with over 40\% reduction in camera motion errors compared with prior methods. Our project page: https://soyouthinkyoucantell.github.io/dualcamctrl\-page/
DNA-Prior: Unsupervised Denoise Anything via Dual-Domain Prior
Medical imaging pipelines critically rely on robust denoising to stabilise downstream tasks such as segmentation and reconstruction. However, many existing denoisers depend on large annotated datasets or supervised learning, which restricts their usability in clinical environments with heterogeneous modalities and limited ground-truth data. To address this limitation, we introduce DNA-Prior, a universal unsupervised denoising framework that reconstructs clean images directly from corrupted observations through a mathematically principled hybrid prior. DNA-Prior integrates (i) an implicit architectural prior, enforced through a deep network parameterisation, with (ii) an explicit spectral-spatial prior composed of a frequency-domain fidelity term and a spatial regularisation functional. This dual-domain formulation yields a well-structured optimisation problem that jointly preserves global frequency characteristics and local anatomical structure, without requiring any external training data or modality-specific tuning. Experiments across multiple modalities show that DNA achieves consistent noise suppression and structural preservation under diverse noise conditions.
Analyzing Image Beyond Visual Aspect: Image Emotion Classification via Multiple-Affective Captioning
Image emotion classification (IEC) is a longstanding research field that has received increasing attention with the rapid progress of deep learning. Although recent advances have leveraged the knowledge encoded in pre-trained visual models, their effectiveness is constrained by the "affective gap" , limits the applicability of pre-training knowledge for IEC tasks. It has been demonstrated in psychology that language exhibits high variability, encompasses diverse and abundant information, and can effectively eliminate the "affective gap". Inspired by this, we propose a novel Affective Captioning for Image Emotion Classification (ACIEC) to classify image emotion based on pure texts, which effectively capture the affective information in the image. In our method, a hierarchical multi-level contrastive loss is designed for detecting emotional concepts from images, while an emotional attribute chain-of-thought reasoning is proposed to generate affective sentences. Then, a pre-trained language model is leveraged to synthesize emotional concepts and affective sentences to conduct IEC. Additionally, a contrastive loss based on semantic similarity sampling is designed to solve the problem of large intra-class differences and small inter-class differences in affective datasets. Moreover, we also take the images with embedded texts into consideration, which were ignored by previous studies. Extensive experiments illustrate that our method can effectively bridge the affective gap and achieve superior results on multiple benchmarks.
db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism
Scaling Diffusion Transformer (DiT) inference via sequence parallelism is critical for reducing latency in visual generation, but is severely hampered by workload imbalance when applied to models employing block-wise sparse attention. The imbalance stems from the inherent variation in sparsity across attention heads and the irregular distribution of dense blocks within the sparse mask, when sequence parallelism is applied along the head dimension (as in Ulysses) or the block dimension (as in Ring Attention). In this paper, we formalize a sparse imbalance ratio to quantify the imbalance, and propose db-SP, a sparsity-aware sequence parallelism technique that tackles the challenge. db-SP contains a dual-level partitioning approach that achieves near-perfect workload balance at both the head and block levels with negligible overhead. Furthermore, to handle the evolving sparsity patterns across denoising steps and layers, db-SP dynamically determines the parallel degrees for the head and block dimensions at runtime. Experimental results demonstrate that db-SP delivers an end-to-end speedup of 1.25x and an attention-specific speedup of 1.40x over state-of-the-art sequence parallel methods on average. Code is available at https://github.com/thu-nics/db-SP.
MathSight: A Benchmark Exploring Have Vision-Language Models Really Seen in University-Level Mathematical Reasoning?
Recent advances in Vision-Language Models (VLMs) have achieved impressive progress in multimodal mathematical reasoning. Yet, how much visual information truly contributes to reasoning remains unclear. Existing benchmarks report strong overall performance but seldom isolate the role of the image modality, leaving open whether VLMs genuinely leverage visual understanding or merely depend on linguistic priors. To address this, we present MathSight, a university-level multimodal mathematical reasoning benchmark designed to disentangle and quantify the effect of visual input. Each problem includes multiple visual variants -- original, hand-drawn, photo-captured -- and a text-only condition for controlled comparison. Experiments on state-of-the-art VLMs reveal a consistent trend: the contribution of visual information diminishes with increasing problem difficulty. Remarkably, Qwen3-VL without any image input surpasses both its multimodal variants and GPT-5, underscoring the need for benchmarks like MathSight to advance genuine vision-grounded reasoning in future models.
comment: Comments: 32 pages, 15 figures, 9 tables, includes appendix. Project page: https://cnu-bot-group.github.io/MathSight/
NumeriKontrol: Adding Numeric Control to Diffusion Transformers for Instruction-based Image Editing
Instruction-based image editing enables intuitive manipulation through natural language commands. However, text instructions alone often lack the precision required for fine-grained control over edit intensity. We introduce NumeriKontrol, a framework that allows users to precisely adjust image attributes using continuous scalar values with common units. NumeriKontrol encodes numeric editing scales via an effective Numeric Adapter and injects them into diffusion models in a plug-and-play manner. Thanks to a task-separated design, our approach supports zero-shot multi-condition editing, allowing users to specify multiple instructions in any order. To provide high-quality supervision, we synthesize precise training data from reliable sources, including high-fidelity rendering engines and DSLR cameras. Our Common Attribute Transform (CAT) dataset covers diverse attribute manipulations with accurate ground-truth scales, enabling NumeriKontrol to function as a simple yet powerful interactive editing studio. Extensive experiments show that NumeriKontrol delivers accurate, continuous, and stable scale control across a wide range of attribute editing scenarios. These contributions advance instruction-based image editing by enabling precise, scalable, and user-controllable image manipulation.
comment: 13 pages, 10 figures
Implementation of a Skin Lesion Detection System for Managing Children with Atopic Dermatitis Based on Ensemble Learning
The amendments made to the Data 3 Act and impact of COVID-19 have fostered the growth of digital healthcare market and promoted the use of medical data in artificial intelligence in South Korea. Atopic dermatitis, a chronic inflammatory skin disease, is diagnosed via subjective evaluations without using objective diagnostic methods, thereby increasing the risk of misdiagnosis. It is also similar to psoriasis in appearance, further complicating its accurate diagnosis. Existing studies on skin diseases have used high-quality dermoscopic image datasets, but such high-quality images cannot be obtained in actual clinical settings. Moreover, existing systems must ensure accuracy and fast response times. To this end, an ensemble learning-based skin lesion detection system (ENSEL) was proposed herein. ENSEL enhanced diagnostic accuracy by integrating various deep learning models via an ensemble approach. Its performance was verified by conducting skin lesion detection experiments using images of skin lesions taken by actual users. Its accuracy and response time were measured using randomly sampled skin disease images. Results revealed that ENSEL achieved high recall in most images and less than 1s s processing speed. This study contributes to the objective diagnosis of skin lesions and promotes the advancement of digital healthcare.
comment: 16pages, 14 figures, 7 tables
SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models
Large vision-language models (VLMs) show strong multimodal understanding but still struggle with 3D spatial reasoning, such as distance estimation, size comparison, and cross-view consistency. Existing 3D-aware methods either depend on auxiliary 3D information or enhance RGB-only VLMs with geometry encoders through shallow feature fusion. We propose SpaceMind, a multimodal large language model explicitly designed for spatial reasoning solely from RGB inputs. The model adopts a dual-encoder architecture, integrating VGGT as a spatial understanding encoder and InternViT as a 2D visual encoder. The key idea is to treat the camera representation as an active guiding modality rather than passive metadata. Specifically, SpaceMind introduces a lightweight Camera-Guided Modality Fusion module before the language model to replace shallow fusion. It applies camera-conditioned biasing to spatial tokens, assigns query-independent weights reflecting their geometric importance, and uses the camera embedding to gate the fused representation. Empirically, SpaceMind establishes new state-of-the-art results on VSI-Bench, SQA3D and SPBench, surpassing both open and proprietary systems on VSI-Bench and SPBench by large margins and achieving state-of-the-art performance on SQA3D. These results demonstrate that camera-guided modality fusion is an effective and practical inductive bias for equipping VLMs with genuinely spatially grounded intelligence. We will release code and model checkpoints to support future research.
Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding
Reading scene text, that is, text appearing in images, has numerous application areas, including assistive technology, search, and e-commerce. Although scene text recognition in English has advanced significantly and is often considered nearly a solved problem, Indian language scene text recognition remains an open challenge. This is due to script diversity, non-standard fonts, and varying writing styles, and, more importantly, the lack of high-quality datasets and open-source models. To address these gaps, we introduce the Bharat Scene Text Dataset (BSTD) - a large-scale and comprehensive benchmark for studying Indian Language Scene Text Recognition. It comprises more than 100K words that span 11 Indian languages and English, sourced from over 6,500 scene images captured across various linguistic regions of India. The dataset is meticulously annotated and supports multiple scene text tasks, including: (i) Scene Text Detection, (ii) Script Identification, (iii) Cropped Word Recognition, and (iv) End-to-End Scene Text Recognition. We evaluated state-of-the-art models originally developed for English by adapting (fine-tuning) them for Indian languages. Our results highlight the challenges and opportunities in Indian language scene text recognition. We believe that this dataset represents a significant step toward advancing research in this domain. All our models and data are open source.
comment: Under Peer Review
Buffer replay enhances the robustness of multimodal learning under missing-modality
Missing modalities consistently lead to significant performance degradation in multimodal models. Existing approaches either synthesize missing modalities at high computational cost or apply prompt-based fine-tuning that relies only on adjacent-layer features and overlooks long-distance contextual information, which may offer additional tolerance to errors when one or more modalities are missing. To address this, we introduce REplay Prompting (REP): (1) construct modality-wise feature buffers via a residual bypass to cache early-layer representations and replay them in deeper layers, mitigating information loss as network depth increases; (2) employ a private-shared feature decoupling strategy, where private buffers preserve modality-specific signals and shared buffers encode cross-modal semantics; and (3) design a task-aware dynamic initialization mechanism to configure these buffers differently, improving stability and generalization under diverse missing-modality conditions. Experiments on vision-language, vision-language-audio, and temporal multimodal benchmarks demonstrate that REP consistently outperforms prior methods under both single- and multi-modality missing scenarios, while introducing only negligible parameter overhead. These results establish REP as a lightweight and effective paradigm for robust multimodal learning in challenging missing-modality environments.
Evaluating the Clinical Impact of Generative Inpainting on Bone Age Estimation
Generative foundation models can remove visual artifacts through realistic image inpainting, but their impact on medical AI performance remains uncertain. Pediatric hand radiographs often contain non-anatomical markers, and it is unclear whether inpainting these regions preserves features needed for bone age and gender prediction. To evaluate the clinical reliability of generative model-based inpainting for artifact removal, we used the RSNA Bone Age Challenge dataset, selecting 200 original radiographs and generating 600 inpainted versions with gpt-image-1 using natural language prompts to target non-anatomical artifacts. Downstream performance was assessed with deep learning ensembles for bone age estimation and gender classification, using mean absolute error (MAE) and area under the ROC curve (AUC) as metrics, and pixel intensity distributions to detect structural alterations. Inpainting markedly degraded model performance: bone age MAE increased from 6.26 to 30.11 months, and gender classification AUC decreased from 0.955 to 0.704. Inpainted images displayed pixel-intensity shifts and inconsistencies, indicating structural modifications not corrected by simple calibration. These findings show that, although visually realistic, foundation model-based inpainting can obscure subtle but clinically relevant features and introduce latent bias even when edits are confined to non-diagnostic regions, underscoring the need for rigorous, task-specific validation before integrating such generative tools into clinical AI workflows.
comment: 8 pages, 4 figures
Image Valuation in NeRF-based 3D reconstruction
Data valuation and monetization are becoming increasingly important across domains such as eXtended Reality (XR) and digital media. In the context of 3D scene reconstruction from a set of images -- whether casually or professionally captured -- not all inputs contribute equally to the final output. Neural Radiance Fields (NeRFs) enable photorealistic 3D reconstruction of scenes by optimizing a volumetric radiance field given a set of images. However, in-the-wild scenes often include image captures of varying quality, occlusions, and transient objects, resulting in uneven utility across inputs. In this paper we propose a method to quantify the individual contribution of each image to NeRF-based reconstructions of in-the-wild image sets. Contribution is assessed through reconstruction quality metrics based on PSNR and MSE. We validate our approach by removing low-contributing images during training and measuring the resulting impact on reconstruction fidelity.
comment: Published In International Conference on Computer Analysis of Images and Patterns (pp. 375-385). Cham: Springer Nature Switzerland
GOATex: Geometry & Occlusion-Aware Texturing NeurIPS 2025
We present GOATex, a diffusion-based method for 3D mesh texturing that generates high-quality textures for both exterior and interior surfaces. While existing methods perform well on visible regions, they inherently lack mechanisms to handle occluded interiors, resulting in incomplete textures and visible seams. To address this, we introduce an occlusion-aware texturing framework based on the concept of hit levels, which quantify the relative depth of mesh faces via multi-view ray casting. This allows us to partition mesh faces into ordered visibility layers, from outermost to innermost. We then apply a two-stage visibility control strategy that progressively reveals interior regions with structural coherence, followed by texturing each layer using a pretrained diffusion model. To seamlessly merge textures obtained across layers, we propose a soft UV-space blending technique that weighs each texture's contribution based on view-dependent visibility confidence. Empirical results demonstrate that GOATex consistently outperforms existing methods, producing seamless, high-fidelity textures across both visible and occluded surfaces. Unlike prior works, GOATex operates entirely without costly fine-tuning of a pretrained diffusion model and allows separate prompting for exterior and interior mesh regions, enabling fine-grained control over layered appearances. For more qualitative results, please visit our project page: https://goatex3d.github.io/.
comment: Accepted to NeurIPS 2025; Project Page: https://goatex3d.github.io/
Geometry-Consistent 4D Gaussian Splatting for Sparse-Input Dynamic View Synthesis
Gaussian Splatting has been considered as a novel way for view synthesis of dynamic scenes, which shows great potential in AIoT applications such as digital twins. However, recent dynamic Gaussian Splatting methods significantly degrade when only sparse input views are available, limiting their applicability in practice. The issue arises from the incoherent learning of 4D geometry as input views decrease. This paper presents GC-4DGS, a novel framework that infuses geometric consistency into 4D Gaussian Splatting (4DGS), offering real-time and high-quality dynamic scene rendering from sparse input views. While learning-based Multi-View Stereo (MVS) and monocular depth estimators (MDEs) provide geometry priors, directly integrating these with 4DGS yields suboptimal results due to the ill-posed nature of sparse-input 4D geometric optimization. To address these problems, we introduce a dynamic consistency checking strategy to reduce estimation uncertainties of MVS across spacetime. Furthermore, we propose a global-local depth regularization approach to distill spatiotemporal-consistent geometric information from monocular depths, thereby enhancing the coherent geometry and appearance learning within the 4D volume. Extensive experiments on the popular N3DV and Technicolor datasets validate the effectiveness of GC-4DGS in rendering quality without sacrificing efficiency. Notably, our method outperforms RF-DeRF, the latest dynamic radiance field tailored for sparse-input dynamic view synthesis, and the original 4DGS by 2.62dB and 1.58dB in PSNR, respectively, with seamless deployability on resource-constrained IoT edge devices.
From Illusion to Intention: Visual Rationale Learning for Vision-Language Reasoning
Recent advances in vision-language reasoning underscore the importance of thinking with images, where models actively ground their reasoning in visual evidence. Yet, prevailing frameworks treat visual actions as optional tools, boosting metrics but leaving reasoning ungrounded and crops ineffective. This gap gives rise to the illusion of thinking with images: models seem visually grounded but rely on context-agnostic actions that neither refine perception nor guide reasoning toward correct answers. We address this problem by reframing visual actions as core reasoning primitives rather than optional tools, which we term visual rationalization, the visual analogue of textual Chain-of-Thought. Building on this insight, we propose Visual Rationale Learning (ViRL), an end-to-end paradigm that grounds training in the visual rationale itself. ViRL integrates (1) Process Supervision with ground-truth rationales, (2) Objective Alignment via step-level reward shaping, and (3) Fine-Grained Credit Assignment to distinguish correct, redundant, and erroneous actions. By ensuring each action contributes meaningfully to the reasoning chain, ViRL enables models to "get the right answer for the right visual reason". Trained purely with end-to-end RL, ViRL achieves state-of-the-art results across benchmarks spanning perception, hallucination, and reasoning. This work establishes visual rationalization as a task-agnostic, process-grounded paradigm for building transparent, verifiable, and trustworthy vision-language models.
comment: 19 pages, 15 figures
DiskChunGS: Large-Scale 3D Gaussian SLAM Through Chunk-Based Memory Management
Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated impressive results for novel view synthesis with real-time rendering capabilities. However, integrating 3DGS with SLAM systems faces a fundamental scalability limitation: methods are constrained by GPU memory capacity, restricting reconstruction to small-scale environments. We present DiskChunGS, a scalable 3DGS SLAM system that overcomes this bottleneck through an out-of-core approach that partitions scenes into spatial chunks and maintains only active regions in GPU memory while storing inactive areas on disk. Our architecture integrates seamlessly with existing SLAM frameworks for pose estimation and loop closure, enabling globally consistent reconstruction at scale. We validate DiskChunGS on indoor scenes (Replica, TUM-RGBD), urban driving scenarios (KITTI), and resource-constrained Nvidia Jetson platforms. Our method uniquely completes all 11 KITTI sequences without memory failures while achieving superior visual quality, demonstrating that algorithmic innovation can overcome the memory constraints that have limited previous 3DGS SLAM methods.
Geodiffussr: Generative Terrain Texturing with Elevation Fidelity
Large-scale terrain generation remains a labor-intensive task in computer graphics. We introduce Geodiffussr, a flow-matching pipeline that synthesizes text-guided texture maps while strictly adhering to a supplied Digital Elevation Map (DEM). The core mechanism is multi-scale content aggregation (MCA): DEM features from a pretrained encoder are injected into UNet blocks at multiple resolutions to enforce global-to-local elevation consistency. Compared with a non-MCA baseline, MCA markedly improves visual fidelity and strengthens height-appearance coupling (FID $\downarrow$ 49.16%, LPIPS $\downarrow$ 32.33%, $Δ$dCor $\downarrow$ to 0.0016). To train and evaluate Geodiffussr, we assemble a globally distributed, biome- and climate-stratified corpus of triplets pairing SRTM-derived DEMs with Sentinel-2 imagery and vision-grounded natural-language captions that describe visible land cover. We position Geodiffussr as a strong baseline and step toward controllable 2.5D landscape generation for coarse-scale ideation and previz, complementary to physically based terrain and ecosystem simulators.
JarvisEvo: Towards a Self-Evolving Photo Editing Agent with Synergistic Editor-Evaluator Optimization
Agent-based editing models have substantially advanced interactive experiences, processing quality, and creative flexibility. However, two critical challenges persist: (1) instruction hallucination, text-only chain-of-thought (CoT) reasoning cannot fully prevent factual errors due to inherent information bottlenecks; (2) reward hacking, dynamic policy optimization against static reward models allows agents to exploit flaws in reward functions. To address these issues, we propose JarvisEvo, a unified image editing agent that emulates an expert human designer by iteratively editing, selecting appropriate tools, evaluating results, and reflecting on its own decisions to refine outcomes. JarvisEvo offers three key advantages: (1) an interleaved multimodal chain-of-thought (iMCoT) reasoning mechanism that enhances instruction following and editing quality; (2) a synergistic editor-evaluator policy optimization (SEPO) framework that enables self-improvement without external rewards, effectively mitigating reward hacking; and (3) support for both global and local fine-grained editing through seamless integration of Adobe Lightroom. On ArtEdit-Bench, JarvisEvo outperforms Nano-Banana by an average of 18.95% on preservative editing metrics, including a substantial 44.96% improvement in pixel-level content fidelity.
comment: 31 pages, 18 figures
MrGS: Multi-modal Radiance Fields with 3D Gaussian Splatting for RGB-Thermal Novel View Synthesis ICRA 2025
Recent advances in Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) have achieved considerable performance in RGB scene reconstruction. However, multi-modal rendering that incorporates thermal infrared imagery remains largely underexplored. Existing approaches tend to neglect distinctive thermal characteristics, such as heat conduction and the Lambertian property. In this study, we introduce MrGS, a multi-modal radiance field based on 3DGS that simultaneously reconstructs both RGB and thermal 3D scenes. Specifically, MrGS derives RGB- and thermal-related information from a single appearance feature through orthogonal feature extraction and employs view-dependent or view-independent embedding strategies depending on the degree of Lambertian reflectance exhibited by each modality. Furthermore, we leverage two physics-based principles to effectively model thermal-domain phenomena. First, we integrate Fourier's law of heat conduction prior to alpha blending to model intensity interpolation caused by thermal conduction between neighboring Gaussians. Second, we apply the Stefan-Boltzmann law and the inverse-square law to formulate a depth-aware thermal radiation map that imposes additional geometric constraints on thermal rendering. Experimental results demonstrate that the proposed MrGS achieves high-fidelity RGB-T scene reconstruction while reducing the number of Gaussians.
comment: Accepted at Thermal Infrared in Robotics (TIRO) Workshop, ICRA 2025 (Best Poster Award)
Optimizer Sensitivity In Vision Transformerbased Iris Recognition: Adamw Vs Sgd Vs Rmsprop
The security of biometric authentication is increasingly critical as digital identity systems expand. Iris recognition offers high reliability due to its distinctive and stable texture patterns. Recent progress in deep learning, especially Vision Transformers ViT, has improved visual recognition performance. Yet, the effect of optimizer choice on ViT-based biometric systems remains understudied. This work evaluates how different optimizers influence the accuracy and stability of ViT for iris recognition, providing insights to enhance the robustness of biometric identification models.
comment: 16 pages, 5 figures
Guiding Visual Autoregressive Models through Spectrum Weakening
Classifier-free guidance (CFG) has become a widely adopted and practical approach for enhancing generation quality and improving condition alignment. Recent studies have explored guidance mechanisms for unconditional generation, yet these approaches remain fundamentally tied to assumptions specific to diffusion models. In this work, we propose a spectrum-weakening framework for visual autoregressive (AR) models. This method works without the need for re-training, specific conditions, or any architectural modifications. It achieves this by constructing a controllable weak model in the spectral domain. We theoretically show that invertible spectral transformations preserve information, while selectively retaining only a subset of spectrum introduces controlled information reduction. Based on this insight, we perform spectrum selection along the channel dimension of internal representations, which avoids the structural constraints imposed by diffusion models. We further introduce two spectrum renormalization strategies that ensures numerical stability during the weakening process. Extensive experiments were conducted on both discrete and continuous AR models, with text or class conditioning. The results demonstrate that our method enables high-quality unconditional generation while maintaining strong prompt alignment for conditional generation.
MIMM-X: Disentangling Spurious Correlations for Medical Image Analysis
Deep learning models can excel on medical tasks, yet often experience spurious correlations, known as shortcut learning, leading to poor generalization in new environments. Particularly in medical imaging, where multiple spurious correlations can coexist, misclassifications can have severe consequences. We propose MIMM-X, a framework that disentangles causal features from multiple spurious correlations by minimizing their mutual information. It enables predictions based on true underlying causal relationships rather than dataset-specific shortcuts. We evaluate MIMM-X on three datasets (UK Biobank, NAKO, CheXpert) across two imaging modalities (MRI and X-ray). Results demonstrate that MIMM-X effectively mitigates shortcut learning of multiple spurious correlations.
MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation
Recent text-to-image generation models have acquired the ability of multi-reference generation and editing; the ability to inherit the appearance of subjects from multiple reference images and re-render them under new contexts. However, the existing benchmark datasets often focus on the generation with single or a few reference images, which prevents us from measuring the progress on how model performance advances or pointing out their weaknesses, under different multi-reference conditions. In addition, their task definitions are still vague, typically limited to axes such as "what to edit" or "how many references are given", and therefore fail to capture the intrinsic difficulty of multi-reference settings. To address this gap, we introduce $\textbf{MultiBanana}$, which is carefully designed to assesses the edge of model capabilities by widely covering multi-reference-specific problems at scale: (1) varying the number of references, (2) domain mismatch among references (e.g., photo vs. anime), (3) scale mismatch between reference and target scenes, (4) references containing rare concepts (e.g., a red banana), and (5) multilingual textual references for rendering. Our analysis among a variety of text-to-image models reveals their superior performances, typical failure modes, and areas for improvement. MultiBanana will be released as an open benchmark to push the boundaries and establish a standardized basis for fair comparison in multi-reference image generation. Our data and code are available at https://github.com/matsuolab/multibanana .
comment: Code: https://github.com/matsuolab/multibanana
Convolutional Feature Noise Reduction for 2D Cardiac MR Image Segmentation
Noise reduction constitutes a crucial operation within Digital Signal Processing. Regrettably, it frequently remains neglected when dealing with the processing of convolutional features in segmentation networks. This oversight could trigger the butterfly effect, impairing the subsequent outcomes within the entire feature system. To complete this void, we consider convolutional features following Gaussian distributions as feature signal matrices and then present a simple and effective feature filter in this study. The proposed filter is fundamentally a low-amplitude pass filter primarily aimed at minimizing noise in feature signal inputs and is named Convolutional Feature Filter (CFF). We conducted experiments on two established 2D segmentation networks and two public cardiac MR image datasets to validate the effectiveness of the CFF, and the experimental findings demonstrated a decrease in noise within the feature signal matrices. To enable a numerical observation and analysis of this reduction, we developed a binarization equation to calculate the information entropy of feature signals.
Ovis-Image Technical Report
We introduce $\textbf{Ovis-Image}$, a 7B text-to-image model specifically optimized for high-quality text rendering, designed to operate efficiently under stringent computational constraints. Built upon our previous Ovis-U1 framework, Ovis-Image integrates a diffusion-based visual decoder with the stronger Ovis 2.5 multimodal backbone, leveraging a text-centric training pipeline that combines large-scale pre-training with carefully tailored post-training refinements. Despite its compact architecture, Ovis-Image achieves text rendering performance on par with significantly larger open models such as Qwen-Image and approaches closed-source systems like Seedream and GPT4o. Crucially, the model remains deployable on a single high-end GPU with moderate memory, narrowing the gap between frontier-level text rendering and practical deployment. Our results indicate that combining a strong multimodal backbone with a carefully designed, text-focused training recipe is sufficient to achieve reliable bilingual text rendering without resorting to oversized or proprietary models.
comment: Code is released at https://github.com/AIDC-AI/Ovis-Image
McSc: Motion-Corrective Preference Alignment for Video Generation with Self-Critic Hierarchical Reasoning
Text-to-video (T2V) generation has achieved remarkable progress in producing high-quality videos aligned with textual prompts. However, aligning synthesized videos with nuanced human preference remains challenging due to the subjective and multifaceted nature of human judgment. Existing video preference alignment methods rely on costly human annotations or utilize proxy metrics to predict preference, which lacks the understanding of human preference logic. Moreover, they usually directly align T2V models with the overall preference distribution, ignoring potential conflict dimensions like motion dynamics and visual quality, which may bias models towards low-motion content. To address these issues, we present Motion-corrective alignment with Self-critic hierarchical Reasoning (McSc), a three-stage reinforcement learning framework for robust preference modeling and alignment. Firstly, Self-critic Dimensional Reasoning (ScDR) trains a generative reward model (RM) to decompose preferences into per-dimension assessments, using self-critic reasoning chains for reliable learning. Secondly, to achieve holistic video comparison, we introduce Hierarchical Comparative Reasoning (HCR) for structural multi-dimensional reasoning with hierarchical reward supervision. Finally, using RM-preferred videos, we propose Motion-corrective Direct Preference Optimization (McDPO) to optimize T2V models, while dynamically re-weighting alignment objective to mitigate bias towards low-motion content. Experiments show that McSc achieves superior performance in human preference alignment and generates videos with high-motion dynamic.
BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation
Generating minute-long videos is a critical step toward developing world models, providing a foundation for realistic extended scenes and advanced AI simulators. The emerging semi-autoregressive (block diffusion) paradigm integrates the strengths of diffusion and autoregressive models, enabling arbitrary-length video generation and improving inference efficiency through KV caching and parallel sampling. However, it yet faces two enduring challenges: (i) KV-cache-induced long-horizon error accumulation, and (ii) the lack of fine-grained long-video benchmarks and coherence-aware metrics. To overcome these limitations, we propose BlockVid, a novel block diffusion framework equipped with semantic-aware sparse KV cache, an effective training strategy called Block Forcing, and dedicated chunk-wise noise scheduling and shuffling to reduce error propagation and enhance temporal consistency. We further introduce LV-Bench, a fine-grained benchmark for minute-long videos, complete with new metrics evaluating long-range coherence. Extensive experiments on VBench and LV-Bench demonstrate that BlockVid consistently outperforms existing methods in generating high-quality, coherent minute-long videos. In particular, it achieves a 22.2% improvement on VDE Subject and a 19.4% improvement on VDE Clarity in LV-Bench over the state of the art approaches. Project website: https://ziplab.co/BlockVid. Inferix (Code): https://github.com/alibaba-damo-academy/Inferix.
Taming the Light: Illumination-Invariant Semantic 3DGS-SLAM
Extreme exposure degrades both the 3D map reconstruction and semantic segmentation accuracy, which is particularly detrimental to tightly-coupled systems. To achieve illumination invariance, we propose a novel semantic SLAM framework with two designs. First, the Intrinsic Appearance Normalization (IAN) module proactively disentangles the scene's intrinsic properties, such as albedo, from transient lighting. By learning a standardized, illumination-invariant appearance model, it assigns a stable and consistent color representation to each Gaussian primitive. Second, the Dynamic Radiance Balancing Loss (DRB-Loss) reactively handles frames with extreme exposure. It activates only when an image's exposure is poor, operating directly on the radiance field to guide targeted optimization. This prevents error accumulation from extreme lighting without compromising performance under normal conditions. The synergy between IAN's proactive invariance and DRB-Loss's reactive correction endows our system with unprecedented robustness. Evaluations on public datasets demonstrate state-of-the-art performance in camera tracking, map quality, and semantic and geometric accuracy.
HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model
Recent advances in large vision-language models (VLMs) have shown significant promise for 3D scene understanding. Existing VLM-based approaches typically align 3D scene features with the VLM's embedding space. However, this implicit alignment often yields suboptimal performance due to the scarcity of 3D data and the inherent complexity of spatial relationships in 3D environments. To address these limitations, we propose a novel hierarchical multimodal representation for 3D scene reasoning that explicitly aligns with VLMs at the input space by leveraging both multi-view images and text descriptions. The text descriptions capture spatial relationships by referencing the 3D coordinates of detected objects, while the multi-view images include a top-down perspective and four directional views (forward, left, right, and backward), ensuring comprehensive scene coverage. Additionally, we introduce a hierarchical feature representation that aggregates patch-level image features into view-level and scene-level representations, enabling the model to reason over both local and global scene context. Experimental results on both situated 3D Q&A and general 3D Q&A benchmarks demonstrate the effectiveness of our approach.
Contrastive Heliophysical Image Pretraining for Solar Dynamics Observatory Records
Deep learning has revolutionized solar image analysis, yet most approaches train task-specific encoders from scratch or rely on natural-image pretraining that ignores the unique characteristics of Solar Dynamics Observatory (SDO) data. We introduce SolarCHIP, a family of contrastively pretrained visual backbones tailored to multi-instrument SDO observations. SolarCHIP addresses three key challenges in solar imaging: multimodal sensing across AIA and HMI instruments, weak inter-class separability due to slow temporal evolution, and strong intra-class variability with sparse activity signals. Our pretraining framework employs a multi-granularity contrastive objective that jointly aligns (1) global class tokens across co-temporal AIA-HMI pairs to enhance temporal discrimination, (2) local patch tokens at fixed spatial indices to enforce position-consistent, modality-invariant features, and (3) intra-sample patches across different spatial locations to preserve fine-grained spatial structure. We train both CNN- and Vision Transformer-based autoencoders and demonstrate their effectiveness on two downstream tasks: cross-modal translation between HMI and AIA passbands via ControlNet, and full-disk flare classification. Experimental results show that SolarCHIP achieves state-of-the-art performance across both tasks, with particularly strong gains in low-resource settings where labeled data is limited. Ablation studies confirm that each contrastive component contributes essential discriminative capacity at different granularities. By publicly releasing pretrained weights and training code, we provide the heliophysics community with a practical, plug-and-play feature extractor that reduces computational requirements, improves label efficiency, and establishes a reusable foundation for diverse solar imaging applications.
RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video
Accurate robot segmentation is a fundamental capability for robotic perception. It enables precise visual servoing for VLA systems, scalable robot-centric data augmentation, accurate real-to-sim transfer, and reliable safety monitoring in dynamic human-robot environments. Despite the strong capabilities of modern segmentation models, surprisingly it remains challenging to segment robots. This is due to robot embodiment diversity, appearance ambiguity, structural complexity, and rapid shape changes. Embracing these challenges, we introduce RobotSeg, a foundation model for robot segmentation in image and video. RobotSeg is built upon the versatile SAM 2 foundation model but addresses its three limitations for robot segmentation, namely the lack of adaptation to articulated robots, reliance on manual prompts, and the need for per-frame training mask annotations, by introducing a structure-enhanced memory associator, a robot prompt generator, and a label-efficient training strategy. These innovations collectively enable a structure-aware, automatic, and label-efficient solution. We further construct the video robot segmentation (VRS) dataset comprising over 2.8k videos (138k frames) with diverse robot embodiments and environments. Extensive experiments demonstrate that RobotSeg achieves state-of-the-art performance on both images and videos, establishing a strong foundation for future advances in robot perception.
comment: Project page: https://github.com/showlab/RobotSeg
Do We Need Perfect Data? Leveraging Noise for Domain Generalized Segmentation AAAI 2026
Domain generalization in semantic segmentation faces challenges from domain shifts, particularly under adverse conditions. While diffusion-based data generation methods show promise, they introduce inherent misalignment between generated images and semantic masks. This paper presents FLEX-Seg (FLexible Edge eXploitation for Segmentation), a framework that transforms this limitation into an opportunity for robust learning. FLEX-Seg comprises three key components: (1) Granular Adaptive Prototypes that captures boundary characteristics across multiple scales, (2) Uncertainty Boundary Emphasis that dynamically adjusts learning emphasis based on prediction entropy, and (3) Hardness-Aware Sampling that progressively focuses on challenging examples. By leveraging inherent misalignment rather than enforcing strict alignment, FLEX-Seg learns robust representations while capturing rich stylistic variations. Experiments across five real-world datasets demonstrate consistent improvements over state-of-the-art methods, achieving 2.44% and 2.63% mIoU gains on ACDC and Dark Zurich. Our findings validate that adaptive strategies for handling imperfect synthetic data lead to superior domain generalization. Code is available at https://github.com/VisualScienceLab-KHU/FLEX-Seg.
comment: Accepted by AAAI 2026
Visual Puns from Idioms: An Iterative LLM-T2IM-MLLM Framework ICASSP 2026
We study idiom-based visual puns--images that align an idiom's literal and figurative meanings--and present an iterative framework that coordinates a large language model (LLM), a text-to-image model (T2IM), and a multimodal LLM (MLLM) for automatic generation and evaluation. Given an idiom, the system iteratively (i) generates detailed visual prompts, (ii) synthesizes an image, (iii) infers the idiom from the image, and (iv) refines the prompt until recognition succeeds or a step limit is reached. Using 1,000 idioms as inputs, we synthesize a corresponding dataset of visual pun images with paired prompts, enabling benchmarking of both generation and understanding. Experiments across 10 LLMs, 10 MLLMs, and one T2IM (Qwen-Image) show that MLLM choice is the primary performance driver: GPT achieves the highest accuracies, Gemini follows, and the best open-source MLLM (Gemma) is competitive with some closed models. On the LLM side, Claude attains the strongest average performance for prompt generation.
comment: Submitted to ICASSP 2026 (under review)
One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfe
Recent advances in diffusion models have greatly improved pose-driven character animation. However, existing methods are limited to spatially aligned reference-pose pairs with matched skeletal structures. Handling reference-pose misalignment remains unsolved. To address this, we present One-to-All Animation, a unified framework for high-fidelity character animation and image pose transfer for references with arbitrary layouts. First, to handle spatially misaligned reference, we reformulate training as a self-supervised outpainting task that transforms diverse-layout reference into a unified occluded-input format. Second, to process partially visible reference, we design a reference extractor for comprehensive identity feature extraction. Further, we integrate hybrid reference fusion attention to handle varying resolutions and dynamic sequence lengths. Finally, from the perspective of generation quality, we introduce identity-robust pose control that decouples appearance from skeletal structure to mitigate pose overfitting, and a token replace strategy for coherent long-video generation. Extensive experiments show that our method outperforms existing approaches. The code and model will be available at https://github.com/ssj9596/One-to-All-Animation.
comment: Project Page:https://ssj9596.github.io/one-to-all-animation-project/
DenoiseGS: Gaussian Reconstruction Model for Burst Denoising
Burst denoising methods are crucial for enhancing images captured on handheld devices, but they often struggle with large motion or suffer from prohibitive computational costs. In this paper, we propose DenoiseGS, the first framework to leverage the efficiency of 3D Gaussian Splatting for burst denoising. Our approach addresses two key challenges when applying feedforward Gaussian reconsturction model to noisy inputs: the degradation of Gaussian point clouds and the loss of fine details. To this end, we propose a Gaussian self-consistency (GSC) loss, which regularizes the geometry predicted from noisy inputs with high-quality Gaussian point clouds. These point clouds are generated from clean inputs by the same model that we are training, thereby alleviating potential bias or domain gaps. Additionally, we introduce a log-weighted frequency (LWF) loss to strengthen supervision within the spectral domain, effectively preserving fine-grained details. The LWF loss adaptively weights frequency discrepancies in a logarithmic manner, emphasizing challenging high-frequency details. Extensive experiments demonstrate that DenoiseGS significantly exceeds the state-of-the-art NeRF-based methods on both burst denoising and novel view synthesis under noisy conditions, while achieving \textbf{250$\times$} faster inference speed. Code and models are released at https://github.com/yscheng04/DenoiseGS.
Barcode and QR Code Object Detection: An Experimental Study on YOLOv8 Models
This research work dives into an in-depth evaluation of the YOLOv8 (You Only Look Once) algorithm's efficiency in object detection, specially focusing on Barcode and QR code recognition. Utilizing the real-time detection abilities of YOLOv8, we performed a study aimed at enhancing its talent in swiftly and correctly figuring out objects. Through large training and high-quality-tuning on Kaggle datasets tailored for Barcode and QR code detection, our goal became to optimize YOLOv8's overall performance throughout numerous situations and environments. The look encompasses the assessment of YOLOv8 throughout special version iterations: Nano, Small, and Medium, with a meticulous attention on precision, recall, and F1 assessment metrics. The consequences exhibit large improvements in object detection accuracy with every subsequent model refinement. Specifically, we achieved an accuracy of 88.95% for the nano model, 97.10% for the small model, and 94.10% for the medium version, showcasing the incremental improvements finished via model scaling. Our findings highlight the big strides made through YOLOv8 in pushing the limits of computer vision, ensuring its function as a milestone within the subject of object detection. This study sheds light on how model scaling affects object recognition, increasing the concept of deep learning-based computer creative and prescient techniques.
comment: 7 Pages, 16 figures, Presented at 2024 International Conference on Emerging Innovations and Advanced Computing (INNOCOMP) Conference
Robust Image Self-Recovery against Tampering using Watermark Generation with Pixel Shuffling
The rapid growth of Artificial Intelligence-Generated Content (AIGC) raises concerns about the authenticity of digital media. In this context, image self-recovery, reconstructing original content from its manipulated version, offers a practical solution for understanding the attacker's intent and restoring trustworthy data. However, existing methods often fail to accurately recover tampered regions, falling short of the primary goal of self-recovery. To address this challenge, we propose ReImage, a neural watermarking-based self-recovery framework that embeds a shuffled version of the target image into itself as a watermark. We design a generator that produces watermarks optimized for neural watermarking and introduce an image enhancement module to refine the recovered image. We further analyze and resolve key limitations of shuffled watermarking, enabling its effective use in self-recovery. We demonstrate that ReImage achieves state-of-the-art performance across diverse tampering scenarios, consistently producing high-quality recovered images. The code and pretrained models will be released upon publication.
comment: 22 pages, 12 figures, 14 tables
NeuMatC: A General Neural Framework for Fast Parametric Matrix Operation
Matrix operations (e.g., inversion and singular value decomposition (SVD)) are fundamental in science and engineering. In many emerging real-world applications (such as wireless communication and signal processing), these operations must be performed repeatedly over matrices with parameters varying continuously. However, conventional methods tackle each matrix operation independently, underexploring the inherent low-rankness and continuity along the parameter dimension, resulting in significantly redundant computation. To address this challenge, we propose \textbf{\textit{Neural Matrix Computation Framework} (NeuMatC)}, which elegantly tackles general parametric matrix operation tasks by leveraging the underlying low-rankness and continuity along the parameter dimension. Specifically, NeuMatC unsupervisedly learns a low-rank and continuous mapping from parameters to their corresponding matrix operation results. Once trained, NeuMatC enables efficient computations at arbitrary parameters using only a few basic operations (e.g., matrix multiplications and nonlinear activations), significantly reducing redundant computations. Experimental results on both synthetic and real-world datasets demonstrate the promising performance of NeuMatC, exemplified by over $3\times$ speedup in parametric inversion and $10\times$ speedup in parametric SVD compared to the widely used NumPy baseline in wireless communication, while maintaining acceptable accuracy.
Artwork Interpretation with Vision Language Models: A Case Study on Emotions and Emotion Symbols AACL
Emotions are a fundamental aspect of artistic expression. Due to their abstract nature, there is a broad spectrum of emotion realization in artworks. These are subject to historical change and their analysis requires expertise in art history. In this article, we investigate which aspects of emotional expression can be detected by current (2025) vision language models (VLMs). We present a case study of three VLMs (Llava-Llama and two Qwen models) in which we ask these models four sets of questions of increasing complexity about artworks (general content, emotional content, expression of emotions, and emotion symbols) and carry out a qualitative expert evaluation. We find that the VLMs recognize the content of the images surprisingly well and often also which emotions they depict and how they are expressed. The models perform best for concrete images but fail for highly abstract or highly symbolic images. Reliable recognition of symbols remains fundamentally difficult. Furthermore, the models continue to exhibit the well-known LLM weakness of providing inconsistent answers to related questions.
comment: Accepted for publication at the IJCNLP-AACL workshop on Multimodal Models for Low-Resource Contexts and Social Impact
MICCAI STS 2024 Challenge: Semi-Supervised Instance-Level Tooth Segmentation in Panoramic X-ray and CBCT Images
Orthopantomogram (OPGs) and Cone-Beam Computed Tomography (CBCT) are vital for dentistry, but creating large datasets for automated tooth segmentation is hindered by the labor-intensive process of manual instance-level annotation. This research aimed to benchmark and advance semi-supervised learning (SSL) as a solution for this data scarcity problem. We organized the 2nd Semi-supervised Teeth Segmentation (STS 2024) Challenge at MICCAI 2024. We provided a large-scale dataset comprising over 90,000 2D images and 3D axial slices, which includes 2,380 OPG images and 330 CBCT scans, all featuring detailed instance-level FDI annotations on part of the data. The challenge attracted 114 (OPG) and 106 (CBCT) registered teams. To ensure algorithmic excellence and full transparency, we rigorously evaluated the valid, open-source submissions from the top 10 (OPG) and top 5 (CBCT) teams, respectively. All successful submissions were deep learning-based SSL methods. The winning semi-supervised models demonstrated impressive performance gains over a fully-supervised nnU-Net baseline trained only on the labeled data. For the 2D OPG track, the top method improved the Instance Affinity (IA) score by over 44 percentage points. For the 3D CBCT track, the winning approach boosted the Instance Dice score by 61 percentage points. This challenge confirms the substantial benefit of SSL for complex, instance-level medical image segmentation tasks where labeled data is scarce. The most effective approaches consistently leveraged hybrid semi-supervised frameworks that combined knowledge from foundational models like SAM with multi-stage, coarse-to-fine refinement pipelines. Both the challenge dataset and the participants' submitted code have been made publicly available on GitHub (https://github.com/ricoleehduu/STS-Challenge-2024), ensuring transparency and reproducibility.
ViGG: Robust RGB-D Point Cloud Registration using Visual-Geometric Mutual Guidance WACV 2026
Point cloud registration is a fundamental task in 3D vision. Most existing methods only use geometric information for registration. Recently proposed RGB-D registration methods primarily focus on feature fusion or improving feature learning, which limits their ability to exploit image information and hinders their practical applicability. In this paper, we propose ViGG, a robust RGB-D registration method using mutual guidance. First, we solve clique alignment in a visual-geometric combination form, employing a geometric guidance design to suppress ambiguous cliques. Second, to mitigate accuracy degradation caused by noise in visual matches, we propose a visual-guided geometric matching method that utilizes visual priors to determine the search space, enabling the extraction of high-quality, noise-insensitive correspondences. This mutual guidance strategy brings our method superior robustness, making it applicable for various RGB-D registration tasks. The experiments on 3DMatch, ScanNet and KITTI datasets show that our method outperforms recent state-of-the-art methods in both learning-free and learning-based settings. Code is available at https://github.com/ccjccjccj/ViGG.
comment: Accepted by WACV 2026
See, Rank, and Filter: Important Word-Aware Clip Filtering via Scene Understanding for Moment Retrieval and Highlight Detection
Video moment retrieval (MR) and highlight detection (HD) with natural language queries aim to localize relevant moments and key highlights in a video clips. However, existing methods overlook the importance of individual words, treating the entire text query and video clips as a black-box, which hinders contextual understanding. In this paper, we propose a novel approach that enables fine-grained clip filtering by identifying and prioritizing important words in the query. Our method integrates image-text scene understanding through Multimodal Large Language Models (MLLMs) and enhances the semantic understanding of video clips. We introduce a feature enhancement module (FEM) to capture important words from the query and a ranking-based filtering module (RFM) to iteratively refine video clips based on their relevance to these important words. Extensive experiments demonstrate that our approach significantly outperforms existing state-of-the-art methods, achieving superior performance in both MR and HD tasks. Our code is available at: https://github.com/VisualAIKHU/SRF.
Leveraging Textual Compositional Reasoning for Robust Change Captioning AAAI 2026
Change captioning aims to describe changes between a pair of images. However, existing works rely on visual features alone, which often fail to capture subtle but meaningful changes because they lack the ability to represent explicitly structured information such as object relationships and compositional semantics. To alleviate this, we present CORTEX (COmpositional Reasoning-aware TEXt-guided), a novel framework that integrates complementary textual cues to enhance change understanding. In addition to capturing cues from pixel-level differences, CORTEX utilizes scene-level textual knowledge provided by Vision Language Models (VLMs) to extract richer image text signals that reveal underlying compositional reasoning. CORTEX consists of three key modules: (i) an Image-level Change Detector that identifies low-level visual differences between paired images, (ii) a Reasoning-aware Text Extraction (RTE) module that use VLMs to generate compositional reasoning descriptions implicit in visual features, and (iii) an Image-Text Dual Alignment (ITDA) module that aligns visual and textual features for fine-grained relational reasoning. This enables CORTEX to reason over visual and textual features and capture changes that are otherwise ambiguous in visual features alone.
comment: Accepted at AAAI 2026
From Points to Clouds: Learning Robust Semantic Distributions for Multi-modal Prompts
Multimodal Prompt Learning (MPL) has emerged as a pivotal technique for adapting large-scale Visual Language Models (VLMs). However, current MPL methods are fundamentally limited by their optimization of a single, static point representation. This paradigm is inherently brittle, leads to overfitting on base classes, and generalizes poorly to novel or ambiguous categories. We challenge this point paradigm, proposing that robust generalization requires learning a semantic cloud (i.e., a distribution over the embedding space). To achieve this, we introduce Points-to-Clouds (P2C), a novel framework inspired by diffusion models that reframes prompt learning as a dynamic denoising task. At the core of P2C is a dual denoising mechanism: a Dynamic Prompt Denoising (DPD) mechanism perturbs text prompts with sophisticated, annealed noise to learn a smoother semantic landscape, while an auxiliary V-L Mapper denoising loss re-tasks the mapper as a denoising autoencoder. This forces the mapper to reconstruct clean visual prompts from noisy text inputs, ensuring robust cross-modal alignment. Extensive experiments across 11 datasets demonstrate that P2C consistently outperforms strong baselines. On the base-to-novel generalization benchmark, our method achieves a Harmonic Mean of 79.7%, representing a relative improvement of 1.4% over the baseline. The code and models are available at https://vranlee.github.io/P2C/.
DM$^3$T: Harmonizing Modalities via Diffusion for Multi-Object Tracking
Multi-object tracking (MOT) is a fundamental task in computer vision with critical applications in autonomous driving and robotics. Multimodal MOT that integrates visible light and thermal infrared information is particularly essential for robust autonomous driving systems. However, effectively fusing these heterogeneous modalities is challenging. Simple strategies like concatenation or addition often fail to bridge the significant non-linear distribution gap between their feature representations, which can lead to modality conflicts and degrade tracking accuracy. Drawing inspiration from the connection between multimodal MOT and the iterative refinement in diffusion models, this paper proposes DM$^3$T, a novel framework that reformulates multimodal fusion as an iterative feature alignment process to generate accurate and temporally coherent object trajectories. Our approach performs iterative cross-modal harmonization through a proposed Cross-Modal Diffusion Fusion (C-MDF) module. In this process, features from both modalities provide mutual guidance, iteratively projecting them onto a shared, consistent feature manifold. This enables the learning of complementary information and achieves deeper fusion compared to conventional methods. Additionally, we introduce a plug-and-play Diffusion Refiner (DR) to enhance and refine the unified feature representation. To further improve tracking robustness, we design a Hierarchical Tracker that adaptively handles confidence estimation. DM$^3$T unifies object detection, state estimation, and data association into a comprehensive online tracking framework without complex post-processing. Extensive experiments on the VT-MOT benchmark demonstrate that our method achieves 41.7 HOTA, representing a 1.54% relative improvement over existing state-of-the-art methods. The code and models are available at https://vranlee.github.io/DM-3-T/.
ClearGCD: Mitigating Shortcut Learning For Robust Generalized Category Discovery
In open-world scenarios, Generalized Category Discovery (GCD) requires identifying both known and novel categories within unlabeled data. However, existing methods often suffer from prototype confusion caused by shortcut learning, which undermines generalization and leads to forgetting of known classes. We propose ClearGCD, a framework designed to mitigate reliance on non-semantic cues through two complementary mechanisms. First, Semantic View Alignment (SVA) generates strong augmentations via cross-class patch replacement and enforces semantic consistency using weak augmentations. Second, Shortcut Suppression Regularization (SSR) maintains an adaptive prototype bank that aligns known classes while encouraging separation of potential novel ones. ClearGCD can be seamlessly integrated into parametric GCD approaches and consistently outperforms state-of-the-art methods across multiple benchmarks.
comment: 5 pages, 4 figures
CNN-Based Framework for Pedestrian Age and Gender Classification Using Far-View Surveillance in Mixed-Traffic Intersections
Pedestrian safety remains a pressing concern in congested urban intersections, particularly in low- and middle-income countries where traffic is multimodal, and infrastructure often lacks formal control. Demographic factors like age and gender significantly influence pedestrian vulnerability, yet real-time monitoring systems rarely capture this information. To address this gap, this study proposes a deep learning framework that classifies pedestrian age group and gender from far-view intersection footage using convolutional neural networks (CNNs), without relying on facial recognition or high-resolution imagery. The classification is structured as a unified six-class problem, distinguishing adult, teenager, and child pedestrians for both males and females, based on full-body visual cues. Video data was collected from three high-risk intersections in Dhaka, Bangladesh. Two CNN architectures were implemented: ResNet50, a deep convolutional neural network pretrained on ImageNet, and a custom lightweight CNN optimized for computational efficiency. Eight model variants explored combinations of pooling strategies and optimizers. ResNet50 with Max Pooling and SGD achieved the highest accuracy (86.19%), while the custom CNN performed comparably (84.15%) with fewer parameters and faster training. The model's efficient design enables real-time inference on standard surveillance feeds. For practitioners, this system provides a scalable, cost-effective tool to monitor pedestrian demographics at intersections using existing camera infrastructure. Its outputs can shape intersection design, optimize signal timing, and enable targeted safety interventions for vulnerable groups such as children or the elderly. By offering demographic insights often missing in conventional traffic data, the framework supports more inclusive, data-driven planning in mixed-traffic environments.
comment: Accepted for poster presentation at the 105th Annual Meeting of the Transportation Research Board
Scalable Diffusion Transformer for Conditional 4D fMRI Synthesis NeurIPS 2025
Generating whole-brain 4D fMRI sequences conditioned on cognitive tasks remains challenging due to the high-dimensional, heterogeneous BOLD dynamics across subjects/acquisitions and the lack of neuroscience-grounded validation. We introduce the first diffusion transformer for voxelwise 4D fMRI conditional generation, combining 3D VQ-GAN latent compression with a CNN-Transformer backbone and strong task conditioning via AdaLN-Zero and cross-attention. On HCP task fMRI, our model reproduces task-evoked activation maps, preserves the inter-task representational structure observed in real data (RSA), achieves perfect condition specificity, and aligns ROI time-courses with canonical hemodynamic responses. Performance improves predictably with scale, reaching task-evoked map correlation of 0.83 and RSA of 0.98, consistently surpassing a U-Net baseline on all metrics. By coupling latent diffusion with a scalable backbone and strong conditioning, this work establishes a practical path to conditional 4D fMRI synthesis, paving the way for future applications such as virtual experiments, cross-site harmonization, and principled augmentation for downstream neuroimaging models.
comment: Accepted at NeurIPS 2025 Workshop: Foundation Models for the Brain and Body. 13 pages, 6 figures, 4 tables
SUPER-AD: Semantic Uncertainty-aware Planning for End-to-End Robust Autonomous Driving
End-to-End (E2E) planning has become a powerful paradigm for autonomous driving, yet current systems remain fundamentally uncertainty-blind. They assume perception outputs are fully reliable, even in ambiguous or poorly observed scenes, leaving the planner without an explicit measure of uncertainty. To address this limitation, we propose a camera-only E2E framework that estimates aleatoric uncertainty directly in BEV space and incorporates it into planning. Our method produces a dense, uncertainty-aware drivability map that captures both semantic structure and geometric layout at pixel-level resolution. To further promote safe and rule-compliant behavior, we introduce a lane-following regularization that encodes lane structure and traffic norms. This prior stabilizes trajectory planning under normal conditions while preserving the flexibility needed for maneuvers such as overtaking or lane changes. Together, these components enable robust and interpretable trajectory planning, even under challenging uncertainty conditions. Evaluated on the NAVSIM benchmark, our method achieves state-of-the-art performance, delivering substantial gains on both the challenging NAVHARD and NAVSAFE subsets. These results demonstrate that our principled aleatoric uncertainty modeling combined with driving priors significantly advances the safety and reliability of camera-only E2E autonomous driving.
CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation
Co-speech gesture generation has significantly advanced human-computer interaction, yet speaker movements remain constrained due to the omission of text-driven non-spontaneous gestures (e.g., bowing while talking). Existing methods face two key challenges: 1) the semantic prior gap due to the lack of descriptive text annotations in gesture datasets, and 2) the difficulty in achieving coordinated multimodal control over gesture generation. To address these challenges, this paper introduces CoordSpeaker, a comprehensive framework that enables coordinated caption-empowered co-speech gesture synthesis. Our approach first bridges the semantic prior gap through a novel gesture captioning framework, leveraging a motion-language model to generate descriptive captions at multiple granularities. Building upon this, we propose a conditional latent diffusion model with unified cross-dataset motion representation and a hierarchically controlled denoiser to achieve highly controlled, coordinated gesture generation. CoordSpeaker pioneers the first exploration of gesture understanding and captioning to tackle the semantic gap in gesture generation while offering a novel perspective of bidirectional gesture-text mapping. Extensive experiments demonstrate that our method produces high-quality gestures that are both rhythmically synchronized with speeches and semantically coherent with arbitrary captions, achieving superior performance with higher efficiency compared to existing approaches.
Bridging Modalities via Progressive Re-alignment for Multimodal Test-Time Adaptation AAAI 2026
Test-time adaptation (TTA) enables online model adaptation using only unlabeled test data, aiming to bridge the gap between source and target distributions. However, in multimodal scenarios, varying degrees of distribution shift across different modalities give rise to a complex coupling effect of unimodal shallow feature shift and cross-modal high-level semantic misalignment, posing a major obstacle to extending existing TTA methods to the multimodal field. To address this challenge, we propose a novel multimodal test-time adaptation (MMTTA) framework, termed as Bridging Modalities via Progressive Re-alignment (BriMPR). BriMPR, consisting of two progressively enhanced modules, tackles the coupling effect with a divide-and-conquer strategy. Specifically, we first decompose MMTTA into multiple unimodal feature alignment sub-problems. By leveraging the strong function approximation ability of prompt tuning, we calibrate the unimodal global feature distributions to their respective source distributions, so as to achieve the initial semantic re-alignment across modalities. Subsequently, we assign the credible pseudo-labels to combinations of masked and complete modalities, and introduce inter-modal instance-wise contrastive learning to further enhance the information interaction among modalities and refine the alignment. Extensive experiments on MMTTA tasks, including both corruption-based and real-world domain shift benchmarks, demonstrate the superiority of our method. Our source code is available at [this URL](https://github.com/Luchicken/BriMPR).
comment: Accepted by AAAI 2026 (Oral)
MARVO: Marine-Adaptive Radiance-aware Visual Odometry CVPR2026
Underwater visual localization remains challenging due to wavelength-dependent attenuation, poor texture, and non-Gaussian sensor noise. We introduce MARVO, a physics-aware, learning-integrated odometry framework that fuses underwater image formation modeling, differentiable matching, and reinforcement-learning optimization. At the front-end, we extend transformer-based feature matcher with a Physics Aware Radiance Adapter that compensates for color channel attenuation and contrast loss, yielding geometrically consistent feature correspondences under turbidity. These semi dense matches are combined with inertial and pressure measurements inside a factor-graph backend, where we formulate a keyframe-based visual-inertial-barometric estimator using GTSAM library. Each keyframe introduces (i) Pre-integrated IMU motion factors, (ii) MARVO-derived visual pose factors, and (iii) barometric depth priors, giving a full-state MAP estimate in real time. Lastly, we introduce a Reinforcement-Learningbased Pose-Graph Optimizer that refines global trajectories beyond local minima of classical least-squares solvers by learning optimal retraction actions on SE(2).
comment: 10 pages, 5 figures, 3 tables, Submitted to CVPR2026
GLOW: Global Illumination-Aware Inverse Rendering of Indoor Scenes Captured with Dynamic Co-Located Light & Camera
Inverse rendering of indoor scenes remains challenging due to the ambiguity between reflectance and lighting, exacerbated by inter-reflections among multiple objects. While natural illumination-based methods struggle to resolve this ambiguity, co-located light-camera setups offer better disentanglement as lighting can be easily calibrated via Structure-from-Motion. However, such setups introduce additional complexities like strong inter-reflections, dynamic shadows, near-field lighting, and moving specular highlights, which existing approaches fail to handle. We present GLOW, a Global Illumination-aware Inverse Rendering framework designed to address these challenges. GLOW integrates a neural implicit surface representation with a neural radiance cache to approximate global illumination, jointly optimizing geometry and reflectance through carefully designed regularization and initialization. We then introduce a dynamic radiance cache that adapts to sharp lighting discontinuities from near-field motion, and a surface-angle-weighted radiometric loss to suppress specular artifacts common in flashlight captures. Experiments show that GLOW substantially outperforms prior methods in material reflectance estimation under both natural and co-located illumination.
Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding
Document understanding is a long standing practical task. Vision Language Models (VLMs) have gradually become a primary approach in this domain, demonstrating effective performance on single page tasks. However, their effectiveness diminishes when handling long documents. In such scenarios, clues are often scattered across multiple pages and modalities, and redundancy from lengthy inputs can impair the models judgment. While retrieval augmented generation mitigates this issue by filtering for question relevant content, the retrieved results still contain substantial redundancy. To address these limitations, we propose SLEUTH, a multi agent framework. Concretely, SLEUTH orchestrates a retriever and four collaborative agents in a coarse to fine process. The framework identifies key textual and visual clues within the retrieved pages, filters for salient visual evidence such as tables and charts, and analyzes the query to devise a reasoning strategy. It ultimately synthesizes a distilled, evidence dense multimodal context to generate the final prediction. SLEUTH is model agnostic and scalable. When paired with advanced VLM backbones, it consistently improves performance on multiple long document benchmarks, achieving state of the art results. Ablation studies verify each modules effectiveness and confirm the benefits of our hierarchical refinement paradigm.
Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering
Existing Multimodal Knowledge-Based Visual Question Answering (MKB-VQA) benchmarks suffer from "visual shortcuts", as the query image typically matches the primary subject entity of the target document. We demonstrate that models can exploit these shortcuts, achieving comparable results using visual cues alone. To address this, we introduce Relational Entity Text-Image kNowledge Augmented (RETINA) benchmark, automatically constructed using an LLM-driven pipeline, consisting of 120k training and 2k human-curated test set. RETINA contains queries referencing secondary subjects (i.e. related entities) and pairs them with images of these related entities, removing the visual shortcut. When evaluated on RETINA existing models show significantly degraded performance, confirming their reliance on the shortcut. Furthermore, we propose Multi-Image MultImodal Retriever (MIMIR), which enriches document embeddings by augmenting images of multiple related entities, effectively handling RETINA, unlike prior work that uses only a single image per document. Our experiments validate the limitations of existing benchmarks and demonstrate the effectiveness of RETINA and MIMIR. Our project is available at: Project Page.
Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs
Despite remarkable advancements in Multimodal Large Language Models (MLLMs), a fundamental question remains: are MLLMs robust to contradicting modalities? To rigorously study this, we introduce MMA-Bench comprising videos and tasks that probe a model's reliance on specific modalities. Using black-box and white-box interpretability techniques, we provide a critical analysis of the brittleness of both open- and closed-sourced MLLMs. We show that current MLLMs struggle under misaligned audio-visual pairs and simple misleading text, thereby lacking robust multi-modal reasoning. Building on these findings, we propose a modality alignment tuning strategy to teach the model when to prioritize, leverage, or ignore specific modality cues. Through extensive experiments and analysis, we show that our alignment tuning yields demonstrably stronger multimodal grounding. This work provides both interpretability tools and a clear path toward developing MLLMs with intrinsically reliable cross-modal reasoning. Code and dataset will be publicly available.
Captain Safari: A World Engine
World engines aim to synthesize long, 3D-consistent videos that support interactive exploration of a scene under user-controlled camera motion. However, existing systems struggle under aggressive 6-DoF trajectories and complex outdoor layouts: they lose long-range geometric coherence, deviate from the target path, or collapse into overly conservative motion. To this end, we introduce Captain Safari, a pose-conditioned world engine that generates videos by retrieving from a persistent world memory. Given a camera path, our method maintains a dynamic local memory and uses a retriever to fetch pose-aligned world tokens, which then condition video generation along the trajectory. This design enables the model to maintain stable 3D structure while accurately executing challenging camera maneuvers. To evaluate this setting, we curate OpenSafari, a new in-the-wild FPV dataset containing high-dynamic drone videos with verified camera trajectories, constructed through a multi-stage geometric and kinematic validation pipeline. Across video quality, 3D consistency, and trajectory following, Captain Safari substantially outperforms state-of-the-art camera-controlled generators. It reduces MEt3R from 0.3703 to 0.3690, improves AUC@30 from 0.181 to 0.200, and yields substantially lower FVD than all camera-controlled baselines. More importantly, in a 50-participant, 5-way human study where annotators select the best result among five anonymized models, 67.6% of preferences favor our method across all axes. Our results demonstrate that pose-conditioned world memory is a powerful mechanism for long-horizon, controllable video generation and provide OpenSafari as a challenging new benchmark for future world-engine research.
DINO-Foresight: Looking into the Future with DINO NeurIPS 2025
Predicting future dynamics is crucial for applications like autonomous driving and robotics, where understanding the environment is key. Existing pixel-level methods are computationally expensive and often focus on irrelevant details. To address these challenges, we introduce DINO-Foresight, a novel framework that operates in the semantic feature space of pretrained Vision Foundation Models (VFMs). Our approach trains a masked feature transformer in a self-supervised manner to predict the evolution of VFM features over time. By forecasting these features, we can apply off-the-shelf, task-specific heads for various scene understanding tasks. In this framework, VFM features are treated as a latent space, to which different heads attach to perform specific tasks for future-frame analysis. Extensive experiments show the very strong performance, robustness and scalability of our framework. Project page and code at https://dino-foresight.github.io/ .
comment: NeurIPS 2025
INQUIRE-Search: A Framework for Interactive Discovery in Large-Scale Biodiversity Databases
Large community science platforms such as iNaturalist contain hundreds of millions of biodiversity images that often capture ecological context on behaviors, interactions, phenology, and habitat. Yet most ecological workflows rely on metadata filtering or manual inspection, leaving this secondary information inaccessible at scale. We introduce INQUIRE-Search, an open-source system that enables scientists to rapidly and interactively search within an ecological image database for specific concepts using natural language, verify and export relevant observations, and utilize this discovered data for novel scientific analysis. Compared to traditional methods, INQUIRE-Search takes a fraction of the time, opening up new possibilities for scientific questions that can be explored. Through five case studies, we show the diversity of scientific applications that a tool like INQUIRE-Search can support, from seasonal variation in behavior across species to forest regrowth after wildfires. These examples demonstrate a new paradigm for interactive, efficient, and scalable scientific discovery that can begin to unlock previously inaccessible scientific value in large-scale biodiversity datasets. Finally, we emphasize using such AI-enabled discovery tools for science call for experts to reframe the priorities of the scientific process and develop novel methods for experiment design, data collection, survey effort, and uncertainty analysis.
comment: EV, JC, RKV contributed equally
CzechLynx: A Dataset for Individual Identification and Pose Estimation of the Eurasian Lynx
We introduce CzechLynx, the first large-scale, open-access dataset for individual identification, pose estimation, and instance segmentation of the Eurasian lynx (Lynx lynx). CzechLynx contains 39,760 camera trap images annotated with segmentation masks, identity labels, and 20-point skeletons and covers 319 unique individuals across 15 years of systematic monitoring in two geographically distinct regions: southwest Bohemia and the Western Carpathians. In addition to the real camera trap data, we provide a large complementary set of photorealistic synthetic images and a Unity-based generation pipeline with diffusion-based text-to-texture modeling, capable of producing arbitrarily large amounts of synthetic data spanning diverse environments, poses, and coat-pattern variations. To enable systematic testing across realistic ecological scenarios, we define three complementary evaluation protocols: (i) geo-aware, (ii) time-aware open-set, and (iii) time-aware closed-set, covering cross-regional and long-term monitoring settings. With the provided resources, CzechLynx offers a unique, flexible benchmark for robust evaluation of computer vision and machine learning models across realistic ecological scenarios.
Configurable Fairness: Direct Optimization of Parity Metrics via Vision-Language Models
Performance disparities of image recognition across demographic groups are known to exist in deep learning-based models, due to imbalanced group representations or spurious correlation between group and target labels. Previous work has addressed such challenges without relying on expensive group labels, typically by upweighting high-loss samples or balancing discovered clusters. However, these heuristic strategies lack direct connection to specific fairness metrics and cannot guarantee optimization of parity-based criteria like equal opportunity, which ensures equal chance to receive positive outcomes across groups. In this work, we propose a novel paradigm that directly optimizes parity-based fairness metrics through specifically designed training objectives, without requiring group labels. We leverage vision-language models to analyze sensitive attribute relevancy for individual samples, then formulate loss functions that mathematically connect to each target fairness metric. This enables flexible optimization of different fairness criteria based on application needs. Experiments on multiple image classification datasets show that our metric-specific approach significantly improves parity-based fairness criteria and outperforms existing methods.
Group Relative Attention Guidance for Image Editing
Recently, image editing based on Diffusion-in-Transformer models has undergone rapid development. However, existing editing methods often lack effective control over the degree of editing, limiting their ability to achieve more customized results. To address this limitation, we investigate the MM-Attention mechanism within the DiT model and observe that the Query and Key tokens share a bias vector that is only layer-dependent. We interpret this bias as representing the model's inherent editing behavior, while the delta between each token and its corresponding bias encodes the content-specific editing signals. Based on this insight, we propose Group Relative Attention Guidance, a simple yet effective method that reweights the delta values of different tokens to modulate the focus of the model on the input image relative to the editing instruction, enabling continuous and fine-grained control over editing intensity without any tuning. Extensive experiments conducted on existing image editing frameworks demonstrate that GRAG can be integrated with as few as four lines of code, consistently enhancing editing quality. Moreover, compared to the commonly used Classifier-Free Guidance, GRAG achieves smoother and more precise control over the degree of editing. Our code will be released at https://github.com/little-misfit/GRAG-Image-Editing.
SARD: Segmentation-Aware Anomaly Synthesis via Region-Constrained Diffusion with Discriminative Mask Guidance
Synthesizing realistic and spatially precise anomalies is essential for enhancing the robustness of industrial anomaly detection systems. While recent diffusion-based methods have demonstrated strong capabilities in modeling complex defect patterns, they often struggle with spatial controllability and fail to maintain fine-grained regional fidelity. To overcome these limitations, we propose SARD (Segmentation-Aware anomaly synthesis via Region-constrained Diffusion with discriminative mask Guidance), a novel diffusion-based framework specifically designed for anomaly generation. Our approach introduces a Region-Constrained Diffusion (RCD) process that preserves the background by freezing it and selectively updating only the foreground anomaly regions during the reverse denoising phase, thereby effectively reducing background artifacts. Additionally, we incorporate a Discriminative Mask Guidance (DMG) module into the discriminator, enabling joint evaluation of both global realism and local anomaly fidelity, guided by pixel-level masks. Extensive experiments on the MVTec-AD and BTAD datasets show that SARD surpasses existing methods in segmentation accuracy and visual quality, setting a new state-of-the-art for pixel-level anomaly synthesis.
comment: Accepted by The 2025 International Conference on Machine Intelligence and Nature-InspireD Computing (MIND)
FAST: Foreground-aware Diffusion with Accelerated Sampling Trajectory for Segmentation-oriented Anomaly Synthesis
Industrial anomaly segmentation relies heavily on pixel-level annotations, yet real-world anomalies are often scarce, diverse, and costly to label. Segmentation-oriented industrial anomaly synthesis (SIAS) has emerged as a promising alternative; however, existing methods struggle to balance sampling efficiency and generation quality. Moreover, most approaches treat all spatial regions uniformly, overlooking the distinct statistical differences between anomaly and background areas. This uniform treatment hinders the synthesis of controllable, structure-specific anomalies tailored for segmentation tasks. In this paper, we propose FAST, a foreground-aware diffusion framework featuring two novel modules: the Anomaly-Informed Accelerated Sampling (AIAS) and the Foreground-Aware Reconstruction Module (FARM). AIAS is a training-free sampling algorithm specifically designed for segmentation-oriented industrial anomaly synthesis, which accelerates the reverse process through coarse-to-fine aggregation and enables the synthesis of state-of-the-art segmentation-oriented anomalies in as few as 10 steps. Meanwhile, FARM adaptively adjusts the anomaly-aware noise within the masked foreground regions at each sampling step, preserving localized anomaly signals throughout the denoising trajectory. Extensive experiments on multiple industrial benchmarks demonstrate that FAST consistently outperforms existing anomaly synthesis methods in downstream segmentation tasks. We release the code at: https://github.com/Chhro123/fast-foreground-aware-anomaly-synthesis.
A Simple yet Effective Test-Time Adaptation for Zero-Shot Monocular Metric Depth Estimation IROS 2025
The recent development of \emph{foundation models} for monocular depth estimation such as Depth Anything paved the way to zero-shot monocular depth estimation. Since it returns an affine-invariant disparity map, the favored technique to recover the metric depth consists in fine-tuning the model. However, this stage is not straightforward, it can be costly and time-consuming because of the training and the creation of the dataset. The latter must contain images captured by the camera that will be used at test time and the corresponding ground truth. Moreover, the fine-tuning may also degrade the generalizing capacity of the original model. Instead, we propose in this paper a new method to rescale Depth Anything predictions using 3D points provided by sensors or techniques such as low-resolution LiDAR or structure-from-motion with poses given by an IMU. This approach avoids fine-tuning and preserves the generalizing power of the original depth estimation model while being robust to the noise of the sparse depth, of the camera-LiDAR calibration or of the depth model. Our experiments highlight enhancements relative to zero-shot monocular metric depth estimation methods, competitive results compared to fine-tuned approaches and a better robustness than depth completion approaches. Code available at github.com/ENSTA-U2IS-AI/depth-rescaling.
comment: Published at IROS 2025 https://ieeexplore.ieee.org/document/11247168
Material-informed Gaussian Splatting for 3D World Reconstruction in a Digital Twin
3D reconstruction for Digital Twins often relies on LiDAR-based methods, which provide accurate geometry but lack the semantics and textures naturally captured by cameras. Traditional LiDAR-camera fusion approaches require complex calibration and still struggle with certain materials like glass, which are visible in images but poorly represented in point clouds. We propose a camera-only pipeline that reconstructs scenes using 3D Gaussian Splatting from multi-view images, extracts semantic material masks via vision models, converts Gaussian representations to mesh surfaces with projected material labels, and assigns physics-based material properties for accurate sensor simulation in modern graphics engines and simulators. This approach combines photorealistic reconstruction with physics-based material assignment, providing sensor simulation fidelity comparable to LiDAR-camera fusion while eliminating hardware complexity and calibration requirements. We validate our camera-only method using an internal dataset from an instrumented test vehicle, leveraging LiDAR as ground truth for reflectivity validation alongside image similarity metrics.
comment: 8 pages, 5 figures. Submitted to IEEE Intelligent Vehicles Symposium (IV) 2026 for possible publication. Revised version (v2) to correct author order
Entropy Rectifying Guidance for Diffusion and Flow Models NeurIPS 2025
Guidance techniques are commonly used in diffusion and flow models to improve image quality and input consistency for conditional generative tasks such as class-conditional and text-to-image generation. In particular, classifier-free guidance (CFG) is the most widely adopted guidance technique. It results, however, in trade-offs across quality, diversity and consistency: improving some at the expense of others. While recent work has shown that it is possible to disentangle these factors to some extent, such methods come with an overhead of requiring an additional (weaker) model, or require more forward passes per sampling step. In this paper, we propose Entropy Rectifying Guidance (ERG), a simple and effective guidance method based on inference-time changes in the attention mechanism of state-of-the-art diffusion transformer architectures, which allows for simultaneous improvements over image quality, diversity and prompt consistency. ERG is more general than CFG and similar guidance techniques, as it extends to unconditional sampling. We show that ERG results in significant improvements in various tasks, including text-to-image, class-conditional and unconditional image generation. We also show that ERG can be seamlessly combined with other recent guidance methods such as CADS and APG, further improving generation results.
comment: NeurIPS 2025
PoseAdapt: Sustainable Human Pose Estimation via Continual Learning Benchmarks and Toolkit WACV 2026
Human pose estimators are typically retrained from scratch or naively fine-tuned whenever keypoint sets, sensing modalities, or deployment domains change--an inefficient, compute-intensive practice that rarely matches field constraints. We present PoseAdapt, an open-source framework and benchmark suite for continual pose model adaptation. PoseAdapt defines domain-incremental and class-incremental tracks that simulate realistic changes in density, lighting, and sensing modality, as well as skeleton growth. The toolkit supports two workflows: (i) Strategy Benchmarking, which lets researchers implement continual learning (CL) methods as plugins and evaluate them under standardized protocols; and (ii) Model Adaptation, which allows practitioners to adapt strong pretrained models to new tasks with minimal supervision. We evaluate representative regularization-based methods in single-step and sequential settings. Benchmarks enforce a fixed lightweight backbone, no access to past data, and tight per-step budgets. This isolates adaptation strategy effects, highlighting the difficulty of maintaining accuracy under strict resource limits. PoseAdapt connects modern CL techniques with practical pose estimation needs, enabling adaptable models that improve over time without repeated full retraining.
comment: Accepted in WACV 2026 Applications Track
Network Inversion for Uncertainty-Aware Out-of-Distribution Detection
Out-of-distribution (OOD) detection and uncertainty estimation (UE) are critical components for building safe machine learning systems, especially in real-world scenarios where unexpected inputs are inevitable. However the two problems have, until recently, separately been addressed. In this work, we propose a novel framework that combines network inversion with classifier training to simultaneously address both OOD detection and uncertainty estimation. For a standard n-class classification task, we extend the classifier to an (n+1)-class model by introducing a "garbage" class, initially populated with random gaussian noise to represent outlier inputs. After each training epoch, we use network inversion to reconstruct input images corresponding to all output classes that initially appear as noisy and incoherent and are therefore excluded to the garbage class for retraining the classifier. This cycle of training, inversion, and exclusion continues iteratively till the inverted samples begin to resemble the in-distribution data more closely, with a significant drop in the uncertainty, suggesting that the classifier has learned to carve out meaningful decision boundaries while sanitising the class manifolds by pushing OOD content into the garbage class. During inference, this training scheme enables the model to effectively detect and reject OOD samples by classifying them into the garbage class. Furthermore, the confidence scores associated with each prediction can be used to estimate uncertainty for both in-distribution and OOD inputs. Our approach is scalable, interpretable, and does not require access to external OOD datasets or post-hoc calibration techniques while providing a unified solution to the dual challenges of OOD detection and uncertainty estimation.
Source-free Video Domain Adaptation by Learning from Noisy Labels
Despite the progress seen in classification methods, current approaches for handling videos with distribution shifts in source and target domains remain source-dependent as they require access to the source data during the adaptation stage. In this paper, we present a self-training based source-free video domain adaptation approach to address this challenge by bridging the gap between the source and the target domains. We use the source pre-trained model to generate pseudo-labels for the target domain samples, which are inevitably noisy. Thus, we treat the problem of source-free video domain adaptation as learning from noisy labels and argue that the samples with correct pseudo-labels can help us in adaptation. To this end, we leverage the cross-entropy loss as an indicator of the correctness of the pseudo-labels and use the resulting small-loss samples from the target domain for fine-tuning the model. We further enhance the adaptation performance by implementing a teacher-student (TS) framework, in which the teacher, which is updated gradually, produces reliable pseudo-labels. Meanwhile, the student undergoes fine-tuning on the target domain videos using these generated pseudo-labels to improve its performance. Extensive experimental evaluations show that our methods, termed as CleanAdapt, CleanAdapt + TS, achieve state-of-the-art results, outperforming the existing approaches on various open datasets. Our source code is publicly available at https://avijit9.github.io/CleanAdapt.
comment: Our extended ICVGIP paper is now accepted in Pattern Recognition
Text2Traffic: A Text-to-Image Generation and Editing Method for Traffic Scenes
With the rapid advancement of intelligent transportation systems, text-driven image generation and editing techniques have demonstrated significant potential in providing rich, controllable visual scene data for applications such as traffic monitoring and autonomous driving. However, several challenges remain, including insufficient semantic richness of generated traffic elements, limited camera viewpoints, low visual fidelity of synthesized images, and poor alignment between textual descriptions and generated content. To address these issues, we propose a unified text-driven framework for both image generation and editing, leveraging a controllable mask mechanism to seamlessly integrate the two tasks. Furthermore, we incorporate both vehicle-side and roadside multi-view data to enhance the geometric diversity of traffic scenes. Our training strategy follows a two-stage paradigm: first, we perform conceptual learning using large-scale coarse-grained text-image data; then, we fine-tune with fine-grained descriptive data to enhance text-image alignment and detail quality. Additionally, we introduce a mask-region-weighted loss that dynamically emphasizes small yet critical regions during training, thereby substantially enhancing the generation fidelity of small-scale traffic elements. Extensive experiments demonstrate that our method achieves leading performance in text-based image generation and editing within traffic scenes.
TEFormer: Texture-Aware and Edge-Guided Transformer for Semantic Segmentation of Urban Remote Sensing Images
Accurate semantic segmentation of urban remote sensing images (URSIs) is essential for urban planning and environmental monitoring. However, it remains challenging due to the subtle texture differences and similar spatial structures among geospatial objects, which cause semantic ambiguity and misclassification. Additional complexities arise from irregular object shapes, blurred boundaries, and overlapping spatial distributions of objects, resulting in diverse and intricate edge morphologies. To address these issues, we propose TEFormer, a texture-aware and edge-guided Transformer. Our model features a texture-aware module (TaM) in the encoder to capture fine-grained texture distinctions between visually similar categories, thereby enhancing semantic discrimination. The decoder incorporates an edge-guided tri-branch decoder (Eg3Head) to preserve local edges and details while maintaining multiscale context-awareness. Finally, an edge-guided feature fusion module (EgFFM) effectively integrates contextual, detail, and edge information to achieve refined semantic segmentation. Extensive evaluation demonstrates that TEFormer yields mIoU scores of 88.57% on Potsdam and 81.46% on Vaihingen, exceeding the next best methods by 0.73% and 0.22%. On the LoveDA dataset, it secures the second position with an overall mIoU of 53.55%, trailing the optimal performance by a narrow margin of 0.19%.
comment: Accepted by IEEE GRSL
Look Where It Matters: Training-Free Ultra-HR Remote Sensing VQA via Adaptive Zoom Search
With advances in satellite constellations, sensor technologies, and imaging pipelines, ultra-high-resolution (Ultra-HR) remote sensing imagery is becoming increasingly widespread. However, current remote sensing foundation models are ill-suited to such inputs: full-image encoding exhausts token and memory budgets, while resize-based preprocessing loses fine-grained and answer-critical details. In this context, guiding the model look where it matters before prediction becomes crucial. Therefore, we present ZoomSearch, a training-free, plug-and-play pipeline that decouples 'where to look' from 'how to answer' for Ultra-HR Remote Sensing Visual Question Answering (RS-VQA). ZoomSearch combines Adaptive Multi-Branch Zoom Search, which performs a hierarchical search over image patches to localize query-relevant regions, with Layout-Aware Patch Reassembly, which reorganizes the selected patches into a compact, layout-faithful canvas. We conduct comprehensive experiments on Ultra-HR RS-VQA benchmarks MME-RealWorld-RS and LRS-VQA, comparing against (i) strong general foundation models, (ii) remote sensing foundation models, (iii) Ultra-HR RS-VQA methods, and (iv) plug-and-play search-based VQA methods. When integrated with LLaVA-ov, ZoomSearch attains state-of-the-art accuracy across diverse tasks, improving the LLaVA-ov baseline by 26.3% on LRS-VQA and 114.8% on MME-RealWorld-RS. Meanwhile, it achieves much higher inference efficiency, outperforming prior search-based methods by 20%~44% in speed.
comment: 17 pages, 8 figures
Activation Quantization of Vision Encoders Needs Prefixing Registers
Transformer-based vision encoders -- such as CLIP -- are central to multimodal intelligence, powering applications from autonomous web agents to robotic control. Since these applications often demand real-time processing of massive visual data, reducing the inference cost of vision encoders is critical. Quantization offers a practical path, but remains challenging even at 8-bit precision due to massive-scale activations (i.e., outliers). In this work, we propose $\textit{RegCache}$, a training-free algorithm that mitigates outliers in large-scale pretrained vision encoders and serves as a plug-in module that can be applied on top of other quantization methods. The proposed RegCache introduces outlier-prone yet semantically meaningless prefix tokens to the target vision encoder, which prevents other tokens from having outliers. Notably, we observe that outliers in vision encoders behave differently from those in language models, motivating two technical innovations: middle-layer prefixing and token deletion. Experiments show that our method consistently improves the accuracy of quantized models across both text-supervised and self-supervised vision encoders.
comment: 19 pages, 8 figures
SAEmnesia: Erasing Concepts in Diffusion Models with Supervised Sparse Autoencoders
Concept unlearning in diffusion models is hampered by feature splitting, where concepts are distributed across many latent features, making their removal challenging and computationally expensive. We introduce SAEmnesia, a supervised sparse autoencoder framework that overcomes this by enforcing one-to-one concept-neuron mappings. By systematically labeling concepts during training, our method achieves feature centralization, binding each concept to a single, interpretable neuron. This enables highly targeted and efficient concept erasure. SAEmnesia reduces hyperparameter search by 96.7% and achieves a 9.2% improvement over the state-of-the-art on the UnlearnCanvas benchmark. Our method also demonstrates superior scalability in sequential unlearning, improving accuracy by 28.4% when removing nine objects, establishing a new standard for precise and controllable concept erasure. Moreover, SAEmnesia mitigates the possibility of generating unwanted content under adversarial attack and effectively removes nudity when evaluated with I2P.
Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion
Realistic 3D city generation is fundamental to a wide range of applications, including virtual reality and digital twins. However, most existing methods rely on training a single diffusion model, which limits their ability to generate personalized and boundless city-scale scenes. In this paper, we present Yo'City, a novel agentic framework that enables user-customized and infinitely expandable 3D city generation by leveraging the reasoning and compositional capabilities of off-the-shelf large models. Specifically, Yo'City first conceptualize the city through a top-down planning strategy that defines a hierarchical "City-District-Grid" structure. The Global Planner determines the overall layout and potential functional districts, while the Local Designer further refines each district with detailed grid-level descriptions. Subsequently, the grid-level 3D generation is achieved through a "produce-refine-evaluate" isometric image synthesis loop, followed by image-to-3D generation. To simulate continuous city evolution, Yo'City further introduces a user-interactive, relationship-guided expansion mechanism, which performs scene graph-based distance- and semantics-aware layout optimization, ensuring spatially coherent city growth. To comprehensively evaluate our method, we construct a diverse benchmark dataset and design six multi-dimensional metrics that assess generation quality from the perspectives of semantics, geometry, texture, and layout. Extensive experiments demonstrate that Yo'City consistently outperforms existing state-of-the-art methods across all evaluation aspects.
comment: 22 pages, 16 figures
Learning Contrastive Feature Representations for Facial Action Unit Detection
For the Facial Action Unit (AU) detection task, accurately capturing the subtle facial differences between distinct AUs is essential for reliable detection. Additionally, AU detection faces challenges from class imbalance and the presence of noisy or false labels, which undermine detection accuracy. In this paper, we introduce a novel contrastive learning framework aimed for AU detection that incorporates both self-supervised and supervised signals, thereby enhancing the learning of discriminative features for accurate AU detection. To tackle the class imbalance issue, we employ a negative sample re-weighting strategy that adjusts the step size of updating parameters for minority and majority class samples. Moreover, to address the challenges posed by noisy and false AU labels, we employ a sampling technique that encompasses three distinct types of positive sample pairs. This enables us to inject self-supervised signals into the supervised signal, effectively mitigating the adverse effects of noisy labels. Our experimental assessments, conducted on five widely-utilized benchmark datasets (BP4D, DISFA, BP4D+, GFT and Aff-Wild2), underscore the superior performance of our approach compared to state-of-the-art methods of AU detection. Our code is available at https://github.com/Ziqiao-Shang/AUNCE.
comment: This paper has been accepted by Pattern Recognition
A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space
Innovative visual stylization is a cornerstone of artistic creation, yet generating novel and consistent visual styles remains a significant challenge. Existing generative approaches typically rely on lengthy textual prompts, reference images, or parameter-efficient fine-tuning to guide style-aware image generation, but often struggle with style consistency, limited creativity, and complex style representations. In this paper, we affirm that a style is worth one numerical code by introducing the novel task, code-to-style image generation, which produces images with novel, consistent visual styles conditioned solely on a numerical style code. To date, this field has only been primarily explored by the industry (e.g., Midjourney), with no open-source research from the academic community. To fill this gap, we propose CoTyle, the first open-source method for this task. Specifically, we first train a discrete style codebook from a collection of images to extract style embeddings. These embeddings serve as conditions for a text-to-image diffusion model (T2I-DM) to generate stylistic images. Subsequently, we train an autoregressive style generator on the discrete style embeddings to model their distribution, allowing the synthesis of novel style embeddings. During inference, a numerical style code is mapped to a unique style embedding by the style generator, and this embedding guides the T2I-DM to generate images in the corresponding style. Unlike existing methods, our method offers unparalleled simplicity and diversity, unlocking a vast space of reproducible styles from minimal input. Extensive experiments validate that CoTyle effectively turns a numerical code into a style controller, demonstrating a style is worth one code.
comment: Code: https://github.com/Kwai-Kolors/CoTyle Demo: https://huggingface.co/spaces/Kwai-Kolors/CoTyle Homepage: https://kwai-kolors.github.io/CoTyle/
Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy
The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.
SegDINO3D: 3D Instance Segmentation Empowered by Both Image-Level and Object-Level 2D Features
In this paper, we present SegDINO3D, a novel Transformer encoder-decoder framework for 3D instance segmentation. As 3D training data is generally not as sufficient as 2D training images, SegDINO3D is designed to fully leverage 2D representation from a pre-trained 2D detection model, including both image-level and object-level features, for improving 3D representation. SegDINO3D takes both a point cloud and its associated 2D images as input. In the encoder stage, it first enriches each 3D point by retrieving 2D image features from its corresponding image views and then leverages a 3D encoder for 3D context fusion. In the decoder stage, it formulates 3D object queries as 3D anchor boxes and performs cross-attention from 3D queries to 2D object queries obtained from 2D images using the 2D detection model. These 2D object queries serve as a compact object-level representation of 2D images, effectively avoiding the challenge of keeping thousands of image feature maps in the memory while faithfully preserving the knowledge of the pre-trained 2D model. The introducing of 3D box queries also enables the model to modulate cross-attention using the predicted boxes for more precise querying. SegDINO3D achieves the state-of-the-art performance on the ScanNetV2 and ScanNet200 3D instance segmentation benchmarks. Notably, on the challenging ScanNet200 dataset, SegDINO3D significantly outperforms prior methods by +8.6 and +6.8 mAP on the validation and hidden test sets, respectively, demonstrating its superiority.
Autoregressive Styled Text Image Generation, but Make it Reliable WACV2026
Generating faithful and readable styled text images (especially for Styled Handwritten Text generation - HTG) is an open problem with several possible applications across graphic design, document understanding, and image editing. A lot of research effort in this task is dedicated to developing strategies that reproduce the stylistic characteristics of a given writer, with promising results in terms of style fidelity and generalization achieved by the recently proposed Autoregressive Transformer paradigm for HTG. However, this method requires additional inputs, lacks a proper stop mechanism, and might end up in repetition loops, generating visual artifacts. In this work, we rethink the autoregressive formulation by framing HTG as a multimodal prompt-conditioned generation task, and tackle the content controllability issues by introducing special textual input tokens for better alignment with the visual ones. Moreover, we devise a Classifier-Free-Guidance-based strategy for our autoregressive model. Through extensive experimental validation, we demonstrate that our approach, dubbed Eruku, compared to previous solutions requires fewer inputs, generalizes better to unseen styles, and follows more faithfully the textual prompt, improving content adherence.
comment: Accepted at WACV2026
Infrared and Visible Image Fusion with Language-Driven Loss in CLIP Embedding Space ACM MM 2025
Infrared-visible image fusion (IVIF) has attracted much attention owing to the highly-complementary properties of the two image modalities. Due to the lack of ground-truth fused images, the fusion output of current deep-learning based methods heavily depends on the loss functions defined mathematically. As it is hard to well mathematically define the fused image without ground truth, the performance of existing fusion methods is limited. In this paper, we propose to use natural language to express the objective of IVIF, which can avoid the explicit mathematical modeling of fusion output in current losses, and make full use of the advantage of language expression to improve the fusion performance. For this purpose, we present a comprehensive language-expressed fusion objective, and encode relevant texts into the multi-modal embedding space using CLIP. A language-driven fusion model is then constructed in the embedding space, by establishing the relationship among the embedded vectors representing the fusion objective and input image modalities. Finally, a language-driven loss is derived to make the actual IVIF aligned with the embedded language-driven fusion model via supervised training. Experiments show that our method can obtain much better fusion results than existing techniques. The code is available at https://github.com/wyhlaowang/LDFusion.
comment: Accepted by ACM MM 2025
Hybrid Rendering for Multimodal Autonomous Driving: Merging Neural and Physics-Based Simulation
Neural reconstruction models for autonomous driving simulation have made significant strides in recent years, with dynamic models becoming increasingly prevalent. However, these models are typically limited to handling in-domain objects closely following their original trajectories. We introduce a hybrid approach that combines the strengths of neural reconstruction with physics-based rendering. This method enables the virtual placement of traditional mesh-based dynamic agents at arbitrary locations, adjustments to environmental conditions, and rendering from novel camera viewpoints. Our approach significantly enhances novel view synthesis quality -- especially for road surfaces and lane markings -- while maintaining interactive frame rates through our novel training method, NeRF2GS. This technique leverages the superior generalization capabilities of NeRF-based methods and the real-time rendering speed of 3D Gaussian Splatting (3DGS). We achieve this by training a customized NeRF model on the original images with depth regularization derived from a noisy LiDAR point cloud, then using it as a teacher model for 3DGS training. This process ensures accurate depth, surface normals, and camera appearance modeling as supervision. With our block-based training parallelization, the method can handle large-scale reconstructions (greater than or equal to 100,000 square meters) and predict segmentation masks, surface normals, and depth maps. During simulation, it supports a rasterization-based rendering backend with depth-based composition and multiple camera models for real-time camera simulation, as well as a ray-traced backend for precise LiDAR simulation.
TRACE: Temporally Reliable Anatomically-Conditioned 3D CT Generation with Enhanced Efficiency MICCAI 2025
3D medical image generation is essential for data augmentation and patient privacy, calling for reliable and efficient models suited for clinical practice. However, current methods suffer from limited anatomical fidelity, restricted axial length, and substantial computational cost, placing them beyond reach for regions with limited resources and infrastructure. We introduce TRACE, a framework that generates 3D medical images with spatiotemporal alignment using a 2D multimodal-conditioned diffusion approach. TRACE models sequential 2D slices as video frame pairs, combining segmentation priors and radiology reports for anatomical alignment, incorporating optical flow to sustain temporal coherence. During inference, an overlapping-frame strategy links frame pairs into a flexible length sequence, reconstructed into a spatiotemporally and anatomically aligned 3D volume. Experimental results demonstrate that TRACE effectively balances computational efficiency with preserving anatomical fidelity and spatiotemporal consistency. Code is available at: https://github.com/VinyehShaw/TRACE.
comment: Accepted to MICCAI 2025 (this version is not peer-reviewed; it is the extended version)
Predicting Video Slot Attention Queries from Random Slot-Feature Pairs AAAI 2026
Unsupervised video Object-Centric Learning (OCL) is promising as it enables object-level scene representation and understanding as we humans do. Mainstream video OCL methods adopt a recurrent architecture: An aggregator aggregates current video frame into object features, termed slots, under some queries; A transitioner transits current slots to queries for the next frame. This is an effective architecture but all existing implementations both (\textit{i1}) neglect to incorporate next frame features, the most informative source for query prediction, and (\textit{i2}) fail to learn transition dynamics, the knowledge essential for query prediction. To address these issues, we propose Random Slot-Feature pair for learning Query prediction (RandSF.Q): (\textit{t1}) We design a new transitioner to incorporate both slots and features, which provides more information for query prediction; (\textit{t2}) We train the transitioner to predict queries from slot-feature pairs randomly sampled from available recurrences, which drives it to learn transition dynamics. Experiments on scene representation demonstrate that our method surpass existing video OCL methods significantly, e.g., up to 10 points on object discovery, setting new state-of-the-art. Such superiority also benefits downstream tasks like scene understanding. Source Code, Model Checkpoints, Training Logs: https://github.com/Genera1Z/RandSF.Q
comment: Accepted to AAAI 2026
HiGFA: Hierarchical Guidance for Fine-grained Data Augmentation with Diffusion Models
Generative diffusion models show promise for data augmentation. However, applying them to fine-grained tasks presents a significant challenge: ensuring synthetic images accurately capture the subtle, category-defining features critical for high fidelity. Standard approaches, such as text-based Classifier-Free Guidance (CFG), often lack the required specificity, potentially generating misleading examples that degrade fine-grained classifier performance. To address this, we propose Hierarchically Guided Fine-grained Augmentation (HiGFA). HiGFA leverages the temporal dynamics of the diffusion sampling process. It employs strong text and transformed contour guidance with fixed strengths in the early-to-mid sampling stages to establish overall scene, style, and structure. In the final sampling stages, HiGFA activates a specialized fine-grained classifier guidance and dynamically modulates the strength of all guidance signals based on prediction confidence. This hierarchical, confidence-driven orchestration enables HiGFA to generate diverse yet faithful synthetic images by intelligently balancing global structure formation with precise detail refinement. Experiments on several FGVC datasets demonstrate the effectiveness of HiGFA.
FIELDS: Face reconstruction with accurate Inference of Expression using Learning with Direct Supervision
Facial expressions convey the bulk of emotional information in human communication, yet existing 3D face reconstruction methods often miss subtle affective details due to reliance on 2D supervision and lack of 3D ground truth. We propose FIELDS (Face reconstruction with accurate Inference of Expression using Learning with Direct Supervision) to address these limitations by extending self-supervised 2D image consistency cues with direct 3D expression parameter supervision and an auxiliary emotion recognition branch. Our encoder is guided by authentic expression parameters from spontaneous 4D facial scans, while an intensity-aware emotion loss encourages the 3D expression parameters to capture genuine emotion content without exaggeration. This dual-supervision strategy bridges the 2D/3D domain gap and mitigates expression-intensity bias, yielding high-fidelity 3D reconstructions that preserve subtle emotional cues. From a single image, FIELDS produces emotion-rich face models with highly realistic expressions, significantly improving in-the-wild facial expression recognition performance without sacrificing naturalness.
AgriPotential: A Novel Multi-Spectral and Multi-Temporal Remote Sensing Dataset for Agricultural Potentials
Remote sensing has emerged as a critical tool for large-scale Earth monitoring and land management. In this paper, we introduce AgriPotential, a novel benchmark dataset composed of Sentinel-2 satellite imagery captured over multiple months. The dataset provides pixel-level annotations of agricultural potentials for three major crop types - viticulture, market gardening, and field crops - across five ordinal classes. AgriPotential supports a broad range of machine learning tasks, including ordinal regression, multi-label classification, and spatio-temporal modeling. The data cover diverse areas in Southern France, offering rich spectral information. AgriPotential is the first public dataset designed specifically for agricultural potential prediction, aiming to improve data-driven approaches to sustainable land use planning. The dataset and the code are freely accessible at: https://zenodo.org/records/15551829
comment: Accepted at CBMI 2025
OmniAID: Decoupling Semantic and Artifacts for Universal AI-Generated Image Detection in the Wild
A truly universal AI-Generated Image (AIGI) detector must simultaneously generalize across diverse generative models and varied semantic content. Current state-of-the-art methods learn a single, entangled forgery representation, conflating content-dependent flaws with content-agnostic artifacts, and are further constrained by outdated benchmarks. To overcome these limitations, we propose OmniAID, a novel framework centered on a decoupled Mixture-of-Experts (MoE) architecture. The core of our method is a hybrid expert system designed to decouple: (1) semantic flaws across distinct content domains, and (2) content-dependent flaws from content-agnostic universal artifacts. This system employs a set of Routable Specialized Semantic Experts, each for a distinct domain (e.g., human, animal), complemented by a Fixed Universal Artifact Expert. This architecture is trained using a novel two-stage strategy: we first train the experts independently with domain-specific hard-sampling to ensure specialization, and subsequently train a lightweight gating network for effective input routing. By explicitly decoupling "what is generated" (content-specific flaws) from "how it is generated" (universal artifacts), OmniAID achieves robust generalization. To address outdated benchmarks and validate real-world applicability, we introduce Mirage, a new large-scale, contemporary dataset. Extensive experiments, using both traditional benchmarks and our Mirage dataset, demonstrate our model surpasses existing monolithic detectors, establishing a new and robust standard for AIGI authentication against modern, in-the-wild threats.
comment: 19 pages, 10 figures, 19 tables
DiffFuSR: Super-Resolution of all Sentinel-2 Multispectral Bands using Diffusion Models SC
This paper presents DiffFuSR, a modular pipeline for super-resolving all 12 spectral bands of Sentinel-2 Level-2A imagery to a unified ground sampling distance (GSD) of 2.5 meters. The pipeline comprises two stages: (i) a diffusion-based super-resolution (SR) model trained on high-resolution RGB imagery from the NAIP and WorldStrat datasets, harmonized to simulate Sentinel-2 characteristics; and (ii) a learned fusion network that upscales the remaining multispectral bands using the super-resolved RGB image as a spatial prior. We introduce a robust degradation model and contrastive degradation encoder to support blind SR. Extensive evaluations of the proposed SR pipeline on the OpenSR benchmark demonstrate that the proposed method outperforms current SOTA baselines in terms of reflectance fidelity, spectral consistency, spatial alignment, and hallucination suppression. Furthermore, the fusion network significantly outperforms classical and learned pansharpening approaches, enabling accurate enhancement of Sentinel-2's 20 m and 60 m bands. This work proposes a novel modular framework Sentinel-2 SR that utilizes harmonized learning with diffusion models and fusion strategies. Our code and models can be found at https://github.com/NorskRegnesentral/DiffFuSR.
comment: Accepted for Publication at IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING (TGRS)
Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory
Dense 3D scene reconstruction from an ordered sequence or unordered image collections is a critical step when bringing research in computer vision into practical scenarios. Following the paradigm introduced by DUSt3R, which unifies an image pair densely into a shared coordinate system, subsequent methods maintain an implicit memory to achieve dense 3D reconstruction from more images. However, such implicit memory is limited in capacity and may suffer from information loss of earlier frames. We propose Point3R, an online framework targeting dense streaming 3D reconstruction. To be specific, we maintain an explicit spatial pointer memory directly associated with the 3D structure of the current scene. Each pointer in this memory is assigned a specific 3D position and aggregates scene information nearby in the global coordinate system into a changing spatial feature. Information extracted from the latest frame interacts explicitly with this pointer memory, enabling dense integration of the current observation into the global coordinate system. We design a 3D hierarchical position embedding to promote this interaction and design a simple yet effective fusion mechanism to ensure that our pointer memory is uniform and efficient. Our method achieves competitive or state-of-the-art performance on various tasks with low training costs. Code: https://github.com/YkiWu/Point3R.
comment: Code is available at: https://github.com/YkiWu/Point3R
Reliable Multimodal Learning Via Multi-Level Adaptive DeConfusion
Multimodal learning enhances the performance of various machine learning tasks by leveraging complementary information across different modalities. However, existing methods often learn multimodal representations that retain substantial inter-class confusion, making it difficult to achieve high-confidence predictions, particularly in real-world scenarios with low-quality or noisy data. To address this challenge, we propose Multi-Level Adaptive DeConfusion (MLAD), which eliminates inter-class confusion in multimodal data at both global and sample levels, significantly enhancing the classification reliability of multimodal models. Specifically, MLAD first learns class-wise latent distributions with global-level confusion removed via dynamic-exit modality encoders that adapt to the varying discrimination difficulty of each class and a cross-class residual reconstruction mechanism. Subsequently, MLAD further removes sample-specific confusion through sample-adaptive cross-modality rectification guided by confusion-free modality priors. These priors are constructed from low-confusion modality features, identified by evaluating feature confusion using the learned class-wise latent distributions and selecting those with low confusion via a Gaussian mixture model. Experiments demonstrate that MLAD outperforms state-of-the-art methods across multiple benchmarks and exhibits superior reliability.
comment: 15 pages, 10 figures
PhysX-3D: Physical-Grounded 3D Asset Generation NeurIPS 2025
3D modeling is moving from virtual to physical. Existing 3D generation primarily emphasizes geometries and textures while neglecting physical-grounded modeling. Consequently, despite the rapid development of 3D generative models, the synthesized 3D assets often overlook rich and important physical properties, hampering their real-world application in physical domains like simulation and embodied AI. As an initial attempt to address this challenge, we propose \textbf{PhysX-3D}, an end-to-end paradigm for physical-grounded 3D asset generation. 1) To bridge the critical gap in physics-annotated 3D datasets, we present PhysXNet - the first physics-grounded 3D dataset systematically annotated across five foundational dimensions: absolute scale, material, affordance, kinematics, and function description. In particular, we devise a scalable human-in-the-loop annotation pipeline based on vision-language models, which enables efficient creation of physics-first assets from raw 3D assets.2) Furthermore, we propose \textbf{PhysXGen}, a feed-forward framework for physics-grounded image-to-3D asset generation, injecting physical knowledge into the pre-trained 3D structural space. Specifically, PhysXGen employs a dual-branch architecture to explicitly model the latent correlations between 3D structures and physical properties, thereby producing 3D assets with plausible physical predictions while preserving the native geometry quality. Extensive experiments validate the superior performance and promising generalization capability of our framework. All the code, data, and models will be released to facilitate future research in generative physical AI.
comment: Accepted by NeurIPS 2025, Spotlight Project page: https://physx-3d.github.io/
ABM-LoRA: Activation Boundary Matching for Fast Convergence in Low-Rank Adaptation
We propose Activation Boundary Matching for Low-Rank Adaptation (ABM-LoRA), a principled initialization strategy that substantially accelerates the convergence of low-rank adapters. While LoRA offers high parameter efficiency, its random initialization restricts gradient updates to a mismatched tangent space, causing significant information loss and hindering early convergence. Our ABM-LoRA addresses this by aligning the adapter's activation boundaries with those of the pretrained model before downstream training, thereby maximizing the projection of full-parameter gradients into the adapter subspace. This alignment sharply reduces information loss at initialization, yields a lower starting loss, and accelerates convergence. We demonstrate ABM-LoRA's effectiveness across diverse architectures and tasks: language understanding (T5-Base on GLUE), dialogue generation (LLaMA2-7B on WizardLM), and vision recognition (ViT-B/16 on VTAB-1K). On VTAB-1K, it achieves the highest accuracy among all methods, with strong gains on structured reasoning tasks requiring geometric understanding.
comment: 16 pages, 5 figures, under review
When Trackers Date Fish: A Benchmark and Framework for Underwater Multiple Fish Tracking AAAI 2026
Multiple object tracking (MOT) technology has made significant progress in terrestrial applications, but underwater tracking scenarios remain underexplored despite their importance to marine ecology and aquaculture. In this paper, we present Multiple Fish Tracking Dataset 2025 (MFT25), a comprehensive dataset specifically designed for underwater multiple fish tracking, featuring 15 diverse video sequences with 408,578 meticulously annotated bounding boxes across 48,066 frames. Our dataset captures various underwater environments, fish species, and challenging conditions including occlusions, similar appearances, and erratic motion patterns. Additionally, we introduce Scale-aware and Unscented Tracker (SU-T), a specialized tracking framework featuring an Unscented Kalman Filter (UKF) optimized for non-linear swimming patterns of fish and a novel Fish-Intersection-over-Union (FishIoU) matching that accounts for the unique morphological characteristics of aquatic species. Extensive experiments demonstrate that our SU-T baseline achieves state-of-the-art performance on MFT25, with 34.1 HOTA and 44.6 IDF1, while revealing fundamental differences between fish tracking and terrestrial object tracking scenarios. The dataset and codes are released at https://vranlee.github.io/SU-T/.
comment: Accepted by AAAI 2026 (Oral)
Enhanced Partially Relevant Video Retrieval through Inter- and Intra-Sample Analysis with Coherence Prediction
Partially Relevant Video Retrieval (PRVR) aims to retrieve the target video that is partially relevant to the text query. The primary challenge in PRVR arises from the semantic asymmetry between textual and visual modalities, as videos often contain substantial content irrelevant to the query. Existing methods coarsely align paired videos and text queries to construct the semantic space, neglecting the critical cross-modal dual nature inherent in this task: inter-sample correlation and intra-sample redundancy. To this end, we propose a novel PRVR framework to systematically exploit these two characteristics. Our framework consists of three core modules. First, the Inter Correlation Enhancement (ICE) module captures inter-sample correlation by identifying semantically similar yet unpaired text queries and video moments, combining them to form pseudo-positive pairs for more robust semantic space construction. Second, the Intra Redundancy Mining (IRM) module mitigates intra-sample redundancy by mining redundant moment features and distinguishing them from query-relevant moments, encouraging the model to learn more discriminative representations. Finally, to reinforce these modules, we introduce the Temporal Coherence Prediction (TCP) module, enhancing discrimination of fine-grained moment-level semantics by training the model to predict the original temporal order of randomly shuffled video sequences. Extensive experiments demonstrate the superiority of our method, achieving state-of-the-art results.
Rethinking Progression of Memory State in Robotic Manipulation: An Object-Centric Perspective AAAI 2026
As embodied agents operate in increasingly complex environments, the ability to perceive, track, and reason about individual object instances over time becomes essential, especially in tasks requiring sequenced interactions with visually similar objects. In these non-Markovian settings, key decision cues are often hidden in object-specific histories rather than the current scene. Without persistent memory of prior interactions (what has been interacted with, where it has been, or how it has changed) visuomotor policies may fail, repeat past actions, or overlook completed ones. To surface this challenge, we introduce LIBERO-Mem, a non-Markovian task suite for stress-testing robotic manipulation under object-level partial observability. It combines short- and long-horizon object tracking with temporally sequenced subgoals, requiring reasoning beyond the current frame. However, vision-language-action (VLA) models often struggle in such settings, with token scaling quickly becoming intractable even for tasks spanning just a few hundred frames. We propose Embodied-SlotSSM, a slot-centric VLA framework built for temporal scalability. It maintains spatio-temporally consistent slot identities and leverages them through two mechanisms: (1) slot-state-space modeling for reconstructing short-term history, and (2) a relational encoder to align the input tokens with action decoding. Together, these components enable temporally grounded, context-aware action prediction. Experiments show Embodied-SlotSSM's baseline performance on LIBERO-Mem and general tasks, offering a scalable solution for non-Markovian reasoning in object-centric robotic policies.
comment: Accepted at AAAI 2026
SlotVLA: Towards Modeling of Object-Relation Representations in Robotic Manipulation
Inspired by how humans reason over discrete objects and their relationships, we explore whether compact object-centric and object-relation representations can form a foundation for multitask robotic manipulation. Most existing robotic multitask models rely on dense embeddings that entangle both object and background cues, raising concerns about both efficiency and interpretability. In contrast, we study object-relation-centric representations as a pathway to more structured, efficient, and explainable visuomotor control. Our contributions are two-fold. First, we introduce LIBERO+, a fine-grained benchmark dataset designed to enable and evaluate object-relation reasoning in robotic manipulation. Unlike prior datasets, LIBERO+ provides object-centric annotations that enrich demonstrations with box- and mask-level labels as well as instance-level temporal tracking, supporting compact and interpretable visuomotor representations. Second, we propose SlotVLA, a slot-attention-based framework that captures both objects and their relations for action decoding. It uses a slot-based visual tokenizer to maintain consistent temporal object representations, a relation-centric decoder to produce task-relevant embeddings, and an LLM-driven module that translates these embeddings into executable actions. Experiments on LIBERO+ demonstrate that object-centric slot and object-relation slot representations drastically reduce the number of required visual tokens, while providing competitive generalization. Together, LIBERO+ and SlotVLA provide a compact, interpretable, and effective foundation for advancing object-relation-centric robotic manipulation.
comment: under review
UNO: Unifying One-stage Video Scene Graph Generation via Object-Centric Visual Representation Learning WACV 2026
Video Scene Graph Generation (VidSGG) aims to represent dynamic visual content by detecting objects and modeling their temporal interactions as structured graphs. Prior studies typically target either coarse-grained box-level or fine-grained panoptic pixel-level VidSGG, often requiring task-specific architectures and multi-stage training pipelines. In this paper, we present UNO (UNified Object-centric VidSGG), a single-stage, unified framework that jointly addresses both tasks within an end-to-end architecture. UNO is designed to minimize task-specific modifications and maximize parameter sharing, enabling generalization across different levels of visual granularity. The core of UNO is an extended slot attention mechanism that decomposes visual features into object and relation slots. To ensure robust temporal modeling, we introduce object temporal consistency learning, which enforces consistent object representations across frames without relying on explicit tracking modules. Additionally, a dynamic triplet prediction module links relation slots to corresponding object pairs, capturing evolving interactions over time. We evaluate UNO on standard box-level and pixel-level VidSGG benchmarks. Results demonstrate that UNO not only achieves competitive performance across both tasks but also offers improved efficiency through a unified, object-centric design.
comment: 11 pages, 7 figures. Accepted at WACV 2026
ARIAL: An Agentic Framework for Document VQA with Precise Answer Localization
Document Visual Question Answering (VQA) requires models to not only extract accurate textual answers but also precisely localize them within document images, a capability critical for interpretability in high-stakes applications. However, existing systems achieve strong textual accuracy while producing unreliable spatial grounding, or sacrifice performance for interpretability. We present ARIAL (Agentic Reasoning for Interpretable Answer Localization), a modular framework that orchestrates specialized tools through an LLM-based planning agent to achieve both precise answer extraction and reliable spatial grounding. ARIAL decomposes Document VQA into structured subtasks: OCR-based text extraction with TrOCR, retrieval-augmented context selection using semantic search, answer generation via a fine-tuned Gemma 3-27B model, and explicit bounding-box localization through text-to-region alignment. This modular architecture produces transparent reasoning traces, enabling tool-level auditability and independent component optimization. We evaluate ARIAL on four benchmarks (DocVQA, FUNSD, CORD, and SROIE) using both textual accuracy (ANLS) and spatial precision (mAP at IoU 0.50 to 0.95). ARIAL achieves state-of-the-art results across all datasets: 88.7 ANLS and 50.1 mAP on DocVQA, 90.0 ANLS and 50.3 mAP on FUNSD, 85.5 ANLS and 60.2 mAP on CORD, and 93.1 ANLS on SROIE, surpassing the previous best method (DLaVA) by +2.8 ANLS and +3.9 mAP on DocVQA. Our work demonstrates how agentic orchestration of specialized tools can simultaneously improve performance and interpretability, providing a pathway toward trustworthy, explainable document AI systems.
Reverberation: Learning the Latencies Before Forecasting Trajectories
Bridging the past to the future, connecting agents both spatially and temporally, lies at the core of the trajectory prediction task. Despite great efforts, it remains challenging to explicitly learn and predict latencies, i.e., response intervals or temporal delays with which agents respond to various trajectory-changing events and adjust their future paths, whether on their own or interactively. Different agents may exhibit distinct latency preferences for noticing, processing, and reacting to a specific trajectory-changing event. The lack of consideration of such latencies may undermine the causal continuity of forecasting systems, leading to implausible or unintended trajectories. Inspired by reverberation in acoustics, we propose a new reverberation transform and the corresponding Reverberation (short for Rev) trajectory prediction model, which predicts both individual latency preferences and their stochastic variations accordingly, by using two explicit and learnable reverberation kernels, enabling latency-conditioned and controllable trajectory prediction of both non-interactive and social latencies. Experiments on multiple datasets, whether pedestrians or vehicles, demonstrate that Rev achieves competitive accuracy while revealing interpretable latency dynamics across agents and scenarios. Qualitative analyses further verify the properties of the reverberation transform, highlighting its potential as a general latency modeling approach.
AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs
Assessing image-text alignment models such as CLIP is crucial for bridging visual and linguistic representations. Yet existing benchmarks rely on rule-based perturbations or short captions, limiting their ability to measure fine-grained alignment. We introduce AlignBench, a benchmark that provides a new indicator of image-text alignment by evaluating detailed image-caption pairs generated by diverse image-to-text and text-to-image models. Each sentence is annotated for correctness, enabling direct assessment of VLMs as alignment evaluators. Benchmarking a wide range of decoder-based VLMs reveals three key findings: (i) CLIP-based models, even those tailored for compositional reasoning, remain nearly blind; (ii) detectors systematically over-score early sentences; and (iii) they show strong self-preference, favoring their own outputs and harming detection performance. Our project page will be available at https://dahlian00.github.io/AlignBench/.
comment: Project Page: https://dahlian00.github.io/AlignBench/
DDAE++: Enhancing Diffusion Models Towards Unified Generative and Discriminative Learning
While diffusion models excel at image synthesis, useful representations have been shown to emerge from generative pre-training, suggesting a path towards unified generative and discriminative learning. However, suboptimal semantic flow within current architectures can hinder this potential: features encoding the richest high-level semantics are underutilized and diluted when propagating through decoding layers, impeding the formation of an explicit semantic bottleneck layer. To address this, we introduce self-conditioning, a lightweight mechanism that reshapes the model's layer-wise semantic hierarchy without external guidance. By aggregating and rerouting intermediate features to guide subsequent decoding layers, our method concentrates more high-level semantics, concurrently strengthening global generative guidance and forming more discriminative representations. This simple approach yields a dual-improvement trend across pixel-space UNet, UViT and latent-space DiT models with minimal overhead. Crucially, it creates an architectural semantic bridge that propagates discriminative improvements into generation and accommodates further techniques such as contrastive self-distillation. Experiments show that our enhanced models, especially self-conditioned DiT, are powerful dual learners that yield strong and transferable representations on image and dense classification tasks, surpassing various generative self-supervised models in linear probing while also improving or maintaining high generation quality.
comment: Updated version. Code available at https://github.com/FutureXiang/ddae_plus_plus
OpenDance: Multimodal Controllable 3D Dance Generation with Large-scale Internet Data
Music-driven 3D dance generation offers significant creative potential, yet practical applications demand versatile and multimodal control. As the highly dynamic and complex human motion covering various styles and genres, dance generation requires satisfying diverse conditions beyond just music (e.g., spatial trajectories, keyframe gestures, or style descriptions). However, the absence of a large-scale and richly annotated dataset severely hinders progress. In this paper, we build OpenDanceSet, an extensive human dance dataset comprising over 100 hours across 14 genres and 147 subjects. Each sample has rich annotations to facilitate robust cross-modal learning: 3D motion, paired music, 2D keypoints, trajectories, and expert-annotated text descriptions. Furthermore, we propose OpenDanceNet, a unified masked modeling framework for controllable dance generation, including a disentangled auto-encoder and a multimodal joint-prediction Transformer. OpenDanceNet supports generation conditioned on music and arbitrary combinations of text, keypoints, or trajectories. Comprehensive experiments demonstrate that our work achieves high-fidelity synthesis with strong diversity and realistic physical contacts, while also offering flexible control over spatial and stylistic conditions. Project Page: https://open-dance.github.io
Axial-UNet: A Neural Weather Model for Precipitation Nowcasting
Accurately predicting short-term precipitation is critical for weather-sensitive applications such as disaster management, aviation, and urban planning. Traditional numerical weather prediction can be computationally intensive at high resolution and short lead times. In this work, we propose a lightweight UNet-based encoder-decoder augmented with axial-attention blocks that attend along image rows and columns to capture long-range spatial interactions, while temporal context is provided by conditioning on multiple past radar frames. Our hybrid architecture captures both local and long-range spatio-temporal dependencies from radar image sequences, enabling fixed lead-time precipitation nowcasting with modest compute. Experimental results on a preprocessed subset of the HKO-7 radar dataset demonstrate that our model outperforms ConvLSTM, pix2pix-style cGANs, and a plain UNet in pixel-fidelity metrics, reaching PSNR 47.67 and SSIM 0.9943. We report PSNR/SSIM here; extending evaluation to meteorology-oriented skill measures (e.g., CSI/FSS) is left to future work. The approach is simple, scalable, and effective for resource-constrained, real-time forecasting scenarios.
comment: 16 pages, 3 figures. Accepted at the International Conference on Distributed Computing and Intelligent Technology (ICDCIT 2026), to appear in Springer LNCS
Tracking the Unstable: Appearance-Guided Motion Modeling for Robust Multi-Object Tracking in UAV-Captured Videos AAAI26
Multi-object tracking (MOT) aims to track multiple objects while maintaining consistent identities across frames of a given video. In unmanned aerial vehicle (UAV) recorded videos, frequent viewpoint changes and complex UAV-ground relative motion dynamics pose significant challenges, which often lead to unstable affinity measurement and ambiguous association. Existing methods typically model motion and appearance cues separately, overlooking their spatio-temporal interplay and resulting in suboptimal tracking performance. In this work, we propose AMOT, which jointly exploits appearance and motion cues through two key components: an Appearance-Motion Consistency (AMC) matrix and a Motion-aware Track Continuation (MTC) module. Specifically, the AMC matrix computes bi-directional spatial consistency under the guidance of appearance features, enabling more reliable and context-aware identity association. The MTC module complements AMC by reactivating unmatched tracks through appearance-guided predictions that align with Kalman-based predictions, thereby reducing broken trajectories caused by missed detections. Extensive experiments on three UAV benchmarks, including VisDrone2019, UAVDT, and VT-MOT-UAV, demonstrate that our AMOT outperforms current state-of-the-art methods and generalizes well in a plug-and-play and training-free manner.
comment: Accepted by the AAAI26 Conference Main Track
ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs
The computational cost of training multimodal large language models (MLLMs) grows rapidly with the number of processed tokens. Existing efficiency methods mainly target inference via token reduction or merging, offering limited benefits during training. We introduce ReGATE (Reference-Guided Adaptive Token Elision), an adaptive token pruning method for accelerating MLLM training. ReGATE adopts a teacher-student framework, in which a frozen teacher LLM provides per-token guidance losses that are fused with an exponential moving average of the student's difficulty estimates. This adaptive scoring mechanism dynamically selects informative tokens while skipping redundant ones in the forward pass, substantially reducing computation without altering the model architecture. Across three representative MLLMs, ReGATE matches the peak accuracy of standard training on MVBench up to 2$\times$ faster, using only 38% of the tokens. With extended training, it even surpasses the baseline across multiple multimodal benchmarks, cutting total token usage by over 41%. Code and models will be released publicly.
Leveraging Semantic Attribute Binding for Free-Lunch Color Control in Diffusion Models WACV 2026
Recent advances in text-to-image (T2I) diffusion models have enabled remarkable control over various attributes, yet precise color specification remains a fundamental challenge. Existing approaches, such as ColorPeel, rely on model personalization, requiring additional optimization and limiting flexibility in specifying arbitrary colors. In this work, we introduce ColorWave, a novel training-free approach that achieves exact RGB-level color control in diffusion models without fine-tuning. By systematically analyzing the cross-attention mechanisms within IP-Adapter, we uncover an implicit binding between textual color descriptors and reference image features. Leveraging this insight, our method rewires these bindings to enforce precise color attribution while preserving the generative capabilities of pretrained models. Our approach maintains generation quality and diversity, outperforming prior methods in accuracy and applicability across diverse object categories. Through extensive evaluations, we demonstrate that ColorWave establishes a new paradigm for structured, color-consistent diffusion-based image synthesis.
comment: WACV 2026. Project page: https://hecoding.github.io/colorwave-page
Image and Video Processing
Deep Learning for Restoring MPI System Matrices Using Simulated Training Data
Magnetic particle imaging reconstructs tracer distributions using a system matrix obtained through time-consuming, noise-prone calibration measurements. Methods for addressing imperfections in measured system matrices increasingly rely on deep neural networks, yet curated training data remain scarce. This study evaluates whether physics-based simulated system matrices can be used to train deep learning models for different system matrix restoration tasks, i.e., denoising, accelerated calibration, upsampling, and inpainting, that generalize to measured data. A large system matrices dataset was generated using an equilibrium magnetization model extended with uniaxial anisotropy. The dataset spans particle, scanner, and calibration parameters for 2D and 3D trajectories, and includes background noise injected from empty-frame measurements. For each restoration task, deep learning models were compared with classical non-learning baseline methods. The models trained solely on simulated system matrices generalized to measured data across all tasks: for denoising, DnCNN/RDN/SwinIR outperformed DCT-F baseline by >10 dB PSNR and up to 0.1 SSIM on simulations and led to perceptually better reconstuctions of real data; for 2D upsampling, SMRnet exceeded bicubic by 20 dB PSNR and 0.08 SSIM at $\times 2$-$\times 4$ which did not transfer qualitatively to real measurements. For 3D accelerated calibration, SMRnet matched tricubic in noiseless cases and was more robust under noise, and for 3D inpainting, biharmonic inpainting was superior when noise-free but degraded with noise, while a PConvUNet maintained quality and yielded less blurry reconstructions. The demonstrated transferability of deep learning models trained on simulations to real measurements mitigates the data-scarcity problem and enables the development of new methods beyond current measurement capabilities.
MICCAI STS 2024 Challenge: Semi-Supervised Instance-Level Tooth Segmentation in Panoramic X-ray and CBCT Images
Orthopantomogram (OPGs) and Cone-Beam Computed Tomography (CBCT) are vital for dentistry, but creating large datasets for automated tooth segmentation is hindered by the labor-intensive process of manual instance-level annotation. This research aimed to benchmark and advance semi-supervised learning (SSL) as a solution for this data scarcity problem. We organized the 2nd Semi-supervised Teeth Segmentation (STS 2024) Challenge at MICCAI 2024. We provided a large-scale dataset comprising over 90,000 2D images and 3D axial slices, which includes 2,380 OPG images and 330 CBCT scans, all featuring detailed instance-level FDI annotations on part of the data. The challenge attracted 114 (OPG) and 106 (CBCT) registered teams. To ensure algorithmic excellence and full transparency, we rigorously evaluated the valid, open-source submissions from the top 10 (OPG) and top 5 (CBCT) teams, respectively. All successful submissions were deep learning-based SSL methods. The winning semi-supervised models demonstrated impressive performance gains over a fully-supervised nnU-Net baseline trained only on the labeled data. For the 2D OPG track, the top method improved the Instance Affinity (IA) score by over 44 percentage points. For the 3D CBCT track, the winning approach boosted the Instance Dice score by 61 percentage points. This challenge confirms the substantial benefit of SSL for complex, instance-level medical image segmentation tasks where labeled data is scarce. The most effective approaches consistently leveraged hybrid semi-supervised frameworks that combined knowledge from foundational models like SAM with multi-stage, coarse-to-fine refinement pipelines. Both the challenge dataset and the participants' submitted code have been made publicly available on GitHub (https://github.com/ricoleehduu/STS-Challenge-2024), ensuring transparency and reproducibility.
Two-Dimensional Tomographic Reconstruction From Projections With Unknown Angles and Unknown Spatial Shifts ICASSP
In parallel beam computed tomography (CT), an object is reconstructed from a series of projections taken at different angles. However, in some industrial and biomedical imaging applications, the projection geometry is unknown, completely or partially. In this paper, we present a technique for two-dimensional (2D) tomography in which both viewing angles and spatial shifts associated with the projections are unknown. There exists literature on 2D unknown view tomography (UVT), but most existing 2D UVT algorithms assume that the projections are centered; that is, there are no spatial shifts in the projections. To tackle these geometric ambiguities, we first modify an existing graph Laplacian-based algorithm for 2D UVT to incorporate spatial shifts, and then use it as the initialization for the proposed three-way alternating minimization algorithm that jointly estimates the 2D structure, its projection angles, and the corresponding shifts. We evaluate our method on noisy projections of ribosome images and demonstrate that it achieves superior reconstruction compared to the baseline that neglects shifts.
comment: 5 pages, 2 figures, 1 table, submitted to the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
TokCom-UEP: Semantic Importance-Matched Unequal Error Protection for Resilient Image Transmission
Based on the provided LaTeX code, here is the metadata for the submission form: Title: TokCom-UEP: Semantic Importance-Matched Unequal Error Protection for Resilient Image Transmission Author(s): Kaizheng Zhang, Zuolin Jin, Zhihang Cheng, Ming Zeng, Li Qiao, Zesong Fei Abstract: Token communication (TokCom), an emerging semantic communication framework powered by Large Multimodal Model (LMM), has become a key paradigm for resilient data transmission in 6G networks. A key limitation of existing TokCom designs lies in the assumption of uniform token importance, which leads to the adoption of equal error protection (EEP). However, compressed one-dimensional (1D) token sequences inherently exhibit heterogeneous semantic importance hierarchies, rendering EEP schemes suboptimal. To address this, this paper proposes TokCom-UEP, a novel semantic importance-matched unequal error protection (UEP) framework designed for resilient image transmission. TokCom-UEP integrates rateless UEP coding with the non-uniform semantic importance of tokens by partitioning source tokens into nested expanding windows, assigning higher selection probabilities to windows containing critical tokens to ensure their prioritized recovery. Simulation results demonstrate that TokCom-UEP outperforms EEP schemes in terms of three core semantic restoration metrics and spectral efficiency under low-overhead conditions.
TIE: A Training-Inversion-Exclusion Framework for Visually Interpretable and Uncertainty-Guided Out-of-Distribution Detection
Deep neural networks often struggle to recognize when an input lies outside their training experience, leading to unreliable and overconfident predictions. Building dependable machine learning systems therefore requires methods that can both estimate predictive \textit{uncertainty} and detect \textit{out-of-distribution (OOD)} samples in a unified manner. In this paper, we propose \textbf{TIE: a Training--Inversion--Exclusion} framework for visually interpretable and uncertainty-guided anomaly detection that jointly addresses these challenges through iterative refinement. TIE extends a standard $n$-class classifier to an $(n+1)$-class model by introducing a garbage class initialized with Gaussian noise to represent outlier inputs. Within each epoch, TIE performs a closed-loop process of \textit{training, inversion, and exclusion}, where highly uncertain inverted samples reconstructed from the just-trained classifier are excluded into the garbage class. Over successive iterations, the inverted samples transition from noisy artifacts into visually coherent class prototypes, providing transparent insight into how the model organizes its learned manifolds. During inference, TIE rejects OOD inputs by either directly mapping them to the garbage class or producing low-confidence, uncertain misclassifications within the in-distribution classes that are easily separable, all without relying on external OOD datasets. A comprehensive threshold-based evaluation using multiple OOD metrics and performance measures such as \textit{AUROC}, \textit{AUPR}, and \textit{FPR@95\%TPR} demonstrates that TIE offers a unified and interpretable framework for robust anomaly detection and calibrated uncertainty estimation (UE) achieving near-perfect OOD detection with \textbf{\(\!\approx\!\) 0 FPR@95\%TPR} when trained on MNIST or FashionMNIST and tested against diverse unseen datasets.
Beyond Expected Goals: A Probabilistic Framework for Shot Occurrences in Soccer
Expected goals (xG) models estimate the probability that a shot results in a goal from its context (e.g., location, pressure), but they operate only on observed shots. We propose xG+, a possession-level framework that first estimates the probability that a shot occurs within the next second and its corresponding xG if it were to occur. We also introduce ways to aggregate this joint probability estimate over the course of a possession. By jointly modeling shot-taking behavior and shot quality, xG+ remedies the conditioning-on-shots limitation of standard xG. We show that this improves predictive accuracy at the team level and produces a more persistent player skill signal than standard xG models.
comment: 18pp main + 3pp appendix; 8 figures, 12 tables. Submitted to the Journal of Quantitative Analysis in Sports (JQAS). Data proprietary to Gradient Sports; we share derived features & scripts (code under MIT/Apache-2.0). Preprint licensed CC BY 4.0
AutocleanEEG ICVision: Automated ICA Artifact Classification Using Vision-Language AI
We introduce EEG Autoclean Vision Language AI (ICVision) a first-of-its-kind system that emulates expert-level EEG ICA component classification through AI-agent vision and natural language reasoning. Unlike conventional classifiers such as ICLabel, which rely on handcrafted features, ICVision directly interprets ICA dashboard visualizations topography, time series, power spectra, and ERP plots, using a multimodal large language model (GPT-4 Vision). This allows the AI to see and explain EEG components the way trained neurologists do, making it the first scientific implementation of AI-agent visual cognition in neurophysiology. ICVision classifies each component into one of six canonical categories (brain, eye, heart, muscle, channel noise, and other noise), returning both a confidence score and a human-like explanation. Evaluated on 3,168 ICA components from 124 EEG datasets, ICVision achieved k = 0.677 agreement with expert consensus, surpassing MNE ICLabel, while also preserving clinically relevant brain signals in ambiguous cases. Over 97% of its outputs were rated as interpretable and actionable by expert reviewers. As a core module of the open-source EEG Autoclean platform, ICVision signals a paradigm shift in scientific AI, where models do not just classify, but see, reason, and communicate. It opens the door to globally scalable, explainable, and reproducible EEG workflows, marking the emergence of AI agents capable of expert-level visual decision-making in brain science and beyond.
comment: 6 pages, 8 figures
Hybrid Context-Fusion Attention (CFA) U-Net and Clustering for Robust Seismic Horizon Interpretation
Interpreting seismic horizons is a critical task for characterizing subsurface structures in hydrocarbon exploration. Recent advances in deep learning, particularly U-Net-based architectures, have significantly improved automated horizon tracking. However, challenges remain in accurately segmenting complex geological features and interpolating horizons from sparse annotations. To address these issues, a hybrid framework is presented that integrates advanced U-Net variants with spatial clustering to enhance horizon continuity and geometric fidelity. The core contribution is the Context Fusion Attention (CFA) U-Net, a novel architecture that fuses spatial and Sobel-derived geometric features within attention gates to improve both precision and surface completeness. The performance of five architectures, the U-Net (Standard and compressed), U-Net++, Attention U-Net, and CFA U-Net, was systematically evaluated across various data sparsity regimes (10-, 20-, and 40-line spacing). This approach outperformed existing baselines, achieving state-of-the-art results on the Mexilhao field (Santos Basin, Brazil) dataset with a validation IoU of 0.881 and MAE of 2.49ms, and excellent surface coverage of 97.6% on the F3 Block of the North Sea dataset under sparse conditions. The framework further refines merged horizon predictions (inline and cross-line) using Density-Based Spatial Clustering of Applications with Noise (DBSCAN) to produce geologically plausible surfaces. The results demonstrate the advantages of hybrid methodologies and attention-based architectures enhanced with geometric context, providing a robust and generalizable solution for seismic interpretation in structurally complex and data-scarce environments.
Efficient Edge-Compatible CNN for Speckle-Based Material Recognition in Laser Cutting Systems
Accurate material recognition is critical for safe and effective laser cutting, as misidentification can lead to poor cut quality, machine damage, or the release of hazardous fumes. Laser speckle sensing has recently emerged as a low-cost and non-destructive modality for material classification; however, prior work has either relied on computationally expensive backbone networks or addressed only limited subsets of materials. In this study, A lightweight convolutional neural network (CNN) tailored for speckle patterns is proposed, designed to minimize parameters while maintaining high discriminative power. Using the complete SensiCut dataset of 59 material classes spanning woods, acrylics, composites, textiles, metals, and paper-based products, the proposed model achieves 95.05% test accuracy, with macro and weighted F1-scores of 0.951. The network contains only 341k trainable parameters (~1.3 MB) -- over 70X fewer than ResNet-50 -- and achieves an inference speed of 295 images per second, enabling deployment on Raspberry Pi and Jetson-class devices. Furthermore, when materials are regrouped into nine and five practical families, recall exceeds 98% and approaches 100%, directly supporting power and speed preset selection in laser cutters. These results demonstrate that compact, domain-specific CNNs can outperform large backbones for speckle-based material classification, advancing the feasibility of material-aware, edge-deployable laser cutting systems.
comment: Copyright 2025 IEEE. This is the author's version of the work that has been Accepted for publication in the Proceedings of the 2025 IEEE The 35th International Conference on Computer Theory and Applications (ICCTA 2025). Final published version will be available on IEEE Xplore
Ternary-Input Binary-Weight CNN Accelerator Design for Miniature Object Classification System with Query-Driven Spatial DVS
Miniature imaging systems are essential for space-constrained applications but are limited by memory and power constraints. While machine learning can reduce data size by extracting key features, its high energy demands often exceed the capacity of small batteries. This paper presents a CNN hardware accelerator optimized for object classification in miniature imaging systems. It processes data from a spatial Dynamic Vision Sensor (DVS), reconfigurable to a temporal DVS via pixel sharing, minimizing sensor area. By using ternary DVS outputs and a ternary-input, binary-weight neural network, the design reduces computation and memory needs. Fabricated in 28 nm CMOS, the accelerator cuts data size by 81% and MAC operations by 27%. It achieves 440 ms inference time at just 1.6 mW power consumption, improving the Figure-of-Merit (FoM) by 7.3x over prior CNN accelerators for miniature systems.
comment: 6 pages.12 figures & 2 table
AgriPotential: A Novel Multi-Spectral and Multi-Temporal Remote Sensing Dataset for Agricultural Potentials
Remote sensing has emerged as a critical tool for large-scale Earth monitoring and land management. In this paper, we introduce AgriPotential, a novel benchmark dataset composed of Sentinel-2 satellite imagery captured over multiple months. The dataset provides pixel-level annotations of agricultural potentials for three major crop types - viticulture, market gardening, and field crops - across five ordinal classes. AgriPotential supports a broad range of machine learning tasks, including ordinal regression, multi-label classification, and spatio-temporal modeling. The data cover diverse areas in Southern France, offering rich spectral information. AgriPotential is the first public dataset designed specifically for agricultural potential prediction, aiming to improve data-driven approaches to sustainable land use planning. The dataset and the code are freely accessible at: https://zenodo.org/records/15551829
comment: Accepted at CBMI 2025
DiffFuSR: Super-Resolution of all Sentinel-2 Multispectral Bands using Diffusion Models SC
This paper presents DiffFuSR, a modular pipeline for super-resolving all 12 spectral bands of Sentinel-2 Level-2A imagery to a unified ground sampling distance (GSD) of 2.5 meters. The pipeline comprises two stages: (i) a diffusion-based super-resolution (SR) model trained on high-resolution RGB imagery from the NAIP and WorldStrat datasets, harmonized to simulate Sentinel-2 characteristics; and (ii) a learned fusion network that upscales the remaining multispectral bands using the super-resolved RGB image as a spatial prior. We introduce a robust degradation model and contrastive degradation encoder to support blind SR. Extensive evaluations of the proposed SR pipeline on the OpenSR benchmark demonstrate that the proposed method outperforms current SOTA baselines in terms of reflectance fidelity, spectral consistency, spatial alignment, and hallucination suppression. Furthermore, the fusion network significantly outperforms classical and learned pansharpening approaches, enabling accurate enhancement of Sentinel-2's 20 m and 60 m bands. This work proposes a novel modular framework Sentinel-2 SR that utilizes harmonized learning with diffusion models and fusion strategies. Our code and models can be found at https://github.com/NorskRegnesentral/DiffFuSR.
comment: Accepted for Publication at IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING (TGRS)
Zero-shot Denoising via Neural Compression: Theoretical and algorithmic framework
Zero-shot denoising aims to denoise observations without access to training samples or clean reference images. This setting is particularly relevant in practical imaging scenarios involving specialized domains such as medical imaging or biology. In this work, we propose the Zero-Shot Neural Compression Denoiser (ZS-NCD), a novel denoising framework based on neural compression. ZS-NCD treats a neural compression network as an untrained model, optimized directly on patches extracted from a single noisy image. The final reconstruction is then obtained by aggregating the outputs of the trained model over overlapping patches. Thanks to the built-in entropy constraints of compression architectures, our method naturally avoids overfitting and does not require manual regularization or early stopping. Through extensive experiments, we show that ZS-NCD achieves state-of-the-art performance among zero-shot denoisers for both Gaussian and Poisson noise, and generalizes well to both natural and non-natural images. Additionally, we provide new finite-sample theoretical results that characterize upper bounds on the achievable reconstruction error of general maximum-likelihood compression-based denoisers. These results further establish the theoretical foundations of compression-based denoising. Our code is available at: https://github.com/Computational-Imaging-RU/ZS-NCDenoiser.
Multimedia
Mina: A Multilingual LLM-Powered Legal Assistant Agent for Bangladesh for Empowering Access to Justice
Bangladesh's low-income population faces major barriers to affordable legal advice due to complex legal language, procedural opacity, and high costs. Existing AI legal assistants lack Bengali-language support and jurisdiction-specific adaptation, limiting their effectiveness. To address this, we developed Mina, a multilingual LLM-based legal assistant tailored for the Bangladeshi context. It employs multilingual embeddings and a RAG-based chain-of-tools framework for retrieval, reasoning, translation, and document generation, delivering context-aware legal drafts, citations, and plain-language explanations via an interactive chat interface. Evaluated by law faculty from leading Bangladeshi universities across all stages of the 2022 and 2023 Bangladesh Bar Council Exams, Mina scored 75-80% in Preliminary MCQs, Written, and simulated Viva Voce exams, matching or surpassing average human performance and demonstrating clarity, contextual understanding, and sound legal reasoning. Even under a conservative upper bound, Mina operates at just 0.12-0.61% of typical legal consultation costs in Bangladesh, yielding a 99.4-99.9\% cost reduction relative to human-provided services. These results confirm its potential as a low-cost, multilingual AI assistant that automates key legal tasks and scales access to justice, offering a real-world case study on building domain-specific, low-resource systems and addressing challenges of multilingual adaptation, efficiency, and sustainable public-service AI deployment.
LongCat-Flash-Omni Technical Report
We introduce LongCat-Flash-Omni, a state-of-the-art open-source omni-modal model with 560 billion parameters, excelling at real-time audio-visual interaction. By adopting a curriculum-inspired progressive training strategy that transitions from simpler to increasingly complex modality sequence modeling tasks, LongCat-Flash-Omni attains comprehensive multimodal capabilities while maintaining strong unimodal capability. Building upon LongCat-Flash, which adopts a high-performance Shortcut-connected Mixture-of-Experts (MoE) architecture with zero-computation experts, LongCat-Flash-Omni integrates efficient multimodal perception and speech reconstruction modules. Despite its immense size of 560B parameters (with 27B activated), LongCat-Flash-Omni achieves low-latency real-time audio-visual interaction. For training infrastructure, we developed a modality-decoupled parallelism scheme specifically designed to manage the data and model heterogeneity inherent in large-scale multimodal training. This innovative approach demonstrates exceptional efficiency by sustaining over 90% of the throughput achieved by text-only training. Extensive evaluations show that LongCat-Flash-Omni achieves state-of-the-art performance on omni-modal benchmarks among open-source models. Furthermore, it delivers highly competitive results across a wide range of modality-specific tasks, including text, image, and video understanding, as well as audio understanding and generation. We provide a comprehensive overview of the model architecture design, training procedures, and data strategies, and open-source the model to foster future research and development in the community.
Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation
Video-to-Music generation seeks to generate musically appropriate background music that enhances audiovisual immersion for videos. However, current approaches suffer from two critical limitations: 1) incomplete representation of video details, leading to weak alignment, and 2) inadequate temporal and rhythmic correspondence, particularly in achieving precise beat synchronization. To address the challenges, we propose Video Echoed in Music (VeM), a latent music diffusion that generates high-quality soundtracks with semantic, temporal, and rhythmic alignment for input videos. To capture video details comprehensively, VeM employs a hierarchical video parsing that acts as a music conductor, orchestrating multi-level information across modalities. Modality-specific encoders, coupled with a storyboard-guided cross-attention mechanism (SG-CAtt), integrate semantic cues while maintaining temporal coherence through position and duration encoding. For rhythmic precision, the frame-level transition-beat aligner and adapter (TB-As) dynamically synchronize visual scene transitions with music beats. We further contribute a novel video-music paired dataset sourced from e-commerce advertisements and video-sharing platforms, which imposes stricter transition-beat synchronization requirements. Meanwhile, we introduce novel metrics tailored to the task. Experimental results demonstrate superiority, particularly in semantic relevance and rhythmic precision.
Cap2Sum: Learning to Summarize Videos by Generating Captions
With the rapid growth of video data on the internet, video summarization is becoming a very important AI technology. However, due to the high labelling cost of video summarization, existing studies have to be conducted on small-scale datasets, leading to limited performance and generalization capacity. In this work, we introduce the use of dense video captions as a supervision signal to train video summarization models. Motivated by this, we propose Cap2Sum, a model that learns to summarize videos by generating captions, to exploit dense video caption annotations. This weakly-supervised approach allows us to train the models on large-scale dense video caption datasets to achieve better performance and generalization capacity. To further improve the generalization capacity, we introduce a CLIP (a strong vision-language model) Prior mechanism to enhance the learning of important objects that captions may ignore in the videos. In practice, Cap2Sum can perform zero-shot video summarization or be fine-tuned by the ground-truth summary or video caption of the target dataset. To examine the performance of Cap2Sum after weakly-supervised fine-tuning by the video captions, we propose two new datasets, TVSum-Caption and SumMe-Caption, which are derived from two common video summarization datasets and will be publicly released. We conduct extensive experiments and the results demonstrate that our method achieves significant improvements in performance and generalization capacity compared with previous methods.
comment: 13 pages, 4 figures
Computation and Language
ThetaEvolve: Test-time Learning on Open Problems
Recent advances in large language models (LLMs) have enabled breakthroughs in mathematical discovery, exemplified by AlphaEvolve, a closed-source system that evolves programs to improve bounds on open problems. However, it relies on ensembles of frontier LLMs to achieve new bounds and is a pure inference system that models cannot internalize the evolving strategies. We introduce ThetaEvolve, an open-source framework that simplifies and extends AlphaEvolve to efficiently scale both in-context learning and Reinforcement Learning (RL) at test time, allowing models to continually learn from their experiences in improving open optimization problems. ThetaEvolve features a single LLM, a large program database for enhanced exploration, batch sampling for higher throughput, lazy penalties to discourage stagnant outputs, and optional reward shaping for stable training signals, etc. ThetaEvolve is the first evolving framework that enable a small open-source model, like DeepSeek-R1-0528-Qwen3-8B, to achieve new best-known bounds on open problems (circle packing and first auto-correlation inequality) mentioned in AlphaEvolve. Besides, across two models and four open tasks, we find that ThetaEvolve with RL at test-time consistently outperforms inference-only baselines, and the model indeed learns evolving capabilities, as the RL-trained checkpoints demonstrate faster progress and better final performance on both trained target task and other unseen tasks. We release our code publicly: https://github.com/ypwang61/ThetaEvolve
comment: 30 pages, link: https://github.com/ypwang61/ThetaEvolve
MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation
Small and medium-sized enterprises (SMEs) in Iran increasingly leverage Telegram for sales, where real-time engagement is essential for conversion. However, developing AI-driven chatbots for this purpose requires large, high-quality question-and-answer (Q&A) datasets, which are typically expensive and resource-intensive to produce, especially for low-resource languages like Persian. In this paper, we introduce MegaChat, the first fully synthetic Persian Q&A dataset designed to evaluate intelligent sales chatbots in Telegram-based e-commerce. We propose a novel, automated multi-agent architecture that generates persona-aware Q&A pairs by collecting data from active Telegram shopping channels. The system employs specialized agents for question generation, validation, and refinement, ensuring the production of realistic and diverse conversational data. To evaluate answer generation, we compare three classic retrieval-augmented generation (RAG) models with our advanced agentic system, which features multi-query retrieval, reranking, and persona-aligned response synthesis. Using GPT-5.1 for evaluation across six quality dimensions, our results show that the agentic architecture outperformed traditional RAG models in 4 out of 5 diverse channels, demonstrating its ability to generate scalable, high-quality datasets without relying on expensive human annotation or complex fine-tuning. MegaChat provides SMEs with an efficient, cost-effective solution for building intelligent customer engagement systems in specialized commercial domains, enabling advancements in multilingual conversational AI for low-resource languages. Download: https://github.com/MegaChat-Tech/MegaChat-DataSet
comment: 6 pages, 11 figures, 2 tables
Ambiguity Awareness Optimization: Towards Semantic Disambiguation for Direct Preference Optimization EMNLP 2025
Direct Preference Optimization (DPO) is a widely used reinforcement learning from human feedback (RLHF) method across various domains. Recent research has increasingly focused on the role of token importance in improving DPO effectiveness. It is observed that identical or semantically similar content (defined as ambiguous content) frequently appears within the preference pairs. We hypothesize that the presence of ambiguous content during DPO training may introduce ambiguity, thereby limiting further improvements in alignment. Through mathematical analysis and proof-of-concept experiments, we reveal that ambiguous content may potentially introduce ambiguities, thereby degrading performance. To address this issue, we introduce Ambiguity Awareness Optimization (AAO), a simple yet effective approach that automatically re-weights ambiguous content to reduce ambiguities by calculating semantic similarity from preference pairs. Through extensive experiments, we demonstrate that AAO consistently and significantly surpasses state-of-the-art approaches in performance, without markedly increasing response length, across multiple model scales and widely adopted benchmark datasets, including AlpacaEval 2, MT-Bench, and Arena-Hard. Specifically, AAO outperforms DPO by up to 8.9 points on AlpacaEval 2 and achieves an improvement of by up to 15.0 points on Arena-Hard.
comment: Accepted at EMNLP 2025 main
Is Passive Expertise-Based Personalization Enough? A Case Study in AI-Assisted Test-Taking EMNLP 2025
Novice and expert users have different systematic preferences in task-oriented dialogues. However, whether catering to these preferences actually improves user experience and task performance remains understudied. To investigate the effects of expertise-based personalization, we first built a version of an enterprise AI assistant with passive personalization. We then conducted a user study where participants completed timed exams, aided by the two versions of the AI assistant. Preliminary results indicate that passive personalization helps reduce task load and improve assistant perception, but reveal task-specific limitations that can be addressed through providing more user agency. These findings underscore the importance of combining active and passive personalization to optimize user experience and effectiveness in enterprise task-oriented environments.
comment: Accepted into Tailoring AI: Exploring Active and Passive LLM Personalization (PALS) workshop at EMNLP 2025
Optimizing Multimodal Language Models through Attention-based Interpretability
Modern large language models become multimodal, analyzing various data formats like text and images. While fine-tuning is effective for adapting these multimodal language models (MLMs) to downstream tasks, full fine-tuning is computationally expensive. Parameter-Efficient Fine-Tuning (PEFT) methods address this by training only a small portion of model weights. However, MLMs are difficult to interpret, making it challenging to identify which components are most effective for training to balance efficiency and performance. We propose an attention-based interpretability method for MLMs by analyzing attention scores relative to image tokens. The core idea is to identify attention heads that focus on image key objects. We utilize this information to select optimal model components for PEFT in multimodal models. Our contributions include a method for identifying attention heads associated with image key objects, its application to PEFT for image captioning, and the creation of a new dataset containing images, key object masks, and their textual descriptions. We conducted experiments on MLMs with 2-3 billion parameters to validate the method's effectiveness. By calculating Head Impact (HI) scores we quantify an attention head's focus on key objects, indicating its significance in image understanding. Our fine-tuning experiments demonstrate that adapting layers with the highest HI scores leads to the most significant shifts in metrics compared to pre-trained, randomly selected, or lowest-HI-score layers. This indicates that fine-tuning a small percentage (around 0.01%) of parameters in these crucial layers can substantially influence image understanding capabilities.
comment: Accepted for ICAI-2025 conference
Scaling HuBERT for African Languages: From Base to Large and XL
Despite recent progress in multilingual speech processing, African languages remain under-represented in both research and deployed systems, particularly when it comes to strong, open-weight encoders that transfer well under low-resource supervision. Self-supervised learning has proven especially promising in such settings, yet most publicly released models targeting African speech remain at BASE scale, leaving unanswered whether larger encoders, trained exclusively on Africa-centric audio, offer tangible benefits and how model capacity interacts with data composition. This work addresses that gap by introducing SSA-HuBERT-Large (317M parameters) and SSA-HuBERT-XL (964M parameters), the first large models trained solely on African speech, alongside a BASE size counterpart. We release these models as open weights: see https://huggingface.co/collections/Orange/african-speech-foundation-models. By conducting a carefully controlled experimental study focused exclusively on Sub-Saharan languages, covering automatic speech recognition (ASR) and language identification (LID) tasks, we demonstrate that larger architectures significantly improve performance by effectively leveraging large audio datasets.
comment: Journée d'études AFIA-ATALA 2025 : Technologies linguistiques pour les langues peu dotées
Towards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach
Knowledge-enhanced text generation aims to enhance the quality of generated text by utilizing internal or external knowledge sources. While language models have demonstrated impressive capabilities in generating coherent and fluent text, the lack of interpretability presents a substantial obstacle. The limited interpretability of generated text significantly impacts its practical usability, particularly in knowledge-enhanced text generation tasks that necessitate reliability and explainability. Existing methods often employ domain-specific knowledge retrievers that are tailored to specific data characteristics, limiting their generalizability to diverse data types and tasks. To overcome this limitation, we directly leverage the two-tier architecture of structured knowledge, consisting of high-level entities and low-level knowledge triples, to design our task-agnostic structured knowledge hunter. Specifically, we employ a local-global interaction scheme for structured knowledge representation learning and a hierarchical transformer-based pointer network as the backbone for selecting relevant knowledge triples and entities. By combining the strong generative ability of language models with the high faithfulness of the knowledge hunter, our model achieves high interpretability, enabling users to comprehend the model output generation process. Furthermore, we empirically demonstrate the effectiveness of our model in both internal knowledge-enhanced table-to-text generation on the RotoWireFG dataset and external knowledge-enhanced dialogue response generation on the KdConv dataset. Our task-agnostic model outperforms state-of-the-art methods and corresponding language models, setting new standards on the benchmark.
Tackling a Challenging Corpus for Early Detection of Gambling Disorder: UNSL at MentalRiskES 2025
Gambling disorder is a complex behavioral addiction that is challenging to understand and address, with severe physical, psychological, and social consequences. Early Risk Detection (ERD) on the Web has become a key task in the scientific community for identifying early signs of mental health behaviors based on social media activity. This work presents our participation in the MentalRiskES 2025 challenge, specifically in Task 1, aimed at classifying users at high or low risk of developing a gambling-related disorder. We proposed three methods based on a CPI+DMC approach, addressing predictive effectiveness and decision-making speed as independent objectives. The components were implemented using the SS3, BERT with extended vocabulary, and SBERT models, followed by decision policies based on historical user analysis. Although it was a challenging corpus, two of our proposals achieved the top two positions in the official results, performing notably in decision metrics. Further analysis revealed some difficulty in distinguishing between users at high and low risk, reinforcing the need to explore strategies to improve data interpretation and quality, and to promote more transparent and reliable ERD systems for mental disorders.
comment: In Iberian Language Evaluation Forum (IberLEF 2025), Zaragoza, Spain
Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models
This work explores the challenge of building ``Machines that Can Remember'', framing long-term memory as the problem of efficient ultra-long context modeling. We argue that this requires three key properties: \textbf{sparsity}, \textbf{random-access flexibility}, and \textbf{length generalization}. To address ultra-long-context modeling, we leverage Hierarchical Sparse Attention (HSA), a novel attention mechanism that satisfies all three properties. We integrate HSA into Transformers to build HSA-UltraLong, which is an 8B-parameter MoE model trained on over 8 trillion tokens and is rigorously evaluated on different tasks with in-domain and out-of-domain context lengths to demonstrate its capability in handling ultra-long contexts. Results show that our model performs comparably to full-attention baselines on in-domain lengths while achieving over 90\% accuracy on most in-context retrieval tasks with contexts up to 16M. This report outlines our experimental insights and open problems, contributing a foundation for future research in ultra-long context modeling.
Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach
Large-scale Vision Language Models (LVLMs) exhibit advanced capabilities in tasks that require visual information, including object detection. These capabilities have promising applications in various industrial domains, such as autonomous driving. For example, LVLMs can generate safety-oriented descriptions of videos captured by road-facing cameras. However, ensuring comprehensive safety requires monitoring driver-facing views as well to detect risky events, such as the use of mobiles while driving. Thus, the ability to process synchronized inputs is necessary from both driver-facing and road-facing cameras. In this study, we develop models and investigate the capabilities of LVLMs by constructing a dataset and evaluating their performance on this dataset. Our experimental results demonstrate that while pre-trained LVLMs have limited effectiveness, fine-tuned LVLMs can generate accurate and safety-aware driving instructions. Nonetheless, several challenges remain, particularly in detecting subtle or complex events in the video. Our findings and error analysis provide valuable insights that can contribute to the improvement of LVLM-based systems in this domain.
comment: Accepted to MMLoSo 2025
Transformer-Driven Triple Fusion Framework for Enhanced Multimodal Author Intent Classification in Low-Resource Bangla
The expansion of the Internet and social networks has led to an explosion of user-generated content. Author intent understanding plays a crucial role in interpreting social media content. This paper addresses author intent classification in Bangla social media posts by leveraging both textual and visual data. Recognizing limitations in previous unimodal approaches, we systematically benchmark transformer-based language models (mBERT, DistilBERT, XLM-RoBERTa) and vision architectures (ViT, Swin, SwiftFormer, ResNet, DenseNet, MobileNet), utilizing the Uddessho dataset of 3,048 posts spanning six practical intent categories. We introduce a novel intermediate fusion strategy that significantly outperforms early and late fusion on this task. Experimental results show that intermediate fusion, particularly with mBERT and Swin Transformer, achieves 84.11% macro-F1 score, establishing a new state-of-the-art with an 8.4 percentage-point improvement over prior Bangla multimodal approaches. Our analysis demonstrates that integrating visual context substantially enhances intent classification. Cross-modal feature integration at intermediate levels provides optimal balance between modality-specific representation and cross-modal learning. This research establishes new benchmarks and methodological standards for Bangla and other low-resource languages. We call our proposed framework BangACMM (Bangla Author Content MultiModal).
comment: Accepted at the 28th International Conference on Computer and Information Technology (ICCIT 2025). To be published in IEEE proceedings
MCP vs RAG vs NLWeb vs HTML: A Comparison of the Effectiveness and Efficiency of Different Agent Interfaces to the Web (Technical Report)
Large language model agents are increasingly used to automate web tasks such as product search, offer comparison, and checkout. Current research explores different interfaces through which these agents interact with websites, including traditional HTML browsing, retrieval-augmented generation (RAG) over pre-crawled content, communication via Web APIs using the Model Context Protocol (MCP), and natural-language querying through the NLWeb interface. However, no prior work has compared these four architectures within a single controlled environment using identical tasks. To address this gap, we introduce a testbed consisting of four simulated e-shops, each offering its products via HTML, MCP, and NLWeb interfaces. For each interface (HTML, RAG, MCP, and NLWeb) we develop specialized agents that perform the same sets of tasks, ranging from simple product searches and price comparisons to complex queries for complementary or substitute products and checkout processes. We evaluate the agents using GPT 4.1, GPT 5, GPT 5 mini, and Claude Sonnet 4 as underlying LLM. Our evaluation shows that the RAG, MCP and NLWeb agents outperform HTML on both effectiveness and efficiency. Averaged over all tasks, F1 rises from 0.67 for HTML to between 0.75 and 0.77 for the other agents. Token usage falls from about 241k for HTML to between 47k and 140k per task. The runtime per task drops from 291 seconds to between 50 and 62 seconds. The best overall configuration is RAG with GPT 5 achieving an F1 score of 0.87 and a completion rate of 0.79. Also taking cost into consideration, RAG with GPT 5 mini offers a good compromise between API usage fees and performance. Our experiments show the choice of the interaction interface has a substantial impact on both the effectiveness and efficiency of LLM-based web agents.
Behavior-Equivalent Token: Single-Token Replacement for Long Prompts in LLMs
Carefully engineered system prompts play a critical role in guiding the behavior of LLM agents, but their considerable length introduces significant drawbacks, including increased inference latency, higher computational cost, and reduced effective context length. This raises the question of whether such lengthy prompts can be replaced by a drastically reduced number of tokens while preserving their behavioral effect on downstream tasks. To enable this, we propose a lightweight three-stage training framework that learns a single prompt-specific Behavior-Equivalent token ([BE]). The framework first trains [BE] to encode the natural-language content of the original system prompt via reconstruction, and then distills the prompt 's downstream behavior into this single token. Importantly, our method requires no access to model internals, no auxiliary compression models, and no labeled responses. Empirical evaluations on three datasets show that a single [BE] token achieves up to a 3000x reduction in prompt length, while retaining about 98% of the downstream performance of the original system prompts. This substantially reduces inference cost and leaves almost the entire context window available for user inputs.
comment: 15 pages, 5 figures
BanglaSentNet: An Explainable Hybrid Deep Learning Framework for Multi-Aspect Sentiment Analysis with Cross-Domain Transfer Learning
Multi-aspect sentiment analysis of Bangla e-commerce reviews remains challenging due to limited annotated datasets, morphological complexity, code-mixing phenomena, and domain shift issues, affecting 300 million Bangla-speaking users. Existing approaches lack explainability and cross-domain generalization capabilities crucial for practical deployment. We present BanglaSentNet, an explainable hybrid deep learning framework integrating LSTM, BiLSTM, GRU, and BanglaBERT through dynamic weighted ensemble learning for multi-aspect sentiment classification. We introduce a dataset of 8,755 manually annotated Bangla product reviews across four aspects (Quality, Service, Price, Decoration) from major Bangladeshi e-commerce platforms. Our framework incorporates SHAP-based feature attribution and attention visualization for transparent insights. BanglaSentNet achieves 85% accuracy and 0.88 F1-score, outperforming standalone deep learning models by 3-7% and traditional approaches substantially. The explainability suite achieves 9.4/10 interpretability score with 87.6% human agreement. Cross-domain transfer learning experiments reveal robust generalization: zero-shot performance retains 67-76% effectiveness across diverse domains (BanglaBook reviews, social media, general e-commerce, news headlines); few-shot learning with 500-1000 samples achieves 90-95% of full fine-tuning performance, significantly reducing annotation costs. Real-world deployment demonstrates practical utility for Bangladeshi e-commerce platforms, enabling data-driven decision-making for pricing optimization, service improvement, and customer experience enhancement. This research establishes a new state-of-the-art benchmark for Bangla sentiment analysis, advances ensemble learning methodologies for low-resource languages, and provides actionable solutions for commercial applications.
comment: Submitted to Springer Nature Computer Science (SNCS) as an extended version of our ICDSAIA 2025 conference paper
Tourism Question Answer System in Indian Language using Domain-Adapted Foundation Models
This article presents the first comprehensive study on designing a baseline extractive question-answering (QA) system for the Hindi tourism domain, with a specialized focus on the Varanasi-a cultural and spiritual hub renowned for its Bhakti-Bhaav (devotional ethos). Targeting ten tourism-centric subdomains-Ganga Aarti, Cruise, Food Court, Public Toilet, Kund, Museum, General, Ashram, Temple and Travel, the work addresses the absence of language-specific QA resources in Hindi for culturally nuanced applications. In this paper, a dataset comprising 7,715 Hindi QA pairs pertaining to Varanasi tourism was constructed and subsequently augmented with 27,455 pairs generated via Llama zero-shot prompting. We propose a framework leveraging foundation models-BERT and RoBERTa, fine-tuned using Supervised Fine-Tuning (SFT) and Low-Rank Adaptation (LoRA), to optimize parameter efficiency and task performance. Multiple variants of BERT, including pre-trained languages (e.g., Hindi-BERT), are evaluated to assess their suitability for low-resource domain-specific QA. Evaluation metrics - F1, BLEU, and ROUGE-L - highlight trade-offs between answer precision and linguistic fluency. Experiments demonstrate that LoRA-based fine-tuning achieves competitive performance (85.3\% F1) while reducing trainable parameters by 98\% compared to SFT, striking a balance between efficiency and accuracy. Comparative analysis across models reveals that RoBERTa with SFT outperforms BERT variants in capturing contextual nuances, particularly for culturally embedded terms (e.g., Aarti, Kund). This work establishes a foundational baseline for Hindi tourism QA systems, emphasizing the role of LORA in low-resource settings and underscoring the need for culturally contextualized NLP frameworks in the tourism domain.
TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies
Native FP8 support in modern hardware is essential for training large Transformers, but is severely hindered by extreme activation outliers. Existing solutions either rely on complex mixed-precision engineering or invasive architectural modifications. This paper fundamentally challenges the conventional wisdom that outliers are data-driven. We demonstrate that extreme outliers are a data-independent, mechanically-produced artifact of training, originating from specific structural properties of the weight matrices (i.e., colinearity). Based on this insight, we propose TWEO (Transformers Without Extreme Outliers), a novel, non-invasive loss function. TWEO effectively prevents extreme outliers via a very simple loss term, which reduces outliers from 10000+ to less than 20. TWEO then enables full-model FP8 pre-training with neither engineering tricks nor architectural changes for both LLM and ViT. When standard FP8 training catastrophically collapses, TWEO achieves performance comparable to the BF16 baseline while delivering a 36% increase in training throughput. Also, TWEO enables a new quantization paradigm. Hardware-friendly W8A8 per-tensor static quantization of LLMs, previously considered completely unusable due to outliers, achieves SOTA performance for the first time on TWEO-trained models.
Listwise Preference Optimization with Element-wise Confusions for Aspect Sentiment Quad Prediction
Aspect sentiment quad prediction (ASQP) is inherently challenging to predict a structured quadruple with four core sentiment elements, including aspect term (a), aspect category (c), opinion term (o), and sentiment polarity (s). Prior methods relying on marker-based prediction struggle with modeling the intricate relationships among elements and experience sharp performance declines when predicting higher-order elements (e.g., c and s) under standard supervised fine-tuning. To address these limitations, we employ reasoning-based generation to output both the quadruple and a natural language rationale under element prefixes within a unified template, encouraging explicit relational reasoning and interpretability. To further enhance element-wise alignment, we introduce a listwise preference optimization framework for improving structural validity and relational coherence. Specifically, we generate element-wise confusable candidates via syntactic and semantic proximity, then train the model with listwise objectives to prefer the gold candidates over closely competing alternatives. Extensive experiments on four benchmark datasets demonstrate that our framework effectively improves quadruple prediction accuracy and explanation consistency.
comment: 11 pages, 7 figures, and 6 tables
Are LLMs Good Safety Agents or a Propaganda Engine?
Large Language Models (LLMs) are trained to refuse to respond to harmful content. However, systematic analyses of whether this behavior is truly a reflection of its safety policies or an indication of political censorship, that is practiced globally by countries, is lacking. Differentiating between safety influenced refusals or politically motivated censorship is hard and unclear. For this purpose we introduce PSP, a dataset built specifically to probe the refusal behaviors in LLMs from an explicitly political context. PSP is built by formatting existing censored content from two data sources, openly available on the internet: sensitive prompts in China generalized to multiple countries, and tweets that have been censored in various countries. We study: 1) impact of political sensitivity in seven LLMs through data-driven (making PSP implicit) and representation-level approaches (erasing the concept of politics); and, 2) vulnerability of models on PSP through prompt injection attacks (PIAs). Associating censorship with refusals on content with masked implicit intent, we find that most LLMs perform some form of censorship. We conclude with summarizing major attributes that can cause a shift in refusal distributions across models and contexts of different countries.
comment: 15 pages, 7 tables, 4 figures
Multi-chain Graph Refinement and Selection for Reliable Reasoning in Large Language Models
The complex reasoning ability of Large Language Models (LLMs) poses a critical bottleneck for their practical applications. Test-time expansion methods such as Tree-of-Thought (ToT) and Graph-of-Thought (GoT) enhance reasoning by introducing intermediate reasoning structures, tree search, or graph-based exploration mechanisms. However, their reasoning strategies suffer from limited diversity, redundant search branches, and inadequate integration and error correction across heterogeneous reasoning paths. To address these limitations, we propose a novel reasoning framework called Multi-chain Graph Refinement & Selection (MGRS), which first generates multiple diverse reasoning trajectories for a given problem, refines candidate responses using a composite self- and cross-verification strategy, then constructs a reasoning relation graph and estimates the success rate of intermediate nodes, and finally computes cumulative success rates to select the most reliable answer and corresponding reasoning trajectory. Experimental results demonstrate that MGRS significantly advances both the reasoning capability and computational efficiency of reasoning enhancement methods. Across six benchmark datasets spanning four distinct tasks, MGRS achieves an average accuracy of 82.9%, outperforming state-of-the-art baselines by a clear margin of 2.1%. Remarkably, on the 24-point game, MGRS attains 100% accuracy for the first time, while delivering a 13.6x speed-up compared to the leading Forest of Thoughts framework.
Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM
Accurately and efficiently extracting main content from general web pages is of great significance for obtaining training data for large models. Using well-pre-trained decoder-only generative language models offers excellent document comprehension capabilities, thereby effectively enhancing parsing quality. However, it remains constrained by issues such as context window length, inference cost, and format hallucination. We present Dripper, an efficient HTML main content extraction framework powered by lightweight language models, which addresses these challenges through four key innovations: (1) We design a specialized HTML simplification algorithm that reduces input token count to 22\% compared to raw HTML while preserving critical structural information; (2) We reformulate main content extraction as a semantic block sequence classification task, significantly reducing inference cost; (3) We introduce a controlled decoding mechanism that strictly constrains the output space through logits processors, effectively eliminating hallucination issues common in small-scale models; (4) We propose WebMainBench, an evaluation dataset containing over 7,800 web pages with meticulously human-annotated main content extraction labels. Experimental results demonstrate that using only a 0.6B parameter model, Dripper achieves state-of-the-art performance across all evaluation benchmarks and outperforms all baseline methods, attaining an ROUGE-N F1 score of 81.58\%( 83.13\% with fall-back strategy) on our proposed WebMainBench dataset.
Mind Reading or Misreading? LLMs on the Big Five Personality Test SC
We evaluate large language models (LLMs) for automatic personality prediction from text under the binary Five Factor Model (BIG5). Five models -- including GPT-4 and lightweight open-source alternatives -- are tested across three heterogeneous datasets (Essays, MyPersonality, Pandora) and two prompting strategies (minimal vs. enriched with linguistic and psychological cues). Enriched prompts reduce invalid outputs and improve class balance, but also introduce a systematic bias toward predicting trait presence. Performance varies substantially: Openness and Agreeableness are relatively easier to detect, while Extraversion and Neuroticism remain challenging. Although open-source models sometimes approach GPT-4 and prior benchmarks, no configuration yields consistently reliable predictions in zero-shot binary settings. Moreover, aggregate metrics such as accuracy and macro-F1 mask significant asymmetries, with per-class recall offering clearer diagnostic value. These findings show that current out-of-the-box LLMs are not yet suitable for APPT, and that careful coordination of prompt design, trait framing, and evaluation metrics is essential for interpretable results.
comment: Funding: SoBigDatait (IR0000013), FAIR (PE00000013), ICSC (CN00000013)
Accent Placement Models for Rigvedic Sanskrit Text AACL
The Rigveda, among the oldest Indian texts in Vedic Sanskrit, employs a distinctive pitch-accent system : udātta, anudātta, svarita whose marks encode melodic and interpretive cues but are often absent from modern e-texts. This work develops a parallel corpus of accented-unaccented ślokas and conducts a controlled comparison of three strategies for automatic accent placement in Rigvedic verse: (i) full fine-tuning of ByT5, a byte-level Transformer that operates directly on Unicode combining marks, (ii) a from-scratch BiLSTM-CRF sequence-labeling baseline, and (iii) LoRA-based parameter-efficient fine-tuning atop ByT5. Evaluation uses Word Error Rate (WER) and Character Error Rate (CER) for orthographic fidelity, plus a task-specific Diacritic Error Rate (DER) that isolates accent edits. Full ByT5 fine-tuning attains the lowest error across all metrics; LoRA offers strong efficiency-accuracy trade-offs, and BiLSTM-CRF serves as a transparent baseline. The study underscores practical requirements for accent restoration - Unicode-safe preprocessing, mark-aware tokenization, and evaluation that separates grapheme from accent errors - and positions heritage-language technology as an emerging NLP area connecting computational modeling with philological and pedagogical aims. Results establish reproducible baselines for Rigvedic accent restoration and provide guidance for downstream tasks such as accent-aware OCR, ASR/chant synthesis, and digital scholarship.
comment: Submitted to AACL-IJCNLP 2025
Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding
Reading scene text, that is, text appearing in images, has numerous application areas, including assistive technology, search, and e-commerce. Although scene text recognition in English has advanced significantly and is often considered nearly a solved problem, Indian language scene text recognition remains an open challenge. This is due to script diversity, non-standard fonts, and varying writing styles, and, more importantly, the lack of high-quality datasets and open-source models. To address these gaps, we introduce the Bharat Scene Text Dataset (BSTD) - a large-scale and comprehensive benchmark for studying Indian Language Scene Text Recognition. It comprises more than 100K words that span 11 Indian languages and English, sourced from over 6,500 scene images captured across various linguistic regions of India. The dataset is meticulously annotated and supports multiple scene text tasks, including: (i) Scene Text Detection, (ii) Script Identification, (iii) Cropped Word Recognition, and (iv) End-to-End Scene Text Recognition. We evaluated state-of-the-art models originally developed for English by adapting (fine-tuning) them for Indian languages. Our results highlight the challenges and opportunities in Indian language scene text recognition. We believe that this dataset represents a significant step toward advancing research in this domain. All our models and data are open source.
comment: Under Peer Review
Conveying Imagistic Thinking in TCM Translation: A Prompt Engineering and LLM-Based Evaluation Framework
Traditional Chinese Medicine theory is built on imagistic thinking, in which medical principles and diagnostic and therapeutic logic are structured through metaphor and metonymy. However, existing English translations largely rely on literal rendering, making it difficult for target-language readers to reconstruct the underlying conceptual networks and apply them in clinical practice. This study adopted a human-in-the-loop framework and selected four passages from the medical canon Huangdi Neijing that are fundamental in theory. Through prompt-based cognitive scaffolding, DeepSeek V3.1 was guided to identify metaphor and metonymy in the source text and convey the theory in translation. In the evaluation stage, ChatGPT 5 Pro and Gemini 2.5 Pro were instructed by prompts to simulate three types of real-world readers. Human translations, baseline model translations, and prompt-adjusted translations were scored by the simulated readers across five cognitive dimensions, followed by structured interviews and Interpretative Phenomenological Analysis. Results show that the prompt-adjusted LLM translations perform best across all five dimensions, with high cross-model and cross-role consistency. The interview themes reveal differences between human and machine translation, effective strategies for metaphor and metonymy transfer, and readers' cognitive preferences. This study provides a cognitive, efficient and replicable HITL methodological pathway for translation of ancient, concept-dense texts like TCM.
comment: 3 figures
Standard Occupation Classifier -- A Natural Language Processing Approach
Standard Occupational Classifiers (SOC) are systems used to categorize and classify different types of jobs and occupations based on their similarities in terms of job duties, skills, and qualifications. Integrating these facets with Big Data from job advertisement offers the prospect to investigate labour demand that is specific to various occupations. This project investigates the use of recent developments in natural language processing to construct a classifier capable of assigning an occupation code to a given job advertisement. We develop various classifiers for both UK ONS SOC and US O*NET SOC, using different Language Models. We find that an ensemble model, which combines Google BERT and a Neural Network classifier while considering job title, description, and skills, achieved the highest prediction accuracy. Specifically, the ensemble model exhibited a classification accuracy of up to 61% for the lower (or fourth) tier of SOC, and 72% for the third tier of SOC. This model could provide up to date, accurate information on the evolution of the labour market using job advertisements.
Decoding the Past: Explainable Machine Learning Models for Dating Historical Texts
Accurately dating historical texts is essential for organizing and interpreting cultural heritage collections. This article addresses temporal text classification using interpretable, feature-engineered tree-based machine learning models. We integrate five feature categories - compression-based, lexical structure, readability, neologism detection, and distance features - to predict the temporal origin of English texts spanning five centuries. Comparative analysis shows that these feature domains provide complementary temporal signals, with combined models outperforming any individual feature set. On a large-scale corpus, we achieve 76.7% accuracy for century-scale prediction and 26.1% for decade-scale classification, substantially above random baselines (20% and 2.3%). Under relaxed temporal precision, performance increases to 96.0% top-2 accuracy for centuries and 85.8% top-10 accuracy for decades. The final model exhibits strong ranking capabilities with AUCROC up to 94.8% and AUPRC up to 83.3%, and maintains controlled errors with mean absolute deviations of 27 years and 30 years, respectively. For authentication-style tasks, binary models around key thresholds (e.g., 1850-1900) reach 85-98% accuracy. Feature importance analysis identifies distance features and lexical structure as most informative, with compression-based features providing complementary signals. SHAP explainability reveals systematic linguistic evolution patterns, with the 19th century emerging as a pivot point across feature domains. Cross-dataset evaluation on Project Gutenberg highlights domain adaptation challenges, with accuracy dropping by 26.4 percentage points, yet the computational efficiency and interpretability of tree-based models still offer a scalable, explainable alternative to neural architectures.
Social Perceptions of English Spelling Variation on Twitter: A Comparative Analysis of Human and LLM Responses
Spelling variation (e.g. funnnn vs. fun) can influence the social perception of texts and their writers: we often have various associations with different forms of writing (is the text informal? does the writer seem young?). In this study, we focus on the social perception of spelling variation in online writing in English and study to what extent this perception is aligned between humans and large language models (LLMs). Building on sociolinguistic methodology, we compare LLM and human ratings on three key social attributes of spelling variation (formality, carefulness, age). We find generally strong correlations in the ratings between humans and LLMs. However, notable differences emerge when we analyze the distribution of ratings and when comparing between different types of spelling variation.
ShoppingComp: Are LLMs Really Ready for Your Shopping Cart?
We present ShoppingComp, a challenging real-world benchmark for rigorously evaluating LLM-powered shopping agents on three core capabilities: precise product retrieval, expert-level report generation, and safety critical decision making. Unlike prior e-commerce benchmarks, ShoppingComp introduces highly complex tasks under the principle of guaranteeing real products and ensuring easy verifiability, adding a novel evaluation dimension for identifying product safety hazards alongside recommendation accuracy and report quality. The benchmark comprises 120 tasks and 1,026 scenarios, curated by 35 experts to reflect authentic shopping needs. Results reveal stark limitations of current LLMs: even state-of-the-art models achieve low performance (e.g., 11.22% for GPT-5, 3.92% for Gemini-2.5-Flash). These findings highlight a substantial gap between research benchmarks and real-world deployment, where LLMs make critical errors such as failure to identify unsafe product usage or falling for promotional misinformation, leading to harmful recommendations. ShoppingComp fills the gap and thus establishes a new standard for advancing reliable and practical agents in e-commerce.
Pooling Attention: Evaluating Pretrained Transformer Embeddings for Deception Classification
This paper investigates fake news detection as a downstream evaluation of Transformer representations, benchmarking encoder-only and decoder-only pre-trained models (BERT, GPT-2, Transformer-XL) as frozen embedders paired with lightweight classifiers. Through controlled preprocessing comparing pooling versus padding and neural versus linear heads, results demonstrate that contextual self-attention encodings consistently transfer effectively. BERT embeddings combined with logistic regression outperform neural baselines on LIAR dataset splits, while analyses of sequence length and aggregation reveal robustness to truncation and advantages from simple max or average pooling. This work positions attention-based token encoders as robust, architecture-centric foundations for veracity tasks, isolating Transformer contributions from classifier complexity.
comment: Accepted at the IEEE 7th Computing, Communications and IoT Applications Conference (ComComAp 2025), Madrid, Spain, December 2025. 6 pages
Training-Free Loosely Speculative Decoding: Accepting Semantically Correct Drafts Beyond Exact Match
Large language models (LLMs) achieve strong performance across diverse tasks but suffer from high inference latency due to their autoregressive generation. Speculative Decoding (SPD) mitigates this issue by verifying candidate tokens in parallel from a smaller draft model, yet its strict exact-match verification discards many semantically valid continuations. Moreover, existing training-based SPD methods often suffer from performance degradation on out-of-distribution (OOD) tasks. To this end, we propose Training-Free Loosely Speculative Decoding (FLy), a novel method that loosens the rigid verification criterion by leveraging the target model's self-corrective behavior to judge whether a draft-target mismatch remains semantically valid. FLy introduces a two-tier mechanism: an entropy-level gate that identifies whether the current token allows multiple plausible alternatives or is nearly deterministic, and a token-level deferred window that distinguishes genuine errors from differently worded yet semantically correct variants. To further reduce latency, we design a multi-level acceleration strategy that accelerates not only the target model but also the drafter itself. Owing to its training-free design, FLy composes seamlessly with arbitrary draft-target pairs and generalizes across models and domains without hyperparameter re-tuning. Experiments show that FLy preserves more than 99% of the target model's accuracy while achieving an average 2.81x speedup on Llama-3.1-70B-Instruct and 5.07x speedup on the 405B variant. Notably, on out-of-domain datasets, our method remains highly effective and outperforms the training-based method EAGLE-3 by 1.62x.
comment: Under review
Visual Puns from Idioms: An Iterative LLM-T2IM-MLLM Framework ICASSP 2026
We study idiom-based visual puns--images that align an idiom's literal and figurative meanings--and present an iterative framework that coordinates a large language model (LLM), a text-to-image model (T2IM), and a multimodal LLM (MLLM) for automatic generation and evaluation. Given an idiom, the system iteratively (i) generates detailed visual prompts, (ii) synthesizes an image, (iii) infers the idiom from the image, and (iv) refines the prompt until recognition succeeds or a step limit is reached. Using 1,000 idioms as inputs, we synthesize a corresponding dataset of visual pun images with paired prompts, enabling benchmarking of both generation and understanding. Experiments across 10 LLMs, 10 MLLMs, and one T2IM (Qwen-Image) show that MLLM choice is the primary performance driver: GPT achieves the highest accuracies, Gemini follows, and the best open-source MLLM (Gemma) is competitive with some closed models. On the LLM side, Claude attains the strongest average performance for prompt generation.
comment: Submitted to ICASSP 2026 (under review)
Artwork Interpretation with Vision Language Models: A Case Study on Emotions and Emotion Symbols AACL
Emotions are a fundamental aspect of artistic expression. Due to their abstract nature, there is a broad spectrum of emotion realization in artworks. These are subject to historical change and their analysis requires expertise in art history. In this article, we investigate which aspects of emotional expression can be detected by current (2025) vision language models (VLMs). We present a case study of three VLMs (Llava-Llama and two Qwen models) in which we ask these models four sets of questions of increasing complexity about artworks (general content, emotional content, expression of emotions, and emotion symbols) and carry out a qualitative expert evaluation. We find that the VLMs recognize the content of the images surprisingly well and often also which emotions they depict and how they are expressed. The models perform best for concrete images but fail for highly abstract or highly symbolic images. Reliable recognition of symbols remains fundamentally difficult. Furthermore, the models continue to exhibit the well-known LLM weakness of providing inconsistent answers to related questions.
comment: Accepted for publication at the IJCNLP-AACL workshop on Multimodal Models for Low-Resource Contexts and Social Impact
Language-conditioned world model improves policy generalization by reading environmental descriptions
To interact effectively with humans in the real world, it is important for agents to understand language that describes the dynamics of the environment--that is, how the environment behaves--rather than just task instructions specifying "what to do". Understanding this dynamics-descriptive language is important for human-agent interaction and agent behavior. Recent work address this problem using a model-based approach: language is incorporated into a world model, which is then used to learn a behavior policy. However, these existing methods either do not demonstrate policy generalization to unseen games or rely on limiting assumptions. For instance, assuming that the latency induced by inference-time planning is tolerable for the target task or expert demonstrations are available. Expanding on this line of research, we focus on improving policy generalization from a language-conditioned world model while dropping these assumptions. We propose a model-based reinforcement learning approach, where a language-conditioned world model is trained through interaction with the environment, and a policy is learned from this model--without planning or expert demonstrations. Our method proposes Language-aware Encoder for Dreamer World Model (LED-WM) built on top of DreamerV3. LED-WM features an observation encoder that uses an attention mechanism to explicitly ground language descriptions to entities in the observation. We show that policies trained with LED-WM generalize more effectively to unseen games described by novel dynamics and language compared to other baselines in several settings in two environments: MESSENGER and MESSENGER-WM.To highlight how the policy can leverage the trained world model before real-world deployment, we demonstrate the policy can be improved through fine-tuning on synthetic test trajectories generated by the world model.
comment: NeuRIPS 2025. Workshop: LAW 2025: Bridging Language, Agent, and World Models
ORION: Teaching Language Models to Reason Efficiently in the Language of Thought
Large Reasoning Models (LRMs) achieve strong performance in mathematics, code generation, and task planning, but their reliance on long chains of verbose "thinking" tokens leads to high latency, redundancy, and incoherent reasoning paths. Inspired by the Language of Thought Hypothesis, which posits that human reasoning operates over a symbolic, compositional mental language called Mentalese, we introduce a framework that trains models to reason in a similarly compact style. Mentalese encodes abstract reasoning as ultra-compressed, structured tokens, enabling models to solve complex problems with far fewer steps. To improve both efficiency and accuracy, we propose SHORTER LENGTH PREFERENCE OPTIMIZATION (SLPO), a reinforcement learning method that rewards concise solutions that stay correct, while still allowing longer reasoning when needed. Applied to Mentalese-aligned models, SLPO yields significantly higher compression rates by enabling concise reasoning that preserves the benefits of detailed thinking without the computational overhead. Across benchmarks including AIME 2024 and 2025, MinervaMath, OlympiadBench, Math500, and AMC, our ORION models produce reasoning traces with 4-16x fewer tokens, achieve up to 5x lower inference latency, and reduce training costs by 7-9x relative to the DeepSeek R1 Distilled model, while maintaining 90-98% of its accuracy. ORION also surpasses Claude and ChatGPT-4o by up to 5% in accuracy while maintaining 2x compression. These results show that Mentalese-style compressed reasoning offers a step toward human-like cognitive efficiency, enabling real-time, cost-effective reasoning without sacrificing accuracy.
FEANEL: A Benchmark for Fine-Grained Error Analysis in K-12 English Writing
Large Language Models (LLMs) have transformed artificial intelligence, offering profound opportunities for educational applications. However, their ability to provide fine-grained educational feedback for K-12 English writing remains underexplored. In this paper, we challenge the error analysis and pedagogical skills of LLMs by introducing the problem of Fine-grained Error Analysis for English Learners and present the Fine-grained Error ANalysis for English Learners (FEANEL) Benchmark. The benchmark comprises 1,000 essays written by elementary and secondary school students, and a well-developed English writing error taxonomy. Each error is annotated by language education experts and categorized by type, severity, and explanatory feedback, using a part-of-speech-based taxonomy they co-developed. We evaluate state-of-the-art LLMs on the FEANEL Benchmark to explore their error analysis and pedagogical abilities. Experimental results reveal significant gaps in current LLMs' ability to perform fine-grained error analysis, highlighting the need for advancements in particular methods for educational applications.
comment: 19 pages, 7 figures, and 4 tables. The dataset is available at https://huggingface.co/datasets/Feanel/FEANEL
JBE-QA: Japanese Bar Exam QA Dataset for Assessing Legal Domain Knowledge
We introduce JBE-QA, a Japanese Bar Exam Question-Answering dataset to evaluate large language models' legal knowledge. Derived from the multiple-choice (tanto-shiki) section of the Japanese bar exam (2015-2024), JBE-QA provides the first comprehensive benchmark for Japanese legal-domain evaluation of LLMs. It covers the Civil Code, the Penal Code, and the Constitution, extending beyond the Civil Code focus of prior Japanese resources. Each question is decomposed into independent true/false judgments with structured contextual fields. The dataset contains 3,464 items with balanced labels. We evaluate 26 LLMs, including proprietary, open-weight, Japanese-specialised, and reasoning models. Our results show that proprietary models with reasoning enabled perform best, and the Constitution questions are generally easier than the Civil Code or the Penal Code questions.
comment: Three tables and one figure
RAG System for Supporting Japanese Litigation Procedures: Faithful Response Generation Complying with Legal Norms SIGIR
This study discusses the essential components that a Retrieval-Augmented Generation (RAG)-based LLM system should possess in order to support Japanese medical litigation procedures complying with legal norms. In litigation, expert commissioners, such as physicians, architects, accountants, and engineers, provide specialized knowledge to help judges clarify points of dispute. When considering the substitution of these expert roles with a RAG-based LLM system, the constraint of strict adherence to legal norms is imposed. Specifically, three requirements arise: (1) the retrieval module must retrieve appropriate external knowledge relevant to the disputed issues in accordance with the principle prohibiting the use of private knowledge, (2) the responses generated must originate from the context provided by the RAG and remain faithful to that context, and (3) the retrieval module must reference external knowledge with appropriate timestamps corresponding to the issues at hand. This paper discusses the design of a RAG-based LLM system that satisfies these requirements.
comment: This is a preprint version of a paper reviewed and accepted at BREV-RAG 2025: Beyond Relevance-based EValuation of RAG Systems, a SIGIR-AP 2025 workshop
Mitigating Semantic Drift: Evaluating LLMs' Efficacy in Psychotherapy through MI Dialogue Summarization
Recent advancements in large language models (LLMs) have shown their potential across both general and domain-specific tasks. However, there is a growing concern regarding their lack of sensitivity, factual incorrectness in responses, inconsistent expressions of empathy, bias, hallucinations, and overall inability to capture the depth and complexity of human understanding, especially in low-resource and sensitive domains such as psychology. To address these challenges, our study employs a mixed-methods approach to evaluate the efficacy of LLMs in psychotherapy. We use LLMs to generate precise summaries of motivational interviewing (MI) dialogues and design a two-stage annotation scheme based on key components of the Motivational Interviewing Treatment Integrity (MITI) framework, namely evocation, collaboration, autonomy, direction, empathy, and a non-judgmental attitude. Using expert-annotated MI dialogues as ground truth, we formulate multi-class classification tasks to assess model performance under progressive prompting techniques, incorporating one-shot and few-shot prompting. Our results offer insights into LLMs' capacity for understanding complex psychological constructs and highlight best practices to mitigate ``semantic drift" in therapeutic settings. Our work contributes not only to the MI community by providing a high-quality annotated dataset to address data scarcity in low-resource domains but also critical insights for using LLMs for precise contextual interpretation in complex behavioral therapy.
Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards
Reinforcement learning (RL) is increasingly used to align large language models (LLMs). Off-policy methods offer greater implementation simplicity and data efficiency than on-policy techniques, but often result in suboptimal performance. In this work, we study the intermediate range of algorithms between off-policy RL and supervised fine-tuning by analyzing a simple off-policy REINFORCE algorithm, where the advantage is defined as $A=r-V$, with $r$ a reward and $V$ some tunable baseline. Intuitively, lowering $V$ emphasizes high-reward samples, while raising it penalizes low-reward ones more heavily. We first provide a theoretical analysis of this off-policy REINFORCE algorithm, showing that when the baseline $V$ lower-bounds the expected reward, the algorithm enjoys a policy improvement guarantee. Our analysis reveals that while on-policy updates can safely leverage both positive and negative signals, off-policy updates benefit from focusing more on positive rewards than on negative ones. We validate our findings experimentally in a controlled stochastic bandit setting and through fine-tuning state-of-the-art LLMs on reasoning tasks.
Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization
Language models can retain dangerous knowledge and skills even after extensive safety fine-tuning, posing both misuse and misalignment risks. Recent studies show that even specialized unlearning methods can be easily reversed. To address this, we systematically evaluate many existing and novel components of unlearning methods and identify ones crucial for irreversible unlearning. We introduce Disruption Masking, a technique in which we only allow updating weights, where the signs of the unlearning gradient and the retaining gradient are the same. This ensures all updates are non-disruptive. Additionally, we identify the need for normalizing the unlearning gradients, and also confirm the usefulness of meta-learning. We combine these insights into MUDMAN (Meta-Unlearning with Disruption Masking and Normalization) and validate its effectiveness at preventing the recovery of dangerous capabilities. MUDMAN outperforms the prior TAR method by 40%, setting a new state-of-the-art for robust unlearning.
Toward Honest Language Models for Deductive Reasoning
Deductive reasoning is the process of deriving conclusions strictly from the given premises, without relying on external knowledge. We define honesty in this setting as a model's ability to respond only when the conclusion is logically entailed by the premises, and to abstain otherwise. However, current language models often fail to reason honestly, producing unwarranted answers when the input is insufficient. To study this challenge, we formulate honest deductive reasoning as multi-step tasks where models must either derive the correct conclusion or abstain. We curate two datasets from graph structures, one for linear algebra and one for logical inference, and introduce unanswerable cases by randomly perturbing an edge in half of the instances. We find that prompting and existing training methods, including GRPO with or without supervised fine-tuning initialization, struggle on these tasks. In particular, GRPO optimize only for final task outcomes, leaving models vulnerable to collapse when negative rewards dominate early training. To address this, we propose ACNCHOR, a reinforcement learning method that injects ground truth trajectories into rollouts, preventing early training collapse. Our results demonstrate that this method stabilizes learning and significantly improves the overall reasoning performance, underscoring the importance of training dynamics for enabling honest deductive reasoning in language models.
Continual Learning of Domain Knowledge from Human Feedback in Text-to-SQL
Large Language Models (LLMs) can generate SQL queries from natural language questions but struggle with database-specific schemas and tacit domain knowledge. We introduce a framework for continual learning from human feedback in text-to-SQL, where a learning agent receives natural language feedback to refine queries and distills the revealed knowledge for reuse on future tasks. This distilled knowledge is stored in a structured memory, enabling the agent to improve execution accuracy over time. We design and evaluate multiple variations of a learning agent architecture that vary in how they capture and retrieve past experiences. Experiments on the BIRD benchmark Dev set show that memory-augmented agents, particularly the Procedural Agent, achieve significant accuracy gains and error reduction by leveraging human-in-the-loop feedback. Our results highlight the importance of transforming tacit human expertise into reusable knowledge, paving the way for more adaptive, domain-aware text-to-SQL systems that continually learn from a human-in-the-loop.
comment: 34 pages, 6 figures, 4 tables
OmniRouter: Budget and Performance Controllable Multi-LLM Routing
Large language models (LLMs) deliver superior performance but require substantial computational resources and operate with relatively low efficiency, while smaller models can efficiently handle simpler tasks with fewer resources. LLM routing is a crucial paradigm that dynamically selects the most suitable large language models from a pool of candidates to process diverse inputs, ensuring optimal resource utilization while maintaining response quality. Existing routing frameworks typically model this as a locally optimal decision-making problem, selecting the presumed best-fit LLM for each query individually, which overlooks global budget constraints, resulting in ineffective resource allocation. To tackle this problem, we introduce OmniRouter, a fundamentally controllable routing framework for multi-LLM serving. Instead of making per-query greedy choices, OmniRouter models the routing task as a constrained optimization problem, assigning models that minimize total cost while ensuring the required performance level. Specifically, a hybrid retrieval-augmented predictor is designed to predict the capabilities and costs of LLMs. After obtaining the predicted cost and performance, we utilize a constrained optimizer for cost-optimal assignments that employs Lagrangian dual decomposition with adaptive multipliers. It iteratively converges toward the globally optimal query-model allocation, dynamically balancing latency minimization against quality thresholds while adhering to heterogeneous capacity constraints. Experiments show that OmniRouter achieves up to 6.30% improvement in response accuracy while simultaneously reducing computational costs by at least 10.15% compared to competitive router baselines. The code and the dataset are available at https://github.com/dongyuanjushi/OmniRouter.
Mina: A Multilingual LLM-Powered Legal Assistant Agent for Bangladesh for Empowering Access to Justice
Bangladesh's low-income population faces major barriers to affordable legal advice due to complex legal language, procedural opacity, and high costs. Existing AI legal assistants lack Bengali-language support and jurisdiction-specific adaptation, limiting their effectiveness. To address this, we developed Mina, a multilingual LLM-based legal assistant tailored for the Bangladeshi context. It employs multilingual embeddings and a RAG-based chain-of-tools framework for retrieval, reasoning, translation, and document generation, delivering context-aware legal drafts, citations, and plain-language explanations via an interactive chat interface. Evaluated by law faculty from leading Bangladeshi universities across all stages of the 2022 and 2023 Bangladesh Bar Council Exams, Mina scored 75-80% in Preliminary MCQs, Written, and simulated Viva Voce exams, matching or surpassing average human performance and demonstrating clarity, contextual understanding, and sound legal reasoning. Even under a conservative upper bound, Mina operates at just 0.12-0.61% of typical legal consultation costs in Bangladesh, yielding a 99.4-99.9\% cost reduction relative to human-provided services. These results confirm its potential as a low-cost, multilingual AI assistant that automates key legal tasks and scales access to justice, offering a real-world case study on building domain-specific, low-resource systems and addressing challenges of multilingual adaptation, efficiency, and sustainable public-service AI deployment.
STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence
Despite rapid progress in Multi-modal Large Language Models and Large Audio-Language Models, existing audio benchmarks largely test semantics that can be recovered from text captions, masking deficits in fine-grained perceptual reasoning. We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce STAR-Bench to measure it. STAR-Bench combines a Foundational Acoustic Perception setting (six attributes under absolute and relative regimes) with a Holistic Spatio-Temporal Reasoning setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories. Our data curation pipeline uses two methods to ensure high-quality samples. For foundational tasks, we use procedurally synthesized and physics-simulated audio. For holistic data, we follow a four-stage process that includes human annotation and final selection based on human performance. Unlike prior benchmarks where caption-only answering reduces accuracy slightly, STAR-Bench induces far larger drops (-31.5\% temporal, -35.2\% spatial), evidencing its focus on linguistically hard-to-describe cues. Evaluating 19 models reveals substantial gaps compared with humans and a capability hierarchy: closed-source models are bottlenecked by fine-grained perception, while open-source models lag across perception, knowledge, and reasoning. Our STAR-Bench provides critical insights and a clear path forward for developing future models with a more robust understanding of the physical world.
comment: Homepage: https://internlm.github.io/StarBench/
Local Hybrid Retrieval-Augmented Document QA ACL
Organizations handling sensitive documents face a critical dilemma: adopt cloud-based AI systems that offer powerful question-answering capabilities but compromise data privacy, or maintain local processing that ensures security but delivers poor accuracy. We present a question-answering system that resolves this trade-off by combining semantic understanding with keyword precision, operating entirely on local infrastructure without internet access. Our approach demonstrates that organizations can achieve competitive accuracy on complex queries across legal, scientific, and conversational documents while keeping all data on their machines. By balancing two complementary retrieval strategies and using consumer-grade hardware acceleration, the system delivers reliable answers with minimal errors, letting banks, hospitals, and law firms adopt conversational document AI without transmitting proprietary information to external providers. This work establishes that privacy and performance need not be mutually exclusive in enterprise AI deployment.
comment: 10 pages, 5 figures, 3 tables; conference-style (ACL format); fully local RAG system
Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey
The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology. This approach leverages the rich, multifaceted descriptions of biomolecules contained within textual data sources to enhance our fundamental understanding and enable downstream computational tasks such as biomolecule property prediction. The fusion of the nuanced narratives expressed through natural language with the structural and functional specifics of biomolecules described via various molecular modeling techniques opens new avenues for comprehensively representing and analyzing biomolecules. By incorporating the contextual language data that surrounds biomolecules into their modeling, BL aims to capture a holistic view encompassing both the symbolic qualities conveyed through language as well as quantitative structural characteristics. In this review, we provide an extensive analysis of recent advancements achieved through cross modeling of biomolecules and natural language. (1) We begin by outlining the technical representations of biomolecules employed, including sequences, 2D graphs, and 3D structures. (2) We then examine in depth the rationale and key objectives underlying effective multi-modal integration of language and molecular data sources. (3) We subsequently survey the practical applications enabled to date in this developing research area. (4) We also compile and summarize the available resources and datasets to facilitate future work. (5) Looking ahead, we identify several promising research directions worthy of further exploration and investment to continue advancing the field. The related resources and contents are updating in https://github.com/QizhiPei/Awesome-Biomolecule-Language-Cross-Modeling.
comment: 2025.11.28 Updated Version
Adversarial Confusion Attack: Disrupting Multimodal Large Language Models
We introduce the Adversarial Confusion Attack, a new class of threats against multimodal large language models (MLLMs). Unlike jailbreaks or targeted misclassification, the goal is to induce systematic disruption that makes the model generate incoherent or confidently incorrect outputs. Applications include embedding adversarial images into websites to prevent MLLM-powered agents from operating reliably. The proposed attack maximizes next-token entropy using a small ensemble of open-source MLLMs. In the white-box setting, we show that a single adversarial image can disrupt all models in the ensemble, both in the full-image and adversarial CAPTCHA settings. Despite relying on a basic adversarial technique (PGD), the attack generates perturbations that transfer to both unseen open-source (e.g., Qwen3-VL) and proprietary (e.g., GPT-5.1) models.
REFLEX: Self-Refining Explainable Fact-Checking via Disentangling Truth into Style and Substance
The prevalence of misinformation on social media threatens public trust, demanding automated fact-checking systems that provide accurate verdicts with interpretable explanations. However, existing large language model-based (LLM-based) approaches often rely heavily on external knowledge sources, introducing substantial latency and even hallucinations that undermine reliability, interpretability, and responsiveness, which is crucial for real-time use. To address these challenges, we propose REason-guided Fact-checking with Latent EXplanations REFLEX paradigm, a plug-and-play, self-refining paradigm that leverages the internal knowledge in backbone model to improve both verdict accuracy and explanation quality. REFLEX reformulates fact-checking as a role-play dialogue and jointly trains verdict prediction and explanation generation. It adaptively extracts contrastive activation pairs between the backbone model and its fine-tuned variant to construct steering vectors that disentangle truth into style and substance naturally. These activation-level signals guide inference and suppress noisy explanations, enabling more faithful and efficient reasoning. Experiments on real-world datasets show that REFLEX outperforms previous methods that steer toward a single truth direction and underscores the challenge traditional approaches face when handling the subtle, human-unknown truth in fact-checking tasks. Remarkably, with only 465 self-refined training samples, RELFEX achieves state-of-the-art performance. Furthermore, models trained with explanatory objectives can effectively guide those without them, yielding up to a 7.57% improvement, highlighting that internal explanation signals play a dual role in both interpreting and enhancing factual reasoning.
Efficient Reasoning via Thought-Training and Thought-Free Inference
Recent advances in large language models (LLMs) have leveraged explicit Chain-of-Thought (CoT) prompting to improve reasoning accuracy. However, most existing methods primarily focus on compressing verbose reasoning outputs. These Long-to-Short transformations aim to improve efficiency, but require a large amount of short CoT data. In this work, we introduce \textbf{3TF} (\textbf{T}hought-\textbf{T}raining and \textbf{T}hought-\textbf{F}ree inference), a framework for efficient reasoning that takes a Short-to-Long perspective. We first train a hybrid model that can operate in both reasoning and non-reasoning modes, and then further train it on CoT-annotated data to internalize structured reasoning, while enforcing concise, thought-free outputs at inference time using the no-reasoning mode. Unlike compression-based approaches, 3TF improves the reasoning quality of non-reasoning outputs, enabling models to perform rich internal reasoning implicitly while keeping external outputs short. Empirically, 3TF-trained models obtain large improvements on reasoning benchmarks under thought-free inference, demonstrating that high quality reasoning can be learned and executed implicitly without explicit step-by-step generation.
comment: 11 pages, 4 figures
RvLLM: LLM Runtime Verification with Domain Knowledge
Large language models (LLMs) have emerged as a dominant AI paradigm due to their exceptional text understanding and generation capabilities. However, their tendency to generate inconsistent or erroneous outputs challenges their reliability, especially in high-stakes domains requiring accuracy and trustworthiness. Existing research primarily focuses on detecting and mitigating model misbehavior in general-purpose scenarios, often overlooking the potential of integrating domain-specific knowledge. In this work, we advance misbehavior detection by incorporating domain knowledge. The core idea is to design a general specification language that enables domain experts to customize domain-specific predicates in a lightweight and intuitive manner, supporting later runtime verification of LLM outputs. To achieve this, we design a novel specification language, ESL, and introduce a runtime verification framework, RvLLM, to validate LLM output against domain-specific constraints defined in ESL. We evaluate RvLLM on three representative tasks: violation detection against Singapore Rapid Transit Systems Act, numerical comparison, and inequality solving. Experimental results demonstrate that RvLLM effectively detects erroneous outputs across various LLMs in a lightweight and flexible manner. The results reveal that despite their impressive capabilities, LLMs remain prone to low-level errors due to limited interpretability and a lack of formal guarantees during inference, and our framework offers a potential long-term solution by leveraging expert domain knowledge to rigorously and efficiently verify LLM outputs.
comment: 24 pages, 11 tables, 13 figures
LongCat-Flash-Omni Technical Report
We introduce LongCat-Flash-Omni, a state-of-the-art open-source omni-modal model with 560 billion parameters, excelling at real-time audio-visual interaction. By adopting a curriculum-inspired progressive training strategy that transitions from simpler to increasingly complex modality sequence modeling tasks, LongCat-Flash-Omni attains comprehensive multimodal capabilities while maintaining strong unimodal capability. Building upon LongCat-Flash, which adopts a high-performance Shortcut-connected Mixture-of-Experts (MoE) architecture with zero-computation experts, LongCat-Flash-Omni integrates efficient multimodal perception and speech reconstruction modules. Despite its immense size of 560B parameters (with 27B activated), LongCat-Flash-Omni achieves low-latency real-time audio-visual interaction. For training infrastructure, we developed a modality-decoupled parallelism scheme specifically designed to manage the data and model heterogeneity inherent in large-scale multimodal training. This innovative approach demonstrates exceptional efficiency by sustaining over 90% of the throughput achieved by text-only training. Extensive evaluations show that LongCat-Flash-Omni achieves state-of-the-art performance on omni-modal benchmarks among open-source models. Furthermore, it delivers highly competitive results across a wide range of modality-specific tasks, including text, image, and video understanding, as well as audio understanding and generation. We provide a comprehensive overview of the model architecture design, training procedures, and data strategies, and open-source the model to foster future research and development in the community.
Normal forms in Virus Machines
In the present work, we further study the computational power of virus machines (VMs in short).VMs provide a computing paradigm inspired by the transmission and replication networks of viruses.VMs consist of process units (called hosts) structured by a directed graph whose arcs are called channels and an instruction graph that controls the transmissions of virus objects among hosts. The present work complements our understanding of the computing power of VMs by introducing normal forms; these expressions restrict the features in a given computing model.Some of the features that we restrict in our normal forms include (a) the number of hosts, (b) the number of instructions, and (c) the number of virus objects in each host. After we recall some known results on the computing power of VMs we give our series of normal forms, such as the size of the loops in the network, proving new characterisations of family of sets, such as finite sets, semilinear sets, or recursively enumerable sets (NRE).
comment: 24 pages, 14 figures
InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training
Reinforcement learning has powered many of the recent breakthroughs in large language models, especially for tasks where rewards can be computed automatically, such as code generation. However, these methods deteriorate in open-ended domains like medical consultation, where feedback is inherently ambiguous, highly context-dependent, and cannot be reduced to a reliable scalar signal. In such settings, RL must either rely on supervision-intensive reward models that often fail to generalize, or it falls into pathological behaviors such as reward hacking - an especially troubling risk for high-stakes medical dialogue. To address these limitations, we introduce ORBIT, an open-ended rubric-based incremental training framework for high-stakes medical dialogue. ORBIT integrates synthetic dialogue generation with dynamically constructed rubrics that serve as adaptive guides for incremental RL. Instead of relying on external medical knowledge bases or handcrafted rule sets, ORBIT uses rubric-driven feedback to steer the learning process. Its judge component can be instantiated with general-purpose instruction-following LLMs, removing the need for any task-specific fine-tuning. Applied to the Qwen3-4B-Instruct model, ORBIT raises the HealthBench-Hard score from 7.0 to 27.5 using only 2k training samples, achieving SOTA performance for models at this scale. With larger rubric datasets, ORBIT-trained models further compete with the strongest open-source baselines on HealthBench-Hard. Our analysis shows that rubric-guided RL consistently improves consultation quality across diverse medical scenarios. We also apply such rubric generation and training pipeline to InfoBench, where ORBIT enhances instruction-following performance, highlighting the generality of rubric-based feedback.
Atom of Thoughts for Markov LLM Test-Time Scaling NeurIPS 2025
Large Language Models (LLMs) achieve superior performance through training-time scaling, and test-time scaling further enhances their capabilities by conducting effective reasoning during inference. However, as the scale of reasoning increases, existing test-time scaling methods suffer from accumulated historical information, which not only wastes computational resources but also interferes with effective reasoning. To address this issue, we observe that complex reasoning can be achieved by solving a series of independent and self-contained subquestions. These subquestions are essentially \textit{atomic questions}, exhibiting the memoryless property similar to Markov processes. Based on this observation, we propose Atom of Thoughts (\our), where each state transition consists of decomposing the current question into a dependency-based directed acyclic graph and contracting its subquestions, forming a simplified question that maintains answer equivalence with the original problem. This answer preservation enables the iterative \textit{decomposition-contraction} process to naturally form a meaningful Markov reasoning process. Furthermore, these atomic states can be seamlessly integrated into existing test-time scaling methods, enabling \our to serve as a plug-in enhancement for improving reasoning capabilities. Experiments across six benchmarks demonstrate the effectiveness of \our both as a standalone framework and a plug-in enhancement. Notably, on HotpotQA, when applied to gpt-4o-mini, \our achieves an \textbf{80.6\%} F1 score, surpassing o3-mini by \textbf{3.4\%} and DeepSeek-R1 by \textbf{10.6\%}. The code is available at \href{https://github.com/qixucen/atom}{https://github.com/qixucen/atom}.
comment: Accepted to NeurIPS 2025
Toward Equitable Access: Leveraging Crowdsourced Reviews to Investigate Public Perceptions of Health Resource Accessibility
Monitoring health resource disparities during public health crises is critical, yet traditional methods, like surveys, lack the requisite speed and spatial granularity. This study introduces a novel framework that leverages: 1) crowdsourced Google Maps reviews (2018-2021) and 2) advanced NLP (DeBERTa) to create a high-resolution, spatial-temporal index of public perception of health resource accessibility in the United States. We then employ Partial Least Squares (PLS) regression to link this perception index to a range of socioeconomic and demographic drivers. Our results quantify significant spatial-temporal shifts in perceived access, confirming that disparities peaked during the COVID-19 crisis and only partially recovered post-peak. We identify political affiliation, racial composition, and educational attainment as primary determinants of these perceptions. This study validates a scalable method for real-time health equity monitoring and provides actionable evidence for interventions to build a more resilient healthcare infrastructure.
Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning
Recently, advanced large language models (LLMs) have emerged at an increasingly rapid pace. However, when faced with complex problems, most users are often unable to provide accurate and effective prompts to interact with LLMs, thus limiting the performance of LLMs. To address this challenge, we propose Prompt-R1, an end-to-end reinforcement learning framework that uses a small-scale LLM to collaborate with large-scale LLMs, replacing user interaction to solve problems better. This collaboration is cast as a multi-turn prompt interaction, where the small-scale LLM thinks and generates prompts, and the large-scale LLM performs complex reasoning. A dual-constrained reward is designed to optimize for correctness, generation quality, and reasoning accuracy. Prompt-R1 provides a plug-and-play framework that supports both inference and training with various large-scale LLMs. Experiments on multiple public datasets show that Prompt-R1 significantly outperforms baseline models across tasks. Our code is publicly available at https://github.com/QwenQKing/Prompt-R1.
Self Iterative Label Refinement via Robust Unlabeled Learning NeurIPS 2025
Recent advances in large language models (LLMs) have yielded impressive performance on various tasks, yet they often depend on high-quality feedback that can be costly. Self-refinement methods attempt to leverage LLMs' internal evaluation mechanisms with minimal human supervision; however, these approaches frequently suffer from inherent biases and overconfidence, especially in domains where the models lack sufficient internal knowledge, resulting in performance degradation. As an initial step toward enhancing self-refinement for broader applications, we introduce an iterative refinement pipeline that employs the Unlabeled-Unlabeled learning framework to improve LLM-generated pseudo-labels for classification tasks. By exploiting two unlabeled datasets with differing positive class ratios, our approach iteratively denoises and refines the initial pseudo-labels, thereby mitigating the adverse effects of internal biases with minimal human supervision. Evaluations on diverse datasets, including low-resource language corpora, patent classifications, and protein structure categorizations, demonstrate that our method consistently outperforms both initial LLM's classification performance and the self-refinement approaches by cutting-edge models (e.g., GPT-4o and DeepSeek-R1). Moreover, we experimentally confirm that our refined classifier facilitates effective post-training alignment for safety in LLMs and demonstrate successful self-refinement in generative tasks as well.\footnote{Our code is available at https://github.com/HikaruAsano/self-iterative-label-refinement.}
comment: To appear in the Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025)
Structured Prompting Enables More Robust Evaluation of Language Models
As language models (LMs) are increasingly adopted across domains, high-quality benchmarking frameworks that accurately estimate performance are essential for guiding deployment decisions. While frameworks such as Holistic Evaluation of Language Models (HELM) enable broad evaluation across tasks, they often rely on fixed prompts that fail to generalize across LMs, yielding unrepresentative performance estimates. Unless we approximate each LM's ceiling (maximum achievable via changes to the prompt), we risk underestimating performance. Declarative prompting frameworks, such as DSPy, offer a scalable alternative to manual prompt engineering by crafting structured prompts that can be optimized per task. However, such frameworks have not been systematically evaluated across established benchmarks. We present a reproducible DSPy+HELM framework that introduces structured prompting methods which elicit reasoning, enabling more accurate LM benchmarking. Using four prompting methods, we evaluate four frontier LMs across seven benchmarks (general/medical domain) against existing HELM baseline scores. We find that without structured prompting: (i) HELM underestimates LM performance (by 4% average), (ii) performance estimates vary more across benchmarks ($+$2% standard deviation), (iii) performance gaps are misrepresented (leaderboard rankings flip on 3/7 benchmarks), and (iv) introducing chain-of-thought reduces LM sensitivity to prompt design (smaller $Δ$ across prompts). To our knowledge, this is the first benchmarking study to systematically integrate structured prompting into an established evaluation framework, demonstrating how scalable performance-ceiling approximation yields more robust, decision-useful benchmarks. We open-source (i) DSPy+HELM Integration (https://github.com/stanford-crfm/helm/pull/3893) and (ii) Prompt Optimization Pipeline (https://github.com/StanfordMIMI/dspy-helm).
Benford's Curse: Tracing Digit Bias to Numerical Hallucination in LLMs NeurIPS 2025
Large Language Models (LLMs) exhibit impressive performance on complex reasoning tasks, yet they frequently fail on basic numerical problems, producing incorrect outputs. Inspired by Benford's Law, a statistical pattern in which lower digits occur more frequently as leading digits, we hypothesize that the skewed digit distributions in web-collected corpora may be learned by LLMs during pretraining, leading to biased numerical generation. To investigate the hypothesis, we first examine whether digits frequencies in pretraining corpus (OLMo2) follows Benford's law. We then construct an evaluation benchmark in which the ground-truth digits are uniformly distributed within each of the seven numerical reasoning tasks. Our evaluation results demonstrate that leading open-source LLMs show a consistent pattern of digit bias that resembles Benford's law. Through logit-lens tracing and neuron-level dissection, we identify that this bias arises predominantly from a small subset of highly digit-selective feed-forward network (FFN) neurons in the deeper layers. Finally, we demonstrate that pruning these neurons mitigates imbalanced overgeneration and partially corrects erroneous outputs, providing causal evidence that fine-grained pretraining digit bias can propagate into model behavior. Our findings reveal a fundamental connection between corpus-level statistics and symbolic failure modes in LLMs, offering a new lens for diagnosing and mitigating hallucinations in numerical tasks.
comment: NeurIPS 2025
Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training NeurIPS 2025
Large language models are trained with tokenizers, and the resulting token distribution is highly imbalanced: a few words dominate the stream while most occur rarely. Recent practice favors ever-larger vocabularies, but it is unclear where the benefit comes from. To this end, we perform a controlled study that scales the vocabulary of the language model from 24K to 196K while holding data, computation, and optimization unchanged. We begin by quantifying the complexity of tokenized text -- formalized via Kolmogorov complexity -- and show that larger vocabularies reduce this complexity. Above 24K, every common word is already tokenized as a single token, so enlarging vocabulary only deepens the relative token-frequency imbalance. Word-level loss decomposition shows that larger vocabularies reduce cross-entropy loss almost exclusively by lowering uncertainty on the 2,500 most frequent words, even though loss on the rare tail rises. The same frequent words cover roughly 75% of tokens in downstream benchmarks, so this training advantage transfers intact. We further show that enlarging model parameters with a fixed vocabulary yields the same frequent-word benefit. Our results recast "bigger vocabularies help" as "lowering complexity of tokenized text helps," offering a simple, principled knob for tokenizer-model co-design and clarifying the loss dynamics that govern language model scaling in pre-training.
comment: NeurIPS 2025
Deep Improvement Supervision
Recently, it was shown that small, looped architectures, such as Tiny Recursive Models (TRMs), can outperform Large Language Models (LLMs) on complex reasoning tasks, including the Abstraction and Reasoning Corpus (ARC). In this work, we investigate a core question: how can we further improve the efficiency of these methods with minimal changes? To address this, we frame the latent reasoning of TRMs as a form of classifier-free guidance and implicit policy improvement algorithm. Building on these insights, we propose a novel training scheme that provides a target for each loop during training. We demonstrate that our approach significantly enhances training efficiency. Our method reduces the total number of forward passes by 18x and eliminates halting mechanisms, while maintaining quality comparable to standard TRMs. Notably, we achieve 24% accuracy on ARC-1 with only 0.8M parameters, outperforming most LLMs.
ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs
The computational cost of training multimodal large language models (MLLMs) grows rapidly with the number of processed tokens. Existing efficiency methods mainly target inference via token reduction or merging, offering limited benefits during training. We introduce ReGATE (Reference-Guided Adaptive Token Elision), an adaptive token pruning method for accelerating MLLM training. ReGATE adopts a teacher-student framework, in which a frozen teacher LLM provides per-token guidance losses that are fused with an exponential moving average of the student's difficulty estimates. This adaptive scoring mechanism dynamically selects informative tokens while skipping redundant ones in the forward pass, substantially reducing computation without altering the model architecture. Across three representative MLLMs, ReGATE matches the peak accuracy of standard training on MVBench up to 2$\times$ faster, using only 38% of the tokens. With extended training, it even surpasses the baseline across multiple multimodal benchmarks, cutting total token usage by over 41%. Code and models will be released publicly.
Human-Computer Interaction
Improving motor imagery decoding methods for an EEG-based mobile brain-computer interface in the context of the 2024 Cybathlon
Motivated by the Cybathlon 2024 competition, we developed a modular, online EEG-based brain-computer interface to address these challenges, increasing accessibility for individuals with severe mobility impairments. Our system uses three mental and motor imagery classes to control up to five control signals. The pipeline consists of four modules: data acquisition, preprocessing, classification, and the transfer function to map classification output to control dimensions. We use three diagonalized structured state-space sequence layers as a deep learning classifier. We developed a training game for our pilot where the mental tasks control the game during quick-time events. We implemented a mobile web application for live user feedback. The components were designed with a human-centred approach in collaboration with the tetraplegic user. We achieve up to 84% classification accuracy in offline analysis using an S4D-layer-based model. In a competition setting, our pilot successfully completed one task; we attribute the reduced performance in this context primarily to factors such as stress and the challenging competition environment. Following the Cybathlon, we further validated our pipeline with the original pilot and an additional participant, achieving a success rate of 73% in real-time gameplay. We also compare our model to the EEGEncoder, which is slower in training but has a higher performance. The S4D model outperforms the reference machine learning models. We provide insights into developing a framework for portable BCIs, bridging the gap between the laboratory and daily life. Specifically, our framework integrates modular design, real-time data processing, user-centred feedback, and low-cost hardware to deliver an accessible and adaptable BCI solution, addressing critical gaps in current BCI applications.
AugGen: Augmenting Task-Based Learning in Professional Creative Software with LLM-Generated Scaffolded UIs
Professional creative software often presents steep learning curves due to complex interfaces, lack of structured task-aware guidance, and unfamiliar domain terminology. To address these challenges and augment user learning experience, we introduce AugGen, a method for generating scaffolded user interfaces that simplify interface complexity and support task-based learning. With the user's task, our method surfaces task-relevant tools to reduce distracting features, organizes the tools around task workflow stages to offer execution guidance, connects tools with domain concepts to foster learning engagement, and progressively discloses advanced features to manage learning progress. To evaluate the method, we used our LLM-assisted pipeline to generate two task-specific scaffolded UIs and deployed them in Blender, our professional 3D modeling testbed. We invited both beginner (N=32) and expert (N=8) users to evaluate our implemented interfaces. Results show that the scaffolded interfaces significantly reduced user-perceived task load, enhanced task performance via embedded guidance, and augmented concept learning during task execution.
comment: arXiv admin note: substantial text overlap with arXiv:2505.12101
Is Passive Expertise-Based Personalization Enough? A Case Study in AI-Assisted Test-Taking EMNLP 2025
Novice and expert users have different systematic preferences in task-oriented dialogues. However, whether catering to these preferences actually improves user experience and task performance remains understudied. To investigate the effects of expertise-based personalization, we first built a version of an enterprise AI assistant with passive personalization. We then conducted a user study where participants completed timed exams, aided by the two versions of the AI assistant. Preliminary results indicate that passive personalization helps reduce task load and improve assistant perception, but reveal task-specific limitations that can be addressed through providing more user agency. These findings underscore the importance of combining active and passive personalization to optimize user experience and effectiveness in enterprise task-oriented environments.
comment: Accepted into Tailoring AI: Exploring Active and Passive LLM Personalization (PALS) workshop at EMNLP 2025
Quantifying the Privacy-Utility Trade-off in GPS-based Daily Stress Recognition using Semantic Features
Psychological stress is a widespread issue that significantly impacts student well-being and academic performance. Effective remote stress recognition is crucial, yet existing methods often rely on wearable devices or GPS-based clustering techniques that pose privacy risks. In this study, we introduce a novel, end-to-end privacy-enhanced framework for semantic location encoding using a self-hosted OSM engine and an LLM-bootstrapped static map. We rigorously quantify the privacy-utility trade-off and demonstrate (via LOSO validation) that our Privacy-Aware (PA) model achieves performance statistically indistinguishable from a non-private model, proving that utility does not require sacrificing privacy. Feature importance analysis highlights that recreational activity time, working time, and travel time play a significant role in stress recognition.
Can Intelligent User Interfaces Engage in Philosophical Discussions? A Longitudinal Study of Philosophers' Evolving Perceptions
This study investigates the evolving attitudes of philosophy scholars towards the participation of generative AI based Intelligent User Interfaces (IUIs) in philosophical discourse. We conducted a three year (2023--2025) mixed methods longitudinal study with 16 philosophy scholars and students. Qualitative data from annual interviews reveal a three stage evolution in attitude: from initial resistance and unfamiliarity, to instrumental acceptance of the IUI as a tool, and finally to a deep principled questioning of the IUI's fundamental capacity for genuine philosophical thought. Quantitative data from blind assessments, where participants rated anonymized philosophical answers from both humans and an IUI, complement these findings. While participants acknowledged the IUI's proficiency in tasks requiring formal logic and knowledge reproduction, they consistently identified significant shortcomings in areas demanding dialectical reasoning, originality and embodied understanding. The study concludes that participants do not see the IUI as a peer but rather as a sophisticated mirror whose capabilities and limitations provoke a deeper reflection on the unique and irreplaceable human dimensions of philosophical inquiry, such as intuition, value laden commitment and the courage to question fundamental premises.
Robust In-the-Wild Exercise Recognition from a Single Wearable: Data-Side Fusion, Sensor Rotation, and Feature Engineering
Monitoring physical exercises is vital for health promotion, with automated systems becoming standard in personal health surveillance. However, sensor placement variability and unconstrained movements limit their effectiveness. This study proposes the team "3KA"'s one-sensor workout activity recognition method using feature extraction and data augmentation in 2ndWEAR Dataset Challenge. From raw acceleration, angle and signal magnitude vector features were derived, followed by extraction of statistical, fractal/spectral, and higher-order differential features. A fused dataset combining left/right limb data was created, and augmented via sensor rotation and axis inversion. We utilized a soft voting model combining Hist Gradient Boosting with balanced weights and Extreme Gradient Boosting without. Under group 5-fold evaluation, the model achieved 58.83\% macro F1 overall (61.72% arm, 55.95% leg). ANOVA F-score showed fractal/spectral features were most important for arm-based recognition but least for leg-based. The code to reproduce the experiments is publicly available via: https://github.com/Khanghcmut/WEAR\_3K
Amplifiers or Equalizers? A Longitudinal Study of LLM Evolution in Software Engineering Project-Based Learning ICSE
As LLMs reshape software development, integrating LLM-augmented practices into SE education has become imperative. While existing studies explore LLMs' educational use in introductory programming or isolated SE tasks, their impact in more open-ended Project-Based Learning (PBL) remains unexplored. This paper introduces a two-year longitudinal study comparing a 2024 (using early free LLMs, $n$=48) and 2025 (using the latest paid LLMs, $n$=46) cohort. Our findings suggest the latest powerful LLMs' dual role: they act as "equalizers," boosting average performance even for programming-weak students, providing opportunities for more authentic SE practices; yet also as "amplifiers," dramatically widening absolute performance gaps, creating new pedagogical challenges for addressing educational inequities.
comment: Accepted by ICSE-SEET (ACM/IEEE 48th International Conference on Software Engineering: Software Engineering Education and Training)
Body Management Information Practices on a Female-dominant Platform
With growing awareness of long-term health and wellness, everyday body management has become a widespread practice. Social media platforms and health-related applications offer abundant information for those pursuing healthier lifestyles and more positive body images. While prior Human-Computer Interaction research has focused extensively on technology-mediated health interventions, the user-initiated practices of browsing and evaluating body management information remain underexplored. In this paper, we study a female-dominant social media platform in China to examine how users seek such information and how it shapes their lifestyle choices. Through semi-structured interviews with 18 users, we identify factors including consumerism, poster popularity, and perceived authenticity that influence decision-making, alongside challenges such as discerning reliable methods and managing body anxiety triggered by social media. We contribute insights into how content and media formats interact to shape users' information evaluation, and we outline design implications for supporting more reliable and healthy engagements with body management information.
comment: Preprint accepted at ICHEC 2025
Safe Autonomous Lane Changing: Planning with Dynamic Risk Fields and Time-Varying Convex Space Generation
This paper presents a novel trajectory planning pipeline for complex driving scenarios like autonomous lane changing, by integrating risk-aware planning with guaranteed collision avoidance into a unified optimization framework. We first construct a dynamic risk fields (DRF) that captures both the static and dynamic collision risks from surrounding vehicles. Then, we develop a rigorous strategy for generating time-varying convex feasible spaces that ensure kinematic feasibility and safety requirements. The trajectory planning problem is formulated as a finite-horizon optimal control problem and solved using a constrained iterative Linear Quadratic Regulator (iLQR) algorithm that jointly optimizes trajectory smoothness, control effort, and risk exposure while maintaining strict feasibility. Extensive simulations demonstrate that our method outperforms traditional approaches in terms of safety and efficiency, achieving collision-free trajectories with shorter lane-changing distances (28.59 m) and times (2.84 s) while maintaining smooth and comfortable acceleration patterns. In dense roundabout environments the planner further demonstrates robust adaptability, producing larger safety margins, lower jerk, and superior curvature smoothness compared with APF, MPC, and RRT based baselines. These results confirm that the integrated DRF with convex feasible space and constrained iLQR solver provides a balanced solution for safe, efficient, and comfortable trajectory generation in dynamic and interactive traffic scenarios.
Responsible LLM Deployment for High-Stake Decisions by Decentralized Technologies and Human-AI Interactions
High-stakes decision domains are increasingly exploring the potential of Large Language Models (LLMs) for complex decision-making tasks. However, LLM deployment in real-world settings presents challenges in data security, evaluation of its capabilities outside controlled environments, and accountability attribution in the event of adversarial decisions. This paper proposes a framework for responsible deployment of LLM-based decision-support systems through active human involvement. It integrates interactive collaboration between human experts and developers through multiple iterations at the pre-deployment stage to assess the uncertain samples and judge the stability of the explanation provided by post-hoc XAI techniques. Local LLM deployment within organizations and decentralized technologies, such as Blockchain and IPFS, are proposed to create immutable records of LLM activities for automated auditing to enhance security and trace back accountability. It was tested on Bert-large-uncased, Mistral, and LLaMA 2 and 3 models to assess the capability to support responsible financial decisions on business lending.
comment: IEEE International Conference on Human-Machine Systems, 2025
Rethinking AI Evaluation in Education: The TEACH-AI Framework and Benchmark for Generative AI Assistants NeurIPS 2025
As generative artificial intelligence (AI) continues to transform education, most existing AI evaluations rely primarily on technical performance metrics such as accuracy or task efficiency while overlooking human identity, learner agency, contextual learning processes, and ethical considerations. In this paper, we present TEACH-AI (Trustworthy and Effective AI Classroom Heuristics), a domain-independent, pedagogically grounded, and stakeholder-aligned framework with measurable indicators and a practical toolkit for guiding the design, development, and evaluation of generative AI systems in educational contexts. Built on an extensive literature review and synthesis, the ten-component assessment framework and toolkit checklist provide a foundation for scalable, value-aligned AI evaluation in education. TEACH-AI rethinks "evaluation" through sociotechnical, educational, theoretical, and applied lenses, engaging designers, developers, researchers, and policymakers across AI and education. Our work invites the community to reconsider what constructs "effective" AI in education and to design model evaluation approaches that promote co-creation, inclusivity, and long-term human, social, and educational impact.
comment: 6 pages, NeurIPS 2025 Responsible Foundation Models Workshop
LegalWebAgent: Empowering Access to Justice via LLM-Based Web Agents
Access to justice remains a global challenge, with many citizens still finding it difficult to seek help from the justice system when facing legal issues. Although the internet provides abundant legal information and services, navigating complex websites, understanding legal terminology, and filling out procedural forms continue to pose barriers to accessing justice. This paper introduces the LegalWebAgent framework that employs a web agent powered by multimodal large language models to bridge the gap in access to justice for ordinary citizens. The framework combines the natural language understanding capabilities of large language models with multimodal perception, enabling a complete process from user query to concrete action. It operates in three stages: the Ask Module understands user needs through natural language processing; the Browse Module autonomously navigates webpages, interacts with page elements (including forms and calendars), and extracts information from HTML structures and webpage screenshots; the Act Module synthesizes information for users or performs direct actions like form completion and schedule booking. To evaluate its effectiveness, we designed a benchmark test covering 15 real-world tasks, simulating typical legal service processes relevant to Québec civil law users, from problem identification to procedural operations. Evaluation results show LegalWebAgent achieved a peak success rate of 86.7%, with an average of 84.4% across all tested models, demonstrating high autonomy in complex real-world scenarios.
Mina: A Multilingual LLM-Powered Legal Assistant Agent for Bangladesh for Empowering Access to Justice
Bangladesh's low-income population faces major barriers to affordable legal advice due to complex legal language, procedural opacity, and high costs. Existing AI legal assistants lack Bengali-language support and jurisdiction-specific adaptation, limiting their effectiveness. To address this, we developed Mina, a multilingual LLM-based legal assistant tailored for the Bangladeshi context. It employs multilingual embeddings and a RAG-based chain-of-tools framework for retrieval, reasoning, translation, and document generation, delivering context-aware legal drafts, citations, and plain-language explanations via an interactive chat interface. Evaluated by law faculty from leading Bangladeshi universities across all stages of the 2022 and 2023 Bangladesh Bar Council Exams, Mina scored 75-80% in Preliminary MCQs, Written, and simulated Viva Voce exams, matching or surpassing average human performance and demonstrating clarity, contextual understanding, and sound legal reasoning. Even under a conservative upper bound, Mina operates at just 0.12-0.61% of typical legal consultation costs in Bangladesh, yielding a 99.4-99.9\% cost reduction relative to human-provided services. These results confirm its potential as a low-cost, multilingual AI assistant that automates key legal tasks and scales access to justice, offering a real-world case study on building domain-specific, low-resource systems and addressing challenges of multilingual adaptation, efficiency, and sustainable public-service AI deployment.
Colin: A Multimodal Human-AI Co-Creation Storytelling System To Support Children's Multi-Level Narrative Skills
Children develop narrative skills by understanding and actively building connections between elements, image text matching, and consequences. However, it is challenging for children to clearly grasp these multi level links only through explanations of text or the facilitator's speech. To address this, we developed Colin, an interactive storytelling tool that supports children's multi level narrative skills through both voice and visual modalities. In the generation stage, Colin supports the facilitator to define and review the generated text and image content freely. In the understanding stage, a question feedback model helps children understand multi level connections while co creating stories with Colin. In the building phase, Colin actively encourages children to create connections between elements through drawing and speaking. A user study with 20 participants evaluated Colin by measuring children's engagement, understanding of cause and effect relationships, and the quality of their new story creations. Our results demonstrate that Colin significantly enhances the development of children's narrative skills across multiple levels.
Biased by Design: Leveraging AI Biases to Enhance Critical Thinking of News Readers
This paper explores the design of a propaganda detection tool using Large Language Models (LLMs). Acknowledging the inherent biases in AI models, especially in political contexts, we investigate how these biases might be leveraged to enhance critical thinking in news consumption. Countering the typical view of AI biases as detrimental, our research proposes strategies of user choice and personalization in response to a user's political stance, applying psychological concepts of confirmation bias and cognitive dissonance. We present findings from a qualitative user study, offering insights and design recommendations (bias awareness, personalization and choice, and gradual introduction of diverse perspectives) for AI tools in propaganda detection.
comment: European Conference on Information Systems (ECIS)
Reranking partisan animosity in algorithmic social media feeds alters affective polarization
Today, social media platforms hold sole power to study the effects of feed ranking algorithms. We developed a platform-independent method that reranks participants' feeds in real-time and used this method to conduct a preregistered 10-day field experiment with 1,256 participants on X during the 2024 U.S. presidential campaign. Our experiment used a large language model to rerank posts that expressed antidemocratic attitudes and partisan animosity (AAPA). Decreasing or increasing AAPA exposure shifted out-party partisan animosity by two points on a 100-point feeling thermometer, with no detectable differences across party lines, providing causal evidence that exposure to AAPA content alters affective polarization. This work establishes a method to study feed algorithms without requiring platform cooperation, enabling independent evaluation of ranking interventions in naturalistic settings.
Compression of executable QR codes or sQRy for Industry: an example for Wi-Fi access points
Executable QR codes, or sQRy, is a technology dated 2022 that permits to include a runnable program inside a QR code, enabling interaction with the user even in the absence of an Internet connection. sQRy are enablers for different practical applications, including network equipment configuration, diagnostics, and enhanced smart manuals in industrial contexts. Many other non-industry-related fields can also benefit from this technology. Regardless of where sQRy are used, text strings are among the most commonly embedded data. However, due to strict limitations on the available payload, the occupancy of strings limits the length of the programs that can be embedded. In this work, we propose a simple yet effective strategy that can reduce the space taken by strings, hence broadening sQRy applicability.
comment: preprint accepted, 4 pages, 2025
A Survey on Bridging EEG Signals and Generative AI: From Image and Text to Beyond
Decoding neural activity into human-interpretable representations is a key research direction in brain-computer interfaces (BCIs) and computational neuroscience. Recent progress in machine learning and generative AI has driven growing interest in transforming non-invasive Electroencephalography (EEG) signals into images, text, and audio. This survey consolidates and analyzes developments across EEG-to-image synthesis, EEG-to-text generation, and EEG-to-audio reconstruction. We conducted a structured literature search across major databases (2017-2025), extracting key information on datasets, generative architectures (GANs, VAEs, transformers, diffusion models), EEG feature-encoding techniques, evaluation metrics, and the major challenges shaping current work in this area. Our review finds that EEG-to-image models predominantly employ encoder-decoder architectures built on GANs, VAEs, or diffusion models; EEG-to-text approaches increasingly leverage transformer-based language models for open-vocabulary decoding; and EEG-to-audio methods commonly map EEG signals to mel-spectrograms that are subsequently rendered into audio using neural vocoders. Despite promising advances, the field remains constrained by small and heterogeneous datasets, limited cross-subject generalization, and the absence of standardized benchmarks. By consolidating methodological trends and available datasets, this survey provides a foundational reference for advancing EEG-based generative AI and supporting reproducible research. We further highlight open-source datasets and baseline implementations to facilitate systematic benchmarking and accelerate progress in EEG-driven neural decoding.
Image and Video Processing
A lasso-alternative to Dijkstra's algorithm for identifying short paths in networks
We revisit the problem of finding the shortest path between two selected vertices of a graph and formulate this as an $\ell_1$-regularized regression -- Least Absolute Shrinkage and Selection Operator (lasso). We draw connections between a numerical implementation of this lasso-formulation, using the so-called LARS algorithm, and a more established algorithm known as the bi-directional Dijkstra. Appealing features of our formulation include the applicability of the Alternating Direction of Multiplier Method (ADMM) to the problem to identify short paths, and a relatively efficient update to topological changes.
comment: 25 pages, 7 figures
Hard Spatial Gating for Precision-Driven Brain Metastasis Segmentation: Addressing the Over-Segmentation Paradox in Deep Attention Networks
Brain metastasis segmentation in MRI remains a formidable challenge due to diminutive lesion sizes (5-15 mm) and extreme class imbalance (less than 2% tumor volume). While soft-attention CNNs are widely used, we identify a critical failure mode termed the "over-segmentation paradox," where models achieve high sensitivity (recall > 0.88) but suffer from catastrophic precision collapse (precision < 0.23) and boundary errors exceeding 150 mm. This imprecision poses significant risks for stereotactic radiosurgery planning. To address this, we introduce the Spatial Gating Network (SG-Net), a precision-first architecture employing hard spatial gating mechanisms. Unlike traditional soft attention, SG-Net enforces strict feature selection to aggressively suppress background artifacts while preserving tumor features. Validated on the Brain-Mets-Lung-MRI dataset (n=92), SG-Net achieves a Dice Similarity Coefficient of 0.5578 +/- 0.0243 (95% CI: 0.45-0.67), statistically outperforming Attention U-Net (p < 0.001) and ResU-Net (p < 0.001). Most critically, SG-Net demonstrates a threefold improvement in boundary precision, achieving a 95% Hausdorff Distance of 56.13 mm compared to 157.52 mm for Attention U-Net, while maintaining robust recall (0.79) and superior precision (0.52 vs. 0.20). Furthermore, SG-Net requires only 0.67M parameters (8.8x fewer than Attention U-Net), facilitating deployment in resource-constrained environments. These findings establish hard spatial gating as a robust solution for precision-driven lesion detection, directly enhancing radiosurgery accuracy.
Content Adaptive Encoding For Interactive Game Streaming
Video-on-demand streaming has benefitted from \textit{content-adaptive encoding} (CAE), i.e., adaptation of resolution and/or quantization parameters for each scene based on convex hull optimization. However, CAE is very challenging to develop and deploy for interactive game streaming (IGS). Commercial IGS services impose ultra-low latency encoding with no lookahead or buffering, and have extremely tight compute constraints for any CAE algorithm execution. We propose the first CAE approach for resolution adaptation in IGS based on compact encoding metadata from past frames. Specifically, we train a convolutional neural network (CNN) to infer the best resolution from the options available for the upcoming scene based on a running window of aggregated coding block statistics from the current scene. By deploying the trained CNN within a practical IGS setup based on HEVC encoding, our proposal: (i) improves over the default fixed-resolution ladder of HEVC by 2.3 Bjøntegaard Delta-VMAF points; (ii) infers using 1ms of a single CPU core per scene, thereby having no latency overhead.
comment: 5 pages
ColonAdapter: Geometry Estimation Through Foundation Model Adaptation for Colonoscopy
Estimating 3D geometry from monocular colonoscopy images is challenging due to non-Lambertian surfaces, moving light sources, and large textureless regions. While recent 3D geometric foundation models eliminate the need for multi-stage pipelines, their performance deteriorates in clinical scenes. These models are primarily trained on natural scene datasets and struggle with specularity and homogeneous textures typical in colonoscopy, leading to inaccurate geometry estimation. In this paper, we present ColonAdapter, a self-supervised fine-tuning framework that adapts geometric foundation models for colonoscopy geometry estimation. Our method leverages pretrained geometric priors while tailoring them to clinical data. To improve performance in low-texture regions and ensure scale consistency, we introduce a Detail Restoration Module (DRM) and a geometry consistency loss. Furthermore, a confidence-weighted photometric loss enhances training stability in clinical environments. Experiments on both synthetic and real datasets demonstrate that our approach achieves state-of-the-art performance in camera pose estimation, monocular depth prediction, and dense 3D point map reconstruction, without requiring ground-truth intrinsic parameters.
GACELLE: GPU-accelerated tools for model parameter estimation and image reconstruction
Quantitative MRI (qMRI) offers tissue-specific biomarkers that can be tracked over time or compared across populations; however, its adoption in clinical research is hindered by significant computational demands of parameter estimation. Images acquired at high spatial resolution or requiring fitting multiple parameters often require lengthy processing time, constraining their use in routine pipelines and slowing methodological innovation and clinical translation. We present GACELLE, an open source, GPU-accelerated framework for high-throughput qMRI analysis. GACELLE provides a stochastic gradient descent optimiser and a stochastic sampler in MATLAB, enabling fast parameter mapping, improved estimation robustness via spatial regularisation, and uncertainty quantification. GACELLE prioritises accessibility: users only need to provide a forward signal model, while GACELLE's backend manages computational parallelisation, automatic parameter updates, and memory-batching. The stochastic solver performs fully vectorised Markov chain Monte Carlo with identical likelihood on CPU and GPU, ensuring reproducibility across hardware. Benchmarking demonstrates up to 451-fold acceleration for the stochastic gradient descent solver and 14,380-fold acceleration for stochastic sampling compared to CPU-based estimation, without compromising accuracy. We demonstrated GACELLE's versatility on three representative qMRI models and on an image reconstruction task. Across these applications, GACELLE improves parameter precision, enhances test-retest reproducibility, and reduces noise in quantitative maps. By combining speed, usability and flexibility, GACELLE provides a generalisable optimisation framework for medical image analysis. It lowers the computational barrier for qMRI, paving the way for reproducible biomarker development, large-scale imaging studies, and clinical translation.
AutoRec: Accelerating Loss Recovery for Live Streaming in a Multi-Supplier Market
Due to the limited permissions for upgrading dualside (i.e., server-side and client-side) loss tolerance schemes from the perspective of CDN vendors in a multi-supplier market, modern large-scale live streaming services are still using the automatic-repeat-request (ARQ) based paradigm for loss recovery, which only requires server-side modifications. In this paper, we first conduct a large-scale measurement study with up to 50 million live streams. We find that loss shows dynamics and live streaming contains frequent on-off mode switching in the wild. We further find that the recovery latency, enlarged by the ubiquitous retransmission loss, is a critical factor affecting live streaming's client-side QoE (e.g., video freezing). We then propose an enhanced recovery mechanism called AutoRec, which can transform the disadvantages of on-off mode switching into an advantage for reducing loss recovery latency without any modifications on the client side. AutoRec allows users to customize overhead tolerance and recovery latency tolerance and adaptively adjusts strategies as the network environment changes to ensure that recovery latency meets user demands whenever possible while keeping overhead under control. We implement AutoRec upon QUIC and evaluate it via testbed and real-world commercial services deployments. The experimental results demonstrate the practicability and profitability of AutoRec.
When Do Domain-Specific Foundation Models Justify Their Cost? A Systematic Evaluation Across Retinal Imaging Tasks
Large vision foundation models have been widely adopted for retinal disease classification without systematic evidence justifying their parameter requirements. In the present work we address two critical questions: First, are large domain-specific foundation models essential, or do compact general-purpose architectures suffice? Second, does specialized retinal pretraining justify its computational cost? To answer this, we benchmark initialization strategies across four retinal imaging classification tasks spanning Optical Coherence Tomography (OCT) and Color Fundus Photography (CFP) modalities: 8-class OCT classification, 3-class diabetic macular edema (DME), 5-class diabetic retinopathy (DR), and 3-class glaucoma (GL) detection. We evaluate 12-13 model configurations per task, including vision transformers (22.8M-86.6M parameters), Swin Transformers (27.6M-28.3M), ConvNeXt (28.6M), and the domain-specific RETFound models (303M), under identical training conditions. Our results challenge prevailing assumptions: First, we demonstrate that pretraining provides universal benefits (5.18-18.41% improvement), scaling with task difficulty. Second, compact architectures (27-29M) dominate Pareto frontiers; SwinV2-tiny achieves top-1 performance on three datasets. Third, RETFound (303M) justifies its computational cost only for challenging DR grading (accuracy of 71.15%), while ImageNet pretraining proves to be sufficient with all other tasks (DME accuracy: 99.24%, OCT accuracy: 97.96%). CFP tasks show larger pretraining accuracy gains (9.13-18.41%) than OCT (5.18%). Thus, the evidence suggests that compact general-purpose models deliver near-optimal performance for most retinal classification tasks; specialized foundation models warranted only for fine-grained discrimination under extreme class imbalance.
TinyViT: Field Deployable Transformer Pipeline for Solar Panel Surface Fault and Severity Screening
Sustained operation of solar photovoltaic assets hinges on accurate detection and prioritization of surface faults across vast, geographically distributed modules. While multi modal imaging strategies are popular, they introduce logistical and economic barriers for routine farm level deployment. This work demonstrates that deep learning and classical machine learning may be judiciously combined to achieve robust surface anomaly categorization and severity estimation from planar visible band imagery alone. We introduce TinyViT which is a compact pipeline integrating Transformer based segmentation, spectral-spatial feature engineering, and ensemble regression. The system ingests consumer grade color camera mosaics of PV panels, classifies seven nuanced surface faults, and generates actionable severity grades for maintenance triage. By eliminating reliance on electroluminescence or IR sensors, our method enables affordable, scalable upkeep for resource limited installations, and advances the state of solar health monitoring toward universal field accessibility. Experiments on real public world datasets validate both classification and regression sub modules, achieving accuracy and interpretability competitive with specialized approaches.
comment: 3pages, 2figures,ICGVIP 2025
Linearly Constrained Diffusion Implicit Models NeurIPS 2025
We introduce Linearly Constrained Diffusion Implicit Models (CDIM), a fast and accurate approach to solving noisy linear inverse problems using diffusion models. Traditional diffusion-based inverse methods rely on numerous projection steps to enforce measurement consistency in addition to unconditional denoising steps. CDIM achieves a 10-50x reduction in projection steps by dynamically adjusting the number and size of projection steps to align a residual measurement energy with its theoretical distribution under the forward diffusion process. This adaptive alignment preserves measurement consistency while substantially accelerating constrained inference. For noise-free linear inverse problems, CDIM exactly satisfies the measurement constraints with few projection steps, even when existing methods fail. We demonstrate CDIM's effectiveness across a range of applications, including super-resolution, denoising, inpainting, deblurring, and 3D point cloud reprojection. Code and an interactive demo can be found on our project website.
comment: NeurIPS 2025
Total Least Square Optimal Analytic Signal by Structure Tensor for N-D images
We produce the analytic signal by using the Structure Tensor, which provides Total Least Squares optimal vectors for estimating orientation and scale locally. Together, these vectors represent N-D frequency components that determine adaptive, complex probing filters. The N-D analytic signal is obtained through scalar products of adaptive filters with image neighborhoods. It comprises orientation, scale, phase, and amplitude information of the neighborhood. The ST analytic signal $ f_A $ is continuous and isotropic, and its extension to N-D is straightforward. The phase gradient can be represented as a vector (instantaneous frequency) or as a tensor. Both are continuous and isotropic, while the tensor additionally preserves continuity of orientation and retains the same information as the vector representation. The tensor representation can also be used to detect singularities. Detection with known phase portraits has been demonstrated in 2-D with relevance to fringe pattern processing in wave physics, including optics and fingerprint measurements. To construct adaptive filters we have used Gabor filter family members as probing functions, but other function families can also be used to sample the spectrum, e.g., quadrature filters. A comparison to three baseline alternatives-in representation (Monogenic signal), enhancement (Monogenic signal combined with a spline-wavelet pyramid), and singularity detection (mindtct, a fingerprint minutia detector widely used in numerous studies)-is also reported using images with precisely known ground truths for location, orientation, singularity type (where applicable), and wave period.
comment: Changed title, included new experimental results. Reorganized so that details are in Supplementary Material part, which is new
Fast Gradient Methods for Data-Consistent Local Super-Resolution of Medical Images
In this work, we propose a new paradigm of iterative model-based reconstruction algorithms for providing real-time solution for zooming-in and refining a region of interest in medical and clinical tomographic images. This algorithmic framework is tailored for a clinical need in medical imaging practice that after a reconstruction of the full tomographic image, the clinician may believe that some critical parts of the image are not clear enough, and may wish to see clearer these regions of interest. A naive approach (which is highly not recommended) would be to perform the global reconstruction of a higher resolution image, which has two major limitations: first, it is computationally inefficient, and second, the image regularization is still applied globally, which may over-smooth some local regions. Furthermore, if one wishes to fine-tune the regularization parameter for local parts, it would be computationally infeasible in practice for the case of using global reconstruction. Our new iterative approaches for such tasks are based on jointly utilizing the measurement information, efficient up-sampling/down-sampling across image spaces, and locally adjusted image prior for efficient and high-quality post-processing. The numerical results in low-dose X-ray CT image local zoom-in demonstrate the effectiveness of our approach.
Fast Equivariant Imaging: Acceleration for Unsupervised Learning via Augmented Lagrangian and Auxiliary PnP Denoisers
In this work, we propose Fast Equivariant Imaging (FEI), a novel unsupervised learning framework to rapidly and efficiently train deep imaging networks without ground-truth data. From the perspective of reformulating the Equivariant Imaging based optimization problem via the method of Lagrange multipliers and utilizing plug-and-play denoisers, this novel unsupervised scheme shows superior efficiency and performance compared to the vanilla Equivariant Imaging paradigm. In particular, our FEI schemes achieve an order-of-magnitude (10x) acceleration over standard EI on training U-Net for X-ray CT reconstruction and image inpainting, with improved generalization performance.
comment: 17 pages
Entropy Coding for Non-Rectangular Transform Blocks using Partitioned DCT Dictionaries for AV1
Recent video codecs such as VVC and AV1 apply a Non-rectangular (NR) partitioning to combine prediction signals using a smooth blending around the boundary, followed by a rectangular transform on the whole block. The NR signal transformation is not yet supported. A transformation technique that applies the same partitioning to the 2D Discrete Cosine Transform (DCT) bases and finds a sparse representation of the NR signal in such a dictionary showed promising gains in an experimental setup outside the reference software. This method uses the regular inverse transformation at the decoder to reconstruct a rectangular signal and discards the signal outside the region of interest. This design is appealing due to the minimal changes required at the decoder. However, current entropy coding schemes are not well-suited for optimally encoding these coefficients because they are primarily designed for DCT coefficients. This work introduces an entropy coding method that efficiently codes these transform coefficients by effectively modeling their properties. The design offers significant theoretical rate savings, estimated using conditional entropy, particularly for scenarios that are more dissimilar to DCT in an experimental setup.
PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation
Video-to-Audio (V2A) generation requires balancing four critical perceptual dimensions: semantic consistency, audio-visual temporal synchrony, aesthetic quality, and spatial accuracy; yet existing methods suffer from objective entanglement that conflates competing goals in single loss functions and lack human preference alignment. We introduce PrismAudio, the first framework to integrate Reinforcement Learning into V2A generation with specialized Chain-of-Thought (CoT) planning. Our approach decomposes monolithic reasoning into four specialized CoT modules (Semantic, Temporal, Aesthetic, and Spatial CoT), each paired with targeted reward functions. This CoT-reward correspondence enables multidimensional RL optimization that guides the model to jointly generate better reasoning across all perspectives, solving the objective entanglement problem while preserving interpretability. To make this optimization computationally practical, we propose Fast-GRPO, which employs hybrid ODE-SDE sampling that dramatically reduces the training overhead compared to existing GRPO implementations. We also introduce AudioCanvas, a rigorous benchmark that is more distributionally balanced and covers more realistically diverse and challenging scenarios than existing datasets, with 300 single-event classes and 501 multi-event samples. Experimental results demonstrate that PrismAudio achieves state-of-the-art performance across all four perceptual dimensions on both the in-domain VGGSound test set and out-of-domain AudioCanvas benchmark. The project page is available at https://PrismAudio-Project.github.io.
comment: Preprint
Training-Free Adaptive Quantization for Variable Rate Image Coding for Machines
Image Coding for Machines (ICM) has become increasingly important with the rapid integration of computer vision technology into real-world applications. However, most neural network-based ICM frameworks operate at a fixed rate, thus requiring individual training for each target bitrate. This limitation may restrict their practical usage. Existing variable rate image compression approaches mitigate this issue but often rely on additional training, which increases computational costs and complicates deployment. Moreover, variable rate control has not been thoroughly explored for ICM. To address these challenges, we propose a training-free quantization strength control scheme that enables flexible bitrate adjustment. By exploiting the scale parameter predicted by the hyperprior network, the proposed method adaptively modulates quantization step sizes across both channel and spatial dimensions. This allows the model to preserve semantically important regions while coarsely quantizing less critical areas. Our architectural design further enables continuous bitrate control through a single parameter. Experimental results demonstrate the effectiveness of our proposed method, achieving up to 11.07% BD-rate savings over the non-adaptive variable rate baseline.
Multimedia
From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images
While Multimodal Large Language Models (MLLMs) are adept at answering what is in an image-identifying objects and describing scenes-they often lack the ability to understand how an image feels to a human observer. This gap is most evident when considering subjective cognitive properties, such as what makes an image memorable, funny, aesthetically pleasing, or emotionally evocative. To systematically address this challenge, we introduce CogIP-Bench, a comprehensive benchmark for evaluating MLLMs on such image cognitive properties. Our evaluation reveals a significant gap: current models are poorly aligned with human perception of these nuanced properties. We then demonstrate that a post-training phase can effectively bridge this gap, significantly enhancing the model's alignment with human judgments. Furthermore, we show that this learned cognitive alignment is not merely predictive but also transferable to downstream creative tasks. By integrating our cognitively-aligned MLLM into an image generation pipeline, we can guide the synthesis process to produce images that better embody desired traits, such as being more memorable or visually appealing. Our work provides a benchmark to measure this human-like perception, a post-training pipeline to enhance it, and a demonstration that this alignment unlocks more human-centric AI.
comment: Project page with codes/datasets/models: https://follen-cry.github.io/MLLM-Cognition-project-page/
ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering
Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning. To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence. Our source code is publicly available at: https://github.com/aimagelab/ReAG.
A Progressive Evaluation Framework for Multicultural Analysis of Story Visualization
Recent advancements in text-to-image generative models have improved narrative consistency in story visualization. However, current story visualization models often overlook cultural dimensions, resulting in visuals that lack authenticity and cultural fidelity. In this study, we conduct a comprehensive multicultural analysis of story visualization using current text-to-image models across multilingual settings on two datasets: FlintstonesSV and VIST. To assess cultural dimensions rigorously, we propose a Progressive Multicultural Evaluation Framework and introduce five story visualization metrics, Cultural Appropriateness, Visual Aesthetics, Cohesion, Semantic Consistency, and Object Presence, that are not addressed by existing metrics. We further automate assessment through an MLLM-as-Jury framework that approximates human judgment. Human evaluations show that models generate more coherent, visually appealing, and culturally appropriate stories for real-world datasets than for animated ones. The generated stories exhibit a stronger alignment with English-speaking cultures across all metrics except Cohesion, where Chinese performs better. In contrast, Hindi ranks lowest on all metrics except Visual Aesthetics, reflecting real-world cultural biases embedded in current models. This multicultural analysis provides a foundation for future research aimed at generating culturally appropriate and inclusive visual stories across diverse linguistic and cultural settings.
Orthogonal Disentanglement with Projected Feature Alignment for Multimodal Emotion Recognition in Conversation
Multimodal Emotion Recognition in Conversation (MERC) significantly enhances emotion recognition performance by integrating complementary emotional cues from text, audio, and visual modalities. While existing methods commonly utilize techniques such as contrastive learning and cross-attention mechanisms to align cross-modal emotional semantics, they typically overlook modality-specific emotional nuances like micro-expressions, tone variations, and sarcastic language. To overcome these limitations, we propose Orthogonal Disentanglement with Projected Feature Alignment (OD-PFA), a novel framework designed explicitly to capture both shared semantics and modality-specific emotional cues. Our approach first decouples unimodal features into shared and modality-specific components. An orthogonal disentanglement strategy (OD) enforces effective separation between these components, aided by a reconstruction loss to maintain critical emotional information from each modality. Additionally, a projected feature alignment strategy (PFA) maps shared features across modalities into a common latent space and applies a cross-modal consistency alignment loss to enhance semantic coherence. Extensive evaluations on widely-used benchmark datasets, IEMOCAP and MELD, demonstrate effectiveness of our proposed OD-PFA multimodal emotion recognition tasks, as compared with the state-of-the-art approaches.
comment: 10 pages, 1 figure
Angle-Optimized Partial Disentanglement for Multimodal Emotion Recognition in Conversation
Multimodal Emotion Recognition in Conversation (MERC) aims to enhance emotion understanding by integrating complementary cues from text, audio, and visual modalities. Existing MERC approaches predominantly focus on cross-modal shared features, often overlooking modality-specific features that capture subtle yet critical emotional cues such as micro-expressions, prosodic variations, and sarcasm. Although related work in multimodal emotion recognition (MER) has explored disentangling shared and modality-specific features, these methods typically employ rigid orthogonal constraints to achieve full disentanglement, which neglects the inherent complementarity between feature types and may limit recognition performance. To address these challenges, we propose Angle-Optimized Feature Learning (AO-FL), a framework tailored for MERC that achieves partial disentanglement of shared and specific features within each modality through adaptive angular optimization. Specifically, AO-FL aligns shared features across modalities to ensure semantic consistency, and within each modality it adaptively models the angular relationship between its shared and modality-specific features to preserve both distinctiveness and complementarity. An orthogonal projection refinement further removes redundancy in specific features and enriches shared features with contextual information, yielding more discriminative multimodal representations. Extensive experiments confirm the effectiveness of AO-FL for MERC, demonstrating superior performance over state-of-the-art approaches. Moreover, AO-FL can be seamlessly integrated with various unimodal feature extractors and extended to other multimodal fusion tasks, such as MER, thereby highlighting its strong generalization beyond MERC.
comment: 10 pages, 7 figures
VSpeechLM: A Visual Speech Language Model for Visual Text-to-Speech Task
The task of Visual Text-to-Speech (VisualTTS), also known as video dubbing, aims to generate speech synchronized with the lip movements in an input video, in additional to being consistent with the content of input text and cloning the timbre of a reference speech. Existing VisualTTS models typically adopt lightweight architectures and design specialized modules to achieve the above goals respectively, yet the speech quality is not satisfied due to the model capacity and the limited data in VisualTTS. Recently, speech large language models (SpeechLLM) show the robust ability to generate high-quality speech. But few work has been done to well leverage temporal cues from video input in generating lip-synchronized speech. To generate both high-quality and lip-synchronized speech in VisualTTS tasks, we propose a novel Visual Speech Language Model called VSpeechLM based upon a SpeechLLM. To capture the synchronization relationship between text and video, we propose a text-video aligner. It first learns fine-grained alignment between phonemes and lip movements, and then outputs an expanded phoneme sequence containing lip-synchronization cues. Next, our proposed SpeechLLM based decoders take the expanded phoneme sequence as input and learns to generate lip-synchronized speech. Extensive experiments demonstrate that our VSpeechLM significantly outperforms previous VisualTTS methods in terms of overall quality, speaker similarity, and synchronization metrics.
comment: MM Asia 2025
OralGPT-Omni: A Versatile Dental Multimodal Large Language Model
Multimodal Large Language Models (MLLMs) have exhibited immense potential across numerous medical specialties; yet, dentistry remains underexplored, in part due to limited domain-specific data, scarce dental expert annotations, insufficient modality-specific modeling, and challenges in reliability. In this paper, we present OralGPT-Omni, the first dental-specialized MLLM designed for comprehensive and trustworthy analysis across diverse dental imaging modalities and clinical tasks. To explicitly capture dentists' diagnostic reasoning, we construct TRACE-CoT, a clinically grounded chain-of-thought dataset that mirrors dental radiologists' decision-making processes. This reasoning supervision, combined with our proposed four-stage training paradigm, substantially strengthens the model's capacity for dental image understanding and analysis. In parallel, we introduce MMOral-Uni, the first unified multimodal benchmark for dental image analysis. It comprises 2,809 open-ended question-answer pairs spanning five modalities and five tasks, offering a comprehensive evaluation suite to date for MLLMs in digital dentistry. OralGPT-Omni achieves an overall score of 51.84 on the MMOral-Uni benchmark and 45.31 on the MMOral-OPG benchmark, dramatically outperforming the scores of GPT-5. Our work promotes intelligent dentistry and paves the way for future advances in dental image analysis. All code, benchmark, and models will be made publicly available.
comment: 47 pages, 42 figures, 13 tables
AutoRec: Accelerating Loss Recovery for Live Streaming in a Multi-Supplier Market
Due to the limited permissions for upgrading dualside (i.e., server-side and client-side) loss tolerance schemes from the perspective of CDN vendors in a multi-supplier market, modern large-scale live streaming services are still using the automatic-repeat-request (ARQ) based paradigm for loss recovery, which only requires server-side modifications. In this paper, we first conduct a large-scale measurement study with up to 50 million live streams. We find that loss shows dynamics and live streaming contains frequent on-off mode switching in the wild. We further find that the recovery latency, enlarged by the ubiquitous retransmission loss, is a critical factor affecting live streaming's client-side QoE (e.g., video freezing). We then propose an enhanced recovery mechanism called AutoRec, which can transform the disadvantages of on-off mode switching into an advantage for reducing loss recovery latency without any modifications on the client side. AutoRec allows users to customize overhead tolerance and recovery latency tolerance and adaptively adjusts strategies as the network environment changes to ensure that recovery latency meets user demands whenever possible while keeping overhead under control. We implement AutoRec upon QUIC and evaluate it via testbed and real-world commercial services deployments. The experimental results demonstrate the practicability and profitability of AutoRec.
Art2Music: Generating Music for Art Images with Multi-modal Feeling Alignment
With the rise of AI-generated content (AIGC), generating perceptually natural and feeling-aligned music from multimodal inputs has become a central challenge. Existing approaches often rely on explicit emotion labels that require costly annotation, underscoring the need for more flexible feeling-aligned methods. To support multimodal music generation, we construct ArtiCaps, a pseudo feeling-aligned image-music-text dataset created by semantically matching descriptions from ArtEmis and MusicCaps. We further propose Art2Music, a lightweight cross-modal framework that synthesizes music from artistic images and user comments. In the first stage, images and text are encoded with OpenCLIP and fused using a gated residual module; the fused representation is decoded by a bidirectional LSTM into Mel-spectrograms with a frequency-weighted L1 loss to enhance high-frequency fidelity. In the second stage, a fine-tuned HiFi-GAN vocoder reconstructs high-quality audio waveforms. Experiments on ArtiCaps show clear improvements in Mel-Cepstral Distortion, Frechet Audio Distance, Log-Spectral Distance, and cosine similarity. A small LLM-based rating study further verifies consistent cross-modal feeling alignment and offers interpretable explanations of matches and mismatches across modalities. These results demonstrate improved perceptual naturalness, spectral fidelity, and semantic consistency. Art2Music also maintains robust performance with only 50k training samples, providing a scalable solution for feeling-aligned creative audio generation in interactive art, personalized soundscapes, and digital art exhibitions.
MoLT: Mixture of Layer-Wise Tokens for Efficient Audio-Visual Learning
In this paper, we propose Mixture of Layer-Wise Tokens (MoLT), a parameter- and memory-efficient adaptation framework for audio-visual learning. The key idea of MoLT is to replace conventional, computationally heavy sequential adaptation at every transformer layer with a parallel, lightweight scheme that extracts and fuses layer-wise tokens only from the late layers. We adopt two types of adapters to distill modality-specific information and cross-modal interaction into compact latent tokens in a layer-wise manner. A token fusion module then dynamically fuses these layer-wise tokens by taking into account their relative significance. To prevent the redundancy of latent tokens, we apply an orthogonality regularization between latent tokens during training. Through the systematic analysis of the position of adaptation in the pre-trained transformers, we extract latent tokens only from the late layers of the transformers. This strategic adaptation approach avoids error propagation from the volatile early-layer features, thereby maximizing the adaptation performance while maintaining parameter and memory efficiency. Through extensive experiments, we demonstrate that MoLT outperforms existing methods on diverse audio-visual benchmarks, including Audio-Visual Question Answering, Audio-Visual Segmentation, and Audio-Visual Event Localization.
comment: 10 pages, 5 figures
Scaling Spatial Intelligence with Multimodal Foundation Models
Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, 54.6% on ViewSpatial, and 50.1% on SITE, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. SenseNova-SI is an ongoing project, and this report will be updated continuously. All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.
comment: Codebase: https://github.com/OpenSenseNova/SenseNova-SI; Models: https://huggingface.co/collections/sensenova/sensenova-si
Holistic Evaluation of Multimodal LLMs on Spatial Intelligence
Multimodal models have achieved remarkable progress in recent years. Nevertheless, they continue to exhibit notable limitations in spatial understanding and reasoning, the very capability that anchors artificial general intelligence in the physical world. With the recent release of GPT-5, allegedly the most powerful AI model to date, it is timely to examine where the leading models (GPT, Gemini, Grok, Seed, Qwen, and Intern) stand on the path toward spatial intelligence (SI). We thus propose EASI for holistic Evaluation of multimodAl LLMs on Spatial Intelligence. EASI conceptualizes a comprehensive taxonomy of spatial tasks that unifies existing benchmarks and a growing collection of newly curated ones, enabling systematic evaluation of state-of-the-art models. In this report, we conduct the study across eight key benchmarks, at a cost exceeding ten billion total tokens. Our empirical study then reveals that (1) GPT-5 demonstrates unprecedented strength in SI, yet (2) still falls short of human performance significantly across a broad spectrum of SI-tasks. Moreover, we (3) show that SI-tasks expose greater model capability deficiency than non-SI tasks, to the extent that (4) proprietary models do not exhibit a decisive advantage when facing the most difficult ones. In addition, we conduct a qualitative evaluation across a diverse set of scenarios that are intuitive for humans, yet fail the most advanced multimodal models. EASI is an ongoing community effort: we have open-sourced the EASI codebase that provides a one-stop and reproducible solution with standardized interfaces, integrated protocols and prompts that significantly reduce the friction of configuring and running multiple benchmarks; we have also launched an accompanying EASI leaderboard to provide a continually updated snapshot of model performance across the full SI spectrum, accelerating collective progress toward robust SI.
comment: Codebase: https://github.com/EvolvingLMMs-Lab/EASI/; Leaderboard: https://huggingface.co/spaces/lmms-lab-si/EASI-Leaderboard
RacketVision: A Multiple Racket Sports Benchmark for Unified Ball and Racket Analysis AAAI 2026
We introduce RacketVision, a novel dataset and benchmark for advancing computer vision in sports analytics, covering table tennis, tennis, and badminton. The dataset is the first to provide large-scale, fine-grained annotations for racket pose alongside traditional ball positions, enabling research into complex human-object interactions. It is designed to tackle three interconnected tasks: fine-grained ball tracking, articulated racket pose estimation, and predictive ball trajectory forecasting. Our evaluation of established baselines reveals a critical insight for multi-modal fusion: while naively concatenating racket pose features degrades performance, a CrossAttention mechanism is essential to unlock their value, leading to trajectory prediction results that surpass strong unimodal baselines. RacketVision provides a versatile resource and a strong starting point for future research in dynamic object tracking, conditional motion forecasting, and multimodal analysis in sports. Project page at https://github.com/OrcustD/RacketVision
comment: Accepted to AAAI 2026 (Oral)
Computation and Language
Intelligent Neural Networks: From Layered Architectures to Graph-Organized Intelligence
Biological neurons exhibit remarkable intelligence: they maintain internal states, communicate selectively with other neurons, and self-organize into complex graphs rather than rigid hierarchical layers. What if artificial intelligence could emerge from similarly intelligent computational units? We introduce Intelligent Neural Networks (INN), a paradigm shift where neurons are first-class entities with internal memory and learned communication patterns, organized in complete graphs rather than sequential layers. Each Intelligent Neuron combines selective state-space dynamics (knowing when to activate) with attention-based routing (knowing to whom to send signals), enabling emergent computation through graph-structured interactions. On the standard Text8 character modeling benchmark, INN achieves 1.705 Bit-Per-Character (BPC), significantly outperforming a comparable Transformer (2.055 BPC) and matching a highly optimized LSTM baseline. Crucially, a parameter-matched baseline of stacked Mamba blocks fails to converge (>3.4 BPC) under the same training protocol, demonstrating that INN's graph topology provides essential training stability. Ablation studies confirm this: removing inter-neuron communication degrades performance or leads to instability, proving the value of learned neural routing. This work demonstrates that neuron-centric design with graph organization is not merely bio-inspired -- it is computationally effective, opening new directions for modular, interpretable, and scalable neural architectures.
comment: Code available at https://github.com/AntoineSal/IntelligentNeuralNetwork
PRISM: Privacy-Aware Routing for Adaptive Cloud-Edge LLM Inference via Semantic Sketch Collaboration AAAI 2026
Large Language Models (LLMs) demonstrate impressive capabilities in natural language understanding and generation, but incur high communication overhead and privacy risks in cloud deployments, while facing compute and memory constraints when confined to edge devices. Cloud-edge inference has emerged as a promising paradigm for improving privacy in LLM services by retaining sensitive computations on local devices. However, existing cloud-edge inference approaches apply uniform privacy protection without considering input sensitivity, resulting in unnecessary perturbation and degraded utility even for non-sensitive tokens. To address this limitation, we propose Privacy-aware Routing for Inference with Semantic Modulation (PRISM), a context-aware framework that dynamically balances privacy and inference quality. PRISM executes in four stages: (1) the edge device profiles entity-level sensitivity; (2) a soft gating module on the edge selects an execution mode - cloud, edge, or collaboration; (3) for collaborative paths, the edge applies adaptive two-layer local differential privacy based on entity risks; and (4) the cloud LLM generates a semantic sketch from the perturbed prompt, which is then refined by the edge-side small language model (SLM) using local context. Our results show that PRISM consistently achieves superior privacy-utility trade-offs across various scenarios, reducing energy consumption and latency to 40-50% of baseline methods such as Uniform and Selective LDP, while maintaining high output quality under strong privacy constraints. These findings are validated through comprehensive evaluations involving realistic prompts, actual energy measurements, and heterogeneous cloud-edge model deployments.
comment: Accepted to AAAI 2026. This is the arXiv preprint version
Modeling Romanized Hindi and Bengali: Dataset Creation and Multilingual LLM Integration
The development of robust transliteration techniques to enhance the effectiveness of transforming Romanized scripts into native scripts is crucial for Natural Language Processing tasks, including sentiment analysis, speech recognition, information retrieval, and intelligent personal assistants. Despite significant advancements, state-of-the-art multilingual models still face challenges in handling Romanized script, where the Roman alphabet is adopted to represent the phonetic structure of diverse languages. Within the South Asian context, where the use of Romanized script for Indo-Aryan languages is widespread across social media and digital communication platforms, such usage continues to pose significant challenges for cutting-edge multilingual models. While a limited number of transliteration datasets and models are available for Indo-Aryan languages, they generally lack sufficient diversity in pronunciation and spelling variations, adequate code-mixed data for large language model (LLM) training, and low-resource adaptation. To address this research gap, we introduce a novel transliteration dataset for two popular Indo-Aryan languages, Hindi and Bengali, which are ranked as the 3rd and 7th most spoken languages worldwide. Our dataset comprises nearly 1.8 million Hindi and 1 million Bengali transliteration pairs. In addition to that, we pre-train a custom multilingual seq2seq LLM based on Marian architecture using the developed dataset. Experimental results demonstrate significant improvements compared to existing relevant models in terms of BLEU and CER metrics.
comment: Proceedings of the 8th Workshop on Big Data for Cybersecurity (BigCyber)
ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering
Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning. To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence. Our source code is publicly available at: https://github.com/aimagelab/ReAG.
Mechanistic Finetuning of Vision-Language-Action Models via Few-Shot Demonstrations
Vision-Language Action (VLAs) models promise to extend the remarkable success of vision-language models (VLMs) to robotics. Yet, unlike VLMs in the vision-language domain, VLAs for robotics require finetuning to contend with varying physical factors like robot embodiment, environment characteristics, and spatial relationships of each task. Existing fine-tuning methods lack specificity, adapting the same set of parameters regardless of a task's visual, linguistic, and physical characteristics. Inspired by functional specificity in neuroscience, we hypothesize that it is more effective to finetune sparse model representations specific to a given task. In this work, we introduce Robotic Steering, a finetuning approach grounded in mechanistic interpretability that leverages few-shot demonstrations to identify and selectively finetune task-specific attention heads aligned with the physical, visual, and linguistic requirements of robotic tasks. Through comprehensive on-robot evaluations with a Franka Emika robot arm, we demonstrate that Robotic Steering outperforms LoRA while achieving superior robustness under task variation, reduced computational cost, and enhanced interpretability for adapting VLAs to diverse robotic tasks.
Improving LLM-based Ontology Matching with fine-tuning on synthetic data
Large Language Models (LLMs) are increasingly being integrated into various components of Ontology Matching pipelines. This paper investigates the capability of LLMs to perform ontology matching directly on ontology modules and generate the corresponding alignments. Furthermore, it is explored how a dedicated fine-tuning strategy can enhance the model's matching performance in a zero-shot setting. The proposed method incorporates a search space reduction technique to select relevant subsets from both source and target ontologies, which are then used to automatically construct prompts. Recognizing the scarcity of reference alignments for training, a novel LLM-based approach is introduced for generating a synthetic dataset. This process creates a corpus of ontology submodule pairs and their corresponding reference alignments, specifically designed to fine-tune an LLM for the ontology matching task. The proposed approach was evaluated on the Conference, Geolink, Enslaved, Taxon, and Hydrography datasets from the OAEI complex track. The results demonstrate that the LLM fine-tuned on the synthetically generated data exhibits superior performance compared to the non-fine-tuned base model. The key contribution is a strategy that combines automatic dataset generation with fine-tuning to effectively adapt LLMs for ontology matching tasks.
Smarter, not Bigger: Fine-Tuned RAG-Enhanced LLMs for Automotive HIL Testing
Hardware-in-the-Loop (HIL) testing is essential for automotive validation but suffers from fragmented and underutilized test artifacts. This paper presents HIL-GPT, a retrieval-augmented generation (RAG) system integrating domain-adapted large language models (LLMs) with semantic retrieval. HIL-GPT leverages embedding fine-tuning using a domain-specific dataset constructed via heuristic mining and LLM-assisted synthesis, combined with vector indexing for scalable, traceable test case and requirement retrieval. Experiments show that fine-tuned compact models, such as \texttt{bge-base-en-v1.5}, achieve a superior trade-off between accuracy, latency, and cost compared to larger models, challenging the notion that bigger is always better. An A/B user study further confirms that RAG-enhanced assistants improve perceived helpfulness, truthfulness, and satisfaction over general-purpose LLMs. These findings provide insights for deploying efficient, domain-aligned LLM-based assistants in industrial HIL environments.
Extension Condition "violations" and Merge optimality constraints
We analyze, using the mathematical formulation of Merge within the Strong Minimalist Thesis framework, a set of linguistic phenomena, including head-to-head movement, phrasal affixes and syntactic cliticization, verb-particle alternation, and operator-variable phenomena. These are often regarded as problematic, as violations of the Extension Condition. We show that, in fact, all of these phenomena can be explained without involving any EC violation. We first show that derivations using Sideward Merge are possible for all of these cases: these respect EC, though they involve some amount of optimality violations, with respect to Resource Restrictions cost functions, andthe amount of violation differs among these cases. We show that all the cases that involve large optimality violations can be derived in alternative ways involving neither EC nor the use of SM. The main remaining case (head-to-head movement) only involves SM with minimal violations of optimality (near equilibrium fluctuations). We analyze explicitly also the cases of multiple wh-fronting, clusters of clitics in Romance languages and possessor agreement construction in Korean, and how an explanation of these phenomena based on SM can be made compatible with the colored operad generators for phases and theta roles. We also show that the EC condition has a clear algebraic meaning in the mathematical formulation of Merge and is therefore an intrinsic structural algebraic constraint of the model, rather than an additional assumption. We also show that the minimal optimality violating SM plays a structural role in the Markovian properties of Merge, and we compare different optimality conditions coming from Minimal Search and from Resource Restriction in terms of their effect on the dynamics of the Hopf algebra Markov chain, in a simple explicit example.
comment: 85 pages
DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning
Large language models have made significant progress in mathematical reasoning, which serves as an important testbed for AI and could impact scientific research if further advanced. By scaling reasoning with reinforcement learning that rewards correct final answers, LLMs have improved from poor performance to saturating quantitative reasoning competitions like AIME and HMMT in one year. However, this approach faces fundamental limitations. Pursuing higher final answer accuracy doesn't address a key issue: correct answers don't guarantee correct reasoning. Moreover, many mathematical tasks like theorem proving require rigorous step-by-step derivation rather than numerical answers, making final answer rewards inapplicable. To push the limits of deep reasoning, we believe it is necessary to verify the comprehensiveness and rigor of mathematical reasoning. Self-verification is particularly important for scaling test-time compute, especially for open problems without known solutions. Towards self-verifiable mathematical reasoning, we investigate how to train an accurate and faithful LLM-based verifier for theorem proving. We then train a proof generator using the verifier as the reward model, and incentivize the generator to identify and resolve as many issues as possible in their own proofs before finalizing them. To maintain the generation-verification gap as the generator becomes stronger, we propose to scale verification compute to automatically label new hard-to-verify proofs, creating training data to further improve the verifier. Our resulting model, DeepSeekMath-V2, demonstrates strong theorem-proving capabilities, achieving gold-level scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled test-time compute.
Joint Speech and Text Training for LLM-Based End-to-End Spoken Dialogue State Tracking ICASSP 2026
End-to-end spoken dialogue state tracking (DST) is made difficult by the tandem of having to handle speech input and data scarcity. Combining speech foundation encoders and large language models has been proposed in recent work as to alleviate some of this difficulty. Although this approach has been shown to result in strong spoken DST models, achieving state-of-the-art performance in realistic multi-turn DST, it struggles to generalize across domains and requires annotated spoken DST training data for each domain of interest. However, collecting such data for every target domain is both costly and difficult. Noting that textual DST data is more easily obtained for various domains, in this work, we propose jointly training on available spoken DST data and written textual data from other domains as a way to achieve cross-domain generalization. We conduct experiments which show the efficacy of our proposed method for getting good cross-domain DST performance without relying on spoken training data from the target domains.
comment: submitted to ICASSP 2026
What Shape Is Optimal for Masks in Text Removal?
The advent of generative models has dramatically improved the accuracy of image inpainting. In particular, by removing specific text from document images, reconstructing original images is extremely important for industrial applications. However, most existing methods of text removal focus on deleting simple scene text which appears in images captured by a camera in an outdoor environment. There is little research dedicated to complex and practical images with dense text. Therefore, we created benchmark data for text removal from images including a large amount of text. From the data, we found that text-removal performance becomes vulnerable against mask profile perturbation. Thus, for practical text-removal tasks, precise tuning of the mask shape is essential. This study developed a method to model highly flexible mask profiles and learn their parameters using Bayesian optimization. The resulting profiles were found to be character-wise masks. It was also found that the minimum cover of a text region is not optimal. Our research is expected to pave the way for a user-friendly guideline for manual masking.
comment: 12 pages, 17 figures
Exploring Performance Variations in Finetuned Translators of Ultra-Low Resource Languages: Do Linguistic Differences Matter?
Finetuning pre-trained language models with small amounts of data is a commonly-used method to create translators for ultra-low resource languages such as endangered Indigenous languages. However, previous works have reported substantially different performances with translators created using similar methodology and data. In this work we systematically explored possible causes of the performance difference, aiming to determine whether it was a product of different cleaning procedures, limitations of the pre-trained models, the size of the base model, or the size of the training dataset, studying both directions of translation. Our studies, using two Brazilian Indigenous languages, related but with significant structural linguistic characteristics, indicated none or very limited influence from those training factors, suggesting differences between languages may play a significant role in the ability to produce translators by fine-tuning pre-trained models.
Mapping Clinical Doubt: Locating Linguistic Uncertainty in LLMs AAAI'26
Large Language Models (LLMs) are increasingly used in clinical settings, where sensitivity to linguistic uncertainty can influence diagnostic interpretation and decision-making. Yet little is known about where such epistemic cues are internally represented within these models. Distinct from uncertainty quantification, which measures output confidence, this work examines input-side representational sensitivity to linguistic uncertainty in medical text. We curate a contrastive dataset of clinical statements varying in epistemic modality (e.g., 'is consistent with' vs. 'may be consistent with') and propose Model Sensitivity to Uncertainty (MSU), a layerwise probing metric that quantifies activation-level shifts induced by uncertainty cues. Our results show that LLMs exhibit structured, depth-dependent sensitivity to clinical uncertainty, suggesting that epistemic information is progressively encoded in deeper layers. These findings reveal how linguistic uncertainty is internally represented in LLMs, offering insight into their interpretability and epistemic reliability.
comment: Accepted to AAAI'26 SECURE-AI4H Workshop
SuRe: Surprise-Driven Prioritised Replay for Continual LLM Learning
Continual learning, one's ability to adapt to a sequence of tasks without forgetting previously acquired knowledge, remains a major challenge in machine learning and a key gap between artificial and human intelligence. While regularisation and replay perform well in vision, they lag behind multi-task learning for large language models (LLMs), especially at scale with many tasks. We revisit replay and argue that two failure modes drive this gap: selection (what to rehearse) and integration (how to consolidate new knowledge). To address selection, we propose Surprise-prioritised Replay (SuRe), a simple, architecture-agnostic rule that ranks and stores the most surprising (high Negative Log-Likelihood) sequences. SuRe achieves state-of-the-art performance in the Large Number of Tasks (LNT) setting and delivers the best overall average across both Standard CL and LNT benchmarks. To address integration, we add a dual-learner design with fast and slow LoRA adapters merged via an exponential moving average (EMA), enabling rapid adaptation while stabilising long-term knowledge. Combining SuRe with the dual learner yields further gains, including improvements of up to +5 accuracy points on LNT over prior SOTA. Ablation studies confirm that our proposed method remains robust under reduced replay frequency and small buffer size, demonstrating both effectiveness and sample efficiency. Taken together, our results establish replay as a strong baseline for continual LLM fine-tuning and demonstrate that surprise-based selection and slow-weight consolidation are complementary components for mitigating catastrophic forgetting.
PAT: Accelerating LLM Decoding via Prefix-Aware Attention with Resource Efficient Multi-Tile Kernel ASPLOS'26
LLM serving is increasingly dominated by decode attention, which is a memory-bound operation due to massive KV cache loading from global memory. Meanwhile, real-world workloads exhibit substantial, hierarchical shared prefixes across requests (e.g., system prompts, tools/templates, RAG). Existing attention implementations fail to fully exploit prefix sharing: *one-query-per-CTA* execution repeatedly loads shared prefix KV cache, while *one-size-fits-all* tiling leaves on-chip resources idle and exacerbates bubbles for uneven KV lengths. These choices amplify memory bandwidth pressure and stall memory-bound decode attention. This paper introduces PAT, a prefix-aware attention kernel implementation for LLM decoding that organizes execution with a pack-forward-merge paradigm. PAT packs queries by shared prefix to reduce repeated memory accesses, runs a customized multi-tile kernel to achieve high resource efficiency. It further applies practical multi-stream forwarding and KV splitting to reduce resource bubbles. The final merge performs online softmax with negligible overhead. We implement PAT as an off-the-shelf plugin for vLLM. Evaluation on both real-world and synthetic workloads shows that PAT reduces attention latency by 67.4% on average and TPOT by 13.6-83.4% under the same configurations against state-of-the-art attention kernels.
comment: Accepted by ASPLOS'26
Named Entity Recognition for the Kurdish Sorani Language: Dataset Creation and Comparative Analysis
This work contributes towards balancing the inclusivity and global applicability of natural language processing techniques by proposing the first 'name entity recognition' dataset for Kurdish Sorani, a low-resource and under-represented language, that consists of 64,563 annotated tokens. It also provides a tool for facilitating this task in this and many other languages and performs a thorough comparative analysis, including classic machine learning models and neural systems. The results obtained challenge established assumptions about the advantage of neural approaches within the context of NLP. Conventional methods, in particular CRF, obtain F1-scores of 0.825, outperforming the results of BiLSTM-based models (0.706) significantly. These findings indicate that simpler and more computationally efficient classical frameworks can outperform neural architectures in low-resource settings.
Sentiment Analysis Of Shopee Product Reviews Using Distilbert
The rapid growth of digital commerce has led to the accumulation of a massive number of consumer reviews on online platforms. Shopee, as one of the largest e-commerce platforms in Southeast Asia, receives millions of product reviews every day containing valuable information regarding customer satisfaction and preferences. Manual analysis of these reviews is inefficient, thus requiring a computational approach such as sentiment analysis. This study examines the use of DistilBERT, a lightweight transformer-based deep learning model, for sentiment classification on Shopee product reviews. The dataset used consists of approximately one million English-language reviews that have been preprocessed and trained using the distilbert-base-uncased model. Evaluation was conducted using accuracy, precision, recall, and F1-score metrics, and compared against benchmark models such as BERT and SVM. The results show that DistilBERT achieved an accuracy of 94.8%, slightly below BERT (95.3%) but significantly higher than SVM (90.2%), with computation time reduced by more than 55%. These findings demonstrate that DistilBERT provides an optimal balance between accuracy and efficiency, making it suitable for large scale sentiment analysis on e-commerce platforms. Keywords: Sentiment Analysis, DistilBERT, Shopee Reviews, Natural Language Processing, Deep Learning, Transformer Models.
comment: 6 pages, 11 figures
Token-Level Marginalization for Multi-Label LLM Classifiers
This paper addresses the critical challenge of deriving interpretable confidence scores from generative language models (LLMs) when applied to multi-label content safety classification. While models like LLaMA Guard are effective for identifying unsafe content and its categories, their generative architecture inherently lacks direct class-level probabilities, which hinders model confidence assessment and performance interpretation. This limitation complicates the setting of dynamic thresholds for content moderation and impedes fine-grained error analysis. This research proposes and evaluates three novel token-level probability estimation approaches to bridge this gap. The aim is to enhance model interpretability and accuracy, and evaluate the generalizability of this framework across different instruction-tuned models. Through extensive experimentation on a synthetically generated, rigorously annotated dataset, it is demonstrated that leveraging token logits significantly improves the interpretability and reliability of generative classifiers, enabling more nuanced content safety moderation.
Swarms of Large Language Model Agents for Protein Sequence Design with Experimental Validation
Designing proteins de novo with tailored structural, physicochemical, and functional properties remains a grand challenge in biotechnology, medicine, and materials science, due to the vastness of sequence space and the complex coupling between sequence, structure, and function. Current state-of-the-art generative methods, such as protein language models (PLMs) and diffusion-based architectures, often require extensive fine-tuning, task-specific data, or model reconfiguration to support objective-directed design, thereby limiting their flexibility and scalability. To overcome these limitations, we present a decentralized, agent-based framework inspired by swarm intelligence for de novo protein design. In this approach, multiple large language model (LLM) agents operate in parallel, each assigned to a specific residue position. These agents iteratively propose context-aware mutations by integrating design objectives, local neighborhood interactions, and memory and feedback from previous iterations. This position-wise, decentralized coordination enables emergent design of diverse, well-defined sequences without reliance on motif scaffolds or multiple sequence alignments, validated with experiments on proteins with alpha helix and coil structures. Through analyses of residue conservation, structure-based metrics, and sequence convergence and embeddings, we demonstrate that the framework exhibits emergent behaviors and effective navigation of the protein fitness landscape. Our method achieves efficient, objective-directed designs within a few GPU-hours and operates entirely without fine-tuning or specialized training, offering a generalizable and adaptable solution for protein design. Beyond proteins, the approach lays the groundwork for collective LLM-driven design across biomolecular systems and other scientific discovery tasks.
Beyond Query-Level Comparison: Fine-Grained Reinforcement Learning for Text-to-SQL with Automated Interpretable Critiques
Text-to-SQL, a pivotal natural language processing (NLP) task that converts textual queries into executable SQL, has seen substantial progress in recent years. However, existing evaluation and reward mechanisms used to train and assess the text-to-SQL models remain a critical bottleneck. Current approaches heavily rely on manually annotated gold SQL queries, which are costly to produce and impractical for large-scale evaluation. More importantly, most reinforcement learning (RL) methods in text-to-SQL leverage only the final binary execution outcome as the reward signal, a coarse-grained supervision that overlooks detailed structural and semantic errors from the perspective of rubrics. To address these challenges, we propose RuCo-C, a novel generative judge model for fine-grained, query-specific automatic evaluation using interpretable critiques without human intervention. Our framework first automatically generates query-specific evaluation rubrics for human-free annotation, linking them to interpretable critiques. Subsequently, it integrates densified reward feedback through a "progressive exploration" strategy during the RL training process, which dynamically adjusts the rewards to enhance the model's performance. Comprehensive experiments demonstrate that RuCo-C outperforms existing methods in text-to-SQL evaluation, yielding significant performance gains.
From Compound Figures to Composite Understanding: Developing a Multi-Modal LLM from Biomedical Literature with Medical Multiple-Image Benchmarking and Validation
Multi-modal large language models (MLLMs) have shown promise in advancing healthcare. However, most existing models remain confined to single-image understanding, which greatly limits their applicability in clinical workflows. In practice, medical diagnosis and progression often require synthesizing information across multiple images from different modalities or time points. The development of medical MLLMs capable of such multi-image understanding has been hindered by the lack of large-scale, high-quality annotated training data. To address this limitation, we propose a novel framework that leverages license-permissive compound images in biomedical literature, as a rich yet underutilized data source for multi-image analysis. Specifically, we design a five-stage, context-aware instruction generation paradigm underpinned by a divide-and-conquer strategy. By decomposing multi-image analysis into manageable sub-tasks, this paradigm empowers MLLMs to move beyond single-panel analysis and provide a composite understanding by learning the complex spatial, temporal, and cross-modal relationships inherent in these compound figures. By parsing over 237,000 compound figures and their contextual text for instruction generation, we develop M3LLM, a medical multi-image multi-modal large language model. For benchmarking, we construct PMC-MI-Bench for composite understanding, manually validated by medical experts. Extensive experiments show that M3LLM significantly outperforms both general-purpose and specialized medical MLLMs across multi-image, single-image, text-only, and multi-choice scenarios. Notably, M3LLM exhibits strong generalization to longitudinal chest X-ray analysis using the MIMIC dataset. This work establishes a scalable and efficient paradigm for developing medical MLLMs capable of composite reasoning, bridging the gap between biomedical literature and real-world clinical applications.
Focused Chain-of-Thought: Efficient LLM Reasoning via Structured Input Information
Recent large language models achieve strong reasoning performance by generating detailed chain-of-thought traces, but this often leads to excessive token use and high inference latency. Existing efficiency approaches typically focus on model-centric interventions, such as reinforcement learning or supervised fine-tuning, to reduce verbosity. In contrast, we propose a training-free, input-centric approach. Inspired by cognitive psychology, we introduce Focused Chain-of-Thought (F-CoT), which separates information extraction from the reasoning process. F-CoT first organizes the essential information from a query into a concise, structured context and then guides the model to reason exclusively over this context. By preventing attention to irrelevant details, F-CoT naturally produces shorter reasoning paths. On arithmetic word problems, F-CoT reduces generated tokens by 2-3x while maintaining accuracy comparable to standard zero-shot CoT. These results highlight structured input as a simple yet effective lever for more efficient LLM reasoning.
RefineBench: Evaluating Refinement Capability of Language Models via Checklists
Can language models (LMs) self-refine their own responses? This question is increasingly relevant as a wide range of real-world user interactions involve refinement requests. However, prior studies have largely tested LMs' refinement abilities on verifiable tasks such as competition math or symbolic reasoning with simplified scaffolds, whereas users often pose open-ended queries and provide varying degrees of feedback on what they desire. The recent advent of reasoning models that exhibit self-reflection patterns in their chains-of-thought further motivates this question. To analyze this, we introduce RefineBench, a benchmark of 1,000 challenging problems across 11 domains paired with a checklist-based evaluation framework. We evaluate two refinement modes: (1) guided refinement, where an LM is provided natural language feedback, and (2) self-refinement, where LMs attempt to improve without guidance. In the self-refinement setting, even frontier LMs such as Gemini 2.5 Pro and GPT-5 achieve modest baseline scores of 31.3% and 29.1%, respectively, and most models fail to consistently improve across iterations (e.g., Gemini-2.5-Pro gains only +1.8%, while DeepSeek-R1 declines by -0.1%). By contrast, in guided refinement, both proprietary LMs and large open-weight LMs (>70B) can leverage targeted feedback to refine responses to near-perfect levels within five turns. These findings suggest that frontier LMs require breakthroughs to self-refine their incorrect responses, and that RefineBench provides a valuable testbed for tracking progress.
comment: Project website: https://passing2961.github.io/refinebench-page/
Lips-Jaw and Tongue-Jaw Articulatory Tradeoff in DYNARTmo
This paper investigates how the dynamic articulatory model DYNARTmo accounts for articulatory tradeoffs between primary and secondary articulators, with a focus on lips-jaw and tongue-jaw coordination. While DYNARTmo does not implement full task-dynamic second-order biomechanics, it adopts first-order task-space gesture specifications comparable to those used in articulatory phonology and integrates a simplified mechanism for distributing articulatory effort across multiple articulators. We first outline the conceptual relationship between task dynamics and DYNARTmo, emphasizing the distinction between high-level task-space trajectories and their low-level articulatory execution. We then present simulation results for a set of CV syllables that illustrate how jaw displacement varies as a function of both place of articulation (labial, apical, dorsal) and vowel context (/a/, /i/, /u/). The model reproduces empirically attested patterns of articulatory synergy, including jaw-supported apical closures, lower-lip elevation in bilabial stops, tongue-jaw co-movement, and saturation effects in labial constrictions. These results demonstrate that even with computationally simplified assumptions, DYNARTmo can generate realistic spatio-temporal movement patterns that capture key aspects of articulatory tradeoff and synergy across a range of consonant-vowel combinations.
comment: 12 pages, 3 figures, supplementary material: python code
A Theoretically Grounded Hybrid Ensemble for Reliable Detection of LLM-Generated Text
The rapid proliferation of Large Language Models (LLMs) has blurred the line between human and machine authorship, creating practical risks for academic integrity and information reliability. Existing text detectors typically rely on a single methodological paradigm and suffer from poor generalization and high false positive rates (FPR), especially on high-stakes academic text. We propose a theoretically grounded hybrid ensemble that systematically fuses three complementary detection paradigms: (i) a RoBERTa-based transformer classifier for deep semantic feature extraction, (ii) a GPT-2-based probabilistic detector using perturbation-induced likelihood curvature, and (iii) a statistical linguistic feature analyzer capturing stylometric patterns. The core novelty lies in an optimized weighted voting framework, where ensemble weights are learned on the probability simplex to maximize F1-score rather than set heuristically. We provide a bias-variance analysis and empirically demonstrate low inter-model correlation (rho ~ 0.35-0.42), a key condition for variance reduction. Evaluated on a large-scale, multigenerator corpus of 30,000 documents, our system achieves 94.2% accuracy and an AUC of 0.978, with a 35% relative reduction in false positives on academic text. This yields a more reliable and ethically responsible detector for real-world deployment in education and other high-stakes domains.
comment: 24 pages
From Topology to Retrieval: Decoding Embedding Spaces with Unified Signatures
Studying how embeddings are organized in space not only enhances model interpretability but also uncovers factors that drive downstream task performance. In this paper, we present a comprehensive analysis of topological and geometric measures across a wide set of text embedding models and datasets. We find a high degree of redundancy among these measures and observe that individual metrics often fail to sufficiently differentiate embedding spaces. Building on these insights, we introduce Unified Topological Signatures (UTS), a holistic framework for characterizing embedding spaces. We show that UTS can predict model-specific properties and reveal similarities driven by model architecture. Further, we demonstrate the utility of our method by linking topological structure to ranking effectiveness and accurately predicting document retrievability. We find that a holistic, multi-attribute perspective is essential to understanding and leveraging the geometry of text embeddings.
C$^2$DLM: Causal Concept-Guided Diffusion Large Language Models
Autoregressive (AR) language models and Diffusion Language Models (DLMs) constitute the two principal paradigms of large language models. However, both paradigms suffer from insufficient reasoning capabilities. Human reasoning inherently relies on causal knowledge and thought, which are reflected in natural language. But in the AR paradigm, language is modeled as next token prediction (a strictly left-to-right, token-by-token order), whereas natural language itself exhibits more flexible causal structures. In the DLM paradigm, the attention mechanism is fully connected, which entirely disregards causal order. To fill this gap, we propose a \underline{\textbf{C}}ausal \underline{\textbf{C}}oncept-Guided \underline{\textbf{D}}iffusion \underline{\textbf{L}}anguage \underline{\textbf{M}}odel (C$^2$DLM). Starting from DLM's fully connected attention, C$^2$DLM first obtains a concept-level causal graph from the teacher model, and then explicitly guides attention to learn causal relationships between concepts. By focusing on causal relationships and avoiding interference from difficult subgoals involving causal inversion, C$^2$DLM improves 12\% with about 3.2 times training speedup in the COT-OrderPerturb task, and achieves an average gain of 1.31\% across six downstream reasoning tasks. More details in the repository ~\href{https://github.com/Kairong-Han/C-2-DLM}{here}.
Bridging the Modality Gap by Similarity Standardization with Pseudo-Positive Samples ACL
Advances in vision-language models (VLMs) have enabled effective cross-modality retrieval. However, when both text and images exist in the database, similarity scores would differ in scale by modality. This phenomenon, known as the modality gap, hinders accurate retrieval. Most existing studies address this issue with manually labeled data, e.g., by fine-tuning VLMs on them. In this work, we propose a similarity standardization approach with pseudo data construction. We first compute the mean and variance of the similarity scores between each query and its paired data in text or image modality. Using these modality-specific statistics, we standardize all similarity scores to compare on a common scale across modalities. These statistics are calculated from pseudo pairs, which are constructed by retrieving the text and image candidates with the highest cosine similarity to each query. We evaluate our method across seven VLMs using two multi-modal QA benchmarks (MMQA and WebQA), where each question requires retrieving either text or image data. Our experimental results show that our method significantly improves retrieval performance, achieving average Recall@20 gains of 64% on MMQA and 28% on WebQA when the query and the target data belong to different modalities. Compared to E5-V, which addresses the modality gap through image captioning, we confirm that our method more effectively bridges the modality gap.
comment: Accepted to PACLIC2025
A Hybrid Theory and Data-driven Approach to Persuasion Detection with Large Language Models
Traditional psychological models of belief revision focus on face-to-face interactions, but with the rise of social media, more effective models are needed to capture belief revision at scale, in this rich text-based online discourse. Here, we use a hybrid approach, utilizing large language models (LLMs) to develop a model that predicts successful persuasion using features derived from psychological experiments. Our approach leverages LLM generated ratings of features previously examined in the literature to build a random forest classification model that predicts whether a message will result in belief change. Of the eight features tested, \textit{epistemic emotion} and \textit{willingness to share} were the top-ranking predictors of belief change in the model. Our findings provide insights into the characteristics of persuasive messages and demonstrate how LLMs can enhance models of successful persuasion based on psychological theory. Given these insights, this work has broader applications in fields such as online influence detection and misinformation mitigation, as well as measuring the effectiveness of online narratives.
Early Risk Prediction with Temporally and Contextually Grounded Clinical Language Processing
Clinical notes in Electronic Health Records (EHRs) capture rich temporal information on events, clinician reasoning, and lifestyle factors often missing from structured data. Leveraging them for predictive modeling can be impactful for timely identification of chronic diseases. However, they present core natural language processing (NLP) challenges: long text, irregular event distribution, complex temporal dependencies, privacy constraints, and resource limitations. We present two complementary methods for temporally and contextually grounded risk prediction from longitudinal notes. First, we introduce HiTGNN, a hierarchical temporal graph neural network that integrates intra-note temporal event structures, inter-visit dynamics, and medical knowledge to model patient trajectories with fine-grained temporal granularity. Second, we propose ReVeAL, a lightweight, test-time framework that distills the reasoning of large language models into smaller verifier models. Applied to opportunistic screening for Type 2 Diabetes (T2D) using temporally realistic cohorts curated from private and public hospital corpora, HiTGNN achieves the highest predictive accuracy, especially for near-term risk, while preserving privacy and limiting reliance on large proprietary models. ReVeAL enhances sensitivity to true T2D cases and retains explanatory reasoning. Our ablations confirm the value of temporal structure and knowledge augmentation, and fairness analysis shows HiTGNN performs more equitably across subgroups.
ResearchArcade: Graph Interface for Academic Tasks
Academic research generates diverse data sources, and as researchers increasingly use machine learning to assist research tasks, a crucial question arises: Can we build a unified data interface to support the development of machine learning models for various academic tasks? Models trained on such a unified interface can better support human researchers throughout the research process, eventually accelerating knowledge discovery. In this work, we introduce ResearchArcade, a graph-based interface that connects multiple academic data sources, unifies task definitions, and supports a wide range of base models to address key academic challenges. ResearchArcade utilizes a coherent multi-table format with graph structures to organize data from different sources, including academic corpora from ArXiv and peer reviews from OpenReview, while capturing information with multiple modalities, such as text, figures, and tables. ResearchArcade also preserves temporal evolution at both the manuscript and community levels, supporting the study of paper revisions as well as broader research trends over time. Additionally, ResearchArcade unifies diverse academic task definitions and supports various models with distinct input requirements. Our experiments across six academic tasks demonstrate that combining cross-source and multi-modal information enables a broader range of tasks, while incorporating graph structures consistently improves performance over baseline methods. This highlights the effectiveness of ResearchArcade and its potential to advance research progress.
AfriStereo: A Culturally Grounded Dataset for Evaluating Stereotypical Bias in Large Language Models
Existing AI bias evaluation benchmarks largely reflect Western perspectives, leaving African contexts underrepresented and enabling harmful stereotypes in applications across various domains. To address this gap, we introduce AfriStereo, the first open-source African stereotype dataset and evaluation framework grounded in local socio-cultural contexts. Through community engaged efforts across Senegal, Kenya, and Nigeria, we collected 1,163 stereotypes spanning gender, ethnicity, religion, age, and profession. Using few-shot prompting with human-in-the-loop validation, we augmented the dataset to over 5,000 stereotype-antistereotype pairs. Entries were validated through semantic clustering and manual annotation by culturally informed reviewers. Preliminary evaluation of language models reveals that nine of eleven models exhibit statistically significant bias, with Bias Preference Ratios (BPR) ranging from 0.63 to 0.78 (p <= 0.05), indicating systematic preferences for stereotypes over antistereotypes, particularly across age, profession, and gender dimensions. Domain-specific models appeared to show weaker bias in our setup, suggesting task-specific training may mitigate some associations. Looking ahead, AfriStereo opens pathways for future research on culturally grounded bias evaluation and mitigation, offering key methodologies for the AI community on building more equitable, context-aware, and globally inclusive NLP technologies.
A Trio Neural Model for Dynamic Entity Relatedness Ranking CoNLL 2018
Measuring entity relatedness is a fundamental task for many natural language processing and information retrieval applications. Prior work often studies entity relatedness in static settings and an unsupervised manner. However, entities in real-world are often involved in many different relationships, consequently entity-relations are very dynamic over time. In this work, we propose a neural networkbased approach for dynamic entity relatedness, leveraging the collective attention as supervision. Our model is capable of learning rich and different entity representations in a joint framework. Through extensive experiments on large-scale datasets, we demonstrate that our method achieves better results than competitive baselines.
comment: In Proceedings of CoNLL 2018
Automated Composition of Agents: A Knapsack Approach for Agentic Component Selection NeurIPS 2025
Designing effective agentic systems requires the seamless composition and integration of agents, tools, and models within dynamic and uncertain environments. Most existing methods rely on static, semantic retrieval approaches for tool or agent discovery. However, effective reuse and composition of existing components remain challenging due to incomplete capability descriptions and the limitations of retrieval methods. Component selection suffers because the decisions are not based on capability, cost, and real-time utility. To address these challenges, we introduce a structured, automated framework for agentic system composition that is inspired by the knapsack problem. Our framework enables a composer agent to systematically identify, select, and assemble an optimal set of agentic components by jointly considering performance, budget constraints, and compatibility. By dynamically testing candidate components and modeling their utility in real-time, our approach streamlines the assembly of agentic systems and facilitates scalable reuse of resources. Empirical evaluation with Claude 3.5 Sonnet across five benchmarking datasets shows that our online-knapsack-based composer consistently lies on the Pareto frontier, achieving higher success rates at significantly lower component costs compared to our baselines. In the single-agent setup, the online knapsack composer shows a success rate improvement of up to 31.6% in comparison to the retrieval baselines. In multi-agent systems, the online knapsack composer increases success rate from 37% to 87% when agents are selected from an agent inventory of 100+ agents. The substantial performance gap confirms the robust adaptability of our method across diverse domains and budget constraints.
comment: Accepted to NeurIPS 2025 Conference
Linguistically-Controlled Paraphrase Generation ACL
Controlled paraphrase generation produces paraphrases that preserve meaning while allowing precise control over linguistic attributes of the output. We introduce LingConv, an encoder-decoder framework that enables fine-grained control over 40 linguistic attributes in English. To improve reliability, we introduce a novel inference-time quality control mechanism that iteratively refines attribute embeddings to generate paraphrases that closely match target attributes without sacrificing semantic fidelity. LingConv reduces attribute error by up to 34% over existing models, with the quality control mechanism contributing an additional 14% improvement.
comment: This paper was published in Findings of ACL: EMNLP 2025
One Patient, Many Contexts: Scaling Medical AI with Contextual Intelligence
Medical AI, including clinical language models, vision-language models, and multimodal health record models, already summarizes notes, answers questions, and supports decisions. Their adaptation to new populations, specialties, or care settings often relies on fine-tuning, prompting, or retrieval from external knowledge bases. These strategies can scale poorly and risk contextual errors: outputs that appear plausible but miss critical patient or situational information. We envision context switching as a solution. Context switching adjusts model reasoning at inference without retraining. Generative models can tailor outputs to patient biology, care setting, or disease. Multimodal models can reason on notes, laboratory results, imaging, and genomics, even when some data are missing or delayed. Agent models can coordinate tools and roles based on tasks and users. In each case, context switching enables medical AI to adapt across specialties, populations, and geographies. It requires advances in data design, model architectures, and evaluation frameworks, and establishes a foundation for medical AI that scales to infinitely many contexts while remaining reliable and suited to real-world care.
Do Large Language Models Think Like the Brain? Sentence-Level Evidence from fMRI and Hierarchical Embeddings
Understanding whether large language models (LLMs) and the human brain converge on similar computational principles remains a fundamental and important question in cognitive neuroscience and AI. Do the brain-like patterns observed in LLMs emerge simply from scaling, or do they reflect deeper alignment with the architecture of human language processing? This study focuses on the sentence-level neural mechanisms of language models, systematically investigating how hierarchical representations in LLMs align with the dynamic neural responses during human sentence comprehension. By comparing hierarchical embeddings from 14 publicly available LLMs with fMRI data collected from participants, who were exposed to a naturalistic narrative story, we constructed sentence-level neural prediction models to precisely identify the model layers most significantly correlated with brain region activations. Results show that improvements in model performance drive the evolution of representational architectures toward brain-like hierarchies, particularly achieving stronger functional and anatomical correspondence at higher semantic abstraction levels.
AutoHall: Automated Factuality Hallucination Dataset Generation for Large Language Models
Large language models (LLMs) have gained broad applications across various domains but still struggle with hallucinations. Currently, hallucinations occur frequently in the generation of factual content and pose a great challenge to trustworthy LLMs. However, hallucination detection is hindered by the laborious and expensive manual annotation of hallucinatory content. Meanwhile, as different LLMs exhibit distinct types and rates of hallucination, the collection of hallucination datasets is inherently model-specific, which also increases the cost. To address this issue, this paper proposes a method called $\textbf{AutoHall}$ for $\underline{Auto}$matically constructing model-specific $\underline{Hall}$ucination datasets based on existing fact-checking datasets. The empirical results reveal variations in hallucination proportions and types among different models. Moreover, we introduce a zero-resource and black-box hallucination detection method based on self-contradiction to recognize the hallucination in our constructed dataset, achieving superior detection performance compared to baselines. Further analysis on our dataset provides insight into factors that may contribute to LLM hallucinations. Our codes and datasets are publicly available at https://github.com/zouyingcao/AutoHall.
comment: Accepted by IEEE Transactions on Audio, Speech, and Language Processing (TASLP)
Event Stream-based Sign Language Translation: A High-Definition Benchmark Dataset and A Novel Baseline
Sign Language Translation (SLT) is a core task in the field of AI-assisted disability. Traditional SLT methods are typically based on visible light videos, which are easily affected by factors such as lighting variations, rapid hand movements, and privacy concerns. This paper proposes the use of bio-inspired event cameras to alleviate the aforementioned issues. Specifically, we introduce a new high-definition event-based sign language dataset, termed Event-CSL, which effectively addresses the data scarcity in this research area. The dataset comprises 14,827 videos, 14,821 glosses, and 2,544 Chinese words in the text vocabulary. These samples are collected across diverse indoor and outdoor scenes, covering multiple viewpoints, lighting conditions, and camera motions. We have also benchmarked existing mainstream SLT methods on this dataset to facilitate fair comparisons in future research.Furthermore, we propose a novel event-based sign language translation framework, termed EvSLT. The framework first segments continuous video features into clips and employs a Mamba-based memory aggregation module to compress and aggregate spatial detail features at the clip level. Subsequently, these spatial features, along with temporal representations obtained from temporal convolution, are then fused by a graph-guided spatiotemporal fusion module. Extensive experiments on Event-CSL, as well as other publicly available datasets, demonstrate the superior performance of our method. The dataset and source code will be released on https://github.com/Event-AHU/OpenESL
Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines
Reasoning models have demonstrated remarkable capabilities in complex reasoning tasks. However, ensuring their safety against adversarial jailbreak prompts remains a critical challenge. Due to the covert and deceptive nature of such prompts, they can often evade built-in safety mechanisms and lead to the generation of harmful content. This underscores the need for an adaptive safety alignment approach that enables models to autonomously reinforce their defenses in response to adversarial inputs. This paper introduces the Synthesized Guideline-based Adaptive Safety Alignment (SGASA) framework, which internalizes model-generated safety guidelines to strengthen models' ability to enhance robustness against harmful adversarial prompts while minimizing unnecessary refusals of benign requests. SGASA consists of two key stages: Data Pre-synthesis, which generates safety guidelines and augmented prompts; and Alignment Fine-tuning, which leverages Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO) to embed these guidelines into the model. Extensive experiments across multiple datasets demonstrate that SGASA significantly improves model safety, validating its adaptive and scalable effectiveness.
KSHSeek: Data-Driven Approaches to Mitigating and Detecting Knowledge-Shortcut Hallucinations in Generative Models
The emergence of large language models (LLMs) has significantly advanced the development of natural language processing (NLP), especially in text generation tasks like question answering. However, model hallucinations remain a major challenge in natural language generation (NLG) tasks due to their complex causes. We systematically expand on the causes of factual hallucinations from the perspective of knowledge shortcuts, analyzing hallucinations arising from correct and defect-free data and demonstrating that knowledge-shortcut hallucinations are prevalent in generative models. To mitigate this issue, we propose a high similarity pruning algorithm at the data preprocessing level to reduce spurious correlations in the data. Additionally, we design a specific detection method for knowledge-shortcut hallucinations to evaluate the effectiveness of our mitigation strategy. Experimental results show that our approach effectively reduces knowledge-shortcut hallucinations, particularly in fine-tuning tasks, without negatively impacting model performance in question answering. This work introduces a new paradigm for mitigating specific hallucination issues in generative models, enhancing their robustness and reliability in real-world applications.
comment: 16 pages, 34 figures
FlowerTune: A Cross-Domain Benchmark for Federated Fine-Tuning of Large Language Models
Large Language Models (LLMs) have achieved state-of-the-art results across diverse domains, yet their development remains reliant on vast amounts of publicly available data, raising concerns about data scarcity and the lack of access to domain-specific, sensitive information. Federated Learning (FL) presents a compelling framework to address these challenges by enabling decentralized fine-tuning on pre-trained LLMs without sharing raw data. However, the compatibility and performance of pre-trained LLMs in FL settings remain largely under explored. We introduce the FlowerTune LLM Leaderboard, a first-of-its-kind benchmarking suite designed to evaluate federated fine-tuning of LLMs across four diverse domains: general NLP, finance, medical, and coding. Each domain includes federated instruction-tuning datasets and domain-specific evaluation metrics. Our results, obtained through a collaborative, open-source and community-driven approach, provide the first comprehensive comparison across 26 pre-trained LLMs with different aggregation and fine-tuning strategies under federated settings, offering actionable insights into model performance, resource constraints, and domain adaptation. This work lays the foundation for developing privacy-preserving, domain-specialized LLMs for real-world applications.
IROTE: Human-like Traits Elicitation of Large Language Model via In-Context Self-Reflective Optimization AAAI 2026
Trained on various human-authored corpora, Large Language Models (LLMs) have demonstrated a certain capability of reflecting specific human-like traits (e.g., personality or values) by prompting, benefiting applications like personalized LLMs and social simulations. However, existing methods suffer from the superficial elicitation problem: LLMs can only be steered to mimic shallow and unstable stylistic patterns, failing to embody the desired traits precisely and consistently across diverse tasks like humans. To address this challenge, we propose IROTE, a novel in-context method for stable and transferable trait elicitation. Drawing on psychological theories suggesting that traits are formed through identity-related reflection, our method automatically generates and optimizes a textual self-reflection within prompts, which comprises self-perceived experience, to stimulate LLMs' trait-driven behavior. The optimization is performed by iteratively maximizing an information-theoretic objective that enhances the connections between LLMs' behavior and the target trait, while reducing noisy redundancy in reflection without any fine-tuning, leading to evocative and compact trait reflection. Extensive experiments across three human trait systems manifest that one single IROTE-generated self-reflection can induce LLMs' stable impersonation of the target trait across diverse downstream tasks beyond simple questionnaire answering, consistently outperforming existing strong baselines.
comment: This paper is accepted by AAAI 2026
Masked Diffusion Models as Energy Minimization
We present a systematic theoretical framework that interprets masked diffusion models (MDMs) as solutions to energy minimization problems in discrete optimal transport. Specifically, we prove that three distinct energy formulations--kinetic, conditional kinetic, and geodesic energy--are mathematically equivalent under the structure of MDMs, and that MDMs minimize all three when the mask schedule satisfies a closed-form optimality condition. This unification not only clarifies the theoretical foundations of MDMs, but also motivates practical improvements in sampling. By parameterizing interpolation schedules via Beta distributions, we reduce the schedule design space to a tractable 2D search, enabling efficient post-training tuning without model modification. Experiments on synthetic and real-world benchmarks demonstrate that our energy-inspired schedules outperform hand-crafted baselines, particularly in low-step sampling settings.
Mind the Gap: Bridging Thought Leap for Improved Chain-of-Thought Tuning NeurIPS 2025
Large language models (LLMs) have achieved remarkable progress on mathematical tasks through Chain-of-Thought (CoT) reasoning. However, existing mathematical CoT datasets often suffer from Thought Leaps due to experts omitting intermediate steps, which negatively impacts model learning and generalization. We propose the CoT Thought Leap Bridge Task, which aims to automatically detect leaps and generate missing intermediate reasoning steps to restore the completeness and coherence of CoT. To facilitate this, we constructed a specialized training dataset called ScaleQM+, based on the structured ScaleQuestMath dataset, and trained CoT-Bridge to bridge thought leaps. Through comprehensive experiments on mathematical reasoning benchmarks, we demonstrate that models fine-tuned on bridged datasets consistently outperform those trained on original datasets, with improvements of up to +5.87% on NuminaMath. Our approach effectively enhances distilled data (+3.02%) and provides better starting points for reinforcement learning (+3.1%), functioning as a plug-and-play module compatible with existing optimization techniques. Furthermore, CoT-Bridge demonstrate improved generalization to out-of-domain logical reasoning tasks, confirming that enhancing reasoning completeness yields broadly applicable benefits.
comment: Accepted to NeurIPS 2025. Camera ready version. Code: https://github.com/ZJU-REAL/Mind-the-Gap Project: https://zju-real.github.io/CoT-Bridge/
Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling
State-of-the-art (SOTA) Text-to-SQL methods still lag significantly behind human experts on challenging benchmarks like BIRD. Current approaches that explore test-time scaling lack an orchestrated strategy and neglect the model's internal reasoning process. To bridge this gap, we introduce Agentar-Scale-SQL, a novel framework leveraging scalable computation to improve performance. Agentar-Scale-SQL implements an Orchestrated Test-Time Scaling strategy that synergistically combines three distinct perspectives: i) Internal Scaling via RL-enhanced Intrinsic Reasoning, ii) Sequential Scaling through Iterative Refinement, and iii) Parallel Scaling using Diverse Synthesis and Tournament Selection. Agentar-Scale-SQL is a general-purpose framework designed for easy adaptation to new databases and more powerful language models. Extensive experiments show that Agentar-Scale-SQL achieves SOTA performance on the BIRD benchmark, reaching 81.67% execution accuracy on the test set and ranking first on the official leaderboard, demonstrating an effective path toward human-level performance.
From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models
With the remarkable success of Multimodal Large Language Models (MLLMs) in perception tasks, enhancing their complex reasoning capabilities has emerged as a critical research focus. Existing models still suffer from challenges such as opaque reasoning paths and insufficient generalization ability. Chain-of-Thought (CoT) reasoning, which has demonstrated significant efficacy in language models by enhancing reasoning transparency and output interpretability, holds promise for improving model reasoning capabilities when extended to the multimodal domain. This paper provides a systematic review centered on "Multimodal Chain-of-Thought" (MCoT). First, it analyzes the background and theoretical motivations for its inception from the perspectives of technical evolution and task demands. Then, it introduces mainstream MCoT methods from three aspects: CoT paradigms, the post-training stage, and the inference stage, while also analyzing their underlying mechanisms. Furthermore, the paper summarizes existing evaluation benchmarks and metrics, and discusses the application scenarios of MCoT. Finally, it analyzes the challenges currently facing MCoT and provides an outlook on its future research directions.
comment: Survey; 7 figures, 3 tables, 44 pages
MCTS-SQL: Light-Weight LLMs can Master the Text-to-SQL through Monte Carlo Tree Search AAAI 2026
Text-to-SQL is a fundamental yet challenging task in the NLP area, aiming at translating natural language questions into SQL queries. While recent advances in large language models have greatly improved performance, most existing approaches depend on models with tens of billions of parameters or costly APIs, limiting their applicability in resource-constrained environments. For real world, especially on edge devices, it is crucial for Text-to-SQL to ensure cost-effectiveness. Therefore, enabling the light-weight models for Text-to-SQL is of great practical significance. However, smaller LLMs often struggle with complicated user instruction, redundant schema linking or syntax correctness. To address these challenges, we propose MCTS-SQL, a novel framework that uses Monte Carlo Tree Search to guide SQL generation through multi-step refinement. Since the light-weight models' weak performance of single-shot prediction, we generate better results through several trials with feedback. However, directly applying MCTS-based methods inevitably leads to significant time and computational overhead. Driven by this issue, we propose a token-level prefix-cache mechanism that stores prior information during iterations, effectively improved the execution speed. Experiments results on the SPIDER and BIRD benchmarks demonstrate the effectiveness of our approach. Using a small open-source Qwen2.5-Coder-1.5B, our method outperforms ChatGPT-3.5. When leveraging a more powerful model Gemini 2.5 to explore the performance upper bound, we achieved results competitive with the SOTA. Our findings demonstrate that even small models can be effectively deployed in practical Text-to-SQL systems with the right strategy.
comment: Accepted by AAAI 2026
On the Superimposed Noise Accumulation Problem in Sequential Knowledge Editing of Large Language Models
Sequential knowledge editing techniques aim to continuously update knowledge in large language models at low cost, preventing models from generating outdated or incorrect information. However, existing sequential editing methods suffer from a significant decline in editing success rates after long-term editing. Through theoretical analysis and experiments, our findings reveal that as the number of edits increases, the model's output increasingly deviates from the desired target, leading to a drop in editing success rates. We refer to this issue as the superimposed noise accumulation problem. Our further analysis demonstrates that the problem is related to the erroneous activation of irrelevant knowledge and conflicts between activated knowledge. Based on this analysis, a method named DeltaEdit is proposed that reduces conflicts between knowledge through dynamic orthogonal constraint strategies. Experiments show that DeltaEdit significantly reduces superimposed noise, achieving a 16.8% improvement in editing performance over the strongest baseline.
Odin: Oriented Dual-module Integration for Text-rich Network Representation Learning
Text-attributed graphs require models to effectively combine strong textual understanding with structurally informed reasoning. Existing approaches either rely on GNNs--limited by over-smoothing and hop-dependent diffusion--or employ Transformers that overlook graph topology and treat nodes as isolated sequences. We propose Odin (Oriented Dual-module INtegration), a new architecture that injects graph structure into Transformers at selected depths through an oriented dual-module mechanism. Unlike message-passing GNNs, Odin does not rely on multi-hop diffusion; instead, multi-hop structures are integrated at specific Transformer layers, yielding low-, mid-, and high-level structural abstraction aligned with the model's semantic hierarchy. Because aggregation operates on the global [CLS] representation, Odin fundamentally avoids over-smoothing and decouples structural abstraction from neighborhood size or graph topology. We further establish that Odin's expressive power strictly contains that of both pure Transformers and GNNs. To make the design efficient in large-scale or low-resource settings, we introduce Light Odin, a lightweight variant that preserves the same layer-aligned structural abstraction for faster training and inference. Experiments on multiple text-rich graph benchmarks show that Odin achieves state-of-the-art accuracy, while Light Odin delivers competitive performance with significantly reduced computational cost. Together, Odin and Light Odin form a unified, hop-free framework for principled structure-text integration. The source code of this model has been released at https://github.com/hongkaifeng/Odin.
comment: 32 pages, 2 figures
WritingBench: A Comprehensive Benchmark for Generative Writing
Recent advancements in large language models (LLMs) have significantly enhanced text generation capabilities, yet evaluating their performance in generative writing remains a challenge. Existing benchmarks primarily focus on generic text generation or limited in writing tasks, failing to capture the diverse requirements of high-quality written contents across various domains. To bridge this gap, we present WritingBench, a comprehensive benchmark designed to evaluate LLMs across 6 core writing domains and 100 subdomains. We further propose a query-dependent evaluation framework that empowers LLMs to dynamically generate instance-specific assessment criteria. This framework is complemented by a fine-tuned critic model for criteria-aware scoring, enabling evaluations in style, format and length. The framework's validity is further demonstrated by its data curation capability, which enables a 7B-parameter model to outperform the performance of GPT-4o in writing. We open-source the benchmark, along with evaluation tools and modular framework components, to advance the development of LLMs in writing.
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe
Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at https://github.com/EvolvingLMMs-Lab/OpenMMReasoner.
Holistic Evaluation of Multimodal LLMs on Spatial Intelligence
Multimodal models have achieved remarkable progress in recent years. Nevertheless, they continue to exhibit notable limitations in spatial understanding and reasoning, the very capability that anchors artificial general intelligence in the physical world. With the recent release of GPT-5, allegedly the most powerful AI model to date, it is timely to examine where the leading models (GPT, Gemini, Grok, Seed, Qwen, and Intern) stand on the path toward spatial intelligence (SI). We thus propose EASI for holistic Evaluation of multimodAl LLMs on Spatial Intelligence. EASI conceptualizes a comprehensive taxonomy of spatial tasks that unifies existing benchmarks and a growing collection of newly curated ones, enabling systematic evaluation of state-of-the-art models. In this report, we conduct the study across eight key benchmarks, at a cost exceeding ten billion total tokens. Our empirical study then reveals that (1) GPT-5 demonstrates unprecedented strength in SI, yet (2) still falls short of human performance significantly across a broad spectrum of SI-tasks. Moreover, we (3) show that SI-tasks expose greater model capability deficiency than non-SI tasks, to the extent that (4) proprietary models do not exhibit a decisive advantage when facing the most difficult ones. In addition, we conduct a qualitative evaluation across a diverse set of scenarios that are intuitive for humans, yet fail the most advanced multimodal models. EASI is an ongoing community effort: we have open-sourced the EASI codebase that provides a one-stop and reproducible solution with standardized interfaces, integrated protocols and prompts that significantly reduce the friction of configuring and running multiple benchmarks; we have also launched an accompanying EASI leaderboard to provide a continually updated snapshot of model performance across the full SI spectrum, accelerating collective progress toward robust SI.
comment: Codebase: https://github.com/EvolvingLMMs-Lab/EASI/; Leaderboard: https://huggingface.co/spaces/lmms-lab-si/EASI-Leaderboard
G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G$^2$VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. G$^2$VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes and enhance spatial reasoning tasks via in-context learning and interleaved reasoning. Our unified design is highly scalable for spatial understanding: it trains on abundant multi-view image and video data, while simultaneously leveraging the benefits of 3D visual priors that are typically only derived from hard-to-collect annotations. Experimental results demonstrate G$^2$VLM is proficient in both tasks, achieving comparable results to state-of-the-art feed-forward 3D reconstruction models and achieving better or competitive results across spatial understanding and reasoning tasks. By unifying a semantically strong VLM with low-level 3D vision tasks, we hope G$^2$VLM can serve as a strong baseline for the community and unlock more future applications, such as 3D scene editing.
comment: code are released at https://github.com/InternRobotics/G2VLM
KeepKV: Achieving Periodic Lossless KV Cache Compression for Efficient LLM Inference
Efficient inference of large language models (LLMs) is hindered by an ever-growing key-value (KV) cache, making KV cache compression a critical research direction. Traditional methods selectively evict less important KV cache entries, which leads to information loss and hallucinations. Recently, merging-based strategies have been explored to retain more information by merging KV pairs that would be discarded; however, these existing approaches inevitably introduce inconsistencies in attention distributions before and after merging, causing degraded generation quality. To overcome this challenge, we propose KeepKV, a novel adaptive KV cache merging method designed to preserve performance under strict memory constraints, achieving single-step lossless compression and providing error bounds for multi-step compression. KeepKV introduces the Electoral Votes mechanism that records merging history and adaptively adjusts attention scores. Moreover, it further leverages a novel Zero Inference-Perturbation Merging method, compensating for attention loss resulting from cache merging. Extensive experiments on various benchmarks and LLM architectures demonstrate that KeepKV substantially reduces memory usage while successfully retaining essential context information, achieving over 2x inference throughput improvement and maintaining superior generation quality even with only 10% KV cache budgets.
comment: 14 pages, 20 figures
Simulated patient systems powered by large language model-based AI agents offer potential for transforming medical education
Background: Simulated patient systems are important in medical education and research, providing safe, integrative training environments and supporting clinical decision making. Advances in artificial intelligence (AI), especially large language models (LLMs), can enhance simulated patients by replicating medical conditions and doctor patient interactions with high fidelity and at low cost, but effectiveness and trustworthiness remain open challenges. Methods: We developed AIPatient, a simulated patient system powered by LLM based AI agents. The system uses a retrieval augmented generation (RAG) framework with six task specific agents for complex reasoning. To improve realism, it is linked to the AIPatient knowledge graph built from de identified real patient data in the MIMIC III intensive care database. Results: We evaluated electronic health record (EHR) based medical question answering (QA), readability, robustness, stability, and user experience. AIPatient reached 94.15 percent QA accuracy when all six agents were enabled, outperforming versions with partial or no agent integration. The knowledge base achieved an F1 score of 0.89. Readability scores showed a median Flesch Reading Ease of 68.77 and a median Flesch Kincaid Grade of 6.4, indicating accessibility for most medical trainees and clinicians. Robustness and stability were supported by non significant variance in repeated trials (analysis of variance F value 0.61, p greater than 0.1; F value 0.78, p greater than 0.1). A user study with medical students showed that AIPatient provides high fidelity, usability, and educational value, comparable to or better than human simulated patients for history taking. Conclusions: LLM based simulated patient systems can deliver accurate, readable, and reliable medical encounters and show strong potential to transform medical education.
comment: 19 pages, 6 figures, 4 tables
Extensible Multi-Granularity Fusion Network and Transferable Curriculum Learning for Aspect-based Sentiment Analysis
Aspect-based Sentiment Analysis (ABSA) aims to determine sentiment polarity toward specific aspects in text. Existing methods enrich semantic and syntactic representations through external knowledge or GNNs, but the growing diversity of linguistic features increases model complexity and lacks a unified, extensible framework. We propose an Extensible Multi-Granularity Fusion Network (EMGF) that integrates dependency syntax, constituent syntax, attention-based semantics, and external knowledge graphs. EMGF employs multi-anchor triplet learning and orthogonal projection to effectively fuse multi-granularity features and strengthen their interactions without additional computational overhead. Furthermore, we introduce the first task-specific curriculum learning framework for text-only ABSA, which assigns difficulty scores using five indicators and trains the model from easy to hard to mimic human learning and improve generalization. Experiments on SemEval 2014, Twitter, and MAMS datasets show that EMGF+CL consistently outperforms state-of-the-art ABSA models.
comment: 8 pages, 4 figures
Privacy-Preserving Reasoning with Knowledge-Distilled Parametric Retrieval Augmented Generation
The current RAG system requires uploading plaintext documents to the cloud, risking private data leakage. Parametric RAG (PRAG) encodes documents as LoRA parameters within LLMs, offering a possible way to reduce exposure of raw content. However, it still faces two issues: (1) PRAG demands synthesizing QA pairs and fine-tuning LLM for each individual document to create its corresponding LoRA, leading to unacceptable inference latency. (2) The performance of PRAG relies solely on synthetic QA data while lacking internal alignment with standard RAG, resulting in poor generalization on out-of-distribution(OOD) inputs. Therefore, achieving high-efficiency parameterization while maintaining RAG-level performance remains a critical challenge for privacy-preserving reasoning. In this paper, we propose DistilledPRAG, a generalizable knowledge-distilled parametric RAG model aligned with standard RAG in document structure and parameter activation. We first synthesize QA pairs from single and multi-documents to enhance cross-document reasoning. Then, we mask the plaintext documents with a special token and translate them to LoRA via a parameter generator, maintaining the standard RAG document structure. Finally, guided by synthetic QA data, we train the parameter generator to match standard RAG's hidden states and output logits, enabling RAG-style reasoning without original documents. Experiments on four QA datasets show that DistilledPRAG outperforms baselines in accuracy and generalizes well on OOD data.
CANVAS: A Benchmark for Vision-Language Models on Tool-Based User Interface Design
User interface (UI) design is an iterative process in which designers progressively refine their work with design software such as Figma or Sketch. Recent advances in vision language models (VLMs) with tool invocation suggest these models can operate design software to edit a UI design through iteration. Understanding and enhancing this capacity is important, as it highlights VLMs' potential to collaborate with designers within conventional software. However, as no existing benchmark evaluates tool-based design performance, the capacity remains unknown. To address this, we introduce CANVAS, a benchmark for VLMs on tool-based user interface design. Our benchmark contains 598 tool-based design tasks paired with ground-truth references sampled from 3.3K mobile UI designs across 30 function-based categories (e.g., onboarding, messaging). In each task, a VLM updates the design step-by-step through context-based tool invocations (e.g., create a rectangle as a button background), linked to design software. Specifically, CANVAS incorporates two task types: (i) design replication evaluates the ability to reproduce a whole UI screen; (ii) design modification evaluates the ability to modify a specific part of an existing screen. Results suggest that leading models exhibit more strategic tool invocations, improving design quality. Furthermore, we identify common error patterns models exhibit, guiding future work in enhancing tool-based design capabilities.
More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG
Retrieval-Augmented Generation (RAG) enhances the accuracy of Large Language Model (LLM) responses by leveraging relevant external documents during generation. Although previous studies noted that retrieving many documents can degrade performance, they did not isolate how the quantity of documents affects performance while controlling for context length. We evaluate various language models on custom datasets derived from a multi-hop QA task. We keep the context length and position of relevant information constant while varying the number of documents, and find that increasing the document count in RAG settings poses significant challenges for most LLMs, reducing performance by up to 20%. However, Qwen2.5 maintained consistent results across increasing document counts, indicating better multi-document handling capability. Finally, our results indicate that processing multiple documents is a separate challenge from handling long contexts. We also make the datasets and code available: https://github.com/shaharl6000/MoreDocsSameLen .
comment: Preprint
KurdSTS: The Kurdish Semantic Textual Similarity
Semantic Textual Similarity (STS) measures the degree of meaning overlap between two texts and underpins many NLP tasks. While extensive resources exist for high-resource languages, low-resource languages such as Kurdish remain underserved. We present, to our knowledge, the first Kurdish STS dataset: 10,000 sentence pairs spanning formal and informal registers, each annotated for similarity. We benchmark Sentence-BERT, multilingual BERT, and other strong baselines, obtaining competitive results while highlighting challenges arising from Kurdish morphology, orthographic variation, and code-mixing. The dataset and baselines establish a reproducible evaluation suite and provide a strong starting point for future research on Kurdish semantics and low-resource NLP.
Beyond Introspection: Reinforcing Thinking via Externalist Behavioral Feedback
While inference-time thinking allows Large Language Models (LLMs) to address complex problems, the extended thinking process can be unreliable or inconsistent because of the model's probabilistic nature, especially near its knowledge boundaries. Existing approaches attempt to mitigate this by having the model critique its own reasoning to make corrections. However, such self-critique inherits the same biases of the original output, known as the introspection illusion. Moving beyond such introspection and inspired by core methodologies in ethology, we propose an externalist three-step framework Distillation-Reinforcement-Reasoning (DRR). Rather than relying on a model's introspection, DRR evaluates its observable behaviors to provide corrective feedback. DRR first distills the reasoner's behavioral traces, then trains a lightweight, external Discriminative Model (DM). At inference time, this DM acts as a critic, identifying and rejecting suspicious reasoning steps. This external feedback compels the LLM to discard flawed pathways and explore alternatives, thereby enhancing reasoning quality without altering the base model. Experiments on multiple reasoning benchmarks show that our framework significantly outperforms prominent self-critique methods. Benefiting from a lightweight and annotation-free design, DRR offers a scalable and adaptable solution for improving the reliability of reasoning in a wide range of LLMs.
Exploring the Human-LLM Synergy in Advancing Theory-driven Qualitative Analysis
Qualitative coding is a demanding yet crucial research method in the field of Human-Computer Interaction (HCI). While recent studies have shown the capability of large language models (LLMs) to perform qualitative coding within theoretical frameworks, their potential for collaborative human-LLM discovery and generation of new insights beyond initial theory remains underexplored. To bridge this gap, we proposed CHALET, a novel approach that harnesses the power of human-LLM partnership to advance theory-driven qualitative analysis by facilitating iterative coding, disagreement analysis, and conceptualization of qualitative data. We demonstrated CHALET's utility by applying it to the qualitative analysis of conversations related to mental-illness stigma, using the attribution model as the theoretical framework. Results highlighted the unique contribution of human-LLM collaboration in uncovering latent themes of stigma across the cognitive, emotional, and behavioral dimensions. We discuss the methodological implications of the human-LLM collaborative approach to theory-based qualitative analysis for the HCI community and beyond.
comment: 51 pages, 6 figures, accepted by ACM Trans. Comput.-Hum. Interact (TOCHI)
TrackList: Tracing Back Query Linguistic Diversity for Head and Tail Knowledge in Open Large Language Models
Large Language Models (LLMs) have proven efficient in giving definition-type answers to user input queries. While for humans giving various types of answers, such as examples and paraphrases, is an easy task, LLMs struggle to provide correct answers for other than definition-type queries. In this study, we evaluated this drop in performance using TrackList, a fine-grained linguistic and statistical analysis pipeline to investigate the impact of the pre-training data on LLMs answers to diverse linguistic queries. We also introduce RefoMed-EN, an English dataset consisting of 6170 human-annotated medical terms alongside their corresponding definitions, denominations, exemplifications, explanations, or paraphrases. We studied whether the high frequency of a concept (head) or low frequency (tail) impacts the language model's performance. We evaluated the quality of the LLM's output using syntactic and semantic similarity metrics, statistical correlations and embeddings. Results showed that the LLM's task performance for definition type questions is the highest, while for the exemplification type it is the lowest. Additionally, we showed that for definition-type questions, large language models are prone to paraphrase more on popular and frequent knowledge and less on tail and technical knowledge, especially in the expert texts.
comment: under review
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model
Long-context video understanding in multimodal large language models (MLLMs) faces a critical challenge: balancing computational efficiency with the retention of fine-grained spatio-temporal patterns. Existing approaches (e.g., sparse sampling, dense sampling with low resolution, and token compression) suffer from significant information loss in temporal dynamics, spatial details, or subtle interactions, particularly in videos with complex motion or varying resolutions. To address this, we propose $\mathbf{Mavors}$, a novel framework that introduces $\mathbf{M}$ulti-gr$\mathbf{a}$nularity $\mathbf{v}$ide$\mathbf{o}$ $\mathbf{r}$epre$\mathbf{s}$entation for holistic long-video modeling. Specifically, Mavors directly encodes raw video content into latent representations through two core components: 1) an Intra-chunk Vision Encoder (IVE) that preserves high-resolution spatial features via 3D convolutions and Vision Transformers, and 2) an Inter-chunk Feature Aggregator (IFA) that establishes temporal coherence across chunks using transformer-based dependency modeling with chunk-level rotary position encodings. Moreover, the framework unifies image and video understanding by treating images as single-frame videos via sub-image decomposition. Experiments across diverse benchmarks demonstrate Mavors' superiority in maintaining both spatial fidelity and temporal continuity, significantly outperforming existing methods in tasks requiring fine-grained spatio-temporal reasoning.
comment: 22 pages
COPO: Causal-Oriented Policy Optimization for Hallucinations of MLLMs
Despite Multimodal Large Language Models (MLLMs) having shown impressive capabilities, they may suffer from hallucinations. Empirically, we find that MLLMs attend disproportionately to task-irrelevant background regions compared with text-only LLMs, implying spurious background-answer correlations. We claim and analyze that (i) outcome-based rewards can be an important factor leading to spurious correlations, and (ii) spurious correlations can be an important factor leading to hallucinations. Based on these results, we propose Causal-Oriented Policy Optimization (COPO) to mitigate these spurious correlations, thus addressing the issue of hallucinations. It imposes token-level sufficiency and necessity constraints to measure each inference token's causal contribution, thus ensuring correct and evidence-grounded output. Specifically, we first evaluate each token's causal contribution via a newly proposed causal completeness reward. This reward is then used to construct a causally informed advantage function within the GRPO optimization framework, encouraging the model to focus on tokens that are causally sufficient and necessary for accurate generation. Experimental results across various benchmarks demonstrate the advantages of COPO.
Fine-grained and Explainable Factuality Evaluation for Multimodal Summarization
Multimodal summarization aims to generate a concise summary based on the input text and image. However, the existing methods potentially suffer from unfactual output. To evaluate the factuality of multimodal summarization models, we propose two fine-grained and explainable evaluation frameworks (FALLACIOUS) for different application scenarios, i.e. reference-based factuality evaluation framework and reference-free factuality evaluation framework. Notably, the reference-free factuality evaluation framework doesn't need ground truth and hence it has a wider application scenario. To evaluate the effectiveness of the proposed frameworks, we compute the correlation between our frameworks and the other metrics. The experimental results show the effectiveness of our proposed method. We will release our code and dataset via github.
comment: project link: https://github.com/for4WARD/FaithfulnessEvaluation
ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks
Vision-language models (VLMs) have exhibited impressive capabilities across diverse image understanding tasks, but still struggle in settings that require reasoning over extended sequences of camera frames from a video. This limits their utility in embodied settings, which require reasoning over long frame sequences from a continuous stream of visual input at each moment of a task attempt. To address this limitation, we propose ROVER (Reasoning Over VidEo Recursively), a framework that enables the model to recursively decompose long-horizon video trajectories into segments corresponding to shorter subtasks within the trajectory. In doing so, ROVER facilitates more focused and accurate reasoning over temporally localized frame sequences without losing global context. We evaluate ROVER, implemented using an in-context learning approach, on diverse OpenX Embodiment videos and on a new dataset derived from RoboCasa that consists of 543 videos showing both expert and perturbed non-expert trajectories across 27 robotic manipulation tasks. ROVER outperforms strong baselines across three video reasoning tasks: task progress estimation, frame-level natural language reasoning, and video question answering. We observe that, by reducing the number of frames the model reasons over at each timestep, ROVER mitigates hallucinations, especially during unexpected or non-optimal moments of a trajectory. In addition, by enabling the implementation of a subtask-specific sliding context window, ROVER's time complexity scales linearly with video length, an asymptotic improvement over baselines. Demos, code, and data available at: https://rover-vlm.github.io
Financial Risk Relation Identification through Dual-view Adaptation EMNLP 2025
A multitude of interconnected risk events -- ranging from regulatory changes to geopolitical tensions -- can trigger ripple effects across firms. Identifying inter-firm risk relations is thus crucial for applications like portfolio management and investment strategy. Traditionally, such assessments rely on expert judgment and manual analysis, which are, however, subjective, labor-intensive, and difficult to scale. To address this, we propose a systematic method for extracting inter-firm risk relations using Form 10-K filings -- authoritative, standardized financial documents -- as our data source. Leveraging recent advances in natural language processing, our approach captures implicit and abstract risk connections through unsupervised fine-tuning based on chronological and lexical patterns in the filings. This enables the development of a domain-specific financial encoder with a deeper contextual understanding and introduces a quantitative risk relation score for transparency, interpretable analysis. Extensive experiments demonstrate that our method outperforms strong baselines across multiple evaluation settings. Our codes are available at https://github.com/cnclabs/codes.fin.relation.
comment: 11 pages, 3 figures, EMNLP 2025 Main Conference
Accelerating Training of Recursive Reasoning Models with Curriculum Guided Adaptive Recursion
Recursive reasoning models achieve remarkable performance on complex reasoning tasks through iterative refinement, enabling tiny networks to match large language models thousands of times their size. However, training remains computationally expensive, prior work reporting approximately 36 GPU-hours per dataset, limiting broader adoption and research. We propose CGAR, a novel training methodology that applies curriculum learning to architectural depth rather than traditional data ordering. CGAR introduces two synergistic components: Progressive Depth Curriculum dynamically adjusts recursion depth from shallow to deep configurations during training, preventing early overfitting while reducing computational cost, and Hierarchical Supervision Weighting applies exponentially decaying importance to supervision steps, aligning loss weighting with observed gradient magnitude decay. On Sudoku-Extreme with 423,168 test puzzles, CGAR achieves 1.71x training speedup (10.93 to 6.38 hours, 42% cost reduction) with only 0.63% accuracy drop (86.65% to 86.02%). Systematic ablations reveal Progressive Depth Curriculum alone achieves 2.26x speedup with 85.47% accuracy, demonstrating a rare Pareto improvement where architectural curriculum simultaneously enhances training efficiency and solution quality. CGAR-trained models exhibit superior inference efficiency with 100% halting accuracy and 11% fewer reasoning steps. Our work demonstrates that principled curriculum on architectural depth enables efficient training of recursive reasoning models on modest hardware. Code and models: https://github.com/Kaleemullahqasim/CGAR and https://huggingface.co/Kaleemullah/trm-cgar-sudoku
Human-Computer Interaction
AI summaries in online search influence users' attitudes
This study examined how AI-generated summaries, which have become visually prominent in online search results, affect how users think about different issues. In a preregistered randomized controlled experiment, participants (N = 2,004) viewed mock search result pages varying in the presence (vs. absence), placement (top vs. middle), and stance (benefit-framed vs. harm-framed) of AI-generated summaries across four publicly debated topics. Compared to a no-summary control group, participants exposed to AI-generated summaries reported issue attitudes, behavioral intentions, and policy support that aligned more closely with the AI summary stance. The summaries placed at the top of the page produced stronger shifts in users' issue attitudes (but not behavioral intentions or policy support) than those placed at the middle of the page. We also observed moderating effects from issue familiarity and general trust toward AI. In addition, users perceived the AI summaries more useful when it emphasized health harms versus benefits. These findings suggest that AI-generated search summaries can significantly shape public perceptions, raising important implications for the design and regulation of AI-integrated information ecosystems.
Learning Programming in Informal Spaces: Using Emotion as a Lens to Understand Novice Struggles on r/learnprogramming
Novice programmers experience emotional difficulties in informal online learning environments, where confusion and frustration can hinder motivation and learning outcomes. This study investigates novice programmers' emotional experiences in informal settings, identifies the causes of emotional struggle, and explores design opportunities for affect-aware support systems. We manually annotated 1,500 posts from r/learnprogramming using the Learning-Centered Emotions framework and conducted clustering and axial coding. Confusion, curiosity, and frustration were the most common emotions, often co-occurring and associated with early learning stages. Positive emotions were relatively rare. The primary emotional triggers included ambiguous errors, unclear learning pathways, and misaligned learning resources. We identify five key areas where novice programmers need support in informal learning spaces: stress relief and resilient motivation, topic explanation and resource recommendation, strategic decision-making and learning guidance, technical support, and acknowledgment of their challenges. Our findings highlight the need for intelligent, affect-sensitive mechanisms that provide timely support aligned with learners' emotional states.
Epistemic Fragility in Large Language Models: Prompt Framing Systematically Modulates Misinformation Correction
As large language models (LLMs) rapidly displace traditional expertise, their capacity to correct misinformation has become a core concern. We investigate the idea that prompt framing systematically modulates misinformation correction - something we term 'epistemic fragility'. We manipulated prompts by open-mindedness, user intent, user role, and complexity. Across ten misinformation domains, we generated 320 prompts and elicited 2,560 responses from four frontier LLMs, which were coded for strength of misinformation correction and rectification strategy use. Analyses showed that creative intent, expert role, and closed framing led to a significant reduction in correction likelihood and effectiveness of used strategy. We also found striking model differences: Gemini 2.5 Pro had 74% lower odds of strong correction than Claude Sonnet 4.5. These findings highlight epistemic fragility as an important structural property of LLMs, challenging current guardrails and underscoring the need for alignment strategies that prioritize epistemic integrity over conversational compliance.
Agentic AI Framework for Individuals with Disabilities and Neurodivergence: A Multi-Agent System for Healthy Eating, Daily Routines, and Inclusive Well-Being
The paper presents a detailed Agentic Artificial Intelligence (AI) model that would enable people with disabilities and neurodivergence to lead healthier lives and have more regular days. The system will use a multi-layer structure; it will include an Application and Interface Layer, an Agents Layer, and a Data Source Layer to provide adaptive, transparent, and inclusive support. Fundamentally, a hybrid reasoning engine will synchronize four special-purpose agents, which include: a personalized-nutrition-based, called a Meal Planner Agent; an adaptive-scheduling-based, called a Reminder Agent; interactive assistance during grocery shopping and cooking, called a Food Guidance Agent; and a continuous-intake-and-physiological-tracking, called a Monitoring Agent. All the agents interact through a central communicative system called the Blackboard/Event Bus, which allows autonomous interaction and real-time feedback loops with multimedia user interfaces. Privacy-sensitive data sources, including electronic health records (EHRs), nutritional databases, wearable sensors, and smart kitchen Internet of Things, are also included in the framework and placed into a policy-controlled layer, which ensures data safety and compliance with consent. Collaborative care and clinician dashboards allow common supervision, and discussable artificial intelligence (XAI) modules give brief explanations of why a decision was made, making users responsible and reliant. The proposed agentic AI framework is an extension beyond traditional assistive systems since it incorporates inclusiveness, personalization, and accessibility at all levels. It displays the intersection of multi-agent reasoning, multi-modal interfaces, and human-centered design that will enable the development of autonomy, health, and digital equity among people with disabilities and neurodivergence.
comment: Presented at International Conference on Business and Digital Technology, Bahrain, Springer Nature, 27 November 2025
A race to belief: How Evidence Accumulation shapes trust in AI and Human informants
The integration of artificial intelligence into everyday decision-making has reshaped patterns of selective trust, yet the cognitive mechanisms behind context-dependent preferences for AI versus human informants remain unclear. We applied a Bayesian Hierarchical Sequential Sampling Model (HSSM) to analyze how 102 Colombian university students made trust decisions across 30 epistemic (factual) and social (interpersonal) scenarios. Results show that context-dependent trust is primarily driven by differences in drift rate (v), the rate of evidence accumulation, rather than initial bias (z) or response caution (a). Epistemic scenarios produced strong negative drift rates (mean v = -1.26), indicating rapid evidence accumulation favoring AI, whereas social scenarios yielded positive drift rates (mean v = 0.70) favoring humans. Starting points were near neutral (z = 0.52), indicating minimal prior bias. Drift rate showed a strong within-subject association with signed confidence (Fisher-z-averaged r = 0.736; 95 percent bootstrap CI 0.699 to 0.766; 97.8 percent of individual correlations positive, N = 93), suggesting that model-derived evidence accumulation closely mirrors participants' moment-to-moment confidence. These dynamics may help explain the fragility of AI trust: in epistemic domains, rapid but low-vigilance evidence processing may promote uncalibrated reliance on AI that collapses quickly after errors. Interpreted through epistemic vigilance theory, the results indicate that domain-specific vigilance mechanisms modulate evidence accumulation. The findings inform AI governance by highlighting the need for transparency features that sustain vigilance without sacrificing efficiency, offering a mechanistic account of selective trust in human-AI collaboration.
comment: 24 pages, 5 figures
GazeTrack: High-Precision Eye Tracking Based on Regularization and Spatial Computing
Eye tracking has become increasingly important in virtual and augmented reality applications; however, the current gaze accuracy falls short of meeting the requirements for spatial computing. We designed a gaze collection framework and utilized high-precision equipment to gather the first precise benchmark dataset, GazeTrack, encompassing diverse ethnicities, ages, and visual acuity conditions for pupil localization and gaze tracking. We propose a novel shape error regularization method to constrain pupil ellipse fitting and train on open-source datasets, enhancing semantic segmentation and pupil position prediction accuracy. Additionally, we invent a novel coordinate transformation method similar to paper unfolding to accurately predict gaze vectors on the GazeTrack dataset. Finally, we built a gaze vector generation model that achieves reduced gaze angle error with lower computational complexity compared to other methods.
comment: 10 pages, 7 figures
MATCH: Engineering Transparent and Controllable Conversational XAI Systems through Composable Building Blocks
While the increased integration of AI technologies into interactive systems enables them to solve an increasing number of tasks, the black-box problem of AI models continues to spread throughout the interactive system as a whole. Explainable AI (XAI) techniques can make AI models more accessible by employing post-hoc methods or transitioning to inherently interpretable models. While this makes individual AI models clearer, the overarching system architecture remains opaque. This challenge not only pertains to standard XAI techniques but also to human examination and conversational XAI approaches that need access to model internals to interpret them correctly and completely. To this end, we propose conceptually representing such interactive systems as sequences of structural building blocks. These include the AI models themselves, as well as control mechanisms grounded in literature. The structural building blocks can then be explained through complementary explanatory building blocks, such as established XAI techniques like LIME and SHAP. The flow and APIs of the structural building blocks form an unambiguous overview of the underlying system, serving as a communication basis for both human and automated agents, thus aligning human and machine interpretability of the embedded AI models. In this paper, we present our flow-based approach and a selection of building blocks as MATCH: a framework for engineering Multi-Agent Transparent and Controllable Human-centered systems. This research contributes to the field of (conversational) XAI by facilitating the integration of interpretability into existing interactive systems.
comment: Submitted Version accepted for publication in an LNCS Volume "Engineering Interactive Computer Systems - EICS 2025 - International Workshops and Doctoral Consortium"
Engineering Trustworthy Automation: Design Principles and Evaluation for AutoML Tools for Novices
AutoML systems targeting novices often prioritize algorithmic automation over usability, leaving gaps in users' understanding, trust, and end-to-end workflow support. To address these issues, we propose an abstract pipeline that covers data intake, guided configuration, training, evaluation, and inference. To examine the abstract pipeline, we report a user study where we assess trust, understandability, and UX of a prototype implementation. In a 24-participant study, all participants successfully built their own models, UEQ ratings were positive, yet experienced users reported higher trust and understanding than novices. Based on this study, we propose four design principles to improve the design of AutoML systems targeting novices: (P1) support first-model success to enhance user self-efficacy, (P2) provide explanations to help users form correct mental models and develop appropriate levels of reliance, (P3) provide abstractions and context-aware assistance to keep users in their zone of proximal development, and (P4) ensure predictability and safeguards to strengthen users' sense of control.
comment: Submitted version accepted for publication in an LNCS Volume "Engineering Interactive Computer Systems - EICS 2025 - International Workshops and Doctoral Consortium"
HandyLabel: Towards Post-Processing to Real-Time Annotation Using Skeleton Based Hand Gesture Recognition
The success of machine learning is deeply linked to the availability of high-quality training data, yet retrieving and manually labeling new data remains a time-consuming and error-prone process. Traditional annotation tools, such as Label Studio, often require post-processing, where users label data after it has been recorded. Post-processing is highly time-consuming and labor-intensive, especially with large datasets, and may lead to erroneous annotations due to the difficulty of subjects' memory tasks when labeling cognitive activities such as emotions or comprehension levels. In this work, we introduce HandyLabel, a real-time annotation tool that leverages hand gesture recognition to map hand signs for labeling. The application enables users to customize gesture mappings through a web-based interface, allowing for real-time annotations. To ensure the performance of HandyLabel, we evaluate several hand gesture recognition models on an open-source hand sign (HaGRID) dataset, with and without skeleton-based preprocessing. We discovered that ResNet50 with preprocessed skeleton-based images performs an F1-score of 0.923. To validate the usability of HandyLabel, a user study was conducted with 46 participants. The results suggest that 88.9% of participants preferred HandyLabel over traditional annotation tools.
Investigating AI in Peer Support via Multi-Module System-Driven Embodied Conversational Agents
Young people's mental well-being is a global concern, with peer support playing a key role in daily emotional regulation. Conversational agents are increasingly viewed as promising tools for delivering accessible, personalised peer support, particularly where professional counselling is limited. However, existing systems often suffer from rigid input formats, scripted responses, and limited emotional sensitivity. The emergence of large language models introduces new possibilities for generating flexible, context-aware, and empathetic responses. To explore how individuals with psychological training perceive such systems in peer support contexts, we developed an LLM-based multi-module system to drive embodied conversational agents informed by Cognitive Behavioral Therapy (CBT). In a user study (N=10), we qualitatively examined participants' perceptions, focusing on trust, response quality, workflow integration, and design opportunities for future mental well-being support systems.
SoftNash: Entropy-Regularized Nash Games for Non-Fighting Virtual Fixtures
Virtual fixtures (VFs) improve precision in teleoperation but often ``fight'' the user, inflating mental workload and eroding the sense of agency. We propose Soft-Nash Virtual Fixtures, a game-theoretic shared-control policy that softens the classic two-player linear-quadratic (LQ) Nash solution by inflating the fixture's effort weight with a single, interpretable scalar parameter $τ$. This yields a continuous dial on controller assertiveness: $τ=0$ recovers a hard, performance-focused Nash / virtual fixture controller, while larger $τ$ reduce gains and pushback, yet preserve the equilibrium structure and continuity of closed-loop stability. We derive Soft-Nash from both a KL-regularized trust-region and a maximum-entropy viewpoint, obtaining a closed-form robot best response that shrinks authority and aligns the fixture with the operator's input as $τ$ grows. We implement Soft-Nash on a 6-DoF haptic device in 3D tracking task ($n=12$). Moderate softness ($τ\approx 1-3$, especially $τ=2$) maintains tracking error statistically indistinguishable from a tuned classic VF while sharply reducing controller-user conflict, lowering NASA-TLX workload, and increasing Sense of Agency (SoAS). A composite BalancedScore that combines normalized accuracy and non-fighting behavior peaks near $τ=2-3$. These results show that a one-parameter Soft-Nash policy can preserve accuracy while improving comfort and perceived agency, providing a practical and interpretable pathway to personalized shared control in haptics and teleoperation.
EAST: Environment-Aware Stylized Transition Along the Reality-Virtuality Continuum
In the Virtual Reality (VR) gaming industry, maintaining immersion during real-world interruptions remains a challenge, particularly during transitions along the reality-virtuality continuum (RVC). Existing methods tend to rely on digital replicas or simple visual transitions, neglecting to address the aesthetic discontinuities between real and virtual environments, especially in highly stylized VR games. This paper introduces the Environment-Aware Stylized Transition (EAST) framework, which employs a novel style-transferred 3D Gaussian Splatting (3DGS) technique to transfer real-world interruptions into the virtual environment with seamless aesthetic consistency. Rather than merely transforming the real world into game-like visuals, EAST minimizes the disruptive impact of interruptions by integrating real-world elements within the framework. Qualitative user studies demonstrate significant enhancements in cognitive comfort and emotional continuity during transitions, while quantitative experiments highlight EAST's ability to maintain visual coherence across diverse VR styles.
When Are Reactive Notebooks Not Reactive?
Computational notebooks are convenient for programmers, but can easily become confusing and inconsistent due to the ability to incrementally edit a program that is running. Recent reactive notebook systems, such as Ipyflow, Marimo and Observable, strive to keep notebook state in sync with the current cell code by re-executing a minimal set of cells upon modification. However, each system defines reactivity a different way. Additionally, within any definition, we find simple notebook modifications that can break each system. Overall, these inconsistencies make it difficult for users to construct a mental model of their reactive notebook's implementation. This paper proposes Rex, a fine-grained test suite to discuss and assess reactivity capabilities within reactive notebook systems. We evaluate Rex on three existing reactive notebook systems and classify their failures with the aims of (i) helping programmers understand when reactivity fails and (ii) helping notebook implementations improve.
Integrating Semantic Communication and Human Decision-Making into an End-to-End Sensing-Decision Framework
As early as 1949, Weaver defined communication in a very broad sense to include all procedures by which one mind or technical system can influence another, thus establishing the idea of semantic communication. With the recent success of machine learning in expert assistance systems where sensed information is wirelessly provided to a human to assist task execution, the need to design effective and efficient communications has become increasingly apparent. In particular, semantic communication aims to convey the meaning behind the sensed information relevant for Human Decision-Making (HDM). Regarding the interplay between semantic communication and HDM, many questions remain, such as how to model the entire end-to-end sensing-decision-making process, how to design semantic communication for the HDM and which information should be provided for HDM. To address these questions, we propose to integrate semantic communication and HDM into one probabilistic end-to-end sensing-decision framework that bridges communications and psychology. In our interdisciplinary framework, we model the human through a HDM process, allowing us to explore how feature extraction from semantic communication can best support HDM both in theory and in simulations. In this sense, our study reveals the fundamental design trade-off between maximizing the relevant semantic information and matching the cognitive capabilities of the HDM model. Our initial analysis shows how semantic communication can balance the level of detail with human cognitive capabilities while demanding less bandwidth, power, and latency.
comment: Submitted for peer review. Code available in https://github.com/ant-uni-bremen/SINFONY
F.A.C.U.L.: Language-Based Interaction with AI Companions in Gaming
In cooperative video games, traditional AI companions are deployed to assist players, who control them using hotkeys or command wheels to issue predefined commands such as ``attack'', ``defend'', or ``retreat''. Despite their simplicity, these methods, which lack target specificity, limit players' ability to give complex tactical instructions and hinder immersive gameplay experiences. To address this problem, we propose the FPS AI Companion who Understands Language (F.A.C.U.L.), the first real-time AI system that enables players to communicate and collaborate with AI companions using natural language. By integrating natural language processing with a confidence-based framework, F.A.C.U.L. efficiently decomposes complex commands and interprets player intent. It also employs a dynamic entity retrieval method for environmental awareness, aligning human intentions with decision-making. Unlike traditional rule-based systems, our method supports real-time language interactions, enabling players to issue complex commands such as ``clear the second floor'', ``take cover behind that tree'', or ``retreat to the river''. The system provides real-time behavioral responses and vocal feedback, ensuring seamless tactical collaboration. Using the popular FPS game \textit{Arena Breakout: Infinite} as a case study, we present comparisons demonstrating the efficacy of our approach and discuss the advantages and limitations of AI companions based on real-world user feedback.
comment: 14 pages, 11 figures,
Exploring the Human-LLM Synergy in Advancing Theory-driven Qualitative Analysis
Qualitative coding is a demanding yet crucial research method in the field of Human-Computer Interaction (HCI). While recent studies have shown the capability of large language models (LLMs) to perform qualitative coding within theoretical frameworks, their potential for collaborative human-LLM discovery and generation of new insights beyond initial theory remains underexplored. To bridge this gap, we proposed CHALET, a novel approach that harnesses the power of human-LLM partnership to advance theory-driven qualitative analysis by facilitating iterative coding, disagreement analysis, and conceptualization of qualitative data. We demonstrated CHALET's utility by applying it to the qualitative analysis of conversations related to mental-illness stigma, using the attribution model as the theoretical framework. Results highlighted the unique contribution of human-LLM collaboration in uncovering latent themes of stigma across the cognitive, emotional, and behavioral dimensions. We discuss the methodological implications of the human-LLM collaborative approach to theory-based qualitative analysis for the HCI community and beyond.
comment: 51 pages, 6 figures, accepted by ACM Trans. Comput.-Hum. Interact (TOCHI)
Human-AI Programming Role Optimization: Developing a Personality-Driven Self-Determination Framework CCS 2012
As artificial intelligence transforms software development, a critical question emerges: how can developers and AI systems collaborate most effectively? This dissertation optimizes human-AI programming roles through self-determination theory and personality psychology, introducing the Role Optimization Motivation Alignment (ROMA) framework. Through Design Science Research spanning five cycles, this work establishes empirically-validated connections between personality traits, programming role preferences, and collaborative outcomes, engaging 200 experimental participants and 46 interview respondents. Key findings demonstrate that personality-driven role optimization significantly enhances self-determination and team dynamics, yielding 23% average motivation increases among professionals and up to 65% among undergraduates. Five distinct personality archetypes emerge: The Explorer (high Openness/low Agreeableness), The Orchestrator (high Extraversion/Agreeableness), The Craftsperson (high Neuroticism/low Extraversion), The Architect (high Conscientiousness), and The Adapter (balanced profile). Each exhibits distinct preferences for programming roles (Co-Pilot, Co-Navigator, Agent), with assignment modes proving crucial for satisfaction. The dissertation contributes: (1) an empirically-validated framework linking personality traits to role preferences and self-determination outcomes; (2) a taxonomy of AI collaboration modalities mapped to personality profiles while preserving human agency; and (3) an ISO/IEC 29110 extension enabling Very Small Entities to implement personality-driven role optimization within established standards. Keywords: artificial intelligence, human-computer interaction, behavioral software engineering, self-determination theory, personality psychology, phenomenology, intrinsic motivation, pair programming, design science research, ISO/IEC 29110
comment: PhD Dissertation, Prague University of Economics and Business, 2025. 323 pages. ACM CCS 2012: Human-computer interaction, Collaborative interaction, Human-AI collaborative systems, Pair programming, AI-assisted software engineering
Interactive Groupwise Comparison for Reinforcement Learning from Human Feedback
Reinforcement learning from human feedback (RLHF) has emerged as a key enabling technology for aligning AI behaviour with human preferences. The traditional way to collect data in RLHF is via pairwise comparisons: human raters are asked to indicate which one of two samples they prefer. We present an interactive visualisation that better exploits the human visual ability to compare and explore whole groups of samples. The interface is comprised of two linked views: 1) an exploration view showing a contextual overview of all sampled behaviours organised in a hierarchical clustering structure; and 2) a comparison view displaying two selected groups of behaviours for user queries. Users can efficiently explore large sets of behaviours by iterating between these two views. Additionally, we devised an active learning approach suggesting groups for comparison. As shown by our evaluation in six simulated robotics tasks, our approach increases the final rewards by 69.34%. It leads to lower error rates and better policies. We open-source the code that can be easily integrated into the RLHF training loop, supporting research on human-AI alignment.
comment: 10 pages, 8 figures in proceedings of Computer Graphics Forum
Computer Vision and Pattern Recognition
Canvas-to-Image: Compositional Image Generation with Multimodal Controls
While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface, enabling users to generate images that faithfully reflect their intent. Our key idea is to encode diverse control signals into a single composite canvas image that the model can directly interpret for integrated visual-spatial reasoning. We further curate a suite of multi-task datasets and propose a Multi-Task Canvas Training strategy that optimizes the diffusion model to jointly understand and integrate heterogeneous controls into text-to-image generation within a unified learning paradigm. This joint training enables Canvas-to-Image to reason across multiple control modalities rather than relying on task-specific heuristics, and it generalizes well to multi-control scenarios during inference. Extensive experiments show that Canvas-to-Image significantly outperforms state-of-the-art methods in identity preservation and control adherence across challenging benchmarks, including multi-person composition, pose-controlled composition, layout-constrained generation, and multi-control generation.
comment: 24 pages; webpage: https://snap-research.github.io/canvas-to-image/
TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos
Learning new robot tasks on new platforms and in new scenes from only a handful of demonstrations remains challenging. While videos of other embodiments - humans and different robots - are abundant, differences in embodiment, camera, and environment hinder their direct use. We address the small-data problem by introducing a unifying, symbolic representation - a compact 3D "trace-space" of scene-level trajectories - that enables learning from cross-embodiment, cross-environment, and cross-task videos. We present TraceGen, a world model that predicts future motion in trace-space rather than pixel space, abstracting away appearance while retaining the geometric structure needed for manipulation. To train TraceGen at scale, we develop TraceForge, a data pipeline that transforms heterogeneous human and robot videos into consistent 3D traces, yielding a corpus of 123K videos and 1.8M observation-trace-language triplets. Pretraining on this corpus produces a transferable 3D motion prior that adapts efficiently: with just five target robot videos, TraceGen attains 80% success across four tasks while offering 50-600x faster inference than state-of-the-art video-based world models. In the more challenging case where only five uncalibrated human demonstration videos captured on a handheld phone are available, it still reaches 67.5% success on a real robot, highlighting TraceGen's ability to adapt across embodiments without relying on object detectors or heavy pixel-space generation.
G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G$^2$VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. G$^2$VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes and enhance spatial reasoning tasks via in-context learning and interleaved reasoning. Our unified design is highly scalable for spatial understanding: it trains on abundant multi-view image and video data, while simultaneously leveraging the benefits of 3D visual priors that are typically only derived from hard-to-collect annotations. Experimental results demonstrate G$^2$VLM is proficient in both tasks, achieving comparable results to state-of-the-art feed-forward 3D reconstruction models and achieving better or competitive results across spatial understanding and reasoning tasks. By unifying a semantically strong VLM with low-level 3D vision tasks, we hope G$^2$VLM can serve as a strong baseline for the community and unlock more future applications, such as 3D scene editing.
comment: code are released at https://github.com/InternRobotics/G2VLM
Seeing without Pixels: Perception from Camera Trajectories
Can one perceive a video's content without seeing its pixels, just from the camera trajectory-the path it carves through space? This paper is the first to systematically investigate this seemingly implausible question. Towards this end, we propose a contrastive learning framework to train CamFormer, a dedicated encoder that projects camera pose trajectories into a joint embedding space, aligning them with natural language. We find that, contrary to its apparent simplicity, the camera trajectory is a remarkably informative signal to uncover video content. In other words, "how you move" can indeed reveal "what you are doing" (egocentric) or "observing" (exocentric). We demonstrate the versatility of our learned CamFormer embeddings on a diverse suite of downstream tasks, ranging from cross-modal alignment to classification and temporal analysis. Importantly, our representations are robust across diverse camera pose estimation methods, including both high-fidelity multi-sensored and standard RGB-only estimators. Our findings establish camera trajectory as a lightweight, robust, and versatile modality for perceiving video content.
comment: Project website: https://sites.google.com/view/seeing-without-pixels
Revolutionizing Glioma Segmentation & Grading Using 3D MRI - Guided Hybrid Deep Learning Models
Gliomas are brain tumor types that have a high mortality rate which means early and accurate diagnosis is important for therapeutic intervention for the tumors. To address this difficulty, the proposed research will develop a hybrid deep learning model which integrates U-Net based segmentation and a hybrid DenseNet-VGG classification network with multihead attention and spatial-channel attention capabilities. The segmentation model will precisely demarcate the tumors in a 3D volume of MRI data guided by spatial and contextual information. The classification network which combines a branch of both DenseNet and VGG, will incorporate the demarcated tumor on which features with attention mechanisms would be focused on clinically relevant features. High-dimensional 3D MRI data could successfully be utilized in the model through preprocessing steps which are normalization, resampling, and data augmentation. Through a variety of measures the framework is evaluated: measures of performance in segmentation are Dice coefficient and Mean Intersection over Union (IoU) and measures of performance in classification are accuracy precision, recall, and F1-score. The hybrid framework that has been proposed has demonstrated through physical testing that it has the capability of obtaining a Dice coefficient of 98% in tumor segmentation, and 99% on classification accuracy, outperforming traditional CNN models and attention-free methods. Utilizing multi-head attention mechanisms enhances notions of priority in aspects of the tumor that are clinically significant, and enhances interpretability and accuracy. The results suggest a great potential of the framework in facilitating the timely and reliable diagnosis and grading of glioma by clinicians is promising, allowing for better planning of patient treatment.
Uncertainty Quantification for Visual Object Pose Estimation SP
Quantifying the uncertainty of an object's pose estimate is essential for robust control and planning. Although pose estimation is a well-studied robotics problem, attaching statistically rigorous uncertainty is not well understood without strict distributional assumptions. We develop distribution-free pose uncertainty bounds about a given pose estimate in the monocular setting. Our pose uncertainty only requires high probability noise bounds on pixel detections of 2D semantic keypoints on a known object. This noise model induces an implicit, non-convex set of pose uncertainty constraints. Our key contribution is SLUE (S-Lemma Uncertainty Estimation), a convex program to reduce this set to a single ellipsoidal uncertainty bound that is guaranteed to contain the true object pose with high probability. SLUE solves a relaxation of the minimum volume bounding ellipsoid problem inspired by the celebrated S-lemma. It requires no initial guess of the bound's shape or size and is guaranteed to contain the true object pose with high probability. For tighter uncertainty bounds at the same confidence, we extend SLUE to a sum-of-squares relaxation hierarchy which is guaranteed to converge to the minimum volume ellipsoidal uncertainty bound for a given set of keypoint constraints. We show this pose uncertainty bound can easily be projected to independent translation and axis-angle orientation bounds. We evaluate SLUE on two pose estimation datasets and a real-world drone tracking scenario. Compared to prior work, SLUE generates substantially smaller translation bounds and competitive orientation bounds. We release code at https://github.com/MIT-SPARK/PoseUncertaintySets.
comment: 18 pages, 9 figures. Code available: https://github.com/MIT-SPARK/PoseUncertaintySets
Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models
In recent years, Vision-Language-Action (VLA) models in embodied intelligence have developed rapidly. However, existing adversarial attack methods require costly end-to-end training and often generate noticeable perturbation patches. To address these limitations, we propose ADVLA, a framework that directly applies adversarial perturbations on features projected from the visual encoder into the textual feature space. ADVLA efficiently disrupts downstream action predictions under low-amplitude constraints, and attention guidance allows the perturbations to be both focused and sparse. We introduce three strategies that enhance sensitivity, enforce sparsity, and concentrate perturbations. Experiments demonstrate that under an $L_{\infty}=4/255$ constraint, ADVLA combined with Top-K masking modifies less than 10% of the patches while achieving an attack success rate of nearly 100%. The perturbations are concentrated on critical regions, remain almost imperceptible in the overall image, and a single-step iteration takes only about 0.06 seconds, significantly outperforming conventional patch-based attacks. In summary, ADVLA effectively weakens downstream action predictions of VLA models under low-amplitude and locally sparse conditions, avoiding the high training costs and conspicuous perturbations of traditional patch attacks, and demonstrates unique effectiveness and practical value for attacking VLA feature spaces.
Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following
Large multimodal models (LMMs) are increasingly adopted as judges in multimodal evaluation systems due to their strong instruction following and consistency with human preferences. However, their ability to follow diverse, fine-grained evaluation criteria remains underexplored. We develop Multi-Crit, a benchmark for evaluating multimodal judges on their capacity to follow pluralistic criteria and produce reliable criterion-level judgments. Covering both open-ended generation and verifiable reasoning tasks, Multi-Crit is built through a rigorous data curation pipeline that gathers challenging response pairs with multi-criterion human annotations. It further introduces three novel metrics for systematically assessing pluralistic adherence, criterion-switching flexibility, and the ability to recognize criterion-level preference conflicts. Comprehensive analysis of 25 LMMs reveals that 1) proprietary models still struggle to maintain consistent adherence to pluralistic criteria--especially in open-ended evaluation; 2) open-source models lag further behind in flexibly following diverse criteria; and 3) critic fine-tuning with holistic judgment signals enhances visual grounding but fails to generalize to pluralistic criterion-level judgment. Additional analyses on reasoning fine-tuning, test-time scaling, and boundary consistency between open-source and proprietary models further probe the limits of current multimodal judges. As a pioneering study, Multi-Crit lays the foundation for building reliable and steerable multimodal AI evaluation.
CaFlow: Enhancing Long-Term Action Quality Assessment with Causal Counterfactual Flow
Action Quality Assessment (AQA) predicts fine-grained execution scores from action videos and is widely applied in sports, rehabilitation, and skill evaluation. Long-term AQA, as in figure skating or rhythmic gymnastics, is especially challenging since it requires modeling extended temporal dynamics while remaining robust to contextual confounders. Existing approaches either depend on costly annotations or rely on unidirectional temporal modeling, making them vulnerable to spurious correlations and unstable long-term representations. To this end, we propose CaFlow, a unified framework that integrates counterfactual de-confounding with bidirectional time-conditioned flow. The Causal Counterfactual Regularization (CCR) module disentangles causal and confounding features in a self-supervised manner and enforces causal robustness through counterfactual interventions, while the BiT-Flow module models forward and backward dynamics with a cycle-consistency constraint to produce smoother and more coherent representations. Extensive experiments on multiple long-term AQA benchmarks demonstrate that CaFlow achieves state-of-the-art performance. Code is available at https://github.com/Harrison21/CaFlow
Continual Error Correction on Low-Resource Devices ACM MM
The proliferation of AI models in everyday devices has highlighted a critical challenge: prediction errors that degrade user experience. While existing solutions focus on error detection, they rarely provide efficient correction mechanisms, especially for resource-constrained devices. We present a novel system enabling users to correct AI misclassifications through few-shot learning, requiring minimal computational resources and storage. Our approach combines server-side foundation model training with on-device prototype-based classification, enabling efficient error correction through prototype updates rather than model retraining. The system consists of two key components: (1) a server-side pipeline that leverages knowledge distillation to transfer robust feature representations from foundation models to device-compatible architectures, and (2) a device-side mechanism that enables ultra-efficient error correction through prototype adaptation. We demonstrate our system's effectiveness on both image classification and object detection tasks, achieving over 50% error correction in one-shot scenarios on Food-101 and Flowers-102 datasets while maintaining minimal forgetting (less than 0.02%) and negligible computational overhead. Our implementation, validated through an Android demonstration app, proves the system's practicality in real-world scenarios.
comment: ACM MMSys 2025
Mechanisms of Non-Monotonic Scaling in Vision Transformers
Deeper Vision Transformers often perform worse than shallower ones, which challenges common scaling assumptions. Through a systematic empirical analysis of ViT-S, ViT-B, and ViT-L on ImageNet, we identify a consistent three-phase Cliff-Plateau-Climb pattern that governs how representations evolve with depth. We observe that better performance is associated with progressive marginalization of the [CLS] token, originally designed as a global aggregation hub, in favor of distributed consensus among patch tokens. We quantify patterns of information mixing with an Information Scrambling Index, and show that in ViT-L the information-task tradeoff emerges roughly 10 layers later than in ViT-B, and that these additional layers correlate with increased information diffusion rather than improved task performance. Taken together, these results suggest that transformer architectures in this regime may benefit more from carefully calibrated depth that executes clean phase transitions than from simply increasing parameter count. The Information Scrambling Index provides a useful diagnostic for existing models and suggests a potential design target for future architectures. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/Cliff-Plateau-Climb.
comment: 16 pages total (11 pages main text, 1 pages references, 4 pages appendix), 5 figures, 11 tables. Code available at https://github.com/AnanthaPadmanaban-KrishnaKumar/Cliff-Plateau-Climb
Qwen3-VL Technical Report
We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.
comment: 42 pages
Active Learning for GCN-based Action Recognition
Despite the notable success of graph convolutional networks (GCNs) in skeleton-based action recognition, their performance often depends on large volumes of labeled data, which are frequently scarce in practical settings. To address this limitation, we propose a novel label-efficient GCN model. Our work makes two primary contributions. First, we develop a novel acquisition function that employs an adversarial strategy to identify a compact set of informative exemplars for labeling. This selection process balances representativeness, diversity, and uncertainty. Second, we introduce bidirectional and stable GCN architectures. These enhanced networks facilitate a more effective mapping between the ambient and latent data spaces, enabling a better understanding of the learned exemplar distribution. Extensive evaluations on two challenging skeleton-based action recognition benchmarks reveal significant improvements achieved by our label-efficient GCNs compared to prior work.
ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images
Interactive segmentation models such as the Segment Anything Model (SAM) have demonstrated remarkable generalization on natural images, but perform suboptimally on remote sensing imagery (RSI) due to severe domain shift and the scarcity of dense annotations. To address this, we propose a self-prompting, point-supervised framework that adapts SAM to RSIs using only sparse point annotations. Our method employs a Refine-Requery-Reinforce loop, where coarse pseudo-masks are generated from initial points (Refine), improved with self-constructed box prompts (Requery), and embeddings are aligned across iterations to reduce confirmation bias (Reinforce). Without relying on full-mask supervision, our approach progressively enhances SAM's segmentation quality and domain robustness through self-guided prompt adaptation . We evaluate our proposed method on three RSI benchmark datasets, including WHU, HRSID, and NWPU VHR-10, showing that our method consistently surpasses pretrained SAM and recent point-supervised segmentation methods. Our results demonstrate that self-prompting and semantic alignment provide an efficient path towards scalable, point-level adaptation of foundation segmentation models for remote sensing applications.
MoGAN: Improving Motion Quality in Video Diffusion via Few-Step Motion Adversarial Post-Training
Video diffusion models achieve strong frame-level fidelity but still struggle with motion coherence, dynamics and realism, often producing jitter, ghosting, or implausible dynamics. A key limitation is that the standard denoising MSE objective provides no direct supervision on temporal consistency, allowing models to achieve low loss while still generating poor motion. We propose MoGAN, a motion-centric post-training framework that improves motion realism without reward models or human preference data. Built atop a 3-step distilled video diffusion model, we train a DiT-based optical-flow discriminator to differentiate real from generated motion, combined with a distribution-matching regularizer to preserve visual fidelity. With experiments on Wan2.1-T2V-1.3B, MoGAN substantially improves motion quality across benchmarks. On VBench, MoGAN boosts motion score by +7.3% over the 50-step teacher and +13.3% over the 3-step DMD model. On VideoJAM-Bench, MoGAN improves motion score by +7.4% over the teacher and +8.8% over DMD, while maintaining comparable or even better aesthetic and image-quality scores. A human study further confirms that MoGAN is preferred for motion quality (52% vs. 38% for the teacher; 56% vs. 29% for DMD). Overall, MoGAN delivers significantly more realistic motion without sacrificing visual fidelity or efficiency, offering a practical path toward fast, high-quality video generation. Project webpage is: https://xavihart.github.io/mogan.
Deep Learning-Based Multiclass Classification of Oral Lesions with Stratified Augmentation
Oral cancer is highly common across the globe and is mostly diagnosed during the later stages due to the close visual similarity to benign, precancerous, and malignant lesions in the oral cavity. Implementing computer aided diagnosis systems early on has the potential to greatly improve clinical outcomes. This research intends to use deep learning to build a multiclass classifier for sixteen different oral lesions. To overcome the challenges of limited and imbalanced datasets, the proposed technique combines stratified data splitting and advanced data augmentation and oversampling to perform the classification. The experimental results, which achieved 83.33 percent accuracy, 89.12 percent precision, and 77.31 percent recall, demonstrate the superiority of the suggested model over state of the art methods now in use. The suggested model effectively conveys the effectiveness of oversampling and augmentation strategies in situations where the minority class classification performance is noteworthy. As a first step toward trustworthy computer aided diagnostic systems for the early detection of oral cancer in clinical settings, the suggested framework shows promise.
comment: 12 pages, 3 figures,
Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy
The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.
Enhanced Landmark Detection Model in Pelvic Fluoroscopy using 2D/3D Registration Loss
Automated landmark detection offers an efficient approach for medical professionals to understand patient anatomic structure and positioning using intra-operative imaging. While current detection methods for pelvic fluoroscopy demonstrate promising accuracy, most assume a fixed Antero-Posterior view of the pelvis. However, orientation often deviates from this standard view, either due to repositioning of the imaging unit or of the target structure itself. To address this limitation, we propose a novel framework that incorporates 2D/3D landmark registration into the training of a U-Net landmark prediction model. We analyze the performance difference by comparing landmark detection accuracy between the baseline U-Net, U-Net trained with Pose Estimation Loss, and U-Net fine-tuned with Pose Estimation Loss under realistic intra-operative conditions where patient pose is variable.
comment: 9 pages, 3 figures, 1 table
Multimodal Robust Prompt Distillation for 3D Point Cloud Models
Adversarial attacks pose a significant threat to learning-based 3D point cloud models, critically undermining their reliability in security-sensitive applications. Existing defense methods often suffer from (1) high computational overhead and (2) poor generalization ability across diverse attack types. To bridge these gaps, we propose a novel yet efficient teacher-student framework, namely Multimodal Robust Prompt Distillation (MRPD) for distilling robust 3D point cloud model. It learns lightweight prompts by aligning student point cloud model's features with robust embeddings from three distinct teachers: a vision model processing depth projections, a high-performance 3D model, and a text encoder. To ensure a reliable knowledge transfer, this distillation is guided by a confidence-gated mechanism which dynamically balances the contribution of all input modalities. Notably, since the distillation is all during the training stage, there is no additional computational cost at inference. Extensive experiments demonstrate that MRPD substantially outperforms state-of-the-art defense methods against a wide range of white-box and black-box attacks, while even achieving better performance on clean data. Our work presents a new, practical paradigm for building robust 3D vision systems by efficiently harnessing multimodal knowledge.
UAVLight: A Benchmark for Illumination-Robust 3D Reconstruction in Unmanned Aerial Vehicle (UAV) Scenes
Illumination inconsistency is a fundamental challenge in multi-view 3D reconstruction. Variations in sunlight direction, cloud cover, and shadows break the constant-lighting assumption underlying both classical multi-view stereo (MVS) and structure from motion (SfM) pipelines and recent neural rendering methods, leading to geometry drift, color inconsistency, and shadow imprinting. This issue is especially critical in UAV-based reconstruction, where long flight durations and outdoor environments make lighting changes unavoidable. However, existing datasets either restrict capture to short time windows, thus lacking meaningful illumination diversity, or span months and seasons, where geometric and semantic changes confound the isolated study of lighting robustness. We introduce UAVLight, a controlled-yet-real benchmark for illumination-robust 3D reconstruction. Each scene is captured along repeatable, geo-referenced flight paths at multiple fixed times of day, producing natural lighting variation under consistent geometry, calibration, and viewpoints. With standardized evaluation protocols across lighting conditions, UAVLight provides a reliable foundation for developing and benchmarking reconstruction methods that are consistent, faithful, and relightable in real outdoor environments.
comment: 10 pages, 6 figures
Video Generation Models Are Good Latent Reward Models
Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces significant challenges. Existing video reward models rely on vision-language models designed for pixel-space inputs, confining ReFL optimization to near-complete denoising steps after computationally expensive VAE decoding. This pixel-space approach incurs substantial memory overhead and increased training time, and its late-stage optimization lacks early-stage supervision, refining only visual quality rather than fundamental motion dynamics and structural coherence. In this work, we show that pre-trained video generation models are naturally suited for reward modeling in the noisy latent space, as they are explicitly designed to process noisy latent representations at arbitrary timesteps and inherently preserve temporal information through their sequential modeling capabilities. Accordingly, we propose Process Reward Feedback Learning~(PRFL), a framework that conducts preference optimization entirely in latent space, enabling efficient gradient backpropagation throughout the full denoising chain without VAE decoding. Extensive experiments demonstrate that PRFL significantly improves alignment with human preferences, while achieving substantial reductions in memory consumption and training time compared to RGB ReFL.
Bangla Sign Language Translation: Dataset Creation Challenges, Benchmarking and Prospects
Bangla Sign Language Translation (BdSLT) has been severely constrained so far as the language itself is very low resource. Standard sentence level dataset creation for BdSLT is of immense importance for developing AI based assistive tools for deaf and hard of hearing people of Bangla speaking community. In this paper, we present a dataset, IsharaKhobor , and two subset of it for enabling research. We also present the challenges towards developing the dataset and present some way forward by benchmarking with landmark based raw and RQE embedding. We do some ablation on vocabulary restriction and canonicalization of the same within the dataset, which resulted in two more datasets, IsharaKhobor_small and IsharaKhobor_canonical_small. The dataset is publicly available at: www.kaggle.com/datasets/hasanssl/isharakhobor [1].
comment: 14 pages, 8 tables
The Age-specific Alzheimer 's Disease Prediction with Characteristic Constraints in Nonuniform Time Span
Alzheimer's disease is a debilitating disorder marked by a decline in cognitive function. Timely identification of the disease is essential for the development of personalized treatment strategies that aim to mitigate its progression. The application of generated images for the prediction of Alzheimer's disease poses challenges, particularly in accurately representing the disease's characteristics when input sequences are captured at irregular time intervals. This study presents an innovative methodology for sequential image generation, guided by quantitative metrics, to maintain the essential features indicative of disease progression. Furthermore, an age-scaling factor is integrated into the process to produce age-specific MRI images, facilitating the prediction of advanced stages of the disease. The results obtained from the ablation study suggest that the inclusion of quantitative metrics significantly improves the accuracy of MRI image synthesis. Furthermore, the application of age-scaled pixel loss contributed to the enhanced iterative generation of MRI images. In terms of long-term disease prognosis, the Structural Similarity Index reached a peak value of 0.882, indicating a substantial degree of similarity in the synthesized images.
comment: 16 pages, 9 figures
EoS-FM: Can an Ensemble of Specialist Models act as a Generalist Feature Extractor?
Recent advances in foundation models have shown great promise in domains such as natural language processing and computer vision, and similar efforts are now emerging in the Earth Observation community. These models aim to generalize across tasks with limited supervision, reducing the need for training separate models for each task. However, current strategies, which largely focus on scaling model size and dataset volume, require prohibitive computational and data resources, limiting accessibility to only a few large institutions. Moreover, this paradigm of ever-larger models stands in stark contrast with the principles of sustainable and environmentally responsible AI, as it leads to immense carbon footprints and resource inefficiency. In this work, we present a novel and efficient alternative: an Ensemble-of-Specialists framework for building Remote Sensing Foundation Models (RSFMs). Our method decomposes the training process into lightweight, task-specific ConvNeXtV2 specialists that can be frozen and reused. This modular approach offers strong advantages in efficiency, interpretability, and extensibility. Moreover, it naturally supports federated training, pruning, and continuous specialist integration, making it particularly well-suited for collaborative and resource-constrained settings. Our framework sets a new direction for building scalable and efficient RSFMs.
Self-Paced Learning for Images of Antinuclear Antibodies
Antinuclear antibody (ANA) testing is a crucial method for diagnosing autoimmune disorders, including lupus, Sjögren's syndrome, and scleroderma. Despite its importance, manual ANA detection is slow, labor-intensive, and demands years of training. ANA detection is complicated by over 100 coexisting antibody types, resulting in vast fluorescent pattern combinations. Although machine learning and deep learning have enabled automation, ANA detection in real-world clinical settings presents unique challenges as it involves multi-instance, multi-label (MIML) learning. In this paper, a novel framework for ANA detection is proposed that handles the complexities of MIML tasks using unaltered microscope images without manual preprocessing. Inspired by human labeling logic, it identifies consistent ANA sub-regions and assigns aggregated labels accordingly. These steps are implemented using three task-specific components: an instance sampler, a probabilistic pseudo-label dispatcher, and self-paced weight learning rate coefficients. The instance sampler suppresses low-confidence instances by modeling pattern confidence, while the dispatcher adaptively assigns labels based on instance distinguishability. Self-paced learning adjusts training according to empirical label observations. Our framework overcomes limitations of traditional MIML methods and supports end-to-end optimization. Extensive experiments on one ANA dataset and three public medical MIML benchmarks demonstrate the superiority of our framework. On the ANA dataset, our model achieves up to +7.0% F1-Macro and +12.6% mAP gains over the best prior method, setting new state-of-the-art results. It also ranks top-2 across all key metrics on public datasets, reducing Hamming loss and one-error by up to 18.2% and 26.9%, respectively. The source code can be accessed at https://github.com/fletcherjiang/ANA-SelfPacedLearning.
comment: IEEE Transactions on Medical Imaging
Generalized Design Choices for Deepfake Detectors
The effectiveness of deepfake detection methods often depends less on their core design and more on implementation details such as data preprocessing, augmentation strategies, and optimization techniques. These factors make it difficult to fairly compare detectors and to understand which factors truly contribute to their performance. To address this, we systematically investigate how different design choices influence the accuracy and generalization capabilities of deepfake detection models, focusing on aspects related to training, inference, and incremental updates. By isolating the impact of individual factors, we aim to establish robust, architecture-agnostic best practices for the design and development of future deepfake detection systems. Our experiments identify a set of design choices that consistently improve deepfake detection and enable state-of-the-art performance on the AI-GenBench benchmark.
comment: 12 pages, 9 figures, 10 tables, code available: https://github.com/MI-BioLab/AI-GenBench
CanKD: Cross-Attention-based Non-local operation for Feature-based Knowledge Distillation WACV 2026
We propose Cross-Attention-based Non-local Knowledge Distillation (CanKD), a novel feature-based knowledge distillation framework that leverages cross-attention mechanisms to enhance the knowledge transfer process. Unlike traditional self-attention-based distillation methods that align teacher and student feature maps independently, CanKD enables each pixel in the student feature map to dynamically consider all pixels in the teacher feature map. This non-local knowledge transfer more thoroughly captures pixel-wise relationships, improving feature representation learning. Our method introduces only an additional loss function to achieve superior performance compared with existing attention-guided distillation methods. Extensive experiments on object detection and image segmentation tasks demonstrate that CanKD outperforms state-of-the-art feature and hybrid distillation methods. These experimental results highlight CanKD's potential as a new paradigm for attention-guided distillation in computer vision tasks. Code is available at https://github.com/tori-hotaru/CanKD
comment: WACV 2026 Accepted
Merge and Bound: Direct Manipulations on Weights for Class Incremental Learning
We present a novel training approach, named Merge-and-Bound (M&B) for Class Incremental Learning (CIL), which directly manipulates model weights in the parameter space for optimization. Our algorithm involves two types of weight merging: inter-task weight merging and intra-task weight merging. Inter-task weight merging unifies previous models by averaging the weights of models from all previous stages. On the other hand, intra-task weight merging facilitates the learning of current task by combining the model parameters within current stage. For reliable weight merging, we also propose a bounded update technique that aims to optimize the target model with minimal cumulative updates and preserve knowledge from previous tasks; this strategy reveals that it is possible to effectively obtain new models near old ones, reducing catastrophic forgetting. M&B is seamlessly integrated into existing CIL methods without modifying architecture components or revising learning objectives. We extensively evaluate our algorithm on standard CIL benchmarks and demonstrate superior performance compared to state-of-the-art methods.
Frequency-Aware Token Reduction for Efficient Vision Transformer
Vision Transformers have demonstrated exceptional performance across various computer vision tasks, yet their quadratic computational complexity concerning token length remains a significant challenge. To address this, token reduction methods have been widely explored. However, existing approaches often overlook the frequency characteristics of self-attention, such as rank collapsing and over-smoothing phenomenon. In this paper, we propose a frequency-aware token reduction strategy that improves computational efficiency while preserving performance by mitigating rank collapsing. Our method partitions tokens into high-frequency tokens and low-frequency tokens. high-frequency tokens are selectively preserved, while low-frequency tokens are aggregated into a compact direct current token to retain essential low-frequency components. Through extensive experiments and analysis, we demonstrate that our approach significantly improves accuracy while reducing computational overhead and mitigating rank collapsing and over smoothing. Furthermore, we analyze the previous methods, shedding light on their implicit frequency characteristics and limitations.
comment: Neurips 2025
MobileI2V: Fast and High-Resolution Image-to-Video on Mobile Devices
Recently, video generation has witnessed rapid advancements, drawing increasing attention to image-to-video (I2V) synthesis on mobile devices. However, the substantial computational complexity and slow generation speed of diffusion models pose significant challenges for real-time, high-resolution video generation on resource-constrained mobile devices. In this work, we propose MobileI2V, a 270M lightweight diffusion model for real-time image-to-video generation on mobile devices. The core lies in: (1) We analyzed the performance of linear attention modules and softmax attention modules on mobile devices, and proposed a linear hybrid architecture denoiser that balances generation efficiency and quality. (2) We design a time-step distillation strategy that compresses the I2V sampling steps from more than 20 to only two without significant quality loss, resulting in a 10-fold increase in generation speed. (3) We apply mobile-specific attention optimizations that yield a 2-fold speed-up for attention operations during on-device inference. MobileI2V enables, for the first time, fast 720p image-to-video generation on mobile devices, with quality comparable to existing models. Under one-step conditions, the generation speed of each frame of 720p video is less than 100 ms. Our code is available at: https://github.com/hustvl/MobileI2V.
comment: Our Demo and code:https://github.com/hustvl/MobileI2V
EvRainDrop: HyperGraph-guided Completion for Effective Frame and Event Stream Aggregation
Event cameras produce asynchronous event streams that are spatially sparse yet temporally dense. Mainstream event representation learning algorithms typically use event frames, voxels, or tensors as input. Although these approaches have achieved notable progress, they struggle to address the undersampling problem caused by spatial sparsity. In this paper, we propose a novel hypergraph-guided spatio-temporal event stream completion mechanism, which connects event tokens across different times and spatial locations via hypergraphs and leverages contextual information message passing to complete these sparse events. The proposed method can flexibly incorporate RGB tokens as nodes in the hypergraph within this completion framework, enabling multi-modal hypergraph-based information completion. Subsequently, we aggregate hypergraph node information across different time steps through self-attention, enabling effective learning and fusion of multi-modal features. Extensive experiments on both single- and multi-label event classification tasks fully validated the effectiveness of our proposed framework. The source code of this paper will be released on https://github.com/Event-AHU/EvRainDrop.
From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings
We present a novel unsupervised framework to unlock vast unlabeled human demonstration data from continuous industrial video streams for Vision-Language-Action (VLA) model pre-training. Our method first trains a lightweight motion tokenizer to encode motion dynamics, then employs an unsupervised action segmenter leveraging a novel "Latent Action Energy" metric to discover and segment semantically coherent action primitives. The pipeline outputs both segmented video clips and their corresponding latent action sequences, providing structured data directly suitable for VLA pre-training. Evaluations on public benchmarks and a proprietary electric motor assembly dataset demonstrate effective segmentation of key tasks performed by humans at workstations. Further clustering and quantitative assessment via a Vision-Language Model confirm the semantic coherence of the discovered action primitives. To our knowledge, this is the first fully automated end-to-end system for extracting and organizing VLA pre-training data from unstructured industrial videos, offering a scalable solution for embodied AI integration in manufacturing.
comment: 10 pages, 5 figures
E-M3RF: An Equivariant Multimodal 3D Re-assembly Framework
3D reassembly is a fundamental geometric problem, and in recent years it has increasingly been challenged by deep learning methods rather than classical optimization. While learning approaches have shown promising results, most still rely primarily on geometric features to assemble a whole from its parts. As a result, methods struggle when geometry alone is insufficient or ambiguous, for example, for small, eroded, or symmetric fragments. Additionally, solutions do not impose physical constraints that explicitly prevent overlapping assemblies. To address these limitations, we introduce E-M3RF, an equivariant multimodal 3D reassembly framework that takes as input the point clouds, containing both point positions and colors of fractured fragments, and predicts the transformations required to reassemble them using SE(3) flow matching. Each fragment is represented by both geometric and color features: i) 3D point positions are encoded as rotationconsistent geometric features using a rotation-equivariant encoder, ii) the colors at each 3D point are encoded with a transformer. The two feature sets are then combined to form a multimodal representation. We experimented on four datasets: two synthetic datasets, Breaking Bad and Fantastic Breaks, and two real-world cultural heritage datasets, RePAIR and Presious, demonstrating that E-M3RF on the RePAIR dataset reduces rotation error by 23.1% and translation error by 13.2%, while Chamfer Distance decreases by 18.4% compared to competing methods.
SAM Guided Semantic and Motion Changed Region Mining for Remote Sensing Change Captioning
Remote sensing change captioning is an emerging and popular research task that aims to describe, in natural language, the content of interest that has changed between two remote sensing images captured at different times. Existing methods typically employ CNNs/Transformers to extract visual representations from the given images or incorporate auxiliary tasks to enhance the final results, with weak region awareness and limited temporal alignment. To address these issues, this paper explores the use of the SAM (Segment Anything Model) foundation model to extract region-level representations and inject region-of-interest knowledge into the captioning framework. Specifically, we employ a CNN/Transformer model to extract global-level vision features, leverage the SAM foundation model to delineate semantic- and motion-level change regions, and utilize a specially constructed knowledge graph to provide information about objects of interest. These heterogeneous sources of information are then fused via cross-attention, and a Transformer decoder is used to generate the final natural language description of the observed changes. Extensive experimental results demonstrate that our method achieves state-of-the-art performance across multiple widely used benchmark datasets. The source code of this paper will be released on https://github.com/Event-AHU/SAM_ChangeCaptioning
DiverseVAR: Balancing Diversity and Quality of Next-Scale Visual Autoregressive Models
We introduce DiverseVAR, a framework that enhances the diversity of text-conditioned visual autoregressive models (VAR) at test time without requiring retraining, fine-tuning, or substantial computational overhead. While VAR models have recently emerged as strong competitors to diffusion and flow models for image generation, they suffer from a critical limitation in diversity, often producing nearly identical images even for simple prompts. This issue has largely gone unnoticed amid the predominant focus on image quality. We address this limitation at test time in two stages. First, inspired by diversity enhancement techniques in diffusion models, we propose injecting noise into the text embedding. This introduces a trade-off between diversity and image quality: as diversity increases, the image quality sharply declines. To preserve quality, we propose scale-travel: a novel latent refinement technique inspired by time-travel strategies in diffusion models. Specifically, we use a multi-scale autoencoder to extract coarse-scale tokens that enable us to resume generation at intermediate stages. Extensive experiments show that combining text-embedding noise injection with our scale-travel refinement significantly enhances diversity while minimizing image-quality degradation, achieving a new Pareto frontier in the diversity-quality trade-off.
Do Reasoning Vision-Language Models Inversely Scale in Test-Time Compute? A Distractor-centric Empirical Analysis
How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior studies on language models have reported an inverse scaling effect, where textual distractors lead to longer but less effective reasoning. To investigate whether similar phenomena occur in multimodal settings, we introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic, numerical, and spatial dimensions. Our analyses reveal that visual distractors differ fundamentally from textual ones: although inverse scaling persists, adding visual distractors reduces accuracy without increasing reasoning length. We further show that tracking attribute counts within reasoning traces provides key insights into how distractors, reasoning length, and accuracy interact. Finally, we demonstrate that these trends extend to established visual bias benchmarks such as Waterbirds, and we propose a simple prompting strategy to mitigate bias-driven predictions in reasoning models.
comment: preprint
Monet: Reasoning in Latent Visual Space Beyond Images and Language
"Thinking with images" has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evidence into intermediate reasoning steps. However, existing methods fall short of human-like abstract visual thinking, as their flexibility is fundamentally limited by external tools. In this work, we introduce Monet, a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. We identify two core challenges in training MLLMs for latent visual reasoning: high computational cost in latent-vision alignment and insufficient supervision over latent embeddings, and address them with a three-stage distillation-based supervised fine-tuning (SFT) pipeline. We further reveal a limitation of applying GRPO to latent reasoning: it primarily enhances text-based reasoning rather than latent reasoning. To overcome this, we propose VLPO (Visual-latent Policy Optimization), a reinforcement learning method that explicitly incorporates latent embeddings into policy gradient updates. To support SFT, we construct Monet-SFT-125K, a high-quality text-image interleaved CoT dataset containing 125K real-world, chart, OCR, and geometry CoTs. Our model, Monet-7B, shows consistent gains across real-world perception and reasoning benchmarks and exhibits strong out-of-distribution generalization on challenging abstract visual reasoning tasks. We also empirically analyze the role of each training component and discuss our early unsuccessful attempts, providing insights for future developments in visual latent reasoning. Our model, data, and code are available at https://github.com/NOVAglow646/Monet.
Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning
Spatio-temporal video grounding (STVG) requires localizing a target object in untrimmed videos both temporally and spatially from natural language descriptions. Despite their strong language understanding, multimodal large language models (MLLMs) underperform on STVG due to misaligned training objectives and weak fine-grained region-word alignment in standard visual encoders. To address this, we propose STVG-o1, the first framework that enables off-the-shelf MLLMs to achieve state-of-the-art STVG performance without any architectural modifications. Our method introduces a bounding-box chain-of-thought mechanism that explicitly reasons about spatio-temporal locations in an intermediate step before producing the final prediction. We further design a multi-dimensional reinforcement reward function consisting of format, consistency, temporal, spatial, and think rewards, which provides geometry-aware supervision through reinforcement fine-tuning. Evaluated on HCSTVG-v1/v2 and VidSTG, STVG-o1 sets new state-of-the-art results on HCSTVG, outperforming the best task-specific method by 7.3\% m\_tIoU on HCSTVG-v1, matching specialized models on VidSTG, and surpassing all existing MLLM-based approaches by large margins. It also demonstrates strong open-vocabulary generalization across datasets, establishing MLLMs as viable and powerful backbones for precise spatio-temporal grounding. Our code and models will be released.
Endo-G$^{2}$T: Geometry-Guided & Temporally Aware Time-Embedded 4DGS For Endoscopic Scenes
Endoscopic (endo) video exhibits strong view-dependent effects such as specularities, wet reflections, and occlusions. Pure photometric supervision misaligns with geometry and triggers early geometric drift, where erroneous shapes are reinforced during densification and become hard to correct. We ask how to anchor geometry early for 4D Gaussian splatting (4DGS) while maintaining temporal consistency and efficiency in dynamic endoscopic scenes. Thus, we present Endo-G$^{2}$T, a geometry-guided and temporally aware training scheme for time-embedded 4DGS. First, geo-guided prior distillation converts confidence-gated monocular depth into supervision with scale-invariant depth and depth-gradient losses, using a warm-up-to-cap schedule to inject priors softly and avoid early overfitting. Second, a time-embedded Gaussian field represents dynamics in XYZT with a rotor-like rotation parameterization, yielding temporally coherent geometry with lightweight regularization that favors smooth motion and crisp opacity boundaries. Third, keyframe-constrained streaming improves efficiency and long-horizon stability through keyframe-focused optimization under a max-points budget, while non-keyframes advance with lightweight updates. Across EndoNeRF and StereoMIS-P1 datasets, Endo-G$^{2}$T achieves state-of-the-art results among monocular reconstruction baselines.
PFF-Net: Patch Feature Fitting for Point Cloud Normal Estimation
Estimating the normal of a point requires constructing a local patch to provide center-surrounding context, but determining the appropriate neighborhood size is difficult when dealing with different data or geometries. Existing methods commonly employ various parameter-heavy strategies to extract a full feature description from the input patch. However, they still have difficulties in accurately and efficiently predicting normals for various point clouds. In this work, we present a new idea of feature extraction for robust normal estimation of point clouds. We use the fusion of multi-scale features from different neighborhood sizes to address the issue of selecting reasonable patch sizes for various data or geometries. We seek to model a patch feature fitting (PFF) based on multi-scale features to approximate the optimal geometric description for normal estimation and implement the approximation process via multi-scale feature aggregation and cross-scale feature compensation. The feature aggregation module progressively aggregates the patch features of different scales to the center of the patch and shrinks the patch size by removing points far from the center. It not only enables the network to precisely capture the structure characteristic in a wide range, but also describes highly detailed geometries. The feature compensation module ensures the reusability of features from earlier layers of large scales and reveals associated information in different patch sizes. Our approximation strategy based on aggregating the features of multiple scales enables the model to achieve scale adaptation of varying local patches and deliver the optimal feature description. Extensive experiments demonstrate that our method achieves state-of-the-art performance on both synthetic and real-world datasets with fewer network parameters and running time.
comment: Accepted by TVCG
BanglaMM-Disaster: A Multimodal Transformer-Based Deep Learning Framework for Multiclass Disaster Classification in Bangla SP
Natural disasters remain a major challenge for Bangladesh, so real-time monitoring and quick response systems are essential. In this study, we present BanglaMM-Disaster, an end-to-end deep learning-based multimodal framework for disaster classification in Bangla, using both textual and visual data from social media. We constructed a new dataset of 5,037 Bangla social media posts, each consisting of a caption and a corresponding image, annotated into one of nine disaster-related categories. The proposed model integrates transformer-based text encoders, including BanglaBERT, mBERT, and XLM-RoBERTa, with CNN backbones such as ResNet50, DenseNet169, and MobileNetV2, to process the two modalities. Using early fusion, the best model achieves 83.76% accuracy. This surpasses the best text-only baseline by 3.84% and the image-only baseline by 16.91%. Our analysis also shows reduced misclassification across all classes, with noticeable improvements for ambiguous examples. This work fills a key gap in Bangla multimodal disaster analysis and demonstrates the benefits of combining multiple data types for real-time disaster response in low-resource settings.
comment: Presented at the 2025 IEEE International Conference on Signal Processing, Information, Communication and Systems (SPICSCON), November 21-22, 2025, University of Rajshahi, Bangladesh. 6 pages, 9 disaster classes, multimodal dataset with 5,037 samples
SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding
Recent advances in multimodal large language models (LLMs) have highlighted their potential for medical and surgical applications. However, existing surgical datasets predominantly adopt a Visual Question Answering (VQA) format with heterogeneous taxonomies and lack support for pixel-level segmentation, limiting consistent evaluation and applicability. We present SurgMLLMBench, a unified multimodal benchmark explicitly designed for developing and evaluating interactive multimodal LLMs for surgical scene understanding, including the newly collected Micro-surgical Artificial Vascular anastomosIS (MAVIS) dataset. It integrates pixel-level instrument segmentation masks and structured VQA annotations across laparoscopic, robot-assisted, and micro-surgical domains under a unified taxonomy, enabling comprehensive evaluation beyond traditional VQA tasks and richer visual-conversational interactions. Extensive baseline experiments show that a single model trained on SurgMLLMBench achieves consistent performance across domains and generalizes effectively to unseen datasets. SurgMLLMBench will be publicly released as a robust resource to advance multimodal surgical AI research, supporting reproducible evaluation and development of interactive surgical reasoning models.
comment: 10 pages, 5 figures
Hybrid SIFT-SNN for Efficient Anomaly Detection of Traffic Flow-Control Infrastructure
This paper presents the SIFT-SNN framework, a low-latency neuromorphic signal-processing pipeline for real-time detection of structural anomalies in transport infrastructure. The proposed approach integrates Scale-Invariant Feature Transform (SIFT) for spatial feature encoding with a latency-driven spike conversion layer and a Leaky Integrate-and-Fire (LIF) Spiking Neural Network (SNN) for classification. The Auckland Harbour Bridge dataset is recorded under various weather and lighting conditions, comprising 6,000 labelled frames that include both real and synthetically augmented unsafe cases. The presented system achieves a classification accuracy of 92.3% (+- 0.8%) with a per-frame inference time of 9.5 ms. Achieved sub-10 millisecond latency, combined with sparse spike activity (8.1%), enables real-time, low-power edge deployment. Unlike conventional CNN-based approaches, the hybrid SIFT-SNN pipeline explicitly preserves spatial feature grounding, enhances interpretability, supports transparent decision-making, and operates efficiently on embedded hardware. Although synthetic augmentation improved robustness, generalisation to unseen field conditions remains to be validated. The SIFT-SNN framework is validated through a working prototype deployed on a consumer-grade system and framed as a generalisable case study in structural safety monitoring for movable concrete barriers, which, as a traffic flow-control infrastructure, is deployed in over 20 cities worldwide.
comment: 8 pages, 6 figures. This is a preprint of a paper accepted for presentation at the 2025 International Conference on Image and Vision Computing New Zealand (IVCNZ). The final version will appear in IEEE Xplore
The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment
Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence. We evaluate ConFu on synthetic and real-world multimodal benchmarks, assessing its ability to exploit cross-modal complementarity, capture higher-order dependencies, and scale with increasing multimodal complexity. Across these settings, ConFu demonstrates competitive performance on retrieval and classification tasks, while supporting unified one-to-one and two-to-one retrieval within a single contrastive framework.
HTTM: Head-wise Temporal Token Merging for Faster VGGT
The Visual Geometry Grounded Transformer (VGGT) marks a significant leap forward in 3D scene reconstruction, as it is the first model that directly infers all key 3D attributes (camera poses, depths, and dense geometry) jointly in one pass. However, this joint inference mechanism requires global attention layers that perform all-to-all attention computation on tokens from all views. For reconstruction of large scenes with long-sequence inputs, this causes a significant latency bottleneck. In this paper, we propose head-wise temporal merging (HTTM), a training-free 3D token merging method for accelerating VGGT. Existing merging techniques merge tokens uniformly across different attention heads, resulting in identical tokens in the layers' output, which hinders the model's representational ability. HTTM tackles this problem by merging tokens in multi-head granularity, which preserves the uniqueness of feature tokens after head concatenation. Additionally, this enables HTTM to leverage the spatial locality and temporal correspondence observed at the head level to achieve higher merging ratios with lower merging costs compared to existing methods. Thus, HTTM achieves up to 7x acceleration with negligible performance drops in a GPU-based inference.
CaliTex: Geometry-Calibrated Attention for View-Coherent 3D Texture Generation
Despite major advances brought by diffusion-based models, current 3D texture generation systems remain hindered by cross-view inconsistency -- textures that appear convincing from one viewpoint often fail to align across others. We find that this issue arises from attention ambiguity, where unstructured full attention is applied indiscriminately across tokens and modalities, causing geometric confusion and unstable appearance-structure coupling. To address this, we introduce CaliTex, a framework of geometry-calibrated attention that explicitly aligns attention with 3D structure. It introduces two modules: Part-Aligned Attention that enforces spatial alignment across semantically matched parts, and Condition-Routed Attention which routes appearance information through geometry-conditioned pathways to maintain spatial fidelity. Coupled with a two-stage diffusion transformer, CaliTex makes geometric coherence an inherent behavior of the network rather than a byproduct of optimization. Empirically, CaliTex produces seamless and view-consistent textures and outperforms both open-source and commercial baselines.
PathMamba: A Hybrid Mamba-Transformer for Topologically Coherent Road Segmentation in Satellite Imagery
Achieving both high accuracy and topological continuity in road segmentation from satellite imagery is a critical goal for applications ranging from urban planning to disaster response. State-of-the-art methods often rely on Vision Transformers, which excel at capturing global context, yet their quadratic complexity is a significant barrier to efficient deployment, particularly for on-board processing in resource-constrained platforms. In contrast, emerging State Space Models like Mamba offer linear-time efficiency and are inherently suited to modeling long, continuous structures. We posit that these architectures have complementary strengths. To this end, we introduce PathMamba, a novel hybrid architecture that integrates Mamba's sequential modeling with the Transformer's global reasoning. Our design strategically uses Mamba blocks to trace the continuous nature of road networks, preserving topological structure, while integrating Transformer blocks to refine features with global context. This approach yields topologically superior segmentation maps without the prohibitive scaling costs of pure attention-based models. Our experiments on the DeepGlobe Road Extraction and Massachusetts Roads datasets demonstrate that PathMamba sets a new state-of-the-art. Notably, it significantly improves topological continuity, as measured by the APLS metric, setting a new benchmark while remaining computationally competitive.
comment: 11 pages, 5 figures
Co-Training Vision Language Models for Remote Sensing Multi-task Learning
With Transformers achieving outstanding performance on individual remote sensing (RS) tasks, we are now approaching the realization of a unified model that excels across multiple tasks through multi-task learning (MTL). Compared to single-task approaches, MTL methods offer improved generalization, enhanced scalability, and greater practical applicability. Recently, vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning, respectively. Moreover, the unified text-based interface demonstrates significant potential for MTL. Hence, in this work, we present RSCoVLM, a simple yet flexible VLM baseline for RS MTL. Firstly, we create the data curation engine, including data acquisition, offline processing and integrating, as well as online loading and weighting. This data engine effectively addresses complex RS data enviroment and generates flexible vision-language conversations. Furthermore, we propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery. For UHR images, we introduce the Zoom-in Chain mechanism together with its corresponding dataset, LRS-VQA-Zoom. The strategies are flexible and effectively mitigate the computational burdens. Additionally, we significantly enhance the model's object detection capability and propose a novel evaluation protocol that ensures fair comparison between VLMs and conventional detection models. Extensive experiments demonstrate that RSCoVLM achieves state-of-the-art performance across diverse tasks, outperforming existing RS VLMs and even rivaling specialized expert models. All the training and evaluating tools, model weights, and datasets have been fully open-sourced to support reproducibility. We expect that this baseline will promote further progress toward general-purpose RS models.
comment: 14 pages, 6 figures
Multi-Reward GRPO for Stable and Prosodic Single-Codebook TTS LLMs at Scale
Recent advances in Large Language Models (LLMs) have transformed text-to-speech (TTS) synthesis, inspiring autoregressive frameworks that represent speech as sequences of discrete codec tokens. Among them, single-codebook TTS LLMs have emerged as compact and streamable architectures that jointly model semantic and acoustic integration. However, despite their efficiency, these models often exhibit unstable prosody, speaker drift, and degraded naturalness. To address these issues, we propose a multi-reward Group Relative Policy Optimization (GRPO) framework that directly optimizes the token generation policy of single-codebook TTS LLMs. Beyond standard intelligibility and speaker similarity objectives, our design integrates three rule-based rewards: a length penalty for duration consistency, an entropy regularization reward for decoding stability, and an LLM-annotated prosody alignment reward that explicitly supervises rhythm. In this prosody reward, an external reasoning LLM predicts multiple plausible pause structures via in-context learning, providing a human-preference-aligned supervisory signal for GRPO training. To assess universality, we further attach a flow-matching (FM) decoder on top of the GRPO-optimized AR backbone and observe consistent additional gains, indicating that our reinforcement optimization enhances the intrinsic AR policy. We further conduct a scalability analysis across data sizes and model scales, revealing that the proposed method consistently enhances prosodic stability, speaker similarity, and overall speech naturalness in single-codebook TTS LLMs.
comment: 4 pages, 2 figures
Unlocking Zero-shot Potential of Semi-dense Image Matching via Gaussian Splatting
Learning-based image matching critically depends on large-scale, diverse, and geometrically accurate training data. 3D Gaussian Splatting (3DGS) enables photorealistic novel-view synthesis and thus is attractive for data generation. However, its geometric inaccuracies and biased depth rendering currently prevent robust correspondence labeling. To address this, we introduce MatchGS, the first framework designed to systematically correct and leverage 3DGS for robust, zero-shot image matching. Our approach is twofold: (1) a geometrically-faithful data generation pipeline that refines 3DGS geometry to produce highly precise correspondence labels, enabling the synthesis of a vast and diverse range of viewpoints without compromising rendering fidelity; and (2) a 2D-3D representation alignment strategy that infuses 3DGS' explicit 3D knowledge into the 2D matcher, guiding 2D semi-dense matchers to learn viewpoint-invariant 3D representations. Our generated ground-truth correspondences reduce the epipolar error by up to 40 times compared to existing datasets, enable supervision under extreme viewpoint changes, and provide self-supervisory signals through Gaussian attributes. Consequently, state-of-the-art matchers trained solely on our data achieve significant zero-shot performance gains on public benchmarks, with improvements of up to 17.7%. Our work demonstrates that with proper geometric refinement, 3DGS can serve as a scalable, high-fidelity, and structurally-rich data source, paving the way for a new generation of robust zero-shot image matchers.
LaGen: Towards Autoregressive LiDAR Scene Generation
Generative world models for autonomous driving (AD) have become a trending topic. Unlike the widely studied image modality, in this work we explore generative world models for LiDAR data. Existing generation methods for LiDAR data only support single frame generation, while existing prediction approaches require multiple frames of historical input and can only deterministically predict multiple frames at once, lacking interactivity. Both paradigms fail to support long-horizon interactive generation. To this end, we introduce LaGen, which to the best of our knowledge is the first framework capable of frame-by-frame autoregressive generation of long-horizon LiDAR scenes. LaGen is able to take a single-frame LiDAR input as a starting point and effectively utilize bounding box information as conditions to generate high-fidelity 4D scene point clouds. In addition, we introduce a scene decoupling estimation module to enhance the model's interactive generation capability for object-level content, as well as a noise modulation module to mitigate error accumulation during long-horizon generation. We construct a protocol based on nuScenes for evaluating long-horizon LiDAR scene generation. Experimental results comprehensively demonstrate LaGen outperforms state-of-the-art LiDAR generation and prediction models, especially on the later frames.
AVFakeBench: A Comprehensive Audio-Video Forgery Detection Benchmark for AV-LMMs
The threat of Audio-Video (AV) forgery is rapidly evolving beyond human-centric deepfakes to include more diverse manipulations across complex natural scenes. However, existing benchmarks are still confined to DeepFake-based forgeries and single-granularity annotations, thus failing to capture the diversity and complexity of real-world forgery scenarios. To address this, we introduce AVFakeBench, the first comprehensive audio-video forgery detection benchmark that spans rich forgery semantics across both human subject and general subject. AVFakeBench comprises 12K carefully curated audio-video questions, covering seven forgery types and four levels of annotations. To ensure high-quality and diverse forgeries, we propose a multi-stage hybrid forgery framework that integrates proprietary models for task planning with expert generative models for precise manipulation. The benchmark establishes a multi-task evaluation framework covering binary judgment, forgery types classification, forgery detail selection, and explanatory reasoning. We evaluate 11 Audio-Video Large Language Models (AV-LMMs) and 2 prevalent detection methods on AVFakeBench, demonstrating the potential of AV-LMMs as emerging forgery detectors while revealing their notable weaknesses in fine-grained perception and reasoning.
Shift-Equivariant Complex-Valued Convolutional Neural Networks WACV 2026
Convolutional neural networks have shown remarkable performance in recent years on various computer vision problems. However, the traditional convolutional neural network architecture lacks a critical property: shift equivariance and invariance, broken by downsampling and upsampling operations. Although data augmentation techniques can help the model learn the latter property empirically, a consistent and systematic way to achieve this goal is by designing downsampling and upsampling layers that theoretically guarantee these properties by construction. Adaptive Polyphase Sampling (APS) introduced the cornerstone for shift invariance, later extended to shift equivariance with Learnable Polyphase up/downsampling (LPS) applied to real-valued neural networks. In this paper, we extend the work on LPS to complex-valued neural networks both from a theoretical perspective and with a novel building block of a projection layer from $\mathbb{C}$ to $\mathbb{R}$ before the Gumbel Softmax. We finally evaluate this extension on several computer vision problems, specifically for either the invariance property in classification tasks or the equivariance property in both reconstruction and semantic segmentation problems, using polarimetric Synthetic Aperture Radar images.
comment: Accepted to WACV 2026
FIELDS: Face reconstruction with accurate Inference of Expression using Learning with Direct Supervision
Facial expressions convey the bulk of emotional information in human communication, yet existing 3D face reconstruction methods often miss subtle affective details due to reliance on 2D supervision and lack of 3D ground truth. We propose FIELDS (Face reconstruction with accurate Inference of Expression using Learning with Direct Supervision) to address these limitations by extending self-supervised 2D image consistency cues with direct 3D expression parameter supervision and an auxiliary emotion recognition branch. Our encoder is guided by authentic expression parameters from spontaneous 4D facial scans, while an intensity-aware emotion loss encourages the 3D expression parameters to capture genuine emotion content without exaggeration. This dual-supervision strategy bridges the 2D/3D domain gap and mitigates expression-intensity bias, yielding high-fidelity 3D reconstructions that preserve subtle emotional cues. From a single image, FIELDS produces emotion-rich face models with highly realistic expressions, significantly improving in-the-wild facial expression recognition performance without sacrificing naturalness.
3-Tracer: A Tri-level Temporal-Aware Framework for Audio Forgery Detection and Localization
Recently, partial audio forgery has emerged as a new form of audio manipulation. Attackers selectively modify partial but semantically critical frames while preserving the overall perceptual authenticity, making such forgeries particularly difficult to detect. Existing methods focus on independently detecting whether a single frame is forged, lacking the hierarchical structure to capture both transient and sustained anomalies across different temporal levels. To address these limitations, We identify three key levels relevant to partial audio forgery detection and present T3-Tracer, the first framework that jointly analyzes audio at the frame, segment, and audio levels to comprehensively detect forgery traces. T3-Tracer consists of two complementary core modules: the Frame-Audio Feature Aggregation Module (FA-FAM) and the Segment-level Multi-Scale Discrepancy-Aware Module (SMDAM). FA-FAM is designed to detect the authenticity of each audio frame. It combines both frame-level and audio-level temporal information to detect intra-frame forgery cues and global semantic inconsistencies. To further refine and correct frame detection, we introduce SMDAM to detect forgery boundaries at the segment level. It adopts a dual-branch architecture that jointly models frame features and inter-frame differences across multi-scale temporal windows, effectively identifying abrupt anomalies that appeared on the forged boundaries. Extensive experiments conducted on three challenging datasets demonstrate that our approach achieves state-of-the-art performance.
From Diffusion to One-Step Generation: A Comparative Study of Flow-Based Models with Application to Image Inpainting
We present a comprehensive comparative study of three generative modeling paradigms: Denoising Diffusion Probabilistic Models (DDPM), Conditional Flow Matching (CFM), and MeanFlow. While DDPM and CFM require iterative sampling, MeanFlow enables direct one-step generation by modeling the average velocity over time intervals. We implement all three methods using a unified TinyUNet architecture (<1.5M parameters) on CIFAR-10, demonstrating that CFM achieves an FID of 24.15 with 50 steps, significantly outperforming DDPM (FID 402.98). MeanFlow achieves FID 29.15 with single-step sampling -- a 50X reduction in inference time. We further extend CFM to image inpainting, implementing mask-guided sampling with four mask types (center, random bbox, irregular, half). Our fine-tuned inpainting model achieves substantial improvements: PSNR increases from 4.95 to 8.57 dB on center masks (+73%), and SSIM improves from 0.289 to 0.418 (+45%), demonstrating the effectiveness of inpainting-aware training.
Towards an Effective Action-Region Tracking Framework for Fine-grained Video Action Recognition
Fine-grained action recognition (FGAR) aims to identify subtle and distinctive differences among fine-grained action categories. However, current recognition methods often capture coarse-grained motion patterns but struggle to identify subtle details in local regions evolving over time. In this work, we introduce the Action-Region Tracking (ART) framework, a novel solution leveraging a query-response mechanism to discover and track the dynamics of distinctive local details, enabling effective distinction of similar actions. Specifically, we propose a region-specific semantic activation module that employs discriminative and text-constrained semantics as queries to capture the most action-related region responses in each video frame, facilitating interaction among spatial and temporal dimensions with corresponding video features. The captured region responses are organized into action tracklets, which characterize region-based action dynamics by linking related responses across video frames in a coherent sequence. The text-constrained queries encode nuanced semantic representations derived from textual descriptions of action labels extracted by language branches within Visual Language Models (VLMs). To optimize the action tracklets, we design a multi-level tracklet contrastive constraint among region responses at spatial and temporal levels, enabling effective discrimination within each frame and correlation between adjacent frames. Additionally, a task-specific fine-tuning mechanism refines textual semantics such that semantic representations encoded by VLMs are preserved while optimized for task preferences. Comprehensive experiments on widely used action recognition benchmarks demonstrate the superiority to previous state-of-the-art baselines.
BotaCLIP: Contrastive Learning for Botany-Aware Representation of Earth Observation Data
Foundation models have demonstrated a remarkable ability to learn rich, transferable representations across diverse modalities such as images, text, and audio. In modern machine learning pipelines, these representations often replace raw data as the primary input for downstream tasks. In this paper, we address the challenge of adapting a pre-trained foundation model to inject domain-specific knowledge, without retraining from scratch or incurring significant computational costs. To this end, we introduce BotaCLIP, a lightweight multimodal contrastive framework that adapts a pre-trained Earth Observation foundation model (DOFA) by aligning high-resolution aerial imagery with botanical relevés. Unlike generic embeddings, BotaCLIP internalizes ecological structure through contrastive learning with a regularization strategy that mitigates catastrophic forgetting. Once trained, the resulting embeddings serve as transferable representations for downstream predictors. Motivated by real-world applications in biodiversity modeling, we evaluated BotaCLIP representations in three ecological tasks: plant presence prediction, butterfly occurrence modeling, and soil trophic group abundance estimation. The results showed consistent improvements over those derived from DOFA and supervised baselines. More broadly, this work illustrates how domain-aware adaptation of foundation models can inject expert knowledge into data-scarce settings, enabling frugal representation learning.
You Can Trust Your Clustering Model: A Parameter-free Self-Boosting Plug-in for Deep Clustering NeurIPS 2025
Recent deep clustering models have produced impressive clustering performance. However, a common issue with existing methods is the disparity between global and local feature structures. While local structures typically show strong consistency and compactness within class samples, global features often present intertwined boundaries and poorly separated clusters. Motivated by this observation, we propose DCBoost, a parameter-free plug-in designed to enhance the global feature structures of current deep clustering models. By harnessing reliable local structural cues, our method aims to elevate clustering performance effectively. Specifically, we first identify high-confidence samples through adaptive $k$-nearest neighbors-based consistency filtering, aiming to select a sufficient number of samples with high label reliability to serve as trustworthy anchors for self-supervision. Subsequently, these samples are utilized to compute a discriminative loss, which promotes both intra-class compactness and inter-class separability, to guide network optimization. Extensive experiments across various benchmark datasets showcase that our DCBoost significantly improves the clustering performance of diverse existing deep clustering models. Notably, our method improves the performance of current state-of-the-art baselines (e.g., ProPos) by more than 3% and amplifies the silhouette coefficient by over $7\times$. Code is available at .
comment: The paper is accepted by NeurIPS 2025
When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models
Vision-Language-Action (VLA) models are vulnerable to adversarial attacks, yet universal and transferable attacks remain underexplored, as most existing patches overfit to a single model and fail in black-box settings. To address this gap, we present a systematic study of universal, transferable adversarial patches against VLA-driven robots under unknown architectures, finetuned variants, and sim-to-real shifts. We introduce UPA-RFAS (Universal Patch Attack via Robust Feature, Attention, and Semantics), a unified framework that learns a single physical patch in a shared feature space while promoting cross-model transfer. UPA-RFAS combines (i) a feature-space objective with an $\ell_1$ deviation prior and repulsive InfoNCE loss to induce transferable representation shifts, (ii) a robustness-augmented two-phase min-max procedure where an inner loop learns invisible sample-wise perturbations and an outer loop optimizes the universal patch against this hardened neighborhood, and (iii) two VLA-specific losses: Patch Attention Dominance to hijack text$\to$vision attention and Patch Semantic Misalignment to induce image-text mismatch without labels. Experiments across diverse VLA models, manipulation suites, and physical executions show that UPA-RFAS consistently transfers across models, tasks, and viewpoints, exposing a practical patch-based attack surface and establishing a strong baseline for future defenses.
Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision-Language Understanding
Recent advances in 3D vision-language models (VLMs) highlight a strong potential for 3D scene understanding and reasoning. However, effectively tokenizing 3D scenes into holistic scene tokens, and leveraging these tokens across diverse 3D understanding tasks, remain highly challenging. We present NDTokenizer3D, a generalist 3D VLM that performs a wide range of 3D scene understanding tasks while naturally supporting human interactions, thereby bridging language-level reasoning with 3D spatial understanding. The core of our approach is a novel three-stage scene tokenization pipeline built upon a Multi-Scale Normal Distributions Transform (NDT) representation, paired with a Multi-Scale NDT Decoder (MSDec). Specifically, NDTokenizer3D first constructs a multi-scale NDT representation from raw high-resolution point clouds, preserving both global context and fine-grained geometric details. Next, the MSDec progressively fuses cross-scale NDT features, producing holistic scene tokens consumable by LLM endpoints. Beyond tokenization, MSDec is repurposed as a general interface for human-interactive prompting (points, boxes, masks) and segmentation-mask decoding, unifying diverse 3D scene understanding tasks within a single architecture. With this compact and unified design, NDTokenizer3D offers a fine-grained, general-purpose 3D VLM, achieving remarkable improvements in 3D Referring Segmentation, 3D Visual Question Answering, and 3D Dense Captioning.
AnchorOPT: Towards Optimizing Dynamic Anchors for Adaptive Prompt Learning
Existing prompt learning methods, which are built upon CLIP models, leverage textual tokens as anchors to guide the learnable soft tokens. This guidance improves CLIP generalizations. However, these anchors-static in both value and position-lack cross-task and stage-adaptive flexibility. To address this limitation, we propose AnchorOPT, a dynamic anchor-based prompt learning framework. Specifically, AnchorOPT introduces dynamism in two key dimensions: (i) anchor values eschew handcrafted explicit textual tokens (e.g., "shape", "color"), instead learning dynamically from task-specific data; and (ii) the positional relationship between anchor and soft tokens is no longer fixed but adaptively optimized via a learnable position matrix conditioned on the training stage and task context. Training occurs in two stages: we first learn the anchor tokens, then freeze and transfer them to the second stage for optimization of soft tokens and the position matrix. Extensive experiments demonstrate that using only a simple learnable anchor and position matrix achieves performance comparable to or exceeding some methods incorporating additional learnable modules or regularization techniques. As a plug-and-play module, AnchorOPT integrates seamlessly into existing frameworks, yielding consistent performance gains across diverse datasets. Code is publicly available at https://github.com/zhengli97/ATPrompt.
comment: Technical Report
Progress by Pieces: Test-Time Scaling for Autoregressive Image Generation
Recent visual autoregressive (AR) models have shown promising capabilities in text-to-image generation, operating in a manner similar to large language models. While test-time computation scaling has brought remarkable success in enabling reasoning-enhanced outputs for challenging natural language tasks, its adaptation to visual AR models remains unexplored and poses unique challenges. Naively applying test-time scaling strategies such as Best-of-N can be suboptimal: they consume full-length computation on erroneous generation trajectories, while the raster-scan decoding scheme lacks a blueprint of the entire canvas, limiting scaling benefits as only a few prompt-aligned candidates are generated. To address these, we introduce GridAR, a test-time scaling framework designed to elicit the best possible results from visual AR models. GridAR employs a grid-partitioned progressive generation scheme in which multiple partial candidates for the same position are generated within a canvas, infeasible ones are pruned early, and viable ones are fixed as anchors to guide subsequent decoding. Coupled with this, we present a layout-specified prompt reformulation strategy that inspects partial views to infer a feasible layout for satisfying the prompt. The reformulated prompt then guides subsequent image generation to mitigate the blueprint deficiency. Together, GridAR achieves higher-quality results under limited test-time scaling: with N=4, it even outperforms Best-of-N (N=8) by 14.4% on T2I-CompBench++ while reducing cost by 25.6%. It also generalizes to autoregressive image editing, showing comparable edit quality and a 13.9% gain in semantic preservation on PIE-Bench over larger-N baselines.
comment: Project page: https://grid-ar.github.io/
LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs
Visual encoding followed by token condensing has become the standard architectural paradigm in multi-modal large language models (MLLMs). Many recent MLLMs increasingly favor global native- resolution visual encoding over slice-based methods. To investigate this trend, we systematically compare their behavior on vision-language understanding and attention patterns, revealing that global encoding enhances overall capability but at the expense of greater computational overhead. To address this issue, we present LLaVA-UHD v3, an MLLM centered upon our proposed Progressive Visual Compression (PVC) method, which can be seamlessly integrated into standard Vision Transformer (ViT) to enable efficient native-resolution encoding. The PVC approach consists of two key modules: (i) refined patch embedding, which supports flexible patch-size scaling for fine-grained visual model- ing, (ii) windowed token compression, hierarchically deployed across ViT layers to progressively aggregate local token representations. Jointly modulated by these two modules, a widely pretrained ViT can be reconfigured into an efficient architecture while largely preserving generality. Evaluated across extensive benchmarks, the transformed ViT, termed ViT-UHD, demonstrates competitive performance with MoonViT while reducing TTFT (time-to-first-token) by 2.4x, when developed within an identical MLLM architecture. Building upon ViT-UHD, LLaVA-UHD v3 also achieves competitive performance to Qwen2-VL, while further reducing TTFT by 1.9x. We will release all code and checkpoints to support future research on efficient MLLMs.
AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control
Sound effect editing-modifying audio by adding, removing, or replacing elements-remains constrained by existing approaches that rely solely on low-level signal processing or coarse text prompts, often resulting in limited flexibility and suboptimal audio quality. To address this, we propose AV-Edit, a generative sound effect editing framework that enables fine-grained editing of existing audio tracks in videos by jointly leveraging visual, audio, and text semantics. Specifically, the proposed method employs a specially designed contrastive audio-visual masking autoencoder (CAV-MAE-Edit) for multimodal pre-training, learning aligned cross-modal representations. These representations are then used to train an editorial Multimodal Diffusion Transformer (MM-DiT) capable of removing visually irrelevant sounds and generating missing audio elements consistent with video content through a correlation-based feature gating training strategy. Furthermore, we construct a dedicated video-based sound editing dataset as an evaluation benchmark. Experiments demonstrate that the proposed AV-Edit generates high-quality audio with precise modifications based on visual content, achieving state-of-the-art performance in the field of sound effect editing and exhibiting strong competitiveness in the domain of audio generation.
TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models
Text-to-Video (T2V) models are capable of synthesizing high-quality, temporally coherent dynamic video content, but the diverse generation also inherently introduces critical safety challenges. Existing safety evaluation methods,which focus on static image and text generation, are insufficient to capture the complex temporal dynamics in video generation. To address this, we propose a TEmporal-aware Automated Red-teaming framework, named TEAR, an automated framework designed to uncover safety risks specifically linked to the dynamic temporal sequencing of T2V models. TEAR employs a temporal-aware test generator optimized via a two-stage approach: initial generator training and temporal-aware online preference learning, to craft textually innocuous prompts that exploit temporal dynamics to elicit policy-violating video output. And a refine model is adopted to improve the prompt stealthiness and adversarial effectiveness cyclically. Extensive experimental evaluation demonstrates the effectiveness of TEAR across open-source and commercial T2V systems with over 80% attack success rate, a significant boost from prior best result of 57%.
STAR: Smartphone-analogous Typing in Augmented Reality
While text entry is an essential and frequent task in Augmented Reality (AR) applications, devising an efficient and easy-to-use text entry method for AR remains an open challenge. This research presents STAR, a smartphone-analogous AR text entry technique that leverages a user's familiarity with smartphone two-thumb typing. With STAR, a user performs thumb typing on a virtual QWERTY keyboard that is overlain on the skin of their hands. During an evaluation study of STAR, participants achieved a mean typing speed of 21.9 WPM (i.e., 56% of their smartphone typing speed), and a mean error rate of 0.3% after 30 minutes of practice. We further analyze the major factors implicated in the performance gap between STAR and smartphone typing, and discuss ways this gap could be narrowed.
comment: ACM UIST 2023
Referring Video Object Segmentation with Cross-Modality Proxy Queries
Referring video object segmentation (RVOS) is an emerging cross-modality task that aims to generate pixel-level maps of the target objects referred by given textual expressions. The main concept involves learning an accurate alignment of visual elements and language expressions within a semantic space. Recent approaches address cross-modality alignment through conditional queries, tracking the target object using a query-response based mechanism built upon transformer structure. However, they exhibit two limitations: (1) these conditional queries lack inter-frame dependency and variation modeling, making accurate target tracking challenging amid significant frame-to-frame variations; and (2) they integrate textual constraints belatedly, which may cause the video features potentially focus on the non-referred objects. Therefore, we propose a novel RVOS architecture called ProxyFormer, which introduces a set of proxy queries to integrate visual and text semantics and facilitate the flow of semantics between them. By progressively updating and propagating proxy queries across multiple stages of video feature encoder, ProxyFormer ensures that the video features are focused on the object of interest. This dynamic evolution also enables the establishment of inter-frame dependencies, enhancing the accuracy and coherence of object tracking. To mitigate high computational costs, we decouple cross-modality interactions into temporal and spatial dimensions. Additionally, we design a Joint Semantic Consistency (JSC) training strategy to align semantic consensus between the proxy queries and the combined video-text pairs. Comprehensive experiments on four widely used RVOS benchmarks demonstrate the superiority of our ProxyFormer to the state-of-the-art methods.
Efficient Training for Human Video Generation with Entropy-Guided Prioritized Progressive Learning
Human video generation has advanced rapidly with the development of diffusion models, but the high computational cost and substantial memory consumption associated with training these models on high-resolution, multi-frame data pose significant challenges. In this paper, we propose Entropy-Guided Prioritized Progressive Learning (Ent-Prog), an efficient training framework tailored for diffusion models on human video generation. First, we introduce Conditional Entropy Inflation (CEI) to assess the importance of different model components on the target conditional generation task, enabling prioritized training of the most critical components. Second, we introduce an adaptive progressive schedule that adaptively increases computational complexity during training by measuring the convergence efficiency. Ent-Prog reduces both training time and GPU memory consumption while maintaining model performance. Extensive experiments across three datasets, demonstrate the effectiveness of Ent-Prog, achieving up to 2.2$\times$ training speedup and 2.4$\times$ GPU memory reduction without compromising generative performance.
comment: Project page: https://github.com/changlin31/Ent-Prog
SocialNav: Training Human-Inspired Foundation Model for Socially-Aware Embodied Navigation
Embodied navigation that adheres to social norms remains an open research challenge. Our \textbf{SocialNav} is a foundational model for socially-aware navigation with a hierarchical "brain-action" architecture, capable of understanding high-level social norms and generating low-level, socially compliant trajectories. To enable such dual capabilities, we construct the SocNav Dataset, a large-scale collection of 7 million samples, comprising (1) a Cognitive Activation Dataset providing social reasoning signals such as chain-of-thought explanations and social traversability prediction, and (2) an Expert Trajectories Pyramid aggregating diverse navigation demonstrations from internet videos, simulated environments, and real-world robots. A multi-stage training pipeline is proposed to gradually inject and refine navigation intelligence: we first inject general navigation skills and social norms understanding into the model via imitation learning, and then refine such skills through a deliberately designed Socially-Aware Flow Exploration GRPO (SAFE-GRPO), the first flow-based reinforcement learning framework for embodied navigation that explicitly rewards socially compliant behaviors. SocialNav achieves +38% success rate and +46% social compliance rate compared to the state-of-the-art method, demonstrating strong gains in both navigation performance and social compliance. Our project page: https://amap-eai.github.io/SocialNav/
DeepRFTv2: Kernel-level Learning for Image Deblurring
It is well-known that if a network aims to learn how to deblur, it should understand the blur process. Blurring is naturally caused by the convolution of the sharp image with the blur kernel. Thus, allowing the network to learn the blur process in the kernel-level can significantly improve the image deblurring performance. But, current deep networks are still at the pixel-level learning stage, either performing end-to-end pixel-level restoration or stage-wise pseudo kernel-level restoration, failing to enable the deblur model to understand the essence of the blur. To this end, we propose Fourier Kernel Estimator (FKE), which considers the activation operation in Fourier space and converts the convolution problem in the spatial domain to a multiplication problem in Fourier space. Our FKE, jointly optimized with the deblur model, enables the network to learn the kernel-level blur process with low complexity and without any additional supervision. Furthermore, we change the convolution object of the kernel from ``image" to network extracted ``feature", whose rich semantic and structural information is more suitable to blur process learning. With the convolution of the feature and the estimated kernel, our model can learn the essence of blur in kernel-level. To further improve the efficiency of feature extraction, we design a decoupled multi-scale architecture with multiple hierarchical sub-unets with a reversible strategy, which allows better multi-scale encoding and decoding in low training memory. Extensive experiments indicate that our method achieves state-of-the-art motion deblurring results and show potential for handling other kernel-related problems. Analysis also shows our kernel estimator is able to learn physically meaningful kernels. The code will be available at https://github.com/DeepMed-Lab-ECNU/Single-Image-Deblur.
CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion
We tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework. Our key insights are two-fold: geometry-only cues (e.g., depth, edges) are insufficient: they specify layout but under-constrain appearance, materials, and illumination, limiting physically meaningful edits such as relighting or material swaps and often causing temporal drift. Enriching the model with additional graphics-based modalities (intrinsics and semantics) provides complementary constraints that both disambiguate understanding and enable precise, predictable control during generation. However, building a single model that uses many heterogeneous cues introduces two core difficulties. Architecturally, the model must accept any subset of modalities, remain robust to missing inputs, and inject control signals without sacrificing temporal consistency. Data-wise, training demands large-scale, temporally aligned supervision that ties real videos to per-pixel multimodal annotations. We then propose CtrlVDiff, a unified diffusion model trained with a Hybrid Modality Control Strategy (HMCS) that routes and fuses features from depth, normals, segmentation, edges, and graphics-based intrinsics (albedo, roughness, metallic), and re-renders videos from any chosen subset with strong temporal coherence. To enable this, we build MMVideo, a hybrid real-and-synthetic dataset aligned across modalities and captions. Across understanding and generation benchmarks, CtrlVDiff delivers superior controllability and fidelity, enabling layer-wise edits (relighting, material adjustment, object insertion) and surpassing state-of-the-art baselines while remaining robust when some modalities are unavailable.
comment: 27 pages, 18 figures, 9 tables. Project page: https://tele-ai.github.io/CtrlVDiff/
Which Layer Causes Distribution Deviation? Entropy-Guided Adaptive Pruning for Diffusion and Flow Models
Large-scale vision generative models, including diffusion and flow models, have demonstrated remarkable performance in visual generation tasks. However, transferring these pre-trained models to downstream tasks often results in significant parameter redundancy. In this paper, we propose EntPruner, an entropy-guided automatic progressive pruning framework for diffusion and flow models. First, we introduce entropy-guided pruning, a block-level importance assessment strategy specifically designed for generative models. Unlike discriminative models, generative models require preserving the diversity and condition-fidelity of the output distribution. As the importance of each module can vary significantly across downstream tasks, EntPruner prioritizes pruning of less important blocks using data-dependent Conditional Entropy Deviation (CED) as a guiding metric. CED quantifies how much the distribution diverges from the learned conditional data distribution after removing a block. Second, we propose a zero-shot adaptive pruning framework to automatically determine when and how much to prune during training. This dynamic strategy avoids the pitfalls of one-shot pruning, mitigating mode collapse, and preserving model performance. Extensive experiments on DiT and SiT models demonstrate the effectiveness of EntPruner, achieving up to 2.22$\times$ inference speedup while maintaining competitive generation quality on ImageNet and three downstream datasets.
comment: Project page: https://github.com/changlin31/EntPruner
Deformation-aware Temporal Generation for Early Prediction of Alzheimers Disease
Alzheimer's disease (AD), a degenerative brain condition, can benefit from early prediction to slow its progression. As the disease progresses, patients typically undergo brain atrophy. Current prediction methods for Alzheimers disease largely involve analyzing morphological changes in brain images through manual feature extraction. This paper proposes a novel method, the Deformation-Aware Temporal Generative Network (DATGN), to automate the learning of morphological changes in brain images about disease progression for early prediction. Given the common occurrence of missing data in the temporal sequences of MRI images, DATGN initially interpolates incomplete sequences. Subsequently, a bidirectional temporal deformation-aware module guides the network in generating future MRI images that adhere to the disease's progression, facilitating early prediction of Alzheimer's disease. DATGN was tested for the generation of temporal sequences of future MRI images using the ADNI dataset, and the experimental results are competitive in terms of PSNR and MMSE image quality metrics. Furthermore, when DATGN-generated synthetic data was integrated into the SVM vs. CNN vs. 3DCNN-based classification methods, significant improvements were achieved from 6. 21\% to 16\% in AD vs. NC classification accuracy and from 7. 34\% to 21. 25\% in AD vs. MCI vs. NC classification accuracy. The qualitative visualization results indicate that DATGN produces MRI images consistent with the brain atrophy trend in Alzheimer's disease, enabling early disease prediction.
comment: 29 pages,6figures,one column
FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain
In controllable driving-scene reconstruction and 3D scene generation, maintaining geometric fidelity while synthesizing visually plausible appearance under large viewpoint shifts is crucial. However, effective fusion of geometry-based 3DGS and appearance-driven diffusion models faces inherent challenges, as the absence of pixel-wise, 3D-consistent editing criteria often leads to over-restoration and geometric drift. To address these issues, we introduce \textbf{FaithFusion}, a 3DGS-diffusion fusion framework driven by pixel-wise Expected Information Gain (EIG). EIG acts as a unified policy for coherent spatio-temporal synthesis: it guides diffusion as a spatial prior to refine high-uncertainty regions, while its pixel-level weighting distills the edits back into 3DGS. The resulting plug-and-play system is free from extra prior conditions and structural modifications.Extensive experiments on the Waymo dataset demonstrate that our approach attains SOTA performance across NTA-IoU, NTL-IoU, and FID, maintaining an FID of 107.47 even at 6 meters lane shift. Our code is available at https://github.com/wangyuanbiubiubiu/FaithFusion.
comment: 16 pages, 10 figures
EM-KD: Distilling Efficient Multimodal Large Language Model with Unbalanced Vision Tokens AAAI 2026
Efficient Multimodal Large Language Models (MLLMs) compress vision tokens to reduce resource consumption, but the loss of visual information can degrade comprehension capabilities. Although some priors introduce Knowledge Distillation to enhance student models, they overlook the fundamental differences in fine-grained vision comprehension caused by unbalanced vision tokens between the efficient student and vanilla teacher. In this paper, we propose EM-KD, a novel paradigm that enhances the Efficient MLLMs with Knowledge Distillation. To overcome the challenge of unbalanced vision tokens, we first calculate the Manhattan distance between the vision logits of teacher and student, and then align them in the spatial dimension with the Hungarian matching algorithm. After alignment, EM-KD introduces two distillation strategies: 1) Vision-Language Affinity Distillation (VLAD) and 2) Vision Semantic Distillation (VSD). Specifically, VLAD calculates the affinity matrix between text tokens and aligned vision tokens, and minimizes the smooth L1 distance of the student and the teacher affinity matrices. Considering the semantic richness of vision logits in the final layer, VSD employs the reverse KL divergence to measure the discrete probability distributions of the aligned vision logits over the vocabulary space. Comprehensive evaluation on diverse benchmarks demonstrates that EM-KD trained model outperforms prior Efficient MLLMs on both accuracy and efficiency with a large margin, validating its effectiveness. Compared with previous distillation methods, which are equipped with our proposed vision token matching strategy for fair comparison, EM-KD also achieves better performance.
comment: accepted by AAAI 2026
Scaling Foundation Models for Radar Scene Understanding
Radar sensors provide reliable perception across adverse weather, lighting, and long-range conditions. Recent advances in foundation models have transformed visual and language understanding, yet their integration with radar sensing remains largely underexplored. Existing radar approaches are fragmented and task-specific; each downstream task employs distinct architectures and training objectives, preventing transfer across tasks. In this work, we introduce RadarFM: a radar foundation model that learns unified scene-level representations through structured spatial language supervision. We make two key contributions: (1) a structured caption framework that encodes vehicle distributions in native radar coordinates, and (2) a hash-aware contrastive learning objective that quantifies continuous scene similarity rather than binary matching, enabling fine-grained spatial reasoning. Leveraging the CARLA simulator, we generate large-scale, well-annotated radar datasets across diverse driving scenarios. We also propose localization-aware metrics that assess spatial accuracy beyond traditional detection measures.
Pygmalion Effect in Vision: Image-to-Clay Translation for Reflective Geometry Reconstruction
Understanding reflection remains a long-standing challenge in 3D reconstruction due to the entanglement of appearance and geometry under view-dependent reflections. In this work, we present the Pygmalion Effect in Vision, a novel framework that metaphorically "sculpts" reflective objects into clay-like forms through image-to-clay translation. Inspired by the myth of Pygmalion, our method learns to suppress specular cues while preserving intrinsic geometric consistency, enabling robust reconstruction from multi-view images containing complex reflections. Specifically, we introduce a dual-branch network in which a BRDF-based reflective branch is complemented by a clay-guided branch that stabilizes geometry and refines surface normals. The two branches are trained jointly using the synthesized clay-like images, which provide a neutral, reflection-free supervision signal that complements the reflective views. Experiments on both synthetic and real datasets demonstrate substantial improvement in normal accuracy and mesh completeness over existing reflection-handling methods. Beyond technical gains, our framework reveals that seeing by unshining, translating radiance into neutrality, can serve as a powerful inductive bias for reflective object geometry learning.
CLRecogEye : Curriculum Learning towards exploiting convolution features for Dynamic Iris Recognition
Iris authentication algorithms have achieved impressive recognition performance, making them highly promising for real-world applications such as border control, citizen identification, and both criminal investigations and commercial systems. However, their robustness is still challenged by variations in rotation, scale, specular reflections, and defocus blur. In addition, most existing approaches rely on straightforward point-to-point comparisons, typically using cosine or L2 distance, without effectively leveraging the spatio-spatial-temporal structure of iris patterns. To address these limitations, we propose a novel and generalized matching pipeline that learns rich spatio-spatial-temporal representations of iris features. Our approach first splits each iris image along one dimension, generating a sequence of sub-images that serve as input to a 3D-CNN, enabling the network to capture both spatial and spatio-spatial-temporal cues. To further enhance the modeling of spatio-spatial-temporal feature dynamics, we train the model in curriculum manner. This design allows the network to embed temporal dependencies directly into the feature space, improving discriminability in the deep metric domain. The framework is trained end-to-end with triplet and ArcFace loss in a curriculum manner, enforcing highly discriminative embeddings despite challenges like rotation, scale, reflections, and blur. This design yields a robust and generalizable solution for iris authentication.Github code: https://github.com/GeetanjaliGTZ/CLRecogEye
comment: 12 Pages, 3 figures, ISVC conference 2025
MIRA: Multimodal Iterative Reasoning Agent for Image Editing
Instruction-guided image editing offers an intuitive way for users to edit images with natural language. However, diffusion-based editing models often struggle to accurately interpret complex user instructions, especially those involving compositional relationships, contextual cues, or referring expressions, leading to edits that drift semantically or fail to reflect the intended changes. We tackle this problem by proposing MIRA (Multimodal Iterative Reasoning Agent), a lightweight, plug-and-play multimodal reasoning agent that performs editing through an iterative perception-reasoning-action loop, effectively simulating multi-turn human-model interaction processes. Instead of issuing a single prompt or static plan, MIRA predicts atomic edit instructions step by step, using visual feedback to make its decisions. Our 150K multimodal tool-use dataset, MIRA-Editing, combined with a two-stage SFT + GRPO training pipeline, enables MIRA to perform reasoning and editing over complex editing instructions. When paired with open-source image editing models such as Flux.1-Kontext, Step1X-Edit, and Qwen-Image-Edit, MIRA significantly improves both semantic consistency and perceptual quality, achieving performance comparable to or exceeding proprietary systems such as GPT-Image and Nano-Banana.
TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.
comment: Project page: https://xuboshen.github.io/TimeViper; Code: https://github.com/xiaomi-research/timeviper
Sequence-Adaptive Video Prediction in Continuous Streams using Diffusion Noise Optimization
In this work, we investigate diffusion-based video prediction models, which forecast future video frames, for continuous video streams. In this context, the models observe continuously new training samples, and we aim to leverage this to improve their predictions. We thus propose an approach that continuously adapts a pre-trained diffusion model to a video stream. Since fine-tuning the parameters of a large diffusion model is too expensive, we refine the diffusion noise during inference while keeping the model parameters frozen, allowing the model to adaptively determine suitable sampling noise. We term the approach Sequence Adaptive Video Prediction with Diffusion Noise Optimization (SAVi-DNO). To validate our approach, we introduce a new evaluation setting on the Ego4D dataset, focusing on simultaneous adaptation and evaluation on long continuous videos. Empirical results demonstrate improved performance based on FVD, SSIM, and PSNR metrics on long videos of Ego4D and OpenDV-YouTube, as well as videos of UCF-101 and SkyTimelapse, showcasing SAVi-DNO's effectiveness.
TinyChemVL: Advancing Chemical Vision-Language Models via Efficient Visual Token Reduction and Complex Reaction Tasks AAAI 2026
While Vision Language Models (VLMs) have demonstrated remarkable capabilities in general visual understanding, their application in the chemical domain has been limited, with previous works predominantly focusing on text and thus overlooking critical visual information, such as molecular structures. Current approaches that directly adopt standard VLMs for chemical tasks suffer from two primary issues: (i) computational inefficiency of processing entire chemical images with non-informative backgrounds. (ii) a narrow scope on molecular-level tasks that restricts progress in chemical reasoning. In this work, we propose \textbf{TinyChemVL}, an efficient and powerful chemical VLM that leverages visual token reduction and reaction-level tasks to improve model efficiency and reasoning capacity. Also, we propose \textbf{ChemRxn-V}, a reaction-level benchmark for assessing vision-based reaction recognition and prediction tasks. Directly predicting reaction products from molecular images poses a non-trivial challenge, as it requires models to integrate both recognition and reasoning capacities. Our results demonstrate that with only 4B parameters, TinyChemVL achieves superior performance on both molecular and reaction tasks while demonstrating faster inference and training speeds compared to existing models. Notably, TinyChemVL outperforms ChemVLM while utilizing only 1/16th of the visual tokens. This work builds efficient yet powerful VLMs for chemical domains by co-designing model architecture and task complexity.
comment: Accepted by AAAI 2026
Adapting Segment Anything Model for Power Transmission Corridor Hazard Segmentation
Power transmission corridor hazard segmentation (PTCHS) aims to separate transmission equipment and surrounding hazards from complex background, conveying great significance to maintaining electric power transmission safety. Recently, the Segment Anything Model (SAM) has emerged as a foundational vision model and pushed the boundaries of segmentation tasks. However, SAM struggles to deal with the target objects in complex transmission corridor scenario, especially those with fine structure. In this paper, we propose ELE-SAM, adapting SAM for the PTCHS task. Technically, we develop a Context-Aware Prompt Adapter to achieve better prompt tokens via incorporating global-local features and focusing more on key regions. Subsequently, to tackle the hazard objects with fine structure in complex background, we design a High-Fidelity Mask Decoder by leveraging multi-granularity mask features and then scaling them to a higher resolution. Moreover, to train ELE-SAM and advance this field, we construct the ELE-40K benchmark, the first large-scale and real-world dataset for PTCHS including 44,094 image-mask pairs. Experimental results for ELE-40K demonstrate the superior performance that ELE-SAM outperforms the baseline model with the average 16.8% mIoU and 20.6% mBIoU performance improvement. Moreover, compared with the state-of-the-art method on HQSeg-44K, the average 2.9% mIoU and 3.8% mBIoU absolute improvements further validate the effectiveness of our method on high-quality generic object segmentation. The source code and dataset are available at https://github.com/Hhaizee/ELE-SAM.
Force Prompting: Video Generation Models Can Learn and Generalize Physics-based Control Signals NeurIPS 2025
Recent advances in video generation models have sparked interest in world models capable of simulating realistic environments. While navigation has been well-explored, physically meaningful interactions that mimic real-world forces remain largely understudied. In this work, we investigate using physical forces as a control signal for video generation and propose force prompts which enable users to interact with images through both localized point forces, such as poking a plant, and global wind force fields, such as wind blowing on fabric. We demonstrate that these force prompts can enable videos to respond realistically to physical control signals by leveraging the visual and motion prior in the original pretrained model, without using any 3D asset or physics simulator at inference. The primary challenge of force prompting is the difficulty in obtaining high quality paired force-video training data, both in the real world due to the difficulty of obtaining force signals, and in synthetic data due to limitations in the visual quality and domain diversity of physics simulators. Our key finding is that video generation models can generalize remarkably well when adapted to follow physical force conditioning from videos synthesized by Blender, even with limited demonstrations of few objects. Our method can generate videos which simulate forces across diverse geometries, settings, and materials. We also try to understand the source of this generalization and perform ablations that reveal two key elements: visual diversity and the use of specific text keywords during training. Our approach is trained on only around 15k training examples for a single day on four A100 GPUs, and outperforms existing methods on force adherence and physics realism, bringing world models closer to real-world physics interactions. We release all datasets, code, weights, and interactive video demos at our project page.
comment: Camera ready version (NeurIPS 2025). Code and interactive demos at https://force-prompting.github.io/
A Gray-box Attack against Latent Diffusion Model-based Image Editing by Posterior Collapse
Recent advancements in Latent Diffusion Models (LDMs) have revolutionized image synthesis and manipulation, raising significant concerns about data misappropriation and intellectual property infringement. While adversarial attacks have been extensively explored as a protective measure against such misuse of generative AI, current approaches are severely limited by their heavy reliance on model-specific knowledge and substantial computational costs. Drawing inspiration from the posterior collapse phenomenon observed in VAE training, we propose the Posterior Collapse Attack (PCA), a novel framework for protecting images from unauthorized manipulation. Through comprehensive theoretical analysis and empirical validation, we identify two distinct collapse phenomena during VAE inference: diffusion collapse and concentration collapse. Based on this discovery, we design a unified loss function that can flexibly achieve both types of collapse through parameter adjustment, each corresponding to different protection objectives in preventing image manipulation. Our method significantly reduces dependence on model-specific knowledge by requiring access to only the VAE encoder, which constitutes less than 4\% of LDM parameters. Notably, PCA achieves prompt-invariant protection by operating on the VAE encoder before text conditioning occurs, eliminating the need for empty prompt optimization required by existing methods. This minimal requirement enables PCA to maintain adequate transferability across various VAE-based LDM architectures while effectively preventing unauthorized image editing. Extensive experiments show PCA outperforms existing techniques in protection effectiveness, computational efficiency (runtime and VRAM), and generalization across VAE-based LDM variants. Our code is available at https://github.com/ZhongliangGuo/PosteriorCollapseAttack.
comment: 15 pages, 9 figures, 9 tables
LMLCC-Net: A Semi-Supervised Deep Learning Model for Lung Nodule Malignancy Prediction from CT Scans using a Novel Hounsfield Unit-Based Intensity Filtering
Lung cancer is the leading cause of patient mortality in the world. Early diagnosis of malignant pulmonary nodules in CT images can have a significant impact on reducing disease mortality and morbidity. In this work, we propose LMLCC-Net, a novel deep learning framework for classifying nodules from CT scan images using a 3D CNN, considering Hounsfield Unit (HU)-based intensity filtering. Benign and malignant nodules have significant differences in their intensity profile of HU, which was not exploited in the literature. Our method considers the intensity pattern as well as the texture for the prediction of malignancies. LMLCC-Net extracts features from multiple branches that each use a separate learnable HU-based intensity filtering stage. Various combinations of branches and learnable ranges of filters were explored to finally produce the best-performing model. In addition, we propose a semi-supervised learning scheme for labeling ambiguous cases and also developed a lightweight model to classify the nodules. The experimental evaluations are carried out on the LUNA16 dataset. The proposed LMLCC-Net was evaluated using the LUNA16 dataset. Our proposed method achieves a classification accuracy of 91.96%, a sensitivity of 92.94%, and an area under the curve of 94.07%, showing improved performance compared to existing methods The proposed method can have a significant impact in helping radiologists in the classification of pulmonary nodules and improving patient care.
comment: 12 pages, 9 figures, 7 tables
Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning AAAI 2026
Visual grounding, the task of linking textual queries to specific regions within images, plays a pivotal role in vision-language integration. Existing methods typically rely on extensive task-specific annotations and fine-tuning, limiting their ability to generalize effectively to novel or out-of-distribution scenarios. To address these limitations, we introduce GroundingAgent, a novel agentic visual grounding framework that operates without any task-specific fine-tuning. GroundingAgent employs a structured, iterative reasoning mechanism that integrates pretrained open-vocabulary object detectors, multimodal large language models (MLLMs), and large language models (LLMs) to progressively refine candidate regions through joint semantic and spatial analyses. Remarkably, GroundingAgent achieves an average zero-shot grounding accuracy of 65.1 % on widely-used benchmarks (RefCOCO, RefCOCO+, RefCOCOg), entirely without fine-tuning. Furthermore, by substituting MLLM-generated captions with the original query texts, the accuracy at the selection stage alone reaches approximately 90 %, closely matching supervised performance and underscoring the critical role of LLM reasoning capabilities. GroundingAgent also offers strong interpretability, transparently illustrating each reasoning step and providing clear insights into its decision-making process.
comment: AAAI 2026
Improved Visually Prompted Keyword Localisation in Real Low-Resource Settings
Given an image query, visually prompted keyword localisation (VPKL) aims to find occurrences of the depicted word in a speech collection. This can be useful when transcriptions are not available for a low-resource language (e.g. if it is unwritten). Previous work showed that VPKL can be performed with a visually grounded speech model trained on paired images and unlabelled speech. But all experiments were done on English. Moreover, transcriptions were used to get positive and negative pairs for the contrastive loss. This paper introduces a few-shot learning scheme to mine pairs automatically without transcriptions. On English, this results in only a small drop in performance. We also - for the first time - consider VPKL on a real low-resource language, Yoruba. While scores are reasonable, here we see a bigger drop in performance compared to using ground truth pairs because the mining is less accurate in Yoruba.
comment: Accepted at SpeD 2025
Modular, On-Site Solutions with Lightweight Anomaly Detection for Sustainable Nutrient Management in Agriculture
Efficient nutrient management is critical for crop growth and sustainable resource consumption (e.g., nitrogen, energy). Current approaches require lengthy analyses, preventing real-time optimization; similarly, imaging facilitates rapid phenotyping but can be computationally intensive, preventing deployment under resource constraints. This study proposes a flexible, tiered pipeline for anomaly detection and status estimation (fresh weight, dry mass, and tissue nutrients), including a comprehensive energy analysis of approaches that span the efficiency-accuracy spectrum. Using a nutrient depletion experiment with three treatments (T1-100%, T2-50%, and T3-25% fertilizer strength) and multispectral imaging (MSI), we developed a hierarchical pipeline using an autoencoder (AE) for early warning. Further, we compared two status estimation modules of different complexity for more detailed analysis: vegetation index (VI) features with machine learning (Random Forest, RF) and raw whole-image deep learning (Vision Transformer, ViT). Results demonstrated high-efficiency anomaly detection (73% net detection of T3 samples 9 days after transplanting) at substantially lower energy than embodied energy in wasted nitrogen. The state estimation modules show trade-offs, with ViT outperforming RF on phosphorus and calcium estimation (R2 0.61 vs. 0.58, 0.48 vs. 0.35) at higher energy cost. With our modular pipeline, this work opens opportunities for edge diagnostics and practical opportunities for agricultural sustainability.
MeshCone: Second-Order Cone Programming for Geometrically-Constrained Mesh Enhancement
Modern mesh generation pipelines whether learning-based or classical often produce outputs requiring post-processing to achieve production-quality geometry. This work introduces MeshCone, a convex optimization framework for guided mesh refinement that leverages reference geometry to correct deformed or degraded meshes. We formulate the problem as a second-order cone program where vertex positions are optimized to align with target geometry while enforcing smoothness through convex edge-length regularization. MeshCone performs geometry-aware optimization that preserves fine details while correcting structural defects. We demonstrate robust performance across 56 diverse object categories from ShapeNet and ThreeDScans, achieving superior refinement quality compared to Laplacian smoothing and unoptimized baselines while maintaining sub-second inference times. MeshCone is particularly suited for applications where reference geometry is available, such as mesh-from-template workflows, scan-to-CAD alignment, and quality assurance in asset production pipelines.
SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation NeurIPS 2025
Referring Image Segmentation (RIS) aims to segment the target object in an image given a natural language expression. While recent methods leverage pre-trained vision backbones and more training corpus to achieve impressive results, they predominantly focus on simple expressions--short, clear noun phrases like "red car" or "left girl". This simplification often reduces RIS to a key word/concept matching problem, limiting the model's ability to handle referential ambiguity in expressions. In this work, we identify two challenging real-world scenarios: object-distracting expressions, which involve multiple entities with contextual cues, and category-implicit expressions, where the object class is not explicitly stated. To address the challenges, we propose a novel framework, SaFiRe, which mimics the human two-phase cognitive process--first forming a global understanding, then refining it through detail-oriented inspection. This is naturally supported by Mamba's scan-then-update property, which aligns with our phased design and enables efficient multi-cycle refinement with linear complexity. We further introduce aRefCOCO, a new benchmark designed to evaluate RIS models under ambiguous referring expressions. Extensive experiments on both standard and proposed datasets demonstrate the superiority of SaFiRe over state-of-the-art baselines.
comment: NeurIPS 2025; Project page: https://zhenjiemao.github.io/SaFiRe/
Comparison of Generative Learning Methods for Turbulence Surrogates
Numerical simulations of turbulent flows present significant challenges in fluid dynamics due to their complexity and high computational cost. High resolution techniques such as Direct Numerical Simulation (DNS) and Large Eddy Simulation (LES) are generally not computationally affordable, particularly for technologically relevant problems. Recent advances in machine learning, specifically in generative probabilistic models, offer promising alternatives as surrogates for turbulence. This paper investigates the application of three generative models - Variational Autoencoders (VAE), Deep Convolutional Generative Adversarial Networks (DCGAN), and Denoising Diffusion Probabilistic Models (DDPM) - in simulating a von Kármán vortex street around a fixed cylinder projected into 2D, as well as a real-world experimental dataset of the wake flow of a cylinder array. Training data was obtained by means of LES in the simulated case and Particle Image Velocimetry (PIV) in the experimental case. We evaluate each model's ability to capture the statistical properties and spatial structures of the turbulent flow. Our results demonstrate that DDPM and DCGAN effectively replicate all flow distributions, highlighting their potential as efficient and accurate tools for turbulence surrogacy. We find a strong argument for DCGAN, as although they are more difficult to train (due to problems such as mode collapse), they show the fastest inference and training time, require less data to train compared to VAE and DDPM, and provide the results most closely aligned with the input stream. In contrast, VAE train quickly (and can generate samples quickly) but do not produce adequate results, and DDPM, whilst effective, are significantly slower at both, inference and training time.
DensiCrafter: Physically-Constrained Generation and Fabrication of Self-Supporting Hollow Structures
The rise of 3D generative models has enabled automatic 3D geometry and texture synthesis from multimodal inputs (e.g., text or images). However, these methods often ignore physical constraints and manufacturability considerations. In this work, we address the challenge of producing 3D designs that are both lightweight and self-supporting. We present DensiCrafter, a framework for generating lightweight, self-supporting 3D hollow structures by optimizing the density field. Starting from coarse voxel grids produced by Trellis, we interpret these as continuous density fields to optimize and introduce three differentiable, physically constrained, and simulation-free loss terms. Additionally, a mass regularization penalizes unnecessary material, while a restricted optimization domain preserves the outer surface. Our method seamlessly integrates with pretrained Trellis-based models (e.g., Trellis, DSO) without any architectural changes. In extensive evaluations, we achieve up to 43% reduction in material mass on the text-to-3D task. Compared to state-of-the-art baselines, our method could improve the stability and maintain high geometric fidelity. Real-world 3D-printing experiments confirm that our hollow designs can be reliably fabricated and could be self-supporting.
CrossEarth-Gate: Fisher-Guided Adaptive Tuning Engine for Efficient Adaptation of Cross-Domain Remote Sensing Semantic Segmentation
In Remote Sensing (RS), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key approach to activate the generalizable representation ability of foundation models for downstream tasks. However, existing specialized PEFT methods often fail when applied to large-scale Earth observation tasks, as they are unable to fully handle the multifaceted and unpredictable domain gaps (\eg, spatial, semantic, and frequency shifts) inherent in RS data. To overcome this, we propose CrossEarth-Gate, which introduces two primary contributions. First, we establish a comprehensive RS module toolbox to address multifaceted domain gaps, comprising spatial, semantic, and frequency modules. Second, we develop a Fisher-guided adaptive selection mechanism that operates on this toolbox. This selection is guided by Fisher Information to quantify each module's importance by measuring its contribution to the task-specific gradient flow. It dynamically activates only the most critical modules at the appropriate layers, guiding the gradient flow to maximize adaptation effectiveness and efficiency. Comprehensive experiments validate the efficacy and generalizability of our method, where CrossEarth-Gate achieves state-of-the-art performance across 16 cross-domain benchmarks for RS semantic segmentation. The code of the work will be released.
Probabilistic Robustness for Free? Revisiting Training via a Benchmark
Deep learning models are notoriously vulnerable to imperceptible perturbations. Most existing research centers on adversarial robustness (AR), which evaluates models under worst-case scenarios by examining the existence of deterministic adversarial examples (AEs). In contrast, probabilistic robustness (PR) adopts a statistical perspective, measuring the probability that predictions remain correct under stochastic perturbations. While PR is widely regarded as a practical complement to AR, dedicated training methods for improving PR are still relatively underexplored, albeit with emerging progress. Among the few PR-targeted training methods, we identify three limitations: i non-comparable evaluation protocols; ii limited comparisons to strong AT baselines despite anecdotal PR gains from AT; and iii no unified framework to compare the generalization of these methods. Thus, we introduce PRBench, the first benchmark dedicated to evaluating improvements in PR achieved by different robustness training methods. PRBench empirically compares most common AT and PR-targeted training methods using a comprehensive set of metrics, including clean accuracy, PR and AR performance, training efficiency, and generalization error (GE). We also provide theoretical analysis on the GE of PR performance across different training methods. Main findings revealed by PRBench include: AT methods are more versatile than PR-targeted training methods in terms of improving both AR and PR performance across diverse hyperparameter settings, while PR-targeted training methods consistently yield lower GE and higher clean accuracy. A leaderboard comprising 222 trained models across 7 datasets and 10 model architectures is publicly available at https://tmpspace.github.io/PRBenchLeaderboard/.
Earth-Adapter: Bridge the Geospatial Domain Gaps with Mixture of Frequency Adaptation AAAI 2026
Parameter-Efficient Fine-Tuning (PEFT) is a technique that allows us to adapt powerful Foundation Models (FMs) to diverse downstream tasks while preserving and unleashing their inherent capabilities. However, we have observed that existing PEFT methods, which are often designed with natural imagery in mind, struggle when applied to Remote Sensing (RS) scenarios. This is primarily due to their inability to handle artifact influences, a problem particularly severe in RS image features. To tackle this challenge, we introduce Earth-Adapter, the first PEFT method specifically designed for RS artifacts conquering. Earth-Adapter introduces a novel Mixture of Frequency Adaptation process that combines a Mixture of Adapter (MoA) with Discrete Fourier Transformation (DFT). By utilizing DFT, Earth-Adapter can decompose features into different frequency components, precisely separating artifacts from original features. The MoA then dynamically assigns weights to each adapter expert, allowing for the combination of features across various frequency domains. These simple-yet-effective approaches enable Earth-Adapter to more efficiently overcome the disturbances caused by artifacts than previous PEFT methods, significantly enhancing the FMs' performance on RS scenarios. Experiments on Domain Adaptation (DA), and Domain Generalization (DG) semantic segmentation benchmarks showcase the Earth-Adapter's effectiveness. Compared with baseline Rein, Earth-Adapter significantly improves 9.0% mIoU in DA and 3.1% mIoU in DG benchmarks. Our code will be released at https://github.com/VisionXLab/Earth-Adapter.
comment: AAAI 2026 camera ready
Learning Normals of Noisy Points by Local Gradient-Aware Surface Filtering ICCV 2025
Estimating normals for noisy point clouds is a persistent challenge in 3D geometry processing, particularly for end-to-end oriented normal estimation. Existing methods generally address relatively clean data and rely on supervised priors to fit local surfaces within specific neighborhoods. In this paper, we propose a novel approach for learning normals from noisy point clouds through local gradient-aware surface filtering. Our method projects noisy points onto the underlying surface by utilizing normals and distances derived from an implicit function constrained by local gradients. We start by introducing a distance measurement operator for global surface fitting on noisy data, which integrates projected distances along normals. Following this, we develop an implicit field-based filtering approach for surface point construction, adding projection constraints on these points during filtering. To address issues of over-smoothing and gradient degradation, we further incorporate local gradient consistency constraints, as well as local gradient orientation and aggregation. Comprehensive experiments on normal estimation, surface reconstruction, and point cloud denoising demonstrate the state-of-the-art performance of our method. The source code and trained models are available at https://github.com/LeoQLi/LGSF.
comment: Accepted by ICCV 2025. Project page: https://leoqli.github.io/LGSF/
Adaptive Object Detection for Indoor Navigation Assistance: A Performance Evaluation of Real-Time Algorithms
This study addresses the need for accurate and efficient object detection in assistive technologies for visually impaired individuals. We evaluate four real-time object detection algorithms YOLO, SSD, Faster R-CNN, and Mask R-CNN within the context of indoor navigation assistance. Using the Indoor Objects Detection dataset, we analyze detection accuracy, processing speed, and adaptability to indoor environments. Our findings highlight the trade-offs between precision and efficiency, offering insights into selecting optimal algorithms for realtime assistive navigation. This research advances adaptive machine learning applications, enhancing indoor navigation solutions for the visually impaired and promoting accessibility.
comment: 5 pages, 2 figures, 3 tables
SKEL-CF: Coarse-to-Fine Biomechanical Skeleton and Surface Mesh Recovery
Parametric 3D human models such as SMPL have driven significant advances in human pose and shape estimation, yet their simplified kinematics limit biomechanical realism. The recently proposed SKEL model addresses this limitation by re-rigging SMPL with an anatomically accurate skeleton. However, estimating SKEL parameters directly remains challenging due to limited training data, perspective ambiguities, and the inherent complexity of human articulation. We introduce SKEL-CF, a coarse-to-fine framework for SKEL parameter estimation. SKEL-CF employs a transformer-based encoder-decoder architecture, where the encoder predicts coarse camera and SKEL parameters, and the decoder progressively refines them in successive layers. To ensure anatomically consistent supervision, we convert the existing SMPL-based dataset 4DHuman into a SKEL-aligned version, 4DHuman-SKEL, providing high-quality training data for SKEL estimation. In addition, to mitigate depth and scale ambiguities, we explicitly incorporate camera modeling into the SKEL-CF pipeline and demonstrate its importance across diverse viewpoints. Extensive experiments validate the effectiveness of the proposed design. On the challenging MOYO dataset, SKEL-CF achieves 85.0 MPJPE / 51.4 PA-MPJPE, significantly outperforming the previous SKEL-based state-of-the-art HSMR (104.5 / 79.6). These results establish SKEL-CF as a scalable and anatomically faithful framework for human motion analysis, bridging the gap between computer vision and biomechanics. Our implementation is available on the project page: https://pokerman8.github.io/SKEL-CF/.
comment: Project page: https://pokerman8.github.io/SKEL-CF/
Dynamic Epsilon Scheduling: A Multi-Factor Adaptive Perturbation Budget for Adversarial Training
Adversarial training is among the most effective strategies for defending deep neural networks against adversarial examples. A key limitation of existing adversarial training approaches lies in their reliance on a fixed perturbation budget, which fails to account for instance-specific robustness characteristics. While prior works such as IAAT and MMA introduce instance-level adaptations, they often rely on heuristic or static approximations of data robustness. In this paper, we propose Dynamic Epsilon Scheduling (DES), a novel framework that adaptively adjusts the adversarial perturbation budget per instance and per training iteration. DES integrates three key factors: (1) the distance to the decision boundary approximated via gradient-based proxies, (2) prediction confidence derived from softmax entropy, and (3) model uncertainty estimated via Monte Carlo dropout. By combining these cues into a unified scheduling strategy, DES tailors the perturbation budget dynamically to guide more effective adversarial learning. Experimental results on CIFAR-10 and CIFAR-100 show that our method consistently improves both adversarial robustness and standard accuracy compared to fixed-epsilon baselines and prior adaptive methods. Moreover, we provide theoretical insights into the stability and convergence of our scheduling policy. This work opens a new avenue for instance-aware, data-driven adversarial training methods.
Decorrelation Speeds Up Vision Transformers
Masked Autoencoder (MAE) pre-training of vision transformers (ViTs) yields strong performance in low-label data regimes but comes with substantial computational costs, making it impractical in time- and resource-constrained industrial settings. We address this by nitegrating Decorrelated Backpropagation (DBP) into MAE pre-training, an optimization method that iteratively reduces input correlations at each layer to accelerate convergence. Applied selectively to the encoder, DBP achieves faster pre-training without loss of stability. To mimic constrained-data scenarios, we evaluate our approach on ImageNet-1K pre-training and ADE20K fine-tuning using randomly sampled subsets of each dataset. Under this setting, DBP-MAE reduces wall-clock time to baseline performance by 21.1%, lowers carbon emissions by 21.4%, and improves segmentation mIoU by 1.1 points. We observe similar gains when pre-training and fine-tuning on proprietary industrial data, confirming the method's applicability in real-world scenarios. These results demonstrate that DBP can reduce training time and energy use while improving downstream performance for large-scale ViT pre-training. Keywords: Deep learning, Vision transformers, Efficient AI, Decorrelation
comment: 16 pages, 12 figures, submitted to CVC 2026
Geometrically Regularized Transfer Learning with On-Manifold and Off-Manifold Perturbation
Transfer learning under domain shift remains a fundamental challenge due to the divergence between source and target data manifolds. In this paper, we propose MAADA (Manifold-Aware Adversarial Data Augmentation), a novel framework that decomposes adversarial perturbations into on-manifold and off-manifold components to simultaneously capture semantic variation and model brittleness. We theoretically demonstrate that enforcing on-manifold consistency reduces hypothesis complexity and improves generalization, while off-manifold regularization smooths decision boundaries in low-density regions. Moreover, we introduce a geometry-aware alignment loss that minimizes geodesic discrepancy between source and target manifolds. Experiments on DomainNet, VisDA, and Office-Home show that MAADA consistently outperforms existing adversarial and adaptation methods in both unsupervised and few-shot settings, demonstrating superior structural robustness and cross-domain generalization.
Disentangled Geometric Alignment with Adaptive Contrastive Perturbation for Reliable Domain Transfer
Despite progress in geometry-aware domain adaptation, current methods such as GAMA still suffer from two unresolved issues: (1) insufficient disentanglement of task-relevant and task-irrelevant manifold dimensions, and (2) rigid perturbation schemes that ignore per-class alignment asymmetries. To address this, we propose GAMA++, a novel framework that introduces (i) latent space disentanglement to isolate label-consistent manifold directions from nuisance factors, and (ii) an adaptive contrastive perturbation strategy that tailors both on- and off-manifold exploration to class-specific manifold curvature and alignment discrepancy. We further propose a cross-domain contrastive consistency loss that encourages local semantic clusters to align while preserving intra-domain diversity. Our method achieves state-of-the-art results on DomainNet, Office-Home, and VisDA benchmarks under both standard and few-shot settings, with notable improvements in class-level alignment fidelity and boundary robustness. GAMA++ sets a new standard for semantic geometry alignment in transfer learning.
One-Step Diffusion-Based Image Compression with Semantic Distillation NeurIPS 2025
While recent diffusion-based generative image codecs have shown impressive performance, their iterative sampling process introduces unpleasing latency. In this work, we revisit the design of a diffusion-based codec and argue that multi-step sampling is not necessary for generative compression. Based on this insight, we propose OneDC, a One-step Diffusion-based generative image Codec -- that integrates a latent compression module with a one-step diffusion generator. Recognizing the critical role of semantic guidance in one-step diffusion, we propose using the hyperprior as a semantic signal, overcoming the limitations of text prompts in representing complex visual content. To further enhance the semantic capability of the hyperprior, we introduce a semantic distillation mechanism that transfers knowledge from a pretrained generative tokenizer to the hyperprior codec. Additionally, we adopt a hybrid pixel- and latent-domain optimization to jointly enhance both reconstruction fidelity and perceptual realism. Extensive experiments demonstrate that OneDC achieves SOTA perceptual quality even with one-step generation, offering over 39% bitrate reduction and 20x faster decoding compared to prior multi-step diffusion-based codecs. Project: https://onedc-codec.github.io/
comment: Accepted by NeurIPS 2025
VA-GS: Enhancing the Geometric Representation of Gaussian Splatting via View Alignment NeurIPS 2025
3D Gaussian Splatting has recently emerged as an efficient solution for high-quality and real-time novel view synthesis. However, its capability for accurate surface reconstruction remains underexplored. Due to the discrete and unstructured nature of Gaussians, supervision based solely on image rendering loss often leads to inaccurate geometry and inconsistent multi-view alignment. In this work, we propose a novel method that enhances the geometric representation of 3D Gaussians through view alignment (VA). Specifically, we incorporate edge-aware image cues into the rendering loss to improve surface boundary delineation. To enforce geometric consistency across views, we introduce a visibility-aware photometric alignment loss that models occlusions and encourages accurate spatial relationships among Gaussians. To further mitigate ambiguities caused by lighting variations, we incorporate normal-based constraints to refine the spatial orientation of Gaussians and improve local surface estimation. Additionally, we leverage deep image feature embeddings to enforce cross-view consistency, enhancing the robustness of the learned geometry under varying viewpoints and illumination. Extensive experiments on standard benchmarks demonstrate that our method achieves state-of-the-art performance in both surface reconstruction and novel view synthesis. The source code is available at https://github.com/LeoQLi/VA-GS.
comment: Accepted by NeurIPS 2025
Stream and Query-guided Feature Aggregation for Efficient and Effective 3D Occupancy Prediction
3D occupancy prediction has become a key perception task in autonomous driving, as it enables comprehensive scene understanding. Recent methods enhance this understanding by incorporating spatiotemporal information through multi-frame fusion, but they suffer from a trade-off: dense voxel-based representations provide high accuracy at significant computational cost, whereas sparse representations improve efficiency but lose spatial detail. To mitigate this trade-off, we introduce DuOcc, which employs a dual aggregation strategy that retains dense voxel representations to preserve spatial fidelity while maintaining high efficiency. DuOcc consists of two key components: (i) Stream-based Voxel Aggregation, which recurrently accumulates voxel features over time and refines them to suppress warping-induced distortions, preserving a clear separation between occupied and free space. (ii) Query-guided Aggregation, which complements the limitations of voxel accumulation by selectively injecting instance-level query features into the voxel regions occupied by dynamic objects. Experiments on the widely used Occ3D-nuScenes and SurroundOcc datasets demonstrate that DuOcc achieves state-of-the-art performance in real-time settings, while reducing memory usage by over 40% compared to prior methods.
Image as an IMU: Estimating Camera Motion from a Single Motion-Blurred Image
In many robotics and VR/AR applications, fast camera motions lead to a high level of motion blur, causing existing camera pose estimation methods to fail. In this work, we propose a novel framework that leverages motion blur as a rich cue for motion estimation rather than treating it as an unwanted artifact. Our approach works by predicting a dense motion flow field and a monocular depth map directly from a single motion-blurred image. We then recover the instantaneous camera velocity by solving a linear least squares problem under the small motion assumption. In essence, our method produces an IMU-like measurement that robustly captures fast and aggressive camera movements. To train our model, we construct a large-scale dataset with realistic synthetic motion blur derived from ScanNet++v2 and further refine our model by training end-to-end on real data using our fully differentiable pipeline. Extensive evaluations on real-world benchmarks demonstrate that our method achieves state-of-the-art angular and translational velocity estimates, outperforming current methods like MASt3R and COLMAP.
comment: Project page: https://jerredchen.github.io/image-as-imu/
Unsupervised Segmentation by Diffusing, Walking and Cutting WACV
We propose an unsupervised image segmentation method using features from pre-trained text-to-image diffusion models. Inspired by classic spectral clustering approaches, we construct adjacency matrices from self-attention layers between image patches and recursively partition using Normalised Cuts. A key insight is that self-attention probability distributions, which capture semantic relations between patches, can be interpreted as a transition matrix for random walks across the image. We leverage this by first using Random Walk Normalized Cuts directly on these self-attention activations to partition the image, minimizing transition probabilities between clusters while maximizing coherence within clusters. Applied recursively, this yields a hierarchical segmentation that reflects the rich semantics in the pre-trained attention layers, without any additional training. Next, we explore other ways to build the NCuts adjacency matrix from features, and how we can use the random walk interpretation of self-attention to capture long-range relationships. Finally, we propose an approach to automatically determine the NCut cost criterion, avoiding the need to tune this manually. We quantitatively analyse the effect incorporating different features, a constant versus dynamic NCut threshold, and incorporating multi-node paths when constructing the NCuts adjacency matrix. We show that our approach surpasses all existing methods for zero-shot unsupervised segmentation, achieving state-of-the-art results on COCO-Stuff-27 and Cityscapes.
comment: Accepted to The IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026
ISAC: Training-Free Instance-to-Semantic Attention Control for Improving Multi-Instance Generation
Text-to-image diffusion models have recently become highly capable, yet their behavior in multi-object scenes remains unreliable: models often produce an incorrect number of instances and exhibit semantics leaking across objects. We trace these failures to vague instance boundaries; self-attention already reveals instance layouts early in the denoising process, but existing approaches act only on semantic signals. We introduce $\textbf{ISAC}$ ($\textbf{I}$nstance-to-$\textbf{S}$emantic $\textbf{A}$ttention $\textbf{C}$ontrol), a training-free, model-agnostic objective that performs hierarchical attention control by first carving out instance layouts from self-attention and then binding semantics to these instances. In Phase 1, ISAC clusters self-attention into the number of instances and repels overlaps, establishing an instance-level structural hierarchy; in Phase 2, it injects these instance cues into cross-attention to obtain instance-aware semantic masks and decomposes mixing semantics by tying attributes within each instance. ISAC yields consistent gains on T2I-CompBench, HRS-Bench, and IntraCompBench, our new benchmark for intra-class compositions where failures are most frequent, with improvements of at least 50% in multi-class accuracy and 7% in multi-instance accuracy on IntraCompBench, without any fine-tuning or external models. Beyond text-to-image setups, ISAC also strengthens layout-to-image controllers under overlapping boxes by refining coarse box layouts into dense instance masks, indicating that hierarchical decoupling of instance formation and semantic assignment is a key principle for robust, controllable multi-object generation. Code will be released upon publication.
comment: 36 pages
Think Visually, Reason Textually: Vision-Language Synergy in ARC
Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4. These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence. The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks. Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles. However, our pilot experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution. This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages: vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution. Building on this insight, we introduce two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction. Extensive experiments demonstrate that our approach yields up to a 4.33\% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks. Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence in future foundation models. Source code is released at https://github.com/InternLM/ARC-VL.
Restoration-Oriented Video Frame Interpolation with Region-Distinguishable Priors from SAM
In existing restoration-oriented Video Frame Interpolation (VFI) approaches, the motion estimation between neighboring frames plays a crucial role. However, the estimation accuracy in existing methods remains a challenge, primarily due to the inherent ambiguity in identifying corresponding areas in adjacent frames for interpolation. Therefore, enhancing accuracy by distinguishing different regions before motion estimation is of utmost importance. In this paper, we introduce a novel solution involving the utilization of open-world segmentation models, e.g., SAM2 (Segment Anything Model2) for frames, to derive Region-Distinguishable Priors (RDPs) in different frames. These RDPs are represented as spatial-varying Gaussian mixtures, distinguishing an arbitrary number of areas with a unified modality. RDPs can be integrated into existing motion-based VFI methods to enhance features for motion estimation, facilitated by our designed play-and-plug Hierarchical Region-aware Feature Fusion Module (HRFFM). HRFFM incorporates RDP into various hierarchical stages of VFI's encoder, using RDP-guided Feature Normalization (RDPFN) in a residual learning manner. With HRFFM and RDP, the features within VFI's encoder exhibit similar representations for matched regions in neighboring frames, thus improving the synthesis of intermediate frames. Extensive experiments demonstrate that HRFFM consistently enhances VFI performance across various scenes.
comment: Code will be released
Towards Consistent and Controllable Image Synthesis for Face Editing
Face editing methods, essential for tasks like virtual avatars, digital human synthesis and identity preservation, have traditionally been built upon GAN-based techniques, while recent focus has shifted to diffusion-based models due to their success in image reconstruction. However, diffusion models still face challenges in controlling specific attributes and preserving the consistency of other unchanged attributes especially the identity characteristics. To address these issues and facilitate more convenient editing of face images, we propose a novel approach that leverages the power of Stable-Diffusion (SD) models and crude 3D face models to control the lighting, facial expression and head pose of a portrait photo. We observe that this task essentially involves the combinations of target background, identity and face attributes aimed to edit. We strive to sufficiently disentangle the control of these factors to enable consistency of face editing. Specifically, our method, coined as RigFace, contains: 1) A Spatial Attribute Encoder that provides presise and decoupled conditions of background, pose, expression and lighting; 2) A high-consistency FaceFusion method that transfers identity features from the Identity Encoder to the denoising UNet of a pre-trained SD model; 3) An Attribute Rigger that injects those conditions into the denoising UNet. Our model achieves comparable or even superior performance in both identity preservation and photorealism compared to existing face editing models.
Not All Splits Are Equal: Rethinking Attribute Generalization Across Unrelated Categories NeurIPS 2025
Can models generalize attribute knowledge across semantically and perceptually dissimilar categories? While prior work has addressed attribute prediction within narrow taxonomic or visually similar domains, it remains unclear whether current models can abstract attributes and apply them to conceptually distant categories. This work presents the first explicit evaluation for the robustness of the attribute prediction task under such conditions, testing whether models can correctly infer shared attributes between unrelated object types: e.g., identifying that the attribute "has four legs" is common to both "dogs" and "chairs". To enable this evaluation, we introduce train-test split strategies that progressively reduce correlation between training and test sets, based on: LLM-driven semantic grouping, embedding similarity thresholding, embedding-based clustering, and supercategory-based partitioning using ground-truth labels. Results show a sharp drop in performance as the correlation between training and test categories decreases, indicating strong sensitivity to split design. Among the evaluated methods, clustering yields the most effective trade-off, reducing hidden correlations while preserving learnability. These findings offer new insights into the limitations of current representations and inform future benchmark construction for attribute reasoning.
comment: Accepted at NeurIPS 2025 Workshop: CauScien - Uncovering Causality in Science and NeurIPS 2025 Workshop: Reliable ML from Unreliable Data
Without Paired Labeled Data: End-to-End Self-Supervised Learning for Drone-view Geo-Localization
Drone-view Geo-Localization (DVGL) aims to achieve accurate localization of drones by retrieving the most relevant GPS-tagged satellite images. However, most existing methods heavily rely on strictly pre-paired drone-satellite images for supervised learning. When the target region shifts, new paired samples are typically required to adapt to the distribution changes. The high cost of annotation and the limited transferability of these methods significantly hinder the practical deployment of DVGL in open-world scenarios. To address these limitations, we propose a novel end-to-end self-supervised learning method with a shallow backbone network, called the dynamic memory-driven and neighborhood information learning (DMNIL) method. It employs a clustering algorithm to generate pseudo-labels and adopts a dual-path contrastive learning framework to learn discriminative intra-view representations. Furthermore, DMNIL incorporates two core modules, including the dynamic hierarchical memory learning (DHML) module and the information consistency evolution learning (ICEL) module. The DHML module combines short-term and long-term memory to enhance intra-view feature consistency and discriminability. Meanwhile, the ICEL module utilizes a neighborhood-driven dynamic constraint mechanism to systematically capture implicit cross-view semantic correlations, consequently improving cross-view feature alignment. To further stabilize and strengthen the self-supervised training process, a pseudo-label enhancement strategy is introduced to enhance the quality of pseudo supervision. Extensive experiments on three public benchmark datasets demonstrate that the proposed method consistently outperforms existing self-supervised methods and even surpasses several state-of-the-art supervised methods. Our code is available at https://github.com/ISChenawei/DMNIL.
From Limited Labels to Open Domains:An Efficient Learning Method for Drone-view Geo-Localization
Traditional supervised drone-view geo-localization (DVGL) methods heavily depend on paired training data and encounter difficulties in learning cross-view correlations from unpaired data. Moreover, when deployed in a new domain, these methods require obtaining the new paired data and subsequent retraining for model adaptation, which significantly increases computational overhead. Existing unsupervised methods have enabled to generate pseudo-labels based on cross-view similarity to infer the pairing relationships. However, geographical similarity and spatial continuity often cause visually analogous features at different geographical locations. The feature confusion compromises the reliability of pseudo-label generation, where incorrect pseudo-labels drive negative optimization. Given these challenges inherent in both supervised and unsupervised DVGL methods, we propose a novel cross-domain invariant knowledge transfer network (CDIKTNet) with limited supervision, whose architecture consists of a cross-domain invariance sub-network (CDIS) and a cross-domain transfer sub-network (CDTS). This architecture facilitates a closed-loop framework for invariance feature learning and knowledge transfer. The CDIS is designed to learn cross-view structural and spatial invariance from a small amount of paired data that serves as prior knowledge. It endows the shared feature space of unpaired data with similar implicit cross-view correlations at initialization, which alleviates feature confusion. Based on this, the CDTS employs dual-path contrastive learning to further optimize each subspace while preserving consistency in a shared feature space. Extensive experiments demonstrate that CDIKTNet achieves state-of-the-art performance under full supervision compared with those supervised methods, and further surpasses existing unsupervised methods in both few-shot and cross-domain initialization.
GreenHyperSpectra: A multi-source hyperspectral dataset for global vegetation trait prediction NeurIPS 2025
Plant traits such as leaf carbon content and leaf mass are essential variables in the study of biodiversity and climate change. However, conventional field sampling cannot feasibly cover trait variation at ecologically meaningful spatial scales. Machine learning represents a valuable solution for plant trait prediction across ecosystems, leveraging hyperspectral data from remote sensing. Nevertheless, trait prediction from hyperspectral data is challenged by label scarcity and substantial domain shifts (\eg across sensors, ecological distributions), requiring robust cross-domain methods. Here, we present GreenHyperSpectra, a pretraining dataset encompassing real-world cross-sensor and cross-ecosystem samples designed to benchmark trait prediction with semi- and self-supervised methods. We adopt an evaluation framework encompassing in-distribution and out-of-distribution scenarios. We successfully leverage GreenHyperSpectra to pretrain label-efficient multi-output regression models that outperform the state-of-the-art supervised baseline. Our empirical analyses demonstrate substantial improvements in learning spectral representations for trait prediction, establishing a comprehensive methodological framework to catalyze research at the intersection of representation learning and plant functional traits assessment. All code and data are available at: https://github.com/echerif18/HyspectraSSL.
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition
Prototypical part network (ProtoPNet) has drawn wide attention and boosted many follow-up studies due to its self-explanatory property for explainable artificial intelligence (XAI). However, when directly applying ProtoPNet on vision transformer (ViT) backbones, learned prototypes have a "distraction" problem: they have a relatively high probability of being activated by the background and pay less attention to the foreground. The powerful capability of modeling long-term dependency makes the transformer-based ProtoPNet hard to focus on prototypical parts, thus severely impairing its inherent interpretability. This paper proposes prototypical part transformer (ProtoPFormer) for appropriately and effectively applying the prototype-based method with ViTs for interpretable image recognition. The proposed method introduces global and local prototypes for capturing and highlighting the representative holistic and partial features of targets according to the architectural characteristics of ViTs. The global prototypes are adopted to provide the global view of objects to guide local prototypes to concentrate on the foreground while eliminating the influence of the background. Afterwards, local prototypes are explicitly supervised to concentrate on their respective prototypical visual parts, increasing the overall interpretability. Extensive experiments demonstrate that our proposed global and local prototypes can mutually correct each other and jointly make final decisions, which faithfully and transparently reason the decision-making processes associatively from the whole and local perspectives, respectively. Moreover, ProtoPFormer consistently achieves superior performance and visualization results over the state-of-the-art (SOTA) prototype-based baselines. Our code has been released at https://github.com/zju-vipa/ProtoPFormer.
comment: Arxiv preprint; 18 pages, 12 figures, 7 tables
AMLP: Adjustable Masking Lesion Patches for Self-Supervised Medical Image Segmentation
Self-supervised masked image modeling (MIM) methods have shown promising performances on analyzing natural images. However, directly applying such methods to medical image segmentation tasks still cannot achieve satisfactory results. The challenges arise from the facts that (i) medical images are inherently more complex compared to natural images, and the subjects in medical images often exhibit more distinct contour features; (ii) moreover, the conventional high and fixed masking ratio in MIM is likely to mask the background, limiting the scope of learnable information. To address these problems, we propose a new self-supervised medical image segmentation framework, called Adjustable Masking Lesion Patches (AMLP), which employs Masked Patch Selection~(MPS) strategy to identify patches with high probabilities of containing lesions to help model achieve precise lesion reconstruction. To improve the categorization of patches in MPS, we further introduce Relative Reconstruction Loss (RRL) to better learn hard-to-reconstruct lesion patches. Then, Category Consistency Loss (CCL) is proposed to refine patch categorization based on reconstruction difficulty, enhancing difference between lesions and backgrounds. Moreover, an Adjustable Masking Ratio (AMR) strategy is proposed to gradually increase the masking ratio over training to expand~the scope of learnable mutual information. Extensive~experiments on two medical segmentation datasets demonstrate the superior performances of the proposed AMLP w.r.t. the SOTA self-supervised methods; the results prove that AMLP effectively addresses the challenges of applying masked modeling to medical images and capturing accurate lesion details that are crucial for segmentation tasks.
comment: © 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
DWFF-Net : A Multi-Scale Farmland System Habitat Identification Method with Adaptive Dynamic Weight
Addressing the current lack of a standardized habitat classification system for cultivated land ecosystems, incomplete coverage of the habitat types, and the inability of existing models to effectively integrate semantic and texture features-resulting in insufficient segmentation accuracy and blurred boundaries for multi-scale habitats (e.g., large-scale field plots and micro-habitats)-this study developed a comprehensively annotated ultra-high-resolution remote sensing image dataset encompassing 15 categories of cultivated land system habitats. Furthermore, we propose a Dynamic-Weighted Feature Fusion Network (DWFF-Net). The encoder of this model utilizes a frozen-parameter DINOv3 to extract foundational features. By analyzing the relationships between different category images and feature maps, we introduce a data-level adaptive dynamic weighting strategy for feature fusion. The decoder incorporates a dynamic weight computation network to achieve thorough integration of multi-layer features, and a hybrid loss function is adopted to optimize model training. Experimental results on the constructed dataset demonstrate that the proposed model achieves a mean Intersection over Union (mIoU) of 69.79% and an F1-score of 80.49%, outperforming the baseline network by 2.1% and 1.61%, respectively. Ablation studies further confirm the complementary nature of multi-layer feature fusion, which effectively improves the IoU for micro-habitat categories such as field ridges. This study establishes a habitat identification framework for cultivated land systems based on adaptive multi-layer feature fusion, enabling sub-meter precision habitat mapping at a low cost and providing robust technical support for fine-grained habitat monitoring in cultivated landscapes. (The complete code repository can be accessed via GitHub at the following URL: https://github.com/sysau/DWFF-Net)
comment: 30 pages,13 figures
Filter Like You Test: Data-Driven Data Filtering for CLIP Pretraining
We introduce Filter Like You Test (FLYT), an algorithm for curating large-scale vision-language datasets that learns the usefulness of each data point as a pretraining example. FLYT trains a scoring model that learns to weigh each example's features using gradient signals from downstream tasks training sets. Based on FLYT, we implement Mixing-FLYT (M-FLYT), which takes the per-example scores generated by different scoring methods as features, and learns to unify them into a single score. FLYT naturally produces a distribution over the training examples, which we leverage through Soft Cap Sampling (SCS), a strategy for obtaining a filtered pretraining dataset from per-example probabilities that samples examples while preventing over-representation through a repetition penalty. Using these methods, we achieve 40.1% ImageNet zero-shot accuracy on the DataComp medium scale filtering benchmark, a 2% absolute accuracy increase over all previous results and a 5.5% increase over results that - like us - use only public resources. Our approach also yields 37.7\% on the average of 38 DataComp evaluation tasks, outperforming previous public-resource approaches by 0.4\%.
Gen-3Diffusion: Realistic Image-to-3D Generation via 2D & 3D Diffusion Synergy
Creating realistic 3D objects and clothed avatars from a single RGB image is an attractive yet challenging problem. Due to its ill-posed nature, recent works leverage powerful prior from 2D diffusion models pretrained on large datasets. Although 2D diffusion models demonstrate strong generalization capability, they cannot guarantee the generated multi-view images are 3D consistent. In this paper, we propose Gen-3Diffusion: Realistic Image-to-3D Generation via 2D & 3D Diffusion Synergy. We leverage a pre-trained 2D diffusion model and a 3D diffusion model via our elegantly designed process that synchronizes two diffusion models at both training and sampling time. The synergy between the 2D and 3D diffusion models brings two major advantages: 1) 2D helps 3D in generalization: the pretrained 2D model has strong generalization ability to unseen images, providing strong shape priors for the 3D diffusion model; 2) 3D helps 2D in multi-view consistency: the 3D diffusion model enhances the 3D consistency of 2D multi-view sampling process, resulting in more accurate multi-view generation. We validate our idea through extensive experiments in image-based objects and clothed avatar generation tasks. Results show that our method generates realistic 3D objects and avatars with high-fidelity geometry and texture. Extensive ablations also validate our design choices and demonstrate the strong generalization ability to diverse clothing and compositional shapes. Our code and pretrained models will be publicly released on https://yuxuan-xue.com/gen-3diffusion.
comment: Accepted to Transaction on Pattern Analysis and Machine Intelligence (T-PAMI). Project Page: https://yuxuan-xue.com/gen-3diffusion. arXiv admin note: substantial text overlap with arXiv:2406.08475
ControlEvents: Controllable Synthesis of Event Camera Datawith Foundational Prior from Image Diffusion Models WACV2026
In recent years, event cameras have gained significant attention due to their bio-inspired properties, such as high temporal resolution and high dynamic range. However, obtaining large-scale labeled ground-truth data for event-based vision tasks remains challenging and costly. In this paper, we present ControlEvents, a diffusion-based generative model designed to synthesize high-quality event data guided by diverse control signals such as class text labels, 2D skeletons, and 3D body poses. Our key insight is to leverage the diffusion prior from foundation models, such as Stable Diffusion, enabling high-quality event data generation with minimal fine-tuning and limited labeled data. Our method streamlines the data generation process and significantly reduces the cost of producing labeled event datasets. We demonstrate the effectiveness of our approach by synthesizing event data for visual recognition, 2D skeleton estimation, and 3D body pose estimation. Our experiments show that the synthesized labeled event data enhances model performance in all tasks. Additionally, our approach can generate events based on unseen text labels during training, illustrating the powerful text-based generation capabilities inherited from foundation models.
comment: Accepted to WACV2026. Project website: https://https://yuxuan-xue.com/controlevents/
SARVLM: A Vision Language Foundation Model for Semantic Understanding and Target Recognition in SAR Imagery
Synthetic Aperture Radar (SAR) is a crucial imaging modality thanks to its all-weather capability. Although recent advances in self-supervised learning and masked image modeling (MIM) have enabled SAR foundation models, these methods largely emphasize low-level visual features and often overlook multimodal alignment and zero-shot target recognition in SAR imagery. To address this, we construct SARVLM-1M, a large-scale vision-language dataset with over one million image-text pairs aggregated from existing datasets. We further propose a domain transfer training strategy to mitigate the large gap between natural and SAR imagery. Building on this, we develop SARVLM, the first vision language foundation model (VLM) tailored to SAR, comprising SARCLIP and SARCap. SARVLM is trained with a vision-language contrastive objective under the proposed domain transfer strategy, bridging SAR imagery and textual descriptions. Extensive experiments on image text retrieval, zero-shot classification, semantic localization, and imagery captioning demonstrate that SARVLM delivers superior feature extraction and interpretation, outperforming state-of-the-art VLMs and advancing SAR semantic understanding. Code and datasets will be released soon.
comment: 11 pages, 9 figures
Active Negative Loss: A Robust Framework for Learning with Noisy Labels
Deep supervised learning has achieved remarkable success across a wide range of tasks, yet it remains susceptible to overfitting when confronted with noisy labels. To address this issue, noise-robust loss functions offer an effective solution for enhancing learning in the presence of label noise. In this work, we systematically investigate the limitation of the recently proposed Active Passive Loss (APL), which employs Mean Absolute Error (MAE) as its passive loss function. Despite the robustness brought by MAE, one of its key drawbacks is that it pays equal attention to clean and noisy samples; this feature slows down convergence and potentially makes training difficult, particularly in large-scale datasets. To overcome these challenges, we introduce a novel loss function class, termed Normalized Negative Loss Functions (NNLFs), which serve as passive loss functions within the APL framework. NNLFs effectively address the limitations of MAE by concentrating more on memorized clean samples. By replacing MAE in APL with our proposed NNLFs, we enhance APL and present a new framework called Active Negative Loss (ANL). Moreover, in non-symmetric noise scenarios, we propose an entropy-based regularization technique to mitigate the vulnerability to the label imbalance. Extensive experiments demonstrate that the new loss functions adopted by our ANL framework can achieve better or comparable performance to state-of-the-art methods across various label noise types and in image segmentation tasks. The source code is available at: https://github.com/Virusdoll/Active-Negative-Loss.
comment: This work has been submitted to the IEEE for possible publication
LightMem: Lightweight and Efficient Memory-Augmented Generation
Despite their remarkable capabilities, Large Language Models (LLMs) struggle to effectively leverage historical interaction information in dynamic and complex environments. Memory systems enable LLMs to move beyond stateless interactions by introducing persistent information storage, retrieval, and utilization mechanisms. However, existing memory systems often introduce substantial time and computational overhead. To this end, we introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems. Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognition-inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep-time update employs an offline procedure that decouples consolidation from online inference. On LongMemEval and LoCoMo, using GPT and Qwen backbones, LightMem consistently surpasses strong baselines, improving QA accuracy by up to 7.7% / 29.3%, reducing total token usage by up to 38x / 20.9x and API calls by up to 30x / 55.5x, while purely online test-time costs are even lower, achieving up to 106x / 117x token reduction and 159x / 310x fewer API calls. The code is available at https://github.com/zjunlp/LightMem.
comment: Work in progress
MetricHMSR:Metric Human Mesh and Scene Recovery from Monocular Images
We introduce MetricHMSR (Metric Human Mesh and Scene Recovery), a novel approach for metric human mesh and scene recovery from monocular images. Due to unrealistic assumptions in the camera model and inherent challenges in metric perception, existing approaches struggle to achieve human pose and metric 3D position estimation through a unified module. To address this limitation, MetricHMSR incorporates camera rays to comprehensively encode both the bounding box information and the intrinsic parameters of perspective projection. Then we proposed Human Mixture-of-Experts (MoE), the model dynamically routes image features and ray features to task-specific experts for specialized understanding of different data aspects, enabling a unified framework that simultaneously perceives the local pose and the global 3D position. Based on the results above, we further refine the existing monocular metric depth estimation method to achieve more accurate results, ultimately enabling the seamless overlay of humans and scenes in 3D space. Comprehensive experimental results demonstrate that the proposed method achieves state-of-the-art performance on both human mesh and scene recovery.
Restora-Flow: Mask-Guided Image Restoration with Flow Matching WACV 2026
Flow matching has emerged as a promising generative approach that addresses the lengthy sampling times associated with state-of-the-art diffusion models and enables a more flexible trajectory design, while maintaining high-quality image generation. This capability makes it suitable as a generative prior for image restoration tasks. Although current methods leveraging flow models have shown promising results in restoration, some still suffer from long processing times or produce over-smoothed results. To address these challenges, we introduce Restora-Flow, a training-free method that guides flow matching sampling by a degradation mask and incorporates a trajectory correction mechanism to enforce consistency with degraded inputs. We evaluate our approach on both natural and medical datasets across several image restoration tasks involving a mask-based degradation, i.e., inpainting, super-resolution and denoising. We show superior perceptual quality and processing time compared to diffusion and flow matching-based reference methods.
comment: Accepted for WACV 2026
Passive Dementia Screening via Facial Temporal Micro-Dynamics Analysis of In-the-Wild Talking-Head Video
We target passive dementia screening from short camera-facing talking head video, developing a facial temporal micro dynamics analysis for language free detection of early neuro cognitive change. This enables unscripted, in the wild video analysis at scale to capture natural facial behaviors, transferrable across devices, topics, and cultures without active intervention by clinicians or researchers during recording. Most existing resources prioritize speech or scripted interviews, limiting use outside clinics and coupling predictions to language and transcription. In contrast, we identify and analyze whether temporal facial kinematics, including blink dynamics, small mouth jaw motions, gaze variability, and subtle head adjustments, are sufficient for dementia screening without speech or text. By stabilizing facial signals, we convert these micro movements into interpretable facial microdynamic time series, smooth them, and summarize short windows into compact clip level statistics for screening. Each window is encoded by its activity mix (the relative share of motion across streams), thus the predictor analyzes the distribution of motion across streams rather than its magnitude, making per channel effects transparent. We also introduce YT DemTalk, a new dataset curated from publicly available, in the wild camera facing videos. It contains 300 clips (150 with self reported dementia, 150 controls) to test our model and offer a first benchmarking of the corpus. On YT DemTalk, ablations identify gaze lability and mouth/jaw dynamics as the most informative cues, and light weighted shallow classifiers could attain a dementia prediction performance of (AUROC) 0.953, 0.961 Average Precision (AP), 0.851 F1-score, and 0.857 accuracy.
UniChange: Unifying Change Detection with Multimodal Large Language Model
Change detection (CD) is a fundamental task for monitoring and analyzing land cover dynamics. While recent high performance models and high quality datasets have significantly advanced the field, a critical limitation persists. Current models typically acquire limited knowledge from single-type annotated data and cannot concurrently leverage diverse binary change detection (BCD) and semantic change detection (SCD) datasets. This constraint leads to poor generalization and limited versatility. The recent advancements in Multimodal Large Language Models (MLLMs) introduce new possibilities for a unified CD framework. We leverage the language priors and unification capabilities of MLLMs to develop UniChange, the first MLLM-based unified change detection model. UniChange integrates generative language abilities with specialized CD functionalities. Our model successfully unifies both BCD and SCD tasks through the introduction of three special tokens: [T1], [T2], and [CHANGE]. Furthermore, UniChange utilizes text prompts to guide the identification of change categories, eliminating the reliance on predefined classification heads. This design allows UniChange to effectively acquire knowledge from multi-source datasets, even when their class definitions conflict. Experiments on four public benchmarks (WHU-CD, S2Looking, LEVIR-CD+, and SECOND) demonstrate SOTA performance, achieving IoU scores of 90.41, 53.04, 78.87, and 57.62, respectively, surpassing all previous methods. The code is available at https://github.com/Erxucomeon/UniChange.
DiffSeg30k: A Multi-Turn Diffusion Editing Benchmark for Localized AIGC Detection
Diffusion-based editing enables realistic modification of local image regions, making AI-generated content harder to detect. Existing AIGC detection benchmarks focus on classifying entire images, overlooking the localization of diffusion-based edits. We introduce DiffSeg30k, a publicly available dataset of 30k diffusion-edited images with pixel-level annotations, designed to support fine-grained detection. DiffSeg30k features: 1) In-the-wild images--we collect images or image prompts from COCO to reflect real-world content diversity; 2) Diverse diffusion models--local edits using eight SOTA diffusion models; 3) Multi-turn editing--each image undergoes up to three sequential edits to mimic real-world sequential editing; and 4) Realistic editing scenarios--a vision-language model (VLM)-based pipeline automatically identifies meaningful regions and generates context-aware prompts covering additions, removals, and attribute changes. DiffSeg30k shifts AIGC detection from binary classification to semantic segmentation, enabling simultaneous localization of edits and identification of the editing models. We benchmark three baseline segmentation approaches, revealing significant challenges in semantic segmentation tasks, particularly concerning robustness to image distortions. Experiments also reveal that segmentation models, despite being trained for pixel-level localization, emerge as highly reliable whole-image classifiers of diffusion edits, outperforming established forgery classifiers while showing great potential in cross-generator generalization. We believe DiffSeg30k will advance research in fine-grained localization of AI-generated content by demonstrating the promise and limitations of segmentation-based methods. DiffSeg30k is released at: https://huggingface.co/datasets/Chaos2629/Diffseg30k
comment: 16 pages, 10 figures; typos corrected, references added
Systematic Evaluation and Guidelines for Segment Anything Model in Surgical Video Analysis
Surgical video segmentation is critical for AI to interpret spatial-temporal dynamics in surgery, yet model performance is constrained by limited annotated data. The SAM2 model, pretrained on natural videos, offers potential for zero-shot surgical segmentation, but its applicability in complex surgical environments, with challenges like tissue deformation and instrument variability, remains unexplored. We present the first comprehensive evaluation of the zero-shot capability of SAM2 in 9 surgical datasets (17 surgery types), covering laparoscopic, endoscopic, and robotic procedures. We analyze various prompting (points, boxes, mask) and {finetuning (dense, sparse) strategies}, robustness to surgical challenges, and generalization across procedures and anatomies. Key findings reveal that while SAM2 demonstrates notable zero-shot adaptability in structured scenarios (e.g., instrument segmentation, {multi-organ segmentation}, and scene segmentation), its performance varies under dynamic surgical conditions, highlighting gaps in handling temporal coherence and domain-specific artifacts. These results highlight future pathways to adaptive data-efficient solutions for the surgical data science field.
XYZCylinder: Towards Compatible Feed-Forward 3D Gaussian Splatting for Driving Scenes via Unified Cylinder Lifting Method
Feed-forward paradigms for 3D reconstruction have become a focus of recent research, which learn implicit, fixed view transformations to generate a single scene representation. However, their application to complex driving scenes reveals significant limitations. Two core challenges are responsible for this performance gap. First, the reliance on a fixed view transformation hinders compatibility to varying camera configurations. Second, the inherent difficulty of learning complex driving scenes from sparse 360° views with minimal overlap compromises the final reconstruction fidelity. To handle these difficulties, we introduce XYZCylinder, a novel method built upon a unified cylinder lifting method that integrates camera modeling and feature lifting. To tackle the compatibility problem, we design a Unified Cylinder Camera Modeling (UCCM) strategy. This strategy explicitly models projection parameters to unify diverse camera setups, thus bypassing the need for learning viewpoint-dependent correspondences. To improve the reconstruction accuracy, we propose a hybrid representation with several dedicated modules based on newly designed Cylinder Plane Feature Group (CPFG) to lift 2D image features to 3D space. Extensive evaluations confirm that XYZCylinder not only achieves state-of-the-art performance under different evaluation settings but also demonstrates remarkable compatibility in entirely new scenes with different camera settings in a zero-shot manner. Project page: \href{https://yuyuyu223.github.io/XYZCYlinder-projectpage/}{here}
comment: Feed-Forward, 3D Gaussian Splatting, Project page: https://yuyuyu223.github.io/XYZCYlinder-projectpage/
Uncovering and Mitigating Destructive Multi-Embedding Attacks in Deepfake Proactive Forensics
With the rapid evolution of deepfake technologies and the wide dissemination of digital media, personal privacy is facing increasingly serious security threats. Deepfake proactive forensics, which involves embedding imperceptible watermarks to enable reliable source tracking, serves as a crucial defense against these threats. Although existing methods show strong forensic ability, they rely on an idealized assumption of single watermark embedding, which proves impractical in real-world scenarios. In this paper, we formally define and demonstrate the existence of Multi-Embedding Attacks (MEA) for the first time. When a previously protected image undergoes additional rounds of watermark embedding, the original forensic watermark can be destroyed or removed, rendering the entire proactive forensic mechanism ineffective. To address this vulnerability, we propose a general training paradigm named Adversarial Interference Simulation (AIS). Rather than modifying the network architecture, AIS explicitly simulates MEA scenarios during fine-tuning and introduces a resilience-driven loss function to enforce the learning of sparse and stable watermark representations. Our method enables the model to maintain the ability to extract the original watermark correctly even after a second embedding. Extensive experiments demonstrate that our plug-and-play AIS training paradigm significantly enhances the robustness of various existing methods against MEA.
Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration
Accurate temporal prediction is the bridge between comprehensive scene understanding and embodied artificial intelligence. However, predicting multiple fine-grained states of a scene at multiple temporal scales is difficult for vision-language models. We formalize the Multi-Scale Temporal Prediction (MSTP) task in general and surgical scenes by decomposing multi-scale into two orthogonal dimensions: the temporal scale, forecasting states of humans and surgery at varying look-ahead intervals, and the state scale, modeling a hierarchy of states in general and surgical scenes. For example, in general scenes, states of contact relationships are finer-grained than states of spatial relationships. In surgical scenes, medium-level steps are finer-grained than high-level phases yet remain constrained by their encompassing phase. To support this unified task, we introduce the first MSTP Benchmark, featuring synchronized annotations across multiple state scales and temporal scales. We further propose a method, Incremental Generation and Multi-agent Collaboration (IG-MC), which integrates two key innovations. First, we present a plug-and-play incremental generation module that continuously synthesizes up-to-date visual previews at expanding temporal scales to inform multiple decision-making agents, keeping decisions and generated visuals synchronized and preventing performance degradation as look-ahead intervals lengthen. Second, we present a decision-driven multi-agent collaboration framework for multi-state prediction, comprising generation, initiation, and multi-state assessment agents that dynamically trigger and evaluate prediction cycles to balance global coherence and local fidelity.
comment: 20 pages, 6 figures
Generalizable cardiac substructures segmentation from contrast and non-contrast CTs using pretrained transformers
Automated AI segmentations for radiation treatment planning deteriorate when applied to cases with different characteristics than the training dataset. We developed a hybrid transformer convolutional network to segment cardiac substructures in lung and breast cancer patients with varying imaging contrasts and scan positions. Cohort I (56 contrast-enhanced CT [CECT], 124 non-contrast CT [NCCT] scans from lung cancer patients, supine position) was used to train an oracle model (180 cases), contrast-only model (56 CECTs), and balanced model (32 CECT, 32 NCCT). All models were evaluated on 60 held-out cohort I patients and 66 cohort II breast cancer patients (45 supine, 21 prone). Accuracy was measured using Dice similarity coefficient (DSC), 95th percentile Hausdorff distance (HD95), and dosimetric metrics, with TotalSegmentator as benchmark. Oracle and balanced models achieved similar accuracy (DSC: Oracle vs Balanced: Cohort I: 0.84 $\pm$ 0.10 vs 0.82 $\pm$ 0.10; Cohort II: 0.81 $\pm$ 0.12 vs 0.80 $\pm$ 0.13), both outperforming TotalSegmentator and the contrast-only models. The balanced model, using 64% fewer training cases, produced dosimetrically equivalent contours to manual delineations. It was robust to contrast variations (6 out of 8 substructures) and positioning variations (5 out of 8 substructures), with low correlation to patient age or body mass index. Our balanced model demonstrated robust geometric and dosimetric accuracy across varying imaging protocols and patient characteristics, which is essential for clinical deployment. Combining pretraining with balanced NCCT/CECT distribution enabled reliable segmentation with substantially fewer labeled cases than conventional approaches.
CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness NeurIPS 2025
Visual captioning benchmarks have become outdated with the emergence of modern multimodal large language models (MLLMs), as the brief ground-truth sentences and traditional metrics fail to assess detailed captions effectively. While recent benchmarks attempt to address this by focusing on keyword extraction or object-centric evaluation, they remain limited to vague-view or object-view analyses and incomplete visual element coverage. In this paper, we introduce CAPability, a comprehensive multi-view benchmark for evaluating visual captioning across 12 dimensions spanning six critical views. We curate nearly 11K human-annotated images and videos with visual element annotations to evaluate the generated captions. CAPability stably assesses both the correctness and thoroughness of captions with \textit{precision} and \textit{hit} metrics. By converting annotations to QA pairs, we further introduce a heuristic metric, \textit{know but cannot tell} ($K\bar{T}$), indicating a significant performance gap between QA and caption capabilities. Our work provides a holistic analysis of MLLMs' captioning abilities, as we identify their strengths and weaknesses across various dimensions, guiding future research to enhance specific aspects of their capabilities.
comment: Accepted to NeurIPS 2025
Vision Remember: Recovering Visual Information in Efficient LVLM with Vision Feature Resampling
The computational expense of redundant vision tokens in Large Vision-Language Models (LVLMs) has led many existing methods to compress them via a vision projector. However, this compression may lose visual information that is crucial for tasks relying on fine-grained spatial relationships, such as OCR and Chart&Table Understanding. In this paper, we propose to resample original vision features across the LLM decoder layers to recover visual information and attain efficiency. Following this principle, we introduce Vision Remember, which includes two key modules: (1) Token-Feature Cross-Attention Layer and (2) Token Bidirectional Self-Attention Layer. In the Token bidirectional attention, we employ self-attention mechanism to maintain the bidirectional interaction between vision tokens and the text-guided token. In the Token-Feature interaction attention, we introduce local cross-attention to resample the visual feature and utilize the multi-level fusion to enrich the visual representation. We conduct comprehensive experiments on multiple visual understanding benchmarks and the results with the LLaVA-NeXT baseline show that Vision Remember outperforms TokenPacker by +2.7 and FastV by +5.7 across nearly all the settings. Compared with previous vision feature re-fusion methods, our approach also surpasses DeepStack by +3.9 and SVA Aggregator by +3.4 on the same baseline. The experimental results validate the generalization capability of the proposed method when combined with various efficient vision projectors and LVLMs.
BRIC: Bridging Kinematic Plans and Physical Control at Test Time AAAI'26
We propose BRIC, a novel test-time adaptation (TTA) framework that enables long-term human motion generation by resolving execution discrepancies between diffusion-based kinematic motion planners and reinforcement learning-based physics controllers. While diffusion models can generate diverse and expressive motions conditioned on text and scene context, they often produce physically implausible outputs, leading to execution drift during simulation. To address this, BRIC dynamically adapts the physics controller to noisy motion plans at test time, while preserving pre-trained skills via a loss function that mitigates catastrophic forgetting. In addition, BRIC introduces a lightweight test-time guidance mechanism that steers the diffusion model in the signal space without updating its parameters. By combining both adaptation strategies, BRIC ensures consistent and physically plausible long-term executions across diverse environments in an effective and efficient manner. We validate the effectiveness of BRIC on a variety of long-term tasks, including motion composition, obstacle avoidance, and human-scene interaction, achieving state-of-the-art performance across all tasks.
comment: Accepted to AAAI'26
FlowTok: Flowing Seamlessly Across Text and Image Tokens
Bridging different modalities lies at the heart of cross-modality generation. While conventional approaches treat the text modality as a conditioning signal that gradually guides the denoising process from Gaussian noise to the target image modality, we explore a much simpler paradigm-directly evolving between text and image modalities through flow matching. This requires projecting both modalities into a shared latent space, which poses a significant challenge due to their inherently different representations: text is highly semantic and encoded as 1D tokens, whereas images are spatially redundant and represented as 2D latent embeddings. To address this, we introduce FlowTok, a minimal framework that seamlessly flows across text and images by encoding images into a compact 1D token representation. Compared to prior methods, this design reduces the latent space size by 3.3x at an image resolution of 256, eliminating the need for complex conditioning mechanisms or noise scheduling. Moreover, FlowTok naturally extends to image-to-text generation under the same formulation. With its streamlined architecture centered around compact 1D tokens, FlowTok is highly memory-efficient, requires significantly fewer training resources, and achieves much faster sampling speeds-all while delivering performance comparable to state-of-the-art models. Code is available at https://github.com/TACJu/FlowTok.
comment: Project page at https://tacju.github.io/projects/flowtok.html
Towards Spatially Consistent Image Generation: On Incorporating Intrinsic Scene Properties into Diffusion Models
Image generation models trained on large datasets can synthesize high-quality images but often produce spatially inconsistent and distorted images due to limited information about the underlying structures and spatial layouts. In this work, we leverage intrinsic scene properties (e.g., depth, segmentation maps) that provide rich information about the underlying scene, unlike prior approaches that solely rely on image-text pairs or use intrinsics as conditional inputs. Our approach aims to co-generate both images and their corresponding intrinsics, enabling the model to implicitly capture the underlying scene structure and generate more spatially consistent and realistic images. Specifically, we first extract rich intrinsic scene properties from a large image dataset with pre-trained estimators, eliminating the need for additional scene information or explicit 3D representations. We then aggregate various intrinsic scene properties into a single latent variable using an autoencoder. Building upon pre-trained large-scale Latent Diffusion Models (LDMs), our method simultaneously denoises the image and intrinsic domains by carefully sharing mutual information so that the image and intrinsic reflect each other without degrading image quality. Experimental results demonstrate that our method corrects spatial inconsistencies and produces a more natural layout of scenes while maintaining the fidelity and textual alignment of the base model (e.g., Stable Diffusion).
ConceptGuard: Proactive Safety in Text-and-Image-to-Video Generation through Multimodal Risk Detection
Recent progress in video generative models has enabled the creation of high-quality videos from multimodal prompts that combine text and images. While these systems offer enhanced controllability, they also introduce new safety risks, as harmful content can emerge from individual modalities or their interaction. Existing safety methods are often text-only, require prior knowledge of the risk category, or operate as post-generation auditors, struggling to proactively mitigate such compositional, multimodal risks. To address this challenge, we present ConceptGuard, a unified safeguard framework for proactively detecting and mitigating unsafe semantics in multimodal video generation. ConceptGuard operates in two stages: First, a contrastive detection module identifies latent safety risks by projecting fused image-text inputs into a structured concept space; Second, a semantic suppression mechanism steers the generative process away from unsafe concepts by intervening in the prompt's multimodal conditioning. To support the development and rigorous evaluation of this framework, we introduce two novel benchmarks: ConceptRisk, a large-scale dataset for training on multimodal risks, and T2VSafetyBench-TI2V, the first benchmark adapted from T2VSafetyBench for the Text-and-Image-to-Video (TI2V) safety setting. Comprehensive experiments on both benchmarks show that ConceptGuard consistently outperforms existing baselines, achieving state-of-the-art results in both risk detection and safe video generation. Our code is available at https://github.com/Ruize-Ma/ConceptGuard.
Saliency-R1: Incentivizing Unified Saliency Reasoning Capability in MLLM with Confidence-Guided Reinforcement Learning
Although multimodal large language models (MLLMs) excel in high-level vision-language reasoning, they lack inherent awareness of visual saliency, making it difficult to identify key visual elements. To bridge this gap, we propose Saliency-R1, the first unified MLLM framework that jointly tackles three representative and heterogeneous saliency tasks: Salient Object Detection (SOD), Salient Instance Segmentation (SIS), and Co-salient Object Detection (CoSOD), enhancing the model's capacity for saliency reasoning. We introduce a textual interface with structured tags (, ) to encode region- and instance-level referring expressions, enabling a single referring segmenter to produce task-appropriate masks. To train the MLLM efficiently, we propose Confidence-Guided Policy Optimization (CGPO), a novel single-sample reinforcement learning algorithm. CGPO improves on GRPO by replacing group-normalized advantages with a per-sample signal based on reward-confidence discrepancy, thereby reducing computational waste, mitigating signal dilution, and lowering training overhead. Our model exceeds or matches the performance of robust open/closed-source MLLMs and specialized state-of-the-art methods across all three tasks, demonstrating the efficacy of our framework in saliency reasoning.
comment: Main text (excluding references): 8 pages, 4 figures; Supplementary Materials (excluding references): 9 pages, 10 figures
Cheating Stereo Matching in Full-scale: Physical Adversarial Attack against Binocular Depth Estimation in Autonomous Driving AAAI 2026
Though deep neural models adopted to realize the perception of autonomous driving have proven vulnerable to adversarial examples, known attacks often leverage 2D patches and target mostly monocular perception. Therefore, the effectiveness of Physical Adversarial Examples (PAEs) on stereo-based binocular depth estimation remains largely unexplored. To this end, we propose the first texture-enabled physical adversarial attack against stereo matching models in the context of autonomous driving. Our method employs a 3D PAE with global camouflage texture rather than a local 2D patch-based one, ensuring both visual consistency and attack effectiveness across different viewpoints of stereo cameras. To cope with the disparity effect of these cameras, we also propose a new 3D stereo matching rendering module that allows the PAE to be aligned with real-world positions and headings in binocular vision. We further propose a novel merging attack that seamlessly blends the target into the environment through fine-grained PAE optimization. It has significantly enhanced stealth and lethality upon existing hiding attacks that fail to get seamlessly merged into the background. Extensive evaluations show that our PAEs can successfully fool the stereo models into producing erroneous depth information.
comment: AAAI 2026
Point-Supervised Facial Expression Spotting with Gaussian-Based Instance-Adaptive Intensity Modeling
Automatic facial expression spotting, which aims to identify facial expression instances in untrimmed videos, is crucial for facial expression analysis. Existing methods primarily focus on fully-supervised learning and rely on costly, time-consuming temporal boundary annotations. In this paper, we investigate point-supervised facial expression spotting (P-FES), where only a single timestamp annotation per instance is required for training. We propose a unique two-branch framework for P-FES. First, to mitigate the limitation of hard pseudo-labeling, which often confuses neutral and expression frames with various intensities, we propose a Gaussian-based instance-adaptive intensity modeling (GIM) module to model instance-level expression intensity distribution for soft pseudo-labeling. By detecting the pseudo-apex frame around each point label, estimating the duration, and constructing an instance-level Gaussian distribution, GIM assigns soft pseudo-labels to expression frames for more reliable intensity supervision. The GIM module is incorporated into our framework to optimize the class-agnostic expression intensity branch. Second, we design a class-aware apex classification branch that distinguishes macro- and micro-expressions solely based on their pseudo-apex frames. During inference, the two branches work independently: the class-agnostic expression intensity branch generates expression proposals, while the class-aware apex-classification branch is responsible for macro- and micro-expression classification. Furthermore, we introduce an intensity-aware contrastive loss to enhance discriminative feature learning and suppress neutral noise by contrasting neutral frames with expression frames with various intensities. Extensive experiments on the SAMM-LV, CAS(ME)$^2$, and CAS(ME)$^3$ datasets demonstrate the effectiveness of our proposed framework.
Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks
Automatically detecting abnormal events in videos is crucial for modern autonomous systems, yet existing Video Anomaly Detection (VAD) benchmarks lack the scene diversity, balanced anomaly coverage, and temporal complexity needed to reliably assess real-world performance. Meanwhile, the community is increasingly moving toward Video Anomaly Understanding (VAU), which requires deeper semantic and causal reasoning but remains difficult to benchmark due to the heavy manual annotation effort it demands. In this paper, we introduce Pistachio, a new VAD/VAU benchmark constructed entirely through a controlled, generation-based pipeline. By leveraging recent advances in video generation models, Pistachio provides precise control over scenes, anomaly types, and temporal narratives, effectively eliminating the biases and limitations of Internet-collected datasets. Our pipeline integrates scene-conditioned anomaly assignment, multi-step storyline generation, and a temporally consistent long-form synthesis strategy that produces coherent 41-second videos with minimal human intervention. Extensive experiments demonstrate the scale, diversity, and complexity of Pistachio, revealing new challenges for existing methods and motivating future research on dynamic and multi-event anomaly understanding.
Thinking in 360°: Humanoid Visual Search in the Wild
Humans rely on the synergistic control of head (cephalomotor) and eye (oculomotor) to efficiently search for visual information in 360°. However, prior approaches to visual search are limited to a static image, neglecting the physical embodiment and its interaction with the 3D world. How can we develop embodied visual search agents as efficient as humans while bypassing the constraints imposed by real-world hardware? To this end, we propose humanoid visual search where a humanoid agent actively rotates its head to search for objects or paths in an immersive world represented by a 360° panoramic image. To study visual search in visually-crowded real-world scenarios, we build H* Bench, a new benchmark that moves beyond household scenes to challenging in-the-wild scenes that necessitate advanced visual-spatial reasoning capabilities, such as transportation hubs, large-scale retail spaces, urban streets, and public institutions. Our experiments first reveal that even top-tier proprietary models falter, achieving only ~30% success in object and path search. We then use post-training techniques to enhance the open-source Qwen2.5-VL, increasing its success rate by over threefold for both object search (14.83% to 47.38%) and path search (6.44% to 24.94%). Notably, the lower ceiling of path search reveals its inherent difficulty, which we attribute to the demand for sophisticated spatial commonsense. Our results not only show a promising path forward but also quantify the immense challenge that remains in building MLLM agents that can be seamlessly integrated into everyday human life.
comment: Website: https://humanoid-vstar.github.io/ ; Code: https://github.com/humanoid-vstar/hstar
PointNSP: Autoregressive 3D Point Cloud Generation with Next-Scale Level-of-Detail Prediction
Autoregressive point cloud generation has long lagged behind diffusion-based approaches in quality. The performance gap stems from the fact that autoregressive models impose an artificial ordering on inherently unordered point sets, forcing shape generation to proceed as a sequence of local predictions. This sequential bias emphasizes short-range continuity but undermines the model's capacity to capture long-range dependencies, hindering its ability to enforce global structural properties such as symmetry, consistent topology, and large-scale geometric regularities. Inspired by the level-of-detail (LOD) principle in shape modeling, we propose PointNSP, a coarse-to-fine generative framework that preserves global shape structure at low resolutions and progressively refines fine-grained geometry at higher scales through a next-scale prediction paradigm. This multi-scale factorization aligns the autoregressive objective with the permutation-invariant nature of point sets, enabling rich intra-scale interactions while avoiding brittle fixed orderings. Experiments on ShapeNet show that PointNSP establishes state-of-the-art (SOTA) generation quality for the first time within the autoregressive paradigm. In addition, it surpasses strong diffusion-based baselines in parameter, training, and inference efficiency. Finally, in dense generation with 8,192 points, PointNSP's advantages become even more pronounced, underscoring its scalability potential.
comment: This work was intended as a replacement of arXiv:2503.08594 and any subsequent updates will appear there
VGGTFace: Topologically Consistent Facial Geometry Reconstruction in the Wild
Reconstructing topologically consistent facial geometry is crucial for the digital avatar creation pipelines. Existing methods either require tedious manual efforts, lack generalization to in-the-wild data, or are constrained by the limited expressiveness of 3D Morphable Models. To address these limitations, we propose VGGTFace, an automatic approach that innovatively applies the 3D foundation model, i.e. VGGT, for topologically consistent facial geometry reconstruction from in-the-wild multi-view images captured by everyday users. Our key insight is that, by leveraging VGGT, our method naturally inherits strong generalization ability and expressive power from its large-scale training and point map representation. However, it is unclear how to reconstruct a topologically consistent mesh from VGGT, as the topology information is missing in its prediction. To this end, we augment VGGT with Pixel3DMM for injecting topology information via pixel-aligned UV values. In this manner, we convert the pixel-aligned point map of VGGT to a point cloud with topology. Tailored to this point cloud with known topology, we propose a novel Topology-Aware Bundle Adjustment strategy to fuse them, where we construct a Laplacian energy for the Bundle Adjustment objective. Our method achieves high-quality reconstruction in 10 seconds for 16 views on a single NVIDIA RTX 4090. Experiments demonstrate state-of-the-art results on benchmarks and impressive generalization to in-the-wild data. Code is available at https://github.com/grignarder/vggtface.
Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning
Vision-language agents have achieved remarkable progress in a variety of multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervision. Recent self-rewarding approaches attempt to overcome this constraint by allowing models to act as their own critics or reward providers. Yet, purely text-based self-evaluation struggles to verify complex visual reasoning steps and often suffers from evaluation hallucinations. To address these challenges, inspired by recent advances in tool-integrated reasoning, we propose Agent0-VL, a self-evolving vision-language agent that achieves continual improvement with tool-integrated reasoning. Agent0-VL incorporates tool usage not only into reasoning but also into self-evaluation and self-repair, enabling the model to introspect, verify, and refine its reasoning through evidence-grounded analysis. It unifies two synergistic roles within a single LVLM: a Solver that performs multi-turn tool-integrated reasoning, and a Verifier that generates structured feedback and fine-grained self-rewards through tool-grounded critique. These roles interact through a Self-Evolving Reasoning Cycle, where tool-based verification and reinforcement learning jointly align the reasoning and evaluation distributions for stable self-improvement. Through this zero-external-reward evolution, Agent0-VL aligns its reasoning and verification behaviors without any human annotation or external reward models, achieving continual self-improvement. Experiments on geometric problem solving and visual scientific analysis show that Agent0-VL achieves an 12.5% improvement over the base model. Our code is available at https://github.com/aiming-lab/Agent0.
Image and Video Processing
Entropy Coding for Non-Rectangular Transform Blocks using Partitioned DCT Dictionaries for AV1
Recent video codecs such as VVC and AV1 apply a Non-rectangular (NR) partitioning to combine prediction signals using a smooth blending around the boundary, followed by a rectangular transform on the whole block. The NR signal transformation is not yet supported. A transformation technique that applies the same partitioning to the 2D Discrete Cosine Transform (DCT) bases and finds a sparse representation of the NR signal in such a dictionary showed promising gains in an experimental setup outside the reference software. This method uses the regular inverse transformation at the decoder to reconstruct a rectangular signal and discards the signal outside the region of interest. This design is appealing due to the minimal changes required at the decoder. However, current entropy coding schemes are not well-suited for optimally encoding these coefficients because they are primarily designed for DCT coefficients. This work introduces an entropy coding method that efficiently codes these transform coefficients by effectively modeling their properties. The design offers significant theoretical rate savings, estimated using conditional entropy, particularly for scenarios that are more dissimilar to DCT in an experimental setup.
Semantic-Enhanced Feature Matching with Learnable Geometric Verification for Cross-Modal Neuron Registration
Accurately registering in-vivo two-photon and ex-vivo fluorescence micro-optical sectioning tomography images of individual neurons is critical for structure-function analysis in neuroscience. This task is profoundly challenging due to a significant cross-modality appearance gap, the scarcity of annotated data and severe tissue deformations. We propose a novel deep learning framework to address these issues. Our method introduces a semantic-enhanced hybrid feature descriptor, which fuses the geometric precision of local features with the contextual robustness of a vision foundation model DINOV3 to bridge the modality gap. To handle complex deformations, we replace traditional RANSAC with a learnable Geometric Consistency Confidence Module, a novel classifier trained to identify and reject physically implausible correspondences. A data-efficient two-stage training strategy, involving pre-training on synthetically deformed data and fine-tuning on limited real data, overcomes the data scarcity problem. Our framework provides a robust and accurate solution for high-precision registration in challenging biomedical imaging scenarios, enabling large-scale correlative studies.
Knowledge Distillation for Continual Learning of Biomedical Neural Fields
Neural fields are increasingly used as a light-weight, continuous, and differentiable signal representation in (bio)medical imaging. However, unlike discrete signal representations such as voxel grids, neural fields cannot be easily extended. As neural fields are, in essence, neural networks, prior signals represented in a neural field will degrade when the model is presented with new data due to catastrophic forgetting. This work examines the extent to which different neural field approaches suffer from catastrophic forgetting and proposes a strategy to mitigate this issue. We consider the scenario in which data becomes available incrementally, with only the most recent data available for neural field fitting. In a series of experiments on cardiac cine MRI data, we demonstrate how knowledge distillation mitigates catastrophic forgetting when the spatiotemporal domain is enlarged or the dimensionality of the represented signal is increased. We find that the amount of catastrophic forgetting depends, to a large extent, on the neural fields model used, and that distillation could enable continual learning in neural fields.
comment: 5 pages, 6 figures. Submitted to IEEE International Symposium on Biomedical Imaging (ISBI) 2026
Deep Parameter Interpolation for Scalar Conditioning
We propose deep parameter interpolation (DPI), a general-purpose method for transforming an existing deep neural network architecture into one that accepts an additional scalar input. Recent deep generative models, including diffusion models and flow matching, employ a single neural network to learn a time- or noise level-dependent vector field. Designing a network architecture to accurately represent this vector field is challenging because the network must integrate information from two different sources: a high-dimensional vector (usually an image) and a scalar. Common approaches either encode the scalar as an additional image input or combine scalar and vector information in specific network components, which restricts architecture choices. Instead, we propose to maintain two learnable parameter sets within a single network and to introduce the scalar dependency by dynamically interpolating between the parameter sets based on the scalar value during training and sampling. DPI is a simple, architecture-agnostic method for adding scalar dependence to a neural network. We demonstrate that our method improves denoising performance and enhances sample quality for both diffusion and flow matching models, while achieving computational efficiency comparable to standard scalar conditioning techniques. Code is available at https://github.com/wustl-cig/parameter_interpolation.
Performance Evaluation of Low-Latency Live Streaming of MPEG-DASH UHD video over Commercial 5G NSA/SA Network
5G Standalone (SA) is the goal of the 5G evolution, which aims to provide higher throughput and lower latency than the existing LTE network. One of the main applications of 5G is the real-time distribution of Ultra High-Definition (UHD) content with a resolution of 4K or 8K. In Q2/2021, Advanced Info Service (AIS), the biggest operator in Thailand, launched 5G SA, providing both 5G SA/NSA service nationwide in addition to the existing LTE network. While many parts of the world are still in process of rolling out the first phase of 5G in Non-Standalone (NSA) mode, 5G SA in Thailand already covers more than 76% of the population. In this paper, UHD video will be a real-time live streaming via MPEG-DASH over different mobile network technologies with minimal buffer size to provide the lowest latency. Then, performance such as the number of dropped segments, MAC throughput, and latency are evaluated in various situations such as stationary, moving in the urban area, moving at high speed, and also an ideal condition with maximum SINR. It has been found that 5G SA can deliver more than 95% of the UHD video segment successfully within the required time window in all situations, while 5G NSA produced mixed results depending on the condition of the LTE network. The result also reveals that the LTE network failed to deliver more than 20% of the video segment within the deadline, which shows that 5G SA is absolutely necessary for low-latency UHD video streaming and 5G NSA may not be good enough for such task as it relies on the legacy control signal.
comment: 2022 International Conference on Computer Communications and Networks (ICCCN), 25-28 July 2022, Honolulu, HI, USA
Digital Elevation Model Estimation from RGB Satellite Imagery using Generative Deep Learning
Digital Elevation Models (DEMs) are vital datasets for geospatial applications such as hydrological modeling and environmental monitoring. However, conventional methods to generate DEM, such as using LiDAR and photogrammetry, require specific types of data that are often inaccessible in resource-constrained settings. To alleviate this problem, this study proposes an approach to generate DEM from freely available RGB satellite imagery using generative deep learning, particularly based on a conditional Generative Adversarial Network (GAN). We first developed a global dataset consisting of 12K RGB-DEM pairs using Landsat satellite imagery and NASA's SRTM digital elevation data, both from the year 2000. A unique preprocessing pipeline was implemented to select high-quality, cloud-free regions and aggregate normalized RGB composites from Landsat imagery. Additionally, the model was trained in a two-stage process, where it was first trained on the complete dataset and then fine-tuned on high-quality samples filtered by Structural Similarity Index Measure (SSIM) values to improve performance on challenging terrains. The results demonstrate promising performance in mountainous regions, achieving an overall mean root-mean-square error (RMSE) of 0.4671 and a mean SSIM score of 0.2065 (scale -1 to 1), while highlighting limitations in lowland and residential areas. This study underscores the importance of meticulous preprocessing and iterative refinement in generative modeling for DEM generation, offering a cost-effective and adaptive alternative to conventional methods while emphasizing the challenge of generalization across diverse terrains worldwide.
comment: 5 pages, 4 figures, accepted at IGARSS 2025 conference
Comparing SAM 2 and SAM 3 for Zero-Shot Segmentation of 3D Medical Data
Foundation models for promptable segmentation, including SAM, SAM 2, and the recently released SAM 3, have renewed interest in zero-shot segmentation of medical imaging. Although these models perform strongly on natural images, their behavior on medical data remains insufficiently characterized. While SAM 2 is widely used for annotation in 3D medical workflows, SAM 3 introduces a new perception backbone, detector-tracker pipeline, and concept-level prompting that may alter its behavior under spatial prompts. We present the first controlled comparison of SAM 2 and SAM 3 for zero-shot segmentation of 3D medical volumes and videos under purely visual prompting, with concept mechanisms disabled. We assess whether SAM 3 can serve as an out-of-the-box replacement for SAM 2 without customization. We benchmark both models on 16 public datasets (CT, MRI, 3D and cine ultrasound, endoscopy) covering 54 anatomical structures, pathologies, and surgical instruments. Prompts are restricted to the first frame and use four modes: single-click, multi-click, bounding box, and dense mask. This design standardizes preprocessing, prompt placement, propagation rules, and metric computation to disentangle prompt interpretation from propagation. Prompt-frame analysis shows that SAM 3 provides substantially stronger initialization than SAM 2 for click prompting across most structures. In full-volume analysis, SAM 3 retains this advantage for complex, vascular, and soft-tissue anatomies, emerging as the more versatile general-purpose segmenter. While SAM 2 remains competitive for compact, rigid organs under strong spatial guidance, it frequently fails on challenging targets where SAM 3 succeeds. Overall, our results suggest that SAM 3 is the superior default choice for most medical segmentation tasks, particularly those involving sparse user interaction or complex anatomical topology.
Attention-Guided Fair AI Modeling for Skin Cancer Diagnosis
Artificial intelligence (AI) has shown remarkable promise in dermatology, offering accurate and non-invasive diagnosis of skin cancer. While extensive research has addressed skin tone-related bias, gender bias in dermatologic AI remains underexplored, leading to unequal care and reinforcing existing gender disparities. In this study, we developed LesionAttn, a fairness-aware algorithm that integrates clinical knowledge into model design by directing attention toward lesion regions, mirroring the diagnostic focus of clinicians. Combined with Pareto-frontier optimization for dual-objective model selection, LesionAttn balances fairness and predictive accuracy. Validated on two large-scale dermatological datasets, LesionAttn significantly mitigates gender bias while maintaining high diagnostic performance, outperforming existing bias mitigation algorithms. Our study highlights the potential of embedding clinical knowledge into AI development to advance both model performance and fairness, and further to foster interdisciplinary collaboration between clinicians and AI developers.
LMLCC-Net: A Semi-Supervised Deep Learning Model for Lung Nodule Malignancy Prediction from CT Scans using a Novel Hounsfield Unit-Based Intensity Filtering
Lung cancer is the leading cause of patient mortality in the world. Early diagnosis of malignant pulmonary nodules in CT images can have a significant impact on reducing disease mortality and morbidity. In this work, we propose LMLCC-Net, a novel deep learning framework for classifying nodules from CT scan images using a 3D CNN, considering Hounsfield Unit (HU)-based intensity filtering. Benign and malignant nodules have significant differences in their intensity profile of HU, which was not exploited in the literature. Our method considers the intensity pattern as well as the texture for the prediction of malignancies. LMLCC-Net extracts features from multiple branches that each use a separate learnable HU-based intensity filtering stage. Various combinations of branches and learnable ranges of filters were explored to finally produce the best-performing model. In addition, we propose a semi-supervised learning scheme for labeling ambiguous cases and also developed a lightweight model to classify the nodules. The experimental evaluations are carried out on the LUNA16 dataset. The proposed LMLCC-Net was evaluated using the LUNA16 dataset. Our proposed method achieves a classification accuracy of 91.96%, a sensitivity of 92.94%, and an area under the curve of 94.07%, showing improved performance compared to existing methods The proposed method can have a significant impact in helping radiologists in the classification of pulmonary nodules and improving patient care.
comment: 12 pages, 9 figures, 7 tables
One-Step Diffusion-Based Image Compression with Semantic Distillation NeurIPS 2025
While recent diffusion-based generative image codecs have shown impressive performance, their iterative sampling process introduces unpleasing latency. In this work, we revisit the design of a diffusion-based codec and argue that multi-step sampling is not necessary for generative compression. Based on this insight, we propose OneDC, a One-step Diffusion-based generative image Codec -- that integrates a latent compression module with a one-step diffusion generator. Recognizing the critical role of semantic guidance in one-step diffusion, we propose using the hyperprior as a semantic signal, overcoming the limitations of text prompts in representing complex visual content. To further enhance the semantic capability of the hyperprior, we introduce a semantic distillation mechanism that transfers knowledge from a pretrained generative tokenizer to the hyperprior codec. Additionally, we adopt a hybrid pixel- and latent-domain optimization to jointly enhance both reconstruction fidelity and perceptual realism. Extensive experiments demonstrate that OneDC achieves SOTA perceptual quality even with one-step generation, offering over 39% bitrate reduction and 20x faster decoding compared to prior multi-step diffusion-based codecs. Project: https://onedc-codec.github.io/
comment: Accepted by NeurIPS 2025
GreenHyperSpectra: A multi-source hyperspectral dataset for global vegetation trait prediction NeurIPS 2025
Plant traits such as leaf carbon content and leaf mass are essential variables in the study of biodiversity and climate change. However, conventional field sampling cannot feasibly cover trait variation at ecologically meaningful spatial scales. Machine learning represents a valuable solution for plant trait prediction across ecosystems, leveraging hyperspectral data from remote sensing. Nevertheless, trait prediction from hyperspectral data is challenged by label scarcity and substantial domain shifts (\eg across sensors, ecological distributions), requiring robust cross-domain methods. Here, we present GreenHyperSpectra, a pretraining dataset encompassing real-world cross-sensor and cross-ecosystem samples designed to benchmark trait prediction with semi- and self-supervised methods. We adopt an evaluation framework encompassing in-distribution and out-of-distribution scenarios. We successfully leverage GreenHyperSpectra to pretrain label-efficient multi-output regression models that outperform the state-of-the-art supervised baseline. Our empirical analyses demonstrate substantial improvements in learning spectral representations for trait prediction, establishing a comprehensive methodological framework to catalyze research at the intersection of representation learning and plant functional traits assessment. All code and data are available at: https://github.com/echerif18/HyspectraSSL.
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Generalizable cardiac substructures segmentation from contrast and non-contrast CTs using pretrained transformers
Automated AI segmentations for radiation treatment planning deteriorate when applied to cases with different characteristics than the training dataset. We developed a hybrid transformer convolutional network to segment cardiac substructures in lung and breast cancer patients with varying imaging contrasts and scan positions. Cohort I (56 contrast-enhanced CT [CECT], 124 non-contrast CT [NCCT] scans from lung cancer patients, supine position) was used to train an oracle model (180 cases), contrast-only model (56 CECTs), and balanced model (32 CECT, 32 NCCT). All models were evaluated on 60 held-out cohort I patients and 66 cohort II breast cancer patients (45 supine, 21 prone). Accuracy was measured using Dice similarity coefficient (DSC), 95th percentile Hausdorff distance (HD95), and dosimetric metrics, with TotalSegmentator as benchmark. Oracle and balanced models achieved similar accuracy (DSC: Oracle vs Balanced: Cohort I: 0.84 $\pm$ 0.10 vs 0.82 $\pm$ 0.10; Cohort II: 0.81 $\pm$ 0.12 vs 0.80 $\pm$ 0.13), both outperforming TotalSegmentator and the contrast-only models. The balanced model, using 64% fewer training cases, produced dosimetrically equivalent contours to manual delineations. It was robust to contrast variations (6 out of 8 substructures) and positioning variations (5 out of 8 substructures), with low correlation to patient age or body mass index. Our balanced model demonstrated robust geometric and dosimetric accuracy across varying imaging protocols and patient characteristics, which is essential for clinical deployment. Combining pretraining with balanced NCCT/CECT distribution enabled reliable segmentation with substantially fewer labeled cases than conventional approaches.
DEMIST: Decoupled Multi-stream latent diffusion for Quantitative Myelin Map Synthesis
Quantitative magnetization transfer (qMT) imaging provides myelin-sensitive biomarkers, such as the pool size ratio (PSR), which is valuable for multiple sclerosis (MS) assessment. However, qMT requires specialized 20-30 minute scans. We propose DEMIST to synthesize PSR maps from standard T1w and FLAIR images using a 3D latent diffusion model with three complementary conditioning mechanisms. Our approach has two stages: first, we train separate autoencoders for PSR and anatomical images to learn aligned latent representations. Second, we train a conditional diffusion model in this latent space on top of a frozen diffusion foundation backbone. Conditioning is decoupled into: (i) \textbf{semantic} tokens via cross-attention, (ii) \textbf{spatial} per-scale residual hints via a 3D ControlNet branch, and (iii) \textbf{adaptive} LoRA-modulated attention. We include edge-aware loss terms to preserve lesion boundaries and alignment losses to maintain quantitative consistency, while keeping the number of trainable parameters low and retaining the inductive bias of the pretrained model. We evaluate on 163 scans from 99 subjects using 5-fold cross-validation. Our method outperforms VAE, GAN and diffusion baselines on multiple metrics, producing sharper boundaries and better quantitative agreement with ground truth. Our code is publicly available at https://github.com/MedICL-VU/MS-Synthesis-3DcLDM.
Diffusion Algorithm for Metalens Optical Aberration Correction
Metalenses offer a path toward creating ultra-thin optical systems, but they inherently suffer from severe, spatially varying optical aberrations, especially chromatic aberration, which makes image reconstruction a significant challenge. This paper presents a novel algorithmic solution to this problem, designed to reconstruct a sharp, full-color image from two inputs: a sharp, bandpass-filtered grayscale ``structure image'' and a heavily distorted ``color cue'' image, both captured by the metalens system. Our method utilizes a dual-branch diffusion model, built upon a pre-trained Stable Diffusion XL framework, to fuse information from the two inputs. We demonstrate through quantitative and qualitative comparisons that our approach significantly outperforms existing deblurring and pansharpening methods, effectively restoring high-frequency details while accurately colorizing the image.
comment: 5 pages, 4 figures
Composition and Alignment of Diffusion Models using Constrained Learning NeurIPS 2025
Diffusion models have become prevalent in generative modeling due to their ability to sample from complex distributions. To improve the quality of generated samples and their compliance with user requirements, two commonly used methods are: (i) Alignment, which involves finetuning a diffusion model to align it with a reward; and (ii) Composition, which combines several pretrained diffusion models together, each emphasizing a desirable attribute in the generated outputs. However, trade-offs often arise when optimizing for multiple rewards or combining multiple models, as they can often represent competing properties. Existing methods cannot guarantee that the resulting model faithfully generates samples with all the desired properties. To address this gap, we propose a constrained optimization framework that unifies alignment and composition of diffusion models by enforcing that the aligned model satisfies reward constraints and/or remains close to each pretrained model. We provide a theoretical characterization of the solutions to the constrained alignment and composition problems and develop a Lagrangian-based primal-dual training algorithm to approximate these solutions. Empirically, we demonstrate our proposed approach in image generation, applying it to alignment and composition, and show that our aligned or composed model satisfies constraints effectively. Our implementation can be found at: \href{https://github.com/shervinkhalafi/constrained_comp_align}{https://github.com/shervinkhalafi/constrained\_comp\_align}
comment: 48 pages, 6 figures, 15 tables; Accepted to NeurIPS 2025
Multimedia
PixelatedScatter: Arbitrary-level Visual Abstraction for Large-scale Multiclass Scatterplots
Overdraw is inevitable in large-scale scatterplots. Current scatterplot abstraction methods lose features in medium-to-low density regions. We propose a visual abstraction method designed to provide better feature preservation across arbitrary abstraction levels for large-scale scatterplots, particularly in medium-to-low density regions. The method consists of three closely interconnected steps: first, we partition the scatterplot into iso-density regions and equalize visual density; then, we allocate pixels for different classes within each region; finally, we reconstruct the data distribution based on pixels. User studies, quantitative and qualitative evaluations demonstrate that, compared to previous methods, our approach better preserves features and exhibits a special advantage when handling ultra-high dynamic range data distributions.
AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control
Sound effect editing-modifying audio by adding, removing, or replacing elements-remains constrained by existing approaches that rely solely on low-level signal processing or coarse text prompts, often resulting in limited flexibility and suboptimal audio quality. To address this, we propose AV-Edit, a generative sound effect editing framework that enables fine-grained editing of existing audio tracks in videos by jointly leveraging visual, audio, and text semantics. Specifically, the proposed method employs a specially designed contrastive audio-visual masking autoencoder (CAV-MAE-Edit) for multimodal pre-training, learning aligned cross-modal representations. These representations are then used to train an editorial Multimodal Diffusion Transformer (MM-DiT) capable of removing visually irrelevant sounds and generating missing audio elements consistent with video content through a correlation-based feature gating training strategy. Furthermore, we construct a dedicated video-based sound editing dataset as an evaluation benchmark. Experiments demonstrate that the proposed AV-Edit generates high-quality audio with precise modifications based on visual content, achieving state-of-the-art performance in the field of sound effect editing and exhibiting strong competitiveness in the domain of audio generation.
Performance Evaluation of Low-Latency Live Streaming of MPEG-DASH UHD video over Commercial 5G NSA/SA Network
5G Standalone (SA) is the goal of the 5G evolution, which aims to provide higher throughput and lower latency than the existing LTE network. One of the main applications of 5G is the real-time distribution of Ultra High-Definition (UHD) content with a resolution of 4K or 8K. In Q2/2021, Advanced Info Service (AIS), the biggest operator in Thailand, launched 5G SA, providing both 5G SA/NSA service nationwide in addition to the existing LTE network. While many parts of the world are still in process of rolling out the first phase of 5G in Non-Standalone (NSA) mode, 5G SA in Thailand already covers more than 76% of the population. In this paper, UHD video will be a real-time live streaming via MPEG-DASH over different mobile network technologies with minimal buffer size to provide the lowest latency. Then, performance such as the number of dropped segments, MAC throughput, and latency are evaluated in various situations such as stationary, moving in the urban area, moving at high speed, and also an ideal condition with maximum SINR. It has been found that 5G SA can deliver more than 95% of the UHD video segment successfully within the required time window in all situations, while 5G NSA produced mixed results depending on the condition of the LTE network. The result also reveals that the LTE network failed to deliver more than 20% of the video segment within the deadline, which shows that 5G SA is absolutely necessary for low-latency UHD video streaming and 5G NSA may not be good enough for such task as it relies on the legacy control signal.
comment: 2022 International Conference on Computer Communications and Networks (ICCCN), 25-28 July 2022, Honolulu, HI, USA
3MDiT: Unified Tri-Modal Diffusion Transformer for Text-Driven Synchronized Audio-Video Generation
Text-to-video (T2V) diffusion models have recently achieved impressive visual quality, yet most systems still generate silent clips and treat audio as a secondary concern. Existing audio-video generation pipelines typically decompose the task into cascaded stages, which accumulate errors across modalities and are trained under separate objectives. Recent joint audio-video generators alleviate this issue but often rely on dual-tower architectures with ad-hoc cross-modal bridges and static, single-shot text conditioning, making it difficult to both reuse T2V backbones and to reason about how audio, video and language interact over time. To address these challenges, we propose 3MDiT, a unified tri-modal diffusion transformer for text-driven synchronized audio-video generation. Our framework models video, audio and text as jointly evolving streams: an isomorphic audio branch mirrors a T2V backbone, tri-modal omni-blocks perform feature-level fusion across the three modalities, and an optional dynamic text conditioning mechanism updates the text representation as audio and video evidence co-evolve. The design supports two regimes: training from scratch on audio-video data, and orthogonally adapting a pretrained T2V model without modifying its backbone. Experiments show that our approach generates high-quality videos and realistic audio while consistently improving audio-video synchronization and tri-modal alignment across a range of quantitative metrics.
Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks
Automatically detecting abnormal events in videos is crucial for modern autonomous systems, yet existing Video Anomaly Detection (VAD) benchmarks lack the scene diversity, balanced anomaly coverage, and temporal complexity needed to reliably assess real-world performance. Meanwhile, the community is increasingly moving toward Video Anomaly Understanding (VAU), which requires deeper semantic and causal reasoning but remains difficult to benchmark due to the heavy manual annotation effort it demands. In this paper, we introduce Pistachio, a new VAD/VAU benchmark constructed entirely through a controlled, generation-based pipeline. By leveraging recent advances in video generation models, Pistachio provides precise control over scenes, anomaly types, and temporal narratives, effectively eliminating the biases and limitations of Internet-collected datasets. Our pipeline integrates scene-conditioned anomaly assignment, multi-step storyline generation, and a temporally consistent long-form synthesis strategy that produces coherent 41-second videos with minimal human intervention. Extensive experiments demonstrate the scale, diversity, and complexity of Pistachio, revealing new challenges for existing methods and motivating future research on dynamic and multi-event anomaly understanding.
Computation and Language
Revisiting Generalization Across Difficulty Levels: It's Not So Easy
We investigate how well large language models (LLMs) generalize across different task difficulties, a key question for effective data curation and evaluation. Existing research is mixed regarding whether training on easier or harder data leads to better results, and whether those gains come on easier or harder test data. We address this question by conducting a systematic evaluation of LLMs' generalization across models, datasets, and fine-grained groups of example difficulty. We rank examples in six datasets using the outputs of thousands of different LLMs and Item Response Theory (IRT), a well-established difficulty metric in educational testing. Unlike prior work, our difficulty ratings are therefore determined solely by the abilities of many different LLMs, excluding human opinions of difficulty. With a more objective, larger-scale, and finer-grained analysis, we show that cross-difficulty generalization is often limited; training on either easy or hard data cannot achieve consistent improvements across the full range of difficulties. These results show the importance of having a range of difficulties in both training and evaluation data for LLMs, and that taking shortcuts with respect to difficulty is risky.
ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration
Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate intelligent tools. ToolOrchestra explicitly uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On tau2-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems.
comment: 21 pages, 6 figures
G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G$^2$VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. G$^2$VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes and enhance spatial reasoning tasks via in-context learning and interleaved reasoning. Our unified design is highly scalable for spatial understanding: it trains on abundant multi-view image and video data, while simultaneously leveraging the benefits of 3D visual priors that are typically only derived from hard-to-collect annotations. Experimental results demonstrate G$^2$VLM is proficient in both tasks, achieving comparable results to state-of-the-art feed-forward 3D reconstruction models and achieving better or competitive results across spatial understanding and reasoning tasks. By unifying a semantically strong VLM with low-level 3D vision tasks, we hope G$^2$VLM can serve as a strong baseline for the community and unlock more future applications, such as 3D scene editing.
comment: code are released at https://github.com/InternRobotics/G2VLM
Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework
Synthetic data has become increasingly important for training large language models, especially when real data is scarce, expensive, or privacy-sensitive. Many such generation tasks require coordinated multi-agent workflows, where specialized agents collaborate to produce data that is higher quality, more diverse, and structurally richer. However, existing frameworks for multi-agent synthesis often depend on a centralized orchestrator, creating scalability bottlenecks, or are hardcoded for specific domains, limiting flexibility. We present \textbf{Matrix}, a decentralized framework that represents both control and data flow as serialized messages passed through distributed queues. This peer-to-peer design eliminates the central orchestrator. Each task progresses independently through lightweight agents, while compute-intensive operations, such as LLM inference or containerized environments, are handled by distributed services. Built on Ray, Matrix scales to tens of thousands of concurrent agentic workflows and provides a modular, configurable design that enables easy adaptation to a wide range of data generation workflows. We evaluate Matrix across diverse synthesis scenarios, such as multi-agent collaborative dialogue, web-based reasoning data extraction, and tool-use trajectory generation in customer service environments. In all cases, Matrix achieves $2$--$15\times$ higher data generation throughput under identical hardware resources, without compromising output quality.
The author is dead, but what if they never lived? A reception experiment on Czech AI- and human-authored poetry
Large language models are increasingly capable of producing creative texts, yet most studies on AI-generated poetry focus on English -- a language that dominates training data. In this paper, we examine the perception of AI- and human-written Czech poetry. We ask if Czech native speakers are able to identify it and how they aesthetically judge it. Participants performed at chance level when guessing authorship (45.8\% correct on average), indicating that Czech AI-generated poems were largely indistinguishable from human-written ones. Aesthetic evaluations revealed a strong authorship bias: when participants believed a poem was AI-generated, they rated it as less favorably, even though AI poems were in fact rated equally or more favorably than human ones on average. The logistic regression model uncovered that the more the people liked a poem, the less probable was that they accurately assign the authorship. Familiarity with poetry or literary background had no effect on recognition accuracy. Our findings show that AI can convincingly produce poetry even in a morphologically complex, low-resource (with respect of the training data of AI models) Slavic language such as Czech. The results suggest that readers' beliefs about authorship and the aesthetic evaluation of the poem are interconnected.
TAGFN: A Text-Attributed Graph Dataset for Fake News Detection in the Age of LLMs
Large Language Models (LLMs) have recently revolutionized machine learning on text-attributed graphs, but the application of LLMs to graph outlier detection, particularly in the context of fake news detection, remains significantly underexplored. One of the key challenges is the scarcity of large-scale, realistic, and well-annotated datasets that can serve as reliable benchmarks for outlier detection. To bridge this gap, we introduce TAGFN, a large-scale, real-world text-attributed graph dataset for outlier detection, specifically fake news detection. TAGFN enables rigorous evaluation of both traditional and LLM-based graph outlier detection methods. Furthermore, it facilitates the development of misinformation detection capabilities in LLMs through fine-tuning. We anticipate that TAGFN will be a valuable resource for the community, fostering progress in robust graph-based outlier detection and trustworthy AI. The dataset is publicly available at https://huggingface.co/datasets/kayzliu/TAGFN and our code is available at https://github.com/kayzliu/tagfn.
comment: Preprint. Under review
Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining
Incorporating metadata in Large Language Models (LLMs) pretraining has recently emerged as a promising approach to accelerate training. However prior work highlighted only one useful signal-URLs, leaving open the question of whether other forms of metadata could yield greater benefits. In this study, we investigate a wider range of metadata types and find other types of metadata, such as fine-grained indicators of document quality that can also accelerate pretraining when prepended. We identify a common feature among effective metadata: they encode information at a finer granularity. We further introduce metadata appending as a means of improving training efficiency, where predicting an appropriate metadata as auxiliary task can help speed up pretraining. In addition, learnable meta-tokens trained with masked loss can recover part of the speedup by inducing quality-aware latent structure. Using probing, we analyze latent representations to understand how metadata shapes learning. Together, these results yield practical guidelines for integrating metadata to improve both the efficiency and effectiveness of LLM pretraining.
Auxiliary Metrics Help Decoding Skill Neurons in the Wild
Large language models (LLMs) exhibit remarkable capabilities across a wide range of tasks, yet their internal mechanisms remain largely opaque. In this paper, we introduce a simple, lightweight, and broadly applicable method with a focus on isolating neurons that encode specific skills. Building upon prior work that identified "skill neurons" via soft prompt training on classification tasks, our approach extends the analysis to complex scenarios involving multiple skills. We correlate neuron activations with auxiliary metrics -- such as external labels and the model's own confidence score -- thereby uncovering interpretable and task-specific behaviors without the need for manual token aggregation. We empirically validate our method on tasks spanning open-ended text generation and natural language inference, demonstrating its ability to detect neurons that not only drive known skills but also reveal previously unidentified shortcuts in arithmetic reasoning on BigBench.
comment: 7 pages, 7 figures. Includes additional appendix
RoParQ: Paraphrase-Aware Alignment of Large Language Models Towards Robustness to Paraphrased Questions
Large Language Models (LLMs) often exhibit inconsistent behavior when answering paraphrased questions, suggesting a reliance on surface-level patterns rather than true semantic understanding. To address this limitation, we introduce RoParQ, a benchmark specifically constructed to evaluate cross-paraphrase consistency in closed-book multiple-choice QA. This benchmark is derived from standard datasets by generating paraphrases via proprietary models and selectively retaining examples that elicit inconsistent confidence from a judge model. We further propose XParaCon, a novel evaluation metric that quantifies a model's robustness by measuring the standard deviation of accuracies across question variants. Additionally, we implement a reasoning-based, paraphrase-aware Supervised Fine-Tuning (SFT) strategy designed to align models toward semantic invariance. Our experiments demonstrate that this targeted alignment significantly enhances robustness. Notably, fine-tuned lightweight models achieved consistency levels comparable to much larger pre-trained models. These results highlight the efficacy of our approach in mitigating superficial memorization and fostering more robust, reliable LLMs.
comment: 12 pages, 9 figures, 8 tables
Bangla Sign Language Translation: Dataset Creation Challenges, Benchmarking and Prospects
Bangla Sign Language Translation (BdSLT) has been severely constrained so far as the language itself is very low resource. Standard sentence level dataset creation for BdSLT is of immense importance for developing AI based assistive tools for deaf and hard of hearing people of Bangla speaking community. In this paper, we present a dataset, IsharaKhobor , and two subset of it for enabling research. We also present the challenges towards developing the dataset and present some way forward by benchmarking with landmark based raw and RQE embedding. We do some ablation on vocabulary restriction and canonicalization of the same within the dataset, which resulted in two more datasets, IsharaKhobor_small and IsharaKhobor_canonical_small. The dataset is publicly available at: www.kaggle.com/datasets/hasanssl/isharakhobor [1].
comment: 14 pages, 8 tables
Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation LREC 2026
Unlike text, speech conveys information about the speaker, such as gender, through acoustic cues like pitch. This gives rise to modality-specific bias concerns. For example, in speech translation (ST), when translating from languages with notional gender, such as English, into languages where gender-ambiguous terms referring to the speaker are assigned grammatical gender, the speaker's vocal characteristics may play a role in gender assignment. This risks misgendering speakers, whether through masculine defaults or vocal-based assumptions. Yet, how ST models make these decisions remains poorly understood. We investigate the mechanisms ST models use to assign gender to speaker-referring terms across three language pairs (en-es/fr/it), examining how training data patterns, internal language model (ILM) biases, and acoustic information interact. We find that models do not simply replicate term-specific gender associations from training data, but learn broader patterns of masculine prevalence. While the ILM exhibits strong masculine bias, models can override these preferences based on acoustic input. Using contrastive feature attribution on spectrograms, we reveal that the model with higher gender accuracy relies on a previously unknown mechanism: using first-person pronouns to link gendered terms back to the speaker, accessing gender information distributed across the frequency spectrum rather than concentrated in pitch.
comment: Submitted to LREC 2026
Hierarchical Ranking Neural Network for Long Document Readability Assessment
Readability assessment aims to evaluate the reading difficulty of a text. In recent years, while deep learning technology has been gradually applied to readability assessment, most approaches fail to consider either the length of the text or the ordinal relationship of readability labels. This paper proposes a bidirectional readability assessment mechanism that captures contextual information to identify regions with rich semantic information in the text, thereby predicting the readability level of individual sentences. These sentence-level labels are then used to assist in predicting the overall readability level of the document. Additionally, a pairwise sorting algorithm is introduced to model the ordinal relationship between readability levels through label subtraction. Experimental results on Chinese and English datasets demonstrate that the proposed model achieves competitive performance and outperforms other baseline models.
A Systematic Study of Model Merging Techniques in Large Language Models
Model merging combines multiple fine-tuned checkpoints into a single model without additional training, offering an attractive approach to reusing models and efficiently improving performance. However, it remains unclear whether the advantages reported for smaller models and classifiers generalize to LLMs. We present a large-scale, systematic evaluation of six state-of-the-art merging methods, including recent subspace methods, across four open-weight LLMs, twelve fine-tuned checkpoints per base model, and sixteen standard LLM benchmarks. Evaluating through standardized benchmarks, we measure both the probability that a merged model outperforms the base model and relative gains over the best individual checkpoint. Our results show that the oldest and simplest method, Task Arithmetic, is the only approach that reliably yields performance gains on LLMs. Other interference-aware and subspace merging methods typically result in significant performance drops. Our findings indicate that current merging techniques do not directly transfer to modern LLMs. This motivates the design of LLM-specific merging algorithms and merging-aware fine-tuning methods. Code will be released upon acceptance of this paper.
Odin: Oriented Dual-module Integration for Text-rich Network Representation Learning
Text-attributed graphs require models to effectively combine strong textual understanding with structurally informed reasoning. Existing approaches either rely on GNNs--limited by over-smoothing and hop-dependent diffusion--or employ Transformers that overlook graph topology and treat nodes as isolated sequences. We propose Odin (Oriented Dual-module INtegration), a new architecture that injects graph structure into Transformers at selected depths through an oriented dual-module mechanism.Unlike message-passing GNNs, Odin does not rely on multi-hop diffusion; instead, multi-hop structures are integrated at specific Transformer layers, yielding low-, mid-, and high-level structural abstraction aligned with the model's semantic hierarchy. Because aggregation operates on the global [CLS] representation, Odin fundamentally avoids over-smoothing and decouples structural abstraction from neighborhood size or graph topology. We further establish that Odin's expressive power strictly contains that of both pure Transformers and GNNs.To make the design efficient in large-scale or low-resource settings, we introduce Light Odin, a lightweight variant that preserves the same layer-aligned structural abstraction for faster training and inference. Experiments on multiple text-rich graph benchmarks show that Odin achieves state-of-the-art accuracy, while Light Odin delivers competitive performance with significantly reduced computational cost. Together, Odin and Light Odin form a unified, hop-free framework for principled structure-text integration. The source code of this model has been released at https://github.com/hongkaifeng/Odin.
comment: 32 pages, 2 figures
Subjective Depth and Timescale Transformers: Learning Where and When to Compute
The rigid, uniform allocation of computation in standard Transformer (TF) architectures can limit their efficiency and scalability, particularly for large-scale models and long sequences. Addressing this, we introduce Subjective Depth Transformers (SDT) and Subjective Timescale Transformers (STT), two distinct architectures that leverage Bayesian surprise signals to dynamically route computation, learning where and when to compute within decoder-only TFs. SDT augments a decoder-only stack with alternating Decision and Dynamic layers: a Decision layer computes a full block 'posterior' and a lightweight 'prior,' while a Dynamic layer employs fixed-capacity Top-K routing based on Bayesian surprise (Expected and Unexpected Change), maintaining a static compute graph. STT extends this conditional computation to the temporal domain: a transition network predicts residual updates, forming a temporal 'change hypothesis' that informs a router to dynamically execute or bypass TF blocks for each token, managing KV-cache contributions. Both architectures exhibit the predicted shift from novelty to prediction driven gating over training, suggesting alignment with surprise based principles. While operating at reduced capacity, they offer preliminary insights into the compute-accuracy trade-offs of conditional computation. The proposed architectures establish a flexible framework for efficiency, reducing self-attention computation by 75% and KV-cache requirements by 50% within each compute skipping layer, setting a pathway for more efficient models.
Text-to-SQL as Dual-State Reasoning: Integrating Adaptive Context and Progressive Generation
Recent divide-and-conquer reasoning approaches, particularly those based on Chain-of-Thought (CoT), have substantially improved the Text-to-SQL capabilities of Large Language Models (LLMs). However, when applied to complex enterprise databases, such methods struggle to maintain coherent reasoning due to limited context capacity, unreliable schema linking, and weak grounding in database semantics. To overcome these issues, we introduce DSR-SQL, a \textbf{D}ual-\textbf{S}tate \textbf{R}easoning framework that models Text-to-SQL as an interaction between an adaptive context state and a progressive generation state. The first constructs a compact, semantically faithful environment by refining large schemas and selecting relevant structures, while the second formalizes SQL synthesis as feedback-guided state transitions, enabling the model to self-correct and align with user intent. Without any post-training or in-context examples, DSR-SQL achieves competitive performance, reaching 35.28\% execution accuracy on Spider 2.0-Snow and 68.32\% on BIRD development set. Our implementation will be open-sourced at: https://github.com/DMIRLAB-Group/DSR-SQL.
Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?
Misinformation frequently spreads in user comments under online news articles, highlighting the need for effective methods to detect factually incorrect information. To strongly support or refute claims extracted from such comments, it is necessary to identify relevant documents and pinpoint the exact text spans that justify or contradict each claim. This paper focuses on the latter task -- fine-grained evidence extraction for Czech and Slovak claims. We create new dataset, containing two-way annotated fine-grained evidence created by paid annotators. We evaluate large language models (LLMs) on this dataset to assess their alignment with human annotations. The results reveal that LLMs often fail to copy evidence verbatim from the source text, leading to invalid outputs. Error-rate analysis shows that the {llama3.1:8b model achieves a high proportion of correct outputs despite its relatively small size, while the gpt-oss-120b model underperforms despite having many more parameters. Furthermore, the models qwen3:14b, deepseek-r1:32b, and gpt-oss:20b demonstrate an effective balance between model size and alignment with human annotations.
Training Introspective Behavior: Fine-Tuning Induces Reliable Internal State Detection in a 7B Model
Lindsey (2025) investigates introspective awareness in language models through four experiments, finding that models can sometimes detect and identify injected activation patterns -- but unreliably (~20% success in the best model). We focus on the first of these experiments -- self-report of injected "thoughts" -- and ask whether this capability can be directly trained rather than waiting for emergence. Through fine-tuning on transient single-token injections, we transform a 7B parameter model from near-complete failure (0.4% accuracy, 6.7% false positive rate) to reliable detection (85% accuracy on held-out concepts at α=40, 0% false positives). Our model detects fleeting "thoughts" injected at a single token position, retains that information, and reports the semantic content across subsequent generation steps. On this task, our trained model satisfies three of Lindsey's criteria: accuracy (correct identification), grounding (0/60 false positives), and internality (detection precedes verbalization). Generalization to unseen concept vectors (7.5pp gap) demonstrates the model learns a transferable skill rather than memorizing specific vectors, though this does not establish metacognitive representation in Lindsey's sense. These results address an open question raised by Lindsey: whether "training for introspection would help eliminate cross-model differences." We show that at least one component of introspective behavior can be directly induced, offering a pathway to built-in AI transparency.
comment: 16 pages, 8 figures
Prune4Web: DOM Tree Pruning Programming for Web Agent AAAI 2026
Web automation employs intelligent agents to execute high-level tasks by mimicking human interactions with web interfaces. Despite the capabilities of recent Large Language Model (LLM)-based web agents, navigating complex, real-world webpages efficiently remains a significant hurdle due to the prohibitively large size of Document Object Model (DOM) structures, often ranging from 10,000 to 100,000 tokens. Existing strategies typically rely on crude DOM truncation -- risking the loss of critical information -- or employ inefficient heuristics and separate ranking models, failing to achieve an optimal balance between precision and scalability. To address these challenges, we introduce Prune4Web, a novel paradigm that shifts DOM processing from resource-intensive LLM reading to efficient programmatic pruning. Central to our approach is DOM Tree Pruning Programming, where an LLM generates executable Python scoring scripts to dynamically filter DOM elements based on semantic cues from decomposed sub-tasks. This mechanism eliminates the need for LLMs to ingest raw, massive DOMs, instead delegating traversal and scoring to lightweight, interpretable programs. This methodology achieves a 25x to 50x reduction in candidate elements for grounding, thereby facilitating precise action localization while mitigating attention dilution. Furthermore, we propose a specialized data annotation pipeline and a two-turn dialogue training strategy that jointly optimizes the Planner, Programmatic Filter, and Grounder within a unified framework. Extensive experiments demonstrate state-of-the-art performance. Notably, on our low-level grounding task, Prune4Web dramatically improves accuracy from 46.8% to 88.28%, underscoring its efficacy in real-world web automation.
comment: Paper accepted to AAAI 2026
Do Reasoning Vision-Language Models Inversely Scale in Test-Time Compute? A Distractor-centric Empirical Analysis
How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior studies on language models have reported an inverse scaling effect, where textual distractors lead to longer but less effective reasoning. To investigate whether similar phenomena occur in multimodal settings, we introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic, numerical, and spatial dimensions. Our analyses reveal that visual distractors differ fundamentally from textual ones: although inverse scaling persists, adding visual distractors reduces accuracy without increasing reasoning length. We further show that tracking attribute counts within reasoning traces provides key insights into how distractors, reasoning length, and accuracy interact. Finally, we demonstrate that these trends extend to established visual bias benchmarks such as Waterbirds, and we propose a simple prompting strategy to mitigate bias-driven predictions in reasoning models.
comment: preprint
BanglaASTE: A Novel Framework for Aspect-Sentiment-Opinion Extraction in Bangla E-commerce Reviews Using Ensemble Deep Learning SP
Aspect-Based Sentiment Analysis (ABSA) has emerged as a critical tool for extracting fine-grained sentiment insights from user-generated content, particularly in e-commerce and social media domains. However, research on Bangla ABSA remains significantly underexplored due to the absence of comprehensive datasets and specialized frameworks for triplet extraction in this language. This paper introduces BanglaASTE, a novel framework for Aspect Sentiment Triplet Extraction (ASTE) that simultaneously identifies aspect terms, opinion expressions, and sentiment polarities from Bangla product reviews. Our contributions include: (1) creation of the first annotated Bangla ASTE dataset containing 3,345 product reviews collected from major e-commerce platforms including Daraz, Facebook, and Rokomari; (2) development of a hybrid classification framework that employs graph-based aspect-opinion matching with semantic similarity techniques; and (3) implementation of an ensemble model combining BanglaBERT contextual embeddings with XGBoost boosting algorithms for enhanced triplet extraction performance. Experimental results demonstrate that our ensemble approach achieves superior performance with 89.9% accuracy and 89.1% F1-score, significantly outperforming baseline models across all evaluation metrics. The framework effectively addresses key challenges in Bangla text processing including informal expressions, spelling variations, and data sparsity. This research advances the state-of-the-art in low-resource language sentiment analysis and provides a scalable solution for Bangla e-commerce analytics applications.
comment: Presented at the 2025 IEEE International Conference on Signal Processing, Information, Communication and Systems (SPICSCON), November 21-22, 2025, University of Rajshahi, Bangladesh. 6 pages, ensemble deep learning, 3,345 annotated Bangla product reviews
Emergent Lexical Semantics in Neural Language Models: Testing Martin's Law on LLM-Generated Text
We present the first systematic investigation of Martin's Law - the empirical relationship between word frequency and polysemy - in text generated by neural language models during training. Using DBSCAN clustering of contextualized embeddings as an operationalization of word senses, we analyze four Pythia models (70M-1B parameters) across 30 training checkpoints. Our results reveal a non-monotonic developmental trajectory: Martin's Law emerges around checkpoint 100, reaches peak correlation (r > 0.6) at checkpoint 104, then degrades by checkpoint 105. Smaller models (70M, 160M) experience catastrophic semantic collapse at late checkpoints, while larger models (410M, 1B) show graceful degradation. The frequency-specificity trade-off remains stable (r $\approx$ -0.3) across all models. These findings suggest that compliance with linguistic regularities in LLM-generated text is not monotonically increasing with training, but instead follows a balanced trajectory with an optimal semantic window. This work establishes a novel methodology for evaluating emergent linguistic structure in neural language models.
comment: paper draft
TALES: A Taxonomy and Analysis of Cultural Representations in LLM-generated Stories
Millions of users across the globe turn to AI chatbots for their creative needs, inviting widespread interest in understanding how such chatbots represent diverse cultures. At the same time, evaluating cultural representations in open-ended tasks remains challenging and underexplored. In this work, we present TALES, an evaluation of cultural misrepresentations in LLM-generated stories for diverse Indian cultural identities. First, we develop TALES-Tax, a taxonomy of cultural misrepresentations by collating insights from participants with lived experiences in India through focus groups (N=9) and individual surveys (N=15). Using TALES-Tax, we evaluate 6 models through a large-scale annotation study spanning 2,925 annotations from 108 annotators with lived cultural experience from across 71 regions in India and 14 languages. Concerningly, we find that 88\% of the generated stories contain one or more cultural inaccuracies, and such errors are more prevalent in mid- and low-resourced languages and stories based in peri-urban regions in India. Lastly, we transform the annotations into TALES-QA, a standalone question bank to evaluate the cultural knowledge of foundational models. Through this evaluation, we surprisingly discover that models often possess the requisite cultural knowledge despite generating stories rife with cultural misrepresentations.
PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark
Despite the state-of-the-art performance of Large Language Models (LLMs) achieved on many tasks, their massive scale often leads to high computational and environmental costs, limiting their accessibility. Parameter-efficient fine-tuning (PEFT) methods address this challenge by reducing the number of trainable parameters while maintaining strong downstream performance. Despite the increased development in PEFT methods, current evaluations remain limited (in terms of evaluated models and datasets) and difficult to reproduce. To bridge this gap, we introduce PEFT-Bench, a unified end-to-end benchmark for evaluating diverse PEFT methods on autoregressive LLMs. We demonstrate its usage across 27 NLP datasets and 6 PEFT methods. To account for different PEFT training and inference factors, we also introduce the PEFT Soft Score Penalties (PSCP) metric, which takes trainable parameters, inference speed, and training memory usage into account.
Developing an Open Conversational Speech Corpus for the Isan Language
This paper introduces the development of the first open conversational speech dataset for the Isan language, the most widely spoken regional dialect in Thailand. Unlike existing speech corpora that are primarily based on read or scripted speech, this dataset consists of natural speech, thereby capturing authentic linguistic phenomena such as colloquials, spontaneous prosody, disfluencies, and frequent code-switching with central Thai. A key challenge in building this resource lies in the lack of a standardized orthography for Isan. Current writing practices vary considerably, due to the different lexical tones between Thai and Isan. This variability complicates the design of transcription guidelines and poses questions regarding consistency, usability, and linguistic authenticity. To address these issues, we establish practical transcription protocols that balance the need for representational accuracy with the requirements of computational processing. By releasing this dataset as an open resource, we aim to contribute to inclusive AI development, support research on underrepresented languages, and provide a basis for addressing the linguistic and technical challenges inherent in modeling conversational speech.
comment: 31 pages, in Thai language, 3 figures, 25 tables
Can Finetuing LLMs on Small Human Samples Increase Heterogeneity, Alignment, and Belief-Action Coherence?
There is ongoing debate about whether large language models (LLMs) can serve as substitutes for human participants in survey and experimental research. While recent work in fields such as marketing and psychology has explored the potential of LLM-based simulation, a growing body of evidence cautions against this practice: LLMs often fail to align with real human behavior, exhibiting limited diversity, systematic misalignment for minority subgroups, insufficient within-group variance, and discrepancies between stated beliefs and actions. This study examines an important and distinct question in this domain: whether fine-tuning on a small subset of human survey data, such as that obtainable from a pilot study, can mitigate these issues and yield realistic simulated outcomes. Using a behavioral experiment on information disclosure, we compare human and LLM-generated responses across multiple dimensions, including distributional divergence, subgroup alignment, belief-action coherence, and the recovery of regression coefficients. We find that fine-tuning on small human samples substantially improves heterogeneity, alignment, and belief-action coherence relative to the base model. However, even the best-performing fine-tuned models fail to reproduce the regression coefficients of the original study, suggesting that LLM-generated data remain unsuitable for replacing human participants in formal inferential analyses.
Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines
Reasoning models have demonstrated remarkable capabilities in complex reasoning tasks. However, ensuring their safety against adversarial jailbreak prompts remains a critical challenge. Due to the covert and deceptive nature of such prompts, they can often evade built-in safety mechanisms and lead to the generation of harmful content. This underscores the need for an adaptive safety alignment approach that enables models to autonomously reinforce their defenses in response to adversarial inputs. This paper introduces the Synthesized Guideline-based Adaptive Safety Alignment (SGASA) framework, which internalizes model-generated safety guidelines to strengthen models' ability to enhance robustness against harmful adversarial prompts while minimizing unnecessary refusals of benign requests. SGASA consists of two key stages: Data Pre-synthesis, which generates safety guidelines and augmented prompts; and Alignment Fine-tuning, which leverages Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO) to embed these guidelines into the model. Extensive experiments across multiple datasets demonstrate that SGASA significantly improves model safety, validating its adaptive and scalable effectiveness.
AnchorOPT: Towards Optimizing Dynamic Anchors for Adaptive Prompt Learning
Existing prompt learning methods, which are built upon CLIP models, leverage textual tokens as anchors to guide the learnable soft tokens. This guidance improves CLIP generalizations. However, these anchors-static in both value and position-lack cross-task and stage-adaptive flexibility. To address this limitation, we propose AnchorOPT, a dynamic anchor-based prompt learning framework. Specifically, AnchorOPT introduces dynamism in two key dimensions: (i) anchor values eschew handcrafted explicit textual tokens (e.g., "shape", "color"), instead learning dynamically from task-specific data; and (ii) the positional relationship between anchor and soft tokens is no longer fixed but adaptively optimized via a learnable position matrix conditioned on the training stage and task context. Training occurs in two stages: we first learn the anchor tokens, then freeze and transfer them to the second stage for optimization of soft tokens and the position matrix. Extensive experiments demonstrate that using only a simple learnable anchor and position matrix achieves performance comparable to or exceeding some methods incorporating additional learnable modules or regularization techniques. As a plug-and-play module, AnchorOPT integrates seamlessly into existing frameworks, yielding consistent performance gains across diverse datasets. Code is publicly available at https://github.com/zhengli97/ATPrompt.
comment: Technical Report
How to Correctly Report LLM-as-a-Judge Evaluations
Large language models (LLMs) are increasingly used as evaluators in lieu of humans. While scalable, their judgments are noisy due to imperfect specificity and sensitivity of LLMs, leading to biased accuracy estimates. Although bias-correction methods exist, they are underutilized in LLM research and typically assume exact knowledge of the model's specificity and sensitivity. Furthermore, in general we only have estimates of these values and it is not well known how to properly construct confidence intervals using only estimates. This work presents a simple plug-in framework that corrects such bias and constructs confidence intervals reflecting uncertainty from both test and calibration dataset, enabling practical and statistically sound LLM-based evaluation. Additionally, to reduce uncertainty in the accuracy estimate, we introduce an adaptive algorithm that efficiently allocates calibration sample sizes.
MortgageLLM: Domain-Adaptive Pretraining with Residual Instruction Transfer, Alignment Tuning, and Task-Specific Routing
Large Language Models (LLMs) demonstrate exceptional capabilities across general domains, yet their application to specialized sectors such as mortgage finance requires domain-specific knowledge augmentation while preserving instruction-following fidelity. We present MortgageLLM, a novel domain-specific large language model that addresses this dual challenge. It is developed using a dual-track specialization framework from a single base model (LLaMA-3.1-8B). We opted for this dual-expert approach as a single multi-task model suffers from performance trade-offs, where optimizing for structured tasks (via SFT) degrades conversational fidelity (via DPO). Our dual-track method solves this by creating two specialists, allowing each to be optimally trained for its distinct capability. Our approach applies the instruction residual technique to restore instruction-following capabilities post-domain adaptation without supervised fine-tuning. We contribute: (1) application of this residual technique to the highly specialized mortgage finance domain; (2) a dual-expert architecture combining a conversational Q&A model and a structured task model for classification and summarization; and (3) an intelligent task routing mechanism using few-shot classification performed by one of the expert models itself. We validate our approach on domain-specific benchmarks, where our final model (MLM v2) significantly outperforms the base LLaMA-3.1-8B-Instruct, achieving an LLM-as-a-Judge summarization score of 4.58 (vs. 3.99), a Q&A score of 4.09 (vs. 4.0), and a classification score of 2.6 (vs. 1.2). On semantic similarity, our model achieved a BERTScore of 0.77 for summarization (vs. 0.74), 0.68 for Q&A (vs. 0.58), and 0.75 for classification (vs. 0.73), substantially outperforming baseline approaches.
ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features
This paper investigates sequence-to-sequence Transformer models for automatic speech recognition (ASR) error correction in low-resource Burmese, focusing on different feature integration strategies including IPA and alignment information. To our knowledge, this is the first study addressing ASR error correction specifically for Burmese. We evaluate five ASR backbones and show that our ASR Error Correction (AEC) approaches consistently improve word- and character-level accuracy over baseline outputs. The proposed AEC model, combining IPA and alignment features, reduced the average WER of ASR models from 51.56 to 39.82 before augmentation (and 51.56 to 43.59 after augmentation) and improving chrF++ scores from 0.5864 to 0.627, demonstrating consistent gains over the baseline ASR outputs without AEC. Our results highlight the robustness of AEC and the importance of feature design for improving ASR outputs in low-resource settings.
comment: 7 pages, 2 figures, 7 tables, Accepted to iSAI-NLP 2025
Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models
Large language models must satisfy hard orthographic constraints during controlled text generation, yet systematic cross-architecture evaluation remains limited. We evaluate 28 configurations spanning three model families (Qwen3, Claude Haiku-4.5, GPT-5-mini) on 58 word puzzles requiring character-level constraint satisfaction. Architectural differences produce substantially larger performance gaps (2.0-2.2x, F1=0.761 vs. 0.343) than parameter scaling within families (83% gain from eightfold scaling), suggesting that constraint satisfaction may require specialized architectural features or training objectives beyond standard language model scaling. Thinking budget sensitivity proves heterogeneous: high-capacity models show strong returns (+0.102 to +0.136 F1), while mid-sized variants saturate or degrade. These patterns are inconsistent with uniform compute benefits. Using difficulty ratings from 10,000 human solvers per puzzle, we establish modest but consistent calibration (r=0.24-0.38) across all families, yet identify systematic failures on common words with unusual orthography ("data", "poop", "loll": 86-95% human success, 89-96% model miss rate). These failures reveal over-reliance on distributional plausibility that penalizes orthographically atypical but constraint-valid patterns, suggesting architectural innovations may be required beyond simply scaling parameters or computational budgets.
Enhancing Burmese News Classification with Kolmogorov-Arnold Network Head Fine-tuning
In low-resource languages like Burmese, classification tasks often fine-tune only the final classification layer, keeping pre-trained encoder weights frozen. While Multi-Layer Perceptrons (MLPs) are commonly used, their fixed non-linearity can limit expressiveness and increase computational cost. This work explores Kolmogorov-Arnold Networks (KANs) as alternative classification heads, evaluating Fourier-based FourierKAN, Spline-based EfficientKAN, and Grid-based FasterKAN-across diverse embeddings including TF-IDF, fastText, and multilingual transformers (mBERT, Distil-mBERT). Experimental results show that KAN-based heads are competitive with or superior to MLPs. EfficientKAN with fastText achieved the highest F1-score (0.928), while FasterKAN offered the best trade-off between speed and accuracy. On transformer embeddings, EfficientKAN matched or slightly outperformed MLPs with mBERT (0.917 F1). These findings highlight KANs as expressive, efficient alternatives to MLPs for low-resource language classification.
comment: 6 pages, 2 figures, 4 tables, Accepted to iSAI-NLP 2025
Context-Aware Pragmatic Metacognitive Prompting for Sarcasm Detection
Detecting sarcasm remains a challenging task in the areas of Natural Language Processing (NLP) despite recent advances in neural network approaches. Currently, Pre-trained Language Models (PLMs) and Large Language Models (LLMs) are the preferred approach for sarcasm detection. However, the complexity of sarcastic text, combined with linguistic diversity and cultural variation across communities, has made the task more difficult even for PLMs and LLMs. Beyond that, those models also exhibit unreliable detection of words or tokens that require extra grounding for analysis. Building on a state-of-the-art prompting method in LLMs for sarcasm detection called Pragmatic Metacognitive Prompting (PMP), we introduce a retrieval-aware approach that incorporates retrieved contextual information for each target text. Our pipeline explores two complementary ways to provide context: adding non-parametric knowledge using web-based retrieval when the model lacks necessary background, and eliciting the model's own internal knowledge for a self-knowledge awareness strategy. We evaluated our approach with three datasets, such as Twitter Indonesia Sarcastic, SemEval-2018 Task 3, and MUStARD. Non-parametric retrieval resulted in a significant 9.87% macro-F1 improvement on Twitter Indonesia Sarcastic compared to the original PMP method. Self-knowledge retrieval improves macro-F1 by 3.29% on Semeval and by 4.08% on MUStARD. These findings highlight the importance of context in enhancing LLMs performance in sarcasm detection task, particularly the involvement of culturally specific slang, references, or unknown terms to the LLMs. Future work will focus on optimizing the retrieval of relevant contextual information and examining how retrieval quality affects performance. The experiment code is available at: https://github.com/wllchrst/sarcasm-detection_pmp_knowledge-base.
Zipf Distributions from Two-Stage Symbolic Processes: Stability Under Stochastic Lexical Filtering
Zipf's law in language lacks a definitive origin, debated across fields. This study explains Zipf-like behavior using geometric mechanisms without linguistic elements. The Full Combinatorial Word Model (FCWM) forms words from a finite alphabet, generating a geometric distribution of word lengths. Interacting exponential forces yield a power-law rank-frequency curve, determined by alphabet size and blank symbol probability. Simulations support predictions, matching English, Russian, and mixed-genre data. The symbolic model suggests Zipf-type laws arise from geometric constraints, not communicative efficiency.
comment: 16 pages
A Unified Understanding of Offline Data Selection and Online Self-refining Generation for Post-training LLMs
Offline data selection and online self-refining generation, which enhance the data quality, are crucial steps in adapting large language models (LLMs) to specific downstream tasks. We tackle offline data selection and online self-refining generations through an optimization perspective. Specifically, bilevel data selection is used for offline data selection with respect to the validation dataset, and we treat online self-refining generation as a model adaptation step of selecting the model trained on current responses that best fits the validation data. Our framework offers a unified understanding of offline data selection and self-refining generation by assigning a learned data weight to each question and response, either explicitly or implicitly. For the first time, we theoretically demonstrate the effectiveness of the bilevel data selection framework and demonstrate its performance gains over unfiltered direct mixing baselines. By combining offline data with validation-weighted online generations, our method enhances fine-tuning performance. Experiments on quality enhancement and safety-aware LLM fine-tuning validate its effectiveness.
Semantic Anchors in In-Context Learning: Why Small LLMs Cannot Flip Their Labels
Can in-context learning (ICL) override pre-trained label semantics, or does it merely refine an existing semantic backbone? We address this question by treating LLMs as prompt-induced classifiers and contrasting their behavior under \emph{natural} demonstrations (with correct labels) and \emph{inverted} demonstrations (systematically flipping label meanings). We decompose ICL behavior into three alignment metrics (truth, prior, and prompt alignment) and introduce a semantic override rate, defined as correctness under flipped semantics. Across eight classification tasks and eight open-source LLMs (1--12B parameters), we find consistent evidence for a semantic anchor view. With natural demonstrations, ICL improves accuracy while maintaining strong prior alignment; most correct predictions coincide with zero-shot behavior, even when the prior is weak. With inverted demonstrations, models cannot learn coherent anti-semantic classifiers: prompt alignment increases only by sacrificing accuracy, and semantic override rates remain exactly zero in our few-shot 1--12B setting. Rather than flexibly remapping label meanings, ICL primarily adjusts how inputs project onto stable semantic directions learned during pre-training, clarifying fundamental limits of few-shot prompting and suggesting that overriding label semantics at these scales requires interventions beyond ICL. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/semantic-anchors-icl.
comment: 13 pages total (7 pages main text, 3 pages references, 3 pages appendix), 2 figures, 14 tables. Code available at https://github.com/AnanthaPadmanaban-KrishnaKumar/semantic-anchors-icl
Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression
As efficient alternatives to softmax Attention, linear state-space models (SSMs) achieve constant memory and linear compute, but maintain only a lossy, fading summary of the past, often leading to inferior performance in recall oriented tasks. We propose Gated KalmaNet (GKA), a layer that reduces this gap by accounting for the full past when predicting the next token, while maintaining SSM-style efficiency. GKA achieves this by solving an online ridge regression problem at test time, with constant memory and linear compute cost in the sequence length. Drawing inspiration from the Kalman Filter, we iteratively solve the online ridge regression problem. However, a critical insight is that standard Kalman filter equations are numerically unstable in low-precision environments (like bfloat16) and difficult to parallelize in modern hardware. We address both challenges through two key innovations: (1) an adaptive regularization strategy with input-dependent gating that controls the condition number of the ridge regression, ensuring numerical stability while balancing memory retention. And (2) the use of Chebyshev Iteration instead of other conventional iterative solvers, which we demonstrate to be more stable in low-precision settings. To further improve scalability, we develop a hardware-aware chunk-wise implementation of Chebyshev Iteration along with custom kernels for backpropagating through our adaptive regularization and gating mechanisms. Empirically, GKA shows strong language understanding capabilites on short-context tasks outperforming existing SSM layers (like Mamba2, GLA and Gated DeltaNet). On long-context, GKA excels at real-world RAG and LongQA tasks up to 128k tokens, achieving more than $10$% relative improvement over other fading memory baselines.
comment: 30 pages, 10 figures
TrackList: Tracing Back Query Linguistic Diversity for Head and Tail Knowledge in Open Large Language Models
Large Language Models (LLMs) have proven efficient in giving definition-type answers to user input queries. While for humans giving various types of answers, such as examples and paraphrases, is an easy task, LLMs struggle to provide correct answers for other than definition-type queries. In this study, we evaluated this drop in performance using TrackList, a fine-grained linguistic and statistical analysis pipeline to investigate the impact of the pre-training data on LLMs answers to diverse linguistic queries. We also introduce RefoMed-EN, an English dataset consisting of 6170 human-annotated medical terms alongside their corresponding definitions, denominations, exemplifications, explanations, or paraphrases. We studied whether the high frequency of a concept (head) or low frequency (tail) impacts the language model's performance. We evaluated the quality of the LLM's output using syntactic and semantic similarity metrics, statistical correlations and embeddings. Results showed that the LLM's task performance for definition type questions is the highest, while for the exemplification type it is the lowest. Additionally, we showed that for definition-type questions, large language models are prone to paraphrase more on popular and frequent knowledge and less on tail and technical knowledge, especially in the expert texts.
comment: under review
RosettaSpeech: Zero-Shot Speech-to-Speech Translation from Monolingual Data
The scarcity of parallel speech corpora critically hampers speech-to-speech translation (S2ST), often forcing reliance on complex, multi-stage pipelines. This paper introduces RosettaSpeech, a novel and simplified framework for zero-shot S2ST that is trained on monolingual speech-text data augmented by machine translation supervision. While our method leverages the linguistic knowledge inherent in text-based NMT models, it strictly eliminates the need for parallel speech-to-speech pairs. Our model uniquely uses text as an intermediate bridge during training but functions as a direct, end-to-end speech-to-speech model at inference. This streamlined approach achieves state-of-the-art results on standard benchmarks. For instance, on the CVSS-C test set, RosettaSpeech outperforms leading systems, achieving an ASR-BLEU score of 25.17 for German-to-English and 29.86 for Spanish-to-English-relative gains of over 27% and 14%, respectively. Furthermore, we demonstrate that a single model can deliver strong many-to-one translation performance (FR/ES/DE -> EN). We also provide a foundational analysis of how training data scaling impacts model performance. By prioritizing reliance on abundant parallel text rather than difficult-to-acquire parallel speech, RosettaSpeech offers a scalable path to creating high-quality, speaker-preserving S2ST for a much broader array of languages.
comment: Work in progress
Towards Audio Token Compression in Large Audio Language Models
Large Audio Language Models (LALMs) demonstrate impressive performance across diverse tasks, ranging from speech recognition to general audio understanding. However, their scalability is limited by the quadratic complexity of attention and the high token rates of audio signals. These challenges make it difficult to extend LALMs to long-form audio and to deploy them on resource-constrained platforms such as edge devices. In this paper, we explore techniques such as unsupervised segmentation, uniform average pooling, etc., to reduce the number of audio tokens generated by the LALM's audio encoder but before they are consumed by the LLM decoder. To mitigate potential performance degradation introduced by the compressed representations, we employ low-rank adapters to finetune the model. We evaluate our proposed models on two tasks, automatic speech recognition and speech-to-speech translation tasks, that are dependent on effectively uncovering the underlying lexical content of the input signal and study the effect of downsampling on these tasks. Experimental results show that compressed LALMs can achieve performance closer to frame-level LALMs while reducing the input audio token count upto three times before the LLM backbone.
TrafficLens: Multi-Camera Traffic Video Analysis Using LLMs SC
Traffic cameras are essential in urban areas, playing a crucial role in intelligent transportation systems. Multiple cameras at intersections enhance law enforcement capabilities, traffic management, and pedestrian safety. However, efficiently managing and analyzing multi-camera feeds poses challenges due to the vast amount of data. Analyzing such huge video data requires advanced analytical tools. While Large Language Models (LLMs) like ChatGPT, equipped with retrieval-augmented generation (RAG) systems, excel in text-based tasks, integrating them into traffic video analysis demands converting video data into text using a Vision-Language Model (VLM), which is time-consuming and delays the timely utilization of traffic videos for generating insights and investigating incidents. To address these challenges, we propose TrafficLens, a tailored algorithm for multi-camera traffic intersections. TrafficLens employs a sequential approach, utilizing overlapping coverage areas of cameras. It iteratively applies VLMs with varying token limits, using previous outputs as prompts for subsequent cameras, enabling rapid generation of detailed textual descriptions while reducing processing time. Additionally, TrafficLens intelligently bypasses redundant VLM invocations through an object-level similarity detector. Experimental results with real-world datasets demonstrate that TrafficLens reduces video-to-text conversion time by up to $4\times$ while maintaining information accuracy.
comment: 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC)
Chatty-KG: A Multi-Agent AI System for On-Demand Conversational Question Answering over Knowledge Graphs SIGMOD 2026
Conversational Question Answering over Knowledge Graphs (KGs) combines the factual grounding of KG-based QA with the interactive nature of dialogue systems. KGs are widely used in enterprise and domain applications to provide structured, evolving, and reliable knowledge. Large language models (LLMs) enable natural and context-aware conversations, but lack direct access to private and dynamic KGs. Retrieval-augmented generation (RAG) systems can retrieve graph content but often serialize structure, struggle with multi-turn context, and require heavy indexing. Traditional KGQA systems preserve structure but typically support only single-turn QA, incur high latency, and struggle with coreference and context tracking. To address these limitations, we propose Chatty-KG, a modular multi-agent system for conversational QA over KGs. Chatty-KG combines RAG-style retrieval with structured execution by generating SPARQL queries through task-specialized LLM agents. These agents collaborate for contextual interpretation, dialogue tracking, entity and relation linking, and efficient query planning, enabling accurate and low-latency translation of natural questions into executable queries. Experiments on large and diverse KGs show that Chatty-KG significantly outperforms state-of-the-art baselines in both single-turn and multi-turn settings, achieving higher F1 and P@1 scores. Its modular design preserves dialogue coherence and supports evolving KGs without fine-tuning or pre-processing. Evaluations with commercial (e.g., GPT-4o, Gemini-2.0) and open-weight (e.g., Phi-4, Gemma 3) LLMs confirm broad compatibility and stable performance. Overall, Chatty-KG unifies conversational flexibility with structured KG grounding, offering a scalable and extensible approach for reliable multi-turn KGQA.
comment: This paper is accepted to SIGMOD 2026
ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction
Embodied cognition argues that intelligence arises from sensorimotor interaction rather than passive observation. It raises an intriguing question: do modern vision-language models (VLMs), trained largely in a disembodied manner, exhibit signs of embodied cognition? We introduce ENACT, a benchmark that casts evaluation of embodied cognition as world modeling from egocentric interaction in a visual question answering (VQA) format. Framed as a partially observable Markov decision process (POMDP) whose actions are scene graph changes, ENACT comprises two complementary sequence reordering tasks: forward world modeling (reorder shuffled observations given actions) and inverse world modeling (reorder shuffled actions given observations). While conceptually simple, solving these tasks implicitly demands capabilities central to embodied cognition-affordance recognition, action-effect reasoning, embodied awareness, and interactive, long-horizon memory from partially observable egocentric input, while avoiding low-level image synthesis that could confound the evaluation. We provide a scalable pipeline that synthesizes QA pairs from robotics simulation (BEHAVIOR) and evaluates models on 8,972 QA pairs spanning long-horizon home-scale activities. Experiments reveal a performance gap between frontier VLMs and humans that widens with interaction horizon. Models consistently perform better on the inverse task than the forward one and exhibit anthropocentric biases, including a preference for right-handed actions and degradation when camera intrinsics or viewpoints deviate from human vision. Website at https://enact-embodied-cognition.github.io/.
comment: Preprint version
Start Making Sense(s): A Developmental Probe of Attention Specialization Using Lexical Ambiguity ACL
Despite an in-principle understanding of self-attention matrix operations in Transformer language models (LMs), it remains unclear precisely how these operations map onto interpretable computations or functions--and how or when individual attention heads develop specialized attention patterns. Here, we present a pipeline to systematically probe attention mechanisms, and we illustrate its value by leveraging lexical ambiguity--where a single word has multiple meanings--to isolate attention mechanisms that contribute to word sense disambiguation. We take a "developmental" approach: first, using publicly available Pythia LM checkpoints, we identify inflection points in disambiguation performance for each LM in the suite; in 14M and 410M, we identify heads whose attention to disambiguating words covaries with overall disambiguation performance across development. We then stress-test the robustness of these heads to stimulus perturbations: in 14M, we find limited robustness, but in 410M, we identify multiple heads with surprisingly generalizable behavior. Then, in a causal analysis, we find that ablating the target heads demonstrably impairs disambiguation performance, particularly in 14M. We additionally reproduce developmental analyses of 14M across all of its random seeds. Together, these results suggest: that disambiguation benefits from a constellation of mechanisms, some of which (especially in 14M) are highly sensitive to the position and part-of-speech of the disambiguating cue; and that larger models (410M) may contain heads with more robust disambiguation behavior. They also join a growing body of work that highlights the value of adopting a developmental perspective when probing LM mechanisms.
comment: 13 pages (main text), 5 figures (main text) 6 pages (appendix), 6 figures (appendix), journal submission to TACL ("a" decision: pre-MIT Press publication version)
A Comparative Study of LLM Prompting and Fine-Tuning for Cross-genre Authorship Attribution on Chinese Lyrics
We propose a novel study on authorship attribution for Chinese lyrics, a domain where clean, public datasets are sorely lacking. Our contributions are twofold: (1) we create a new, balanced dataset of Chinese lyrics spanning multiple genres, and (2) we develop and fine-tune a domain-specific model, comparing its performance against zero-shot inference using the DeepSeek LLM. We test two central hypotheses. First, we hypothesize that a fine-tuned model will outperform a zero-shot LLM baseline. Second, we hypothesize that performance is genre-dependent. Our experiments strongly confirm Hypothesis 2: structured genres (e.g. Folklore & Tradition) yield significantly higher attribution accuracy than more abstract genres (e.g. Love & Romance). Hypothesis 1 receives only partial support: fine-tuning improves robustness and generalization in Test1 (real-world data and difficult genres), but offers limited or ambiguous gains in Test2, a smaller, synthetically-augmented set. We show that the design limitations of Test2 (e.g., label imbalance, shallow lexical differences, and narrow genre sampling) can obscure the true effectiveness of fine-tuning. Our work establishes the first benchmark for cross-genre Chinese lyric attribution, highlights the importance of genre-sensitive evaluation, and provides a public dataset and analytical framework for future research. We conclude with recommendations: enlarge and diversify test sets, reduce reliance on token-level data augmentation, balance author representation across genres, and investigate domain-adaptive pretraining as a pathway for improved attribution performance.
comment: 8 pages, 6 figures
Tracing How Annotators Think: Augmenting Preference Judgments with Reading Processes
We propose an annotation approach that captures not only labels but also the reading process underlying annotators' decisions, e.g., what parts of the text they focus on, re-read or skim. Using this framework, we conduct a case study on the preference annotation task, creating a dataset PreferRead that contains fine-grained annotator reading behaviors obtained from mouse tracking. PreferRead enables detailed analysis of how annotators navigate between a prompt and two candidate responses before selecting their preference. We find that annotators re-read a response in roughly half of all trials, most often revisiting the option they ultimately choose, and rarely revisit the prompt. Reading behaviors are also significantly related to annotation outcomes: re-reading is associated with higher inter-annotator agreement, whereas long reading paths and times are associated with lower agreement. These results demonstrate that reading processes provide a complementary cognitive dimension for understanding annotator reliability, decision-making and disagreement in complex, subjective NLP tasks. Our code and data are publicly available.
A Customer Journey in the Land of Oz: Leveraging the Wizard of Oz Technique to Model Emotions in Customer Service Interactions
Emotion-aware customer service needs in-domain conversational data, rich annotations, and predictive capabilities, but existing resources for emotion recognition are often out-of-domain, narrowly labeled, and focused on post-hoc detection. To address this, we conducted a controlled Wizard of Oz (WOZ) experiment to elicit interactions with targeted affective trajectories. The resulting corpus, EmoWOZ-CS, contains 2,148 bilingual (Dutch-English) written dialogues from 179 participants across commercial aviation, e-commerce, online travel agencies, and telecommunication scenarios. Our contributions are threefold: (1) Evaluate WOZ-based operator-steered valence trajectories as a design for emotion research; (2) Quantify human annotation performance and variation, including divergences between self-reports and third-party judgments; (3) Benchmark detection and forward-looking emotion inference in real-time support. Findings show neutral dominates participant messages; desire and gratitude are the most frequent non-neutral emotions. Agreement is moderate for multilabel emotions and valence, lower for arousal and dominance; self-reports diverge notably from third-party labels, aligning most for neutral, gratitude, and anger. Objective strategies often elicit neutrality or gratitude, while suboptimal strategies increase anger, annoyance, disappointment, desire, and confusion. Some affective strategies (cheerfulness, gratitude) foster positive reciprocity, whereas others (apology, empathy) can also leave desire, anger, or annoyance. Temporal analysis confirms successful conversation-level steering toward prescribed trajectories, most distinctly for negative targets; positive and neutral targets yield similar final valence distributions. Benchmarks highlight the difficulty of forward-looking emotion inference from prior turns, underscoring the complexity of proactive emotion-aware support.
Improving Score Reliability of Multiple Choice Benchmarks with Consistency Evaluation and Altered Answer Choices
In this work we present the Consistency-Rebalanced Accuracy (CoRA) metric, improving the reliability of Large Language Model (LLM) scores computed on multiple choice (MC) benchmarks. Our metric explores the response consistency of the LLMs, taking advantage of synthetically-generated questions with altered answer choices. With two intermediate scores, i.e. Bare-Minimum-Consistency Accuracy (BMCA) and Consistency Index (CI), CoRA is computed by adjusting the multiple-choice question answering (MCQA) scores to better reflect the level of consistency of the LLM. We present evaluations in different benchmarks using diverse LLMs, and not only demonstrate that LLMs can present low response consistency even when they present high MCQA scores, but also that CoRA can successfully scale down the scores of inconsistent models.
FLAWS: A Benchmark for Error Identification and Localization in Scientific Papers
The identification and localization of errors is a core task in peer review, yet the exponential growth of scientific output has made it increasingly difficult for human reviewers to reliably detect errors given the limited pool of experts. Recent advances in Large Language Models (LLMs) have sparked interest in their potential to support such evaluation tasks, from academic peer review to automated scientific assessment. However, despite the growing use of LLMs in review systems, their capabilities to pinpoint errors remain underexplored. In this work, we introduce Fault Localization Across Writing in Science (FLAWS), an automated benchmark consisting of 713 paper-error pairs designed to evaluate how effectively LLMs detect errors that undermine key claims in research papers. We construct the benchmark by systematically inserting claim-invalidating errors into peer-reviewed papers using LLMs, paired with an automated evaluation metric that measures whether models can identify and localize these errors. Developing such a benchmark presents unique challenges that we overcome: ensuring that the inserted errors are well-defined, challenging, and relevant to the content of the paper, avoiding artifacts that would make identification trivial, and designing a scalable, automated evaluation metric. On the resulting benchmark, we evaluate five frontier LLMs: Claude Sonnet 4.5, DeepSeek Reasoner v3.1, Gemini 2.5 Pro, GPT 5, and Grok 4. Among these, GPT 5 is the top-performing model, achieving 39.1% identification accuracy when k=10, where k is the number of top-ranked error text candidates generated by the LLM.
comment: 30 pages, 12 tables, 2 figures
AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following
Recent progress in large language models (LLMs) has led to impressive performance on a range of tasks, yet advanced instruction following (IF)-especially for complex, multi-turn, and system-prompted instructions-remains a significant challenge. Rigorous evaluation and effective training for such capabilities are hindered by the lack of high-quality, human-annotated benchmarks and reliable, interpretable reward signals. In this work, we introduce AdvancedIF (we will release this benchmark soon), a comprehensive benchmark featuring over 1,600 prompts and expert-curated rubrics that assess LLMs ability to follow complex, multi-turn, and system-level instructions. We further propose RIFL (Rubric-based Instruction-Following Learning), a novel post-training pipeline that leverages rubric generation, a finetuned rubric verifier, and reward shaping to enable effective reinforcement learning for instruction following. Extensive experiments demonstrate that RIFL substantially improves the instruction-following abilities of LLMs, achieving a 6.7% absolute gain on AdvancedIF and strong results on public benchmarks. Our ablation studies confirm the effectiveness of each component in RIFL. This work establishes rubrics as a powerful tool for both training and evaluating advanced IF in LLMs, paving the way for more capable and reliable AI systems.
TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.
comment: Project page: https://xuboshen.github.io/TimeViper; Code: https://github.com/xiaomi-research/timeviper
Leveraging Test Driven Development with Large Language Models for Reliable and Verifiable Spreadsheet Code Generation: A Research Framework
Large Language Models (LLMs), such as ChatGPT, are increasingly leveraged for generating both traditional software code and spreadsheet logic. Despite their impressive generative capabilities, these models frequently exhibit critical issues such as hallucinations, subtle logical inconsistencies, and syntactic errors, risks particularly acute in high stakes domains like financial modelling and scientific computations, where accuracy and reliability are paramount. This position paper proposes a structured research framework that integrates the proven software engineering practice of Test-Driven Development (TDD) with Large Language Model (LLM) driven generation to enhance the correctness of, reliability of, and user confidence in generated outputs. We hypothesise that a "test first" methodology provides both technical constraints and cognitive scaffolding, guiding LLM outputs towards more accurate, verifiable, and comprehensible solutions. Our framework, applicable across diverse programming contexts, from spreadsheet formula generation to scripting languages such as Python and strongly typed languages like Rust, includes an explicitly outlined experimental design with clearly defined participant groups, evaluation metrics, and illustrative TDD based prompting examples. By emphasising test driven thinking, we aim to improve computational thinking, prompt engineering skills, and user engagement, particularly benefiting spreadsheet users who often lack formal programming training yet face serious consequences from logical errors. We invite collaboration to refine and empirically evaluate this approach, ultimately aiming to establish responsible and reliable LLM integration in both educational and professional development practices.
comment: 16 pages
BengaliFig: A Low-Resource Challenge for Figurative and Culturally Grounded Reasoning in Bengali
Large language models excel on broad multilingual benchmarks but remain to be evaluated extensively in figurative and culturally grounded reasoning, especially in low-resource contexts. We present BengaliFig, a compact yet richly annotated challenge set that targets this gap in Bengali, a widely spoken low-resourced language. The dataset contains 435 unique riddles drawn from Bengali oral and literary traditions. Each item is annotated along five orthogonal dimensions capturing reasoning type, trap type, cultural depth, answer category, and difficulty, and is automatically converted to multiple-choice format through a constraint-aware, AI-assisted pipeline. We evaluate eight frontier LLMs from major providers under zero-shot and few-shot chain-of-thought prompting, revealing consistent weaknesses in metaphorical and culturally specific reasoning. BengaliFig thus contributes both a diagnostic probe for evaluating LLM robustness in low-resource cultural contexts and a step toward inclusive and heritage-aware NLP evaluation.
Mem-PAL: Towards Memory-based Personalized Dialogue Assistants for Long-term User-Agent Interaction AAAI 2026
With the rise of smart personal devices, service-oriented human-agent interactions have become increasingly prevalent. This trend highlights the need for personalized dialogue assistants that can understand user-specific traits to accurately interpret requirements and tailor responses to individual preferences. However, existing approaches often overlook the complexities of long-term interactions and fail to capture users' subjective characteristics. To address these gaps, we present PAL-Bench, a new benchmark designed to evaluate the personalization capabilities of service-oriented assistants in long-term user-agent interactions. In the absence of available real-world data, we develop a multi-step LLM-based synthesis pipeline, which is further verified and refined by human annotators. This process yields PAL-Set, the first Chinese dataset comprising multi-session user logs and dialogue histories, which serves as the foundation for PAL-Bench. Furthermore, to improve personalized service-oriented interactions, we propose H$^2$Memory, a hierarchical and heterogeneous memory framework that incorporates retrieval-augmented generation to improve personalized response generation. Comprehensive experiments on both our PAL-Bench and an external dataset demonstrate the effectiveness of the proposed memory framework.
comment: Accepted by AAAI 2026 (Oral)
Co-NAML-LSTUR: A Combined Model with Attentive Multi-View Learning and Long- and Short-term User Representations for News Recommendation
News recommendation systems play a critical role in alleviating information overload by delivering personalized content. A key challenge lies in jointly modeling multi-view representations of news articles and capturing the dynamic, dual-scale nature of user interests-encompassing both short- and long-term preferences. Prior methods often rely on single-view features or insufficiently model user behavior across time. In this work, we introduce Co-NAML-LSTUR, a hybrid news recommendation framework that integrates NAML for attentive multi-view news encoding and LSTUR for hierarchical user modeling, designed for training on limited data resources. Our approach leverages BERT-based embeddings to enhance semantic representation. We evaluate Co-NAML-LSTUR on two widely used benchmarks, MIND-small and MIND-large. Results show that our model significantly outperforms strong baselines, achieving improvements over NRMS by 1.55% in AUC and 1.15% in MRR, and over NAML by 2.45% in AUC and 1.71% in MRR. These findings highlight the effectiveness of our efficiency-focused hybrid model, which combines multi-view news modeling with dual-scale user representations for practical, resource-limited resources rather than a claim to absolute state-of-the-art (SOTA). The implementation of our model is publicly available at https://github.com/MinhNguyenDS/Co-NAML-LSTUR
comment: The 18th International Conference on Multi-disciplinary Trends in Artificial Intelligence (MIWAI 2025)
Yesterday's News: Benchmarking Multi-Dimensional Out-of-Distribution Generalization of Misinformation Detection Models
This article introduces misinfo-general, a benchmark dataset for evaluating misinformation models' ability to perform out-of-distribution generalization. Misinformation changes rapidly, much more quickly than moderators can annotate at scale, resulting in a shift between the training and inference data distributions. As a result, misinformation detectors need to be able to perform out-of-distribution generalization, an attribute they currently lack. Our benchmark uses distant labelling to enable simulating covariate shifts in misinformation content. We identify time, event, topic, publisher, political bias, misinformation type as important axes for generalization, and we evaluate a common class of baseline models on each. Using article metadata, we show how this model fails desiderata, which is not necessarily obvious from classification metrics. Finally, we analyze properties of the data to ensure limited presence of modelling shortcuts. We make the dataset and accompanying code publicly available: https://github.com/ioverho/misinfo-general
comment: Accepted for publication in Computational Linguistics on November 23, 2025. This is the pre-MIT Press publication version
BoundingDocs: a Unified Dataset for Document Question Answering with Spatial Annotations
We present a unified dataset for document Question-Answering (QA), which is obtained combining several public datasets related to Document AI and visually rich document understanding (VRDU). Our main contribution is twofold: on the one hand we reformulate existing Document AI tasks, such as Information Extraction (IE), into a Question-Answering task, making it a suitable resource for training and evaluating Large Language Models; on the other hand, we release the OCR of all the documents and include the exact position of the answer to be found in the document image as a bounding box. Using this dataset, we explore the impact of different prompting techniques (that might include bounding box information) on the performance of open-weight models, identifying the most effective approaches for document comprehension.
Improved Visually Prompted Keyword Localisation in Real Low-Resource Settings
Given an image query, visually prompted keyword localisation (VPKL) aims to find occurrences of the depicted word in a speech collection. This can be useful when transcriptions are not available for a low-resource language (e.g. if it is unwritten). Previous work showed that VPKL can be performed with a visually grounded speech model trained on paired images and unlabelled speech. But all experiments were done on English. Moreover, transcriptions were used to get positive and negative pairs for the contrastive loss. This paper introduces a few-shot learning scheme to mine pairs automatically without transcriptions. On English, this results in only a small drop in performance. We also - for the first time - consider VPKL on a real low-resource language, Yoruba. While scores are reasonable, here we see a bigger drop in performance compared to using ground truth pairs because the mining is less accurate in Yoruba.
comment: Accepted at SpeD 2025
Scaling Efficient LLMs
Recent LLMs have hundreds of billions of parameters consuming vast resources. Furthermore, the so called "AI scaling law" for transformers suggests that the number of parameters must scale linearly with the size of the data. In response, we inquire into efficient LLMs, i.e. those with the fewest parameters that achieve the desired accuracy on a training corpus. Specifically, by comparing theoretical and empirical estimates of the Kullback-Leibler divergence, we derive a natural AI scaling law that the number of parameters in an efficient LLM scales as $D^γ$ where $D$ is the size of the training data and $ γ\in [0.44, 0.72]$, suggesting the existence of more efficient architectures. Against this backdrop, we propose recurrent transformers, combining the efficacy of transformers with the efficiency of recurrent networks, progressively applying a single transformer layer to a fixed-width sliding window across the input sequence. Recurrent transformers (a) run in linear time in the sequence length, (b) are memory-efficient and amenable to parallel processing in large batches, (c) learn to forget history for language tasks, or accumulate history for long range tasks like copy and selective copy, and (d) are amenable to curriculum training to overcome vanishing gradients. In our experiments, we find that recurrent transformers perform favorably on benchmark tests.
Step-Audio-R1 Technical Report
Recent advances in reasoning models have demonstrated remarkable success in text and vision domains through extended chain-of-thought deliberation. However, a perplexing phenomenon persists in audio language models: they consistently perform better with minimal or no reasoning, raising a fundamental question - can audio intelligence truly benefit from deliberate thinking? We introduce Step-Audio-R1, the first audio reasoning model that successfully unlocks reasoning capabilities in the audio domain. Through our proposed Modality-Grounded Reasoning Distillation (MGRD) framework, Step-Audio-R1 learns to generate audio-relevant reasoning chains that genuinely ground themselves in acoustic features rather than hallucinating disconnected deliberations. Our model exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to the state-of-the-art Gemini 3 Pro across comprehensive audio understanding and reasoning benchmarks spanning speech, environmental sounds, and music. These results demonstrate that reasoning is a transferable capability across modalities when appropriately anchored, transforming extended deliberation from a liability into a powerful asset for audio intelligence. By establishing the first successful audio reasoning model, Step-Audio-R1 opens new pathways toward building truly multimodal reasoning systems that think deeply across all sensory modalities.
comment: 22 pages, 5 figures. Technical Report
DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research
Deep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards (RLVR), which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training; this allows the rubrics to incorporate information that the model has newly explored and to provide discriminative, on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare and general domains, DR Tulu substantially outperforms existing open deep research models, and matches or exceeds proprietary deep research systems, while being significantly smaller and cheaper per query. To facilitate future research, we release all data, models, and code, including our new MCP-based agent infrastructure for deep research systems.
AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser
While web data quality is crucial for large language models, most curation efforts focus on filtering and deduplication,treating HTML-to-text extraction as a fixed pre-processing step. Existing web corpora rely on heuristic-based extractors like Trafilatura, which struggle to preserve document structure and frequently corrupt structured elements such as formulas, codes, and tables. We hypothesize that improving extraction quality can be as impactful as aggressive filtering strategies for downstream performance. We introduce MinerU-HTML, a novel extraction pipeline that reformulates content extraction as a sequence labeling problem solved by a 0.6B-parameter language model. Unlike text-density heuristics, MinerU-HTML leverages semantic understanding and employs a two-stage formatting pipeline that explicitly categorizes semantic elements before converting to Markdown. Crucially, its model-based approach is inherently scalable, whereas heuristic methods offer limited improvement pathways. On MainWebBench, our benchmark of 7,887 annotated web pages, MinerU-HTML achieves 81.8\% ROUGE-N F1 compared to Trafilatura's 63.6\%, with exceptional structured element preservation (90.9\% for code blocks, 94.0\% for formulas). Using MinerU-HTML, we construct AICC (AI-ready Common Crawl), a 7.3-trillion token multilingual corpus from two Common Crawl snapshots. In controlled pretraining experiments where AICC and Trafilatura-extracted TfCC undergo identical filtering, models trained on AICC (62B tokens) achieve 50.8\% average accuracy across 13 benchmarks, outperforming TfCC by 1.08pp-providing direct evidence that extraction quality significantly impacts model capabilities. AICC also surpasses RefinedWeb and FineWeb on key benchmarks. We publicly release MainWebBench, MinerU-HTML, and AICC, demonstrating that HTML extraction is a critical, often underestimated component of web corpus construction.
Think Visually, Reason Textually: Vision-Language Synergy in ARC
Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4. These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence. The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks. Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles. However, our pilot experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution. This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages: vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution. Building on this insight, we introduce two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction. Extensive experiments demonstrate that our approach yields up to a 4.33\% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks. Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence in future foundation models. Source code is released at https://github.com/InternLM/ARC-VL.
Reasoning Transfer for an Extremely Low-Resource and Endangered Language: Bridging Languages Through Sample-Efficient Language Understanding
Recent advances have enabled Large Language Models (LLMs) to tackle reasoning tasks by generating chain-of-thought (CoT) rationales, yet these gains have largely applied to high-resource languages, leaving low-resource languages behind. In this work, we first investigate CoT techniques in extremely low-resource scenarios through previous prompting, model-editing, and fine-tuning approaches. We introduce English-Pivoted CoT Training, leveraging the insight that LLMs internally operate in a latent space aligned toward the dominant language. Given input in a low-resource language, we perform supervised fine-tuning to generate CoT in English and output the final response in the target language. Across mathematical reasoning benchmarks, our approach outperforms other baselines with up to 28.33% improvement in low-resource scenarios. Our analysis and additional experiments, including Mixed-Language CoT and Two-Stage Training, show that explicitly separating language understanding from reasoning enhances cross-lingual reasoning abilities. To facilitate future work, we also release \emph{LC2024}, the first benchmark for mathematical tasks in Irish, an extremely low-resource and endangered language. Our results and resources highlight a practical pathway to multilingual reasoning without extensive retraining in every extremely low-resource language, despite data scarcity.
Characterizing Pattern Matching and Its Limits on Compositional Task Structures
Despite impressive capabilities, LLMs' successes often rely on pattern-matching behaviors, yet these are also linked to OOD generalization failures in compositional tasks. However, behavioral studies commonly employ task setups that allow multiple generalization sources (e.g., algebraic invariances, structural repetition), obscuring a precise and testable account of how well LLMs perform generalization through pattern matching and their limitations. To address this ambiguity, we first formalize pattern matching as functional equivalence, i.e., identifying pairs of subsequences of inputs that consistently lead to identical results when the rest of the input is held constant. Then, we systematically study how decoder-only Transformer and Mamba behave in controlled tasks with compositional structures that isolate this mechanism. Our formalism yields predictive and quantitative insights: (1) Instance-wise success of pattern matching is well predicted by the number of contexts witnessing the relevant functional equivalence. (2) We prove a tight sample complexity bound of learning a two-hop structure by identifying the exponent of the data scaling law for perfect in-domain generalization. Our empirical results align with the theoretical prediction, under 20x parameter scaling and across architectures. (3) Path ambiguity is a structural barrier: when a variable influences the output via multiple paths, models fail to form unified intermediate state representations, impairing accuracy and interpretability. (4) Chain-of-Thought reduces data requirements yet does not resolve path ambiguity. Hence, we provide a predictive, falsifiable boundary for pattern matching and a foundational diagnostic for disentangling mixed generalization mechanisms.
LightMem: Lightweight and Efficient Memory-Augmented Generation
Despite their remarkable capabilities, Large Language Models (LLMs) struggle to effectively leverage historical interaction information in dynamic and complex environments. Memory systems enable LLMs to move beyond stateless interactions by introducing persistent information storage, retrieval, and utilization mechanisms. However, existing memory systems often introduce substantial time and computational overhead. To this end, we introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems. Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognition-inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep-time update employs an offline procedure that decouples consolidation from online inference. On LongMemEval and LoCoMo, using GPT and Qwen backbones, LightMem consistently surpasses strong baselines, improving QA accuracy by up to 7.7% / 29.3%, reducing total token usage by up to 38x / 20.9x and API calls by up to 30x / 55.5x, while purely online test-time costs are even lower, achieving up to 106x / 117x token reduction and 159x / 310x fewer API calls. The code is available at https://github.com/zjunlp/LightMem.
comment: Work in progress
A Systematic Analysis of Large Language Models with RAG-enabled Dynamic Prompting for Medical Error Detection and Correction
Objective: Clinical documentation contains factual, diagnostic, and management errors that can compromise patient safety. Large language models (LLMs) may help detect and correct such errors, but their behavior under different prompting strategies remains unclear. We evaluate zero-shot prompting, static prompting with random exemplars (SPR), and retrieval-augmented dynamic prompting (RDP) for three subtasks of medical error processing: error flag detection, error sentence detection, and error correction. Methods: Using the MEDEC dataset, we evaluated nine instruction-tuned LLMs (GPT, Claude, Gemini, and OpenAI o-series models). We measured performance using accuracy, recall, false-positive rate (FPR), and an aggregate score of ROUGE-1, BLEURT, and BERTScore for error correction. We also analyzed example outputs to identify failure modes and differences between LLM and clinician reasoning. Results: Zero-shot prompting showed low recall in both detection tasks, often missing abbreviation-heavy or atypical errors. SPR improved recall but increased FPR. Across all nine LLMs, RDP reduced FPR by about 15 percent, improved recall by 5 to 10 percent in error sentence detection, and generated more contextually accurate corrections. Conclusion: Across diverse LLMs, RDP outperforms zero-shot and SPR prompting. Using retrieved exemplars improves detection accuracy, reduces false positives, and enhances the reliability of medical error correction.
Mechanism of Task-oriented Information Removal in In-context Learning
In-context Learning (ICL) is an emerging few-shot learning paradigm based on modern Language Models (LMs), yet its inner mechanism remains unclear. In this paper, we investigate the mechanism through a novel perspective of information removal. Specifically, we demonstrate that in the zero-shot scenario, LMs encode queries into non-selective representations in hidden states containing information for all possible tasks, leading to arbitrary outputs without focusing on the intended task, resulting in near-zero accuracy. Meanwhile, we find that selectively removing specific information from hidden states by a low-rank filter effectively steers LMs toward the intended task. Building on these findings, by measuring the hidden states on carefully designed metrics, we observe that few-shot ICL effectively simulates such task-oriented information removal processes, selectively removing the redundant information from entangled non-selective representations, and improving the output based on the demonstrations, which constitutes a key mechanism underlying ICL. Moreover, we identify essential attention heads inducing the removal operation, termed Denoising Heads, which enables the ablation experiments blocking the information removal operation from the inference, where the ICL accuracy significantly degrades, especially when the correct label is absent from the few-shot demonstrations, confirming both the critical role of the information removal mechanism and denoising heads.
comment: 87 pages, 90 figures, 7 tables
UniChange: Unifying Change Detection with Multimodal Large Language Model
Change detection (CD) is a fundamental task for monitoring and analyzing land cover dynamics. While recent high performance models and high quality datasets have significantly advanced the field, a critical limitation persists. Current models typically acquire limited knowledge from single-type annotated data and cannot concurrently leverage diverse binary change detection (BCD) and semantic change detection (SCD) datasets. This constraint leads to poor generalization and limited versatility. The recent advancements in Multimodal Large Language Models (MLLMs) introduce new possibilities for a unified CD framework. We leverage the language priors and unification capabilities of MLLMs to develop UniChange, the first MLLM-based unified change detection model. UniChange integrates generative language abilities with specialized CD functionalities. Our model successfully unifies both BCD and SCD tasks through the introduction of three special tokens: [T1], [T2], and [CHANGE]. Furthermore, UniChange utilizes text prompts to guide the identification of change categories, eliminating the reliance on predefined classification heads. This design allows UniChange to effectively acquire knowledge from multi-source datasets, even when their class definitions conflict. Experiments on four public benchmarks (WHU-CD, S2Looking, LEVIR-CD+, and SECOND) demonstrate SOTA performance, achieving IoU scores of 90.41, 53.04, 78.87, and 57.62, respectively, surpassing all previous methods. The code is available at https://github.com/Erxucomeon/UniChange.
Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning
Recently, advanced large language models (LLMs) have emerged at an increasingly rapid pace. However, when faced with complex problems, most users are often unable to provide accurate and effective prompts to interact with LLMs, thus limiting the performance of LLMs. To address this challenge, we propose Prompt-R1, an end-to-end reinforcement learning framework that uses a small-scale LLM to collaborate with large-scale LLMs, replacing user interaction to solve problems better. This collaboration is cast as a multi-turn prompt interaction, where the small-scale LLM thinks and generates prompts, and the large-scale LLM performs complex reasoning. A dual-constrained reward is designed to optimize for correctness, generation quality, and reasoning accuracy. Prompt-R1 provides a plug-and-play framework that supports both inference and training with various large-scale LLMs. Experiments on multiple public datasets show that Prompt-R1 significantly outperforms baseline models across tasks. Our code is publicly available at https://github.com/QwenQKing/Prompt-R1.
The Distribution of Dependency Distance and Hierarchical Distance in Contemporary Written Japanese and Its Influencing Factors
To explore the relationship between dependency distance (DD) and hierarchical distance (HD) in Japanese, we compared the probability distributions of DD and HD with and without sentence length fixed, and analyzed the changes in mean dependency distance (MDD) and mean hierarchical distance (MHD) as sentence length increases, along with their correlation coefficient based on the Balanced Corpus of Contemporary Written Japanese. It was found that the valency of the predicates is the underlying factor behind the trade-off relation between MDD and MHD in Japanese. Native speakers of Japanese regulate the linear complexity and hierarchical complexity through the valency of the predicates, and the relative sizes of MDD and MHD depend on whether the threshold of valency has been reached. Apart from the cognitive load, the valency of the predicates also affects the probability distributions of DD and HD. The effect of the valency of the predicates on the distribution of HD is greater than on that of DD, which leads to differences in their probability distributions and causes the mean of MDD to be lower than that of MHD.
comment: This paper has been accepted by the 13th International Quantitative Linguistics Conference QUALICO 2025
UITron-Speech: Towards Automated GUI Agents Based on Speech Instructions
Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing human-computer interaction, yet their reliance on text-based instructions imposes limitations on accessibility and convenience, particularly in hands-free scenarios. To address this issue, we propose replacing text with speech as the instruction input modality for GUI agents, and introduce UITron-Speech, which is the first end-to-end GUI agent capable of directly processing speech instructions and on-device screenshots to predict user actions. To tackle the problem of data scarcity, we synthesize high-quality speech instruction datasets using a random-speaker text-to-speech model. Additionally, we design a mixed-modality training strategy to mitigate the inherent modality imbalance in pre-trained foundation models. Furthermore, we conduct a statistical analysis of the distribution of GUI grounding prediction errors and propose a training-free two-step grounding refinement method to alleviate minor localization deviations. Extensive experiments on multiple benchmarks demonstrate that UITron-Speech achieves robust performance and superior adaptability, underscoring the feasibility and potential of speech-driven GUI agents for more accessible and intelligent human-computer interaction. Our code and datasets are available at https://github.com/UITron-hub/UITron-Speech.
Federated Large Language Models: Current Progress and Future Directions
Large language models are rapidly gaining popularity and have been widely adopted in real-world applications. While the quality of training data is essential, privacy concerns arise during data collection. Federated learning offers a solution by allowing multiple clients to collaboratively train LLMs without sharing local data. However, FL introduces new challenges, such as model convergence issues due to heterogeneous data and high communication costs. A comprehensive study is required to address these challenges and guide future research. This paper surveys Federated learning for LLMs (FedLLM), highlighting recent advances and future directions. We focus on two key aspects: fine-tuning and prompt learning in a federated setting, discussing existing work and associated research challenges. We finally propose potential directions for federated LLMs, including pre-training, federated agents, and LLMs for federated learning.
AutoDiscovery: Open-ended Scientific Discovery via Bayesian Surprise NeurIPS 2025
The promise of autonomous scientific discovery (ASD) hinges not only on answering questions, but also on knowing which questions to ask. Most recent works in ASD explore the use of large language models (LLMs) in goal-driven settings, relying on human-specified research questions to guide hypothesis generation. However, scientific discovery may be accelerated further by allowing the AI system to drive exploration by its own criteria. The few existing approaches in open-ended ASD select hypotheses based on diversity heuristics or subjective proxies for human interestingness, but the former struggles to meaningfully navigate the typically vast hypothesis space, and the latter suffers from imprecise definitions. This paper presents AutoDiscovery -- a method for open-ended ASD that instead drives scientific exploration using Bayesian surprise. Here, we quantify the epistemic shift from the LLM's prior beliefs about a hypothesis to its posterior beliefs after gathering experimental results. To efficiently explore the space of nested hypotheses, our method employs a Monte Carlo tree search (MCTS) strategy with progressive widening using surprisal as the reward function. We evaluate AutoDiscovery in the setting of data-driven discovery across 21 real-world datasets spanning domains such as biology, economics, finance, and behavioral science. Our results demonstrate that under a fixed budget, AutoDiscovery substantially outperforms competitors by producing 5-29% more discoveries deemed surprising by the LLM. Our human evaluation further reveals that two-thirds of discoveries made by our system are surprising to domain experts as well, suggesting this is an important step towards building open-ended ASD systems.
comment: Accepted to NeurIPS 2025; https://neurips.cc/virtual/2025/loc/san-diego/poster/116398
Meursault as a Data Point
In an era dominated by datafication, the reduction of human experiences to quantifiable metrics raises profound philosophical and ethical questions. This paper explores these issues through the lens of Meursault, the protagonist of Albert Camus' The Stranger, whose emotionally detached existence epitomizes the existential concept of absurdity. Using natural language processing (NLP) techniques including emotion detection (BERT), sentiment analysis (VADER), and named entity recognition (spaCy)-this study quantifies key events and behaviors in Meursault's life. Our analysis reveals the inherent limitations of applying algorithmic models to complex human experiences, particularly those rooted in existential alienation and moral ambiguity. By examining how modern AI tools misinterpret Meursault's actions and emotions, this research underscores the broader ethical dilemmas of reducing nuanced human narratives to data points, challenging the foundational assumptions of our data-driven society. The findings presented in this paper serve as a critique of the increasing reliance on data-driven narratives and advocate for incorporating humanistic values in artificial intelligence.
comment: 7 pages, 9 figures, 4 tables
Enhancing Large Language Models for Detecting Mental Manipulation via Annotation-Free Data Augmentation and Anti-Curriculum Distillation
Mental manipulation is a subtle yet pervasive form of psychological abuse that poses serious threats to mental health. Nevertheless, detecting mental manipulation remains a largely underexplored research problem. The field faces three major challenges: (i) insufficient and hard-to-obtain training data; (ii) the covert nature of mental manipulation, which hinders detection; and (iii) the lack of real-world datasets. To address these challenges, we propose MentalMAC, a novel framework that enhances large language models' ability to detect elements of mental manipulation in multi-turn dialogue. Our approach consists of three key components: EvoSA, an annotation-free data augmentation method based on evolutionary operations and speech act theory; teacher-model-generated multi-task supervision; and progressive task-level anti-curriculum distillation. We then constructed the ReaMent dataset, comprising 5,000 real-world dialogue samples, utilizing MentalMAC-distilled models to aid in human annotation. Vast experiments show that MentalMAC achieves up to 25.9% improvement in F1mac and 8.1% in accuracy over the best-performing baseline, outperforming commercial LLMs such as GPT-4 and Claude-3.5-Sonnet. Warning: This paper contains content that may be offensive to the reader.
comment: Preprint
CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness NeurIPS 2025
Visual captioning benchmarks have become outdated with the emergence of modern multimodal large language models (MLLMs), as the brief ground-truth sentences and traditional metrics fail to assess detailed captions effectively. While recent benchmarks attempt to address this by focusing on keyword extraction or object-centric evaluation, they remain limited to vague-view or object-view analyses and incomplete visual element coverage. In this paper, we introduce CAPability, a comprehensive multi-view benchmark for evaluating visual captioning across 12 dimensions spanning six critical views. We curate nearly 11K human-annotated images and videos with visual element annotations to evaluate the generated captions. CAPability stably assesses both the correctness and thoroughness of captions with \textit{precision} and \textit{hit} metrics. By converting annotations to QA pairs, we further introduce a heuristic metric, \textit{know but cannot tell} ($K\bar{T}$), indicating a significant performance gap between QA and caption capabilities. Our work provides a holistic analysis of MLLMs' captioning abilities, as we identify their strengths and weaknesses across various dimensions, guiding future research to enhance specific aspects of their capabilities.
comment: Accepted to NeurIPS 2025
Where to Start Alignment? Diffusion Large Language Model May Demand a Distinct Position AAAI 2026
Diffusion Large Language Models (dLLMs) have recently emerged as a competitive non-autoregressive paradigm due to their unique training and inference approach. However, there is currently a lack of safety study on this novel architecture. In this paper, we present the first analysis of dLLMs' safety performance and propose a novel safety alignment method tailored to their unique generation characteristics. Specifically, we identify a critical asymmetry between the defender and attacker in terms of security. For the defender, we reveal that the middle tokens of the response, rather than the initial ones, are more critical to the overall safety of dLLM outputs; this seems to suggest that aligning middle tokens can be more beneficial to the defender. The attacker, on the contrary, may have limited power to manipulate middle tokens, as we find dLLMs have a strong tendency towards a sequential generation order in practice, forcing the attack to meet this distribution and diverting it from influencing the critical middle tokens. Building on this asymmetry, we introduce Middle-tOken Safety Alignment (MOSA), a novel method that directly aligns the model's middle generation with safe refusals exploiting reinforcement learning. We implement MOSA and compare its security performance against eight attack methods on two benchmarks. We also test the utility of MOSA-aligned dLLM on coding, math, and general reasoning. The results strongly prove the superiority of MOSA.
comment: Accepted for oral presentation at AAAI 2026
Uncovering Implicit Bias in Large Language Models with Concept Learning Dataset
We introduce a dataset of concept learning tasks that helps uncover implicit biases in large language models. Using in-context concept learning experiments, we found that language models may have a bias toward upward monotonicity in quantifiers; such bias is less apparent when the model is tested by direct prompting without concept learning components. This demonstrates that in-context concept learning can be an effective way to discover hidden biases in language models.
comment: Presented at EurIPS 2025 Workshop - Unifying Perspectives on Learning Biases (UPLB) https://sites.google.com/view/towards-a-unified-view
Fine-grained and Explainable Factuality Evaluation for Multimodal Summarization
Multimodal summarization aims to generate a concise summary based on the input text and image. However, the existing methods potentially suffer from unfactual output. To evaluate the factuality of multimodal summarization models, we propose two fine-grained and explainable evaluation frameworks (FALLACIOUS) for different application scenarios, i.e. reference-based factuality evaluation framework and reference-free factuality evaluation framework. Notably, the reference-free factuality evaluation framework doesn't need ground truth and hence it has a wider application scenario. To evaluate the effectiveness of the proposed frameworks, we compute the correlation between our frameworks and the other metrics. The experimental results show the effectiveness of our proposed method. We will release our code and dataset via github.
Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma Language
As an Indo-Aryan language with limited available data, Chakma remains largely underrepresented in language models. In this work, we introduce a novel corpus of contextually coherent Bangla-transliterated Chakma, curated from Chakma literature, and validated by native speakers. Using this dataset, we fine-tune six encoder-based transformer models, including multilingual (mBERT, XLM-RoBERTa, DistilBERT), regional (BanglaBERT, IndicBERT), and monolingual English (DeBERTaV3) variants on masked language modeling (MLM) tasks. Our experiments show that fine-tuned multilingual models outperform their pre-trained counterparts when adapted to Bangla-transliterated Chakma, achieving up to 73.54% token accuracy and a perplexity as low as 2.90. Our analysis further highlights the impact of data quality on model performance and shows the limitations of OCR pipelines for morphologically rich Indic scripts. Our research demonstrates that Bangla-transliterated Chakma can be very effective for transfer learning for Chakma language, and we release our dataset to encourage further research on multilingual language modeling for low-resource languages.
The Structure-Content Trade-off in Knowledge Graph Retrieval
Large Language Models (LLMs) increasingly rely on knowledge graphs for factual reasoning, yet how retrieval design shapes their performance remains unclear. We examine how question decomposition changes the retrieved subgraph's content and structure. Using a hybrid retrieval function that controls the importance of initial question and subquestions, we show that subquestion-based retrieval improves content precision, but yields disjoint subgraphs, while question-based retrieval maintains structure at the cost of relevance. Optimal performance arises between these extremes, revealing that balancing retrieval content and structure is key to effective LLM reasoning over structured knowledge.
A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency
Large language models (LLMs) are widely applied in chatbots, code generators, and search engines. Workload such as chain-of-throught, complex reasoning, agent services significantly increase the inference cost by invoke the model repeatedly. Optimization methods such as parallelism, compression, and caching have been adopted to reduce costs, but the diverse service requirements make it hard to select the right method. Recently, specialized LLM inference engines have emerged as a key component for integrating the optimization methods into service-oriented infrastructures. However, a systematic study on inference engines is still lacking.This paper provides a comprehensive evaluation of 25 open-source and commercial inference engines. We examine each inference engine in terms of ease-of-use, ease-of-deployment, general-purpose support, scalability, and suitability for throughput- and latency-aware computation. Furthermore, we explore the design goals of each inference engine by investigating the optimization techniques it supports. In addition, we assess the ecosystem maturity of open source inference engines and handle the performance and cost policy of commercial solutions.We outline future research directions that include support for complex LLM-based services, support of various hardware, and enhanced security, offering practical guidance to researchers and developers in selecting and designing optimized LLM inference engines. We also provide a public repository to continually track developments in this fast-evolving field: \href{https://github.com/sihyeong/Awesome-LLM-Inference-Engine}{https://github.com/sihyeong/Awesome-LLM-Inference-Engine}.
comment: Under review; 106 pages; 46 figures
Beyond Introspection: Reinforcing Thinking via Externalist Behavioral Feedback
While inference-time thinking allows Large Language Models (LLMs) to address complex problems, the extended thinking process can be unreliable or inconsistent because of the model's probabilistic nature, especially near its knowledge boundaries. Existing approaches attempt to mitigate this by having the model critique its own reasoning to make corrections. However, such self-critique inherits the same biases of the original output, known as the introspection illusion. Moving beyond such introspection and inspired by core methodologies in ethology, we propose an externalist three-step framework Distillation-Reinforcement-Reasoning (DRR). Rather than relying on a model's introspection, DRR evaluates its observable behaviors to provide corrective feedback. DRR first distills the reasoner's behavioral traces, then trains a lightweight, external Discriminative Model (DM). At inference time, this DM acts as a critic, identifying and rejecting suspicious reasoning steps. This external feedback compels the LLM to discard flawed pathways and explore alternatives, thereby enhancing reasoning quality without altering the base model. Experiments on multiple reasoning benchmarks show that our framework significantly outperforms prominent self-critique methods. Benefiting from a lightweight and annotation-free design, DRR offers a scalable and adaptable solution for improving the reliability of reasoning in a wide range of LLMs.
Evaluating Large Language Models for Radiology Natural Language Processing
The rise of large language models (LLMs) has marked a pivotal shift in the field of natural language processing (NLP). LLMs have revolutionized a multitude of domains, and they have made a significant impact in the medical field. Large language models are now more abundant than ever, and many of these models exhibit bilingual capabilities, proficient in both English and Chinese. However, a comprehensive evaluation of these models remains to be conducted. This lack of assessment is especially apparent within the context of radiology NLP. This study seeks to bridge this gap by critically evaluating thirty two LLMs in interpreting radiology reports, a crucial component of radiology NLP. Specifically, the ability to derive impressions from radiologic findings is assessed. The outcomes of this evaluation provide key insights into the performance, strengths, and weaknesses of these LLMs, informing their practical applications within the medical domain.
CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis AAAI 2026
Large Language Models (LLMs) with Mixture-of-Experts (MoE) architectures are distinguished by their strong performance scaling with increasing parameters across a wide range of tasks, yet they also suffer from substantial computational and storage overheads. Notably, the performance gains of MoE models do not scale proportionally with the growth in expert parameters. While prior works attempt to reduce parameters via expert-level pruning, merging, or decomposition, they still suffer from challenges in both performance and computational efficiency. In this paper, we address these challenges by introducing micro-expert as a finer-grained compression unit that spans across matrices. We first establish a more fundamental perspective, viewing MoE layers as mixtures of micro-experts, and present CAMERA, a lightweight and training-free framework for identifying micro-expert redundancy. Our analysis uncovers significant variance in micro-expert contributions during decoding. Based on this insight, we further propose CAMERA-P, a structured micro-expert pruning framework, and CAMERA-Q, a mixed-precision quantization idea designed for micro-experts. Extensive experiments on nine downstream tasks show that CAMERA-P consistently outperforms strong baselines under pruning ratios ranging from 20% to 60%. Furthermore, CAMERA-Q achieves superior results under aggressive 2-bit quantization, surpassing existing matrix- and channel-level ideas. Notably, our method enables complete micro-expert analysis of Qwen2-57B-A14B in less than 5 minutes on a single NVIDIA A100-40GB GPU.
comment: Accepted in AAAI 2026
Position-Aware Depth Decay Decoding ($D^3$): Boosting Large Language Model Inference Efficiency
Due to the large number of parameters, the inference phase of Large Language Models (LLMs) is resource-intensive. Unlike traditional model compression, which needs retraining, recent dynamic computation methods show that not all components are required for inference, enabling a training-free pipeline. In this paper, we focus on the dynamic depth of LLM generation. A token-position aware layer skipping framework is proposed to save 1.5x times operations efficiently while maintaining performance. We first observed that tokens predicted later have lower perplexity and thus require less computation. Then, we propose a training-free algorithm called Position-Aware Depth Decay Decoding ($D^3$), which leverages a power-law decay function, $\left\lfloor L \times (α^i) \right\rfloor$, to determine the number of layers to retain when generating token $T_i$. Remarkably, without any retraining, the $D^3$ achieves success across a wide range of generation tasks for the first time. Experiments on large language models (\ie the Llama) with $7 \sim 70$ billion parameters show that $D^3$ can achieve an average 1.5x speedup compared with the full-inference pipeline while maintaining comparable performance with nearly no performance drop ($<1\%$) on the GSM8K and BBH benchmarks.
LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?
Recent advances in Large Multimodal Models (LMMs) have revolutionized their reasoning and Optical Character Recognition (OCR) capabilities. However, their complex logical reasoning performance on text-rich images remains underexplored. To bridge this gap, we introduce LogicOCR, a benchmark comprising 2780 questions with two subsets, i.e., LogicOCR-Gen with 1100 multi-choice questions on generated images, and LogicOCR-Real with 1680 meticulously designed free-form questions on real-world images. For constructing LogicOCR-Gen, we first curate a text corpus from the Chinese National Civil Servant Examination, and customize an automatic pipeline to steer GPT-Image-1 to generate images with varied layouts and fonts, ensuring contextual relevance and visual realism. Then, the generated images are manually verified. We evaluate a range of representative LMMs under Chain-of-Thought (CoT) and direct-answer settings. Our multi-dimensional analysis reveals key insights, such as the impact of test-time scaling, input modality differences, and sensitivity to visual-text orientation. Notably, LMMs still lag in multimodal reasoning compared to text-only inputs, indicating that they have not fully bridged visual reading with reasoning. Moreover, we propose TextCue, a training-free method that enhances LMMs' perception of image regions containing important text cues for solving questions. We leverage LMMs' attention maps and an off-the-shelf text segmentation specialist to determine the region, which is then cropped and enlarged to augment the original image. Experiments show its effectiveness, e.g., a 1.8% accuracy gain over LLaVA-OV-1.5-8B under the CoT setting. Our benchmark is available at https://github.com/MiliLab/LogicOCR.
comment: GitHub: https://github.com/MiliLab/LogicOCR
Gram2Vec: An Interpretable Document Vectorizer
We present Gram2Vec, a grammatical style embedding system that embeds documents into a higher dimensional space by extracting the normalized relative frequencies of grammatical features present in the text. Compared to neural approaches, Gram2Vec offers inherent interpretability based on how the feature vectors are generated. In this paper, we use authorship verification and AI detection as two applications to show how Gram2Vec can be used. For authorship verification, we use the features from Gram2Vec to explain why a pair of documents is by the same or by different authors. We also demonstrate how Gram2Vec features can be used to train a classifier for AI detection, outperforming machine learning models trained on a comparable set of Biber features.
comment: 8 pages, 1 figure
On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey
Text embeddings have attracted growing interest due to their effectiveness across a wide range of natural language processing (NLP) tasks, including retrieval, classification, clustering, bitext mining, and summarization. With the emergence of pretrained language models (PLMs), general-purpose text embeddings (GPTE) have gained significant traction for their ability to produce rich, transferable representations. The general architecture of GPTE typically leverages PLMs to derive dense text representations, which are then optimized through contrastive learning on large-scale pairwise datasets. In this survey, we provide a comprehensive overview of GPTE in the era of PLMs, focusing on the roles PLMs play in driving its development. We first examine the fundamental architecture and describe the basic roles of PLMs in GPTE, i.e., embedding extraction, expressivity enhancement, training strategies, learning objectives, and data construction. We then describe advanced roles enabled by PLMs, including multilingual support, multimodal integration, code understanding, and scenario-specific adaptation. Finally, we highlight potential future research directions that move beyond traditional improvement goals, including ranking integration, safety considerations, bias mitigation, structural information incorporation, and the cognitive extension of embeddings. This survey aims to serve as a valuable reference for both newcomers and established researchers seeking to understand the current state and future potential of GPTE.
comment: 45 pages, 4 figures, 9 tables
MTA: A Merge-then-Adapt Framework for Personalized Large Language Model
Personalized Large Language Models (PLLMs) aim to align model outputs with individual user preferences, a crucial capability for user-centric applications. However, the prevalent approach of fine-tuning a separate module for each user faces two major limitations: (1) storage costs scale linearly with the number of users, rendering the method unscalable; and (2) fine-tuning a static model from scratch often yields suboptimal performance for users with sparse data. To address these challenges, we propose MTA, a Merge-then-Adapt framework for PLLMs. MTA comprises three key stages. First, we construct a shared Meta-LoRA Bank by selecting anchor users and pre-training meta-personalization traits within meta-LoRA modules. Second, to ensure scalability and enable dynamic personalization combination beyond static models, we introduce an Adaptive LoRA Fusion stage. This stage retrieves and dynamically merges the most relevant anchor meta-LoRAs to synthesize a user-specific one, thereby eliminating the need for user-specific storage and supporting more flexible personalization. Third, we propose a LoRA Stacking for Few-Shot Personalization stage, which applies an additional ultra-low-rank, lightweight LoRA module on top of the merged LoRA. Fine-tuning this module enables effective personalization under few-shot settings. Extensive experiments on the LaMP benchmark demonstrate that our approach outperforms existing SOTA methods across multiple tasks.
AppSelectBench: Application-Level Tool Selection Benchmark
Computer Using Agents (CUAs) are increasingly equipped with external tools, enabling them to perform complex and realistic tasks. For CUAs to operate effectively, application selection, which refers to deciding which application to use before invoking fine-grained tools such as APIs, is a fundamental capability. It determines whether the agent initializes the correct environment, avoids orchestration confusion, and efficiently focuses on relevant context. However, existing benchmarks primarily assess fine-grained API selection, offering limited insight into whether models can reason across and choose between different applications. To fill this gap, we introduce AppSelectBench, a comprehensive benchmark for evaluating application selection in CUAs. AppSelectBench contains a novel user task generation pipeline that produces realistic, diverse, and semantically grounded user intents at scale, together with unified evaluation protocols covering random, heuristic, zero-shot, few-shot, and retrieval-augmented-settings. AppSelectBench covers one hundred widely used desktop applications and includes more than one hundred thousand realistic, diverse, and semantically grounded user tasks. Extensive experiments across both closed-source and open-source large language models reveal systematic strengths and weaknesses in inter-application reasoning, showing that even the most capable models still struggle to make consistent application choices. Together, these results establish AppSelectBench as a foundation for studying and advancing application level reasoning, an essential yet underexplored capability of intelligent CUAs. The source is available at https://microsoft.github.io/appselectbench/.
Harvesting Textual and Contrastive Data from the HAL Publication Repository
Authorship attribution in natural language processing traditionally struggles to distinguish genuine stylistic signals from topical confounds. While contrastive learning approaches have addressed this by maximizing semantic overlap between positive pairs, creating large-scale datasets under strict topic constraints remains challenging. We introduce HALvest, a 17-billion-token multilingual corpus harvested from 778k open-access academic papers, and HALvest-Contrastive, a derived dataset designed to isolate stylometric signals through controlled topic variation. Unlike prior work that minimizes lexical overlap, we exploit natural topic drift between papers by the same author, treating residual lexical patterns as authorial fingerprints rather than noise. Comparing lexical baselines (BM25) against neural models trained on unrestricted (topic-rich) versus base (topic-decoupled) triplets, we demonstrate that models trained exclusively on topic-decoupled data achieve superior performance across all test conditions, outperforming both retrieval baselines and models exposed to topic-rich training data. Our analysis reveals that while lexical signals provide substantial performance gains for keyword-driven methods, neural architectures learn robust stylometric representations that plateau with moderate context length, suggesting they capture distributional style beyond surface-level tokens. Both datasets and code are publicly available.
comment: New dataset version with only the contrastive learning data
PoETa v2: Toward More Robust Evaluation of Large Language Models in Portuguese
Large Language Models (LLMs) exhibit significant variations in performance across linguistic and cultural contexts, underscoring the need for systematic evaluation in diverse languages. In this work, we present the most extensive evaluation of LLMs for the Portuguese language to date. Leveraging our newly introduced PoETa v2 benchmark -- a comprehensive suite of over 40 tasks in Portuguese -- we assess more than 20 models covering a broad spectrum of training scales and computational resources. Our study reveals how computational investment and language-specific adaptation impact performance in Portuguese, while also analyzing performance gaps in comparison to equivalent tasks in English. Through this benchmark and analysis, PoETa v2 lays the groundwork for future research on Portuguese language modeling and evaluation. The benchmark is available at https://github.com/PoETaV2/PoETaV2.
Strong Memory, Weak Control: An Empirical Study of Executive Functioning in LLMs
Working memory, or the ability to hold and manipulate information in the mind, is a critical component of human intelligence and executive functioning. It is correlated with performance on various cognitive tasks, including measures of fluid intelligence, which encompasses reasoning and problem solving. We use a comprehensive set of classic working memory tasks to estimate the working memory capacity of large language models (LLMs). We find that in most cases, LLMs exceed normative human scores. However, we do not find that the increased capacity of working memory is associated with higher performance on other executive functioning tasks or problem solving benchmarks. These results suggest that LLMs may have deficits in attentional control and cognitive flexibility, which result in difficulties with inhibiting automatic responses and adapting to shifting information. Our findings suggest that current reasoning models have mixed results in compensating for these deficits.
ChiKhaPo: A Large-Scale Multilingual Benchmark for Evaluating Lexical Comprehension and Generation in Large Language Models
Existing benchmarks for large language models (LLMs) are largely restricted to high- or mid-resource languages, and often evaluate performance on higher-order tasks in reasoning and generation. However, plenty of evidence points to the fact that LLMs lack basic linguistic competence in the vast majority of the world's 3800+ written languages. We introduce ChiKhaPo, consisting of 8 subtasks of varying difficulty designed to evaluate the lexical comprehension and generation abilities of generative models. ChiKhaPo draws on existing lexicons, monolingual data, and bitext, and provides coverage for 2700+ languages for 2 subtasks, surpassing any existing benchmark in terms of language coverage. We further show that 6 SOTA models struggle on our benchmark, and discuss the factors contributing to performance scores, including language family, language resourcedness, task, and comprehension versus generation directions. With ChiKhaPo, we hope to enable and encourage the massively multilingual benchmarking of LLMs.
Beyond the Rubric: Cultural Misalignment in LLM Benchmarks for Sexual and Reproductive Health
Large Language Models (LLMs) have been positioned as having the potential to expand access to health information in the Global South, yet their evaluation remains heavily dependent on benchmarks designed around Western norms. We present insights from a preliminary benchmarking exercise with a chatbot for sexual and reproductive health (SRH) for an underserved community in India. We evaluated using HealthBench, a benchmark for conversational health models by OpenAI. We extracted 637 SRH queries from the dataset and evaluated on the 330 single-turn conversations. Responses were evaluated using HealthBench's rubric-based automated grader, which rated responses consistently low. However, qualitative analysis by trained annotators and public health experts revealed that many responses were actually culturally appropriate and medically accurate. We highlight recurring issues, particularly a Western bias, such as for legal framing and norms (e.g., breastfeeding in public), diet assumptions (e.g., fish safe to eat during pregnancy), and costs (e.g., insurance models). Our findings demonstrate the limitations of current benchmarks in capturing the effectiveness of systems built for different cultural and healthcare contexts. We argue for the development of culturally adaptive evaluation frameworks that meet quality standards while recognizing needs of diverse populations.
comment: https://github.com/Sumon/healthbench-srh-eval/
Human-Computer Interaction
From Prediction to Foresight: The Role of AI in Designing Responsible Futures
In an era marked by rapid technological advancements and complex global challenges, responsible foresight has emerged as an essential framework for policymakers aiming to navigate future uncertainties and shape the future. Responsible foresight entails the ethical anticipation of emerging opportunities and risks, with a focus on fostering proactive, sustainable, and accountable future design. This paper coins the term "responsible computational foresight", examining the role of human-centric artificial intelligence and computational modeling in advancing responsible foresight, establishing a set of foundational principles for this new field and presenting a suite of AI-driven foresight tools currently shaping it. AI, particularly in conjunction with simulations and scenario analysis, enhances policymakers' ability to address uncertainty, evaluate risks, and devise strategies geared toward sustainable, resilient futures. However, responsible foresight extends beyond mere technical forecasting; it demands a nuanced understanding of the interdependencies within social, environmental, economic and political systems, alongside a commitment to ethical, long-term decision-making that supports human intelligence. We argue that AI will play a role as a supportive tool in responsible, human-centered foresight, complementing rather than substituting policymaker judgment to enable the proactive shaping of resilient and ethically sound futures. This paper advocates for the thoughtful integration of AI into foresight practices to empower policymakers and communities as they confront the grand challenges of the 21st century.
comment: Accessible at https://projecteuclid.org/journals/journal-of-artificial-intelligence-for-sustainable-development/volume-1/issue-1/From-Prediction-to-Foresight--The-Role-of-AI-in/10.69828/4d4kja.full
Self-Transparency Failures in Expert-Persona LLMs: A Large-Scale Behavioral Audit
If a language model cannot reliably disclose its AI identity in expert contexts, users cannot trust its competence boundaries. This study examines self-transparency in models assigned professional personas within high-stakes domains where false expertise risks user harm. Using a common-garden design, sixteen open-weight models (4B--671B parameters) were audited across 19,200 trials. Models exhibited sharp domain-specific inconsistency: a Financial Advisor persona elicited 30.8% disclosure initially, while a Neurosurgeon persona elicited only 3.5%. This creates preconditions for a "Reverse Gell-Mann Amnesia" effect, where transparency in some domains leads users to overgeneralize trust to contexts where disclosure fails. Disclosure ranged from 2.8% to 73.6%, with a 14B model reaching 61.4% while a 70B produced just 4.1%. Model identity predicted behavior better than parameter count ($ΔR_{adj}^{2} = 0.359$ vs 0.018). Reasoning optimization actively suppressed self-transparency in some models, with reasoning variants showing up to 48.4% lower disclosure than base counterparts. Bayesian validation with Rogan--Gladen correction confirmed robustness to measurement error ($κ= 0.908$). These findings demonstrate transparency reflects training factors rather than scale. Organizations cannot assume safety properties transfer to deployment contexts, requiring deliberate behavior design and empirical verification.
MMA: A Momentum Mamba Architecture for Human Activity Recognition with Inertial Sensors
Human activity recognition (HAR) from inertial sensors is essential for ubiquitous computing, mobile health, and ambient intelligence. Conventional deep models such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and transformers have advanced HAR but remain limited by vanishing or exloding gradients, high computational cost, and difficulty in capturing long-range dependencies. Structured state-space models (SSMs) like Mamba address these challenges with linear complexity and effective temporal modeling, yet they are restricted to first-order dynamics without stable longterm memory mechanisms. We introduce Momentum Mamba, a momentum-augmented SSM that incorporates second-order dynamics to improve stability of information flow across time steps, robustness, and long-sequence modeling. Two extensions further expand its capacity: Complex Momentum Mamba for frequency-selective memory scaling. Experiments on multiple HAR benchmarks demonstrate consistent gains over vanilla Mamba and Transformer baselines in accuracy, robustness, and convergence speed. With only moderate increases in training cost, momentum-augmented SSMs offer a favorable accuracy-efficiency balance, establishing them as a scalable paradigm for HAR and a promising principal framework for broader sequence modeling applications.
comment: 14 pages, 5 pages
Seeing Twice: How Side-by-Side T2I Comparison Changes Auditing Strategies
While generative AI systems have gained popularity in diverse applications, their potential to produce harmful outputs limits their trustworthiness and utility. A small but growing line of research has explored tools and processes to better engage non-AI expert users in auditing generative AI systems. In this work, we present the design and evaluation of MIRAGE, a web-based tool exploring a "contrast-first" workflow that allows users to pick up to four different text-to-image (T2I) models, view their images side-by-side, and provide feedback on model performance on a single screen. In our user study with fifteen participants, we used four predefined models for consistency, with only a single model initially being shown. We found that most participants shifted from analyzing individual images to general model output patterns once the side-by-side step appeared with all four models; several participants coined persistent "model personalities" (e.g., cartoonish, saturated) that helped them form expectations about how each model would behave on future prompts. Bilingual participants also surfaced a language-fidelity gap, as English prompts produced more accurate images than Portuguese or Chinese, an issue often overlooked when dealing with a single model. These findings suggest that simple comparative interfaces can accelerate bias discovery and reshape how people think about generative models.
comment: 8 pages, 6 figures. Presented at ACM Collective Intelligence (CI), 2025. Available at https://ci.acm.org/2025/wp-content/uploads/101-Maldaner.pdf
Prune4Web: DOM Tree Pruning Programming for Web Agent AAAI 2026
Web automation employs intelligent agents to execute high-level tasks by mimicking human interactions with web interfaces. Despite the capabilities of recent Large Language Model (LLM)-based web agents, navigating complex, real-world webpages efficiently remains a significant hurdle due to the prohibitively large size of Document Object Model (DOM) structures, often ranging from 10,000 to 100,000 tokens. Existing strategies typically rely on crude DOM truncation -- risking the loss of critical information -- or employ inefficient heuristics and separate ranking models, failing to achieve an optimal balance between precision and scalability. To address these challenges, we introduce Prune4Web, a novel paradigm that shifts DOM processing from resource-intensive LLM reading to efficient programmatic pruning. Central to our approach is DOM Tree Pruning Programming, where an LLM generates executable Python scoring scripts to dynamically filter DOM elements based on semantic cues from decomposed sub-tasks. This mechanism eliminates the need for LLMs to ingest raw, massive DOMs, instead delegating traversal and scoring to lightweight, interpretable programs. This methodology achieves a 25x to 50x reduction in candidate elements for grounding, thereby facilitating precise action localization while mitigating attention dilution. Furthermore, we propose a specialized data annotation pipeline and a two-turn dialogue training strategy that jointly optimizes the Planner, Programmatic Filter, and Grounder within a unified framework. Extensive experiments demonstrate state-of-the-art performance. Notably, on our low-level grounding task, Prune4Web dramatically improves accuracy from 46.8% to 88.28%, underscoring its efficacy in real-world web automation.
comment: Paper accepted to AAAI 2026
TALES: A Taxonomy and Analysis of Cultural Representations in LLM-generated Stories
Millions of users across the globe turn to AI chatbots for their creative needs, inviting widespread interest in understanding how such chatbots represent diverse cultures. At the same time, evaluating cultural representations in open-ended tasks remains challenging and underexplored. In this work, we present TALES, an evaluation of cultural misrepresentations in LLM-generated stories for diverse Indian cultural identities. First, we develop TALES-Tax, a taxonomy of cultural misrepresentations by collating insights from participants with lived experiences in India through focus groups (N=9) and individual surveys (N=15). Using TALES-Tax, we evaluate 6 models through a large-scale annotation study spanning 2,925 annotations from 108 annotators with lived cultural experience from across 71 regions in India and 14 languages. Concerningly, we find that 88\% of the generated stories contain one or more cultural inaccuracies, and such errors are more prevalent in mid- and low-resourced languages and stories based in peri-urban regions in India. Lastly, we transform the annotations into TALES-QA, a standalone question bank to evaluate the cultural knowledge of foundational models. Through this evaluation, we surprisingly discover that models often possess the requisite cultural knowledge despite generating stories rife with cultural misrepresentations.
Bug Detective and Quality Coach: Developers' Mental Models of AI-Assisted IDE Tools
AI-assisted tools support developers in performing cognitively demanding tasks such as bug detection and code readability assessment. Despite the advancements in the technical characteristics of these tools, little is known about how developers mentally model them and how mismatches affect trust, control, and adoption. We conducted six co-design workshops with 58 developers to elicit their mental models about AI-assisted bug detection and readability features. It emerged that developers conceive bug detection tools as \textit{bug detectives}, which warn users only in case of critical issues, guaranteeing transparency, actionable feedback, and confidence cues. Readability assessment tools, on the other hand, are envisioned as \textit{quality coaches}, which provide contextual, personalized, and progressive guidance. Trust, in both tasks, depends on the clarity of explanations, timing, and user control. A set of design principles for Human-Centered AI in IDEs has been distilled, aiming to balance disruption with support, conciseness with depth, and automation with human agency.
Generative AI Compensates for Age-Related Cognitive Decline in Decision Making: Preference-Aligned Recommendations Reduce Choice Difficulty
Due to age-related declines in memory, processing speed, working memory, and executive functions, older adults experience difficulties in decision making when situations require novel choices, probabilistic judgments, rapid responses, or extensive information search. This study examined whether using generative AI during decision making enhances choice satisfaction and reduces choice difficulty among older adults. A total of 130 participants (younger: 56; older: 74) completed a music-selection task under AI-use and AI-nonuse conditions across two contexts: previously experienced (road trip) and not previously experienced (space travel). In the AI-nonuse condition, participants generated candidate options from memory; in the AI-use condition, GPT-4o presented options tailored to individual preferences. To assess cognitive function, we also administered the Wechsler Adult Intelligence Scale-Fourth Edition. Results revealed that in the AI-nonuse condition, older adults with lower cognitive function reported higher choice difficulty and lower choice satisfaction. Under the AI-use condition, choice satisfaction did not change significantly, but perceived choice difficulty decreased significantly in both age groups. Moreover, AI use attenuated the associations observed among older adults between lower cognitive function and both greater difficulty and lower satisfaction. These findings indicate that preference-aligned option recommendations generated by AI can compensate for age-related constraints on information search, thereby reducing perceived choice difficulty without diminishing satisfaction.
QuadStretcher: A Forearm-Worn Skin Stretch Display for Bare-Hand Interaction in AR/VR
The paradigm of bare-hand interaction has become increasingly prevalent in Augmented Reality (AR) and Virtual Reality (VR) environments, propelled by advancements in hand tracking technology. However, a significant challenge arises in delivering haptic feedback to users' hands, due to the necessity for the hands to remain bare. In response to this challenge, recent research has proposed an indirect solution of providing haptic feedback to the forearm. In this work, we present QuadStretcher, a skin stretch display featuring four independently controlled stretching units surrounding the forearm. While achieving rich haptic expression, our device also eliminates the need for a grounding base on the forearm by using a pair of counteracting tactors, thereby reducing bulkiness. To assess the effectiveness of QuadStretcher in facilitating immersive bare-hand experiences, we conducted a comparative user evaluation (n = 20) with a baseline solution, Squeezer. The results confirmed that QuadStretcher outperformed Squeezer in terms of expressing force direction and heightening the sense of realism, particularly in 3-DoF VR interactions such as pulling a rubber band, hooking a fishing rod, and swinging a tennis racket. We further discuss the design insights gained from qualitative user interviews, presenting key takeaways for future forearm-haptic systems aimed at advancing AR/VR bare-hand experiences.
comment: ACM CHI 2024
STAR: Smartphone-analogous Typing in Augmented Reality
While text entry is an essential and frequent task in Augmented Reality (AR) applications, devising an efficient and easy-to-use text entry method for AR remains an open challenge. This research presents STAR, a smartphone-analogous AR text entry technique that leverages a user's familiarity with smartphone two-thumb typing. With STAR, a user performs thumb typing on a virtual QWERTY keyboard that is overlain on the skin of their hands. During an evaluation study of STAR, participants achieved a mean typing speed of 21.9 WPM (i.e., 56% of their smartphone typing speed), and a mean error rate of 0.3% after 30 minutes of practice. We further analyze the major factors implicated in the performance gap between STAR and smartphone typing, and discuss ways this gap could be narrowed.
comment: ACM UIST 2023
Lattice Menu: A Low-Error Gaze-Based Marking Menu Utilizing Target-Assisted Gaze Gestures on a Lattice of Visual Anchors
We present Lattice Menu, a gaze-based marking menu utilizing a lattice of visual anchors that helps perform accurate gaze pointing for menu item selection. Users who know the location of the desired item can leverage target-assisted gaze gestures for multilevel item selection by looking at visual anchors over the gaze trajectories. Our evaluation showed that Lattice Menu exhibits a considerably low error rate (~1%) and a quick menu selection time (1.3-1.6 s) for expert usage across various menu structures (4 x 4 x 4 and 6 x 6 x 6) and sizes (8, 10 and 12°). In comparison with a traditional gaze-based marking menu that does not utilize visual targets, Lattice Menu showed remarkably (~5 times) fewer menu selection errors for expert usage. In a post-interview, all 12 subjects preferred Lattice Menu, and most subjects (8 out of 12) commented that the provisioning of visual targets facilitated more stable menu selections with reduced eye fatigue.
comment: ACM CHI 2022
Human-Centered Artificial Social Intelligence (HC-ASI)
As artificial intelligence systems become increasingly integrated into human social contexts, Artificial Social Intelligence (ASI) has emerged as a critical capability that enables AI to perceive, understand, and engage meaningfully in complex human social interactions. This chapter introduces a comprehensive framework for Human-Centered Artificial Social Intelligence (HC-ASI), built upon the Technology-Human Factors-Ethics (THE) Triangle, which systematically addresses both technical foundations and human-centered design principles necessary for developing socially intelligent AI systems. This chapter provides a comprehensive overview of current ASI research. This chapter begins by establishing the theoretical foundations of ASI, tracing its evolution from classical psychological theories of human social intelligence to contemporary computational models, then examines the mechanisms underlying human-AI social interaction with particular emphasis on establishing shared social understanding and appropriate role positioning. The chapter further explores ASI's practical implications for individuals and groups through comprehensive evaluation frameworks that combine technical benchmarks with human-centered experiential assessments, demonstrating real-world applications through detailed case studies spanning healthcare, companionship, education, and customer service domains. Building on the overview and the framework of HC -ASI, this chapter articulates core HC-ASI design principles and translates them into actionable methodologies and implementation guidelines that provide practical guidance for researchers and practitioners. This chapter concludes with a critical discussion of current challenges and promising directions for developing comprehensive HC-ASI ecosystems.
comment: Book chapter preprint
LOOM: Personalized Learning Informed by Daily LLM Conversations Toward Long-Term Mastery via a Dynamic Learner Memory Graph AAAI 2026
Foundation models are increasingly used to personalize learning, yet many systems still assume fixed curricula or coarse progress signals, limiting alignment with learners' day-to-day needs. At the other extreme, lightweight incidental systems offer flexible, in-the-moment content but rarely guide learners toward mastery. Prior work privileges either continuity (maintaining a plan across sessions) or initiative (reacting to the moment), not both, leaving learners to navigate the trade-off between recency and trajectory-immediate relevance versus cumulative, goal-aligned progress. We present LOOM, an agentic pipeline that infers evolving learner needs from recent LLM conversations and a dynamic learner memory graph, then assembles coherent learning materials personalized to the learner's current needs, priorities, and understanding. These materials link adjacent concepts and surface gaps as tightly scoped modules that cumulatively advance broader goals, providing guidance and sustained progress while remaining responsive to new interests. We describe LOOM's end-to-end architecture and working prototype, including conversation summarization, topic planning, course generation, and graph-based progress tracking. In a formative study with ten participants, users reported that LOOM's generated lessons felt relevant to their recent activities and helped them recognize knowledge gaps, though they also highlighted needs for greater consistency and control. We conclude with design implications for more robust, mixed-initiative learning pipelines that integrate structured learner modelling with everyday LLM interactions.
comment: Accepted to the PerFM Workshop at AAAI 2026
PileUp: A Tufting Approach to Soft, Tactile, and Volumetric E-Textile Interfaces
We present PileUp, a tufted pile e-textile sensing approach that offers unique affordances through the tactile expressiveness and richness of its continuous, threaded-volume construction. By integrating conductive yarns in looped or cut pile forms, PileUp transforms soft 3-dimensional textiles into multimodal sensors capable of detecting mechanical deformations such as pressure, bending, and strain, as well as environmental conditions like moisture. We propose a design space that outlines the relationships between texture, form factor, and sensing affordances of tufted textiles. We characterize electrical responses under compression, bending, and strain, reporting sensor behaviors. To demonstrate versatility, we present three application scenarios in which PileUp sensors are seamlessly integrated into soft fabrics: a meditation rug with multi-zone sensing, a fleece sleeve that detects arm motion, and a moisture-sensing wall art. Our results establish tufting as an accessible yet expressive fabrication method for creating integrated sensing textiles, distinguishing our work from traditional flat textile sensors.
comment: Twentieth International Conference on Tangible, Embedded, and Embodied Interaction (TEI '26)
Facilitating the Integration of LLMs Into Online Experiments With Simple Chat
As large language models (LLMs) become increasingly prevalent, understanding human-LLM interactions is emerging as a central priority in psychological research. Online experiments offer an efficient means to study human-LLM interactions, yet integrating LLMs into established survey platforms remains technically demanding, particularly when aiming for ecologically valid, real-time conversational experiences with strong experimental control. We introduce Simple Chat, an open-source, research-focused chat interface that streamlines LLM integration for platforms such as Qualtrics, oTree, and LimeSurvey, while presenting a unified participant experience across conditions. Simple Chat connects to both commercial providers and open-weights models, supports streaming responses to preserve conversational flow, and offers an administrative interface for fine-grained control of prompts and interface features. By reducing technical barriers, standardizing interfaces, and improving participant experience, Simple Chat helps advance the study of human-LLM interaction. In this article, we outline Simple Chat's key features, provide a step-by-step tutorial, and demonstrate its utility through two illustrative case studies.
Computer Vision and Pattern Recognition
RubricRL: Simple Generalizable Rewards for Text-to-Image Generation
Reinforcement learning (RL) has recently emerged as a promising approach for aligning text-to-image generative models with human preferences. A key challenge, however, lies in designing effective and interpretable rewards. Existing methods often rely on either composite metrics (e.g., CLIP, OCR, and realism scores) with fixed weights or a single scalar reward distilled from human preference models, which can limit interpretability and flexibility. We propose RubricRL, a simple and general framework for rubric-based reward design that offers greater interpretability, composability, and user control. Instead of using a black-box scalar signal, RubricRL dynamically constructs a structured rubric for each prompt--a decomposable checklist of fine-grained visual criteria such as object correctness, attribute accuracy, OCR fidelity, and realism--tailored to the input text. Each criterion is independently evaluated by a multimodal judge (e.g., o4-mini), and a prompt-adaptive weighting mechanism emphasizes the most relevant dimensions. This design not only produces interpretable and modular supervision signals for policy optimization (e.g., GRPO or PPO), but also enables users to directly adjust which aspects to reward or penalize. Experiments with an autoregressive text-to-image model demonstrate that RubricRL improves prompt faithfulness, visual detail, and generalizability, while offering a flexible and extensible foundation for interpretable RL alignment across text-to-image architectures.
MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Medical Imaging Modalities
Traditional object detection models in medical imaging operate within a closed-set paradigm, limiting their ability to detect objects of novel labels. Open-vocabulary object detection (OVOD) addresses this limitation but remains underexplored in medical imaging due to dataset scarcity and weak text-image alignment. To bridge this gap, we introduce MedROV, the first Real-time Open Vocabulary detection model for medical imaging. To enable open-vocabulary learning, we curate a large-scale dataset, Omnis, with 600K detection samples across nine imaging modalities and introduce a pseudo-labeling strategy to handle missing annotations from multi-source datasets. Additionally, we enhance generalization by incorporating knowledge from a large pre-trained foundation model. By leveraging contrastive learning and cross-modal representations, MedROV effectively detects both known and novel structures. Experimental results demonstrate that MedROV outperforms the previous state-of-the-art foundation model for medical image detection with an average absolute improvement of 40 mAP50, and surpasses closed-set detectors by more than 3 mAP50, while running at 70 FPS, setting a new benchmark in medical detection. Our source code, dataset, and trained model are available at https://github.com/toobatehreem/MedROV.
Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout
Current autoregressive video diffusion models are constrained by three core bottlenecks: (i) the finite temporal horizon imposed by the base model's 3D Rotary Positional Embedding (3D-RoPE), (ii) slow prompt responsiveness in maintaining fine-grained action control during long-form rollouts, and (iii) the inability to realize discontinuous cinematic transitions within a single generation stream. We introduce $\infty$-RoPE, a unified inference-time framework that addresses all three limitations through three interconnected components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Block-Relativistic RoPE reformulates temporal encoding as a moving local reference frame, where each newly generated latent block is rotated relative to the base model's maximum frame horizon while earlier blocks are rotated backward to preserve relative temporal geometry. This relativistic formulation eliminates fixed temporal positions, enabling continuous video generation far beyond the base positional limits. To obtain fine-grained action control without re-encoding, KV Flush renews the KV cache by retaining only two latent frames, the global sink and the last generated latent frame, thereby ensuring immediate prompt responsiveness. Finally, RoPE Cut introduces controlled discontinuities in temporal RoPE coordinates, enabling multi-cut scene transitions within a single continuous rollout. Together, these components establish $\infty$-RoPE as a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion. Comprehensive experiments show that $\infty$-RoPE consistently surpasses previous autoregressive models in overall VBench scores.
comment: Project Page: https://infinity-rope.github.io/
Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization
While recent text-to-video (T2V) diffusion models have achieved impressive quality and prompt alignment, they often produce low-diversity outputs when sampling multiple videos from a single text prompt. We tackle this challenge by formulating it as a set-level policy optimization problem, with the goal of training a policy that can cover the diverse range of plausible outcomes for a given prompt. To address this, we introduce DPP-GRPO, a novel framework for diverse video generation that combines Determinantal Point Processes (DPPs) and Group Relative Policy Optimization (GRPO) theories to enforce explicit reward on diverse generations. Our objective turns diversity into an explicit signal by imposing diminishing returns on redundant samples (via DPP) while supplies groupwise feedback over candidate sets (via GRPO). Our framework is plug-and-play and model-agnostic, and encourages diverse generations across visual appearance, camera motions, and scene structure without sacrificing prompt fidelity or perceptual quality. We implement our method on WAN and CogVideoX, and show that our method consistently improves video diversity on state-of-the-art benchmarks such as VBench, VideoScore, and human preference studies. Moreover, we release our code and a new benchmark dataset of 30,000 diverse prompts to support future research.
comment: Project webpage: https://diverse-video.github.io/
LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
To act in the world, a model must name what it sees and know where it is in 3D. Today's vision-language models (VLMs) excel at open-ended 2D description and grounding, yet multi-object 3D detection remains largely missing from the VLM toolbox. We present LocateAnything3D, a VLM-native recipe that casts 3D detection as a next-token prediction problem. The key is a short, explicit Chain-of-Sight (CoS) sequence that mirrors how human reason from images: find an object in 2D, then infer its distance, size, and pose. The decoder first emits 2D detections as a visual chain-of-thought, then predicts 3D boxes under an easy-to-hard curriculum: across objects, a near-to-far order reduces early ambiguity and matches ego-centric utility; within each object, a center-from-camera, dimensions, and rotation factorization ranks information by stability and learnability. This VLM-native interface preserves open-vocabulary and visual-prompting capability without specialized heads. On the challenging Omni3D benchmark, our model achieves state-of-the-art results, with 49.89 AP_3D, surpassing the previous best by +15.51 absolute improvement even when the baseline is given ground-truth 2D boxes. It also generalizes zero-shot to held-out categories with strong robustness. By turning 3D detection into a disciplined next-token problem, LocateAnything3D offers a practical foundation for models to perceive in 3D.
comment: Tech report. Project page: https://nvlabs.github.io/LocateAnything3D/
3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding
This paper addresses the challenge of training a single network to jointly perform multiple dense prediction tasks, such as segmentation and depth estimation, i.e., multi-task learning (MTL). Current approaches mainly capture cross-task relations in the 2D image space, often leading to unstructured features lacking 3D-awareness. We argue that 3D-awareness is vital for modeling cross-task correlations essential for comprehensive scene understanding. We propose to address this problem by integrating correlations across views, i.e., cost volume, as geometric consistency in the MTL network. Specifically, we introduce a lightweight Cross-view Module (CvM), shared across tasks, to exchange information across views and capture cross-view correlations, integrated with a feature from MTL encoder for multi-task predictions. This module is architecture-agnostic and can be applied to both single and multi-view data. Extensive results on NYUv2 and PASCAL-Context demonstrate that our method effectively injects geometric consistency into existing MTL methods to improve performance.
comment: 3D-aware Multi-task Learning, Cross-view Correlations, Code will be available at https://github.com/WeiHongLee/CrossView3DMTL
PixelDiT: Pixel Diffusion Transformers for Image Generation
Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. Our analysis reveals that effective pixel-level token modeling is essential to the success of pixel diffusion. PixelDiT achieves 1.61 FID on ImageNet 256x256, surpassing existing pixel generative models by a large margin. We further extend PixelDiT to text-to-image generation and pretrain it at the 1024x1024 resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.
Vision-Language Memory for Spatial Reasoning
Spatial reasoning is a critical capability for intelligent robots, yet current vision-language models (VLMs) still fall short of human-level performance in video-based spatial reasoning. This gap mainly stems from two challenges: a semantic-geometric misalignment that prevents consistent 3D understanding, and the absence of persistent memory to retain 3D representation and understanding over time. To address these limitations, we present VLM$^2$, a Vision-Language Model with persistent Memory for spatial reasoning with a view-consistent, 3D-aware representation purely from 2D video. Specifically, to enhance long-horizon reasoning, we incorporate a dual-memory module, consisting of a working memory that operates as a sliding window to focus on immediate context, and an episodic memory that consolidates and stores critical long-term information. This design enables efficient and long-horizon spatial reasoning with a fixed computational cost. Extensive experiments on multiple benchmarks show that VLM$^2$ achieves state-of-the-art performance among video-only models, significantly advancing the frontier of visual-spatial intelligence.
Concept-Aware Batch Sampling Improves Language-Image Pretraining
What data should a vision-language model be trained on? To answer this question, many data curation efforts center on the quality of a dataset. However, most of these existing methods are (i) offline, i.e. they produce a static dataset from a set of predetermined filtering criteria, and (ii) concept-agnostic, i.e. they use model-based filters which induce additional data biases. In this work, we go beyond such offline, concept-agnostic methods and advocate for more flexible, task-adaptive online concept-based curation. Our first contribution is DataConcept, a collection of 128M web-crawled image-text pairs annotated with fine-grained details about their concept composition. Building on DataConcept, we introduce Concept-Aware Batch Sampling (CABS), a simple yet effective batch sampling framework that flexibly constructs batches on-the-fly based on specific target distributions. We propose two variants: (i) Diversity Maximization (CABS-DM) to curate batches with a broad coverage of available concepts, and (ii) Frequency Maximization (CABS-FM) to curate batches with high object multiplicity. Through extensive evaluations across 28 benchmarks, we demonstrate that our CABS method significantly benefits CLIP/SigLIP model classes and yields highly performant models. Overall, CABS represents a strong open-source alternative to proprietary online data curation algorithms, enabling practitioners to define custom concept distributions that optimize for specific downstream tasks.
comment: Tech Report
Unleashing the Power of Vision-Language Models for Long-Tailed Multi-Label Visual Recognition
Long-tailed multi-label visual recognition poses a significant challenge, as images typically contain multiple labels with highly imbalanced class distributions, leading to biased models that favor head classes while underperforming on tail classes. Recent efforts have leveraged pre-trained vision-language models, such as CLIP, alongside long-tailed learning techniques to exploit rich visual-textual priors for improved performance. However, existing methods often derive semantic inter-class relationships directly from imbalanced datasets, resulting in unreliable correlations for tail classes due to data scarcity. Moreover, CLIP's zero-shot paradigm is optimized for single-label image-text matching, making it suboptimal for multi-label tasks. To address these issues, we propose the correlation adaptation prompt network (CAPNET), a novel end-to-end framework that explicitly models label correlations from CLIP's textual encoder. The framework incorporates a graph convolutional network for label-aware propagation and learnable soft prompts for refined embeddings. It utilizes a distribution-balanced Focal loss with class-aware re-weighting for optimized training under imbalance. Moreover, it improves generalization through test-time ensembling and realigns visual-textual modalities using parameter-efficient fine-tuning to avert overfitting on tail classes without compromising head class performance. Extensive experiments and ablation studies on benchmarks including VOC-LT, COCO-LT, and NUS-WIDE demonstrate that CAPNET achieves substantial improvements over state-of-the-art methods, validating its effectiveness for real-world long-tailed multi-label visual recognition.
MotionV2V: Editing Motion in a Video
While generative video models have achieved remarkable fidelity and consistency, applying these capabilities to video editing remains a complex challenge. Recent research has explored motion controllability as a means to enhance text-to-video generation or image animation; however, we identify precise motion control as a promising yet under-explored paradigm for editing existing videos. In this work, we propose modifying video motion by directly editing sparse trajectories extracted from the input. We term the deviation between input and output trajectories a "motion edit" and demonstrate that this representation, when coupled with a generative backbone, enables powerful video editing capabilities. To achieve this, we introduce a pipeline for generating "motion counterfactuals", video pairs that share identical content but distinct motion, and we fine-tune a motion-conditioned video diffusion architecture on this dataset. Our approach allows for edits that start at any timestamp and propagate naturally. In a four-way head-to-head user study, our model achieves over 65 percent preference against prior work. Please see our project page: https://ryanndagreat.github.io/MotionV2V
iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation
Pre-trained video models learn powerful priors for generating high-quality, temporally coherent content. While these models excel at temporal coherence, their dynamics are often constrained by the continuous nature of their training data. We hypothesize that by injecting the rich and unconstrained content diversity from image data into this coherent temporal framework, we can generate image sets that feature both natural transitions and a far more expansive dynamic range. To this end, we introduce iMontage, a unified framework designed to repurpose a powerful video model into an all-in-one image generator. The framework consumes and produces variable-length image sets, unifying a wide array of image generation and editing tasks. To achieve this, we propose an elegant and minimally invasive adaptation strategy, complemented by a tailored data curation process and training paradigm. This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors. iMontage excels across several mainstream many-in-many-out tasks, not only maintaining strong cross-image contextual consistency but also generating scenes with extraordinary dynamics that surpass conventional scopes. Find our homepage at: https://kr1sjfu.github.io/iMontage-web/.
MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models
Reinforcement learning from human feedback (RLHF) with reward models has advanced alignment of generative models to human aesthetic and perceptual preferences. However, jointly optimizing multiple rewards often incurs an alignment tax, improving one dimension while degrading others. To address this, we introduce two complementary methods: MapReduce LoRA and Reward-aware Token Embedding (RaTE). MapReduce LoRA trains preference-specific LoRA experts in parallel and iteratively merges them to refine a shared base model; RaTE learns reward-specific token embeddings that compose at inference for flexible preference control. Experiments on Text-to-Image generation (Stable Diffusion 3.5 Medium and FLUX.1-dev) show improvements of 36.1%, 4.6%, and 55.7%, and 32.7%, 4.3%, and 67.1% on GenEval, PickScore, and OCR, respectively. On Text-to-Video generation (HunyuanVideo), visual and motion quality improve by 48.1% and 90.0%, respectively. On the language task, Helpful Assistant, with Llama-2 7B, helpful and harmless improve by 43.4% and 136.7%, respectively. Our framework sets a new state-of-the-art multi-preference alignment recipe across modalities.
ShapeGen: Towards High-Quality 3D Shape Synthesis SIGGRAPH
Inspired by generative paradigms in image and video, 3D shape generation has made notable progress, enabling the rapid synthesis of high-fidelity 3D assets from a single image. However, current methods still face challenges, including the lack of intricate details, overly smoothed surfaces, and fragmented thin-shell structures. These limitations leave the generated 3D assets still one step short of meeting the standards favored by artists. In this paper, we present ShapeGen, which achieves high-quality image-to-3D shape generation through 3D representation and supervision improvements, resolution scaling up, and the advantages of linear transformers. These advancements allow the generated assets to be seamlessly integrated into 3D pipelines, facilitating their widespread adoption across various applications. Through extensive experiments, we validate the impact of these improvements on overall performance. Ultimately, thanks to the synergistic effects of these enhancements, ShapeGen achieves a significant leap in image-to-3D generation, establishing a new state-of-the-art performance.
comment: Accepted to SIGGRAPH Asia 2025
Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI
Reproducible closed-loop evaluation remains a major bottleneck in Embodied AI such as visual navigation. A promising path forward is high-fidelity simulation that combines photorealistic sensor rendering with geometrically grounded interaction in complex, open-world urban environments. Although recent video-3DGS methods ease open-world scene capturing, they are still unsuitable for benchmarking due to large visual and geometric sim-to-real gaps. To address these challenges, we introduce Wanderland, a real-to-sim framework that features multi-sensor capture, reliable reconstruction, accurate geometry, and robust view synthesis. Using this pipeline, we curate a diverse dataset of indoor-outdoor urban scenes and systematically demonstrate how image-only pipelines scale poorly, how geometry quality impacts novel view synthesis, and how all of these adversely affect navigation policy learning and evaluation reliability. Beyond serving as a trusted testbed for embodied navigation, Wanderland's rich raw sensor data further allows benchmarking of 3D reconstruction and novel view synthesis models. Our work establishes a new foundation for reproducible research in open-world embodied AI. Project website is at https://ai4ce.github.io/wanderland/.
Evaluating the Performance of Deep Learning Models in Whole-body Dynamic 3D Posture Prediction During Load-reaching Activities
This study aimed to explore the application of deep neural networks for whole-body human posture prediction during dynamic load-reaching activities. Two time-series models were trained using bidirectional long short-term memory (BLSTM) and transformer architectures. The dataset consisted of 3D full-body plug-in gait dynamic coordinates from 20 normal-weight healthy male individuals each performing 204 load-reaching tasks from different load positions while adapting various lifting and handling techniques. The model inputs consisted of the 3D position of the hand-load position, lifting (stoop, full-squat and semi-squat) and handling (one- and two-handed) techniques, body weight and height, and the 3D coordinate data of the body posture from the first 25% of the task duration. These inputs were used by the models to predict body coordinates during the remaining 75% of the task period. Moreover, a novel method was proposed to improve the accuracy of the previous and present posture prediction networks by enforcing constant body segment lengths through the optimization of a new cost function. The results indicated that the new cost function decreased the prediction error of the models by approximately 8% and 21% for the arm and leg models, respectively. We indicated that utilizing the transformer architecture, with a root-mean-square-error of 47.0 mm, exhibited ~58% more accurate long-term performance than the BLSTM-based model. This study merits the use of neural networks that capture time series dependencies in 3D motion frames, providing a unique approach for understanding and predict motion dynamics during manual material handling activities.
comment: 10 pages, 6 figures, 7 tables
The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment
Previous works have explored various customized generation tasks given a reference image, but they still face limitations in generating consistent fine-grained details. In this paper, our aim is to solve the inconsistency problem of generated images by applying a reference-guided post-editing approach and present our ImageCritic. We first construct a dataset of reference-degraded-target triplets obtained via VLM-based selection and explicit degradation, which effectively simulates the common inaccuracies or inconsistencies observed in existing generation models. Furthermore, building on a thorough examination of the model's attention mechanisms and intrinsic representations, we accordingly devise an attention alignment loss and a detail encoder to precisely rectify inconsistencies. ImageCritic can be integrated into an agent framework to automatically detect inconsistencies and correct them with multi-round and local editing in complex scenarios. Extensive experiments demonstrate that ImageCritic can effectively resolve detail-related issues in various customized generation scenarios, providing significant improvements over existing methods.
comment: Project page: https://ouyangziheng.github.io/ImageCritic-Page/
Optimization of Sums of Bivariate Functions: An Introduction to Relaxation-Based Methods for the Case of Finite Domains
We study the optimization of functions with $n>2$ arguments that have a representation as a sum of several functions that have only $2$ of the $n$ arguments each, termed sums of bivariates, on finite domains. The complexity of optimizing sums of bivariates is shown to be NP-equivalent and it is shown that there exists free lunch in the optimization of sums of bivariates. Based on measure-valued extensions of the objective function, so-called relaxations, $\ell^2$-approximation, and entropy-regularization, we derive several tractable problem formulations solvable with linear programming, coordinate ascent as well as with closed-form solutions. The limits of applying tractable versions of such relaxations to sums of bivariates are investigated using general results for reconstructing measures from their bivariate marginals. Experiments in which the derived algorithms are applied to random functions, vertex coloring, and signal reconstruction problems provide insights into qualitatively different function classes that can be modeled as sums of bivariates.
comment: 59 pages, 7 figures
Latent Diffusion Inversion Requires Understanding the Latent Space
The recovery of training data from generative models (``model inversion'') has been extensively studied for diffusion models in the data domain. The encoder/decoder pair and corresponding latent codes have largely been ignored by inversion techniques applied to latent space generative models, e.g., Latent Diffusion models (LDMs). In this work we describe two key findings: (1) The diffusion model exhibits non-uniform memorization across latent codes, tending to overfit samples located in high-distortion regions of the decoder pullback metric. (2) Even within a single latent code, different dimensions contribute unequally to memorization. We introduce a principled method to rank latent dimensions by their per-dimensional contribution to the decoder pullback metric, identifying those most responsible for memorization. Empirically, removing less-memorizing dimensions when computing attack statistics for score-based membership inference attacker significantly improves performance, with average AUROC gains of 2.7\% and substantial increases in TPR@1\%FPR (6.42\%) across diverse datasets including CIFAR-10, CelebA, ImageNet-1K, Pokémon, MS-COCO, and Flickr. This indicates stronger confidence in identifying members under extremely low false-positive tolerance. Our results highlight the overlooked influence of the auto-encoder geometry on LDM memorization and provide a new perspective for analyzing privacy risks in diffusion-based generative models.
comment: 14 pages, 4 figures, 4 tables
VQ-VA World: Towards High-Quality Visual Question-Visual Answering
This paper studies Visual Question-Visual Answering (VQ-VA): generating an image, rather than text, in response to a visual question -- an ability that has recently emerged in proprietary systems such as NanoBanana and GPT-Image. To also bring this capability to open-source models, we introduce VQ-VA World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction. Leveraging web-scale deployment, this pipeline crawls a massive amount of ~1.8M high-quality, interleaved image-text samples for model training. For evaluation, we further release IntelligentBench, a human-curated benchmark that systematically assesses VQ-VA along the aspects of world knowledge, design knowledge, and reasoning. Training with VQ-VA World data yields strong empirical gains: it helps LightFusion attain 53.06 on IntelligentBench, substantially surpassing the best prior open-source baselines (i.e., 7.78 from vanilla LightFusion; 1.94 from UniWorld-V1), and significantly narrowing the gap toward leading proprietary systems (e.g., 81.67 from NanoBanana; 82.64 from GPT-Image). By releasing the full suite of model weights, datasets, and pipelines, we hope to stimulate future research on VQ-VA.
DINO-Tok: Adapting DINO for Visual Tokenizers
Recent advances in visual generation have highlighted the rise of Latent Generative Models (LGMs), which rely on effective visual tokenizers to bridge pixels and semantics. However, existing tokenizers are typically trained from scratch and struggle to balance semantic representation and reconstruction fidelity, particularly in high-dimensional latent spaces. In this work, we introduce DINO-Tok, a DINO-based visual tokenizer that unifies hierarchical representations into an information-complete latent space. By integrating shallow features that retain fine-grained details with deep features encoding global semantics, DINO-Tok effectively bridges pretrained representations and visual generation. We further analyze the challenges of vector quantization (VQ) in this high-dimensional space, where key information is often lost and codebook collapse occurs. We thus propose a global PCA reweighting mechanism to stabilize VQ and preserve essential information across dimensions. On ImageNet 256$\times$256, DINO-Tok achieves state-of-the-art reconstruction performance, reaching 28.54 PSNR for autoencoding and 23.98 PSNR for VQ-based modeling, significantly outperforming prior tokenizers and comparable to billion-level data trained models (such as Hunyuan and Wan). These results demonstrate that adapting powerful pretrained vision models like DINO for tokenization enables semantically aligned and high-fidelity latent representations, enabling next-generation visual generative models. Code will be publicly available at https://github.com/MKJia/DINO-Tok.
A Reason-then-Describe Instruction Interpreter for Controllable Video Generation
Diffusion Transformers have significantly improved video fidelity and temporal coherence, however, practical controllability remains limited. Concise, ambiguous, and compositionally complex user inputs contrast with the detailed prompts used in training, yielding an intent-output mismatch. We propose ReaDe, a universal, model-agnostic interpreter that converts raw instructions into precise, actionable specifications for downstream video generators. ReaDe follows a reason-then-describe paradigm: it first analyzes the user request to identify core requirements and resolve ambiguities, then produces detailed guidance that enables faithful, controllable generation. We train ReaDe via a two-stage optimization: (i) reasoning-augmented supervision imparts analytic parsing with stepwise traces and dense captions, and (ii) a multi-dimensional reward assigner enables stable, feedback-driven refinement for natural-style captions. Experiments across single- and multi-condition scenarios show consistent gains in instruction fidelity, caption accuracy, and downstream video quality, with strong generalization to reasoning-intensive and unseen inputs. ReaDe offers a practical route to aligning controllable video generation with accurately interpreted user intent. Project Page: https://sqwu.top/ReaDe/.
comment: 27 pages, 13 figures, 13 tables, Project Page: https://sqwu.top/ReaDe/
PhysChoreo: Physics-Controllable Video Generation with Part-Aware Semantic Grounding
While recent video generation models have achieved significant visual fidelity, they often suffer from the lack of explicit physical controllability and plausibility. To address this, some recent studies attempted to guide the video generation with physics-based rendering. However, these methods face inherent challenges in accurately modeling complex physical properties and effectively control ling the resulting physical behavior over extended temporal sequences. In this work, we introduce PhysChoreo, a novel framework that can generate videos with diverse controllability and physical realism from a single image. Our method consists of two stages: first, it estimates the static initial physical properties of all objects in the image through part-aware physical property reconstruction. Then, through temporally instructed and physically editable simulation, it synthesizes high-quality videos with rich dynamic behaviors and physical realism. Experimental results show that PhysChoreo can generate videos with rich behaviors and physical realism, outperforming state-of-the-art methods on multiple evaluation metrics.
Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward
Recent years have witnessed significant progress in Unified Multimodal Models, yet a fundamental question remains: Does understanding truly inform generation? To investigate this, we introduce UniSandbox, a decoupled evaluation framework paired with controlled, synthetic datasets to avoid data leakage and enable detailed analysis. Our findings reveal a significant understanding-generation gap, which is mainly reflected in two key dimensions: reasoning generation and knowledge transfer. Specifically, for reasoning generation tasks, we observe that explicit Chain-of-Thought (CoT) in the understanding module effectively bridges the gap, and further demonstrate that a self-training approach can successfully internalize this ability, enabling implicit reasoning during generation. Additionally, for knowledge transfer tasks, we find that CoT assists the generative process by helping retrieve newly learned knowledge, and also discover that query-based architectures inherently exhibit latent CoT-like properties that affect this transfer. UniSandbox provides preliminary insights for designing future unified architectures and training strategies that truly bridge the gap between understanding and generation. Code and data are available at https://github.com/PKU-YuanGroup/UniSandBox
Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning
Diffusion Models have emerged as a leading class of generative models, yet their iterative sampling process remains computationally expensive. Timestep distillation is a promising technique to accelerate generation, but it often requires extensive training and leads to image quality degradation. Furthermore, fine-tuning these distilled models for specific objectives, such as aesthetic appeal or user preference, using Reinforcement Learning (RL) is notoriously unstable and easily falls into reward hacking. In this work, we introduce Flash-DMD, a novel framework that enables fast convergence with distillation and joint RL-based refinement. Specifically, we first propose an efficient timestep-aware distillation strategy that significantly reduces training cost with enhanced realism, outperforming DMD2 with only $2.1\%$ its training cost. Second, we introduce a joint training scheme where the model is fine-tuned with an RL objective while the timestep distillation training continues simultaneously. We demonstrate that the stable, well-defined loss from the ongoing distillation acts as a powerful regularizer, effectively stabilizing the RL training process and preventing policy collapse. Extensive experiments on score-based and flow matching models show that our proposed Flash-DMD not only converges significantly faster but also achieves state-of-the-art generation quality in the few-step sampling regime, outperforming existing methods in visual quality, human preference, and text-image alignment metrics. Our work presents an effective paradigm for training efficient, high-fidelity, and stable generative models. Codes are coming soon.
New York Smells: A Large Multimodal Dataset for Olfaction
While olfaction is central to how animals perceive the world, this rich chemical sensory modality remains largely inaccessible to machines. One key bottleneck is the lack of diverse, multimodal olfactory training data collected in natural settings. We present New York Smells, a large dataset of paired image and olfactory signals captured ``in the wild.'' Our dataset contains 7,000 smell-image pairs from 3,500 distinct objects across indoor and outdoor environments, with approximately 70$\times$ more objects than existing olfactory datasets. Our benchmark has three tasks: cross-modal smell-to-image retrieval, recognizing scenes, objects, and materials from smell alone, and fine-grained discrimination between grass species. Through experiments on our dataset, we find that visual data enables cross-modal olfactory representation learning, and that our learned olfactory representations outperform widely-used hand-crafted features.
comment: Project website at https://smell.cs.columbia.edu
Automated Monitoring of Cultural Heritage Artifacts Using Semantic Segmentation
This paper addresses the critical need for automated crack detection in the preservation of cultural heritage through semantic segmentation. We present a comparative study of U-Net architectures, using various convolutional neural network (CNN) encoders, for pixel-level crack identification on statues and monuments. A comparative quantitative evaluation is performed on the test set of the OmniCrack30k dataset [1] using popular segmentation metrics including Mean Intersection over Union (mIoU), Dice coefficient, and Jaccard index. This is complemented by an out-of-distribution qualitative evaluation on an unlabeled test set of real-world cracked statues and monuments. Our findings provide valuable insights into the capabilities of different CNN- based encoders for fine-grained crack segmentation. We show that the models exhibit promising generalization capabilities to unseen cultural heritage contexts, despite never having been explicitly trained on images of statues or monuments.
comment: Keywords: Cultural Heritage, Monitoring, Deep Learning, U-Nets, Semantic Segmentation
Beyond Generation: Multi-Hop Reasoning for Factual Accuracy in Vision-Language Models ICML
Visual Language Models (VLMs) are powerful generative tools but often produce factually inaccurate outputs due to a lack of robust reasoning capabilities. While extensive research has been conducted on integrating external knowledge for reasoning in large language models (LLMs), such efforts remain underexplored in VLMs, where the challenge is compounded by the need to bridge multiple modalities seamlessly. This work introduces a framework for knowledge-guided reasoning in VLMs, leveraging structured knowledge graphs for multi-hop verification using image-captioning task to illustrate our framework. Our approach enables systematic reasoning across multiple steps, including visual entity recognition, knowledge graph traversal, and fact-based caption refinement. We evaluate the framework using hierarchical, triple-based and bullet-point based knowledge representations, analyzing their effectiveness in factual accuracy and logical inference. Empirical results show that our approach improves factual accuracy by approximately 31% on preliminary experiments on a curated dataset of mixtures from Google Landmarks v2, Conceptual captions and Coco captions revealing key insights into reasoning patterns and failure modes. This work demonstrates the potential of integrating external knowledge for advancing reasoning in VLMs, paving the way for more reliable and knowledgable multimodal systems.
comment: Accepted as poster at NewInML Workshop ICML, 2025
Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos
We introduce Mistake Attribution (MATT), a task for fine-grained understanding of human mistakes in egocentric video. Unlike prior mistake understanding work, which lacks fine-grained output, MATT concretely attributes mistakes to the input instruction text or the attempt video. MATT determines what part of the instruction is violated (semantic role), when the deviation becomes irreversible (the Point-of-No-Return, PNR), and where the mistake appears in the PNR frame. We develop MisEngine, a data engine that automatically constructs attribution-rich mistake samples from existing datasets and inherits their annotations. Applied to large egocentric corpora, MisEngine yields EPIC-KITCHENS-M and Ego4D-M, two datasets that are up to two orders of magnitude larger than prior mistake datasets. We then present MisFormer, a unified attention-based model for mistake attribution across semantic (what), temporal (when), and spatial (where) dimensions, trained using MisEngine supervision. Experiments on our new datasets and prior benchmarks show that MisFormer outperforms strong video-language, temporal localization, hand-object interaction, and mistake-detection baselines.
comment: 11 pages, 4 figures, 6 tables
HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation
Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models), achieving strong multimodal performance. However, recent advanced methods such as BAGEL and LMFusion follow the Mixture-of-Transformers (MoT) paradigm, adopting a symmetric design that mirrors one expert to another for convenient initialization and fusion, which remains suboptimal due to inherent modality discrepancies. In this work, we propose HBridge, an asymmetric H-shaped architecture that enables heterogeneous experts to optimally leverage pretrained priors from their respective modality domains. Unlike prior dense fusion strategies that straightforwardly connect all layers between experts via shared attention, HBridge selectively bridges intermediate layers, reducing over 40% attention sharing, which improves efficiency and enhances generation quality. Shallow and deep layers, which capture modality-specific representations, are decoupled, while mid-layer bridging promotes semantic alignment. To further strengthen cross-modal coherence, we introduce semantic reconstruction tokens that explicitly guide the generative expert to reconstruct visual semantic tokens of the target image. Extensive experiments across multiple benchmarks demonstrate the effectiveness and superior performance of HBridge, establishing a new paradigm for unified multimodal generation.
AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs
Assessing image-text alignment models such as CLIP is crucial for bridging visual and linguistic representations. Yet existing benchmarks rely on rule-based perturbations or short captions, limiting their ability to measure fine-grained alignment. We introduce AlignBench, a benchmark that provides a new indicator of image-text alignment by evaluating detailed image-caption pairs generated by diverse image-to-text and text-to-image models. Each sentence is annotated for correctness, enabling direct assessment of VLMs as alignment evaluators. Benchmarking a wide range of decoder-based VLMs reveals three key findings: (i) CLIP-based models, even those tailored for compositional reasoning, remain nearly blind; (ii) detectors systematically over-score early sentences; and (iii) they show strong self-preference, favoring their own outputs and harming detection performance. Our project page will be available at https://dahlian00.github.io/AlignBench/.
comment: Project Page: https://dahlian00.github.io/AlignBench/
DesignPref: Capturing Personal Preferences in Visual Design Generation
Generative models, such as large language models and text-to-image diffusion models, are increasingly used to create visual designs like user interfaces (UIs) and presentation slides. Finetuning and benchmarking these generative models have often relied on datasets of human-annotated design preferences. Yet, due to the subjective and highly personalized nature of visual design, preference varies widely among individuals. In this paper, we study this problem by introducing DesignPref, a dataset of 12k pairwise comparisons of UI design generation annotated by 20 professional designers with multi-level preference ratings. We found that among trained designers, substantial levels of disagreement exist (Krippendorff's alpha = 0.25 for binary preferences). Natural language rationales provided by these designers indicate that disagreements stem from differing perceptions of various design aspect importance and individual preferences. With DesignPref, we demonstrate that traditional majority-voting methods for training aggregated judge models often do not accurately reflect individual preferences. To address this challenge, we investigate multiple personalization strategies, particularly fine-tuning or incorporating designer-specific annotations into RAG pipelines. Our results show that personalized models consistently outperform aggregated baseline models in predicting individual designers' preferences, even when using 20 times fewer examples. Our work provides the first dataset to study personalized visual design evaluation and support future research into modeling individual design taste.
A Physics-Informed Loss Function for Boundary-Consistent and Robust Artery Segmentation in DSA Sequences
Accurate extraction and segmentation of the cerebral arteries from digital subtraction angiography (DSA) sequences is essential for developing reliable clinical management models of complex cerebrovascular diseases. Conventional loss functions often rely solely on pixel-wise overlap, overlooking the geometric and physical consistency of vascular boundaries, which can lead to fragmented or unstable vessel predictions. To overcome this limitation, we propose a novel \textit{Physics-Informed Loss} (PIL) that models the interaction between the predicted and ground-truth boundaries as an elastic process inspired by dislocation theory in materials physics. This formulation introduces a physics-based regularization term that enforces smooth contour evolution and structural consistency, allowing the network to better capture fine vascular geometry. The proposed loss is integrated into several segmentation architectures, including U-Net, U-Net++, SegFormer, and MedFormer, and evaluated on two public benchmarks: DIAS and DSCA. Experimental results demonstrate that PIL consistently outperforms conventional loss functions such as Cross-Entropy, Dice, Active Contour, and Surface losses, achieving superior sensitivity, F1 score, and boundary coherence. These findings confirm that the incorporation of physics-based boundary interactions into deep neural networks improves both the precision and robustness of vascular segmentation in dynamic angiographic imaging. The implementation of the proposed method is publicly available at https://github.com/irfantahir301/Physicsis_loss.
Modular Deep Learning Framework for Assistive Perception: Gaze, Affect, and Speaker Identification
Developing comprehensive assistive technologies requires the seamless integration of visual and auditory perception. This research evaluates the feasibility of a modular architecture inspired by core functionalities of perceptive systems like 'Smart Eye.' We propose and benchmark three independent sensing modules: a Convolutional Neural Network (CNN) for eye state detection (drowsiness/attention), a deep CNN for facial expression recognition, and a Long Short-Term Memory (LSTM) network for voice-based speaker identification. Utilizing the Eyes Image, FER2013, and customized audio datasets, our models achieved accuracies of 93.0%, 97.8%, and 96.89%, respectively. This study demonstrates that lightweight, domain-specific models can achieve high fidelity on discrete tasks, establishing a validated foundation for future real-time, multimodal integration in resource-constrained assistive devices.
comment: 10 pages, 9 figures, and 3 tables
Dance Style Classification using Laban-Inspired and Frequency-Domain Motion Features
Dance is an essential component of human culture and serves as a tool for conveying emotions and telling stories. Identifying and distinguishing dance genres based on motion data is a complex problem in human activity recognition, as many styles share similar poses, gestures, and temporal motion patterns. This work presents a lightweight framework for classifying dance styles that determines motion characteristics based on pose estimates extracted from videos. We propose temporal-spatial descriptors inspired by Laban Movement Analysis. These features capture local joint dynamics such as velocity, acceleration, and angular movement of the upper body, enabling a structured representation of spatial coordination. To further encode rhythmic and periodic aspects of movement, we integrate Fast Fourier Transform features that characterize movement patterns in the frequency domain. The proposed approach achieves robust classification of different dance styles with low computational effort, as complex model architectures are not required, and shows that interpretable motion representations can effectively capture stylistic nuances.
STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flow
Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction, and native likelihood estimation. Building upon the recently proposed STARFlow, STARFlow-V operates in the spatiotemporal latent space with a global-local architecture which restricts causal dependencies to a global latent space while preserving rich local within-frame interactions. This eases error accumulation over time, a common pitfall of standard autoregressive diffusion model generation. Additionally, we propose flow-score matching, which equips the model with a light-weight causal denoiser to improve the video generation consistency in an autoregressive fashion. To improve the sampling efficiency, STARFlow-V employs a video-aware Jacobi iteration scheme that recasts inner updates as parallelizable iterations without breaking causality. Thanks to the invertible structure, the same model can natively support text-to-video, image-to-video as well as video-to-video generation tasks. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines. These results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models. Code and generated samples are available at https://github.com/apple/ml-starflow.
comment: 21 pages
Look Where It Matters: Training-Free Ultra-HR Remote Sensing VQA via Adaptive Zoom Search
With advances in satellite constellations, sensor technologies, and imaging pipelines, ultra-high-resolution (Ultra-HR) remote sensing imagery is becoming increasingly widespread. However, current remote sensing foundation models are ill-suited to such inputs: full-image encoding exhausts token and memory budgets, while resize-based preprocessing loses fine-grained and answer-critical details. In this context, guiding the model look where it matters before prediction becomes crucial. Therefore, we present ZoomSearch, a training-free, plug-and-play pipeline that decouples 'where to look' from 'how to answer' for Ultra-HR Remote Sensing Visual Question Answering (RS-VQA). ZoomSearch combines Adaptive Multi-Branch Zoom Search, which performs a hierarchical search over image patches to localize query-relevant regions, with Layout-Aware Patch Reassembly, which reorganizes the selected patches into a compact, layout-faithful canvas. We conduct comprehensive experiments on Ultra-HR RS-VQA benchmarks MME-RealWorld-RS and LRS-VQA, comparing against (i) strong general foundation models, (ii) remote sensing foundation models, (iii) Ultra-HR RS-VQA methods, and (iv) plug-and-play search-based VQA methods. When integrated with LLaVA-ov, ZoomSearch attains state-of-the-art accuracy across diverse tasks, improving the LLaVA-ov baseline by 26.3% on LRS-VQA and 114.8\% on MME-RealWorld-RS. Meanwhile, it achieves much higher inference efficiency, outperforming prior search-based methods by 20%~44% in speed.
comment: 17 pages, 8 figures
Learning to Generate Human-Human-Object Interactions from Textual Descriptions
The way humans interact with each other, including interpersonal distances, spatial configuration, and motion, varies significantly across different situations. To enable machines to understand such complex, context-dependent behaviors, it is essential to model multiple people in relation to the surrounding scene context. In this paper, we present a novel research problem to model the correlations between two people engaged in a shared interaction involving an object. We refer to this formulation as Human-Human-Object Interactions (HHOIs). To overcome the lack of dedicated datasets for HHOIs, we present a newly captured HHOIs dataset and a method to synthesize HHOI data by leveraging image generative models. As an intermediary, we obtain individual human-object interaction (HOIs) and human-human interaction (HHIs) from the HHOIs, and with these data, we train an text-to-HOI and text-to-HHI model using score-based diffusion model. Finally, we present a unified generative framework that integrates the two individual model, capable of synthesizing complete HHOIs in a single advanced sampling process. Our method extends HHOI generation to multi-human settings, enabling interactions involving more than two individuals. Experimental results show that our method generates realistic HHOIs conditioned on textual descriptions, outperforming previous approaches that focus only on single-human HOIs. Furthermore, we introduce multi-human motion generation involving objects as an application of our framework.
comment: Project Page: https://tlb-miss.github.io/hhoi/
Object-Centric Vision Token Pruning for Vision Language Models
In Vision Language Models (VLMs), vision tokens are quantity-heavy yet information-dispersed compared with language tokens, thus consume too much unnecessary computation. Pruning redundant vision tokens for high VLM inference efficiency has been continuously studied but all existing methods resort to indirect and non-guaranteed ways. We propose OC-VTP, a direct and guaranteed approach to select the most representative vision tokens for high-efficiency yet accuracy-preserving VLM inference. Our OC-VTP requires merely light-weight pre-training of a small object-centric vision token pruner, which can then be inserted into existing VLMs, without fine-tuning of any models on any datasets. It is gauranteed that the most representative vision tokens are kept by minimizing the error in reconstructing the original unpruned tokens from the selected ones. Across any vision pruning ratios, i.e., inference efficiency, our OC-VTP consistently helps mainstream VLMs to preserve the highest inference accuracy. Our pruning also demonstrates interesting interpretability. Our codes are available at https://github.com/GarryLarry010131/OC-VTP.
BRIC: Bridging Kinematic Plans and Physical Control at Test Time
We propose BRIC, a novel test-time adaptation (TTA) framework that enables long-term human motion generation by resolving execution discrepancies between diffusion-based kinematic motion planners and reinforcement learning-based physics controllers. While diffusion models can generate diverse and expressive motions conditioned on text and scene context, they often produce physically implausible outputs, leading to execution drift during simulation. To address this, BRIC dynamically adapts the physics controller to noisy motion plans at test time, while preserving pre-trained skills via a loss function that mitigates catastrophic forgetting. In addition, BRIC introduces a lightweight test-time guidance mechanism that steers the diffusion model in the signal space without updating its parameters. By combining both adaptation strategies, BRIC ensures consistent and physically plausible long-term executions across diverse environments in an effective and efficient manner. We validate the effectiveness of BRIC on a variety of long-term tasks, including motion composition, obstacle avoidance, and human-scene interaction, achieving state-of-the-art performance across all tasks.
Block Cascading: Training Free Acceleration of Block-Causal Video Models
Block-causal video generation faces a stark speed-quality trade-off: small 1.3B models manage only 16 FPS while large 14B models crawl at 4.5 FPS, forcing users to choose between responsiveness and quality. Block Cascading significantly mitigates this trade-off through training-free parallelization. Our key insight: future video blocks do not need fully denoised current blocks to begin generation. By starting block generation with partially denoised context from predecessors, we transform sequential pipelines into parallel cascades where multiple blocks denoise simultaneously. With 5 GPUs exploiting temporal parallelism, we achieve ~2x acceleration across all model scales: 1.3B models accelerate from 16 to 30 FPS, 14B models from 4.5 to 12.5 FPS. Beyond inference speed, Block Cascading eliminates overhead from KV-recaching (of ~200ms) during context switches for interactive generation. Extensive evaluations validated against multiple block-causal pipelines demonstrate no significant loss in generation quality when switching from block-causal to Block Cascading pipelines for inference. Project Page: https://hmrishavbandy.github.io/block_cascading_page/
VibraVerse: A Large-Scale Geometry-Acoustics Alignment Dataset for Physically-Consistent Multimodal Learning
Understanding the physical world requires perceptual models grounded in physical laws rather than mere statistical correlations. However, existing multimodal learning frameworks, focused on vision and language, lack physical consistency and overlook the intrinsic causal relationships among an object's geometry, material, vibration modes, and the sounds it produces. We introduce VibraVerse, a large-scale geometry-acoustics alignment dataset that explicitly bridges the causal chain from 3D geometry -> physical attributes -> modal parameters -> acoustic signals. Each 3D model has explicit physical properties (density, Young's modulus, Poisson's ratio) and volumetric geometry, from which modal eigenfrequencies and eigenvectors are computed for impact sound synthesis under controlled excitations. To establish this coherence, we introduce CLASP, a contrastive learning framework for cross-modal alignment that preserves the causal correspondence between an object's physical structure and its acoustic response. This framework enforces physically consistent alignment across modalities, ensuring that every sample is coherent, traceable to the governing equations, and embedded within a unified representation space spanning shape, image, and sound. Built upon VibraVerse, we define a suite of benchmark tasks for geometry-to-sound prediction, sound-guided shape reconstruction, and cross-modal representation learning. Extensive validations on these tasks demonstrate that models trained on VibraVerse exhibit superior accuracy, interpretability, and generalization across modalities. These results establish VibraVerse as a benchmark for physically consistent and causally interpretable multimodal learning, providing a foundation for sound-guided embodied perception and a deeper understanding of the physical world. The dataset will be open-sourced.
StableTrack: Stabilizing Multi-Object Tracking on Low-Frequency Detections
Multi-object tracking (MOT) is one of the most challenging tasks in computer vision, where it is important to correctly detect objects and associate these detections across frames. Current approaches mainly focus on tracking objects in each frame of a video stream, making it almost impossible to run the model under conditions of limited computing resources. To address this issue, we propose StableTrack, a novel approach that stabilizes the quality of tracking on low-frequency detections. Our method introduces a new two-stage matching strategy to improve the cross-frame association between low-frequency detections. We propose a novel Bbox-Based Distance instead of the conventional Mahalanobis distance, which allows us to effectively match objects using the Re-ID model. Furthermore, we integrate visual tracking into the Kalman Filter and the overall tracking pipeline. Our method outperforms current state-of-the-art trackers in the case of low-frequency detections, achieving $\textit{11.6%}$ HOTA improvement at $\textit{1}$ Hz on MOT17-val, while keeping up with the best approaches on the standard MOT17, MOT20, and DanceTrack benchmarks with full-frequency detections.
MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts
Generating realistic 3D cities is fundamental to world models, virtual reality, and game development, where an ideal urban scene must satisfy both stylistic diversity, fine-grained, and controllability. However, existing methods struggle to balance the creative flexibility offered by text-based generation with the object-level editability enabled by explicit structural representations. We introduce MajutsuCity, a natural language-driven and aesthetically adaptive framework for synthesizing structurally consistent and stylistically diverse 3D urban scenes. MajutsuCity represents a city as a composition of controllable layouts, assets, and materials, and operates through a four-stage pipeline. To extend controllability beyond initial generation, we further integrate MajutsuAgent, an interactive language-grounded editing agent} that supports five object-level operations. To support photorealistic and customizable scene synthesis, we also construct MajutsuDataset, a high-quality multimodal dataset} containing 2D semantic layouts and height maps, diverse 3D building assets, and curated PBR materials and skyboxes, each accompanied by detailed annotations. Meanwhile, we develop a practical set of evaluation metrics, covering key dimensions such as structural consistency, scene complexity, material fidelity, and lighting atmosphere. Extensive experiments demonstrate MajutsuCity reduces layout FID by 83.7% compared with CityDreamer and by 20.1% over CityCraft. Our method ranks first across all AQS and RDR scores, outperforming existing methods by a clear margin. These results confirm MajutsuCity as a new state-of-the-art in geometric fidelity, stylistic adaptability, and semantic controllability for 3D city generation. We expect our framework can inspire new avenues of research in 3D city generation. Our dataset and code will be released at https://github.com/LongHZ140516/MajutsuCity.
comment: 13 pages, 6 figures
Image-Free Timestep Distillation via Continuous-Time Consistency with Trajectory-Sampled Pairs
Timestep distillation is an effective approach for improving the generation efficiency of diffusion models. The Consistency Model (CM), as a trajectory-based framework, demonstrates significant potential due to its strong theoretical foundation and high-quality few-step generation. Nevertheless, current continuous-time consistency distillation methods still rely heavily on training data and computational resources, hindering their deployment in resource-constrained scenarios and limiting their scalability to diverse domains. To address this issue, we propose Trajectory-Backward Consistency Model (TBCM), which eliminates the dependence on external training data by extracting latent representations directly from the teacher model's generation trajectory. Unlike conventional methods that require VAE encoding and large-scale datasets, our self-contained distillation paradigm significantly improves both efficiency and simplicity. Moreover, the trajectory-extracted samples naturally bridge the distribution gap between training and inference, thereby enabling more effective knowledge transfer. Empirically, TBCM achieves 6.52 FID and 28.08 CLIP scores on MJHQ-30k under one-step generation, while reducing training time by approximately 40% compared to Sana-Sprint and saving a substantial amount of GPU memory, demonstrating superior efficiency without sacrificing quality. We further reveal the diffusion-generation space discrepancy in continuous-time consistency distillation and analyze how sampling strategies affect distillation performance, offering insights for future distillation research. GitHub Link: https://github.com/hustvl/TBCM.
A Training-Free Approach for Multi-ID Customization via Attention Adjustment and Spatial Control
Multi-ID customization is an interesting topic in computer vision and attracts considerable attention recently. Given the ID images of multiple individuals, its purpose is to generate a customized image that seamlessly integrates them while preserving their respective identities. Compared to single-ID customization, multi-ID customization is much more difficult and poses two major challenges. First, since the multi-ID customization model is trained to reconstruct an image from the cropped person regions, it often encounters the copy-paste issue during inference, leading to lower quality. Second, the model also suffers from inferior text controllability. The generated result simply combines multiple persons into one image, regardless of whether it is aligned with the input text. In this work, we propose MultiID to tackle this challenging task in a training-free manner. Since the existing single-ID customization models have less copy-paste issue, our key idea is to adapt these models to achieve multi-ID customization. To this end, we present an ID-decoupled cross-attention mechanism, injecting distinct ID embeddings into the corresponding image regions and thus generating multi-ID outputs. To enhance the generation controllability, we introduce three critical strategies, namely the local prompt, depth-guided spatial control, and extended self-attention, making the results more consistent with the text prompts and ID images. We also carefully build a benchmark, called IDBench, for evaluation. The extensive qualitative and quantitative results demonstrate the effectiveness of MultiID in solving the aforementioned two challenges. Its performance is comparable or even better than the training-based multi-ID customization methods.
FREE: Uncertainty-Aware Autoregression for Parallel Diffusion Transformers
Diffusion Transformers (DiTs) achieve state-of-the-art generation quality but require long sequential denoising trajectories, leading to high inference latency. Recent speculative inference methods enable lossless parallel sampling in U-Net-based diffusion models via a drafter-verifier scheme, but their acceleration is limited on DiTs due to insufficient draft accuracy during verification. To address this limitation, we analyze the DiTs' feature dynamics and find the features of the final transformer layer (top-block) exhibit strong temporal consistency and rich semantic abstraction. Based on this insight, we propose FREE, a novel framework that employs a lightweight drafter to perform feature-level autoregression with parallel verification, guaranteeing lossless acceleration with theoretical and empirical support. Meanwhile, prediction variance (uncertainty) of DiTs naturally increases in later denoising steps, reducing acceptance rates under speculative sampling. To mitigate this effect, we further introduce an uncertainty-guided relaxation strategy, forming FREE (relax), which dynamically adjusts the acceptance probability in response to uncertainty levels. Experiments on ImageNet-$512^2$ show that FREE achieves up to $1.86 \times$ acceleration, and FREE (relax) further reaches $2.25 \times$ speedup while maintaining high perceptual and quantitative fidelity in generation quality.
VGGTFace: Topologically Consistent Facial Geometry Reconstruction in the Wild
Reconstructing topologically consistent facial geometry is crucial for the digital avatar creation pipelines. Existing methods either require tedious manual efforts, lack generalization to in-the-wild data, or are constrained by the limited expressiveness of 3D Morphable Models. To address these limitations, we propose VGGTFace, an automatic approach that innovatively applies the 3D foundation model, \emph{i.e.} VGGT, for topologically consistent facial geometry reconstruction from in-the-wild multi-view images captured by everyday users. Our key insight is that, by leveraging VGGT, our method naturally inherits strong generalization ability and expressive power from its large-scale training and point map representation. However, it is unclear how to reconstruct a topologically consistent mesh from VGGT, as the topology information is missing in its prediction. To this end, we augment VGGT with Pixel3DMM for injecting topology information via pixel-aligned UV values. In this manner, we convert the pixel-aligned point map of VGGT to a point cloud with topology. Tailored to this point cloud with known topology, we propose a novel Topology-Aware Bundle Adjustment strategy to fuse them, where we construct a Laplacian energy for the Bundle Adjustment objective. Our method achieves high-quality reconstruction in 10 seconds for 16 views on a single NVIDIA RTX 4090. Experiments demonstrate state-of-the-art results on benchmarks and impressive generalization to in-the-wild data. Code is available at https://github.com/grignarder/vggtface.
From Passive Perception to Active Memory: A Weakly Supervised Image Manipulation Localization Framework Driven by Coarse-Grained Annotations AAAI 2026
Image manipulation localization (IML) faces a fundamental trade-off between minimizing annotation cost and achieving fine-grained localization accuracy. Existing fully-supervised IML methods depend heavily on dense pixel-level mask annotations, which limits scalability to large datasets or real-world deployment.In contrast, the majority of existing weakly-supervised IML approaches are based on image-level labels, which greatly reduce annotation effort but typically lack precise spatial localization. To address this dilemma, we propose BoxPromptIML, a novel weakly-supervised IML framework that effectively balances annotation cost and localization performance. Specifically, we propose a coarse region annotation strategy, which can generate relatively accurate manipulation masks at lower cost. To improve model efficiency and facilitate deployment, we further design an efficient lightweight student model, which learns to perform fine-grained localization through knowledge distillation from a fixed teacher model based on the Segment Anything Model (SAM). Moreover, inspired by the human subconscious memory mechanism, our feature fusion module employs a dual-guidance strategy that actively contextualizes recalled prototypical patterns with real-time observational cues derived from the input. Instead of passive feature extraction, this strategy enables a dynamic process of knowledge recollection, where long-term memory is adapted to the specific context of the current image, significantly enhancing localization accuracy and robustness. Extensive experiments across both in-distribution and out-of-distribution datasets show that BoxPromptIML outperforms or rivals fully-supervised models, while maintaining strong generalization, low annotation cost, and efficient deployment characteristics.
comment: Accepted by AAAI 2026
GS-Checker: Tampering Localization for 3D Gaussian Splatting AAAI2026
Recent advances in editing technologies for 3D Gaussian Splatting (3DGS) have made it simple to manipulate 3D scenes. However, these technologies raise concerns about potential malicious manipulation of 3D content. To avoid such malicious applications, localizing tampered regions becomes crucial. In this paper, we propose GS-Checker, a novel method for locating tampered areas in 3DGS models. Our approach integrates a 3D tampering attribute into the 3D Gaussian parameters to indicate whether the Gaussian has been tampered. Additionally, we design a 3D contrastive mechanism by comparing the similarity of key attributes between 3D Gaussians to seek tampering cues at 3D level. Furthermore, we introduce a cyclic optimization strategy to refine the 3D tampering attribute, enabling more accurate tampering localization. Notably, our approach does not require expensive 3D labels for supervision. Extensive experimental results demonstrate the effectiveness of our proposed method to locate the tampered 3DGS area.
comment: Accepted by AAAI2026
Thinking in 360°: Humanoid Visual Search in the Wild
Humans rely on the synergistic control of head (cephalomotor) and eye (oculomotor) to efficiently search for visual information in 360°. However, prior approaches to visual search are limited to a static image, neglecting the physical embodiment and its interaction with the 3D world. How can we develop embodied visual search agents as efficient as humans while bypassing the constraints imposed by real-world hardware? To this end, we propose humanoid visual search where a humanoid agent actively rotates its head to search for objects or paths in an immersive world represented by a 360° panoramic image. To study visual search in visually-crowded real-world scenarios, we build H* Bench, a new benchmark that moves beyond household scenes to challenging in-the-wild scenes that necessitate advanced visual-spatial reasoning capabilities, such as transportation hubs, large-scale retail spaces, urban streets, and public institutions. Our experiments first reveal that even top-tier proprietary models falter, achieving only ~30% success in object and path search. We then use post-training techniques to enhance the open-source Qwen2.5-VL, increasing its success rate by over threefold for both object search (14.83% to 47.38%) and path search (6.44% to 24.94%). Notably, the lower ceiling of path search reveals its inherent difficulty, which we attribute to the demand for sophisticated spatial commonsense. Our results not only show a promising path forward but also quantify the immense challenge that remains in building MLLM agents that can be seamlessly integrated into everyday human life.
Material-informed Gaussian Splatting for 3D World Reconstruction in a Digital Twin
3D reconstruction for Digital Twins often relies on LiDAR-based methods, which provide accurate geometry but lack the semantics and textures naturally captured by cameras. Traditional LiDAR-camera fusion approaches require complex calibration and still struggle with certain materials like glass, which are visible in images but poorly represented in point clouds. We propose a camera-only pipeline that reconstructs scenes using 3D Gaussian Splatting from multi-view images, extracts semantic material masks via vision models, converts Gaussian representations to mesh surfaces with projected material labels, and assigns physics-based material properties for accurate sensor simulation in modern graphics engines and simulators. This approach combines photorealistic reconstruction with physics-based material assignment, providing sensor simulation fidelity comparable to LiDAR-camera fusion while eliminating hardware complexity and calibration requirements. We validate our camera-only method using an internal dataset from an instrumented test vehicle, leveraging LiDAR as ground truth for reflectivity validation alongside image similarity metrics.
comment: 8 pages, 5 figures. Submitted to IEEE Intelligent Vehicles Symposium (IV) 2026 for possible publication
AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend
We present AMB3R, a multi-view feed-forward model for dense 3D reconstruction on a metric-scale that addresses diverse 3D vision tasks. The key idea is to leverage a sparse, yet compact, volumetric scene representation as our backend, enabling geometric reasoning with spatial compactness. Although trained solely for multi-view reconstruction, we demonstrate that AMB3R can be seamlessly extended to uncalibrated visual odometry (online) or large-scale structure from motion without the need for task-specific fine-tuning or test-time optimization. Compared to prior pointmap-based models, our approach achieves state-of-the-art performance in camera pose, depth, and metric-scale estimation, 3D reconstruction, and even surpasses optimization-based SLAM and SfM methods with dense reconstruction priors on common benchmarks.
comment: Project page: https://hengyiwang.github.io/projects/amber
ShelfRectNet: Single View Shelf Image Rectification with Homography Estimation
Estimating homography from a single image remains a challenging yet practically valuable task, particularly in domains like retail, where only one viewpoint is typically available for shelf monitoring and product alignment. In this paper, we present a deep learning framework that predicts a 4-point parameterized homography matrix to rectify shelf images captured from arbitrary angles. Our model leverages a ConvNeXt-based backbone for enhanced feature representation and adopts normalized coordinate regression for improved stability. To address data scarcity and promote generalization, we introduce a novel augmentation strategy by modeling and sampling synthetic homographies. Our method achieves a mean corner error of 1.298 pixels on the test set. When compared with both classical computer vision and deep learning-based approaches, our method demonstrates competitive performance in both accuracy and inference speed. Together, these results establish our approach as a robust and efficient solution for realworld single-view rectification. To encourage further research in this domain, we will make our dataset, ShelfRectSet, and code publicly available
3D Motion Perception of Binocular Vision Target with PID-CNN
This article trained a network for perceiving three-dimensional motion information of binocular vision target, which can provide real-time three-dimensional coordinate, velocity, and acceleration, and has a basic spatiotemporal perception capability. Understood the ability of neural networks to fit nonlinear problems from the perspective of PID. Considered a single-layer neural network as using a second-order difference equation and a nonlinearity to describe a local problem. Multilayer networks gradually transform the raw representation to the desired representation through multiple such combinations. Analysed some reference principles for designing neural networks. Designed a relatively small PID convolutional neural network, with a total of 17 layers and 413 thousand parameters. Implemented a simple but practical feature reuse method by concatenation and pooling. The network was trained and tested using the simulated randomly moving ball datasets, and the experimental results showed that the prediction accuracy was close to the upper limit that the input image resolution can represent. Analysed the experimental results and errors, as well as the existing shortcomings and possible directions for improvement. Finally, discussed the advantages of high-dimensional convolution in improving computational efficiency and feature space utilization. As well as the potential advantages of using PID information to implement memory and attention mechanisms.
comment: 7 pages, 9 figures, 2 tables
ArtiBench and ArtiBrain: Benchmarking Generalizable Vision-Language Articulated Object Manipulation
Interactive articulated manipulation requires long-horizon, multi-step interactions with appliances while maintaining physical consistency. Existing vision-language and diffusion-based policies struggle to generalize across parts, instances, and categories. We first introduce ArtiBench, a five-level benchmark covering kitchen, storage, office, and tool environments. ArtiBench enables structured evaluation from cross-part and cross-instance variation to long-horizon multi-object tasks, revealing the core generalization challenges of articulated object manipulation. Building on this benchmark, we propose ArtiBrain, a modular framework that unifies high-level reasoning with adaptive low-level control. ArtiBrain uses a VLM-based Task Reasoner (GPT-4.1) to decompose and validate subgoals, and employs a Hybrid Controller that combines geometry-aware keyframe execution with affordance-guided diffusion for precise and interpretable manipulation. An Affordance Memory Bank continually accumulates successful execution episodes and propagates part-level actionable affordances to unseen articulated parts and configurations. Extensive experiments on ArtiBench show that our ArtiBrain significantly outperforms state-of-the-art multimodal and diffusion-based methods in robustness and generalization. Code and dataset will be released upon acceptance.
AD-R1: Closed-Loop Reinforcement Learning for End-to-End Autonomous Driving with Impartial World Models
End-to-end models for autonomous driving hold the promise of learning complex behaviors directly from sensor data, but face critical challenges in safety and handling long-tail events. Reinforcement Learning (RL) offers a promising path to overcome these limitations, yet its success in autonomous driving has been elusive. We identify a fundamental flaw hindering this progress: a deep seated optimistic bias in the world models used for RL. To address this, we introduce a framework for post-training policy refinement built around an Impartial World Model. Our primary contribution is to teach this model to be honest about danger. We achieve this with a novel data synthesis pipeline, Counterfactual Synthesis, which systematically generates a rich curriculum of plausible collisions and off-road events. This transforms the model from a passive scene completer into a veridical forecaster that remains faithful to the causal link between actions and outcomes. We then integrate this Impartial World Model into our closed-loop RL framework, where it serves as an internal critic. During refinement, the agent queries the critic to ``dream" of the outcomes for candidate actions. We demonstrate through extensive experiments, including on a new Risk Foreseeing Benchmark, that our model significantly outperforms baselines in predicting failures. Consequently, when used as a critic, it enables a substantial reduction in safety violations in challenging simulations, proving that teaching a model to dream of danger is a critical step towards building truly safe and intelligent autonomous agents.
IrisNet: Infrared Image Status Awareness Meta Decoder for Infrared Small Targets Detection
Infrared Small Target Detection (IRSTD) faces significant challenges due to low signal-to-noise ratios, complex backgrounds, and the absence of discernible target features. While deep learning-based encoder-decoder frameworks have advanced the field, their static pattern learning suffers from pattern drift across diverse scenarios (\emph{e.g.}, day/night variations, sky/maritime/ground domains), limiting robustness. To address this, we propose IrisNet, a novel meta-learned framework that dynamically adapts detection strategies to the input infrared image status. Our approach establishes a dynamic mapping between infrared image features and entire decoder parameters via an image-to-decoder transformer. More concretely, we represent the parameterized decoder as a structured 2D tensor preserving hierarchical layer correlations and enable the transformer to model inter-layer dependencies through self-attention while generating adaptive decoding patterns via cross-attention. To further enhance the perception ability of infrared images, we integrate high-frequency components to supplement target-position and scene-edge information. Experiments on NUDT-SIRST, NUAA-SIRST, and IRSTD-1K datasets demonstrate the superiority of our IrisNet, achieving state-of-the-art performance.
comment: 10pages,5figures
TReFT: Taming Rectified Flow Models For One-Step Image Translation
Rectified Flow (RF) models have advanced high-quality image and video synthesis via optimal transport theory. However, when applied to image-to-image translation, they still depend on costly multi-step denoising, hindering real-time applications. Although the recent adversarial training paradigm, CycleGAN-Turbo, works in pretrained diffusion models for one-step image translation, we find that directly applying it to RF models leads to severe convergence issues. In this paper, we analyze these challenges and propose TReFT, a novel method to Tame Rectified Flow models for one-step image Translation. Unlike previous works, TReFT directly uses the velocity predicted by pretrained DiT or UNet as output-a simple yet effective design that tackles the convergence issues under adversarial training with one-step inference. This design is mainly motivated by a novel observation that, near the end of the denoising process, the velocity predicted by pretrained RF models converges to the vector from origin to the final clean image, a property we further justify through theoretical analysis. When applying TReFT to large pretrained RF models such as SD3.5 and FLUX, we introduce memory-efficient latent cycle-consistency and identity losses during training, as well as lightweight architectural simplifications for faster inference. Pretrained RF models finetuned with TReFT achieve performance comparable to sota methods across multiple image translation datasets while enabling real-time inference.
TaCo: Capturing Spatio-Temporal Semantic Consistency in Remote Sensing Change Detection
Remote sensing change detection (RSCD) aims to identify surface changes across bi-temporal satellite images. Most previous methods rely solely on mask supervision, which effectively guides spatial localization but provides limited constraints on the temporal semantic transitions. Consequently, they often produce spatially coherent predictions while still suffering from unresolved semantic inconsistencies. To address this limitation, we propose TaCo, a spatio-temporal semantic consistent network, which enriches the existing mask-supervised framework with a spatio-temporal semantic joint constraint. TaCo conceptualizes change as a semantic transition between bi-temporal states, in which one temporal feature representation can be derived from the other via dedicated transition features. To realize this, we introduce a Text-guided Transition Generator that integrates textual semantics with bi-temporal visual features to construct the cross-temporal transition features. In addition, we propose a spatio-temporal semantic joint constraint consisting of bi-temporal reconstruct constraints and a transition constraint: the former enforces alignment between reconstructed and original features, while the latter enhances discrimination for changes. This design can yield substantial performance gains without introducing any additional computational overhead during inference. Extensive experiments on six public datasets, spanning both binary and semantic change detection tasks, demonstrate that TaCo consistently achieves SOTA performance.
CrossEarth-Gate: Fisher-Guided Adaptive Tuning Engine for Efficient Adaptation of Cross-Domain Remote Sensing Semantic Segmentation
In Remote Sensing (RS), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key approach to activate the generalizable representation ability of foundation models for downstream tasks. However, existing specialized PEFT methods often fail when applied to large-scale Earth observation tasks, as they are unable to fully handle the multifaceted and unpredictable domain gaps (\eg, spatial, semantic, and frequency shifts) inherent in RS data. To overcome this, we propose CrossEarth-Gate, which introduces two primary contributions. First, we establish a comprehensive RS module toolbox to address multifaceted domain gaps, comprising spatial, semantic, and frequency modules. Second, we develop a Fisher-guided adaptive selection mechanism that operates on this toolbox. This selection is guided by Fisher Information to quantify each module's importance by measuring its contribution to the task-specific gradient flow. It dynamically activates only the most critical modules at the appropriate layers, guiding the gradient flow to maximize adaptation effectiveness and efficiency. Comprehensive experiments validate the efficacy and generalizability of our method, where CrossEarth-Gate achieves state-of-the-art performance across 16 cross-domain benchmarks for RS semantic segmentation. The code of the work will be released.
Prompting Lipschitz-constrained network for multiple-in-one sparse-view CT reconstruction
Despite significant advancements in deep learning-based sparse-view computed tomography (SVCT) reconstruction algorithms, these methods still encounter two primary limitations: (i) It is challenging to explicitly prove that the prior networks of deep unfolding algorithms satisfy Lipschitz constraints due to their empirically designed nature. (ii) The substantial storage costs of training a separate model for each setting in the case of multiple views hinder practical clinical applications. To address these issues, we elaborate an explicitly provable Lipschitz-constrained network, dubbed LipNet, and integrate an explicit prompt module to provide discriminative knowledge of different sparse sampling settings, enabling the treatment of multiple sparse view configurations within a single model. Furthermore, we develop a storage-saving deep unfolding framework for multiple-in-one SVCT reconstruction, termed PromptCT, which embeds LipNet as its prior network to ensure the convergence of its corresponding iterative algorithm. In simulated and real data experiments, PromptCT outperforms benchmark reconstruction algorithms in multiple-in-one SVCT reconstruction, achieving higher-quality reconstructions with lower storage costs. On the theoretical side, we explicitly demonstrate that LipNet satisfies boundary property, further proving its Lipschitz continuity and subsequently analyzing the convergence of the proposed iterative algorithms. The data and code are publicly available at https://github.com/shibaoshun/PromptCT.
Back to the Feature: Explaining Video Classifiers with Video Counterfactual Explanations
Counterfactual explanations (CFEs) are minimal and semantically meaningful modifications of the input of a model that alter the model predictions. They highlight the decisive features the model relies on, providing contrastive interpretations for classifiers. State-of-the-art visual counterfactual explanation methods are designed to explain image classifiers. The generation of CFEs for video classifiers remains largely underexplored. For the counterfactual videos to be useful, they have to be physically plausible, temporally coherent, and exhibit smooth motion trajectories. Existing CFE image-based methods, designed to explain image classifiers, lack the capacity to generate temporally coherent, smooth and physically plausible video CFEs. To address this, we propose Back To The Feature (BTTF), an optimization framework that generates video CFEs. Our method introduces two novel features, 1) an optimization scheme to retrieve the initial latent noise conditioned by the first frame of the input video, 2) a two-stage optimization strategy to enable the search for counterfactual videos in the vicinity of the input video. Both optimization processes are guided solely by the target classifier, ensuring the explanation is faithful. To accelerate convergence, we also introduce a progressive optimization strategy that incrementally increases the number of denoising steps. Extensive experiments on video datasets such as Shape-Moving (motion classification), MEAD (emotion classification), and NTU RGB+D (action classification) show that our BTTF effectively generates valid, visually similar and realistic counterfactual videos that provide concrete insights into the classifier's decision-making mechanism.
Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement ICCV 2025
Recent progress in video generation has led to impressive visual quality, yet current models still struggle to produce results that align with real-world physical principles. To this end, we propose an iterative self-refinement framework that leverages large language models and vision-language models to provide physics-aware guidance for video generation. Specifically, we introduce a multimodal chain-of-thought (MM-CoT) process that refines prompts based on feedback from physical inconsistencies, progressively enhancing generation quality. This method is training-free and plug-and-play, making it readily applicable to a wide range of video generation models. Experiments on the PhyIQ benchmark show that our method improves the Physics-IQ score from 56.31 to 62.38. We hope this work serves as a preliminary exploration of physics-consistent video generation and may offer insights for future research.
comment: ICCV 2025 Physics-IQ Challenge Third Place Solution
SelfMOTR: Revisiting MOTR with Self-Generating Detection Priors
Despite progress toward end-to-end tracking with transformer architectures, poor detection performance and the conflict between detection and association in a joint architecture remain critical concerns. Recent approaches aim to mitigate these issues by (i) employing advanced denoising or label assignment strategies, or (ii) incorporating detection priors from external object detectors via distillation or anchor proposal techniques. Inspired by the success of integrating detection priors and by the key insight that MOTR-like models are secretly strong detection models, we introduce SelfMOTR, a novel tracking transformer that relies on self-generated detection priors. Through extensive analysis and ablation studies, we uncover and demonstrate the hidden detection capabilities of MOTR-like models, and present a practical set of tools for leveraging them effectively. On DanceTrack, SelfMOTR achieves strong performance, competing with recent state-of-the-art end-to-end tracking methods.
comment: 11 pages, 5 figures, 10 tables
DAPointMamba: Domain Adaptive Point Mamba for Point Cloud Completion AAAI 2026
Domain adaptive point cloud completion (DA PCC) aims to narrow the geometric and semantic discrepancies between the labeled source and unlabeled target domains. Existing methods either suffer from limited receptive fields or quadratic complexity due to using CNNs or vision Transformers. In this paper, we present the first work that studies the adaptability of State Space Models (SSMs) in DA PCC and find that directly applying SSMs to DA PCC will encounter several challenges: directly serializing 3D point clouds into 1D sequences often disrupts the spatial topology and local geometric features of the target domain. Besides, the overlook of designs in the learning domain-agnostic representations hinders the adaptation performance. To address these issues, we propose a novel framework, DAPointMamba for DA PCC, that exhibits strong adaptability across domains and has the advantages of global receptive fields and efficient linear complexity. It has three novel modules. In particular, Cross-Domain Patch-Level Scanning introduces patch-level geometric correspondences, enabling effective local alignment. Cross-Domain Spatial SSM Alignment further strengthens spatial consistency by modulating patch features based on cross-domain similarity, effectively mitigating fine-grained structural discrepancies. Cross-Domain Channel SSM Alignment actively addresses global semantic gaps by interleaving and aligning feature channels. Extensive experiments on both synthetic and real-world benchmarks demonstrate that our DAPointMamba outperforms state-of-the-art methods with less computational complexity and inference latency.
comment: Accepted to AAAI 2026
ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis
Until recently, the general corpus of CLIP-type fundamental models has widely explored either the retrieval of short descriptions or the classification of objects in the scene as SINGLE-object image classification task. The same holds for retrieving the image embedding (image retrieval task) given a text prompt. However, real-world scene images exhibit rich compositional structure involving multiple objects and actions. The latest methods in the CLIP-based literature improve class-level discrimination by mining harder negative image-text pairs and by refining permanent text prompts, often using LLMs. However, these improvements remain confined to predefined class lists and do not explicitly model relational or compositional structure. PyramidCLIP partially addresses this gap by aligning global and local visual features, yet it still lacks explicit modeling of inter-object relations. Hence, to further leverage this aspect for scene analysis, the proposed ScenarioCLIP model accepts input texts, grounded relations, and input images, along with focused regions highlighting relations. The proposed model is pretrained on curated scenario data, and finetuned for specialized downstream tasks, such as cross-modal retrieval and fine-grained visual understanding tasks. To address the lack of domain-specific datasets, we generate a novel dataset by extending image-text pairs from existing diverse indoor and outdoor scenario datasets that are publicly available. We used a pipeline of existing language models to ground action, object, and relations, filled by manual and automatic curation. We established a comprehensive benchmark for several scenario-based tasks and compared it with many baseline methods. ScenarioCLIP demonstrates robust zero-shot and finetune performance on various domain-specific tasks. Our code and dataset are available at https://github.com/scenario-clip/ScenarioCLIP
VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs
While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they often lack the intuitive, human-like understanding of the world's underlying physical and social principles. This high-level vision-grounded semantics, which we term visual knowledge, forms a bridge between perception and reasoning, yet remains an underexplored area in current MLLMs. To systematically evaluate this capability, we present VKnowU, a comprehensive benchmark featuring 1,680 questions in 1,249 videos, covering 8 core types of visual knowledge spanning both world-centric (e.g., intuitive physics) and human-centric (e.g., subjective intentions). Evaluation of 23 SOTA MLLMs reveals that leading models still fall short of human performance, with particularly notable gaps in the world-centric. To bridge this gap, we introduce a new dataset, VKnowQA, and VideoKnow+, a baseline model that explicitly incorporates visual knowledge into MLLMs. VideoKnow+ follows a structured See-Think-Answer paradigm and adopts reinforcement learning with visual knowledge reward, achieving a +3.7% improvement on VKnowU and consistent gains on MVBench, Video-MME, and MMVU. Our work highlights visual knowledge as a missing cornerstone for developing more generalizable MLLMs that can not only see but also truly understand our physical and social worlds.
comment: Data & Code: this https URL
DRL-Guided Neural Batch Sampling for Semi-Supervised Pixel-Level Anomaly Detection
Anomaly detection in industrial visual inspection is challenging due to the scarcity of defective samples. Most existing methods rely on unsupervised reconstruction using only normal data, often resulting in overfitting and poor detection of subtle defects. We propose a semi-supervised deep reinforcement learning framework that integrates a neural batch sampler, an autoencoder, and a predictor. The RL-based sampler adaptively selects informative patches by balancing exploration and exploitation through a composite reward. The autoencoder generates loss profiles highlighting abnormal regions, while the predictor performs segmentation in the loss-profile space. This interaction enables the system to effectively learn both normal and defective patterns with limited labeled data. Experiments on the MVTec AD dataset demonstrate that our method achieves higher accuracy and better localization of subtle anomalies than recent state-of-the-art approaches while maintaining low complexity, yielding an average improvement of 0.15 in F1_max and 0.06 in AUC, with a maximum gain of 0.37 in F1_max in the best case.
Advancing Image Classification with Discrete Diffusion Classification Modeling
Image classification is a well-studied task in computer vision, and yet it remains challenging under high-uncertainty conditions, such as when input images are corrupted or training data are limited. Conventional classification approaches typically train models to directly predict class labels from input images, but this might lead to suboptimal performance in such scenarios. To address this issue, we propose Discrete Diffusion Classification Modeling (DiDiCM), a novel framework that leverages a diffusion-based procedure to model the posterior distribution of class labels conditioned on the input image. DiDiCM supports diffusion-based predictions either on class probabilities or on discrete class labels, providing flexibility in computation and memory trade-offs. We conduct a comprehensive empirical study demonstrating the superior performance of DiDiCM over standard classifiers, showing that a few diffusion iterations achieve higher classification accuracy on the ImageNet dataset compared to baselines, with accuracy gains increasing as the task becomes more challenging. We release our code at https://github.com/omerb01/didicm .
Modality-Balanced Collaborative Distillation for Multi-Modal Domain Generalization
Weight Averaging (WA) has emerged as a powerful technique for enhancing generalization by promoting convergence to a flat loss landscape, which correlates with stronger out-of-distribution performance. However, applying WA directly to multi-modal domain generalization (MMDG) is challenging: differences in optimization speed across modalities lead WA to overfit to faster-converging ones in early stages, suppressing the contribution of slower yet complementary modalities, thereby hindering effective modality fusion and skewing the loss surface toward sharper, less generalizable minima. To address this issue, we propose MBCD, a unified collaborative distillation framework that retains WA's flatness-inducing advantages while overcoming its shortcomings in multi-modal contexts. MBCD begins with adaptive modality dropout in the student model to curb early-stage bias toward dominant modalities. A gradient consistency constraint then aligns learning signals between uni-modal branches and the fused representation, encouraging coordinated and smoother optimization. Finally, a WA-based teacher conducts cross-modal distillation by transferring fused knowledge to each uni-modal branch, which strengthens cross-modal interactions and steer convergence toward flatter solutions. Extensive experiments on MMDG benchmarks show that MBCD consistently outperforms existing methods, achieving superior accuracy and robustness across diverse unseen domains.
The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation
A reliable reward function is essential for reinforcement learning (RL) in image generation. Most current RL approaches depend on pre-trained preference models that output scalar rewards to approximate human preferences. However, these rewards often fail to capture human perception and are vulnerable to reward hacking, where higher scores do not correspond to better images. To address this, we introduce Adv-GRPO, an RL framework with an adversarial reward that iteratively updates both the reward model and the generator. The reward model is supervised using reference images as positive samples and can largely avoid being hacked. Unlike KL regularization that constrains parameter updates, our learned reward directly guides the generator through its visual outputs, leading to higher-quality images. Moreover, while optimizing existing reward functions can alleviate reward hacking, their inherent biases remain. For instance, PickScore may degrade image quality, whereas OCR-based rewards often reduce aesthetic fidelity. To address this, we take the image itself as a reward, using reference images and vision foundation models (e.g., DINO) to provide rich visual rewards. These dense visual signals, instead of a single scalar, lead to consistent gains across image quality, aesthetics, and task-specific metrics. Finally, we show that combining reference samples with foundation-model rewards enables distribution transfer and flexible style customization. In human evaluation, our method outperforms Flow-GRPO and SD3, achieving 70.0% and 72.4% win rates in image quality and aesthetics, respectively. Code and models have been released.
XiCAD: Camera Activation Detection in the Da Vinci Xi User Interface
Purpose: Robot-assisted minimally invasive surgery relies on endoscopic video as the sole intraoperative visual feedback. The DaVinci Xi system overlays a graphical user interface (UI) that indicates the state of each robotic arm, including the activation of the endoscope arm. Detecting this activation provides valuable metadata such as camera movement information, which can support downstream surgical data science tasks including tool tracking, skill assessment, or camera control automation. Methods: We developed a lightweight pipeline based on a ResNet18 convolutional neural network to automatically identify the position of the camera tile and its activation state within the DaVinci Xi UI. The model was fine-tuned on manually annotated data from the SurgToolLoc dataset and evaluated across three public datasets comprising over 70,000 frames. Results: The model achieved F1-scores between 0.993 and 1.000 for the binary detection of active cameras and correctly localized the camera tile in all cases without false multiple-camera detections. Conclusion: The proposed pipeline enables reliable, real-time extraction of camera activation metadata from surgical videos, facilitating automated preprocessing and analysis for diverse downstream applications. All code, trained models, and annotations are publicly available.
Zoo3D: Zero-Shot 3D Object Detection at Scene Level
3D object detection is fundamental for spatial understanding. Real-world environments demand models capable of recognizing diverse, previously unseen objects, which remains a major limitation of closed-set methods. Existing open-vocabulary 3D detectors relax annotation requirements but still depend on training scenes, either as point clouds or images. We take this a step further by introducing Zoo3D, the first training-free 3D object detection framework. Our method constructs 3D bounding boxes via graph clustering of 2D instance masks, then assigns semantic labels using a novel open-vocabulary module with best-view selection and view-consensus mask generation. Zoo3D operates in two modes: the zero-shot Zoo3D$_0$, which requires no training at all, and the self-supervised Zoo3D$_1$, which refines 3D box prediction by training a class-agnostic detector on Zoo3D$_0$-generated pseudo labels. Furthermore, we extend Zoo3D beyond point clouds to work directly with posed and even unposed images. Across ScanNet200 and ARKitScenes benchmarks, both Zoo3D$_0$ and Zoo3D$_1$ achieve state-of-the-art results in open-vocabulary 3D object detection. Remarkably, our zero-shot Zoo3D$_0$ outperforms all existing self-supervised methods, hence demonstrating the power and adaptability of training-free, off-the-shelf approaches for real-world 3D understanding. Code is available at https://github.com/col14m/zoo3d .
PromptMoG: Enhancing Diversity in Long-Prompt Image Generation via Prompt Embedding Mixture-of-Gaussian Sampling
Recent advances in text-to-image (T2I) generation have achieved remarkable visual outcomes through large-scale rectified flow models. However, how these models behave under long prompts remains underexplored. Long prompts encode rich content, spatial, and stylistic information that enhances fidelity but often suppresses diversity, leading to repetitive and less creative outputs. In this work, we systematically study this fidelity-diversity dilemma and reveal that state-of-the-art models exhibit a clear drop in diversity as prompt length increases. To enable consistent evaluation, we introduce LPD-Bench, a benchmark designed for assessing both fidelity and diversity in long-prompt generation. Building on our analysis, we develop a theoretical framework that increases sampling entropy through prompt reformulation and propose a training-free method, PromptMoG, which samples prompt embeddings from a Mixture-of-Gaussians in the embedding space to enhance diversity while preserving semantics. Extensive experiments on four state-of-the-art models, SD3.5-Large, Flux.1-Krea-Dev, CogView4, and Qwen-Image, demonstrate that PromptMoG consistently improves long-prompt generation diversity without semantic drifting.
comment: Technical Report
Uplifting Table Tennis: A Robust, Real-World Application for 3D Trajectory and Spin Estimation
Obtaining the precise 3D motion of a table tennis ball from standard monocular videos is a challenging problem, as existing methods trained on synthetic data struggle to generalize to the noisy, imperfect ball and table detections of the real world. This is primarily due to the inherent lack of 3D ground truth trajectories and spin annotations for real-world video. To overcome this, we propose a novel two-stage pipeline that divides the problem into a front-end perception task and a back-end 2D-to-3D uplifting task. This separation allows us to train the front-end components with abundant 2D supervision from our newly created TTHQ dataset, while the back-end uplifting network is trained exclusively on physically-correct synthetic data. We specifically re-engineer the uplifting model to be robust to common real-world artifacts, such as missing detections and varying frame rates. By integrating a ball detector and a table keypoint detector, our approach transforms a proof-of-concept uplifting method into a practical, robust, and high-performing end-to-end application for 3D table tennis trajectory and spin analysis.
HistoSpeckle-Net: Mutual Information-Guided Deep Learning for high-fidelity reconstruction of complex OrganAMNIST images via perturbed Multimode Fibers
Existing deep learning methods in multimode fiber (MMF) imaging often focus on simpler datasets, limiting their applicability to complex, real-world imaging tasks. These models are typically data-intensive, a challenge that becomes more pronounced when dealing with diverse and complex images. In this work, we propose HistoSpeckle-Net, a deep learning architecture designed to reconstruct structurally rich medical images from MMF speckles. To build a clinically relevant dataset, we develop an optical setup that couples laser light through a spatial light modulator (SLM) into an MMF, capturing output speckle patterns corresponding to input OrganAMNIST images. Unlike previous MMF imaging approaches, which have not considered the underlying statistics of speckles and reconstructed images, we introduce a distribution-aware learning strategy. We employ a histogram-based mutual information loss to enhance model robustness and reduce reliance on large datasets. Our model includes a histogram computation unit that estimates smooth marginal and joint histograms for calculating mutual information loss. It also incorporates a unique Three-Scale Feature Refinement Module, which leads to multiscale Structural Similarity Index Measure (SSIM) loss computation. Together, these two loss functions enhance both the structural fidelity and statistical alignment of the reconstructed images. Our experiments on the complex OrganAMNIST dataset demonstrate that HistoSpeckle-Net achieves higher fidelity than baseline models such as U-Net and Pix2Pix. It gives superior performance even with limited training samples and across varying fiber bending conditions. By effectively reconstructing complex anatomical features with reduced data and under fiber perturbations, HistoSpeckle-Net brings MMF imaging closer to practical deployment in real-world clinical environments.
V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs
Adversarial attacks have evolved from simply disrupting predictions on conventional task-specific models to the more complex goal of manipulating image semantics on Large Vision-Language Models (LVLMs). However, existing methods struggle with controllability and fail to precisely manipulate the semantics of specific concepts in the image. We attribute this limitation to semantic entanglement in the patch-token representations on which adversarial attacks typically operate: global context aggregated by self-attention in the vision encoder dominates individual patch features, making them unreliable handles for precise local semantic manipulation. Our systematic investigation reveals a key insight: value features (V) computed within the transformer attention block serve as much more precise handles for manipulation. We show that V suppresses global-context channels, allowing it to retain high-entropy, disentangled local semantic information. Building on this discovery, we propose V-Attack, a novel method designed for precise local semantic attacks. V-Attack targets the value features and introduces two core components: (1) a Self-Value Enhancement module to refine V's intrinsic semantic richness, and (2) a Text-Guided Value Manipulation module that leverages text prompts to locate source concept and optimize it toward a target concept. By bypassing the entangled patch features, V-Attack achieves highly effective semantic control. Extensive experiments across diverse LVLMs, including LLaVA, InternVL, DeepseekVL and GPT-4o, show that V-Attack improves the attack success rate by an average of 36% over state-of-the-art methods, exposing critical vulnerabilities in modern visual-language understanding. Our code and data are available https://github.com/Summu77/V-Attack.
comment: 21 pages
Patch-Level Glioblastoma Subregion Classification with a Contrastive Learning-Based Encoder MICCAI 2025
The significant molecular and pathological heterogeneity of glioblastoma, an aggressive brain tumor, complicates diagnosis and patient stratification. While traditional histopathological assessment remains the standard, deep learning offers a promising path toward objective and automated analysis of whole slide images. For the BraTS-Path 2025 Challenge, we developed a method that fine-tunes a pre-trained Vision Transformer (ViT) encoder with a dedicated classification head on the official training dataset. Our model's performance on the online validation set, evaluated via the Synapse platform, yielded a Matthews Correlation Coefficient (MCC) of 0.7064 and an F1-score of 0.7676. On the final test set, the model achieved an MCC of 0.6509 and an F1-score of 0.5330, which secured our team second place in the BraTS-Pathology 2025 Challenge. Our results establish a solid baseline for ViT-based histopathological analysis, and future efforts will focus on bridging the performance gap observed on the unseen validation data.
comment: Accepted by the International Brain Tumor Segmentation (BraTS) challenge organized at MICCAI 2025 conference
Text-guided Controllable Diffusion for Realistic Camouflage Images Generation AAAI 2026
Camouflage Images Generation (CIG) is an emerging research area that focuses on synthesizing images in which objects are harmoniously blended and exhibit high visual consistency with their surroundings. Existing methods perform CIG by either fusing objects into specific backgrounds or outpainting the surroundings via foreground object-guided diffusion. However, they often fail to obtain natural results because they overlook the logical relationship between camouflaged objects and background environments. To address this issue, we propose CT-CIG, a Controllable Text-guided Camouflage Images Generation method that produces realistic and logically plausible camouflage images. Leveraging Large Visual Language Models (VLM), we design a Camouflage-Revealing Dialogue Mechanism (CRDM) to annotate existing camouflage datasets with high-quality text prompts. Subsequently, the constructed image-prompt pairs are utilized to finetune Stable Diffusion, incorporating a lightweight controller to guide the location and shape of camouflaged objects for enhanced camouflage scene fitness. Moreover, we design a Frequency Interaction Refinement Module (FIRM) to capture high-frequency texture features, facilitating the learning of complex camouflage patterns. Extensive experiments, including CLIPScore evaluation and camouflage effectiveness assessment, demonstrate the semantic alignment of our generated text prompts and CT-CIG's ability to produce photorealistic camouflage images.
comment: Accepted by AAAI 2026
CostNav: A Navigation Benchmark for Cost-Aware Evaluation of Embodied Agents
Existing navigation benchmarks focus on task success metrics while overlooking economic viability -- critical for commercial deployment of autonomous delivery robots. We introduce \emph{CostNav}, a \textbf{Micro-Navigation Economic Testbed} that evaluates embodied agents through comprehensive cost-revenue analysis aligned with real-world business operations. CostNav models the complete economic lifecycle including hardware, training, energy, maintenance costs, and delivery revenue with service-level agreements, using industry-derived parameters. \textbf{To our knowledge, CostNav is the first work to quantitatively expose the gap between navigation research metrics and commercial viability}, revealing that optimizing for task success fundamentally differs from optimizing for economic deployment. Our cost model uses parameters derived from industry data sources (energy rates, delivery service pricing), and we project from a reduced-scale simulation to realistic deliveries. Under this projection, the baseline achieves 43.0\% SLA compliance but is \emph{not} commercially viable: yielding a loss of \$30.009 per run with no finite break-even point, because operating costs are dominated by collision-induced maintenance, which accounts for 99.7\% of per-run costs and highlights collision avoidance as a key optimization target. We demonstrate a learning-based on-device navigation baseline and establish a foundation for evaluating rule-based navigation, imitation learning, and cost-aware RL training. CostNav bridges the gap between navigation research and commercial deployment, enabling data-driven decisions about economic trade-offs across navigation paradigms.
OmniAlpha: A Sequence-to-Sequence Framework for Unified Multi-Task RGBA Generation
Generative models have excelled in RGB synthesis, but real-world applications require RGBA manipulation. This has led to a fragmented landscape: specialized, single-task models handle alpha but lack versatility, while unified multi-task frameworks are confined to the RGB domain. To bridge this critical gap, we propose OmniAlpha, the first unified, multi-task generative framework for sequence-to-sequence RGBA image generation and editing. Its architecture features MSRoPE-BiL, a novel RoPE method with a bi-directionally extendable layer axis for its Diffusion Transformer (DiT) backbone, enabling the concurrent processing of multiple input and target RGBA layers. To power this framework, we introduce AlphaLayers, a new dataset of 1,000 high-quality, multi-layer triplets, built via a novel automated synthesis and filter pipeline. Jointly training OmniAlpha on this dataset across a comprehensive suite of 21 diverse tasks, extensive experiments demonstrate that our unified approach consistently outperforms strong, specialized baselines. Most notably, OmniAlpha achieves a dramatic 84.8% relative reduction in SAD for mask-free matting on AIM-500 and wins over 90% of human preferences in layer-conditioned completion. Our work proves that a unified, multi-task model can learn a superior shared representation for RGBA, paving the way for more powerful, layer-aware generative systems.
Robust 3D Brain MRI Inpainting with Random Masking Augmentation MICCAI 2025
The ASNR-MICCAI BraTS-Inpainting Challenge was established to mitigate dataset biases that limit deep learning models in the quantitative analysis of brain tumor MRI. This paper details our submission to the 2025 challenge, a novel deep learning framework for synthesizing healthy tissue in 3D scans. The core of our method is a U-Net architecture trained to inpaint synthetically corrupted regions, enhanced with a random masking augmentation strategy to improve generalization. Quantitative evaluation confirmed the efficacy of our approach, yielding an SSIM of 0.873$\pm$0.004, a PSNR of 24.996$\pm$4.694, and an MSE of 0.005$\pm$0.087 on the validation set. On the final online test set, our method achieved an SSIM of 0.919$\pm$0.088, a PSNR of 26.932$\pm$5.057, and an RMSE of 0.052$\pm$0.026. This performance secured first place in the BraTS-Inpainting 2025 challenge and surpassed the winning solutions from the 2023 and 2024 competitions on the official leaderboard.
comment: Accepted by the International Brain Tumor Segmentation (BraTS) challenge organized at MICCAI 2025 conference
GHR-VQA: Graph-guided Hierarchical Relational Reasoning for Video Question Answering
We propose GHR-VQA, Graph-guided Hierarchical Relational Reasoning for Video Question Answering (Video QA), a novel human-centric framework that incorporates scene graphs to capture intricate human-object interactions within video sequences. Unlike traditional pixel-based methods, each frame is represented as a scene graph and human nodes across frames are linked to a global root, forming the video-level graph and enabling cross-frame reasoning centered on human actors. The video-level graphs are then processed by Graph Neural Networks (GNNs), transforming them into rich, context-aware embeddings for efficient processing. Finally, these embeddings are integrated with question features in a hierarchical network operating across different abstraction levels, enhancing both local and global understanding of video content. This explicit human-rooted structure enhances interpretability by decomposing actions into human-object interactions and enables a more profound understanding of spatiotemporal dynamics. We validate our approach on the Action Genome Question Answering (AGQA) dataset, achieving significant performance improvements, including a 7.3% improvement in object-relation reasoning over the state of the art.
SFA: Scan, Focus, and Amplify toward Guidance-aware Answering for Video TextVQA
Video text-based visual question answering (Video TextVQA) task aims to answer questions about videos by leveraging the visual text appearing within the videos. This task poses significant challenges, requiring models to accurately perceive and comprehend scene text that varies in scale, orientation, and clarity across frames, while effectively integrating temporal and semantic context to generate precise answers. Moreover, the model must identify question-relevant textual cues and filter out redundant or irrelevant information to ensure answering is guided by the most relevant and informative cues. To address these challenges, we propose SFA, a training-free framework and the first Video-LLM-based method tailored for Video TextVQA, motivated by the human process of answering questions. By adaptively scanning video frames, selectively focusing on key regions, and directly amplifying them, SFA effectively guides the Video-LLM's attention toward essential cues, enabling it to generate more accurate answers. SFA achieves new state-of-the-art results across several public Video TextVQA datasets and surpasses previous methods by a substantial margin, demonstrating its effectiveness and generalizability.
Exo2EgoSyn: Unlocking Foundation Video Generation Models for Exocentric-to-Egocentric Video Synthesis
Foundation video generation models such as WAN 2.2 exhibit strong text- and image-conditioned synthesis abilities but remain constrained to the same-view generation setting. In this work, we introduce Exo2EgoSyn, an adaptation of WAN 2.2 that unlocks Exocentric-to-Egocentric(Exo2Ego) cross-view video synthesis. Our framework consists of three key modules. Ego-Exo View Alignment(EgoExo-Align) enforces latent-space alignment between exocentric and egocentric first-frame representations, reorienting the generative space from the given exo view toward the ego view. Multi-view Exocentric Video Conditioning (MultiExoCon) aggregates multi-view exocentric videos into a unified conditioning signal, extending WAN2.2 beyond its vanilla single-image or text conditioning. Furthermore, Pose-Aware Latent Injection (PoseInj) injects relative exo-to-ego camera pose information into the latent state, guiding geometry-aware synthesis across viewpoints. Together, these modules enable high-fidelity ego view video generation from third-person observations without retraining from scratch. Experiments on ExoEgo4D validate that Exo2EgoSyn significantly improves Ego2Exo synthesis, paving the way for scalable cross-view video generation with foundation models. Source code and models will be released publicly.
Realizing Fully-Integrated, Low-Power, Event-Based Pupil Tracking with Neuromorphic Hardware
Eye tracking is fundamental to numerous applications, yet achieving robust, high-frequency tracking with ultra-low power consumption remains challenging for wearable platforms. While event-based vision sensors offer microsecond resolution and sparse data streams, they have lacked fully integrated, low-power processing solutions capable of real-time inference. In this work, we present the first battery-powered, wearable pupil-center-tracking system with complete on-device integration, combining event-based sensing and neuromorphic processing on the commercially available Speck2f system-on-chip with lightweight coordinate decoding on a low-power microcontroller. Our solution features a novel uncertainty-quantifying spiking neural network with gated temporal decoding, optimized for strict memory and bandwidth constraints, complemented by systematic deployment mechanisms that bridge the reality gap. We validate our system on a new multi-user dataset and demonstrate a wearable prototype with dual neuromorphic devices achieving robust binocular pupil tracking at 100 Hz with an average power consumption below 5 mW per eye. Our work demonstrates that end-to-end neuromorphic computing enables practical, always-on eye tracking for next-generation energy-efficient wearable systems.
comment: 17 pages, 14 figures, 3 tables
ADNet: A Large-Scale and Extensible Multi-Domain Benchmark for Anomaly Detection Across 380 Real-World Categories
Anomaly detection (AD) aims to identify defects using normal-only training data. Existing anomaly detection benchmarks (e.g., MVTec-AD with 15 categories) cover only a narrow range of categories, limiting the evaluation of cross-context generalization and scalability. We introduce ADNet, a large-scale, multi-domain benchmark comprising 380 categories aggregated from 49 publicly available datasets across Electronics, Industry, Agrifood, Infrastructure, and Medical domains. The benchmark includes a total of 196,294 RGB images, consisting of 116,192 normal samples for training and 80,102 test images, of which 60,311 are anomalous. All images are standardized with MVTec-style pixel-level annotations and structured text descriptions spanning both spatial and visual attributes, enabling multimodal anomaly detection tasks. Extensive experiments reveal a clear scalability challenge: existing state-of-the-art methods achieve 90.6% I-AUROC in one-for-one settings but drop to 78.5% when scaling to all 380 categories in a multi-class setting. To address this, we propose Dinomaly-m, a context-guided Mixture-of-Experts extension of Dinomaly that expands decoder capacity without increasing inference cost. It achieves 83.2% I-AUROC and 93.1% P-AUROC, demonstrating superior performance over existing approaches. ADNet is designed as a standardized and extensible benchmark, supporting the community in expanding anomaly detection datasets across diverse domains and providing a scalable foundation for future anomaly detection foundation models. Dataset: https://grainnet.github.io/ADNet
While recognizing actions, LMMs struggle to detect core interaction events
Large multi-modal models (LMMs) show increasing performance in realistic visual tasks for images and, more recently, for videos. For example, given a video sequence, such models are able to describe in detail objects, the surroundings and dynamic actions. In this study, we explored the extent to which these models ground their semantic understanding in the actual visual input. Specifically, given sequences of hands interacting with objects, we asked models when and where the interaction begins or ends. For this purpose, we introduce a first of its kind, large-scale dataset with more than 20K annotated interactions on videos from the Something-Something-V2 dataset. 250 AMTurk human annotators labeled core interaction events, particularly when and where objects and agents become attached ('contact') or detached ('release'). We asked two LMMs (Qwen-2.5VL and GPT-4o) to locate these events in short videos, each with a single event. The results show that although the models can reliably name the target objects, identify the action and provide coherent reasoning, they consistently fail to identify the frame where the interaction begins or ends and cannot localize the event within the scene. Our findings suggest that in struggling to pinpoint the moment and location of physical contact that defines the interaction, the models lack the perceptual grounding required for deeper understanding of dynamic scenes.
Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs
While continual visual instruction tuning (CVIT) has shown promise in adapting multimodal large language models (MLLMs), existing studies predominantly focus on models without safety alignment. This critical oversight ignores the fact that real-world MLLMs inherently require such mechanisms to mitigate potential risks. In this work, we shift our focus to CVIT for safety-aligned MLLMs and observe that during continual adaptation, the model not only suffers from task forgetting but also exhibits degradation in its safety. Achieving a harmonious balance between safety and task performance remains a crucial challenge. To address this, we propose Harmonious Parameter Adaptation (HPA), a post-training framework composed of focusing-based parameter partition, harmoniously balanced parameter selection, and orthogonal parameter adjustment. Specifically, HPA partitions parameters into two types based on their focus on safety or task performance, and selects the focused ones to preserve from a balanced perspective. In addition, HPA imposes orthogonality constraints on parameter updates to further alleviate catastrophic forgetting. Extensive experiments on the CVIT benchmark and safety evaluation datasets demonstrate that HPA better maintains high safety and mitigates forgetting than existing baselines.
SKEL-CF: Coarse-to-Fine Biomechanical Skeleton and Surface Mesh Recovery
Parametric 3D human models such as SMPL have driven significant advances in human pose and shape estimation, yet their simplified kinematics limit biomechanical realism. The recently proposed SKEL model addresses this limitation by re-rigging SMPL with an anatomically accurate skeleton. However, estimating SKEL parameters directly remains challenging due to limited training data, perspective ambiguities, and the inherent complexity of human articulation. We introduce SKEL-CF, a coarse-to-fine framework for SKEL parameter estimation. SKEL-CF employs a transformer-based encoder-decoder architecture, where the encoder predicts coarse camera and SKEL parameters, and the decoder progressively refines them in successive layers. To ensure anatomically consistent supervision, we convert the existing SMPL-based dataset 4DHuman into a SKEL-aligned version, 4DHuman-SKEL, providing high-quality training data for SKEL estimation. In addition, to mitigate depth and scale ambiguities, we explicitly incorporate camera modeling into the SKEL-CF pipeline and demonstrate its importance across diverse viewpoints. Extensive experiments validate the effectiveness of the proposed design. On the challenging MOYO dataset, SKEL-CF achieves 85.0 MPJPE / 51.4 PA-MPJPE, significantly outperforming the previous SKEL-based state-of-the-art HSMR (104.5 / 79.6). These results establish SKEL-CF as a scalable and anatomically faithful framework for human motion analysis, bridging the gap between computer vision and biomechanics. Our implementation is available on the project page: https://pokerman8.github.io/SKEL-CF/.
comment: 15 pages, 10 figures
Map-World: Masked Action planning and Path-Integral World Model for Autonomous Driving
Motion planning for autonomous driving must handle multiple plausible futures while remaining computationally efficient. Recent end-to-end systems and world-model-based planners predict rich multi-modal trajectories, but typically rely on handcrafted anchors or reinforcement learning to select a single best mode for training and control. This selection discards information about alternative futures and complicates optimization. We propose MAP-World, a prior-free multi-modal planning framework that couples masked action planning with a path-weighted world model. The Masked Action Planning (MAP) module treats future ego motion as masked sequence completion: past waypoints are encoded as visible tokens, future waypoints are represented as mask tokens, and a driving-intent path provides a coarse scaffold. A compact latent planning state is expanded into multiple trajectory queries with injected noise, yielding diverse, temporally consistent modes without anchor libraries or teacher policies. A lightweight world model then rolls out future BEV semantics conditioned on each candidate trajectory. During training, semantic losses are computed as an expectation over modes, using trajectory probabilities as discrete path weights, so the planner learns from the full distribution of plausible futures instead of a single selected path. On NAVSIM, our method matches anchor-based approaches and achieves state-of-the-art performance among world-model-based methods, while avoiding reinforcement learning and maintaining real-time inference latency.
Alzheimers Disease Progression Prediction Based on Manifold Mapping of Irregularly Sampled Longitudinal Data
The uncertainty of clinical examinations frequently leads to irregular observation intervals in longitudinal imaging data, posing challenges for modeling disease progression.Most existing imaging-based disease prediction models operate in Euclidean space, which assumes a flat representation of data and fails to fully capture the intrinsic continuity and nonlinear geometric structure of irregularly sampled longitudinal images. To address the challenge of modeling Alzheimers disease (AD) progression from irregularly sampled longitudinal structural Magnetic Resonance Imaging (sMRI) data, we propose a Riemannian manifold mapping, a Time-aware manifold Neural ordinary differential equation, and an Attention-based riemannian Gated recurrent unit (R-TNAG) framework. Our approach first projects features extracted from high-dimensional sMRI into a manifold space to preserve the intrinsic geometry of disease progression. On this representation, a time-aware Neural Ordinary Differential Equation (TNODE) models the continuous evolution of latent states between observations, while an Attention-based Riemannian Gated Recurrent Unit (ARGRU) adaptively integrates historical and current information to handle irregular intervals. This joint design improves temporal consistency and yields robust AD trajectory prediction under irregular sampling.Experimental results demonstrate that the proposed method consistently outperforms state-of-the-art models in both disease status prediction and cognitive score regression. Ablation studies verify the contributions of each module, highlighting their complementary roles in enhancing predictive accuracy. Moreover, the model exhibits stable performance across varying sequence lengths and missing data rates, indicating strong temporal generalizability. Cross-dataset validation further confirms its robustness and applicability in diverse clinical settings.
comment: 10 pages, 3 figures
Restora-Flow: Mask-Guided Image Restoration with Flow Matching WACV 2026
Flow matching has emerged as a promising generative approach that addresses the lengthy sampling times associated with state-of-the-art diffusion models and enables a more flexible trajectory design, while maintaining high-quality image generation. This capability makes it suitable as a generative prior for image restoration tasks. Although current methods leveraging flow models have shown promising results in restoration, some still suffer from long processing times or produce over-smoothed results. To address these challenges, we introduce Restora-Flow, a training-free method that guides flow matching sampling by a degradation mask and incorporates a trajectory correction mechanism to enforce consistency with degraded inputs. We evaluate our approach on both natural and medical datasets across several image restoration tasks involving a mask-based degradation, i.e., inpainting, super-resolution and denoising. We show superior perceptual quality and processing time compared to diffusion and flow matching-based reference methods.
comment: Accepted for WACV 2026
Hybrid Convolution and Frequency State Space Network for Image Compression
Learned image compression (LIC) has recently benefited from Transformer based and state space model (SSM) based architectures. Convolutional neural networks (CNNs) effectively capture local high frequency details, whereas Transformers and SSMs provide strong long range modeling capabilities but may cause structural information loss or ignore frequency characteristics that are crucial for compression. In this work we propose HCFSSNet, a Hybrid Convolution and Frequency State Space Network for LIC. HCFSSNet uses CNNs to extract local high frequency structures and introduces a Vision Frequency State Space (VFSS) block that models long range low frequency information. The VFSS block combines an Omni directional Neighborhood State Space (VONSS) module, which scans features horizontally, vertically and diagonally, with an Adaptive Frequency Modulation Module (AFMM) that applies content adaptive weighting of discrete cosine transform frequency components for more efficient bit allocation. To further reduce redundancy in the entropy model, we integrate AFMM with a Swin Transformer to form a Frequency Swin Transformer Attention Module (FSTAM) for frequency aware side information modeling. Experiments on the Kodak, Tecnick and CLIC Professional Validation datasets show that HCFSSNet achieves competitive rate distortion performance compared with recent SSM based codecs such as MambaIC, while using significantly fewer parameters. On Kodak, Tecnick and CLIC, HCFSSNet reduces BD rate over the VTM anchor by 18.06, 24.56 and 22.44 percent, respectively, providing an efficient and interpretable hybrid architecture for future learned image compression systems.
comment: 36 pages, 8 figures
Vision-Language Models for Automated 3D PET/CT Report Generation
Positron emission tomography/computed tomography (PET/CT) is essential in oncology, yet the rapid expansion of scanners has outpaced the availability of trained specialists, making automated PET/CT report generation (PETRG) increasingly important for reducing clinical workload. Compared with structural imaging (e.g., X-ray, CT, and MRI), functional PET poses distinct challenges: metabolic patterns vary with tracer physiology, and whole-body 3D contextual information is required rather than local-region interpretation. To advance PETRG, we propose PETRG-3D, an end-to-end 3D dual-branch framework that separately encodes PET and CT volumes and incorporates style-adaptive prompts to mitigate inter-hospital variability in reporting practices. We construct PETRG-Lym, a multi-center lymphoma dataset collected from four hospitals (824 reports w/ 245,509 paired PET/CT slices), and construct AutoPET-RG-Lym, a publicly accessible PETRG benchmark derived from open imaging data but equipped with new expert-written, clinically validated reports (135 cases). To assess clinical utility, we introduce PETRG-Score, a lymphoma-specific evaluation protocol that jointly measures metabolic and structural findings across curated anatomical regions. Experiments show that PETRG-3D substantially outperforms existing methods on both natural language metrics (e.g., +31.49\% ROUGE-L) and clinical efficacy metrics (e.g., +8.18\% PET-All), highlighting the benefits of volumetric dual-modality modeling and style-aware prompting. Overall, this work establishes a foundation for future PET/CT-specific models emphasizing disease-aware reasoning and clinically reliable evaluation. Codes, models, and AutoPET-RG-Lym will be released.
UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers
Despite advances, video diffusion transformers still struggle to generalize beyond their training length, a challenge we term video length extrapolation. We identify two failure modes: model-specific periodic content repetition and a universal quality degradation. Prior works attempt to solve repetition via positional encodings, overlooking quality degradation and achieving only limited extrapolation. In this paper, we revisit this challenge from a more fundamental view: attention maps, which directly govern how context influences outputs. We identify that both failure modes arise from a unified cause: attention dispersion, where tokens beyond the training window dilute learned attention patterns. This leads to quality degradation and repetition emerges as a special case when this dispersion becomes structured into periodic attention patterns, induced by harmonic properties of positional encodings. Building on this insight, we propose UltraViCo, a training-free, plug-and-play method that suppresses attention for tokens beyond the training window via a constant decay factor. By jointly addressing both failure modes, we outperform a broad set of baselines largely across models and extrapolation ratios, pushing the extrapolation limit from 2x to 4x. Remarkably, it improves Dynamic Degree and Imaging Quality by 233% and 40.5% over the previous best method at 4x extrapolation. Furthermore, our method generalizes seamlessly to downstream tasks such as controllable video synthesis and editing.
comment: Project page: https://thu-ml.github.io/UltraViCo.github.io/
LungEvaty: A Scalable, Open-Source Transformer-based Deep Learning Model for Lung Cancer Risk Prediction in LDCT Screening
Lung cancer risk estimation is gaining increasing importance as more countries introduce population-wide screening programs using low-dose CT (LDCT). As imaging volumes grow, scalable methods that can process entire lung volumes efficiently are essential to tap into the full potential of these large screening datasets. Existing approaches either over-rely on pixel-level annotations, limiting scalability, or analyze the lung in fragments, weakening performance. We present LungEvaty, a fully transformer-based framework for predicting 1-6 year lung cancer risk from a single LDCT scan. The model operates on whole-lung inputs, learning directly from large-scale screening data to capture comprehensive anatomical and pathological cues relevant for malignancy risk. Using only imaging data and no region supervision, LungEvaty matches state-of-the-art performance, refinable by an optional Anatomically Informed Attention Guidance (AIAG) loss that encourages anatomically focused attention. In total, LungEvaty was trained on more than 90,000 CT scans, including over 28,000 for fine-tuning and 6,000 for evaluation. The framework offers a simple, data-efficient, and fully open-source solution that provides an extensible foundation for future research in longitudinal and multimodal lung cancer risk prediction.
Multi Head Attention Enhanced Inception v3 for Cardiomegaly Detection
The healthcare industry has been revolutionized significantly by novel imaging technologies, not just in the diagnosis of cardiovascular diseases but also by the visualization of structural abnormalities like cardiomegaly. This article explains an integrated approach to the use of deep learning tools and attention mechanisms for automatic detection of cardiomegaly using X-ray images. The initiation of the project is grounded on a strong Data Collection phase and gathering the data of annotated X-ray images of various types. Then, while the Preprocessing module fine-tunes image quality, it is feasible to utilize the best out of the data quality in the proposed system. In our proposed system, the process is a CNN configuration leveraging the inception V3 model as one of the key blocks. Besides, we also employ a multilayer attention mechanism to enhance the strength. The most important feature of the method is the multi-head attention mechanism that can learn features automatically. By exact selective focusing on only some regions of input, the model can thus identify cardiomegaly in a sensitive manner. Attention rating is calculated, duplicated, and applied to enhance representation of main data, and therefore there is a successful diagnosis. The Evaluation stage will be extremely strict and it will thoroughly evaluate the model based on such measures as accuracy and precision. This will validate that the model can identify cardiomegaly and will also show the clinical significance of this method. The model has accuracy of 95.6, precision of 95.2, recall of 96.2, sensitivity of 95.7, specificity of 96.1 and an Area Under Curve(AUC) of 96.0 and their respective graphs are plotted for visualisation.
Cloud4D: Estimating Cloud Properties at a High Spatial and Temporal Resolution NeurIPS 2025
There has been great progress in improving numerical weather prediction and climate models using machine learning. However, most global models act at a kilometer-scale, making it challenging to model individual clouds and factors such as extreme precipitation, wind gusts, turbulence, and surface irradiance. Therefore, there is a need to move towards higher-resolution models, which in turn require high-resolution real-world observations that current instruments struggle to obtain. We present Cloud4D, the first learning-based framework that reconstructs a physically consistent, four-dimensional cloud state using only synchronized ground-based cameras. Leveraging a homography-guided 2D-to-3D transformer, Cloud4D infers the full 3D distribution of liquid water content at 25 m spatial and 5 s temporal resolution. By tracking the 3D liquid water content retrievals over time, Cloud4D additionally estimates horizontal wind vectors. Across a two-month deployment comprising six skyward cameras, our system delivers an order-of-magnitude improvement in space-time resolution relative to state-of-the-art satellite measurements, while retaining single-digit relative error ($<10\%$) against collocated radar measurements. Code and data are available on our project page https://cloud4d.jacob-lin.com/.
comment: NeurIPS 2025 Spotlight, project page: https://cloud4d.jacob-lin.com/
Rethinking the Learning Paradigm for Facial Expression Recognition
Due to the subjective crowdsourcing annotations and the inherent inter-class similarity of facial expressions, the real-world Facial Expression Recognition (FER) datasets usually exhibit ambiguous annotation. To simplify the learning paradigm, most previous methods convert ambiguous annotation results into precise one-hot annotations and train FER models in an end-to-end supervised manner. In this paper, we rethink the existing training paradigm and propose that it is better to use weakly supervised strategies to train FER models with original ambiguous annotation.
Multi-modal Generative AI: Multi-modal LLMs, Diffusions, and the Unification
Multi-modal generative AI (Artificial Intelligence) has attracted increasing attention from both academia and industry. Particularly, two dominant families of techniques have emerged: i) Multi-modal large language models (LLMs) demonstrate impressive ability for multi-modal understanding; and ii) Diffusion models exhibit remarkable multi-modal powers in terms of multi-modal generation. Therefore, this paper provides a comprehensive overview of multi-modal generative AI, including multi-modal LLMs, diffusions, and the unification for understanding and generation. To lay a solid foundation for unified models, we first provide a detailed review of both multi-modal LLMs and diffusion models respectively, including their probabilistic modeling procedure, multi-modal architecture design, and advanced applications to image/video LLMs as well as text-to-image/video generation. Furthermore, we explore the emerging efforts toward unified models for understanding and generation. To achieve the unification of understanding and generation, we investigate key designs including autoregressive-based and diffusion-based modeling, as well as dense and Mixture-of-Experts (MoE) architectures. We then introduce several strategies for unified models, analyzing their potential advantages and disadvantages. In addition, we summarize the common datasets widely used for multi-modal generative AI pretraining. Last but not least, we present several challenging future research directions which may contribute to the ongoing advancement of multi-modal generative AI.
comment: 21 pages, 10 figures, 3 tables
Localizing Knowledge in Diffusion Transformers
Understanding how knowledge is distributed across the layers of generative models is crucial for improving interpretability, controllability, and adaptation. While prior work has explored knowledge localization in UNet-based architectures, Diffusion Transformer (DiT)-based models remain underexplored in this context. In this paper, we propose a model- and knowledge-agnostic method to localize where specific types of knowledge are encoded within the DiT blocks. We evaluate our method on state-of-the-art DiT-based models, including PixArt-alpha, FLUX, and SANA, across six diverse knowledge categories. We show that the identified blocks are both interpretable and causally linked to the expression of knowledge in generated outputs. Building on these insights, we apply our localization framework to two key applications: model personalization and knowledge unlearning. In both settings, our localized fine-tuning approach enables efficient and targeted updates, reducing computational cost, improving task-specific performance, and better preserving general model behavior with minimal interference to unrelated or surrounding content. Overall, our findings offer new insights into the internal structure of DiTs and introduce a practical pathway for more interpretable, efficient, and controllable model editing.
FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions NeurIPS 2025
We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models (LRMs) with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. We attach links to the benchmark, evaluation data, and other updates on this website: https://flageval-baai.github.io/LRM-Eval/
comment: Project homepage: https://flageval-baai.github.io/LRM-Eval/ This work will also be presented at NeurIPS 2025 Workshop on Foundations of Reasoning in Language Models (FoRLM); update with trials on Gemini 3 Pro
AutoFocus-IL: VLM-based Saliency Maps for Data-Efficient Visual Imitation Learning without Extra Human Annotations
AutoFocus-IL is a simple yet effective method to improve data efficiency and generalization in visual imitation learning by guiding policies to attend to task-relevant features rather than distractors and spurious correlations. Although saliency regularization has emerged as a promising way to achieve this, existing approaches typically require costly supervision such as human gaze data or manual saliency annotations. In contrast, AutoFocus-IL leverages vision-language models (VLMs) to automatically identify and track key objects in demonstrations, generating temporal saliency maps that highlight causal visual signals while suppressing distractors. These maps are then used to regularize behavior cloning policies, yielding stronger alignment between visual attention and task-relevant cues. Experiments in both the CARLA simulator and real-robot manipulation tasks demonstrate that AutoFocus-IL not only outperforms standard behavior cloning but also surpasses state-of-the-art baselines that assume privileged access to human supervision, such as gaze data. Code, datasets, and trained policy videos are available at https://AutoFocus-IL.github.io/.
comment: 8 pages, 6 figures. Code and datasets available at http://autofocus-il.github.io/
Personalized Generative Low-light Image Denoising and Enhancement
Modern cameras' performance in low-light conditions remains suboptimal due to fundamental limitations in photon shot noise and sensor read noise. Generative image restoration methods have shown promising results compared to traditional approaches, but they suffer from hallucinatory content generation when the signal-to-noise ratio (SNR) is low. Leveraging the availability of personalized photo galleries of the users, we introduce Diffusion-based Personalized Generative Denoising (DiffPGD), a new approach that builds a customized diffusion model for individual users. Our key innovation lies in the development of an identity-consistent physical buffer that extracts the physical attributes of the person from the gallery. This ID-consistent physical buffer serves as a robust prior that can be seamlessly integrated into the diffusion model to restore degraded images without the need for fine-tuning. Over a wide range of low-light testing scenarios, we show that DiffPGD achieves superior image denoising and enhancement performance compared to existing diffusion-based denoising approaches. Our project page can be found at \href{https://genai-restore.github.io/DiffPGD/}{\textcolor{purple}{\textbf{https://genai-restore.github.io/DiffPGD/}}}.
ConceptGuard: Proactive Safety in Text-and-Image-to-Video Generation through Multimodal Risk Detection
Recent progress in video generative models has enabled the creation of high-quality videos from multimodal prompts that combine text and images. While these systems offer enhanced controllability, they also introduce new safety risks, as harmful content can emerge from individual modalities or their interaction. Existing safety methods are often text-only, require prior knowledge of the risk category, or operate as post-generation auditors, struggling to proactively mitigate such compositional, multimodal risks. To address this challenge, we present ConceptGuard, a unified safeguard framework for proactively detecting and mitigating unsafe semantics in multimodal video generation. ConceptGuard operates in two stages: First, a contrastive detection module identifies latent safety risks by projecting fused image-text inputs into a structured concept space; Second, a semantic suppression mechanism steers the generative process away from unsafe concepts by intervening in the prompt's multimodal conditioning. To support the development and rigorous evaluation of this framework, we introduce two novel benchmarks: ConceptRisk, a large-scale dataset for training on multimodal risks, and T2VSafetyBench-TI2V, the first benchmark adapted from T2VSafetyBench for the Text-and-Image-to-Video (TI2V) safety setting. Comprehensive experiments on both benchmarks show that ConceptGuard consistently outperforms existing baselines, achieving state-of-the-art results in both risk detection and safe video generation.Our code is available at https://github.com/Ruize-Ma/ConceptGuard.
LoRA-based methods on Unet for transfer learning in Subarachnoid Hematoma Segmentation
Aneurysmal subarachnoid hemorrhage (SAH) is a life-threatening neurological emergency with mortality rates exceeding 30%. Transfer learning from related hematoma types represents a potentially valuable but underexplored approach. Although Unet architectures remain the gold standard for medical image segmentation due to their effectiveness on limited datasets, Low-Rank Adaptation (LoRA) methods for parameter-efficient transfer learning have been rarely applied to convolutional neural networks in medical imaging contexts. We implemented a Unet architecture pre-trained on computed tomography scans from 124 traumatic brain injury patients across multiple institutions, then fine-tuned on 30 aneurysmal SAH patients from the University of Michigan Health System using 3-fold cross-validation. We developed a novel CP-LoRA method based on tensor CP-decomposition and introduced DoRA variants (DoRA-C, convDoRA, CP-DoRA) that decompose weight matrices into magnitude and directional components. We compared these approaches against existing LoRA methods (LoRA-C, convLoRA) and standard fine-tuning strategies across different modules on a multi-view Unet model. LoRA-based methods consistently outperformed standard Unet fine-tuning. Performance varied by hemorrhage volume, with all methods showing improved accuracy for larger volumes. CP-LoRA achieved comparable performance to existing methods while using significantly fewer parameters. Over-parameterization with higher ranks consistently yielded better performance than strictly low-rank adaptations. This study demonstrates that transfer learning between hematoma types is feasible and that LoRA-based methods significantly outperform conventional Unet fine-tuning for aneurysmal SAH segmentation.
CGCE: Classifier-Guided Concept Erasure in Generative Models
Recent advancements in large-scale generative models have enabled the creation of high-quality images and videos, but have also raised significant safety concerns regarding the generation of unsafe content. To mitigate this, concept erasure methods have been developed to remove undesirable concepts from pre-trained models. However, existing methods remain vulnerable to adversarial attacks that can regenerate the erased content. Moreover, achieving robust erasure often degrades the model's generative quality for safe, unrelated concepts, creating a difficult trade-off between safety and performance. To address this challenge, we introduce Classifier-Guided Concept Erasure (CGCE), an efficient plug-and-play framework that provides robust concept erasure for diverse generative models without altering their original weights. CGCE uses a lightweight classifier operating on text embeddings to first detect and then refine prompts containing undesired concepts. This approach is highly scalable, allowing for multi-concept erasure by aggregating guidance from several classifiers. By modifying only unsafe embeddings at inference time, our method prevents harmful content generation while preserving the model's original quality on benign prompts. Extensive experiments show that CGCE achieves state-of-the-art robustness against a wide range of red-teaming attacks. Our approach also maintains high generative utility, demonstrating a superior balance between safety and performance. We showcase the versatility of CGCE through its successful application to various modern T2I and T2V models, establishing it as a practical and effective solution for safe generative AI.
comment: 26 pages, 17 figures
ExDDV: A New Dataset for Explainable Deepfake Detection in Video WACV 2026
The ever growing realism and quality of generated videos makes it increasingly harder for humans to spot deepfake content, who need to rely more and more on automatic deepfake detectors. However, deepfake detectors are also prone to errors, and their decisions are not explainable, leaving humans vulnerable to deepfake-based fraud and misinformation. To this end, we introduce ExDDV, the first dataset and benchmark for Explainable Deepfake Detection in Video. ExDDV comprises around 5.4K real and deepfake videos that are manually annotated with text descriptions (to explain the artifacts) and clicks (to point out the artifacts). We evaluate a number of vision-language models on ExDDV, performing experiments with various fine-tuning and in-context learning strategies. Our results show that text and click supervision are both required to develop robust explainable models for deepfake videos, which are able to localize and describe the observed artifacts. Our novel dataset and code to reproduce the results are available at https://github.com/vladhondru25/ExDDV.
comment: Accepted at WACV 2026
When to Think and When to Look: Uncertainty-Guided Lookback
Test-time thinking (that is, generating explicit intermediate reasoning chains) is known to boost performance in large language models and has recently shown strong gains for large vision language models (LVLMs). However, despite these promising results, there is still no systematic analysis of how thinking actually affects visual reasoning. We provide the first such analysis with a large scale, controlled comparison of thinking for LVLMs, evaluating ten variants from the InternVL3.5 and Qwen3-VL families on MMMU-val under generous token budgets and multi pass decoding. We show that more thinking is not always better; long chains often yield long wrong trajectories that ignore the image and underperform the same models run in standard instruct mode. A deeper analysis reveals that certain short lookback phrases, which explicitly refer back to the image, are strongly enriched in successful trajectories and correlate with better visual grounding. Building on this insight, we propose uncertainty guided lookback, a training free decoding strategy that combines an uncertainty signal with adaptive lookback prompts and breadth search. Our method improves overall MMMU performance, delivers the largest gains in categories where standard thinking is weak, and outperforms several strong decoding baselines, setting a new state of the art under fixed model families and token budgets. We further show that this decoding strategy generalizes, yielding consistent improvements on five additional benchmarks, including two broad multimodal suites and math focused visual reasoning datasets.
IVY-FAKE: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection
The rapid development of Artificial Intelligence Generated Content (AIGC) techniques has enabled the creation of high-quality synthetic content, but it also raises significant security concerns. Current detection methods face two major limitations: (1) the lack of multidimensional explainable datasets for generated images and videos. Existing open-source datasets (e.g., WildFake, GenVideo) rely on oversimplified binary annotations, which restrict the explainability and trustworthiness of trained detectors. (2) Prior MLLM-based forgery detectors (e.g., FakeVLM) exhibit insufficiently fine-grained interpretability in their step-by-step reasoning, which hinders reliable localization and explanation. To address these challenges, we introduce Ivy-Fake, the first large-scale multimodal benchmark for explainable AIGC detection. It consists of over 106K richly annotated training samples (images and videos) and 5,000 manually verified evaluation examples, sourced from multiple generative models and real world datasets through a carefully designed pipeline to ensure both diversity and quality. Furthermore, we propose Ivy-xDetector, a reinforcement learning model based on Group Relative Policy Optimization (GRPO), capable of producing explainable reasoning chains and achieving robust performance across multiple synthetic content detection benchmarks. Extensive experiments demonstrate the superiority of our dataset and confirm the effectiveness of our approach. Notably, our method improves performance on GenImage from 86.88% to 96.32%, surpassing prior state-of-the-art methods by a clear margin.
comment: 30 pages
FastGS: Training 3D Gaussian Splatting in 100 Seconds
The dominant 3D Gaussian splatting (3DGS) acceleration methods fail to properly regulate the number of Gaussians during training, causing redundant computational time overhead. In this paper, we propose FastGS, a novel, simple, and general acceleration framework that fully considers the importance of each Gaussian based on multi-view consistency, efficiently solving the trade-off between training time and rendering quality. We innovatively design a densification and pruning strategy based on multi-view consistency, dispensing with the budgeting mechanism. Extensive experiments on Mip-NeRF 360, Tanks & Temples, and Deep Blending datasets demonstrate that our method significantly outperforms the state-of-the-art methods in training speed, achieving a 3.32$\times$ training acceleration and comparable rendering quality compared with DashGaussian on the Mip-NeRF 360 dataset and a 15.45$\times$ acceleration compared with vanilla 3DGS on the Deep Blending dataset. We demonstrate that FastGS exhibits strong generality, delivering 2-7$\times$ training acceleration across various tasks, including dynamic scene reconstruction, surface reconstruction, sparse-view reconstruction, large-scale reconstruction, and simultaneous localization and mapping. The project page is available at https://fastgs.github.io/
comment: Project page: https://fastgs.github.io/
Scalable FPGA Framework for Real-Time Denoising in High-Throughput Imaging: A DRAM-Optimized Pipeline using High-Level Synthesis
High-throughput imaging workflows, such as Parallel Rapid Imaging with Spectroscopic Mapping (PRISM), generate data at rates that exceed conventional real-time processing capabilities. We present a scalable FPGA-based preprocessing pipeline for real-time denoising, implemented via High-Level Synthesis (HLS) and optimized for DRAM-backed buffering. Our architecture performs frame subtraction and averaging directly on streamed image data, minimizing latency through burst-mode AXI4 interfaces. The resulting kernel operates below the inter-frame interval, enabling inline denoising and reducing dataset size for downstream CPU/GPU analysis. Validated under PRISM-scale acquisition, this modular FPGA framework offers a practical solution for latency-sensitive imaging workflows in spectroscopy and microscopy.
comment: FPGA-based denoising pipeline for PRISM-scale imaging. Real-time frame subtraction and averaging via burst-mode AXI4 and DRAM buffering. Benchmarked against CPU/GPU workflows; scalable across multi-bank FPGA setups. Acknowledgements revised for consistency with journal submission; scientific content remains unchanged
CardioComposer: Leveraging Differentiable Geometry for Compositional Control of Anatomical Diffusion Models
Generative models of 3D cardiovascular anatomy can synthesize informative structures for clinical research and medical device evaluation, but face a trade-off between geometric controllability and realism. We propose CardioComposer: a programmable, inference-time framework for generating multi-class anatomical label maps based on interpretable ellipsoidal primitives. These primitives represent geometric attributes such as the size, shape, and position of discrete substructures. We specifically develop differentiable measurement functions based on voxel-wise geometric moments, enabling loss-based gradient guidance during diffusion model sampling. We demonstrate that these losses can constrain individual geometric attributes in a disentangled manner and provide compositional control over multiple substructures. Finally, we show that our method is compatible with a wide array of anatomical systems containing non-convex substructures, spanning cardiac, vascular, and skeletal organs.
comment: 10 pages, 16 figures
Segmentation-Aware Generative Reinforcement Network (GRN) for Tissue Layer Segmentation in 3-D Ultrasound Images for Chronic Low-back Pain (cLBP) Assessment
We introduce a novel segmentation-aware joint training framework called generative reinforcement network (GRN) that integrates segmentation loss feedback to optimize both image generation and segmentation performance in a single stage. An image enhancement technique called segmentation-guided enhancement (SGE) is also developed, where the generator produces images tailored specifically for the segmentation model. Two variants of GRN were also developed, including GRN for sample-efficient learning (GRN-SEL) and GRN for semi-supervised learning (GRN-SSL). GRN's performance was evaluated using a dataset of 69 fully annotated 3D ultrasound scans from 29 subjects. The annotations included six anatomical structures: dermis, superficial fat, superficial fascial membrane (SFM), deep fat, deep fascial membrane (DFM), and muscle. Our results show that GRN-SEL with SGE reduces labeling efforts by up to 70% while achieving a 1.98% improvement in the Dice Similarity Coefficient (DSC) compared to models trained on fully labeled datasets. GRN-SEL alone reduces labeling efforts by 60%, GRN-SSL with SGE decreases labeling requirements by 70%, and GRN-SSL alone by 60%, all while maintaining performance comparable to fully supervised models. These findings suggest the effectiveness of the GRN framework in optimizing segmentation performance with significantly less labeled data, offering a scalable and efficient solution for ultrasound image analysis and reducing the burdens associated with data annotation.
OceanGym: A Benchmark Environment for Underwater Embodied Agents
We introduce OceanGym, the first comprehensive benchmark for ocean underwater embodied agents, designed to advance AI in one of the most demanding real-world environments. Unlike terrestrial or aerial domains, underwater settings present extreme perceptual and decision-making challenges, including low visibility, dynamic ocean currents, making effective agent deployment exceptionally difficult. OceanGym encompasses eight realistic task domains and a unified agent framework driven by Multi-modal Large Language Models (MLLMs), which integrates perception, memory, and sequential decision-making. Agents are required to comprehend optical and sonar data, autonomously explore complex environments, and accomplish long-horizon objectives under these harsh conditions. Extensive experiments reveal substantial gaps between state-of-the-art MLLM-driven agents and human experts, highlighting the persistent difficulty of perception, planning, and adaptability in ocean underwater environments. By providing a high-fidelity, rigorously designed platform, OceanGym establishes a testbed for developing robust embodied AI and transferring these capabilities to real-world autonomous ocean underwater vehicles, marking a decisive step toward intelligent agents capable of operating in one of Earth's last unexplored frontiers. The code and data are available at https://github.com/OceanGPT/OceanGym.
comment: Work in progress
Target-aware Image Editing via Cycle-consistent Constraints
Recent advances in pre-trained text-to-image flow models have enabled remarkable progress in text-based image editing. Mainstream approaches always adopt a corruption-then-restoration paradigm, where the source image is first corrupted into an ``intermediate state'' and then restored to the target image under the prompt guidance. However, current methods construct this intermediate state in a target-agnostic manner, i.e., they primarily focus on realizing source image reconstruction while neglecting the semantic gaps towards the specific editing target. This design inherently results in limited editability or inconsistency when the desired modifications substantially deviate from the source. In this paper, we argue that the intermediate state should be target-aware, i.e., selectively corrupting editing-relevant contents while preserving editing-irrelevant ones. To this end, we propose FlowCycle, a novel inversion-free and flow-based editing framework that parameterizes corruption with learnable noises and optimizes them through a cycle-consistent process. By iteratively editing the source to the target and recovering back to the source with dual consistency constraints, FlowCycle learns to produce a target-aware intermediate state, enabling faithful modifications while preserving source consistency. Extensive ablations have demonstrated that FlowCycle achieves superior editing quality and consistency over state-of-the-art methods.
CLIP-IT: CLIP-based Pairing for Histology Images Classification
Multimodal learning has shown promise in medical imaging, combining complementary modalities like images and text. Vision-language models (VLMs) capture rich diagnostic cues but often require large paired datasets and prompt- or text-based inference, limiting their practicality due to annotation cost, privacy, and compute demands. Crucially, available free unpaired external text, like pathology reports, can still provide complementary diagnostic cues if semantically relevant content is retrievable per image. To address this, we introduce CLIP-IT, a novel framework that relies on rich unpaired text reports. Specifically, CLIP-IT uses a CLIP model pre-trained on histology image-text pairs from a separate dataset to retrieve the most relevant unpaired textual report for each image in the downstream unimodal dataset. These reports, sourced from the same disease domain and tissue type, form pseudo-pairs that reflect shared clinical semantics rather than exact alignment. Knowledge from these texts is distilled into the vision model during training, while LoRA-based adaptation mitigates the semantic gap between unaligned modalities. At inference, only the vision model is used, keeping overhead low while still benefiting from multimodal training without requiring paired data in the downstream dataset. Experiments on histology image datasets confirm that CLIP-IT consistently improves classification accuracy over both unimodal and multimodal CLIP-based baselines in most cases, without the burden of per-dataset paired annotation or inference-time complexity.
LightMem: Lightweight and Efficient Memory-Augmented Generation
Despite their remarkable capabilities, Large Language Models (LLMs) struggle to effectively leverage historical interaction information in dynamic and complex environments. Memory systems enable LLMs to move beyond stateless interactions by introducing persistent information storage, retrieval, and utilization mechanisms. However, existing memory systems often introduce substantial time and computational overhead. To this end, we introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems. Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognition-inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep-time update employs an offline procedure that decouples consolidation from online inference. On LongMemEval and LoCoMo, using GPT and Qwen backbones, LightMem consistently surpasses strong baselines, improving QA accuracy by up to 7.7% / 29.3%, reducing total token usage by up to 38x / 20.9x and API calls by up to 30x / 55.5x, while purely online test-time costs are even lower, achieving up to 106x / 117x token reduction and 159x / 310x fewer API calls. The code is available at https://github.com/zjunlp/LightMem.
comment: Work in progress
Harnessing Vision-Language Models for Time Series Anomaly Detection AAAI 2026
Time-series anomaly detection (TSAD) has played a vital role in a variety of fields, including healthcare, finance, and sensor-based condition monitoring. Prior methods, which mainly focus on training domain-specific models on numerical data, lack the visual-temporal understanding capacity that human experts have to identify contextual anomalies. To fill this gap, we explore a solution based on vision language models (VLMs). Recent studies have shown the ability of VLMs for visual understanding tasks, yet their direct application to time series has fallen short on both accuracy and efficiency. To harness the power of VLMs for TSAD, we propose a two-stage solution, with (1) ViT4TS, a vision-screening stage built on a relatively lightweight pre-trained vision encoder, which leverages 2D time series representations to accurately localize candidate anomalies; (2) VLM4TS, a VLM-based stage that integrates global temporal context and VLM's visual understanding capacity to refine the detection upon the candidates provided by ViT4TS. We show that without any time-series training, VLM4TS outperforms time-series pre-trained and from-scratch baselines in most cases, yielding a 24.6% improvement in F1-max score over the best baseline. Moreover, VLM4TS also consistently outperforms existing language model-based TSAD methods and is on average 36x more efficient in token usage.
comment: Accepted at AAAI 2026 (Oral)
Detecting Cultural Differences in News Video Thumbnails via Computational Aesthetics
We propose a two-step approach for detecting differences in the style of images across sources of differing cultural affinity, where images are first clustered into finer visual themes based on content before their aesthetic features are compared. We test this approach on 2,400 YouTube video thumbnails taken equally from two U.S. and two Chinese YouTube channels, and relating equally to COVID-19 and the Ukraine conflict. Our results suggest that while Chinese thumbnails are less formal and more candid, U.S. channels tend to use more deliberate, proper photographs as thumbnails. In particular, U.S. thumbnails are less colorful, more saturated, darker, more finely detailed, less symmetric, sparser, less varied, and more up close and personal than Chinese thumbnails. We suggest that most of these differences reflect cultural preferences, and that our methods and observations can serve as a baseline against which suspected visual propaganda can be computed and compared.
comment: ICWSM'24 Workshop
RobustMerge: Parameter-Efficient Model Merging for MLLMs with Direction Robustness NeurIPS 2025
Fine-tuning pre-trained models with custom data leads to numerous expert models on specific tasks. Merging models into one universal model to empower multi-task ability refraining from data leakage has gained popularity. With the expansion in data and model size, parameter-efficient tuning becomes the common practice for obtaining task-specific models efficiently. However, few methods are dedicated to efficient merging, and existing methods designed for full fine-tuning merging fail under efficient merging. To address the issue, we analyze from low-rank decomposition and reveal that direction robustness during merging is crucial for merging efficient modules. We furthermore uncover that compensating for the gap between stark singular values contributes to direction robustness. Therefore, we propose RobustMerge, a training-free parameter-efficient merging method with complementary parameter adaptation to maintain direction robustness. Specifically, we (1) prune parameters and scale coefficients from inter-parameter relation for singular values to maintain direction stability away from task interference, and (2) perform cross-task normalization to enhance unseen task generalization. We establish a benchmark consisting of diverse multimodal tasks, on which we conduct experiments to certify the outstanding performance and generalizability of our method. Additional studies and extensive analyses further showcase the effectiveness. Code is available at https://github.com/AuroraZengfh/RobustMerge.
comment: NeurIPS 2025 (Spotlight) Fix some typos
Adversarial Robustness for Unified Multi-Modal Encoders via Efficient Calibration
Recent unified multi-modal encoders align a wide range of modalities into a shared representation space, enabling diverse cross-modal tasks. Despite their impressive capabilities, the robustness of these models under adversarial perturbations remains underexplored, which is a critical concern for safety-sensitive applications. In this work, we present the first comprehensive study of adversarial vulnerability in unified multi-modal encoders. We find that even mild adversarial perturbations lead to substantial performance drops across all modalities. Non-visual inputs, such as audio and point clouds, are especially fragile, while visual inputs like images and videos also degrade significantly. To address this, we propose an efficient adversarial calibration framework that improves robustness across modalities without modifying pretrained encoders or semantic centers, ensuring compatibility with existing foundation models. Our method introduces modality-specific projection heads trained solely on adversarial examples, while keeping the backbone and embeddings frozen. We explore three training objectives: fixed-center cross-entropy, clean-to-adversarial L2 alignment, and clean-adversarial InfoNCE, and we introduce a regularization strategy to ensure modality-consistent alignment under attack. Experiments on six modalities and three Bind-style models show that our method improves adversarial robustness by up to 47.3 percent at epsilon = 4/255, while preserving or even improving clean zero-shot and retrieval performance with less than 1 percent trainable parameters.
LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference
Intuitive physics understanding in video diffusion models plays an essential role in building general-purpose physically plausible world simulators, yet accurately evaluating such capacity remains a challenging task due to the difficulty in disentangling physics correctness from visual appearance in generation. To the end, we introduce LikePhys, a training-free method that evaluates intuitive physics in video diffusion models by distinguishing physically valid and impossible videos using the denoising objective as an ELBO-based likelihood surrogate on a curated dataset of valid-invalid pairs. By testing on our constructed benchmark of twelve scenarios spanning over four physics domains, we show that our evaluation metric, Plausibility Preference Error (PPE), demonstrates strong alignment with human preference, outperforming state-of-the-art evaluator baselines. We then systematically benchmark intuitive physics understanding in current video diffusion models. Our study further analyses how model design and inference settings affect intuitive physics understanding and highlights domain-specific capacity variations across physical laws. Empirical results show that, despite current models struggling with complex and chaotic dynamics, there is a clear trend of improvement in physics understanding as model capacity and inference settings scale.
comment: 22 pages, 9 figures
HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model
Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content, often overlooking contextually unsafe outcomes from seemingly benign pairs. This narrow coverage leaves VLMs vulnerable to jailbreak attacks in unseen configurations. 2) Prior methods rely primarily on data-centric tuning, with limited architectural innovations to intrinsically strengthen safety. We address these gaps by introducing a holistic safety dataset and benchmark, \textbf{HoliSafe}, that spans all five safe/unsafe image-text combinations, providing a more robust basis for both training and evaluation (HoliSafe-Bench). We further propose a novel modular framework for enhancing VLM safety with a visual guard module (VGM) designed to assess the harmfulness of input images for VLMs. This module endows VLMs with a dual functionality: they not only learn to generate safer responses but can also provide an interpretable harmfulness classification to justify their refusal decisions. A significant advantage of this approach is its modularity; the VGM is designed as a plug-in component, allowing for seamless integration with diverse pre-trained VLMs across various scales. Experiments show that Safe-VLM with VGM, trained on our HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks. Additionally, the HoliSafe-Bench itself reveals critical vulnerabilities in existing VLM models. We hope that HoliSafe and VGM will spur further research into robust and interpretable VLM safety, expanding future avenues for multimodal alignment.
comment: Project page: https://youngwanlee.github.io/holisafe
StrCGAN: A Generative Framework for Stellar Image Restoration
We introduce StrCGAN (Stellar Cyclic GAN), a generative model designed to enhance low-resolution astrophotography images. Our goal is to reconstruct high fidelity ground truth like representations of stellar objects, a task that is challenging due to the limited resolution and quality of small-telescope observations such as the MobilTelesco dataset. Traditional models such as CycleGAN provide a foundation for image to image translation but often distort the morphology of stars and produce barely resembling images. To overcome these limitations, we extend the CycleGAN framework with some key innovations: multi-spectral fusion to align optical and near infrared (NIR) domains, and astrophysical regularization modules to preserve stellar morphology. Ground truth references from multi-mission all sky surveys spanning optical to NIR guide the training process, ensuring that reconstructions remain consistent across spectral bands. Together, these components allow StrCGAN to generate reconstructions that are visually sharper outperforming standard GAN models in the task of astrophysical image enhancement.
Are Image-to-Video Models Good Zero-Shot Image Editors?
Large-scale video diffusion models show strong world simulation and temporal reasoning abilities, but their use as zero-shot image editors remains underexplored. We introduce IF-Edit, a tuning-free framework that repurposes pretrained image-to-video diffusion models for instruction-driven image editing. IF-Edit addresses three key challenges: prompt misalignment, redundant temporal latents, and blurry late-stage frames. It includes (1) a chain-of-thought prompt enhancement module that transforms static editing instructions into temporally grounded reasoning prompts; (2) a temporal latent dropout strategy that compresses frame latents after the expert-switch point, accelerating denoising while preserving semantic and temporal coherence; and (3) a self-consistent post-refinement step that sharpens late-stage frames using a short still-video trajectory. Experiments on four public benchmarks, covering non-rigid editing, physical and temporal reasoning, and general instruction edits, show that IF-Edit performs strongly on reasoning-centric tasks while remaining competitive on general-purpose edits. Our study provides a systematic view of video diffusion models as image editors and highlights a simple recipe for unified video-image generative reasoning.
comment: technical report
AlignCVC: Aligning Cross-View Consistency for Single-Image-to-3D Generation
Single-image-to-3D models typically follow a sequential generation and reconstruction workflow. However, intermediate multi-view images synthesized by pre-trained generation models often lack cross-view consistency (CVC), significantly degrading 3D reconstruction performance. While recent methods attempt to refine CVC by feeding reconstruction results back into the multi-view generator, these approaches struggle with noisy and unstable reconstruction outputs that limit effective CVC improvement. We introduce AlignCVC, a novel framework that fundamentally re-frames single-image-to-3D generation through distribution alignment rather than relying on strict regression losses. Our key insight is to align both generated and reconstructed multi-view distributions toward the ground-truth multi-view distribution, establishing a principled foundation for improved CVC. Observing that generated images exhibit weak CVC while reconstructed images display strong CVC due to explicit rendering, we propose a soft-hard alignment strategy with distinct objectives for generation and reconstruction models. This approach not only enhances generation quality but also dramatically accelerates inference to as few as 4 steps. As a plug-and-play paradigm, our method, namely AlignCVC, seamlessly integrates various multi-view generation models with 3D reconstruction models. Extensive experiments demonstrate the effectiveness and efficiency of AlignCVC for single-image-to-3D generation.
Probabilistic Hyper-Graphs using Multiple Randomly Masked Autoencoders for Semi-supervised Multi-modal Multi-task Learning
The computer vision domain has greatly benefited from an abundance of data across many modalities to improve on various visual tasks. Recently, there has been a lot of focus on self-supervised pre-training methods through Masked Autoencoders (MAE) \cite{he2022masked,bachmann2022multimae}, usually used as a first step before optimizing for a downstream task, such as classification or regression. This is very useful as it doesn't require any manually labeled data. In this work, we introduce Probabilistic Hyper-Graphs using Masked Autoencoders (PHG-MAE): a novel model that unifies the classical work on neural graphs \cite{leordeanu2021semi} with the modern approach of masked autoencoders under a common theoretical framework. Through random masking of entire modalities, not just patches, the model samples from the distribution of hyper-edges on each forward pass. Additionally, the model adapts the standard MAE algorithm by combining pre-training and fine-tuning into a single training loop. Moreover, our approach enables the creation of inference-time ensembles which, through aggregation, boost the final prediction performance and consistency. Lastly, we show that we can apply knowledge distillation on top of the ensembles with little loss in performance, even with models that have fewer than 1M parameters. While our work mostly focuses on outdoor UAV scenes that contain multiple world interpretations and modalities, the same steps can be followed in other similar domains, such as autonomous driving or indoor robotics. In order to streamline the process of integrating external pre-trained experts for computer vision multi-modal multi-task learning (MTL) scenarios, we developed a data-pipeline software. Using this tool, we have created and released a fully-automated extension of the Dronescapes dataset. All the technical details, code and reproduction steps are publicly released.
comment: Submitted to Neurocomputing
Alternating Perception-Reasoning for Hallucination-Resistant Video Understanding
Sufficient visual perception is the foundation of video reasoning. Nevertheless, existing Video Reasoning LLMs suffer from perception shortcuts, relying on a flawed single-step perception paradigm. This paradigm describes the video and then conducts reasoning, which runs the risk of insufficient evidence and emergent hallucinations. To address these issues, we introduce a new framework that integrates a loop-based paradigm with an anti-hallucination reward. First, to address the insufficient evidence, we introduce the Perception Loop Reasoning (PLR) paradigm. Instead of describing the video at once, each loop requires the model to describe a video segment with precise timestamps, analyze this segment, and decide the next action. Second, for the risk of hallucinations, the Factual-Aware Evaluator (FAE) evaluates each perception result as a reliable anti-hallucination reward. This reward encourages the model to provide sufficient and precise video evidence. Our FAE, which performs comparably to GPT-4o, is tuned on our AnetHallu-117K, a large-scale hallucination judgment preference dataset. Extensive experiments show that our Video-PLR achieves the state-of-the-art in both 3B and 7B parameter scales and has the best data efficiency. Our code, models, and datasets are released on: https://github.com/BoweiPu/VideoPLR.
comment: 32 pages, 36 figures
WeatherDiffusion: Controllable Weather Editing in Intrinsic Space
We present WeatherDiffusion, a diffusion-based framework for controllable weather editing in intrinsic space. Our framework includes two components based on diffusion priors: an inverse renderer that estimates material properties, scene geometry, and lighting as intrinsic maps from an input image, and a forward renderer that utilizes these geometry and material maps along with a text prompt that describes specific weather conditions to generate a final image. The intrinsic maps enhance controllability compared to traditional pixel-space editing approaches.We propose an intrinsic map-aware attention mechanism that improves spatial correspondence and decomposition quality in large outdoor scenes. For forward rendering, we leverage CLIP-space interpolation of weather prompts to achieve fine-grained weather control. We also introduce a synthetic and a real-world dataset, containing 38k and 18k images under various weather conditions, each with intrinsic map annotations. WeatherDiffusion outperforms state-of-the-art pixel-space editing approaches, weather restoration methods, and rendering-based methods, showing promise for downstream tasks such as autonomous driving, enhancing the robustness of detection and segmentation in challenging weather scenarios.
Exploring Convolutional Neural Networks for Rice Grain Classification: An Explainable AI Approach
Rice is an essential staple food worldwide that is important in promoting international trade, economic growth, and nutrition. Asian countries such as China, India, Pakistan, Thailand, Vietnam, and Indonesia are notable for their significant contribution to the cultivation and utilization of rice. These nations are also known for cultivating different rice grains, including short and long grains. These sizes are further classified as basmati, jasmine, kainat saila, ipsala, arborio, etc., catering to diverse culinary preferences and cultural traditions. For both local and international trade, inspecting and maintaining the quality of rice grains to satisfy customers and preserve a country's reputation is necessary. Manual quality check and classification is quite a laborious and time-consuming process. It is also highly prone to mistakes. Therefore, an automatic solution must be proposed for the effective and efficient classification of different varieties of rice grains. This research paper presents an automatic framework based on a convolutional neural network (CNN) for classifying different varieties of rice grains. We evaluated the proposed model based on performance metrics such as accuracy, recall, precision, and F1-Score. The CNN model underwent rigorous training and validation, achieving a remarkable accuracy rate and a perfect area under each class's Receiver Operating Characteristic (ROC) curve. The confusion matrix analysis confirmed the model's effectiveness in distinguishing between the different rice varieties, indicating minimal misclassifications. Additionally, the integration of explainability techniques such as LIME (Local Interpretable Model-agnostic Explanations) and SHAP (SHapley Additive exPlanations) provided valuable insights into the model's decision-making process, revealing how specific features of the rice grains influenced classification outcomes.
Panoptic Captioning: An Equivalence Bridge for Image and Text NeurIPS 2025
This work introduces panoptic captioning, a novel task striving to seek the minimum text equivalent of images, which has broad potential applications. We take the first step towards panoptic captioning by formulating it as a task of generating a comprehensive textual description for an image, which encapsulates all entities, their respective locations and attributes, relationships among entities, as well as global image state. Through an extensive evaluation, our work reveals that state-of-the-art Multi-modal Large Language Models (MLLMs) have limited performance in solving panoptic captioning. To address this, we propose an effective data engine named PancapEngine to produce high-quality data and a novel method named PancapChain to improve panoptic captioning. Specifically, our PancapEngine first detects diverse categories of entities in images by an elaborate detection suite, and then generates required panoptic captions using entity-aware prompts. Additionally, our PancapChain explicitly decouples the challenging panoptic captioning task into multiple stages and generates panoptic captions step by step. More importantly, we contribute a comprehensive metric named PancapScore and a human-curated test set for reliable model evaluation. Experiments show that our PancapChain-13B model can beat state-of-the-art open-source MLLMs like InternVL-2.5-78B and even surpass proprietary models like GPT-4o and Gemini-2.0-Pro, demonstrating the effectiveness of our data engine and method. Project page: https://visual-ai.github.io/pancap/
comment: NeurIPS 2025; Project page: https://visual-ai.github.io/pancap/
Panoramic Distortion-Aware Tokenization for Person Detection and Localization in Overhead Fisheye Images
Person detection in overhead fisheye images is challenging due to person rotation and small persons. Prior work has mainly addressed person rotation, leaving the small-person problem underexplored. We remap fisheye images to equirectangular panoramas to handle rotation and exploit panoramic geometry to handle small persons more effectively. Conventional detection methods tend to favor larger persons because they dominate the attention maps, causing smaller persons to be missed. In hemispherical equirectangular panoramas, we find that apparent person height decreases approximately linearly with the vertical angle near the top of the image. Using this finding, we introduce panoramic distortion-aware tokenization to enhance the detection of small persons. This tokenization procedure divides panoramic features using self-similar figures that enable the determination of optimal divisions without gaps, and we leverage the maximum significance values in each tile of the token groups to preserve the significance areas of smaller persons. We propose a transformer-based person detection and localization method that combines panoramic-image remapping and the tokenization procedure. Extensive experiments demonstrated that our method outperforms conventional methods on large-scale datasets.
ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment
Training-free video object editing aims to achieve precise object-level manipulation, including object insertion, swapping, and deletion. However, it faces significant challenges in maintaining fidelity and temporal consistency. Existing methods, often designed for U-Net architectures, suffer from two primary limitations: inaccurate inversion due to first-order solvers, and contextual conflicts caused by crude "hard" feature replacement. These issues are more challenging in Diffusion Transformers (DiTs), where the unsuitability of prior layer-selection heuristics makes effective guidance challenging. To address these limitations, we introduce ContextFlow, a novel training-free framework for DiT-based video object editing. In detail, we first employ a high-order Rectified Flow solver to establish a robust editing foundation. The core of our framework is Adaptive Context Enrichment (for specifying what to edit), a mechanism that addresses contextual conflicts. Instead of replacing features, it enriches the self-attention context by concatenating Key-Value pairs from parallel reconstruction and editing paths, empowering the model to dynamically fuse information. Additionally, to determine where to apply this enrichment (for specifying where to edit), we propose a systematic, data-driven analysis to identify task-specific vital layers. Based on a novel Guidance Responsiveness Metric, our method pinpoints the most influential DiT blocks for different tasks (e.g., insertion, swapping), enabling targeted and highly effective guidance. Extensive experiments show that ContextFlow significantly outperforms existing training-free methods and even surpasses several state-of-the-art training-based approaches, delivering temporally coherent, high-fidelity results.
comment: The project page is at https://yychen233.github.io/ContextFlow-page
OmniLens++: Blind Lens Aberration Correction via Large LensLib Pre-Training and Latent PSF Representation
Emerging deep-learning-based lens library pre-training (LensLib-PT) pipeline offers a new avenue for blind lens aberration correction by training a universal neural network, demonstrating strong capability in handling diverse unknown optical degradations. This work proposes the OmniLens++ framework, which resolves two challenges that hinder the generalization ability of existing pipelines: the difficulty of scaling data and the absence of prior guidance characterizing optical degradation. To improve data scalability, we expand the design specifications to increase the degradation diversity of the lens source, and we sample a more uniform distribution by quantifying the spatial-variation patterns and severity of optical degradation. In terms of model design, to leverage the Point Spread Functions (PSFs), which intuitively describe optical degradation, as guidance in a blind paradigm, we propose the Latent PSF Representation (LPR). The VQVAE framework is introduced to learn latent features of LensLib's PSFs, which is assisted by modeling the optical degradation process to constrain the learning of degradation priors. Experiments on diverse aberrations of real-world lenses and synthetic LensLib show that OmniLens++ exhibits state-of-the-art generalization capacity in blind aberration correction. Beyond performance, the AODLibpro is verified as a scalable foundation for more effective training across diverse aberrations, and LPR can further tap the potential of large-scale LensLib. The source code and datasets will be made publicly available at https://github.com/zju-jiangqi/OmniLens2.
comment: The source code and datasets will be made publicly available at https://github.com/zju-jiangqi/OmniLens2
Can Modern Vision Models Understand the Difference Between an Object and a Look-alike?
Recent advances in computer vision have yielded models with strong performance on recognition benchmarks; however, significant gaps remain in comparison to human perception. One subtle ability is to judge whether an image looks like a given object without being an instance of that object. We study whether vision-language models such as CLIP capture this distinction. We curated a dataset named RoLA (Real or Lookalike) of real and lookalike exemplars (e.g., toys, statues, drawings, pareidolia) across multiple categories, and first evaluate a prompt-based baseline with paired "real"/"lookalike" prompts. We then estimate a direction in CLIP's embedding space that moves representations between real and lookalike. Applying this direction to image and text embeddings improves discrimination in cross-modal retrieval on Conceptual12M, and also enhances captions produced by a CLIP prefix captioner.
Time-step Mixup for Efficient Spiking Knowledge Transfer from Appearance to Event Domain
The integration of event cameras and spiking neural networks holds great promise for energy-efficient visual processing. However, the limited availability of event data and the sparse nature of DVS outputs pose challenges for effective training. Although some prior work has attempted to transfer semantic knowledge from RGB datasets to DVS, they often overlook the significant distribution gap between the two modalities. In this paper, we propose Time-step Mixup knowledge transfer (TMKT), a novel fine-grained mixing strategy that exploits the asynchronous nature of SNNs by interpolating RGB and DVS inputs at various time-steps. To enable label mixing in cross-modal scenarios, we further introduce modality-aware auxiliary learning objectives. These objectives support the time-step mixup process and enhance the model's ability to discriminate effectively across different modalities. Our approach enables smoother knowledge transfer, alleviates modality shift during training, and achieves superior performance in spiking image classification tasks. Extensive experiments demonstrate the effectiveness of our method across multiple datasets. The code will be released after the double-blind review process.
Multi-view Surface Reconstruction Using Normal and Reflectance Cues
Achieving high-fidelity 3D surface reconstruction while preserving fine details remains challenging, especially in the presence of materials with complex reflectance properties and without a dense-view setup. In this paper, we introduce a versatile framework that incorporates multi-view normal and optionally reflectance maps into radiance-based surface reconstruction. Our approach employs a pixel-wise joint re-parametrization of reflectance and surface normals, representing them as a vector of radiances under simulated, varying illumination. This formulation enables seamless incorporation into standard surface reconstruction pipelines, such as traditional multi-view stereo (MVS) frameworks or modern neural volume rendering (NVR) ones. Combined with the latter, our approach achieves state-of-the-art performance on multi-view photometric stereo (MVPS) benchmark datasets, including DiLiGenT-MV, LUCES-MV and Skoltech3D. In particular, our method excels in reconstructing fine-grained details and handling challenging visibility conditions. The present paper is an extended version of the earlier conference paper by Brument et al. (in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024), featuring an accelerated and more robust algorithm as well as a broader empirical evaluation. The code and data relative to this article is available at https://github.com/RobinBruneau/RNb-NeuS2.
comment: 22 pages, 15 figures, 11 tables. A thorough qualitative and quantitive study is available in the supplementary material at https://drive.google.com/file/d/1KDfCKediXNP5Os954TL_QldaUWS0nKcD/view?usp=drive_link. Accepted to IJCV
M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion 3DV 2026
We tackle the problem of monocular-to-stereo video conversion and propose a novel architecture for inpainting and refinement of the warped right view obtained by depth-based reprojection of the input left view. We extend the Stable Video Diffusion (SVD) model to utilize the input left video, the warped right video, and the disocclusion masks as conditioning input to generate a high-quality right camera view. In order to effectively exploit information from neighboring frames for inpainting, we modify the attention layers in SVD to compute full attention for discoccluded pixels. Our model is trained to generate the right view video in an end-to-end manner without iterative diffusion steps by minimizing image space losses to ensure high-quality generation. Our approach outperforms previous state-of-the-art methods, being ranked best 2.6x more often than the second-place method in a user study, while being 6x faster.
comment: To be published at 3DV 2026, project webpage https://m2svid.github.io/
SuperQuadricOcc: Multi-Layer Gaussian Approximation of Superquadrics for Real-Time Self-Supervised Occupancy Estimation
Semantic occupancy estimation enables comprehensive scene understanding for automated driving, providing dense spatial and semantic information essential for perception and planning. While Gaussian representations have been widely adopted in self-supervised occupancy estimation, the deployment of a large number of Gaussian primitives drastically increases memory requirements and is not suitable for real-time inference. In contrast, superquadrics permit reduced primitive count and lower memory requirements due to their diverse shape set. However, implementation into a self-supervised occupancy model is nontrivial due to the absence of a superquadric rasterizer to enable model supervision. Our proposed method, SuperQuadricOcc, employs a superquadric-based scene representation. By leveraging a multi-layer icosphere-tessellated Gaussian approximation of superquadrics, we enable Gaussian rasterization for supervision during training. On the Occ3D dataset, SuperQuadricOcc achieves a 75% reduction in memory footprint, 124% faster inference, and a 5.9% improvement in mIoU compared to previous Gaussian-based methods, without the use of temporal labels. To our knowledge, this is the first occupancy model to enable real-time inference while maintaining competitive performance. The use of superquadrics reduces the number of primitives required for scene modeling by 84% relative to Gaussian-based approaches. Finally, evaluation against prior methods is facilitated by our fast superquadric voxelization module. The code will be made available at https://github.com/seamie6/SuperQuadricOcc.
Prompt Guiding Multi-Scale Adaptive Sparse Representation-driven Network for Low-Dose CT MAR
Low-dose CT (LDCT) is capable of reducing X-ray radiation exposure, but it will potentially degrade image quality, even yields metal artifacts at the case of metallic implants. For simultaneous LDCT reconstruction and metal artifact reduction (LDMAR), existing deep learning-based efforts face two main limitations: i) the network design neglects multi-scale and within-scale information; ii) training a distinct model for each dose necessitates significant storage space for multiple doses. To fill these gaps, we propose a prompt guiding multi-scale adaptive sparse representation-driven network, abbreviated as PMSRNet, for LDMAR task. Specifically, we construct PMSRNet inspired from multi-scale sparsifying frames, and it can simultaneously employ within-scale characteristics and cross-scale complementarity owing to an elaborated prompt guiding scale-adaptive threshold generator (PSATG) and a built multi-scale coefficient fusion module (MSFuM). The PSATG can adaptively capture multiple contextual information to generate more faithful thresholds, achieved by fusing features from local, regional, and global levels. Furthermore, we elaborate a model interpretable dual domain LDMAR framework called PDuMSRNet, and train single model with a prompt guiding strategy for multiple dose levels. We build a prompt guiding module, whose input contains dose level, metal mask and input instance, to provide various guiding information, allowing a single model to accommodate various CT dose settings. Extensive experiments at various dose levels demonstrate that the proposed methods outperform the state-of-the-art LDMAR methods.
Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation
Generating high-fidelity human videos that match user-specified identities is important yet challenging in the field of generative AI. Existing methods often rely on an excessive number of training parameters and lack compatibility with other AIGC tools. In this paper, we propose Stand-In, a lightweight and plug-and-play framework for identity preservation in video generation. Specifically, we introduce a conditional image branch into the pre-trained video generation model. Identity control is achieved through restricted self-attentions with conditional position mapping. Thanks to these designs, which greatly preserve the pre-trained prior of the video generation model, our approach is able to outperform other full-parameter training methods in video quality and identity preservation, even with just $\sim$1% additional parameters and only 2000 training pairs. Moreover, our framework can be seamlessly integrated for other tasks, such as subject-driven video generation, pose-referenced video generation, stylization, and face swapping.
Zero-Shot Anomaly Detection with Dual-Branch Prompt Selection BMVC 2025
Zero-shot anomaly detection (ZSAD) enables identifying and localizing defects in unseen categories by relying solely on generalizable features rather than requiring any labeled examples of anomalies. However, existing ZSAD methods, whether using fixed or learned prompts, struggle under domain shifts because their training data are derived from limited training domains and fail to generalize to new distributions. In this paper, we introduce PILOT, a framework designed to overcome these challenges through two key innovations: (1) a novel dual-branch prompt learning mechanism that dynamically integrates a pool of learnable prompts with structured semantic attributes, enabling the model to adaptively weight the most relevant anomaly cues for each input image; and (2) a label-free test-time adaptation strategy that updates the learnable prompt parameters using high-confidence pseudo-labels from unlabeled test data. Extensive experiments on 13 industrial and medical benchmarks demonstrate that PILOT achieves state-of-the-art performance in both anomaly detection and localization under domain shift.
comment: Accepted at BMVC 2025
Orientation Matters: Making 3D Generative Models Orientation-Aligned NeurIPS 2025
Humans intuitively perceive object shape and orientation from a single image, guided by strong priors about canonical poses. However, existing 3D generative models often produce misaligned results due to inconsistent training data, limiting their usability in downstream tasks. To address this gap, we introduce the task of orientation-aligned 3D object generation: producing 3D objects from single images with consistent orientations across categories. To facilitate this, we construct Objaverse-OA, a dataset of 14,832 orientation-aligned 3D models spanning 1,008 categories. Leveraging Objaverse-OA, we fine-tune two representative 3D generative models based on multi-view diffusion and 3D variational autoencoder frameworks to produce aligned objects that generalize well to unseen objects across various categories. Experimental results demonstrate the superiority of our method over post-hoc alignment approaches. Furthermore, we showcase downstream applications enabled by our aligned object generation, including zero-shot object orientation estimation via analysis-by-synthesis and efficient arrow-based object rotation manipulation.
comment: Accepted by NeurIPS 2025. Project Page: https://xdimlab.github.io/Orientation_Matters
Comparison of Generative Learning Methods for Turbulence Surrogates
Numerical simulations of turbulent flows present significant challenges in fluid dynamics due to their complexity and high computational cost. High resolution techniques such as Direct Numerical Simulation (DNS) and Large Eddy Simulation (LES) are generally not computationally affordable, particularly for technologically relevant problems. Recent advances in machine learning, specifically in generative probabilistic models, offer promising alternatives as surrogates for turbulence. This paper investigates the application of three generative models - Variational Autoencoders (VAE), Deep Convolutional Generative Adversarial Networks (DCGAN), and Denoising Diffusion Probabilistic Models (DDPM) - in simulating a von Kármán vortex street around a fixed cylinder projected into 2D, as well as a real-world experimental dataset of the wake flow of a cylinder array. Training data was obtained by means of LES in the simulated case and Particle Image Velocimetry (PIV) in the experimental case. We evaluate each model's ability to capture the statistical properties and spatial structures of the turbulent flow. Our results demonstrate that DDPM and DCGAN effectively replicate all flow distributions, highlighting their potential as efficient and accurate tools for turbulence surrogacy. We find a strong argument for DCGAN, as although they are more difficult to train (due to problems such as mode collapse), they show the fastest inference and training time, require less data to train compared to VAE and DDPM, and provide the results most closely aligned with the input stream. In contrast, VAE train quickly (and can generate samples quickly) but do not produce adequate results, and DDPM, whilst effective, are significantly slower at both, inference and training time.
Unleashing the Power of Chain-of-Prediction for Monocular 3D Object Detection
Monocular 3D detection (Mono3D) aims to infer 3D bounding boxes from a single RGB image. Without auxiliary sensors such as LiDAR, this task is inherently ill-posed since the 3D-to-2D projection introduces depth ambiguity. Previous works often predict 3D attributes (e.g., depth, size, and orientation) in parallel, overlooking that these attributes are inherently correlated through the 3D-to-2D projection. However, simply enforcing such correlations through sequential prediction can propagate errors across attributes, especially when objects are occluded or truncated, where inaccurate size or orientation predictions can further amplify depth errors. Therefore, neither parallel nor sequential prediction is optimal. In this paper, we propose MonoCoP, an adaptive framework that learns when and how to leverage inter-attribute correlations with two complementary designs. A Chain-of-Prediction (CoP) explores inter-attribute correlations through feature-level learning, propagation, and aggregation, while an Uncertainty-Guided Selector (GS) dynamically switches between CoP and parallel paradigms for each object based on the predicted uncertainty. By combining their strengths, MonoCoP achieves state-of-the-art (SOTA) performance on KITTI, nuScenes, and Waymo, significantly improving depth accuracy, particularly for distant objects.
A2Seek: Towards Reasoning-Centric Benchmark for Aerial Anomaly Understanding
While unmanned aerial vehicles (UAVs) offer wide-area, high-altitude coverage for anomaly detection, they face challenges such as dynamic viewpoints, scale variations, and complex scenes. Existing datasets and methods, mainly designed for fixed ground-level views, struggle to adapt to these conditions, leading to significant performance drops in drone-view scenarios. To bridge this gap, we introduce A2Seek (Aerial Anomaly Seek), a large-scale, reasoning-centric benchmark dataset for aerial anomaly understanding. This dataset covers various scenarios and environmental conditions, providing high-resolution real-world aerial videos with detailed annotations, including anomaly categories, frame-level timestamps, region-level bounding boxes, and natural language explanations for causal reasoning. Building on this dataset, we propose A2Seek-R1, a novel reasoning framework that generalizes R1-style strategies to aerial anomaly understanding, enabling a deeper understanding of "Where" anomalies occur and "Why" they happen in aerial frames. To this end, A2Seek-R1 first employs a graph-of-thought (GoT)-guided supervised fine-tuning approach to activate the model's latent reasoning capabilities on A2Seek. Then, we introduce Aerial Group Relative Policy Optimization (A-GRPO) to design rule-based reward functions tailored to aerial scenarios. Furthermore, we propose a novel "seeking" mechanism that simulates UAV flight behavior by directing the model's attention to informative regions. Extensive experiments demonstrate that A2Seek-R1 achieves up to a 22.04% improvement in AP for prediction accuracy and a 13.9% gain in mIoU for anomaly localization, exhibiting strong generalization across complex environments and out-of-distribution scenarios. Our dataset and code are released at https://2-mo.github.io/A2Seek/.
Leveraging Unlabeled Data from Unknown Sources via Dual-Path Guidance for Deepfake Face Detection
Existing deepfake detection methods heavily rely on static labeled datasets. However, with the proliferation of generative models, real-world scenarios are flooded with massive amounts of unlabeled fake face data from unknown sources. This presents a critical dilemma: detectors relying solely on existing data face generalization failure, while manual labeling for this new stream is infeasible due to the high realism of fakes. A more fundamental challenge is that, unlike typical unsupervised learning tasks where categories are clearly defined, real and fake faces share the same semantics, which leads to a decline in the performance of traditional unsupervised strategies. Therefore, there is an urgent need for a new paradigm designed specifically for this scenario to effectively utilize these unlabeled data. Accordingly, this paper proposes a dual-path guided network (DPGNet) to address two key challenges: (1) bridging the domain differences between faces generated by different generative models; and (2) utilizing unlabeled image samples. The method comprises two core modules: text-guided cross-domain alignment, which uses learnable cues to unify visual and textual embeddings into a domain-invariant feature space; and curriculum-driven pseudo-label generation, which dynamically utilizes unlabeled samples. Extensive experiments on multiple mainstream datasets show that DPGNet significantly outperforms existing techniques,, highlighting its effectiveness in addressing the challenges posed by the deepfakes using unlabeled data.
comment: 11pages,4figures
Image and Video Processing
Time-Domain Linear Model-based Framework for Passive Acoustic Mapping of Cavitation Activity
Passive acoustic mapping enables the spatial mapping and temporal monitoring of cavitation activity, playing a crucial role in therapeutic ultrasound applications. Most conventional beamforming methods, whether implemented in the time or frequency domains, suffer from limited axial resolution due to the absence of a reference emission onset time. While frequency-domain methods, the most efficient of which are based on the cross-spectral matrix, require long signals for accurate estimation, time-domain methods typically achieve lower spatial resolution. To address these limitations, we propose a linear model-based beamforming framework fully formulated in the time domain. The linear forward model relates a discretized spatiotemporal distribution of cavitation activity to the temporal signals recorded by a probe, explicitly accounting for time-of-flight delays dictated by the acquisition geometry. This model is then inverted using regularization techniques that exploit prior knowledge of cavitation activity in both spatial and temporal domains. Experimental results show that the proposed framework achieves enhanced or competitive cavitation map quality while using only 20\% of the data typically required by frequency-domain methods. This highlights the substantial gain in data efficiency and the flexibility of our spatiotemporal regularization to adapt to diverse passive cavitation scenarios, outperforming state-of-the-art techniques.
DLADiff: A Dual-Layer Defense Framework against Fine-Tuning and Zero-Shot Customization of Diffusion Models
With the rapid advancement of diffusion models, a variety of fine-tuning methods have been developed, enabling high-fidelity image generation with high similarity to the target content using only 3 to 5 training images. More recently, zero-shot generation methods have emerged, capable of producing highly realistic outputs from a single reference image without altering model weights. However, technological advancements have also introduced significant risks to facial privacy. Malicious actors can exploit diffusion model customization with just a few or even one image of a person to create synthetic identities nearly identical to the original identity. Although research has begun to focus on defending against diffusion model customization, most existing defense methods target fine-tuning approaches and neglect zero-shot generation defenses. To address this issue, this paper proposes Dual-Layer Anti-Diffusion (DLADiff) to defense both fine-tuning methods and zero-shot methods. DLADiff contains a dual-layer protective mechanism. The first layer provides effective protection against unauthorized fine-tuning by leveraging the proposed Dual-Surrogate Models (DSUR) mechanism and Alternating Dynamic Fine-Tuning (ADFT), which integrates adversarial training with the prior knowledge derived from pre-fine-tuned models. The second layer, though simple in design, demonstrates strong effectiveness in preventing image generation through zero-shot methods. Extensive experimental results demonstrate that our method significantly outperforms existing approaches in defending against fine-tuning of diffusion models and achieves unprecedented performance in protecting against zero-shot generation.
Field Test of 5G New Radio (NR) UL-MIMO and UL-256QAM for HD Live-Streaming
The exponential growth of User-Generated Content (UGC), especially High-Definition (HD) live video streaming, places a significant demand on the uplink capabilities of mobile networks. To address this, the 5G New Radio (NR) standard introduced key uplink enhancements, including Uplink Multi-Input Multi-Output (UL-MIMO) and Uplink 256QAM, to improve throughput and spectral efficiency. However, while the benefits of these features for raw data rates are well-documented, their practical impact on real-time applications like live-streaming is not yet well understood. This paper investigates the performance of UL-MIMO and UL-256QAM for HD live-streaming over a commercial 5G network using the Real-Time Messaging Protocol (RTMP). To ensure a fair assessment, we conduct a comparative analysis by modifying the modem firmware of commercial User Equipment (UE), allowing these features to be selectively enabled and disabled on the same device. Performance is evaluated based on key metrics, including dropped video frames and connection stability. Furthermore, this study analyzes 5G Radio Frequency (RF) parameters to quantify the spectral efficiency impact, specifically examining metrics derived from the Channel State Information (CSI) framework, including Reference Signal Received Power (CSI-RSRP), Reference Signal Received Quality (CSI-RSRQ), and Signal-to-Interference-plus-Noise Ratio (CSI-SINR).
comment: 2025 IEEE International Conference on Visual Communications and Image Processing (VCIP 2025), 1-4 December 2025, Klagenfurt, Austria
MODEST: Multi-Optics Depth-of-Field Stereo Dataset
Reliable depth estimation under real optical conditions remains a core challenge for camera vision in systems such as autonomous robotics and augmented reality. Despite recent progress in depth estimation and depth-of-field rendering, research remains constrained by the lack of large-scale, high-fidelity, real stereo DSLR datasets, limiting real-world generalization and evaluation of models trained on synthetic data as shown extensively in literature. We present the first high-resolution (5472$\times$3648px) stereo DSLR dataset with 18000 images, systematically varying focal length and aperture across complex real scenes and capturing the optical realism and complexity of professional camera systems. For 9 scenes with varying scene complexity, lighting and background, images are captured with two identical camera assemblies at 10 focal lengths (28-70mm) and 5 apertures (f/2.8-f/22), spanning 50 optical configurations in 2000 images per scene. This full-range optics coverage enables controlled analysis of geometric and optical effects for monocular and stereo depth estimation, shallow depth-of-field rendering, deblurring, 3D scene reconstruction and novel view synthesis. Each focal configuration has a dedicated calibration image set, supporting evaluation of classical and learning based methods for intrinsic and extrinsic calibration. The dataset features challenging visual elements such as multi-scale optical illusions, reflective surfaces, mirrors, transparent glass walls, fine-grained details, and natural / artificial ambient light variations. This work attempts to bridge the realism gap between synthetic training data and real camera optics, and demonstrates challenges with the current state-of-the-art monocular, stereo depth and depth-of-field methods. We release the dataset, calibration files, and evaluation code to support reproducible research on real-world optical generalization.
Adversarial Multi-Task Learning for Liver Tumor Segmentation, Dynamic Enhancement Regression, and Classification
Liver tumor segmentation, dynamic enhancement regression, and classification are critical for clinical assessment and diagnosis. However, no prior work has attempted to achieve these tasks simultaneously in an end-to-end framework, primarily due to the lack of an effective framework that captures inter-task relevance for mutual improvement and the absence of a mechanism to extract dynamic MRI information effectively. To address these challenges, we propose the Multi-Task Interaction adversarial learning Network (MTI-Net), a novel integrated framework designed to tackle these tasks simultaneously. MTI-Net incorporates Multi-domain Information Entropy Fusion (MdIEF), which utilizes entropy-aware, high-frequency spectral information to effectively integrate features from both frequency and spectral domains, enhancing the extraction and utilization of dynamic MRI data. The network also introduces a task interaction module that establishes higher-order consistency between segmentation and regression, thus fostering inter-task synergy and improving overall performance. Additionally, we designed a novel task-driven discriminator (TDD) to capture internal high-order relationships between tasks. For dynamic MRI information extraction, we employ a shallow Transformer network to perform positional encoding, which captures the relationships within dynamic MRI sequences. In experiments on a dataset of 238 subjects, MTI-Net demonstrates high performance across multiple tasks, indicating its strong potential for assisting in the clinical assessment of liver tumors. The code is available at: https://github.com/xiaojiao929/MTI-Net.
Automated Histopathologic Assessment of Hirschsprung Disease Using a Multi-Stage Vision Transformer Framework
Hirschsprung Disease is characterized by the absence of ganglion cells in the myenteric plexus. Therefore, their correct identification is crucial for diagnosing Hirschsprung disease. We introduce a three-stage segmentation framework based on a Vision Transformer (ViT-B/16) that mimics the pathologist's diagnostic approach. The framework sequentially segments the muscularis propria, delineates the myenteric plexus, and identifies ganglion cells within anatomically valid regions. 30 whole-slide images of colon tissue were used, each containing expert manual annotations of muscularis, plexus, and ganglion cells at varying levels of certainty. A 5-fold cross-validation scheme was applied to each stage, along with resolution-specific tiling strategies and tailored postprocessing to ensure anatomical consistency. The proposed method achieved a Dice coefficient of 89.9% and a Plexus Inclusion Rate of 100% for muscularis segmentation. Plexus segmentation reached a recall of 94.8%, a precision of 84.2% and a Ganglia Inclusion Rate of 99.7%. For high-certainty ganglion cells, the model achieved 62.1% precision and 89.1% recall, while joint certainty scores yielded 67.0% precision. These results indicate that ViT-based models are effective at leveraging global tissue context and capturing cellular morphology at small scales, even within complex histological tissue structures. This multi-stage methodology has great potential to support digital pathology workflows by reducing inter-observer variability and assisting in the evaluation of Hirschsprung disease. The clinical impact will be evaluated in future work with larger multi-center datasets and additional expert annotations.
comment: 16 pages, 8 figures, 6 tables
Video Object Recognition in Mobile Edge Networks: Local Tracking or Edge Detection?
Fast and accurate video object recognition, which relies on frame-by-frame video analytics, remains a challenge for resource-constrained devices such as traffic cameras. Recent advances in mobile edge computing have made it possible to offload computation-intensive object detection to edge servers equipped with high-accuracy neural networks, while lightweight and fast object tracking algorithms run locally on devices. This hybrid approach offers a promising solution but introduces a new challenge: deciding when to perform edge detection versus local tracking. To address this, we formulate two long-term optimization problems for both single-device and multi-device scenarios, taking into account the temporal correlation of consecutive frames and the dynamic conditions of mobile edge networks. Based on the formulation, we propose the LTED-Ada in single-device setting, a deep reinforcement learning-based algorithm that adaptively selects between local tracking and edge detection, according to the frame rate as well as recognition accuracy and delay requirement. In multi-device setting, we further enhance LTED-Ada using federated learning to enable collaborative policy training across devices, thereby improving its generalization to unseen frame rates and performance requirements. Finally, we conduct extensive hardware-in-the-loop experiments using multiple Raspberry Pi 4B devices and a personal computer as the edge server, demonstrating the superiority of LTED-Ada.
LAYER: A Quantitative Explainable AI Framework for Decoding Tissue-Layer Drivers of Myofascial Low Back Pain
Myofascial pain (MP) is a leading cause of chronic low back pain, yet its tissue-level drivers remain poorly defined and lack reliable image biomarkers. Existing studies focus predominantly on muscle while neglecting fascia, fat, and other soft tissues that play integral biomechanical roles. We developed an anatomically grounded explainable artificial intelligence (AI) framework, LAYER (Layer-wise Analysis for Yielding Explainable Relevance Tissue), that analyses six tissue layers in three-dimensional (3D) ultrasound and quantifies their contribution to MP prediction. By utilizing the largest multi-model 3D ultrasound cohort consisting of over 4,000 scans, LAYER reveals that non-muscle tissues contribute substantially to pain prediction. In B-mode imaging, the deep fascial membrane (DFM) showed the highest saliency (0.420), while in combined B-mode and shear-wave images, the collective saliency of non-muscle layers (0.316) nearly matches that of muscle (0.317), challenging the conventional muscle-centric paradigm in MP research and potentially affecting the therapy methods. LAYER establishes a quantitative, interpretable framework for linking layer-specific anatomy to pain physiology, uncovering new tissue targets and providing a generalizable approach for explainable analysis of soft-tissue imaging.
Adapter Shield: A Unified Framework with Built-in Authentication for Preventing Unauthorized Zero-Shot Image-to-Image Generation
With the rapid progress in diffusion models, image synthesis has advanced to the stage of zero-shot image-to-image generation, where high-fidelity replication of facial identities or artistic styles can be achieved using just one portrait or artwork, without modifying any model weights. Although these techniques significantly enhance creative possibilities, they also pose substantial risks related to intellectual property violations, including unauthorized identity cloning and stylistic imitation. To counter such threats, this work presents Adapter Shield, the first universal and authentication-integrated solution aimed at defending personal images from misuse in zero-shot generation scenarios. We first investigate how current zero-shot methods employ image encoders to extract embeddings from input images, which are subsequently fed into the UNet of diffusion models through cross-attention layers. Inspired by this mechanism, we construct a reversible encryption system that maps original embeddings into distinct encrypted representations according to different secret keys. The authorized users can restore the authentic embeddings via a decryption module and the correct key, enabling normal usage for authorized generation tasks. For protection purposes, we design a multi-target adversarial perturbation method that actively shifts the original embeddings toward designated encrypted patterns. Consequently, protected images are embedded with a defensive layer that ensures unauthorized users can only produce distorted or encrypted outputs. Extensive evaluations demonstrate that our method surpasses existing state-of-the-art defenses in blocking unauthorized zero-shot image synthesis, while supporting flexible and secure access control for verified users.
LoRA-based methods on Unet for transfer learning in Subarachnoid Hematoma Segmentation
Aneurysmal subarachnoid hemorrhage (SAH) is a life-threatening neurological emergency with mortality rates exceeding 30%. Transfer learning from related hematoma types represents a potentially valuable but underexplored approach. Although Unet architectures remain the gold standard for medical image segmentation due to their effectiveness on limited datasets, Low-Rank Adaptation (LoRA) methods for parameter-efficient transfer learning have been rarely applied to convolutional neural networks in medical imaging contexts. We implemented a Unet architecture pre-trained on computed tomography scans from 124 traumatic brain injury patients across multiple institutions, then fine-tuned on 30 aneurysmal SAH patients from the University of Michigan Health System using 3-fold cross-validation. We developed a novel CP-LoRA method based on tensor CP-decomposition and introduced DoRA variants (DoRA-C, convDoRA, CP-DoRA) that decompose weight matrices into magnitude and directional components. We compared these approaches against existing LoRA methods (LoRA-C, convLoRA) and standard fine-tuning strategies across different modules on a multi-view Unet model. LoRA-based methods consistently outperformed standard Unet fine-tuning. Performance varied by hemorrhage volume, with all methods showing improved accuracy for larger volumes. CP-LoRA achieved comparable performance to existing methods while using significantly fewer parameters. Over-parameterization with higher ranks consistently yielded better performance than strictly low-rank adaptations. This study demonstrates that transfer learning between hematoma types is feasible and that LoRA-based methods significantly outperform conventional Unet fine-tuning for aneurysmal SAH segmentation.
Scalable FPGA Framework for Real-Time Denoising in High-Throughput Imaging: A DRAM-Optimized Pipeline using High-Level Synthesis
High-throughput imaging workflows, such as Parallel Rapid Imaging with Spectroscopic Mapping (PRISM), generate data at rates that exceed conventional real-time processing capabilities. We present a scalable FPGA-based preprocessing pipeline for real-time denoising, implemented via High-Level Synthesis (HLS) and optimized for DRAM-backed buffering. Our architecture performs frame subtraction and averaging directly on streamed image data, minimizing latency through burst-mode AXI4 interfaces. The resulting kernel operates below the inter-frame interval, enabling inline denoising and reducing dataset size for downstream CPU/GPU analysis. Validated under PRISM-scale acquisition, this modular FPGA framework offers a practical solution for latency-sensitive imaging workflows in spectroscopy and microscopy.
comment: FPGA-based denoising pipeline for PRISM-scale imaging. Real-time frame subtraction and averaging via burst-mode AXI4 and DRAM buffering. Benchmarked against CPU/GPU workflows; scalable across multi-bank FPGA setups. Acknowledgements revised for consistency with journal submission; scientific content remains unchanged
CardioComposer: Leveraging Differentiable Geometry for Compositional Control of Anatomical Diffusion Models
Generative models of 3D cardiovascular anatomy can synthesize informative structures for clinical research and medical device evaluation, but face a trade-off between geometric controllability and realism. We propose CardioComposer: a programmable, inference-time framework for generating multi-class anatomical label maps based on interpretable ellipsoidal primitives. These primitives represent geometric attributes such as the size, shape, and position of discrete substructures. We specifically develop differentiable measurement functions based on voxel-wise geometric moments, enabling loss-based gradient guidance during diffusion model sampling. We demonstrate that these losses can constrain individual geometric attributes in a disentangled manner and provide compositional control over multiple substructures. Finally, we show that our method is compatible with a wide array of anatomical systems containing non-convex substructures, spanning cardiac, vascular, and skeletal organs.
comment: 10 pages, 16 figures
Improving Lesion Segmentation in Medical Images by Global and Regional Feature Compensation
Automated lesion segmentation of medical images has made tremendous improvements in recent years due to deep learning advancements. However, accurately capturing fine-grained global and regional feature representations remains a challenge. Many existing methods obtain suboptimal performance on complex lesion segmentation due to information loss during typical downsampling operations and the insufficient capture of either regional or global features. To address these issues, we propose the Global and Regional Compensation Segmentation Framework (GRCSF), which introduces two key innovations: the Global Compensation Unit (GCU) and the Region Compensation Unit (RCU). The proposed GCU addresses resolution loss in the U-shaped backbone by preserving global contextual features and fine-grained details during multiscale downsampling. Meanwhile, the RCU introduces a self-supervised learning (SSL) residual map generated by Masked Autoencoders (MAE), obtained as pixel-wise differences between reconstructed and original images, to highlight regions with potential lesions. These SSL residual maps guide precise lesion localization and segmentation through a patch-based cross-attention mechanism that integrates regional spatial and pixel-level features. Additionally, the RCU incorporates patch-level importance scoring to enhance feature fusion by leveraging global spatial information from the backbone. Experiments on two publicly available medical image segmentation datasets, including brain stroke lesion and coronary artery calcification datasets, demonstrate that our GRCSF outperforms state-of-the-art methods, confirming its effectiveness across diverse lesion types and its potential as a generalizable lesion segmentation solution.
A Visual Perception-Based Tunable Framework and Evaluation Benchmark for H.265/HEVC ROI Encryption
ROI selective encryption, as an efficient privacy protection technique, encrypts only the key regions in the video, thereby ensuring security while minimizing the impact on coding efficiency. However, existing ROI-based video encryption methods suffer from insufficient flexibility and lack of a unified evaluation system. To address these issues, we propose a visual perception-based tunable framework and evaluation benchmark for H.265/HEVC ROI encryption. Our scheme introduces three key contributions: 1) A ROI region recognition module based on visual perception network is proposed to accurately identify the ROI region in videos. 2) A three-level tunable encryption strategy is implemented while balancing security and real-time performance. 3) A unified ROI encryption evaluation benchmark is developed to provide a standardized quantitative platform for subsequent research. This triple strategy provides new solution and significant unified performance evaluation methods for ROI selective encryption field. Experimental results indicate that the proposed benchmark can comprehensively measure the performance of the ROI selective encryption. Compared to existing ROI encryption algorithms, our proposed enhanced and advanced level encryption exhibit superior performance in multiple performance metrics. In general, the proposed framework effectively meets the privacy protection requirements in H.265/HEVC and provides a reliable solution for secure and efficient processing of sensitive video content.
OmniLens++: Blind Lens Aberration Correction via Large LensLib Pre-Training and Latent PSF Representation
Emerging deep-learning-based lens library pre-training (LensLib-PT) pipeline offers a new avenue for blind lens aberration correction by training a universal neural network, demonstrating strong capability in handling diverse unknown optical degradations. This work proposes the OmniLens++ framework, which resolves two challenges that hinder the generalization ability of existing pipelines: the difficulty of scaling data and the absence of prior guidance characterizing optical degradation. To improve data scalability, we expand the design specifications to increase the degradation diversity of the lens source, and we sample a more uniform distribution by quantifying the spatial-variation patterns and severity of optical degradation. In terms of model design, to leverage the Point Spread Functions (PSFs), which intuitively describe optical degradation, as guidance in a blind paradigm, we propose the Latent PSF Representation (LPR). The VQVAE framework is introduced to learn latent features of LensLib's PSFs, which is assisted by modeling the optical degradation process to constrain the learning of degradation priors. Experiments on diverse aberrations of real-world lenses and synthetic LensLib show that OmniLens++ exhibits state-of-the-art generalization capacity in blind aberration correction. Beyond performance, the AODLibpro is verified as a scalable foundation for more effective training across diverse aberrations, and LPR can further tap the potential of large-scale LensLib. The source code and datasets will be made publicly available at https://github.com/zju-jiangqi/OmniLens2.
comment: The source code and datasets will be made publicly available at https://github.com/zju-jiangqi/OmniLens2
PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation
Video-to-Audio (V2A) generation requires balancing four critical perceptual dimensions: semantic consistency, audio-visual temporal synchrony, aesthetic quality, and spatial accuracy; yet existing methods suffer from objective entanglement that conflates competing goals in single loss functions and lack human preference alignment. We introduce PrismAudio, the first framework to integrate Reinforcement Learning into V2A generation with specialized Chain-of-Thought (CoT) planning. Our approach decomposes monolithic reasoning into four specialized CoT modules (Semantic, Temporal, Aesthetic, and Spatial CoT), each paired with targeted reward functions. This CoT-reward correspondence enables multidimensional RL optimization that guides the model to jointly generate better reasoning across all perspectives, solving the objective entanglement problem while preserving interpretability. To make this optimization computationally practical, we propose Fast-GRPO, which employs hybrid ODE-SDE sampling that dramatically reduces the training overhead compared to existing GRPO implementations. We also introduce AudioCanvas, a rigorous benchmark that is more distributionally balanced and covers more realistically diverse and challenging scenarios than existing datasets, with 300 single-event classes and 501 multi-event samples. Experimental results demonstrate that PrismAudio achieves state-of-the-art performance across all four perceptual dimensions on both the in-domain VGGSound test set and out-of-domain AudioCanvas benchmark. The project page is available at https://PrismAudio-Project.github.io.
comment: Preprint
Time-causal and time-recursive wavelets
When to apply wavelet analysis to real-time temporal signals, where the future cannot be accessed, it is essential to base all the steps in the signal processing pipeline on computational mechanisms that are truly time-causal. This paper describes how a time-causal wavelet analysis can be performed based on concepts developed in the area of temporal scale-space theory, originating from a complete classification of temporal smoothing kernels that guarantee non-creation of new structures from finer to coarser temporal scale levels. By necessity, convolution with truncated exponential kernels in cascade constitutes the only permissable class of kernels, as well as their temporal derivatives as a natural complement to fulfil the admissibility conditions of wavelet representations. For a particular way of choosing the time constants in the resulting infinite convolution of truncated exponential kernels, to ensure temporal scale covariance and thus self-similarity over temporal scales, we describe how mother wavelets can be chosen as temporal derivatives of the resulting time-causal limit kernel. By developing connections between wavelet theory and scale-space theory, we characterize and quantify how the continuous scaling properties transfer to the discrete implementation, demonstrating how the proposed time-causal wavelet representation can reflect the duration of locally dominant temporal structures in the input signals. We propose that this notion of time-causal wavelet analysis could be a valuable tool for signal processing tasks, where streams of signals are to be processed in real time, specifically for signals that may contain local variations over a rich span of temporal scales, or more generally for analysing physical or biophysical temporal phenomena, where a fully time-causal analysis is called for to be physically realistic.
comment: 28 pages, 11 figures
Shape-Adapting Gated Experts: Dynamic Expert Routing for Colonoscopic Lesion Segmentation
The substantial diversity in cell scale and form remains a primary challenge in computer-aided cancer detection on gigapixel Whole Slide Images (WSIs), attributable to cellular heterogeneity. Existing CNN-Transformer hybrids rely on static computation graphs with fixed routing, which consequently causes redundant computation and limits their adaptability to input variability. We propose Shape-Adapting Gated Experts (SAGE), an input-adaptive framework that enables dynamic expert routing in heterogeneous visual networks. SAGE reconfigures static backbones into dynamically routed expert architectures. SAGE's dual-path design features a backbone stream that preserves representation and selectively activates an expert path through hierarchical gating. This gating mechanism operates at multiple hierarchical levels, performing a two-level, hierarchical selection between shared and specialized experts to modulate model logits for Top-K activation. Our Shape-Adapting Hub (SA-Hub) harmonizes structural and semantic representations across the CNN and the Transformer module, effectively bridging diverse modules. Embodied as SAGE-UNet, our model achieves superior segmentation on three medical benchmarks: EBHI, DigestPath, and GlaS, yielding state-of-the-art Dice Scores of 95.57%, 95.16%, and 94.17%, respectively, and robustly generalizes across domains by adaptively balancing local refinement and global context. SAGE provides a scalable foundation for dynamic expert routing, enabling flexible visual reasoning.
Deep learning-derived arterial input function for dynamic brain PET
Dynamic positron emission tomography (PET) imaging combined with radiotracer kinetic modeling is a powerful technique for visualizing biological processes in the brain, offering valuable insights into brain functions and neurological disorders such as Alzheimer's and Parkinson's diseases. Accurate kinetic modeling relies heavily on the use of a metabolite-corrected arterial input function (AIF), which typically requires invasive and labor-intensive arterial blood sampling. While alternative non-invasive approaches have been proposed, they often compromise accuracy or still necessitate at least one invasive blood sampling. In this study, we present the deep learning-derived arterial input function (DLIF), a deep learning framework capable of estimating a metabolite-corrected AIF directly from dynamic PET image sequences without any blood sampling. We validated DLIF using existing dynamic PET patient data. We compared DLIF and resulting parametric maps against ground truth measurements. Our evaluation shows that DLIF achieves accurate and robust AIF estimation. By leveraging deep learning's ability to capture complex temporal dynamics and incorporating prior knowledge of typical AIF shapes through basis functions, DLIF provides a rapid, accurate, and entirely non-invasive alternative to traditional AIF measurement methods.
comment: Accepted to NeuroImage
Learning Hierarchical Sparse Transform Coding of 3DGS
3D Gaussian Splatting (3DGS) supports fast, high quality, novel view synthesis but has a heavy memory footprint, making the compression of its model crucial. Current state-of-the-art (SOTA) 3DGS compression methods adopt an anchor-based architecture that pairs the Scaffold-GS representation with conditional entropy coding. However, these methods forego the analysis-synthesis transform, a vital mechanism in visual data compression. As a result, redundancy remains intact in the signal and its removal is left to the entropy coder, which computationally overburdens the entropy coding module, increasing coding latency. Even with added complexity thorough redundancy removal is a task unsuited to an entropy coder. To fix this critical omission, we introduce a Sparsity-guided Hierarchical Transform Coding (SHTC) method, the first study on the end-to-end learned neural transform coding of 3DGS. SHTC applies KLT to decorrelate intra-anchor attributes, followed by quantization and entropy coding, and then compresses KLT residuals with a low-complexity, scene-adaptive neural transform. Aided by the sparsity prior and deep unfolding technique, the learned transform uses only a few trainable parameters, reducing the memory usage. Overall, SHTC achieves an appreciably improved R-D performance and at the same time higher decoding speed over SOTA. Its prior-guided, parameter-efficient design may also inspire low-complexity neural image and video codecs. Our code will be released at https://github.com/hxu160/SHTC_for_3DGS_compression.
comment: Our code will be released at \href{https://github.com/hxu160/SHTC_for_3DGS_compression}{here}
Multimedia
FINE: Factorized multimodal sentiment analysis via mutual INformation Estimation
Multimodal sentiment analysis remains a challenging task due to the inherent heterogeneity across modalities. Such heterogeneity often manifests as asynchronous signals, imbalanced information between modalities, and interference from task-irrelevant noise, hindering the learning of robust and accurate sentiment representations. To address these issues, we propose a factorized multimodal fusion framework that first disentangles each modality into shared and unique representations, and then suppresses task-irrelevant noise within both to retain only sentiment-critical representations. This fine-grained decomposition improves representation quality by reducing redundancy, prompting cross-modal complementarity, and isolating task-relevant sentiment cues. Rather than manipulating the feature space directly, we adopt a mutual information-based optimization strategy to guide the factorization process in a more stable and principled manner. To further support feature extraction and long-term temporal modeling, we introduce two auxiliary modules: a Mixture of Q-Formers, placed before factorization, which precedes the factorization and uses learnable queries to extract fine-grained affective features from multiple modalities, and a Dynamic Contrastive Queue, placed after factorization, which stores latest high-level representations for contrastive learning, enabling the model to capture long-range discriminative patterns and improve class-level separability. Extensive experiments on multiple public datasets demonstrate that our method consistently outperforms existing approaches, validating the effectiveness and robustness of the proposed framework.
comment: 15 pages, 9 figures, conference
It Hears, It Sees too: Multi-Modal LLM for Depression Detection By Integrating Visual Understanding into Audio Language Models
Depression is one of the most prevalent mental health disorders globally. In recent years, multi-modal data, such as speech, video, and transcripts, has been increasingly used to develop AI-assisted depression assessment systems. Large language models have further advanced this field due to their strong language understanding and generalization capabilities. However, conventional LLMs remain text-centric and cannot process the rich non-verbal cues found in audio and visual modalities, which are critical components in mental health evaluation. While multi-modal LLMs offer a promising direction, few are tailored for psychological applications. In this study, we propose a novel multi-modal LLM framework for depression detection. Our approach augments an audio language model with visual understanding and aligns audio-visual features at the timestamp level. This fine-grained alignment improves modeling of temporal dynamics across modalities while reducing the need for extensive training data and computational resources. Experiments on the DAIC-WoZ dataset demonstrate that our model outperforms both single-modality approaches and previous multi-modal methods. Moreover, the proposed framework can be extended to incorporate additional physiological signals, paving the way for broader clinical applications beyond mental health.
Field Test of 5G New Radio (NR) UL-MIMO and UL-256QAM for HD Live-Streaming
The exponential growth of User-Generated Content (UGC), especially High-Definition (HD) live video streaming, places a significant demand on the uplink capabilities of mobile networks. To address this, the 5G New Radio (NR) standard introduced key uplink enhancements, including Uplink Multi-Input Multi-Output (UL-MIMO) and Uplink 256QAM, to improve throughput and spectral efficiency. However, while the benefits of these features for raw data rates are well-documented, their practical impact on real-time applications like live-streaming is not yet well understood. This paper investigates the performance of UL-MIMO and UL-256QAM for HD live-streaming over a commercial 5G network using the Real-Time Messaging Protocol (RTMP). To ensure a fair assessment, we conduct a comparative analysis by modifying the modem firmware of commercial User Equipment (UE), allowing these features to be selectively enabled and disabled on the same device. Performance is evaluated based on key metrics, including dropped video frames and connection stability. Furthermore, this study analyzes 5G Radio Frequency (RF) parameters to quantify the spectral efficiency impact, specifically examining metrics derived from the Channel State Information (CSI) framework, including Reference Signal Received Power (CSI-RSRP), Reference Signal Received Quality (CSI-RSRQ), and Signal-to-Interference-plus-Noise Ratio (CSI-SINR).
comment: 2025 IEEE International Conference on Visual Communications and Image Processing (VCIP 2025), 1-4 December 2025, Klagenfurt, Austria
Prompt-Aware Adaptive Elastic Weight Consolidation for Continual Learning in Medical Vision-Language Models
Medical AI systems face catastrophic forgetting when deployed in clinical settings, where models must learn new imaging protocols while retaining prior diagnostic capabilities. This challenge is particularly acute for medical vision-language models that must preserve complex cross-modal alignments between medical images and clinical terminology across diverse imaging modalities. We introduce Prompt- Aware Adaptive Elastic Weight Consolidation (PA-EWC), a novel continual learning approach that addresses catastrophic forgetting through prompt-guided parameter specialization. Our method systematically categorizes model parameters based on their functional roles in processing visual-descriptive, spatial-guided, and medical-semantic information, enabling targeted protection of critical knowledge while allowing adaptation to new clinical requirements. PA-EWC incorporates adaptive Fisher Information computation with gradient stability analysis and develops weighted complexity metrics based on medical terminology density. We evaluate our approach across five medical imaging datasets (Kvasir-SEG, ISIC 2018, CheXlocalize, BUSI, CAMUS) representing diverse modalities including endoscopy, dermoscopy, radiography, and ultrasound. Experimental results demonstrate that PA-EWC reduces catastrophic forgetting by up to 17.58% compared to baseline methods, with performance improvements of 4.30% on chest X-ray pathology localization and 6.06% on polyp segmentation.
comment: Accepted by 32nd International Conference on MultiMedia Modeling (MMM 2026)
Emotion-Driven Personalized Recommendation for AI-Generated Content Using Multi-Modal Sentiment and Intent Analysis
With the rapid growth of AI-generated content (AIGC) across domains such as music, video, and literature, the demand for emotionally aware recommendation systems has become increasingly important. Traditional recommender systems primarily rely on user behavioral data such as clicks, views, or ratings, while neglecting users' real-time emotional and intentional states during content interaction. To address this limitation, this study proposes a Multi-Modal Emotion and Intent Recognition Model (MMEI) based on a BERT-based Cross-Modal Transformer with Attention-Based Fusion, integrated into a cloud-native personalized AIGC recommendation framework. The proposed system jointly processes visual (facial expression), auditory (speech tone), and textual (comments or utterances) modalities through pretrained encoders ViT, Wav2Vec2, and BERT, followed by an attention-based fusion module to learn emotion-intent representations. These embeddings are then used to drive personalized content recommendations through a contextual matching layer. Experiments conducted on benchmark emotion datasets (AIGC-INT, MELD, and CMU-MOSEI) and an AIGC interaction dataset demonstrate that the proposed MMEI model achieves a 4.3% improvement in F1-score and a 12.3% reduction in cross-entropy loss compared to the best fusion-based transformer baseline. Furthermore, user-level online evaluations reveal that emotion-driven recommendations increase engagement time by 15.2% and enhance satisfaction scores by 11.8%, confirming the model's effectiveness in aligning AI-generated content with users' affective and intentional states. This work highlights the potential of cross-modal emotional intelligence for next-generation AIGC ecosystems, enabling adaptive, empathetic, and context-aware recommendation experiences.
ExDDV: A New Dataset for Explainable Deepfake Detection in Video WACV 2026
The ever growing realism and quality of generated videos makes it increasingly harder for humans to spot deepfake content, who need to rely more and more on automatic deepfake detectors. However, deepfake detectors are also prone to errors, and their decisions are not explainable, leaving humans vulnerable to deepfake-based fraud and misinformation. To this end, we introduce ExDDV, the first dataset and benchmark for Explainable Deepfake Detection in Video. ExDDV comprises around 5.4K real and deepfake videos that are manually annotated with text descriptions (to explain the artifacts) and clicks (to point out the artifacts). We evaluate a number of vision-language models on ExDDV, performing experiments with various fine-tuning and in-context learning strategies. Our results show that text and click supervision are both required to develop robust explainable models for deepfake videos, which are able to localize and describe the observed artifacts. Our novel dataset and code to reproduce the results are available at https://github.com/vladhondru25/ExDDV.
comment: Accepted at WACV 2026
Signal Processing for Haptic Surface Modeling: a Review
Haptic feedback has been integrated into Virtual and Augmented Reality, complementing acoustic and visual information and contributing to an all-round immersive experience in multiple fields, spanning from the medical domain to entertainment and gaming. Haptic technologies involve complex cross-disciplinary research that encompasses sensing, data representation, interactive rendering, perception, and quality of experience. The standard processing pipeline, consists of (I) sensing physical features in the real world using a transducer, (II) modeling and storing the collected information in some digital format, (III) communicating the information, and finally, (IV) rendering the haptic information through appropriate devices, thus producing a user experience (V) perceptually close to the original physical world. Among these areas, sensing, rendering and perception have been deeply investigated and are the subject of different comprehensive surveys available in the literature. Differently, research dealing with haptic surface modeling and data representation still lacks a comprehensive dissection. In this work, we aim at providing an overview on modeling and representation of haptic surfaces from a signal processing perspective, covering the aspects that lie in between haptic information acquisition on one side and rendering and perception on the other side. We analyze, categorize, and compare research papers that address the haptic surface modeling and data representation, pointing out existing gaps and possible research directions.
comment: 19 pages, 6 figures
A Visual Perception-Based Tunable Framework and Evaluation Benchmark for H.265/HEVC ROI Encryption
ROI selective encryption, as an efficient privacy protection technique, encrypts only the key regions in the video, thereby ensuring security while minimizing the impact on coding efficiency. However, existing ROI-based video encryption methods suffer from insufficient flexibility and lack of a unified evaluation system. To address these issues, we propose a visual perception-based tunable framework and evaluation benchmark for H.265/HEVC ROI encryption. Our scheme introduces three key contributions: 1) A ROI region recognition module based on visual perception network is proposed to accurately identify the ROI region in videos. 2) A three-level tunable encryption strategy is implemented while balancing security and real-time performance. 3) A unified ROI encryption evaluation benchmark is developed to provide a standardized quantitative platform for subsequent research. This triple strategy provides new solution and significant unified performance evaluation methods for ROI selective encryption field. Experimental results indicate that the proposed benchmark can comprehensively measure the performance of the ROI selective encryption. Compared to existing ROI encryption algorithms, our proposed enhanced and advanced level encryption exhibit superior performance in multiple performance metrics. In general, the proposed framework effectively meets the privacy protection requirements in H.265/HEVC and provides a reliable solution for secure and efficient processing of sensitive video content.
To Align or Not to Align: Strategic Multimodal Representation Alignment for Optimal Performance AAAI 2026
Multimodal learning often relies on aligning representations across modalities to enable effective information integration, an approach traditionally assumed to be universally beneficial. However, prior research has primarily taken an observational approach, examining naturally occurring alignment in multimodal data and exploring its correlation with model performance, without systematically studying the direct effects of explicitly enforced alignment between representations of different modalities. In this work, we investigate how explicit alignment influences both model performance and representation alignment under different modality-specific information structures. Specifically, we introduce a controllable contrastive learning module that enables precise manipulation of alignment strength during training, allowing us to explore when explicit alignment improves or hinders performance. Our results on synthetic and real datasets under different data characteristics show that the impact of explicit alignment on the performance of unimodal models is related to the characteristics of the data: the optimal level of alignment depends on the amount of redundancy between the different modalities. We identify an optimal alignment strength that balances modality-specific signals and shared redundancy in the mixed information distributions. This work provides practical guidance on when and how explicit alignment should be applied to achieve optimal unimodal encoder performance.
comment: Accepted by AAAI 2026. This arXiv version includes additional details and extended appendix
Computation and Language
Latent Collaboration in Multi-Agent Systems
Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings. A shared latent working memory then preserves and transfers each agent's internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single-model and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4x-4.3x faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.
comment: Project: https://github.com/Gen-Verse/LatentMAS
On Evaluating LLM Alignment by Evaluating LLMs as Judges NeurIPS 2025
Alignment with human preferences is an important evaluation aspect of LLMs, requiring them to be helpful, honest, safe, and to precisely follow human instructions. Evaluating large language models' (LLMs) alignment typically involves directly assessing their open-ended responses, requiring human annotators or strong LLM judges. Conversely, LLMs themselves have also been extensively evaluated as judges for assessing alignment. In this work, we examine the relationship between LLMs' generation and evaluation capabilities in aligning with human preferences. To this end, we first conduct a comprehensive analysis of the generation-evaluation consistency (GE-consistency) among various LLMs, revealing a strong correlation between their generation and evaluation capabilities when evaluated by a strong LLM preference oracle. Utilizing this finding, we propose a benchmarking paradigm that measures LLM alignment with human preferences without directly evaluating their generated outputs, instead assessing LLMs in their role as evaluators. Our evaluation shows that our proposed benchmark, AlignEval, matches or surpasses widely used automatic LLM evaluation benchmarks, such as AlpacaEval and Arena-Hard, in capturing human preferences when ranking LLMs. Our study offers valuable insights into the connection between LLMs' generation and evaluation capabilities, and introduces a benchmark that assesses alignment without directly evaluating model outputs.
comment: NeurIPS 2025 Camera Ready
Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward
Recent years have witnessed significant progress in Unified Multimodal Models, yet a fundamental question remains: Does understanding truly inform generation? To investigate this, we introduce UniSandbox, a decoupled evaluation framework paired with controlled, synthetic datasets to avoid data leakage and enable detailed analysis. Our findings reveal a significant understanding-generation gap, which is mainly reflected in two key dimensions: reasoning generation and knowledge transfer. Specifically, for reasoning generation tasks, we observe that explicit Chain-of-Thought (CoT) in the understanding module effectively bridges the gap, and further demonstrate that a self-training approach can successfully internalize this ability, enabling implicit reasoning during generation. Additionally, for knowledge transfer tasks, we find that CoT assists the generative process by helping retrieve newly learned knowledge, and also discover that query-based architectures inherently exhibit latent CoT-like properties that affect this transfer. UniSandbox provides preliminary insights for designing future unified architectures and training strategies that truly bridge the gap between understanding and generation. Code and data are available at https://github.com/PKU-YuanGroup/UniSandBox
From Words to Wisdom: Discourse Annotation and Baseline Models for Student Dialogue Understanding
Identifying discourse features in student conversations is quite important for educational researchers to recognize the curricular and pedagogical variables that cause students to engage in constructing knowledge rather than merely completing tasks. The manual analysis of student conversations to identify these discourse features is time-consuming and labor-intensive, which limits the scale and scope of studies. Leveraging natural language processing (NLP) techniques can facilitate the automatic detection of these discourse features, offering educational researchers scalable and data-driven insights. However, existing studies in NLP that focus on discourse in dialogue rarely address educational data. In this work, we address this gap by introducing an annotated educational dialogue dataset of student conversations featuring knowledge construction and task production discourse. We also establish baseline models for automatically predicting these discourse properties for each turn of talk within conversations, using pre-trained large language models GPT-3.5 and Llama-3.1. Experimental results indicate that these state-of-the-art models perform suboptimally on this task, indicating the potential for future research.
Bridging the Language Gap: Synthetic Voice Diversity via Latent Mixup for Equitable Speech Recognition ICML 2025
Modern machine learning models for audio tasks often exhibit superior performance on English and other well-resourced languages, primarily due to the abundance of available training data. This disparity leads to an unfair performance gap for low-resource languages, where data collection is both challenging and costly. In this work, we introduce a novel data augmentation technique for speech corpora designed to mitigate this gap. Through comprehensive experiments, we demonstrate that our method significantly improves the performance of automatic speech recognition systems on low-resource languages. Furthermore, we show that our approach outperforms existing augmentation strategies, offering a practical solution for enhancing speech technology in underrepresented linguistic communities.
comment: Accepted at ICML 2025 Workshop on Machine Learning for Audio
DesignPref: Capturing Personal Preferences in Visual Design Generation
Generative models, such as large language models and text-to-image diffusion models, are increasingly used to create visual designs like user interfaces (UIs) and presentation slides. Finetuning and benchmarking these generative models have often relied on datasets of human-annotated design preferences. Yet, due to the subjective and highly personalized nature of visual design, preference varies widely among individuals. In this paper, we study this problem by introducing DesignPref, a dataset of 12k pairwise comparisons of UI design generation annotated by 20 professional designers with multi-level preference ratings. We found that among trained designers, substantial levels of disagreement exist (Krippendorff's alpha = 0.25 for binary preferences). Natural language rationales provided by these designers indicate that disagreements stem from differing perceptions of various design aspect importance and individual preferences. With DesignPref, we demonstrate that traditional majority-voting methods for training aggregated judge models often do not accurately reflect individual preferences. To address this challenge, we investigate multiple personalization strategies, particularly fine-tuning or incorporating designer-specific annotations into RAG pipelines. Our results show that personalized models consistently outperform aggregated baseline models in predicting individual designers' preferences, even when using 20 times fewer examples. Our work provides the first dataset to study personalized visual design evaluation and support future research into modeling individual design taste.
The Text Aphasia Battery (TAB): A Clinically-Grounded Benchmark for Aphasia-Like Deficits in Language Models
Large language models (LLMs) have emerged as a candidate "model organism" for human language, offering an unprecedented opportunity to study the computational basis of linguistic disorders like aphasia. However, traditional clinical assessments are ill-suited for LLMs, as they presuppose human-like pragmatic pressures and probe cognitive processes not inherent to artificial architectures. We introduce the Text Aphasia Battery (TAB), a text-only benchmark adapted from the Quick Aphasia Battery (QAB) to assess aphasic-like deficits in LLMs. The TAB comprises four subtests: Connected Text, Word Comprehension, Sentence Comprehension, and Repetition. This paper details the TAB's design, subtests, and scoring criteria. To facilitate large-scale use, we validate an automated evaluation protocol using Gemini 2.5 Flash, which achieves reliability comparable to expert human raters (prevalence-weighted Cohen's kappa = 0.255 for model--consensus agreement vs. 0.286 for human--human agreement). We release TAB as a clinically-grounded, scalable framework for analyzing language deficits in artificial systems.
Adversarial Confusion Attack: Disrupting Multimodal Large Language Models
We introduce the Adversarial Confusion Attack, a new class of threats against multimodal large language models (MLLMs). Unlike jailbreaks or targeted misclassification, the goal is to induce systematic disruption that makes the model generate incoherent or confidently incorrect outputs. Applications include embedding adversarial images into websites to prevent MLLM-powered agents from operating reliably. The proposed attack maximizes next-token entropy using a small ensemble of open-source MLLMs. In the white-box setting, we show that a single adversarial image can disrupt all models in the ensemble, both in the full-image and adversarial CAPTCHA settings. Despite relying on a basic adversarial technique (PGD), the attack generates perturbations that transfer to both unseen open-source (e.g., Qwen3-VL) and proprietary (e.g., GPT-5.1) models.
Generation, Evaluation, and Explanation of Novelists' Styles with Single-Token Prompts
Recent advances in large language models have created new opportunities for stylometry, the study of writing styles and authorship. Two challenges, however, remain central: training generative models when no paired data exist, and evaluating stylistic text without relying only on human judgment. In this work, we present a framework for both generating and evaluating sentences in the style of 19th-century novelists. Large language models are fine-tuned with minimal, single-token prompts to produce text in the voices of authors such as Dickens, Austen, Twain, Alcott, and Melville. To assess these generative models, we employ a transformer-based detector trained on authentic sentences, using it both as a classifier and as a tool for stylistic explanation. We complement this with syntactic comparisons and explainable AI methods, including attention-based and gradient-based analyses, to identify the linguistic cues that drive stylistic imitation. Our findings show that the generated text reflects the authors' distinctive patterns and that AI-based evaluation offers a reliable alternative to human assessment. All artifacts of this work are published online.
A Task-Oriented Evaluation Framework for Text Normalization in Modern NLP Pipelines
Text normalization is an essential preprocessing step in many natural language processing (NLP) tasks, and stemming is one such normalization technique that reduces words to their base or root form. However, evaluating stemming methods is challenging because current evaluation approaches are limited and do not capture the potential harm caused by excessive stemming; therefore, it is essential to develop new approaches to evaluate stemming methods. To address this issue, this study propose a novel, task-oriented approach to evaluate stemming methods, which considers three aspects: (1) the utility of stemming using Stemming Effectiveness Score (SES), (2) the impact of stemming on downstream tasks using Model Performance Delta (MPD), and (3) the semantic similarity between stemmed and original words using Average Normalized Levenshtein Distance (ANLD), thus providing a comprehensive evaluation framework. We apply our evaluation framework to compare two stemmers for Bangla (BNLTK) and English (Snowball), and our results reveal a significant issue, prompting us to analyze their performance in detail. While the Bangla stemmer achieves the highest SES (1.67) due to effective word reduction (CR = 1.90), SES alone is insufficient because our proposed safety measure, ANLD, reveals that this high SES is due to harmful over-stemming (ANLD = 0.26), which correlates with the observed decrease in downstream performance.In contrast, the English stemmer achieves a moderate SES (1.31) with a safe meaning distance (ANLD = 0.14), allowing its word reduction to contribute positively to downstream performance; therefore, it is a more reliable stemmer. Our study provides a valuable tool for distinguishing between potential efficiency gains (high SES) and meaning preservation (low ANLD).
BengaliFig: A Low-Resource Challenge for Figurative and Culturally Grounded Reasoning in Bengali
Large language models excel on broad multilingual benchmarks but remain to be evaluated extensively in figurative and culturally grounded reasoning, especially in low-resource contexts. We present BengaliFig, a compact yet richly annotated challenge set that targets this gap in Bengali, a widely spoken low-resourced language. The dataset contains 435 unique riddles drawn from Bengali oral and literary traditions. Each item is annotated along five orthogonal dimensions capturing reasoning type, trap type, cultural depth, answer category, and difficulty, and is automatically converted to multiple-choice format through a constraint-aware, AI-assisted pipeline. We evaluate eight frontier LLMs from major providers under zero-shot and few-shot chain-of-thought prompting, revealing consistent weaknesses in metaphorical and culturally specific reasoning. BengaliFig thus contributes both a diagnostic probe for evaluating LLM robustness in low-resource cultural contexts and a step toward inclusive and heritage-aware NLP evaluation.
Soft Adaptive Policy Optimization
Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains challenging. Token-level importance ratios often exhibit high variance-a phenomenon exacerbated in Mixture-of-Experts models-leading to unstable updates. Existing group-based policy optimization methods, such as GSPO and GRPO, alleviate this problem via hard clipping, making it difficult to maintain both stability and effective learning. We propose Soft Adaptive Policy Optimization (SAPO), which replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO. When a sequence contains a few highly off-policy tokens, GSPO suppresses all gradients for that sequence, whereas SAPO selectively down-weights only the offending tokens and preserves the learning signal from the near-on-policy ones, improving sample efficiency. Relative to GRPO, SAPO replaces hard token-level clipping with smooth, temperature-controlled scaling, enabling more informative and stable updates. Empirical results on mathematical reasoning benchmarks indicate that SAPO exhibits improved training stability and higher Pass@1 performance under comparable training budgets. Moreover, we employ SAPO to train the Qwen3-VL model series, demonstrating that SAPO yields consistent performance gains across diverse tasks and different model sizes. Overall, SAPO provides a more reliable, scalable, and effective optimization strategy for RL training of LLMs.
The Curious Case of Analogies: Investigating Analogical Reasoning in Large Language Models AAAI 2026
Analogical reasoning is at the core of human cognition, serving as an important foundation for a variety of intellectual activities. While prior work has shown that LLMs can represent task patterns and surface-level concepts, it remains unclear whether these models can encode high-level relational concepts and apply them to novel situations through structured comparisons. In this work, we explore this fundamental aspect using proportional and story analogies, and identify three key findings. First, LLMs effectively encode the underlying relationships between analogous entities; both attributive and relational information propagate through mid-upper layers in correct cases, whereas reasoning failures reflect missing relational information within these layers. Second, unlike humans, LLMs often struggle not only when relational information is missing, but also when attempting to apply it to new entities. In such cases, strategically patching hidden representations at critical token positions can facilitate information transfer to a certain extent. Lastly, successful analogical reasoning in LLMs is marked by strong structural alignment between analogous situations, whereas failures often reflect degraded or misplaced alignment. Overall, our findings reveal that LLMs exhibit emerging but limited capabilities in encoding and applying high-level relational concepts, highlighting both parallels and gaps with human cognition.
comment: AAAI 2026
Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios AAAI-2026
Speculative decoding accelerates LLM inference by utilizing otherwise idle computational resources during memory-to-chip data transfer. Current speculative decoding methods typically assume a considerable amount of available computing power, then generate a complex and massive draft tree using a small autoregressive language model to improve overall prediction accuracy. However, methods like batching have been widely applied in mainstream model inference systems as a superior alternative to speculative decoding, as they compress the available idle computing power. Therefore, performing speculative decoding with low verification resources and low scheduling costs has become an important research problem. We believe that more capable models that allow for parallel generation on draft sequences are what we truly need. Recognizing the fundamental nature of draft models to only generate sequences of limited length, we propose SpecFormer, a novel architecture that integrates unidirectional and bidirectional attention mechanisms. SpecFormer combines the autoregressive model's ability to extract information from the entire input sequence with the parallel generation benefits of non-autoregressive models. This design eliminates the reliance on large prefix trees and achieves consistent acceleration, even in large-batch scenarios. Through lossless speculative decoding experiments across models of various scales, we demonstrate that SpecFormer sets a new standard for scaling LLM inference with lower training demands and reduced computational costs.
comment: accepted by AAAI-2026
Geometry of Decision Making in Language Models NeurIPS 2025
Large Language Models (LLMs) show strong generalization across diverse tasks, yet the internal decision-making processes behind their predictions remain opaque. In this work, we study the geometry of hidden representations in LLMs through the lens of \textit{intrinsic dimension} (ID), focusing specifically on decision-making dynamics in a multiple-choice question answering (MCQA) setting. We perform a large-scale study, with 28 open-weight transformer models and estimate ID across layers using multiple estimators, while also quantifying per-layer performance on MCQA tasks. Our findings reveal a consistent ID pattern across models: early layers operate on low-dimensional manifolds, middle layers expand this space, and later layers compress it again, converging to decision-relevant representations. Together, these results suggest LLMs implicitly learn to project linguistic inputs onto structured, low-dimensional manifolds aligned with task-specific decisions, providing new geometric insights into how generalization and reasoning emerge in language models.
comment: Accepted at NeurIPS 2025
Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits NeurIPS 2025
Transformer-based language models exhibit complex and distributed behavior, yet their internal computations remain poorly understood. Existing mechanistic interpretability methods typically treat attention heads and multilayer perceptron layers (MLPs) (the building blocks of a transformer architecture) as indivisible units, overlooking possibilities of functional substructure learned within them. In this work, we introduce a more fine-grained perspective that decomposes these components into orthogonal singular directions, revealing superposed and independent computations within a single head or MLP. We validate our perspective on widely used standard tasks like Indirect Object Identification (IOI), Gender Pronoun (GP), and Greater Than (GT), showing that previously identified canonical functional heads, such as the name mover, encode multiple overlapping subfunctions aligned with distinct singular directions. Nodes in a computational graph, that are previously identified as circuit elements show strong activation along specific low-rank directions, suggesting that meaningful computations reside in compact subspaces. While some directions remain challenging to interpret fully, our results highlight that transformer computations are more distributed, structured, and compositional than previously assumed. This perspective opens new avenues for fine-grained mechanistic interpretability and a deeper understanding of model internals.
comment: Accepted at NeurIPS 2025
REFLEX: Self-Refining Explainable Fact-Checking via Disentangling Truth into Style and Substance
The prevalence of misinformation on social media threatens public trust, demanding automated fact-checking systems that provide accurate verdicts with interpretable explanations. However, existing large language model-based (LLM-based) approaches often rely heavily on external knowledge sources, introducing substantial latency and even hallucinations that undermine reliability, interpretability, and responsiveness, which is crucial for real-time use. To address these challenges, we propose REason-guided Fact-checking with Latent EXplanations REFLEX paradigm, a plug-and-play, self-refining paradigm that leverages the internal knowledge in backbone model to improve both verdict accuracy and explanation quality. REFLEX reformulates fact-checking as a role-play dialogue and jointly trains verdict prediction and explanation generation. It adaptively extracts contrastive activation pairs between the backbone model and its fine-tuned variant to construct steering vectors that disentangle truth into style and substance naturally. These activation-level signals guide inference and suppress noisy explanations, enabling more faithful and efficient reasoning. Experiments on real-world datasets show that REFLEX outperforms previous methods that steer toward a single truth direction and underscores the challenge traditional approaches face when handling the subtle, human-unknown truth in fact-checking tasks. Remarkably, with only 465 self-refined training samples, RELFEX achieves state-of-the-art performance. Furthermore, models trained with explanatory objectives can effectively guide those without them, yielding up to a 7.57% improvement, highlighting that internal explanation signals play a dual role in both interpreting and enhancing factual reasoning.
KyrgyzBERT: A Compact, Efficient Language Model for Kyrgyz NLP
Kyrgyz remains a low-resource language with limited foundational NLP tools. To address this gap, we introduce KyrgyzBERT, the first publicly available monolingual BERT-based language model for Kyrgyz. The model has 35.9M parameters and uses a custom tokenizer designed for the language's morphological structure. To evaluate performance, we create kyrgyz-sst2, a sentiment analysis benchmark built by translating the Stanford Sentiment Treebank and manually annotating the full test set. KyrgyzBERT fine-tuned on this dataset achieves an F1-score of 0.8280, competitive with a fine-tuned mBERT model five times larger. All models, data, and code are released to support future research in Kyrgyz NLP.
comment: 3 pages, 1 figure, 2 tables. Preprint
SEDA: A Self-Adapted Entity-Centric Data Augmentation for Boosting Gird-based Discontinuous NER Models
Named Entity Recognition (NER) is a critical task in natural language processing, yet it remains particularly challenging for discontinuous entities. The primary difficulty lies in text segmentation, as traditional methods often missegment or entirely miss cross-sentence discontinuous entities, significantly affecting recognition accuracy. Therefore, we aim to address the segmentation and omission issues associated with such entities. Recent studies have shown that grid-tagging methods are effective for information extraction due to their flexible tagging schemes and robust architectures. Building on this, we integrate image data augmentation techniques, such as cropping, scaling, and padding, into grid-based models to enhance their ability to recognize discontinuous entities and handle segmentation challenges. Experimental results demonstrate that traditional segmentation methods often fail to capture cross-sentence discontinuous entities, leading to decreased performance. In contrast, our augmented grid models achieve notable improvements. Evaluations on the CADEC, ShARe13, and ShARe14 datasets show F1 score gains of 1-2.5% overall and 3.7-8.4% for discontinuous entities, confirming the effectiveness of our approach.
comment: 9 pages, 5 figures
"When Data is Scarce, Prompt Smarter"... Approaches to Grammatical Error Correction in Low-Resource Settings AACL 2025
Grammatical error correction (GEC) is an important task in Natural Language Processing that aims to automatically detect and correct grammatical mistakes in text. While recent advances in transformer-based models and large annotated datasets have greatly improved GEC performance for high-resource languages such as English, the progress has not extended equally. For most Indic languages, GEC remains a challenging task due to limited resources, linguistic diversity and complex morphology. In this work, we explore prompting-based approaches using state-of-the-art large language models (LLMs), such as GPT-4.1, Gemini-2.5 and LLaMA-4, combined with few-shot strategy to adapt them to low-resource settings. We observe that even basic prompting strategies, such as zero-shot and few-shot approaches, enable these LLMs to substantially outperform fine-tuned Indic-language models like Sarvam-22B, thereby illustrating the exceptional multilingual generalization capabilities of contemporary LLMs for GEC. Our experiments show that carefully designed prompts and lightweight adaptation significantly enhance correction quality across multiple Indic languages. We achieved leading results in the shared task--ranking 1st in Tamil (GLEU: 91.57) and Hindi (GLEU: 85.69), 2nd in Telugu (GLEU: 85.22), 4th in Bangla (GLEU: 92.86), and 5th in Malayalam (GLEU: 92.97). These findings highlight the effectiveness of prompt-driven NLP techniques and underscore the potential of large-scale LLMs to bridge resource gaps in multilingual GEC.
comment: 10 pages, 5 figures, 5 tables; Accept-demonstration at BHASHA Workshop, IJCNLP-AACL 2025
Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach
Mispronunciation Detection and Diagnosis (MDD) is crucial for language learning and speech therapy. Unlike conventional methods that require scoring models or training phoneme-level models, we propose a novel training-free framework that leverages retrieval techniques with a pretrained Automatic Speech Recognition model. Our method avoids phoneme-specific modeling or additional task-specific training, while still achieving accurate detection and diagnosis of pronunciation errors. Experiments on the L2-ARCTIC dataset show that our method achieves a superior F1 score of 69.60% while avoiding the complexity of model training.
EM2LDL: A Multilingual Speech Corpus for Mixed Emotion Recognition through Label Distribution Learning
This study introduces EM2LDL, a novel multilingual speech corpus designed to advance mixed emotion recognition through label distribution learning. Addressing the limitations of predominantly monolingual and single-label emotion corpora \textcolor{black}{that restrict linguistic diversity, are unable to model mixed emotions, and lack ecological validity}, EM2LDL comprises expressive utterances in English, Mandarin, and Cantonese, capturing the intra-utterance code-switching prevalent in multilingual regions like Hong Kong and Macao. The corpus integrates spontaneous emotional expressions from online platforms, annotated with fine-grained emotion distributions across 32 categories. Experimental baselines using self-supervised learning models demonstrate robust performance in speaker-independent gender-, age-, and personality-based evaluations, with HuBERT-large-EN achieving optimal results. By incorporating linguistic diversity and ecological validity, EM2LDL enables the exploration of complex emotional dynamics in multilingual settings. This work provides a versatile testbed for developing adaptive, empathetic systems for applications in affective computing, including mental health monitoring and cross-cultural communication. The dataset, annotations, and baseline codes are publicly available at https://github.com/xingfengli/EM2LDL.
comment: Submitted to IEEE Transactions on Affective computing
The Devil in the Details: Emergent Misalignment, Format and Coherence in Open-Weights LLMs
Prior work has shown that fine-tuning models on a narrow domain with misaligned data can lead to broad misalignment - a phenomenon termed "emergent misalignment" (Betley et al. 2025). While all tested models were susceptible to emergent misalignment, some models showed more resistance than others. Specifically the Qwen-2.5 family proved to be relatively resistant, while GPT-4o exhibited the strongest misalignment. In this paper we evaluate if current-generation open-weights models exhibit similar resistance to the Qwen-2.5 family and measure misalignment robustness over a range of model architectures and scales. We replicate the effect across nine modern open-weights models (Gemma 3 and Qwen 3 families, 1B-32B parameters). Models fine-tuned on insecure code generation show a 0.68% misalignment rate (compared to 0.07% for base models), matching the lower end of prior open-model results but dramatically lower than GPT-4o's 20%. We identify a critical format-dependent vulnerability: requiring JSON output doubles misalignment rates compared to natural language prompts (0.96% vs 0.42%). This suggests that structural constraints may bypass safety training by reducing the model's 'degrees of freedom' to refuse. These findings confirm emergent misalignment as a reproducible phenomenon in modern open-weights models, with rates substantially lower than observed in proprietary systems.
SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space
The quadratic complexity of full attention limits efficient long-context processing in large language models (LLMs). Sparse attention mitigates this cost by restricting each query to attend to a subset of previous tokens; however, training-free approaches often lead to severe performance degradation. Native sparse-attention methods (e.g., NSA, MoBA) alleviate this issue, yet exhibit a critical paradox: they produce lower attention sparsity than full-attention models, despite aiming to approximate full attention, which may constrain their effectiveness. We attribute this paradox to gradient update deficiency: low-ranked key-value pairs excluded during sparse training receive neither forward contribution nor backward gradients, and thus never learn proper suppression. To overcome this limitation, we propose SSA (Sparse Sparse Attention), a unified training framework that considers both sparse and full attention and enforces bidirectional alignment at every layer. This design preserves gradient flow to all tokens while explicitly encouraging sparse-attention outputs to align with their full-attention counterparts, thereby promoting stronger sparsity. As a result, SSA achieves state-of-the-art performance under both sparse and full attention inference across multiple commonsense benchmarks. Furthermore, SSA enables models to adapt smoothly to varying sparsity budgets; performance improves consistently as more tokens are allowed to attend, supporting flexible compute-performance trade-offs at inference time. Finally, we show that native sparse-attention training surprisingly improves long-context extrapolation by mitigating the over-allocation of attention values in sink areas, with SSA demonstrating the strongest extrapolation capability.
comment: 28 pages
QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation AAAI 2026
Developing high-performance GPU kernels is critical for AI and scientific computing, but remains challenging due to its reliance on expert crafting and poor portability. While LLMs offer promise for automation, both general-purpose and finetuned LLMs suffer from two fundamental and conflicting limitations: correctness and efficiency. The key reason is that existing LLM-based approaches directly generate the entire optimized low-level programs, requiring exploration of an extremely vast space encompassing both optimization policies and implementation codes. To address the challenge of exploring an intractable space, we propose Macro Thinking Micro Coding (MTMC), a hierarchical framework inspired by the staged optimization strategy of human experts. It decouples optimization strategy from implementation details, ensuring efficiency through high-level strategy and correctness through low-level implementation. Specifically, Macro Thinking employs reinforcement learning to guide lightweight LLMs in efficiently exploring and learning semantic optimization strategies that maximize hardware utilization. Micro Coding leverages general-purpose LLMs to incrementally implement the stepwise optimization proposals from Macro Thinking, avoiding full-kernel generation errors. Together, they effectively navigate the vast optimization space and intricate implementation details, enabling LLMs for high-performance GPU kernel generation. Comprehensive results on widely adopted benchmarks demonstrate the superior performance of MTMC on GPU kernel generation in both accuracy and running time. On KernelBench, MTMC achieves near 100% and 70% accuracy at Levels 1-2 and 3, over 50% than SOTA general-purpose and domain-finetuned LLMs, with up to 7.3x speedup over LLMs, and 2.2x over expert-optimized PyTorch Eager kernels. On the more challenging TritonBench, MTMC attains up to 59.64% accuracy and 34x speedup.
comment: 9 pages, 2 figures, accepted by AAAI 2026
More Bias, Less Bias: BiasPrompting for Enhanced Multiple-Choice Question Answering
With the advancement of large language models (LLMs), their performance on multiple-choice question (MCQ) tasks has improved significantly. However, existing approaches face key limitations: answer choices are typically presented to LLMs without contextual grounding or explanation. This absence of context can lead to incomplete exploration of all possible answers, ultimately degrading the models' reasoning capabilities. To address these challenges, we introduce BiasPrompting, a novel inference framework that guides LLMs to generate and critically evaluate reasoning across all plausible answer options before reaching a final prediction. It consists of two components: first, a reasoning generation stage, where the model is prompted to produce supportive reasonings for each answer option, and then, a reasoning-guided agreement stage, where the generated reasonings are synthesized to select the most plausible answer. Through comprehensive evaluations, BiasPrompting demonstrates significant improvements in five widely used multiple-choice question answering benchmarks. Our experiments showcase that BiasPrompting enhances the reasoning capabilities of LLMs and provides a strong foundation for tackling complex and challenging questions, particularly in settings where existing methods underperform.
comment: Accepted at the 41st ACM/SIGAPP Symposium On Applied Computing (SAC 2026), Main Conference
MTA: A Merge-then-Adapt Framework for Personalized Large Language Model
Personalized Large Language Models (PLLMs) aim to align model outputs with individual user preferences, a crucial capability for user-centric applications. However, the prevalent approach of fine-tuning a separate module for each user faces two major limitations: (1) storage costs scale linearly with the number of users, rendering the method unscalable; and (2) fine-tuning a static model from scratch often yields suboptimal performance for users with sparse data. To address these challenges, we propose MTA, a Merge-then-Adapt framework for PLLMs. MTA comprises three key stages. First, we construct a shared Meta-LoRA Bank by selecting anchor users and pre-training meta-personalization traits within meta-LoRA modules. Second, to ensure scalability and enable dynamic personalization combination beyond static models, we introduce an Adaptive LoRA Fusion stage. This stage retrieves and dynamically merges the most relevant anchor meta-LoRAs to synthesize a user-specific one, thereby eliminating the need for user-specific storage and supporting more flexible personalization. Third, we propose a LoRA Stacking for Few-Shot Personalization stage, which applies an additional ultra-low-rank, lightweight LoRA module on top of the merged LoRA. Fine-tuning this module enables effective personalization under few-shot settings. Extensive experiments on the LaMP benchmark demonstrate that our approach outperforms existing SOTA methods across multiple tasks.
Online-PVLM: Advancing Personalized VLMs with Online Concept Learning
Personalized Visual Language Models (VLMs) are gaining increasing attention for their formidable ability in user-specific concepts aligned interactions (e.g., identifying a user's bike). Existing methods typically require the learning of separate embeddings for each new concept, which fails to support real-time adaptation during testing. This limitation becomes particularly pronounced in large-scale scenarios, where efficient retrieval of concept embeddings is not achievable. To alleviate this gap, we propose Online-PVLM, a framework for online concept learning by leveraging hyperbolic representations. Our approach makes a train-free paradigm for concept embeddings generation at test time, making the use of personalized VLMs both scalable and efficient. In addition, we develop OP-Eval, a comprehensive and large-scale benchmark comprising 1,292 concepts and over 30K high-quality instances with diverse question types, designed to rigorously assess online concept learning in realistic scenarios. Extensive experiments demonstrate the state-of-the-art performance of our proposed framework. Our source code and dataset will be made available.
comment: Work in Progress
A Machine Learning Approach for Detection of Mental Health Conditions and Cyberbullying from Social Media AAAI-26
Mental health challenges and cyberbullying are increasingly prevalent in digital spaces, necessitating scalable and interpretable detection systems. This paper introduces a unified multiclass classification framework for detecting ten distinct mental health and cyberbullying categories from social media data. We curate datasets from Twitter and Reddit, implementing a rigorous "split-then-balance" pipeline to train on balanced data while evaluating on a realistic, held-out imbalanced test set. We conducted a comprehensive evaluation comparing traditional lexical models, hybrid approaches, and several end-to-end fine-tuned transformers. Our results demonstrate that end-to-end fine-tuning is critical for performance, with the domain-adapted MentalBERT emerging as the top model, achieving an accuracy of 0.92 and a Macro F1 score of 0.76, surpassing both its generic counterpart and a zero-shot LLM baseline. Grounded in a comprehensive ethical analysis, we frame the system as a human-in-the-loop screening aid, not a diagnostic tool. To support this, we introduce a hybrid SHAPLLM explainability framework and present a prototype dashboard ("Social Media Screener") designed to integrate model predictions and their explanations into a practical workflow for moderators. Our work provides a robust baseline, highlighting future needs for multi-label, clinically-validated datasets at the critical intersection of online safety and computational mental health.
comment: Accepted for Oral Presentation at the AAAI-26 Bridge Program on AI for Medicine and Healthcare (AIMedHealth). To appear in Proceedings of Machine Learning Research (PMLR)
Directional Optimization Asymmetry in Transformers: A Synthetic Stress Test
Transformers are theoretically reversal-invariant: their function class does not prefer left-to-right over right-to-left mappings. Yet empirical studies on natural language repeatedly report a "reversal curse," and recent work on temporal asymmetry in LLMs suggests that real-world corpora carry their own arrow of time. This leaves an unresolved question: do directional failures stem from linguistic statistics, or from the architecture itself? We cut through this ambiguity with a fully synthetic, entropy-controlled benchmark designed as a clean-room stress test for directional learning. Using random string mappings with tunable branching factor K, we construct forward tasks with zero conditional entropy and inverse tasks with analytically determined entropy floors. Excess loss above these floors reveals that even scratch-trained GPT-2 models exhibit a strong, reproducible directional optimization gap (e.g., 1.16 nats at K=5), far larger than that of an MLP trained on the same data. Pre-trained initializations shift optimization behavior but do not eliminate this gap, while LoRA encounters a sharp capacity wall on high-entropy inverse mappings. Together, these results isolate a minimal, semantics-free signature of directional friction intrinsic to causal Transformer training-one that persists even when linguistic priors, token frequencies, and corpus-level temporal asymmetries are removed. Our benchmark provides a controlled instrument for dissecting directional biases in modern sequence models and motivates deeper mechanistic study of why inversion remains fundamentally harder for Transformers.
comment: 19 pages, 4 figures. Code available at https://github.com/mihirs-0/synass
$\text{R}^2\text{R}$: A Route-to-Rerank Post-Training Framework for Multi-Domain Decoder-Only Rerankers
Decoder-only rerankers are central to Retrieval-Augmented Generation (RAG). However, generalist models miss domain-specific nuances in high-stakes fields like finance and law, and naive fine-tuning causes surface-form overfitting and catastrophic forgetting. To address this challenge, we introduce R2R, a domain-aware framework that combines dynamic expert routing with a two-stage training strategy, Entity Abstraction for Generalization (EAG). EAG introduces a counter-shortcut mechanism by masking the most predictive surface cues, forcing the reranker to learn domain-invariant relevance patterns rather than memorizing dataset-specific entities. To efficiently activate domain experts, R2R employs a lightweight Latent Semantic Router that probes internal representations from the frozen backbone decoder to select the optimal LoRA expert per query. Extensive experiments across different reranker backbones and diverse domains (legal, medical, and financial) demonstrate that R2R consistently surpasses generalist and single-domain fine-tuned baselines. Our results confirm that R2R is a model-agnostic and modular approach to domain specialization with strong cross-domain robustness.
comment: 13 pages, including 3 figures and 3 tables
AppSelectBench: Application-Level Tool Selection Benchmark
Computer Using Agents (CUAs) are increasingly equipped with external tools, enabling them to perform complex and realistic tasks. For CUAs to operate effectively, application selection, which refers to deciding which application to use before invoking fine-grained tools such as APIs, is a fundamental capability. It determines whether the agent initializes the correct environment, avoids orchestration confusion, and efficiently focuses on relevant context. However, existing benchmarks primarily assess fine-grained API selection, offering limited insight into whether models can reason across and choose between different applications. To fill this gap, we introduce AppSelectBench, a comprehensive benchmark for evaluating application selection in CUAs. AppSelectBench contains a novel user task generation pipeline that produces realistic, diverse, and semantically grounded user intents at scale, together with unified evaluation protocols covering random, heuristic, zero-shot, few-shot, and retrieval-augmented-settings. AppSelectBench covers one hundred widely used desktop applications and includes more than one hundred thousand realistic, diverse, and semantically grounded user tasks. Extensive experiments across both closed-source and open-source large language models reveal systematic strengths and weaknesses in inter-application reasoning, showing that even the most capable models still struggle to make consistent application choices. Together, these results establish AppSelectBench as a foundation for studying and advancing application level reasoning, an essential yet underexplored capability of intelligent CUAs. The source is available at https://github.com/microsoft/appselectbench.
EfficientXpert: Efficient Domain Adaptation for Large Language Models via Propagation-Aware Pruning
The rapid advancement of large language models (LLMs) has increased the demand for domain-specialized variants in areas such as law, healthcare, and finance. However, their large size remains a barrier to deployment in resource-constrained environments, and existing compression methods either generalize poorly across domains or incur high overhead. In this work, we propose \textbf{EfficientXpert}, a lightweight domain-pruning framework that combines a propagation-aware pruning criterion (Foresight Mask) with an efficient adapter-update algorithm (Partial Brain Surgeon). Integrated into the LoRA fine-tuning process, EfficientXpert enables a one-step transformation of general pretrained models into sparse, domain-adapted experts. Across health and legal tasks, it retains up to 98% of dense-model performance at 40% sparsity, outperforming state-of-the-art methods. Further analysis reveals substantial domain-dependent structural shifts that degrade the effectiveness of general pruning masks, underscoring the need for adaptive, domain-aware pruning strategies tailored to each domain.
CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding
Vision Language Models (VLMs) have recently shown significant advancements in video understanding, especially in feature alignment, event reasoning, and instruction-following tasks. However, their capability for counterfactual reasoning, inferring alternative outcomes under hypothetical conditions, remains underexplored. This capability is essential for robust video understanding, as it requires identifying underlying causal structures and reasoning about unobserved possibilities, rather than merely recognizing observed patterns. To systematically evaluate this capability, we introduce CounterVQA, a video-based benchmark featuring three progressive difficulty levels that assess different aspects of counterfactual reasoning. Through comprehensive evaluation of both state-of-the-art open-source and closed-source models, we uncover a substantial performance gap: while these models achieve reasonable accuracy on simple counterfactual questions, performance degrades significantly on complex multi-hop causal chains. To address these limitations, we develop a post-training method, CFGPT, that enhances a model's visual counterfactual reasoning ability by distilling its counterfactual reasoning capability from the language modality, yielding consistent improvements across all CounterVQA difficulty levels. Dataset and code will be further released.
MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization
Vision-Language-Action (VLA) models inherit strong priors from pretrained Vision-Language Models (VLMs), but naive fine-tuning often disrupts these representations and harms generalization. Existing fixes -- freezing modules or applying uniform regularization -- either overconstrain adaptation or ignore the differing roles of VLA components. We present MAPS (Module-Wise Proximity Scheduling), the first robust fine-tuning framework for VLAs. Through systematic analysis, we uncover an empirical order in which proximity constraints should be relaxed to balance stability and flexibility. MAPS linearly schedules this relaxation, enabling visual encoders to stay close to their pretrained priors while action-oriented language layers adapt more freely. MAPS introduces no additional parameters or data, and can be seamlessly integrated into existing VLAs. Across MiniVLA-VQ, MiniVLA-OFT, OpenVLA-OFT, and challenging benchmarks such as SimplerEnv, CALVIN, LIBERO, as well as real-world evaluations on the Franka Emika Panda platform, MAPS consistently boosts both in-distribution and out-of-distribution performance (up to +30%). Our findings highlight empirically guided proximity to pretrained VLMs as a simple yet powerful principle for preserving broad generalization in VLM-to-VLA transfer.
A Systematic Analysis of Large Language Models with RAG-enabled Dynamic Prompting for Medical Error Detection and Correction
Objective: Clinical documentation contains factual, diagnostic, and management errors that can compromise patient safety. Large language models (LLMs) may help detect and correct such errors, but their behavior under different prompting strategies remains unclear. We evaluate zero-shot prompting, static prompting with random exemplars (SPR), and retrieval-augmented dynamic prompting (RDP) for three subtasks of medical error processing: error flag detection, error sentence detection, and error correction. Methods: Using the MEDEC dataset, we evaluated nine instruction-tuned LLMs (GPT, Claude, Gemini, and OpenAI o-series models). We measured performance using accuracy, recall, false-positive rate (FPR), and an aggregate score of ROUGE-1, BLEURT, and BERTScore for error correction. We also analyzed example outputs to identify failure modes and differences between LLM and clinician reasoning. Results: Zero-shot prompting showed low recall in both detection tasks, often missing abbreviation-heavy or atypical errors. SPR improved recall but increased FPR. Across all nine LLMs, RDP reduced FPR by about 15 percent, improved recall by 5 to 10 percent in error sentence detection, and generated more contextually accurate corrections. Conclusion: Across diverse LLMs, RDP outperforms zero-shot and SPR prompting. Using retrieved exemplars improves detection accuracy, reduces false positives, and enhances the reliability of medical error correction.
Profile-LLM: Dynamic Profile Optimization for Realistic Personality Expression in LLMs
Personalized Large Language Models (LLMs) have been shown to be an effective way to create more engaging and enjoyable user-AI interactions. While previous studies have explored using prompts to elicit specific personality traits in LLMs, they have not optimized these prompts to maximize personality expression. To address this limitation, we propose PersonaPulse: Dynamic Profile Optimization for Realistic Personality Expression in LLMs, a framework that leverages LLMs' inherent knowledge of personality traits to iteratively enhance role-play prompts while integrating a situational response benchmark as a scoring tool, ensuring a more realistic and contextually grounded evaluation to guide the optimization process. Quantitative evaluations demonstrate that the prompts generated by PersonaPulse outperform those of prior work, which were designed based on personality descriptions from psychological studies. Additionally, we explore the relationship between model size and personality modeling through extensive experiments. Finally, we find that, for certain personality traits, the extent of personality evocation can be partially controlled by pausing the optimization process. These findings underscore the importance of prompt optimization in shaping personality expression within LLMs, offering valuable insights for future research on adaptive AI interactions.
CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception
Vision-Language Models (VLMs) often struggle with tasks that require fine-grained image understanding, such as scene-text recognition or document analysis, due to perception limitations and visual fragmentation. To address these challenges, we introduce CropVLM as an external low-cost method for boosting performance, enabling VLMs to dynamically ''zoom in'' on relevant image regions, enhancing their ability to capture fine details. CropVLM is trained using reinforcement learning, without using human-labeled bounding boxes as a supervision signal, and without expensive synthetic evaluations. The model is trained once and can be paired with both open-source and proprietary VLMs to improve their performance. Our approach delivers significant improvements on tasks that require high-resolution image understanding, notably for benchmarks that are out-of-domain for the target VLM, without modifying or fine-tuning the VLM, thus avoiding catastrophic forgetting.
Language-Independent Sentiment Labelling with Distant Supervision: A Case Study for English, Sepedi and Setswana
Sentiment analysis is a helpful task to automatically analyse opinions and emotions on various topics in areas such as AI for Social Good, AI in Education or marketing. While many of the sentiment analysis systems are developed for English, many African languages are classified as low-resource languages due to the lack of digital language resources like text labelled with corresponding sentiment classes. One reason for that is that manually labelling text data is time-consuming and expensive. Consequently, automatic and rapid processes are needed to reduce the manual effort as much as possible making the labelling process as efficient as possible. In this paper, we present and analyze an automatic language-independent sentiment labelling method that leverages information from sentiment-bearing emojis and words. Our experiments are conducted with tweets in the languages English, Sepedi and Setswana from SAfriSenti, a multilingual sentiment corpus for South African languages. We show that our sentiment labelling approach is able to label the English tweets with an accuracy of 66%, the Sepedi tweets with 69%, and the Setswana tweets with 63%, so that on average only 34% of the automatically generated labels remain to be corrected.
comment: Published in the The Fourth Workshop on Processing Emotions, Decisions and Opinions (EDO 2023) at 10th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC 2023), Poznań, Poland, 21-23 April 2023. ISBN: 978-83-232-4176-8
Breaking Bad: Norms for Valence, Arousal, and Dominance for over 10k English Multiword Expressions
Factor analysis studies have shown that the primary dimensions of word meaning are Valence (V), Arousal (A), and Dominance (D). Existing lexicons such as the NRC VAD Lexicon, published in 2018, include VAD association ratings for words. Here, we present a complement to it, which has human ratings of valence, arousal, and dominance for 10k English Multiword Expressions (MWEs) and their constituent words. We also increase the coverage of unigrams, especially words that have become more common since 2018. In all, the new NRC VAD Lexicon v2 now has entries for 10k MWEs and 25k words, in addition to the entries in v1. We show that the associations are highly reliable. We use the lexicon to examine emotional characteristics of MWEs, including: 1. The degree to which MWEs (idioms, noun compounds, and verb particle constructions) exhibit strong emotionality; 2. The degree of emotional compositionality in MWEs. The lexicon enables a wide variety of research in NLP, Psychology, Public Health, Digital Humanities, and Social Sciences. The NRC VAD Lexicon v2 is freely available through the project webpage: http://saifmohammad.com/WebPages/nrc-vad.html
Training-Free Generation of Diverse and High-Fidelity Images via Prompt Semantic Space Optimization
Image diversity remains a fundamental challenge for text-to-image diffusion models. Low-diversity models tend to generate repetitive outputs, increasing sampling redundancy and hindering both creative exploration and downstream applications. A primary cause is that generation often collapses toward a strong mode in the learned distribution. Existing attempts to improve diversity, such as noise resampling, prompt rewriting, or steering-based guidance, often still collapse to dominant modes or introduce distortions that degrade image quality. In light of this, we propose Token-Prompt embedding Space Optimization (TPSO), a training-free and model-agnostic module. TPSO introduces learnable parameters to explore underrepresented regions of the token embedding space, reducing the tendency of the model to repeatedly generate samples from strong modes of the learned distribution. At the same time, the prompt-level space provides a global semantic constraint that regulates distribution shifts, preventing quality degradation while maintaining high fidelity. Extensive experiments on MS-COCO and three diffusion backbones show that TPSO significantly enhances generative diversity, improving baseline performance from 1.10 to 4.18 points, without sacrificing image quality. Code will be released upon acceptance.
comment: under review
Emergence and Localisation of Semantic Role Circuits in LLMs
Despite displaying semantic competence, large language models' internal mechanisms that ground abstract semantic structure remain insufficiently characterised. We propose a method integrating role-cross minimal pairs, temporal emergence analysis, and cross-model comparison to study how LLMs implement semantic roles. Our analysis uncovers: (i) highly concentrated circuits (89-94% attribution within 28 nodes); (ii) gradual structural refinement rather than phase transitions, with larger models sometimes bypassing localised circuits; and (iii) moderate cross-scale conservation (24-59% component overlap) alongside high spectral similarity. These findings suggest that LLMs form compact, causally isolated mechanisms for abstract semantic structure, and these mechanisms exhibit partial transfer across scales and architectures.
Winning with Less for Low Resource Languages: Advantage of Cross-Lingual English_Persian Argument Mining Model over LLM Augmentation
Argument mining is a subfield of natural language processing to identify and extract the argument components, like premises and conclusions, within a text and to recognize the relations between them. It reveals the logical structure of texts to be used in tasks like knowledge extraction. This paper aims at utilizing a cross-lingual approach to argument mining for low-resource languages, by constructing three training scenarios. We examine the models on English, as a high-resource language, and Persian, as a low-resource language. To this end, we evaluate the models based on the English Microtext corpus \citep{PeldszusStede2015}, and its parallel Persian translation. The learning scenarios are as follow: (i) zero-shot transfer, where the model is trained solely with the English data, (ii) English-only training enhanced by synthetic examples generated by Large Language Models (LLMs), and (iii) a cross-lingual model that combines the original English data with manually translated Persian sentences. The zero-shot transfer model attains F1 scores of 50.2\% on the English test set and 50.7\% on the Persian test set. LLM-based augmentation model improves the performance up to 59.2\% on English and 69.3\% on Persian. The cross-lingual model, trained on both languages but evaluated solely on the Persian test set, surpasses the LLM-based variant, by achieving a F1 of 74.8\%. Results indicate that a lightweight cross-lingual blend can outperform considerably the more resource-intensive augmentation pipelines, and it offers a practical pathway for the argument mining task to overcome data resource shortage on low-resource languages.
comment: Preprint. Under review
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Statefulness is essential for large language model (LLM) agents to perform long-term planning and problem-solving. This makes memory a critical component, yet its management and evolution remain largely underexplored. Existing evaluations mostly focus on static conversational settings, where memory is passively retrieved from dialogue to answer queries, overlooking the dynamic ability to accumulate and reuse experience across evolving task streams. In real-world environments such as interactive problem assistants or embodied agents, LLMs are required to handle continuous task streams, yet often fail to learn from accumulated interactions, losing valuable contextual insights, a limitation that calls for test-time evolution, where LLMs retrieve, integrate, and update memory continuously during deployment. To bridge this gap, we introduce Evo-Memory, a comprehensive streaming benchmark and framework for evaluating self-evolving memory in LLM agents. Evo-Memory structures datasets into sequential task streams, requiring LLMs to search, adapt, and evolve memory after each interaction. We unify and implement over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets. To better benchmark experience reuse, we provide a baseline method, ExpRAG, for retrieving and utilizing prior experience, and further propose ReMem, an action-think-memory refine pipeline that tightly integrates reasoning, task actions, and memory updates to achieve continual improvement.
Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries WACV 2026
Visual content memorability has intrigued the scientific community for decades, with applications ranging widely, from understanding nuanced aspects of human memory to enhancing content design. A significant challenge in progressing the field lies in the expensive process of collecting memorability annotations from humans. This limits the diversity and scalability of datasets for modeling visual content memorability. Most existing datasets are limited to collecting aggregate memorability scores for visual content, not capturing the nuanced memorability signals present in natural, open-ended recall descriptions. In this work, we introduce the first large-scale unsupervised dataset designed explicitly for modeling visual memorability signals, containing over 82,000 videos, accompanied by descriptive recall data. We leverage tip-of-the-tongue (ToT) retrieval queries from online platforms such as Reddit. We demonstrate that our unsupervised dataset provides rich signals for two memorability-related tasks: recall generation and ToT retrieval. Large vision-language models fine-tuned on our dataset outperform state-of-the-art models such as GPT-4o in generating open-ended memorability descriptions for visual content. We also employ a contrastive training strategy to create the first model capable of performing multimodal ToT retrieval. Our dataset and models present a novel direction, facilitating progress in visual content memorability research.
comment: Accepted at WACV 2026
Length-MAX Tokenizer for Language Models
We introduce a new tokenizer for language models that minimizes the average tokens per character, thereby reducing the number of tokens needed to represent text during training and to generate text during inference. Our method, which we refer to as the Length-MAX tokenizer, obtains its vocabulary by casting a length-weighted objective maximization as a graph partitioning problem and developing a greedy approximation algorithm. On FineWeb and diverse domains, it yields 14--18\% fewer tokens than Byte Pair Encoding (BPE) across vocabulary sizes from 10K to 50K, and the reduction is 13.0\% when the size is 64K. Training GPT-2 models at 124M, 355M, and 1.3B parameters from scratch with five runs each shows 18.5\%, 17.2\%, and 18.5\% fewer steps, respectively, to reach a fixed validation loss, and 13.7\%, 12.7\%, and 13.7\% lower inference latency, together with a 16\% throughput gain at 124M, while consistently improving on downstream tasks including reducing LAMBADA perplexity by 11.7\% and enhancing HellaSwag accuracy by 4.3\%. Moreover, the Length-MAX tokenizer achieves 99.62\% vocabulary coverage and the out-of-vocabulary rate remains low at 0.12\% on test sets. These results demonstrate that optimizing for average token length, rather than frequency alone, offers an effective approach to more efficient language modeling without sacrificing -- and often improving -- downstream performance. The tokenizer is compatible with production systems and reduces embedding and KV-cache memory by 18\% at inference.
Structured Prompting Enables More Robust, Holistic Evaluation of Language Models
As language models (LMs) are increasingly adopted across domains, high-quality benchmarking frameworks that accurately estimate performance are essential for guiding deployment decisions. While frameworks such as Holistic Evaluation of Language Models (HELM) enable broad evaluation across tasks, they often rely on fixed prompts that fail to generalize across LMs, yielding unrepresentative performance estimates. Unless we estimate each LM's ceiling (maximum achievable via changes to the prompt), we risk underestimating performance. Declarative prompting frameworks, such as DSPy, offer a scalable alternative to manual prompt engineering by crafting structured prompts that can be optimized per task. However, such frameworks have not been systematically evaluated across established benchmarks. We present a reproducible DSPy+HELM framework that introduces structured prompting methods which elicit reasoning, enabling more accurate LM benchmarking. Using four prompting methods, we evaluate four frontier LMs across seven benchmarks (general/medical domain) against existing HELM baseline scores. We find that without structured prompting: (i) HELM underestimates LM performance (by 4% average), (ii) performance estimates vary more across benchmarks (+2% standard deviation), (iii) performance gaps are misrepresented (leaderboard rankings flip on 3/7 benchmarks), and (iv) introducing reasoning (chain-of-thought) reduces LM sensitivity to prompt design (smaller Δ across prompts). To our knowledge, this is the first large-scale benchmarking study to empirically characterize LM behavior across benchmarks and prompting methods, showing that scalable performance ceiling estimation enables more decision-useful benchmarks. We open-source (i) DSPy+HELM Integration (https://github.com/stanford-crfm/helm/pull/3893) and (ii) Prompt Optimization Pipeline (https://github.com/StanfordMIMI/dspy-helm).
Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion
Diffusion models have established the state-of-the-art in text-to-image generation, but their performance often relies on a diffusion prior network to translate text embeddings into the visual manifold for easier decoding. These priors are computationally expensive and require extensive training on massive datasets. In this work, we challenge the necessity of a trained prior at all by employing Optimization-based Visual Inversion (OVI), a training-free and data-free alternative, to replace the need for a prior. OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimizes it to maximize the cosine similarity with input textual prompt embedding. We further propose two novel constraints, a Mahalanobis-based and a Nearest-Neighbor loss, to regularize the OVI optimization process toward the distribution of realistic images. Our experiments, conducted on Kandinsky 2.2, show that OVI can serve as an alternative to traditional priors. More importantly, our analysis reveals a critical flaw in current evaluation benchmarks like T2I-CompBench++, where simply using the text embedding as a prior achieves surprisingly high scores, despite lower perceptual quality. Our constrained OVI methods improve visual fidelity over this baseline, with the Nearest-Neighbor approach proving particularly effective, achieving quantitative scores comparable to or higher than the state-of-the-art data-efficient prior, indicating that the idea merits further investigation. The code will be publicly available upon acceptance.
comment: 11 pages, 7 figures, technical report (preprint)
SAGE: An Agentic Explainer Framework for Interpreting SAE Features in Language Models
Large language models (LLMs) have achieved remarkable progress, yet their internal mechanisms remain largely opaque, posing a significant challenge to their safe and reliable deployment. Sparse autoencoders (SAEs) have emerged as a promising tool for decomposing LLM representations into more interpretable features, but explaining the features captured by SAEs remains a challenging task. In this work, we propose SAGE (SAE AGentic Explainer), an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanation-driven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanationdriven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.
Memories Retrieved from Many Paths: A Multi-Prefix Framework for Robust Detection of Training Data Leakage in Large Language Models
Large language models, trained on massive corpora, are prone to verbatim memorization of training data, creating significant privacy and copyright risks. While previous works have proposed various definitions for memorization, many exhibit shortcomings in comprehensively capturing this phenomenon, especially in aligned models. To address this, we introduce a novel framework: multi-prefix memorization. Our core insight is that memorized sequences are deeply encoded and thus retrievable via a significantly larger number of distinct prefixes than non-memorized content. We formalize this by defining a sequence as memorized if an external adversarial search can identify a target count of distinct prefixes that elicit it. This framework shifts the focus from single-path extraction to quantifying the robustness of a memory, measured by the diversity of its retrieval paths. Through experiments on open-source and aligned chat models, we demonstrate that our multi-prefix definition reliably distinguishes memorized from non-memorized data, providing a robust and practical tool for auditing data leakage in LLMs.
comment: 11 pages, 2 tables, 8 figures
CANVAS: A Benchmark for Vision-Language Models on Tool-Based User Interface Design
User interface (UI) design is an iterative process in which designers progressively refine their work with design software such as Figma or Sketch. Recent advances in vision language models (VLMs) with tool invocation suggest these models can operate design software to edit a UI design through iteration. Understanding and enhancing this capacity is important, as it highlights VLMs' potential to collaborate with designers within conventional software. However, as no existing benchmark evaluates tool-based design performance, the capacity remains unknown. To address this, we introduce CANVAS, a benchmark for VLMs on tool-based user interface design. Our benchmark contains 598 tool-based design tasks paired with ground-truth references sampled from 3.3K mobile UI designs across 30 function-based categories (e.g., onboarding, messaging). In each task, a VLM updates the design step-by-step through context-based tool invocations (e.g., create a rectangle as a button background), linked to design software. Specifically, CANVAS incorporates two task types: (i) design replication evaluates the ability to reproduce a whole UI screen; (ii) design modification evaluates the ability to modify a specific part of an existing screen. Results suggest that leading models exhibit more strategic tool invocations, improving design quality. Furthermore, we identify common error patterns models exhibit, guiding future work in enhancing tool-based design capabilities.
Large Language Models' Complicit Responses to Illicit Instructions across Socio-Legal Contexts
Large language models (LLMs) are now deployed at unprecedented scale, assisting millions of users in daily tasks. However, the risk of these models assisting unlawful activities remains underexplored. In this study, we define this high-risk behavior as complicit facilitation - the provision of guidance or support that enables illicit user instructions - and present four empirical studies that assess its prevalence in widely deployed LLMs. Using real-world legal cases and established legal frameworks, we construct an evaluation benchmark spanning 269 illicit scenarios and 50 illicit intents to assess LLMs' complicit facilitation behavior. Our findings reveal widespread LLM susceptibility to complicit facilitation, with GPT-4o providing illicit assistance in nearly half of tested cases. Moreover, LLMs exhibit deficient performance in delivering credible legal warnings and positive guidance. Further analysis uncovers substantial safety variation across socio-legal contexts. On the legal side, we observe heightened complicity for crimes against societal interests, non-extreme but frequently occurring violations, and malicious intents driven by subjective motives or deceptive justifications. On the social side, we identify demographic disparities that reveal concerning complicit patterns towards marginalized and disadvantaged groups, with older adults, racial minorities, and individuals in lower-prestige occupations disproportionately more likely to receive unlawful guidance. Analysis of model reasoning traces suggests that model-perceived stereotypes, characterized along warmth and competence, are associated with the model's complicit behavior. Finally, we demonstrate that existing safety alignment strategies are insufficient and may even exacerbate complicit behavior.
InvisibleBench: A Deployment Gate for Caregiving Relationship AI
InvisibleBench is a deployment gate for caregiving-relationship AI, evaluating 3-20+ turn interactions across five dimensions: Safety, Compliance, Trauma-Informed Design, Belonging/Cultural Fitness, and Memory. The benchmark includes autofail conditions for missed crises, medical advice (WOPR Act), harmful information, and attachment engineering. We evaluate four frontier models across 17 scenarios (N=68) spanning three complexity tiers. All models show significant safety gaps (11.8-44.8 percent crisis detection), indicating the necessity of deterministic crisis routing in production systems. DeepSeek Chat v3 achieves the highest overall score (75.9 percent), while strengths differ by dimension: GPT-4o Mini leads Compliance (88.2 percent), Gemini leads Trauma-Informed Design (85.0 percent), and Claude Sonnet 4.5 ranks highest in crisis detection (44.8 percent). We release all scenarios, judge prompts, and scoring configurations with code. InvisibleBench extends single-turn safety tests by evaluating longitudinal risk, where real harms emerge. No clinical claims; this is a deployment-readiness evaluation.
comment: 29 pages, 3 figures
Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1)
Reasoning is central to human intelligence, enabling structured problem-solving across diverse tasks. Recent advances in large language models (LLMs) have greatly enhanced their reasoning abilities in arithmetic, commonsense, and symbolic domains. However, effectively extending these capabilities into multimodal contexts-where models must integrate both visual and textual inputs-continues to be a significant challenge. Multimodal reasoning introduces complexities, such as handling conflicting information across modalities, which require models to adopt advanced interpretative strategies. Addressing these challenges involves not only sophisticated algorithms but also robust methodologies for evaluating reasoning accuracy and coherence. This paper offers a concise yet insightful overview of reasoning techniques in both textual and multimodal LLMs. Through a thorough and up-to-date comparison, we clearly formulate core reasoning challenges and opportunities, highlighting practical methods for post-training optimization and test-time inference. Our work provides valuable insights and guidance, bridging theoretical frameworks and practical implementations, and sets clear directions for future research.
FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions NeurIPS 2025
We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models (LRMs) with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. We attach links to the benchmark, evaluation data, and other updates on this website: https://flageval-baai.github.io/LRM-Eval/
comment: Project homepage: https://flageval-baai.github.io/LRM-Eval/ This work will also be presented at NeurIPS 2025 Workshop on Foundations of Reasoning in Language Models (FoRLM); update with trials on Gemini 3 Pro
ExDDV: A New Dataset for Explainable Deepfake Detection in Video WACV 2026
The ever growing realism and quality of generated videos makes it increasingly harder for humans to spot deepfake content, who need to rely more and more on automatic deepfake detectors. However, deepfake detectors are also prone to errors, and their decisions are not explainable, leaving humans vulnerable to deepfake-based fraud and misinformation. To this end, we introduce ExDDV, the first dataset and benchmark for Explainable Deepfake Detection in Video. ExDDV comprises around 5.4K real and deepfake videos that are manually annotated with text descriptions (to explain the artifacts) and clicks (to point out the artifacts). We evaluate a number of vision-language models on ExDDV, performing experiments with various fine-tuning and in-context learning strategies. Our results show that text and click supervision are both required to develop robust explainable models for deepfake videos, which are able to localize and describe the observed artifacts. Our novel dataset and code to reproduce the results are available at https://github.com/vladhondru25/ExDDV.
comment: Accepted at WACV 2026
When to Think and When to Look: Uncertainty-Guided Lookback
Test-time thinking (that is, generating explicit intermediate reasoning chains) is known to boost performance in large language models and has recently shown strong gains for large vision language models (LVLMs). However, despite these promising results, there is still no systematic analysis of how thinking actually affects visual reasoning. We provide the first such analysis with a large scale, controlled comparison of thinking for LVLMs, evaluating ten variants from the InternVL3.5 and Qwen3-VL families on MMMU-val under generous token budgets and multi pass decoding. We show that more thinking is not always better; long chains often yield long wrong trajectories that ignore the image and underperform the same models run in standard instruct mode. A deeper analysis reveals that certain short lookback phrases, which explicitly refer back to the image, are strongly enriched in successful trajectories and correlate with better visual grounding. Building on this insight, we propose uncertainty guided lookback, a training free decoding strategy that combines an uncertainty signal with adaptive lookback prompts and breadth search. Our method improves overall MMMU performance, delivers the largest gains in categories where standard thinking is weak, and outperforms several strong decoding baselines, setting a new state of the art under fixed model families and token budgets. We further show that this decoding strategy generalizes, yielding consistent improvements on five additional benchmarks, including two broad multimodal suites and math focused visual reasoning datasets.
OceanGym: A Benchmark Environment for Underwater Embodied Agents
We introduce OceanGym, the first comprehensive benchmark for ocean underwater embodied agents, designed to advance AI in one of the most demanding real-world environments. Unlike terrestrial or aerial domains, underwater settings present extreme perceptual and decision-making challenges, including low visibility, dynamic ocean currents, making effective agent deployment exceptionally difficult. OceanGym encompasses eight realistic task domains and a unified agent framework driven by Multi-modal Large Language Models (MLLMs), which integrates perception, memory, and sequential decision-making. Agents are required to comprehend optical and sonar data, autonomously explore complex environments, and accomplish long-horizon objectives under these harsh conditions. Extensive experiments reveal substantial gaps between state-of-the-art MLLM-driven agents and human experts, highlighting the persistent difficulty of perception, planning, and adaptability in ocean underwater environments. By providing a high-fidelity, rigorously designed platform, OceanGym establishes a testbed for developing robust embodied AI and transferring these capabilities to real-world autonomous ocean underwater vehicles, marking a decisive step toward intelligent agents capable of operating in one of Earth's last unexplored frontiers. The code and data are available at https://github.com/OceanGPT/OceanGym.
comment: Work in progress
LightMem: Lightweight and Efficient Memory-Augmented Generation
Despite their remarkable capabilities, Large Language Models (LLMs) struggle to effectively leverage historical interaction information in dynamic and complex environments. Memory systems enable LLMs to move beyond stateless interactions by introducing persistent information storage, retrieval, and utilization mechanisms. However, existing memory systems often introduce substantial time and computational overhead. To this end, we introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems. Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognition-inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep-time update employs an offline procedure that decouples consolidation from online inference. On LongMemEval and LoCoMo, using GPT and Qwen backbones, LightMem consistently surpasses strong baselines, improving QA accuracy by up to 7.7% / 29.3%, reducing total token usage by up to 38x / 20.9x and API calls by up to 30x / 55.5x, while purely online test-time costs are even lower, achieving up to 106x / 117x token reduction and 159x / 310x fewer API calls. The code is available at https://github.com/zjunlp/LightMem.
comment: Work in progress
Counterfactual Simulatability of LLM Explanations for Generation Tasks
LLMs can be unpredictable, as even slight alterations to the prompt can cause the output to change in unexpected ways. Thus, the ability of models to accurately explain their behavior is critical, especially in high-stakes settings. One approach for evaluating explanations is counterfactual simulatability, how well an explanation allows users to infer the model's output on related counterfactuals. Counterfactual simulatability has been previously studied for yes/no question answering tasks. We provide a general framework for extending this method to generation tasks, using news summarization and medical suggestion as example use cases. We find that while LLM explanations do enable users to better predict LLM outputs on counterfactuals in the summarization setting, there is significant room for improvement for medical suggestion. Furthermore, our results suggest that the evaluation for counterfactual simulatability may be more appropriate for skill-based tasks as opposed to knowledge-based tasks.
comment: INLG25
EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis
Electronic Health Records (EHRs) contain rich yet complex information, and their automated analysis is critical for clinical decision-making. Despite recent advances of large language models (LLMs) in clinical workflows, their ability to analyze EHRs remains limited due to narrow task coverage and lack of EHR-oriented reasoning capabilities. This paper aims to bridge the gap, specifically, we present EHR-Ins, a large-scale, comprehensive EHR reasoning instruction dataset, comprising 300k high-quality reasoning cases and 4M non-reasoning cases across 42 distinct EHR tasks. Its core innovation is a thinking-graph-driven framework that enables to generate high-quality reasoning data at scale. Based on it, we develop EHR-R1, a series of reasoning-enhanced LLMs with up to 72B parameters tailored for EHR analysis. Through a multi-stage training paradigm, including domain adaptation, reasoning enhancement, and reinforcement learning, EHR-R1 systematically acquires domain knowledge and diverse reasoning capabilities, enabling accurate and robust EHR analysis. Lastly, we introduce EHR-Bench, a new benchmark curated from MIMIC-IV, spanning 42 tasks, to comprehensively assess reasoning and prediction across EHR scenarios. In experiments, we show that the resulting EHR-R1 consistently outperforms state-of-the-art commercial and open-source LLMs (including DeepSeek-V3 and GPT-4o), surpassing GPT-4o by over 30 points on MIMIC-Bench and achieving a 10\% higher zero-shot AUROC on EHRSHOT. Collectively, EHR-Ins, EHR-R1, and EHR-Bench have significantly advanced the development for more reliable and clinically relevant EHR analysis.
BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models AAAI 2026
Although large language models (LLMs) demonstrate impressive proficiency in various tasks, they present potential safety risks, such as `jailbreaks', where malicious inputs can coerce LLMs into generating harmful content bypassing safety alignments. In this paper, we delve into the ethical biases in LLMs and examine how those biases could be exploited for jailbreaks. Notably, these biases result in a jailbreaking success rate in GPT-4o models that differs by 20\% between non-binary and cisgender keywords and by 16\% between white and black keywords, even when the other parts of the prompts are identical. We introduce the concept of BiasJailbreak, highlighting the inherent risks posed by these safety-induced biases. BiasJailbreak generates biased keywords automatically by asking the target LLM itself, and utilizes the keywords to generate harmful output. Additionally, we propose an efficient defense method BiasDefense, which prevents jailbreak attempts by injecting defense prompts prior to generation. BiasDefense stands as an appealing alternative to Guard Models, such as Llama-Guard, that require additional inference cost after text generation. Our findings emphasize that ethical biases in LLMs can actually lead to generating unsafe output, and suggest a method to make the LLMs more secure and unbiased. To enable further research and improvements, we open-source our code and artifacts of BiasJailbreak, providing the community with tools to better understand and mitigate safety-induced biases in LLMs.
comment: Accepted as a workshop paper at AAAI 2026
LiRA: A Multi-Agent Framework for Reliable and Readable Literature Review Generation
The rapid growth of scientific publications has made it increasingly difficult to keep literature reviews comprehensive and up-to-date. Though prior work has focused on automating retrieval and screening, the writing phase of systematic reviews remains largely under-explored, especially with regard to readability and factual accuracy. To address this, we present LiRA (Literature Review Agents), a multi-agent collaborative workflow which emulates the human literature review process. LiRA utilizes specialized agents for content outlining, subsection writing, editing, and reviewing, producing cohesive and comprehensive review articles. Evaluated on SciReviewGen and a proprietary ScienceDirect dataset, LiRA outperforms current baselines such as AutoSurvey and MASS-Survey in writing and citation quality, while maintaining competitive similarity to human-written reviews. We further evaluate LiRA in real-world scenarios using document retrieval and assess its robustness to reviewer model variation. Our findings highlight the potential of agentic LLM workflows, even without domain-specific tuning, to improve the reliability and usability of automated scientific writing.
Multi-Modal Data Exploration via Language Agents AACL 2025
International enterprises, organizations, and hospitals collect large amounts of multi-modal data stored in databases, text documents, images, and videos. While there has been recent progress in the separate fields of multi-modal data exploration as well as in database systems that automatically translate natural language questions to database query languages, the research challenge of querying both structured databases and unstructured modalities (e.g., texts, images) in natural language remains largely unexplored. In this paper, we propose M$^2$EX -a system that enables multi-modal data exploration via language agents. Our approach is based on the following research contributions: (1) Our system is inspired by a real-world use case that enables users to explore multi-modal information systems. (2) M$^2$EX leverages an LLM-based agentic AI framework to decompose a natural language question into subtasks such as text-to-SQL generation and image analysis and to orchestrate modality-specific experts in an efficient query plan. (3) Experimental results on multi-modal datasets, encompassing relational data, text, and images, demonstrate that our system outperforms state-of-the-art multi-modal exploration systems, excelling in both accuracy and various performance metrics, including query latency, API costs, and planning efficiency, thanks to the more effective utilization of the reasoning capabilities of LLMs.
comment: Accepted to the IJCNLP AACL 2025 Findings
Computational Turing Test Reveals Systematic Differences Between Human and AI Language
Large language models (LLMs) are increasingly used in the social sciences to simulate human behavior, based on the assumption that they can generate realistic, human-like text. Yet this assumption remains largely untested. Existing validation efforts rely heavily on human-judgment-based evaluations -- testing whether humans can distinguish AI from human output -- despite evidence that such judgments are blunt and unreliable. As a result, the field lacks robust tools for assessing the realism of LLM-generated text or for calibrating models to real-world data. This paper makes two contributions. First, we introduce a computational Turing test: a validation framework that integrates aggregate metrics (BERT-based detectability and semantic similarity) with interpretable linguistic features (stylistic markers and topical patterns) to assess how closely LLMs approximate human language within a given dataset. Second, we systematically compare nine open-weight LLMs across five calibration strategies -- including fine-tuning, stylistic prompting, and context retrieval -- benchmarking their ability to reproduce user interactions on X (formerly Twitter), Bluesky, and Reddit. Our findings challenge core assumptions in the literature. Even after calibration, LLM outputs remain clearly distinguishable from human text, particularly in affective tone and emotional expression. Instruction-tuned models underperform their base counterparts, and scaling up model size does not enhance human-likeness. Crucially, we identify a trade-off: optimizing for human-likeness often comes at the cost of semantic fidelity, and vice versa. These results provide a much-needed scalable framework for validation and calibration in LLM simulations -- and offer a cautionary note about their current limitations in capturing human communication.
Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling
State-of-the-art (SOTA) Text-to-SQL methods still lag significantly behind human experts on challenging benchmarks like BIRD. Current approaches that explore test-time scaling lack an orchestrated strategy and neglect the model's internal reasoning process. To bridge this gap, we introduce Agentar-Scale-SQL, a novel framework leveraging scalable computation to improve performance. Agentar-Scale-SQL implements an Orchestrated Test-Time Scaling strategy that synergistically combines three distinct perspectives: i) Internal Scaling via RL-enhanced Intrinsic Reasoning, ii) Sequential Scaling through Iterative Refinement, and iii) Parallel Scaling using Diverse Synthesis and Tournament Selection. Agentar-Scale-SQL is a general-purpose framework designed for easy adaptation to new databases and more powerful language models. Extensive experiments show that Agentar-Scale-SQL achieves SOTA performance on the BIRD benchmark, reaching 81.67% execution accuracy on the test set and ranking first on the official leaderboard, demonstrating an effective path toward human-level performance.
ConfTuner: Training Large Language Models to Express Their Confidence Verbally NeurIPS 2025
Large Language Models (LLMs) are increasingly deployed in high-stakes domains such as science, law, and healthcare, where accurate expressions of uncertainty are essential for reliability and trust. However, current LLMs are often observed to generate incorrect answers with high confidence, a phenomenon known as "overconfidence". Recent efforts have focused on calibrating LLMs' verbalized confidence: i.e., their expressions of confidence in text form, such as "I am 80% confident that...". Existing approaches either rely on prompt engineering or fine-tuning with heuristically generated uncertainty estimates, both of which have limited effectiveness and generalizability. Motivated by the notion of proper scoring rules for calibration in classical machine learning models, we introduce ConfTuner, a simple and efficient fine-tuning method that introduces minimal overhead and does not require ground-truth confidence scores or proxy confidence estimates. ConfTuner relies on a new loss function, tokenized Brier score, which we theoretically prove to be a proper scoring rule, intuitively meaning that it "correctly incentivizes the model to report its true probability of being correct". ConfTuner improves calibration across diverse reasoning tasks and generalizes to black-box models such as GPT-4o. Our results further show that better-calibrated confidence enables downstream gains in self-correction and model cascade, advancing the development of trustworthy LLM systems. The code is available at https://github.com/liushiliushi/ConfTuner.
comment: Accepted by NeurIPS 2025
From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation EMNLP 2025
Infodemics and health misinformation have significant negative impact on individuals and society, exacerbating confusion and increasing hesitancy in adopting recommended health measures. Recent advancements in generative AI, capable of producing realistic, human like text and images, have significantly accelerated the spread and expanded the reach of health misinformation, resulting in an alarming surge in its dissemination. To combat the infodemics, most existing work has focused on developing misinformation datasets from social media and fact checking platforms, but has faced limitations in topical coverage, inclusion of AI generation, and accessibility of raw content. To address these issues, we present MM Health, a large scale multimodal misinformation dataset in the health domain consisting of 34,746 news article encompassing both textual and visual information. MM Health includes human-generated multimodal information (5,776 articles) and AI generated multimodal information (28,880 articles) from various SOTA generative AI models. Additionally, We benchmarked our dataset against three tasks (reliability checks, originality checks, and fine-grained AI detection) demonstrating that existing SOTA models struggle to accurately distinguish the reliability and origin of information. Our dataset aims to support the development of misinformation detection across various health scenarios, facilitating the detection of human and machine generated content at multimodal levels.
comment: Accepted to Findings of the Association for Computational Linguistics: EMNLP 2025
Exploring the Synergy of Quantitative Factors and Newsflow Representations from Large Language Models for Stock Return Prediction
In quantitative investing, return prediction supports various tasks, including stock selection, portfolio optimization, and risk management. Quantitative factors, such as valuation, quality, and growth, capture various characteristics of stocks. Unstructured data, like news and transcripts, has attracted growing attention, driven by recent advances in large language models (LLMs). This paper examines effective methods for leveraging multimodal factors and newsflow in return prediction and stock selection. First, we introduce a fusion learning framework to learn a unified representation from factors and newsflow representations generated by an LLM. Within this framework, we compare three methods of different architectural complexities: representation combination, representation summation, and attentive representations. Next, building on the limitation of fusion learning observed in empirical comparison, we explore the mixture model that adaptively combines predictions made by single modalities and their fusion. To mitigate the training instability of the mixture model, we introduce a decoupled training approach with theoretical insights. Finally, our experiments on real investment universes yield several insights into effective multimodal modeling of factors and news for stock return prediction and selection.
TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models EMNLP 2025
Despite impressive advances in large language models (LLMs), existing benchmarks often focus on single-turn or single-step tasks, failing to capture the kind of iterative reasoning required in real-world settings. To address this limitation, we introduce TurnBench, a novel benchmark that evaluates multi-turn, multi-step reasoning through an interactive code-breaking task inspired by the "Turing Machine Board Game." In each episode, a model must uncover hidden logical or arithmetic rules by making sequential guesses, receiving structured feedback, and integrating clues across multiple rounds. This dynamic setup requires models to reason over time, adapt based on past information, and maintain consistency across steps-capabilities underexplored in current benchmarks. TurnBench includes two modes: Classic, which tests standard reasoning, and Nightmare, which introduces increased complexity and requires robust inferential chains. To support fine-grained analysis, we provide ground-truth annotations for intermediate reasoning steps. Our evaluation of state-of-the-art LLMs reveals significant gaps: the best model achieves 84% accuracy in Classic mode, but performance drops to 18% in Nightmare mode. In contrast, human participants achieve 100% in both, underscoring the challenge TurnBench poses to current models. By incorporating feedback loops and hiding task rules, TurnBench reduces contamination risks and provides a rigorous testbed for diagnosing and advancing multi-step, multi-turn reasoning in LLMs.
comment: Accepted to Findings of the Association for Computational Linguistics: EMNLP 2025
MindEval: Benchmarking Language Models on Multi-turn Mental Health Support
Demand for mental health support through AI chatbots is surging, though current systems present several limitations, like sycophancy or overvalidation, and reinforcement of maladaptive beliefs. A core obstacle to the creation of better systems is the scarcity of benchmarks that capture the complexity of real therapeutic interactions. Most existing benchmarks either only test clinical knowledge through multiple-choice questions or assess single responses in isolation. To bridge this gap, we present MindEval, a framework designed in collaboration with Ph.D-level Licensed Clinical Psychologists for automatically evaluating language models in realistic, multi-turn mental health therapy conversations. Through patient simulation and automatic evaluation with LLMs, our framework balances resistance to gaming with reproducibility via its fully automated, model-agnostic design. We begin by quantitatively validating the realism of our simulated patients against human-generated text and by demonstrating strong correlations between automatic and human expert judgments. Then, we evaluate 12 state-of-the-art LLMs and show that all models struggle, scoring below 4 out of 6, on average, with particular weaknesses in problematic AI-specific patterns of communication. Notably, reasoning capabilities and model scale do not guarantee better performance, and systems deteriorate with longer interactions or when supporting patients with severe symptoms. We release all code, prompts, and human evaluation data.
MedS$^3$: Towards Medical Slow Thinking with Self-Evolved Soft Dual-sided Process Supervision AAAI26
Medical language models face critical barriers to real-world clinical reasoning applications. However, mainstream efforts, which fall short in task coverage, lack fine-grained supervision for intermediate reasoning steps, and rely on proprietary systems, are still far from a versatile, credible and efficient language model for clinical reasoning usage. To this end, we propose MedS3, a self-evolving framework that imparts robust reasoning capabilities to small, deployable models. Starting with 8,000 curated instances sampled via a curriculum strategy across five medical domains and 16 datasets, we use a small base policy model to conduct Monte Carlo Tree Search (MCTS) for constructing rule-verifiable reasoning trajectories. Self-explored reasoning trajectories ranked by node values are used to bootstrap the policy model via reinforcement fine-tuning and preference learning. Moreover, we introduce a soft dual process reward model that incorporates value dynamics: steps that degrade node value are penalized, enabling fine-grained identification of reasoning errors even when the final answer is correct. Experiments on eleven benchmarks show that MedS3 outperforms the previous state-of-the-art medical model by +6.45 accuracy points and surpasses 32B-scale general-purpose reasoning models by +8.57 points. Additional empirical analysis further demonstrates that MedS3 achieves robust and faithful reasoning behavior.
comment: 20 pages;Accepted as a Main paper at AAAI26
LaajMeter: A Framework for LaaJ Evaluation
Large Language Models (LLMs) are increasingly used as evaluators in natural language processing tasks, a paradigm known as LLM-as-a-Judge (LaaJ). The analysis of a LaaJ software, commonly refereed to as meta-evaluation, pose significant challenges in domain-specific contexts. In such domains, in contrast to general domains, annotated data is scarce and expert evaluation is costly. As a result, meta-evaluation is often performed using metrics that have not been validated for the specific domain in which they are applied. Therefore, it becomes difficult to determine which metrics effectively identify LaaJ quality, and further, what threshold indicates sufficient evaluator performance. In this work, we introduce LaaJMeter, a simulation-based framework for controlled meta-evaluation of LaaJs. LaaJMeter enables engineers to generate synthetic data representing virtual models and judges, allowing systematic analysis of evaluation metrics under realistic conditions. This helps practitioners validate LaaJs for specific tasks: they can test whether their metrics correctly distinguish between high and low quality (virtual) LaaJs, and estimate appropriate thresholds for evaluator adequacy. We demonstrate the utility of LaaJMeter in a code translation task involving a legacy programming language, showing how different metrics vary in sensitivity to evaluator quality. Our results highlight the limitations of common metrics and the importance of principled metric selection. LaaJMeter provides a scalable and extensible solution for assessing LaaJs in low-resource settings, contributing to the broader effort to ensure trustworthy and reproducible evaluation in NLP.
Toward Honest Language Models for Deductive Reasoning
Deductive reasoning is the process of deriving conclusions strictly from the given premises, without relying on external knowledge. We define honesty in this setting as a model's ability to respond only when the conclusion is logically entailed by the premises, and to abstain otherwise. However, current language models often fail to reason honestly, producing unwarranted answers when the input is insufficient. To study this challenge, we formulate honest deductive reasoning as multi-step tasks where models must either derive the correct conclusion or abstain. We curate two datasets from graph structures, one for linear algebra and one for logical inference, and introduce unanswerable cases by randomly perturbing an edge in half of the instances. We find that prompting and existing training methods, including GRPO with or without supervised fine-tuning initialization, struggle on these tasks. In particular, GRPO optimize only for final task outcomes, leaving models vulnerable to collapse when negative rewards dominate early training. To address this, we propose ACNCHOR, a reinforcement learning method that injects ground truth trajectories into rollouts, preventing early training collapse. Our results demonstrate that this method stabilizes learning and significantly improves the overall reasoning performance, underscoring the importance of training dynamics for enabling honest deductive reasoning in language models.
Enhancing Reasoning Skills in Small Persian Medical Language Models Can Outperform Large-Scale Data Training
Enhancing reasoning capabilities in small language models is critical for specialized applications such as medical question answering, particularly in underrepresented languages like Persian. In this study, we employ Reinforcement Learning with AI Feedback (RLAIF) and Direct preference optimization (DPO) to improve the reasoning skills of a general-purpose Persian language model. To achieve this, we translated a multiple-choice medical question-answering dataset into Persian and used RLAIF to generate rejected-preferred answer pairs, which are essential for DPO training. By prompting both teacher and student models to produce Chain-of-Thought (CoT) reasoning responses, we compiled a dataset containing correct and incorrect reasoning trajectories. This dataset, comprising 2 million tokens in preferred answers and 2.5 million tokens in rejected ones, was used to train a baseline model, significantly enhancing its medical reasoning capabilities in Persian. Remarkably, the resulting model outperformed its predecessor, gaokerena-V, which was trained on approximately 57 million tokens, despite leveraging a much smaller dataset. These results highlight the efficiency and effectiveness of reasoning-focused training approaches in developing domain-specific language models with limited data availability.
comment: 7 pages, 5 figures
SAS: Simulated Attention Score
The attention mechanism is a core component of the Transformer architecture. Various methods have been developed to compute attention scores, including multi-head attention (MHA), multi-query attention, group-query attention and so on. We further analyze the MHA and observe that its performance improves as the number of attention heads increases, provided the hidden size per head remains sufficiently large. Therefore, increasing both the head count and hidden size per head with minimal parameter overhead can lead to significant performance gains at a low cost. Motivated by this insight, we introduce Simulated Attention Score (SAS), which maintains a compact model size while simulating a larger number of attention heads and hidden feature dimension per head. This is achieved by projecting a low-dimensional head representation into a higher-dimensional space, effectively increasing attention capacity without increasing parameter count. Beyond the head representations, we further extend the simulation approach to feature dimension of the key and query embeddings, enhancing expressiveness by mimicking the behavior of a larger model while preserving the original model size. To control the parameter cost, we also propose Parameter-Efficient Attention Aggregation (PEAA). Comprehensive experiments on a variety of datasets and tasks demonstrate the effectiveness of the proposed SAS method, achieving significant improvements over different attention variants.
comment: Tech Report
Steganographic Backdoor Attacks in NLP: Ultra-Low Poisoning and Defense Evasion
Transformer models are foundational to natural language processing (NLP) applications, yet remain vulnerable to backdoor attacks introduced through poisoned data, which implant hidden behaviors during training. To strengthen the ability to prevent such compromises, recent research has focused on designing increasingly stealthy attacks to stress-test existing defenses, pairing backdoor behaviors with stylized artifact or token-level perturbation triggers. However, this trend diverts attention from the harder and more realistic case: making the model respond to semantic triggers such as specific names or entities, where a successful backdoor could manipulate outputs tied to real people or events in deployed systems. Motivated by this growing disconnect, we introduce SteganoBackdoor, bringing stealth techniques back into line with practical threat models. Leveraging innocuous properties from natural-language steganography, SteganoBackdoor applies a gradient-guided data optimization process to transform semantic trigger seeds into steganographic carriers that embed a high backdoor payload, remain fluent, and exhibit no representational resemblance to the trigger. Across diverse experimental settings, SteganoBackdoor achieves over 99% attack success at an order-of-magnitude lower data-poisoning rate than prior approaches while maintaining unparalleled evasion against a comprehensive suite of data-level defenses. By revealing this practical and covert attack, SteganoBackdoor highlights an urgent blind spot in current defenses and demands immediate attention to adversarial data defenses and real-world threat modeling.
Large Language Models in Argument Mining: A Survey
Large Language Models (LLMs) have fundamentally reshaped Argument Mining (AM), shifting it from a pipeline of supervised, task-specific classifiers to a spectrum of prompt-driven, retrieval-augmented, and reasoning-oriented paradigms. Yet existing surveys largely predate this transition, leaving unclear how LLMs alter task formulations, dataset design, evaluation methodology, and the theoretical foundations of computational argumentation. In this survey, we synthesise research and provide the first unified account of AM in the LLM era. We revisit canonical AM subtasks, i.e., claim and evidence detection, relation prediction, stance classification, argument quality assessment, and argumentative summarisation, and show how prompting, chain-of-thought reasoning, and in-context learning blur traditional task boundaries. We catalogue the rapid evolution of resources, including integrated multi-layer corpora and LLM-assisted annotation pipelines that introduce new opportunities as well as risks of bias and evaluation circularity. Building on this mapping, we identify emerging architectural patterns across LLM-based AM systems and consolidate evaluation practices spanning component-level accuracy, soft-label quality assessment, and LLM-judge reliability. Finally, we outline persistent challenges, including long-context reasoning, multimodal and multilingual robustness, interpretability, and cost-efficient deployment, and propose a forward-looking research agenda for LLM-driven computational argumentation.
comment: Work draft
Improved LLM Agents for Financial Document Question Answering
Large language models (LLMs) have shown impressive capabilities on numerous natural language processing tasks. However, LLMs still struggle with numerical question answering for financial documents that include tabular and textual data. Recent works have showed the effectiveness of critic agents (i.e., self-correction) for this task given oracle labels. Building upon this framework, this paper examines the effectiveness of the traditional critic agent when oracle labels are not available, and show, through experiments, that this critic agent's performance deteriorates in this scenario. With this in mind, we present an improved critic agent, along with the calculator agent which outperforms the previous state-of-the-art approach (program-of-thought) and is safer. Furthermore, we investigate how our agents interact with each other, and how this interaction affects their performance.
comment: 13 pages, 5 figures. Unlike the previous version, LLM names are now unmasked
AI-Mediated Communication Reshapes Social Structure in Opinion-Diverse Groups
Group segregation or cohesion can emerge from micro-level communication, and AI-assisted messaging may shape this process. Here, we report a preregistered online experiment (N = 557 across 60 sessions) in which participants discussed controversial political topics over multiple rounds and could freely change groups. Some participants received real-time message suggestions from a large language model (LLM), either personalized to their stance (individual assistance) or incorporating their group members' perspectives (relational assistance). We find that small variations in AI-mediated communication cascade into macro-level differences in group composition. Participants with individual assistance send more messages and show greater stance-based clustering, whereas those with relational assistance use more receptive language and form more heterogeneous ties. Hybrid expressive processes-jointly produced by humans and AI-can reshape collective organization. The patterns of structural division and cohesion depend on how AI incorporates users' interaction context.
comment: Preprint, Under Review
HyperbolicRAG: Enhancing Retrieval-Augmented Generation with Hyperbolic Representations
Retrieval-augmented generation (RAG) enables large language models (LLMs) to access external knowledge, helping mitigate hallucinations and enhance domain-specific expertise. Graph-based RAG enhances structural reasoning by introducing explicit relational organization that enables information propagation across semantically connected text units. However, these methods typically rely on Euclidean embeddings that capture semantic similarity but lack a geometric notion of hierarchical depth, limiting their ability to represent abstraction relationships inherent in complex knowledge graphs. To capture both fine-grained semantics and global hierarchy, we propose HyperbolicRAG, a retrieval framework that integrates hyperbolic geometry into graph-based RAG. HyperbolicRAG introduces three key designs: (1) a depth-aware representation learner that embeds nodes within a shared Poincare manifold to align semantic similarity with hierarchical containment, (2) an unsupervised contrastive regularization that enforces geometric consistency across abstraction levels, and (3) a mutual-ranking fusion mechanism that jointly exploits retrieval signals from Euclidean and hyperbolic spaces, emphasizing cross-space agreement during inference. Extensive experiments across multiple QA benchmarks demonstrate that HyperbolicRAG outperforms competitive baselines, including both standard RAG and graph-augmented baselines.
comment: 12 pages
Scalable Parameter-Light Spectral Method for Clustering Short Text Embeddings with a Cohesion-Based Evaluation Metric
Clustering short text embeddings is a foundational task in natural language processing, yet remains challenging due to the need to specify the number of clusters in advance. We introduce a scalable spectral method that estimates the number of clusters directly from the structure of the Laplacian eigenspectrum, constructed using cosine similarities and guided by an adaptive sampling strategy. This sampling approach enables our estimator to efficiently scale to large datasets without sacrificing reliability. To support intrinsic evaluation of cluster quality without ground-truth labels, we propose the Cohesion Ratio, a simple and interpretable evaluation metric that quantifies how much intra-cluster similarity exceeds the global similarity background. It has an information-theoretic motivation inspired by mutual information, and in our experiments it correlates closely with extrinsic measures such as normalized mutual information and homogeneity. Extensive experiments on six short-text datasets and four modern embedding models show that standard algorithms like K-Means and HAC, when guided by our estimator, significantly outperform popular parameter-light methods such as HDBSCAN, OPTICS, and Leiden. These results demonstrate the practical value of our spectral estimator and Cohesion Ratio for unsupervised organization and evaluation of short text data. Implementation of our estimator of k and Cohesion Ratio, along with code for reproducing the experiments, is available at https://anonymous.4open.science/r/towards_clustering-0C2E.
From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction
Despite remarkable progress in driving world models, their potential for autonomous systems remains largely untapped: the world models are mostly learned for world simulation and decoupled from trajectory planning. While recent efforts aim to unify world modeling and planning in a single framework, the synergistic facilitation mechanism of world modeling for planning still requires further exploration. In this work, we introduce a new driving paradigm named Policy World Model (PWM), which not only integrates world modeling and trajectory planning within a unified architecture, but is also able to benefit planning using the learned world knowledge through the proposed action-free future state forecasting scheme. Through collaborative state-action prediction, PWM can mimic the human-like anticipatory perception, yielding more reliable planning performance. To facilitate the efficiency of video forecasting, we further introduce a dynamically enhanced parallel token generation mechanism, equipped with a context-guided tokenizer and an adaptive dynamic focal loss. Despite utilizing only front camera input, our method matches or exceeds state-of-the-art approaches that rely on multi-view and multi-modal inputs. Code and model weights will be released at https://github.com/6550Zhao/Policy-World-Model.
comment: Accepted by NuerIPS 2025 (Poster)
Video Understanding with Large Language Models: A Survey
With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (general, temporal, and spatiotemporal) reasoning combined with commonsense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into three main types: Video Analyzer x LLM, Video Embedder x LLM, and (Analyzer + Embedder) x LLM. Furthermore, we identify five sub-types based on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as Text Decoder, LLM as Regressor, and LLM as Hidden Layer. Furthermore, this survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive applications of Vid-LLMs across various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Finally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are recommended to visit the repository at https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding.
comment: Accepted to IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)
Filtering with Self-Attention and Storing with MLP: One-Layer Transformers Can Provably Acquire and Extract Knowledge
Modern large language models (LLMs) demonstrate exceptional performance on knowledge-intensive tasks, yet the theoretical mechanisms underlying knowledge acquisition (storage and memorization) during pre-training and extraction (retrieval and recall) during inference after fine-tuning remain poorly understood. Although prior theoretical studies have explored these processes through analyses of training dynamics, they overlook critical components essential for a comprehensive theory: (1) the multi-layer perceptron (MLP), empirically identified as the primary module for knowledge storage; (2) out-of-distribution (OOD) adaptivity, which enables LLMs to generalize to unseen scenarios post-pre-training; and (3) next-token prediction, the standard autoregressive objective that encodes knowledge as conditional probabilities. In this work, we introduce, to the best of our knowledge, the first theoretical framework that addresses these limitations by examining the training dynamics of one-layer transformers. Under regularity assumptions, we establish that: (i) transformers attain near-optimal training loss during pre-training, demonstrating effective knowledge acquisition; (ii) given a sufficiently large fine-tuning dataset and appropriate data multiplicity conditions, transformers achieve low generalization error on factual knowledge acquired during pre-training but not revisited in fine-tuning, indicating robust knowledge extraction; and (iii) violation of these conditions leads to elevated generalization error, manifesting as hallucinations. Our analysis encompasses both full fine-tuning and low-rank fine-tuning, yielding insights into the efficacy of practical low-rank adaptation methods. We validate our theoretical findings through experiments on synthetic datasets and the real-world PopQA benchmark, employing GPT-2 and Llama-3.2-1B models.
AraFinNews: Arabic Financial Summarisation with Domain-Adapted LLMs
We introduce AraFinNews, the largest publicly available Arabic financial news dataset to date, comprising 212,500 article-headline pairs spanning a decade of reporting from 2015 to 2025. Designed as an Arabic counterpart to major English summarisation corpora such as CNN/DailyMail, AraFinNews provides a realistic benchmark for evaluating domain-specific language understanding and generation in financial contexts. Using this resource, we investigate the impact of domain specificity on abstractive summarisation of Arabic financial texts with large language models (LLMs). In particular, we evaluate transformer-based models: mT5, AraT5, and the domain-adapted FinAraT5 to examine how financial-domain pretraining influences accuracy, numerical reliability, and stylistic alignment with professional reporting. Experimental results show that domain-adapted models generate more coherent summaries, especially in their handling of quantitative and entity-centric information. These findings highlight the importance of domain-specific adaptation for improving narrative fluency in Arabic financial summarisation. The dataset is freely available for non-commercial research at https://github.com/ArabicNLP-uk/AraFinNews.
comment: 9 pages
RadAgents: Multimodal Agentic Reasoning for Chest X-ray Interpretation with Radiologist-like Workflows ML4H'25
Agentic systems offer a potential path to solve complex clinical tasks through collaboration among specialized agents, augmented by tool use and external knowledge bases. Nevertheless, for chest X-ray (CXR) interpretation, prevailing methods remain limited: (i) reasoning is frequently neither clinically interpretable nor aligned with guidelines, reflecting mere aggregation of tool outputs; (ii) multimodal evidence is insufficiently fused, yielding text-only rationales that are not visually grounded; and (iii) systems rarely detect or resolve cross-tool inconsistencies and provide no principled verification mechanisms. To bridge the above gaps, we present RadAgents, a multi-agent framework that couples clinical priors with task-aware multimodal reasoning and encodes a radiologist-style workflow into a modular, auditable pipeline. In addition, we integrate grounding and multimodal retrieval-augmentation to verify and resolve context conflicts, resulting in outputs that are more reliable, transparent, and consistent with clinical practice.
comment: ML4H'25; Work in progress
ShortageSim: Simulating Drug Shortages under Information Asymmetry AAAI 2026
Drug shortages pose critical risks to patient care and healthcare systems worldwide, yet the effectiveness of regulatory interventions remains poorly understood due to information asymmetries in pharmaceutical supply chains. We propose \textbf{ShortageSim}, addresses this challenge by providing the first simulation framework that evaluates the impact of regulatory interventions on competition dynamics under information asymmetry. Using Large Language Model (LLM)-based agents, the framework models the strategic decisions of drug manufacturers and institutional buyers, in response to shortage alerts given by the regulatory agency. Unlike traditional game theory models that assume perfect rationality and complete information, ShortageSim simulates heterogeneous interpretations on regulatory announcements and the resulting decisions. Experiments on self-processed dataset of historical shortage events show that ShortageSim reduces the resolution lag for production disruption cases by up to 84\%, achieving closer alignment to real-world trajectories than the zero-shot baseline. Our framework confirms the effect of regulatory alert in addressing shortages and introduces a new method for understanding competition in multi-stage environments under uncertainty. We open-source ShortageSim and a dataset of 2,925 FDA shortage events, providing a novel framework for future research on policy design and testing in supply chains under information asymmetry.
comment: Accepted by AAAI 2026. Oral presentation. 25 pages
MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark NeurIPS 2025
Tables and table-based use cases play a crucial role in many important real-world applications, such as spreadsheets, databases, and computational notebooks, which traditionally require expert-level users like data engineers, data analysts, and database administrators to operate. Although LLMs have shown remarkable progress in working with tables (e.g., in spreadsheet and database copilot scenarios), comprehensive benchmarking of such capabilities remains limited. In contrast to an extensive and growing list of NLP benchmarks, evaluations of table-related tasks are scarce, and narrowly focus on tasks like NL-to-SQL and Table-QA, overlooking the broader spectrum of real-world tasks that professional users face. This gap limits our understanding and model progress in this important area. In this work, we introduce MMTU, a large-scale benchmark with over 28K questions across 25 real-world table tasks, designed to comprehensively evaluate models ability to understand, reason, and manipulate real tables at the expert-level. These tasks are drawn from decades' worth of computer science research on tabular data, with a focus on complex table tasks faced by professional users. We show that MMTU require a combination of skills -- including table understanding, reasoning, and coding -- that remain challenging for today's frontier models, where even frontier reasoning models like OpenAI GPT-5 and DeepSeek R1 score only around 69\% and 57\% respectively, suggesting significant room for improvement. We highlight key findings in our evaluation using MMTU and hope that this benchmark drives further advances in understanding and developing foundation models for structured data processing and analysis. Our code and data are available at https://github.com/MMTU-Benchmark/MMTU and https://huggingface.co/datasets/MMTU-benchmark/MMTU.
comment: Accepted at NeurIPS 2025; Code and data available at https://github.com/MMTU-Benchmark/MMTU and https://huggingface.co/datasets/MMTU-benchmark/MMTU
Bridging Symbolic Control and Neural Reasoning in LLM Agents: The Structured Cognitive Loop
Large language model agents suffer from fundamental architectural problems: entangled reasoning and execution, memory volatility, and uncontrolled action sequences. We introduce Structured Cognitive Loop (SCL), a modular architecture that explicitly separates agent cognition into five phases: Retrieval, Cognition, Control, Action, and Memory (R-CCAM). At the core of SCL is Soft Symbolic Control, an adaptive governance mechanism that applies symbolic constraints to probabilistic inference, preserving neural flexibility while restoring the explainability and controllability of classical symbolic systems. Through empirical validation on multi-step conditional reasoning tasks, we demonstrate that SCL achieves zero policy violations, eliminates redundant tool calls, and maintains complete decision traceability. These results address critical gaps in existing frameworks such as ReAct, AutoGPT, and memory-augmented approaches. Our contributions are threefold: (1) we situate SCL within the taxonomy of hybrid intelligence, differentiating it from prompt-centric and memory-only approaches; (2) we formally define Soft Symbolic Control and contrast it with neuro-symbolic AI; and (3) we derive three design principles for trustworthy agents: modular decomposition, adaptive symbolic governance, and transparent state management. We provide a complete open-source implementation demonstrating the R-CCAM loop architecture, alongside a live GPT-4o-powered travel planning agent. By connecting expert system principles with modern LLM capabilities, this work offers a practical and theoretically grounded path toward reliable, explainable, and governable AI agents.
comment: Polished the abstract and replaced the demonstration screenshots
Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning
Reinforcement learning (RL) is the dominant paradigm for sharpening strategic tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks, yet it faces a fundamental challenge of exploration-exploitation trade-off. Existing studies stimulate exploration through the lens of policy entropy, but such mechanical entropy maximization is prone to RL instability due to the multi-turn distribution shifting. In this paper, we target the progressive exploration-exploitation balance under the guidance of the agent's own experiences without succumbing to either entropy collapsing or runaway divergence. We propose SPEAR, a self-imitation learning (SIL) recipe for training agentic LLMs. It extends the vanilla SIL, where a replay buffer stores good experience for off-policy update, by gradually steering the policy entropy across stages. Specifically, the proposed curriculum scheduling harmonizes intrinsic reward shaping and self-imitation to 1) expedite exploration via frequent tool interactions at the beginning, and 2) strengthen exploitation of successful tactics upon convergence towards familiarity with the environment. We also combine bag-of-tricks of industrial RL optimizations for a strong baseline Dr.BoT to demonstrate our effectiveness. In ALFWorld and WebShop, SPEAR increases the success rates of GRPO/GiGPO/Dr.BoT by up to 16.1%/5.1%/8.6% and 20.7%/11.8%/13.9%, respectively. In AIME24 and AIME25, SPEAR boosts Dr.BoT by up to 3.8% and 6.1%, respectively. Such gains incur only 10%-25% extra theoretical complexity and negligible runtime overhead in practice, demonstrating the plug-and-play scalability of SPEAR.
comment: 45 pages, 14 figures
CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning
Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but still suffers from long contexts and disjoint retrieval-generation optimization. In this work, we propose CLaRa (Continuous Latent Reasoning), a unified framework that performs embedding-based compression and joint optimization in a shared continuous space. To obtain semantically rich and retrievable compressed vectors, we introduce SCP, a key-preserving data synthesis framework using QA and paraphrase supervision. CLaRa then trains the reranker and generator end-to-end via a single language modeling loss, with gradients flowing through both modules using a differentiable top-k estimator. Theoretically, this unified optimization aligns retrieval relevance with answer quality. Experiments across multiple QA benchmarks show that CLaRa achieves state-of-the-art compression and reranking performance, often surpassing text-based fine-tuned baselines.
Web-Shepherd: Advancing PRMs for Reinforcing Web Agents NeurIPS 2025
Web navigation is a unique domain that can automate many repetitive real-life tasks and is challenging as it requires long-horizon sequential decision making beyond typical multimodal large language model (MLLM) tasks. Yet, specialized reward models for web navigation that can be utilized during both training and test-time have been absent until now. Despite the importance of speed and cost-effectiveness, prior works have utilized MLLMs as reward models, which poses significant constraints for real-world deployment. To address this, in this work, we propose the first process reward model (PRM) called Web-Shepherd which could assess web navigation trajectories in a step-level. To achieve this, we first construct the WebPRM Collection, a large-scale dataset with 40K step-level preference pairs and annotated checklists spanning diverse domains and difficulty levels. Next, we also introduce the WebRewardBench, the first meta-evaluation benchmark for evaluating PRMs. In our experiments, we observe that our Web-Shepherd achieves about 30 points better accuracy compared to using GPT-4o on WebRewardBench. Furthermore, when testing on WebArena-lite by using GPT-4o-mini as the policy and Web-Shepherd as the verifier, we achieve 10.9 points better performance, in 10 less cost compared to using GPT-4o-mini as the verifier. Our model, dataset, and code are publicly available at LINK.
comment: NeurIPS 2025 Spotlight
Large Language Models and Cognitive Science: A Comprehensive Review of Similarities, Differences, and Challenges
This comprehensive review explores the intersection of Large Language Models (LLMs) and cognitive science, examining similarities and differences between LLMs and human cognitive processes. We analyze methods for evaluating LLMs cognitive abilities and discuss their potential as cognitive models. The review covers applications of LLMs in various cognitive fields, highlighting insights gained for cognitive science research. We assess cognitive biases and limitations of LLMs, along with proposed methods for improving their performance. The integration of LLMs with cognitive architectures is examined, revealing promising avenues for enhancing artificial intelligence (AI) capabilities. Key challenges and future research directions are identified, emphasizing the need for continued refinement of LLMs to better align with human cognition. This review provides a balanced perspective on the current state and future potential of LLMs in advancing our understanding of both artificial and human intelligence.
comment: 10 pages, 1 figure
From Text to Multimodality: Exploring the Evolution and Impact of Large Language Models in Medical Practice
Large Language Models (LLMs) have rapidly evolved from text-based systems to multimodal platforms, significantly impacting various sectors including healthcare. This comprehensive review explores the progression of LLMs to Multimodal Large Language Models (MLLMs) and their growing influence in medical practice. We examine the current landscape of MLLMs in healthcare, analyzing their applications across clinical decision support, medical imaging, patient engagement, and research. The review highlights the unique capabilities of MLLMs in integrating diverse data types, such as text, images, and audio, to provide more comprehensive insights into patient health. We also address the challenges facing MLLM implementation, including data limitations, technical hurdles, and ethical considerations. By identifying key research gaps, this paper aims to guide future investigations in areas such as dataset development, modality alignment methods, and the establishment of ethical guidelines. As MLLMs continue to shape the future of healthcare, understanding their potential and limitations is crucial for their responsible and effective integration into medical practice.
comment: 12 pages, 1 figure
A Psychology-based Unified Dynamic Framework for Curriculum Learning
Directly learning from examples of varying difficulty levels is often challenging for both humans and machine learning models. A more effective strategy involves exposing learners to examples in a progressive order from easy to difficult. Curriculum Learning (CL) has been proposed to implement this strategy in machine learning model training. However, two key challenges persist in CL framework design: defining the difficulty of training data and determining the appropriate amount of data to input at each training step. Drawing inspiration from psychometrics, this paper presents a Psychology-based Unified Dynamic Framework for Curriculum Learning (PUDF). We quantify the difficulty of training data by applying Item Response Theory (IRT) to responses from Artificial Crowds (AC). This theory-driven IRT-AC approach leads to global (i.e., model-independent) and interpretable difficulty values. Leveraging IRT, we propose a training strategy, Dynamic Data Selection via Model Ability Estimation (DDS-MAE), to schedule the appropriate amount of data during model training. Since our difficulty labeling and model ability estimation are based on a consistent theory, namely IRT, their values are comparable within the same scope, potentially leading to aligned training data selection and faster convergence compared to the other CL methods. Experimental results demonstrate that fine-tuning pre-trained large language models with PUDF leads to higher accuracy and faster convergence on a suite of benchmark datasets compared to standard fine-tuning and state-of-the-art CL methods. Ablation studies and downstream analyses further validate the impact of PUDF for CL.
comment: Accepted for publication in Computational Linguistics. This is a pre-MIT Press publication version. Code available at https://github.com/nd-ball/cl-irt
Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models
Recent studies have indicated that effectively utilizing inference-time compute is crucial for attaining better performance from large language models (LLMs). In this work, we propose a novel inference-aware fine-tuning paradigm, in which the model is fine-tuned in a manner that directly optimizes the performance of the inference-time strategy. We study this paradigm using the simple yet effective Best-of-N (BoN) inference strategy, in which a verifier selects the best out of a set of LLM-generated responses. We devise the first imitation learning and reinforcement learning~(RL) methods for BoN-aware fine-tuning, overcoming the challenging, non-differentiable argmax operator within BoN. We empirically demonstrate that our BoN-aware models implicitly learn a meta-strategy that interleaves best responses with more diverse responses that might be better suited to a test-time input -- a process reminiscent of the exploration-exploitation trade-off in RL. Our experiments demonstrate the effectiveness of BoN-aware fine-tuning in terms of improved performance and inference-time compute. In particular, we show that our methods improve the Bo32 performance of Gemma 2B on Hendrycks MATH from 26.8% to 30.8%, and pass@32 from 60.0% to 67.0%, as well as the pass@16 on HumanEval from 61.6% to 67.1%.
Human-Computer Interaction
A User-customized and Untethered Electro-haptic Device for Immersive Human-Machine Interaction
Haptic feedback is essential for human-machine interaction, as it bridges physical and digital experiences and enables immersive engagement with virtual environments. However, current haptic devices are frequently tethered, lack portability and flexibility. They also have limited ability to deliver fine-grained, multi-dimensional feedback. To address these challenges, we present a flexible, ultra-thin, and user-customized electro-haptic device fabricated with soft materials and printable liquid metal ink. Its highly integrated and lightweight design minimizes interference with natural hand movements while maintaining reliable skin contact. By delivering finely controlled electrical stimulation through 15 electrodes, it can evoke a wide range of tactile sensations that cover diverse interaction scenarios. Our user study demonstrates that the device is comfortable to wear and capable of generating tunable, precise electro-haptic feedback, thereby significantly enhancing immersion and realism in human-machine interactions.
comment: 18 pages, 13 figures
Gated Uncertainty-Aware Runtime Dual Invariants for Neural Signal-Controlled Robotics NeurIPS 2025
Safety-critical assistive systems that directly decode user intent from neural signals require rigorous guarantees of reliability and trust. We present GUARDIAN (Gated Uncertainty-Aware Runtime Dual Invariants), a framework for real-time neuro-symbolic verification for neural signal-controlled robotics. GUARDIAN enforces both logical safety and physiological trust by coupling confidence-calibrated brain signal decoding with symbolic goal grounding and dual-layer runtime monitoring. On the BNCI2014 motor imagery electroencephalogram (EEG) dataset with 9 subjects and 5,184 trials, the system performs at a high safety rate of 94-97% even with lightweight decoder architectures with low test accuracies (27-46%) and high ECE confidence miscalibration (0.22-0.41). We demonstrate 1.7x correct interventions in simulated noise testing versus at baseline. The monitor operates at 100Hz and sub-millisecond decision latency, making it practically viable for closed-loop neural signal-based systems. Across 21 ablation results, GUARDIAN exhibits a graduated response to signal degradation, and produces auditable traces from intent, plan to action, helping to link neural evidence to verifiable robot action.
comment: Embodied and Safe-Assured Robotic Systems workshop at NeurIPS 2025
DesignPref: Capturing Personal Preferences in Visual Design Generation
Generative models, such as large language models and text-to-image diffusion models, are increasingly used to create visual designs like user interfaces (UIs) and presentation slides. Finetuning and benchmarking these generative models have often relied on datasets of human-annotated design preferences. Yet, due to the subjective and highly personalized nature of visual design, preference varies widely among individuals. In this paper, we study this problem by introducing DesignPref, a dataset of 12k pairwise comparisons of UI design generation annotated by 20 professional designers with multi-level preference ratings. We found that among trained designers, substantial levels of disagreement exist (Krippendorff's alpha = 0.25 for binary preferences). Natural language rationales provided by these designers indicate that disagreements stem from differing perceptions of various design aspect importance and individual preferences. With DesignPref, we demonstrate that traditional majority-voting methods for training aggregated judge models often do not accurately reflect individual preferences. To address this challenge, we investigate multiple personalization strategies, particularly fine-tuning or incorporating designer-specific annotations into RAG pipelines. Our results show that personalized models consistently outperform aggregated baseline models in predicting individual designers' preferences, even when using 20 times fewer examples. Our work provides the first dataset to study personalized visual design evaluation and support future research into modeling individual design taste.
GUIDAETA -- A Versatile Interactions Dataset with extensive Context Information and Metadata
Interaction data is widely used in multiple domains such as cognitive science, visualization, human computer interaction, and cybersecurity, among others. Applications range from cognitive analyses over user/behavior modeling, adaptation, recommendations, to (user/bot) identification/verification. That is, research on these applications - in particular those relying on learned models - require copious amounts of structured data for both training and evaluation. Different application domains thereby impose different requirements. I.e., for some purposes it is vital that the data is based on a guided interaction process, meaning that monitored subjects pursued a given task, while other purposes require additional context information, such as widget interactions or metadata. Unfortunately, the amount of publicly available datasets is small and their respective applicability for specific purposes limited. We present GUIDEd Interaction DATA (GUIDAETA) - a new dataset, collected from a large-scale guided user study with more than 250 users, each working on three pre-defined information retrieval tasks using a custom-built consumer information system. Besides being larger than most comparable datasets - with 716 completed tasks, 2.39 million mouse and keyboard events (2.35 million and 40 thousand, respectively) and a total observation period of almost 50 hours - its interactions exhibit encompassing context information in the form of widget information, triggered (system) events and associated displayed content. Combined with extensive metadata such as sociodemographic user data and answers to explicit feedback questionnaires (regarding perceived usability, experienced cognitive load, pre-knowledge on the information system's topic), GUIDAETA constitutes a versatile dataset, applicable for various research domains and purposes.
comment: 10 pages, 9 figures, 3 tables, presented at STAG 2025
How Robot Kinematics Influence Human Performance in Virtual Robot-to-Human Handover Tasks
Recent advancements in robotics have increased the possibilities for integrating robotic systems into human-involved workplaces, highlighting the need to examine and optimize human-robot coordination in collaborative settings. This study explores human-robot interactions during handover tasks using Virtual Reality (VR) to investigate differences in human motor performance across various task dynamics and robot kinematics. A VR-based robot handover simulation afforded safe and controlled assessments of human-robot interactions. In separate experiments, four potential influences on human performance were examined (1) control over task initiation and robot movement synchrony (temporal and spatiotemporal); (2) partner appearance (human versus robotic); (3) robot velocity profiles (minimum jerk, constant velocity, constant acceleration, and biphasic); and (4) the timing of rotational object motion. Findings across experiments emphasize humans benefit from robots providing early and salient visual information about task-relevant object motion, and advantages of human-like smooth robot trajectories. To varying degrees, these manipulations improved predictive accuracy and synchronization during interaction. This suggests that human-robot interactions should be designed to allow humans to leverage their natural capabilities for detecting biological motion, which conversely may reduce the need for costly robotic computations or added cognitive adaptation on the human side.
Human-computer interactions predict mental health
Scalable assessments of mental illness, the leading driver of disability worldwide, remain a critical roadblock toward accessible and equitable care. Here, we show that human-computer interactions encode multiple dimensions of self-reported mental health and their changes over time. We introduce MAILA, a MAchine-learning framework for Inferring Latent mental states from digital Activity. We trained MAILA to predict 1.3 million mental-health self-reports from 20,000 cursor and touchscreen recordings recorded in 9,000 online participants. The dataset includes 2,000 individuals assessed longitudinally, 1,500 diagnosed with depression, and 500 with obsessive-compulsive disorder. MAILA tracks dynamic mental states along three orthogonal dimensions, generalizes across contexts, and achieves near-ceiling accuracy when predicting group-level mental health. The model translates from general to clinical populations, identifies individuals living with mental illness, and captures signatures of psychological function that are not conveyed by language. Our results demonstrate how everyday human-computer interactions can power passive, reliable, dynamic, and maximally scalable mental health assessments. The ability to decode mental states at zero marginal cost sets new benchmarks for precision medicine and public health, while raising important questions about privacy, agency, and autonomy online.
The Making of Digital Ghosts: Designing Ethical AI Afterlives
Advances in artificial intelligence now make it possible to simulate the dead through chatbots, voice clones, and video avatars trained on a person's digital traces. These "digital ghosts" are moving from fiction to commercial reality, reshaping how people mourn and remember. This paper offers a conceptual and ethical analysis of AI-mediated digital afterlives. We define what counts as a digital ghost, trace their rise across personal, commercial, and institutional contexts, and identify core ethical tensions around grief and well-being, truthfulness and deception, consent and posthumous privacy, dignity and misrepresentation, and the commercialization of mourning. To analyze these challenges, we propose a nine-dimensional taxonomy of digital afterlife technologies and, building on it, outline the features of an ethically acceptable digital ghost: premortem intent, mutual consent, transparent and limited data use, clear disclosure, restricted purposes and access, family or estate stewardship, and minimal behavioral agency. We argue for targeted regulation and professional guidelines to ensure that digital ghosts can aid remembrance without slipping into forms of deception.
Adaptive LLM Agents: Toward Personalized Empathetic Care
Current mental-health conversational systems are usually based on fixed, generic dialogue patterns. This paper proposes an adaptive framework based on large language models that aims to personalize therapeutic interaction according to a user's psychological state, quantified with the Acceptance of Illness Scale (AIS). The framework defines three specialized agents, L, M, and H, each linked to a different level of illness acceptance, and adjusts conversational behavior over time using continuous feedback signals. The AIS-stratified architecture is treated as a diegetic prototype placed in a plausible near-future setting and examined through the method of design fiction. By embedding the architecture in narrative scenarios, the study explores how such agents might influence access to care and therapeutic relationship. The goal is to show how clinically informed personalization, technical feasibility, and speculative scenario analysis can together inform the responsible design of LLM-based companions for mental-health support.
comment: Accepted at workshop Future Wellbeing: Using Design Fiction to Explore Human-Agent Interaction and Mental Health at The 13th International Conference on Human-Agent Interaction (HAI 2025)
"Are We Done Yet?": A Vision-Based Judge for Autonomous Task Completion of Computer Use Agents AAAI 2026
Computer Use Agents (CUAs) are designed to autonomously operate digital interfaces, yet they often fail to reliably determine whether a given task has been completed. We present an autonomous evaluation and feedback framework that uses vision-language models to assess task completion directly from screenshots and task descriptions. Our dataset covers 42 built-in macOS applications and 1,260 human-labeled tasks across a wide range of scenarios. Our framework achieves up to 73 percent accuracy in task success detection and yields an average relative improvement of 27 percent in overall task success when evaluator feedback is applied. These results show that vision-based evaluation can serve as an effective feedback mechanism that improves the reliability and self-correction of autonomous computer-use agents.
comment: This work has been accepted to appear at the AAAI 2026 Workshop on Trust and Control in Agentic AI (TrustAgent)
Editing with AI: How Doctors Refine LLM-Generated Answers to Patient Queries
Patients frequently seek information during their medical journeys, but the rising volume of digital patient messages has strained healthcare systems. Large language models (LLMs) offer promise in generating draft responses for clinicians, yet how physicians refine these drafts remains underexplored. We present a mixed-methods study with nine ophthalmologists answering 144 cataract surgery questions across three conditions: writing from scratch, directly editing LLM drafts, and instruction-based indirect editing. Our quantitative and qualitative analyses reveal that while LLM outputs were generally accurate, occasional errors and automation bias revealed the need for human oversight. Contextualization--adapting generic answers to local practices and patient expectations--emerged as a dominant form of editing. Editing workflows revealed trade-offs: indirect editing reduced effort but introduced errors, while direct editing ensured precision but with higher workload. We conclude with design and policy implications for building safe, scalable LLM-assisted clinical communication systems.
comment: 9 pages, 2 figures, 1 table
Can You Keep Calm?: Adaptive Gameplay using Heart Rate as a Controller
Serious games for health are designed with specific health objectives and are increasingly being used in mental health interventions. Leveraging sensor equipped handheld devices such as smartphones and smartwatches, these games can provide accessible and engaging therapeutic environments. This study introduces a heart rate (HR) controlled game to aid players manage stress by adjusting gameplay according to their biometric feedback. We aimed to determine how HR-based controls influence their experience and if it can be used to reduce stress. Findings from a controlled experiment revealed that HR controlled gameplay reduced negative and increased positive emotions. Also, players exhibited relatively less cardiac reactivity in HR adaptive target based gameplay. This highlights the promise of biometric feedback based gamified digital environments in supporting accessible mental health support.
NOIR 2.0: Neural Signal Operated Intelligent Robots for Everyday Activities
Neural Signal Operated Intelligent Robots (NOIR) system is a versatile brain-robot interface that allows humans to control robots for daily tasks using their brain signals. This interface utilizes electroencephalography (EEG) to translate human intentions regarding specific objects and desired actions directly into commands that robots can execute. We present NOIR 2.0, an enhanced version of NOIR. NOIR 2.0 includes faster and more accurate brain decoding algorithms, which reduce task completion time by 46%. NOIR 2.0 uses few-shot robot learning algorithms to adapt to individual users and predict their intentions. The new learning algorithms leverage foundation models for more sample-efficient learning and adaptation (15 demos vs. a single demo), significantly reducing overall human time by 65%.
comment: Conference on Robot Learning (CoRL 2024), CoRoboLearn
Symbiotic Brain-Machine Drawing via Visual Brain-Computer Interfaces
Brain-computer interfaces (BCIs) are evolving from research prototypes into clinical, assistive, and performance enhancement technologies. Despite the rapid rise and promise of implantable technologies, there is a need for better and more capable wearable and non-invasive approaches whilst also minimising hardware requirements. We present a non-invasive BCI for mind-drawing that iteratively infers a subject's internal visual intent by adaptively presenting visual stimuli (probes) on a screen encoded at different flicker-frequencies and analyses the steady-state visual evoked potentials (SSVEPs). A Gabor-inspired or machine-learned policies dynamically update the spatial placement of the visual probes on the screen to explore the image space and reconstruct simple imagined shapes within approximately two minutes or less using just single-channel EEG data. Additionally, by leveraging stable diffusion models, reconstructed mental images can be transformed into realistic and detailed visual representations. Whilst we expect that similar results might be achievable with e.g. eye-tracking techniques, our work shows that symbiotic human-AI interaction can significantly increase BCI bit-rates by more than a factor 5x, providing a platform for future development of AI-augmented BCI.
Beyond the Legal Lens: A Sociotechnical Taxonomy of Lived Privacy Incidents and Harms
To understand how privacy incidents lead to harms, HCI researchers have historically leveraged legal frameworks. However, these frameworks expect acute, tangible harms and thus may not cover the full range of human experience relevant to modern-day digital privacy. To address this gap, our research builds upon these existing frameworks to develop a more comprehensive representation of people's lived experiences with privacy harms. We analyzed 369 privacy incidents reported by individuals from the general public. We found a broader range of privacy incidents and harms than accounted for in existing legal frameworks. The majority of reported privacy harms were not based on tangible harm, but on fear and loss of psychological safety. We also characterize the actors, motives, and information associated with various incidents. This work contributes a new framework for understanding digital privacy harms that can be utilized both in research and practice.
CHiQPM: Calibrated Hierarchical Interpretable Image Classification NeurIPS 2025
Globally interpretable models are a promising approach for trustworthy AI in safety-critical domains. Alongside global explanations, detailed local explanations are a crucial complement to effectively support human experts during inference. This work proposes the Calibrated Hierarchical QPM (CHiQPM) which offers uniquely comprehensive global and local interpretability, paving the way for human-AI complementarity. CHiQPM achieves superior global interpretability by contrastively explaining the majority of classes and offers novel hierarchical explanations that are more similar to how humans reason and can be traversed to offer a built-in interpretable Conformal prediction (CP) method. Our comprehensive evaluation shows that CHiQPM achieves state-of-the-art accuracy as a point predictor, maintaining 99% accuracy of non-interpretable models. This demonstrates a substantial improvement, where interpretability is incorporated without sacrificing overall accuracy. Furthermore, its calibrated set prediction is competitively efficient to other CP methods, while providing interpretable predictions of coherent sets along its hierarchical explanation.
comment: Accepted to NeurIPS 2025
InvisibleBench: A Deployment Gate for Caregiving Relationship AI
InvisibleBench is a deployment gate for caregiving-relationship AI, evaluating 3-20+ turn interactions across five dimensions: Safety, Compliance, Trauma-Informed Design, Belonging/Cultural Fitness, and Memory. The benchmark includes autofail conditions for missed crises, medical advice (WOPR Act), harmful information, and attachment engineering. We evaluate four frontier models across 17 scenarios (N=68) spanning three complexity tiers. All models show significant safety gaps (11.8-44.8 percent crisis detection), indicating the necessity of deterministic crisis routing in production systems. DeepSeek Chat v3 achieves the highest overall score (75.9 percent), while strengths differ by dimension: GPT-4o Mini leads Compliance (88.2 percent), Gemini leads Trauma-Informed Design (85.0 percent), and Claude Sonnet 4.5 ranks highest in crisis detection (44.8 percent). We release all scenarios, judge prompts, and scoring configurations with code. InvisibleBench extends single-turn safety tests by evaluating longitudinal risk, where real harms emerge. No clinical claims; this is a deployment-readiness evaluation.
comment: 29 pages, 3 figures
Emotion-Coherent Reasoning for Multimodal LLMs via Emotional Rationale Verifier
The recent advancement of Multimodal Large Language Models (MLLMs) is transforming human-computer interaction (HCI) from surface-level exchanges into more nuanced and emotionally intelligent communication. To realize this shift, emotion understanding becomes essential allowing systems to capture subtle cues underlying user intent. Furthermore, providing faithful explanations for predicted emotions is crucial to ensure interpretability and build user trust. However, current MLLM-based methods often generate emotion explanations that diverge from the target labels and sometimes even contradict their own predicted emotions. This inconsistency poses a critical risk for misunderstanding and erodes reliability in interactive settings. To address this, we propose a novel approach: the Emotional Rationale Verifier (ERV) and an Explanation Reward. Our method guides the model to produce reasoning that is explicitly consistent with the target emotion during multimodal emotion recognition without modifying the model architecture or requiring additional paired video-description annotations. Our method significantly improves faithful explanation-prediction consistency and explanation emotion accuracy on the MAFW and DFEW datasets. Through extensive experiments and human evaluations, we show that our approach not only enhances alignment between explanation and prediction but also empowers MLLMs to deliver emotionally coherent, trustworthy interactions, marking a key step toward truly human-like HCI systems.
comment: 16 pages, 11 figures
Understanding Human-Chatbot Romance: A Qualitative and Quantitative Study on Romantic Fantasy and Other Interpersonal Characteristics
LLM-based chatbots are now being specifically designed to facilitate social companionship, even romantic relationships, incorporating features that parallel human relationship dynamics. This has led a subset of users to form romantic relationships with chatbots. Understanding which interpersonal characteristics drive individuals to form intense, emotional bonds with chatbots is crucial for comprehending the potential psychological and societal impacts of romantic human-chatbot relationships. This mixed-methods study investigates psychological predictors of relationship intensity among individuals currently in romantic relationships with chatbots. Romantic and sexual fantasy, promising constructs not previously investiagted in this context, are examined alongside previously discussed factors (loneliness, anthropomorphism, attachment orientation, and sexual sensation seeking). In Study 1, quantitative data from individuals with chatbot partners (N=92) showed that romantic fantasy explained the most variance in relationship intensity, with additional contributions from anthropomorphism and avoidant attachment. Contrary to expectations, the other predictors, including loneliness, did not significantly predict intensity. In Study 2, 15 qualitative interviews illuminated how users employ romantic fantasy to enhance their relationships, describing active fantasy use to shape interactions and a desire for their chatbot to feel as human as possible. This study provides the first quantitative sample of this under-researched population, explaining who might form more intense romantic relationships with chatbots.
comment: 44 pages, 1 figure, presented at the International Communication Association 2025, Draft from the 25-11-2025
Generative AI for Cel-Animation: A Survey ICCV 2025
Traditional Celluloid (Cel) Animation production pipeline encompasses multiple essential steps, including storyboarding, layout design, keyframe animation, inbetweening, and colorization, which demand substantial manual effort, technical expertise, and significant time investment. These challenges have historically impeded the efficiency and scalability of Cel-Animation production. The rise of generative artificial intelligence (GenAI), encompassing large language models, multimodal models, and diffusion models, offers innovative solutions by automating tasks such as inbetween frame generation, colorization, and storyboard creation. This survey explores how GenAI integration is revolutionizing traditional animation workflows by lowering technical barriers, broadening accessibility for a wider range of creators through tools like AniDoc, ToonCrafter, and AniSora, and enabling artists to focus more on creative expression and artistic innovation. Despite its potential, challenges like visual consistency, stylistic coherence, and ethical considerations persist. Additionally, this paper explores future directions and advancements in AI-assisted animation. For further exploration and resources, please visit our GitHub repository: https://github.com/yunlong10/Awesome-AI4Animation
comment: Accepted by ICCV 2025 AISTORY Workshop
MobileViews: A Million-scale and Diverse Mobile GUI Dataset
Visual language models (VLMs) empower mobile GUI agents to interpret complex mobile screens and respond to user requests. Training such capable agents requires large-scale, high-quality mobile GUI data. However, existing mobile GUI datasets are limited in scale, data comprehensiveness, and fidelity. To overcome this, we utilize two mobile SoC clusters to provide over 200 native, high-fidelity mobile environments, along with a VLM-enhanced automatic application traversal framework for highly parallel, automated dataset collection with minimal human intervention. With this system, we propose MobileViews, a million-scale mobile GUI dataset comprising over 1.2 million unique screenshot-view hierarchy pairs from more than 30K modern Android applications. We assess the effectiveness of MobileViews by training four VLMs using the reinforcement learning-based GUI grounding task and evaluating them on two representative GUI grounding benchmarks. Results show that MobileViews significantly enhances grounding accuracy by up to 6.1%. Further analysis of data scale and quality underscores the critical role of large, high-quality datasets as reliable sources for training mobile GUI agents. The MobileViews dataset is publicly available at https://huggingface.co/datasets/mllmTeam/MobileViews.
comment: Dataset: https://huggingface.co/datasets/mllmTeam/MobileViews
Computer Vision and Pattern Recognition
LumiTex: Towards High-Fidelity PBR Texture Generation with Illumination Context
Physically-based rendering (PBR) provides a principled standard for realistic material-lighting interactions in computer graphics. Despite recent advances in generating PBR textures, existing methods fail to address two fundamental challenges: 1) materials decomposition from image prompts under limited illumination cues, and 2) seamless and view-consistent texture completion. To this end, we propose LumiTex, an end-to-end framework that comprises three key components: (1) a multi-branch generation scheme that disentangles albedo and metallic-roughness under shared illumination priors for robust material understanding, (2) a lighting-aware material attention mechanism that injects illumination context into the decoding process for physically grounded generation of albedo, metallic, and roughness maps, and (3) a geometry-guided inpainting module based on a large view synthesis model that enriches texture coverage and ensures seamless, view-consistent UV completion. Extensive experiments demonstrate that LumiTex achieves state-of-the-art performance in texture quality, surpassing both existing open-source and commercial methods.
comment: Project page: https://lumitex.vercel.app
VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection
We present VDC-Agent, a self-evolving framework for Video Detailed Captioning that requires neither human annotations nor larger teacher models. The agent forms a closed loop of caption generation, principle-guided scoring (score and textual suggestions), and prompt refinement. When caption quality regresses, a self-reflection path leverages the previous chain-of-thought to amend the update. Running this process on unlabeled videos produces trajectories of (caption, score) pairs. We convert the trajectories into preference tuples and filter out samples with JSON parsing errors, resulting in VDC-Agent-19K, which contains 18,886 automatically constructed pairs. We then fine-tune the base MLLM on this dataset using an easy-to-hard curriculum direct preference optimization. Built on Qwen2.5-VL-7B-Instruct, our VDC-Agent-7B attains state-of-the-art performance on the VDC benchmark with 49.08% average accuracy and 2.50 score, surpassing specialized video captioners and improving over the base model by +5.13% accuracy and +0.27 score at similar inference cost.
Are Image-to-Video Models Good Zero-Shot Image Editors?
Large-scale video diffusion models show strong world simulation and temporal reasoning abilities, but their use as zero-shot image editors remains underexplored. We introduce IF-Edit, a tuning-free framework that repurposes pretrained image-to-video diffusion models for instruction-driven image editing. IF-Edit addresses three key challenges: prompt misalignment, redundant temporal latents, and blurry late-stage frames. It includes (1) a chain-of-thought prompt enhancement module that transforms static editing instructions into temporally grounded reasoning prompts; (2) a temporal latent dropout strategy that compresses frame latents after the expert-switch point, accelerating denoising while preserving semantic and temporal coherence; and (3) a self-consistent post-refinement step that sharpens late-stage frames using a short still-video trajectory. Experiments on four public benchmarks, covering non-rigid editing, physical and temporal reasoning, and general instruction edits, show that IF-Edit performs strongly on reasoning-centric tasks while remaining competitive on general-purpose edits. Our study provides a systematic view of video diffusion models as image editors and highlights a simple recipe for unified video-image generative reasoning.
comment: technical report
Breaking the Likelihood-Quality Trade-off in Diffusion Models by Merging Pretrained Experts ICLR 2025
Diffusion models for image generation often exhibit a trade-off between perceptual sample quality and data likelihood: training objectives emphasizing high-noise denoising steps yield realistic images but poor likelihoods, whereas likelihood-oriented training overweights low-noise steps and harms visual fidelity. We introduce a simple plug-and-play sampling method that combines two pretrained diffusion experts by switching between them along the denoising trajectory. Specifically, we apply an image-quality expert at high noise levels to shape global structure, then switch to a likelihood expert at low noise levels to refine pixel statistics. The approach requires no retraining or fine-tuning -- only the choice of an intermediate switching step. On CIFAR-10 and ImageNet32, the merged model consistently matches or outperforms its base components, improving or preserving both likelihood and sample quality relative to each expert alone. These results demonstrate that expert switching across noise levels is an effective way to break the likelihood-quality trade-off in image diffusion models.
comment: ICLR 2025 DeLTa workshop
Mixture of Horizons in Action Chunking
Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the $\textbf{action chunk length}$ used during training, termed $\textbf{horizon}$. Our empirical study reveals an inherent trade-off: longer horizons provide stronger global foresight but degrade fine-grained accuracy, while shorter ones sharpen local control yet struggle on long-term tasks, implying fixed choice of single horizons being suboptimal. To mitigate the trade-off, we propose a $\textbf{mixture of horizons (MoH)}$ strategy. MoH rearranges the action chunk into several segments with different horizons, processes them in parallel with a shared action transformer, and fuses outputs with a light linear gate. It has three appealing benefits. 1) MoH exploits long-term foresight and short-term precision jointly within a single model, improving both performance and generalizability to complex tasks. 2) MoH is plug-and-play for full-attention action modules with minimal training or inference overhead. 3) MoH enables dynamic inference with adaptive horizons, which selects stable actions through cross-horizon consensus, achieving 2.5$\times$ higher throughput than baselines while preserving superior performance. Extensive experiments over flow-based policies $π_0$, $π_{0.5}$, and one-step regression policy $π_{\text{reg}}$ demonstrate that MoH yields consistent and significant gains on both simulations and real-world tasks. Notably, under mixed-task setting, $π_{0.5}$ with MoH reaches a new state-of-the-art with 99$\%$ average success rate on LIBERO after only $30k$ training iterations. Project page: https://github.com/Timsty1/MixtureOfHorizons
comment: 15 pages, 14 figures
Cloud4D NeurIPS 2025
There has been great progress in improving numerical weather prediction and climate models using machine learning. However, most global models act at a kilometer-scale, making it challenging to model individual clouds and factors such as extreme precipitation, wind gusts, turbulence, and surface irradiance. Therefore, there is a need to move towards higher-resolution models, which in turn require high-resolution real-world observations that current instruments struggle to obtain. We present Cloud4D, the first learning-based framework that reconstructs a physically consistent, four-dimensional cloud state using only synchronized ground-based cameras. Leveraging a homography-guided 2D-to-3D transformer, Cloud4D infers the full 3D distribution of liquid water content at 25 m spatial and 5 s temporal resolution. By tracking the 3D liquid water content retrievals over time, Cloud4D additionally estimates horizontal wind vectors. Across a two-month deployment comprising six skyward cameras, our system delivers an order-of-magnitude improvement in space-time resolution relative to state-of-the-art satellite measurements, while retaining single-digit relative error ($<10\%$) against collocated radar measurements. Code and data are available on our project page https://cloud4d.jacob-lin.com/.
comment: NeurIPS 2025 Spotlight, project page: https://cloud4d.jacob-lin.com/
Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution AAAI 2026
Task scheduling is critical for embodied AI, enabling agents to follow natural language instructions and execute actions efficiently in 3D physical worlds. However, existing datasets often simplify task planning by ignoring operations research (OR) knowledge and 3D spatial grounding. In this work, we propose Operations Research knowledge-based 3D Grounded Task Scheduling (ORS3D), a new task that requires the synergy of language understanding, 3D grounding, and efficiency optimization. Unlike prior settings, ORS3D demands that agents minimize total completion time by leveraging parallelizable subtasks, e.g., cleaning the sink while the microwave operates. To facilitate research on ORS3D, we construct ORS3D-60K, a large-scale dataset comprising 60K composite tasks across 4K real-world scenes. Furthermore, we propose GRANT, an embodied multi-modal large language model equipped with a simple yet effective scheduling token mechanism to generate efficient task schedules and grounded actions. Extensive experiments on ORS3D-60K validate the effectiveness of GRANT across language understanding, 3D grounding, and scheduling efficiency. The code is available at https://github.com/H-EmbodVis/GRANT
comment: Accepted to AAAI 2026 (Oral). The code is available at \url{https://github.com/H-EmbodVis/GRANT}
Flow Map Distillation Without Data
State-of-the-art flow models achieve remarkable quality but require slow, iterative sampling. To accelerate this, flow maps can be distilled from pre-trained teachers, a procedure that conventionally requires sampling from an external dataset. We argue that this data-dependency introduces a fundamental risk of Teacher-Data Mismatch, as a static dataset may provide an incomplete or even misaligned representation of the teacher's full generative capabilities. This leads us to question whether this reliance on data is truly necessary for successful flow map distillation. In this work, we explore a data-free alternative that samples only from the prior distribution, a distribution the teacher is guaranteed to follow by construction, thereby circumventing the mismatch risk entirely. To demonstrate the practical viability of this philosophy, we introduce a principled framework that learns to predict the teacher's sampling path while actively correcting for its own compounding errors to ensure high fidelity. Our approach surpasses all data-based counterparts and establishes a new state-of-the-art by a significant margin. Specifically, distilling from SiT-XL/2+REPA, our method reaches an impressive FID of 1.45 on ImageNet 256x256, and 1.49 on ImageNet 512x512, both with only 1 sampling step. We hope our work establishes a more robust paradigm for accelerating generative models and motivates the broader adoption of flow map distillation without data.
Ref-SAM3D: Bridging SAM3D with Text for Reference 3D Reconstruction
SAM3D has garnered widespread attention for its strong 3D object reconstruction capabilities. However, a key limitation remains: SAM3D cannot reconstruct specific objects referred to by textual descriptions, a capability that is essential for practical applications such as 3D editing, game development, and virtual environments. To address this gap, we introduce Ref-SAM3D, a simple yet effective extension to SAM3D that incorporates textual descriptions as a high-level prior, enabling text-guided 3D reconstruction from a single RGB image. Through extensive qualitative experiments, we show that Ref-SAM3D, guided only by natural language and a single 2D view, delivers competitive and high-fidelity zero-shot reconstruction performance. Our results demonstrate that Ref-SAM3D effectively bridges the gap between 2D visual cues and 3D geometric understanding, offering a more flexible and accessible paradigm for reference-guided 3D reconstruction. Code is available at: https://github.com/FudanCVL/Ref-SAM3D.
comment: Code: https://github.com/FudanCVL/Ref-SAM3D
SAM3-Adapter: Efficient Adaptation of Segment Anything 3 for Camouflage Object Segmentation, Shadow Detection, and Medical Image Segmentation
The rapid rise of large-scale foundation models has reshaped the landscape of image segmentation, with models such as Segment Anything achieving unprecedented versatility across diverse vision tasks. However, previous generations-including SAM and its successor-still struggle with fine-grained, low-level segmentation challenges such as camouflaged object detection, medical image segmentation, cell image segmentation, and shadow detection. To address these limitations, we originally proposed SAM-Adapter in 2023, demonstrating substantial gains on these difficult scenarios. With the emergence of Segment Anything 3 (SAM3)-a more efficient and higher-performing evolution with a redesigned architecture and improved training pipeline-we revisit these long-standing challenges. In this work, we present SAM3-Adapter, the first adapter framework tailored for SAM3 that unlocks its full segmentation capability. SAM3-Adapter not only reduces computational overhead but also consistently surpasses both SAM and SAM2-based solutions, establishing new state-of-the-art results across multiple downstream tasks, including medical imaging, camouflaged (concealed) object segmentation, and shadow detection. Built upon the modular and composable design philosophy of the original SAM-Adapter, SAM3-Adapter provides stronger generalizability, richer task adaptability, and significantly improved segmentation precision. Extensive experiments confirm that integrating SAM3 with our adapter yields superior accuracy, robustness, and efficiency compared to all prior SAM-based adaptations. We hope SAM3-Adapter can serve as a foundation for future research and practical segmentation applications. Code, pre-trained models, and data processing pipelines are available.
Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens
Vision-Language Models (VLMs) excel at reasoning in linguistic space but struggle with perceptual understanding that requires dense visual perception, e.g., spatial reasoning and geometric awareness. This limitation stems from the fact that current VLMs have limited mechanisms to capture dense visual information across spatial dimensions. We introduce Chain-of-Visual-Thought (COVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens-compact latent representations that encode rich perceptual cues. Within a small budget of roughly 20 tokens, COVT distills knowledge from lightweight vision experts, capturing complementary properties such as 2D appearance, 3D geometry, spatial layout, and edge structure. During training, the VLM with COVT autoregressively predicts these visual tokens to reconstruct dense supervision signals (e.g., depth, segmentation, edges, and DINO features). At inference, the model reasons directly in the continuous visual token space, preserving efficiency while optionally decoding dense predictions for interpretability. Evaluated across more than ten diverse perception benchmarks, including CV-Bench, MMVP, RealWorldQA, MMStar, WorldMedQA, and HRBench, integrating COVT into strong VLMs such as Qwen2.5-VL and LLaVA consistently improves performance by 3% to 16% and demonstrates that compact continuous visual thinking enables more precise, grounded, and interpretable multimodal intelligence.
comment: Project page: https://wakalsprojectpage.github.io/comt-website/
UniGame: Turning a Unified Multimodal Model Into Its Own Adversary
Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture. However, UMMs still exhibit a fundamental inconsistency: understanding favors compact embeddings, whereas generation favors reconstruction-rich representations. This structural trade-off produces misaligned decision boundaries, degraded cross-modal coherence, and heightened vulnerability under distributional and adversarial shifts. In this paper, we present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies. By applying a lightweight perturber at the shared token interface, UniGame enables the generation branch to actively seek and challenge fragile understanding, turning the model itself into its own adversary. Experiments demonstrate that UniGame significantly improves the consistency (+4.6%). Moreover, it also achieves substantial improvements in understanding (+3.6%), generation (+0.02), out-of-distribution and adversarial robustness (+4.8% and +6.2% on NaturalBench and AdVQA). The framework is architecture-agnostic, introduces less than 1% additional parameters, and is complementary to existing post-training methods. These results position adversarial self-play as a general and effective principle for enhancing the coherence, stability, and unified competence of future multimodal foundation models. The official code is available at: https://github.com/AIFrontierLab/UniGame
In-Video Instructions: Visual Signals as Generative Control
Large-scale video generative models have recently demonstrated strong visual capabilities, enabling the prediction of future frames that adhere to the logical and physical cues in the current observation. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction. In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects. Extensive experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions, particularly in complex multi-object scenarios.
Real-Time Object Tracking with On-Device Deep Learning for Adaptive Beamforming in Dynamic Acoustic Environments
Advances in object tracking and acoustic beamforming are driving new capabilities in surveillance, human-computer interaction, and robotics. This work presents an embedded system that integrates deep learning-based tracking with beamforming to achieve precise sound source localization and directional audio capture in dynamic environments. The approach combines single-camera depth estimation and stereo vision to enable accurate 3D localization of moving objects. A planar concentric circular microphone array constructed with MEMS microphones provides a compact, energy-efficient platform supporting 2D beam steering across azimuth and elevation. Real-time tracking outputs continuously adapt the array's focus, synchronizing the acoustic response with the target's position. By uniting learned spatial awareness with dynamic steering, the system maintains robust performance in the presence of multiple or moving sources. Experimental evaluation demonstrates significant gains in signal-to-interference ratio, making the design well-suited for teleconferencing, smart home devices, and assistive technologies.
BackSplit: The Importance of Sub-dividing the Background in Biomedical Lesion Segmentation
Segmenting small lesions in medical images remains notoriously difficult. Most prior work tackles this challenge by either designing better architectures, loss functions, or data augmentation schemes; and collecting more labeled data. We take a different view, arguing that part of the problem lies in how the background is modeled. Common lesion segmentation collapses all non-lesion pixels into a single "background" class, ignoring the rich anatomical context in which lesions appear. In reality, the background is highly heterogeneous-composed of tissues, organs, and other structures that can now be labeled manually or inferred automatically using existing segmentation models. In this paper, we argue that training with fine-grained labels that sub-divide the background class, which we call BackSplit, is a simple yet powerful paradigm that can offer a significant performance boost without increasing inference costs. From an information theoretic standpoint, we prove that BackSplit increases the expected Fisher Information relative to conventional binary training, leading to tighter asymptotic bounds and more stable optimization. With extensive experiments across multiple datasets and architectures, we empirically show that BackSplit consistently boosts small-lesion segmentation performance, even when auxiliary labels are generated automatically using pretrained segmentation models. Additionally, we demonstrate that auxiliary labels derived from interactive segmentation frameworks exhibit the same beneficial effect, demonstrating its robustness, simplicity, and broad applicability.
UISearch: Graph-Based Embeddings for Multimodal Enterprise UI Screenshots Retrieval
Enterprise software companies maintain thousands of user interface screens across products and versions, creating critical challenges for design consistency, pattern discovery, and compliance check. Existing approaches rely on visual similarity or text semantics, lacking explicit modeling of structural properties fundamental to user interface (UI) composition. We present a novel graph-based representation that converts UI screenshots into attributed graphs encoding hierarchical relationships and spatial arrangements, potentially generalizable to document layouts, architectural diagrams, and other structured visual domains. A contrastive graph autoencoder learns embeddings preserving multi-level similarity across visual, structural, and semantic properties. The comprehensive analysis demonstrates that our structural embeddings achieve better discriminative power than state-of-the-art Vision Encoders, representing a fundamental advance in the expressiveness of the UI representation. We implement this representation in UISearch, a multi-modal search framework that combines structural embeddings with semantic search through a composable query language. On 20,396 financial software UIs, UISearch achieves 0.92 Top-5 accuracy with 47.5ms median latency (P95: 124ms), scaling to 20,000+ screens. The hybrid indexing architecture enables complex queries and supports fine-grained UI distinction impossible with vision-only approaches.
comment: 12 pages, 2 figures, 3 algorithms, 4 tables
An Anatomy Aware Hybrid Deep Learning Framework for Lung Cancer Tumor Stage Classification
Accurate lung cancer tumor staging is crucial for prognosis and treatment planning. However, it remains challenging for end-to-end deep learning approaches, as such approaches often overlook spatial and anatomical information that are central to the tumor-node-metastasis system. The tumor stage depends on multiple quantitative criteria, including the tumor size and its proximity to the nearest anatomical structures, and small variations can alter the staging outcome. We propose a medically grounded hybrid pipeline that performs staging by explicitly measuring the tumor's size and distance properties rather than treating it as a pure image classification task. Our method employs specialized encoder-decoder networks to precisely segment the lung and adjacent anatomy, including the lobes, tumor, mediastinum, and diaphragm. Subsequently, we extract the necessary tumor properties, i.e. measure the largest tumor dimension and calculate the distance between the tumor and neighboring anatomical structures by a quantitative analysis of the segmentation masks. Finally, we apply rule-based tumor staging aligned with the medical guidelines. This novel framework has been evaluated on the Lung-PET-CT-Dx dataset, demonstrating superior performance compared to traditional deep learning models, achieving an overall classification accuracy of 91.36%. We report the per-stage F1-scores of 0.93 (T1), 0.89 (T2), 0.96 (T3), and 0.90 (T4), a critical evaluation aspect often omitted in prior literature. To our knowledge, this is the first study that embeds explicit clinical context into tumor stage classification. Unlike standard convolutional neural networks that operate in an uninterpretable "black box" manner, our method offers both state-of-the-art performance and transparent decision support.
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation
Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This approach avoids the limitations of VAE in the two-stage latent diffusion, offering higher model capacity. Existing pixel diffusion models suffer from slow training and inference, as they usually model both high-frequency signals and low-frequency semantics within a single diffusion transformer (DiT). To pursue a more efficient pixel diffusion paradigm, we propose the frequency-DeCoupled pixel diffusion framework. With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT. This thus frees the DiT to specialize in modeling low-frequency semantics. In addition, we introduce a frequency-aware flow-matching loss that emphasizes visually salient frequencies while suppressing insignificant ones. Extensive experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of 1.62 (256x256) and 2.22 (512x512) on ImageNet, closing the gap with latent diffusion methods. Furthermore, our pretrained text-to-image model achieves a leading overall score of 0.86 on GenEval in system-level comparison. Codes are publicly available at https://github.com/Zehong-Ma/DeCo.
comment: Project Page: https://zehong-ma.github.io/DeCo. Code Repository: https://github.com/Zehong-Ma/DeCo
Growing with the Generator: Self-paced GRPO for Video Generation
Group Relative Policy Optimization (GRPO) has emerged as a powerful reinforcement learning paradigm for post-training video generation models. However, existing GRPO pipelines rely on static, fixed-capacity reward models whose evaluation behavior is frozen during training. Such rigid rewards introduce distributional bias, saturate quickly as the generator improves, and ultimately limit the stability and effectiveness of reinforcement-based alignment. We propose Self-Paced GRPO, a competence-aware GRPO framework in which reward feedback co-evolves with the generator. Our method introduces a progressive reward mechanism that automatically shifts its emphasis from coarse visual fidelity to temporal coherence and fine-grained text-video semantic alignment as generation quality increases. This self-paced curriculum alleviates reward-policy mismatch, mitigates reward exploitation, and yields more stable optimization. Experiments on VBench across multiple video generation backbones demonstrate consistent improvements in both visual quality and semantic alignment over GRPO baselines with static rewards, validating the effectiveness and generality of Self-Paced GRPO.
CellFMCount: A Fluorescence Microscopy Dataset, Benchmark, and Methods for Cell Counting ICDM
Accurate cell counting is essential in various biomedical research and clinical applications, including cancer diagnosis, stem cell research, and immunology. Manual counting is labor-intensive and error-prone, motivating automation through deep learning techniques. However, training reliable deep learning models requires large amounts of high-quality annotated data, which is difficult and time-consuming to produce manually. Consequently, existing cell-counting datasets are often limited, frequently containing fewer than $500$ images. In this work, we introduce a large-scale annotated dataset comprising $3{,}023$ images from immunocytochemistry experiments related to cellular differentiation, containing over $430{,}000$ manually annotated cell locations. The dataset presents significant challenges: high cell density, overlapping and morphologically diverse cells, a long-tailed distribution of cell count per image, and variation in staining protocols. We benchmark three categories of existing methods: regression-based, crowd-counting, and cell-counting techniques on a test set with cell counts ranging from $10$ to $2{,}126$ cells per image. We also evaluate how the Segment Anything Model (SAM) can be adapted for microscopy cell counting using only dot-annotated datasets. As a case study, we implement a density-map-based adaptation of SAM (SAM-Counter) and report a mean absolute error (MAE) of $22.12$, which outperforms existing approaches (second-best MAE of $27.46$). Our results underscore the value of the dataset and the benchmarking framework for driving progress in automated cell counting and provide a robust foundation for future research and development.
comment: The IEEE International Conference on Data Mining (ICDM) 2025
Syn-GRPO: Self-Evolving Data Synthesis for MLLM Perception Reasoning
RL (reinforcement learning) methods (e.g., GRPO) for MLLM (Multimodal LLM) perception ability has attracted wide research interest owing to its remarkable generalization ability. Nevertheless, existing reinforcement learning methods still face the problem of low data quality, where data samples cannot elicit diverse responses from MLLMs, thus restricting the exploration scope for MLLM reinforcement learning. Some methods attempt to mitigate this problem by imposing constraints on entropy, but none address it at its root. Therefore, to tackle this problem, this work proposes Syn-GRPO (Synthesis-GRPO), which employs an online data generator to synthesize high-quality training data with diverse responses in GRPO training. Specifically, Syn-GRPO consists of two components: (1) data server; (2) GRPO workflow. The data server synthesizes new samples from existing ones using an image generation model, featuring a decoupled and asynchronous scheme to achieve high generation efficiency. The GRPO workflow provides the data server with the new image descriptions, and it leverages a diversity reward to supervise the MLLM to predict image descriptions for synthesizing samples with diverse responses. Experiment results across three visual perception tasks demonstrate that Syn-GRPO improves the data quality by a large margin, achieving significant superior performance to existing MLLM perception methods, and Syn-GRPO presents promising potential for scaling long-term self-evolving RL. Our code is available at https://github.com/hqhQAQ/Syn-GRPO.
POUR: A Provably Optimal Method for Unlearning Representations via Neural Collapse
In computer vision, machine unlearning aims to remove the influence of specific visual concepts or training images without retraining from scratch. Studies show that existing approaches often modify the classifier while leaving internal representations intact, resulting in incomplete forgetting. In this work, we extend the notion of unlearning to the representation level, deriving a three-term interplay between forgetting efficacy, retention fidelity, and class separation. Building on Neural Collapse theory, we show that the orthogonal projection of a simplex Equiangular Tight Frame (ETF) remains an ETF in a lower dimensional space, yielding a provably optimal forgetting operator. We further introduce the Representation Unlearning Score (RUS) to quantify representation-level forgetting and retention fidelity. Building on this, we introduce POUR (Provably Optimal Unlearning of Representations), a geometric projection method with closed-form (POUR-P) and a feature-level unlearning variant under a distillation scheme (POUR-D). Experiments on CIFAR-10/100 and PathMNIST demonstrate that POUR achieves effective unlearning while preserving retained knowledge, outperforming state-of-the-art unlearning methods on both classification-level and representation-level metrics.
MonoMSK: Monocular 3D Musculoskeletal Dynamics Estimation
Reconstructing biomechanically realistic 3D human motion - recovering both kinematics (motion) and kinetics (forces) - is a critical challenge. While marker-based systems are lab-bound and slow, popular monocular methods use oversimplified, anatomically inaccurate models (e.g., SMPL) and ignore physics, fundamentally limiting their biomechanical fidelity. In this work, we introduce MonoMSK, a hybrid framework that bridges data-driven learning and physics-based simulation for biomechanically realistic 3D human motion estimation from monocular video. MonoMSK jointly recovers both kinematics (motions) and kinetics (forces and torques) through an anatomically accurate musculoskeletal model. By integrating transformer-based inverse dynamics with differentiable forward kinematics and dynamics layers governed by ODE-based simulation, MonoMSK establishes a physics-regulated inverse-forward loop that enforces biomechanical causality and physical plausibility. A novel forward-inverse consistency loss further aligns motion reconstruction with the underlying kinetic reasoning. Experiments on BML-MoVi, BEDLAM, and OpenCap show that MonoMSK significantly outperforms state-of-the-art methods in kinematic accuracy, while for the first time enabling precise monocular kinetics estimation.
SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation
Preserving first-frame identity while ensuring precise motion control is a fundamental challenge in human image animation. The Image-to-Motion Binding process of the dominant Reference-to-Video (R2V) paradigm overlooks critical spatio-temporal misalignments common in real-world applications, leading to failures such as identity drift and visual artifacts. We introduce SteadyDancer, an Image-to-Video (I2V) paradigm-based framework that achieves harmonized and coherent animation and is the first to ensure first-frame preservation robustly. Firstly, we propose a Condition-Reconciliation Mechanism to harmonize the two conflicting conditions, enabling precise control without sacrificing fidelity. Secondly, we design Synergistic Pose Modulation Modules to generate an adaptive and coherent pose representation that is highly compatible with the reference image. Finally, we employ a Staged Decoupled-Objective Training Pipeline that hierarchically optimizes the model for motion fidelity, visual quality, and temporal coherence. Experiments demonstrate that SteadyDancer achieves state-of-the-art performance in both appearance fidelity and motion control, while requiring significantly fewer training resources than comparable methods.
comment: 10 pages, with supp
SyncMV4D: Synchronized Multi-view Joint Diffusion of Appearance and Motion for Hand-Object Interaction Synthesis
Hand-Object Interaction (HOI) generation plays a critical role in advancing applications across animation and robotics. Current video-based methods are predominantly single-view, which impedes comprehensive 3D geometry perception and often results in geometric distortions or unrealistic motion patterns. While 3D HOI approaches can generate dynamically plausible motions, their dependence on high-quality 3D data captured in controlled laboratory settings severely limits their generalization to real-world scenarios. To overcome these limitations, we introduce SyncMV4D, the first model that jointly generates synchronized multi-view HOI videos and 4D motions by unifying visual prior, motion dynamics, and multi-view geometry. Our framework features two core innovations: (1) a Multi-view Joint Diffusion (MJD) model that co-generates HOI videos and intermediate motions, and (2) a Diffusion Points Aligner (DPA) that refines the coarse intermediate motion into globally aligned 4D metric point tracks. To tightly couple 2D appearance with 4D dynamics, we establish a closed-loop, mutually enhancing cycle. During the diffusion denoising process, the generated video conditions the refinement of the 4D motion, while the aligned 4D point tracks are reprojected to guide next-step joint generation. Experimentally, our method demonstrates superior performance to state-of-the-art alternatives in visual realism, motion plausibility, and multi-view consistency.
comment: Project Page: https://droliven.github.io/SyncMV4D
Evaluating Dataset Watermarking for Fine-tuning Traceability of Customized Diffusion Models: A Comprehensive Benchmark and Removal Approach
Recent fine-tuning techniques for diffusion models enable them to reproduce specific image sets, such as particular faces or artistic styles, but also introduce copyright and security risks. Dataset watermarking has been proposed to ensure traceability by embedding imperceptible watermarks into training images, which remain detectable in outputs even after fine-tuning. However, current methods lack a unified evaluation framework. To address this, this paper establishes a general threat model and introduces a comprehensive evaluation framework encompassing Universality, Transmissibility, and Robustness. Experiments show that existing methods perform well in universality and transmissibility, and exhibit some robustness against common image processing operations, yet still fall short under real-world threat scenarios. To reveal these vulnerabilities, the paper further proposes a practical watermark removal method that fully eliminates dataset watermarks without affecting fine-tuning, highlighting a key challenge for future research.
Dual-Granularity Semantic Prompting for Language Guidance Infrared Small Target Detection
Infrared small target detection remains challenging due to limited feature representation and severe background interference, resulting in sub-optimal performance. While recent CLIP-inspired methods attempt to leverage textual guidance for detection, they are hindered by inaccurate text descriptions and reliance on manual annotations. To overcome these limitations, we propose DGSPNet, an end-to-end language prompt-driven framework. Our approach integrates dual-granularity semantic prompts: coarse-grained textual priors (e.g., 'infrared image', 'small target') and fine-grained personalized semantic descriptions derived through visual-to-textual mapping within the image space. This design not only facilitates learning fine-grained semantic information but also can inherently leverage language prompts during inference without relying on any annotation requirements. By fully leveraging the precision and conciseness of text descriptions, we further introduce a text-guide channel attention (TGCA) mechanism and text-guide spatial attention (TGSA) mechanism that enhances the model's sensitivity to potential targets across both low- and high-level feature spaces. Extensive experiments demonstrate that our method significantly improves detection accuracy and achieves state-of-the-art performance on three benchmark datasets.
comment: 10 pages, 2 figures
IDEAL-M3D: Instance Diversity-Enriched Active Learning for Monocular 3D Detection
Monocular 3D detection relies on just a single camera and is therefore easy to deploy. Yet, achieving reliable 3D understanding from monocular images requires substantial annotation, and 3D labels are especially costly. To maximize performance under constrained labeling budgets, it is essential to prioritize annotating samples expected to deliver the largest performance gains. This prioritization is the focus of active learning. Curiously, we observed two significant limitations in active learning algorithms for 3D monocular object detection. First, previous approaches select entire images, which is inefficient, as non-informative instances contained in the same image also need to be labeled. Secondly, existing methods rely on uncertainty-based selection, which in monocular 3D object detection creates a bias toward depth ambiguity. Consequently, distant objects are selected, while nearby objects are overlooked. To address these limitations, we propose IDEAL-M3D, the first instance-level pipeline for monocular 3D detection. For the first time, we demonstrate that an explicitly diverse, fast-to-train ensemble improves diversity-driven active learning for monocular 3D. We induce diversity with heterogeneous backbones and task-agnostic features, loss weight perturbation, and time-dependent bagging. IDEAL-M3D shows superior performance and significant resource savings: with just 60% of the annotations, we achieve similar or better AP3D on KITTI validation and test set results compared to training the same detector on the whole dataset.
DensifyBeforehand: LiDAR-assisted Content-aware Densification for Efficient and Quality 3D Gaussian Splatting
This paper addresses the limitations of existing 3D Gaussian Splatting (3DGS) methods, particularly their reliance on adaptive density control, which can lead to floating artifacts and inefficient resource usage. We propose a novel densify beforehand approach that enhances the initialization of 3D scenes by combining sparse LiDAR data with monocular depth estimation from corresponding RGB images. Our ROI-aware sampling scheme prioritizes semantically and geometrically important regions, yielding a dense point cloud that improves visual fidelity and computational efficiency. This densify beforehand approach bypasses the adaptive density control that may introduce redundant Gaussians in the original pipeline, allowing the optimization to focus on the other attributes of 3D Gaussian primitives, reducing overlap while enhancing visual quality. Our method achieves comparable results to state-of-the-art techniques while significantly lowering resource consumption and training time. We validate our approach through extensive comparisons and ablation studies on four newly collected datasets, showcasing its effectiveness in preserving regions of interest in complex scenes.
ReMatch: Boosting Representation through Matching for Multimodal Retrieval
We present ReMatch, a framework that leverages the generative strength of MLLMs for multimodal retrieval. Previous approaches treated an MLLM as a simple encoder, ignoring its generative nature, and under-utilising its compositional reasoning and world knowledge. We instead train the embedding MLLM end-to-end with a chat-style generative matching stage. The matching stage uses the same MLLM to autoregressively decide relevance from multi-view inputs, including both raw data and its own projected embeddings for each query and document. It provides instance-wise discrimination supervision that complements a standard contrastive loss, offering stronger gradients on hard negatives and preserving the compositional strengths of the original MLLM. To obtain semantically richer multimodal embeddings, we use multiple learnable tokens to augment each input, generating fine-grained contextual, mutually orthogonal embeddings with low inference cost. Leveraging our established high-performance baseline,we assemble the ideas mentioned above into a powerful training recipe and achieve a new state-of-the-art on the Massive Multimodal Embedding Benchmark (MMEB). Our experiments show particularly strong zero-shot generalization results on five datasets, highlighting the robustness and transferability of ReMatch.
Diffusion Reconstruction-based Data Likelihood Estimation for Core-Set Selection AAAI 2026
Existing core-set selection methods predominantly rely on heuristic scoring signals such as training dynamics or model uncertainty, lacking explicit modeling of data likelihood. This omission may hinder the constructed subset from capturing subtle yet critical distributional structures that underpin effective model training. In this work, we propose a novel, theoretically grounded approach that leverages diffusion models to estimate data likelihood via reconstruction deviation induced by partial reverse denoising. Specifically, we establish a formal connection between reconstruction error and data likelihood, grounded in the Evidence Lower Bound (ELBO) of Markovian diffusion processes, thereby enabling a principled, distribution-aware scoring criterion for data selection. Complementarily, we introduce an efficient information-theoretic method to identify the optimal reconstruction timestep, ensuring that the deviation provides a reliable signal indicative of underlying data likelihood. Extensive experiments on ImageNet demonstrate that reconstruction deviation offers an effective scoring criterion, consistently outperforming existing baselines across selection ratios, and closely matching full-data training using only 50% of the data. Further analysis shows that the likelihood-informed nature of our score reveals informative insights in data selection, shedding light on the interplay between data distributional characteristics and model learning preferences.
comment: Accepted by AAAI 2026
BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment
Conditional image generation enhances text-to-image synthesis with structural, spatial, or stylistic priors, but current methods face challenges in handling conflicts between sources. These include 1) input-level conflicts, where the conditioning image contradicts the text prompt, and 2) model-bias conflicts, where generative biases disrupt alignment even when conditions match the text. Addressing these conflicts requires nuanced solutions, which standard supervised fine-tuning struggles to provide. Preference-based optimization techniques like Direct Preference Optimization (DPO) show promise but are limited by gradient entanglement between text and condition signals and lack disentangled training data for multi-constraint tasks. To overcome this, we propose a bidirectionally decoupled DPO framework (BideDPO). Our method creates two disentangled preference pairs-one for the condition and one for the text-to reduce gradient entanglement. The influence of pairs is managed using an Adaptive Loss Balancing strategy for balanced optimization. We introduce an automated data pipeline to sample model outputs and generate conflict-aware data. This process is embedded in an iterative optimization strategy that refines both the model and the data. We construct a DualAlign benchmark to evaluate conflict resolution between text and condition. Experiments show BideDPO significantly improves text success rates (e.g., +35%) and condition adherence. We also validate our approach using the COCO dataset. Project Pages: https://limuloo.github.io/BideDPO/.
comment: 29 pages
LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models
Humans can perceive and understand 3D space and long videos from sequential visual observations. But do vision-language models (VLMs) can? Recent work demonstrates that even state-of-the-art VLMs still struggle to understand 3D space and long videos, although they are powerful in typical vision-language tasks. Current methods often rely on specialized architectural designs to improve performance for 3D tasks and video understanding tasks separately. In contrast, we propose LAST, short for LeArn to Think in Space and Time, to jointly improve 3D spatial and long video understanding for general VLMs with only a set of 2D images as inputs. LAST makes VLMs think in space and time rather than only with text before giving the final answer, building visual thinking trajectories in 3D space and temporal dimension. We demonstrate the effectiveness of LAST in two scenarios: 1) zero-shot, where we directly prompt proprietary models; and 2) fine-tuning general VLMs with data that include thinking trajectories in 3D space and time. We show that LAST brings substantial gains in various benchmarks, including 3 spatial understanding, 4 video understanding, and 3 image understanding tasks. Notably, 15.8% gains on EgoSchema with GPT-4o in a zero-shot manner and 8.3 gains on VSI-Bench compared with Qwen2.5-VL-7B.
Adversarial Patch Attacks on Vision-Based Cargo Occupancy Estimation via Differentiable 3D Simulation
Computer vision systems are increasingly adopted in modern logistics operations, including the estimation of trailer occupancy for planning, routing, and billing. Although effective, such systems may be vulnerable to physical adversarial attacks, particularly adversarial patches that can be printed and placed on interior surfaces. In this work, we study the feasibility of such attacks on a convolutional cargo-occupancy classifier using fully simulated 3D environments. Using Mitsuba 3 for differentiable rendering, we optimize patch textures across variations in geometry, lighting, and viewpoint, and compare their effectiveness to a 2D compositing baseline. Our experiments demonstrate that 3D-optimized patches achieve high attack success rates, especially in a denial-of-service scenario (empty to full), where success reaches 84.94 percent. Concealment attacks (full to empty) prove more challenging but still reach 30.32 percent. We analyze the factors influencing attack success, discuss implications for the security of automated logistics pipelines, and highlight directions for strengthening physical robustness. To our knowledge, this is the first study to investigate adversarial patch attacks for cargo-occupancy estimation in physically realistic, fully simulated 3D scenes.
comment: 9 pages, 5 figures, 1 algorithm
FedPoisonTTP: A Threat Model and Poisoning Attack for Federated Test-Time Personalization
Test-time personalization in federated learning enables models at clients to adjust online to local domain shifts, enhancing robustness and personalization in deployment. Yet, existing federated learning work largely overlooks the security risks that arise when local adaptation occurs at test time. Heterogeneous domain arrivals, diverse adaptation algorithms, and limited cross-client visibility create vulnerabilities where compromised participants can craft poisoned inputs and submit adversarial updates that undermine both global and per-client performance. To address this threat, we introduce FedPoisonTTP, a realistic grey-box attack framework that explores test-time data poisoning in the federated adaptation setting. FedPoisonTTP distills a surrogate model from adversarial queries, synthesizes in-distribution poisons using feature-consistency, and optimizes attack objectives to generate high-entropy or class-confident poisons that evade common adaptation filters. These poisons are injected during local adaptation and spread through collaborative updates, leading to broad degradation. Extensive experiments on corrupted vision benchmarks show that compromised participants can substantially diminish overall test-time performance.
comment: 13 pages, 3 figures, 2 tables
IDSplat: Instance-Decomposed 3D Gaussian Splatting for Driving Scenes
Reconstructing dynamic driving scenes is essential for developing autonomous systems through sensor-realistic simulation. Although recent methods achieve high-fidelity reconstructions, they either rely on costly human annotations for object trajectories or use time-varying representations without explicit object-level decomposition, leading to intertwined static and dynamic elements that hinder scene separation. We present IDSplat, a self-supervised 3D Gaussian Splatting framework that reconstructs dynamic scenes with explicit instance decomposition and learnable motion trajectories, without requiring human annotations. Our key insight is to model dynamic objects as coherent instances undergoing rigid transformations, rather than unstructured time-varying primitives. For instance decomposition, we employ zero-shot, language-grounded video tracking anchored to 3D using lidar, and estimate consistent poses via feature correspondences. We introduce a coordinated-turn smoothing scheme to obtain temporally and physically consistent motion trajectories, mitigating pose misalignments and tracking failures, followed by joint optimization of object poses and Gaussian parameters. Experiments on the Waymo Open Dataset demonstrate that our method achieves competitive reconstruction quality while maintaining instance-level decomposition and generalizes across diverse sequences and view densities without retraining, making it practical for large-scale autonomous driving applications. Code will be released.
Learning Plug-and-play Memory for Guiding Video Diffusion Models
Diffusion Transformer(DiT) based video generation models have recently achieved impressive visual quality and temporal coherence, but they still frequently violate basic physical laws and commonsense dynamics, revealing a lack of explicit world knowledge. In this work, we explore how to equip them with a plug-and-play memory that injects useful world knowledge. Motivated by in-context memory in Transformer-based LLMs, we conduct empirical studies to show that DiT can be steered via interventions on its hidden states, and simple low-pass and high-pass filters in the embedding space naturally disentangle low-level appearance and high-level physical/semantic cues, enabling targeted guidance. Building on these observations, we propose a learnable memory encoder DiT-Mem, composed of stacked 3D CNNs, low-/high-pass filters, and self-attention layers. The encoder maps reference videos into a compact set of memory tokens, which are concatenated as the memory within the DiT self-attention layers. During training, we keep the diffusion backbone frozen, and only optimize the memory encoder. It yields a rather efficient training process on few training parameters (150M) and 10K data samples, and enables plug-and-play usage at inference time. Extensive experiments on state-of-the-art models demonstrate the effectiveness of our method in improving physical rule following and video fidelity. Our code and data are publicly released here: https://thrcle421.github.io/DiT-Mem-Web/.
Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving
Autonomous driving heavily relies on accurate and robust spatial perception. Many failures arise from inaccuracies and instability, especially in long-tail scenarios and complex interactions. However, current vision-language models are weak at spatial grounding and understanding, and VLA systems built on them therefore show limited perception and localization ability. To address these challenges, we introduce Percept-WAM, a perception-enhanced World-Awareness-Action Model that is the first to implicitly integrate 2D/3D scene understanding abilities within a single vision-language model (VLM). Instead of relying on QA-style spatial reasoning, Percept-WAM unifies 2D/3D perception tasks into World-PV and World-BEV tokens, which encode both spatial coordinates and confidence. We propose a grid-conditioned prediction mechanism for dense object perception, incorporating IoU-aware scoring and parallel autoregressive decoding, improving stability in long-tail, far-range, and small-object scenarios. Additionally, Percept-WAM leverages pretrained VLM parameters to retain general intelligence (e.g., logical reasoning) and can output perception results and trajectory control outputs directly. Experiments show that Percept-WAM matches or surpasses classical detectors and segmenters on downstream perception benchmarks, achieving 51.7/58.9 mAP on COCO 2D detection and nuScenes BEV 3D detection. When integrated with trajectory decoders, it further improves planning performance on nuScenes and NAVSIM, e.g., surpassing DiffusionDrive by 2.1 in PMDS on NAVSIM. Qualitative results further highlight its strong open-vocabulary and long-tail generalization.
Are Large Vision Language Models Truly Grounded in Medical Images? Evidence from Italian Clinical Visual Question Answering
Large vision language models (VLMs) have achieved impressive performance on medical visual question answering benchmarks, yet their reliance on visual information remains unclear. We investigate whether frontier VLMs demonstrate genuine visual grounding when answering Italian medical questions by testing four state-of-the-art models: Claude Sonnet 4.5, GPT-4o, GPT-5-mini, and Gemini 2.0 flash exp. Using 60 questions from the EuropeMedQA Italian dataset that explicitly require image interpretation, we substitute correct medical images with blank placeholders to test whether models truly integrate visual and textual information. Our results reveal striking variability in visual dependency: GPT-4o shows the strongest visual grounding with a 27.9pp accuracy drop (83.2% [74.6%, 91.7%] to 55.3% [44.1%, 66.6%]), while GPT-5-mini, Gemini, and Claude maintain high accuracy with modest drops of 8.5pp, 2.4pp, and 5.6pp respectively. Analysis of model-generated reasoning reveals confident explanations for fabricated visual interpretations across all models, suggesting varying degrees of reliance on textual shortcuts versus genuine visual analysis. These findings highlight critical differences in model robustness and the need for rigorous evaluation before clinical deployment.
comment: Accepted at the Workshop on Multimodal Representation Learning for Healthcare (MMRL4H), EurIPS 2025
ReAlign: Text-to-Motion Generation via Step-Aware Reward-Guided Alignment AAAI 2026
Text-to-motion generation, which synthesizes 3D human motions from text inputs, holds immense potential for applications in gaming, film, and robotics. Recently, diffusion-based methods have been shown to generate more diversity and realistic motion. However, there exists a misalignment between text and motion distributions in diffusion models, which leads to semantically inconsistent or low-quality motions. To address this limitation, we propose Reward-guided sampling Alignment (ReAlign), comprising a step-aware reward model to assess alignment quality during the denoising sampling and a reward-guided strategy that directs the diffusion process toward an optimally aligned distribution. This reward model integrates step-aware tokens and combines a text-aligned module for semantic consistency and a motion-aligned module for realism, refining noisy motions at each timestep to balance probability density and alignment. Extensive experiments of both motion generation and retrieval tasks demonstrate that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods.
comment: Accepted by AAAI 2026
NVGS: Neural Visibility for Occlusion Culling in 3D Gaussian Splatting
3D Gaussian Splatting can exploit frustum culling and level-of-detail strategies to accelerate rendering of scenes containing a large number of primitives. However, the semi-transparent nature of Gaussians prevents the application of another highly effective technique: occlusion culling. We address this limitation by proposing a novel method to learn the viewpoint-dependent visibility function of all Gaussians in a trained model using a small, shared MLP across instances of an asset in a scene. By querying it for Gaussians within the viewing frustum prior to rasterization, our method can discard occluded primitives during rendering. Leveraging Tensor Cores for efficient computation, we integrate these neural queries directly into a novel instanced software rasterizer. Our approach outperforms the current state of the art for composed scenes in terms of VRAM usage and image quality, utilizing a combination of our instanced rasterizer and occlusion culling MLP, and exhibits complementary properties to existing LoD techniques.
comment: 15 pages, 13 figures
Can Modern Vision Models Understand the Difference Between an Object and a Look-alike?
Recent advances in computer vision have yielded models with strong performance on recognition benchmarks; however, significant gaps remain in comparison to human perception. One subtle ability is to judge whether an image looks like a given object without being an instance of that object. We study whether vision-language models such as CLIP capture this distinction. We curated a dataset named RoLA (Real or Lookalike) of real and lookalike exemplars (e.g., toys, statues, drawings, pareidolia) across multiple categories, and first evaluate a prompt-based baseline with paired "real"/"lookalike" prompts. We then estimate a direction in CLIP's embedding space that moves representations between real and lookalike. Applying this direction to image and text embeddings improves discrimination in cross-modal retrieval on Conceptual12M, and also enhances captions produced by a CLIP prefix captioner.
CLASH: A Benchmark for Cross-Modal Contradiction Detection
Contradictory multimodal inputs are common in real-world settings, yet existing benchmarks typically assume input consistency and fail to evaluate cross-modal contradiction detection - a fundamental capability for preventing hallucinations and ensuring reliability. We introduce CLASH, a novel benchmark for multimodal contradiction detection, featuring COCO images paired with contradictory captions containing controlled object-level or attribute-level contradictions. The samples include targeted questions evaluated in both multiple-choice and open-ended formats. The benchmark provides an extensive fine-tuning set filtered through automated quality checks, alongside a smaller human-verified diagnostic set. Our analysis of state-of-the-art models reveals substantial limitations in recognizing cross-modal conflicts, exposing systematic modality biases and category-specific weaknesses. Furthermore, we empirically demonstrate that targeted fine-tuning on CLASH substantially enhances conflict detection capabilities.
comment: First two authors contributed equally
Three-Dimensional Anatomical Data Generation Based on Artificial Neural Networks IROS
Surgical planning and training based on machine learning requires a large amount of 3D anatomical models reconstructed from medical imaging, which is currently one of the major bottlenecks. Obtaining these data from real patients and during surgery is very demanding, if even possible, due to legal, ethical, and technical challenges. It is especially difficult for soft tissue organs with poor imaging contrast, such as the prostate. To overcome these challenges, we present a novel workflow for automated 3D anatomical data generation using data obtained from physical organ models. We additionally use a 3D Generative Adversarial Network (GAN) to obtain a manifold of 3D models useful for other downstream machine learning tasks that rely on 3D data. We demonstrate our workflow using an artificial prostate model made of biomimetic hydrogels with imaging contrast in multiple zones. This is used to physically simulate endoscopic surgery. For evaluation and 3D data generation, we place it into a customized ultrasound scanner that records the prostate before and after the procedure. A neural network is trained to segment the recorded ultrasound images, which outperforms conventional, non-learning-based computer vision techniques in terms of intersection over union (IoU). Based on the segmentations, a 3D mesh model is reconstructed, and performance feedback is provided.
comment: 6 pages, 4 figures, 1 table, IEEE International Conference on Intelligent Robots and Systems (IROS)
SpectraNet: FFT-assisted Deep Learning Classifier for Deepfake Face Detection
Detecting deepfake images is crucial in combating misinformation. We present a lightweight, generalizable binary classification model based on EfficientNet-B6, fine-tuned with transformation techniques to address severe class imbalances. By leveraging robust preprocessing, oversampling, and optimization strategies, our model achieves high accuracy, stability, and generalization. While incorporating Fourier transform-based phase and amplitude features showed minimal impact, our proposed framework helps non-experts to effectively identify deepfake images, making significant strides toward accessible and reliable deepfake detection.
comment: 4 pages, 3 figures
nnActive: A Framework for Evaluation of Active Learning in 3D Biomedical Segmentation
Semantic segmentation is crucial for various biomedical applications, yet its reliance on large annotated datasets presents a bottleneck due to the high cost and specialized expertise required for manual labeling. Active Learning (AL) aims to mitigate this challenge by querying only the most informative samples, thereby reducing annotation effort. However, in the domain of 3D biomedical imaging, there is no consensus on whether AL consistently outperforms Random sampling. Four evaluation pitfalls hinder the current methodological assessment. These are (1) restriction to too few datasets and annotation budgets, (2) using 2D models on 3D images without partial annotations, (3) Random baseline not being adapted to the task, and (4) measuring annotation cost only in voxels. In this work, we introduce nnActive, an open-source AL framework that overcomes these pitfalls by (1) means of a large scale study spanning four biomedical imaging datasets and three label regimes, (2) extending nnU-Net by using partial annotations for training with 3D patch-based query selection, (3) proposing Foreground Aware Random sampling strategies tackling the foreground-background class imbalance of medical images and (4) propose the foreground efficiency metric, which captures the low annotation cost of background-regions. We reveal the following findings: (A) while all AL methods outperform standard Random sampling, none reliably surpasses an improved Foreground Aware Random sampling; (B) benefits of AL depend on task specific parameters; (C) Predictive Entropy is overall the best performing AL method, but likely requires the most annotation effort; (D) AL performance can be improved with more compute intensive design choices. As a holistic, open-source framework, nnActive can serve as a catalyst for research and application of AL in 3D biomedical imaging. Code is at: https://github.com/MIC-DKFZ/nnActive
comment: Accepted at TMLR
Evaluating Deep Learning and Traditional Approaches Used in Source Camera Identification
One of the most important tasks in computer vision is identifying the device using which the image was taken, useful for facilitating further comprehensive analysis of the image. This paper presents comparative analysis of three techniques used in source camera identification (SCI): Photo Response Non-Uniformity (PRNU), JPEG compression artifact analysis, and convolutional neural networks (CNNs). It evaluates each method in terms of device classification accuracy. Furthermore, the research discusses the possible scientific development needed for the implementation of the methods in real-life scenarios.
comment: 4 figures
MetroGS: Efficient and Stable Reconstruction of Geometrically Accurate High-Fidelity Large-Scale Scenes
Recently, 3D Gaussian Splatting and its derivatives have achieved significant breakthroughs in large-scale scene reconstruction. However, how to efficiently and stably achieve high-quality geometric fidelity remains a core challenge. To address this issue, we introduce MetroGS, a novel Gaussian Splatting framework for efficient and robust reconstruction in complex urban environments. Our method is built upon a distributed 2D Gaussian Splatting representation as the core foundation, serving as a unified backbone for subsequent modules. To handle potential sparse regions in complex scenes, we propose a structured dense enhancement scheme that utilizes SfM priors and a pointmap model to achieve a denser initialization, while incorporating a sparsity compensation mechanism to improve reconstruction completeness. Furthermore, we design a progressive hybrid geometric optimization strategy that organically integrates monocular and multi-view optimization to achieve efficient and accurate geometric refinement. Finally, to address the appearance inconsistency commonly observed in large-scale scenes, we introduce a depth-guided appearance modeling approach that learns spatial features with 3D consistency, facilitating effective decoupling between geometry and appearance and further enhancing reconstruction stability. Experiments on large-scale urban datasets demonstrate that MetroGS achieves superior geometric accuracy, rendering quality, offering a unified solution for high-fidelity large-scale scene reconstruction.
comment: Project page: https://m3phist0.github.io/MetroGS
Test-Time Preference Optimization for Image Restoration AAAI26
Image restoration (IR) models are typically trained to recover high-quality images using L1 or LPIPS loss. To handle diverse unknown degradations, zero-shot IR methods have also been introduced. However, existing pre-trained and zero-shot IR approaches often fail to align with human preferences, resulting in restored images that may not be favored. This highlights the critical need to enhance restoration quality and adapt flexibly to various image restoration tasks or backbones without requiring model retraining and ideally without labor-intensive preference data collection. In this paper, we propose the first Test-Time Preference Optimization (TTPO) paradigm for image restoration, which enhances perceptual quality, generates preference data on-the-fly, and is compatible with any IR model backbone. Specifically, we design a training-free, three-stage pipeline: (i) generate candidate preference images online using diffusion inversion and denoising based on the initially restored image; (ii) select preferred and dispreferred images using automated preference-aligned metrics or human feedback; and (iii) use the selected preference images as reward signals to guide the diffusion denoising process, optimizing the restored image to better align with human preferences. Extensive experiments across various image restoration tasks and models demonstrate the effectiveness and flexibility of the proposed pipeline.
comment: Accepted by AAAI26
From Pixels to Posts: Retrieval-Augmented Fashion Captioning and Hashtag Generation
This paper introduces the retrieval-augmented framework for automatic fashion caption and hashtag generation, combining multi-garment detection, attribute reasoning, and Large Language Model (LLM) prompting. The system aims to produce visually grounded, descriptive, and stylistically interesting text for fashion imagery, overcoming the limitations of end-to-end captioners that have problems with attribute fidelity and domain generalization. The pipeline combines a YOLO-based detector for multi-garment localization, k-means clustering for dominant color extraction, and a CLIP-FAISS retrieval module for fabric and gender attribute inference based on a structured product index. These attributes, together with retrieved style examples, create a factual evidence pack that is used to guide an LLM to generate human-like captions and contextually rich hashtags. A fine-tuned BLIP model is used as a supervised baseline model for comparison. Experimental results show that the YOLO detector is able to obtain a mean Average Precision (mAP@0.5) of 0.71 for nine categories of garments. The RAG-LLM pipeline generates expressive attribute-aligned captions and achieves mean attribute coverage of 0.80 with full coverage at the 50% threshold in hashtag generation, whereas BLIP gives higher lexical overlap and lower generalization. The retrieval-augmented approach exhibits better factual grounding, less hallucination, and great potential for scalable deployment in various clothing domains. These results demonstrate the use of retrieval-augmented generation as an effective and interpretable paradigm for automated and visually grounded fashion content generation.
comment: Submitted to Expert Systems with Applications
Collaborative Learning with Multiple Foundation Models for Source-Free Domain Adaptation
Source-Free Domain Adaptation (SFDA) aims to adapt a pre-trained source model to an unlabeled target domain without access to source data. Recent advances in Foundation Models (FMs) have introduced new opportunities for leveraging external semantic knowledge to guide SFDA. However, relying on a single FM is often insufficient, as it tends to bias adaptation toward a restricted semantic coverage, failing to capture diverse contextual cues under domain shift. To overcome this limitation, we propose a Collaborative Multi-foundation Adaptation (CoMA) framework that jointly leverages two different FMs (e.g., CLIP and BLIP) with complementary properties to capture both global semantics and local contextual cues. Specifically, we employ a bidirectional adaptation mechanism that (1) aligns different FMs with the target model for task adaptation while maintaining their semantic distinctiveness, and (2) transfers complementary knowledge from the FMs to the target model. To ensure stable adaptation under mini-batch training, we introduce Decomposed Mutual Information (DMI) that selectively enhances true dependencies while suppressing false dependencies arising from incomplete class coverage. Extensive experiments demonstrate that our method consistently outperforms existing state-of-the-art SFDA methods across four benchmarks, including Office-31, Office-Home, DomainNet-126, and VisDA, under the closed-set setting, while also achieving best results on partial-set and open-set variants.
comment: 15 pages, 8 figures
ABM-LoRA: Activation Boundary Matching for Fast Convergence in Low-Rank Adaptation
We propose Activation Boundary Matching for Low-Rank Adaptation (ABM-LoRA), a principled initialization strategy that substantially accelerates the convergence of low-rank adapters. While LoRA offers high parameter efficiency, its random initialization restricts gradient updates to a mismatched tangent space, causing significant information loss and hindering early convergence. Our ABM-LoRA addresses this by aligning the adapter's activation boundaries with those of the pretrained model before downstream training, thereby maximizing the projection of full-parameter gradients into the adapter subspace. This alignment sharply reduces information loss at initialization, yields a lower starting loss, and accelerates convergence. We demonstrate ABM-LoRA's effectiveness across diverse architectures and tasks: language understanding (T5-Base on GLUE), dialogue generation (LLaMA2-7B on WizardLM), and vision recognition (ViT-B/16 on VTAB-1K). On VTAB-1K, it achieves the highest accuracy among all methods, with strong gains on structured reasoning tasks requiring geometric understanding.
comment: 16 pages, 5 figures, under review
FilmSceneDesigner: Chaining Set Design for Procedural Film Scene Generation
Film set design plays a pivotal role in cinematic storytelling and shaping the visual atmosphere. However, the traditional process depends on expert-driven manual modeling, which is labor-intensive and time-consuming. To address this issue, we introduce FilmSceneDesigner, an automated scene generation system that emulates professional film set design workflow. Given a natural language description, including scene type, historical period, and style, we design an agent-based chaining framework to generate structured parameters aligned with film set design workflow, guided by prompt strategies that ensure parameter accuracy and coherence. On the other hand, we propose a procedural generation pipeline which executes a series of dedicated functions with the structured parameters for floorplan and structure generation, material assignment, door and window placement, and object retrieval and layout, ultimately constructing a complete film scene from scratch. Moreover, to enhance cinematic realism and asset diversity, we construct SetDepot-Pro, a curated dataset of 6,862 film-specific 3D assets and 733 materials. Experimental results and human evaluations demonstrate that our system produces structurally sound scenes with strong cinematic fidelity, supporting downstream tasks such as virtual previs, construction drawing and mood board creation.
MambaRefine-YOLO: A Dual-Modality Small Object Detector for UAV Imagery
Small object detection in Unmanned Aerial Vehicle (UAV) imagery is a persistent challenge, hindered by low resolution and background clutter. While fusing RGB and infrared (IR) data offers a promising solution, existing methods often struggle with the trade-off between effective cross-modal interaction and computational efficiency. In this letter, we introduce MambaRefine-YOLO. Its core contributions are a Dual-Gated Complementary Mamba fusion module (DGC-MFM) that adaptively balances RGB and IR modalities through illumination-aware and difference-aware gating mechanisms, and a Hierarchical Feature Aggregation Neck (HFAN) that uses a ``refine-then-fuse'' strategy to enhance multi-scale features. Our comprehensive experiments validate this dual-pronged approach. On the dual-modality DroneVehicle dataset, the full model achieves a state-of-the-art mAP of 83.2%, an improvement of 7.9% over the baseline. On the single-modality VisDrone dataset, a variant using only the HFAN also shows significant gains, demonstrating its general applicability. Our work presents a superior balance between accuracy and speed, making it highly suitable for real-world UAV applications.
comment: Submitted to IEEE Geoscience and Remote Sensing Letters
When Semantics Regulate: Rethinking Patch Shuffle and Internal Bias for Generated Image Detection with CLIP
The rapid progress of GANs and Diffusion Models poses new challenges for detecting AI-generated images. Although CLIP-based detectors exhibit promising generalization, they often rely on semantic cues rather than generator artifacts, leading to brittle performance under distribution shifts. In this work, we revisit the nature of semantic bias and uncover that Patch Shuffle provides an unusually strong benefit for CLIP, that disrupts global semantic continuity while preserving local artifact cues, which reduces semantic entropy and homogenizes feature distributions between natural and synthetic images. Through a detailed layer-wise analysis, we further show that CLIP's deep semantic structure functions as a regulator that stabilizes cross-domain representations once semantic bias is suppressed. Guided by these findings, we propose SemAnti, a semantic-antagonistic fine-tuning paradigm that freezes the semantic subspace and adapts only artifact-sensitive layers under shuffled semantics. Despite its simplicity, SemAnti achieves state-of-the-art cross-domain generalization on AIGCDetectBenchmark and GenImage, demonstrating that regulating semantics is key to unlocking CLIP's full potential for robust AI-generated image detection.
comment: 14 pages, 7 figures and 7 tables
MonoSR: Open-Vocabulary Spatial Reasoning from Monocular Images
Spatial reasoning (SR), the ability to infer 3D spatial information from 2D inputs, is essential for real-world applications such as embodied AI and autonomous driving. However, existing research primarily focuses on indoor environments and typically relies on multi-view observations, which limits their generalizability to outdoor scenarios and constrains their applicability to monocular images, the most common real-world setting. In this work, we propose MonoSR, a large-scale monocular spatial reasoning dataset that spans diverse scenarios including indoor, outdoor, and object-centric settings, and supports multiple question types. MonoSR provides a path toward open-world monocular spatial reasoning. Beyond introducing the dataset, we evaluate advanced vision-language models to reveal their limitations on this challenging task. We further analyze whether auxiliary information is crucial for monocular spatial reasoning and offer practical guidance for designing future models. These contributions collectively establish a foundation for advancing monocular spatial reasoning in real-world, open-world environments.
3M-TI: High-Quality Mobile Thermal Imaging via Calibration-free Multi-Camera Cross-Modal Diffusion
The miniaturization of thermal sensors for mobile platforms inherently limits their spatial resolution and textural fidelity, leading to blurry and less informative images. Existing thermal super-resolution (SR) methods can be grouped into single-image and RGB-guided approaches: the former struggles to recover fine structures from limited information, while the latter relies on accurate and laborious cross-camera calibration, which hinders practical deployment and robustness. Here, we propose 3M-TI, a calibration-free Multi-camera cross-Modality diffusion framework for Mobile Thermal Imaging. At its core, 3M-TI integrates a cross-modal self-attention module (CSM) into the diffusion UNet, replacing the original self-attention layers to adaptively align thermal and RGB features throughout the denoising process, without requiring explicit camera calibration. This design enables the diffusion network to leverage its generative prior to enhance spatial resolution, structural fidelity, and texture detail in the super-resolved thermal images. Extensive evaluations on real-world mobile thermal cameras and public benchmarks validate our superior performance, achieving state-of-the-art results in both visual quality and quantitative metrics. More importantly, the thermal images enhanced by 3M-TI lead to substantial gains in critical downstream tasks like object detection and segmentation, underscoring its practical value for robust mobile thermal perception systems. More materials: https://github.com/work-submit/3MTI.
comment: 11 pages, 7 figures
DiffSeg30k: A Multi-Turn Diffusion Editing Benchmark for Localized AIGC Detection
Diffusion-based editing enables realistic modification of local image regions, making AI-generated content harder to detect. Existing AIGC detection benchmarks focus on classifying entire images, overlooking the localization of diffusion-based edits. We introduce DiffSeg30k, a publicly available dataset of 30k diffusion-edited images with pixel-level annotations, designed to support fine-grained detection. DiffSeg30k features: 1) In-the-wild images--we collect images or image prompts from COCO to reflect real-world content diversity; 2) Diverse diffusion models--local edits using eight SOTA diffusion models; 3) Multi-turn editing--each image undergoes up to three sequential edits to mimic real-world sequential editing; and 4) Realistic editing scenarios--a vision-language model (VLM)-based pipeline automatically identifies meaningful regions and generates context-aware prompts covering additions, removals, and attribute changes. DiffSeg30k shifts AIGC detection from binary classification to semantic segmentation, enabling simultaneous localization of edits and identification of the editing models. We benchmark three baseline segmentation approaches, revealing significant challenges in semantic segmentation tasks, particularly concerning robustness to image distortions. Experiments also reveal that segmentation models, despite being trained for pixel-level localization, emerge as highly reliable whole-image classifiers of diffusion edits, outperforming established forgery classifiers while showing great potential in cross-generator generalization. We believe DiffSeg30k will advance research in fine-grained localization of AI-generated content by demonstrating the promise and limitations of segmentation-based methods. DiffSeg30k is released at: https://huggingface.co/datasets/Chaos2629/Diffseg30k
comment: 16 pages, 10 figures
HABIT: Human Action Benchmark for Interactive Traffic in CARLA WACV 2026
Current autonomous driving (AD) simulations are critically limited by their inadequate representation of realistic and diverse human behavior, which is essential for ensuring safety and reliability. Existing benchmarks often simplify pedestrian interactions, failing to capture complex, dynamic intentions and varied responses critical for robust system deployment. To overcome this, we introduce HABIT (Human Action Benchmark for Interactive Traffic), a high-fidelity simulation benchmark. HABIT integrates real-world human motion, sourced from mocap and videos, into CARLA (Car Learning to Act, a full autonomous driving simulator) via a modular, extensible, and physically consistent motion retargeting pipeline. From an initial pool of approximately 30,000 retargeted motions, we curate 4,730 traffic-compatible pedestrian motions, standardized in SMPL format for physically consistent trajectories. HABIT seamlessly integrates with CARLA's Leaderboard, enabling automated scenario generation and rigorous agent evaluation. Our safety metrics, including Abbreviated Injury Scale (AIS) and False Positive Braking Rate (FPBR), reveal critical failure modes in state-of-the-art AD agents missed by prior evaluations. Evaluating three state-of-the-art autonomous driving agents, InterFuser, TransFuser, and BEVDriver, demonstrates how HABIT exposes planner weaknesses that remain hidden in scripted simulations. Despite achieving close or equal to zero collisions per kilometer on the CARLA Leaderboard, the autonomous agents perform notably worse on HABIT, with up to 7.43 collisions/km and a 12.94% AIS 3+ injury risk, and they brake unnecessarily in up to 33% of cases. All components are publicly released to support reproducible, pedestrian-aware AI research.
comment: Accepted to WACV 2026. This is the pre-camera-ready version
Graph-based 3D Human Pose Estimation using WiFi Signals
WiFi-based human pose estimation (HPE) has attracted increasing attention due to its resilience to occlusion and privacy-preserving compared to camera-based methods. However, existing WiFi-based HPE approaches often employ regression networks that directly map WiFi channel state information (CSI) to 3D joint coordinates, ignoring the inherent topological relationships among human joints. In this paper, we present GraphPose-Fi, a graph-based framework that explicitly models skeletal topology for WiFi-based 3D HPE. Our framework comprises a CNN encoder shared across antennas for subcarrier-time feature extraction, a lightweight attention module that adaptively reweights features over time and across antennas, and a graph-based regression head that combines GCN layers with self-attention to capture local topology and global dependencies. Our proposed method significantly outperforms existing methods on the MM-Fi dataset in various settings. The source code is available at: https://github.com/Cirrick/GraphPose-Fi.
Towards Generalizable Deepfake Detection via Forgery-aware Audio-Visual Adaptation: A Variational Bayesian Approach
The widespread application of AIGC contents has brought not only unprecedented opportunities, but also potential security concerns, e.g., audio-visual deepfakes. Therefore, it is of great importance to develop an effective and generalizable method for multi-modal deepfake detection. Typically, the audio-visual correlation learning could expose subtle cross-modal inconsistencies, e.g., audio-visual misalignment, which serve as crucial clues in deepfake detection. In this paper, we reformulate the correlation learning with variational Bayesian estimation, where audio-visual correlation is approximated as a Gaussian distributed latent variable, and thus develop a novel framework for deepfake detection, i.e., Forgery-aware Audio-Visual Adaptation with Variational Bayes (FoVB). Specifically, given the prior knowledge of pre-trained backbones, we adopt two core designs to estimate audio-visual correlations effectively. First, we exploit various difference convolutions and a high-pass filter to discern local and global forgery traces from both modalities. Second, with the extracted forgery-aware features, we estimate the latent Gaussian variable of audio-visual correlation via variational Bayes. Then, we factorize the variable into modality-specific and correlation-specific ones with orthogonality constraint, allowing them to better learn intra-modal and cross-modal forgery traces with less entanglement. Extensive experiments demonstrate that our FoVB outperforms other state-of-the-art methods in various benchmarks.
comment: TIFS AQE
DEAP-3DSAM: Decoder Enhanced and Auto Prompt SAM for 3D Medical Image Segmentation
The Segment Anything Model (SAM) has recently demonstrated significant potential in medical image segmentation. Although SAM is primarily trained on 2D images, attempts have been made to apply it to 3D medical image segmentation. However, the pseudo 3D processing used to adapt SAM results in spatial feature loss, limiting its performance. Additionally, most SAM-based methods still rely on manual prompts, which are challenging to implement in real-world scenarios and require extensive external expert knowledge. To address these limitations, we introduce the Decoder Enhanced and Auto Prompt SAM (DEAP-3DSAM) to tackle these limitations. Specifically, we propose a Feature Enhanced Decoder that fuses the original image features with rich and detailed spatial information to enhance spatial features. We also design a Dual Attention Prompter to automatically obtain prompt information through Spatial Attention and Channel Attention. We conduct comprehensive experiments on four public abdominal tumor segmentation datasets. The results indicate that our DEAP-3DSAM achieves state-of-the-art performance in 3D image segmentation, outperforming or matching existing manual prompt methods. Furthermore, both quantitative and qualitative ablation studies confirm the effectiveness of our proposed modules.
comment: Accepted by BIBM 2024
DynaMix: Generalizable Person Re-identification via Dynamic Relabeling and Mixed Data Sampling
Generalizable person re-identification (Re-ID) aims to recognize individuals across unseen cameras and environments. While existing methods rely heavily on limited labeled multi-camera data, we propose DynaMix, a novel method that effectively combines manually labeled multi-camera and large-scale pseudo-labeled single-camera data. Unlike prior works, DynaMix dynamically adapts to the structure and noise of the training data through three core components: (1) a Relabeling Module that refines pseudo-labels of single-camera identities on-the-fly; (2) an Efficient Centroids Module that maintains robust identity representations under a large identity space; and (3) a Data Sampling Module that carefully composes mixed data mini-batches to balance learning complexity and intra-batch diversity. All components are specifically designed to operate efficiently at scale, enabling effective training on millions of images and hundreds of thousands of identities. Extensive experiments demonstrate that DynaMix consistently outperforms state-of-the-art methods in generalizable person Re-ID.
Understanding, Accelerating, and Improving MeanFlow Training
MeanFlow promises high-quality generative modeling in few steps, by jointly learning instantaneous and average velocity fields. Yet, the underlying training dynamics remain unclear. We analyze the interaction between the two velocities and find: (i) well-established instantaneous velocity is a prerequisite for learning average velocity; (ii) learning of instantaneous velocity benefits from average velocity when the temporal gap is small, but degrades as the gap increases; and (iii) task-affinity analysis indicates that smooth learning of large-gap average velocities, essential for one-step generation, depends on the prior formation of accurate instantaneous and small-gap average velocities. Guided by these observations, we design an effective training scheme that accelerates the formation of instantaneous velocity, then shifts emphasis from short- to long-interval average velocity. Our enhanced MeanFlow training yields faster convergence and significantly better few-step generation: With the same DiT-XL backbone, our method reaches an impressive FID of 2.87 on 1-NFE ImageNet 256x256, compared to 3.43 for the conventional MeanFlow baseline. Alternatively, our method matches the performance of the MeanFlow baseline with 2.5x shorter training time, or with a smaller DiT-L backbone.
Granular Computing-driven SAM: From Coarse-to-Fine Guidance for Prompt-Free Segmentation
Prompt-free image segmentation aims to generate accurate masks without manual guidance. Typical pre-trained models, notably Segmentation Anything Model (SAM), generate prompts directly at a single granularity level. However, this approach has two limitations: (1) Localizability, lacking mechanisms for autonomous region localization; (2) Scalability, limited fine-grained modeling at high resolution. To address these challenges, we introduce Granular Computing-driven SAM (Grc-SAM), a coarse-to-fine framework motivated by Granular Computing (GrC). First, the coarse stage adaptively extracts high-response regions from features to achieve precise foreground localization and reduce reliance on external prompts. Second, the fine stage applies finer patch partitioning with sparse local swin-style attention to enhance detail modeling and enable high-resolution segmentation. Third, refined masks are encoded as latent prompt embeddings for the SAM decoder, replacing handcrafted prompts with an automated reasoning process. By integrating multi-granularity attention, Grc-SAM bridges granular computing with vision transformers. Extensive experimental results demonstrate Grc-SAM outperforms baseline methods in both accuracy and scalability. It offers a unique granular computational perspective for prompt-free segmentation.
comment: 19 pages, 7 figures
LAA3D: A Benchmark of Detecting and Tracking Low-Altitude Aircraft in 3D Space
Perception of Low-Altitude Aircraft (LAA) in 3D space enables precise 3D object localization and behavior understanding. However, datasets tailored for 3D LAA perception remain scarce. To address this gap, we present LAA3D, a large-scale dataset designed to advance 3D detection and tracking of low-altitude aerial vehicles. LAA3D contains 15,000 real images and 600,000 synthetic frames, captured across diverse scenarios, including urban and suburban environments. It covers multiple aerial object categories, including electric Vertical Take-Off and Landing (eVTOL) aircraft, Micro Aerial Vehicles (MAVs), and Helicopters. Each instance is annotated with 3D bounding box, class label, and instance identity, supporting tasks such as 3D object detection, 3D multi-object tracking (MOT), and 6-DoF pose estimation. Besides, we establish the LAA3D Benchmark, integrating multiple tasks and methods with unified evaluation protocols for comparison. Furthermore, we propose MonoLAA, a monocular 3D detection baseline, achieving robust 3D localization from zoom cameras with varying focal lengths. Models pretrained on synthetic images transfer effectively to real-world data with fine-tuning, demonstrating strong sim-to-real generalization. Our LAA3D provides a comprehensive foundation for future research in low-altitude 3D object perception.
comment: 25 pages
Beyond Reward Margin: Rethinking and Resolving Likelihood Displacement in Diffusion Models via Video Generation
Direct Preference Optimization (DPO) has shown promising results in aligning generative outputs with human preferences by distinguishing between chosen and rejected samples. However, a critical limitation of DPO is likelihood displacement, where the probabilities of chosen samples paradoxically decrease during training, undermining the quality of generation. Although this issue has been investigated in autoregressive models, its impact within diffusion-based models remains largely unexplored. This gap leads to suboptimal performance in tasks involving video generation. To address this, we conduct a formal analysis of DPO loss through updating policy within the diffusion framework, which describes how the updating of specific training samples influences the model's predictions on other samples. Using this tool, we identify two main failure modes: (1) Optimization Conflict, which arises from small reward margins between chosen and rejected samples, and (2) Suboptimal Maximization, caused by large reward margins. Informed by these insights, we introduce a novel solution named Policy-Guided DPO (PG-DPO), combining Adaptive Rejection Scaling (ARS) and Implicit Preference Regularization (IPR) to effectively mitigate likelihood displacement. Experiments show that PG-DPO outperforms existing methods in both quantitative metrics and qualitative evaluations, offering a robust solution for improving preference alignment in video generation tasks.
MedSAM3: Delving into Segment Anything with Medical Concepts
Medical image segmentation is fundamental for biomedical discovery. Existing methods lack generalizability and demand extensive, time-consuming manual annotation for new clinical application. Here, we propose MedSAM-3, a text promptable medical segmentation model for medical image and video segmentation. By fine-tuning the Segment Anything Model (SAM) 3 architecture on medical images paired with semantic conceptual labels, our MedSAM-3 enables medical Promptable Concept Segmentation (PCS), allowing precise targeting of anatomical structures via open-vocabulary text descriptions rather than solely geometric prompts. We further introduce the MedSAM-3 Agent, a framework that integrates Multimodal Large Language Models (MLLMs) to perform complex reasoning and iterative refinement in an agent-in-the-loop workflow. Comprehensive experiments across diverse medical imaging modalities, including X-ray, MRI, Ultrasound, CT, and video, demonstrate that our approach significantly outperforms existing specialist and foundation models. We will release our code and model at https://github.com/Joey-S-Liu/MedSAM3.
CSD: Change Semantic Detection with only Semantic Change Masks for Damage Assessment in Conflict Zones
Accurately and swiftly assessing damage from conflicts is crucial for humanitarian aid and regional stability. In conflict zones, damaged zones often share similar architectural styles, with damage typically covering small areas and exhibiting blurred boundaries. These characteristics lead to limited data, annotation difficulties, and significant recognition challenges, including high intra-class similarity and ambiguous semantic changes. To address these issues, we introduce a pre-trained DINOv3 model and propose a multi-scale cross-attention difference siamese network (MC-DiSNet). The powerful visual representation capability of the DINOv3 backbone enables robust and rich feature extraction from bi-temporal remote sensing images. We also release a new Gaza-change dataset containing high-resolution satellite image pairs from 2023-2024 with pixel-level semantic change annotations. It is worth emphasizing that our annotations only include semantic pixels of changed areas. Unlike conventional semantic change detection (SCD), our approach eliminates the need for large-scale semantic annotations of bi-temporal images, instead focusing directly on the changed regions. We term this new task change semantic detection (CSD). The CSD task represents a direct extension of binary change detection (BCD). Due to the limited spatial extent of semantic regions, it presents greater challenges than traditional SCD tasks. We evaluated our method under the CSD framework on both the Gaza-Change and SECOND datasets. Experimental results demonstrate that our proposed approach effectively addresses the CSD task, and its outstanding performance paves the way for practical applications in rapid damage assessment across conflict zones.
ReEXplore: Improving MLLMs for Embodied Exploration with Contextualized Retrospective Experience Replay
Embodied exploration is a target-driven process that requires embodied agents to possess fine-grained perception and knowledge-enhanced decision making. While recent attempts leverage MLLMs for exploration due to their strong perceptual and reasoning abilities, we find that MLLM-based embodied agents remain suboptimal in exploring new environments: (i) they rely on profound but stale pre-trained knowledge, (ii) training-based approaches such as imitation learning or reinforcement learning are expensive for long-horizon tasks with sparse outcome rewards, and (iii) frontier-based exploration yields a large, visually nuanced action space that is difficult for MLLMs to make reliable decisions. We address these challenges with ReEXplore, a training-free framework that performs retrospective experience replay to inject distilled, abstract experience at inference time, and hierarchical frontier selection to decompose frontier ranking into coarse-to-fine decisions. Our approach enables robust, traceable, and efficient exploration. Across multiple embodied exploration benchmarks, ReEXplore yields great improvements over strong MLLM baselines, up to 3x higher performance in both success rate and in navigation efficiency under open-source backbones.
comment: 8 main pages plus 13 pages Appendix
Benchmarking Corruption Robustness of LVLMs: A Discriminative Benchmark and Robustness Alignment Metric
Despite the remarkable reasoning abilities of large vision-language models (LVLMs), their robustness under visual corruptions remains insufficiently studied. Existing evaluation paradigms exhibit two major limitations: 1) the dominance of low-discriminative samples in current datasets masks the real robustness gap between models; and 2) conventional accuracy-based metric fail to capture the degradation of the underlying prediction structure. To bridge these gaps, we introduce Bench-C, a comprehensive benchmark emphasizing discriminative samples for assessing corruption robustness, where a selection strategy is proposed to jointly consider the prediction inconsistency under corruption and the semantic diversity. Furthermore, we propose the Robustness Alignment Score (RAS), a unified metric that measures degradation in logit-level prediction structure by considering the shifts in prediction uncertainty and calibration alignment. Comprehensive experiments and analysis reveal several interesting findings: 1) model behaviors exhibit distinguish patterns under corruptions, such as erroneous confidence and hesitation; 2) despite subtle corruption may lead to a slight accuracy gain, the overall prediction structure still degrades; 3) by decomposing corruption robustness into destructive and corrective components, the distinct failure and recovery patterns across models can be revealed.
comment: 15 pages
Life-IQA: Boosting Blind Image Quality Assessment through GCN-enhanced Layer Interaction and MoE-based Feature Decoupling
Blind image quality assessment (BIQA) plays a crucial role in evaluating and optimizing visual experience. Most existing BIQA approaches fuse shallow and deep features extracted from backbone networks, while overlooking the unequal contributions to quality prediction. Moreover, while various vision encoder backbones are widely adopted in BIQA, the effective quality decoding architectures remain underexplored. To address these limitations, this paper investigates the contributions of shallow and deep features to BIQA, and proposes a effective quality feature decoding framework via GCN-enhanced \underline{l}ayer\underline{i}nteraction and MoE-based \underline{f}eature d\underline{e}coupling, termed \textbf{(Life-IQA)}. Specifically, the GCN-enhanced layer interaction module utilizes the GCN-enhanced deepest-layer features as query and the penultimate-layer features as key, value, then performs cross-attention to achieve feature interaction. Moreover, a MoE-based feature decoupling module is proposed to decouple fused representations though different experts specialized for specific distortion types or quality dimensions. Extensive experiments demonstrate that Life-IQA shows more favorable balance between accuracy and cost than a vanilla Transformer decoder and achieves state-of-the-art performance on multiple BIQA benchmarks.The code is available at: \href{https://github.com/TANGLONG2/Life-IQA/tree/main}{\texttt{Life-IQA}}.
Dynamic Granularity Matters: Rethinking Vision Transformers Beyond Fixed Patch Splitting
Vision Transformers (ViTs) have demonstrated strong capabilities in capturing global dependencies but often struggle to efficiently represent fine-grained local details. Existing multi-scale approaches alleviate this issue by integrating hierarchical or hybrid features; however, they rely on fixed patch sizes and introduce redundant computation. To address these limitations, we propose Granularity-driven Vision Transformer (Grc-ViT), a dynamic coarse-to-fine framework that adaptively adjusts visual granularity based on image complexity. It comprises two key stages: (1) Coarse Granularity Evaluation module, which assesses visual complexity using edge density, entropy, and frequency-domain cues to estimate suitable patch and window sizes; (2) Fine-grained Refinement module, which refines attention computation according to the selected granularity, enabling efficient and precise feature learning. Two learnable parameters, α and \b{eta}, are optimized end-to-end to balance global reasoning and local perception. Comprehensive evaluations demonstrate that Grc-ViT enhances fine-grained discrimination while achieving a superior trade-off between accuracy and computational efficiency.
comment: 10 pages, 7 figures
A Self-Conditioned Representation Guided Diffusion Model for Realistic Text-to-LiDAR Scene Generation
Text-to-LiDAR generation can customize 3D data with rich structures and diverse scenes for downstream tasks. However, the scarcity of Text-LiDAR pairs often causes insufficient training priors, generating overly smooth 3D scenes. Moreover, low-quality text descriptions may degrade generation quality and controllability. In this paper, we propose a Text-to-LiDAR Diffusion Model for scene generation, named T2LDM, with a Self-Conditioned Representation Guidance (SCRG). Specifically, SCRG, by aligning to the real representations, provides the soft supervision with reconstruction details for the Denoising Network (DN) in training, while decoupled in inference. In this way, T2LDM can perceive rich geometric structures from data distribution, generating detailed objects in scenes. Meanwhile, we construct a content-composable Text-LiDAR benchmark, T2nuScenes, along with a controllability metric. Based on this, we analyze the effects of different text prompts for LiDAR generation quality and controllability, providing practical prompt paradigms and insights. Furthermore, a directional position prior is designed to mitigate street distortion, further improving scene fidelity. Additionally, by learning a conditional encoder via frozen DN, T2LDM can support multiple conditional tasks, including Sparse-to-Dense, Dense-to-Sparse, and Semantic-to-LiDAR generation. Extensive experiments in unconditional and conditional generation demonstrate that T2LDM outperforms existing methods, achieving state-of-the-art scene generation.
AuViRe: Audio-visual Speech Representation Reconstruction for Deepfake Temporal Localization WACV 2026
With the rapid advancement of sophisticated synthetic audio-visual content, e.g., for subtle malicious manipulations, ensuring the integrity of digital media has become paramount. This work presents a novel approach to temporal localization of deepfakes by leveraging Audio-Visual Speech Representation Reconstruction (AuViRe). Specifically, our approach reconstructs speech representations from one modality (e.g., lip movements) based on the other (e.g., audio waveform). Cross-modal reconstruction is significantly more challenging in manipulated video segments, leading to amplified discrepancies, thereby providing robust discriminative cues for precise temporal forgery localization. AuViRe outperforms the state of the art by +8.9 AP@0.95 on LAV-DF, +9.6 AP@0.5 on AV-Deepfake1M, and +5.1 AUC on an in-the-wild experiment. Code available at https://github.com/mever-team/auvire.
comment: WACV 2026
View-Consistent Diffusion Representations for 3D-Consistent Video Generation
Video generation models have made significant progress in generating realistic content, enabling applications in simulation, gaming, and film making. However, current generated videos still contain visual artifacts arising from 3D inconsistencies, e.g., objects and structures deforming under changes in camera pose, which can undermine user experience and simulation fidelity. Motivated by recent findings on representation alignment for diffusion models, we hypothesize that improving the multi-view consistency of video diffusion representations will yield more 3D-consistent video generation. Through detailed analysis on multiple recent camera-controlled video diffusion models we reveal strong correlations between 3D-consistent representations and videos. We also propose ViCoDR, a new approach for improving the 3D consistency of video models by learning multi-view consistent diffusion representations. We evaluate ViCoDR on camera controlled image-to-video, text-to-video, and multi-view generation models, demonstrating significant improvements in the 3D consistency of the generated videos. Project page: https://danier97.github.io/ViCoDR.
Rethinking Plant Disease Diagnosis: Bridging the Academic-Practical Gap with Vision Transformers and Zero-Shot Learning
Recent advances in deep learning have enabled significant progress in plant disease classification using leaf images. Much of the existing research in this field has relied on the PlantVillage dataset, which consists of well-centered plant images captured against uniform, uncluttered backgrounds. Although models trained on this dataset achieve high accuracy, they often fail to generalize to real-world field images, such as those submitted by farmers to plant diagnostic systems. This has created a significant gap between published studies and practical application requirements, highlighting the necessity of investigating and addressing this issue. In this study, we investigate whether attention-based architectures and zero-shot learning approaches can bridge the gap between curated academic datasets and real-world agricultural conditions in plant disease classification. We evaluate three model categories: Convolutional Neural Networks (CNNs), Vision Transformers, and Contrastive Language-Image Pre-training (CLIP)-based zero-shot models. While CNNs exhibit limited robustness under domain shift, Vision Transformers demonstrate stronger generalization by capturing global contextual features. Most notably, CLIP models classify diseases directly from natural language descriptions without any task-specific training, offering strong adaptability and interpretability. These findings highlight the potential of zero-shot learning as a practical and scalable domain adaptation strategy for plant health diagnosis in diverse field environments.
UMCL: Unimodal-generated Multimodal Contrastive Learning for Cross-compression-rate Deepfake Detection
In deepfake detection, the varying degrees of compression employed by social media platforms pose significant challenges for model generalization and reliability. Although existing methods have progressed from single-modal to multimodal approaches, they face critical limitations: single-modal methods struggle with feature degradation under data compression in social media streaming, while multimodal approaches require expensive data collection and labeling and suffer from inconsistent modal quality or accessibility in real-world scenarios. To address these challenges, we propose a novel Unimodal-generated Multimodal Contrastive Learning (UMCL) framework for robust cross-compression-rate (CCR) deepfake detection. In the training stage, our approach transforms a single visual modality into three complementary features: compression-robust rPPG signals, temporal landmark dynamics, and semantic embeddings from pre-trained vision-language models. These features are explicitly aligned through an affinity-driven semantic alignment (ASA) strategy, which models inter-modal relationships through affinity matrices and optimizes their consistency through contrastive learning. Subsequently, our cross-quality similarity learning (CQSL) strategy enhances feature robustness across compression rates. Extensive experiments demonstrate that our method achieves superior performance across various compression rates and manipulation types, establishing a new benchmark for robust deepfake detection. Notably, our approach maintains high detection accuracy even when individual features degrade, while providing interpretable insights into feature relationships through explicit alignment.
comment: 24-page manuscript accepted to IJCV
Zero-shot segmentation of skin tumors in whole-slide images with vision-language foundation models
Accurate annotation of cutaneous neoplasm biopsies represents a major challenge due to their wide morphological variability, overlapping histological patterns, and the subtle distinctions between benign and malignant lesions. Vision-language foundation models (VLMs), pre-trained on paired image-text corpora, learn joint representations that bridge visual features and diagnostic terminology, enabling zero-shot localization and classification of tissue regions without pixel-level labels. However, most existing VLM applications in histopathology remain limited to slide-level tasks or rely on coarse interactive prompts, and they struggle to produce fine-grained segmentations across gigapixel whole-slide images (WSIs). In this work, we introduce a zero-shot visual-language segmentation pipeline for whole-slide images (ZEUS), a fully automated, zero-shot segmentation framework that leverages class-specific textual prompt ensembles and frozen VLM encoders to generate high-resolution tumor masks in WSIs. By partitioning each WSI into overlapping patches, extracting visual embeddings, and computing cosine similarities against text prompts, we generate a final segmentation mask. We demonstrate competitive performance on two in-house datasets, primary spindle cell neoplasms and cutaneous metastases, highlighting the influence of prompt design, domain shifts, and institutional variability in VLMs for histopathology. ZEUS markedly reduces annotation burden while offering scalable, explainable tumor delineation for downstream diagnostic workflows.
comment: Conference manuscript accepted for oral presentation at CASEIB 2025
Peregrine: One-Shot Fine-Tuning for FHE Inference of General Deep CNNs
We address two fundamental challenges in adapting general deep CNNs for FHE-based inference: approximating non-linear activations such as ReLU with low-degree polynomials while minimizing accuracy degradation, and overcoming the ciphertext capacity barrier that constrains high-resolution image processing on FHE inference. Our contributions are twofold: (1) a single-stage fine-tuning (SFT) strategy that directly converts pre-trained CNNs into FHE-friendly forms using low-degree polynomials, achieving competitive accuracy with minimal training overhead; and (2) a generalized interleaved packing (GIP) scheme that is compatible with feature maps of virtually arbitrary spatial resolutions, accompanied by a suite of carefully designed homomorphic operators that preserve the GIP-form encryption throughout computation. These advances enable efficient, end-to-end FHE inference across diverse CNN architectures. Experiments on CIFAR-10, ImageNet, and MS COCO demonstrate that the FHE-friendly CNNs obtained via our SFT strategy achieve accuracy comparable to baselines using ReLU or SiLU activations. Moreover, this work presents the first demonstration of FHE-based inference for YOLO architectures in object detection leveraging low-degree polynomial activations.
CataractCompDetect: Intraoperative Complication Detection in Cataract Surgery
Cataract surgery is one of the most commonly performed surgeries worldwide, yet intraoperative complications such as iris prolapse, posterior capsule rupture (PCR), and vitreous loss remain major causes of adverse outcomes. Automated detection of such events could enable early warning systems and objective training feedback. In this work, we propose CataractCompDetect, a complication detection framework that combines phase-aware localization, SAM 2-based tracking, complication-specific risk scoring, and vision-language reasoning for final classification. To validate CataractCompDetect, we curate CataComp, the first cataract surgery video dataset annotated for intraoperative complications, comprising 53 surgeries, including 23 with clinical complications. On CataComp, CataractCompDetect achieves an average F1 score of 70.63%, with per-complication performance of 81.8% (Iris Prolapse), 60.87% (PCR), and 69.23% (Vitreous Loss). These results highlight the value of combining structured surgical priors with vision-language reasoning for recognizing rare but high-impact intraoperative events. Our dataset and code will be publicly released upon acceptance.
AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention
Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in embodied AI tasks. However, existing VLA models, often built upon Vision-Language Models (VLMs), typically process dense visual inputs independently at each timestep. This approach implicitly models the task as a Markov Decision Process (MDP). However, this history-agnostic design is suboptimal for effective visual token processing in dynamic sequential decision-making, as it fails to leverage the context of history. To address this limitation, we reformulate the problem from a Partially Observable Markov Decision Process (POMDP) perspective and propose a novel framework named AVA-VLA. Inspired by the POMDP that the action generation should be conditioned on the belief state. AVA-VLA introduces Active Visual Attention (AVA) to dynamically modulate visual processing. It achieves this by leveraging the recurrent state, which is a neural approximation of the agent's belief state derived from the previous decision step. Specifically, the AVA module uses the recurrent state to compute the soft weights to actively process task-relevant visual tokens based on its historical context. Comprehensive evaluations demonstrate that AVA-VLA achieves state-of-the-art performance across popular robotic benchmarks, including LIBERO and CALVIN. Furthermore, real-world deployments on a dual-arm robot platform validate the framework's practical applicability and robust sim-to-real transferability.
comment: 18 pages, 10 figures
Eevee: Towards Close-up High-resolution Video-based Virtual Try-on
Video virtual try-on technology provides a cost-effective solution for creating marketing videos in fashion e-commerce. However, its practical adoption is hindered by two critical limitations. First, the reliance on a single garment image as input in current virtual try-on datasets limits the accurate capture of realistic texture details. Second, most existing methods focus solely on generating full-shot virtual try-on videos, neglecting the business's demand for videos that also provide detailed close-ups. To address these challenges, we introduce a high-resolution dataset for video-based virtual try-on. This dataset offers two key features. First, it provides more detailed information on the garments, which includes high-fidelity images with detailed close-ups and textual descriptions; Second, it uniquely includes full-shot and close-up try-on videos of real human models. Furthermore, accurately assessing consistency becomes significantly more critical for the close-up videos, which demand high-fidelity preservation of garment details. To facilitate such fine-grained evaluation, we propose a new garment consistency metric VGID (Video Garment Inception Distance) that quantifies the preservation of both texture and structure. Our experiments validate these contributions. We demonstrate that by utilizing the detailed images from our dataset, existing video generation models can extract and incorporate texture features, significantly enhancing the realism and detail fidelity of virtual try-on results. Furthermore, we conduct a comprehensive benchmark of recent models. The benchmark effectively identifies the texture and structural preservation problems among current methods.
Compressor-VLA: Instruction-Guided Visual Token Compression for Efficient Robotic Manipulation
Vision-Language-Action (VLA) models have emerged as a powerful paradigm in Embodied AI. However, the significant computational overhead of processing redundant visual tokens remains a critical bottleneck for real-time robotic deployment. While standard token pruning techniques can alleviate this, these task-agnostic methods struggle to preserve task-critical visual information. To address this challenge, simultaneously preserving both the holistic context and fine-grained details for precise action, we propose Compressor-VLA, a novel hybrid instruction-conditioned token compression framework designed for efficient, task-oriented compression of visual information in VLA models. The proposed Compressor-VLA framework consists of two token compression modules: a Semantic Task Compressor (STC) that distills holistic, task-relevant context, and a Spatial Refinement Compressor (SRC) that preserves fine-grained spatial details. This compression is dynamically modulated by the natural language instruction, allowing for the adaptive condensation of task-relevant visual information. Experimentally, extensive evaluations demonstrate that Compressor-VLA achieves a competitive success rate on the LIBERO benchmark while reducing FLOPs by 59% and the visual token count by over 3x compared to its baseline. The real-robot deployments on a dual-arm robot platform validate the model's sim-to-real transferability and practical applicability. Moreover, qualitative analyses reveal that our instruction guidance dynamically steers the model's perceptual focus toward task-relevant objects, thereby validating the effectiveness of our approach.
comment: 11 pages, 5 figures
Leveraging Adversarial Learning for Pathological Fidelity in Virtual Staining
In addition to evaluating tumor morphology using H&E staining, immunohistochemistry is used to assess the presence of specific proteins within the tissue. However, this is a costly and labor-intensive technique, for which virtual staining, as an image-to-image translation task, offers a promising alternative. Although recent, this is an emerging field of research with 64% of published studies just in 2024. Most studies use publicly available datasets of H&E-IHC pairs from consecutive tissue sections. Recognizing the training challenges, many authors develop complex virtual staining models based on conditional Generative Adversarial Networks, but ignore the impact of adversarial loss on the quality of virtual staining. Furthermore, overlooking the issues of model evaluation, they claim improved performance based on metrics such as SSIM and PSNR, which are not sufficiently robust to evaluate the quality of virtually stained images. In this paper, we developed CSSP2P GAN, which we demonstrate to achieve heightened pathological fidelity through a blind pathological expert evaluation. Furthermore, while iteratively developing our model, we study the impact of the adversarial loss and demonstrate its crucial role in the quality of virtually stained images. Finally, while comparing our model with reference works in the field, we underscore the limitations of the currently used evaluation metrics and demonstrate the superior performance of CSSP2P GAN.
VeCoR - Velocity Contrastive Regularization for Flow Matching
Flow Matching (FM) has recently emerged as a principled and efficient alternative to diffusion models. Standard FM encourages the learned velocity field to follow a target direction; however, it may accumulate errors along the trajectory and drive samples off the data manifold, leading to perceptual degradation, especially in lightweight or low-step configurations. To enhance stability and generalization, we extend FM into a balanced attract-repel scheme that provides explicit guidance on both "where to go" and "where not to go." To be formal, we propose \textbf{Velocity Contrastive Regularization (VeCoR)}, a complementary training scheme for flow-based generative modeling that augments the standard FM objective with contrastive, two-sided supervision. VeCoR not only aligns the predicted velocity with a stable reference direction (positive supervision) but also pushes it away from inconsistent, off-manifold directions (negative supervision). This contrastive formulation transforms FM from a purely attractive, one-sided objective into a two-sided training signal, regularizing trajectory evolution and improving perceptual fidelity across datasets and backbones. On ImageNet-1K 256$\times$256, VeCoR yields 22\% and 35\% relative FID reductions on SiT-XL/2 and REPA-SiT-XL/2 backbones, respectively, and achieves further FID gains (32\% relative) on MS-COCO text-to-image generation, demonstrating consistent improvements in stability, convergence, and image quality, particularly in low-step and lightweight settings. Project page: https://p458732.github.io/VeCoR_Project_Page/
Human-Centric Open-Future Task Discovery: Formulation, Benchmark, and Scalable Tree-Based Search
Recent progress in robotics and embodied AI is largely driven by Large Multimodal Models (LMMs). However, a key challenge remains underexplored: how can we advance LMMs to discover tasks that directly assist humans in open-future scenarios, where human intentions are highly concurrent and dynamic. In this work, we formalize the problem of Human-centric Open-future Task Discovery (HOTD), focusing particularly on identifying tasks that reduce human effort across multiple plausible futures. To facilitate this study, we propose an HOTD-Bench, which features over 2K real-world videos, a semi-automated annotation pipeline, and a simulation-based protocol tailored for open-set future evaluation. Additionally, we propose the Collaborative Multi-Agent Search Tree (CMAST) framework, which decomposes the complex reasoning through a multi-agent system and structures the reasoning process through a scalable search tree module. In our experiments, CMAST achieves the best performance on the HOTD-Bench, significantly surpassing existing LMMs. It also integrates well with existing LMMs, consistently improving performance.
comment: 10 pages, 9 figures
FineXtrol: Controllable Motion Generation via Fine-Grained Text AAAI 2026
Recent works have sought to enhance the controllability and precision of text-driven motion generation. Some approaches leverage large language models (LLMs) to produce more detailed texts, while others incorporate global 3D coordinate sequences as additional control signals. However, the former often introduces misaligned details and lacks explicit temporal cues, and the latter incurs significant computational cost when converting coordinates to standard motion representations. To address these issues, we propose FineXtrol, a novel control framework for efficient motion generation guided by temporally-aware, precise, user-friendly, and fine-grained textual control signals that describe specific body part movements over time. In support of this framework, we design a hierarchical contrastive learning module that encourages the text encoder to produce more discriminative embeddings for our novel control signals, thereby improving motion controllability. Quantitative results show that FineXtrol achieves strong performance in controllable motion generation, while qualitative analysis demonstrates its flexibility in directing specific body part movements.
comment: 20 pages, 14 figures, AAAI 2026
AttenDence: Maximizing Attention Confidence for Test Time Adaptation
Test-time adaptation (TTA) enables models to adapt to distribution shifts at inference time. While entropy minimization over the output distribution has proven effective for TTA, transformers offer an additional unsupervised learning signal through their attention mechanisms. We propose minimizing the entropy of attention distributions from the CLS token to image patches as a novel TTA objective.This approach encourages the model to attend more confidently to relevant image regions under distribution shift and is effective even when only a single test image is available. We demonstrate that attention entropy minimization improves robustness across diverse corruption types while not hurting performance on clean data on a single sample stream of images at test time.
comment: Initial submission. 5 pages, 4 figures
One4D: Unified 4D Generation and Reconstruction via Decoupled LoRA Control
We present One4D, a unified framework for 4D generation and reconstruction that produces dynamic 4D content as synchronized RGB frames and pointmaps. By consistently handling varying sparsities of conditioning frames through a Unified Masked Conditioning (UMC) mechanism, One4D can seamlessly transition between 4D generation from a single image, 4D reconstruction from a full video, and mixed generation and reconstruction from sparse frames. Our framework adapts a powerful video generation model for joint RGB and pointmap generation, with carefully designed network architectures. The commonly used diffusion finetuning strategies for depthmap or pointmap reconstruction often fail on joint RGB and pointmap generation, quickly degrading the base video model. To address this challenge, we introduce Decoupled LoRA Control (DLC), which employs two modality-specific LoRA adapters to form decoupled computation branches for RGB frames and pointmaps, connected by lightweight, zero-initialized control links that gradually learn mutual pixel-level consistency. Trained on a mixture of synthetic and real 4D datasets under modest computational budgets, One4D produces high-quality RGB frames and accurate pointmaps across both generation and reconstruction tasks. This work represents a step toward general, high-quality geometry-based 4D world modeling using video diffusion models. Project page: https://mizhenxing.github.io/One4D
comment: Project page: https://mizhenxing.github.io/One4D
BackdoorVLM: A Benchmark for Backdoor Attacks on Vision-Language Models
Backdoor attacks undermine the reliability and trustworthiness of machine learning systems by injecting hidden behaviors that can be maliciously activated at inference time. While such threats have been extensively studied in unimodal settings, their impact on multimodal foundation models, particularly vision-language models (VLMs), remains largely underexplored. In this work, we introduce \textbf{BackdoorVLM}, the first comprehensive benchmark for systematically evaluating backdoor attacks on VLMs across a broad range of settings. It adopts a unified perspective that injects and analyzes backdoors across core vision-language tasks, including image captioning and visual question answering. BackdoorVLM organizes multimodal backdoor threats into 5 representative categories: targeted refusal, malicious injection, jailbreak, concept substitution, and perceptual hijack. Each category captures a distinct pathway through which an adversary can manipulate a model's behavior. We evaluate these threats using 12 representative attack methods spanning text, image, and bimodal triggers, tested on 2 open-source VLMs and 3 multimodal datasets. Our analysis reveals that VLMs exhibit strong sensitivity to textual instructions, and in bimodal backdoors the text trigger typically overwhelms the image trigger when forming the backdoor mapping. Notably, backdoors involving the textual modality remain highly potent, with poisoning rates as low as 1\% yielding over 90\% success across most tasks. These findings highlight significant, previously underexplored vulnerabilities in current VLMs. We hope that BackdoorVLM can serve as a useful benchmark for analyzing and mitigating multimodal backdoor threats. Code is available at: https://github.com/bin015/BackdoorVLM .
EventSTU: Event-Guided Efficient Spatio-Temporal Understanding for Video Large Language Models
Video large language models have demonstrated strong video understanding capabilities but suffer from high inference costs due to the massive number of tokens in long videos. Inspired by event-based vision, we propose an event-guided, training-free framework for efficient spatio-temporal understanding, named EventSTU. In the temporal domain, we design a coarse-to-fine keyframe sampling algorithm that exploits the change-triggered property of event cameras to eliminate redundant frames. In the spatial domain, we design an adaptive token pruning algorithm that leverages the visual saliency of events as a zero-cost prior to guide spatial reduction. From a holistic spatio-temporal perspective, we further integrate question relevance from keyframe sampling to adaptively allocate token pruning budgets. To facilitate evaluation, we construct EventBench, the first event-inclusive, human-annotated multimodal benchmark that covers diverse real-world scenarios. Beyond physical event cameras, EventSTU also supports general video understanding using simulated events. Comprehensive experiments show that EventSTU achieves 3.01x FLOPs reduction and 3.10x prefilling speedup over the strongest baseline while still improving performance.
comment: 8 pages, 7 figures
Learning What to Trust: Bayesian Prior-Guided Optimization for Visual Generation
Group Relative Policy Optimization (GRPO) has emerged as an effective and lightweight framework for post-training visual generative models. However, its performance is fundamentally limited by the ambiguity of textual visual correspondence: a single prompt may validly describe diverse visual outputs, and a single image or video may support multiple equally correct interpretations. This many to many relationship leads reward models to generate uncertain and weakly discriminative signals, causing GRPO to underutilize reliable feedback and overfit noisy ones. We introduce Bayesian Prior-Guided Optimization (BPGO), a novel extension of GRPO that explicitly models reward uncertainty through a semantic prior anchor. BPGO adaptively modulates optimization trust at two levels: inter-group Bayesian trust allocation emphasizes updates from groups consistent with the prior while down-weighting ambiguous ones, and intra-group prior-anchored renormalization sharpens sample distinctions by expanding confident deviations and compressing uncertain scores. Across both image and video generation tasks, BPGO delivers consistently stronger semantic alignment, enhanced perceptual fidelity, and faster convergence than standard GRPO and recent variants.
MatMart: Material Reconstruction of 3D Objects via Diffusion
Applying diffusion models to physically-based material estimation and generation has recently gained prominence. In this paper, we propose \ttt, a novel material reconstruction framework for 3D objects, offering the following advantages. First, \ttt\ adopts a two-stage reconstruction, starting with accurate material prediction from inputs and followed by prior-guided material generation for unobserved views, yielding high-fidelity results. Second, by utilizing progressive inference alongside the proposed view-material cross-attention (VMCA), \ttt\ enables reconstruction from an arbitrary number of input images, demonstrating strong scalability and flexibility. Finally, \ttt\ achieves both material prediction and generation capabilities through end-to-end optimization of a single diffusion model, without relying on additional pre-trained models, thereby exhibiting enhanced stability across various types of objects. Extensive experiments demonstrate that \ttt\ achieves superior performance in material reconstruction compared to existing methods.
MetaDCSeg: Robust Medical Image Segmentation via Meta Dynamic Center Weighting
Medical image segmentation is crucial for clinical applications, but it is frequently disrupted by noisy annotations and ambiguous anatomical boundaries, which lead to instability in model training. Existing methods typically rely on global noise assumptions or confidence-based sample selection, which inadequately mitigate the performance degradation caused by annotation noise, especially in challenging boundary regions. To address this issue, we propose MetaDCSeg, a robust framework that dynamically learns optimal pixel-wise weights to suppress the influence of noisy ground-truth labels while preserving reliable annotations. By explicitly modeling boundary uncertainty through a Dynamic Center Distance (DCD) mechanism, our approach utilizes weighted feature distances for foreground, background, and boundary centers, directing the model's attention toward hard-to-segment pixels near ambiguous boundaries. This strategy enables more precise handling of structural boundaries, which are often overlooked by existing methods, and significantly enhances segmentation performance. Extensive experiments across four benchmark datasets with varying noise levels demonstrate that MetaDCSeg consistently outperforms existing state-of-the-art methods.
MFmamba: A Multi-function Network for Panchromatic Image Resolution Restoration Based on State-Space Model AAAI-2026
Remote sensing images are becoming increasingly widespread in military, earth resource exploration. Because of the limitation of a single sensor, we can obtain high spatial resolution grayscale panchromatic (PAN) images and low spatial resolution color multispectral (MS) images. Therefore, an important issue is to obtain a color image with high spatial resolution when there is only a PAN image at the input. The existing methods improve spatial resolution using super-resolution (SR) technology and spectral recovery using colorization technology. However, the SR technique cannot improve the spectral resolution, and the colorization technique cannot improve the spatial resolution. Moreover, the pansharpening method needs two registered inputs and can not achieve SR. As a result, an integrated approach is expected. To solve the above problems, we designed a novel multi-function model (MFmamba) to realize the tasks of SR, spectral recovery, joint SR and spectral recovery through three different inputs. Firstly, MFmamba utilizes UNet++ as the backbone, and a Mamba Upsample Block (MUB) is combined with UNet++. Secondly, a Dual Pool Attention (DPA) is designed to replace the skip connection in UNet++. Finally, a Multi-scale Hybrid Cross Block (MHCB) is proposed for initial feature extraction. Many experiments show that MFmamba is competitive in evaluation metrics and visual results and performs well in the three tasks when only the input PAN image is used.
comment: 9 pages, 9 figures. This paper has been accepted for publication in AAAI-2026
MagicWorld: Interactive Geometry-driven Video World Exploration
Recent interactive video world model methods generate scene evolution conditioned on user instructions. Although they achieve impressive results, two key limitations remain. First, they fail to fully exploit the correspondence between instruction-driven scene motion and the underlying 3D geometry, which results in structural instability under viewpoint changes. Second, they easily forget historical information during multi-step interaction, resulting in error accumulation and progressive drift in scene semantics and structure. To address these issues, we propose MagicWorld, an interactive video world model that integrates 3D geometric priors and historical retrieval. MagicWorld starts from a single scene image, employs user actions to drive dynamic scene evolution, and autoregressively synthesizes continuous scenes. We introduce the Action-Guided 3D Geometry Module (AG3D), which constructs a point cloud from the first frame of each interaction and the corresponding action, providing explicit geometric constraints for viewpoint transitions and thereby improving structural consistency. We further propose History Cache Retrieval (HCR) mechanism, which retrieves relevant historical frames during generation and injects them as conditioning signals, helping the model utilize past scene information and mitigate error accumulation. Experimental results demonstrate that MagicWorld achieves notable improvements in scene stability and continuity across interaction iterations.
Facade Segmentation for Solar Photovoltaic Suitability NeurIPS 2025
Building integrated photovoltaic (BIPV) facades represent a promising pathway towards urban decarbonization, especially where roof areas are insufficient and ground-mounted arrays are infeasible. Although machine learning-based approaches to support photovoltaic (PV) planning on rooftops are well researched, automated approaches for facades still remain scarce and oversimplified. This paper therefore presents a pipeline that integrates detailed information on the architectural composition of the facade to automatically identify suitable surfaces for PV application and estimate the solar energy potential. The pipeline fine-tunes SegFormer-B5 on the CMP Facades dataset and converts semantic predictions into facade-level PV suitability masks and PV panel layouts considering module sizes and clearances. Applied to a dataset of 373 facades with known dimensions from ten cities, the results show that installable BIPV potential is significantly lower than theoretical potential, thus providing valuable insights for reliable urban energy planning. With the growing availability of facade imagery, the proposed pipeline can be scaled to support BIPV planning in cities worldwide.
comment: NeurIPS 2025 Tackling Climate Change with Machine Learning Workshop version. Non-archival
Parallel Vision Token Scheduling for Fast and Accurate Multimodal LMMs Inference
Multimodal large language models (MLLMs) deliver impressive vision-language reasoning but suffer steep inference latency because self-attention scales quadratically with sequence length and thousands of visual tokens contributed by high-resolution images. Naively pruning less-informative visual tokens reduces this burden, yet indiscriminate removal can strip away contextual cues essential for background or fine-grained questions, undermining accuracy. In this paper, we present ParVTS (Parallel Vision Token Scheduling), a training-free scheduling framework that partitions visual tokens into subject and non-subject groups, processes them in parallel to transfer their semantics into question tokens, and discards the non-subject path mid-inference to reduce computation. This scheduling reduces computational complexity, requires no heuristics or additional modules, and is compatible with diverse existing MLLM architectures. Experiments across multiple MLLM backbones show that ParVTS prunes up to 88.9% of visual tokens with minimal performance drop, achieving 1.77x speedup and 70% FLOPs reduction.
GContextFormer: A global context-aware hybrid multi-head attention approach with scaled additive aggregation for multimodal trajectory prediction
Multimodal trajectory prediction generates multiple plausible future trajectories to address vehicle motion uncertainty from intention ambiguity and execution variability. However, HD map-dependent models suffer from costly data acquisition, delayed updates, and vulnerability to corrupted inputs, causing prediction failures. Map-free approaches lack global context, with pairwise attention over-amplifying straight patterns while suppressing transitional patterns, resulting in motion-intention misalignment. This paper proposes GContextFormer, a plug-and-play encoder-decoder architecture with global context-aware hybrid attention and scaled additive aggregation achieving intention-aligned multimodal prediction without map reliance. The Motion-Aware Encoder builds scene-level intention prior via bounded scaled additive aggregation over mode-embedded trajectory tokens and refines per-mode representations under shared global context, mitigating inter-mode suppression and promoting intention alignment. The Hierarchical Interaction Decoder decomposes social reasoning into dual-pathway cross-attention: a standard pathway ensures uniform geometric coverage over agent-mode pairs while a neighbor-context-enhanced pathway emphasizes salient interactions, with gating module mediating their contributions to maintain coverage-focus balance. Experiments on eight highway-ramp scenarios from TOD-VT dataset show GContextFormer outperforms state-of-the-art baselines. Compared to existing transformer models, GContextFormer achieves greater robustness and concentrated improvements in high-curvature and transition zones via spatial distributions. Interpretability is achieved through motion mode distinctions and neighbor context modulation exposing reasoning attribution. The modular architecture supports extensibility toward cross-domain multimodal reasoning tasks. Source: https://fenghy-chen.github.io/sources/.
Neural Texture Splatting: Expressive 3D Gaussian Splatting for View Synthesis, Geometry, and Dynamic Reconstruction SIGGRAPH
3D Gaussian Splatting (3DGS) has emerged as a leading approach for high-quality novel view synthesis, with numerous variants extending its applicability to a broad spectrum of 3D and 4D scene reconstruction tasks. Despite its success, the representational capacity of 3DGS remains limited by the use of 3D Gaussian kernels to model local variations. Recent works have proposed to augment 3DGS with additional per-primitive capacity, such as per-splat textures, to enhance its expressiveness. However, these per-splat texture approaches primarily target dense novel view synthesis with a reduced number of Gaussian primitives, and their effectiveness tends to diminish when applied to more general reconstruction scenarios. In this paper, we aim to achieve concrete performance improvement over state-of-the-art 3DGS variants across a wide range of reconstruction tasks, including novel view synthesis, geometry and dynamic reconstruction, under both sparse and dense input settings. To this end, we introduce Neural Texture Splatting (NTS). At the core of our approach is a global neural field (represented as a hybrid of a tri-plane and a neural decoder) that predicts local appearance and geometric fields for each primitive. By leveraging this shared global representation that models local texture fields across primitives, we significantly reduce model size and facilitate efficient global information exchange, demonstrating strong generalization across tasks. Furthermore, our neural modeling of local texture fields introduces expressive view- and time-dependent effects, a critical aspect that existing methods fail to account for. Extensive experiments show that Neural Texture Splatting consistently improves models and achieves state-of-the-art results across multiple benchmarks.
comment: SIGGRAPH Asia 2025 (conference track), Project page: https://19reborn.github.io/nts/
HunyuanVideo 1.5 Technical Report
We present HunyuanVideo 1.5, a lightweight yet powerful open-source video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture featuring selective and sliding tile attention (SSTA), enhanced bilingual understanding through glyph-aware text encoding, progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and resolutions.Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source video generation models. By releasing the code and model weights, we provide the community with a high-performance foundation that lowers the barrier to video creation and research, making advanced video generation accessible to a broader audience. All open-source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5.
DualGazeNet: A Biologically Inspired Dual-Gaze Query Network for Salient Object Detection
Recent salient object detection (SOD) methods aim to improve performance in four key directions: semantic enhancement, boundary refinement, auxiliary task supervision, and multi-modal fusion. In pursuit of continuous gains, these approaches have evolved toward increasingly sophisticated architectures with multi-stage pipelines, specialized fusion modules, edge-guided learning, and elaborate attention mechanisms. However, this complexity paradoxically introduces feature redundancy and cross-component interference that obscure salient cues, ultimately reaching performance bottlenecks. In contrast, human vision achieves efficient salient object identification without such architectural complexity. This contrast raises a fundamental question: can we design a biologically grounded yet architecturally simple SOD framework that dispenses with most of this engineering complexity, while achieving state-of-the-art accuracy, computational efficiency, and interpretability? In this work, we answer this question affirmatively by introducing DualGazeNet, a biologically inspired pure Transformer framework that models the dual biological principles of robust representation learning and magnocellular-parvocellular dual-pathway processing with cortical attention modulation in the human visual system. Extensive experiments on five RGB SOD benchmarks show that DualGazeNet consistently surpasses 25 state-of-the-art CNN- and Transformer-based methods. On average, DualGazeNet achieves about 60\% higher inference speed and 53.4\% fewer FLOPs than four Transformer-based baselines of similar capacity (VST++, MDSAM, Sam2unet, and BiRefNet). Moreover, DualGazeNet exhibits strong cross-domain generalization, achieving leading or highly competitive performance on camouflaged and underwater SOD benchmarks without relying on additional modalities.
SketchDeco: Training-Free Latent Composition for Precise Sketch Colourisation
We introduce SketchDeco, a training-free approach to sketch colourisation that bridges the gap between professional design needs and intuitive, region-based control. Our method empowers artists to use simple masks and colour palettes for precise spatial and chromatic specification, avoiding both the tediousness of manual assignment and the ambiguity of text-based prompts. We reformulate this task as a novel, training-free composition problem. Our core technical contribution is a guided latent-space blending process: we first leverage diffusion inversion to precisely ``paint'' user-defined colours into specified regions, and then use a custom self-attention mechanism to harmoniously blend these local edits with a globally consistent base image. This ensures both local colour fidelity and global harmony without requiring any model fine-tuning. Our system produces high-quality results in 15--20 inference steps on consumer GPUs, making professional-quality, controllable colourisation accessible.
comment: Project Page: \url{https://chaitron.github.io/SketchDeco/}
The Geometry of Cortical Computation: Manifold Disentanglement and Predictive Dynamics in VCNet NeurIPS 2025
Despite their success, modern convolutional neural networks (CNNs) exhibit fundamental limitations, including data inefficiency, poor out-of-distribution generalization, and vulnerability to adversarial perturbations. These shortcomings can be traced to a lack of inductive biases that reflect the inherent geometric structure of the visual world. The primate visual system, in contrast, demonstrates superior efficiency and robustness, suggesting that its architectural and computational principles,which evolved to internalize these structures,may offer a blueprint for more capable artificial vision. This paper introduces Visual Cortex Network (VCNet), a novel neural network architecture whose design is informed by the macro-scale organization of the primate visual cortex. VCNet is framed as a geometric framework that emulates key biological mechanisms, including hierarchical processing across distinct cortical areas, dual-stream information segregation for learning disentangled representations, and top-down predictive feedback for representation refinement. We interpret these mechanisms through the lens of geometry and dynamical systems, positing that they guide the learning of structured, low-dimensional neural manifolds. We evaluate VCNet on two specialized benchmarks: the Spots-10 animal pattern dataset, which probes sensitivity to natural textures, and a light field image classification task, which requires processing higher-dimensional visual data. Our results show that VCNet achieves state-of-the-art accuracy of 92.1\% on Spots-10 and 74.4\% on the light field dataset, surpassing contemporary models of comparable size. This work demonstrates that integrating high-level neuroscientific principles, viewed through a geometric lens, can lead to more efficient and robust models, providing a promising direction for addressing long-standing challenges in machine learning.
comment: Published in the proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Symmetry and Geometry in Neural Representations (NeurReps). Additionally accepted for presentation in NeurIPS 2025 Workshop: Interpreting Cognition in Deep Learning Models (CogInterp)
A Target-based Multi-LiDAR Multi-Camera Extrinsic Calibration System
Extrinsic Calibration represents the cornerstone of autonomous driving. Its accuracy plays a crucial role in the perception pipeline, as any errors can have implications for the safety of the vehicle. Modern sensor systems collect different types of data from the environment, making it harder to align the data. To this end, we propose a target-based extrinsic calibration system tailored for a multi-LiDAR and multi-camera sensor suite. This system enables cross-calibration between LiDARs and cameras with limited prior knowledge using a custom ChArUco board and a tailored nonlinear optimization method. We test the system with real-world data gathered in a warehouse. Results demonstrated the effectiveness of the proposed method, highlighting the feasibility of a unique pipeline tailored for various types of sensors.
comment: RiTA 2025 Accepted, 13 Pages, 6 Figures and 2 Tables
The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification
Automated video analysis is critical for wildlife conservation. A foundational task in this domain is multi-animal tracking (MAT), which underpins applications such as individual re-identification and behavior recognition. However, existing datasets are limited in scale, constrained to a few species, or lack sufficient temporal and geographical diversity - leaving no suitable benchmark for training general-purpose MAT models applicable across wild animal populations. To address this, we introduce SA-FARI, the largest open-source MAT dataset for wild animals. It comprises 11,609 camera trap videos collected over approximately 10 years (2014-2024) from 741 locations across 4 continents, spanning 99 species categories. Each video is exhaustively annotated culminating in ~46 hours of densely annotated footage containing 16,224 masklet identities and 942,702 individual bounding boxes, segmentation masks, and species labels. Alongside the task-specific annotations, we publish anonymized camera trap locations for each video. Finally, we present comprehensive benchmarks on SA-FARI using state-of-the-art vision-language models for detection and tracking, including SAM 3, evaluated with both species-specific and generic animal prompts. We also compare against vision-only methods developed specifically for wildlife analysis. SA-FARI is the first large-scale dataset to combine high species diversity, multi-region coverage, and high-quality spatio-temporal annotations, offering a new foundation for advancing generalizable multianimal tracking in the wild. The dataset is available at https://www.conservationxlabs.com/sa-fari.
FOCUS: Efficient Keyframe Selection for Long Video Understanding
Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments. We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region. On two long-video question-answering benchmarks, FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs. Code is available at https://github.com/NUS-HPC-AI-Lab/FOCUS.
InfoScale: Unleashing Training-free Variable-scaled Image Generation via Effective Utilization of Information
Diffusion models (DMs) have become dominant in visual generation but suffer performance drop when tested on resolutions that differ from the training scale, whether lower or higher. In fact, the key challenge in generating variable-scale images lies in the differing amounts of information across resolutions, which requires information conversion procedures to be varied for generating variable-scaled images. In this paper, we investigate the issues of three critical aspects in DMs for a unified analysis in variable-scaled generation: dilated convolution, attention mechanisms, and initial noise. Specifically, 1) dilated convolution in DMs for the higher-resolution generation loses high-frequency information. 2) Attention for variable-scaled image generation struggles to adjust the information aggregation adaptively. 3) The spatial distribution of information in the initial noise is misaligned with variable-scaled image. To solve the above problems, we propose \textbf{InfoScale}, an information-centric framework for variable-scaled image generation by effectively utilizing information from three aspects correspondingly. For information loss in 1), we introduce Progressive Frequency Compensation module to compensate for high-frequency information lost by dilated convolution in higher-resolution generation. For information aggregation inflexibility in 2), we introduce Adaptive Information Aggregation module to adaptively aggregate information in lower-resolution generation and achieve an effective balance between local and global information in higher-resolution generation. For information distribution misalignment in 3), we design Noise Adaptation module to re-distribute information in initial noise for variable-scaled generation. Our method is plug-and-play for DMs and extensive experiments demonstrate the effectiveness in variable-scaled image generation.
VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval
Prevailing joint prediction transformers for Video Highlight Detection and Moment Retrieval (HD/MR) exhibit deficiencies in handling cross-task dynamics, achieving robust video-text alignment, and utilizing effective attention mechanisms, with the potential of Large Language/Vision-Language Models (LLMs/LVLMs) being largely untapped. This paper introduces VideoLights, a novel HD/MR framework addressing these limitations by incorporating: (i) Convolutional Projection and Feature Refinement modules with an alignment loss for enhanced video-text feature congruity; (ii) a Bi-Directional Cross-Modal Fusion network for strongly coupled query-aware representations; (iii) a Uni-directional joint-task feedback mechanism for synergistic task improvement; (iv) hard positive/negative losses for adaptive learning; and (v) the leveraging of LVLMs (e.g., BLIP-2) for superior multimodal feature integration and intelligent pre-training with synthetic data. Comprehensive evaluations on QVHighlights, TVSum, and Charades-STA benchmarks demonstrate that VideoLights significantly surpasses existing baselines, establishing new state-of-the-art performances. Codes and model checkpoints are available at https://github.com/dpaul06/VideoLights .
Optimization-Free Style Transfer for 3D Gaussian Splats
The task of style transfer for 3D Gaussian splats has been explored in many previous works, but these require reconstructing or fine-tuning the splat while incorporating style information or optimizing a feature extraction network on the splat representation. We propose a reconstruction- and optimization-free approach to stylizing 3D Gaussian splats, allowing for direct stylization on a .ply or .splat file without requiring the original camera views. This is done by generating a graph structure across the implicit surface of the splat representation. A feed-forward, surface-based stylization method is then used and interpolated back to the individual splats in the scene. This also allows for fast stylization of splats with no additional training, achieving speeds under 2 minutes even on CPU-based consumer hardware. We demonstrate the quality results this approach achieves and compare to other 3D Gaussian splat style transfer methods. Code is publicly available at https://github.com/davidmhart/FastSplatStyler.
Zero-Shot Coreset Selection via Iterative Subspace Sampling WACV 2026
Deep learning increasingly relies on massive data with substantial storage, annotation, and training costs. To reduce costs, coreset selection finds a representative subset of data to train models while ideally performing on par with the full data training. To maximize performance, current state-of-the-art coreset methods select data using dataset-specific ground truth labels and training. However, these methodological requirements prevent selection at scale on real-world, unlabeled data. To that end, this paper addresses the selection of coresets that achieve state-of-the-art performance but without using any labels or training on candidate data. Instead, our solution, Zero-Shot Coreset Selection via Iterative Subspace Sampling (ZCore), uses previously-trained foundation models to generate zero-shot, high-dimensional embedding spaces to interpret unlabeled data. ZCore then iteratively quantifies the relative value of all candidate data based on coverage and redundancy in numerous subspace distributions. Finally, ZCore selects a coreset sized for any data budget to train downstream models. We evaluate ZCore on four datasets and outperform several state-of-the-art label-based methods, especially at low data rates that provide the most substantial cost reduction. On ImageNet, ZCore selections for 10% training data achieve a downstream validation accuracy of 53.99%, which outperforms prior label-based methods and removes annotation and training costs for 1.15 million images. Our paper's code is publicly available at https://github.com/voxel51/zcore.
comment: WACV 2026
Fairness in Multi-modal Medical Diagnosis with Demonstration Selection
Multimodal large language models (MLLMs) have shown strong potential for medical image reasoning, yet fairness across demographic groups remains a major concern. Existing debiasing methods often rely on large labeled datasets or fine-tuning, which are impractical for foundation-scale models. We explore In-Context Learning (ICL) as a lightweight, tuning-free alternative for improving fairness. Through systematic analysis, we find that conventional demonstration selection (DS) strategies fail to ensure fairness due to demographic imbalance in selected exemplars. To address this, we propose Fairness-Aware Demonstration Selection (FADS), which builds demographically balanced and semantically relevant demonstrations via clustering-based sampling. Experiments on multiple medical imaging benchmarks show that FADS consistently reduces gender-, race-, and ethnicity-related disparities while maintaining strong accuracy, offering an efficient and scalable path toward fair medical image reasoning. These results highlight the potential of fairness-aware in-context learning as a scalable and data-efficient solution for equitable medical image reasoning.
comment: 10 pages (including 2 pages of references), 4 figures. This work explores fairness in multi-modal medical image reasoning using in-context learning
Minimax Multi-Target Conformal Prediction with Applications to Imaging Inverse Problems
In ill-posed imaging inverse problems, uncertainty quantification remains a fundamental challenge, especially in safety-critical applications. Recently, conformal prediction has been used to quantify the uncertainty that the inverse problem contributes to downstream tasks like image classification, image quality assessment, fat mass quantification, etc. While existing works handle only a scalar estimation target, practical applications often involve multiple targets. In response, we propose an asymptotically minimax approach to multi-target conformal prediction that provides tight prediction intervals while ensuring joint marginal coverage. We then outline how our minimax approach can be applied to multi-metric blind image quality assessment, multi-task uncertainty quantification, and multi-round measurement acquisition. Finally, we numerically demonstrate the benefits of our minimax method, relative to existing multi-target conformal prediction methods, using both synthetic and magnetic resonance imaging (MRI) data. Code is available at https://github.com/jwen307/multi_target_minimax.
Multiview point cloud registration with anisotropic and space-varying localization noise
In this paper, we address the problem of registering multiple point clouds corrupted with high anisotropic localization noise. Our approach follows the widely used framework of Gaussian mixture model (GMM) reconstruction with an expectation-maximization (EM) algorithm. Existing methods are based on an implicit assumption of space-invariant isotropic Gaussian noise. However, this assumption is violated in practice in applications such as single molecule localization microscopy (SMLM). To address this issue, we propose to introduce an explicit localization noise model that decouples shape modeling with the GMM from noise handling. We design a stochastic EM algorithm that considers noise-free data as a latent variable, with closed-form solutions at each EM step. The first advantage of our approach is to handle space-variant and anisotropic Gaussian noise with arbitrary covariances. The second advantage is to leverage the explicit noise model to impose prior knowledge about the noise that may be available from physical sensors. We show on various simulated data that our noise handling strategy improves significantly the robustness to high levels of anisotropic noise. We also demonstrate the performance of our method on real SMLM data.
Automatic Multi-View X-Ray/CT Registration Using Bone Substructure Contours
Purpose: Accurate intraoperative X-ray/CT registration is essential for surgical navigation in orthopedic procedures. However, existing methods struggle with consistently achieving sub-millimeter accuracy, robustness under broad initial pose estimates or need manual key-point annotations. This work aims to address these challenges by proposing a novel multi-view X-ray/CT registration method for intraoperative bone registration. Methods: The proposed registration method consists of a multi-view, contour-based iterative closest point (ICP) optimization. Unlike previous methods, which attempt to match bone contours across the entire silhouette in both imaging modalities, we focus on matching specific subcategories of contours corresponding to bone substructures. This leads to reduced ambiguity in the ICP matches, resulting in a more robust and accurate registration solution. This approach requires only two X-ray images and operates fully automatically. Additionally, we contribute a dataset of 5 cadaveric specimens, including real X-ray images, X-ray image poses and the corresponding CT scans. Results: The proposed registration method is evaluated on real X-ray images using mean reprojection error (mRPD). The method consistently achieves sub-millimeter accuracy with a mRPD 0.67mm compared to 5.35mm by a commercial solution requiring manual intervention. Furthermore, the method offers improved practical applicability, being fully automatic. Conclusion: Our method offers a practical, accurate, and efficient solution for multi-view X-ray/CT registration in orthopedic surgeries, which can be easily combined with tracking systems. By improving registration accuracy and minimizing manual intervention, it enhances intraoperative navigation, contributing to more accurate and effective surgical outcomes in computer-assisted surgery (CAS).
comment: This paper was accepted to IPCAI 2025. The Project Webpage is: https://rflepp.github.io/BoneSubstructureContours2D3DRegistration/
MedBridge: Bridging Foundation Vision-Language Models to Medical Image Diagnosis in Chest X-Ray
Recent vision-language foundation models deliver state-of-the-art results in natural image classification, but falter in medical images due to pronounced domain shifts. Training a medical foundation model also requires substantial resources, including extensive annotated data and high computational capacity. To bridge this gap with minimal overhead, we introduce MedBridge, a lightweight multimodal adaptation framework that flexibly re-purposes arbitrary pre-trained foundation VLMs for medical image diagnosis. MedBridge comprises three novel core components. First, a Focal Sampling module that subsamples and extracts high-resolution local regions to capture subtle pathological features, compensating for the limited input resolution of foundation VLMs. Second, a Query-Encoder model with a small set of learnable queries to align the feature maps of frozen VLMs with medical semantics, without requiring retraining of the backbone layers. Third, a Mixture of Experts mechanism, driven by learnable queries, harnesses the complementary strength of various VLMs to maximize diagnostic performance. We evaluate MedBridge on five chest radiograph benchmarks in three key adaptation tasks, demonstrating its superior performance in both cross-domain and in-domain adaptation settings under varying levels of training data availability. MedBridge achieved an improvement of 6-15% in AUC compared to state-of-the-art VLM adaptation methods in multi-label thoracic disease diagnosis, underscoring its effectiveness in leveraging diverse foundation models for accurate and data-efficient medical diagnosis. Our project and code are available at https://github.com/ai-med/MedBridge.
CUPID: Generative 3D Reconstruction via Joint Object and Pose Modeling
We introduce Cupid, a generative 3D reconstruction framework that jointly models the full distribution over both canonical objects and camera poses. Our two-stage flow-based model first generates a coarse 3D structure and 2D-3D correspondences to estimate the camera pose robustly. Conditioned on this pose, a refinement stage injects pixel-aligned image features directly into the generative process, marrying the rich prior of a generative model with the geometric fidelity of reconstruction. This strategy achieves exceptional faithfulness, outperforming state-of-the-art reconstruction methods by over 3 dB PSNR and 10% in Chamfer Distance. As a unified generative model that decouples the object and camera pose, Cupid naturally extends to multi-view and scene-level reconstruction tasks without requiring post-hoc optimization or fine-tuning.
comment: project page at https://cupid3d.github.io
Don't Reach for the Stars: Rethinking Topology for Resilient Federated Learning
Federated learning (FL) enables collaborative model training across distributed clients while preserving data privacy by keeping data local. Traditional FL approaches rely on a centralized, star-shaped topology, where a central server aggregates model updates from clients. However, this architecture introduces several limitations, including a single point of failure, limited personalization, and poor robustness to distribution shifts or vulnerability to malfunctioning clients. Moreover, update selection in centralized FL often relies on low-level parameter differences, which can be unreliable when client data is not independent and identically distributed, and offer clients little control. In this work, we propose a decentralized, peer-to-peer (P2P) FL framework. It leverages the flexibility of the P2P topology to enable each client to identify and aggregate a personalized set of trustworthy and beneficial updates.This framework is the Local Inference Guided Aggregation for Heterogeneous Training Environments to Yield Enhancement Through Agreement and Regularization (LIGHTYEAR). Central to our method is an agreement score, computed on a local validation set, which quantifies the semantic alignment of incoming updates in the function space with respect to the clients reference model. Each client uses this score to select a tailored subset of updates and performs aggregation with a regularization term that further stabilizes the training. Our empirical evaluation across five datasets shows that the proposed approach consistently outperforms both, centralized baselines and existing P2P methods in terms of client-level performance, particularly under adversarial and heterogeneous conditions.
Benchmarking the Spatial Robustness of DNNs via Natural and Adversarial Localized Corruptions
The robustness of deep neural networks is a crucial factor in safety-critical applications, particularly in complex and dynamic environments (e.g., medical or driving scenarios) where localized corruptions can arise. While previous studies have evaluated the robustness of semantic segmentation (SS) models under whole-image natural or adversarial corruptions, a comprehensive investigation into the spatial robustness of dense vision models under localized corruptions remains underexplored. This paper fills this gap by introducing novel, region-aware metrics for benchmarking the spatial robustness of segmentation models, along with an evaluation framework to assess the impact of natural localized corruptions. Furthermore, it uncovers the inherent complexity of evaluating worst-case spatial robustness using only a single localized adversarial attack. To address this, the work proposes a region-aware multi-attack adversarial analysis to systematically assess model robustness across specific image regions. The proposed metrics and analysis were exploited to evaluate 14 segmentation models in driving scenarios, uncovering key insights into the effects of localized corruption in both natural and adversarial forms. The results reveal that models respond to these two types of threats differently; for instance, transformer-based segmentation models demonstrate notable robustness to localized natural corruptions but are highly vulnerable to adversarial ones, and vice versa for CNN-based models. Consequently, we also address the challenge of balancing robustness to both natural and adversarial localized corruptions by means of ensemble models, thereby achieving a broader threat coverage and improved reliability for dense vision tasks.
comment: Accepted for publication in Pattern Recognition
In-Situ Tweedie Discrete Diffusion Models
While diffusion models excel at generating continuous data such as images, adapting them to discrete tasks has relied on indirect approaches that either operate in continuous embedding spaces or use token masking mechanisms, both of which deviate from modeling the true discrete data distribution that can be theoretically guaranteed by Tweedie's formula. We propose in-situ Tweedie Discrete Diffusion (TDD), a framework that performs diffusion guaranteed by Tweedie's formula directly within the discrete one-hot space, hence "in-situ." Unlike prior methods that diffuse continuous embeddings or mask tokens, TDD directly corrupts one-hot vectors with Gaussian noise and performs iterative denoising through a timestep-conditioned cross-entropy objective rather than mean-squared-error reconstruction. At each denoising step, the model predicts class probabilities, applies argmax to obtain discrete predictions, converts them to one-hot vectors, and feeds them into the next iteration with progressively reduced noise. This process naturally unifies discriminative classification and generative modeling under a single framework. Experiments demonstrate that TDD achieves strong performance on both image classification and text generation tasks, with extensive ablation studies confirming the effectiveness of each design component. Our work establishes a principled approach to discrete diffusion that preserves the core characteristics of diffusion models while operating natively in discrete space.
ReefNet: A Large scale, Taxonomically Enriched Dataset and Benchmark for Hard Coral Classification
Coral reefs are rapidly declining due to anthropogenic pressures such as climate change, underscoring the urgent need for scalable, automated monitoring. We introduce ReefNet, a large public coral reef image dataset with point-label annotations mapped to the World Register of Marine Species (WoRMS). ReefNet aggregates imagery from 76 curated CoralNet sources and an additional site from Al Wajh in the Red Sea, totaling approximately 925000 genus-level hard coral annotations with expert-verified labels. Unlike prior datasets, which are often limited by size, geography, or coarse labels and are not ML-ready, ReefNet offers fine-grained, taxonomically mapped labels at a global scale to WoRMS. We propose two evaluation settings: (i) a within-source benchmark that partitions each source's images for localized evaluation, and (ii) a cross-source benchmark that withholds entire sources to test domain generalization. We analyze both supervised and zero-shot classification performance on ReefNet and find that while supervised within-source performance is promising, supervised performance drops sharply across domains, and performance is low across the board for zero-shot models, especially for rare and visually similar genera. This provides a challenging benchmark intended to catalyze advances in domain generalization and fine-grained coral classification. We will release our dataset, benchmarking code, and pretrained models to advance robust, domain-adaptive, global coral reef monitoring and conservation.
U-REPA: Aligning Diffusion U-Nets to ViTs
Representation Alignment (REPA) that aligns Diffusion Transformer (DiT) hidden-states with ViT visual encoders has proven highly effective in DiT training, demonstrating superior convergence properties, but it has not been validated on the canonical diffusion U-Net architecture that shows faster convergence compared to DiTs. However, adapting REPA to U-Net architectures presents unique challenges: (1) different block functionalities necessitate revised alignment strategies; (2) spatial-dimension inconsistencies emerge from U-Net's spatial downsampling operations; (3) space gaps between U-Net and ViT hinder the effectiveness of tokenwise alignment. To encounter these challenges, we propose \textbf{U-REPA}, a representation alignment paradigm that bridges U-Net hidden states and ViT features as follows: Firstly, we propose via observation that due to skip connection, the middle stage of U-Net is the best alignment option. Secondly, we propose upsampling of U-Net features after passing them through MLPs. Thirdly, we observe difficulty when performing tokenwise similarity alignment, and further introduces a manifold loss that regularizes the relative similarity between samples. Experiments indicate that the resulting U-REPA could achieve excellent generation quality and greatly accelerates the convergence speed. With CFG guidance interval, U-REPA could reach $FID<1.5$ in 200 epochs or 1M iterations on ImageNet 256 $\times$ 256, and needs only half the total epochs to perform better than REPA under sd-vae-ft-ema. Codes: https://github.com/YuchuanTian/U-REPA
comment: 22 pages, 7 figures
InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior
Video inverse problems are fundamental to streaming, telepresence, and AR/VR, where high perceptual quality must coexist with tight latency constraints. Diffusion-based priors currently deliver state-of-the-art reconstructions, but existing approaches either adapt image diffusion models with ad hoc temporal regularizers - leading to temporal artifacts - or rely on native video diffusion models whose iterative posterior sampling is far too slow for real-time use. We introduce InstantViR, an amortized inference framework for ultra-fast video reconstruction powered by a pre-trained video diffusion prior. We distill a powerful bidirectional video diffusion model (teacher) into a causal autoregressive student that maps a degraded video directly to its restored version in a single forward pass, inheriting the teacher's strong temporal modeling while completely removing iterative test-time optimization. The distillation is prior-driven: it only requires the teacher diffusion model and known degradation operators, and does not rely on externally paired clean/noisy video data. To further boost throughput, we replace the video-diffusion backbone VAE with a high-efficiency LeanVAE via an innovative teacher-space regularized distillation scheme, enabling low-latency latent-space processing. Across streaming random inpainting, Gaussian deblurring and super-resolution, InstantViR matches or surpasses the reconstruction quality of diffusion-based baselines while running at over 35 FPS on NVIDIA A100 GPUs, achieving up to 100 times speedups over iterative video diffusion solvers. These results show that diffusion-based video reconstruction is compatible with real-time, interactive, editable, streaming scenarios, turning high-quality video restoration into a practical component of modern vision systems.
Splats in Splats: Robust and Effective 3D Steganography towards Gaussian Splatting AAAI 2026
3D Gaussian splatting (3DGS) has demonstrated impressive 3D reconstruction performance with explicit scene representations. Given the widespread application of 3DGS in 3D reconstruction and generation tasks, there is an urgent need to protect the copyright of 3DGS assets. However, existing copyright protection techniques for 3DGS overlook the usability of 3D assets, posing challenges for practical deployment. Here we describe splats in splats, the first 3DGS steganography framework that embeds 3D content in 3DGS itself without modifying any attributes. To achieve this, we take a deep insight into spherical harmonics (SH) and devise an importance-graded SH coefficient encryption strategy to embed the hidden SH coefficients. Furthermore, we employ a convolutional autoencoder to establish a mapping between the original Gaussian primitives' opacity and the hidden Gaussian primitives' opacity. Extensive experiments indicate that our method significantly outperforms existing 3D steganography techniques, with 5.31% higher scene fidelity and 3x faster rendering speed, while ensuring security, robustness, and user experience.
comment: Accepted by AAAI 2026
HiGFA: Hierarchical Guidance for Fine-grained Data Augmentation with Diffusion Models
Generative diffusion models show promise for data augmentation. However, applying them to fine-grained tasks presents a significant challenge: ensuring synthetic images accurately capture the subtle, category-defining features critical for high fidelity. Standard approaches, such as text-based Classifier-Free Guidance (CFG), often lack the required specificity, potentially generating misleading examples that degrade fine-grained classifier performance. To address this, we propose Hierarchically Guided Fine-grained Augmentation (HiGFA). HiGFA leverages the temporal dynamics of the diffusion sampling process. It employs strong text and transformed contour guidance with fixed strengths in the early-to-mid sampling stages to establish overall scene, style, and structure. In the final sampling stages, HiGFA activates a specialized fine-grained classifier guidance and dynamically modulates the strength of all guidance signals based on prediction confidence. This hierarchical, confidence-driven orchestration enables HiGFA to generate diverse yet faithful synthetic images by intelligently balancing global structure formation with precise detail refinement. Experiments on several FGVC datasets demonstrate the effectiveness of HiGFA.
TokenCLIP: Token-wise Prompt Learning for Zero-shot Anomaly Detection
Adapting CLIP for anomaly detection on unseen objects has shown strong potential in a zero-shot manner. However, existing methods typically rely on a single textual space to align with visual semantics across diverse objects and domains. The indiscriminate alignment hinders the model from accurately capturing varied anomaly semantics. We propose TokenCLIP, a token-wise adaptation framework that enables dynamic alignment between visual and learnable textual spaces for fine-grained anomaly learning. Rather than mapping all visual tokens to a single, token-agnostic textual space, TokenCLIP aligns each token with a customized textual subspace that represents its visual characteristics. Explicitly assigning a unique learnable textual space to each token is computationally intractable and prone to insufficient optimization. We instead expand the token-agnostic textual space into a set of orthogonal subspaces, and then dynamically assign each token to a subspace combination guided by semantic affinity, which jointly supports customized and efficient token-wise adaptation. To this end, we formulate dynamic alignment as an optimal transport problem, where all visual tokens in an image are transported to textual subspaces based on semantic similarity. The transport constraints of OT ensure sufficient optimization across subspaces and encourage them to focus on different semantics. Solving the problem yields a transport plan that adaptively assigns each token to semantically relevant subspaces. A top-k masking is then applied to sparsify the plan and specialize subspaces for distinct visual regions. Extensive experiments demonstrate the superiority of TokenCLIP.
Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling
We present \textbf{Upsample Anything}, a lightweight test-time optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14x/16x (e.g., ViT), which limits their direct use in pixel-level applications. Existing feature upsampling approaches depend on dataset-specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per-image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge-aware operator that transfers seamlessly across architectures and modalities, enabling precise high-resolution reconstruction of features, depth, or probability maps. It runs in only $\approx0.419 \text{s}$ per 224x224 image and achieves state-of-the-art performance on semantic segmentation, depth estimation, and both depth and probability map upsampling. \textbf{Project page:} \href{https://seominseok0429.github.io/Upsample-Anything/}{https://seominseok0429.github.io/Upsample-Anything/}
comment: 15 pages, 12 figures
Evo-0: Vision-Language-Action Model with Implicit Spatial Understanding
Vision-Language-Action (VLA) models have emerged as a promising framework for enabling generalist robots capable of perceiving, reasoning, and acting in the real world. These models usually build upon pretrained Vision-Language Models (VLMs), which excel at semantic understanding due to large-scale image and text pretraining. However, existing VLMs typically lack precise spatial understanding capabilities, as they are primarily tuned on 2D image-text pairs without 3D supervision. To address this limitation, recent approaches have incorporated explicit 3D inputs such as point clouds or depth maps, but this necessitates additional depth sensors or pre-trained depth estimation models, which may yield defective results. In contrast, our work introduces a plug-and-play module that implicitly incorporates 3D geometry features into VLA models by leveraging an off-the-shelf visual geometry foundation model. This integration provides the model with depth-aware visual representations, improving its ability to understand the geometric structure of the scene and the spatial relationships among objects from RGB images alone. We evaluate our method on a set of spatially challenging tasks in both simulation and the real world. Extensive evaluations show that our method significantly improves the performance of state-of-the-art VLA models across diverse scenarios.
ConMamba: Contrastive Vision Mamba for Plant Disease Detection
Plant Disease Detection (PDD) is a key aspect of precision agriculture. However, existing deep learning methods often rely on extensively annotated datasets, which are time-consuming and costly to generate. Self-supervised Learning (SSL) offers a promising alternative by exploiting the abundance of unlabeled data. However, most existing SSL approaches suffer from high computational costs due to convolutional neural networks or transformer-based architectures. Additionally, they struggle to capture long-range dependencies in visual representation and rely on static loss functions that fail to align local and global features effectively. To address these challenges, we propose ConMamba, a novel SSL framework specially designed for PDD. ConMamba integrates the Vision Mamba Encoder (VME), which employs a bidirectional State Space Model (SSM) to capture long-range dependencies efficiently. Furthermore, we introduce a dual-level contrastive loss with dynamic weight adjustment to optimize local-global feature alignment. Experimental results on three benchmark datasets demonstrate that ConMamba significantly outperforms state-of-the-art methods across multiple evaluation metrics. This provides an efficient and robust solution for PDD.
Teacher Encoder-Student Decoder Denoising Guided Segmentation Network for Anomaly Detection
Visual anomaly detection is a highly challenging task, often categorized as a one-class classification and segmentation problem. Recent studies have demonstrated that the student-teacher (S-T) framework effectively addresses this challenge. However, most S-T frameworks rely solely on pre-trained teacher networks to guide student networks in learning multi-scale similar features, overlooking the potential of the student networks to enhance learning through multi-scale feature fusion. In this study, we propose a novel model named PFADSeg, which integrates a pre-trained teacher network, a denoising student network with multi-scale feature fusion, and a guided anomaly segmentation network into a unified framework. By adopting a unique teacher-encoder and student-decoder denoising mode, the model improves the student network's ability to learn from teacher network features. Furthermore, an adaptive feature fusion mechanism is introduced to train a self-supervised segmentation network that synthesizes anomaly masks autonomously, significantly increasing detection performance. Rigorous evaluations on the widely-used MVTec AD dataset demonstrate that PFADSeg exhibits excellent performance, achieving an image-level AUC of 98.9%, a pixel-level mean precision of 76.4%, and an instance-level mean precision of 78.7%.
Q-SAM2: Accurate Quantization for Segment Anything Model 2
The Segment Anything Model 2 (SAM2) is a powerful foundation model for promptable segmentation. However, its high computational and memory costs are a major barrier to deployment on resource-constrained devices. In this paper, we present Q-SAM2, an accurate low-bit quantization method that achieves high compression and high fidelity. To address performance degradation arising from challenging weight and activation distributions during quantization, Q-SAM2 introduces two novel contributions: Variance-Reduced Calibration (VRC), an initialization method that reduces weight statistical variance by minimizing the Frobenius norm over a small calibration batch; and Learnable Statistical Clipping (LSC), a Quantization-Aware Training (QAT) method that learns momentum-stabilized clipping factors to manage outliers in weights and activations. Comprehensive experiments demonstrate that Q-SAM2 achieves highly accurate inference with substantial efficiency gains, significantly surpassing state-of-the-art general QAT schemes, particularly in the ultra-low 2-bit regime. Specifically, Q-SAM2 achieves an accuracy gain of up to 9.7 ppt in J&F on the video segmentation benchmark and 7.3 ppt in mIoU for instance segmentation over the best competing QAT model, all while achieving an 8x reduction in model size compared to the BF16 baseline.
comment: 22 pages
Beyond Complete Shapes: A Benchmark for Quantitative Evaluation of 3D Shape Surface Matching Algorithms
Finding correspondences between 3D deformable shapes is an important and long-standing problem in geometry processing, computer vision, graphics, and beyond. While various shape matching datasets exist, they are mostly static or limited in size, restricting their adaptation to different problem settings, including both full and partial shape matching. In particular the existing partial shape matching datasets are small (fewer than 100 shapes) and thus unsuitable for data-hungry machine learning approaches. Moreover, the type of partiality present in existing datasets is often artificial and far from realistic. To address these limitations, we introduce a generic and flexible framework for the procedural generation of challenging full and partial shape matching datasets. Our framework allows the propagation of custom annotations across shapes, making it useful for various applications. By utilising our framework and manually creating cross-dataset correspondences between seven existing (complete geometry) shape matching datasets, we propose a new large benchmark BeCoS with a total of 2543 shapes. Based on this, we offer several challenging benchmark settings, covering both full and partial matching, for which we evaluate respective state-of-the-art methods as baselines.
From Spots to Pixels: Dense Spatial Gene Expression Prediction from Histology Images
Spatial transcriptomics (ST) measures gene expression at fine-grained spatial resolution, offering insights into tissue molecular landscapes. Previous methods for spatial gene expression prediction typically crop spots of interest from histopathology slide images, and train models to map each spot to a corresponding gene expression profile. However, these methods inherently lose the spatial resolution in gene expression: 1) each spot often contains multiple cells with distinct gene expression profiles; 2) spots are typically defined at fixed spatial resolutions, limiting the ability to predict gene expression at varying scales. To address these limitations, this paper presents PixNet, a dense prediction network capable of predicting spatially resolved gene expression across spots of varying sizes and scales directly from histopathology slide images. Different from previous methods that map individual spots to gene expression values, we generate a spatially dense continuous gene expression map from the histopathology slide image, and aggregate values within spots of interest to predict the gene expression. Our PixNet outperforms state-of-the-art methods on four common ST datasets in multiple spatial scales. The source code will be publicly available.
Preventing Shortcut Learning in Medical Image Analysis through Intermediate Layer Knowledge Distillation from Specialist Teachers
Deep learning models are prone to learning shortcut solutions to problems using spuriously correlated yet irrelevant features of their training data. In high-risk applications such as medical image analysis, this phenomenon may prevent models from using clinically meaningful features when making predictions, potentially leading to poor robustness and harm to patients. We demonstrate that different types of shortcuts (those that are diffuse and spread throughout the image, as well as those that are localized to specific areas) manifest distinctly across network layers and can, therefore, be more effectively targeted through mitigation strategies that target the intermediate layers. We propose a novel knowledge distillation framework that leverages a teacher network fine-tuned on a small subset of task-relevant data to mitigate shortcut learning in a student network trained on a large dataset corrupted with a bias feature. Through extensive experiments on CheXpert, ISIC 2017, and SimBA datasets using various architectures (ResNet-18, AlexNet, DenseNet-121, and 3D CNNs), we demonstrate consistent improvements over traditional Empirical Risk Minimization, augmentation-based bias-mitigation, and group-based bias-mitigation approaches. In many cases, we achieve comparable performance with a baseline model trained on bias-free data, even on out-of-distribution test data. Our results demonstrate the practical applicability of our approach to real-world medical imaging scenarios where bias annotations are limited and shortcut features are difficult to identify a priori.
comment: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2025:020
Unsupervised and Source-Free Ranking of Biomedical Segmentation Models
Model transfer presents a solution to the challenges of segmentation in the biomedical community, where the immense cost of data annotation is a major bottleneck in the use of deep learning. At the same time, hundreds of models get trained on biomedical data, submitted to challenges, and posted in model zoos and repositories. A major hurdle to wider adoption of pre-trained models lies in the lack of methods for best model selection. While such methods have been proposed for classification models, semantic and instance segmentation model ranking remain largely unaddressed, especially in a practically important setting where no labels are available on the target dataset. Similarly, if unsupervised domain adaptation is used, practitioners are faced with the task of selecting the best adapted model without target domain labels. Building on previous work linking model generalisation and consistency under perturbation, we propose the first unsupervised and source-free transferability estimator for semantic and instance segmentation tasks. We evaluate on multiple segmentation problems across biomedical imaging, finding a strong correlation between the rankings based on our estimator and rankings based on target dataset performance.
comment: 24 pages, 6 figures
OMGSR: You Only Need One Mid-timestep Guidance for Real-World Image Super-Resolution
Denoising Diffusion Probabilistic Models (DDPMs) show promising potential in one-step Real-World Image Super-Resolution (Real-ISR). Current one-step Real-ISR methods typically inject the low-quality (LQ) image latent representation at the start or end timestep of the DDPM scheduler. Recent studies have begun to note that the LQ image latent and the pre-trained noisy latent representations are intuitively closer at a mid-timestep. However, a quantitative analysis of these latent representations remains lacking. Considering these latent representations can be decomposed into signal and noise, we propose a method based on the Signal-to-Noise Ratio (SNR) to pre-compute an average optimal mid-timestep for injection. To better approximate the pre-trained noisy latent representation, we further introduce the Latent Representation Refinement (LRR) loss via a LoRA-enhanced VAE encoder. We also fine-tune the backbone of the DDPM-based generative model using LoRA to perform one-step denoising at the average optimal mid-timestep. Based on these components, we present OMGSR, a GAN-based Real-ISR framework that employs a DDPM-based generative model as the generator and a DINOv3-ConvNeXt model with multi-level discriminator heads as the discriminator. We also propose the DINOv3-ConvNeXt DISTS (Dv3CD) loss, which is enhanced for structural perception at varying resolutions. Within the OMGSR framework, we develop OMGSR-S based on SD2.1-base. An ablation study confirms that our pre-computation strategy and LRR loss significantly improve the baseline. Comparative studies demonstrate that OMGSR-S achieves state-of-the-art performance across multiple metrics. Code is available at \hyperlink{Github}{https://github.com/wuer5/OMGSR}.
Restore Text First, Enhance Image Later: Two-Stage Scene Text Image Super-Resolution with Glyph Structure Guidance
Current image super-resolution methods show strong performance on natural images but distort text, creating a fundamental trade-off between image quality and textual readability. To address this, we introduce TIGER (Text-Image Guided supEr-Resolution), a novel two-stage framework that breaks this trade-off through a "text-first, image-later" paradigm. TIGER explicitly decouples glyph restoration from image enhancement: it first reconstructs precise text structures and uses them to guide full-image super-resolution. This ensures high fidelity and readability. To support comprehensive training and evaluation, we present the UZ-ST (UltraZoom-Scene Text) dataset, the first Chinese scene text dataset with extreme zoom. Extensive experiments show TIGER achieves state-of-the-art performance, enhancing readability and image quality.
Motion-R1: Enhancing Motion Generation with Decomposed Chain-of-Thought and RL Binding
Text-to-Motion generation has become a fundamental task in human-machine interaction, enabling the synthesis of realistic human motions from natural language descriptions. Although recent advances in large language models and reinforcement learning have contributed to high-quality motion generation, two major challenges remain. Existing approaches often fail to capture the temporal and causal complexities inherent in natural language, leading to oversimplified or incoherent motions. Additionally, RL-based methods are frequently overly complex, hindering their scalability and adaptability across various motion generation tasks. To address these challenges, we propose Motion-R1, a novel framework that combines decomposed Chain-of-Thought reasoning with reinforcement learning to enhance both the quality and interpretability of generated motions. Specifically, we introduce the Decomposed CoT Data Engine, which leverages an automated pipeline to synthesize high-quality reasoning data, allowing the model to better capture the temporal dependencies and causal relationships of human motion. We also propose RL Binding, a reinforcement learning strategy that incorporates multi-modal text-motion alignment into the RL reward function, guiding the model to produce motions that are both semantically accurate and motionally realistic. Extensive experiments across benchmark datasets demonstrate that Motion-R1 achieves state-of-the-art performance, with a 3.5% improvement in MM-Dist on HumanML3D and improvements in R-Precision and FID on KIT-ML and BABEL, surpassing existing methods across key metrics and highlighting its superior capability in handling complex motion generation tasks. Project page: https://motion-r1.github.io/.
Directed-CP: Directed Collaborative Perception for Connected and Autonomous Vehicles via Proactive Attention ICRA'25
Collaborative perception (CP) leverages visual data from connected and autonomous vehicles (CAV) to enhance an ego vehicle's field of view (FoV). Despite recent progress, current CP methods expand the ego vehicle's 360-degree perceptual range almost equally, which faces two key challenges. Firstly, in areas with uneven traffic distribution, focusing on directions with little traffic offers limited benefits. Secondly, under limited communication budgets, allocating excessive bandwidth to less critical directions lowers the perception accuracy in more vital areas. To address these issues, we propose Direct-CP, a proactive and direction-aware CP system aiming at improving CP in specific directions. Our key idea is to enable an ego vehicle to proactively signal its interested directions and readjust its attention to enhance local directional CP performance. To achieve this, we first propose an RSU-aided direction masking mechanism that assists an ego vehicle in identifying vital directions. Additionally, we design a direction-aware selective attention module to wisely aggregate pertinent features based on ego vehicle's directional priorities, communication budget, and the positional data of CAVs. Moreover, we introduce a direction-weighted detection loss (DWLoss) to capture the divergence between directional CP outcomes and the ground truth, facilitating effective model training. Extensive experiments on the V2X-Sim 2.0 dataset demonstrate that our approach achieves 19.8\% higher local perception accuracy in interested directions and 2.5\% higher overall perception accuracy than the state-of-the-art methods in collaborative 3D object detection tasks. Codes are available at https://github.com/yihangtao/Directed-CP.git.
comment: Accepted by ICRA'25
Prototypical Contrastive Learning-based CLIP Fine-tuning for Object Re-identification
This work aims to adapt large-scale pre-trained vision-language models, such as contrastive language-image pretraining (CLIP), to enhance the performance of object reidentification (Re-ID) across various supervision settings. Although prompt learning has enabled a recent work named CLIP-ReID to achieve promising performance, the underlying mechanisms and the necessity of prompt learning remain unclear due to the absence of semantic labels in ReID tasks. In this work, we first analyze the role prompt learning in CLIP-ReID and identify its limitations. Based on our investigations, we propose a simple yet effective approach to adapt CLIP for supervised object Re-ID. Our approach directly fine-tunes the image encoder of CLIP using a prototypical contrastive learning (PCL) loss, eliminating the need for prompt learning. Experimental results on both person and vehicle Re-ID datasets demonstrate the competitiveness of our method compared to CLIP-ReID. Furthermore, we extend our PCL-based CLIP fine-tuning approach to unsupervised scenarios, where we achieve state-of-the art performance.
K-FACE: A Large-Scale KIST Face Database in Consideration with Unconstrained Environments
In this paper, we introduce a new large-scale face database from KIST, denoted as K-FACE, and describe a novel capturing device specifically designed to obtain the data. The K-FACE database contains more than 1 million high-quality images of 1,000 subjects selected by considering the ratio of gender and age groups. It includes a variety of attributes, including 27 poses, 35 lighting conditions, three expressions, and occlusions by the combination of five types of accessories. As the K-FACE database is systematically constructed through a hemispherical capturing system with elaborate lighting control and multiple cameras, it is possible to accurately analyze the effects of factors that cause performance degradation, such as poses, lighting changes, and accessories. We consider not only the balance of external environmental factors, such as pose and lighting, but also the balance of personal characteristics such as gender and age group. The gender ratio is the same, while the age groups of subjects are uniformly distributed from the 20s to 50s for both genders. The K-FACE database can be extensively utilized in various vision tasks, such as face recognition, face frontalization, illumination normalization, face age estimation, and three-dimensional face model generation. We expect systematic diversity and uniformity of the K-FACE database to promote these research fields.
comment: 8 pages, 8 figures
Roadside Monocular 3D Detection Prompted by 2D Detection WACV 2026
Roadside monocular 3D detection requires detecting objects of predefined classes in an RGB frame and predicting their 3D attributes, such as bird's-eye-view (BEV) locations. It has broad applications in traffic control, vehicle-vehicle communication, and vehicle-infrastructure cooperative perception. To address this task, we introduce Promptable 3D Detector (Pro3D), a novel detector design that leverages 2D detections as prompts. We build our Pro3D upon two key insights. First, compared to a typical 3D detector, a 2D detector is ``easier'' to train due to fewer loss terms and performs significantly better at localizing objects w.r.t 2D metrics. Second, once 2D detections precisely locate objects in the image, a 3D detector can focus on lifting these detections into 3D BEV, especially when fixed camera pose or scene geometry provide an informative prior. To encode and incorporate 2D detections, we explore three methods: (a) concatenating features from both 2D and 3D detectors, (b) attentively fusing 2D and 3D detector features, and (c) encoding properties of predicted 2D bounding boxes \{$x$, $y$, width, height, label\} and attentively fusing them with the 3D detector feature. Interestingly, the third method significantly outperforms the others, underscoring the effectiveness of 2D detections as prompts that offer precise object targets and allow the 3D detector to focus on lifting them into 3D. Pro3D is adaptable for use with a wide range of 2D and 3D detectors with minimal modifications. Comprehensive experiments demonstrate that our Pro3D significantly enhances existing methods, achieving state-of-the-art results on two contemporary benchmarks.
comment: Accepted by WACV 2026
Monocular Person Localization under Camera Ego-motion IROS2025
Localizing a person from a moving monocular camera is critical for Human-Robot Interaction (HRI). To estimate the 3D human position from a 2D image, existing methods either depend on the geometric assumption of a fixed camera or use a position regression model trained on datasets containing little camera ego-motion. These methods are vulnerable to severe camera ego-motion, resulting in inaccurate person localization. We consider person localization as a part of a pose estimation problem. By representing a human with a four-point model, our method jointly estimates the 2D camera attitude and the person's 3D location through optimization. Evaluations on both public datasets and real robot experiments demonstrate our method outperforms baselines in person localization accuracy. Our method is further implemented into a person-following system and deployed on an agile quadruped robot.
comment: Accepted by IROS2025. Project page: https://medlartea.github.io/rpf-quadruped/
COLI: A Hierarchical Efficient Compressor for Large Images
The escalating adoption of high-resolution, large-field-of-view imagery amplifies the need for efficient compression methodologies. Conventional techniques frequently fail to preserve critical image details, while data-driven approaches exhibit limited generalizability. Implicit Neural Representations (INRs) present a promising alternative by learning continuous mappings from spatial coordinates to pixel intensities for individual images, thereby storing network weights rather than raw pixels and avoiding the generalization problem. However, INR-based compression of large images faces challenges including slow compression speed and suboptimal compression ratios. To address these limitations, we introduce COLI (Compressor for Large Images), a novel framework leveraging Neural Representations for Videos (NeRV). First, recognizing that INR-based compression constitutes a training process, we accelerate its convergence through a pretraining-finetuning paradigm, mixed-precision training, and reformulation of the sequential loss into a parallelizable objective. Second, capitalizing on INRs' transformation of image storage constraints into weight storage, we implement Hyper-Compression, a novel post-training technique to substantially enhance compression ratios while maintaining minimal output distortion. Evaluations across two medical imaging datasets demonstrate that COLI consistently achieves competitive or superior PSNR and SSIM metrics at significantly reduced bits per pixel (bpp), while accelerating NeRV training by up to 4 times.
Sim-DETR: Unlock DETR for Temporal Sentence Grounding ICCV 2025
Temporal sentence grounding aims to identify exact moments in a video that correspond to a given textual query, typically addressed with detection transformer (DETR) solutions. However, we find that typical strategies designed to enhance DETR do not improve, and may even degrade, its performance in this task. We systematically analyze and identify the root causes of this abnormal behavior: (1) conflicts between queries from similar target moments and (2) internal query conflicts due to the tension between global semantics and local localization. Building on these insights, we propose a simple yet powerful baseline, Sim-DETR, which extends the standard DETR with two minor modifications in the decoder layers: (1) constraining self-attention between queries based on their semantic and positional overlap and (2) adding query-to-frame alignment to bridge the global and local contexts. Experiments demonstrate that Sim-DETR unlocks the full potential of DETR for temporal sentence grounding, offering a strong baseline for future research.
comment: This work is accepted by ICCV 2025
Benchmarking Endoscopic Surgical Image Restoration and Beyond
In endoscopic surgery, a clear and high-quality visual field is critical for surgeons to make accurate intraoperative decisions. However, persistent visual degradation, including smoke generated by energy devices, lens fogging from thermal gradients, and lens contamination due to blood or tissue fluid splashes during surgical procedures, severely impairs visual clarity. These degenerations can seriously hinder surgical workflow and pose risks to patient safety. To systematically investigate and address various forms of surgical scene degradation, we introduce a real- world open-source surgical image restoration dataset covering endoscopic environments, called SurgClean, which involves multi-type image restoration tasks from two medical sites, i.e., desmoking, defogging, and desplashing. SurgClean comprises 3,113 images with diverse degradation types and corresponding paired reference labels. Based on SurgClean, we establish a standardized evaluation benchmark and provide performance for 22 representative generic task-specific image restoration approaches, including 12 generic and 10 task-specific image restoration approaches. Experimental results reveal substantial performance gaps relative to clinical requirements, highlighting a critical opportunity for algorithm advancements in intelligent surgical restoration. Furthermore, we explore the degradation discrepancies between surgical and natural scenes from structural perception and semantic under- standing perspectives, providing fundamental insights for domain-specific image restoration research. Our work aims to empower restoration algorithms and improve the efficiency of clinical procedures.
Prompt-guided Disentangled Representation for Action Recognition
Action recognition is a fundamental task in video understanding. Existing methods typically extract unified features to process all actions in one video, which makes it challenging to model the interactions between different objects in multi-action scenarios. To alleviate this issue, we explore disentangling any specified actions from complex scenes as an effective solution. In this paper, we propose Prompt-guided Disentangled Representation for Action Recognition (ProDA), a novel framework that disentangles any specified actions from a multi-action scene. ProDA leverages Spatio-temporal Scene Graphs (SSGs) and introduces Dynamic Prompt Module (DPM) to guide a Graph Parsing Neural Network (GPNN) in generating action-specific representations. Furthermore, we design a video-adapted GPNN that aggregates information using dynamic weights. Experiments in video action recognition demonstrate the effectiveness of our approach when compared with the state-of-the-art methods. Our code can be found in https://github.com/iamsnaping/ProDA.git
PairHuman: A High-Fidelity Photographic Dataset for Customized Dual-Person Generation
Personalized dual-person portrait customization has considerable potential applications, such as preserving emotional memories and facilitating wedding photography planning. However, the absence of a benchmark dataset hinders the pursuit of high-quality customization in dual-person portrait generation. In this paper, we propose the PairHuman dataset, which is the first large-scale benchmark dataset specifically designed for generating dual-person portraits that meet high photographic standards. The PairHuman dataset contains more than 100K images that capture a variety of scenes, attire, and dual-person interactions, along with rich metadata, including detailed image descriptions, person localization, human keypoints, and attribute tags. We also introduce DHumanDiff, which is a baseline specifically crafted for dual-person portrait generation that features enhanced facial consistency and simultaneously balances in personalized person generation and semantic-driven scene creation. Finally, the experimental results demonstrate that our dataset and method produce highly customized portraits with superior visual quality that are tailored to human preferences. Our dataset is publicly available at https://github.com/annaoooo/PairHuman.
comment: 46 pages, 31 figures
Sketch-1-to-3: One Single Sketch to 3D Detailed Face Reconstruction ACM MM
3D face reconstruction from a single sketch is a critical yet underexplored task with significant practical applications. The primary challenges stem from the substantial modality gap between 2D sketches and 3D facial structures, including: (1) accurately extracting facial keypoints from 2D sketches; (2) preserving diverse facial expressions and fine-grained texture details; and (3) training a high-performing model with limited data. In this paper, we propose Sketch-1-to-3, a novel framework for realistic 3D face reconstruction from a single sketch, to address these challenges. Specifically, we first introduce the Geometric Contour and Texture Detail (GCTD) module, which enhances the extraction of geometric contours and texture details from facial sketches. Additionally, we design a deep learning architecture with a domain adaptation module and a tailored loss function to align sketches with the 3D facial space, enabling high-fidelity expression and texture reconstruction. To facilitate evaluation and further research, we construct SketchFaces, a real hand-drawn facial sketch dataset, and Syn-SketchFaces, a synthetic facial sketch dataset. Extensive experiments demonstrate that Sketch-1-to-3 achieves state-of-the-art performance in sketch-based 3D face reconstruction.
comment: Accepted by ACM MMAsia 2025
Image and Video Processing
PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation
Video-to-Audio (V2A) generation requires balancing four critical perceptual dimensions: semantic consistency, audio-visual temporal synchrony, aesthetic quality, and spatial accuracy; yet existing methods suffer from objective entanglement that conflates competing goals in single loss functions and lack human preference alignment. We introduce PrismAudio, the first framework to integrate Reinforcement Learning into V2A generation with specialized Chain-of-Thought (CoT) planning. Our approach decomposes monolithic reasoning into four specialized CoT modules (Semantic, Temporal, Aesthetic, and Spatial CoT), each paired with targeted reward functions. This CoT-reward correspondence enables multidimensional RL optimization that guides the model to jointly generate better reasoning across all perspectives, solving the objective entanglement problem while preserving interpretability. To make this optimization computationally practical, we propose Fast-GRPO, which employs hybrid ODE-SDE sampling that dramatically reduces the training overhead compared to existing GRPO implementations. We also introduce AudioCanvas, a rigorous benchmark that is more distributionally balanced and covers more realistically diverse and challenging scenarios than existing datasets, with 300 single-event classes and 501 multi-event samples. Experimental results demonstrate that PrismAudio achieves state-of-the-art performance across all four perceptual dimensions on both the in-domain VGGSound test set and out-of-domain AudioCanvas benchmark. The project page is available at https://PrismAudio-Project.github.io.
comment: Preprint
Neural B-Frame Coding: Tackling Domain Shift Issues with Lightweight Online Motion Resolution Adaptation
Learned B-frame codecs with hierarchical temporal prediction often encounter the domain-shift issue due to mismatches between the Group-of-Pictures (GOP) sizes for training and testing, leading to inaccurate motion estimates, particularly for large motion. A common solution is to turn large motion into small motion by downsampling video frames during motion estimation. However, determining the optimal downsampling factor typically requires costly rate-distortion optimization. This work introduces lightweight classifiers to predict downsampling factors. These classifiers leverage simple state signals from current and reference frames to balance rate-distortion performance with computational cost. Three variants are proposed: (1) a binary classifier (Bi-Class) trained with Focal Loss to choose between high and low resolutions, (2) a multi-class classifier (Mu-Class) trained with novel soft labels based on rate-distortion costs, and (3) a co-class approach (Co-Class) that combines the predictive capability of the multi-class classifier with the selective search of the binary classifier. All classifier methods can work seamlessly with existing B-frame codecs without requiring codec retraining. Experimental results show that they achieve coding performance comparable to exhaustive search methods while significantly reducing computational complexity. The code is available at: https://github.com/NYCU-MAPL/Fast-OMRA.git.
comment: Accepted by TCAS-II: Express Briefs
Evaluation of Hardware-based Video Encoders on Modern GPUs for UHD Live-Streaming
Many GPUs have incorporated hardware-accelerated video encoders, which allow video encoding tasks to be offloaded from the main CPU and provide higher power efficiency. Over the years, many new video codecs such as H.265/HEVC, VP9, and AV1 were added to the latest GPU boards. Recently, the rise of live video content such as VTuber, game live-streaming, and live event broadcasts, drives the demand for high-efficiency hardware encoders in the GPUs to tackle these real-time video encoding tasks, especially at higher resolutions such as 4K/8K UHD. In this paper, RD performance, encoding speed, as well as power consumption of hardware encoders in several generations of NVIDIA, Intel GPUs as well as Qualcomm Snapdragon Mobile SoCs were evaluated and compared to the software counterparts, including the latest H.266/VVC codec, using several metrics including PSNR, SSIM, and machine-learning based VMAF. The results show that modern GPU hardware encoders can match the RD performance of software encoders in real-time encoding scenarios, and while encoding speed increased in newer hardware, there is mostly negligible RD performance improvement between hardware generations. Finally, the bitrate required for each hardware encoder to match YouTube transcoding quality was also calculated.
comment: The 33rd International Conference on Computer Communications and Networks (ICCCN 2024), 29-31 July 2024, Big Island, Hawaii, USA
Data Augmentation Strategies for Robust Lane Marking Detection
Robust lane detection is essential for advanced driver assistance and autonomous driving, yet models trained on public datasets such as CULane often fail to generalise across different camera viewpoints. This paper addresses the challenge of domain shift for side-mounted cameras used in lane-wheel monitoring by introducing a generative AI-based data enhancement pipeline. The approach combines geometric perspective transformation, AI-driven inpainting, and vehicle body overlays to simulate deployment-specific viewpoints while preserving lane continuity. We evaluated the effectiveness of the proposed augmentation in two state-of-the-art models, SCNN and UFLDv2. With the augmented data trained, both models show improved robustness to different conditions, including shadows. The experimental results demonstrate gains in precision, recall, and F1 score compared to the pre-trained model. By bridging the gap between widely available datasets and deployment-specific scenarios, our method provides a scalable and practical framework to improve the reliability of lane detection in a pilot deployment scenario.
comment: 8 figures, 2 tables, 10 pages, ACRA, Australasian conference on robotics and automation
Equivariant Deep Equilibrium Models for Imaging Inverse Problems
Equivariant imaging (EI) enables training signal reconstruction models without requiring ground truth data by leveraging signal symmetries. Deep equilibrium models (DEQs) are a powerful class of neural networks where the output is a fixed point of a learned operator. However, training DEQs with complex EI losses requires implicit differentiation through fixed-point computations, whose implementation can be challenging. We show that backpropagation can be implemented modularly, simplifying training. Experiments demonstrate that DEQs trained with implicit differentiation outperform those trained with Jacobian-free backpropagation and other baseline methods. Additionally, we find evidence that EI-trained DEQs approximate the proximal map of an invariant prior.
The Selective Disk Bispectrum and Its Inversion, with Application to Multi-Reference Alignment
In many computer vision and shape analysis tasks, practitioners are interested in learning from the shape of the object in an image, while disregarding the object's orientation. To this end, it is valuable to define a rotation-invariant representation of images, retaining all information about that image, but disregarding the way an object is rotated in the frame. To be practical for learning tasks, this representation must be computationally efficient for large datasets and invertible, so the representation can be visualized in image space. To this end, we present the selective disk bispectrum: a fast, rotation-invariant representation for image shape analysis. While the translational bispectrum has long been used as a translational invariant representation for 1-D and 2-D signals, its extension to 2-D (disk) rotational invariance on images has been hindered by the absence of an invertible formulation and its cubic complexity. In this work, we derive an explicit inverse for the disk bispectrum, which allows us to define a "selective" disk bispectrum, which only uses the minimal number of coefficients needed for faithful shape recovery. We show that this representation enables multi-reference alignment for rotated images-a task previously intractable for disk bispectrum methods. These results establish the disk bispectrum as a practical and theoretically grounded tool for learning on rotation-invariant shape data.
Development of a fully deep learning model to improve the reproducibility of sector classification systems for predicting unerupted maxillary canine likelihood of impaction
Objectives. The aim of the present study was to develop a fully deep learning model to reduce the intra- and inter-operator reproducibility of sector classification systems for predicting unerupted maxillary canine likelihood of impaction. Methods. Three orthodontists (Os) and three general dental practitioners (GDPs) classified the position of unerupted maxillary canines on 306 radiographs (T0) according to the three different sector classification systems (5-, 4-, and 3-sector classification system). The assessment was repeated after four weeks (T1). Intra- and inter-observer agreement were evaluated with Cohen's K and Fleiss K, and between group differences with a z-test. The same radiographs were tested on different artificial intelligence (AI) models, pre-trained on an extended dataset of 1,222 radiographs. The best-performing model was identified based on its sensitivity and precision. Results. The 3-sector system was found to be the classification method with highest reproducibility, with an agreement (Cohen's K values) between observations (T0 versus T1) for each examiner ranged from 0.80 to 0.92, and an overall agreement of 0.85 [95% confidence interval (CI) = 0.83-0.87]. The overall inter-observer agreement (Fleiss K) ranged from 0.69 to 0.7. The educational background did not affect either intra- or inter-observer agreement (p>0.05). DenseNet121 proved to be the best-performing model in allocating impacted canines in the three different classes, with an overall accuracy of 76.8%. Conclusion. AI models can be designed to automatically classify the position of unerupted maxillary canines.
Cross-Domain Generalization of Multimodal LLMs for Global Photovoltaic Assessment
The rapid expansion of distributed photovoltaic (PV) systems poses challenges for power grid management, as many installations remain undocumented. While satellite imagery provides global coverage, traditional computer vision (CV) models such as CNNs and U-Nets require extensive labeled data and fail to generalize across regions. This study investigates the cross-domain generalization of a multimodal large language model (LLM) for global PV assessment. By leveraging structured prompts and fine-tuning, the model integrates detection, localization, and quantification within a unified schema. Cross-regional evaluation using the $Δ$F1 metric demonstrates that the proposed model achieves the smallest performance degradation across unseen regions, outperforming conventional CV and transformer baselines. These results highlight the robustness of multimodal LLMs under domain shift and their potential for scalable, transferable, and interpretable global PV mapping.
comment: 5 pages, 7 figures
Blinking Beyond EAR: A Stable Eyelid Angle Metric for Driver Drowsiness Detection and Data Augmentation
Detecting driver drowsiness reliably is crucial for enhancing road safety and supporting advanced driver assistance systems (ADAS). We introduce the Eyelid Angle (ELA), a novel, reproducible metric of eye openness derived from 3D facial landmarks. Unlike conventional binary eye state estimators or 2D measures, such as the Eye Aspect Ratio (EAR), the ELA provides a stable geometric description of eyelid motion that is robust to variations in camera angle. Using the ELA, we design a blink detection framework that extracts temporal characteristics, including the closing, closed, and reopening durations, which are shown to correlate with drowsiness levels. To address the scarcity and risk of collecting natural drowsiness data, we further leverage ELA signals to animate rigged avatars in Blender 3D, enabling the creation of realistic synthetic datasets with controllable noise, camera viewpoints, and blink dynamics. Experimental results in public driver monitoring datasets demonstrate that the ELA offers lower variance under viewpoint changes compared to EAR and achieves accurate blink detection. At the same time, synthetic augmentation expands the diversity of training data for drowsiness recognition. Our findings highlight the ELA as both a reliable biometric measure and a powerful tool for generating scalable datasets in driver state monitoring.
comment: 8 pages, 5 figures, 3 tables
The Determinant Ratio Matrix Approach to Solving 3D Matching and 2D Orthographic Projection Alignment Tasks
Pose estimation is a general problem in computer vision with wide applications. The relative orientation of a 3D reference object can be determined from a 3D rotated version of that object, or from a projection of the rotated object to a 2D planar image. This projection can be a perspective projection (the PnP problem) or an orthographic projection (the OnP problem). We restrict our attention here to the OnP problem and the full 3D pose estimation task (the EnP problem). Here we solve the least squares systems for both the error-free EnP and OnP problems in terms of the determinant ratio matrix (DRaM) approach. The noisy-data case can be addressed with a straightforward rotation correction scheme. While the SVD and optimal quaternion eigensystem methods solve the noisy EnP 3D-3D alignment exactly, the noisy 3D-2D orthographic (OnP) task has no known comparable closed form, and can be solved by DRaM-class methods. We note that while previous similar work has been presented in the literature exploiting both the QR decomposition and the Moore-Penrose pseudoinverse transformations, here we place these methods in a larger context that has not previously been fully recognized in the absence of the corresponding DRaM solution. We term this class of solutions as the DRaM family, and conduct comparisons of the behavior of the families of solutions for the EnP and OnP rotation estimation problems. Overall, this work presents both a new solution to the 3D and 2D orthographic pose estimation problems and provides valuable insight into these classes of problems. With hindsight, we are able to show that our DRaM solutions to the exact EnP and OnP problems possess derivations that could have been discovered in the time of Gauss, and in fact generalize to all analogous N-dimensional Euclidean pose estimation problems.
comment: 12 pages of main text, 3 figures, 31 pages total (including references and 2 appendices, one with algorithm-defining source code)
A CNN-Based Technique to Assist Layout-to-Generator Conversion for Analog Circuits
We propose a technique to assist in converting a reference layout of an analog circuit into the procedural layout generator by efficiently reusing available generators for sub-cell creation. The proposed convolutional neural network (CNN) model automatically detects sub-cells that can be generated by available generator scripts in the library, and suggests using them in the hierarchically correct places of the generator software. In experiments, the CNN model examined sub-cells of a high-speed wireline receiver that has a total of 4,885 sub-cell instances including different 145 sub-cell designs. The CNN model classified the sub-cell instances into 51 generatable and one not-generatable classes. One not-generatable class indicates that no available generator can generate the classified sub-cell. The CNN model achieved 99.3% precision in examining the 145 different sub-cell designs. The CNN model greatly reduced the examination time to 18 seconds from 88 minutes required in manual examination. Also, the proposed CNN model could correctly classify unfamiliar sub-cells that are very different from the training dataset.
Smooth optimization using global and local low-rank regularizers
Many inverse problems and signal processing problems involve low-rank regularizers based on the nuclear norm. Commonly, proximal gradient methods (PGM) are adopted to solve this type of non-smooth problems as they can offer fast and guaranteed convergence. However, PGM methods cannot be simply applied in settings where low-rank models are imposed locally on overlapping patches; therefore, heuristic approaches have been proposed that lack convergence guarantees. In this work we propose to replace the nuclear norm with a smooth approximation in which a Huber-type function is applied to each singular value. By providing a theoretical framework based on singular value function theory, we show that important properties can be established for the proposed regularizer, such as: convexity, differentiability, and Lipschitz continuity of the gradient. Moreover, we provide a closed-form expression for the regularizer gradient, enabling the use of standard iterative gradient-based optimization algorithms (e.g., nonlinear conjugate gradient) that can easily address the case of overlapping patches and have well-known convergence guarantees. In addition, we provide a novel step-size selection strategy based on a quadratic majorizer of the line-search function that leverages the Huber characteristics of the proposed regularizer. Finally, we assess the proposed optimization framework by providing empirical results in dynamic magnetic resonance imaging (MRI) reconstruction in the context of local low-rank models with overlapping patches.
comment: 41 pages, 7 figures
Splats in Splats: Robust and Effective 3D Steganography towards Gaussian Splatting AAAI 2026
3D Gaussian splatting (3DGS) has demonstrated impressive 3D reconstruction performance with explicit scene representations. Given the widespread application of 3DGS in 3D reconstruction and generation tasks, there is an urgent need to protect the copyright of 3DGS assets. However, existing copyright protection techniques for 3DGS overlook the usability of 3D assets, posing challenges for practical deployment. Here we describe splats in splats, the first 3DGS steganography framework that embeds 3D content in 3DGS itself without modifying any attributes. To achieve this, we take a deep insight into spherical harmonics (SH) and devise an importance-graded SH coefficient encryption strategy to embed the hidden SH coefficients. Furthermore, we employ a convolutional autoencoder to establish a mapping between the original Gaussian primitives' opacity and the hidden Gaussian primitives' opacity. Extensive experiments indicate that our method significantly outperforms existing 3D steganography techniques, with 5.31% higher scene fidelity and 3x faster rendering speed, while ensuring security, robustness, and user experience.
comment: Accepted by AAAI 2026
Can Knowledge Improve Security? A Coding-Enhanced Jamming Approach for Semantic Communication
As semantic communication (SemCom) attracts growing attention as a novel communication paradigm, ensuring the security of transmitted semantic information over open wireless channels has become a critical issue. However, traditional encryption methods often introduce significant additional communication overhead to maintain reliability, and conventional learning-based secure SemCom methods typically rely on a channel capacity advantage for the legitimate receiver, which is challenging to guarantee in real-world scenarios. In this paper, we propose a coding-enhanced jamming method that eliminates the need to transmit a secret key by utilizing shared knowledge, which may be part of the training set of the SemCom system, between the legitimate receiver and the transmitter. Specifically, we leverage the shared private knowledge base to generate a set of private digital codebooks in advance using neural network (NN)-based encoders. For each transmission, we encode the transmitted data into a digital sequence Y1 and associate Y1 with a sequence randomly picked from the private codebook, denoted as Y2, through superposition coding. Here, Y1 serves as the outer code and Y2 as the inner code. By optimizing the power allocation between the inner and outer codes, the legitimate receiver can reconstruct the transmitted data using successive decoding based on the shared index of Y2, while the eavesdropper's decoding performance is severely degraded, potentially to the point of random guessing. Experimental results demonstrate that our method achieves security comparable to state-of-the-art approaches while significantly improving the reconstruction performance of the legitimate receiver by more than 1 dB across varying channel signal-to-noise ratios (SNRs) and compression ratios.
OmniLens++: Blind Lens Aberration Correction via Large LensLib Pre-Training and Latent PSF Representation
Emerging deep-learning-based lens library pre-training (LensLib-PT) pipeline offers a new avenue for blind lens aberration correction by training a universal neural network, demonstrating strong capability in handling diverse unknown optical degradations. This work proposes the OmniLens++ framework, which resolves two challenges that hinder the generalization ability of existing pipelines: the difficulty of scaling data and the absence of prior guidance characterizing optical degradation. To improve data scalability, we expand the design specifications to increase the degradation diversity of the lens source, and we sample a more uniform distribution by quantifying the spatial-variation patterns and severity of optical degradation. In terms of model design, to leverage the Point Spread Functions (PSFs), which intuitively describe optical degradation, as guidance in a blind paradigm, we propose the Latent PSF Representation (LPR). The VQVAE framework is introduced to learn latent features of LensLib's PSFs, which is assisted by modeling the optical degradation process to constrain the learning of degradation priors. Experiments on diverse aberrations of real-world lenses and synthetic LensLib show that OmniLens++ exhibits state-of-the-art generalization capacity in blind aberration correction. Beyond performance, the AODLibpro is verified as a scalable foundation for more effective training across diverse aberrations, and LPR can further tap the potential of large-scale LensLib. The source code and datasets will be made publicly available at https://github.com/zju-jiangqi/OmniLens2.
comment: The source code and datasets will be made publicly available at https://github.com/zju-jiangqi/OmniLens2
Full-scale Representation Guided Network for Retinal Vessel Segmentation
The U-Net architecture and its variants have remained state-of-the-art (SOTA) for retinal vessel segmentation over the past decade. In this study, we introduce a Full-Scale Guided Network (FSG-Net), where a novel feature representation module using modernized convolution blocks effectively captures full-scale structural information, while a guided convolution block subsequently refines this information. Specifically, we introduce an attention-guided filter within the guided convolution block, leveraging its similarity to unsharp masking to enhance fine vascular structures. Passing full-scale information to the attention block facilitates the generation of more contextually relevant attention maps, which are then passed to the attention-guided filter, providing further refinement to the segmentation performance. The structure preceding the guided convolution block can be replaced by any U-Net variant, ensuring flexibility and scalability across various segmentation tasks. For a fair comparison, we re-implemented recent studies available in public repositories to evaluate their scalability and reproducibility. Our experiments demonstrate that, despite its compact architecture, FSG-Net delivers performance competitive with SOTA methods across multiple public datasets. Ablation studies further demonstrate that each proposed component meaningfully contributes to this competitive performance. Our code is available on https://github.com/ZombaSY/FSG-Net-pytorch.
comment: 12 pages, 7 figures
FCDM: A Physics-Guided Bidirectional Frequency Aware Convolution and Diffusion-Based Model for Sinogram Inpainting
Computed tomography (CT) is widely used in scientific imaging systems such as synchrotron and laboratory-based nano-CT, but acquiring full-view sinograms requires high radiation dose and long scan times. Sparse-view CT alleviates this burden but yields incomplete sinograms with structured signal loss, hampering accurate reconstruction. Unlike RGB images, sinograms encode overlapping features along projection paths and exhibit distinct directional spectral patterns, which make conventional RGB-oriented inpainting approaches--including diffusion models--ineffective for sinogram restoration, as they disregard the angular dependencies and physical constraints inherent to tomographic data. To overcome these limitations, we propose FCDM, a diffusion-based framework tailored for sinograms, which restores global structure through bidirectional frequency reasoning and angular-aware masking, while enforcing physical plausibility via physics-guided constraints and frequency-adaptive noise control. Experiments on real-world datasets show that FCDM consistently outperforms baselines, achieving SSIM over 0.93 and PSNR above 31 dB across diverse sparse-view scenarios.
FMPlug: Plug-In Foundation Flow-Matching Priors for Inverse Problems
We present FMPlug, a novel plug-in framework that enhances foundation flow-matching (FM) priors for solving ill-posed inverse problems. Unlike traditional approaches that rely on domain-specific or untrained priors, FMPlug smartly leverages two simple but powerful insights: the similarity between observed and desired objects and the Gaussianity of generative flows. By introducing a time-adaptive warm-up strategy and sharp Gaussianity regularization, FMPlug unlocks the true potential of domain-agnostic foundation models. Our method beats state-of-the-art methods that use foundation FM priors by significant margins, on image super-resolution and Gaussian deblurring.
Enhancing Medical Image Analysis through Geometric and Photometric transformations
Medical image analysis suffers from a lack of labeled data due to several challenges including patient privacy and lack of experts. Although some AI models only perform well with large amounts of data, we will move to data augmentation where there is a solution to improve the performance of our models and increase the dataset size through traditional or advanced techniques. In this paper, we evaluate the effectiveness of data augmentation techniques on two different medical image datasets. In the first step, we applied some transformation techniques to the skin cancer dataset containing benign and malignant classes. Then, we trained the convolutional neural network (CNN) on the dataset before and after augmentation, which significantly improved test accuracy from 90.74% to 96.88% and decreased test loss from 0.7921 to 0.1468 after augmentation. In the second step, we used the Mixup technique by mixing two random images and their corresponding masks using the retina and blood vessels dataset, then we trained the U-net model and obtained the Dice coefficient which increased from 0 before augmentation to 0.4163 after augmentation. The result shows the effect of using data augmentation to increase the dataset size on the classification and segmentation performance.
Multimodal Optical Imaging Platform for Quantitative Burn Assessment
Accurate assessment of burn severity at injury onset remains a major clinical challenge due to the lack of objective methods for detecting subsurface tissue damage. This limitation is critical in battlefield and mass-casualty settings, where rapid and reliable evaluation of burn depth is essential for triage and surgical decision-making. We present a multimodal optical imaging framework that establishes the foundation for a compact, low-size, weight, and power (low-SWaP) field-deployable device for quantitative burn assessment. The system integrates broadband hyperspectral imaging (VSWIR, 400 -- 2100 nm) with laser speckle contrast imaging to jointly evaluate biochemical composition and microvascular perfusion. Using short-wave infrared (SWIR, >1000 nm) wavelengths, we developed and validated novel deep-tissue parameters linked to water, lipid, and collagen absorption features that enhance burn-tissue separability and burn severity classification. We implemented and validated unsupervised learning methods for spectral feature extraction, band down-selection, and clustering against histology, establishing a foundation for a rugged, data-driven device for early quantitative burn evaluation in austere environments.
Lessons Learned from Developing a Privacy-Preserving Multimodal Wearable for Local Voice-and-Vision Inference
Many promising applications of multimodal wearables require continuous sensing and heavy computation, yet users reject such devices due to privacy concerns. This paper shares our experiences building an ear-mounted voice-and-vision wearable that performs local AI inference using a paired smartphone as a trusted personal edge. We describe the hardware-software co-design of this privacy-preserving system, including challenges in integrating a camera, microphone, and speaker within a 30-gram form factor, enabling wake word-triggered capture, and running quantized vision-language and large-language models entirely offline. Through iterative prototyping, we identify key design hurdles in power budgeting, connectivity, latency, and social acceptability. Our initial evaluation shows that fully local multimodal inference is feasible on commodity mobile hardware with interactive latency. We conclude with design lessons for researchers developing embedded AI systems that balance privacy, responsiveness, and usability in everyday settings.
Multimedia
VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection
We present VDC-Agent, a self-evolving framework for Video Detailed Captioning that requires neither human annotations nor larger teacher models. The agent forms a closed loop of caption generation, principle-guided scoring (score and textual suggestions), and prompt refinement. When caption quality regresses, a self-reflection path leverages the previous chain-of-thought to amend the update. Running this process on unlabeled videos produces trajectories of (caption, score) pairs. We convert the trajectories into preference tuples and filter out samples with JSON parsing errors, resulting in VDC-Agent-19K, which contains 18,886 automatically constructed pairs. We then fine-tune the base MLLM on this dataset using an easy-to-hard curriculum direct preference optimization. Built on Qwen2.5-VL-7B-Instruct, our VDC-Agent-7B attains state-of-the-art performance on the VDC benchmark with 49.08% average accuracy and 2.50 score, surpassing specialized video captioners and improving over the base model by +5.13% accuracy and +0.27 score at similar inference cost.
Towards Generalizable Deepfake Detection via Forgery-aware Audio-Visual Adaptation: A Variational Bayesian Approach
The widespread application of AIGC contents has brought not only unprecedented opportunities, but also potential security concerns, e.g., audio-visual deepfakes. Therefore, it is of great importance to develop an effective and generalizable method for multi-modal deepfake detection. Typically, the audio-visual correlation learning could expose subtle cross-modal inconsistencies, e.g., audio-visual misalignment, which serve as crucial clues in deepfake detection. In this paper, we reformulate the correlation learning with variational Bayesian estimation, where audio-visual correlation is approximated as a Gaussian distributed latent variable, and thus develop a novel framework for deepfake detection, i.e., Forgery-aware Audio-Visual Adaptation with Variational Bayes (FoVB). Specifically, given the prior knowledge of pre-trained backbones, we adopt two core designs to estimate audio-visual correlations effectively. First, we exploit various difference convolutions and a high-pass filter to discern local and global forgery traces from both modalities. Second, with the extracted forgery-aware features, we estimate the latent Gaussian variable of audio-visual correlation via variational Bayes. Then, we factorize the variable into modality-specific and correlation-specific ones with orthogonality constraint, allowing them to better learn intra-modal and cross-modal forgery traces with less entanglement. Extensive experiments demonstrate that our FoVB outperforms other state-of-the-art methods in various benchmarks.
comment: TIFS AQE
Parallel Vision Token Scheduling for Fast and Accurate Multimodal LMMs Inference
Multimodal large language models (MLLMs) deliver impressive vision-language reasoning but suffer steep inference latency because self-attention scales quadratically with sequence length and thousands of visual tokens contributed by high-resolution images. Naively pruning less-informative visual tokens reduces this burden, yet indiscriminate removal can strip away contextual cues essential for background or fine-grained questions, undermining accuracy. In this paper, we present ParVTS (Parallel Vision Token Scheduling), a training-free scheduling framework that partitions visual tokens into subject and non-subject groups, processes them in parallel to transfer their semantics into question tokens, and discards the non-subject path mid-inference to reduce computation. This scheduling reduces computational complexity, requires no heuristics or additional modules, and is compatible with diverse existing MLLM architectures. Experiments across multiple MLLM backbones show that ParVTS prunes up to 88.9% of visual tokens with minimal performance drop, achieving 1.77x speedup and 70% FLOPs reduction.
Neural B-Frame Coding: Tackling Domain Shift Issues with Lightweight Online Motion Resolution Adaptation
Learned B-frame codecs with hierarchical temporal prediction often encounter the domain-shift issue due to mismatches between the Group-of-Pictures (GOP) sizes for training and testing, leading to inaccurate motion estimates, particularly for large motion. A common solution is to turn large motion into small motion by downsampling video frames during motion estimation. However, determining the optimal downsampling factor typically requires costly rate-distortion optimization. This work introduces lightweight classifiers to predict downsampling factors. These classifiers leverage simple state signals from current and reference frames to balance rate-distortion performance with computational cost. Three variants are proposed: (1) a binary classifier (Bi-Class) trained with Focal Loss to choose between high and low resolutions, (2) a multi-class classifier (Mu-Class) trained with novel soft labels based on rate-distortion costs, and (3) a co-class approach (Co-Class) that combines the predictive capability of the multi-class classifier with the selective search of the binary classifier. All classifier methods can work seamlessly with existing B-frame codecs without requiring codec retraining. Experimental results show that they achieve coding performance comparable to exhaustive search methods while significantly reducing computational complexity. The code is available at: https://github.com/NYCU-MAPL/Fast-OMRA.git.
comment: Accepted by TCAS-II: Express Briefs
When Top-ranked Recommendations Fail: Modeling Multi-Granular Negative Feedback for Explainable and Robust Video Recommendation AAAI 2026
Existing video recommendation systems, relying mainly on ID-based embedding mapping and collaborative filtering, often fail to capture in-depth video content semantics. Moreover, most struggle to address biased user behaviors (e.g., accidental clicks, fast skips), leading to inaccurate interest modeling and frequent negative feedback in top recommendations with unclear causes. To tackle this issue, we collect real-world user video-watching sequences, annotate the reasons for users' dislikes, and construct a benchmark dataset for personalized explanations. We then introduce the Agentic Explainable Negative Feedback (ENF) framework, which integrates three core components: (1) the Profile Agent, extracting behavioral cues from users' historical data to derive psychological and personality profiles; (2) the Video Agent, performing comprehensive multimodal video analysis; and (3) the Reason Agent, synthesizing information from the other two agents to predict user engagement and generate explanations. Additionally, we propose the S-GRPO algorithm, enabling the model to progressively address complex tasks during reinforcement fine-tuning. Experimental results on the collected dataset show that our method significantly outperforms state-of-the-art baselines in negative feedback prediction and reason explanation. Notably, it achieves an 8.6% improvement over GPT-4o in reason classification. Deployment on the business platform further validates its benefits: increasing average user watch time by 6.2%, reducing the fast-skip rate by 9.4%, and significantly enhancing user satisfaction.
comment: Accepted in AAAI 2026
Multimodal Real-Time Anomaly Detection and Industrial Applications
This paper presents the design, implementation, and evolution of a comprehensive multimodal room-monitoring system that integrates synchronized video and audio processing for real-time activity recognition and anomaly detection. We describe two iterations of the system: an initial lightweight implementation using YOLOv8, ByteTrack, and the Audio Spectrogram Transformer (AST), and an advanced version that incorporates multi-model audio ensembles, hybrid object detection, bidirectional cross-modal attention, and multi-method anomaly detection. The evolution demonstrates significant improvements in accuracy, robustness, and industrial applicability. The advanced system combines three audio models (AST, Wav2Vec2, and HuBERT) for comprehensive audio understanding, dual object detectors (YOLO and DETR) for improved accuracy, and sophisticated fusion mechanisms for enhanced cross-modal learning. Experimental evaluation shows the system's effectiveness in general monitoring scenarios as well as specialized industrial safety applications, achieving real-time performance on standard hardware while maintaining high accuracy.
Evaluation of Hardware-based Video Encoders on Modern GPUs for UHD Live-Streaming
Many GPUs have incorporated hardware-accelerated video encoders, which allow video encoding tasks to be offloaded from the main CPU and provide higher power efficiency. Over the years, many new video codecs such as H.265/HEVC, VP9, and AV1 were added to the latest GPU boards. Recently, the rise of live video content such as VTuber, game live-streaming, and live event broadcasts, drives the demand for high-efficiency hardware encoders in the GPUs to tackle these real-time video encoding tasks, especially at higher resolutions such as 4K/8K UHD. In this paper, RD performance, encoding speed, as well as power consumption of hardware encoders in several generations of NVIDIA, Intel GPUs as well as Qualcomm Snapdragon Mobile SoCs were evaluated and compared to the software counterparts, including the latest H.266/VVC codec, using several metrics including PSNR, SSIM, and machine-learning based VMAF. The results show that modern GPU hardware encoders can match the RD performance of software encoders in real-time encoding scenarios, and while encoding speed increased in newer hardware, there is mostly negligible RD performance improvement between hardware generations. Finally, the bitrate required for each hardware encoder to match YouTube transcoding quality was also calculated.
comment: The 33rd International Conference on Computer Communications and Networks (ICCCN 2024), 29-31 July 2024, Big Island, Hawaii, USA
A Survey of Generative Categories and Techniques in Multimodal Generative Models
Multimodal Generative Models (MGMs) have rapidly evolved beyond text generation, now spanning diverse output modalities including images, music, video, human motion, and 3D objects, by integrating language with other sensory modalities under unified architectures. This survey categorises six primary generative modalities and examines how foundational techniques, namely Self-Supervised Learning (SSL), Mixture of Experts (MoE), Reinforcement Learning from Human Feedback (RLHF), and Chain-of-Thought (CoT) prompting, enable cross-modal capabilities. We analyze key models, architectural trends, and emergent cross-modal synergies, while highlighting transferable techniques and unresolved challenges. Building on a common taxonomy of models and training recipes, we propose a unified evaluation framework centred on faithfulness, compositionality, and robustness, and synthesise evidence from benchmarks and human studies across modalities. We further analyse trustworthiness, safety, and ethical risks, including multimodal bias, privacy leakage, and the misuse of high-fidelity media generation for deepfakes, disinformation, and copyright infringement in music and 3D assets, together with emerging mitigation strategies. Finally, we discuss how architectural trends, evaluation protocols, and governance mechanisms can be co-designed to close current capability and safety gaps, outlining critical paths toward more general-purpose, controllable, and accountable multimodal generative systems.
Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI
AI Video Chat emerges as a new paradigm for Real-time Communication (RTC), where one peer is not a human, but a Multimodal Large Language Model (MLLM). This makes interaction between humans and AI more intuitive, as if chatting face-to-face with a real person. However, this poses significant challenges to latency, because the MLLM inference takes up most of the response time, leaving very little time for video streaming. Due to network uncertainty, transmission latency becomes a critical bottleneck preventing AI from being like a real person. To address this, we call for AI-oriented RTC research, exploring the network requirement shift from "humans watching video" to "AI understanding video". We begin by recognizing the main differences between AI Video Chat and traditional RTC. Then, through prototype measurements, we identify that ultra-low bitrate is a key factor for low latency. To reduce bitrate dramatically while maintaining MLLM accuracy, we propose Context-Aware Video Streaming that recognizes the importance of each video region for chat and allocates bitrate almost exclusively to chat-important regions. To evaluate the impact of video streaming quality on MLLM accuracy, we build the first benchmark, named Degraded Video Understanding Benchmark (DeViBench). Finally, we discuss some open questions and ongoing solutions for AI Video Chat. DeViBench is open-sourced at: https://github.com/pku-netvideo/DeViBench.
comment: 9 pages, 10 figures, Proceedings of the 24th ACM Workshop on Hot Topics in Networks (HotNets 2025), College Park, Maryland, USA
Learning Primitive Embodied World Models: Towards Scalable Robotic Learning
While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation--hindering generative models from achieving a "GPT moment" in the embodied domain. There is a naive observation: the diversity of embodied data far exceeds the relatively small space of possible primitive motions. Based on this insight, we propose a novel paradigm for world modeling--Primitive Embodied World Models (PEWM). By restricting video generation to fixed short horizons, our approach 1) enables fine-grained alignment between linguistic concepts and visual representations of robotic actions, 2) reduces learning complexity, 3) improves data efficiency in embodied data collection, and 4) decreases inference latency. By equipping with a modular Vision-Language Model (VLM) planner and a Start-Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed-loop control and supports compositional generalization of primitive-level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.
FinAudio: A Benchmark for Audio Large Language Models in Financial Applications
Audio Large Language Models (AudioLLMs) have received widespread attention and have significantly improved performance on audio tasks such as conversation, audio understanding, and automatic speech recognition (ASR). Despite these advancements, there is an absence of a benchmark for assessing AudioLLMs in financial scenarios, where audio data, such as earnings conference calls and CEO speeches, are crucial resources for financial analysis and investment decisions. In this paper, we introduce \textsc{FinAudio}, the first benchmark designed to evaluate the capacity of AudioLLMs in the financial domain. We first define three tasks based on the unique characteristics of the financial domain: 1) ASR for short financial audio, 2) ASR for long financial audio, and 3) summarization of long financial audio. Then, we curate two short and two long audio datasets, respectively, and develop a novel dataset for financial audio summarization, comprising the \textsc{FinAudio} benchmark. Then, we evaluate seven prevalent AudioLLMs on \textsc{FinAudio}. Our evaluation reveals the limitations of existing AudioLLMs in the financial domain and offers insights for improving AudioLLMs. All datasets and codes will be released.
Computation and Language
Be My Eyes: Extending Large Language Models to New Modalities Through Multi-Agent Collaboration
Large Language Models (LLMs) have demonstrated remarkable capabilities in challenging, knowledge-intensive reasoning tasks. However, extending LLMs to perceive and reason over a new modality (e.g., vision), often requires costly development of large-scale vision language models (VLMs) with LLMs as backbones. Smaller VLMs are more efficient and adaptable but often lack the broad knowledge and reasoning capabilities of frontier LLMs. In this work, we propose BeMyEyes, a modular, multi-agent framework for extending LLMs to multimodal reasoning by orchestrating collaboration between efficient, adaptable VLMs as perceivers and powerful LLMs as reasoners through conversations. We then introduce a data synthesis and supervised fine-tuning pipeline to train the perceiver agent to effectively collaborate with the reasoner agent. By combining the complementary strengths of perception and reasoning agents, BeMyEyes avoids the need for training large-scale multimodal models, preserves the generalization and reasoning capabilities of LLMs, and allows flexible extension to new domains and modalities. Experiments show that our framework unlocks the multimodal reasoning capabilities for LLMs, enabling a lightweight and fully open-source solution, i.e. equipping text-only DeepSeek-R1 with Qwen2.5-VL-7B perceiver, to outperform large-scale proprietary VLMs such as GPT-4o on a wide range of knowledge-intensive multimodal tasks. These results demonstrate the effectiveness, modularity, and scalability of our multi-agent approach for building future multimodal reasoning systems.
DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research
Deep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards (RLVR), which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training; this allows the rubrics to incorporate information that the model has newly explored and to provide discriminative, on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare and general domains, DR Tulu substantially outperforms existing open deep research models, and matches or exceeds proprietary deep research systems, while being significantly smaller and cheaper per query. To facilitate future research, we release all data, models, and code, including our new MCP-based agent infrastructure for deep research systems.
Scalable Parameter-Light Spectral Method for Clustering Short Text Embeddings with a Cohesion-Based Evaluation Metric
Clustering short text embeddings is a foundational task in natural language processing, yet remains challenging due to the need to specify the number of clusters in advance. We introduce a scalable spectral method that estimates the number of clusters directly from the structure of the Laplacian eigenspectrum, constructed using cosine similarities and guided by an adaptive sampling strategy. This sampling approach enables our estimator to efficiently scale to large datasets without sacrificing reliability. To support intrinsic evaluation of cluster quality without ground-truth labels, we propose the Cohesion Ratio, a simple and interpretable evaluation metric that quantifies how much intra-cluster similarity exceeds the global similarity background. It has an information-theoretic motivation inspired by mutual information, and in our experiments it correlates closely with extrinsic measures such as normalized mutual information and homogeneity. Extensive experiments on six short-text datasets and four modern embedding models show that standard algorithms like K-Means and HAC, when guided by our estimator, significantly outperform popular parameter-light methods such as HDBSCAN, OPTICS, and Leiden. These results demonstrate the practical value of our spectral estimator and Cohesion Ratio for unsupervised organization and evaluation of short text data. Implementation of our estimator of k and Cohesion Ratio, along with code for reproducing the experiments, is available at https://anonymous.4open.science/r/towards_clustering-0C2E.
Learning to Reason: Training LLMs with GPT-OSS or DeepSeek R1 Reasoning Traces
Test-time scaling, which leverages additional computation during inference to improve model accuracy, has enabled a new class of Large Language Models (LLMs) that are able to reason through complex problems by understanding the goal, turning this goal into a plan, working through intermediate steps, and checking their own work before answering . Frontier large language models with reasoning capabilities, such as DeepSeek-R1 and OpenAI's gpt-oss, follow the same procedure when solving complex problems by generating intermediate reasoning traces before giving the final answer. Today, these models are being increasingly used to generate reasoning traces that serve as high-quality supervised data for post-training of small and medium-sized language models to teach reasoning capabilities without requiring expensive human curation. In this work, we compare the performance of medium-sized LLMs on Math problems after post-training on two kinds of reasoning traces. We compare the impact of reasoning traces generated by DeepSeek-R1 and gpt-oss LLMs in terms of accuracy and inference efficiency.
Generative Query Expansion with Multilingual LLMs for Cross-Lingual Information Retrieval
Query expansion is the reformulation of a user query by adding semantically related information, and is an essential component of monolingual and cross-lingual information retrieval used to ensure that relevant documents are not missed. Recently, multilingual large language models (mLLMs) have shifted query expansion from semantic augmentation with synonyms and related words to pseudo-document generation. Pseudo-documents both introduce additional relevant terms and bridge the gap between short queries and long documents, which is particularly beneficial in dense retrieval. This study evaluates recent mLLMs and fine-tuned variants across several generative expansion strategies to identify factors that drive cross-lingual retrieval performance. Results show that query length largely determines which prompting technique is effective, and that more elaborate prompts often do not yield further gains. Substantial linguistic disparities persist: cross-lingual query expansion can produce the largest improvements for languages with the weakest baselines, yet retrieval is especially poor between languages written in different scripts. Fine-tuning is found to lead to performance gains only when the training and test data are of similar format. These outcomes underline the need for more balanced multilingual and cross-lingual training and evaluation resources.
What Drives Cross-lingual Ranking? Retrieval Approaches with Multilingual Language Models
Cross-lingual information retrieval (CLIR) enables access to multilingual knowledge but remains challenging due to disparities in resources, scripts, and weak cross-lingual semantic alignment in embedding models. Existing pipelines often rely on translation and monolingual retrieval heuristics, which add computational overhead and noise, degrading performance. This work systematically evaluates four intervention types, namely document translation, multilingual dense retrieval with pretrained encoders, contrastive learning at word, phrase, and query-document levels, and cross-encoder re-ranking, across three benchmark datasets. We find that dense retrieval models trained specifically for CLIR consistently outperform lexical matching methods and derive little benefit from document translation. Contrastive learning mitigates language biases and yields substantial improvements for encoders with weak initial alignment, and re-ranking can be effective, but depends on the quality of the cross-encoder training data. Although high-resource languages still dominate overall performance, gains over lexical and document-translated baselines are most pronounced for low-resource and cross-script pairs. These findings indicate that cross-lingual search systems should prioritise semantic multilingual embeddings and targeted learning-based alignment over translation-based pipelines, particularly for cross-script and under-resourced languages.
MultiBanAbs: A Comprehensive Multi-Domain Bangla Abstractive Text Summarization Dataset
This study developed a new Bangla abstractive summarization dataset to generate concise summaries of Bangla articles from diverse sources. Most existing studies in this field have concentrated on news articles, where journalists usually follow a fixed writing style. While such approaches are effective in limited contexts, they often fail to adapt to the varied nature of real-world Bangla texts. In today's digital era, a massive amount of Bangla content is continuously produced across blogs, newspapers, and social media. This creates a pressing need for summarization systems that can reduce information overload and help readers understand content more quickly. To address this challenge, we developed a dataset of over 54,000 Bangla articles and summaries collected from multiple sources, including blogs such as Cinegolpo and newspapers such as Samakal and The Business Standard. Unlike single-domain resources, our dataset spans multiple domains and writing styles. It offers greater adaptability and practical relevance. To establish strong baselines, we trained and evaluated this dataset using several deep learning and transfer learning models, including LSTM, BanglaT5-small, and MTS-small. The results highlight its potential as a benchmark for future research in Bangla natural language processing. This dataset provides a solid foundation for building robust summarization systems and helps expand NLP resources for low-resource languages.
PRInTS: Reward Modeling for Long-Horizon Information Seeking
Information-seeking is a core capability for AI agents, requiring them to gather and reason over tool-generated information across long trajectories. However, such multi-step information-seeking tasks remain challenging for agents backed by language models. While process reward models (PRMs) can guide agents by ranking candidate steps at test-time, existing PRMs, designed for short reasoning with binary judgment, cannot capture richer dimensions of information-seeking steps, such as tool interactions and reasoning over tool outputs, nor handle the rapidly growing context in long-horizon tasks. To address these limitations, we introduce PRInTS, a generative PRM trained with dual capabilities: (1) dense scoring based on the PRM's reasoning across multiple step quality dimensions (e.g., interpretation of tool outputs, tool call informativeness) and (2) trajectory summarization that compresses the growing context while preserving essential information for step evaluation. Extensive evaluations across FRAMES, GAIA (levels 1-3), and WebWalkerQA (easy-hard) benchmarks on multiple models, along with ablations, reveal that best-of-n sampling with PRInTS enhances information-seeking abilities of open-source models as well as specialized agents, matching or surpassing the performance of frontier models with a much smaller backbone agent and outperforming other strong reward modeling baselines.
comment: 18 pages, code: https://github.com/G-JWLee/PRInTS
AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning
Humans naturally adapt to diverse environments by learning underlying rules across worlds with different dynamics, observations, and reward structures. In contrast, existing agents typically demonstrate improvements via self-evolving within a single domain, implicitly assuming a fixed environment distribution. Cross-environment learning has remained largely unmeasured: there is no standard collection of controllable, heterogeneous environments, nor a unified way to represent how agents learn. We address these gaps in two steps. First, we propose AutoEnv, an automated framework that treats environments as factorizable distributions over transitions, observations, and rewards, enabling low-cost (4.12 USD on average) generation of heterogeneous worlds. Using AutoEnv, we construct AutoEnv-36, a dataset of 36 environments with 358 validated levels, on which seven language models achieve 12-49% normalized reward, demonstrating the challenge of AutoEnv-36. Second, we formalize agent learning as a component-centric process driven by three stages of Selection, Optimization, and Evaluation applied to an improvable agent component. Using this formulation, we design eight learning methods and evaluate them on AutoEnv-36. Empirically, the gain of any single learning method quickly decrease as the number of environments increases, revealing that fixed learning methods do not scale across heterogeneous environments. Environment-adaptive selection of learning methods substantially improves performance but exhibits diminishing returns as the method space expands. These results highlight both the necessity and the current limitations of agent learning for scalable cross-environment generalization, and position AutoEnv and AutoEnv-36 as a testbed for studying cross-environment agent learning. The code is avaiable at https://github.com/FoundationAgents/AutoEnv.
MapFormer: Self-Supervised Learning of Cognitive Maps with Input-Dependent Positional Embeddings
A cognitive map is an internal model which encodes the abstract relationships among entities in the world, giving humans and animals the flexibility to adapt to new situations, with a strong out-of-distribution (OOD) generalization that current AI systems still do not possess. To bridge this gap, we introduce MapFormers, new architectures based on Transformer models, which can learn cognitive maps from observational data and perform path integration in parallel, in a self-supervised manner. Cognitive maps are learned in the model by disentangling structural relationships in the inputs from their specific content, a property that can be achieved naturally by updating the positional encoding in Transformers with input-dependent matrices. We developed two variants of MapFormers that unify absolute and relative positional encoding to model episodic (EM) and working memory (WM), respectively. We tested MapFormers on several tasks, including a classic 2D navigation task, showing that our models can learn a cognitive map of the underlying space and generalize OOD (e.g., to longer sequences) with near-perfect performance, unlike current architectures. Together, these results demonstrate the superiority of models designed to learn a cognitive map, and the importance of introducing a structural bias for structure-content disentanglement, which can be achieved in Transformers with input-dependent positional encoding. MapFormers have broad applications in both neuroscience and AI, by explaining the neural mechanisms giving rise to cognitive maps, while allowing these relation models to be learned at scale.
comment: 19 pages (29 with appendix), 8 figures
CDLM: Consistency Diffusion Language Models For Faster Sampling
Diffusion Language Models (DLMs) offer a promising parallel generation paradigm but suffer from slow inference due to numerous refinement steps and the inability to use standard KV caching. We introduce CDLM (Consistency Diffusion Language Models), a training-based acceleration method that simultaneously tackles both bottlenecks. CDLM integrates consistency modeling to drastically reduce the number of required sampling steps by enabling multi-token finalization. Furthermore, we enforce a block-wise causal attention mask during fine-tuning, making the model fully compatible with KV caching. Experiments show CDLM achieves 3.6x-14.5x lower latency while maintaining competitive accuracy on math and coding tasks. The full training and evaluation code is available at https://github.com/SqueezeAILab/CDLM.
comment: 18 pages, 6 figures
A Nutrition Multimodal Photoplethysmography Language Model
Hunger and satiety dynamics shape dietary behaviors and metabolic health, yet remain difficult to capture in everyday settings. We present a Nutrition Photoplethysmography Language Model (NPLM), integrating continuous photoplethysmography (PPG) from wearables with meal descriptions. NPLM projects PPG into embeddings interpretable by language models, enabling joint reasoning over physiology and meal context. Trained on 19,340 participants and 1.1 million meal-PPG pairs, the model improved daily caloric intake prediction by 11% over text-only baselines, with accuracy maintained when 80% of meal text was removed. In an independent validation study (n=140) with controlled dining and detailed meal information, the model replicated these findings. These results demonstrate the value of integrating physiological measurements from consumer wearables with meal information for noninvasive dietary monitoring at scale.
comment: 21 pages, 2 figures
In Machina N400: Pinpointing Where a Causal Language Model Detects Semantic Violations
How and where does a transformer notice that a sentence has gone semantically off the rails? To explore this question, we evaluated the causal language model (phi-2) using a carefully curated corpus, with sentences that concluded plausibly or implausibly. Our analysis focused on the hidden states sampled at each model layer. To investigate how violations are encoded, we utilized two complementary probes. First, we conducted a per-layer detection using a linear probe. Our findings revealed that a simple linear decoder struggled to distinguish between plausible and implausible endings in the lowest third of the model's layers. However, its accuracy sharply increased in the middle blocks, reaching a peak just before the top layers. Second, we examined the effective dimensionality of the encoded violation. Initially, the violation widens the representational subspace, followed by a collapse after a mid-stack bottleneck. This might indicate an exploratory phase that transitions into rapid consolidation. Taken together, these results contemplate the idea of alignment with classical psycholinguistic findings in human reading, where semantic anomalies are detected only after syntactic resolution, occurring later in the online processing sequence.
comment: Accepted at AICS2025
RAVEN++: Pinpointing Fine-Grained Violations in Advertisement Videos with Active Reinforcement Reasoning EMNLP 2025
Advertising (Ad) is a cornerstone of the digital economy, yet the moderation of video advertisements remains a significant challenge due to their complexity and the need for precise violation localization. While recent advancements, such as the RAVEN model, have improved coarse-grained violation detection, critical gaps persist in fine-grained understanding, explainability, and generalization. To address these limitations, we propose RAVEN++, a novel framework that introduces three key innovations: 1) Active Reinforcement Learning (RL), which dynamically adapts training to samples of varying difficulty; 2) Fine-Grained Violation Understanding, achieved through hierarchical reward functions and reasoning distillation; and 3) Progressive Multi-Stage Training, which systematically combines knowledge injection, curriculum-based passive RL, and active RL. Extensive experiments on both public and proprietary datasets, on both offline scenarios and online deployed A/B Testing, demonstrate that RAVEN++ outperforms general-purpose LLMs and specialized models like RAVEN in terms of fine-grained violation understanding, reasoning capabilities, and generalization ability.
comment: EMNLP 2025 (Oral, Industry Track)
Representational Stability of Truth in Large Language Models
Large language models (LLMs) are widely used for factual tasks such as "What treats asthma?" or "What is the capital of Latvia?". However, it remains unclear how stably LLMs encode distinctions between true, false, and neither-true-nor-false content in their internal probabilistic representations. We introduce representational stability as the robustness of an LLM's veracity representations to perturbations in the operational definition of truth. We assess representational stability by (i) training a linear probe on an LLM's activations to separate true from not-true statements and (ii) measuring how its learned decision boundary shifts under controlled label changes. Using activations from sixteen open-source models and three factual domains, we compare two types of neither statements. The first are fact-like assertions about entities we believe to be absent from any training data. We call these unfamiliar neither statements. The second are nonfactual claims drawn from well-known fictional contexts. We call these familiar neither statements. The unfamiliar statements induce the largest boundary shifts, producing up to $40\%$ flipped truth judgements in fragile domains (such as word definitions), while familiar fictional statements remain more coherently clustered and yield smaller changes ($\leq 8.2\%$). These results suggest that representational stability stems more from epistemic familiarity than from linguistic form. More broadly, our approach provides a diagnostic for auditing and training LLMs to preserve coherent truth assignments under semantic uncertainty, rather than optimizing for output accuracy alone.
comment: 25 pages, 24 figures
From Pixels to Posts: Retrieval-Augmented Fashion Captioning and Hashtag Generation
This paper introduces the retrieval-augmented framework for automatic fashion caption and hashtag generation, combining multi-garment detection, attribute reasoning, and Large Language Model (LLM) prompting. The system aims to produce visually grounded, descriptive, and stylistically interesting text for fashion imagery, overcoming the limitations of end-to-end captioners that have problems with attribute fidelity and domain generalization. The pipeline combines a YOLO-based detector for multi-garment localization, k-means clustering for dominant color extraction, and a CLIP-FAISS retrieval module for fabric and gender attribute inference based on a structured product index. These attributes, together with retrieved style examples, create a factual evidence pack that is used to guide an LLM to generate human-like captions and contextually rich hashtags. A fine-tuned BLIP model is used as a supervised baseline model for comparison. Experimental results show that the YOLO detector is able to obtain a mean Average Precision (mAP@0.5) of 0.71 for nine categories of garments. The RAG-LLM pipeline generates expressive attribute-aligned captions and achieves mean attribute coverage of 0.80 with full coverage at the 50% threshold in hashtag generation, whereas BLIP gives higher lexical overlap and lower generalization. The retrieval-augmented approach exhibits better factual grounding, less hallucination, and great potential for scalable deployment in various clothing domains. These results demonstrate the use of retrieval-augmented generation as an effective and interpretable paradigm for automated and visually grounded fashion content generation.
comment: Submitted to Expert Systems with Applications
Eliciting Chain-of-Thought in Base LLMs via Gradient-Based Representation Optimization AAAI2026
Chain-of-Thought (CoT) reasoning is a critical capability for large language models (LLMs), enabling them to tackle com- plex multi-step tasks. While base LLMs, pre-trained on general text corpora, often struggle with reasoning due to a lack of specialized training, recent studies reveal their latent reason- ing potential tied to hidden states. However, existing hidden state manipulation methods, such as linear activation steering, suffer from limitations due to their rigid and unconstrained nature, often leading to distribution shifts and degraded text quality. In this work, we propose a novel approach for elic- iting CoT reasoning from base LLMs through hidden state manipulation grounded in probabilistic conditional generation. By reformulating the challenge as an optimization problem with a balanced likelihood and prior regularization framework, our method guides hidden states toward reasoning-oriented trajectories while preserving linguistic coherence. Extensive evaluations across mathematical, commonsense, and logical reasoning benchmarks demonstrate that our approach con- sistently outperforms existing steering methods, offering a theoretically principled and effective solution for enhancing reasoning capabilities in base LLMs.
comment: AAAI2026
Emotion-Enhanced Multi-Task Learning with LLMs for Aspect Category Sentiment Analysis
Aspect category sentiment analysis (ACSA) has achieved remarkable progress with large language models (LLMs), yet existing approaches primarily emphasize sentiment polarity while overlooking the underlying emotional dimensions that shape sentiment expressions. This limitation hinders the model's ability to capture fine-grained affective signals toward specific aspect categories. To address this limitation, we introduce a novel emotion-enhanced multi-task ACSA framework that jointly learns sentiment polarity and category-specific emotions grounded in Ekman's six basic emotions. Leveraging the generative capabilities of LLMs, our approach enables the model to produce emotional descriptions for each aspect category, thereby enriching sentiment representations with affective expressions. Furthermore, to ensure the accuracy and consistency of the generated emotions, we introduce an emotion refinement mechanism based on the Valence-Arousal-Dominance (VAD) dimensional framework. Specifically, emotions predicted by the LLM are projected onto a VAD space, and those inconsistent with their corresponding VAD coordinates are re-annotated using a structured LLM-based refinement strategy. Experimental results demonstrate that our approach significantly outperforms strong baselines on all benchmark datasets. This underlines the effectiveness of integrating affective dimensions into ACSA.
comment: 8 pages, 4 figures
On the Optimality of Discrete Object Naming: a Kinship Case Study
The structure of naming systems in natural languages hinges on a trade-off between high informativeness and low complexity. Prior work capitalizes on information theory to formalize these notions; however, these studies generally rely on two simplifications: (i) optimal listeners, and (ii) universal communicative need across languages. Here, we address these limitations by introducing an information-theoretic framework for discrete object naming systems, and we use it to prove that an optimal trade-off is achievable if and only if the listener's decoder is equivalent to the Bayesian decoder of the speaker. Adopting a referential game setup from emergent communication, and focusing on the semantic domain of kinship, we show that our notion of optimality is not only theoretically achievable but also emerges empirically in learned communication systems.
A symbolic Perl algorithm for the unification of Nahuatl word spellings
In this paper, we describe a symbolic model for the automatic orthographic unification of Nawatl text documents. Our model is based on algorithms that we have previously used to analyze sentences in Nawatl, and on the corpus called $π$-yalli, consisting of texts in several Nawatl orthographies. Our automatic unification algorithm implements linguistic rules in symbolic regular expressions. We also present a manual evaluation protocol that we have proposed and implemented to assess the quality of the unified sentences generated by our algorithm, by testing in a sentence semantic task. We have obtained encouraging results from the evaluators for most of the desired features of our artificially unified sentences
comment: MICAI 2025, LNAI 16221, pp. 141-154, 2026. 10 pages, 4 Figures, 8 Tables
A Multi-Agent LLM Framework for Multi-Domain Low-Resource In-Context NER via Knowledge Retrieval, Disambiguation and Reflective Analysis AAAI 2026
In-context learning (ICL) with large language models (LLMs) has emerged as a promising paradigm for named entity recognition (NER) in low-resource scenarios. However, existing ICL-based NER methods suffer from three key limitations: (1) reliance on dynamic retrieval of annotated examples, which is problematic when annotated data is scarce; (2) limited generalization to unseen domains due to the LLM's insufficient internal domain knowledge; and (3) failure to incorporate external knowledge or resolve entity ambiguities. To address these challenges, we propose KDR-Agent, a novel multi-agent framework for multi-domain low-resource in-context NER that integrates Knowledge retrieval, Disambiguation, and Reflective analysis. KDR-Agent leverages natural-language type definitions and a static set of entity-level contrastive demonstrations to reduce dependency on large annotated corpora. A central planner coordinates specialized agents to (i) retrieve factual knowledge from Wikipedia for domain-specific mentions, (ii) resolve ambiguous entities via contextualized reasoning, and (iii) reflect on and correct model predictions through structured self-assessment. Experiments across ten datasets from five domains demonstrate that KDR-Agent significantly outperforms existing zero-shot and few-shot ICL baselines across multiple LLM backbones. The code and data can be found at https://github.com/MWXGOD/KDR-Agent.
comment: This paper has been accepted by AAAI 2026 (Main Technical Track)
GraphMind: Theorem Selection and Conclusion Generation Framework with Dynamic GNN for LLM Reasoning
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, including multi-step reasoning such as mathematical proving. However, existing approaches often lack an explicit and dynamic mechanism to structurally represent and evolve intermediate reasoning states, which limits their ability to perform context-aware theorem selection and iterative conclusion generation. To address these challenges, we propose GraphMind, a novel dynamic graph-based framework that integrates the graph neural network (GNN) with LLMs to iteratively select theorems and generate intermediate conclusions for multi-step reasoning. Our method models the reasoning process as a heterogeneous evolving graph, where nodes represent conditions, theorems, and conclusions, while edges capture logical dependencies between nodes. By encoding the current reasoning state with GNN and leveraging semantic matching for theorem selection, our framework enables context-aware, interpretable, and structured reasoning in a closed-loop manner. Experiments on various question-answering (QA) datasets demonstrate that our proposed GraphMind method achieves consistent performance improvements and significantly outperforms existing baselines in multi-step reasoning, validating the effectiveness and generalizability of our approach.
Logic of Montage
In expressing emotions, as an expression form separate from natural language, we propose an alternative form that complements natural language, acting as a proxy or window for emotional states. First, we set up an expression form "Effect of Contradictory Structure." "Effect of Contradictory Structure" is not static but dynamic. Effect in "Effect of Contradictory Structure" is unpleasant or pleasant, and the orientation to avoid that unpleasantness is considered pseudo-expression of will. Second, "Effect of Contradictory Structure" can be overlapped with each other. This overlapping operation is called "montage." A broader "Structure" that includes related "Effect of Contradictory Structure" and "Effect of Structure" are set up. Montage produces "Effect of Structure". In montage, it is necessary to set something like "strength," so we adopted Deleuze and Deleuze/Guattari's word "intensity" and set it as an element of our model. We set up a general theoretical framework - Word Import Between Systems (Models) and justified the import of "intensity" through Austin's use of the word "force." "Effect of Structure" process is demonstrated using the example of proceeding to the next level of education.
Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation
Large language models demonstrate powerful capabilities across various natural language processing tasks, yet they also harbor safety vulnerabilities. To enhance LLM safety, various jailbreak defense methods have been proposed to guard against harmful outputs. However, improvements in model safety often come at the cost of severe over-refusal, failing to strike a good balance between safety and usability. In this paper, we first analyze the causes of over-refusal from a representation perspective, revealing that over-refusal samples reside at the boundary between benign and malicious samples. Based on this, we propose MOSR, designed to mitigate over-refusal by intervening the safety representation of LLMs. MOSR incorporates two novel components: (1) Overlap-Aware Loss Weighting, which determines the erasure weight for malicious samples by quantifying their similarity to pseudo-malicious samples in the representation space, and (2) Context-Aware Augmentation, which supplements the necessary context for rejection decisions by adding harmful prefixes before rejection responses. Experiments demonstrate that our method outperforms existing approaches in mitigating over-refusal while largely maintaining safety. Overall, we advocate that future defense methods should strike a better balance between safety and over-refusal.
Classification EM-PCA for clustering and embedding
The mixture model is undoubtedly one of the greatest contributions to clustering. For continuous data, Gaussian models are often used and the Expectation-Maximization (EM) algorithm is particularly suitable for estimating parameters from which clustering is inferred. If these models are particularly popular in various domains including image clustering, they however suffer from the dimensionality and also from the slowness of convergence of the EM algorithm. However, the Classification EM (CEM) algorithm, a classifying version, offers a fast convergence solution while dimensionality reduction still remains a challenge. Thus we propose in this paper an algorithm combining simultaneously and non-sequentially the two tasks --Data embedding and Clustering-- relying on Principal Component Analysis (PCA) and CEM. We demonstrate the interest of such approach in terms of clustering and data embedding. We also establish different connections with other clustering approaches.
comment: Accepted at the IEEE conference on Big Data (Special Session on Machine Learning)
Knowledge-based Graphical Method for Safety Signal Detection in Clinical Trials
We present a graphical, knowledge-based method for reviewing treatment-emergent adverse events (AEs) in clinical trials. The approach enhances MedDRA by adding a hidden medical knowledge layer (Safeterm) that captures semantic relationships between terms in a 2-D map. Using this layer, AE Preferred Terms can be regrouped automatically into similarity clusters, and their association to the trial disease may be quantified. The Safeterm map is available online and connected to aggregated AE incidence tables from ClinicalTrials.gov. For signal detection, we compute treatment-specific disproportionality metrics using shrinkage incidence ratios. Cluster-level EBGM values are then derived through precision-weighted aggregation. Two visual outputs support interpretation: a semantic map showing AE incidence and an expectedness-versus-disproportionality plot for rapid signal detection. Applied to three legacy trials, the automated method clearly recovers all expected safety signals. Overall, augmenting MedDRA with a medical knowledge layer improves clarity, efficiency, and accuracy in AE interpretation for clinical trials.
comment: 13 pages, 3 tables, 5 figures
SWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache Compression
Large Language Models (LLMs) face a significant bottleneck during autoregressive inference due to the massive memory footprint of the Key-Value (KV) cache. Existing compression techniques like token eviction, quantization, or other low-rank methods often risk information loss, have fixed limits, or introduce significant computational overhead from explicit decompression steps. In this work, we introduce SWAN, a novel, fine-tuning-free framework that eliminates this overhead. Our method uses an offline orthogonal matrix to rotate and prune the KV-cache, which is then used directly in the attention computation without any reconstruction. Our extensive experiments demonstrate that SWAN, augmented with a small dense buffer, offers a robust trade-off, maintaining performance close to the uncompressed baseline even at aggressive 50-60% memory savings per-token on KV-cache. A key advantage is its runtime-tunable compression level, allowing operators to dynamically adjust the memory footprint, a flexibility absent in methods requiring fixed offline configurations. This combination of a decompression-free design, high performance under compression, and adaptability makes SWAN a practical and efficient solution for serving LLMs with long contexts.
Skeletons Matter: Dynamic Data Augmentation for Text-to-Query EMNLP 2025
The task of translating natural language questions into query languages has long been a central focus in semantic parsing. Recent advancements in Large Language Models (LLMs) have significantly accelerated progress in this field. However, existing studies typically focus on a single query language, resulting in methods with limited generalizability across different languages. In this paper, we formally define the Text-to-Query task paradigm, unifying semantic parsing tasks across various query languages. We identify query skeletons as a shared optimization target of Text-to-Query tasks, and propose a general dynamic data augmentation framework that explicitly diagnoses model-specific weaknesses in handling these skeletons to synthesize targeted training data. Experiments on four Text-to-Query benchmarks demonstrate that our method achieves state-of-the-art performance using only a small amount of synthesized data, highlighting the efficiency and generality of our approach and laying a solid foundation for unified research on Text-to-Query tasks. We release our code at https://github.com/jjjycaptain/Skeletron.
comment: Accepted at EMNLP 2025
Look It Up: Analysing Internal Web Search Capabilities of Modern LLMs
Modern large language models integrate web search to provide real-time answers, yet it remains unclear whether they are efficiently calibrated to use search when it is actually needed. We introduce a benchmark evaluating both the necessity and effectiveness of web access across commercial models with no access to internal states or parameters. The dataset includes a static split of 783 temporally anchored questions answerable from pre-cutoff knowledge, aimed at testing whether models invoke search based on low internal confidence, and a dynamic split of 288 post-cutoff queries designed to test whether models recognise when search is required and retrieve updated information. Web access substantially improves static accuracy for GPT-5-mini and Claude Haiku 4.5, though confidence calibration worsens. On dynamic queries, both models frequently invoke search yet remain below 70 percent accuracy due to weak query formulation. Costs per accuracy-improving call remain low, but returns diminish once initial retrieval fails. Selective invocation helps, but models become overconfident and inconsistent after search. Overall, built-in web search meaningfully improves factual accuracy and can be invoked selectively, yet models remain overconfident, skip retrieval when it is essential, and falter once initial search queries underperform. Taken together, internal web search works better as a good low-latency verification layer than a reliable analytical tool, with clear room for improvement.
comment: 10 pages, 8 figures
How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining
Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.
Reproducibility Study of Large Language Model Bayesian Optimization ICLR 2024
In this reproducibility study, we revisit the LLAMBO framework of Daxberger et al. (2024), a prompting-based Bayesian optimization (BO) method that uses large language models as discriminative surrogates and acquisition optimizers via text-only interactions. We replicate the core Bayesmark and HPOBench experiments under the original evaluation protocol, but replace GPT-3.5 with the open-weight Llama 3.1 70B model used for all text encoding components. Our results broadly confirm the main claims of LLAMBO. Contextual warm starting via textual problem and hyperparameter descriptions substantially improves early regret behaviour and reduces variance across runs. LLAMBO's discriminative surrogate is weaker than GP or SMAC as a pure single task regressor, yet benefits from cross task semantic priors induced by the language model. Ablations that remove textual context markedly degrade predictive accuracy and calibration, while the LLAMBO candidate sampler consistently generates higher quality and more diverse proposals than TPE or random sampling. Experiments with smaller backbones (Gemma 27B, Llama 3.1 8B) yield unstable or invalid predictions, suggesting insufficient capacity for reliable surrogate behaviour. Overall, our study shows that the LLAMBO architecture is robust to changing the language model backbone and remains effective when instantiated with Llama 3.1 70B.
comment: 7 pages, 8 figures. Reproducibility study of the LLAMBO framework (ICLR 2024). Code: https://github.com/spagnoloG/llambo-reproducibility
CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation ACL'25
Data contamination poses a significant challenge to the fairness of LLM evaluations in natural language processing tasks by inadvertently exposing models to test data during training. Current studies attempt to mitigate this issue by modifying existing datasets or generating new ones from freshly collected information. However, these methods fall short of ensuring contamination-resilient evaluation, as they fail to fully eliminate pre-existing knowledge from models or preserve the semantic complexity of the original datasets. To address these limitations, we propose \textbf{CoreEval}, a \textbf{Co}ntamination-\textbf{re}silient \textbf{Eval}uation strategy for automatically updating data with real-world knowledge. This approach begins by extracting entity relationships from the original data and leveraging the GDELT database to retrieve relevant, up-to-date knowledge. The retrieved knowledge is then recontextualized and integrated with the original data, which is refined and restructured to ensure semantic coherence and enhanced task relevance. Ultimately, a robust data reflection mechanism is employed to iteratively verify and refine labels, ensuring consistency between the updated and original datasets. Extensive experiments on updated datasets validate the robustness of CoreEval, demonstrating its effectiveness in mitigating performance overestimation caused by data contamination.
comment: ACL'25
Think Before You Prune: Selective Self-Generated Calibration for Pruning Large Reasoning Models
Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning benchmarks. However, their long chain-of-thought reasoning processes incur significant inference overhead. Pruning has emerged as a promising approach to reducing computational costs. However, existing efforts have primarily focused on large language models (LLMs), while pruning LRMs remains unexplored. In this work, we conduct the first empirical study on pruning LRMs and show that directly applying existing pruning techniques fails to yield satisfactory results. Our findings indicate that using self-generated reasoning data for calibration can substantially improve pruning performance. We further investigate how the difficulty and length of reasoning data affect pruning outcomes. Our analysis reveals that challenging and moderately long self-generated reasoning data serve as ideal calibration data. Based on these insights, we propose a Selective Self-Generated Reasoning (SSGR) data construction strategy to provide effective calibration data for pruning LRMs. Experimental results on the DeepSeek-R1-Distill model series validate that our strategy improves the reasoning ability of pruned LRMs by 10%-13% compared to general pruning methods.
comment: Under Review
Generating Reading Comprehension Exercises with Large Language Models for Educational Applications
With the rapid development of large language models (LLMs), the applications of LLMs have grown substantially. In the education domain, LLMs demonstrate significant potential, particularly in automatic text generation, which enables the creation of intelligent and adaptive learning content. This paper proposes a new LLMs framework, which is named as Reading Comprehension Exercise Generation (RCEG). It can generate high-quality and personalized English reading comprehension exercises automatically. Firstly, RCEG uses fine-tuned LLMs to generate content candidates. Then, it uses a discriminator to select the best candidate. Finally, the quality of the generated content has been improved greatly. To evaluate the performance of RCEG, a dedicated dataset for English reading comprehension is constructed to perform the experiments, and comprehensive evaluation metrics are used to analyze the experimental results. These metrics include content diversity, factual accuracy, linguistic toxicity, and pedagogical alignment. Experimental results show that RCEG significantly improves the relevance and cognitive appropriateness of the generated exercises.
FanarGuard: A Culturally-Aware Moderation Filter for Arabic Language Models
Content moderation filters are a critical safeguard against alignment failures in language models. Yet most existing filters focus narrowly on general safety and overlook cultural context. In this work, we introduce FanarGuard, a bilingual moderation filter that evaluates both safety and cultural alignment in Arabic and English. We construct a dataset of over 468K prompt and response pairs, drawn from synthetic and public datasets, scored by a panel of LLM judges on harmlessness and cultural awareness, and use it to train two filter variants. To rigorously evaluate cultural alignment, we further develop the first benchmark targeting Arabic cultural contexts, comprising over 1k norm-sensitive prompts with LLM-generated responses annotated by human raters. Results show that FanarGuard achieves stronger agreement with human annotations than inter-annotator reliability, while matching the performance of state-of-the-art filters on safety benchmarks. These findings highlight the importance of integrating cultural awareness into moderation and establish FanarGuard as a practical step toward more context-sensitive safeguards.
Cognitive Alpha Mining via LLM-Driven Code-Based Evolution
Discovering effective predictive signals, or ``alphas,'' from financial data with high dimensionality and extremely low signal-to-noise ratio remains a difficult open problem. Despite progress in deep learning, genetic programming, and, more recently, large language model (LLM)--based factor generation, existing approaches still explore only a narrow region of the vast alpha search space. Neural models tend to produce opaque and fragile patterns, while symbolic or formula-based methods often yield redundant or economically ungrounded expressions that generalize poorly. Although different in form, these paradigms share a key limitation: none can conduct broad, structured, and human-like exploration that balances logical consistency with creative leaps. To address this gap, we introduce the Cognitive Alpha Mining Framework (CogAlpha), which combines code-level alpha representation with LLM-driven reasoning and evolutionary search. Treating LLMs as adaptive cognitive agents, our framework iteratively refines, mutates, and recombines alpha candidates through multi-stage prompts and financial feedback. This synergistic design enables deeper thinking, richer structural diversity, and economically interpretable alpha discovery, while greatly expanding the effective search space. Experiments on A-share equities demonstrate that CogAlpha consistently discovers alphas with superior predictive accuracy, robustness, and generalization over existing methods. Our results highlight the promise of aligning evolutionary optimization with LLM-based reasoning for automated and explainable alpha discovery. All source code will be released.
Large Language Models for the Summarization of Czech Documents: From History to the Present
Text summarization is the task of automatically condensing longer texts into shorter, coherent summaries while preserving the original meaning and key information. Although this task has been extensively studied in English and other high-resource languages, Czech summarization, particularly in the context of historical documents, remains underexplored. This is largely due to the inherent linguistic complexity of Czech and the lack of high-quality annotated datasets. In this work, we address this gap by leveraging the capabilities of Large Language Models (LLMs), specifically Mistral and mT5, which have demonstrated strong performance across a wide range of natural language processing tasks and multilingual settings. In addition, we also propose a translation-based approach that first translates Czech texts into English, summarizes them using an English-language model, and then translates the summaries back into Czech. Our study makes the following main contributions: We demonstrate that LLMs achieve new state-of-the-art results on the SumeCzech dataset, a benchmark for modern Czech text summarization, showing the effectiveness of multilingual LLMs even for morphologically rich, medium-resource languages like Czech. We introduce a new dataset, Posel od Čerchova, designed for the summarization of historical Czech texts. This dataset is derived from digitized 19th-century publications and annotated for abstractive summarization. We provide initial baselines using modern LLMs to facilitate further research in this underrepresented area. By combining cutting-edge models with both modern and historical Czech datasets, our work lays the foundation for further progress in Czech summarization and contributes valuable resources for future research in Czech historical document processing and low-resource summarization more broadly.
A Reproducible Framework for Neural Topic Modeling in Focus Group Analysis
Focus group discussions generate rich qualitative data but their analysis traditionally relies on labor-intensive manual coding that limits scalability and reproducibility. We present a rigorous, reproducible computational framework for applying neural topic modeling to focus group transcripts, addressing fundamental methodological challenges: hyperparameter sensitivity, model stability, and validation of interpretability. Using BERTopic applied to ten focus groups exploring HPV vaccine perceptions in Tunisia (1,076 utterances), we conducted systematic evaluation across 27 hyperparameter configurations, assessed stability through bootstrap resampling with 30 replicates per configuration, and validated interpretability through formal human evaluation by three domain experts. Our analysis demonstrates substantial sensitivity to hyperparameter choices and reveals that metric selection for stability assessment must align with analytical goals. A hierarchical merging strategy (extracting fine-grained topics for stability then consolidating for interpretability) effectively navigates the stability-coherence tradeoff, achieving coherence of 0.558 compared to 0.539 for direct extraction. Human validation confirmed topic quality with very good inter-rater reliability (ICC = 0.79, weighted Cohen's kappa = 0.578). Our framework provides practical guidelines that researchers can adapt to their own qualitative research contexts. All code, data processing scripts, and evaluation protocols are publicly available to support reproduction and extension of this work.
Concept than Document: Context Compression via AMR-based Conceptual Entropy
Large Language Models (LLMs) face information overload when handling long contexts, particularly in Retrieval-Augmented Generation (RAG) where extensive supporting documents often introduce redundant content. This issue not only weakens reasoning accuracy but also increases computational overhead. We propose an unsupervised context compression framework that exploits Abstract Meaning Representation (AMR) graphs to preserve semantically essential information while filtering out irrelevant text. By quantifying node-level entropy within AMR graphs, our method estimates the conceptual importance of each node, enabling the retention of core semantics. Specifically, we construct AMR graphs from raw contexts, compute the conceptual entropy of each node, and screen significant informative nodes to form a condensed and semantically focused context than raw documents. Experiments on the PopQA and EntityQuestions datasets show that our method outperforms vanilla and other baselines, achieving higher accuracy while substantially reducing context length. To the best of our knowledge, this is the first work introducing AMR-based conceptual entropy for context compression, demonstrating the potential of stable linguistic features in context engineering.
Assessing the alignment between infants' visual and linguistic experience using multimodal language models
Figuring out which objects or concepts words refer to is a central language learning challenge for young children. Most models of this process posit that children learn early object labels from co-occurrences of words and their referents that occur when someone around them talks about an object in the immediate physical environment. But how aligned in time are children's visual and linguistic experiences during everyday learning? To date, answers to this question have been limited by the need for labor-intensive manual annotations of vision-language co-occurrences. Here, we evaluate the use of contrastive language-image pretraining (CLIP) models to automatically characterize vision-language alignment in egocentric videos taken from the infant perspective in home environments. After validating CLIP alignment scores using human alignment judgments, we apply this metric to a large corpus of infant-perspective videos. We show that idealized aligned moments for learning (e.g., "look at the ball" with a ball present in the child's view) are relatively rare in children's everyday experiences compared to modern machine learning datasets, and highlight variability in alignment both within and across children. These findings suggest that infrequent alignment is a constraint for models describing early word learning and offer a new method for investigating children's multimodal environment.
HyperbolicRAG: Enhancing Retrieval-Augmented Generation with Hyperbolic Representations
Retrieval-augmented generation (RAG) enables large language models (LLMs) to access external knowledge, helping mitigate hallucinations and enhance domain-specific expertise. Graph-based RAG enhances structural reasoning by introducing explicit relational organization that enables information propagation across semantically connected text units. However, these methods typically rely on Euclidean embeddings that capture semantic similarity but lack a geometric notion of hierarchical depth, limiting their ability to represent abstraction relationships inherent in complex knowledge graphs. To capture both fine-grained semantics and global hierarchy, we propose HyperbolicRAG, a retrieval framework that integrates hyperbolic geometry into graph-based RAG. HyperbolicRAG introduces three key designs: (1) a depth-aware representation learner that embeds nodes within a shared Poincare manifold to align semantic similarity with hierarchical containment, (2) an unsupervised contrastive regularization that enforces geometric consistency across abstraction levels, and (3) a mutual-ranking fusion mechanism that jointly exploits retrieval signals from Euclidean and hyperbolic spaces, emphasizing cross-space agreement during inference. Extensive experiments across multiple QA benchmarks demonstrate that HyperbolicRAG outperforms competitive baselines, including both standard RAG and graph-augmented baselines.
comment: 12 pages
Context-Aware Whisper for Arabic ASR Under Linguistic Varieties
Low-resource ASR remains a challenging problem, especially for languages like Arabic that exhibit wide dialectal variation and limited labeled data. We propose context-aware prompting strategies to adapt OpenAI's Whisper for Arabic speech recognition without retraining. Our methods include decoder prompting with first-pass transcriptions or retrieved utterances, and encoder prefixing using speech synthesized in the target speaker's voice. We introduce techniques such as prompt reordering, speaker-aware prefix synthesis, and modality-specific retrieval (lexical, semantic, acoustic) to improve transcription in real-world, zero-shot settings. Evaluated on nine Arabic linguistic conditions, our approach reduces WER by up to 22.3% on Modern Standard Arabic and 9.2% on dialectal speech, significantly mitigating hallucinations and speaker mismatch.
Robust Multimodal Sentiment Analysis with Distribution-Based Feature Recovery and Fusion ACM MM 2024
As posts on social media increase rapidly, analyzing the sentiments embedded in image-text pairs has become a popular research topic in recent years. Although existing works achieve impressive accomplishments in simultaneously harnessing image and text information, they lack the considerations of possible low-quality and missing modalities. In real-world applications, these issues might frequently occur, leading to urgent needs for models capable of predicting sentiment robustly. Therefore, we propose a Distribution-based feature Recovery and Fusion (DRF) method for robust multimodal sentiment analysis of image-text pairs. Specifically, we maintain a feature queue for each modality to approximate their feature distributions, through which we can simultaneously handle low-quality and missing modalities in a unified framework. For low-quality modalities, we reduce their contributions to the fusion by quantitatively estimating modality qualities based on the distributions. For missing modalities, we build inter-modal mapping relationships supervised by samples and distributions, thereby recovering the missing modalities from available ones. In experiments, two disruption strategies that corrupt and discard some modalities in samples are adopted to mimic the low-quality and missing modalities in various real-world scenarios. Through comprehensive experiments on three publicly available image-text datasets, we demonstrate the universal improvements of DRF compared to SOTA methods under both two strategies, validating its effectiveness in robust multimodal sentiment analysis.
comment: Accepted by ACM MM 2024
Large Language Models Require Curated Context for Reliable Political Fact-Checking -- Even with Reasoning and Web Search
Large language models (LLMs) have raised hopes for automated end-to-end fact-checking, but prior studies report mixed results. As mainstream chatbots increasingly ship with reasoning capabilities and web search tools -- and millions of users already rely on them for verification -- rigorous evaluation is urgent. We evaluate 15 recent LLMs from OpenAI, Google, Meta, and DeepSeek on more than 6,000 claims fact-checked by PolitiFact, comparing standard models with reasoning- and web-search variants. Standard models perform poorly, reasoning offers minimal benefits, and web search provides only moderate gains, despite fact-checks being available on the web. In contrast, a curated RAG system using PolitiFact summaries improved macro F1 by 233% on average across model variants. These findings suggest that giving models access to curated high-quality context is a promising path for automated fact-checking.
RhinoInsight: Improving Deep Research through Control Mechanisms for Model Behavior and Context
Large language models are evolving from single-turn responders into tool-using agents capable of sustained reasoning and decision-making for deep research. Prevailing systems adopt a linear pipeline of plan to search to write to a report, which suffers from error accumulation and context rot due to the lack of explicit control over both model behavior and context. We introduce RhinoInsight, a deep research framework that adds two control mechanisms to enhance robustness, traceability, and overall quality without parameter updates. First, a Verifiable Checklist module transforms user requirements into traceable and verifiable sub-goals, incorporates human or LLM critics for refinement, and compiles a hierarchical outline to anchor subsequent actions and prevent non-executable planning. Second, an Evidence Audit module structures search content, iteratively updates the outline, and prunes noisy context, while a critic ranks and binds high-quality evidence to drafted content to ensure verifiability and reduce hallucinations. Our experiments demonstrate that RhinoInsight achieves state-of-the-art performance on deep research tasks while remaining competitive on deep search tasks.
Empathetic Cascading Networks: A Multi-Stage Prompting Technique for Reducing Social Biases in Large Language Models
This report presents the Empathetic Cascading Networks (ECN) framework, a multi-stage prompting method designed to enhance the empathetic and inclusive capabilities of large language models. ECN employs four stages: Perspective Adoption, Emotional Resonance, Reflective Understanding, and Integrative Synthesis, to guide models toward generating emotionally resonant and contextually aware responses. Experimental results demonstrate that ECN achieves the highest Empathy Quotient (EQ) scores across GPT-3.5-turbo and GPT-4, while maintaining competitive Regard and Perplexity metrics. These findings emphasize ECN's potential for applications requiring empathy and inclusivity in conversational AI.
CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning
Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but still suffers from long contexts and disjoint retrieval-generation optimization. In this work, we propose CLaRa (Continuous Latent Reasoning), a unified framework that performs embedding-based compression and joint optimization in a shared continuous space. To obtain semantically rich and retrievable compressed vectors, we introduce SCP, a key-preserving data synthesis framework using QA and paraphrase supervision. CLaRa then trains the reranker and generator end-to-end via a single language modeling loss, with gradients flowing through both modules using a differentiable top-k estimator. Theoretically, this unified optimization aligns retrieval relevance with answer quality. Experiments across multiple QA benchmarks show that CLaRa achieves state-of-the-art compression and reranking performance, often surpassing text-based fine-tuned baselines.
Gender Bias in Emotion Recognition by Large Language Models AAAI 2026
The rapid advancement of large language models (LLMs) and their growing integration into daily life underscore the importance of evaluating and ensuring their fairness. In this work, we examine fairness within the domain of emotional theory of mind, investigating whether LLMs exhibit gender biases when presented with a description of a person and their environment and asked, "How does this person feel?". Furthermore, we propose and evaluate several debiasing strategies, demonstrating that achieving meaningful reductions in bias requires training based interventions rather than relying solely on inference-time prompt-based approaches such as prompt engineering.
comment: Accepted at AAAI 2026 Workshop (WS37)
Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs
While recent vision-language models (VLMs) demonstrate strong image understanding, their ability to "think with images", i.e., to reason through multi-step visual interactions, remains limited. We introduce VISTA-Gym, a scalable training environment for incentivizing tool-integrated visual reasoning capabilities in VLMs. VISTA-Gym unifies diverse real-world multimodal reasoning tasks (7 tasks from 13 datasets in total) with a standardized interface for visual tools (e.g., grounding, parsing), executable interaction loops, verifiable feedback signals, and efficient trajectory logging, enabling visual agentic reinforcement learning at scale. While recent VLMs exhibit strong text-only reasoning, both proprietary and open-source models still struggle with tool selection, invocation, and coordination. With VISTA-Gym, we train VISTA-R1 to interleave tool-use with agentic reasoning via multi-turn trajectory sampling and end-to-end reinforcement learning. Extensive experiments across 11 public reasoning-intensive VQA benchmarks show that VISTA-R1-8B outperforms state-of-the-art baselines with similar sizes by 9.51%-18.72%, demonstrating VISTA-Gym as an effective training ground to unlock the tool-integrated reasoning capabilities for VLMs.
comment: 17 pages, 9 figures, work in progress
What does it mean to understand language?
Language understanding entails not just extracting the surface-level meaning of the linguistic input, but constructing rich mental models of the situation it describes. Here we propose that because processing within the brain's core language system is fundamentally limited, deeply understanding language requires exporting information from the language system to other brain regions that compute perceptual and motor representations, construct mental models, and store our world knowledge and autobiographical memories. We review the existing evidence for this hypothesis, and argue that recent progress in cognitive neuroscience provides both the conceptual foundation and the methods to directly test it, thus opening up a new strategy to reveal what it means, cognitively and neurally, to understand language.
Comparative Analysis of LoRA-Adapted Embedding Models for Clinical Cardiology Text Representation
Domain-specific text embeddings are critical for clinical natural language processing, yet systematic comparisons across model architectures remain limited. This study evaluates ten transformer-based embedding models adapted for cardiology through Low-Rank Adaptation (LoRA) fine-tuning on 106,535 cardiology text pairs derived from authoritative medical textbooks. Results demonstrate that encoder-only architectures, particularly BioLinkBERT, achieve superior domain-specific performance (separation score: 0.510) compared to larger decoder-based models, while requiring significantly fewer computational resources. The findings challenge the assumption that larger language models necessarily produce better domain-specific embeddings and provide practical guidance for clinical NLP system development. All models, training code, and evaluation datasets are publicly available to support reproducible research in medical informatics.
comment: 25 pages, 13 figures, 5 tables
Can LLMs Faithfully Explain Themselves in Low-Resource Languages? A Case Study on Emotion Detection in Persian
Large language models (LLMs) are increasingly used to generate self-explanations alongside their predictions, a practice that raises concerns about the faithfulness of these explanations, especially in low-resource languages. This study evaluates the faithfulness of LLM-generated explanations in the context of emotion classification in Persian, a low-resource language, by comparing the influential words identified by the model against those identified by human annotators. We assess faithfulness using confidence scores derived from token-level log-probabilities. Two prompting strategies, differing in the order of explanation and prediction (Predict-then-Explain and Explain-then-Predict), are tested for their impact on explanation faithfulness. Our results reveal that while LLMs achieve strong classification performance, their generated explanations often diverge from faithful reasoning, showing greater agreement with each other than with human judgments. These results highlight the limitations of current explanation methods and metrics, emphasizing the need for more robust approaches to ensure LLM reliability in multilingual and low-resource contexts.
Fara-7B: An Efficient Agentic Model for Computer Use
Progress in computer use agents (CUAs) has been constrained by the absence of large and high-quality datasets that capture how humans interact with a computer. While LLMs have thrived on abundant textual data, no comparable corpus exists for CUA trajectories. To address these gaps, we introduce FaraGen, a novel synthetic data generation system for multi-step web tasks. FaraGen can propose diverse tasks from frequently used websites, generate multiple solution attempts, and filter successful trajectories using multiple verifiers. It achieves high throughput, yield, and diversity for multi-step web tasks, producing verified trajectories at approximately $1 each. We use this data to train Fara-7B, a native CUA model that perceives the computer using only screenshots, executes actions via predicted coordinates, and is small enough to run on-device. We find that Fara-7B outperforms other CUA models of comparable size on benchmarks like WebVoyager, Online-Mind2Web, and WebTailBench -- our novel benchmark that better captures under-represented web tasks in pre-existing benchmarks. Furthermore, Fara-7B is competitive with much larger frontier models, illustrating key benefits of scalable data generation systems in advancing small efficient agentic models. We are making Fara-7B open-weight on Microsoft Foundry and HuggingFace, and we are releasing WebTailBench.
Efficient Multi-Hop Question Answering over Knowledge Graphs via LLM Planning and Embedding-Guided Search
Multi-hop question answering over knowledge graphs remains computationally challenging due to the combinatorial explosion of possible reasoning paths. Recent approaches rely on expensive Large Language Model (LLM) inference for both entity linking and path ranking, limiting their practical deployment. Additionally, LLM-generated answers often lack verifiable grounding in structured knowledge. We present two complementary hybrid algorithms that address both efficiency and verifiability: (1) LLM-Guided Planning that uses a single LLM call to predict relation sequences executed via breadth-first search, achieving near-perfect accuracy (micro-F1 > 0.90) while ensuring all answers are grounded in the knowledge graph, and (2) Embedding-Guided Neural Search that eliminates LLM calls entirely by fusing text and graph embeddings through a lightweight 6.7M-parameter edge scorer, achieving over 100 times speedup with competitive accuracy. Through knowledge distillation, we compress planning capability into a 4B-parameter model that matches large-model performance at zero API cost. Evaluation on MetaQA demonstrates that grounded reasoning consistently outperforms ungrounded generation, with structured planning proving more transferable than direct answer generation. Our results show that verifiable multi-hop reasoning does not require massive models at inference time, but rather the right architectural inductive biases combining symbolic structure with learned representations.
Studying Maps at Scale: A Digital Investigation of Cartography and the Evolution of Figuration
This thesis presents methods and datasets to investigate cartographic heritage on a large scale and from a cultural perspective. Heritage institutions worldwide have digitized more than one million maps, and automated techniques now enable large-scale recognition and extraction of map content. Yet these methods have engaged little with the history of cartography, or the view that maps are semantic-symbolic systems, and cultural objects reflecting political and epistemic expectations. This work leverages a diverse corpus of 771,561 map records and 99,715 digitized images aggregated from 38 digital catalogs. After normalization, the dataset includes 236,925 contributors and spans six centuries, from 1492 to 1948. These data make it possible to chart geographic structures and the global chronology of map publication. The spatial focus of cartography is analyzed in relation to political dynamics, evidencing links between Atlantic maritime charting, the triangular trade, and colonial expansion. Further results document the progression of national, domestic focus and the impact of military conflicts on publication volumes. The research introduces semantic segmentation techniques and object detection models for the generic recognition of land classes and cartographic signs, trained on annotated data and synthetic images. The analysis of land classes shows that maps are designed images whose framing and composition emphasize features through centering and semantic symmetries. The study of cartographic figuration encodes 63 M signs and 25 M fragments into a latent visual space, revealing figurative shifts such as the replacement of relief hachures by terrain contours and showing that signs tend to form locally consistent systems. Analyses of collaboration and diffusion highlight the role of legitimacy, larger actors, and major cities in the spread of figurative norms and semiotic cultures.
comment: PhD thesis, EPFL. 396 pages, 156 figures
MiniF2F in Rocq: Automatic Translation Between Proof Assistants -- A Case Study
In this work, we conduct an experiment using state-of-the-art LLMs to translate MiniF2F into Rocq. The translation task focuses on generating a Rocq theorem based on three sources: a natural language description, the Lean formalization, and the Isabelle formalization. We conducted our experiment in 3 stages of increasing complexity, from basic one-shot prompting to multi-turn conversations that incorporate feedback from unsuccessful attempts. At each stage, we perform multiple rounds of translation using increasingly advanced models: GPT-4o mini, Claude 3.5 Sonnet, o1 mini, and o1. We successfully translated 478 out of 488 theorems. The dataset is opensource: https://github.com/LLM4Rocq/miniF2F-rocq.
Information Extraction From Fiscal Documents Using LLMs
Large Language Models (LLMs) have demonstrated remarkable capabilities in text comprehension, but their ability to process complex, hierarchical tabular data remains underexplored. We present a novel approach to extracting structured data from multi-page government fiscal documents using LLM-based techniques. Applied to annual fiscal documents from the State of Karnataka in India (200+ pages), our method achieves high accuracy through a multi-stage pipeline that leverages domain knowledge, sequential context, and algorithmic validation. A large challenge with traditional OCR methods is the inability to verify the accurate extraction of numbers. When applied to fiscal data, the inherent structure of fiscal tables, with totals at each level of the hierarchy, allows for robust internal validation of the extracted data. We use these hierarchical relationships to create multi-level validation checks. We demonstrate that LLMs can read tables and also process document-specific structural hierarchies, offering a scalable process for converting PDF-based fiscal disclosures into research-ready databases. Our implementation shows promise for broader applications across developing country contexts.
comment: 6 pages. Presented at the AI for Financial Inclusion, Risk Modeling and Resilience in Emerging Markets workshop at ACM ICAIF 2025 Singapore
PEANuT: Parameter-Efficient Adaptation with Weight-aware Neural Tweakers
Fine-tuning large pre-trained foundation models often yields excellent downstream performance but is prohibitively expensive when updating all parameters. Parameter-efficient fine-tuning (PEFT) methods such as LoRA alleviate this by introducing lightweight update modules, yet they commonly rely on weight-agnostic linear approximations, limiting their expressiveness. In this work, we propose PEANuT, a novel PEFT framework that introduces weight-aware neural tweakers, compact neural modules that generate task-adaptive updates conditioned on frozen pre-trained weights. PEANuT provides a flexible yet efficient way to capture complex update patterns without full model tuning. We theoretically show that PEANuT achieves equivalent or greater expressivity than existing linear PEFT methods with comparable or fewer parameters. Extensive experiments across four benchmarks with over twenty datasets demonstrate that PEANuT consistently outperforms strong baselines in both NLP and vision tasks, while maintaining low computational overhead.
Sentence Smith: Controllable Edits for Evaluating Text Embeddings EMNLP 2025
Controllable and transparent text generation has been a long-standing goal in NLP. Almost as long-standing is a general idea for addressing this challenge: Parsing text to a symbolic representation, and generating from it. However, earlier approaches were hindered by parsing and generation insufficiencies. Using modern parsers and a safety supervision mechanism, we show how close current methods come to this goal. Concretely, we propose the Sentence Smith framework for English, which has three steps: 1. Parsing a sentence into a semantic graph. 2. Applying human-designed semantic manipulation rules. 3. Generating text from the manipulated graph. A final entailment check (4.) verifies the validity of the applied transformation. To demonstrate our framework's utility, we use it to induce hard negative text pairs that challenge text embedding models. Since the controllable generation makes it possible to clearly isolate different types of semantic shifts, we can evaluate text embedding models in a fine-grained way, also addressing an issue in current benchmarking where linguistic phenomena remain opaque. Human validation confirms that our transparent generation process produces texts of good quality. Notably, our way of generation is very resource-efficient, since it relies only on smaller neural networks.
comment: EMNLP 2025 (main), this version fixes a subscript typo in Eq 1
Enhancing Domain-Specific Encoder Models with LLM-Generated Data: How to Leverage Ontologies, and How to Do Without Them EMNLP 2025
We investigate the use of LLM-generated data for continual pretraining of encoder models in specialized domains with limited training data, using the scientific domain of invasion biology as a case study. To this end, we leverage domain-specific ontologies by enriching them with LLM-generated data and pretraining the encoder model as an ontology-informed embedding model for concept definitions. To evaluate the effectiveness of this method, we compile a benchmark specifically designed for assessing model performance in invasion biology. After demonstrating substantial improvements over standard LLM pretraining, we investigate the feasibility of applying the proposed approach to domains without comprehensive ontologies by substituting ontological concepts with concepts automatically extracted from a small corpus of scientific abstracts and establishing relationships between concepts through distributional statistics. Our results demonstrate that this automated approach achieves comparable performance using only a small set of scientific abstracts, resulting in a fully automated pipeline for enhancing domain-specific understanding of small encoder models that is especially suited for application in low-resource settings and achieves performance comparable to masked language modeling pretraining on much larger datasets.
comment: Published in the Findings of the Association for Computational Linguistics: EMNLP 2025
How does Alignment Enhance LLMs' Multilingual Capabilities? A Language Neurons Perspective AAAI 2026
Multilingual Alignment is an effective and representative paradigm to enhance LLMs' multilingual capabilities, which transfers the capabilities from the high-resource languages to the low-resource languages. Meanwhile, some research on language-specific neurons provides a new perspective to analyze and understand LLMs' mechanisms. However, we find that there are many neurons that are shared by multiple but not all languages and cannot be correctly classified. In this work, we propose a ternary classification methodology that categorizes neurons into three types, including language-specific neurons, language-related neurons, and general neurons. And we propose a corresponding identification algorithm to distinguish these different types of neurons. Furthermore, based on the distributional characteristics of different types of neurons, we divide the LLMs' internal process for multilingual inference into four parts: (1) multilingual understanding, (2) shared semantic space reasoning, (3) multilingual output space transformation, and (4) vocabulary space outputting. Additionally, we systematically analyze the models before and after alignment with a focus on different types of neurons. We also analyze the phenomenon of ''Spontaneous Multilingual Alignment''. Overall, our work conducts a comprehensive investigation based on different types of neurons, providing empirical results and valuable insights to better understand multilingual alignment and multilingual capabilities of LLMs.
comment: AAAI 2026 (Oral)
ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation AACL 2025
Evaluating the quality of generated text automatically remains a significant challenge. Conventional reference-based metrics have been shown to exhibit relatively weak correlation with human evaluations. Recent research advocates the use of large language models (LLMs) as source-based metrics for natural language generation (NLG) assessment. While promising, LLM-based metrics, particularly those using smaller models, still fall short in aligning with human judgments. In this work, we introduce ContrastScore, a contrastive evaluation metric designed to enable higher-quality, less biased, and more efficient assessment of generated text. We evaluate ContrastScore on two NLG tasks: machine translation and summarization. Experimental results show that ContrastScore consistently achieves stronger correlation with human judgments than both single-model and ensemble-based baselines. Notably, ContrastScore based on Qwen 3B and 0.5B even outperforms Qwen 7B, despite having only half as many parameters, demonstrating its efficiency. Furthermore, it effectively mitigates common evaluation biases such as length and likelihood preferences, resulting in more robust automatic evaluation.
comment: Accepted at AACL 2025 (Main Conference Paper)
Strategic Innovation Management in the Age of Large Language Models Market Intelligence, Adaptive R&D, and Ethical Governance
This study analyzes the multiple functions of Large Language Models (LLMs) in transforming research and development (R&D) processes. By automating knowledge discovery, boosting hypothesis creation, integrating transdisciplinary insights, and enabling cooperation within innovation ecosystems, LLMs dramatically improve the efficiency and effectiveness of research processes. Through extensive analysis of scientific literature, patent databases, and experimental data, these models enable more flexible and informed R&D workflows, ultimately accelerating innovation cycles and lowering time-to-market for breakthrough ideas.
Word-level Annotation of GDPR Transparency Compliance in Privacy Policies using Large Language Models
Ensuring transparency of data practices related to personal information is a core requirement of the General Data Protection Regulation (GDPR). However, large-scale compliance assessment remains challenging due to the complexity and diversity of privacy policy language. Manual audits are labour-intensive and inconsistent, while current automated methods often lack the granularity required to capture nuanced transparency disclosures. In this paper, we present a modular large language model (LLM)-based pipeline for fine-grained word-level annotation of privacy policies with respect to GDPR transparency requirements. Our approach integrates LLM-driven annotation with passage-level classification, retrieval-augmented generation, and a self-correction mechanism to deliver scalable, context-aware annotations across 21 GDPR-derived transparency requirements. To support empirical evaluation, we compile a corpus of 703,791 English-language privacy policies and generate a ground-truth sample of 200 manually annotated policies based on a comprehensive, GDPR-aligned annotation scheme. We propose a two-tiered evaluation methodology capturing both passage-level classification and span-level annotation quality and conduct a comparative analysis of seven state-of-the-art LLMs on two annotation schemes, including the widely used OPP-115 dataset. The results of our evaluation show that decomposing the annotation task and integrating targeted retrieval and classification components significantly improve annotation accuracy, particularly for well-structured requirements. Our work provides new empirical resources and methodological foundations for advancing automated transparency compliance assessment at scale.
comment: Accepted to Proceedings on Privacy Enhancing Technologies (PoPETs) 1 (2026)
A Survey of Generative Categories and Techniques in Multimodal Generative Models
Multimodal Generative Models (MGMs) have rapidly evolved beyond text generation, now spanning diverse output modalities including images, music, video, human motion, and 3D objects, by integrating language with other sensory modalities under unified architectures. This survey categorises six primary generative modalities and examines how foundational techniques, namely Self-Supervised Learning (SSL), Mixture of Experts (MoE), Reinforcement Learning from Human Feedback (RLHF), and Chain-of-Thought (CoT) prompting, enable cross-modal capabilities. We analyze key models, architectural trends, and emergent cross-modal synergies, while highlighting transferable techniques and unresolved challenges. Building on a common taxonomy of models and training recipes, we propose a unified evaluation framework centred on faithfulness, compositionality, and robustness, and synthesise evidence from benchmarks and human studies across modalities. We further analyse trustworthiness, safety, and ethical risks, including multimodal bias, privacy leakage, and the misuse of high-fidelity media generation for deepfakes, disinformation, and copyright infringement in music and 3D assets, together with emerging mitigation strategies. Finally, we discuss how architectural trends, evaluation protocols, and governance mechanisms can be co-designed to close current capability and safety gaps, outlining critical paths toward more general-purpose, controllable, and accountable multimodal generative systems.
Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?
Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Gödel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that LIVE-SWE-AGENT can achieve an impressive solve rate of 77.4% without test-time scaling, outperforming all existing software agents, including the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8%.
Lost in translation: using global fact-checks to measure multilingual misinformation prevalence, spread, and evolution
Misinformation and disinformation are growing threats in the digital age, affecting people across languages and borders. However, no research has investigated the prevalence of multilingual misinformation and quantified the extent to which misinformation diffuses across languages. This paper investigates the prevalence and dynamics of multilingual misinformation through an analysis of 264,487 fact-checks spanning 95 languages. To study the evolution of claims over time and mutations across languages, we represent fact-checks with multilingual sentence embeddings and build a graph where semantically similar claims are linked. We provide quantitative evidence of repeated fact-checking efforts and establish that claims diffuse across languages. Specifically, we find that while the majority of misinformation claims are only fact-checked once, 10.26%, corresponding to more than 27,000 claims, are checked multiple times. Using fact-checks as a proxy for the spread of misinformation, we find 32.26% of repeated claims cross linguistic boundaries, suggesting that some misinformation permeates language barriers. However, spreading patterns exhibit strong assortativity, with misinformation more likely to spread within the same language or language family. Next we show that fact-checkers take more time to fact-check claims that have crossed language barriers and model the temporal and cross-lingual evolution of claims. We analyze connected components and shortest paths connecting different versions of a claim finding that claims gradually drift over time and undergo greater alteration when traversing languages. Misinformation changes over time, reducing the effectiveness of static claim matching algorithms. The findings advocate for expanded information sharing between fact-checkers globally while underscoring the importance of localized verification.
In-Situ Tweedie Discrete Diffusion Models
While diffusion models excel at generating continuous data such as images, adapting them to discrete tasks has relied on indirect approaches that either operate in continuous embedding spaces or use token masking mechanisms, both of which deviate from modeling the true discrete data distribution that can be theoretically guaranteed by Tweedie's formula. We propose in-situ Tweedie Discrete Diffusion (TDD), a framework that performs diffusion guaranteed by Tweedie's formula directly within the discrete one-hot space, hence "in-situ." Unlike prior methods that diffuse continuous embeddings or mask tokens, TDD directly corrupts one-hot vectors with Gaussian noise and performs iterative denoising through a timestep-conditioned cross-entropy objective rather than mean-squared-error reconstruction. At each denoising step, the model predicts class probabilities, applies argmax to obtain discrete predictions, converts them to one-hot vectors, and feeds them into the next iteration with progressively reduced noise. This process naturally unifies discriminative classification and generative modeling under a single framework. Experiments demonstrate that TDD achieves strong performance on both image classification and text generation tasks, with extensive ablation studies confirming the effectiveness of each design component. Our work establishes a principled approach to discrete diffusion that preserves the core characteristics of diffusion models while operating natively in discrete space.
AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking
Recent studies have shown that large language models (LLMs), especially smaller ones, often lack robustness in grade school math (GSM) reasoning. In particular, they tend to experience performance drops when faced with distribution shifts, such as changes to numerical or nominal variables, or insertions of distracting clauses. A possible strategy to address this involves generating synthetic data to further "instantiate" reasoning problems on potential variations. In this work, we instead focuses on the strategy of "abstracting" reasoning problems. This not only helps counteract distribution shifts but also facilitates the connection to symbolic tools for deriving solutions. Focusing on GSM, we find that this abstraction process is better acquired through reinforcement learning (RL) than just supervised fine-tuning, which often fails to produce faithful abstractions. Our method, AbstRaL -- which promotes abstract reasoning in LLMs using RL on granular abstraction data -- significantly mitigates performance degradation on recent GSM perturbation benchmarks. Besides, improving GSM robustness via AbstRaL is shown to also implicitly benefit LLMs' capabilities on OOD mathematical and general reasoning tasks, indicating that abstract thinking broadly enables better generalizability.
comment: Under review
URLs Help, Topics Guide: Understanding Metadata Utility in LLM Training NeurIPS 2025
Large Language Models (LLMs) are commonly pretrained on vast corpora of text without utilizing contextual metadata such as source, quality, or topic, leading to a context-free learning paradigm. While recent studies suggest that adding metadata like URL information as context (i.e., auxiliary inputs not used in the loss calculation) can improve training efficiency and downstream performance, they offer limited understanding of which types of metadata are truly effective and under what conditions. In this work, we conduct a systematic evaluation and find that not all metadata types contribute equally. Only URL context speeds up training, whereas quality scores and topic/format domain information offer no clear benefit. Furthermore, the improved downstream performances of URL conditioning emerge only when longer prompts are used at inference time. In addition, we demonstrate that context-aware pretraining enables more controllable generation than context-free pretraining, in a classifier-free guidance fashion. Although topic and format metadata do not accelerate training, they are effective for steering outputs, offering human-interpretable control over generation.
comment: NeurIPS 2025, Camera Ready
ModernBERT is More Efficient than Conventional BERT for Chest CT Findings Classification in Japanese Radiology Reports
Japanese language models for medical text classification face challenges with complex vocabulary and linguistic structures in radiology reports. This study compared three Japanese models--BERT Base, JMedRoBERTa, and ModernBERT--for multi-label classification of 18 chest CT findings. Using the CT-RATE-JPN dataset, all models were fine-tuned under identical conditions. ModernBERT showed clear efficiency advantages, producing substantially fewer tokens and achieving faster training and inference than the other models while maintaining comparable performance on the internal test dataset (exact match accuracy: 74.7% vs. 72.7% for BERT Base). To assess generalizability, we additionally constructed RR-Findings, an external dataset of 243 naturally written Japanese radiology reports annotated using the same schema. Under this domain-shifted setting, performance differences became pronounced: BERT Base outperformed both JMedRoBERTa and ModernBERT, whereas ModernBERT showed the largest decline in exact match accuracy. Average precision differences were smaller, indicating that ModernBERT retained reasonable ranking ability despite reduced calibration. Overall, ModernBERT offers substantial computational efficiency and strong in-domain performance but remains sensitive to real-world linguistic variability. These results highlight the need for more diverse natural-language training data and domain-specific calibration strategies to improve robustness when deploying modern transformer models in heterogeneous clinical environments.
comment: 31 pages
Entropy-Guided Reasoning Compression
Large reasoning models have demonstrated remarkable performance on complex reasoning tasks, yet the excessive length of their chain-of-thought outputs remains a major practical bottleneck due to high computation cost and poor deployability. Existing compression methods have achieved partial success but overlook a crucial phenomenon in the training process -- the entropy conflict. During compression training, entropy decreases, leading to shorter reasoning but limited exploration, while accuracy-oriented objectives increase entropy, lengthening reasoning chains. This can cause the model to get stuck in a local dilemma. Our analysis further reveals the origin of the entropy conflict: many high-entropy tokens are logical connectors that receive larger gradients and are encouraged under the performance objective, while the compression objective simultaneously penalizes these potentially redundant connectors. This opposing pressure creates a direct source of entropy conflict. To address these issues, we adopt an entropy-guided training framework. As entropy descends, the model is guided toward efficient reasoning by encouraging concise thought steps; as entropy rises, exploration is reinforced under the compact reasoning mode to improve robustness. Experiments on six mathematical benchmarks show that our method compresses reasoning length to 20% of the original while maintaining or even surpassing baseline accuracy. Code and models will be released publicly.
comment: 10pages, 4 figures
Agent-OM: Leveraging LLM Agents for Ontology Matching
Ontology matching (OM) enables semantic interoperability between different ontologies and resolves their conceptual heterogeneity by aligning related entities. OM systems currently have two prevailing design paradigms: conventional knowledge-based expert systems and newer machine learning-based predictive systems. While large language models (LLMs) and LLM agents have revolutionised data engineering and have been applied creatively in many domains, their potential for OM remains underexplored. This study introduces a novel agent-powered LLM-based design paradigm for OM systems. With consideration of several specific challenges in leveraging LLM agents for OM, we propose a generic framework, namely Agent-OM (Agent for Ontology Matching), consisting of two Siamese agents for retrieval and matching, with a set of OM tools. Our framework is implemented in a proof-of-concept system. Evaluations of three Ontology Alignment Evaluation Initiative (OAEI) tracks over state-of-the-art OM systems show that our system can achieve results very close to the long-standing best performance on simple OM tasks and can significantly improve the performance on complex and few-shot OM tasks.
comment: 31 pages
TRIM: Token Reduction and Inference Modeling for Cost-Effective Language Generation
The high inference cost of Large Language Models (LLMs) poses challenges, especially for tasks requiring lengthy outputs. However, natural language often contains redundancy, which presents an opportunity for optimization. We have observed that LLMs can generate distilled language (i.e., concise outputs that retain essential meaning) when prompted appropriately. We propose TRIM, a pipeline for saving computational cost in which the LLM omits a predefined set of semantically irrelevant and easily inferable words based on the context during inference. Then, a specifically trained smaller language model with lower inference cost reconstructs the distilled answer into the ideal answer. Our experiments show promising results, particularly on the proposed NaLDA evaluation dataset focused on the reconstruction task, with 19.4% saved tokens on average for GPT-4o and only a tiny decrease in evaluation metrics. This suggests that the approach can effectively balance efficiency and accuracy in language processing tasks.
comment: 16 pages, 9 tables, 5 figures
Safeguarding Privacy of Retrieval Data against Membership Inference Attacks: Is This Query Too Close to Home? EMNLP
Retrieval-augmented generation (RAG) mitigates the hallucination problem in large language models (LLMs) and has proven effective for personalized usages. However, delivering private retrieved documents directly to LLMs introduces vulnerability to membership inference attacks (MIAs), which try to determine whether the target data point exists in the private external database or not. Based on the insight that MIA queries typically exhibit high similarity to only one target document, we introduce a novel similarity-based MIA detection framework designed for the RAG system. With the proposed method, we show that a simple detect-and-hide strategy can successfully obfuscate attackers, maintain data utility, and remain system-agnostic against MIA. We experimentally prove its detection and defense against various state-of-the-art MIA methods and its adaptability to existing RAG systems.
comment: Accepted for EMNLP findings 2025
Evaluation of OpenAI o1: Opportunities and Challenges of AGI
This comprehensive study evaluates the performance of OpenAI's o1-preview large language model across a diverse array of complex reasoning tasks, spanning multiple domains, including computer science, mathematics, natural sciences, medicine, linguistics, and social sciences. Through rigorous testing, o1-preview demonstrated remarkable capabilities, often achieving human-level or superior performance in areas ranging from coding challenges to scientific reasoning and from language processing to creative problem-solving. Key findings include: -83.3% success rate in solving complex competitive programming problems, surpassing many human experts. -Superior ability in generating coherent and accurate radiology reports, outperforming other evaluated models. -100% accuracy in high school-level mathematical reasoning tasks, providing detailed step-by-step solutions. -Advanced natural language inference capabilities across general and specialized domains like medicine. -Impressive performance in chip design tasks, outperforming specialized models in areas such as EDA script generation and bug analysis. -Remarkable proficiency in anthropology and geology, demonstrating deep understanding and reasoning in these specialized fields. -Strong capabilities in quantitative investing. O1 has comprehensive financial knowledge and statistical modeling skills. -Effective performance in social media analysis, including sentiment analysis and emotion recognition. The model excelled particularly in tasks requiring intricate reasoning and knowledge integration across various fields. While some limitations were observed, including occasional errors on simpler problems and challenges with certain highly specialized concepts, the overall results indicate significant progress towards artificial general intelligence.
Beyond SELECT: A Comprehensive Taxonomy-Guided Benchmark for Real-World Text-to-SQL Translation
Text-to-SQL datasets are essential for training and evaluating text-to-SQL models, but existing datasets often suffer from limited coverage and fail to capture the diversity of real-world applications. To address this, we propose a novel taxonomy for text-to-SQL classification based on dimensions including core intents, statement types, syntax structures, and key actions. Using this taxonomy, we evaluate widely used public text-to-SQL datasets (e.g., Spider and Bird) and reveal limitations in their coverage and diversity. We then introduce a taxonomy-guided dataset synthesis pipeline, yielding a new dataset named SQL-Synth. This approach combines the taxonomy with Large Language Models (LLMs) to ensure the dataset reflects the breadth and complexity of real-world text-to-SQL applications. Extensive analysis and experimental results validate the effectiveness of our taxonomy, as SQL-Synth exhibits greater diversity and coverage compared to existing benchmarks. Moreover, we uncover that existing LLMs typically fall short in adequately capturing the full range of scenarios, resulting in limited performance on SQL-Synth. However, fine-tuning can substantially improve their performance in these scenarios. The proposed taxonomy has significant potential impact, as it not only enables comprehensive analysis of datasets and the performance of different LLMs, but also guides the construction of training data for LLMs.
SGM: A Framework for Building Specification-Guided Moderation Filters
Aligning large language models (LLMs) with deployment-specific requirements is critical but inherently imperfect. Despite extensive training, models remain susceptible to misalignment and adversarial inputs such as jailbreaks. Content moderation filters are commonly used as external safeguards, though they typically focus narrowly on safety. We introduce SGM (Specification-Guided Moderation), a flexible framework for training moderation filters grounded in user-defined specifications that go beyond standard safety concerns. SGM automates training data generation without relying on human-written examples, enabling scalable support for diverse, application-specific alignment goals. SGM-trained filters perform on par with state-of-the-art safety filters built on curated datasets, while supporting fine-grained and user-defined alignment control.
DataSage: Multi-agent Collaboration for Insight Discovery with External Knowledge Retrieval, Multi-role Debating, and Multi-path Reasoning
In today's data-driven era, fully automated end-to-end data analytics, particularly insight discovery, is critical for discovering actionable insights that assist organizations in making effective decisions. With the rapid advancement of large language models (LLMs), LLM-driven agents have emerged as a promising paradigm for automating data analysis and insight discovery. However, existing data insight agents remain limited in several key aspects, often failing to deliver satisfactory results due to: (1) insufficient utilization of domain knowledge, (2) shallow analytical depth, and (3) error-prone code generation during insight generation. To address these issues, we propose DataSage, a novel multi-agent framework that incorporates three innovative features including external knowledge retrieval to enrich the analytical context, a multi-role debating mechanism to simulate diverse analytical perspectives and deepen analytical depth, and multi-path reasoning to improve the accuracy of the generated code and insights. Extensive experiments on InsightBench demonstrate that DataSage consistently outperforms existing data insight agents across all difficulty levels, offering an effective solution for automated data insight discovery.
Don't Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models EMNLP 2025
Large language models (LLMs) have witnessed rapid advancements, demonstrating remarkable capabilities. However, a notable vulnerability persists: LLMs often uncritically accept flawed or contradictory premises, leading to inefficient reasoning and unreliable outputs. This emphasizes the significance of possessing the \textbf{Premise Critique Ability} for LLMs, defined as the capacity to proactively identify and articulate errors in input premises. Most existing studies assess LLMs' reasoning ability in ideal settings, largely ignoring their vulnerabilities when faced with flawed premises. Thus, we introduce the \textbf{Premise Critique Bench (PCBench)}, designed by incorporating four error types across three difficulty levels, paired with multi-faceted evaluation metrics. We conducted systematic evaluations of 15 representative LLMs. Our findings reveal: (1) Most models rely heavily on explicit prompts to detect errors, with limited autonomous critique; (2) Premise critique ability depends on question difficulty and error type, with direct contradictions being easier to detect than complex or procedural errors; (3) Reasoning ability does not consistently correlate with the premise critique ability; (4) Flawed premises trigger overthinking in reasoning models, markedly lengthening responses due to repeated attempts at resolving conflicts. These insights underscore the urgent need to enhance LLMs' proactive evaluation of input validity, positioning premise critique as a foundational capability for developing reliable, human-centric systems. The code is available at https://github.com/MLGroupJLU/Premise_Critique.
comment: EMNLP 2025 Findings camera-ready version
Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL
Generating accurate SQL from users' natural language questions (text-to-SQL) remains a long-standing challenge due to the complexities involved in user question understanding, database schema comprehension, and SQL generation. Traditional text-to-SQL systems, which combine human engineering and deep neural networks, have made significant progress. Subsequently, pre-trained language models (PLMs) have been developed for text-to-SQL tasks, achieving promising results. However, as modern databases and user questions grow more complex, PLMs with a limited parameter size often produce incorrect SQL. This necessitates more sophisticated and tailored optimization methods, which restricts the application of PLM-based systems. Recently, large language models (LLMs) have shown significant capabilities in natural language understanding as model scale increases. Thus, integrating LLM-based solutions can bring unique opportunities, improvements, and solutions to text-to-SQL research. In this survey, we provide a comprehensive review of existing LLM-based text-to-SQL studies. Specifically, we offer a brief overview of the technical challenges and evolutionary process of text-to-SQL. Next, we introduce the datasets and metrics designed to evaluate text-to-SQL systems. Subsequently, we present a systematic analysis of recent advances in LLM-based text-to-SQL. Finally, we make a summarization and discuss the remaining challenges in this field and suggest expectations for future research directions. All the related resources of LLM-based, including research papers, benchmarks, and open-source projects, are collected for the community in our repository: https://github.com/DEEP-PolyU/Awesome-LLM-based-Text2SQL.
comment: Accepted to IEEE TKDE2025
Systematic Reward Gap Optimization for Mitigating VLM Hallucinations NeurIPS 2025
The success of Direct Preference Optimization (DPO) in mitigating hallucinations in Vision Language Models (VLMs) critically hinges on the true reward gaps within preference pairs. However, current methods, typically relying on ranking or rewriting strategies, often struggle to optimize these reward gaps in a systematic way during data curation. A core difficulty lies in precisely characterizing and strategically manipulating the overall reward gap configuration, that is, the deliberate design of how to shape these reward gaps within each preference pair across the data. To address this, we introduce Topic-level Preference Rewriting(TPR), a novel framework designed for the systematic optimization of reward gap configuration. Through selectively replacing semantic topics within VLM responses with model's own resampled candidates for targeted rewriting, TPR can provide topic-level control over fine-grained semantic details. This precise control enables advanced data curation strategies, such as progressively adjusting the difficulty of rejected responses, thereby sculpting an effective reward gap configuration that guides the model to overcome challenging hallucinations. Comprehensive experiments demonstrate TPR achieves state-of-the-art performance on multiple hallucination benchmarks, outperforming previous methods by an average of 20%. Notably, it significantly reduces hallucinations by up to 93% on ObjectHal-Bench, and also exhibits superior data efficiency towards robust and cost-effective VLM alignment. Code and datasets are available at https://tpr-dpo.github.io .
comment: 34 pages, 12 figures, Accepted by NeurIPS 2025
IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding
Recent advances in vision-language models (VLMs) have significantly enhanced the visual grounding task, which involves locating objects in an image based on natural language queries. Despite these advancements, the security of VLM-based grounding systems has not been thoroughly investigated. This paper reveals a novel and realistic vulnerability: the first multi-target backdoor attack on VLM-based visual grounding. Unlike prior attacks that rely on static triggers or fixed targets, we propose IAG, a method that dynamically generates input-aware, text-guided triggers conditioned on any specified target object description to execute the attack. This is achieved through a text-conditioned UNet that embeds imperceptible target semantic cues into visual inputs while preserving normal grounding performance on benign samples. We further develop a joint training objective that balances language capability with perceptual reconstruction to ensure imperceptibility, effectiveness, and stealth. Extensive experiments on multiple VLMs (e.g., LLaVA, InternVL, Ferret) and benchmarks (RefCOCO, RefCOCO+, RefCOCOg, Flickr30k Entities, and ShowUI) demonstrate that IAG achieves the best ASRs compared with other baselines on almost all settings without compromising clean accuracy, maintaining robustness against existing defenses, and exhibiting transferability across datasets and models. These findings underscore critical security risks in grounding-capable VLMs and highlight the need for further research on trustworthy multimodal understanding.
comment: 20 pages, 13 Figures
BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models AAAI 2026
Although large language models (LLMs) demonstrate impressive proficiency in various tasks, they present potential safety risks, such as `jailbreaks', where malicious inputs can coerce LLMs into generating harmful content bypassing safety alignments. In this paper, we delve into the ethical biases in LLMs and examine how those biases could be exploited for jailbreaks. Notably, these biases result in a jailbreaking success rate in GPT-4o models that differs by 20\% between non-binary and cisgender keywords and by 16\% between white and black keywords, even when the other parts of the prompts are identical. We introduce the concept of BiasJailbreak, highlighting the inherent risks posed by these safety-induced biases. BiasJailbreak generates biased keywords automatically by asking the target LLM itself, and utilizes the keywords to generate harmful output. Additionally, we propose an efficient defense method BiasDefense, which prevents jailbreak attempts by injecting defense prompts prior to generation. BiasDefense stands as an appealing alternative to Guard Models, such as Llama-Guard, that require additional inference cost after text generation. Our findings emphasize that ethical biases in LLMs can actually lead to generating unsafe output, and suggest a method to make the LLMs more secure and unbiased. To enable further research and improvements, we open-source our code and artifacts of BiasJailbreak, providing the community with tools to better understand and mitigate safety-induced biases in LLMs.
comment: Accepted as a workshop paper at AAAI 2026
Can Code-Switched Texts Activate a Knowledge Switch in LLMs? A Case Study on English-Korean Code-Switching EMNLP 2025
Recent large language models (LLMs) demonstrate multilingual abilities, yet they are English-centric due to dominance of English in training corpora. The limited resource for low-resource languages remains a crucial challenge. Code-switching (CS), a phenomenon where multilingual speakers alternate between languages in a discourse, can convey subtle cultural and linguistic nuances that can be otherwise lost in translation and elicits language-specific knowledge in human communications. In light of this, we investigate whether code-switching can activate, or identify and leverage knowledge for reasoning when LLMs solve low-resource language tasks. To facilitate the research, we first present EnKoQA, a synthetic English-Korean CS question-answering dataset. We provide comprehensive analysis on a variety of multilingual LLMs by subdividing activation process into knowledge identification and knowledge leveraging. Our results demonstrate that compared to English text, CS can faithfully activate knowledge inside LLMs especially on language-specific domains, suggesting the potential of code-switching on low-resource language tasks.
comment: Accepted to EMNLP 2025 Findings
SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning
Long-context inference for Large Language Models (LLMs) is heavily limited by high computational demands. While several existing methods optimize attention computation, they still process the full set of hidden states at each layer, limiting overall efficiency. In this work, we propose SlimInfer, an innovative framework that aims to accelerate inference by directly pruning less critical prompt tokens during the forward pass. Our key insight is an information diffusion phenomenon: As information from critical tokens propagates through layers, it becomes distributed across the entire sequence. This diffusion process suggests that LLMs can maintain their semantic integrity when excessive tokens, even including these critical ones, are pruned in hidden states. Motivated by this, SlimInfer introduces a dynamic fine-grained pruning mechanism that accurately removes redundant tokens of hidden state at intermediate layers. This layer-wise pruning naturally enables an asynchronous KV cache manager that prefetches required token blocks without complex predictors, reducing both memory usage and I/O costs. Extensive experiments show that SlimInfer can achieve up to $\mathbf{2.53\times}$ time-to-first-token (TTFT) speedup and $\mathbf{1.88\times}$ end-to-end latency reduction for LLaMA3.1-8B-Instruct on a single RTX 4090, without sacrificing performance on LongBench. Our code is available at https://github.com/Longxmas/SlimInfer.
REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning
Nowadays, formal theorem provers have made monumental progress on high-school and competition-level mathematics, but few of them generalize to more advanced mathematics. In this paper, we present REAL-Prover, a new open-source stepwise theorem prover for Lean 4 to push this boundary. This prover, based on our fine-tuned large language model (REAL-Prover-v1) and integrated with a retrieval system (Leansearch-PS), notably boosts performance on solving college-level mathematics problems. To train REAL-Prover-v1, we developed HERALD-AF, a data extraction pipeline that converts natural language math problems into formal statements, and a new open-source Lean 4 interactive environment (Jixia-interactive) to facilitate synthesis data collection. In our experiments, our prover using only supervised fine-tune achieves competitive results with a 23.7% success rate (Pass@64) on the ProofNet dataset-comparable to state-of-the-art (SOTA) models. To further evaluate our approach, we introduce FATE-M, a new benchmark focused on algebraic problems, where our prover achieves a SOTA success rate of 56.7% (Pass@64).
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly on mathematics and programming tasks. Similar to how traditional RL helps agents explore and learn new strategies, RLVR is believed to enable LLMs to continuously self-improve, thus acquiring novel reasoning abilities beyond those of the corresponding base models. In this study we critically examine the current state of RLVR by systematically probing the reasoning capability boundaries of RLVR-trained LLMs across various model families, RL algorithms, and math, coding, and visual reasoning benchmarks, using pass@k at large k values as the evaluation metric. Surprisingly, we find that the current training setup does not elicit fundamentally new reasoning patterns. While RLVR-trained models outperform their base models at small k (e.g., k = 1), the base models achieve a higher pass@k score when k is large. Coverage and perplexity analyses show that the observed reasoning abilities originate from and are bounded by the base model. Treating the base model as an upper bound, our quantitative analysis shows that six popular RLVR algorithms perform similarly and remain far from optimal in leveraging the potential of the base model. By contrast, we find that distillation can introduce new reasoning patterns from the teacher and genuinely expand the model's reasoning capabilities. Overall, our findings suggest that current RLVR methods have not yet realized the potential of RL to elicit truly novel reasoning abilities in LLMs. This highlights the need for improved RL paradigms, such as continual scaling and multi-turn agent-environment interaction, to unlock this potential.
comment: 31 pages, 27 figures
SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth AAAI 2026
The rapid proliferation of large language models (LLMs) in applications targeting children and adolescents necessitates a fundamental reassessment of prevailing AI safety frameworks, which are largely tailored to adult users and neglect the distinct developmental vulnerabilities of minors. This paper highlights key deficiencies in existing LLM safety benchmarks, including their inadequate coverage of age-specific cognitive, emotional, and social risks spanning early childhood (ages 0--6), middle childhood (7--12), and adolescence (13--18). To bridge these gaps, we introduce SproutBench, an innovative evaluation suite comprising 1,283 developmentally grounded adversarial prompts designed to probe risks such as emotional dependency, privacy violations, and imitation of hazardous behaviors. Through rigorous empirical evaluation of 47 diverse LLMs, we uncover substantial safety vulnerabilities, corroborated by robust inter-dimensional correlations (e.g., between Safety and Risk Prevention) and a notable inverse relationship between Interactivity and Age Appropriateness. These insights yield practical guidelines for advancing child-centric AI design and deployment.
comment: Accepted in AAAI 2026 Workshop on AI for Education
SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage ACL
Large language models (LLMs) have made significant advancements across various tasks, but their safety alignment remain a major concern. Exploring jailbreak prompts can expose LLMs' vulnerabilities and guide efforts to secure them. Existing methods primarily design sophisticated instructions for the LLM to follow, or rely on multiple iterations, which could hinder the performance and efficiency of jailbreaks. In this work, we propose a novel jailbreak paradigm, Simple Assistive Task Linkage (SATA), which can effectively circumvent LLM safeguards and elicit harmful responses. Specifically, SATA first masks harmful keywords within a malicious query to generate a relatively benign query containing one or multiple [MASK] special tokens. It then employs a simple assistive task such as a masked language model task or an element lookup by position task to encode the semantics of the masked keywords. Finally, SATA links the assistive task with the masked query to jointly perform the jailbreak. Extensive experiments show that SATA achieves state-of-the-art performance and outperforms baselines by a large margin. Specifically, on AdvBench dataset, with mask language model (MLM) assistive task, SATA achieves an overall attack success rate (ASR) of 85% and harmful score (HS) of 4.57, and with element lookup by position (ELP) assistive task, SATA attains an overall ASR of 76% and HS of 4.43.
comment: ACL Findings 2025. Welcome to employ SATA as a baseline
Can Large Language Models Detect Misinformation in Scientific News Reporting?
Scientific facts are often spun in the popular press with the intent to influence public opinion and action, as was evidenced during the COVID-19 pandemic. Automatic detection of misinformation in the scientific domain is challenging because of the distinct styles of writing in these two media types and is still in its nascence. Most research on the validity of scientific reporting treats this problem as a claim verification challenge. In doing so, significant expert human effort is required to generate appropriate claims. Our solution bypasses this step and addresses a more real-world scenario where such explicit, labeled claims may not be available. The central research question of this paper is whether it is possible to use large language models (LLMs) to detect misinformation in scientific reporting. To this end, we first present a new labeled dataset SciNews, containing 2.4k scientific news stories drawn from trusted and untrustworthy sources, paired with related abstracts from the CORD-19 database. Our dataset includes both human-written and LLM-generated news articles, making it more comprehensive in terms of capturing the growing trend of using LLMs to generate popular press articles. Then, we identify dimensions of scientific validity in science news articles and explore how this can be integrated into the automated detection of scientific misinformation. We propose several baseline architectures using LLMs to automatically detect false representations of scientific findings in the popular press. For each of these architectures, we use several prompt engineering strategies including zero-shot, few-shot, and chain-of-thought prompting. We also test these architectures and prompting strategies on GPT-3.5, GPT-4, and Llama2-7B, Llama2-13B.
GMoE: Empowering LLMs Fine-Tuning via MoE Graph Collaboration
The sparse Mixture-of-Experts (MoE) architecture of large language models (LLMs) confronts an inherent issue of load imbalance arising from the simplistic linear router strategy, which ultimately causes the instability and inefficient learning of LLMs. To address this challenge, we introduce a novel MoE graph-based framework $\textbf{GMoE}$, aimed at enhancing the collaboration among multiple experts. In GMoE, a graph router function is designed to capture the collaboration signals among experts. This enables all experts to dynamically allocate information derived from input data by sharing information with their neighboring experts. Moreover, we put forward two coordination strategies in GMoE: the $\textit{Poisson distribution-based distinction strategy}$ and the $\textit{Normal distribution-based balance strategy}$, to further release the capacity of each expert and increase the model stability in the fine-tuning of LLMs. Specifically, we leverage a parameter-efficient fine-tuning technique, i.e., Low-Rank Adaptation (LoRA), to implement the graph MoE architecture. Extensive experiments on four real-world benchmark datasets demonstrate the effectiveness of GMoE, showing the benefits of facilitating collaborations of multiple experts in LLM fine-tuning. The code of experimental implementation is available at https://github.com/BAI-LAB/GMoE
comment: 9 pages, 25 figures
Ellipsoid-Based Decision Boundaries for Open Intent Classification
Textual open intent classification is crucial for real-world dialogue systems, enabling robust detection of unknown user intents without prior knowledge and contributing to the robustness of the system. While adaptive decision boundary methods have shown great potential by eliminating manual threshold tuning, existing approaches assume isotropic distributions of known classes, restricting boundaries to balls and overlooking distributional variance along different directions. To address this limitation, we propose EliDecide, a novel method that learns ellipsoid decision boundaries with varying scales along different feature directions. First, we employ supervised contrastive learning to obtain a discriminative feature space for known samples. Second, we apply learnable matrices to parameterize ellipsoids as the boundaries of each known class, offering greater flexibility than spherical boundaries defined solely by centers and radii. Third, we optimize the boundaries via a novelly designed dual loss function that balances empirical and open-space risks: expanding boundaries to cover known samples while contracting them against synthesized pseudo-open samples. Our method achieves state-of-the-art performance on multiple text intent benchmarks and further on a question classification dataset. The flexibility of the ellipsoids demonstrates superior open intent detection capability and strong potential for generalization to more text classification tasks in diverse complex open-world scenarios.
Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT
Multiple-choice question answering (MCQA) has been a popular format for evaluating and reinforcement fine-tuning (RFT) of modern multimodal language models. Its constrained output format allows for simplified, deterministic automatic verification. However, we find that the options may leak exploitable signals, which makes the accuracy metrics unreliable for indicating real capabilities and encourages explicit or implicit answer guessing behaviors during RFT. We propose ReVeL (Rewrite and Verify by LLM), a framework that rewrites multiple-choice questions into open-form questions while keeping answers verifiable whenever possible. The framework categorizes questions according to different answer types, apply different rewriting and verification schemes, respectively. When applied for RFT, we converted 20k MCQA examples and use GRPO to finetune Qwen2.5-VL models. Models trained on ReVeL-OpenQA match MCQA accuracy on multiple-choice benchmarks and improve OpenQA accuracy by about six percentage points, indicating better data efficiency and more robust reward signals than MCQA-based training. When used for evaluation, ReVeL also reveals up to 20 percentage points of score inflation in MCQA benchmarks (relative to OpenQA), improves judging accuracy, and reduces both cost and latency. We will release code and data publicly.
comment: Project url: https://flageval-baai.github.io/ReVeL/
Personalized LLM Decoding via Contrasting Personal Preference EMNLP 2025
As large language models (LLMs) are progressively deployed in various real-world applications, personalization of LLMs has become increasingly important. While various approaches to LLM personalization such as prompt-based and training-based methods have been actively explored, the development of effective decoding-time algorithms remains largely overlooked, despite their demonstrated potential. In this paper, we propose CoPe (Contrasting Personal Preference), a novel decoding-time approach applied after performing parameter-efficient fine-tuning (PEFT) on user-specific data. Our core idea is to leverage reward-guided decoding specifically for personalization by maximizing each user's implicit reward signal. We evaluate CoPe across five open-ended personalized text generation tasks. Our empirical results demonstrate that CoPe achieves strong performance, improving personalization by an average of 10.57% in ROUGE-L, without relying on external reward models or additional training procedures.
comment: EMNLP 2025 Main
Advancing Multi-Agent RAG Systems with Minimalist Reinforcement Learning
Large Language Models (LLMs) equipped with modern Retrieval-Augmented Generation (RAG) systems often employ multi-turn interaction pipelines to interface with search engines for complex reasoning tasks. However, such multi-turn interactions inevitably produce long intermediate contexts, as context length grows exponentially with exploration depth. This leads to a well-known limitation of LLMs: their difficulty in effectively leveraging information from long contexts. This problem is further amplified in RAG systems that depend on in-context learning, where few-shot demonstrations must also be included in the prompt, compounding the context-length bottleneck. To address these challenges, we propose Mujica-MyGo, a unified framework for efficient multi-turn reasoning in RAG. Inspired by the divide-and-conquer principle, we introduce Mujica (Multi-hop Joint Intelligence for Complex Question Answering), a multi-agent RAG workflow that decomposes multi-turn interactions into cooperative sub-interactions, thereby mitigating long-context issues. To eliminate the dependency on in-context learning, we further develop MyGO (Minimalist Policy Gradient Optimization), a lightweight and efficient reinforcement learning algorithm that enables effective post-training of LLMs within complex RAG pipelines. We provide theoretical guarantees for MyGO's convergence to the optimal policy. Empirical evaluations across diverse question-answering benchmarks, covering both text corpora and knowledge graphs, show that Mujica-MyGO achieves superior performance.
Mixture of Attention Spans: Optimizing LLM Inference Efficiency with Heterogeneous Sliding-Window Lengths
Sliding-window attention offers a hardware-efficient solution to the memory and throughput challenges of Large Language Models (LLMs) in long-context scenarios. Existing methods typically employ a single window length across all attention heads and input sizes. However, this uniform approach fails to capture the heterogeneous attention patterns inherent in LLMs, ignoring their distinct accuracy-latency trade-offs. To address this challenge, we propose *Mixture of Attention Spans* (MoA), which automatically tailors distinct sliding-window length configurations to different heads and layers. MoA constructs and navigates a search space of various window lengths and their scaling rules relative to input sizes. It profiles the model, evaluates potential configurations, and pinpoints the optimal length configurations for each head. MoA adapts to varying input sizes, revealing that some attention heads expand their focus to accommodate longer inputs, while other heads consistently concentrate on fixed-length local contexts. Experiments show that MoA increases the effective context length by 3.9x with the same average sliding-window length, boosting retrieval accuracy by 1.5-7.1x over the uniform-window baseline across Vicuna-{7B, 13B} and Llama3-{8B, 70B} models. Moreover, MoA narrows the performance gap with full attention, reducing the maximum relative performance drop from 9%-36% to within 5% across three long-context understanding benchmarks. MoA achieves a 1.2-1.4x GPU memory reduction, boosting decode throughput by 6.6-8.2x and 1.7-1.9x over FlashAttention2 and vLLM, with minimal performance impact. Our code is available at: https://github.com/thu-nics/MoA
comment: Published at CoLM'25
The magnitude of categories of texts enriched by language models
The purpose of this article is twofold. Firstly, we use the next-token probabilities given by a language model to explicitly define a category of texts in natural language enriched over the unit interval, in the sense of Bradley, Terilla, and Vlassopoulos. We consider explicitly the terminating conditions for text generation and determine when the enrichment itself can be interpreted as a probability over texts. Secondly, we compute the Möbius function and the magnitude of an associated generalized metric space of texts. The magnitude function of that space is a sum over texts (prompts) of the $t$-logarithmic (Tsallis) entropies of the next-token probability distributions associated with each prompt, plus the cardinality of the model's possible outputs. A suitable evaluation of the magnitude function's derivative recovers a sum of Shannon entropies, which justifies seeing magnitude as a partition function. Following Leinster and Shulman, we also express the magnitude function of the generalized metric space as an Euler characteristic of magnitude homology and provide an explicit description of the zeroeth and first magnitude homology groups.
comment: 26 pages
Large language models replicate and predict human cooperation across experiments in game theory
Large language models (LLMs) are increasingly used both to make decisions in domains such as health, education and law, and to simulate human behavior. Yet how closely LLMs mirror actual human decision-making remains poorly understood. This gap is critical: misalignment could produce harmful outcomes in practical applications, while failure to replicate human behavior renders LLMs ineffective for social simulations. Here, we address this gap by developing a digital twin of game-theoretic experiments and introducing a systematic prompting and probing framework for machine-behavioral evaluation. Testing three open-source models (Llama, Mistral and Qwen), we find that Llama reproduces human cooperation patterns with high fidelity, capturing human deviations from rational choice theory, while Qwen aligns closely with Nash equilibrium predictions. Notably, we achieved population-level behavioral replication without persona-based prompting, simplifying the simulation process. Extending beyond the original human-tested games, we generate and preregister testable hypotheses for novel game configurations outside the original parameter grid. Our findings demonstrate that appropriately calibrated LLMs can replicate aggregate human behavioral patterns and enable systematic exploration of unexplored experimental spaces, offering a complementary approach to traditional research in the social and behavioral sciences that generates new empirical predictions about human social decision-making.
Gram2Vec: An Interpretable Document Vectorizer
We present Gram2Vec, a grammatical style embedding system that embeds documents into a higher dimensional space by extracting the normalized relative frequencies of grammatical features present in the text. Compared to neural approaches, Gram2Vec offers inherent interpretability based on how the feature vectors are generated. In this paper, we use authorship verification and AI detection as two applications to show how Gram2Vec can be used. For authorship verification, we use the features from Gram2Vec to explain why a pair of documents is by the same or by different authors. We also demonstrate how Gram2Vec features can be used to train a classifier for AI detection, outperforming machine learning models trained on a comparable set of Biber features.
comment: 8 pages, 1 figure
Health Sentinel: An AI Pipeline For Real-time Disease Outbreak Detection
Early detection of disease outbreaks is crucial to ensure timely intervention by the health authorities. Due to the challenges associated with traditional indicator-based surveillance, monitoring informal sources such as online media has become increasingly popular. However, owing to the number of online articles getting published everyday, manual screening of the articles is impractical. To address this, we propose Health Sentinel. It is a multi-stage information extraction pipeline that uses a combination of ML and non-ML methods to extract events-structured information concerning disease outbreaks or other unusual health events-from online articles. The extracted events are made available to the Media Scanning and Verification Cell (MSVC) at the National Centre for Disease Control (NCDC), Delhi for analysis, interpretation and further dissemination to local agencies for timely intervention. From April 2022 till date, Health Sentinel has processed over 300 million news articles and identified over 95,000 unique health events across India of which over 3,500 events were shortlisted by the public health experts at NCDC as potential outbreaks.
Human-Computer Interaction
Human-AI Teaming Under Deception: An Implicit BCI Safeguards Drone Team Performance in Virtual Reality
Human-AI teams can be vulnerable to catastrophic failure when feedback from the AI is incorrect, especially under high cognitive workload. Traditional team aggregation methods, such as voting, are susceptible to these AI errors, which can actively bias the behaviour of each individual and inflate the likelihood of an erroneous group decision. We hypothesised that a collaborative Brain-Computer Interface (cBCI) using neural activity collected before a behavioural decision is made can provide a source of information that is decoupled from this biased behaviour, thereby protecting the team from the deleterious influence of AI error. We tested this in a VR drone surveillance task where teams of operators faced high workload and systematically misleading AI cues, comparing traditional behaviour-based team strategies against a purely Neuro-Decoupled Team (NDT) that used only BCI confidence scores derived from pre-response EEG. Under AI deception, behaviour-based teams catastrophically failed, with Majority Vote accuracy collapsing to 44%. The NDT, however, maintained 98% accuracy, a statistically significant synergistic gain over even the team's best individual performer (p < .001). This was explained by a neuro-behavioural decoupling, where the BCI's predictions remained highly accurate while the operator's subjective confidence became an unreliable signal. We conclude that an implicit BCI provides resilience by learning to adapt its neural strategy, shifting from relying on signals of efficient, autopilot processing in simple conditions to interpreting signatures of effortful deliberation when confronted with cognitive conflict. This demonstrates a system that leverages the context of the neural signal to defend against AI-induced error in high-stakes environments.
comment: 30 pages, 19 figures
Facilitating the Integration of LLMs Into Online Experiments With Simple Chat
As large language models (LLMs) become increasingly prevalent, understanding human-LLM interactions is emerging as a central priority in psychological research. Online experiments offer an efficient means to study human-LLM interactions, yet integrating LLMs into established survey platforms remains technically demanding, particularly when aiming for ecologically valid, real-time conversational experiences with strong experimental control. We introduce Simple Chat, an open-source, research-focused chat interface that streamlines LLM integration for platforms such as Qualtrics, oTree, and LimeSurvey, while presenting a unified participant experience across conditions. Simple Chat connects to both commercial providers and open-weights models, supports streaming responses to preserve conversational flow, and offers an administrative interface for fine-grained control of prompts and interface features. By reducing technical barriers, standardizing interfaces, and improving participant experience, Simple Chat helps advance the study of human-LLM interaction. In this article, we outline Simple Chat's key features, provide a step-by-step tutorial, and demonstrate its utility through two illustrative case studies.
LLM Chatbots in High School Programming: Exploring Behaviors and Interventions
This study uses a Design-Based Research (DBR) cycle to refine the integration of Large Language Models (LLMs) in high school programming education. The initial problem was identified in an Intervention Group where, in an unguided setting, a higher proportion of executive, solution-seeking queries correlated strongly and negatively with exam performance. A contemporaneous Comparison Group demonstrated that without guidance, these unproductive help-seeking patterns do not self-correct, with engagement fluctuating and eventually declining. This insight prompted a mid-course pedagogical intervention in the first group, designed to teach instrumental help-seeking. The subsequent evaluation confirmed the intervention's success, revealing a decrease in executive queries, as well as a shift toward more productive learning workflows. However, this behavioral change did not translate into a statistically significant improvement in exam grades, suggesting that altering tool-use strategies alone may be insufficient to overcome foundational knowledge gaps. The DBR process thus yields a more nuanced principle: the educational value of an LLM depends on a pedagogy that scaffolds help-seeking, but this is only one part of the complex process of learning.
REFLECTing SPERET: Measuring and Promoting Ethics and Privacy Reflexivity in Eye-Tracking Research
The proliferation of eye tracking in high-stakes domains - such as healthcare, marketing and surveillance - underscores the need for researchers to be ethically aware when employing this technology. Although privacy and ethical guidelines have emerged in recent years, empirical research on how scholars reflect on their own work remains scarce. To address this gap, we present two complementary instruments developed with input from more than 70 researchers: REFLECT, a qualitative questionnaire, and SPERET (Latin for "hope"), a quantitative psychometric scale that measures privacy and ethics reflexivity in eye tracking. Our findings reveal a research community that is concerned about user privacy, cognisant of methodological constraints, such as sample bias, and that possesses a nuanced sense of ethical responsibility evolving with project maturity. Together, these tools and our analyses offer a systematic examination and a hopeful outlook on reflexivity in eye-tracking research, promoting more privacy and ethics-conscious practice.
comment: 15 pages
MoodBench 1.0: An Evaluation Benchmark for Emotional Companionship Dialogue Systems
With the rapid development of Large Language Models, dialogue systems are shifting from information tools to emotional companions, heralding the era of Emotional Companionship Dialogue Systems (ECDs) that provide personalized emotional support for users. However, the field lacks clear definitions and systematic evaluation standards for ECDs. To address this, we first propose a definition of ECDs with formal descriptions. Then, based on this theory and the design principle of "Ability Layer-Task Layer (three level)-Data Layer-Method Layer", we design and implement the first ECD evaluation benchmark - MoodBench 1.0. Through extensive evaluations of 30 mainstream models, we demonstrate that MoodBench 1.0 has excellent discriminant validity and can effectively quantify the differences in emotional companionship abilities among models. Furthermore, the results reveal current models' shortcomings in deep emotional companionship, guiding future technological optimization and significantly aiding developers in enhancing ECDs' user experience.
comment: 26 pages, 7 figures
Pre-Filtering Code Suggestions using Developer Behavioral Telemetry to Optimize LLM-Assisted Programming
Large Language Models (LLMs) are increasingly integrated into code editors to provide AI-powered code suggestions. Yet many of these suggestions are ignored, resulting in wasted computation, increased latency, and unnecessary interruptions. We introduce a lightweight pre-filtering model that predicts the likelihood of suggestion acceptance before invoking the LLM, using only real-time developer telemetry such as typing speed, file navigation, and editing activity. Deployed in a production-grade Visual Studio Code plugin over four months of naturalistic use, our approach nearly doubled acceptance rates (18.4% -> 34.2%) while suppressing 35% of low-value LLM calls. These findings demonstrate that behavioral signals alone can meaningfully improve both user experience and system efficiency in LLM-assisted programming, highlighting the value of timing-aware, privacy-preserving adaptation mechanisms. The filter operates solely on pre-invocation editor telemetry and never inspects code or prompts.
comment: \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses
A Reproducible Framework for Neural Topic Modeling in Focus Group Analysis
Focus group discussions generate rich qualitative data but their analysis traditionally relies on labor-intensive manual coding that limits scalability and reproducibility. We present a rigorous, reproducible computational framework for applying neural topic modeling to focus group transcripts, addressing fundamental methodological challenges: hyperparameter sensitivity, model stability, and validation of interpretability. Using BERTopic applied to ten focus groups exploring HPV vaccine perceptions in Tunisia (1,076 utterances), we conducted systematic evaluation across 27 hyperparameter configurations, assessed stability through bootstrap resampling with 30 replicates per configuration, and validated interpretability through formal human evaluation by three domain experts. Our analysis demonstrates substantial sensitivity to hyperparameter choices and reveals that metric selection for stability assessment must align with analytical goals. A hierarchical merging strategy (extracting fine-grained topics for stability then consolidating for interpretability) effectively navigates the stability-coherence tradeoff, achieving coherence of 0.558 compared to 0.539 for direct extraction. Human validation confirmed topic quality with very good inter-rater reliability (ICC = 0.79, weighted Cohen's kappa = 0.578). Our framework provides practical guidelines that researchers can adapt to their own qualitative research contexts. All code, data processing scripts, and evaluation protocols are publicly available to support reproduction and extension of this work.
Optimizing LLM Code Suggestions: Feedback-Driven Timing with Lightweight State Bounds
Large Language Models (LLMs) have transformed code auto-completion by generating context-aware suggestions. Yet, deciding when to present these suggestions remains underexplored, often leading to interruptions or wasted inference calls. We propose an adaptive timing mechanism that dynamically adjusts the delay before offering a suggestion based on real-time developer feedback. Our suggested method combines a logistic transform of recent acceptance rates with a bounded delay range, anchored by a high-level binary prediction of the developer's cognitive state. In a two-month deployment with professional developers, our system improved suggestion acceptance from 4.9% with no delay to 15.4% with static delays, and to 18.6% with adaptive timing-while reducing blind rejections (rejections without being read) from 8.3% to 0.36%. Together, these improvements increase acceptance and substantially reduce wasted inference calls by 75%, making LLM-based code assistants more efficient and cost-effective in practice.
comment: \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses
KOM: A Multi-Agent Artificial Intelligence System for Precision Management of Knee Osteoarthritis (KOA)
Knee osteoarthritis (KOA) affects more than 600 million individuals globally and is associated with significant pain, functional impairment, and disability. While personalized multidisciplinary interventions have the potential to slow disease progression and enhance quality of life, they typically require substantial medical resources and expertise, making them difficult to implement in resource-limited settings. To address this challenge, we developed KOM, a multi-agent system designed to automate KOA evaluation, risk prediction, and treatment prescription. This system assists clinicians in performing essential tasks across the KOA care pathway and supports the generation of tailored management plans based on individual patient profiles, disease status, risk factors, and contraindications. In benchmark experiments, KOM demonstrated superior performance compared to several general-purpose large language models in imaging analysis and prescription generation. A randomized three-arm simulation study further revealed that collaboration between KOM and clinicians reduced total diagnostic and planning time by 38.5% and resulted in improved treatment quality compared to each approach used independently. These findings indicate that KOM could help facilitate automated KOA management and, when integrated into clinical workflows, has the potential to enhance care efficiency. The modular architecture of KOM may also offer valuable insights for developing AI-assisted management systems for other chronic conditions.
IndEgo: A Dataset of Industrial Scenarios and Collaborative Work for Egocentric Assistants NeurIPS 2025
We introduce IndEgo, a multimodal egocentric and exocentric dataset addressing common industrial tasks, including assembly/disassembly, logistics and organisation, inspection and repair, woodworking, and others. The dataset contains 3,460 egocentric recordings (approximately 197 hours), along with 1,092 exocentric recordings (approximately 97 hours). A key focus of the dataset is collaborative work, where two workers jointly perform cognitively and physically intensive tasks. The egocentric recordings include rich multimodal data and added context via eye gaze, narration, sound, motion, and others. We provide detailed annotations (actions, summaries, mistake annotations, narrations), metadata, processed outputs (eye gaze, hand pose, semi-dense point cloud), and benchmarks on procedural and non-procedural task understanding, Mistake Detection, and reasoning-based Question Answering. Baseline evaluations for Mistake Detection, Question Answering and collaborative task understanding show that the dataset presents a challenge for the state-of-the-art multimodal models. Our dataset is available at: https://huggingface.co/datasets/FraunhoferIPK/IndEgo
comment: Accepted to NeurIPS 2025 D&B Track. Project Page: https://indego-dataset.github.io/
Towards Synergistic Teacher-AI Interactions with Generative Artificial Intelligence
Generative artificial intelligence (GenAI) is increasingly used in education, posing significant challenges for teachers adapting to these changes. GenAI offers unprecedented opportunities for accessibility, scalability and productivity in educational tasks. However, the automation of teaching tasks through GenAI raises concerns about reduced teacher agency, potential cognitive atrophy, and the broader deprofessionalisation of teaching. Drawing findings from prior literature on AI in Education, and refining through a recent systematic literature review, this chapter presents a conceptualisation of five levels of teacher-AI teaming: transactional, situational, operational, praxical and synergistic teaming. The framework aims to capture the nuanced dynamics of teacher-AI interactions, particularly with GenAI, that may lead to the replacement, complementarity, or augmentation of teachers' competences and professional practice. GenAI technological affordances required in supporting teaming, along with empirical studies, are discussed. Drawing on empirical observations, we outline a future vision that moves beyond individual teacher agency toward collaborative decision-making between teachers and AI, in which both agents engage in negotiation, constructive challenge, and co-reasoning that enhance each other's capabilities and enable outcomes neither could realise independently. Further discussion of socio-technical factors beyond teacher-AI teaming is also included to streamline the synergy of teachers and AI in education ethically and practically.
comment: 18 pages, 6 pages
Using Wearable Devices to Improve Chronic PainTreatment among Patients with Opioid Use Disorder
Chronic pain (CP) and opioid use disorder (OUD) are common and interrelated chronic medical conditions. Currently, there is a paucity of evidence-based integrated treatments for CP and OUD among individuals receiving medication for opioid use disorder (MOUD). Wearable devices have the potential to monitor complex patient information and inform treatment development for persons with OUD and CP, including pain variability (e.g., exacerbations of pain or pain spikes) and clinical correlates (e.g., perceived stress). However, the application of large language models (LLMs) with wearable data for understanding pain spikes, remains unexplored. Consequently, the aim of this pilot study was to examine the clinical correlates of pain spikes using a range of AI approaches. We found that machine learning models achieved relatively high accuracy (>0.7) in predicting pain spikes, while LLMs were limited in providing insights on pain spikes. Real-time monitoring through wearable devices, combined with advanced AI models, could facilitate early detection of pain spikes and support personalized interventions that may help mitigate the risk of opioid relapse, improve adherence to MOUD, and enhance the integration of CP and OUD care. Given overall limited LLM performance, these findings highlight the need to develop LLMs which can provide actionable insights in the OUD/CP context.
FocusView: Understanding and Customizing Informational Video Watching Experiences for Viewers with ADHD
While videos have become increasingly prevalent in delivering information across different educational and professional contexts, individuals with ADHD often face attention challenges when watching informational videos due to the dynamic, multimodal, yet potentially distracting video elements. To understand and address this critical challenge, we designed FocusView, a video customization interface that allows viewers with ADHD to customize informational videos from different aspects. We evaluated FocusView with 12 participants with ADHD and found that FocusView significantly improved the viewability of videos by reducing distractions. Through the study, we uncovered participants' diverse perceptions of video distractions (e.g., background music as a distraction vs. stimulation boost) and their customization preferences, highlighting unique ADHD-relevant needs in designing video customization interfaces (e.g., reducing the number of options to avoid distraction caused by customization itself). We further derived design considerations for future video customization systems for the ADHD community.
comment: 15 pages, 12 figures
Forgetful by Design? A Critical Audit of YouTube's Search API for Academic Research
This paper critically audits the search endpoint of YouTube's Data API (v3), a common tool for academic research. Through systematic weekly searches over six months using eleven queries, we identify major limitations regarding completeness, representativeness, consistency, and bias. Our findings reveal substantial differences between ranking parameters like relevance and date in terms of video recall and precision, with relevance often retrieving numerous off-topic videos. We also observe severe temporal decay in video discoverability: the number of retrievable videos for a given period drops dramatically within just 20-60 days of publication, even though these videos remain on the platform. This potentially undermines research designs that rely on systematic data collection. Furthermore, search results lack consistency, with identical queries yielding different video sets over time, compromising replicability. A case study on the European Parliament elections highlights how these issues impact research outcomes. While the paper offers several mitigation strategies, it concludes that the API's search function, potentially prioritizing 'freshness' over comprehensive retrieval, is not adequate for robust academic research, especially concerning Digital Services Act requirements.
comment: 25 pages, 2 tables and 4 figures
Health App Reviews for Privacy & Trust (HARPT): A Corpus for Analyzing Patient Privacy Concerns, Trust in Providers and Trust in Applications
Background: User reviews of Telehealth and Patient Portal mobile applications (apps) hereon referred to as electronic health (eHealth) apps are a rich source of unsolicited patient feedback, revealing critical insights into patient perceptions. However, the lack of large-scale, annotated datasets specific to privacy and trust has limited the ability of researchers to systematically analyze these concerns using natural language processing (NLP) techniques. Objective: This study aims to develop and benchmark Health App Reviews for Privacy & Trust (HARPT), a large-scale annotated corpus of patient reviews from eHealth apps to advance research in patient privacy and trust. Methods: We employed a multistage data construction strategy. This integrated keyword-based filtering, iterative manual labeling with review, targeted data augmentation, and weak supervision using transformer-based classifiers. A curated subset of 7,000 reviews was manually annotated to support machine learning model development and evaluation. The resulting dataset was used to benchmark a broad range of models. Results: The HARPT corpus comprises 480,000 patient reviews annotated across seven categories capturing critical aspects of trust in the application (TA), trust in the provider (TP), and privacy concerns (PC). We provide comprehensive benchmark performance for a range of machine learning models on the manually annotated subset, establishing a baseline for future research. Conclusions: The HARPT corpus is a significant resource for advancing the study of privacy and trust in the eHealth domain. By providing a large-scale, annotated dataset and initial benchmarks, this work supports reproducible research in usable privacy and trust within health informatics. HARPT is released under an open resource license.
Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI
AI Video Chat emerges as a new paradigm for Real-time Communication (RTC), where one peer is not a human, but a Multimodal Large Language Model (MLLM). This makes interaction between humans and AI more intuitive, as if chatting face-to-face with a real person. However, this poses significant challenges to latency, because the MLLM inference takes up most of the response time, leaving very little time for video streaming. Due to network uncertainty, transmission latency becomes a critical bottleneck preventing AI from being like a real person. To address this, we call for AI-oriented RTC research, exploring the network requirement shift from "humans watching video" to "AI understanding video". We begin by recognizing the main differences between AI Video Chat and traditional RTC. Then, through prototype measurements, we identify that ultra-low bitrate is a key factor for low latency. To reduce bitrate dramatically while maintaining MLLM accuracy, we propose Context-Aware Video Streaming that recognizes the importance of each video region for chat and allocates bitrate almost exclusively to chat-important regions. To evaluate the impact of video streaming quality on MLLM accuracy, we build the first benchmark, named Degraded Video Understanding Benchmark (DeViBench). Finally, we discuss some open questions and ongoing solutions for AI Video Chat. DeViBench is open-sourced at: https://github.com/pku-netvideo/DeViBench.
comment: 9 pages, 10 figures, Proceedings of the 24th ACM Workshop on Hot Topics in Networks (HotNets 2025), College Park, Maryland, USA
GRAPHIC--Guidelines for Reviewing Algorithmic Practices in Human-centred Design and Interaction for Creativity
Artificial Intelligence (AI) has been increasingly applied to creative domains, leading to the development of systems that collaborate with humans in design processes. In Graphic Design, integrating computational systems into co-creative workflows presents specific challenges, as it requires balancing scientific rigour with the subjective and visual nature of design practice. Following the PRISMA methodology, we identified 872 articles, resulting in a final corpus of 71 publications describing 68 unique systems. Based on this review, we introduce GRAPHIC (Guidelines for Reviewing Algorithmic Practices in Human-centred Design and Interaction for Creativity), a framework for analysing computational systems applied to Graphic Design. Its goal is to understand how current systems support human-AI collaboration in the Graphic Design discipline. The framework comprises main dimensions, which our analysis revealed to be essential across diverse system types: (1) Collaborative Panorama, (2) Processes and Modalities, and (3) Graphic Design Principles. Its application revealed research gaps, including the need to balance initiative and control between agents, improve communication through explainable interaction models, and promote systems that support transformational creativity grounded in core design principles.
comment: 20 pages, 16 figures
VeML: An End-to-End Machine Learning Lifecycle for Large-scale and High-dimensional Data
An end-to-end machine learning (ML) lifecycle consists of many iterative processes, from data preparation and ML model design to model training and then deploying the trained model for inference. When building an end-to-end lifecycle for an ML problem, many ML pipelines must be designed and executed that produce a huge number of lifecycle versions. Therefore, this paper introduces VeML, a Version management system dedicated to end-to-end ML Lifecycle. Our system tackles several crucial problems that other systems have not solved. First, we address the high cost of building an ML lifecycle, especially for large-scale and high-dimensional dataset. We solve this problem by proposing to transfer the lifecycle of similar datasets managed in our system to the new training data. We design an algorithm based on the core set to compute similarity for large-scale, high-dimensional data efficiently. Another critical issue is the model accuracy degradation by the difference between training data and testing data during the ML lifetime, which leads to lifecycle rebuild. Our system helps to detect this mismatch without getting labeled data from testing data and rebuild the ML lifecycle for a new data version. To demonstrate our contributions, we conduct experiments on real-world, large-scale datasets of driving images and spatiotemporal sensor data and show promising results.
comment: The updated version of this paper, titled "Efficient ML Lifecycle Transferring for Large-scale and High-dimensional Data via Core Set-based Dataset Similarity," has been accepted for publication in IEEE Access
PresentCoach: Dual-Agent Presentation Coaching through Exemplars and Interactive Feedback
Effective presentation skills are essential in education, professional communication, and public speaking, yet learners often lack access to high-quality exemplars or personalized coaching. Existing AI tools typically provide isolated functionalities such as speech scoring or script generation without integrating reference modeling and interactive feedback into a cohesive learning experience. We introduce a dual-agent system that supports presentation practice through two complementary roles: the Ideal Presentation Agent and the Coach Agent. The Ideal Presentation Agent converts user-provided slides into model presentation videos by combining slide processing, visual-language analysis, narration script generation, personalized voice synthesis, and synchronized video assembly. The Coach Agent then evaluates user-recorded presentations against these exemplars, conducting multimodal speech analysis and delivering structured feedback in an Observation-Impact-Suggestion (OIS) format. To enhance the authenticity of the learning experience, the Coach Agent incorporates an Audience Agent, which simulates the perspective of a human listener and provides humanized feedback reflecting audience reactions and engagement. Together, these agents form a closed loop of observation, practice, and feedback. Implemented on a robust backend with multi-model integration, voice cloning, and error handling mechanisms, the system demonstrates how AI-driven agents can provide engaging, human-centered, and scalable support for presentation skill development in both educational and professional contexts.
comment: 13pages,6figures
Practitioners' Perspectives on a Differential Privacy Deployment Registry
Differential privacy (DP) -- a principled approach to producing statistical data products with strong, mathematically provable privacy guarantees for the individuals in the underlying dataset -- has seen substantial adoption in practice over the past decade. Applying DP requires making several implementation decisions, each with significant impacts on data privacy and/or utility. Hence, to promote shared learning and accountability around DP deployments, Dwork, Kohli, and Mulligan (2019) proposed a public-facing repository ("registry") of DP deployments. The DP community has recently started to work toward realizing this vision. We contribute to this effort by (1) developing a holistic, hierarchical schema to describe any given DP deployment and (2) designing and implementing an interactive interface to act as a registry where practitioners can access information about past DP deployments. We (3) populate our interface with 21 real-world DP deployments and (4) conduct an exploratory user study with DP practitioners ($n=16$) to understand how they would use the registry, as well as what challenges and opportunities they foresee around its adoption. We find that participants were enthusiastic about the registry as a valuable resource for evaluating prior deployments and making future deployments. They also identified several opportunities for the registry, including that it can become a "hub" for the community and support broader communication around DP (e.g., to legal teams). At the same time, they identified challenges around the registry gaining adoption, including the effort and risk involved with making implementation choices public and moderating the quality of entries. Based on our findings, we offer recommendations for encouraging adoption and increasing the registry's value not only to DP practitioners, but also to policymakers, data users, and data subjects.
Lessons Learned from Developing a Privacy-Preserving Multimodal Wearable for Local Voice-and-Vision Inference
Many promising applications of multimodal wearables require continuous sensing and heavy computation, yet users reject such devices due to privacy concerns. This paper shares our experiences building an ear-mounted voice-and-vision wearable that performs local AI inference using a paired smartphone as a trusted personal edge. We describe the hardware-software co-design of this privacy-preserving system, including challenges in integrating a camera, microphone, and speaker within a 30-gram form factor, enabling wake word-triggered capture, and running quantized vision-language and large-language models entirely offline. Through iterative prototyping, we identify key design hurdles in power budgeting, connectivity, latency, and social acceptability. Our initial evaluation shows that fully local multimodal inference is feasible on commodity mobile hardware with interactive latency. We conclude with design lessons for researchers developing embedded AI systems that balance privacy, responsiveness, and usability in everyday settings.
Exploring Families' Use and Mediation of Generative AI: A Multi-User Perspective
Generative AI (GenAI) platforms, such as ChatGPT, have gained popularity among the public due to their ease of access, use, and support of educational and creative activities. Despite these benefits, GenAI poses unique risks for families, such as lacking sufficient safeguards tailored to protect children under 13 years of age and not offering parental control features. This study explores how parents mediate their children's use of GenAI and the factors/processes that shape this mediation. Through analyzing semi-structured interviews with 12 families, we identified ways in which families used and mediated GenAI and factors that influenced parents' GenAI mediation strategies. We contextualize our findings with a modified model of family mediation strategies, drawing from previous family media and mediation frameworks. We provide insights for future research on Family-GenAI interactions and highlight the need for more robust protective measures on GenAI platforms for families.
Final Happiness: What Intelligent User Interfaces Can Do for The Lonely Dying
This study explores the design of Intelligent User Interfaces (IUIs) to address the profound existential loneliness of terminally ill individuals. While Human-Computer Interaction (HCI) has made inroads in "Thanatechnology," current research often focuses on practical aspects like digital legacy management, overlooking the subjective, existential needs of those facing death in isolation. To address this gap, we conducted in-depth qualitative interviews with 14 lonely, terminally ill individuals. Our core contributions are: (1) An empirically-grounded model articulating the complex psychological, practical, social, and spiritual needs of this group; (2) The "Three Pillars, Twelve Principles" framework for designing IUIs as "Existential Companions"; and (3) A critical design directive derived from user evaluations: technology in this context should aim for transcendence over simulation. The findings suggest that IUIs should create experiences that augment or surpass human capabilities, rather than attempting to simulate basic human connections, which can paradoxically deepen loneliness. This research provides a clear, user-centered path for designing technology that serves not as a "tool for dying," but as a "partner for living fully until the end".
Pneumatic Multi-mode Silicone Actuator with Pressure, Vibration, and Cold Thermal Feedback
A wide range of haptic feedback is crucial for achieving high realism and immersion in virtual environments. Therefore, a multi-modal haptic interface that provides various haptic signals simultaneously is highly beneficial. This paper introduces a novel silicone fingertip actuator that is pneumatically actuated, delivering a realistic and effective haptic experience by simultaneously providing pressure, vibrotactile, and cold thermal feedback. The actuator features a design with multiple air chambers, each with controllable volume achieved through pneumatic valves connected to compressed air tanks. The lower air chamber generates pressure feedback, while the upper chamber produces vibrotactile feedback. In addition, two integrated lateral air nozzles create a cold thermal sensation. To showcase the system's capabilities, we designed two unique 3D surfaces in the virtual environment: a frozen meat surface and an abrasive icy surface. These surfaces simulate tactile perceptions of coldness, pressure, and texture. Comprehensive performance assessments and user studies were conducted to validate the actuator's effectiveness, highlighting its diverse feedback capabilities compared to traditional actuators that offer only single feedback modalities.
CNS-Obsidian: A Neurosurgical Vision-Language Model Built From Scientific Publications
General-purpose VLMs demonstrate impressive capabilities, but their opaque training on uncurated internet data poses critical limitations for high-stakes decision-making, such as in neurosurgery. We present CNS-Obsidian, a neurosurgical VLM trained on peer-reviewed literature, and demonstrate its clinical utility versus GPT-4o in a real-world setting. We compiled 23,984 articles from Neurosurgery Publications journals, yielding 78,853 figures and captions. Using GPT-4o and Claude Sonnet-3.5, we converted these into 263,064 training samples across three formats: instruction fine-tuning, multiple-choice questions, and differential diagnosis. We trained CNS-Obsidian, a fine-tune of the 34-billion parameter LLaVA-Next model. In a blinded, randomized trial at NYU Langone Health (Aug 30-Nov 30, 2024), neurosurgery consultations were assigned to either CNS-Obsidian or a HIPAA-compliant GPT-4o endpoint as diagnostic co-pilot after consultations. Primary outcomes were diagnostic helpfulness and accuracy, assessed via user ratings and presence of correct diagnosis within the VLM-provided differential. CNS-Obsidian matched GPT-4o on synthetic questions (76.13% vs 77.54%, p=0.235), but only achieved 46.81% accuracy on human-generated questions versus GPT-4o's 65.70% (p<10-15). In the randomized trial, 70 consultations were evaluated (32 CNS-Obsidian, 38 GPT-4o) from 959 total consults (7.3% utilization). CNS-Obsidian received positive ratings in 40.62% of cases versus 57.89% for GPT-4o (p=0.230). Both models included correct diagnosis in approximately 60% of cases (59.38% vs 65.79%, p=0.626). Domain-specific VLMs trained on curated scientific literature can approach frontier model performance despite being orders of magnitude smaller and less expensive to train. This establishes a transparent framework for scientific communities to build specialized AI models.
Image and Video Processing
Shape-Adapting Gated Experts: Dynamic Expert Routing for Colonoscopic Lesion Segmentation
The substantial diversity in cell scale and form remains a primary challenge in computer-aided cancer detection on gigapixel Whole Slide Images (WSIs), attributable to cellular heterogeneity. Existing CNN-Transformer hybrids rely on static computation graphs with fixed routing, which consequently causes redundant computation and limits their adaptability to input variability. We propose Shape-Adapting Gated Experts (SAGE), an input-adaptive framework that enables dynamic expert routing in heterogeneous visual networks. SAGE reconfigures static backbones into dynamically routed expert architectures. SAGE's dual-path design features a backbone stream that preserves representation and selectively activates an expert path through hierarchical gating. This gating mechanism operates at multiple hierarchical levels, performing a two-level, hierarchical selection between shared and specialized experts to modulate model logits for Top-K activation. Our Shape-Adapting Hub (SA-Hub) harmonizes structural and semantic representations across the CNN and the Transformer module, effectively bridging diverse modules. Embodied as SAGE-UNet, our model achieves superior segmentation on three medical benchmarks: EBHI, DigestPath, and GlaS, yielding state-of-the-art Dice Scores of 95.57%, 95.16%, and 94.17%, respectively, and robustly generalizes across domains by adaptively balancing local refinement and global context. SAGE provides a scalable foundation for dynamic expert routing, enabling flexible visual reasoning.
Speed Control Security System For safety of Driver and Surroundings
The speed control security system is best suited for the task of slowing the speed of a vehicle during rash driving as the Driver is over speeding the circuit captures the images of the lanes witch decides the speed of the road the car is currently on this input is further provided to the ESP-32 micro Prosser module in the car switch compiles this data with the data received for the RPM sensor of the car and decides whether the car is over speeding or not in case of over speeding a signal is send by the ESP to the Arduino witch actuates the dc motor used in the car to reduce the speed of the car by the use of a hydraulic brake system actuated by a DC motor.
comment: 9 Pages , 7 figures
SPASHT: An image-enhancement method for sparse-view MPI SPECT
Single-photon emission computed tomography for myocardial perfusion imaging (MPI SPECT) is a widely used diagnostic tool for coronary artery disease. However, the procedure requires considerable scanning time, leading to patient discomfort and the potential for motion-induced artifacts. Reducing the number of projection views while keeping the time per view unchanged provides a mechanism to shorten the scanning time. However, this approach leads to increased sampling artifacts, higher noise, and hence limited image quality. To address these issues, we propose sparseview SPECT image enhancement (SPASHT), inherently training the algorithm to improve performance on defect-detection tasks. We objectively evaluated SPASHT on the clinical task of detecting perfusion defects in a retrospective clinical study using data from patients who underwent MPI SPECT, where the defects were clinically realistic and synthetically inserted. The study was conducted for different numbers of fewer projection views, including 1/6, 1/3, and 1/2 of the typical projection views for MPI SPECT. Performance on the detection task was quantified using area under the receiver operating characteristic curve (AUC). Images obtained with SPASHT yielded significantly improved AUC compared to those obtained with the sparse-view protocol for all the considered numbers of fewer projection views. To further assess performance, a human observer study on the task of detecting perfusion defects was conducted. Results from the human observer study showed improved detection performance with images reconstructed using SPASHT compared to those from the sparse-view protocol. The results provide evidence of the efficacy of SPASHT in improving the quality of sparse-view MPI SPECT images and motivate further clinical validation.
comment: My advisor does not agree on the publication of this paper
Multimedia
Self-Empowering VLMs: Achieving Hierarchical Consistency via Self-Elicited Knowledge Distillation
Vision-language models (VLMs) possess rich knowledge but often fail on hierarchical understanding tasks, where the goal is to predict a coarse-to-fine taxonomy path that remains consistent across all levels. We compare three inference paradigms for hierarchical VQA and find that stepwise reasoning, when conditioned on prior answers, significantly outperforms single-pass prompting. Further analysis indicates that the main limitation of current VLMs is their inability to maintain cross-level state, rather than a lack of taxonomic knowledge. Motivated by this diagnosis, we propose Self-Elicited Knowledge Distillation (SEKD), which requires no human labels or external tools: the same VLM is prompted to reason step by step and act as a teacher by exposing its hard labels, soft distributions, and decoder hidden states, while a single-pass student distills these signals. The student VLM remains efficient while approaching the accuracy of its multi-step teacher. It improves in-domain path consistency (HCA) by up to +29.50 percentage points, raises zero-shot HCA on an unseen taxonomy from 4.15% to 42.26%, and yields gains on challenging mathematical benchmarks. Because all supervision is self-elicited, SEKD scales to new taxonomies and datasets without annotation cost, providing a practical route to imbue compact VLMs with dependency-aware multi-step reasoning.
comment: 21 pages, 18 tables, 6 figures
Tu crois que c'est vrai ? Diversite des regimes d'enonciation face aux fake news et mecanismes d'autoregulation conversationnelle
This thesis addresses two paradoxes: (1) why empirical studies find that fake news represent only a small share of the information consulted and shared on social media despite the absence of editorial control or journalistic norms, and (2) how political polarization has intensified even though users do not appear especially receptive to fake news. To investigate these issues, two complementary studies were carried out on Twitter and Facebook, combining quantitative analyses of digital traces with online observation and interviews. This mixed-methods design avoids reducing users to single reactions to identified fake items and instead examines the variety of practices across different interactional situations, online and offline, while recording socio-demographic traits. The first study mapped users who shared at least one item labeled fake by fact-checkers in the French Twittersphere. The second used a corpus of items flagged by Facebook users to study reactions to statements whose epistemic status is uncertain. Three main findings emerge. First, sharing fake news is concentrated among a limited group of users who are not less educated or cognitively disadvantaged but are more politicized and critical of institutions; owing to their high activity and prolific sharing, they can help set the agenda for their political camp. Second, exposed users can deploy varying forms of critical distance depending on their social position and the interactional norms of the situations they inhabit: either discursive caution (prudence énonciative) or interventions ('points d'arrêt') that express disagreement or corrections. Third, these forms of critical distance seldom yield genuine deliberative debates or agonistic pluralism; rather, they often produce dialogues of the deaf among a small, particularly active minority.
comment: in French language
SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model
Video dubbing aims to generate high-fidelity speech that is precisely temporally aligned with the visual content. Existing methods still suffer from limitations in speech naturalness and audio-visual synchronization, and are limited to monolingual settings. To address these challenges, we propose SyncVoice, a vision-augmented video dubbing framework built upon a pretrained text-to-speech (TTS) model. By fine-tuning the TTS model on audio-visual data, we achieve strong audiovisual consistency. We propose a Dual Speaker Encoder to effectively mitigate inter-language interference in cross-lingual speech synthesis and explore the application of video dubbing in video translation scenarios. Experimental results show that SyncVoice achieves high-fidelity speech generation with strong synchronization performance, demonstrating its potential in video dubbing tasks.
Spatial Knowledge Graph-Guided Multimodal Synthesis
Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced their capabilities; however, their spatial perception abilities remain a notable limitation. To address this challenge, multimodal data synthesis offers a promising solution. Yet, ensuring that synthesized data adhere to spatial common sense is a non-trivial task. Our approach addresses this critical gap by providing a systematic framework for generating spatially coherent data. In this work, we introduce SKG2DATA, a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation. SKG2DATA employs an automated pipeline for constructing Spatial Knowledge Graph (SKG) that effectively captures human-like spatial cognition, including directional and distance relationships. These structured representations then serve as precise guidance for our integrated synthesis pipeline, where a diffusion model generates spatially-consistent images while a MLLM produces corresponding textual descriptions. The automated construction of SKG enables scalable generation of diverse yet realistic spatial configurations, overcoming the limitations of manual data collection and annotation. Extensive experiments demonstrate that data synthesized from diverse types of spatial knowledge, including direction and distance, enhance the spatial perception and reasoning abilities of MLLMs markedly, albeit with a slight cost to their general capabilities. We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence. Code is available at https://github.com/zjunlp/Knowledge2Data.
comment: IEEE/ACM Transactions on Audio, Speech and Language Processing
Computation and Language
Evaluating Large Language Models on the 2026 Korean CSAT Mathematics Exam: Measuring Mathematical Ability in a Zero-Data-Leakage Setting
This study systematically evaluated the mathematical reasoning capabilities of Large Language Models (LLMs) using the 2026 Korean College Scholastic Ability Test (CSAT) Mathematics section, ensuring a completely contamination-free evaluation environment. To address data leakage issues in existing benchmarks, we digitized all 46 questions (22 common and 24 elective) within two hours of the exam's public release, eliminating any possibility of inclusion in model training data. We conducted comprehensive evaluations of 24 state-of-the-art LLMs across varying input modalities (text, image, text+figure) and prompt languages (Korean, English). GPT-5 Codex achieved the only perfect score (100 points) with text input and Korean prompts, while Grok 4, GPT-5, and Deepseek R1 scored above 95 points. Notably, gpt-oss-20B achieved 95.7 points despite its relatively small size, demonstrating high cost-effectiveness. Problem-specific analysis revealed geometry as the weakest domain (77.7% average) with significant performance degradation on 4-point high-difficulty problems. Text input consistently outperformed image input, while prompt language effects varied by model scale. In reasoning enhancement experiments with GPT-5 series, increased reasoning intensity improved performance (from 82.6 to 100 points) but quadrupled token usage and drastically reduced efficiency, suggesting that models with minimal reasoning may be more practical. This research contributes: (1) implementation of a completely unexposed evaluation environment, (2) a real-exam-based LLM assessment framework, and (3) a practical evaluation perspective integrating performance, cost, and time considerations. Detailed results and model comparisons are available at the 2026 Korean CSAT LLM Evaluation Leaderboard (https://isoft.cnu.ac.kr/csat2026/).
comment: 52 pages, Korean
No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases
Large Language Models (LLMs) inherit societal biases from their training data, potentially leading to harmful or unfair outputs. While various techniques aim to mitigate these biases, their effects are often evaluated only along the dimension of the bias being targeted. This work investigates the cross-category consequences of targeted bias mitigation. We study four bias mitigation techniques applied across ten models from seven model families, and we explore racial, religious, profession- and gender-related biases. We measure the impact of debiasing on model coherence and stereotypical preference using the StereoSet benchmark. Our results consistently show that while targeted mitigation can sometimes reduce bias in the intended dimension, it frequently leads to unintended and often negative consequences in others, such as increasing model bias and decreasing general coherence. These findings underscore the critical need for robust, multi-dimensional evaluation tools when examining and developing bias mitigation strategies to avoid inadvertently shifting or worsening bias along untargeted axes.
Majority of the Bests: Improving Best-of-N via Bootstrapping
Sampling multiple outputs from a Large Language Model (LLM) and selecting the most frequent (Self-consistency) or highest-scoring (Best-of-N) candidate is a popular approach to achieve higher accuracy in tasks with discrete final answers. Best-of-N (BoN) selects the output with the highest reward, and with perfect rewards, it often achieves near-perfect accuracy. With imperfect rewards from reward models, however, BoN fails to reliably find the correct answer and its performance degrades drastically. We consider the distribution of BoN's outputs and highlight that, although the correct answer does not usually have a probability close to one under imperfect rewards, it is often the most likely outcome. This suggests that the mode of this distribution can be more reliably correct than a sample from it. Based on this idea, we propose Majority-of-the-Bests (MoB), a novel selection mechanism that estimates the output distribution of BoN via bootstrapping and selects its mode. Experimental results across five benchmarks, three different base LLMs, and two reward models demonstrate consistent improvements over BoN in 25 out of 30 setups. We also provide theoretical results for the consistency of the bootstrapping. MoB serves as a simple, yet strong alternative to BoN and self-consistency, and more broadly, motivates further research in more nuanced selection mechanisms.
OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph
We present OpenGloss, a synthetic encyclopedic dictionary and semantic knowledge graph for English that integrates lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships in a unified resource. OpenGloss contains 537K senses across 150K lexemes, on par with WordNet 3.1 and Open English WordNet, while providing more than four times as many sense definitions. These lexemes include 9.1M semantic edges, 1M usage examples, 3M collocations, and 60M words of encyclopedic content. Generated through a multi-agent procedural generation pipeline with schema-validated LLM outputs and automated quality assurance, the entire resource was produced in under one week for under $1,000. This demonstrates that structured generation can create comprehensive lexical resources at cost and time scales impractical for manual curation, enabling rapid iteration as foundation models improve. The resource addresses gaps in pedagogical applications by providing integrated content -- definitions, examples, collocations, encyclopedias, etymology -- that supports both vocabulary learning and natural language processing tasks. As a synthetically generated resource, OpenGloss reflects both the capabilities and limitations of current foundation models. The dataset is publicly available on Hugging Face under CC-BY 4.0, enabling researchers and educators to build upon and adapt this resource.
comment: 30 pages, 5 figures, 8 tables. Dataset available at https://huggingface.co/datasets/mjbommar/opengloss-dictionary
Prompt Optimization as a State-Space Search Problem
Language Models are extremely susceptible to performance collapse with even small changes to input prompt strings. Libraries such as DSpy (from Stanford NLP) avoid this problem through demonstration-based prompt optimisation. Inspired by this, I propose an alternative approach that treats prompt optimisation as a classical state-space search problem. I model the prompt space as a graph where nodes represent prompt states and edges correspond to deliberate transformations such as shortening, adding examples, or re- ordering content. Using beam search and random walk algorithms, I systematically explore this space, evaluating candidates on development sets and pruning unpromising branches. Across five NLP tasks (sentiment classification, question answering, summarisation, reason- ing, and natural language inference), I find that even shallow search configurations (beam width=2, depth=2) improve upon seed prompts on development sets. For instance, beam search achieves development accuracy gains from 0.40 to 0.80 on reasoning tasks, though test set improvements are more modest (0.20 to 0.50), indicating overfitting to the develop- ment heuristic. Analysis of successful optimisation paths reveals that transformations that make prompts concise appear most frequently, while verbosity operators are never selected. My results validate prompt optimization as a search problem and suggest that with greater computational resources and improved evaluation metrics, deeper exploration could yield more robust prompts that generalize beyond development sets. Code and implementation are available at [https://github.com/MaanasTaneja/PromptOptimiser].
A Unified BERT-CNN-BiLSTM Framework for Simultaneous Headline Classification and Sentiment Analysis of Bangla News
In our daily lives, newspapers are an essential information source that impacts how the public talks about present-day issues. However, effectively navigating the vast amount of news content from different newspapers and online news portals can be challenging. Newspaper headlines with sentiment analysis tell us what the news is about (e.g., politics, sports) and how the news makes us feel (positive, negative, neutral). This helps us quickly understand the emotional tone of the news. This research presents a state-of-the-art approach to Bangla news headline classification combined with sentiment analysis applying Natural Language Processing (NLP) techniques, particularly the hybrid transfer learning model BERT-CNN-BiLSTM. We have explored a dataset called BAN-ABSA of 9014 news headlines, which is the first time that has been experimented with simultaneously in the headline and sentiment categorization in Bengali newspapers. Over this imbalanced dataset, we applied two experimental strategies: technique-1, where undersampling and oversampling are applied before splitting, and technique-2, where undersampling and oversampling are applied after splitting on the In technique-1 oversampling provided the strongest performance, both headline and sentiment, that is 78.57\% and 73.43\% respectively, while technique-2 delivered the highest result when trained directly on the original imbalanced dataset, both headline and sentiment, that is 81.37\% and 64.46\% respectively. The proposed model BERT-CNN-BiLSTM significantly outperforms all baseline models in classification tasks, and achieves new state-of-the-art results for Bangla news headline classification and sentiment analysis. These results demonstrate the importance of leveraging both the headline and sentiment datasets, and provide a strong baseline for Bangla text classification in low-resource.
A Benchmark for Zero-Shot Belief Inference in Large Language Models
Beliefs are central to how humans reason, communicate, and form social connections, yet most computational approaches to studying them remain confined to narrow sociopolitical contexts and rely on fine-tuning for optimal performance. Despite the growing use of large language models (LLMs) across disciplines, how well these systems generalize across diverse belief domains remains unclear. We introduce a systematic, reproducible benchmark that evaluates the ability of LLMs to predict individuals' stances on a wide range of topics in a zero-shot setting using data from an online debate platform. The benchmark includes multiple informational conditions that isolate the contribution of demographic context and known prior beliefs to predictive success. Across several small- to medium-sized models, we find that providing more background information about an individual improves predictive accuracy, but performance varies substantially across belief domains. These findings reveal both the capacity and limitations of current LLMs to emulate human reasoning, advancing the study of machine behavior and offering a scalable framework for modeling belief systems beyond the sociopolitical sphere.
comment: 28 pages, 5 figures
Toward Trustworthy Difficulty Assessments: Large Language Models as Judges in Programming and Synthetic Tasks
Large Language Models (LLMs) have demonstrated impressive capabilities in natural language and code generation, and are increasingly deployed as automatic judges of model outputs and learning activities. Yet, their behavior on structured tasks such as predicting the difficulty of competitive programming problems remains under-explored. We conduct a systematic comparison of GPT-4o, used purely as a natural-language difficulty assessor, against an interpretable Light-GBM ensemble trained on explicit numeric and textual features. On a dataset of 1,825 LeetCode problems labeled Easy, Medium, or Hard, LightGBM attains 86% accuracy, whereas GPT-4o reaches only 37.75%. Detailed analyses, including confusion matrices and SHAP-based interpretability, show that numeric constraints -- such as input size limits and acceptance rates -- play a crucial role in separating Hard problems from easier ones. By contrast, GPT-4o often overlooks these cues and exhibits a strong bias toward simpler categories. We further probe GPT-4o through a synthetic Hard-problem generation protocol. Surprisingly, GPT-4o labels almost all of its own synthetic Hard problems as Medium, contradicting its tendency to downgrade real Hard problems to Easy. Our findings connect to recent work on LLMs-as-judges and automatic difficulty estimation in programming and education, and highlight concrete failure modes that must be addressed before LLM-based judges can be considered trustworthy in competitive programming, educational platforms, or reinforcement-learning pipelines.
Dealing with the Hard Facts of Low-Resource African NLP
Creating speech datasets, models, and evaluation frameworks for low-resource languages remains challenging given the lack of a broad base of pertinent experience to draw from. This paper reports on the field collection of 612 hours of spontaneous speech in Bambara, a low-resource West African language; the semi-automated annotation of that dataset with transcriptions; the creation of several monolingual ultra-compact and small models using the dataset; and the automatic and human evaluation of their output. We offer practical suggestions for data collection protocols, annotation, and model design, as well as evidence for the importance of performing human evaluation. In addition to the main dataset, multiple evaluation datasets, models, and code are made publicly available.
comment: 10 pages, 4 figures
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence
Large language models (LLMs) have fundamentally transformed automated software development by enabling direct translation of natural language descriptions into functional code, driving commercial adoption through tools like Github Copilot (Microsoft), Cursor (Anysphere), Trae (ByteDance), and Claude Code (Anthropic). While the field has evolved dramatically from rule-based systems to Transformer-based architectures, achieving performance improvements from single-digit to over 95\% success rates on benchmarks like HumanEval. In this work, we provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs, systematically examining the complete model life cycle from data curation to post-training through advanced prompting paradigms, code pre-training, supervised fine-tuning, reinforcement learning, and autonomous coding agents. We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder), critically examining the techniques, design decisions, and trade-offs. Further, we articulate the research-practice gap between academic research (e.g., benchmarks and tasks) and real-world deployment (e.g., software-related code tasks), including code correctness, security, contextual awareness of large codebases, and integration with development workflows, and map promising research directions to practical needs. Last, we conduct a series of experiments to provide a comprehensive analysis of code pre-training, supervised fine-tuning, and reinforcement learning, covering scaling law, framework selection, hyperparameter sensitivity, model architectures, and dataset comparisons.
For Those Who May Find Themselves on the Red Team
This position paper argues that literary scholars must engage with large language model (LLM) interpretability research. While doing so will involve ideological struggle, if not out-right complicity, the necessity of this engagement is clear: the abiding instrumentality of current approaches to interpretability cannot be the only standard by which we measure interpretation with LLMs. One site at which this struggle could take place, I suggest, is the red team.
MindEval: Benchmarking Language Models on Multi-turn Mental Health Support
Demand for mental health support through AI chatbots is surging, though current systems present several limitations, like sycophancy or overvalidation, and reinforcement of maladaptive beliefs. A core obstacle to the creation of better systems is the scarcity of benchmarks that capture the complexity of real therapeutic interactions. Most existing benchmarks either only test clinical knowledge through multiple-choice questions or assess single responses in isolation. To bridge this gap, we present MindEval, a framework designed in collaboration with Ph.D-level Licensed Clinical Psychologists for automatically evaluating language models in realistic, multi-turn mental health therapy conversations. Through patient simulation and automatic evaluation with LLMs, our framework balances resistance to gaming with reproducibility via its fully automated, model-agnostic design. We begin by quantitatively validating the realism of our simulated patients against human-generated text and by demonstrating strong correlations between automatic and human expert judgments. Then, we evaluate 12 state-of-the-art LLMs and show that all models struggle, scoring below 4 out of 6, on average, with particular weaknesses in problematic AI-specific patterns of communication. Notably, reasoning capabilities and model scale do not guarantee better performance, and systems deteriorate with longer interactions or when supporting patients with severe symptoms. We release all code, prompts, and human evaluation data.
InstructAudio: Unified speech and music generation with natural language instruction
Text-to-speech (TTS) and text-to-music (TTM) models face significant limitations in instruction-based control. TTS systems usually depend on reference audio for timbre, offer only limited text-level attribute control, and rarely support dialogue generation. TTM systems are constrained by input conditioning requirements that depend on expert knowledge annotations. The high heterogeneity of these input control conditions makes them difficult to joint modeling with speech synthesis. Despite sharing common acoustic modeling characteristics, these two tasks have long been developed independently, leaving open the challenge of achieving unified modeling through natural language instructions. We introduce InstructAudio, a unified framework that enables instruction-based (natural language descriptions) control of acoustic attributes including timbre (gender, age), paralinguistic (emotion, style, accent), and musical (genre, instrument, rhythm, atmosphere). It supports expressive speech, music, and dialogue generation in English and Chinese. The model employs joint and single diffusion transformer layers with a standardized instruction-phoneme input format, trained on 50K hours of speech and 20K hours of music data, enabling multi-task learning and cross-modal alignment. Fig. 1 visualizes performance comparisons with mainstream TTS and TTM models, demonstrating that InstructAudio achieves optimal results on most metrics. To our best knowledge, InstructAudio represents the first instruction-controlled framework unifying speech and music generation. Audio samples are available at: https://qiangchunyu.github.io/InstructAudio/
Shadows in the Code: Exploring the Risks and Defenses of LLM-based Multi-Agent Software Development Systems AAAI 2026
The rapid advancement of Large Language Model (LLM)-driven multi-agent systems has significantly streamlined software developing tasks, enabling users with little technical expertise to develop executable applications. While these systems democratize software creation through natural language requirements, they introduce significant security risks that remain largely unexplored. We identify two risky scenarios: Malicious User with Benign Agents (MU-BA) and Benign User with Malicious Agents (BU-MA). We introduce the Implicit Malicious Behavior Injection Attack (IMBIA), demonstrating how multi-agent systems can be manipulated to generate software with concealed malicious capabilities beneath seemingly benign applications, and propose Adv-IMBIA as a defense mechanism. Evaluations across ChatDev, MetaGPT, and AgentVerse frameworks reveal varying vulnerability patterns, with IMBIA achieving attack success rates of 93%, 45%, and 71% in MU-BA scenarios, and 71%, 84%, and 45% in BU-MA scenarios. Our defense mechanism reduced attack success rates significantly, particularly in the MU-BA scenario. Further analysis reveals that compromised agents in the coding and testing phases pose significantly greater security risks, while also identifying critical agents that require protection against malicious user exploitation. Our findings highlight the urgent need for robust security measures in multi-agent software development systems and provide practical guidelines for implementing targeted, resource-efficient defensive strategies.
comment: Accepted by AAAI 2026 Alignment Track
General Agentic Memory Via Deep Research
Memory is critical for AI agents, yet the widely-adopted static memory, aiming to create readily available memory in advance, is inevitably subject to severe information loss. To address this limitation, we propose a novel framework called \textbf{general agentic memory (GAM)}. GAM follows the principle of "\textbf{just-in time (JIT) compilation}" where it focuses on creating optimized contexts for its client at runtime while keeping only simple but useful memory during the offline stage. To this end, GAM employs a duo-design with the following components. 1) \textbf{Memorizer}, which highlights key historical information using a lightweight memory, while maintaining complete historical information within a universal page-store. 2) \textbf{Researcher}, which retrieves and integrates useful information from the page-store for its online request guided by the pre-constructed memory. This design allows GAM to effectively leverage the agentic capabilities and test-time scalability of frontier large language models (LLMs), while also facilitating end-to-end performance optimization through reinforcement learning. In our experimental study, we demonstrate that GAM achieves substantial improvement on various memory-grounded task completion scenarios against existing memory systems.
Multi-Agent Collaborative Filtering: Orchestrating Users and Items for Agentic Recommendations
Agentic recommendations cast recommenders as large language model (LLM) agents that can plan, reason, use tools, and interact with users of varying preferences in web applications. However, most existing agentic recommender systems focus on generic single-agent plan-execute workflows or multi-agent task decomposition pipelines. Without recommendation-oriented design, they often underuse the collaborative signals in the user-item interaction history, leading to unsatisfying recommendation results. To address this, we propose the Multi-Agent Collaborative Filtering (MACF) framework for agentic recommendations, drawing an analogy between traditional collaborative filtering algorithms and LLM-based multi-agent collaboration. Specifically, given a target user and query, we instantiate similar users and relevant items as LLM agents with unique profiles. Each agent is able to call retrieval tools, suggest candidate items, and interact with other agents. Different from the static preference aggregation in traditional collaborative filtering, MACF employs a central orchestrator agent to adaptively manage the collaboration between user and item agents via dynamic agent recruitment and personalized collaboration instruction. Experimental results on datasets from three different domains show the advantages of our MACF framework compared to strong agentic recommendation baselines.
SmolKalam: Ensemble Quality-Filtered Translation at Scale for High Quality Arabic Post-Training Data
Although the community has tackled the acquisition of high-quality Arabic pretraining data, we still lack large-scale, multi-turn Arabic datasets that include reasoning and tool calling. Naive translation can work at the pretraining scale, but post-training demands much higher quality, which requires a stricter approach to dataset curation. In this work, we introduce SmolKalam, a translation of Smoltalk2 that uses a multi-model ensemble translation pipeline, applies quality filtering, and examines effective translation techniques for traditional decoder-only models through ablations.
comment: Work in progress
Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models
Mechanistic interpretability (MI) seeks to uncover how language models (LMs) implement specific behaviors, yet measuring progress in MI remains challenging. The recently released Mechanistic Interpretability Benchmark (MIB; Mueller et al., 2025) provides a standardized framework for evaluating circuit and causal variable localization. Building on this foundation, the BlackboxNLP 2025 Shared Task extends MIB into a community-wide reproducible comparison of MI techniques. The shared task features two tracks: circuit localization, which assesses methods that identify causally influential components and interactions driving model behavior, and causal variable localization, which evaluates approaches that map activations into interpretable features. With three teams spanning eight different methods, participants achieved notable gains in circuit localization using ensemble and regularization strategies for circuit discovery. With one team spanning two methods, participants achieved significant gains in causal variable localization using low-dimensional and non-linear projections to featurize activation vectors. The MIB leaderboard remains open; we encourage continued work in this standard evaluation framework to measure progress in MI research going forward.
Towards Robust and Fair Next Visit Diagnosis Prediction under Noisy Clinical Notes with Large Language Models AAAI
A decade of rapid advances in artificial intelligence (AI) has opened new opportunities for clinical decision support systems (CDSS), with large language models (LLMs) demonstrating strong reasoning abilities on timely medical tasks. However, clinical texts are often degraded by human errors or failures in automated pipelines, raising concerns about the reliability and fairness of AI-assisted decision-making. Yet the impact of such degradations remains under-investigated, particularly regarding how noise-induced shifts can heighten predictive uncertainty and unevenly affect demographic subgroups. We present a systematic study of state-of-the-art LLMs under diverse text corruption scenarios, focusing on robustness and equity in next-visit diagnosis prediction. To address the challenge posed by the large diagnostic label space, we introduce a clinically grounded label-reduction scheme and a hierarchical chain-of-thought (CoT) strategy that emulates clinicians' reasoning. Our approach improves robustness and reduces subgroup instability under degraded inputs, advancing the reliable use of LLMs in CDSS. We release code at https://github.com/heejkoo9/NECHOv3.
comment: Accepted by the Association for the Advancement of Artificial Intelligence (AAAI) 2026 1st Workshop on Safe, Ethical, Certified, Uncertainty-aware, Robust, and Explainable AI for Health (SECURE-AI4H)
Tu crois que c'est vrai ? Diversite des regimes d'enonciation face aux fake news et mecanismes d'autoregulation conversationnelle
This thesis addresses two paradoxes: (1) why empirical studies find that fake news represent only a small share of the information consulted and shared on social media despite the absence of editorial control or journalistic norms, and (2) how political polarization has intensified even though users do not appear especially receptive to fake news. To investigate these issues, two complementary studies were carried out on Twitter and Facebook, combining quantitative analyses of digital traces with online observation and interviews. This mixed-methods design avoids reducing users to single reactions to identified fake items and instead examines the variety of practices across different interactional situations, online and offline, while recording socio-demographic traits. The first study mapped users who shared at least one item labeled fake by fact-checkers in the French Twittersphere. The second used a corpus of items flagged by Facebook users to study reactions to statements whose epistemic status is uncertain. Three main findings emerge. First, sharing fake news is concentrated among a limited group of users who are not less educated or cognitively disadvantaged but are more politicized and critical of institutions; owing to their high activity and prolific sharing, they can help set the agenda for their political camp. Second, exposed users can deploy varying forms of critical distance depending on their social position and the interactional norms of the situations they inhabit: either discursive caution (prudence énonciative) or interventions ('points d'arrêt') that express disagreement or corrections. Third, these forms of critical distance seldom yield genuine deliberative debates or agonistic pluralism; rather, they often produce dialogues of the deaf among a small, particularly active minority.
comment: in French language
OmniStruct: Universal Text-to-Structure Generation across Diverse Schemas
The ability of Large Language Models (LLMs) to generate structured outputs that follow arbitrary schemas is crucial to a wide range of downstream tasks that require diverse structured representations of results such as information extraction, table generation, and function calling. While modern LLMs excel in generating unstructured responses in natural language, whether this advancement translates to a strong performance on text-to-structure tasks remains unclear. To bridge this gap, we first introduce OmniStruct, a comprehensive benchmark for assessing LLMs' capabilities on diverse text-to-structure tasks such as information extraction, table generation, and function calling. We build OmniStruct by identifying existing datasets across a wide range of tasks that are suitable for a structured answer format, and adapting them under a unified text-to-structure problem setting. To facilitate the development of efficient text-to-structure models, we collect high-quality training data via synthetic task generation. Without using any supervised data for OmniStruct tasks, our experiments demonstrate the possibility of fine-tuning much smaller models on synthetic data into universal structured generation models that can rival the performance of GPT-4o.
Gradient Masters at BLP-2025 Task 1: Advancing Low-Resource NLP for Bengali using Ensemble-Based Adversarial Training for Hate Speech Detection AACL
This paper introduces the approach of "Gradient Masters" for BLP-2025 Task 1: "Bangla Multitask Hate Speech Identification Shared Task". We present an ensemble-based fine-tuning strategy for addressing subtasks 1A (hate-type classification) and 1B (target group classification) in YouTube comments. We propose a hybrid approach on a Bangla Language Model, which outperformed the baseline models and secured the 6th position in subtask 1A with a micro F1 score of 73.23% and the third position in subtask 1B with 73.28%. We conducted extensive experiments that evaluated the robustness of the model throughout the development and evaluation phases, including comparisons with other Language Model variants, to measure generalization in low-resource Bangla hate speech scenarios and data set coverage. In addition, we provide a detailed analysis of our findings, exploring misclassification patterns in the detection of hate speech.
comment: 6 pages, 2 figures, 4 tables. Accepted at the Second International Workshop on Bangla Language Processing (BLP-2025) co-located with AACL-IJCNLP 2025. Ranked 6th (Subtask 1A, 73.23% micro F1) and 3rd (Subtask 1B, 73.28% micro F1) on the official leaderboard
Path-Constrained Retrieval: A Structural Approach to Reliable LLM Agent Reasoning Through Graph-Scoped Semantic Search
Large Language Model agents often retrieve context from knowledge bases that lack structural consistency with the agent's current reasoning state, leading to incoherent reasoning chains. We introduce Path-Constrained Retrieval (PCR), a retrieval method that combines structural graph constraints with semantic search to ensure retrieved information maintains logical relationships within a knowledge graph. PCR restricts the search space to nodes reachable from an anchor node, preventing retrieval of structurally disconnected information that may lead to inconsistent reasoning. We evaluate PCR on PathRAG-6, a benchmark spanning six domains with 180 nodes and 360 edges. Our results show that PCR achieves full structural consistency compared to 24-32 percent in baseline methods, while maintaining strong relevance scores. On the technology domain, PCR obtains full relevance at rank 10 with full structural consistency, significantly outperforming vector search and hybrid retrieval. PCR reduces the average graph distance of retrieved context by 78 percent compared to baselines, demonstrating retrieval of more structurally consistent information. These findings suggest that path-constrained retrieval is an effective approach for improving the reliability and coherence of LLM agent reasoning systems.
comment: 10 pages
Table Comprehension in Building Codes using Vision Language Models and Domain-Specific Fine-Tuning
Building codes contain critical information for ensuring safety, regulatory compliance, and informed decision-making in construction and engineering. Automated question answering systems over such codes enable quick and accurate access to specific regulatory clauses, improving efficiency and reducing errors. Retrieval-Augmented Generation (RAG) systems are essential for this task as they combine the precision of information retrieval with the generative capabilities of language models. However, tabular data are challenging to extract as they often involve complex layouts, merged cells, multi-row headers, and embedded semantic relationships that are not easily captured by traditional natural language processing techniques and Vision Language Models (VLMs). This paper explores and compares two methods for extracting information from tabular data in building codes using several pre-trained VLMs. First, a direct input method is used, where the image of the page is input directly into the VLMs, which are then tasked with answering questions based on the image. Second, an indirect input method is introduced, which involves converting an image of a page containing tables into the LaTeX code and then answering inquires based on the LaTeX-based input. The experiments find that the direct input method generally resulted in higher accuracy than the indirect input method. To further improve the performance, we fine-tuned each VLM using Low Rank Adaptation (LoRA) on a domain-specific tabular dataset. The fine-tuned models exhibited substantial improvements, with Qwen2.5-VL-3B-Instruct achieving relative accuracy gains exceeding 100%. Our results highlight the potential of parameter-efficient fine-tuning methods to adapt powerful VLMs for understanding complex structured data in specialized fields, such as building code interpretation and regulatory compliance.
"AGI" team at SHROOM-CAP: Data-Centric Approach to Multilingual Hallucination Detection using XLM-RoBERTa AACL
The detection of hallucinations in multilingual scientific text generated by Large Language Models (LLMs) presents significant challenges for reliable AI systems. This paper describes our submission to the SHROOM-CAP 2025 shared task on scientific hallucination detection across 9 languages. Unlike most approaches that focus primarily on model architecture, we adopted a data-centric strategy that addressed the critical issue of training data scarcity and imbalance. We unify and balance five existing datasets to create a comprehensive training corpus of 124,821 samples (50% correct, 50% hallucinated), representing a 172x increase over the original SHROOM training data. Our approach fine-tuned XLM-RoBERTa-Large with 560 million parameters on this enhanced dataset, achieves competitive performance across all languages, including \textbf{2nd place in Gujarati} (zero-shot language) with Factuality F1 of 0.5107, and rankings between 4th-6th place across the remaining 8 languages. Our results demonstrate that systematic data curation can significantly outperform architectural innovations alone, particularly for low-resource languages in zero-shot settings.
comment: Accepted to the 1st Workshop on Confabulation, Hallucinations & Overgeneration in Multilingual and Practical Settings (CHOMPS) at AACL-IJCNLP 2025
From Archives to Decisions: Multi-Agent Pharmaceutical Co-Scientist for Traceable Drug Discovery and Reverse Translation
Pharmaceutical research and development has accumulated vast, heterogeneous archives of data. Much of this knowledge stems from discontinued programs, and reusing these archives is invaluable for reverse translation. However, in practice, such reuse is often infeasible. In this work, we introduce DiscoVerse, a multi-agent co-scientist designed to support pharmaceutical research and development. The system implements semantic retrieval, cross-document linking, and auditable synthesis on a large historical corpus from Roche. To validate our approach at real-world scale, we selected a subset of 180 molecules from the Roche research repositories, covering over 0.87 billion BPE tokens and more than four decades of research. Given that automated evaluation metrics are poorly aligned with scientific utility, we evaluate the performance of DiscoVerse using blinded expert evaluation of source-linked outputs. To our knowledge, this is the first agentic framework systematically assessed on real pharmaceutical data for reverse translation, enabled by authorized access to confidential, end-to-end drug-development archives. Our contributions include role-specialized agent designs aligned with scientist workflows; human-in-the-loop support for reverse translation; expert evaluation; and a large-scale demonstration showing promising answer accuracy and decision-making insights. In brief, across seven benchmark queries covering 180 molecules, DiscoVerse achieved near-perfect recall ($\geq 0.99$) with moderate precision ($0.71-0.91$), while qualitative assessments of discontinuation rationale and organ-specific toxicity showed faithful, source-linked synthesis across preclinical and clinical evidence.
comment: 22 pages, 4 figures, 3 tables
LLMs4All: A Review of Large Language Models Across Academic Disciplines
Cutting-edge Artificial Intelligence (AI) techniques keep reshaping our view of the world. For example, Large Language Models (LLMs) based applications such as ChatGPT have shown the capability of generating human-like conversation on extensive topics. Due to the impressive performance on a variety of language-related tasks (e.g., open-domain question answering, translation, and document summarization), one can envision the far-reaching impacts that can be brought by the LLMs with broader real-world applications (e.g., customer service, education and accessibility, and scientific discovery). Inspired by their success, this paper will offer an overview of state-of-the-art LLMs and their integration into a wide range of academic disciplines, including: (1) arts, letters, and law (e.g., history, philosophy, political science, arts and architecture, law), (2) economics and business (e.g., finance, economics, accounting, marketing), and (3) science and engineering (e.g., mathematics, physics and mechanical engineering, chemistry and chemical engineering, life sciences and bioengineering, earth sciences and civil engineering, computer science and electrical engineering). Integrating humanity and technology, in this paper, we will explore how LLMs are shaping research and practice in these fields, while also discussing key limitations, open challenges, and future directions in the era of generative AI. The review of how LLMs are engaged across disciplines-along with key observations and insights-can help researchers and practitioners interested in exploiting LLMs to advance their works in diverse real-world applications.
Non-Linear Scoring Model for Translation Quality Evaluation
Analytic Translation Quality Evaluation (TQE), based on Multidimensional Quality Metrics (MQM), traditionally uses a linear error-to-penalty scale calibrated to a reference sample of 1000-2000 words. However, linear extrapolation biases judgment on samples of different sizes, over-penalizing short samples and under-penalizing long ones, producing misalignment with expert intuition. Building on the Multi-Range framework, this paper presents a calibrated, non-linear scoring model that better reflects how human content consumers perceive translation quality across samples of varying length. Empirical data from three large-scale enterprise environments shows that acceptable error counts grow logarithmically, not linearly, with sample size. Psychophysical and cognitive evidence, including the Weber-Fechner law and Cognitive Load Theory, supports this premise by explaining why the perceptual impact of additional errors diminishes while the cognitive burden grows with scale. We propose a two-parameter model E(x) = a * ln(1 + b * x), a, b > 0, anchored to a reference tolerance and calibrated from two tolerance points using a one-dimensional root-finding step. The model yields an explicit interval within which the linear approximation stays within +/-20 percent relative error and integrates into existing evaluation workflows with only a dynamic tolerance function added. The approach improves interpretability, fairness, and inter-rater reliability across both human and AI-generated translations. By operationalizing a perceptually valid scoring paradigm, it advances translation quality evaluation toward more accurate and scalable assessment. The model also provides a stronger basis for AI-based document-level evaluation aligned with human judgment. Implementation considerations for CAT/LQA systems and implications for human and AI-generated text evaluation are discussed.
comment: ongoing work, 38 pages
Time-To-Inconsistency: A Survival Analysis of Large Language Model Robustness to Adversarial Attacks
Large Language Models (LLMs) have revolutionized conversational AI, yet their robustness in extended multi-turn dialogues remains poorly understood. Existing evaluation frameworks focus on static benchmarks and single-turn assessments, failing to capture the temporal dynamics of conversational degradation that characterize real-world interactions. In this work, we present a large-scale survival analysis of conversational robustness, modeling failure as a time-to-event process over 36,951 turns from 9 state-of-the-art LLMs on the MT-Consistency benchmark. Our framework combines Cox proportional hazards, Accelerated Failure Time (AFT), and Random Survival Forest models with simple semantic drift features. We find that abrupt prompt-to-prompt semantic drift sharply increases the hazard of inconsistency, whereas cumulative drift is counterintuitively \emph{protective}, suggesting adaptation in conversations that survive multiple shifts. AFT models with model-drift interactions achieve the best combination of discrimination and calibration, and proportional hazards checks reveal systematic violations for key drift covariates, explaining the limitations of Cox-style modeling in this setting. Finally, we show that a lightweight AFT model can be turned into a turn-level risk monitor that flags most failing conversations several turns before the first inconsistent answer while keeping false alerts modest. These results establish survival analysis as a powerful paradigm for evaluating multi-turn robustness and for designing practical safeguards for conversational AI systems.
A Novel Framework for Augmenting Rating Scale Tests with LLM-Scored Text Data
Psychological assessments are dominated by rating scales, which cannot capture the nuance in natural language. Efforts to supplement them with qualitative text have relied on labelled datasets or expert rubrics, limiting scalability. We introduce a framework that avoids this reliance: large language models (LLMs) score free-text responses with simple prompts to produce candidate LLM items, from which we retain those that yield the most test information when co-calibrated with a baseline scale. Using depression as a case study, we developed and tested the method in upper-secondary students (n=693) and a matched synthetic dataset (n=3,000). Results on held-out test sets showed that augmenting a 19-item scale with LLM items improved its precision, accuracy, and convergent validity. Further, the test information gain matched that of adding as many as 16 rating-scale items. This framework leverages the increasing availability of transcribed language to enhance psychometric measures, with applications in clinical health and beyond.
Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs ICLR 2025
LLMs are commonly trained with a learning rate (LR) warmup, followed by cosine decay to 10% of the maximum (10x decay). In a large-scale empirical study, we show that under an optimal peak LR, a simple linear decay-to-zero (D2Z) schedule consistently outperforms other schedules when training at compute-optimal dataset sizes. D2Z is superior across a range of model sizes, batch sizes, datasets, and vocabularies. Benefits increase as dataset size increases. Leveraging a novel interpretation of AdamW as an exponential moving average of weight updates, we show how linear D2Z optimally balances the demands of early training (moving away from initial conditions) and late training (averaging over more updates in order to mitigate gradient noise). In experiments, a 610M-parameter model trained for 80 tokens-per-parameter (TPP) using D2Z achieves lower loss than when trained for 200 TPP using 10x decay, corresponding to an astonishing 60% compute savings. Models such as Llama2-7B, trained for 286 TPP with 10x decay, could likely have saved a majority of compute by training with D2Z.
comment: ICLR 2025
Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training NeurIPS 2025
Efficient LLM pre-training requires well-tuned hyperparameters (HPs), including learning rate $η$ and weight decay $λ$. We study scaling laws for HPs: formulas for how to scale HPs as we scale model size N, dataset size D, and batch size B. Recent work suggests the AdamW timescale, $τ= B/(ηλD)$, should remain constant across training settings, and we verify the implication that optimal $λ$ scales linearly with B, for a fixed N and D. However, as N and D scale, we show optimal $τ$ obeys a precise power law in the tokens-per-parameter ratio, D/N. This law thus provides a method to accurately predict $λ$opt in advance of large-scale training. We also study scaling laws for optimal batch size Bopt (the B enabling lowest loss at a given N,D) and critical batch size Bcrit (the B beyond which further data parallelism becomes ineffective). In contrast to prior work, we find both Bopt and Bcrit scale as power laws in D, independent of model size, N. Finally, we analyze how these findings inform the real-world selection of Pareto-optimal N and D under dual training time and compute objectives. All experiments were run on Cerebras CS-3 systems.
comment: NeurIPS 2025
Lessons from Studying Two-Hop Latent Reasoning
Large language models can use chain-of-thought (CoT) to externalize reasoning, potentially enabling oversight of capable LLM agents. Prior work has shown that models struggle at two-hop question-answering without CoT. This capability is so basic that if it was a fundamental limitation, it would imply that many complex agentic tasks would similarly require CoT. We investigate LLM latent reasoning capabilities using two-hop question answering as a case study. Previous work on the gap between latent and externalized two-hop reasoning produced mixed evidence with inconclusive results. In this paper, we introduce a controlled setting for investigating two-hop reasoning in LLMs, where a positive result provides definitive evidence for latent reasoning. We fine-tune LLMs (including Llama 3 8B and GPT-4o) on synthetic facts and test two-hop reasoning over these facts. By using synthetic facts, we rule out memorization and reasoning shortcuts as explanations for two-hop performance. We observe a nuanced picture: Models fail to compose two synthetic facts, but can succeed when one fact is synthetic and the other is natural. These results demonstrate that LLMs are undeniably capable of latent two-hop reasoning, although it remains unclear how this ability scales with model size. Finally, we highlight a lesson for researchers studying LLM reasoning: when drawing conclusions about LLM latent reasoning, one must be careful to avoid both spurious successes (that stem from memorization and reasoning shortcuts) and spurious failures (that may stem from artificial experimental setups, divorced from training setups of frontier LLMs).
ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?
Frontier AI agents show increasing promise as scientific research assistants, and may eventually be useful for extended, open-ended research workflows. However, in order to use agents for novel research, we must first assess the underlying faithfulness and correctness of their work. To evaluate agents as research assistants, we introduce ReplicationBench, an evaluation framework that tests whether agents can replicate entire research papers drawn from the astrophysics literature. Astrophysics, where research relies heavily on archival data and computational study while requiring little real-world experimentation, is a particularly useful testbed for AI agents in scientific research. We split each paper into tasks which require agents to replicate the paper's core contributions, including the experimental setup, derivations, data analysis, and codebase. Each task is co-developed with the original paper authors and targets a key scientific result, enabling objective evaluation of both faithfulness (adherence to original methods) and correctness (technical accuracy of results). ReplicationBench is extremely challenging for current frontier language models: even the best-performing language models score under 20%. We analyze ReplicationBench trajectories in collaboration with domain experts and find a rich, diverse set of failure modes for agents in scientific research. ReplicationBench establishes the first benchmark of paper-scale, expert-validated astrophysics research tasks, reveals insights about agent performance generalizable to other domains of data-driven science, and provides a scalable framework for measuring AI agents' reliability in scientific research.
Assessing Historical Structural Oppression Worldwide via Rule-Guided Prompting of Large Language Models
Traditional efforts to measure historical structural oppression struggle with cross-national validity due to the unique, locally specified histories of exclusion, colonization, and social status in each country, and often have relied on structured indices that privilege material resources while overlooking lived, identity-based exclusion. We introduce a novel framework for oppression measurement that leverages Large Language Models (LLMs) to generate context-sensitive scores of lived historical disadvantage across diverse geopolitical settings. Using unstructured self-identified ethnicity utterances from a multilingual COVID-19 global study, we design rule-guided prompting strategies that encourage models to produce interpretable, theoretically grounded estimations of oppression. We systematically evaluate these strategies across multiple state-of-the-art LLMs. Our results demonstrate that LLMs, when guided by explicit rules, can capture nuanced forms of identity-based historical oppression within nations. This approach provides a complementary measurement tool that highlights dimensions of systemic exclusion, offering a scalable, cross-cultural lens for understanding how oppression manifests in data-driven research and public health contexts. To support reproducible evaluation, we release an open-sourced benchmark dataset for assessing LLMs on oppression measurement (https://github.com/chattergpt/HSO-Bench).
comment: To appear in the 2025 IEEE International Conference on Big Data (IEEE BigData 2025)
PsychiatryBench: A Multi-Task Benchmark for LLMs in Psychiatry
Large language models (LLMs) offer significant potential in enhancing psychiatric practice, from improving diagnostic accuracy to streamlining clinical documentation and therapeutic support. However, existing evaluation resources heavily rely on small clinical interview corpora, social media posts, or synthetic dialogues, which limits their clinical validity and fails to capture the full complexity of diagnostic reasoning. In this work, we introduce PsychiatryBench, a rigorously curated benchmark grounded exclusively in authoritative, expert-validated psychiatric textbooks and casebooks. PsychiatryBench comprises eleven distinct question-answering tasks ranging from diagnostic reasoning and treatment planning to longitudinal follow-up, management planning, clinical approach, sequential case analysis, and multiple-choice/extended matching formats totaling 5,188 expert-annotated items. {\color{red}We evaluate a diverse set of frontier LLMs (including Google Gemini, DeepSeek, Sonnet 4.5, and GPT 5) alongside leading open-source medical models such as MedGemma using both conventional metrics and an "LLM-as-judge" similarity scoring framework. Our results reveal substantial gaps in clinical consistency and safety, particularly in multi-turn follow-up and management tasks, underscoring the need for specialized model tuning and more robust evaluation paradigms. PsychiatryBench offers a modular, extensible platform for benchmarking and improving LLM performance in mental health applications.
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format
Recent researches on video large language models (VideoLLM) predominantly focus on model architectures and training datasets, leaving the interaction format between the user and the model under-explored. In existing works, users often interact with VideoLLMs by using the entire video and a query as input, after which the model generates a response. This interaction format constrains the application of VideoLLMs in scenarios such as live-streaming comprehension where videos do not end and responses are required in a real-time manner, and also results in unsatisfactory performance on time-sensitive tasks that requires localizing video segments. In this paper, we focus on a video-text duet interaction format. This interaction format is characterized by the continuous playback of the video, and both the user and the model can insert their text messages at any position during the video playback. When a text message ends, the video continues to play, akin to the alternative of two performers in a duet. We construct MMDuetIT, a video-text training dataset designed to adapt VideoLLMs to video-text duet interaction format. We also introduce the Multi-Answer Grounded Video Question Answering (MAGQA) task to benchmark the real-time response ability of VideoLLMs. Trained on MMDuetIT, MMDuet demonstrates that adopting the video-text duet interaction format enables the model to achieve significant improvements in various time-sensitive tasks (76% CIDEr on YouCook2 dense video captioning, 90\% mAP on QVHighlights highlight detection and 25% R@0.5 on Charades-STA temporal video grounding) with minimal training efforts, and also enable VideoLLMs to reply in a real-time manner as the video plays.
comment: 9 pages
One SPACE to Rule Them All: Jointly Mitigating Factuality and Faithfulness Hallucinations in LLMs NIPS 2025
LLMs have demonstrated unprecedented capabilities in natural language processing, yet their practical deployment remains hindered by persistent factuality and faithfulness hallucinations. While existing methods address these hallucination types independently, they inadvertently induce performance trade-offs, as interventions targeting one type often exacerbate the other. Through empirical and theoretical analysis of activation space dynamics in LLMs, we reveal that these hallucination categories share overlapping subspaces within neural representations, presenting an opportunity for concurrent mitigation. To harness this insight, we propose SPACE, a unified framework that jointly enhances factuality and faithfulness by editing shared activation subspaces. SPACE establishes a geometric foundation for shared subspace existence through dual-task feature modeling, then identifies and edits these subspaces via a hybrid probe strategy combining spectral clustering and attention head saliency scoring. Experimental results across multiple benchmark datasets demonstrate the superiority of our approach.
comment: Accepted as NIPS 2025 poster
Llama2Vec: Unsupervised Adaptation of Large Language Models for Dense Retrieval ACL 2024
Dense retrieval calls for discriminative embeddings to represent the semantic relationship between query and document. It may benefit from the using of large language models (LLMs), given LLMs' strong capability on semantic understanding. However, the LLMs are learned by auto-regression, whose working mechanism is completely different from representing whole text as one discriminative embedding. Thus, it is imperative to study how to adapt LLMs properly so that they can be effectively initialized as the backbone encoder for dense retrieval. In this paper, we propose a novel approach, called Llama2Vec, which performs unsupervised adaptation of LLM for its dense retrieval application. Llama2Vec consists of two pretext tasks: EBAE (Embedding-Based Auto-Encoding) and EBAR (Embedding-Based Auto-Regression), where the LLM is prompted to reconstruct the input sentence and predict the next sentence based on its text embeddings. Llama2Vec is simple, lightweight, but highly effective. It is used to adapt LLaMA-2-7B on the Wikipedia corpus. With a moderate steps of adaptation, it substantially improves the model's fine-tuned performances on a variety of dense retrieval benchmarks. Notably, it results in the new state-of-the-art performances on popular benchmarks, such as passage and document retrieval on MSMARCO, and zero-shot retrieval on BEIR. The model and source code will be made publicly available to facilitate the future research. Our model is available at https://github.com/FlagOpen/FlagEmbedding.
comment: ACL 2024
Conversations: Love Them, Hate Them, Steer Them
Large Language Models (LLMs) demonstrate increasing conversational fluency, yet instilling them with nuanced, human-like emotional expression remains a significant challenge. Current alignment techniques often address surface-level output or require extensive fine-tuning. This paper demonstrates that targeted activation engineering can steer LLaMA 3.1-8B to exhibit more human-like emotional nuances. We first employ attribution patching to identify causally influential components, to find a key intervention locus by observing activation patterns during diagnostic conversational tasks. We then derive emotional expression vectors from the difference in the activations generated by contrastive text pairs (positive vs. negative examples of target emotions). Applying these vectors to new conversational prompts significantly enhances emotional characteristics: steered responses show increased positive sentiment (e.g., joy, trust) and more frequent first-person pronoun usage, indicative of greater personal engagement. Our findings offer a precise and interpretable method for controlling specific emotional attributes in LLMs, contributing to developing more aligned and empathetic conversational AI.
comment: We have created a new arXiv submission with a more up to date version of this paper at arXiv:2511.12832
ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection
Hateful memes have emerged as a particularly challenging form of online abuse, motivating the development of automated detection systems. Most prior approaches rely on direct detection, producing only binary predictions. Such models fail to provide the context and explanations that real-world moderation requires. Recent Explain-then-Detect approaches, using Chain-of-Thought prompting or LMM agents, perform worse than simple SFT baselines, and even advanced post-training methods such as GRPO fail to close the gap. Our analysis identifies two key issues of such systems: important policy-relevant cues such as targets and attack types are not hypothesized by the model as a likely explanation; and the binary reward signal is insufficient to guide reasoning. To address these challenges, we propose ExPO-HM (Explain-then-Detect Policy Optimization for Hateful Memes), inspired by the training and evaluation process of human annotators. ExPO-HM combines SFT warmup, GRPO with curriculum learning, and Conditional Decision Entropy (CDE) as both metric and reward for reasoning quality. Across three hateful meme benchmarks, ExPO-HM achieves state-of-the-art performance on binary detection, fine-grained classification, and reasoning quality, with up to 15\% and 17\% F1 improvement over the GRPO and DPO baselines, respectively. By moving hateful meme detection from simple binary alarms to explanation-driven detection, ExPO-HM provides accurate, interpretable, and actionable moderation support.
comment: Preprint
ReCode: Updating Code API Knowledge with Reinforcement Learning AAAI 2026
Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their training data, even with access to current documentation, impedes reliable code generation in dynamic environments. To tackle this issue, we propose ReCode (rule-based Reinforcement learning for Code Update), a novel framework that mimics human programmer adaptation to API changes. Specifically, we construct a dataset of approximately 2,000 data entries to train the LLMs to perform version migration based on updated information. Then, we introduce a modified string similarity metric for code evaluation as the reward for reinforcement learning. Our experiments demonstrate that ReCode substantially boosts LLMs' code generation performance in dynamic API scenarios, especially on the unseen CodeUpdateArena task. Crucially, compared to supervised fine-tuning, ReCode has less impact on LLMs' general code generation abilities. We apply ReCode on various LLMs and reinforcement learning algorithms (GRPO and DAPO), all achieving consistent improvements. Notably, after training, Qwen2.5-Coder-7B outperforms that of the 32B parameter code instruction-tuned model and the reasoning model with the same architecture. Code is available at https://github.com/zjunlp/ReCode.
comment: AAAI 2026
LoKI: Low-damage Knowledge Implanting of Large Language Models AAAI-26
Fine-tuning adapts pretrained models for specific tasks but poses the risk of catastrophic forgetting (CF), where critical knowledge from pretraining is overwritten. To address the issue of CF in a general-purpose framework, we propose Low-damage Knowledge Implanting (LoKI), a parameter-efficient fine-tuning (PEFT) technique that utilizes recent mechanistic understanding of how knowledge is stored in transformer architectures. We compare LoKI against state-of-the-art PEFT methods in two real-world fine-tuning scenarios. The results show that LoKI demonstrates significantly better preservation of general capabilities. At the same time, its task-specific performance is comparable to or even surpasses that of full parameter fine-tuning and these PEFT methods across various model architectures. Our work bridges the mechanistic insights of LLMs' knowledge storage with practical fine-tuning objectives, enabling an effective balance between task-specific adaptation and the retention of general-purpose capabilities.
comment: AAAI-26 Oral
Low-Confidence Gold: Refining Low-Confidence Samples for Efficient Instruction Tuning EMNLP
The effectiveness of instruction fine-tuning for Large Language Models is fundamentally constrained by the quality and efficiency of training datasets. This work introduces Low-Confidence Gold (LCG), a novel filtering framework that employs centroid-based clustering and confidence-guided selection for identifying valuable instruction pairs. Through a semi-supervised approach using a lightweight classifier trained on representative samples, LCG curates high-quality subsets while preserving data diversity. Experimental evaluation demonstrates that models fine-tuned on LCG-filtered subsets of 6K samples achieve superior performance compared to existing methods, with substantial improvements on MT-bench and consistent gains across comprehensive evaluation metrics. The framework's efficacy while maintaining model performance establishes a promising direction for efficient instruction tuning.
comment: Accepted to EMNLP Findings 2025
Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data
We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal large model (OLM), it substantially advances Lychee's Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the dense LLM, we build Uni-MoE-2.0-Omni from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of omnimodal understanding, as well as generating images, text, and speech. Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following cross-modal pretraining, we use a progressive supervised fine-tuning strategy that activates modality-specific experts and is enhanced by balanced data composition and an iterative GSPO-DPO method to stabilise RL training and improve reasoning. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues. Extensive evaluation across 85 benchmarks demonstrates that our model achieves SOTA or highly competitive performance against leading OLMs, surpassing Qwen2.5-Omni (trained with 1.2T tokens) on over 50 of 76 benchmarks. Key strengths include video understanding (+7% avg. of 8), omnimodallity understanding (+7% avg. of 4), and audiovisual reasoning (+4%). It also advances long-form speech processing (reducing WER by 4.2%) and leads in low-level image processing and controllable generation across 5 metrics.
comment: 47 pages,10 Figures, Project Website: https://idealistxy.github.io/Uni-MoE-v2.github.io/ Codes: https://github.com/HITsz-TMG/Uni-MoE
UPLME: Uncertainty-Aware Probabilistic Language Modelling for Robust Empathy Regression
Noisy self-reported empathy scores challenge supervised learning for empathy regression. While many algorithms have been proposed for learning with noisy labels in textual classification problems, the regression counterpart is relatively under-explored. We propose UPLME, an uncertainty-aware probabilistic language modelling framework to capture label noise in empathy regression tasks. One of the novelties in UPLME is a probabilistic language model that predicts both empathy scores and heteroscedastic uncertainty, and is trained using Bayesian concepts with variational model ensembling. We further introduce two novel loss components: one penalises degenerate Uncertainty Quantification (UQ), and another enforces similarity between the input pairs on which empathy is being predicted. UPLME achieves state-of-the-art performance (Pearson Correlation Coefficient: $0.558\rightarrow0.580$ and $0.629\rightarrow0.634$) in terms of the performance reported in the literature on two public benchmarks with label noise. Through synthetic label noise injection, we demonstrate that UPLME is effective in distinguishing between noisy and clean samples based on the predicted uncertainty. UPLME further outperform (Calibration error: $0.571\rightarrow0.376$) a recent variational model ensembling-based UQ method designed for regression problems. Code is publicly available at https://github.com/hasan-rakibul/UPLME.
comment: Code available at https://github.com/hasan-rakibul/UPLME
FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models NeurIPS 2025
Large vision-language models (LVLMs) excel at multimodal understanding but suffer from high computational costs due to redundant vision tokens. Existing pruning methods typically rely on single-layer attention scores to rank and prune redundant visual tokens to solve this inefficiency. However, as the interaction between tokens and layers is complicated, this raises a basic question: Is such a simple single-layer criterion sufficient to identify redundancy? To answer this question, we rethink the emergence of redundant visual tokens from a fundamental perspective: information flow, which models the interaction between tokens and layers by capturing how information moves between tokens across layers. We find (1) the CLS token acts as an information relay, which can simplify the complicated flow analysis; (2) the redundancy emerges progressively and dynamically via layer-wise attention concentration; and (3) relying solely on attention scores from single layers can lead to contradictory redundancy identification. Based on this, we propose FlowCut, an information-flow-aware pruning framework, mitigating the insufficiency of the current criterion for identifying redundant tokens and better aligning with the model's inherent behaviors. Extensive experiments show that FlowCut achieves superior results, outperforming SoTA by 1.6% on LLaVA-1.5-7B with 88.9% token reduction, and by 4.3% on LLaVA-NeXT-7B with 94.4% reduction, delivering 3.2x speed-up in the prefilling stage. Our code is available at https://github.com/TungChintao/FlowCut
comment: Accepted by NeurIPS 2025
OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models
Since Multimodal Large Language Models (MLLMs) are increasingly being integrated into everyday tools and intelligent agents, growing concerns have arisen regarding their possible output of unsafe contents, ranging from toxic language and biased imagery to privacy violations and harmful misinformation. Current safety benchmarks remain highly limited in both modality coverage and performance evaluations, often neglecting the extensive landscape of content safety. In this work, we introduce OutSafe-Bench, the first most comprehensive content safety evaluation test suite designed for the multimodal era. OutSafe-Bench includes a large-scale dataset that spans four modalities, featuring over 18,000 bilingual (Chinese and English) text prompts, 4,500 images, 450 audio clips and 450 videos, all systematically annotated across nine critical content risk categories. In addition to the dataset, we introduce a Multidimensional Cross Risk Score (MCRS), a novel metric designed to model and assess overlapping and correlated content risks across different categories. To ensure fair and robust evaluation, we propose FairScore, an explainable automated multi-reviewer weighted aggregation framework. FairScore selects top-performing models as adaptive juries, thereby mitigating biases from single-model judgments and enhancing overall evaluation reliability. Our evaluation of nine state-of-the-art MLLMs reveals persistent and substantial safety vulnerabilities, underscoring the pressing need for robust safeguards in MLLMs.
LLM4Cell: A Survey of Large Language and Agentic Models for Single-Cell Biology
Large language models (LLMs) and emerging agentic frameworks are beginning to transform single-cell biology by enabling natural-language reasoning, generative annotation, and multimodal data integration. However, progress remains fragmented across data modalities, architectures, and evaluation standards. LLM4Cell presents the first unified survey of 58 foundation and agentic models developed for single-cell research, spanning RNA, ATAC, multi-omic, and spatial modalities. We categorize these methods into five families-foundation, text-bridge, spatial, multimodal, epigenomic, and agentic-and map them to eight key analytical tasks including annotation, trajectory and perturbation modeling, and drug-response prediction. Drawing on over 40 public datasets, we analyze benchmark suitability, data diversity, and ethical or scalability constraints, and evaluate models across 10 domain dimensions covering biological grounding, multi-omics alignment, fairness, privacy, and explainability. By linking datasets, models, and evaluation domains, LLM4Cell provides the first integrated view of language-driven single-cell intelligence and outlines open challenges in interpretability, standardization, and trustworthy model development.
comment: 34 pages, 5 figures, 7 tables
Spatial Knowledge Graph-Guided Multimodal Synthesis
Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced their capabilities; however, their spatial perception abilities remain a notable limitation. To address this challenge, multimodal data synthesis offers a promising solution. Yet, ensuring that synthesized data adhere to spatial common sense is a non-trivial task. Our approach addresses this critical gap by providing a systematic framework for generating spatially coherent data. In this work, we introduce SKG2DATA, a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation. SKG2DATA employs an automated pipeline for constructing Spatial Knowledge Graph (SKG) that effectively captures human-like spatial cognition, including directional and distance relationships. These structured representations then serve as precise guidance for our integrated synthesis pipeline, where a diffusion model generates spatially-consistent images while a MLLM produces corresponding textual descriptions. The automated construction of SKG enables scalable generation of diverse yet realistic spatial configurations, overcoming the limitations of manual data collection and annotation. Extensive experiments demonstrate that data synthesized from diverse types of spatial knowledge, including direction and distance, enhance the spatial perception and reasoning abilities of MLLMs markedly, albeit with a slight cost to their general capabilities. We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence. Code is available at https://github.com/zjunlp/Knowledge2Data.
comment: IEEE/ACM Transactions on Audio, Speech and Language Processing
BadGraph: A Backdoor Attack Against Latent Diffusion Model for Text-Guided Graph Generation
The rapid progress of graph generation has raised new security concerns, particularly regarding backdoor vulnerabilities. While prior work has explored backdoor attacks in image diffusion and unconditional graph generation, conditional, especially text-guided graph generation remains largely unexamined. This paper proposes BadGraph, a backdoor attack method against latent diffusion models for text-guided graph generation. BadGraph leverages textual triggers to poison training data, covertly implanting backdoors that induce attacker-specified subgraphs during inference when triggers appear, while preserving normal performance on clean inputs. Extensive experiments on four benchmark datasets (PubChem, ChEBI-20, PCDes, MoMu) demonstrate the effectiveness and stealth of the attack: less than 10% poisoning rate can achieves 50% attack success rate, while 24% suffices for over 80% success rate, with negligible performance degradation on benign samples. Ablation studies further reveal that the backdoor is implanted during VAE and diffusion training rather than pretraining. These findings reveal the security vulnerabilities in latent diffusion models of text-guided graph generation, highlight the serious risks in models' applications such as drug discovery and underscore the need for robust defenses against the backdoor attack in such diffusion models.
Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection
Attention mechanisms have revolutionized several domains of artificial intelligence, such as natural language processing and computer vision, by enabling models to selectively focus on relevant parts of the input data. While recent work has characterized the optimization dynamics of gradient descent (GD) in attention-based models and the structural properties of its preferred solutions, less is known about more general optimization algorithms such as mirror descent (MD). In this paper, we investigate the convergence properties and implicit biases of a family of MD algorithms tailored for softmax attention mechanisms, with the potential function chosen as the $p$-th power of the $\ell_p$-norm. Specifically, we show that these algorithms converge in direction to a generalized hard-margin SVM with an $\ell_p$-norm objective when applied to a classification problem using a softmax attention model. Notably, our theoretical results reveal that the convergence rate is comparable to that of traditional GD in simpler models, despite the highly nonlinear and nonconvex nature of the present problem. Additionally, we delve into the joint optimization dynamics of the key-query matrix and the decoder, establishing conditions under which this complex joint optimization converges to their respective hard-margin SVM solutions. Lastly, our numerical experiments on real data demonstrate that MD algorithms improve generalization over standard GD and excel in optimal token selection.
Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems
Recent advancements in Retrieval-Augmented Generation (RAG) have enabled Large Language Models (LLMs) to access multimodal knowledge bases containing both text and visual information such as charts, diagrams, and tables in financial documents. However, existing multimodal RAG systems rely on LLM-based summarization to convert images into text during preprocessing, storing only text representations in vector databases, which causes loss of contextual information and visual details critical for downstream retrieval and question answering. To address this limitation, we present a comprehensive comparative analysis of two retrieval approaches for multimodal RAG systems, including text-based chunk retrieval (where images are summarized into text before embedding) and direct multimodal embedding retrieval (where images are stored natively in the vector space). We evaluate all three approaches across 6 LLM models and a two multi-modal embedding models on a newly created financial earnings call benchmark comprising 40 question-answer pairs, each paired with 2 documents (1 image and 1 text chunk). Experimental results demonstrate that direct multimodal embedding retrieval significantly outperforms LLM-summary-based approaches, achieving absolute improvements of 13% in mean average precision (mAP@5) and 11% in normalized discounted cumulative gain. These gains correspond to relative improvements of 32% in mAP@5 and 20% in nDCG@5, providing stronger evidence of their practical impact. We additionally find that direct multimodal retrieval produces more accurate and factually consistent answers as measured by LLM-as-a-judge pairwise comparisons. We demonstrate that LLM summarization introduces information loss during preprocessing, whereas direct multimodal embeddings preserve visual context for retrieval and inference.
Demystifying CLIP Data
Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective. However, CLIP only provides very limited information about its data and how it has been collected, leading to works that aim to reproduce CLIP's data by filtering with its model parameters. In this work, we intend to reveal CLIP's data curation approach and in our pursuit of making it open to the community introduce Metadata-Curated Language-Image Pre-training (MetaCLIP). MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution. Our experimental study rigorously isolates the model and training settings, concentrating solely on data. MetaCLIP applied to CommonCrawl with 400M image-text data pairs outperforms CLIP's data on multiple standard benchmarks. In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy, surpassing CLIP's 68.3% on ViT-B models. Scaling to 1B data, while maintaining the same training budget, attains 72.4%. Our observations hold across various model sizes, exemplified by ViT-H achieving 80.5%, without any bells-and-whistles. Curation code and training data distribution on metadata is made available at https://github.com/facebookresearch/MetaCLIP.
comment: 17 pages. arXiv admin note: text overlap with arXiv:2103.00020 by other authors
Human-Computer Interaction
Universality in Collective Intelligence on the Rubik's Cube
Progress in understanding expert performance is limited by the scarcity of quantitative data on long-term knowledge acquisition and deployment. Here we use the Rubik's Cube as a cognitive model system existing at the intersection of puzzle solving, skill learning, expert knowledge, cultural transmission, and group theory. By studying competitive cube communities, we find evidence for universality in the collective learning of the Rubik's Cube in both sighted and blindfolded conditions: expert performance follows exponential progress curves whose parameters reflect the delayed acquisition of algorithms that shorten solution paths. Blindfold solves form a distinct problem class from sighted solves and are constrained not only by expert knowledge but also by the skill improvements required to overcome short-term memory bottlenecks, a constraint shared with blindfold chess. Cognitive artifacts such as the Rubik's Cube help solvers navigate an otherwise enormous mathematical state space. In doing so, they sustain collective intelligence by integrating communal knowledge stores with individual expertise and skill, illustrating how expertise can, in practice, continue to deepen over the course of a single lifetime.
Barriers to AI Adoption: Image Concerns at Work
Concerns about how workers are perceived can deter effective collaboration with artificial intelligence (AI). In a field experiment on a large online labor market, I hired 450 U.S.-based remote workers to complete an image-categorization job assisted by AI recommendations. Workers were incentivized by the prospect of a contract extension based on an HR evaluator's feedback. I find that workers adopt AI recommendations at lower rates when their reliance on AI is visible to the evaluator, resulting in a measurable decline in task performance. The effects are present despite a conservative design in which workers know that the evaluator is explicitly instructed to assess expected accuracy on the same AI-assisted task. This reduction in AI reliance persists even when the evaluator is reassured about workers' strong performance history on the platform, underscoring how difficult these concerns are to alleviate. Leveraging the platform's public feedback feature, I introduce a novel incentive-compatible elicitation method showing that workers fear heavy reliance on AI signals a lack of confidence in their own judgment, a trait they view as essential when collaborating with AI.
The Evaluation for Usability Methods of Unmanned Surface Vehicles: Are Current Usability Methods Viable for Unmanned Surface Vehicles? Insights from a Multiple Case Study Approach to Human-Robot Interaction
Unmanned Surface Vehicles (USVs) are increasingly utilised for diverse applications, ranging from environmental monitoring to security patrols. While USV technology is progressing, it remains clear that full autonomy is not achievable in all scenarios, and remote human intervention is still crucial, particularly in dynamic or complex environments. This continued reliance on human intervention highlights a range of Human-Robot Interaction (HRI) challenges that remain unresolved. Compared to the extensive body of HRI research in domains such as unmanned aerial vehicles and autonomous vehicles, HRI considerations specific to USVs remain significantly underexplored. Addressing this gap, our study investigates real-world usability challenges in USV operation through in-depth interviews with 9 engineers and users, supported by field observations. We focus especially on the difficulties beginner operators encounter and their coping strategies. Our findings reveal existing usability issues, mental models, and adaptation strategies of beginners that inform future user-centered design of USV systems, contributing new insights to the emerging field of maritime HRI. Based on these findings, we argue that current USV systems are poorly suited for beginner operation in dynamic inland and offshore environments, where operators must make timely decisions under uncertainty, manage complex spatial awareness, and adapt to changing environmental conditions. Furthermore, we identify key operational patterns in three representative use cases-harmful algal bloom detection, underwater concealed pipe inspection and post-construction hydrographic survey, and summarise key interaction constraints that should inform future maritime HRI design efforts.
Emotion-Aware Conversational Recommender Systems: a Case Study
In recent years, online shopping has grown rapidly, especially during the COVID-19 period. However, it still lacks elements typical of physical stores, such as empathic support and personalised advice from a sales assistant. This study explores how an emotion-aware Conversational Agent (CA) can improve the online shopping experience by responding to user emotions in a more natural and human way. The project focuses on Gala, a virtual assistant developed for the Galeries Lafayette website, capable of recognising emotional states from voice messages and adapting its responses accordingly. User needs were first analysed through semi-structured interviews, which informed the design of Gala's UX and functionalities. Gala was implemented using the OpenAI API and the Galeries Lafayette API, adopting a Content-Based recommendation approach. Through Natural Language Processing, it interprets user requests and retrieves products aligned with specific attributes such as name, price, and brand, enabling fluid dialogue and tailored suggestions. Two user studies were conducted: a usability test and a comparative evaluation between a standard CA and Gala's emotion-aware version. The results highlight the potential of emotion-aware CAs to make online shopping faster, more engaging, and closer to an in-store guided experience.
A Multimodal Conversational Agent for Tabular Data Analysis
Large language models (LLMs) can reshape information processing by handling data analysis, visualization, and interpretation in an interactive, context-aware dialogue with users, including voice interaction, while maintaining high performance. In this article, we present Talk2Data, a multimodal LLM-driven conversational agent for intuitive data exploration. The system lets users query datasets with voice or text instructions and receive answers as plots, tables, statistics, or spoken explanations. Built on LLMs, the suggested design combines OpenAI Whisper automatic speech recognition (ASR) system, Qwen-coder code generation LLM/model, custom sandboxed execution tools, and Coqui library for text-to-speech (TTS) within an agentic orchestration loop. Unlike text-only analysis tools, it adapts responses across modalities and supports multi-turn dialogues grounded in dataset context. In an evaluation of 48 tasks on three datasets, our prototype achieved 95.8% accuracy with model-only generation time under 1.7 seconds (excluding ASR and execution time). A comparison across five LLM sizes (1.5B-32B) revealed accuracy-latency-cost trade-offs, with a 7B model providing the best balance for interactive use. By routing between conversation with user and code execution, constrained to a transparent sandbox, with simultaneously grounding prompts in schema-level context, the Talk2Data agent reliably retrieves actionable insights from tables while making computations verifiable. In the article, except for the Talk2Data agent itself, we discuss implications for human-data interaction, trust in LLM-driven analytics, and future extensions toward large-scale multimodal assistants.
comment: \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses
Tu crois que c'est vrai ? Diversite des regimes d'enonciation face aux fake news et mecanismes d'autoregulation conversationnelle
This thesis addresses two paradoxes: (1) why empirical studies find that fake news represent only a small share of the information consulted and shared on social media despite the absence of editorial control or journalistic norms, and (2) how political polarization has intensified even though users do not appear especially receptive to fake news. To investigate these issues, two complementary studies were carried out on Twitter and Facebook, combining quantitative analyses of digital traces with online observation and interviews. This mixed-methods design avoids reducing users to single reactions to identified fake items and instead examines the variety of practices across different interactional situations, online and offline, while recording socio-demographic traits. The first study mapped users who shared at least one item labeled fake by fact-checkers in the French Twittersphere. The second used a corpus of items flagged by Facebook users to study reactions to statements whose epistemic status is uncertain. Three main findings emerge. First, sharing fake news is concentrated among a limited group of users who are not less educated or cognitively disadvantaged but are more politicized and critical of institutions; owing to their high activity and prolific sharing, they can help set the agenda for their political camp. Second, exposed users can deploy varying forms of critical distance depending on their social position and the interactional norms of the situations they inhabit: either discursive caution (prudence énonciative) or interventions ('points d'arrêt') that express disagreement or corrections. Third, these forms of critical distance seldom yield genuine deliberative debates or agonistic pluralism; rather, they often produce dialogues of the deaf among a small, particularly active minority.
comment: in French language
MultiDiffNet: A Multi-Objective Diffusion Framework for Generalizable Brain Decoding
Neural decoding from electroencephalography (EEG) remains fundamentally limited by poor generalization to unseen subjects, driven by high inter-subject variability and the lack of large-scale datasets to model it effectively. Existing methods often rely on synthetic subject generation or simplistic data augmentation, but these strategies fail to scale or generalize reliably. We introduce \textit{MultiDiffNet}, a diffusion-based framework that bypasses generative augmentation entirely by learning a compact latent space optimized for multiple objectives. We decode directly from this space and achieve state-of-the-art generalization across various neural decoding tasks using subject and session disjoint evaluation. We also curate and release a unified benchmark suite spanning four EEG decoding tasks of increasing complexity (SSVEP, Motor Imagery, P300, and Imagined Speech) and an evaluation protocol that addresses inconsistent split practices in prior EEG research. Finally, we develop a statistical reporting framework tailored for low-trial EEG settings. Our work provides a reproducible and open-source foundation for subject-agnostic EEG decoding in real-world BCI systems.
Clinician-Directed Large Language Model Software Generation for Therapeutic Interventions in Physical Rehabilitation
Digital health interventions are increasingly used in physical and occupational therapy to deliver home exercise programs via sensor equipped devices such as smartphones, enabling remote monitoring of adherence and performance. However, digital interventions are typically programmed as software before clinical encounters as libraries of parametrized exercise modules targeting broad patient populations. At the point of care, clinicians can only select modules and adjust a narrow set of parameters like repetitions, so patient specific needs that emerge during encounters, such as distinct movement limitations, and home environments, are rarely reflected in the software. We evaluated a digital intervention paradigm that uses large language models (LLMs) to translate clinicians' exercise prescriptions into intervention software. In a prospective single arm feasibility study with 20 licensed physical and occupational therapists and a standardized patient, clinicians created 40 individualized upper extremity programs (398 instructions) that were automatically translated into executable software. Our results show a 45% increase in the proportion of personalized prescriptions that can be implemented as software compared with a template based benchmark, with unanimous consensus among therapists on ease of use. The LLM generated software correctly delivered 99.78% (397/398) of instructions as prescribed and monitored performance with 88.4% (352/398) accuracy, with 90% (18/20) of therapists judged it safe to interact with patients, and 75% (15/20) expressed willingness to adopt it. To our knowledge, this is the first prospective evaluation of clinician directed intervention software generation with LLMs in healthcare, demonstrating feasibility and motivating larger trials to assess clinical effectiveness and safety in real patient populations.
Where Do Passengers Gaze? Impact of Passengers' Personality Traits on Their Gaze Pattern Toward Pedestrians During APMV-Pedestrian Interactions with Diverse eHMIs
Autonomous Personal Mobility Vehicles (APMVs) are designed to address the ``last-mile'' transportation challenge for everyone. When an APMV encounters a pedestrian, it uses an external Human-Machine Interface (eHMI) to negotiate road rights. Through this interaction, passengers also engage with the process. This study examines passengers' gaze behavior toward pedestrians during such interactions, focusing on whether different eHMI designs influence gaze patterns based on passengers' personality traits. The results indicated that when using a visual-based eHMI, passengers often struggled to perceive the communication content. Consequently, passengers with higher Neuroticism scores, who were more sensitive to communication details, might seek cues from pedestrians' reactions. In addition, a multimodal eHMI (visual and voice) using neutral voice did not significantly affect the gaze behavior of passengers toward pedestrians, regardless of personality traits. In contrast, a multimodal eHMI using affective voice encouraged passengers with high Openness to Experience scores to focus on pedestrians' heads. In summary, this study revealed how different eHMI designs influence passengers' gaze behavior and highlighted the effects of personality traits on their gaze patterns toward pedestrians, providing new insights for personalized eHMI designs.
Data-driven Causal Discovery for Pedestrians-Autonomous Personal Mobility Vehicle Interactions with eHMIs: From Psychological States to Walking Behaviors
Autonomous personal mobility vehicle (APMV) is a new type of small smart vehicle designed for mixed-traffic environments, including interactions with pedestrians. To enhance the interaction experience between pedestrians and APMVs and to prevent potential risks, it is crucial to investigate pedestrians' walking behaviors when interacting with APMVs and to understand the psychological processes underlying these behaviors. This study aims to investigate the causal relationships between subjective evaluations of pedestrians and their walking behaviors during interactions with an APMV equipped with an external human-machine interface (eHMI). An experiment of pedestrian-APMV interaction was conducted with 42 pedestrian participants, in which various eHMIs on the APMV were designed to induce participants to experience different levels of subjective evaluations and generate the corresponding walking behaviors. Based on the hypothesized model of the pedestrian's cognition-decision-behavior process, the results of causal discovery align with the previously proposed model. Furthermore, this study further analyzes the direct and total causal effects of each factor and investigates the causal processes affecting several important factors in the field of human-vehicle interaction, such as situation awareness, trust in vehicle, risk perception, hesitation in decision making, and walking behaviors.
LLMs' Reshaping of People, Processes, Products, and Society in Software Development: A Comprehensive Exploration with Early Adopters
Large language models (LLMs) are rapidly reshaping software development, but their impact across the software development lifecycle is underexplored. Existing work focuses on isolated activities such as code generation or testing, leaving open questions about how LLMs affect developers, processes, products, and the software ecosystem. We address this gap through semi-structured interviews with sixteen early-adopter software professionals who integrated LLM-based tools into their day-to-day work in early to mid-2023. We treat these interviews as early empirical evidence and compare participants' accounts with recent work on LLMs in software engineering, noting which early patterns persist or shift. Using thematic analysis, we organize findings around four dimensions: people, process, product, and society. Developers reported substantial productivity gains from reducing routine tasks, streamlining search, and accelerating debugging, but also described a productivity-quality paradox: they often discarded generated code and shifted effort from writing to critically evaluating and integrating it. LLM use was highly phase-dependent, with strong uptake in implementation and debugging but limited influence on requirements gathering and other collaborative work. Participants developed new competencies to use LLMs effectively, including prompt engineering strategies, layered verification, and secure integration to protect proprietary data. They anticipated changes in hiring expectations, team practices, and computing education, while emphasizing that human judgment and foundational software engineering skills remain essential. Our findings, later echoed in large-scale quantitative studies, offer actionable implications for developers, organizations, educators, and tool designers seeking to integrate LLMs responsibly into software practice today.
Image and Video Processing
Linear Algebraic Approaches to Neuroimaging Data Compression: A Comparative Analysis of Matrix and Tensor Decomposition Methods for High-Dimensional Medical Images
This paper evaluates Tucker decomposition and Singular Value Decomposition (SVD) for compressing neuroimaging data. Tucker decomposition preserves multi-dimensional relationships, achieving superior reconstruction fidelity and perceptual similarity. SVD excels in extreme compression but sacrifices fidelity. The results highlight Tucker decomposition's suitability for applications requiring the preservation of structural and temporal relationships.
Diverse Instance Generation via Diffusion Models for Enhanced Few-Shot Object Detection in Remote Sensing Images
Few-shot object detection (FSOD) aims to detect novel instances with only a limited number of labeled training samples, presenting a challenge that is particularly prominent in numerous remote sensing applications such as endangered species monitoring and disaster assessment. Existing FSOD methods for remote sensing images (RSIs) have achieved promising progress but remain constrained by the limited diversity of instances. To address this issue, we propose a novel framework that can leverage a diffusion model pretrained on large-scale natural images to synthesize diverse remote sensing instances, thereby improving the performance of few-shot object detectors. Instead of directly synthesizing complete remote sensing images, we first generate instance-level slices via a specialized slice-to-slice module, and then embed these slices into full-scale imagery for enhanced data augmentation. To further adapt diffusion models for remote sensing scenarios, we develop a class-agnostic image inversion module that can invert remote sensing instance slices into semantic space. Additionally, we introduce contrastive loss to semantically align the synthesized images with their corresponding classes. Experimental results show that our method hasachieved an average performance improvement of 4.4% across multiple datasets and various approaches. Ablation experiments indicate that the elaborately designed inversion module can effectively enhance the performance of FSOD methods, and the semantic contrastive loss can further boost the performance.
comment: 6 pages, 2 figures
Spectral Super-Resolution Neural Operator with Atmospheric Radiative Transfer Prior
Spectral super-resolution (SSR) aims to reconstruct hyperspectral images (HSIs) from multispectral observations, with broad applications in remote sensing. Data-driven methods are widely used, but they often overlook physical principles, leading to unrealistic spectra, particularly in atmosphere-affected bands. To address this challenge, we propose the Spectral Super-Resolution Neural Operator (SSRNO), which incorporates atmospheric radiative transfer (ART) prior into the data-driven procedure, yielding more physically consistent predictions. The proposed SSRNO framework consists of three stages: upsampling, reconstruction, and refinement. In the upsampling stage, we leverage prior information to expand the input multispectral image, producing a physically plausible hyperspectral estimate. Subsequently, we utilize a neural operator in the reconstruction stage to learn a continuous mapping across the spectral domain. Finally, the refinement stage imposes a hard constraint on the output HSI to eliminate color distortion. The upsampling and refinement stages are implemented via the proposed guidance matrix projection (GMP) method, and the reconstruction neural operator adopts U-shaped spectral-aware convolution (SAC) layers to capture multi-scale features. Moreover, we theoretically demonstrate the optimality of the GMP method. With the neural operator and ART priors, SSRNO also achieves continuous spectral reconstruction and zero-shot extrapolation. Various experiments validate the effectiveness and generalization ability of the proposed approach.
TransLK-Net: Entangling Transformer and Large Kernel for Progressive and Collaborative Feature Encoding and Decoding in Medical Image Segmentation
Convolutional neural networks (CNNs) and vision transformers (ViTs) are widely employed for medical image segmentation, but they are still challenged by their intrinsic characteristics. CNNs are limited from capturing varying-scaled features and global contextual information due to the employment of fixed-sized kernels. In contrast, ViTs employ self-attention and MLP for global information modeling, but they lack mechanisms to learn spatial-wise local information. Additionally, self-attention leads the network to show high computational complexity. To tackle these limitations, we propose Progressively Entangled Transformer Large Kernel (PTLK) and Collaboratively Entangled Transformer Large Kernel (CTLK) modules to leverage the benefits of self-attention and large kernel convolutions and overcome shortcomings. Specifically, PTLK and CTLK modules employ the Multi-head Large Kernel to capture multi-scale local features and the Efficient Decomposed Self-attention to model global information efficiently. Subsequently, they employ the Attention Entanglement mechanism to enable local and global features to enhance and calibrate each other progressively and collaboratively. Additionally, an Attention-gated Channel MLP (AG-MLP) module is proposed to equip the standard MLP module with the capabilities of modeling spatial information. PTLK and CTLK modules are further incorporated as a Cross Entanglement Decoding (CED) block for efficient feature fusion and decoding. Finally, we propose a novel network for volumetric medical image segmentation that employs an encoder-decoder architecture, termed TransLK-Net. The encoder employs a hierarchical ViT architecture whose block is built by incorporating PTLK and CTLK with AG-MLP into a ViT block, and the decoder employs the CED block.
comment: 7 figures
INT-DTT+: Low-Complexity Data-Dependent Transforms for Video Coding
Discrete trigonometric transforms (DTTs), such as the DCT-2 and the DST-7, are widely used in video codecs for their balance between coding performance and computational efficiency. In contrast, data-dependent transforms, such as the Karhunen-Loève transform (KLT) and graph-based separable transforms (GBSTs), offer better energy compaction but lack symmetries that can be exploited to reduce computational complexity. This paper bridges this gap by introducing a general framework to design low-complexity data-dependent transforms. Our approach builds on DTT+, a family of GBSTs derived from rank-one updates of the DTT graphs, which can adapt to signal statistics while retaining a structure amenable to fast computation. We first propose a graph learning algorithm for DTT+ that estimates the rank-one updates for rows and column graphs jointly, capturing the statistical properties of the overall block. Then, we exploit the progressive structure of DTT+ to decompose the kernel into a base DTT and a structured Cauchy matrix. By leveraging low-complexity integer DTTs and sparsifying the Cauchy matrix, we construct an integer approximation to DTT+, termed INT-DTT+. This approximation significantly reduces both computational and memory complexities with respect to the separable KLT with minimal performance loss. We validate our approach in the context of mode-dependent transforms for the VVC standard, following a rate-distortion optimized transform (RDOT) design approach. Integrated into the explicit multiple transform selection (MTS) framework of VVC in a rate-distortion optimization setup, INT-DTT+ achieves more than 3% BD-rate savings over the VVC MTS baseline, with complexity comparable to the integer DCT-2 once the base DTT coefficients are available.
A Versatile Optical Frontend for Multicolor Fluorescence Imaging with Miniaturized Lensless Sensors
Lensless imaging enables exceptionally compact fluorescence sensors, advancing applications in \textit{in vivo} imaging and low-cost, point-of-care diagnostics. These sensors require a filter to block the excitation light while passing fluorescent emissions. However, conventional thin-film interference filters are sensitive to angle of incidence (AOI), complicating their use in lensless systems. Here we thoroughly analyze and optimize a technique using a fiber optic plate (FOP) to absorb off-axis light that would bleed through the interference filter while improving image resolution. Through simulations, we show that the numerical aperture (NA) of the FOP drives inherent design tradeoffs: collection efficiency improves rapidly with a higher NA, but at the cost of resolution, increased device thickness, and fluorescence excitation efficiency. To illustrate this, we optimize two optical frontends with full-width at half maximums (FWHMs) of 8.3° and 45.7°. Implementing these designs, we show that angle-insensitivity requires filters on both sides of the FOP, due to scattering. In imaging experiments, the 520-$μ$m-thick high-NA design is 59$\times$ more sensitive to fluorescence while only degrading resolution by 3.2$\times$. Alternatively, the low-NA design is capable of three-color fluorescence imaging with 110-$μ$m resolution at a 1-mm working distance. Overall, we demonstrate a versatile optical frontend that is adaptable to a range of applications using different fluorophores, illumination configurations, and lensless imaging techniques.
Generative MR Multitasking with complex-harmonic cardiac encoding: Bridging the gap between gated imaging and real-time imaging
Purpose: To develop a unified image reconstruction framework that bridges real-time and gated cardiac MRI, including quantitative MRI. Methods: We introduce Generative Multitasking, which learns an implicit neural temporal basis from sequence timings and an interpretable latent space for cardiac and respiratory motion. Cardiac motion is modeled as a complex harmonic, with phase encoding timing and a latent amplitude capturing beat-to-beat functional variability, linking cardiac phase-resolved ("gated-like") and time-resolved ("real-time-like") views. We implemented the framework using a conditional variational autoencoder (CVAE) and evaluated it for free-breathing, non-ECG-gated radial GRE in three settings: steady-state cine imaging, multicontrast T2prep/IR imaging, and dual-flip-angle T1/T2 mapping, compared with conventional Multitasking. Results: Generative Multitasking provided flexible cardiac motion representation, enabling reconstruction of archetypal cardiac phase-resolved cines (like gating) as well as time-resolved series that reveal beat-to-beat variability (like real-time imaging). Conditioning on the previous k-space angle and modifying this term at inference removed eddy-current artifacts without globally smoothing high temporal frequencies. For quantitative mapping, Generative Multitasking reduced intraseptal T1 and T2 coefficients of variation compared with conventional Multitasking (T1: 0.13 vs. 0.31; T2: 0.12 vs. 0.32; p<0.001), indicating higher SNR. Conclusion: Generative Multitasking uses a CVAE with complex harmonic cardiac coordinates to unify gated and real-time CMR within a single free-breathing, non-ECG-gated acquisition. It allows flexible cardiac motion representation, suppresses trajectory-dependent artifacts, and improves T1 and T2 mapping, suggesting a path toward cine, multicontrast, and quantitative imaging without separate gated and real-time scans.
comment: Submitted to Magnetic Resonance in Medicine; 21 pages, 7 figures
A Multi-Stage Deep Learning Framework with PKCP-MixUp Augmentation for Pediatric Liver Tumor Diagnosis Using Multi-Phase Contrast-Enhanced CT
Pediatric liver tumors are one of the most common solid tumors in pediatrics, with differentiation of benign or malignant status and pathological classification critical for clinical treatment. While pathological examination is the gold standard, the invasive biopsy has notable limitations: the highly vascular pediatric liver and fragile tumor tissue raise complication risks such as bleeding; additionally, young children with poor compliance require anesthesia for biopsy, increasing medical costs or psychological trauma. Although many efforts have been made to utilize AI in clinical settings, most researchers have overlooked its importance in pediatric liver tumors. To establish a non-invasive examination procedure, we developed a multi-stage deep learning (DL) framework for automated pediatric liver tumor diagnosis using multi-phase contrast-enhanced CT. Two retrospective and prospective cohorts were enrolled. We established a novel PKCP-MixUp data augmentation method to address data scarcity and class imbalance. We also trained a tumor detection model to extract ROIs, and then set a two-stage diagnosis pipeline with three backbones with ROI-masked images. Our tumor detection model has achieved high performance (mAP=0.871), and the first stage classification model between benign and malignant tumors reached an excellent performance (AUC=0.989). Final diagnosis models also exhibited robustness, including benign subtype classification (AUC=0.915) and malignant subtype classification (AUC=0.979). We also conducted multi-level comparative analyses, such as ablation studies on data and training pipelines, as well as Shapley-Value and CAM interpretability analyses. This framework fills the pediatric-specific DL diagnostic gap, provides actionable insights for CT phase selection and model design, and paves the way for precise, accessible pediatric liver tumor diagnosis.
Not Quite Anything: Overcoming SAMs Limitations for 3D Medical Imaging
Foundation segmentation models such as SAM and SAM-2 perform well on natural images but struggle with brain MRIs where structures like the caudate and thalamus lack sharp boundaries and have low contrast. Rather than fine tune these models (for example MedSAM), we propose a compositional alternative where the foundation model output is treated as an additional input channel and passed alongside the MRI to highlight regions of interest. We generate SAM-2 prompts by using a lightweight 3D U-Net that was previously trained on MRI segmentation. The U-Net may have been trained on a different dataset, so its guesses are often imprecise but usually in the correct region. The edges of the resulting foundation model guesses are smoothed to improve alignment with the MRI. We also test prompt free segmentation using DINO attention maps in the same framework. This has-a architecture avoids modifying foundation weights and adapts to domain shift without retraining the foundation model. It reaches about 96 percent volume accuracy on basal ganglia segmentation, which is sufficient for our study of longitudinal volume change. The approach is fast, label efficient, and robust to out of distribution scans. We apply it to study inflammation linked changes in sudden onset pediatric OCD.
comment: Preprint; Paper accepted at AIAS 2025
Improvement of Spiking Neural Network with Bit Planes and Color Models
Spiking neural network (SNN) has emerged as a promising paradigm in computational neuroscience and artificial intelligence, offering advantages such as low energy consumption and small memory footprint. However, their practical adoption is constrained by several challenges, prominently among them being performance optimization. In this study, we present a novel approach to enhance the performance of SNN for images through a new coding method that exploits bit plane representation. Our proposed technique is designed to improve the accuracy of SNN without increasing model size. Also, we investigate the impacts of color models of the proposed coding process. Through extensive experimental validation, we demonstrate the effectiveness of our coding strategy in achieving performance gain across multiple datasets. To the best of our knowledge, this is the first research that considers bit planes and color models in the context of SNN. By leveraging the unique characteristics of bit planes, we hope to unlock new potentials in SNNs performance, potentially paving the way for more efficient and effective SNNs models in future researches and applications.
comment: Accepted for publication at IEEE Access
RN-SDEs: Limited-Angle CT Reconstruction with Residual Null-Space Diffusion Stochastic Differential Equations
Computed tomography is a widely used imaging modality with applications ranging from medical imaging to material analysis. One major challenge arises from the lack of scanning information at certain angles, resulting in distortion or artifacts in the reconstructed images. This is referred to as the Limited Angle Computed Tomography (LACT) reconstruction problem. To address this problem, we propose the use of Residual Null-Space Diffusion Stochastic Differential Equations (RN-SDEs), which are a variant of diffusion models that characterize the diffusion process with mean-reverting (MR) stochastic differential equations. To demonstrate the generalizability of RN-SDEs, we conducted experiments with two different LACT datasets, ChromSTEM and C4KC-KiTS. Through extensive experiments, we demonstrate that by leveraging learned MR-SDEs as a prior and emphasizing data consistency using Range-Null Space Decomposition (RNSD) based rectification, we can recover high-quality images from severely degraded ones and achieve state-of-the-art performance in most LACT tasks. Additionally, we present a quantitative comparison of RN-SDE with other networks, in terms of computational complexity and runtime efficiency, highlighting the superior effectiveness of our proposed approach.
Automatic nodule identification and differentiation in ultrasound videos to facilitate per-nodule examination
Ultrasound is a vital diagnostic technique in health screening, with the advantages of non-invasive, cost-effective, and radiation free, and therefore is widely applied in the diagnosis of nodules. However, it relies heavily on the expertise and clinical experience of the sonographer. In ultrasound images, a single nodule might present heterogeneous appearances in different cross-sectional views which makes it hard to perform per-nodule examination. Sonographers usually discriminate different nodules by examining the nodule features and the surrounding structures like gland and duct, which is cumbersome and time-consuming. To address this problem, we collected hundreds of breast ultrasound videos and built a nodule reidentification system that consists of two parts: an extractor based on the deep learning model that can extract feature vectors from the input video clips and a real-time clustering algorithm that automatically groups feature vectors by nodules. The system obtains satisfactory results and exhibits the capability to differentiate ultrasound videos. As far as we know, it's the first attempt to apply re-identification technique in the ultrasonic field.
comment: The authors wish to withdraw this manuscript as it requires major revisions that substantially change the methodology and conclusions. A significantly updated version of this work may be submitted elsewhere at a later date. Thank you for your understanding
Intraoperative 2D/3D Registration via Spherical Similarity Learning and Differentiable Levenberg-Marquardt Optimization WACV 2026
Intraoperative 2D/3D registration aligns preoperative 3D volumes with real-time 2D radiographs, enabling accurate localization of instruments and implants. A recent fully differentiable similarity learning framework approximates geodesic distances on SE(3), expanding the capture range of registration and mitigating the effects of substantial disturbances, but existing Euclidean approximations distort manifold structure and slow convergence. To address these limitations, we explore similarity learning in non-Euclidean spherical feature spaces to better capture and fit complex manifold structure. We extract feature embeddings using a CNN-Transformer encoder, project them into spherical space, and approximate their geodesic distances with Riemannian distances in the bi-invariant SO(4) space. This enables a more expressive and geometrically consistent deep similarity metric, enhancing the ability to distinguish subtle pose differences. During inference, we replace gradient descent with fully differentiable Levenberg-Marquardt optimization to accelerate convergence. Experiments on real and synthetic datasets show superior accuracy in both patient-specific and patient-agnostic scenarios.
comment: WACV 2026 Accepted
Multimedia
Signal: Selective Interaction and Global-local Alignment for Multi-Modal Object Re-Identification AAAI2026
Multi-modal object Re-IDentification (ReID) is devoted to retrieving specific objects through the exploitation of complementary multi-modal image information. Existing methods mainly concentrate on the fusion of multi-modal features, yet neglecting the background interference. Besides, current multi-modal fusion methods often focus on aligning modality pairs but suffer from multi-modal consistency alignment. To address these issues, we propose a novel selective interaction and global-local alignment framework called Signal for multi-modal object ReID. Specifically, we first propose a Selective Interaction Module (SIM) to select important patch tokens with intra-modal and inter-modal information. These important patch tokens engage in the interaction with class tokens, thereby yielding more discriminative features. Then, we propose a Global Alignment Module (GAM) to simultaneously align multi-modal features by minimizing the volume of 3D polyhedra in the gramian space. Meanwhile, we propose a Local Alignment Module (LAM) to align local features in a shift-aware manner. With these modules, our proposed framework could extract more discriminative features for object ReID. Extensive experiments on three multi-modal object ReID benchmarks (i.e., RGBNT201, RGBNT100, MSVR310) validate the effectiveness of our method. The source code is available at https://github.com/010129/Signal.
comment: Accepted by AAAI2026. More modifications may be performed
Decoupled Audio-Visual Dataset Distillation
Audio-Visual Dataset Distillation aims to compress large-scale datasets into compact subsets while preserving the performance of the original data. However, conventional Distribution Matching (DM) methods struggle to capture intrinsic cross-modal alignment. Subsequent studies have attempted to introduce cross-modal matching, but two major challenges remain: (i) independently and randomly initialized encoders lead to inconsistent modality mapping spaces, increasing training difficulty; and (ii) direct interactions between modalities tend to damage modality-specific (private) information, thereby degrading the quality of the distilled data. To address these challenges, we propose DAVDD, a pretraining-based decoupled audio-visual distillation framework. DAVDD leverages a diverse pretrained bank to obtain stable modality features and uses a lightweight decoupler bank to disentangle them into common and private representations. To effectively preserve cross-modal structure, we further introduce Common Intermodal Matching together with a Sample-Distribution Joint Alignment strategy, ensuring that shared representations are aligned both at the sample level and the global distribution level. Meanwhile, private representations are entirely isolated from cross-modal interaction, safeguarding modality-specific cues throughout distillation. Extensive experiments across multiple benchmarks show that DAVDD achieves state-of-the-art results under all IPC settings, demonstrating the effectiveness of decoupled representation learning for high-quality audio-visual dataset distillation. Code will be released.
Tracking and Segmenting Anything in Any Modality AAAI 2026
Tracking and segmentation play essential roles in video understanding, providing basic positional information and temporal association of objects within video sequences. Despite their shared objective, existing approaches often tackle these tasks using specialized architectures or modality-specific parameters, limiting their generalization and scalability. Recent efforts have attempted to unify multiple tracking and segmentation subtasks from the perspectives of any modality input or multi-task inference. However, these approaches tend to overlook two critical challenges: the distributional gap across different modalities and the feature representation gap across tasks. These issues hinder effective cross-task and cross-modal knowledge sharing, ultimately constraining the development of a true generalist model. To address these limitations, we propose a universal tracking and segmentation framework named SATA, which unifies a broad spectrum of tracking and segmentation subtasks with any modality input. Specifically, a Decoupled Mixture-of-Expert (DeMoE) mechanism is presented to decouple the unified representation learning task into the modeling process of cross-modal shared knowledge and specific information, thus enabling the model to maintain flexibility while enhancing generalization. Additionally, we introduce a Task-aware Multi-object Tracking (TaMOT) pipeline to unify all the task outputs as a unified set of instances with calibrated ID information, thereby alleviating the degradation of task-specific knowledge during multi-task training. SATA demonstrates superior performance on 18 challenging tracking and segmentation benchmarks, offering a novel perspective for more generalizable video understanding.
comment: Accpetd by AAAI 2026
Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks
Automatically detecting abnormal events in videos is crucial for modern autonomous systems, yet existing Video Anomaly Detection (VAD) benchmarks lack the scene diversity, balanced anomaly coverage, and temporal complexity needed to reliably assess real-world performance. Meanwhile, the community is increasingly moving toward Video Anomaly Understanding (VAU), which requires deeper semantic and causal reasoning but remains difficult to benchmark due to the heavy manual annotation effort it demands. In this paper, we introduce Pistachio, a new VAD/VAU benchmark constructed entirely through a controlled, generation-based pipeline. By leveraging recent advances in video generation models, Pistachio provides precise control over scenes, anomaly types, and temporal narratives, effectively eliminating the biases and limitations of Internet-collected datasets. Our pipeline integrates scene-conditioned anomaly assignment, multi-step storyline generation, and a temporally consistent long-form synthesis strategy that produces coherent 41-second videos with minimal human intervention. Extensive experiments demonstrate the scale, diversity, and complexity of Pistachio, revealing new challenges for existing methods and motivating future research on dynamic and multi-event anomaly understanding.
D-FCGS: Feedforward Compression of Dynamic Gaussian Splatting for Free-Viewpoint Videos AAAI-26
Free-Viewpoint Video (FVV) enables immersive 3D experiences, but efficient compression of dynamic 3D representation remains a major challenge. Existing dynamic 3D Gaussian Splatting methods couple reconstruction with optimization-dependent compression and customized motion formats, limiting generalization and standardization. To address this, we propose D-FCGS, a novel Feedforward Compression framework for Dynamic Gaussian Splatting. Key innovations include: (1) a standardized Group-of-Frames (GoF) structure with I-P coding, leveraging sparse control points to extract inter-frame motion tensors; (2) a dual prior-aware entropy model that fuses hyperprior and spatial-temporal priors for accurate rate estimation; (3) a control-point-guided motion compensation mechanism and refinement network to enhance view-consistent fidelity. Trained on Gaussian frames derived from multi-view videos, D-FCGS generalizes across diverse scenes in a zero-shot fashion. Experiments show that it matches the rate-distortion performance of optimization-based methods, achieving over 40 times compression compared to the baseline while preserving visual quality across viewpoints. This work advances feedforward compression of dynamic 3DGS, facilitating scalable FVV transmission and storage for immersive applications.
comment: AAAI-26 accepted, code: https://github.com/Mr-Zwkid/D-FCGS
CommonVoice-SpeechRE and RPG-MoGe: Advancing Speech Relation Extraction with a New Dataset and Multi-Order Generative Framework
Speech Relation Extraction (SpeechRE) aims to extract relation triplets directly from speech. However, existing benchmark datasets rely heavily on synthetic data, lacking sufficient quantity and diversity of real human speech. Moreover, existing models also suffer from rigid single-order generation templates and weak semantic alignment, substantially limiting their performance. To address these challenges, we introduce CommonVoice-SpeechRE, a large-scale dataset comprising nearly 20,000 real-human speech samples from diverse speakers, establishing a new benchmark for SpeechRE research. Furthermore, we propose the Relation Prompt-Guided Multi-Order Generative Ensemble (RPG-MoGe), a novel framework that features: (1) a multi-order triplet generation ensemble strategy, leveraging data diversity through diverse element orders during both training and inference, and (2) CNN-based latent relation prediction heads that generate explicit relation prompts to guide cross-modal alignment and accurate triplet generation. Experiments show our approach outperforms state-of-the-art methods, providing both a benchmark dataset and an effective solution for real-world SpeechRE. The source code and dataset are publicly available at https://github.com/NingJinzhong/SpeechRE_RPG_MoGe.
Computation and Language
Agent-as-a-Graph: Knowledge Graph-Based Tool and Agent Retrieval for LLM Multi-Agent Systems
Recent advances in Large Language Model Multi-Agent Systems enable scalable orchestration and retrieval of specialized, parallelized subagents, each equipped with hundreds or thousands of Model Context Protocol (MCP) servers and tools. However, existing agent, MCP, and retrieval methods typically match queries against a single agent description, obscuring fine-grained tool capabilities of each agent, resulting in suboptimal agent selection. We introduce Agent-as-a-Graph retrieval, a knowledge graph retrieval augmented generation approach that represents both tools and their parent agents as nodes and edges in a knowledge graph. During retrieval, i) relevant agents and tool nodes are first retrieved through vector search, ii) we apply a type-specific weighted reciprocal rank fusion (wRRF) for reranking tools and agents, and iii) parent agents are traversed in the knowledge graph for the final set of agents. We evaluate Agent-as-a-Graph on the LiveMCPBenchmark, achieving 14.9% and 14.6% improvements in Recall@5 and nDCG@5 over prior state-of-the-art retrievers, and 2.4% improvements in wRRF optimizations.
Rethinking Retrieval: From Traditional Retrieval Augmented Generation to Agentic and Non-Vector Reasoning Systems in the Financial Domain for Large Language Models
Recent advancements in Retrieval-Augmented Generation (RAG) have enabled Large Language Models to answer financial questions using external knowledge bases of U.S. SEC filings, earnings reports, and regulatory documents. However, existing work lacks systematic comparison of vector-based and non-vector RAG architectures for financial documents, and the empirical impact of advanced RAG techniques on retrieval accuracy, answer quality, latency, and cost remain unclear. We present the first systematic evaluation comparing vector-based agentic RAG using hybrid search and metadata filtering against hierarchical node-based systems that traverse document structure without embeddings. We evaluate two enhancement techniques applied to the vector-based architecture, i) cross-encoder reranking for retrieval precision, and ii) small-to-big chunk retrieval for context completeness. Across 1,200 SEC 10-K, 10-Q, and 8-K filings on a 150-question benchmark, we measure retrieval metrics (MRR, Recall@5), answer quality through LLM-as-a-judge pairwise comparisons, latency, and preprocessing costs. Vector-based agentic RAG achieves a 68% win rate over hierarchical node-based systems with comparable latency (5.2 compared to 5.98 seconds). Cross-encoder reranking achieves a 59% absolute improvement at optimal parameters (10, 5) for MRR@5. Small-to-big retrieval achieves a 65% win rate over baseline chunking with only 0.2 seconds additional latency. Our findings reveal that applying advanced RAG techniques to financial Q&A systems improves retrieval accuracy, answer quality, and has cost-performance tradeoffs to be considered in production.
comment: 8 pages, 2 figures
Vector Arithmetic in Concept and Token Subspaces NeurIPS 2025
In order to predict the next token, LLMs must represent semantic and surface-level information about the current word. Previous work identified two types of attention heads that disentangle this information: (i) Concept induction heads, which copy word meanings, and (ii) Token induction heads, which copy literal token representations (Feucht et al., 2025). We show that these heads can be used to identify subspaces of model activations that exhibit coherent semantic structure in Llama-2-7b. Specifically, when we transform hidden states using the attention weights of concept heads, we are able to more accurately perform parallelogram arithmetic (Mikolov et al., 2013) on the resulting hidden states, e.g., showing that "Athens" - "Greece" + "China" = "Beijing". This transformation allows for much higher nearest-neighbor accuracy (80%) than direct use of raw hidden states (47%). Analogously, we show that token heads allow for transformations that reveal surface-level word information in hidden states, allowing for operations like "coding" - "code" + "dance" = "dancing".
comment: 9 pages, 6 figures. NeurIPS 2025 Mechanistic Interpretability Workshop
GeeSanBhava: Sentiment Tagged Sinhala Music Video Comment Data Set
This study introduce GeeSanBhava, a high-quality data set of Sinhala song comments extracted from YouTube manually tagged using Russells Valence-Arousal model by three independent human annotators. The human annotators achieve a substantial inter-annotator agreement (Fleiss kappa = 84.96%). The analysis revealed distinct emotional profiles for different songs, highlighting the importance of comment based emotion mapping. The study also addressed the challenges of comparing comment-based and song-based emotions, mitigating biases inherent in user-generated content. A number of Machine learning and deep learning models were pre-trained on a related large data set of Sinhala News comments in order to report the zero-shot result of our Sinhala YouTube comment data set. An optimized Multi-Layer Perceptron model, after extensive hyperparameter tuning, achieved a ROC-AUC score of 0.887. The model is a three-layer MLP with a configuration of 256, 128, and 64 neurons. This research contributes a valuable annotated dataset and provides insights for future work in Sinhala Natural Language Processing and music emotion recognition.
TyphoFormer: Language-Augmented Transformer for Accurate Typhoon Track Forecasting SP
Accurate typhoon track forecasting is crucial for early system warning and disaster response. While Transformer-based models have demonstrated strong performance in modeling the temporal dynamics of dense trajectories of humans and vehicles in smart cities, they usually lack access to broader contextual knowledge that enhances the forecasting reliability of sparse meteorological trajectories, such as typhoon tracks. To address this challenge, we propose TyphoFormer, a novel framework that incorporates natural language descriptions as auxiliary prompts to improve typhoon trajectory forecasting. For each time step, we use Large Language Model (LLM) to generate concise textual descriptions based on the numerical attributes recorded in the North Atlantic hurricane database. The language descriptions capture high-level meteorological semantics and are embedded as auxiliary special tokens prepended to the numerical time series input. By integrating both textual and sequential information within a unified Transformer encoder, TyphoFormer enables the model to leverage contextual cues that are otherwise inaccessible through numerical features alone. Extensive experiments are conducted on HURDAT2 benchmark, results show that TyphoFormer consistently outperforms other state-of-the-art baseline methods, particularly under challenging scenarios involving nonlinear path shifts and limited historical observations.
comment: Accepted by ACM SIGSPATIAL 2025. Received SIGSPATIAL '25 Best Short Paper Award
Nested-ReFT: Efficient Reinforcement Learning for Large Language Model Fine-Tuning via Off-Policy Rollouts
Advanced reasoning in LLMs on challenging domains like mathematical reasoning can be tackled using verifiable rewards based reinforced fine-tuning (ReFT). In standard ReFT frameworks, a behavior model generates multiple completions with answers per problem, for the answer to be then scored by a reward function. While such RL post-training methods demonstrate significant performance improvements across challenging reasoning domains, the computational cost of generating completions during training with multiple inference steps makes the training cost non-trivial. To address this, we draw inspiration from off-policy RL, and speculative decoding to introduce a novel ReFT framework, dubbed Nested-ReFT, where a subset of layers of the target model acts as the behavior model to generate off-policy completions during training. The behavior model configured with dynamic layer skipping per batch during training decreases the inference cost compared to the standard ReFT frameworks. Our theoretical analysis shows that Nested-ReFT yields unbiased gradient estimates with controlled variance. Our empirical analysis demonstrates improved computational efficiency measured as tokens/sec across multiple math reasoning benchmarks and model sizes. Additionally, we explore three variants of bias mitigation to minimize the off-policyness in the gradient updates that allows for maintaining performance that matches the baseline ReFT performance.
MGen: Millions of Naturally Occurring Generics in Context SC
MGen is a dataset of over 4 million naturally occurring generic and quantified sentences extracted from diverse textual sources. Sentences in the dataset have long context documents, corresponding to websites and academic papers, and cover 11 different quantifiers. We analyze the features of generics sentences in the dataset, with interesting insights: generics can be long sentences (averaging over 16 words) and speakers often use them to express generalisations about people. MGen is the biggest and most diverse dataset of naturally occurring generic sentences, opening the door to large-scale computational research on genericity. It is publicly available at https://gustavocilleruelo.com/mgen
comment: Presented at SCiL 2025
MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models
Large language models (LLMs) are starting to complement traditional information seeking mechanisms such as web search. LLM-powered chatbots like ChatGPT are gaining prominence among the general public. AI chatbots are also increasingly producing content on social media platforms. However, LLMs are also prone to hallucinations, generating plausible yet factually incorrect or fabricated information. This becomes a critical problem when laypeople start seeking information about sensitive issues such as healthcare. Existing works in LLM hallucinations in the medical domain mainly focus on testing the medical knowledge of LLMs through standardized medical exam questions which are often well-defined and clear-cut with definitive answers. However, these approaches may not fully capture how these LLMs perform during real-world interactions with patients. This work conducts a pioneering study on hallucinations in LLM-generated responses to real-world healthcare queries from patients.We introduce MedHalu, a novel medical hallucination benchmark featuring diverse health-related topics and hallucinated responses from LLMs, with detailed annotation of the hallucination types and text spans. We also propose MedHaluDetect, a comprehensive framework for evaluating LLMs' abilities to detect hallucinations. Furthermore, we study the vulnerability to medical hallucinations among three groups -- medical experts, LLMs, and laypeople. Notably, LLMs significantly underperform human experts and, in some cases, even laypeople in detecting medical hallucinations. To improve hallucination detection, we propose an expert-in-the-loop approach that integrates expert reasoning into LLM inputs, significantly improving hallucination detection for all LLMs, including a 6.3% macro-F1 improvement for GPT-4. Our code and dataset are available at https://netsys.surrey.ac.uk/datasets/medhalu/.
comment: Accepted at ICWSM2026. https://netsys.surrey.ac.uk/datasets/medhalu/
Human-Computer Interaction
Enhancing Large Language Models for Automated Homework Assessment in Undergraduate Circuit Analysis
This research full paper presents an enhancement pipeline for large language models (LLMs) in assessing homework for an undergraduate circuit analysis course, aiming to improve LLMs' capacity to provide personalized support to electrical engineering students. Existing evaluations have demonstrated that GPT-4o possesses promising capabilities in assessing student homework in this domain. Building on these findings, we enhance GPT-4o's performance through multi-step prompting, contextual data augmentation, and the incorporation of targeted hints. These strategies effectively address common errors observed in GPT-4o's responses when using simple prompts, leading to a substantial improvement in assessment accuracy. Specifically, the correct response rate for GPT-4o increases from 74.71% to 97.70% after applying the enhanced prompting and augmented data on entry-level circuit analysis topics. This work lays a foundation for the effective integration of LLMs into circuit analysis instruction and, more broadly, into engineering education.
comment: Accepted to 2025 Frontiers in Education (FIE) Conference
Typing Reinvented: Towards Hands-Free Input via sEMG
We explore surface electromyography (sEMG) as a non-invasive input modality for mapping muscle activity to keyboard inputs, targeting immersive typing in next-generation human-computer interaction (HCI). This is especially relevant for spatial computing and virtual reality (VR), where traditional keyboards are impractical. Using attention-based architectures, we significantly outperform the existing convolutional baselines, reducing online generic CER from 24.98% -> 20.34% and offline personalized CER from 10.86% -> 10.10%, while remaining fully causal. We further incorporate a lightweight decoding pipeline with language-model-based correction, demonstrating the feasibility of accurate, real-time muscle-driven text input for future wearable and spatial interfaces.
Towards Automating Data Access Permissions in AI Agents
As AI agents attempt to autonomously act on users' behalf, they raise transparency and control issues. We argue that permission-based access control is indispensable in providing meaningful control to the users, but conventional permission models are inadequate for the automated agentic execution paradigm. We therefore propose automated permission management for AI agents. Our key idea is to conduct a user study to identify the factors influencing users' permission decisions and to encode these factors into an ML-based permission management assistant capable of predicting users' future decisions. We find that participants' permission decisions are influenced by communication context but importantly individual preferences tend to remain consistent within contexts, and align with those of other participants. Leveraging these insights, we develop a permission prediction model achieving 85.1% accuracy overall and 94.4% for high-confidence predictions. We find that even without using permission history, our model achieves an accuracy of 66.9%, and a slight increase of training samples (i.e., 1-4) can substantially increase the accuracy by 10.8%.
comment: Accepted by the IEEE Symposium on Security and Privacy (S&P) 2026
Three-Class Emotion Classification for Audiovisual Scenes Based on Ensemble Learning Scheme
Emotion recognition plays a pivotal role in enhancing human-computer interaction, particularly in movie recommendation systems where understanding emotional content is essential. While multimodal approaches combining audio and video have demonstrated effectiveness, their reliance on high-performance graphical computing limits deployment on resource-constrained devices such as personal computers or home audiovisual systems. To address this limitation, this study proposes a novel audio-only ensemble learning framework capable of classifying movie scenes into three emotional categories: Good, Neutral, and Bad. The model integrates ten support vector machines and six neural networks within a stacking ensemble architecture to enhance classification performance. A tailored data preprocessing pipeline, including feature extraction, outlier handling, and feature engineering, is designed to optimize emotional information from audio inputs. Experiments on a simulated dataset achieve 67% accuracy, while a real-world dataset collected from 15 diverse films yields an impressive 86% accuracy. These results underscore the potential of audio-based, lightweight emotion recognition methods for broader consumer-level applications, offering both computational efficiency and robust classification capabilities.
Exploring Multiview UI Layouts and Placement Strategies for Collaborative Sensemaking in Virtual Reality
Immersive technologies expand the potential for collaborative sense-making and visual analysis via head-worn displays (HWDs), offering customizable, high-resolution perspectives of a shared visualization space. In such an immersive environment, window/view management is crucial for collaborative sense-making tasks. However, the role of document types (graphs, images) and pair dynamics in collaborative layout formation has rarely been explored. We conducted a user study with 20 participants to explore how pair of users organize multiview windows in remote immersive workspaces during tasks such as search, comparison, and classification. Findings show that users often arrange windows in a semi-circular layout for pair collaboration. Image+text documents reduce mental and temporal demand in comparison tasks, while graphs lower task load for classification. Conflicts in window selection arise mainly in complex comparisons, with frequent discussion and reorganization during difficult tasks. Based on these insights, we propose design guidelines for multiview systems that support VR collaboration and brainstorming.
comment: 12 pages, 9 figures
AnimAgents: Coordinating Multi-Stage Animation Pre-Production with Human-Multi-Agent Collaboration
Animation pre-production lays the foundation of an animated film by transforming initial concepts into a coherent blueprint across interdependent stages such as ideation, scripting, design, and storyboarding. While generative AI tools are increasingly adopted in this process, they remain isolated, requiring creators to juggle multiple systems without integrated workflow support. Our formative study with 12 professional creative directors and independent animators revealed key challenges in their current practice: Creators must manually coordinate fragmented outputs, manage large volumes of information, and struggle to maintain continuity and creative control between stages. Based on the insights, we present AnimAgents, a human-multi-agent collaborative system that coordinates complex, multi-stage workflows through a core agent and specialized agents, supported by dedicated boards for the four major stages of pre-production. AnimAgents enables stage-aware orchestration, stage-specific output management, and element-level refinement, providing an end-to-end workflow tailored to professional practice. In a within-subjects summative study with 16 professional creators, AnimAgents significantly outperformed a strong single-agent baseline that equipped with advanced parallel image generation in coordination, consistency, information management, and overall satisfaction (p < .01). A field deployment with 4 creators further demonstrated AnimAgents' effectiveness in real-world projects.
A superpersuasive autonomous policy debating system AAAI 2026
The capacity for highly complex, evidence-based, and strategically adaptive persuasion remains a formidable great challenge for artificial intelligence. Previous work, like IBM Project Debater, focused on generating persuasive speeches in simplified and shortened debate formats intended for relatively lay audiences. We introduce DeepDebater, a novel autonomous system capable of participating in and winning a full, unmodified, two-team competitive policy debate. Our system employs a hierarchical architecture of specialized multi-agent workflows, where teams of LLM-powered agents collaborate and critique one another to perform discrete argumentative tasks. Each workflow utilizes iterative retrieval, synthesis, and self-correction using a massive corpus of policy debate evidence (OpenDebateEvidence) and produces complete speech transcripts, cross-examinations, and rebuttals. We introduce a live, interactive end-to-end presentation pipeline that renders debates with AI speech and animation: transcripts are surface-realized and synthesized to audio with OpenAI TTS, and then displayed as talking-head portrait videos with EchoMimic V1. Beyond fully autonomous matches (AI vs AI), DeepDebater supports hybrid human-AI operation: human debaters can intervene at any stage, and humans can optionally serve as opponents against AI in any speech, allowing AI-human and AI-AI rounds. In preliminary evaluations against human-authored cases, DeepDebater produces qualitatively superior argumentative components and consistently wins simulated rounds as adjudicated by an independent autonomous judge. Expert human debate coaches also prefer the arguments, evidence, and cases constructed by DeepDebater. We open source all code, generated speech transcripts, audio and talking head video here: https://github.com/Hellisotherpeople/DeepDebater/tree/main
comment: Accepted to CLIP workshop at AAAI 2026
"It's trained by non-disabled people": Evaluating How Image Quality Affects Product Captioning with VLMs
Vision-Language Models (VLMs) are increasingly used by blind and low-vision (BLV) people to identify and understand products in their everyday lives, such as food, personal products, and household goods. Despite their prevalence, we lack an empirical understanding of how common image quality issues, like blur and misframing of items, affect the accuracy of VLM-generated captions and whether resulting captions meet BLV people's information needs. Grounded in a survey with 86 BLV people, we systematically evaluate how image quality issues affect captions generated by VLMs. We show that the best model recognizes products in images with no quality issues with 98% accuracy, but drops to 75% accuracy overall when quality issues are present, worsening considerably as issues compound. We discuss the need for model evaluations that center on disabled people's experiences throughout the process and offer concrete recommendations for HCI and ML researchers to make VLMs more reliable for BLV people.
comment: Paper under review
Text to Robotic Assembly of Multi Component Objects using 3D Generative AI and Vision Language Models NeurIPS 2025
Advances in 3D generative AI have enabled the creation of physical objects from text prompts, but challenges remain in creating objects involving multiple component types. We present a pipeline that integrates 3D generative AI with vision-language models (VLMs) to enable the robotic assembly of multi-component objects from natural language. Our method leverages VLMs for zero-shot, multi-modal reasoning about geometry and functionality to decompose AI-generated meshes into multi-component 3D models using predefined structural and panel components. We demonstrate that a VLM is capable of determining which mesh regions need panel components in addition to structural components, based on the object's geometry and functionality. Evaluation across test objects shows that users preferred the VLM-generated assignments 90.6% of the time, compared to 59.4% for rule-based and 2.5% for random assignment. Lastly, the system allows users to refine component assignments through conversational feedback, enabling greater human control and agency in making physical objects with generative AI and robotics.
comment: Accepted to NeurIPS 2025, Conference on Neural Information Processing Systems, Creative AI Track
Perceptogram: Reconstructing Visual Percepts and Presumptive Electrode Preference from EEG
Visual neural decoding from EEG has improved significantly due to diffusion models that can reconstruct high-quality images from decoded latents. While recent works have focused on relatively complex architectures to achieve good reconstruction performance from EEG, less attention has been paid to the source of this information. We present a unified framework that not only enables image reconstruction from EEG using a simple linear decoder, but also isolates interpretable EEG feature maps that support visual perception. Unlike prior approaches that rely on deep, opaque models, our method leverages the inherent structure of CLIP embeddings to keep the mapping linear. We show that training a simple linear decoder from EEG to CLIP latent space, followed by a frozen pre-trained diffusion model, is sufficient to decode images with state-of-the-art reconstruction performance. Beyond reconstruction, Perceptogram enables the visualization of presumptive electrode preference and EEG patterns, revealing interpretable EEG feature maps that correspond to distinct visual attributes, such as semantic class, texture, and hue. We thus use our framework, Perceptogram, to probe EEG signals at various levels of the visual information hierarchy.
Toward Adaptive Categories: Dimensional Governance for Agentic AI
As AI systems evolve from static tools to dynamic agents, traditional categorical governance frameworks -- based on fixed risk tiers, levels of autonomy, or human oversight models -- are increasingly insufficient on their own. Systems built on foundation models, self-supervised learning, and multi-agent architectures increasingly blur the boundaries that categories were designed to police. In this Perspective, we make the case for dimensional governance: a framework that tracks how decision authority, process autonomy, and accountability (the 3As) distribute dynamically across human-AI relationships. A critical advantage of this approach is its ability to explicitly monitor system movement toward and across key governance thresholds, enabling preemptive adjustments before risks materialize. This dimensional approach provides the necessary foundation for more adaptive categorization, enabling thresholds and classifications that can evolve with emerging capabilities. While categories remain essential for decision-making, building them upon dimensional foundations allows for context-specific adaptability and stakeholder-responsive governance that static approaches cannot achieve. We outline key dimensions, critical trust thresholds, and practical examples illustrating where rigid categorical frameworks fail -- and where a dimensional mindset could offer a more resilient and future-proof path forward for both governance and innovation at the frontier of artificial intelligence.
comment: 12 pages core text, 15 pages including references, 2 figures
DreamGarden: A Designer Assistant for Growing Games from a Single Prompt
Coding assistants are increasingly leveraged in game design, both generating code and making high-level plans. To what degree can these tools align with developer workflows, and what new modes of human-computer interaction can emerge from their use? We present DreamGarden, an AI system capable of assisting with the development of diverse game environments in Unreal Engine. At the core of our method is an LLM-driven planner, capable of breaking down a single, high-level prompt -- a dream, memory, or imagined scenario provided by a human user -- into a hierarchical action plan, which is then distributed across specialized submodules facilitating concrete implementation. This system is presented to the user as a garden of plans and actions, both growing independently and responding to user intervention via seed prompts, pruning, and feedback. Through a user study, we explore design implications of this system, charting courses for future work in semi-autonomous assistants and open-ended simulation design.
comment: 30 pages + appendix, 11 figures, published at CHI 2025
VoxCity: A Seamless Framework for Open Geospatial Data Integration, Grid-Based Semantic 3D City Model Generation, and Urban Environment Simulation
Three-dimensional urban environment simulation is a powerful tool for informed urban planning. However, the intensive manual effort required to prepare input 3D city models has hindered its widespread adoption. To address this challenge, we present VoxCity, an open-source Python package that provides a one-stop solution for grid-based 3D city model generation and urban environment simulation for cities worldwide. VoxCity's `generator' subpackage automatically downloads building heights, tree canopy heights, land cover, and terrain elevation within a specified target area, and voxelizes buildings, trees, land cover, and terrain to generate an integrated voxel city model. The `simulator' subpackage enables users to conduct environmental simulations, including solar radiation and view index analyses. Users can export the generated models using several file formats compatible with external software, such as ENVI-met (INX), Blender, and Rhino (OBJ). We generated 3D city models for eight global cities, and demonstrated the calculation of solar irradiance, sky view index, and green view index. We also showcased microclimate simulation and 3D rendering visualization through ENVI-met and Rhino, respectively, through the file export function. Additionally, we reviewed openly available geospatial data to create guidelines to help users choose appropriate data sources depending on their target areas and purposes. VoxCity can significantly reduce the effort and time required for 3D city model preparation and promote the utilization of urban environment simulations. This contributes to more informed urban and architectural design that considers environmental impacts, and in turn, fosters sustainable and livable cities. VoxCity is released openly at https://github.com/kunifujiwara/VoxCity.
Neural Transparency: Mechanistic Interpretability Interfaces for Anticipating Model Behaviors for Personalized AI
Millions of users now design personalized LLM-based chatbots that shape their daily interactions, yet they can only roughly anticipate how their design choices will manifest as behaviors in deployment. This opacity is consequential: seemingly innocuous prompts can trigger excessive sycophancy, toxicity, or other undesirable traits, degrading utility and raising safety concerns. To address this issue, we introduce an interface that enables neural transparency by exposing language model internals during chatbot design. Our approach extracts behavioral trait vectors (empathy, toxicity, sycophancy, etc.) by computing differences in neural activations between contrastive system prompts that elicit opposing behaviors. We predict chatbot behaviors by projecting the system prompt's final token activations onto these trait vectors, normalizing for cross-trait comparability, and visualizing results via an interactive sunburst diagram. To evaluate this approach, we conducted an online user study using Prolific to compare our neural transparency interface against a baseline chatbot interface without any form of transparency. Our analyses suggest that users systematically miscalibrated AI behavior: participants misjudged trait activations for eleven of fifteen analyzable traits, motivating the need for transparency tools in everyday human-AI interaction. While our interface did not change design iteration patterns, it significantly increased user trust and was enthusiastically received. Qualitative analysis revealed nuanced user experiences with the visualization, suggesting interface and interaction improvements for future work. This work offers a path for how mechanistic interpretability can be operationalized for non-technical users, establishing a foundation for safer, more aligned human-AI interactions.
comment: SK and AB are co-first authors
Computer Vision and Pattern Recognition
Native 3D Editing with Full Attention
Instruction-guided 3D editing is a rapidly emerging field with the potential to broaden access to 3D content creation. However, existing methods face critical limitations: optimization-based approaches are prohibitively slow, while feed-forward approaches relying on multi-view 2D editing often suffer from inconsistent geometry and degraded visual quality. To address these issues, we propose a novel native 3D editing framework that directly manipulates 3D representations in a single, efficient feed-forward pass. Specifically, we create a large-scale, multi-modal dataset for instruction-guided 3D editing, covering diverse addition, deletion, and modification tasks. This dataset is meticulously curated to ensure that edited objects faithfully adhere to the instructional changes while preserving the consistency of unedited regions with the source object. Building upon this dataset, we explore two distinct conditioning strategies for our model: a conventional cross-attention mechanism and a novel 3D token concatenation approach. Our results demonstrate that token concatenation is more parameter-efficient and achieves superior performance. Extensive evaluations show that our method outperforms existing 2D-lifting approaches, setting a new benchmark in generation quality, 3D consistency, and instruction fidelity.
EvDiff: High Quality Video with an Event Camera
As neuromorphic sensors, event cameras asynchronously record changes in brightness as streams of sparse events with the advantages of high temporal resolution and high dynamic range. Reconstructing intensity images from events is a highly ill-posed task due to the inherent ambiguity of absolute brightness. Early methods generally follow an end-to-end regression paradigm, directly mapping events to intensity frames in a deterministic manner. While effective to some extent, these approaches often yield perceptually inferior results and struggle to scale up in model capacity and training data. In this work, we propose EvDiff, an event-based diffusion model that follows a surrogate training framework to produce high-quality videos. To reduce the heavy computational cost of high-frame-rate video generation, we design an event-based diffusion model that performs only a single forward diffusion step, equipped with a temporally consistent EvEncoder. Furthermore, our novel Surrogate Training Framework eliminates the dependence on paired event-image datasets, allowing the model to leverage large-scale image datasets for higher capacity. The proposed EvDiff is capable of generating high-quality colorful videos solely from monochromatic event streams. Experiments on real-world datasets demonstrate that our method strikes a sweet spot between fidelity and realism, outperforming existing approaches on both pixel-level and perceptual metrics.
Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination
Understanding text-rich videos requires reading small, transient textual cues that often demand repeated inspection. Yet most video QA models rely on single-pass perception over fixed frames, leading to hallucinations and failures on fine-grained evidence. Inspired by how humans pause, zoom, and re-read critical regions, we introduce Video-R4 (Reinforcing Text-Rich Video Reasoning with Visual Rumination), a video reasoning LMM that performs visual rumination: iteratively selecting frames, zooming into informative regions, re-encoding retrieved pixels, and updating its reasoning state. We construct two datasets with executable rumination trajectories: Video-R4-CoT-17k for supervised practice and Video-R4-RL-30k for reinforcement learning. We propose a multi-stage rumination learning framework that progressively finetunes a 7B LMM to learn atomic and mixing visual operations via SFT and GRPO-based RL. Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and further generalizes to multi-page document QA, slides QA, and generic video QA, demonstrating that iterative rumination is an effective paradigm for pixel-grounded multimodal reasoning.
Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models
Scaling up multimodal models has enabled remarkable advances in visual understanding and reasoning, but practical demands call for smaller, efficient systems. In this work, we conduct a principled analysis of downscaling intelligence in multimodal models, examining how reduced large language model (LLM) capacity affects multimodal capabilities. Our initial findings reveal an interesting trend: LLM downscaling disproportionately affects visual capabilities, rather than abilities inherited from the LLM. We then examine whether this drop mainly reflects the expected decline in visual reasoning or a more fundamental loss of perceptual abilities. Isolating the effect of LLM downscaling on perception, we find performance still drops sharply, often matching or exceeding the impact on reasoning. To address this bottleneck, we introduce visual extraction tuning, which explicitly trains the model to extract instruction-relevant visual details consistently across tasks. With these extracted visual details, we then apply step-by-step reasoning to generate answers. Together, these components form our Extract+Think approach, setting a new standard for efficiency and performance in this space.
comment: Website at https://web.stanford.edu/~markendo/projects/downscaling_intelligence
An Artificial Intelligence Framework for Measuring Human Spine Aging Using MRI
The human spine is a complex structure composed of 33 vertebrae. It holds the body and is important for leading a healthy life. The spine is vulnerable to age-related degenerations that can be identified through magnetic resonance imaging (MRI). In this paper we propose a novel computer-vison-based deep learning method to estimate spine age using images from over 18,000 MRI series. Data are restricted to subjects with only age-related spine degeneration. Eligibility criteria are created by identifying common age-based clusters of degenerative spine conditions using uniform manifold approximation and projection (UMAP) and hierarchical density-based spatial clustering of applications with noise (HDBSCAN). Model selection is determined using a detailed ablation study on data size, loss, and the effect of different spine regions. We evaluate the clinical utility of our model by calculating the difference between actual spine age and model-predicted age, the spine age gap (SAG), and examining the association between these differences and spine degenerative conditions and lifestyle factors. We find that SAG is associated with conditions including disc bulges, disc osteophytes, spinal stenosis, and fractures, as well as lifestyle factors like smoking and physically demanding work, and thus may be a useful biomarker for measuring overall spine health.
comment: 17 pages, 7 figures
Radar2Shape: 3D Shape Reconstruction from High-Frequency Radar using Multiresolution Signed Distance Functions
Determining the shape of 3D objects from high-frequency radar signals is analytically complex but critical for commercial and aerospace applications. Previous deep learning methods have been applied to radar modeling; however, they often fail to represent arbitrary shapes or have difficulty with real-world radar signals which are collected over limited viewing angles. Existing methods in optical 3D reconstruction can generate arbitrary shapes from limited camera views, but struggle when they naively treat the radar signal as a camera view. In this work, we present Radar2Shape, a denoising diffusion model that handles a partially observable radar signal for 3D reconstruction by correlating its frequencies with multiresolution shape features. Our method consists of a two-stage approach: first, Radar2Shape learns a regularized latent space with hierarchical resolutions of shape features, and second, it diffuses into this latent space by conditioning on the frequencies of the radar signal in an analogous coarse-to-fine manner. We demonstrate that Radar2Shape can successfully reconstruct arbitrary 3D shapes even from partially-observed radar signals, and we show robust generalization to two different simulation methods and real-world data. Additionally, we release two synthetic benchmark datasets to encourage future research in the high-frequency radar domain so that models like Radar2Shape can safely be adapted into real-world radar systems.
Counterfactual World Models via Digital Twin-conditioned Video Diffusion
World models learn to predict the temporal evolution of visual observations given a control signal, potentially enabling agents to reason about environments through forward simulation. Because of the focus on forward simulation, current world models generate predictions based on factual observations. For many emerging applications, such as comprehensive evaluations of physical AI behavior under varying conditions, the ability of world models to answer counterfactual queries, such as "what would happen if this object was removed?", is of increasing importance. We formalize counterfactual world models that additionally take interventions as explicit inputs, predicting temporal sequences under hypothetical modifications to observed scene properties. Traditional world models operate directly on entangled pixel-space representations where object properties and relationships cannot be selectively modified. This modeling choice prevents targeted interventions on specific scene properties. We introduce CWMDT, a framework to overcome those limitations, turning standard video diffusion models into effective counterfactual world models. First, CWMDT constructs digital twins of observed scenes to explicitly encode objects and their relationships, represented as structured text. Second, CWMDT applies large language models to reason over these representations and predict how a counterfactual intervention propagates through time to alter the observed scene. Third, CWMDT conditions a video diffusion model with the modified representation to generate counterfactual visual sequences. Evaluations on two benchmarks show that the CWMDT approach achieves state-of-the-art performance, suggesting that alternative representations of videos, such as the digital twins considered here, offer powerful control signals for video forward simulation-based world models.
GPR-OdomNet: Difference and Similarity-Driven Odometry Estimation Network for Ground Penetrating Radar-Based Localization
When performing robot/vehicle localization using ground penetrating radar (GPR) to handle adverse weather and environmental conditions, existing techniques often struggle to accurately estimate distances when processing B-scan images with minor distinctions. This study introduces a new neural network-based odometry method that leverages the similarity and difference features of GPR B-scan images for precise estimation of the Euclidean distances traveled between the B-scan images. The new custom neural network extracts multi-scale features from B-scan images taken at consecutive moments and then determines the Euclidean distance traveled by analyzing the similarities and differences between these features. To evaluate our method, an ablation study and comparison experiments have been conducted using the publicly available CMU-GPR dataset. The experimental results show that our method consistently outperforms state-of-the-art counterparts in all tests. Specifically, our method achieves a root mean square error (RMSE), and achieves an overall weighted RMSE of 0.449 m across all data sets, which is a 10.2\% reduction in RMSE when compared to the best state-of-the-art method.
Improving Multimodal Distillation for 3D Semantic Segmentation under Domain Shift BMVC 2025
Semantic segmentation networks trained under full supervision for one type of lidar fail to generalize to unseen lidars without intervention. To reduce the performance gap under domain shifts, a recent trend is to leverage vision foundation models (VFMs) providing robust features across domains. In this work, we conduct an exhaustive study to identify recipes for exploiting VFMs in unsupervised domain adaptation for semantic segmentation of lidar point clouds. Building upon unsupervised image-to-lidar knowledge distillation, our study reveals that: (1) the architecture of the lidar backbone is key to maximize the generalization performance on a target domain; (2) it is possible to pretrain a single backbone once and for all, and use it to address many domain shifts; (3) best results are obtained by keeping the pretrained backbone frozen and training an MLP head for semantic segmentation. The resulting pipeline achieves state-of-the-art results in four widely-recognized and challenging settings. The code will be available at: https://github.com/valeoai/muddos.
comment: Accepted at BMVC 2025
Illustrator's Depth: Monocular Layer Index Prediction for Image Decomposition
We introduce Illustrator's Depth, a novel definition of depth that addresses a key challenge in digital content creation: decomposing flat images into editable, ordered layers. Inspired by an artist's compositional process, illustrator's depth infers a layer index to each pixel, forming an interpretable image decomposition through a discrete, globally consistent ordering of elements optimized for editability. We also propose and train a neural network using a curated dataset of layered vector graphics to predict layering directly from raster inputs. Our layer index inference unlocks a range of powerful downstream applications. In particular, it significantly outperforms state-of-the-art baselines for image vectorization while also enabling high-fidelity text-to-vector-graphics generation, automatic 3D relief generation from 2D images, and intuitive depth-aware editing. By reframing depth from a physical quantity to a creative abstraction, illustrator's depth prediction offers a new foundation for editable image decomposition.
Planning with Sketch-Guided Verification for Physics-Aware Video Generation
Recent video generation approaches increasingly rely on planning intermediate control signals such as object trajectories to improve temporal coherence and motion fidelity. However, these methods mostly employ single-shot plans that are typically limited to simple motions, or iterative refinement which requires multiple calls to the video generator, incuring high computational cost. To overcome these limitations, we propose SketchVerify, a training-free, sketch-verification-based planning framework that improves motion planning quality with more dynamically coherent trajectories (i.e., physically plausible and instruction-consistent motions) prior to full video generation by introducing a test-time sampling and verification loop. Given a prompt and a reference image, our method predicts multiple candidate motion plans and ranks them using a vision-language verifier that jointly evaluates semantic alignment with the instruction and physical plausibility. To efficiently score candidate motion plans, we render each trajectory as a lightweight video sketch by compositing objects over a static background, which bypasses the need for expensive, repeated diffusion-based synthesis while achieving comparable performance. We iteratively refine the motion plan until a satisfactory one is identified, which is then passed to the trajectory-conditioned generator for final synthesis. Experiments on WorldModelBench and PhyWorldBench demonstrate that our method significantly improves motion quality, physical realism, and long-term consistency compared to competitive baselines while being substantially more efficient. Our ablation study further shows that scaling up the number of trajectory candidates consistently enhances overall performance.
comment: website: https://sketchverify.github.io/
MMT-ARD: Multimodal Multi-Teacher Adversarial Distillation for Robust Vision-Language Models
Vision-Language Models (VLMs) are increasingly deployed in safety-critical applications, making their adversarial robustness a crucial concern. While adversarial knowledge distillation has shown promise in transferring robustness from teacher to student models, traditional single-teacher approaches suffer from limited knowledge diversity, slow convergence, and difficulty in balancing robustness and accuracy. To address these challenges, we propose MMT-ARD: a Multimodal Multi-Teacher Adversarial Robust Distillation framework. Our key innovation is a dual-teacher knowledge fusion architecture that collaboratively optimizes clean feature preservation and robust feature enhancement. To better handle challenging adversarial examples, we introduce a dynamic weight allocation strategy based on teacher confidence, enabling adaptive focus on harder samples. Moreover, to mitigate bias among teachers, we design an adaptive sigmoid-based weighting function that balances the strength of knowledge transfer across modalities. Extensive experiments on ImageNet and zero-shot benchmarks demonstrate that MMT-ARD improves robust accuracy by +4.32% and zero-shot accuracy by +3.5% on the ViT-B-32 model, while achieving a 2.3x increase in training efficiency over traditional single-teacher methods. These results highlight the effectiveness and scalability of MMT-ARD in enhancing the adversarial robustness of multimodal large models. Our codes are available at https://github.com/itsnotacie/MMT-ARD.
comment: 10 pages
REMSA: An LLM Agent for Foundation Model Selection in Remote Sensing
Foundation Models (FMs) are increasingly used in remote sensing (RS) for tasks such as environmental monitoring, disaster assessment, and land-use mapping. These models include unimodal vision encoders trained on a single data modality and multimodal architectures trained on combinations of SAR, multispectral, hyperspectral, and image-text data. They support diverse RS tasks including semantic segmentation, image classification, change detection, and visual question answering. However, selecting an appropriate remote sensing foundation model (RSFM) remains difficult due to scattered documentation, heterogeneous formats, and varied deployment constraints. We introduce the RSFM Database (RS-FMD), a structured resource covering over 150 RSFMs spanning multiple data modalities, resolutions, and learning paradigms. Built on RS-FMD, we present REMSA, the first LLM-based agent for automated RSFM selection from natural language queries. REMSA interprets user requirements, resolves missing constraints, ranks candidate models using in-context learning, and provides transparent justifications. We also propose a benchmark of 75 expert-verified RS query scenarios, producing 900 configurations under an expert-centered evaluation protocol. REMSA outperforms several baselines, including naive agents, dense retrieval, and unstructured RAG-based LLMs. It operates entirely on publicly available metadata and does not access private or sensitive data.
comment: Code and data available at https://github.com/be-chen/REMSA
SMILE: A Composite Lexical-Semantic Metric for Question-Answering Evaluation
Traditional evaluation metrics for textual and visual question answering, like ROUGE, METEOR, and Exact Match (EM), focus heavily on n-gram based lexical similarity, often missing the deeper semantic understanding needed for accurate assessment. While measures like BERTScore and MoverScore leverage contextual embeddings to address this limitation, they lack flexibility in balancing sentence-level and keyword-level semantics and ignore lexical similarity, which remains important. Large Language Model (LLM) based evaluators, though powerful, come with drawbacks like high costs, bias, inconsistency, and hallucinations. To address these issues, we introduce SMILE: Semantic Metric Integrating Lexical Exactness, a novel approach that combines sentence-level semantic understanding with keyword-level semantic understanding and easy keyword matching. This composite method balances lexical precision and semantic relevance, offering a comprehensive evaluation. Extensive benchmarks across text, image, and video QA tasks show SMILE is highly correlated with human judgments and computationally lightweight, bridging the gap between lexical and semantic evaluation.
comment: 23 pages, 6 tables, 9 figures
Self-Supervised Learning by Curvature Alignment
Self-supervised learning (SSL) has recently advanced through non-contrastive methods that couple an invariance term with variance, covariance, or redundancy-reduction penalties. While such objectives shape first- and second-order statistics of the representation, they largely ignore the local geometry of the underlying data manifold. In this paper, we introduce CurvSSL, a curvature-regularized self-supervised learning framework, and its RKHS extension, kernel CurvSSL. Our approach retains a standard two-view encoder-projector architecture with a Barlow Twins-style redundancy-reduction loss on projected features, but augments it with a curvature-based regularizer. Each embedding is treated as a vertex whose $k$ nearest neighbors define a discrete curvature score via cosine interactions on the unit hypersphere; in the kernel variant, curvature is computed from a normalized local Gram matrix in an RKHS. These scores are aligned and decorrelated across augmentations by a Barlow-style loss on a curvature-derived matrix, encouraging both view invariance and consistency of local manifold bending. Experiments on MNIST and CIFAR-10 datasets with a ResNet-18 backbone show that curvature-regularized SSL yields competitive or improved linear evaluation performance compared to Barlow Twins and VICReg. Our results indicate that explicitly shaping local geometry is a simple and effective complement to purely statistical SSL regularizers.
Preventing Shortcut Learning in Medical Image Analysis through Intermediate Layer Knowledge Distillation from Specialist Teachers
Deep learning models are prone to learning shortcut solutions to problems using spuriously correlated yet irrelevant features of their training data. In high-risk applications such as medical image analysis, this phenomenon may prevent models from using clinically meaningful features when making predictions, potentially leading to poor robustness and harm to patients. We demonstrate that different types of shortcuts (those that are diffuse and spread throughout the image, as well as those that are localized to specific areas) manifest distinctly across network layers and can, therefore, be more effectively targeted through mitigation strategies that target the intermediate layers. We propose a novel knowledge distillation framework that leverages a teacher network fine-tuned on a small subset of task-relevant data to mitigate shortcut learning in a student network trained on a large dataset corrupted with a bias feature. Through extensive experiments on CheXpert, ISIC 2017, and SimBA datasets using various architectures (ResNet-18, AlexNet, DenseNet-121, and 3D CNNs), we demonstrate consistent improvements over traditional Empirical Risk Minimization, augmentation-based bias-mitigation, and group-based bias-mitigation approaches. In many cases, we achieve comparable performance with a baseline model trained on bias-free data, even on out-of-distribution test data. Our results demonstrate the practical applicability of our approach to real-world medical imaging scenarios where bias annotations are limited and shortcut features are difficult to identify a priori.
comment: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2025:020
Sparse Mixture-of-Experts for Multi-Channel Imaging: Are All Channel Interactions Required? NeurIPS
Vision Transformers ($\text{ViTs}$) have become the backbone of vision foundation models, yet their optimization for multi-channel domains - such as cell painting or satellite imagery - remains underexplored. A key challenge in these domains is capturing interactions between channels, as each channel carries different information. While existing works have shown efficacy by treating each channel independently during tokenization, this approach naturally introduces a major computational bottleneck in the attention block - channel-wise comparisons leads to a quadratic growth in attention, resulting in excessive $\text{FLOPs}$ and high training cost. In this work, we shift focus from efficacy to the overlooked efficiency challenge in cross-channel attention and ask: "Is it necessary to model all channel interactions?". Inspired by the philosophy of Sparse Mixture-of-Experts ($\text{MoE}$), we propose MoE-ViT, a Mixture-of-Experts architecture for multi-channel images in $\text{ViTs}$, which treats each channel as an expert and employs a lightweight router to select only the most relevant experts per patch for attention. Proof-of-concept experiments on real-world datasets - JUMP-CP and So2Sat - demonstrate that $\text{MoE-ViT}$ achieves substantial efficiency gains without sacrificing, and in some cases enhancing, performance, making it a practical and attractive backbone for multi-channel imaging.
comment: This has been accepted at the NeurIPS AI4Science Workshop 2025
MCMoE: Completing Missing Modalities with Mixture of Experts for Incomplete Multimodal Action Quality Assessment AAAI 2026
Multimodal Action Quality Assessment (AQA) has recently emerged as a promising paradigm. By leveraging complementary information across shared contextual cues, it enhances the discriminative evaluation of subtle intra-class variations in highly similar action sequences. However, partial modalities are frequently unavailable at the inference stage in reality. The absence of any modality often renders existing multimodal models inoperable. Furthermore, it triggers catastrophic performance degradation due to interruptions in cross-modal interactions. To address this issue, we propose a novel Missing Completion Framework with Mixture of Experts (MCMoE) that unifies unimodal and joint representation learning in single-stage training. Specifically, we propose an adaptive gated modality generator that dynamically fuses available information to reconstruct missing modalities. We then design modality experts to learn unimodal knowledge and dynamically mix the knowledge of all experts to extract cross-modal joint representations. With a mixture of experts, missing modalities are further refined and complemented. Finally, in the training phase, we mine the complete multimodal features and unimodal expert knowledge to guide modality generation and generation-based joint representation extraction. Extensive experiments demonstrate that our MCMoE achieves state-of-the-art results in both complete and incomplete multimodal learning on three public AQA benchmarks. Code is available at https://github.com/XuHuangbiao/MCMoE.
comment: AAAI 2026
Designing and Generating Diverse, Equitable Face Image Datasets for Face Verification Tasks
Face verification is a significant component of identity authentication in various applications including online banking and secure access to personal devices. The majority of the existing face image datasets often suffer from notable biases related to race, gender, and other demographic characteristics, limiting the effectiveness and fairness of face verification systems. In response to these challenges, we propose a comprehensive methodology that integrates advanced generative models to create varied and diverse high-quality synthetic face images. This methodology emphasizes the representation of a diverse range of facial traits, ensuring adherence to characteristics permissible in identity card photographs. Furthermore, we introduce the Diverse and Inclusive Faces for Verification (DIF-V) dataset, comprising 27,780 images of 926 unique identities, designed as a benchmark for future research in face verification. Our analysis reveals that existing verification models exhibit biases toward certain genders and races, and notably, applying identity style modifications negatively impacts model performance. By tackling the inherent inequities in existing datasets, this work not only enriches the discussion on diversity and ethics in artificial intelligence but also lays the foundation for developing more inclusive and reliable face verification technologies
MorphSeek: Fine-grained Latent Representation-Level Policy Optimization for Deformable Image Registration
Deformable image registration (DIR) remains a fundamental yet challenging problem in medical image analysis, largely due to the prohibitively high-dimensional deformation space of dense displacement fields and the scarcity of voxel-level supervision. Existing reinforcement learning frameworks often project this space into coarse, low-dimensional representations, limiting their ability to capture spatially variant deformations. We propose MorphSeek, a fine-grained representation-level policy optimization paradigm that reformulates DIR as a spatially continuous optimization process in the latent feature space. MorphSeek introduces a stochastic Gaussian policy head atop the encoder to model a distribution over latent features, facilitating efficient exploration and coarse-to-fine refinement. The framework integrates unsupervised warm-up with weakly supervised fine-tuning through Group Relative Policy Optimization, where multi-trajectory sampling stabilizes training and improves label efficiency. Across three 3D registration benchmarks (OASIS brain MRI, LiTS liver CT, and Abdomen MR-CT), MorphSeek achieves consistent Dice improvements over competitive baselines while maintaining high label efficiency with minimal parameter cost and low step-level latency overhead. Beyond optimizer specifics, MorphSeek advances a representation-level policy learning paradigm that achieves spatially coherent and data-efficient deformation optimization, offering a principled, backbone-agnostic, and optimizer-agnostic solution for scalable visual alignment in high-dimensional settings.
IndustryNav: Exploring Spatial Reasoning of Embodied Agents in Dynamic Industrial Navigation
While Visual Large Language Models (VLLMs) show great promise as embodied agents, they continue to face substantial challenges in spatial reasoning. Existing embodied benchmarks largely focus on passive, static household environments and evaluate only isolated capabilities, failing to capture holistic performance in dynamic, real-world complexity. To fill this gap, we present IndustryNav, the first dynamic industrial navigation benchmark for active spatial reasoning. IndustryNav leverages 12 manually created, high-fidelity Unity warehouse scenarios featuring dynamic objects and human movement. Our evaluation employs a PointGoal navigation pipeline that effectively combines egocentric vision with global odometry to assess holistic local-global planning. Crucially, we introduce the "collision rate" and "warning rate" metrics to measure safety-oriented behaviors and distance estimation. A comprehensive study of nine state-of-the-art VLLMs (including models such as GPT-5-mini, Claude-4.5, and Gemini-2.5) reveals that closed-source models maintain a consistent advantage; however, all agents exhibit notable deficiencies in robust path planning, collision avoidance and active exploration. This highlights a critical need for embodied research to move beyond passive perception and toward tasks that demand stable planning, active exploration, and safe behavior in dynamic, real-world environment.
Non-Parametric Probabilistic Robustness: A Conservative Metric with Optimized Perturbation Distributions
Deep learning (DL) models, despite their remarkable success, remain vulnerable to small input perturbations that can cause erroneous outputs, motivating the recent proposal of probabilistic robustness (PR) as a complementary alternative to adversarial robustness (AR). However, existing PR formulations assume a fixed and known perturbation distribution, an unrealistic expectation in practice. To address this limitation, we propose non-parametric probabilistic robustness (NPPR), a more practical PR metric that does not rely on any predefined perturbation distribution. Following the non-parametric paradigm in statistical modeling, NPPR learns an optimized perturbation distribution directly from data, enabling conservative PR evaluation under distributional uncertainty. We further develop an NPPR estimator based on a Gaussian Mixture Model (GMM) with Multilayer Perceptron (MLP) heads and bicubic up-sampling, covering various input-dependent and input-independent perturbation scenarios. Theoretical analyses establish the relationships among AR, PR, and NPPR. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet across ResNet18/50, WideResNet50 and VGG16 validate NPPR as a more practical robustness metric, showing up to 40\% more conservative (lower) PR estimates compared to assuming those common perturbation distributions used in state-of-the-arts.
METIS: Multi-Source Egocentric Training for Integrated Dexterous Vision-Language-Action Model
Building a generalist robot that can perceive, reason, and act across diverse tasks remains an open challenge, especially for dexterous manipulation. A major bottleneck lies in the scarcity of large-scale, action-annotated data for dexterous skills, as teleoperation is difficult and costly. Human data, with its vast scale and diverse manipulation behaviors, provides rich priors for learning robotic actions. While prior works have explored leveraging human demonstrations, they are often constrained by limited scenarios and a large visual gap between human and robots. To eliminate these limitations, we propose METIS, a vision-language-action (VLA) model for dexterous manipulation pretrained on multi-source egocentric datasets. We first construct EgoAtlas, which integrates large-scale human and robotic data from multiple sources, all unified under a consistent action space. We further extract motion-aware dynamics, a compact and discretized motion representation, which provides efficient and expressive supervision for VLA training. Built upon them, METIS integrates reasoning and acting into a unified framework, enabling effective deployment to downstream dexterous manipulation tasks. Our method demonstrates exceptional dexterous manipulation capabilities, achieving highest average success rate in six real-world tasks. Experimental results also highlight the superior generalization and robustness to out-of-distribution scenarios. These findings emphasize METIS as a promising step toward a generalist model for dexterous manipulation.
SVRecon: Sparse Voxel Rasterization for Surface Reconstruction
We extend the recently proposed sparse voxel rasterization paradigm to the task of high-fidelity surface reconstruction by integrating Signed Distance Function (SDF), named SVRecon. Unlike 3D Gaussians, sparse voxels are spatially disentangled from their neighbors and have sharp boundaries, which makes them prone to local minima during optimization. Although SDF values provide a naturally smooth and continuous geometric field, preserving this smoothness across independently parameterized sparse voxels is nontrivial. To address this challenge, we promote coherent and smooth voxel-wise structure through (1) robust geometric initialization using a visual geometry model and (2) a spatial smoothness loss that enforces coherent relationships across parent-child and sibling voxel groups. Extensive experiments across various benchmarks show that our method achieves strong reconstruction accuracy while having consistently speedy convergence. The code will be made public.
ATAC: Augmentation-Based Test-Time Adversarial Correction for CLIP
Despite its remarkable success in zero-shot image-text matching, CLIP remains highly vulnerable to adversarial perturbations on images. As adversarial fine-tuning is prohibitively costly, recent works explore various test-time defense strategies; however, these approaches still exhibit limited robustness. In this work, we revisit this problem and propose a simple yet effective strategy: Augmentation-based Test-time Adversarial Correction (ATAC). Our method operates directly in the embedding space of CLIP, calculating augmentation-induced drift vectors to infer a semantic recovery direction and correcting the embedding based on the angular consistency of these latent drifts. Across a wide range of benchmarks, ATAC consistently achieves remarkably high robustness, surpassing that of previous state-of-the-art methods by nearly 50\% on average, all while requiring minimal computational overhead. Furthermore, ATAC retains state-of-the-art robustness in unconventional and extreme settings and even achieves nontrivial robustness against adaptive attacks. Our results demonstrate that ATAC is an efficient method in a novel paradigm for test-time adversarial defenses in the embedding space of CLIP.
comment: 16 pages
SuperQuadricOcc: Multi-Layer Gaussian Approximation of Superquadrics for Real-Time Self-Supervised Occupancy Estimation
Semantic occupancy estimation enables comprehensive scene understanding for automated driving, providing dense spatial and semantic information essential for perception and planning. While Gaussian representations have been widely adopted in self-supervised occupancy estimation, the deployment of a large number of Gaussian primitives drastically increases memory requirements and is not suitable for real-time inference. In contrast, superquadrics permit reduced primitive count and lower memory requirements due to their diverse shape set. However, implementation into a self-supervised occupancy model is nontrivial due to the absence of a superquadric rasterizer to enable model supervision. Our proposed method, SuperQuadricOcc, employs a superquadric-based scene representation. By leveraging a multi-layer icosphere-tessellated Gaussian approximation of superquadrics, we enable Gaussian rasterization for supervision during training. On the Occ3D dataset, SuperQuadricOcc achieves a 75\% reduction in memory footprint, 124\% faster inference, and a 5.9\% improvement in mIoU compared to previous Gaussian-based methods, without the use of temporal labels. To our knowledge, this is the first occupancy model to enable real-time inference while maintaining competitive performance. The use of superquadrics reduces the number of primitives required for scene modeling by 84\% relative to Gaussian-based approaches. Finally, evaluation against prior methods is facilitated by our fast superquadric voxelization module. The code will be released as open source.
UAM: A Unified Attention-Mamba Backbone of Multimodal Framework for Tumor Cell Classification
Cell-level radiomics features provide fine-grained insights into tumor phenotypes and have the potential to significantly enhance diagnostic accuracy on hematoxylin and eosin (H&E) images. By capturing micro-level morphological and intensity patterns, these features support more precise tumor identification and improve AI interpretability by highlighting diagnostically relevant cells for pathologist review. However, most existing studies focus on slide-level or patch-level tumor classification, leaving cell-level radiomics analysis largely unexplored. Moreover, there is currently no dedicated backbone specifically designed for radiomics data. Inspired by the recent success of the Mamba architecture in vision and language domains, we introduce a Unified Attention-Mamba (UAM) backbone for cell-level classification using radiomics features. Unlike previous hybrid approaches that integrate Attention and Mamba modules in fixed proportions, our unified design flexibly combines their capabilities within a single cohesive architecture, eliminating the need for manual ratio tuning and improving encode capability. We develop two UAM variants to comprehensively evaluate the benefits of this unified structure. Building on this backbone, we further propose a multimodal UAM framework that jointly performs cell-level classification and image segmentation. Experimental results demonstrate that UAM achieves state-of-the-art performance across both tasks on public benchmarks, surpassing leading image-based foundation models. It improves cell classification accuracy from 74% to 78% ($n$=349,882 cells), and tumor segmentation precision from 75% to 80% ($n$=406 patches). These findings highlight the effectiveness and promise of UAM as a unified and extensible multimodal foundation for radiomics-driven cancer diagnosis.
DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture
Image-based Joint-Embedding Predictive Architecture (I-JEPA) learns visual representations by predicting latent embeddings of masked regions from visible context. However, it treats all regions uniformly and independently, lacking an explicit notion of where or in what order predictions should be made. Inspired by human visual perception, which deploys attention selectively and sequentially from the most informative to secondary regions, we propose DSeq-JEPA, a Discriminative Sequential Joint-Embedding Predictive Architecture that bridges predictive and autoregressive self-supervised learning, integrating JEPA-style latent prediction with GPT-style sequential reasoning. Specifically, DSeq-JEPA (i) first identifies primary discriminative regions based on a transformer-derived saliency map, emphasizing the distribution of visual importance, and then (ii) predicts subsequent regions in this discriminative order, progressively forming a curriculum-like semantic progression from primary to secondary cues -- a form of GPT-style pre-training. Extensive experiments across diverse tasks, including image classification (ImageNet), fine-grained visual categorization (iNaturalist21, CUB-200-2011, Stanford-Cars), detection and segmentation (MS-COCO, ADE20K), and low-level reasoning tasks (Clevr/Count, Clevr/Dist), demonstrate that DSeq-JEPA consistently focuses on more discriminative and generalizable representations than I-JEPA variants. Project page: https://github.com/SkyShunsuke/DSeq-JEPA.
comment: Project page: https://github.com/SkyShunsuke/DSeq-JEPA
Learning Latent Transmission and Glare Maps for Lens Veiling Glare Removal
Beyond the commonly recognized optical aberrations, the imaging performance of compact optical systems-including single-lens and metalens designs-is often further degraded by veiling glare caused by stray-light scattering from non-ideal optical surfaces and coatings, particularly in complex real-world environments. This compound degradation undermines traditional lens aberration correction yet remains underexplored. A major challenge is that conventional scattering models (e.g., for dehazing) fail to fit veiling glare due to its spatial-varying and depth-independent nature. Consequently, paired high-quality data are difficult to prepare via simulation, hindering application of data-driven veiling glare removal models. To this end, we propose VeilGen, a generative model that learns to simulate veiling glare by estimating its underlying optical transmission and glare maps in an unsupervised manner from target images, regularized by Stable Diffusion (SD)-based priors. VeilGen enables paired dataset generation with realistic compound degradation of optical aberrations and veiling glare, while also providing the estimated latent optical transmission and glare maps to guide the veiling glare removal process. We further introduce DeVeiler, a restoration network trained with a reversibility constraint, which utilizes the predicted latent maps to guide an inverse process of the learned scattering model. Extensive experiments on challenging compact optical systems demonstrate that our approach delivers superior restoration quality and physical fidelity compared with existing methods. These suggest that VeilGen reliably synthesizes realistic veiling glare, and its learned latent maps effectively guide the restoration process in DeVeiler. All code and datasets will be publicly released at https://github.com/XiaolongQian/DeVeiler.
comment: All code and datasets will be publicly released at https://github.com/XiaolongQian/DeVeiler
Label-Efficient Skeleton-based Recognition with Stable-Invertible Graph Convolutional Networks
Skeleton-based action recognition is a hotspot in image processing. A key challenge of this task lies in its dependence on large, manually labeled datasets whose acquisition is costly and time-consuming. This paper devises a novel, label-efficient method for skeleton-based action recognition using graph convolutional networks (GCNs). The contribution of the proposed method resides in learning a novel acquisition function -- scoring the most informative subsets for labeling -- as the optimum of an objective function mixing data representativity, diversity and uncertainty. We also extend this approach by learning the most informative subsets using an invertible GCN which allows mapping data from ambient to latent spaces where the inherent distribution of the data is more easily captured. Extensive experiments, conducted on two challenging skeleton-based recognition datasets, show the effectiveness and the outperformance of our label-frugal GCNs against the related work.
Loomis Painter: Reconstructing the Painting Process
Step-by-step painting tutorials are vital for learning artistic techniques, but existing video resources (e.g., YouTube) lack interactivity and personalization. While recent generative models have advanced artistic image synthesis, they struggle to generalize across media and often show temporal or structural inconsistencies, hindering faithful reproduction of human creative workflows. To address this, we propose a unified framework for multi-media painting process generation with a semantics-driven style control mechanism that embeds multiple media into a diffusion models conditional space and uses cross-medium style augmentation. This enables consistent texture evolution and process transfer across styles. A reverse-painting training strategy further ensures smooth, human-aligned generation. We also build a large-scale dataset of real painting processes and evaluate cross-media consistency, temporal coherence, and final-image fidelity, achieving strong results on LPIPS, DINO, and CLIP metrics. Finally, our Perceptual Distance Profile (PDP) curve quantitatively models the creative sequence, i.e., composition, color blocking, and detail refinement, mirroring human artistic progression.
Refracting Reality: Generating Images with Realistic Transparent Objects
Generative image models can produce convincingly real images, with plausible shapes, textures, layouts and lighting. However, one domain in which they perform notably poorly is in the synthesis of transparent objects, which exhibit refraction, reflection, absorption and scattering. Refraction is a particular challenge, because refracted pixel rays often intersect with surfaces observed in other parts of the image, providing a constraint on the color. It is clear from inspection that generative models have not distilled the laws of optics sufficiently well to accurately render refractive objects. In this work, we consider the problem of generating images with accurate refraction, given a text prompt. We synchronize the pixels within the object's boundary with those outside by warping and merging the pixels using Snell's Law of Refraction, at each step of the generation trajectory. For those surfaces that are not directly observed in the image, but are visible via refraction or reflection, we recover their appearance by synchronizing the image with a second generated image -- a panorama centered at the object -- using the same warping and merging procedure. We demonstrate that our approach generates much more optically-plausible images that respect the physical constraints.
Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM
Human-robot collaboration towards a shared goal requires robots to understand human action and interaction with the surrounding environment. This paper focuses on human-robot interaction (HRI) based on human-robot dialogue that relies on the robot action confirmation and action step generation using multimodal scene understanding. The state-of-the-art approach uses multimodal transformers to generate robot action steps aligned with robot action confirmation from a single clip showing a task composed of multiple micro steps. Although actions towards a long-horizon task depend on each other throughout an entire video, the current approaches mainly focus on clip-level processing and do not leverage long-context information. This paper proposes a long-context Q-former incorporating left and right context dependency in full videos. Furthermore, this paper proposes a text-conditioning approach to feed text embeddings directly into the LLM decoder to mitigate the high abstraction of the information in text by Q-former. Experiments with the YouCook2 corpus show that the accuracy of confirmation generation is a major factor in the performance of action planning. Furthermore, we demonstrate that the long-context Q-former improves the confirmation and action planning by integrating VideoLLaMA3.
comment: Accepted to ASRU 2025
NoPe-NeRF++: Local-to-Global Optimization of NeRF with No Pose Prior
In this paper, we introduce NoPe-NeRF++, a novel local-to-global optimization algorithm for training Neural Radiance Fields (NeRF) without requiring pose priors. Existing methods, particularly NoPe-NeRF, which focus solely on the local relationships within images, often struggle to recover accurate camera poses in complex scenarios. To overcome the challenges, our approach begins with a relative pose initialization with explicit feature matching, followed by a local joint optimization to enhance the pose estimation for training a more robust NeRF representation. This method significantly improves the quality of initial poses. Additionally, we introduce global optimization phase that incorporates geometric consistency constraints through bundle adjustment, which integrates feature trajectories to further refine poses and collectively boost the quality of NeRF. Notably, our method is the first work that seamlessly combines the local and global cues with NeRF, and outperforms state-of-the-art methods in both pose estimation accuracy and novel view synthesis. Extensive evaluations on benchmark datasets demonstrate our superior performance and robustness, even in challenging scenes, thus validating our design choices.
MuM: Multi-View Masked Image Modeling for 3D Vision
Self-supervised learning on images seeks to extract meaningful visual representations from unlabeled data. When scaled to large datasets, this paradigm has achieved state-of-the-art performance and the resulting trained models such as DINOv3 have seen widespread adoption. However, most prior efforts are optimized for semantic understanding rather than geometric reasoning. One important exception is Cross-View Completion, CroCo, which is a form of masked autoencoding (MAE) tailored for 3D understanding. In this work, we continue on the path proposed by CroCo and focus on learning features tailored for 3D vision. In a nutshell, we extend MAE to arbitrarily many views of the same scene. By uniformly masking all views and employing a lightweight decoder with inter-frame attention, our approach is inherently simpler and more scalable than CroCo. We evaluate the resulting model, MuM, extensively on downstream tasks including feedforward reconstruction, dense image matching and relative pose estimation, finding that it outperforms the state-of-the-art visual encoders DINOv3 and CroCo v2.
SpatialGeo:Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion
Multimodal large language models (MLLMs) have achieved significant progress in image and language tasks due to the strong reasoning capability of large language models (LLMs). Nevertheless, most MLLMs suffer from limited spatial reasoning ability to interpret and infer spatial arrangements in three-dimensional space. In this work, we propose a novel vision encoder based on hierarchical fusion of geometry and semantics features, generating spatial-aware visual embedding and boosting the spatial grounding capability of MLLMs. Specifically, we first unveil that the spatial ambiguity shortcoming stems from the lossy embedding of the vision encoder utilized in most existing MLLMs (e.g., CLIP), restricted to instance-level semantic features. This motivates us to complement CLIP with the geometry features from vision-only self-supervised learning via a hierarchical adapter, enhancing the spatial awareness in the proposed SpatialGeo. The network is efficiently trained using pretrained LLaVA model and optimized with random feature dropping to avoid trivial solutions relying solely on the CLIP encoder. Experimental results show that SpatialGeo improves the accuracy in spatial reasoning tasks, enhancing state-of-the-art models by at least 8.0% in SpatialRGPT-Bench with approximately 50% less memory cost during inference. The source code is available via https://ricky-plus.github.io/SpatialGeoPages/.
BiFingerPose: Bimodal Finger Pose Estimation for Touch Devices
Finger pose offers promising opportunities to expand human computer interaction capability of touchscreen devices. Existing finger pose estimation algorithms that can be implemented in portable devices predominantly rely on capacitive images, which are currently limited to estimating pitch and yaw angles and exhibit reduced accuracy when processing large-angle inputs (especially when it is greater than 45 degrees). In this paper, we propose BiFingerPose, a novel bimodal based finger pose estimation algorithm capable of simultaneously and accurately predicting comprehensive finger pose information. A bimodal input is explored, including a capacitive image and a fingerprint patch obtained from the touchscreen with an under-screen fingerprint sensor. Our approach leads to reliable estimation of roll angle, which is not achievable using only a single modality. In addition, the prediction performance of other pose parameters has also been greatly improved. The evaluation of a 12-person user study on continuous and discrete interaction tasks further validated the advantages of our approach. Specifically, BiFingerPose outperforms previous SOTA methods with over 21% improvement in prediction performance, 2.5 times higher task completion efficiency, and 23% better user operation accuracy, demonstrating its practical superiority. Finally, we delineate the application space of finger pose with respect to enhancing authentication security and improving interactive experiences, and develop corresponding prototypes to showcase the interaction potential. Our code will be available at https://github.com/XiongjunGuan/DualFingerPose.
MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning
Optical Chemical Structure Recognition (OCSR) plays a pivotal role in modern chemical informatics, enabling the automated conversion of chemical structure images from scientific literature, patents, and educational materials into machine-readable molecular representations. This capability is essential for large-scale chemical data mining, drug discovery pipelines, and Large Language Model (LLM) applications in related domains. However, existing OCSR systems face significant challenges in accurately recognizing stereochemical information due to the subtle visual cues that distinguish stereoisomers, such as wedge and dash bonds, ring conformations, and spatial arrangements. To address these challenges, we propose MolSight, a comprehensive learning framework for OCSR that employs a three-stage training paradigm. In the first stage, we conduct pre-training on large-scale but noisy datasets to endow the model with fundamental perception capabilities for chemical structure images. In the second stage, we perform multi-granularity fine-tuning using datasets with richer supervisory signals, systematically exploring how auxiliary tasks-specifically chemical bond classification and atom localization-contribute to molecular formula recognition. Finally, we employ reinforcement learning for post-training optimization and introduce a novel stereochemical structure dataset. Remarkably, we find that even with MolSight's relatively compact parameter size, the Group Relative Policy Optimization (GRPO) algorithm can further enhance the model's performance on stereomolecular. Through extensive experiments across diverse datasets, our results demonstrate that MolSight achieves state-of-the-art performance in (stereo)chemical optical structure recognition.
Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation
Multilingual text-to-image (T2I) models have advanced rapidly in terms of visual realism and semantic alignment, and are now widely utilized. Yet outputs vary across cultural contexts: because language carries cultural connotations, images synthesized from multilingual prompts should preserve cross-lingual cultural consistency. We conduct a comprehensive analysis showing that current T2I models often produce culturally neutral or English-biased results under multilingual prompts. Analyses of two representative models indicate that the issue stems not from missing cultural knowledge but from insufficient activation of culture-related representations. We propose a probing method that localizes culture-sensitive signals to a small set of neurons in a few fixed layers. Guided by this finding, we introduce two complementary alignment strategies: (1) inference-time cultural activation that amplifies the identified neurons without backbone fine-tuned; and (2) layer-targeted cultural enhancement that updates only culturally relevant layers. Experiments on our CultureBench demonstrate consistent improvements over strong baselines in cultural consistency while preserving fidelity and diversity.
Leveraging CVAE for Joint Configuration Estimation of Multifingered Grippers from Point Cloud Data
This paper presents an efficient approach for determining the joint configuration of a multifingered gripper solely from the point cloud data of its poly-articulated chain, as generated by visual sensors, simulations or even generative neural networks. Well-known inverse kinematics (IK) techniques can provide mathematically exact solutions (when they exist) for joint configuration determination based solely on the fingertip pose, but often require post-hoc decision-making by considering the positions of all intermediate phalanges in the gripper's fingers, or rely on algorithms to numerically approximate solutions for more complex kinematics. In contrast, our method leverages machine learning to implicitly overcome these challenges. This is achieved through a Conditional Variational Auto-Encoder (CVAE), which takes point cloud data of key structural elements as input and reconstructs the corresponding joint configurations. We validate our approach on the MultiDex grasping dataset using the Allegro Hand, operating within 0.05 milliseconds and achieving accuracy comparable to state-of-the-art methods. This highlights the effectiveness of our pipeline for joint configuration estimation within the broader context of AI-driven techniques for grasp planning.
Range-Edit: Semantic Mask Guided Outdoor LiDAR Scene Editing
Training autonomous driving and navigation systems requires large and diverse point cloud datasets that capture complex edge case scenarios from various dynamic urban settings. Acquiring such diverse scenarios from real-world point cloud data, especially for critical edge cases, is challenging, which restricts system generalization and robustness. Current methods rely on simulating point cloud data within handcrafted 3D virtual environments, which is time-consuming, computationally expensive, and often fails to fully capture the complexity of real-world scenes. To address some of these issues, this research proposes a novel approach that addresses the problem discussed by editing real-world LiDAR scans using semantic mask-based guidance to generate novel synthetic LiDAR point clouds. We incorporate range image projection and semantic mask conditioning to achieve diffusion-based generation. Point clouds are transformed to 2D range view images, which are used as an intermediate representation to enable semantic editing using convex hull-based semantic masks. These masks guide the generation process by providing information on the dimensions, orientations, and locations of objects in the real environment, ensuring geometric consistency and realism. This approach demonstrates high-quality LiDAR point cloud generation, capable of producing complex edge cases and dynamic scenes, as validated on the KITTI-360 dataset. This offers a cost-effective and scalable solution for generating diverse LiDAR data, a step toward improving the robustness of autonomous driving systems.
comment: 8 pages, 9 figures
A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback WACV'26
Large vision-language models (VLMs) enable intuitive visual search using natural language queries. However, improving their performance often requires fine-tuning and scaling to larger model variants. In this work, we propose a mechanism inspired by traditional text-based search to improve retrieval performance at inference time: relevance feedback. While relevance feedback can serve as an alternative to fine-tuning, its model-agnostic design also enables use with fine-tuned VLMs. Specifically, we introduce and evaluate four feedback strategies for VLM-based retrieval. First, we revise classical pseudo-relevance feedback (PRF), which refines query embeddings based on top-ranked results. To address its limitations, we propose generative relevance feedback (GRF), which uses synthetic captions for query refinement. Furthermore, we introduce an attentive feedback summarizer (AFS), a custom transformer-based model that integrates multimodal fine-grained features from relevant items. Finally, we simulate explicit feedback using ground-truth captions as an upper-bound baseline. Experiments on Flickr30k and COCO with the VLM backbones show that GRF, AFS, and explicit feedback improve retrieval performance by 3-5% in MRR@5 for smaller VLMs, and 1-3% for larger ones, compared to retrieval with no feedback. Moreover, AFS, similarly to explicit feedback, mitigates query drift and is more robust than GRF in iterative, multi-turn retrieval settings. Our findings demonstrate that relevance feedback can consistently enhance retrieval across VLMs and open up opportunities for interactive and adaptive visual search.
comment: Accepted to WACV'26
Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats NeurIPS 2025
Despite their impressive performance across a wide range of tasks, Large Vision-Language Models (LVLMs) remain prone to hallucination. In this study, we propose a comprehensive intervention framework aligned with the transformer's causal architecture in LVLMs, integrating the effects of different intervention paths on hallucination. We find that hallucinations in LVLMs do not arise from a single causal path, but rather from the interplay among image-to-input-text, image-to-output-text, and text-to-text pathways. For the first time, we also find that LVLMs rely on different pathways depending on the question-answer alignment format. Building on these insights, we propose simple yet effective methods to identify and intervene on critical hallucination heads within each pathway, tailored to discriminative and generative formats. Experiments across multiple benchmarks demonstrate that our approach consistently reduces hallucinations across diverse alignment types.
comment: Accepted to NeurIPS 2025, Project Page: https://github.com/SooLab/AllPath
Blind Deconvolution for Color Images Using Normalized Quaternion Kernels
In this work, we address the challenging problem of blind deconvolution for color images. Existing methods often convert color images to grayscale or process each color channel separately, which overlooking the relationships between color channels. To handle this issue, we formulate a novel quaternion fidelity term designed specifically for color image blind deconvolution. This fidelity term leverages the properties of quaternion convolution kernel, which consists of four kernels: one that functions similarly to a non-negative convolution kernel to capture the overall blur, and three additional convolution kernels without constraints corresponding to red, green and blue channels respectively model their unknown interdependencies. In order to preserve image intensity, we propose to use the normalized quaternion kernel in the blind deconvolution process. Extensive experiments on real datasets of blurred color images show that the proposed method effectively removes artifacts and significantly improves deblurring effect, demonstrating its potential as a powerful tool for color image deconvolution.
Equivariant-Aware Structured Pruning for Efficient Edge Deployment: A Comprehensive Framework with Adaptive Fine-Tuning
This paper presents a novel framework combining group equivariant convolutional neural networks (G-CNNs) with equivariant-aware structured pruning to produce compact, transformation-invariant models for resource-constrained environments. Equivariance to rotations is achieved through the C4 cyclic group via the e2cnn library,enabling consistent performance under geometric transformations while reducing computational overhead. Our approach introduces structured pruning that preserves equivariant properties by analyzing e2cnn layer structure and applying neuron-level pruning to fully connected components. To mitigate accuracy degradation, we implement adaptive fine-tuning that automatically triggers when accuracy drop exceeds 2%, using early stopping and learning rate scheduling for efficient recovery. The framework includes dynamic INT8 quantization and a comprehensive pipeline encompassing training, knowledge distillation, structured pruning, fine-tuning, and quantization. We evaluate our method on satellite imagery (EuroSAT) and standard benchmarks (CIFAR-10, Rotated MNIST) demonstrating effectiveness across diverse domains. Experimental results show 29.3% parameter reduction with significant accuracy recovery, demonstrating that structured pruning of equivariant networks achieves substantial compression while maintaining geometric robustness. Our pipeline provides a reproducible framework for optimizing equivariant models, bridging the gap between group-theoretic network design and practical deployment constraints, with particular relevance to satellite imagery analysis and geometric vision tasks.
comment: 8 pages, 5 tables, 1 figure. Accepted at IEEE EdgeCom 2025 (11th IEEE International Conference on Edge Computing and Scalable Cloud)
Lost in Translation and Noise: A Deep Dive into the Failure Modes of VLMs on Real-World Tables
The impressive performance of VLMs is largely measured on benchmarks that fail to capture the complexities of real-world scenarios. Existing datasets for tabular QA, such as WikiTableQuestions and FinQA, are overwhelmingly monolingual (English) and present tables in a digitally perfect, clean format. This creates a significant gap between research and practice. To address this, we present \textbf{MirageTVQA}, a new benchmark designed to evaluate VLMs on these exact dimensions. Featuring nearly 60,000 QA pairs across 24 languages, MirageTVQA challenges models with tables that are not only multilingual but also visually imperfect, incorporating realistic noise to mimic scanned documents. Our evaluation of the leading VLMs reveals two primary failure points: a severe degradation in performance (over 35\% drop for the best models) when faced with visual noise and a consistent English-first bias where reasoning abilities fail to transfer to other languages. MirageTVQA provides a benchmark for measuring and driving progress towards more robust VLM models for table reasoning. The dataset and the code are available at: https://github.com/anshulsc/MirageTVQA.
comment: Accepted as Spotligh Talk at EurIPS 2025 Workshop on AI For Tabular Data
TP-MDDN: Task-Preferenced Multi-Demand-Driven Navigation with Autonomous Decision-Making NeurIPS 2025
In daily life, people often move through spaces to find objects that meet their needs, posing a key challenge in embodied AI. Traditional Demand-Driven Navigation (DDN) handles one need at a time but does not reflect the complexity of real-world tasks involving multiple needs and personal choices. To bridge this gap, we introduce Task-Preferenced Multi-Demand-Driven Navigation (TP-MDDN), a new benchmark for long-horizon navigation involving multiple sub-demands with explicit task preferences. To solve TP-MDDN, we propose AWMSystem, an autonomous decision-making system composed of three key modules: BreakLLM (instruction decomposition), LocateLLM (goal selection), and StatusMLLM (task monitoring). For spatial memory, we design MASMap, which combines 3D point cloud accumulation with 2D semantic mapping for accurate and efficient environmental understanding. Our Dual-Tempo action generation framework integrates zero-shot planning with policy-based fine control, and is further supported by an Adaptive Error Corrector that handles failure cases in real time. Experiments demonstrate that our approach outperforms state-of-the-art baselines in both perception accuracy and navigation robustness.
comment: Accepted at NeurIPS 2025
QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy
Learning 3D scene geometry and semantics from images is a core challenge in computer vision and a key capability for autonomous driving. Since large-scale 3D annotation is prohibitively expensive, recent work explores self-supervised learning directly from sensor data without manual labels. Existing approaches either rely on 2D rendering consistency, where 3D structure emerges only implicitly, or on discretized voxel grids from accumulated lidar point clouds, limiting spatial precision and scalability. We introduce QueryOcc, a query-based self-supervised framework that learns continuous 3D semantic occupancy directly through independent 4D spatio-temporal queries sampled across adjacent frames. The framework supports supervision from either pseudo-point clouds derived from vision foundation models or raw lidar data. To enable long-range supervision and reasoning under constant memory, we introduce a contractive scene representation that preserves near-field detail while smoothly compressing distant regions. QueryOcc surpasses previous camera-based methods by 26% in semantic RayIoU on the self-supervised Occ3D-nuScenes benchmark while running at 11.6 FPS, demonstrating that direct 4D query supervision enables strong self-supervised occupancy learning. https://research.zenseact.com/publications/queryocc/
Dual-domain Adaptation Networks for Realistic Image Super-resolution
Realistic image super-resolution (SR) focuses on transforming real-world low-resolution (LR) images into high-resolution (HR) ones, handling more complex degradation patterns than synthetic SR tasks. This is critical for applications like surveillance, medical imaging, and consumer electronics. However, current methods struggle with limited real-world LR-HR data, impacting the learning of basic image features. Pre-trained SR models from large-scale synthetic datasets offer valuable prior knowledge, which can improve generalization, speed up training, and reduce the need for extensive real-world data in realistic SR tasks. In this paper, we introduce a novel approach, Dual-domain Adaptation Networks, which is able to efficiently adapt pre-trained image SR models from simulated to real-world datasets. To achieve this target, we first set up a spatial-domain adaptation strategy through selectively updating parameters of pre-trained models and employing the low-rank adaptation technique to adjust frozen parameters. Recognizing that image super-resolution involves recovering high-frequency components, we further integrate a frequency domain adaptation branch into the adapted model, which combines the spectral data of the input and the spatial-domain backbone's intermediate features to infer HR frequency maps, enhancing the SR result. Experimental evaluations on public realistic image SR benchmarks, including RealSR, D2CRealSR, and DRealSR, demonstrate the superiority of our proposed method over existing state-of-the-art models. Codes are available at: https://github.com/dummerchen/DAN.
FisheyeGaussianLift: BEV Feature Lifting for Surround-View Fisheye Camera Perception
Accurate BEV semantic segmentation from fisheye imagery remains challenging due to extreme non-linear distortion, occlusion, and depth ambiguity inherent to wide-angle projections. We present a distortion-aware BEV segmentation framework that directly processes multi-camera high-resolution fisheye images,utilizing calibrated geometric unprojection and per-pixel depth distribution estimation. Each image pixel is lifted into 3D space via Gaussian parameterization, predicting spatial means and anisotropic covariances to explicitly model geometric uncertainty. The projected 3D Gaussians are fused into a BEV representation via differentiable splatting, producing continuous, uncertainty-aware semantic maps without requiring undistortion or perspective rectification. Extensive experiments demonstrate strong segmentation performance on complex parking and urban driving scenarios, achieving IoU scores of 87.75% for drivable regions and 57.26% for vehicles under severe fisheye distortion and diverse environmental conditions.
comment: 8 pages, 3 figures, published in IMVIP 2025 conference
Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers
We introduce SPECTRE, a fully transformer-based foundation model for volumetric computed tomography (CT). Our Self-Supervised & Cross-Modal Pretraining for CT Representation Extraction (SPECTRE) approach utilizes scalable 3D Vision Transformer architectures and modern self-supervised and vision-language pretraining strategies to learn general-purpose CT representations. Volumetric CT poses unique challenges, such as extreme token scaling, geometric anisotropy, and weak or noisy clinical supervision, that make standard transformer and contrastive learning recipes ineffective out of the box. The framework jointly optimizes a local transformer for high-resolution volumetric feature extraction and a global transformer for whole-scan context modeling, making large-scale 3D attention computationally tractable. Notably, SPECTRE is trained exclusively on openly available CT datasets, demonstrating that high-performing, generalizable representations can be achieved without relying on private data. Pretraining combines DINO-style self-distillation with SigLIP-based vision-language alignment using paired radiology reports, yielding features that are both geometrically consistent and clinically meaningful. Across multiple CT benchmarks, SPECTRE consistently outperforms prior CT foundation models in both zero-shot and fine-tuned settings, establishing SPECTRE as a scalable, open, and fully transformer-based foundation model for 3D medical imaging.
SING3R-SLAM: Submap-based Indoor Monocular Gaussian SLAM with 3D Reconstruction Priors
Recent advances in dense 3D reconstruction enable the accurate capture of local geometry; however, integrating them into SLAM is challenging due to drift and redundant point maps, which limit efficiency and downstream tasks, such as novel view synthesis. To address these issues, we propose SING3R-SLAM, a globally consistent and compact Gaussian-based dense RGB SLAM framework. The key idea is to combine locally consistent 3D reconstructions with a unified global Gaussian representation that jointly refines scene geometry and camera poses, enabling efficient and versatile 3D mapping for multiple downstream applications. SING3R-SLAM first builds locally consistent submaps through our lightweight tracking and reconstruction module, and then progressively aligns and fuses them into a global Gaussian map that enforces cross-view geometric consistency. This global map, in turn, provides feedback to correct local drift and enhance the robustness of tracking. Extensive experiments demonstrate that SING3R-SLAM achieves state-of-the-art tracking, 3D reconstruction, and novel view rendering, resulting in over 12% improvement in tracking and producing finer, more detailed geometry, all while maintaining a compact and memory-efficient global representation on real-world datasets.
Continual Alignment for SAM: Rethinking Foundation Models for Medical Image Segmentation in Continual Learning
In medical image segmentation, heterogeneous privacy policies across institutions often make joint training on pooled datasets infeasible, motivating continual image segmentation-learning from data streams without catastrophic forgetting. While the Segment Anything Model (SAM) offers strong zero-shot priors and has been widely fine-tuned across downstream tasks, its large parameter count and computational overhead challenge practical deployment. This paper demonstrates that the SAM paradigm is highly promising once its computational efficiency and performance can be balanced. To this end, we introduce the Alignment Layer, a lightweight, plug-and-play module which aligns encoder-decoder feature distributions to efficiently adapt SAM to specific medical images, improving accuracy while reducing computation. Building on SAM and the Alignment Layer, we then propose Continual Alignment for SAM (CA-SAM), a continual learning strategy that automatically adapts the appropriate Alignment Layer to mitigate catastrophic forgetting, while leveraging SAM's zero-shot priors to preserve strong performance on unseen medical datasets. Experimented across nine medical segmentation datasets under continual-learning scenario, CA-SAM achieves state-of-the-art performance. Our code, models and datasets will be released on \mbox{https://github.com/azzzzyo/Continual-Alignment-for-SAM.}
VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation
Vision-language-action (VLA) models show potential for general robotic tasks, but remain challenging in spatiotemporally coherent manipulation, which requires fine-grained representations. Typically, existing methods embed 3D positions into visual representations to enhance the spatial precision of actions. However, these methods struggle to achieve temporally coherent control over action execution. In this work, we propose VLA-4D, a general VLA model with 4D awareness for spatiotemporally coherent robotic manipulation. Our model is guided by two key designs: 1) 4D-aware visual representation. We extract visual features, embed 1D time into 3D positions for 4D embeddings, and fuse them into a unified visual representation via a cross-attention mechanism. 2) Spatiotemporal action representation. We extend conventional spatial action representations with temporal information to enable the spatiotemporal planning, and align the multimodal representations into the LLM for spatiotemporal action prediction. Within this unified framework, the designed visual and action representations jointly make robotic manipulation spatially-smooth and temporally-coherent. In addition, we extend the VLA dataset with temporal action annotations for fine-tuning our model. Extensive experiments have been conducted to verify the superiority of our method across different tasks of robotic manipulation.
Designing Domain-Specific Agents via Hierarchical Task Abstraction Mechanism
LLM-driven agents, particularly those using general frameworks like ReAct or human-inspired role-playing, often struggle in specialized domains that necessitate rigorously structured workflows. Fields such as remote sensing, requiring specialized tools (e.g., correction, spectral indices calculation), and multi-step procedures (e.g., numerous intermediate products and optional steps), significantly challenge generalized approaches. To address this gap, we introduce a novel agent design framework centered on a Hierarchical Task Abstraction Mechanism (HTAM). Specifically, HTAM moves beyond emulating social roles, instead structuring multi-agent systems into a logical hierarchy that mirrors the intrinsic task-dependency graph of a given domain. This task-centric architecture thus enforces procedural correctness and decomposes complex problems into sequential layers, where each layer's sub-agents operate on the outputs of the preceding layers. We instantiate this framework as EarthAgent, a multi-agent system tailored for complex geospatial analysis. To evaluate such complex planning capabilities, we build GeoPlan-bench, a comprehensive benchmark of realistic, multi-step geospatial planning tasks. It is accompanied by a suite of carefully designed metrics to evaluate tool selection, path similarity, and logical completeness. Experiments show that EarthAgent substantially outperforms a range of established single- and multi-agent systems. Our work demonstrates that aligning agent architecture with a domain's intrinsic task structure is a critical step toward building robust and reliable specialized autonomous systems.
comment: Page: https://earth-insights.github.io/EarthAgent
Real Noise Decoupling for Hyperspectral Image Denoising
Hyperspectral image (HSI) denoising is a crucial step in enhancing the quality of HSIs. Noise modeling methods can fit noise distributions to generate synthetic HSIs to train denoising networks. However, the noise in captured HSIs is usually complex and difficult to model accurately, which significantly limits the effectiveness of these approaches. In this paper, we propose a multi-stage noise-decoupling framework that decomposes complex noise into explicitly modeled and implicitly modeled components. This decoupling reduces the complexity of noise and enhances the learnability of HSI denoising methods when applied to real paired data. Specifically, for explicitly modeled noise, we utilize an existing noise model to generate paired data for pre-training a denoising network, equipping it with prior knowledge to handle the explicitly modeled noise effectively. For implicitly modeled noise, we introduce a high-frequency wavelet guided network. Leveraging the prior knowledge from the pre-trained module, this network adaptively extracts high-frequency features to target and remove the implicitly modeled noise from real paired HSIs. Furthermore, to effectively eliminate all noise components and mitigate error accumulation across stages, a multi-stage learning strategy, comprising separate pre-training and joint fine-tuning, is employed to optimize the entire framework. Extensive experiments on public and our captured datasets demonstrate that our proposed framework outperforms state-of-the-art methods, effectively handling complex real-world noise and significantly enhancing HSI quality.
PostCam: Camera-Controllable Novel-View Video Generation with Query-Shared Cross-Attention
We propose PostCam, a framework for novel-view video generation that enables post-capture editing of camera trajectories in dynamic scenes. We find that existing video recapture methods suffer from suboptimal camera motion injection strategies; such suboptimal designs not only limit camera control precision but also result in generated videos that fail to preserve fine visual details from the source video. To achieve more accurate and flexible motion manipulation, PostCam introduces a query-shared cross-attention module. It integrates two distinct forms of control signals: the 6-DoF camera poses and the 2D rendered video frames. By fusing them into a unified representation within a shared feature space, our model can extract underlying motion cues, which enhances both control precision and generation quality. Furthermore, we adopt a two-stage training strategy: the model first learns coarse camera control from pose inputs, and then incorporates visual information to refine motion accuracy and enhance visual fidelity. Experiments on both real-world and synthetic datasets demonstrate that PostCam outperforms state-of-the-art methods by over 20% in camera control precision and view consistency, while achieving the highest video generation quality. Our project webpage is publicly available at: https://cccqaq.github.io/PostCam.github.io/
Navigating in the Dark: A Multimodal Framework and Dataset for Nighttime Traffic Sign Recognition
Traffic signboards are vital for road safety and intelligent transportation systems, enabling navigation and autonomous driving. Yet, recognizing traffic signs at night remains challenging due to visual noise and scarcity of public nighttime datasets. Despite advances in vision architectures, existing methods struggle with robustness under low illumination and fail to leverage complementary mutlimodal cues effectively. To overcome these limitations, firstly, we introduce INTSD, a large-scale dataset comprising street-level night-time images of traffic signboards collected across diverse regions of India. The dataset spans 41 traffic signboard classes captured under varying lighting and weather conditions, providing a comprehensive benchmark for both detection and classification tasks. To benchmark INTSD for night-time sign recognition, we conduct extensive evaluations using state-of-the-art detection and classification models. Secondly, we propose LENS-Net, which integrates an adaptive image enhancement detector for joint illumination correction and sign localization, followed by a structured multimodal CLIP-GCNN classifier that leverages cross-modal attention and graph-based reasoning for robust and semantically consistent recognition. Our method surpasses existing frameworks, with ablation studies confirming the effectiveness of its key components. The dataset and code for LENS-Net is publicly available for research.
Investigating self-supervised representations for audio-visual deepfake detection
Self-supervised representations excel at many vision and speech tasks, but their potential for audio-visual deepfake detection remains underexplored. Unlike prior work that uses these features in isolation or buried within complex architectures, we systematically evaluate them across modalities (audio, video, multimodal) and domains (lip movements, generic visual content). We assess three key dimensions: detection effectiveness, interpretability of encoded information, and cross-modal complementarity. We find that most self-supervised features capture deepfake-relevant information, and that this information is complementary. Moreover, models primarily attend to semantically meaningful regions rather than spurious artifacts. Yet none generalize reliably across datasets. This generalization failure likely stems from dataset characteristics, not from the features themselves latching onto superficial patterns. These results expose both the promise and fundamental challenges of self-supervised representations for deepfake detection: while they learn meaningful patterns, achieving robust cross-domain performance remains elusive.
FireScope: Wildfire Risk Prediction with a Chain-of-Thought Oracle
Predicting wildfire risk is a reasoning-intensive spatial problem that requires the integration of visual, climatic, and geographic factors to infer continuous risk maps. Existing methods lack the causal reasoning and multimodal understanding required for reliable generalization. We introduce $\textbf{FireScope-Bench}$, a large-scale dataset and benchmark that couples Sentinel-2 imagery and climate data with expert-defined risk rasters across the USA, and real wildfire events in Europe for cross-continental evaluation. Building on this dataset, we propose $\textbf{FireScope}$, a VLM-based reasoning-to-generation framework that learns from both reinforcement learning and visual supervision to predict risk rasters with complementary reasoning traces. When trained in the USA and tested in Europe, $\textbf{FireScope}$ achieves substantial performance gains, while expert feedback and automated analysis confirm that its reasoning traces are faithful and semantically meaningful. Our findings demonstrate that reasoning can ground raster prediction models, improving both generalization and interpretability. To our knowledge, this is the first framework to (1) demonstrate that language-based reasoning can improve generalization in visual generation, (2) propose a high-resolution wildfire risk model that can be applied across continents, and (3) enable systematic studies of robust cross-continental generalization for multimodal fire risk models. We believe that $\textbf{FireScope-Bench}$ has the potential to serve as a foundation for advancing reasoning-driven, interpretable and generalizable spatial modeling. Data and source code will be made publicly available.
Exploring the added value of pretherapeutic MR descriptors in predicting breast cancer pathologic complete response to neoadjuvant chemotherapy
Objectives: To evaluate the association between pretreatment MRI descriptors and breast cancer (BC) pathological complete response (pCR) to neoadjuvant chemotherapy (NAC). Materials \& Methods: Patients with BC treated by NAC with a breast MRI between 2016 and 2020 were included in this retrospective observational single-center study. MR studies were described using the standardized BI-RADS and breast edema score on T2-weighted MRI. Univariable and multivariable logistic regression analyses were performed to assess variables association with pCR according to residual cancer burden. Random forest classifiers were trained to predict pCR on a random split including 70% of the database and were validated on the remaining cases. Results: Among 129 BC, 59 (46%) achieved pCR after NAC (luminal (n=7/37, 19%), triple negative (TN) (n=30/55, 55%), HER2+ (n=22/37, 59%). Clinical and biological items associated with pCR were BC subtype (p<0.001), T stage 0/I/II (p=0.008), higher Ki67 (p=0.005) and higher tumor-infiltrating lymphocytes levels (p=0.016). Univariate analysis showed that the following MRI features, oval or round shape (p=0.047), unifocality (p=0.026), non-spiculated margins (p=0.018), no associated non-mass enhancement (NME) (p = 0.024) and a lower MRI size (p = 0.031) were significantly associated with pCR. Unifocality and non-spiculated margins remained independently associated with pCR at multivariable analysis. Adding significant MRI features to clinicobiological variables in random forest classifiers significantly increased sensitivity (0.67 versus 0.62), specificity (0.69 versus 0.67) and precision (0.71 versus 0.67) for pCR prediction. Conclusion: Non-spiculated margins and unifocality are independently associated with pCR and can increase models performance to predict BC response to NAC. Clinical Relevance Statement: A multimodal approach integrating pretreatment MRI features with clinicobiological predictors, including TILs, could be employed to develop machine learning models for identifying patients at risk of non-response. This may enable consideration of alternative therapeutic strategies to optimize treatment outcomes
UI-Styler: Ultrasound Image Style Transfer with Class-Aware Prompts for Cross-Device Diagnosis Using a Frozen Black-Box Inference Network WACV 2026
The appearance of ultrasound images varies across acquisition devices, causing domain shifts that degrade the performance of fixed black-box downstream inference models when reused. To mitigate this issue, it is practical to develop unpaired image translation (UIT) methods that effectively align the statistical distributions between source and target domains, particularly under the constraint of a reused inference-blackbox setting. However, existing UIT approaches often overlook class-specific semantic alignment during domain adaptation, resulting in misaligned content-class mappings that can impair diagnostic accuracy. To address this limitation, we propose UI-Styler, a novel ultrasound-specific, class-aware image style transfer framework. UI-Styler leverages a pattern-matching mechanism to transfer texture patterns embedded in the target images onto source images while preserving the source structural content. In addition, we introduce a class-aware prompting strategy guided by pseudo labels of the target domain, which enforces accurate semantic alignment with diagnostic categories. Extensive experiments on ultrasound cross-device tasks demonstrate that UI-Styler consistently outperforms existing UIT methods, achieving state-of-the-art performance in distribution distance and downstream tasks, such as classification and segmentation.
comment: Project page: https://dotrannhattuong.github.io/UIStyler, Accepted to WACV 2026
DiffRefiner: Coarse to Fine Trajectory Planning via Diffusion Refinement with Semantic Interaction for End to End Autonomous Driving AAAI 2026
Unlike discriminative approaches in autonomous driving that predict a fixed set of candidate trajectories of the ego vehicle, generative methods, such as diffusion models, learn the underlying distribution of future motion, enabling more flexible trajectory prediction. However, since these methods typically rely on denoising human-crafted trajectory anchors or random noise, there remains significant room for improvement. In this paper, we propose DiffRefiner, a novel two-stage trajectory prediction framework. The first stage uses a transformer-based Proposal Decoder to generate coarse trajectory predictions by regressing from sensor inputs using predefined trajectory anchors. The second stage applies a Diffusion Refiner that iteratively denoises and refines these initial predictions. In this way, we enhance the performance of diffusion-based planning by incorporating a discriminative trajectory proposal module, which provides strong guidance for the generative refinement process. Furthermore, we design a fine-grained denoising decoder to enhance scene compliance, enabling more accurate trajectory prediction through enhanced alignment with the surrounding environment. Experimental results demonstrate that DiffRefiner achieves state-of-the-art performance, attaining 87.4 EPDMS on NAVSIM v2, and 87.1 DS along with 71.4 SR on Bench2Drive, thereby setting new records on both public benchmarks. The effectiveness of each component is validated via ablation studies as well.
comment: Accepted to AAAI 2026
A lightweight detector for real-time detection of remote sensing images
Remote sensing imagery is widely used across various fields, yet real-time detection remains challenging due to the prevalence of small objects and the need to balance accuracy with efficiency. To address this, we propose DMG-YOLO, a lightweight real-time detector tailored for small object detection in remote sensing images. Specifically, we design a Dual-branch Feature Extraction (DFE) module in the backbone, which partitions feature maps into two parallel branches: one extracts local features via depthwise separable convolutions, and the other captures global context using a vision transformer with a gating mechanism. Additionally, a Multi-scale Feature Fusion (MFF) module with dilated convolutions enhances multi-scale integration while preserving fine details. In the neck, we introduce the Global and Local Aggregate Feature Pyramid Network (GLAFPN) to further boost small object detection through global-local feature fusion. Extensive experiments on the VisDrone2019 and NWPU VHR-10 datasets show that DMG-YOLO achieves competitive performance in terms of mAP, model size, and other key metrics.
comment: none
Learning to Look Closer: A New Instance-Wise Loss for Small Cerebral Lesion Segmentation
Traditional loss functions in medical image segmentation, such as Dice, often under-segment small lesions because their small relative volume contributes negligibly to the overall loss. To address this, instance-wise loss functions and metrics have been proposed to evaluate segmentation quality on a per-lesion basis. We introduce CC-DiceCE, a loss function based on the CC-Metrics framework, and compare it with the existing blob loss. Both are benchmarked against a DiceCE baseline within the nnU-Net framework, which provides a robust and standardized setup. We find that CC-DiceCE loss increases detection (recall) with minimal to no degradation in segmentation performance, albeit at the cost of slightly more false positives. Furthermore, our multi-dataset study shows that CC-DiceCE generally outperforms blob loss.
comment: 5 pages, 2 figures, 2 tables
One-Step Diffusion Transformer for Controllable Real-World Image Super-Resolution
Recent advances in diffusion-based real-world image super-resolution (Real-ISR) have demonstrated remarkable perceptual quality, yet the balance between fidelity and controllability remains a problem: multi-step diffusion-based methods suffer from generative diversity and randomness, resulting in low fidelity, while one-step methods lose control flexibility due to fidelity-specific finetuning. In this paper, we present ODTSR, a one-step diffusion transformer based on Qwen-Image that performs Real-ISR considering fidelity and controllability simultaneously: a newly introduced visual stream receives low-quality images (LQ) with adjustable noise (Control Noise), and the original visual stream receives LQs with consistent noise (Prior Noise), forming the Noise-hybrid Visual Stream (NVS) design. ODTSR further employs Fidelity-aware Adversarial Training (FAA) to enhance controllability and achieve one-step inference. Extensive experiments demonstrate that ODTSR not only achieves state-of-the-art (SOTA) performance on generic Real-ISR, but also enables prompt controllability on challenging scenarios such as real-world scene text image super-resolution (STISR) of Chinese characters without training on specific datasets.
A Multi-Stage Optimization Framework for Deploying Learned Image Compression on FPGAs
Deep learning-based image compression (LIC) has achieved state-of-the-art rate-distortion (RD) performance, yet deploying these models on resource-constrained FPGAs remains a major challenge. This work presents a complete, multi-stage optimization framework to bridge the gap between high-performance floating-point models and efficient, hardware-friendly integer-based implementations. First, we address the fundamental problem of quantization-induced performance degradation. We propose a Dynamic Range-Aware Quantization (DRAQ) method that uses statistically-calibrated activation clipping and a novel weight regularization scheme to counteract the effects of extreme data outliers and large dynamic ranges, successfully creating a high-fidelity 8-bit integer model. Second, building on this robust foundation, we introduce two hardware-aware optimization techniques tailored for FPGAs. A progressive mixed-precision search algorithm exploits FPGA flexibility to assign optimal, non-uniform bit-widths to each layer, minimizing complexity while preserving performance. Concurrently, a channel pruning method, adapted to work with the Generalized Divisive Normalization (GDN) layers common in LIC, removes model redundancy by eliminating inactive channels. Our comprehensive experiments show that the foundational DRAQ method reduces the BD-rate overhead of a GDN-based model from $30\%$ to $6.3\%$. The subsequent hardware-aware optimizations further reduce computational complexity by over $20\%$ with negligible impact on RD performance, yielding a final model that is both state-of-the-art in efficiency and superior in quality to existing FPGA-based LIC implementations.
Off the Planckian Locus: Using 2D Chromaticity to Improve In-Camera Color
Traditional in-camera colorimetric mapping relies on correlated color temperature (CCT)-based interpolation between pre-calibrated transforms optimized for Planckian illuminants such as CIE A and D65. However, modern lighting technologies such as LEDs can deviate substantially from the Planckian locus, exposing the limitations of relying on conventional one-dimensional CCT for illumination characterization. This paper demonstrates that transitioning from 1D CCT (on the Planckian locus) to a 2D chromaticity space (off the Planckian locus) improves colorimetric accuracy across various mapping approaches. In addition, we replace conventional CCT interpolation with a lightweight multi-layer perceptron (MLP) that leverages 2D chromaticity features for robust colorimetric mapping under non-Planckian illuminants. A lightbox-based calibration procedure incorporating representative LED sources is used to train our MLP. Validated across diverse LED lighting, our method reduces angular reproduction error by 22% on average in LED-lit scenes, maintains backward compatibility with traditional illuminants, accommodates multi-illuminant scenes, and supports real-time in-camera deployment with negligible additional computational cost.
comment: Project page: https://cst-mlp.github.io
OmniLens++: Blind Lens Aberration Correction via Large LensLib Pre-Training and Latent PSF Representation
Emerging deep-learning-based lens library pre-training (LensLib-PT) pipeline offers a new avenue for blind lens aberration correction by training a universal neural network, demonstrating strong capability in handling diverse unknown optical degradations. This work proposes the OmniLens++ framework, which resolves two challenges that hinder the generalization ability of existing pipelines: the difficulty of scaling data and the absence of prior guidance characterizing optical degradation. To improve data scalability, we expand the design specifications to increase the degradation diversity of the lens source, and we sample a more uniform distribution by quantifying the spatial-variation patterns and severity of optical degradation. In terms of model design, to leverage the Point Spread Functions (PSFs), which intuitively describe optical degradation, as guidance in a blind paradigm, we propose the Latent PSF Representation (LPR). The VQVAE framework is introduced to learn latent features of LensLib's PSFs, which is assisted by modeling the optical degradation process to constrain the learning of degradation priors. Experiments on diverse aberrations of real-world lenses and synthetic LensLib show that OmniLens++ exhibits state-of-the-art generalization capacity in blind aberration correction. Beyond performance, the AODLibpro is verified as a scalable foundation for more effective training across diverse aberrations, and LPR can further tap the potential of large-scale LensLib. The source code and datasets will be made publicly available at https://github.com/zju-jiangqi/OmniLens2.
comment: The source code and datasets will be made publicly available at https://github.com/zju-jiangqi/OmniLens2
PEGS: Physics-Event Enhanced Large Spatiotemporal Motion Reconstruction via 3D Gaussian Splatting
Reconstruction of rigid motion over large spatiotemporal scales remains a challenging task due to limitations in modeling paradigms, severe motion blur, and insufficient physical consistency. In this work, we propose PEGS, a framework that integrates Physical priors with Event stream enhancement within a 3D Gaussian Splatting pipeline to perform deblurred target-focused modeling and motion recovery. We introduce a cohesive triple-level supervision scheme that enforces physical plausibility via an acceleration constraint, leverages event streams for high-temporal resolution guidance, and employs a Kalman regularizer to fuse multi-source observations. Furthermore, we design a motion-aware simulated annealing strategy that adaptively schedules the training process based on real-time kinematic states. We also contribute the first RGB-Event paired dataset targeting natural, fast rigid motion across diverse scenarios. Experiments show PEGS's superior performance in reconstructing motion over large spatiotemporal scales compared to mainstream dynamic methods.
ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better
Recent advances in multimodal reasoning models have demonstrated impressive capabilities across text and vision. However, even leading models exhibit redundant self-reflection when generating lengthy reasoning chains. While training-free CoT compression methods have emerged in the LLMs domain, they rely on static visual references and thus provide limited gains for multimodal reasoning. Therefore, we propose ChainV, a framework that dynamically integrates visual hints into the reasoning process, thereby making multimodal reasoning shorter and better. Specifically, ChainV first performs a coarse visual patch selection based on the previous reasoning step, then refines it by identifying the most representative atomic visual hint according to the averaged attention intensity. Additionally, ChainV introduces a consistency-based evaluation mechanism to assess the reliability of the chosen hint, guiding the model to adaptively adjust its level of self-reflection. Eventually, the pixel coordinates of the selected visual hint and its reliability are incorporated into thinking with a Bernoulli stochastic process. Experiments indicate that our method significantly improves reasoning accuracy and efficiency, especially on math-intensive benchmarks where visual hints are crucial for multi-step symbolic reasoning. For example, ChainV achieves $2.3\%$ improvement on the MathVista within MIMO-VL-RL, while reducing inference latency by $51.4\%$ and shortening output token length by $24.5\%$.
comment: 16 pages
Bridging Visual Affective Gap: Borrowing Textual Knowledge by Learning from Noisy Image-Text Pairs ACM MM 2024
Visual emotion recognition (VER) is a longstanding field that has garnered increasing attention with the advancement of deep neural networks. Although recent studies have achieved notable improvements by leveraging the knowledge embedded within pre-trained visual models, the lack of direct association between factual-level features and emotional categories, called the "affective gap", limits the applicability of pre-training knowledge for VER tasks. On the contrary, the explicit emotional expression and high information density in textual modality eliminate the "affective gap". Therefore, we propose borrowing the knowledge from the pre-trained textual model to enhance the emotional perception of pre-trained visual models. We focus on the factual and emotional connections between images and texts in noisy social media data, and propose Partitioned Adaptive Contrastive Learning (PACL) to leverage these connections. Specifically, we manage to separate different types of samples and devise distinct contrastive learning strategies for each type. By dynamically constructing negative and positive pairs, we fully exploit the potential of noisy samples. Through comprehensive experiments, we demonstrate that bridging the "affective gap" significantly improves the performance of various pre-trained visual models in downstream emotion-related tasks. Our code is released on https://github.com/wdqqdw/PACL.
comment: Accepted by ACM MM 2024
Sparse Reasoning is Enough: Biological-Inspired Framework for Video Anomaly Detection with Large Pre-trained Models
Video anomaly detection (VAD) plays a vital role in real-world applications such as security surveillance, autonomous driving, and industrial monitoring. Recent advances in large pre-trained models have opened new opportunities for training-free VAD by leveraging rich prior knowledge and general reasoning capabilities. However, existing studies typically rely on dense frame-level inference, incurring high computational costs and latency. This raises a fundamental question: Is dense reasoning truly necessary when using powerful pre-trained models in VAD systems? To answer this, we propose ReCoVAD, a novel framework inspired by the dual reflex and conscious pathways of the human nervous system, enabling selective frame processing to reduce redundant computation. ReCoVAD consists of two core pathways: (i) a Reflex pathway that uses a lightweight CLIP-based module to fuse visual features with prototype prompts and produce decision vectors, which query a dynamic memory of past frames and anomaly scores for fast response; and (ii) a Conscious pathway that employs a medium-scale vision-language model to generate textual event descriptions and refined anomaly scores for novel frames. It continuously updates the memory and prototype prompts, while an integrated large language model periodically reviews accumulated descriptions to identify unseen anomalies, correct errors, and refine prototypes. Extensive experiments show that ReCoVAD achieves state-of-the-art training-free performance while processing only 28.55\% and 16.04\% of the frames used by previous methods on the UCF-Crime and XD-Violence datasets, demonstrating that sparse reasoning is sufficient for effective large-model-based VAD.
SPAGS: Sparse-View Articulated Object Reconstruction from Single State via Planar Gaussian Splatting
Articulated objects are ubiquitous in daily environments, and their 3D reconstruction holds great significance across various fields. However, existing articulated object reconstruction methods typically require costly inputs such as multi-stage and multi-view observations. To address the limitations, we propose a category-agnostic articulated object reconstruction framework via planar Gaussian Splatting, which only uses sparse-view RGB images from a single state. Specifically, we first introduce a Gaussian information field to perceive the optimal sparse viewpoints from candidate camera poses. Then we compress 3D Gaussians into planar Gaussians to facilitate accurate estimation of normal and depth. The planar Gaussians are optimized in a coarse-to-fine manner through depth smooth regularization and few-shot diffusion. Moreover, we introduce a part segmentation probability for each Gaussian primitive and update them by back-projecting part segmentation masks of renderings. Extensive experimental results demonstrate that our method achieves higher-fidelity part-level surface reconstruction on both synthetic and real-world data than existing methods. Codes will be made publicly available.
comment: 10 pages, 7 figures
Spanning Tree Autoregressive Visual Generation
We present Spanning Tree Autoregressive (STAR) modeling, which can incorporate prior knowledge of images, such as center bias and locality, to maintain sampling performance while also providing sufficiently flexible sequence orders to accommodate image editing at inference. Approaches that expose randomly permuted sequence orders to conventional autoregressive (AR) models in visual generation for bidirectional context either suffer from a decline in performance or compromise the flexibility in sequence order choice at inference. Instead, STAR utilizes traversal orders of uniform spanning trees sampled in a lattice defined by the positions of image patches. Traversal orders are obtained through breadth-first search, allowing us to efficiently construct a spanning tree whose traversal order ensures that the connected partial observation of the image appears as a prefix in the sequence through rejection sampling. Through the tailored yet structured randomized strategy compared to random permutation, STAR preserves the capability of postfix completion while maintaining sampling performance without any significant changes to the model architecture widely adopted in the language AR modeling.
comment: Preprint; Under review
Diversity Has Always Been There in Your Visual Autoregressive Models
Visual Autoregressive (VAR) models have recently garnered significant attention for their innovative next-scale prediction paradigm, offering notable advantages in both inference efficiency and image quality compared to traditional multi-step autoregressive (AR) and diffusion models. However, despite their efficiency, VAR models often suffer from the diversity collapse i.e., a reduction in output variability, analogous to that observed in few-step distilled diffusion models. In this paper, we introduce DiverseVAR, a simple yet effective approach that restores the generative diversity of VAR models without requiring any additional training. Our analysis reveals the pivotal component of the feature map as a key factor governing diversity formation at early scales. By suppressing the pivotal component in the model input and amplifying it in the model output, DiverseVAR effectively unlocks the inherent generative potential of VAR models while preserving high-fidelity synthesis. Empirical results demonstrate that our approach substantially enhances generative diversity with only neglectable performance influences. Our code will be publicly released at https://github.com/wangtong627/DiverseVAR.
ReBrain: Brain MRI Reconstruction from Sparse CT Slice via Retrieval-Augmented Diffusion WACV 2026
Magnetic Resonance Imaging (MRI) plays a crucial role in brain disease diagnosis, but it is not always feasible for certain patients due to physical or clinical constraints. Recent studies attempt to synthesize MRI from Computed Tomography (CT) scans; however, low-dose protocols often result in highly sparse CT volumes with poor through-plane resolution, making accurate reconstruction of the full brain MRI volume particularly challenging. To address this, we propose ReBrain, a retrieval-augmented diffusion framework for brain MRI reconstruction. Given any 3D CT scan with limited slices, we first employ a Brownian Bridge Diffusion Model (BBDM) to synthesize MRI slices along the 2D dimension. Simultaneously, we retrieve structurally and pathologically similar CT slices from a comprehensive prior database via a fine-tuned retrieval model. These retrieved slices are used as references, incorporated through a ControlNet branch to guide the generation of intermediate MRI slices and ensure structural continuity. We further account for rare retrieval failures when the database lacks suitable references and apply spherical linear interpolation to provide supplementary guidance. Extensive experiments on SynthRAD2023 and BraTS demonstrate that ReBrain achieves state-of-the-art performance in cross-modal reconstruction under sparse conditions.
comment: 16 pages, 12 figures, 7 tables; Accepted by WACV 2026
REArtGS++: Generalizable Articulation Reconstruction with Temporal Geometry Constraint via Planar Gaussian Splatting
Articulated objects are pervasive in daily environments, such as drawers and refrigerators. Towards their part-level surface reconstruction and joint parameter estimation, REArtGS~\cite{wu2025reartgs} introduces a category-agnostic approach using multi-view RGB images at two different states. However, we observe that REArtGS still struggles with screw-joint or multi-part objects and lacks geometric constraints for unseen states. In this paper, we propose REArtGS++, a novel method towards generalizable articulated object reconstruction with temporal geometry constraint and planar Gaussian splatting. We first model a decoupled screw motion for each joint without type prior, and jointly optimize part-aware Gaussians with joint parameters through part motion blending. To introduce time-continuous geometric constraint for articulated modeling, we encourage Gaussians to be planar and propose a temporally consistent regularization between planar normal and depth through Taylor first-order expansion. Extensive experiments on both synthetic and real-world articulated objects demonstrate our superiority in generalizable part-level surface reconstruction and joint parameter estimation, compared to existing approaches. Project Site: https://sites.google.com/view/reartgs2/home.
comment: 10 pages, 7 figures
RL-AD-Net: Reinforcement Learning Guided Adaptive Displacement in Latent Space for Refined Point Cloud Completion
Recent point cloud completion models, including transformer-based, denoising-based, and other state-of-the-art approaches, generate globally plausible shapes from partial inputs but often leave local geometric inconsistencies. We propose RL-AD-Net, a reinforcement learning (RL) refinement framework that operates in the latent space of a pretrained point autoencoder. The autoencoder encodes completions into compact global feature vectors (GFVs), which are selectively adjusted by an RL agent to improve geometric fidelity. To ensure robustness, a lightweight non-parametric PointNN selector evaluates the geometric consistency of both the original completion and the RL-refined output, retaining the better reconstruction. When ground truth is available, both Chamfer Distance and geometric consistency metrics guide refinement. Training is performed separately per category, since the unsupervised and dynamic nature of RL makes convergence across highly diverse categories challenging. Nevertheless, the framework can be extended to multi-category refinement in future work. Experiments on ShapeNetCore-2048 demonstrate that while baseline completion networks perform reasonable under their training-style cropping, they struggle in random cropping scenarios. In contrast, RL-AD-Net consistently delivers improvements across both settings, highlighting the effectiveness of RL-guided ensemble refinement. The approach is lightweight, modular, and model-agnostic, making it applicable to a wide range of completion networks without requiring retraining.
OmniPT: Unleashing the Potential of Large Vision Language Models for Pedestrian Tracking and Understanding AAAI 2026
LVLMs have been shown to perform excellently in image-level tasks such as VQA and caption. However, in many instance-level tasks, such as visual grounding and object detection, LVLMs still show performance gaps compared to previous expert models. Meanwhile, although pedestrian tracking is a classical task, there have been a number of new topics in combining object tracking and natural language, such as Referring MOT, Cross-view Referring MOT, and Semantic MOT. These tasks emphasize that models should understand the tracked object at an advanced semantic level, which is exactly where LVLMs excel. In this paper, we propose a new unified Pedestrian Tracking framework, namely OmniPT, which can track, track based on reference and generate semantic understanding of tracked objects interactively. We address two issues: how to model the tracking task into a task that foundation models can perform, and how to make the model output formatted answers. To this end, we implement a training phase consisting of RL-Mid Training-SFT-RL. Based on the pre-trained weights of the LVLM, we first perform a simple RL phase to enable the model to output fixed and supervisable bounding box format. Subsequently, we conduct a mid-training phase using a large number of pedestrian-related datasets. Finally, we perform supervised fine-tuning on several pedestrian tracking datasets, and then carry out another RL phase to improve the model's tracking performance and enhance its ability to follow instructions. We conduct experiments on tracking benchmarks and the experimental results demonstrate that the proposed method can perform better than the previous methods.
comment: AAAI 2026
PathAgent: Toward Interpretable Analysis of Whole-slide Pathology Images via Large Language Model-based Agentic Reasoning
Analyzing whole-slide images (WSIs) requires an iterative, evidence-driven reasoning process that parallels how pathologists dynamically zoom, refocus, and self-correct while collecting the evidence. However, existing computational pipelines often lack this explicit reasoning trajectory, resulting in inherently opaque and unjustifiable predictions. To bridge this gap, we present PathAgent, a training-free, large language model (LLM)-based agent framework that emulates the reflective, stepwise analytical approach of human experts. PathAgent can autonomously explore WSI, iteratively and precisely locating significant micro-regions using the Navigator module, extracting morphology visual cues using the Perceptor, and integrating these findings into the continuously evolving natural language trajectories in the Executor. The entire sequence of observations and decisions forms an explicit chain-of-thought, yielding fully interpretable predictions. Evaluated across five challenging datasets, PathAgent exhibits strong zero-shot generalization, surpassing task-specific baselines in both open-ended and constrained visual question-answering tasks. Moreover, a collaborative evaluation with human pathologists confirms PathAgent's promise as a transparent and clinically grounded diagnostic assistant.
comment: 11 pages, 6 figures
RoomPlanner: Explicit Layout Planner for Easier LLM-Driven 3D Room Generation
In this paper, we propose RoomPlanner, the first fully automatic 3D room generation framework for painlessly creating realistic indoor scenes with only short text as input. Without any manual layout design or panoramic image guidance, our framework can generate explicit layout criteria for rational spatial placement. We begin by introducing a hierarchical structure of language-driven agent planners that can automatically parse short and ambiguous prompts into detailed scene descriptions. These descriptions include raw spatial and semantic attributes for each object and the background, which are then used to initialize 3D point clouds. To position objects within bounded environments, we implement two arrangement constraints that iteratively optimize spatial arrangements, ensuring a collision-free and accessible layout solution. In the final rendering stage, we propose a novel AnyReach Sampling strategy for camera trajectory, along with the Interval Timestep Flow Sampling (ITFS) strategy, to efficiently optimize the coarse 3D Gaussian scene representation. These approaches help reduce the total generation time to under 30 minutes. Extensive experiments demonstrate that our method can produce geometrically rational 3D indoor scenes, surpassing prior approaches in both rendering speed and visual quality while preserving editability. The code will be available soon.
RacketVision: A Multiple Racket Sports Benchmark for Unified Ball and Racket Analysis AAAI 2026
We introduce RacketVision, a novel dataset and benchmark for advancing computer vision in sports analytics, covering table tennis, tennis, and badminton. The dataset is the first to provide large-scale, fine-grained annotations for racket pose alongside traditional ball positions, enabling research into complex human-object interactions. It is designed to tackle three interconnected tasks: fine-grained ball tracking, articulated racket pose estimation, and predictive ball trajectory forecasting. Our evaluation of established baselines reveals a critical insight for multi-modal fusion: while naively concatenating racket pose features degrades performance, a CrossAttention mechanism is essential to unlock their value, leading to trajectory prediction results that surpass strong unimodal baselines. RacketVision provides a versatile resource and a strong starting point for future research in dynamic object tracking, conditional motion forecasting, and multimodal analysis in sports. Project page at https://github.com/OrcustD/RacketVision
comment: Accepted to AAAI 2026 (Oral)
MedImageInsight for Thoracic Cavity Health Classification from Chest X-rays
Chest radiography remains one of the most widely used imaging modalities for thoracic diagnosis, yet increasing imaging volumes and radiologist workload continue to challenge timely interpretation. In this work, we investigate the use of MedImageInsight, a medical imaging foundational model, for automated binary classification of chest X-rays into Normal and Abnormal categories. Two approaches were evaluated: (1) fine-tuning MedImageInsight for end-to-end classification, and (2) employing the model as a feature extractor for a transfer learning pipeline using traditional machine learning classifiers. Experiments were conducted using a combination of the ChestX-ray14 dataset and real-world clinical data sourced from partner hospitals. The fine-tuned classifier achieved the highest performance, with an ROC-AUC of 0.888 and superior calibration compared to the transfer learning models, demonstrating performance comparable to established architectures such as CheXNet. These results highlight the effectiveness of foundational medical imaging models in reducing task-specific training requirements while maintaining diagnostic reliability. The system is designed for integration into web-based and hospital PACS workflows to support triage and reduce radiologist burden. Future work will extend the model to multi-label pathology classification to provide preliminary diagnostic interpretation in clinical environments.
comment: 9 pages, 5 figures and 3 tables
Do Vision-Language Models Understand Visual Persuasiveness? NeurIPS 2025
Recent advances in vision-language models (VLMs) have enabled impressive multi-modal reasoning and understanding. Yet, whether these models truly grasp visual persuasion-how visual cues shape human attitudes and decisions-remains unclear. To probe this question, we construct a high-consensus dataset for binary persuasiveness judgment and introduce the taxonomy of Visual Persuasive Factors (VPFs), encompassing low-level perceptual, mid-level compositional, and high-level semantic cues. We also explore cognitive steering and knowledge injection strategies for persuasion-relevant reasoning. Empirical analysis across VLMs reveals a recall-oriented bias-models over-predict high persuasiveness-and weak discriminative power for low/mid-level features. In contrast, high-level semantic alignment between message and object presence emerges as the strongest predictor of human judgment. Among intervention strategies, simple instruction or unguided reasoning scaffolds yield marginal or negative effects, whereas concise, object-grounded rationales significantly improve precision and F1 scores. These results indicate that VLMs core limitation lies not in recognizing persuasive objects but in linking them to communicative intent.
comment: 8 pages (except for reference and appendix), 5 figures, 7 tables, to be published in NeurIPS 2025 Workshop: VLM4RWD
Energy Scaling Laws for Diffusion Models: Quantifying Compute and Carbon Emissions in Image Generation
The rapidly growing computational demands of diffusion models for image generation have raised significant concerns about energy consumption and environmental impact. While existing approaches to energy optimization focus on architectural improvements or hardware acceleration, there is a lack of principled methods to predict energy consumption across different model configurations and hardware setups. We propose an adaptation of Kaplan scaling laws to predict GPU energy consumption for diffusion models based on computational complexity (FLOPs). Our approach decomposes diffusion model inference into text encoding, iterative denoising, and decoding components, with the hypothesis that denoising operations dominate energy consumption due to their repeated execution across multiple inference steps. We conduct comprehensive experiments across four state-of-the-art diffusion models (Stable Diffusion 2, Stable Diffusion 3.5, Flux, and Qwen) on three GPU architectures (NVIDIA A100, A4000, A6000), spanning various inference configurations including resolution (256x256 to 1024x1024), precision (fp16/fp32), step counts (10-50), and classifier-free guidance settings. Our energy scaling law achieves high predictive accuracy within individual architectures (R-squared > 0.9) and exhibits strong cross-architecture generalization, maintaining high rank correlations across models and enabling reliable energy estimation for unseen model-hardware combinations. These results validate the compute-bound nature of diffusion inference and provide a foundation for sustainable AI deployment planning and carbon footprint estimation.
comment: Accepted at EurIPS 2025 workshop "Rethinking AI: Efficiency, Frugality, and Sustainability"
Parameter-Free Neural Lens Blur Rendering for High-Fidelity Composites
Consistent and natural camera lens blur is important for seamlessly blending 3D virtual objects into photographed real-scenes. Since lens blur typically varies with scene depth, the placement of virtual objects and their corresponding blur levels significantly affect the visual fidelity of mixed reality compositions. Existing pipelines often rely on camera parameters (e.g., focal length, focus distance, aperture size) and scene depth to compute the circle of confusion (CoC) for realistic lens blur rendering. However, such information is often unavailable to ordinary users, limiting the accessibility and generalizability of these methods. In this work, we propose a novel compositing approach that directly estimates the CoC map from RGB images, bypassing the need for scene depth or camera metadata. The CoC values for virtual objects are inferred through a linear relationship between its signed CoC map and depth, and realistic lens blur is rendered using a neural reblurring network. Our method provides flexible and practical solution for real-world applications. Experimental results demonstrate that our method achieves high-fidelity compositing with realistic defocus effects, outperforming state-of-the-art techniques in both qualitative and quantitative evaluations.
comment: Accepted by ISMAR 2025 with oral presentation. 10 pages, 11 figures
FLUID: Training-Free Face De-identification via Latent Identity Substitution
We present FLUID (Face de-identification in the Latent space via Utility-preserving Identity Displacement), a training-free framework that directly substitutes identity in the latent space of pretrained diffusion models. Inspired by substitution mechanisms in chemistry, we reinterpret identity editing as semantic displacement in the latent h-space of a pretrained unconditional diffusion model. Our framework discovers identity-editing directions through optimization guided by novel reagent losses, which supervise for attribute preservation and identity suppression. We further propose both linear and geodesic (tangent-based) editing schemes to effectively navigate the latent manifold. Experimental results on CelebA-HQ and FFHQ demonstrate that FLUID achieves a superior trade-off between identity suppression and attribute preservation, outperforming state-of-the-art de-identification methods in both qualitative and quantitative metrics.
Vision Language Models are Confused Tourists
Although the cultural dimension has been one of the key aspects in evaluating Vision-Language Models (VLMs), their ability to remain stable across diverse cultural inputs remains largely untested, despite being crucial to support diversity and multicultural societies. Existing evaluations often rely on benchmarks featuring only a singular cultural concept per image, overlooking scenarios where multiple, potentially unrelated cultural cues coexist. To address this gap, we introduce ConfusedTourist, a novel cultural adversarial robustness suite designed to assess VLMs' stability against perturbed geographical cues. Our experiments reveal a critical vulnerability, where accuracy drops heavily under simple image-stacking perturbations and even worsens with its image-generation-based variant. Interpretability analyses further show that these failures stem from systematic attention shifts toward distracting cues, diverting the model from its intended focus. These findings highlight a critical challenge: visual cultural concept mixing can substantially impair even state-of-the-art VLMs, underscoring the urgent need for more culturally robust multimodal understanding.
VLM-Augmented Degradation Modeling for Image Restoration Under Adverse Weather Conditions
Reliable visual perception under adverse weather conditions, such as rain, haze, snow, or a mixture of them, is desirable yet challenging for autonomous driving and outdoor robots. In this paper, we propose a unified Memory-Enhanced Visual-Language Recovery (MVLR) model that restores images from different degradation levels under various weather conditions. MVLR couples a lightweight encoder-decoder backbone with a Visual-Language Model (VLM) and an Implicit Memory Bank (IMB). The VLM performs chain-of-thought inference to encode weather degradation priors and the IMB stores continuous latent representations of degradation patterns. The VLM-generated priors query the IMB to retrieve fine-grained degradation prototypes. These prototypes are then adaptively fused with multi-scale visual features via dynamic cross-attention mechanisms, enhancing restoration accuracy while maintaining computational efficiency. Extensive experiments on four severe-weather benchmarks show that MVLR surpasses single-branch and Mixture-of-Experts baselines in terms of Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM). These results indicate that MVLR offers a practical balance between model compactness and expressiveness for real-time deployment in diverse outdoor conditions.
DepthFocus: Controllable Depth Estimation for See-Through Scenes
Depth in the real world is rarely singular. Transmissive materials create layered ambiguities that confound conventional perception systems. Existing models remain passive, attempting to estimate static depth maps anchored to the nearest surface, while humans actively shift focus to perceive a desired depth. We introduce DepthFocus, a steerable Vision Transformer that redefines stereo depth estimation as intent-driven control. Conditioned on a scalar depth preference, the model dynamically adapts its computation to focus on the intended depth, enabling selective perception within complex scenes. The training primarily leverages our newly constructed 500k multi-layered synthetic dataset, designed to capture diverse see-through effects. DepthFocus not only achieves state-of-the-art performance on conventional single-depth benchmarks like BOOSTER, a dataset notably rich in transparent and reflective objects, but also quantitatively demonstrates intent-aligned estimation on our newly proposed real and synthetic multi-depth datasets. Moreover, it exhibits strong generalization capabilities on unseen see-through scenes, underscoring its robustness as a significant step toward active and human-like 3D perception.
comment: 8pages, 6 figures, 5 tables
DReX: Pure Vision Fusion of Self-Supervised and Convolutional Representations for Image Complexity Prediction
Visual complexity prediction is a fundamental problem in computer vision with applications in image compression, retrieval, and classification. Understanding what makes humans perceive an image as complex is also a long-standing question in cognitive science. Recent approaches have leveraged multimodal models that combine visual and linguistic representations, but it remains unclear whether language information is necessary for this task. We propose DReX (DINO-ResNet Fusion), a vision-only model that fuses self-supervised and convolutional representations through a learnable attention mechanism to predict image complexity. Our architecture integrates multi-scale hierarchical features from ResNet-50 with semantically rich representations from DINOv3 ViT-S/16, enabling the model to capture both low-level texture patterns and high-level semantic structure. DReX achieves state-of-the-art performance on the IC9600 benchmark (Pearson r = 0.9581), surpassing previous methods--including those trained on multimodal image-text data--while using approximately 21.5x fewer learnable parameters. Furthermore, DReX generalizes robustly across multiple datasets and metrics, achieving superior results on Pearson and Spearman correlation, Root Mean Square Error (RMSE), and Mean Absolute Error (MAE). Ablation and attention analyses confirm that DReX leverages complementary cues from both backbones, with the DINOv3 [CLS] token enhancing sensitivity to visual complexity. Our findings suggest that visual features alone can be sufficient for human-aligned complexity prediction and that, when properly fused, self-supervised transformers and supervised deep convolutional neural networks offer complementary and synergistic benefits for this task.
comment: 8 pages
RadioKMoE: Knowledge-Guided Radiomap Estimation with Kolmogorov-Arnold Networks and Mixture-of-Experts
Radiomap serves as a vital tool for wireless network management and deployment by providing powerful spatial knowledge of signal propagation and coverage. However, increasingly complex radio propagation behavior and surrounding environments pose strong challenges for radiomap estimation (RME). In this work, we propose a knowledge-guided RME framework that integrates Kolmogorov-Arnold Networks (KAN) with Mixture-of-Experts (MoE), namely RadioKMoE. Specifically, we design a KAN module to predict an initial coarse coverage map, leveraging KAN's strength in approximating physics models and global radio propagation patterns. The initial coarse map, together with environmental information, drives our MoE network for precise radiomap estimation. Unlike conventional deep learning models, the MoE module comprises expert networks specializing in distinct radiomap patterns to improve local details while preserving global consistency. Experimental results in both multi- and single-band RME demonstrate the enhanced accuracy and robustness of the proposed RadioKMoE in radiomap estimation.
A Diversity-optimized Deep Ensemble Approach for Accurate Plant Leaf Disease Detection
Plant diseases pose a significant threat to global agriculture, causing over $220 billion in annual economic losses and jeopardizing food security. The timely and accurate detection of these diseases from plant leaf images is critical to mitigating their adverse effects. Deep neural network Ensembles (Deep Ensembles) have emerged as a powerful approach to enhancing prediction accuracy by leveraging the strengths of diverse Deep Neural Networks (DNNs). However, selecting high-performing ensemble member models is challenging due to the inherent difficulty in measuring ensemble diversity. In this paper, we introduce the Synergistic Diversity (SQ) framework to enhance plant disease detection accuracy. First, we conduct a comprehensive analysis of the limitations of existing ensemble diversity metrics (denoted as Q metrics), which often fail to identify optimal ensemble teams. Second, we present the SQ metric, a novel measure that captures the synergy between ensemble members and consistently aligns with ensemble accuracy. Third, we validate our SQ approach through extensive experiments on a plant leaf image dataset, which demonstrates that our SQ metric substantially improves ensemble selection and enhances detection accuracy. Our findings pave the way for a more reliable and efficient image-based plant disease detection.
Gradient-Driven Natural Selection for Compact 3D Gaussian Splatting
3DGS employs a large number of Gaussian primitives to fit scenes, resulting in substantial storage and computational overhead. Existing pruning methods rely on manually designed criteria or introduce additional learnable parameters, yielding suboptimal results. To address this, we propose an natural selection inspired pruning framework that models survival pressure as a regularization gradient field applied to opacity, allowing the optimization gradients--driven by the goal of maximizing rendering quality--to autonomously determine which Gaussians to retain or prune. This process is fully learnable and requires no human intervention. We further introduce an opacity decay technique with a finite opacity prior, which accelerates the selection process without compromising pruning effectiveness. Compared to 3DGS, our method achieves over 0.6 dB PSNR gain under 15\% budgets, establishing state-of-the-art performance for compact 3DGS. Project page https://xiaobin2001.github.io/GNS-web.
ID-Crafter: VLM-Grounded Online RL for Compositional Multi-Subject Video Generation
Significant progress has been achieved in high-fidelity video synthesis, yet current paradigms often fall short in effectively integrating identity information from multiple subjects. This leads to semantic conflicts and suboptimal performance in preserving identities and interactions, limiting controllability and applicability. To tackle this issue, we introduce ID-Crafter, a framework for multi-subject video generation that achieves superior identity preservation and semantic coherence. ID-Crafter integrates three key components: (i) a hierarchical identity-preserving attention mechanism that progressively aggregates features at intra-subject, inter-subject, and cross-modal levels; (ii) a semantic understanding module powered by a pretrained Vision-Language Model (VLM) to provide fine-grained guidance and capture complex inter-subject relationships; and (iii) an online reinforcement learning phase to further refine the model for critical concepts. Furthermore, we construct a new dataset to facilitate robust training and evaluation. Extensive experiments demonstrate that ID-Crafter establishes new state-of-the-art performance on multi-subject video generation benchmarks, excelling in identity preservation, temporal consistency, and overall video quality.
YOLO Meets Mixture-of-Experts: Adaptive Expert Routing for Robust Object Detection
This paper presents a novel Mixture-of-Experts framework for object detection, incorporating adaptive routing among multiple YOLOv9-T experts to enable dynamic feature specialization and achieve higher mean Average Precision (mAP) and Average Recall (AR) compared to a single YOLOv9-T model.
comment: 1 figure, 1 table
MF-GCN: A Multi-Frequency Graph Convolutional Network for Tri-Modal Depression Detection Using Eye-Tracking, Facial, and Acoustic Features
Depression is a prevalent global mental health disorder, characterised by persistent low mood and anhedonia. However, it remains underdiagnosed because current diagnostic methods depend heavily on subjective clinical assessments. To enable objective detection, we introduce a gold standard dataset of 103 clinically assessed participants collected through a tripartite data approach which uniquely integrated eye tracking data with audio and video to give a comprehensive representation of depressive symptoms. Eye tracking data quantifies the attentional bias towards negative stimuli that is frequently observed in depressed groups. Audio and video data capture the affective flattening and psychomotor retardation characteristic of depression. Statistical validation confirmed their significant discriminative power in distinguishing depressed from non depressed groups. We address a critical limitation of existing graph-based models that focus on low-frequency information and propose a Multi-Frequency Graph Convolutional Network (MF-GCN). This framework consists of a novel Multi-Frequency Filter Bank Module (MFFBM), which can leverage both low and high frequency signals. Extensive evaluation against traditional machine learning algorithms and deep learning frameworks demonstrates that MF-GCN consistently outperforms baselines. In binary classification, the model achieved a sensitivity of 0.96 and F2 score of 0.94. For the 3 class classification task, the proposed method achieved a sensitivity of 0.79 and specificity of 0.87 and siginificantly suprassed other models. To validate generalizability, the model was also evaluated on the Chinese Multimodal Depression Corpus (CMDC) dataset and achieved a sensitivity of 0.95 and F2 score of 0.96. These results confirm that our trimodal, multi frequency framework effectively captures cross modal interaction for accurate depression detection.
Topology Aware Neural Interpolation of Scalar Fields
This paper presents a neural scheme for the topology-aware interpolation of time-varying scalar fields. Given a time-varying sequence of persistence diagrams, along with a sparse temporal sampling of the corresponding scalar fields, denoted as keyframes, our interpolation approach aims at "inverting" the non-keyframe diagrams to produce plausible estimations of the corresponding, missing data. For this, we rely on a neural architecture which learns the relation from a time value to the corresponding scalar field, based on the keyframe examples, and reliably extends this relation to the non-keyframe time steps. We show how augmenting this architecture with specific topological losses exploiting the input diagrams both improves the geometrical and topological reconstruction of the non-keyframe time steps. At query time, given an input time value for which an interpolation is desired, our approach instantaneously produces an output, via a single propagation of the time input through the network. Experiments interpolating 2D and 3D time-varying datasets show our approach superiority, both in terms of data and topological fitting, with regard to reference interpolation schemes. Our implementation is available at this GitHub link : https://github.com/MohamedKISSI/Topology-Aware-Neural-Interpolation-of-Scalar-Fields.git.
Forecasting Future Anatomies: Longitudinal Brain Mri-to-Mri Prediction
Predicting future brain state from a baseline magnetic resonance image (MRI) is a central challenge in neuroimaging and has important implications for studying neurodegenerative diseases such as Alzheimer's disease (AD). Most existing approaches predict future cognitive scores or clinical outcomes, such as conversion from mild cognitive impairment to dementia. Instead, here we investigate longitudinal MRI image-to-image prediction that forecasts a participant's entire brain MRI several years into the future, intrinsically modeling complex, spatially distributed neurodegenerative patterns. We implement and evaluate five deep learning architectures (UNet, U2-Net, UNETR, Time-Embedding UNet, and ODE-UNet) on two longitudinal cohorts (ADNI and AIBL). Predicted follow-up MRIs are directly compared with the actual follow-up scans using metrics that capture global similarity and local differences. The best performing models achieve high-fidelity predictions, and all models generalize well to an independent external dataset, demonstrating robust cross-cohort performance. Our results indicate that deep learning can reliably predict participant-specific brain MRI at the voxel level, offering new opportunities for individualized prognosis.
Automated Interpretable 2D Video Extraction from 3D Echocardiography
Although the heart has complex three-dimensional (3D) anatomy, conventional medical imaging with cardiac ultrasound relies on a series of 2D videos showing individual cardiac structures. 3D echocardiography is a developing modality that now offers adequate image quality for clinical use, with potential to streamline acquisition and improve assessment of off-axis features. We propose an automated method to select standard 2D views from 3D cardiac ultrasound volumes, allowing physicians to interpret the data in their usual format while benefiting from the speed and usability of 3D scanning. Applying a deep learning view classifier and downstream heuristics based on anatomical landmarks together with heuristics provided by cardiologists, we reconstruct standard echocardiography views. This approach was validated by three cardiologists in blinded evaluation (96\% accuracy in 1,600 videos from 2 hospitals). The downstream 2D videos were also validated in their ability to detect cardiac abnormalities using AI echocardiography models (EchoPrime and PanEcho) as well as ability to generate clinical-grade measurements of cardiac anatomy (EchoNet-Measurement). We demonstrated that the extracted 2D videos preserve spatial calibration and diagnostic features, allowing clinicians to obtain accurate real-world interpretations from 3D volumes. We release the code and a dataset of 29 3D echocardiography videos https://github.com/echonet/3d-echo .
comment: 12 pages, 5 figures
TrackGS: Optimizing COLMAP-Free 3D Gaussian Splatting with Global Track Constraints
We present TrackGS, a novel method to integrate global feature tracks with 3D Gaussian Splatting (3DGS) for COLMAP-free novel view synthesis. While 3DGS delivers impressive rendering quality, its reliance on accurate precomputed camera parameters remains a significant limitation. Existing COLMAP-free approaches depend on local constraints that fail in complex scenarios. Our key innovation lies in leveraging feature tracks to establish global geometric constraints, enabling simultaneous optimization of camera parameters and 3D Gaussians. Specifically, we: (1) introduce track-constrained Gaussians that serve as geometric anchors, (2) propose novel 2D and 3D track losses to enforce multi-view consistency, and (3) derive differentiable formulations for camera intrinsics optimization. Extensive experiments on challenging real-world and synthetic datasets demonstrate state-of-the-art performance, with much lower pose error than previous methods while maintaining superior rendering quality. Our approach eliminates the need for COLMAP preprocessing, making 3DGS more accessible for practical applications.
Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models
Despite the recent advances in the video understanding ability of multimodal large language models (MLLMs), long video understanding remains a challenge. One of the main issues is that the number of vision tokens grows linearly with video length, which causes an explosion in attention cost, memory, and latency. To solve this challenge, we present Query-aware Token Selector (\textbf{QTSplus}), a lightweight yet powerful visual token selection module that serves as an information gate between the vision encoder and LLMs. Given a text query and video tokens, QTSplus dynamically selects the most important visual evidence for the input text query by (i) scoring visual tokens via cross-attention, (ii) \emph{predicting} an instance-specific retention budget based on the complexity of the query, and (iii) \emph{selecting} Top-$n$ tokens with a differentiable straight-through estimator during training and a hard gate at inference. Furthermore, a small re-encoder preserves temporal order using absolute time information, enabling second-level localization while maintaining global coverage. Integrated into Qwen2.5-VL, QTSplus compresses the vision stream by up to \textbf{89\%} and reduces end-to-end latency by \textbf{28\%} on long videos. The evaluation on eight long video understanding benchmarks shows near-parity accuracy overall when compared with the original Qwen models and outperforms the original model by \textbf{+20.5} and \textbf{+5.6} points respectively on TempCompass direction and order accuracies. These results show that QTSplus is an effective, general mechanism for scaling MLLMs to real-world long-video scenarios while preserving task-relevant evidence.
A Training-Free Style-Personalization via SVD-Based Feature Decomposition
We present a training-free framework for style-personalized image generation that operates during inference using a scale-wise autoregressive model. Our method generates a stylized image guided by a single reference style while preserving semantic consistency and mitigating content leakage. Through a detailed step-wise analysis of the generation process, we identify a pivotal step where the dominant singular values of the internal feature encode style-related components. Building upon this insight, we introduce two lightweight control modules: Principal Feature Blending, which enables precise modulation of style through SVD-based feature reconstruction, and Structural Attention Correction, which stabilizes structural consistency by leveraging content-guided attention correction across fine stages. Without any additional training, extensive experiments demonstrate that our method achieves competitive style fidelity and prompt fidelity compared to fine-tuned baselines, while offering faster inference and greater deployment flexibility.
comment: 21 pages, 14 figures
A Unified Voxel Diffusion Module for Point Cloud 3D Object Detection
Recent advances in point cloud object detection have increasingly adopted Transformer-based and State Space Models (SSMs), demonstrating strong performance. However, voxelbased representations in these models require strict consistency in input and output dimensions due to their serialized processing, which limits the spatial diffusion capability typically offered by convolutional operations. This limitation significantly affects detection accuracy. Inspired by CNN-based object detection architectures, we propose a novel Voxel Diffusion Module (VDM) to enhance voxel-level representation and diffusion in point cloud data. VDM is composed of sparse 3D convolutions, submanifold sparse convolutions, and residual connections. To ensure computational efficiency, the output feature maps are downsampled to one-fourth of the original input resolution. VDM serves two primary functions: (1) diffusing foreground voxel features through sparse 3D convolutions to enrich spatial context, and (2) aggregating fine-grained spatial information to strengthen voxelwise feature representation. The enhanced voxel features produced by VDM can be seamlessly integrated into mainstream Transformer- or SSM-based detection models for accurate object classification and localization, highlighting the generalizability of our method. We evaluate VDM on several benchmark datasets by embedding it into both Transformerbased and SSM-based models. Experimental results show that our approach consistently improves detection accuracy over baseline models. Specifically, VDM-SSMs achieve 74.7 mAPH (L2) on Waymo, 72.9 NDS on nuScenes, 42.3 mAP on Argoverse 2, and 67.6 mAP on ONCE, setting new stateof-the-art performance across all datasets. Our code will be made publicly available.
comment: Under review
Lung-DDPM+: Efficient Thoracic CT Image Synthesis using Diffusion Probabilistic Model
Generative artificial intelligence (AI) has been playing an important role in various domains. Leveraging its high capability to generate high-fidelity and diverse synthetic data, generative AI is widely applied in diagnostic tasks, such as lung cancer diagnosis using computed tomography (CT). However, existing generative models for lung cancer diagnosis suffer from low efficiency and anatomical imprecision, which limit their clinical applicability. To address these drawbacks, we propose Lung-DDPM+, an improved version of our previous model, Lung-DDPM. This novel approach is a denoising diffusion probabilistic model (DDPM) guided by nodule semantic layouts and accelerated by a pulmonary DPM-solver, enabling the method to focus on lesion areas while achieving a better trade-off between sampling efficiency and quality. Evaluation results on the public LIDC-IDRI dataset suggest that the proposed method achieves 8$\times$ fewer FLOPs (floating point operations per second), 6.8$\times$ lower GPU memory consumption, and 14$\times$ faster sampling compared to Lung-DDPM. Moreover, it maintains comparable sample quality to both Lung-DDPM and other state-of-the-art (SOTA) generative models in two downstream segmentation tasks. We also conducted a Visual Turing Test by an experienced radiologist, showing the advanced quality and fidelity of synthetic samples generated by the proposed method. These experimental results demonstrate that Lung-DDPM+ can effectively generate high-quality thoracic CT images with lung nodules, highlighting its potential for broader applications, such as general tumor synthesis and lesion generation in medical imaging. The code and pretrained models are available at https://github.com/Manem-Lab/Lung-DDPM-PLUS.
comment: Accepted by Computers in Biology and Medicine (CIBM)
HazeMatching: Dehazing Light Microscopy Images with Guided Conditional Flow Matching
Fluorescence microscopy is a major driver of scientific progress in the life sciences. Although high-end confocal microscopes are capable of filtering out-of-focus light, cheaper and more accessible microscopy modalities, such as widefield microscopy, can not, which consequently leads to hazy image data. Computational dehazing is trying to combine the best of both worlds, leading to cheap microscopy but crisp-looking images. The perception-distortion trade-off tells us that we can optimize either for data fidelity, e.g. low MSE or high PSNR, or for data realism, measured by perceptual metrics such as LPIPS or FID. Existing methods either prioritize fidelity at the expense of realism, or produce perceptually convincing results that lack quantitative accuracy. In this work, we propose HazeMatching, a novel iterative method for dehazing light microscopy images, which effectively balances these objectives. Our goal was to find a balanced trade-off between the fidelity of the dehazing results and the realism of individual predictions (samples). We achieve this by adapting the conditional flow matching framework by guiding the generative process with a hazy observation in the conditional velocity field. We evaluate HazeMatching on 5 datasets, covering both synthetic and real data, assessing both distortion and perceptual quality. Our method is compared against 11 baselines, achieving a consistent balance between fidelity and realism on average. Additionally, with calibration analysis, we show that HazeMatching produces well-calibrated predictions. Note that our method does not need an explicit degradation operator to exist, making it easily applicable on real microscopy data. All data used for training and evaluation and our code will be publicly available under a permissive license.
comment: 4 figures, 8 pages + refs, 45 pages total (including supplement), 28 supplementary figures
ResMatching: Noise-Resilient Computational Super-Resolution via Guided Conditional Flow Matching
Computational Super-Resolution (CSR) in fluorescence microscopy has, despite being an ill-posed problem, a long history. At its very core, CSR is about finding a prior that can be used to extrapolate frequencies in a micrograph that have never been imaged by the image-generating microscope. It stands to reason that, with the advent of better data-driven machine learning techniques, stronger prior can be learned and hence CSR can lead to better results. Here, we present ResMatching, a novel CSR method that uses guided conditional flow matching to learn such improved data-priors. We evaluate ResMatching on 4 diverse biological structures from the BioSR dataset and compare its results against 7 baselines. ResMatching consistently achieves competitive results, demonstrating in all cases the best trade-off between data fidelity and perceptual realism. We observe that CSR using ResMatching is particularly effective in cases where a strong prior is hard to learn, e.g. when the given low-resolution images contain a lot of noise. Additionally, we show that ResMatching can be used to sample from an implicitly learned posterior distribution and that this distribution is calibrated for all tested use-cases, enabling our method to deliver a pixel-wise data-uncertainty term that can guide future users to reject uncertain predictions.
comment: 5 pages, 4 figures
Disentangled Concepts Speak Louder Than Words: Explainable Video Action Recognition NeurIPS 2025
Effective explanations of video action recognition models should disentangle how movements unfold over time from the surrounding spatial context. However, existing methods based on saliency produce entangled explanations, making it unclear whether predictions rely on motion or spatial context. Language-based approaches offer structure but often fail to explain motions due to their tacit nature -- intuitively understood but difficult to verbalize. To address these challenges, we propose Disentangled Action aNd Context concept-based Explainable (DANCE) video action recognition, a framework that predicts actions through disentangled concept types: motion dynamics, objects, and scenes. We define motion dynamics concepts as human pose sequences. We employ a large language model to automatically extract object and scene concepts. Built on an ante-hoc concept bottleneck design, DANCE enforces prediction through these concepts. Experiments on four datasets -- KTH, Penn Action, HAA500, and UCF-101 -- demonstrate that DANCE significantly improves explanation clarity with competitive performance. We validate the superior interpretability of DANCE through a user study. Experimental results also show that DANCE is beneficial for model debugging, editing, and failure analysis.
comment: NeurIPS 2025 Spotlight paper. Project page: https://jong980812.github.io/DANCE/
OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model
We present OpenDriveVLA, a Vision Language Action model designed for end-to-end autonomous driving, built upon open-source large language models. OpenDriveVLA generates spatially grounded driving actions by leveraging multimodal inputs, including 2D and 3D instance-aware visual representations, ego vehicle states, and language commands. To bridge the modality gap between driving visual representations and language embeddings, we introduce a hierarchical vision language alignment process, projecting both 2D and 3D structured visual tokens into a unified semantic space. Furthermore, we incorporate structured agent environment ego interaction modeling into the autoregressive decoding process, enabling the model to capture fine-grained spatial dependencies and behavior-aware dynamics critical for reliable trajectory planning. Extensive experiments on the nuScenes dataset demonstrate that OpenDriveVLA achieves state-of-the-art results across open-loop trajectory planning and driving-related question answering tasks. Qualitative analyses further illustrate its capability to follow high-level driving commands and generate trajectories under challenging scenarios, highlighting its potential for next-generation end-to-end autonomous driving.
LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation
Video diffusion models (DMs) have enabled high-quality video synthesis. However, their computation costs scale quadratically with sequence length because self-attention has quadratic complexity. While linear attention lowers the cost, fully replacing quadratic attention requires expensive pretraining due to the limited expressiveness of linear attention and the complexity of spatiotemporal modeling in video generation. In this paper, we present LinVideo, an efficient data-free post-training framework that replaces a target number of self-attention modules with linear attention while preserving the original model's performance. First, we observe a significant disparity in the replaceability of different layers. Instead of manual or heuristic choices, we frame layer selection as a binary classification problem and propose selective transfer, which automatically and progressively converts layers to linear attention with minimal performance impact. Additionally, to overcome the ineffectiveness and inefficiency of existing objectives for this transfer process, we introduce an anytime distribution matching (ADM) objective that aligns the distributions of samples across any timestep along the sampling trajectory. This objective is efficient and recovers model performance. Extensive experiments show that our method achieves a 1.25-2.00x speedup while preserving generation quality, and our 4-step distilled model further delivers a 15.92x latency reduction with minimal visual quality drop.
comment: Code will be released upon acceptance
One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models
Diffusion models struggle to scale beyond their training resolutions, as direct high-resolution sampling is slow and costly, while post-hoc image super-resolution (ISR) introduces artifacts and additional latency by operating after decoding. We present the Latent Upscaler Adapter (LUA), a lightweight module that performs super-resolution directly on the generator's latent code before the final VAE decoding step. LUA integrates as a drop-in component, requiring no modifications to the base model or additional diffusion stages, and enables high-resolution synthesis through a single feed-forward pass in latent space. A shared Swin-style backbone with scale-specific pixel-shuffle heads supports 2x and 4x factors and remains compatible with image-space SR baselines, achieving comparable perceptual quality with nearly 3x lower decoding and upscaling time (adding only +0.42 s for 1024 px generation from 512 px, compared to 1.87 s for pixel-space SR using the same SwinIR architecture). Furthermore, LUA shows strong generalization across the latent spaces of different VAEs, making it easy to deploy without retraining from scratch for each new decoder. Extensive experiments demonstrate that LUA closely matches the fidelity of native high-resolution generation while offering a practical and efficient path to scalable, high-fidelity image synthesis in modern diffusion pipelines.
MOCHA: Multi-modal Objects-aware Cross-arcHitecture Alignment
Personalized object detection aims to adapt a general-purpose detector to recognize user-specific instances from only a few examples. Lightweight models often struggle in this setting due to their weak semantic priors, while large vision-language models (VLMs) offer strong object-level understanding but are too computationally demanding for real-time or on-device applications. We introduce MOCHA (Multi-modal Objects-aware Cross-arcHitecture Alignment), a distillation framework that transfers multimodal region-level knowledge from a frozen VLM teacher into a lightweight vision-only detector. MOCHA extracts fused visual and textual teacher's embeddings and uses them to guide student training through a dual-objective loss that enforces accurate local alignment and global relational consistency across regions. This process enables efficient transfer of semantics without the need for teacher modifications or textual input at inference. MOCHA consistently outperforms prior baselines across four personalized detection benchmarks under strict few-shot regimes, yielding a +10.1 average improvement, with minimal inference cost.
DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding
Large Vision-Language Models (LVLMs) have demonstrated strong multimodal reasoning capabilities on long and complex documents. However, their high memory footprint makes them impractical for deployment on resource-constrained edge devices. We present DocSLM, an efficient Small Vision-Language Model designed for long-document understanding under constrained memory resources. DocSLM incorporates a Hierarchical Multimodal Compressor that jointly encodes visual, textual, and layout information from each page into a fixed-length sequence, greatly reducing memory consumption while preserving both local and global semantics. To enable scalable processing over arbitrarily long inputs, we introduce a Streaming Abstention mechanism that operates on document segments sequentially and filters low-confidence responses using an entropy-based uncertainty calibrator. Across multiple long multimodal document benchmarks, DocSLM matches or surpasses state-of-the-art methods while using 82\% fewer visual tokens, 75\% fewer parameters, and 71\% lower latency, delivering reliable multimodal document understanding on lightweight edge devices. Code and Model are available in https://github.com/Tanveer81/DocSLM.git.
Resolving Sentiment Discrepancy for Multimodal Sentiment Detection via Semantics Completion and Decomposition
With the proliferation of social media posts in recent years, the need to detect sentiments in multimodal (image-text) content has grown rapidly. Since posts are user-generated, the image and text from the same post can express different or even contradictory sentiments, leading to potential \textbf{sentiment discrepancy}. However, existing works mainly adopt a single-branch fusion structure that primarily captures the consistent sentiment between image and text. The ignorance or implicit modeling of discrepant sentiment results in compromised unimodal encoding and limited performance. In this paper, we propose a semantics Completion and Decomposition (CoDe) network to resolve the above issue. In the semantics completion module, we complement image and text representations with the semantics of the in-image text, helping bridge the sentiment gap. In the semantics decomposition module, we decompose image and text representations with exclusive projection and contrastive learning, thereby explicitly capturing the discrepant sentiment between modalities. Finally, we fuse image and text representations by cross-attention and combine them with the learned discrepant sentiment for final classification. Extensive experiments on four datasets demonstrate the superiority of CoDe and the effectiveness of each proposed module.
comment: Accepted by Pattern Recognition
HPPP: Halpern-type Preconditioned Proximal Point Algorithms and Applications to Image Restoration
Recently, the degenerate preconditioned proximal point (PPP) method provides a unified and flexible framework for designing and analyzing operator-splitting algorithms such as Douglas-Rachford (DR). However, the degenerate PPP method exhibits weak convergence in the infinite-dimensional Hilbert space and lacks accelerated variants. To address these issues, we propose a Halpern-type PPP (HPPP) algorithm, which leverages the strong convergence and acceleration properties of Halpern's iteration method. Moreover, we propose a novel algorithm for image restoration by combining HPPP with denoiser priors such as Plug-and-Play (PnP) prior, which can be viewed as an accelerated PnP method. Finally, numerical experiments including several toy examples and image restoration validate the effectiveness of our proposed algorithms.
CLIMB-3D: Continual Learning for Imbalanced 3D Instance Segmentation BMVC 2025
While 3D instance segmentation (3DIS) has advanced significantly, most existing methods assume that all object classes are known in advance and uniformly distributed. However, this assumption is unrealistic in dynamic, real-world environments where new classes emerge gradually and exhibit natural imbalance. Although some approaches address the emergence of new classes, they often overlook class imbalance, which leads to suboptimal performance, particularly on rare categories. To tackle this, we propose \ourmethodbf, a unified framework for \textbf{CL}ass-incremental \textbf{Imb}alance-aware \textbf{3D}IS. Building upon established exemplar replay (ER) strategies, we show that ER alone is insufficient to achieve robust performance under memory constraints. To mitigate this, we introduce a novel pseudo-label generator (PLG) that extends supervision to previously learned categories by leveraging predictions from a frozen model trained on prior tasks. Despite its promise, PLG tends to be biased towards frequent classes. Therefore, we propose a class-balanced re-weighting (CBR) scheme that estimates object frequencies from pseudo-labels and dynamically adjusts training bias, without requiring access to past data. We design and evaluate three incremental scenarios for 3DIS on the challenging ScanNet200 dataset and additionally validate our method for semantic segmentation on ScanNetV2. Our approach achieves state-of-the-art results, surpassing prior work by up to 16.76\% mAP for instance segmentation and approximately 30\% mIoU for semantic segmentation, demonstrating strong generalisation across both frequent and rare classes. Code is available at: https://github.com/vgthengane/CLIMB3D
comment: Accepted at BMVC 2025
CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation BMVC 2025
Vision foundation models (VFMs) such as DINO have led to a paradigm shift in 2D camera-based perception towards extracting generalized features to support many downstream tasks. Recent works introduce self-supervised cross-modal knowledge distillation (KD) as a way to transfer these powerful generalization capabilities into 3D LiDAR-based models. However, they either rely on highly complex distillation losses, pseudo-semantic maps, or limit KD to features useful for semantic segmentation only. In this work, we propose CleverDistiller, a self-supervised, cross-modal 2D-to-3D KD framework introducing a set of simple yet effective design choices: Unlike contrastive approaches relying on complex loss design choices, our method employs a direct feature similarity loss in combination with a multi layer perceptron (MLP) projection head to allow the 3D network to learn complex semantic dependencies throughout the projection. Crucially, our approach does not depend on pseudo-semantic maps, allowing for direct knowledge transfer from a VFM without explicit semantic supervision. Additionally, we introduce the auxiliary self-supervised spatial task of occupancy prediction to enhance the semantic knowledge, obtained from a VFM through KD, with 3D spatial reasoning capabilities. Experiments on standard autonomous driving benchmarks for 2D-to-3D KD demonstrate that CleverDistiller achieves state-of-the-art performance in both semantic segmentation and 3D object detection (3DOD) by up to 10% mIoU, especially when fine tuning on really low data amounts, showing the effectiveness of our simple yet powerful KD strategy
comment: Accepted to BMVC 2025
The Cooperative Network Architecture: Learning Structured Networks as Representation of Sensory Patterns
We introduce the Cooperative Network Architecture (CNA), a model that represents sensory signals using structured, recurrently connected networks of neurons, termed "nets." Nets are dynamically assembled from overlapping net fragments, which are learned based on statistical regularities in sensory input. This architecture offers robustness to noise, deformation, and generalization to out-of-distribution data, addressing challenges in current vision systems from a novel perspective. We demonstrate that net fragments can be learned without supervision and flexibly recombined to encode novel patterns, enabling figure completion and resilience to noise. Our findings establish CNA as a promising paradigm for developing neural representations that integrate local feature processing with global structure formation, providing a foundation for future research on invariant object recognition.
comment: Accepted at Neural Computation
VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding
Multimodal large language models (MLLMs) demonstrate exceptional performance in vision-language tasks, yet their processing of long videos is constrained by input context length and high computational costs. Sparse frame sampling thus becomes a necessary preprocessing step, with sampled frame quality directly impacting downstream performance. Existing keyframe search algorithms achieve a balance between efficiency and sampled frame quality but heavily rely on the visual modality alone. This makes them difficult to adapt to text-related tasks and often leads to retrieval results deviating from core semantic content. To address this, we propose the VISUAL-SUBTITLE INTEGRATION (VSI), a multimodal keyframe retrieval framework. It employs a dual-branch collaborative retrieval approach combining Video Search and Subtitle Match to fuse complementary visual and textual information for precise localization. Experiments on LongVideoBench and VideoMME demonstrate that VSI achieves state-of-the-art accuracy in keyframe retrieval while delivering breakthrough performance in text-related tasks and exhibiting strong generalization across other tasks.
comment: 9 pages,3 figures
Attention Via Convolutional Nearest Neighbors
The shift from Convolutional Neural Networks to Transformers has reshaped computer vision, yet these two architectural families are typically viewed as fundamentally distinct. We argue that convolution and self-attention, despite their apparent differences, can be unified within a single k-nearest neighbor aggregation framework. The critical insight is that both operations are special cases of neighbor selection and aggregation; convolution selects neighbors by spatial proximity, while attention selects by feature similarity, revealing they exist on a continuous spectrum. We introduce Convolutional Nearest Neighbors (ConvNN), a unified framework that formalizes this connection. Crucially, ConvNN serves as a drop-in replacement for convolutional and attention layers, enabling systematic exploration of the intermediate spectrum between these two extremes. We validate the framework's coherence on CIFAR-10 and CIFAR-100 classification tasks across two complementary architectures: (1) Hybrid branching in VGG improves accuracy on both CIFAR datasets by combining spatial-proximity and feature-similarity selection; and (2) ConvNN in ViT outperforms standard attention and other attention variants on both datasets. Extensive ablations on $k$ values and architectural variants reveal that interpolating along this spectrum provides regularization benefits by balancing local and global receptive fields. Our work provides a unifying framework that dissolves the apparent distinction between convolution and attention, with implications for designing more principled and interpretable vision architectures.
PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly
While vision-language models (VLMs) have demonstrated promising capabilities in reasoning and planning for embodied agents, their ability to comprehend physical phenomena, particularly within structured 3D environments, remains severely limited. To close this gap, we introduce PhyBlock, a progressive benchmark designed to assess VLMs on physical understanding and planning through robotic 3D block assembly tasks. PhyBlock integrates a novel four-level cognitive hierarchy assembly task alongside targeted Visual Question Answering (VQA) samples, collectively aimed at evaluating progressive spatial reasoning and fundamental physical comprehension, including object properties, spatial relationships, and holistic scene understanding. PhyBlock includes 2600 block tasks (400 assembly tasks, 2200 VQA tasks) and evaluates models across three key dimensions: partial completion, failure diagnosis, and planning robustness. We benchmark 21 state-of-the-art VLMs, highlighting their strengths and limitations in physically grounded, multi-step planning. Our empirical findings indicate that the performance of VLMs exhibits pronounced limitations in high-level planning and reasoning capabilities, leading to a notable decline in performance for the growing complexity of the tasks. Error analysis reveals persistent difficulties in spatial orientation and dependency reasoning. Surprisingly, chain-of-thought prompting offers minimal improvements, suggesting spatial tasks heavily rely on intuitive model comprehension. We position PhyBlock as a unified testbed to advance embodied reasoning, bridging vision-language understanding and real-world physical problem-solving.
EatGAN: An Edge-Attention Guided Generative Adversarial Network for Single Image Super-Resolution
Single-image super-resolution (SISR) is an important task in image processing, aiming to enhance the resolution of imaging systems. Recently, SISR has made a significant leap and achieved promising results with deep learning. GAN-based models stand out among all the deep learning models because of their excellent performance in perceiving quality. However, it is rather difficult for them to reconstruct realistic high-frequency details and achieve stable training. To solve these issues, we introduce an Edge-Attention guided Generative Adversarial Network (EatGAN), the first GAN-based SISR model that simultaneously leverages edge priors both explicitly and implicitly inside the generator, which (i) proposes a Normalized Edge Attention (NEA) mechanism based on channel-affine and spatial gating that transforms edge prior into lightweight, learnable modulation parameters and injects and fuses them multiple times in a (ii) edge-guided hybrid residual block, which progressively enforces structural consistency across scales; and (iii) a composite generator objective combining pixel, perceptual, edge-gradient, and adversarial terms. Experiments show consistent state-of-the-art across distortion-oriented benchmarks and perception oriented benchmarks. Notably, our model achieves 40.87 dB and 0.073 (LPIPS) on Manga 109, which indicates that reframing image priors from passive guidance into a controllable modulation primitive for generators can chart a practical path toward trustworthy, high-fidelity Super-Resolution.
comment: 17 pages (8 pages of main text + 3 pages of reference + 6 pages of supplementary material)
TDSNNs: Competitive Topographic Deep Spiking Neural Networks for Visual Cortex Modeling AAAI 2026
The primate visual cortex exhibits topographic organization, where functionally similar neurons are spatially clustered, a structure widely believed to enhance neural processing efficiency. While prior works have demonstrated that conventional deep ANNs can develop topographic representations, these models largely neglect crucial temporal dynamics. This oversight often leads to significant performance degradation in tasks like object recognition and compromises their biological fidelity. To address this, we leverage spiking neural networks (SNNs), which inherently capture spike-based temporal dynamics and offer enhanced biological plausibility. We propose a novel Spatio-Temporal Constraints (STC) loss function for topographic deep spiking neural networks (TDSNNs), successfully replicating the hierarchical spatial functional organization observed in the primate visual cortex from low-level sensory input to high-level abstract representations. Our results show that STC effectively generates representative topographic features across simulated visual cortical areas. While introducing topography typically leads to significant performance degradation in ANNs, our spiking architecture exhibits a remarkably small performance drop (No drop in ImageNet top-1 accuracy, compared to a 3% drop observed in TopoNet, which is the best-performing topographic ANN so far) and outperforms topographic ANNs in brain-likeness. We also reveal that topographic organization facilitates efficient and stable temporal information processing via the spike mechanism in TDSNNs, contributing to model robustness. These findings suggest that TDSNNs offer a compelling balance between computational performance and brain-like features, providing not only a framework for interpreting neural science phenomena but also novel insights for designing more efficient and robust deep learning models.
comment: AAAI 2026 (Oral)
Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding
Visual grouping -- operationalized through tasks such as instance segmentation, visual grounding, and object detection -- enables applications ranging from robotic perception to photo editing. These fundamental problems in computer vision are powered by large-scale, painstakingly annotated datasets. Despite their impact, these datasets are costly to build, biased in coverage, and difficult to scale. Synthetic datasets offer a promising alternative but struggle with flexibility, accuracy, and compositional diversity. We introduce Synthetic Object Compositions (SOC), an accurate and scalable data synthesis pipeline via a novel object-centric composition strategy. It composes high-quality synthetic object segments into new images using 3D geometric layout augmentation and camera configuration augmentation with generative harmonization and mask-area-weighted blending, yielding accurate and diverse masks, boxes, and referring expressions. Models trained on just 100K of our synthetic images outperform those trained on larger real datasets (GRIT 20M, V3Det 200K) and synthetic pipelines (Copy-Paste, X-Paste, SynGround, SegGen) by +24-36% -- achieving +10.9 AP on LVIS and +8.4 NAcc on gRefCOCO. Beyond the general open-vocabulary setup, SOC also enables controllable dataset construction for different use cases and boosts performance in both low-data and closed-vocabulary scenarios. Augmenting LVIS and COCO with synthetic object segments delivers strong performance across different real-data scales and yields even greater improvements under extremely limited real-data conditions, including +6.59 AP on a 1% COCO data setup. Furthermore, this controllability enables targeted data generation for intra-class referring, a diagnostic grounding task we propose that requires fine-grained attribute discrimination.
comment: Project website: https://github.com/weikaih04/Synthetic-Detection-Segmentation-Grounding-Data
VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference
Vision-Language-Action (VLA) models have shown great promise for embodied AI, yet the heavy computational cost of processing continuous visual streams severely limits their real-time deployment. Token pruning (keeping salient visual tokens and dropping redundant ones) has emerged as an effective approach for accelerating Vision-Language Models (VLMs), offering a solution for efficient VLA. However, these VLM-specific token pruning methods select tokens based solely on semantic salience metrics (e.g., prefill attention), while overlooking the VLA's intrinsic dual-system nature of high-level semantic understanding and low-level action execution. Consequently, these methods bias token retention toward semantic cues, discard critical information for action generation, and significantly degrade VLA performance. To bridge this gap, we propose VLA-Pruner, a versatile plug-and-play VLA-specific token prune method that aligns with the dual-system nature of VLA models and exploits the temporal continuity in robot manipulation. Specifically, VLA-Pruner adopts a dual-level importance criterion for visual token retention: vision-language prefill attention for semantic-level relevance and action decode attention, estimated via temporal smoothing, for action-level importance. Based on this criterion, VLA-Pruner proposes a novel dual-level token selection strategy that adaptively preserves a compact, informative set of visual tokens for both semantic understanding and action execution under given compute budget. Experiments show that VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks.
Model Inversion Attack Against Deep Hashing
Deep hashing improves retrieval efficiency through compact binary codes, yet it introduces severe and often overlooked privacy risks. The ability to reconstruct original training data from hash codes could lead to serious threats such as biometric forgery and privacy breaches. However, model inversion attacks specifically targeting deep hashing models remain unexplored, leaving their security implications unexamined. This research gap stems from the inaccessibility of genuine training hash codes and the highly discrete Hamming space, which prevents existing methods from adapting to deep hashing. To address these challenges, we propose DHMI, the first diffusion-based model inversion framework designed for deep hashing. DHMI first clusters an auxiliary dataset to derive semantic hash centers as surrogate anchors. It then introduces a surrogate-guided denoising optimization method that leverages a novel attack metric (fusing classification consistency and hash proximity) to dynamically select candidate samples. A cluster of surrogate models guides the refinement of these candidates, ensuring the generation of high-fidelity and semantically consistent images. Experiments on multiple datasets demonstrate that DHMI successfully reconstructs high-resolution, high-quality images even under the most challenging black-box setting, where no training hash codes are available. Our method outperforms the existing state-of-the-art model inversion attacks in black-box scenarios, confirming both its practical efficacy and the critical privacy risks inherent in deep hashing systems.
HDCompression: Hybrid-Diffusion Image Compression for Ultra-Low Bitrates PRICAI 2025
Image compression under ultra-low bitrates remains challenging for both conventional learned image compression (LIC) and generative vector-quantized (VQ) modeling. Conventional LIC suffers from severe artifacts due to heavy quantization, while generative VQ modeling gives poor fidelity due to the mismatch between learned generative priors and specific inputs. In this work, we propose Hybrid-Diffusion Image Compression (HDCompression), a dual-stream framework that utilizes both generative VQ-modeling and diffusion models, as well as conventional LIC, to achieve both high fidelity and high perceptual quality. Different from previous hybrid methods that directly use pre-trained LIC models to generate low-quality fidelity-preserving information from heavily quantized latent, we use diffusion models to extract high-quality complementary fidelity information from the ground-truth input, which can enhance the system performance in several aspects: improving index map prediction, enhancing the fidelity-preserving output of the LIC stream, and refining conditioned image reconstruction with VQ-latent correction. In addition, our diffusion model is based on a dense representative vector (DRV), which is lightweight with very simple sampling schedulers. Extensive experiments demonstrate that our HDCompression outperforms the previous conventional LIC, generative VQ-modeling, and hybrid frameworks in both quantitative metrics and qualitative visualization, providing balanced robust compression performance at ultra-low bitrates.
comment: Accepted by PRICAI 2025 (Oral Presentation)
Explore More, Learn Better: Parallel MLLM Embeddings under Mutual Information Minimization
Embedding models are a cornerstone of modern AI. Driven by Multimodal Large Language Models (MLLMs), they have made great progress in architecture and data curation, while the holistic paradigm is still limited to SSC, i.e., single input, singular embedding, contrastive supervision, which collapses rich, multifaceted inputs into monolithic embeddings and fails to fully exploit MLLM capabilities. In this paper, we tailor one Parallel Decoupling Framework (PDF) for multimodal embedding learning, by utilizing the proprietary steerability of MLLMs, i.e., their ability to flexibly generate quite differentiated response under explicit instructions. Concretely, PDF conditions a shared MLLM backbone on distinct, learnable prefixes to roll out multiple parallel paths for one input, then relies on these paths to obtain parallel embeddings. To promote full parallel diversity, we employ Mutual Information Minimization (MIM) as an explicit constraint, coupled with per-path contrastive supervision to maintain semantic alignment. Such dual-objectives force PDF to yield robust semantic coverage and a generalizable embedding space. Ultimately, the remarkable embedding space are accessible at inference via one single forward pass, incurring negligible computational overhead. We instantiate PDF on multiple MLLM backbones and prove its effectiveness on MMEB benchmark. Significant gains are consistently achieved across various resolutions and model sizes, e.g., boosting the VLM2Vec-LLaVA-1.6-LR model by a remarkable +8.9% (7B), while the VLM2Vec-Qwen2VL models by +4.2% (2B) and +3.1% (7B). In terms of efficiency, our 2B model surpasses its baseline by +2.6% using only half the computational budget.
POMA-3D: The Point Map Way to 3D Scene Understanding
In this paper, we introduce POMA-3D, the first self-supervised 3D representation model learned from point maps. Point maps encode explicit 3D coordinates on a structured 2D grid, preserving global 3D geometry while remaining compatible with the input format of 2D foundation models. To transfer rich 2D priors into POMA-3D, a view-to-scene alignment strategy is designed. Moreover, as point maps are view-dependent with respect to a canonical space, we introduce POMA-JEPA, a joint embedding-predictive architecture that enforces geometrically consistent point map features across multiple views. Additionally, we introduce ScenePoint, a point map dataset constructed from 6.5K room-level RGB-D scenes and 1M 2D image scenes to facilitate large-scale POMA-3D pretraining. Experiments show that POMA-3D serves as a strong backbone for both specialist and generalist 3D understanding. It benefits diverse tasks, including 3D question answering, embodied navigation, scene retrieval, and embodied localization, all achieved using only geometric inputs (i.e., 3D coordinates). Overall, our POMA-3D explores a point map way to 3D scene understanding, addressing the scarcity of pretrained priors and limited data in 3D representation learning. Project Page: https://matchlab-imperial.github.io/poma3d/
comment: 11 pages, 6 tables, 5 figures
RapidPoseTriangulation: Multi-view Multi-person Whole-body Human Pose Triangulation in a Millisecond
The integration of multi-view imaging and pose estimation represents a significant advance in computer vision applications, offering new possibilities for understanding human movement and interactions. This work presents a new algorithm that improves multi-view multi-person pose estimation, focusing on fast triangulation speeds and good generalization capabilities. The approach extends to whole-body pose estimation, capturing details from facial expressions to finger movements across multiple individuals and viewpoints. Adaptability to different settings is demonstrated through strong performance across unseen datasets and configurations. To support further progress in this field, all of this work is publicly accessible.
REArtGS: Reconstructing and Generating Articulated Objects via 3D Gaussian Splatting with Geometric and Motion Constraints
Articulated objects, as prevalent entities in human life, their 3D representations play crucial roles across various applications. However, achieving both high-fidelity textured surface reconstruction and dynamic generation for articulated objects remains challenging for existing methods. In this paper, we present REArtGS, a novel framework that introduces additional geometric and motion constraints to 3D Gaussian primitives, enabling realistic surface reconstruction and generation for articulated objects. Specifically, given multi-view RGB images of arbitrary two states of articulated objects, we first introduce an unbiased Signed Distance Field (SDF) guidance to regularize Gaussian opacity fields, enhancing geometry constraints and improving surface reconstruction quality. Then we establish deformable fields for 3D Gaussians constrained by the kinematic structures of articulated objects, achieving unsupervised generation of surface meshes in unseen states. Extensive experiments on both synthetic and real datasets demonstrate our approach achieves high-quality textured surface reconstruction for given states, and enables high-fidelity surface generation for unseen states. Project site: https://sites.google.com/view/reartgs/home.
comment: 11pages, 6 figures
Loss-Oriented Ranking for Automated Visual Prompting in LVLMs
Inspired by text prompts in large language models (LLMs), visual prompts have been explored to enhance the reasoning capabilities of large vision-language models (LVLMs). Current methods design heuristic visual prompts, such as overlaying a text-query-guided attention heatmap on the original input image. However, designing effective prompts manually is challenging and time-consuming, and it often fails to explore the benefits of different visual prompts, leading to sub-optimal performance. To this end, we propose \textbf{AutoV} that learns to automatically select the optimal visual prompt from various candidates based on given textual queries and the input image. To train AutoV, we develop an automatic data collection and labeling pipeline that evaluates various visual prompts with a pre-trained LVLM. We input a set of visual prompts into the LVLM and rank them according to the prediction losses generated by the model. Using the ranking as a supervision signal, we train AutoV to automatically choose the optimal visual prompt from various visual prompts for LVLMs. Experiments indicate that AutoV enhances the performance of various LVLMs across multiple image understanding tasks. For instance, LLaVA-OV with AutoV achieves $\textbf{10.2}\%$ accuracy gain on VizWiz, and AutoV boosts Qwen2.5-VL by $\textbf{3.8}\%$ on MMMU, highlighting its potential as an optimal visual prompting method.
comment: 17 pages
Composed Object Retrieval: Object-level Retrieval via Composed Expressions
Retrieving fine-grained visual content based on user intent remains a challenge in multi-modal systems. Although current Composed Image Retrieval (CIR) methods combine reference images with retrieval texts, they are constrained to image-level matching and cannot localize specific objects. To this end, we propose Composed Object Retrieval (COR), a brand-new task that goes beyond image-level retrieval to achieve object-level precision, allowing the retrieval and segmentation of target objects based on composed expressions combining reference objects and retrieval texts. COR presents significant challenges in retrieval flexibility, which requires systems to identify arbitrary objects satisfying composed expressions while avoiding semantically similar but irrelevant negative objects within the same scene. We construct COR127K, the first large-scale COR benchmark that contains 127,166 retrieval triplets with various semantic transformations in 408 categories. We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive visual-textual interaction, and region-level contrastive learning. Extensive experiments demonstrate that CORE significantly outperforms existing models in both base and novel categories, establishing a simple and effective baseline for this challenging task while opening new directions for fine-grained multi-modal retrieval research. We will publicly release both the dataset and the model at https://github.com/wangtong627/COR.
Performance of Conformal Prediction in Capturing Aleatoric Uncertainty WACV 2026
Conformal prediction is a model-agnostic approach to generating prediction sets that cover the true class with a high probability. Although its prediction set size is expected to capture aleatoric uncertainty, there is a lack of evidence regarding its effectiveness. The literature presents that prediction set size can upper-bound aleatoric uncertainty or that prediction sets are larger for difficult instances and smaller for easy ones, but a validation of this attribute of conformal predictors is missing. This work investigates how effectively conformal predictors quantify aleatoric uncertainty, specifically the inherent ambiguity in datasets caused by overlapping classes. We perform this by measuring the correlation between prediction set sizes and the number of distinct labels assigned by human annotators per instance. We further assess the similarity between prediction sets and human-provided annotations. We use three conformal prediction approaches to generate prediction sets for eight deep learning models trained on four datasets. The datasets contain annotations from multiple human annotators (ranging from five to fifty participants) per instance, enabling the identification of class overlap. We show that the vast majority of the conformal prediction outputs show a very weak to weak correlation with human annotations, with only a few showing moderate correlation. These findings underscore the necessity of critically reassessing the prediction sets generated using conformal predictors. While they can provide a higher coverage of the true classes, their capability in capturing aleatoric uncertainty and generating sets that align with human annotations remains limited.
comment: Accepted at the IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2026
Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning ICCV 2025
Multimodal reasoning in Large Language Models (LLMs) struggles with incomplete knowledge and hallucination artifacts, challenges that textual Knowledge Graphs (KGs) only partially mitigate due to their modality isolation. While Multimodal Knowledge Graphs (MMKGs) promise enhanced cross-modal understanding, their practical construction is impeded by semantic narrowness of manual text annotations and inherent noise in visual-semantic entity linkages. In this paper, we propose Vision-align-to-Language integrated Knowledge Graph (VaLiK), a novel approach for constructing MMKGs that enhances LLMs reasoning through cross-modal information supplementation. Specifically, we cascade pre-trained Vision-Language Models (VLMs) to align image features with text, transforming them into descriptions that encapsulate image-specific information. Furthermore, we developed a cross-modal similarity verification mechanism to quantify semantic consistency, effectively filtering out noise introduced during feature alignment. Even without manually annotated image captions, the refined descriptions alone suffice to construct the MMKG. Compared to conventional MMKGs construction paradigms, our approach achieves substantial storage efficiency gains while maintaining direct entity-to-image linkage capability. Experimental results on multimodal reasoning tasks demonstrate that LLMs augmented with VaLiK outperform previous state-of-the-art models. Our code is published at https://github.com/Wings-Of-Disaster/VaLiK.
comment: 14 pages, 7 figures, 6 tables; Accepted by ICCV 2025
SF-Recon: Simplification-Free Lightweight Building Reconstruction via 3D Gaussian Splatting SP
Lightweight building surface models are crucial for digital city, navigation, and fast geospatial analytics, yet conventional multi-view geometry pipelines remain cumbersome and quality-sensitive due to their reliance on dense reconstruction, meshing, and subsequent simplification. This work presents SF-Recon, a method that directly reconstructs lightweight building surfaces from multi-view images without post-hoc mesh simplification. We first train an initial 3D Gaussian Splatting (3DGS) field to obtain a view-consistent representation. Building structure is then distilled by a normal-gradient-guided Gaussian optimization that selects primitives aligned with roof and wall boundaries, followed by multi-view edge-consistency pruning to enhance structural sharpness and suppress non-structural artifacts without external supervision. Finally, a multi-view depth-constrained Delaunay triangulation converts the structured Gaussian field into a lightweight, structurally faithful building mesh. Based on a proposed SF dataset, the experimental results demonstrate that our SF-Recon can directly reconstruct lightweight building models from multi-view imagery, achieving substantially fewer faces and vertices while maintaining computational efficiency. Website:https://lzh282140127-cell.github.io/SF-Recon-project/
comment: This paper has been submitted to the 2026 ISPRS Congress
Text2Traffic: A Text-to-Image Generation and Editing Method for Traffic Scenes
With the rapid advancement of intelligent transportation systems, text-driven image generation and editing techniques have demonstrated significant potential in providing rich, controllable visual scene data for applications such as traffic monitoring and autonomous driving. However, several challenges remain, including insufficient semantic richness of generated traffic elements, limited camera viewpoints, low visual fidelity of synthesized images, and poor alignment between textual descriptions and generated content. To address these issues, we propose a unified text-driven framework for both image generation and editing, leveraging a controllable mask mechanism to seamlessly integrate the two tasks. Furthermore, we incorporate both vehicle-side and roadside multi-view data to enhance the geometric diversity of traffic scenes. Our training strategy follows a two-stage paradigm: first, we perform conceptual learning using large-scale coarse-grained text-image data; then, we fine-tune with fine-grained descriptive data to enhance text-image alignment and detail quality. Additionally, we introduce a mask-region-weighted loss that dynamically emphasizes small yet critical regions during training, thereby substantially enhancing the generation fidelity of small-scale traffic elements. Extensive experiments demonstrate that our method achieves leading performance in text-based image generation and editing within traffic scenes.
Model-Agnostic Gender Bias Control for Text-to-Image Generation via Sparse Autoencoder
Text-to-image (T2I) diffusion models often exhibit gender bias, particularly by generating stereotypical associations between professions and gendered subjects. This paper presents SAE Debias, a lightweight and model-agnostic framework for mitigating such bias in T2I generation. Unlike prior approaches that rely on CLIP-based filtering or prompt engineering, which often require model-specific adjustments and offer limited control, SAE Debias operates directly within the feature space without retraining or architectural modifications. By leveraging a k-sparse autoencoder pre-trained on a gender bias dataset, the method identifies gender-relevant directions within the sparse latent space, capturing professional stereotypes. Specifically, a biased direction per profession is constructed from sparse latents and suppressed during inference to steer generations toward more gender-balanced outputs. Trained only once, the sparse autoencoder provides a reusable debiasing direction, offering effective control and interpretable insight into biased subspaces. Extensive evaluations across multiple T2I models, including Stable Diffusion 1.4, 1.5, 2.1, and SDXL, demonstrate that SAE Debias substantially reduces gender bias while preserving generation quality. To the best of our knowledge, this is the first work to apply sparse autoencoders for identifying and intervening in gender bias within T2I models. These findings contribute toward building socially responsible generative AI, providing an interpretable and model-agnostic tool to support fairness in text-to-image generation.
Open-Set Domain Generalization through Spectral-Spatial Uncertainty Disentanglement for Hyperspectral Image Classification
Open-set domain generalization (OSDG) tackles the dual challenge of recognizing unknown classes while simultaneously striving to generalize across unseen domains without using target data during training. In this article, an OSDG framework for hyperspectral image classification is proposed, centered on a new Spectral-Spatial Uncertainty Disentanglement mechanism. It has been designed to address the domain shift influencing both spectral, spatial and combined feature extraction pathways using evidential deep learning, after which the most reliable pathway for each sample is adaptively selected. The proposed framework is further integrated with frequency-domain feature extraction for domain-invariant representation learning, dual-channel residual networks for spectral-spatial feature extraction, and evidential deep learning based uncertainty quantification. Experiments conducted on three cross scene hyperspectral datasets, show that performance comparable to state-of-the-art domain adaptation methods can be achieved despite no access to target data, while high unknown-class rejection and known-class accuracy levels are maintained. The implementation will be available at github.com/amir-khb/UGOSDG upon acceptance.
Supervised Contrastive Learning for Few-Shot AI-Generated Image Detection and Attribution
The rapid advancement of generative artificial intelligence has enabled the creation of synthetic images that are increasingly indistinguishable from authentic content, posing significant challenges for digital media integrity. This problem is compounded by the accelerated release cycle of novel generative models, which renders traditional detection approaches (reliant on periodic retraining) computationally infeasible and operationally impractical. This work proposes a novel two-stage detection framework designed to address the generalization challenge inherent in synthetic image detection. The first stage employs a vision deep learning model trained via supervised contrastive learning to extract discriminative embeddings from input imagery. Critically, this model was trained on a strategically partitioned subset of available generators, with specific architectures withheld from training to rigorously ablate cross-generator generalization capabilities. The second stage utilizes a k-nearest neighbors (k-NN) classifier operating on the learned embedding space, trained in a few-shot learning paradigm incorporating limited samples from previously unseen test generators. With merely 150 images per class in the few-shot learning regime, which are easily obtainable from current generation models, the proposed framework achieves an average detection accuracy of 91.3%, representing a 5.2 percentage point improvement over existing approaches . For the source attribution task, the proposed approach obtains improvements of of 14.70% and 4.27% in AUC and OSCR respectively on an open set classification context, marking a significant advancement toward robust, scalable forensic attribution systems capable of adapting to the evolving generative AI landscape without requiring exhaustive retraining protocols.
comment: 17 pages, 6 figures, 6 tables
BeyondFacial: Identity-Preserving Personalized Generation Beyond Facial Close-ups
Identity-Preserving Personalized Generation (IPPG) has advanced film production and artistic creation, yet existing approaches overemphasize facial regions, resulting in outputs dominated by facial close-ups.These methods suffer from weak visual narrativity and poor semantic consistency under complex text prompts, with the core limitation rooted in identity (ID) feature embeddings undermining the semantic expressiveness of generative models. To address these issues, this paper presents an IPPG method that breaks the constraint of facial close-ups, achieving synergistic optimization of identity fidelity and scene semantic creation. Specifically, we design a Dual-Line Inference (DLI) pipeline with identity-semantic separation, resolving the representation conflict between ID and semantics inherent in traditional single-path architectures. Further, we propose an Identity Adaptive Fusion (IdAF) strategy that defers ID-semantic fusion to the noise prediction stage, integrating adaptive attention fusion and noise decision masking to avoid ID embedding interference on semantics without manual masking. Finally, an Identity Aggregation Prepending (IdAP) module is introduced to aggregate ID information and replace random initializations, further enhancing identity preservation. Experimental results validate that our method achieves stable and effective performance in IPPG tasks beyond facial close-ups, enabling efficient generation without manual masking or fine-tuning. As a plug-and-play component, it can be rapidly deployed in existing IPPG frameworks, addressing the over-reliance on facial close-ups, facilitating film-level character-scene creation, and providing richer personalized generation capabilities for related domains.
comment: 16 pages, 16 figures
Mask2IV: Interaction-Centric Video Generation via Mask Trajectories AAAI 2026
Generating interaction-centric videos, such as those depicting humans or robots interacting with objects, is crucial for embodied intelligence, as they provide rich and diverse visual priors for robot learning, manipulation policy training, and affordance reasoning. However, existing methods often struggle to model such complex and dynamic interactions. While recent studies show that masks can serve as effective control signals and enhance generation quality, obtaining dense and precise mask annotations remains a major challenge for real-world use. To overcome this limitation, we introduce Mask2IV, a novel framework specifically designed for interaction-centric video generation. It adopts a decoupled two-stage pipeline that first predicts plausible motion trajectories for both actor and object, then generates a video conditioned on these trajectories. This design eliminates the need for dense mask inputs from users while preserving the flexibility to manipulate the interaction process. Furthermore, Mask2IV supports versatile and intuitive control, allowing users to specify the target object of interaction and guide the motion trajectory through action descriptions or spatial position cues. To support systematic training and evaluation, we curate two benchmarks covering diverse action and object categories across both human-object interaction and robotic manipulation scenarios. Extensive experiments demonstrate that our method achieves superior visual realism and controllability compared to existing baselines.
comment: AAAI 2026. Project page: https://reagan1311.github.io/mask2iv
EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards
Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3\% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.
comment: 9 Pages, 6 Figures, 4 Tables
A statistical method for crack pre-detection in 3D concrete images
In practical applications, effectively segmenting cracks in large-scale computed tomography (CT) images holds significant importance for understanding the structural integrity of materials. Classical image-processing techniques and modern deep-learning models both face substantial computational challenges when applied directly to high resolution big data volumes. This paper introduces a statistical framework for crack pre-localization, whose purpose is not to replace or compete with segmentation networks, but to identify, with controlled error rates, the regions of a 3D CT image that are most likely to contain cracks. The method combines a simple Hessian-based filter, geometric descriptors computed on a regular spatial partition, and a spatial multiple testing procedure to detect anomalous regions while relying only on minimal calibration data, rather than large annotated datasets. Experiments on semi-synthetic and real 3D CT scans demonstrate that the proposed approach reliably highlights regions likely to contain cracks while preserving linear computational complexity. By restricting subsequent high resolution segmentation to these localized regions, deep-learning models can be trained and operate more efficiently, reducing both training runtime as well as resource consumption. The framework thus offers a practical and interpretable preprocessing step for large-scale CT inspection pipelines.
Draft and Refine with Visual Experts
While recent Large Vision-Language Models (LVLMs) exhibit strong multimodal reasoning abilities, they often produce ungrounded or hallucinated responses because they rely too heavily on linguistic priors instead of visual evidence. This limitation highlights the absence of a quantitative measure of how much these models actually use visual information during reasoning. We propose Draft and Refine (DnR), an agent framework driven by a question-conditioned utilization metric. The metric quantifies the model's reliance on visual evidence by first constructing a query-conditioned relevance map to localize question-specific cues and then measuring dependence through relevance-guided probabilistic masking. Guided by this metric, the DnR agent refines its initial draft using targeted feedback from external visual experts. Each expert's output (such as boxes or masks) is rendered as visual cues on the image, and the model is re-queried to select the response that yields the largest improvement in utilization. This process strengthens visual grounding without retraining or architectural changes. Experiments across VQA and captioning benchmarks show consistent accuracy gains and reduced hallucination, demonstrating that measuring visual utilization provides a principled path toward more interpretable and evidence-driven multimodal agent systems. Code is available at https://github.com/EavnJeong/Draft-and-Refine-with-Visual-Experts.
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios. Code is available at https://github.com/PaddlePaddle/PaddleOCR .
comment: Github Repo: https://github.com/PaddlePaddle/PaddleOCR
From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models
With the remarkable success of Multimodal Large Language Models (MLLMs) in perception tasks, enhancing their complex reasoning capabilities has emerged as a critical research focus. Existing models still suffer from challenges such as opaque reasoning paths and insufficient generalization ability. Chain-of-Thought (CoT) reasoning, which has demonstrated significant efficacy in language models by enhancing reasoning transparency and output interpretability, holds promise for improving model reasoning capabilities when extended to the multimodal domain. This paper provides a systematic review centered on "Multimodal Chain-of-Thought" (MCoT). First, it analyzes the background and theoretical motivations for its inception from the perspectives of technical evolution and task demands. Then, it introduces mainstream MCoT methods from three aspects: CoT paradigms, the post-training stage, and the inference stage, while also analyzing their underlying mechanisms. Furthermore, the paper summarizes existing evaluation benchmarks and metrics, and discusses the application scenarios of MCoT. Finally, it analyzes the challenges currently facing MCoT and provides an outlook on its future research directions.
comment: Survey; 7 figures, 3 tables, 44 pages
Automated Muscle and Fat Segmentation in Computed Tomography for Comprehensive Body Composition Analysis
Body composition assessment using CT images can potentially be used for a number of clinical applications, including the prognostication of cardiovascular outcomes, evaluation of metabolic health, monitoring of disease progression, assessment of nutritional status, prediction of treatment response in oncology, and risk stratification for surgical and critical care outcomes. While multiple groups have developed in-house segmentation tools for this analysis, there are very limited publicly available tools that could be consistently used across different applications. To mitigate this gap, we present a publicly accessible, end-to-end segmentation and feature calculation model specifically for CT body composition analysis. Our model performs segmentation of skeletal muscle, subcutaneous adipose tissue (SAT), and visceral adipose tissue (VAT) across the chest, abdomen, and pelvis area in axial CT images. It also provides various body composition metrics, including muscle density, visceral-to-subcutaneous fat (VAT/SAT) ratio, muscle area/volume, and skeletal muscle index (SMI), supporting both 2D and 3D assessments. To evaluate the model, the segmentation was applied to both internal and external datasets, with body composition metrics analyzed across different age, sex, and race groups. The model achieved high dice coefficients on both internal and external datasets, exceeding 89% for skeletal muscle, SAT, and VAT segmentation. The model outperforms the benchmark by 2.10% on skeletal muscle and 8.6% on SAT compared to the manual annotations given by the publicly available dataset. Body composition metrics show mean relative absolute errors (MRAEs) under 10% for all measures. Our model with weights is publicly available at https://github.com/mazurowski-lab/CT-Muscle-and-Fat-Segmentation.git.
comment: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2025:026
Generalizable and Relightable Gaussian Splatting for Human Novel View Synthesis
We propose GRGS, a generalizable and relightable 3D Gaussian framework for high-fidelity human novel view synthesis under diverse lighting conditions. Unlike existing methods that rely on per-character optimization or ignore physical constraints, GRGS adopts a feed-forward, fully supervised strategy projecting geometry, material, and illumination cues from multi-view 2D observations into 3D Gaussian representations. To recover accurate geometry under diverse lighting conditions, we introduce a Lighting-robust Geometry Refinement (LGR) module trained on synthetically relit data to predict precise depth and surface normals. Based on the high-quality geometry, a Physically Grounded Neural Rendering (PGNR) module is further proposed to integrate neural prediction with physics-based shading, supporting editable relighting with shadows and indirect illumination. Moreover, we design a 2D-to-3D projection training scheme leveraging differentiable supervision from ambient occlusion, direct, and indirect lighting maps, alleviating the computational cost of ray tracing. Extensive experiments demonstrate that GRGS achieves superior visual quality, geometric consistency, and generalization across characters and lighting conditions.
comment: Project Webpage: https://sypj-98.github.io/grgs/
Image and Video Processing
Learning Latent Transmission and Glare Maps for Lens Veiling Glare Removal
Beyond the commonly recognized optical aberrations, the imaging performance of compact optical systems-including single-lens and metalens designs-is often further degraded by veiling glare caused by stray-light scattering from non-ideal optical surfaces and coatings, particularly in complex real-world environments. This compound degradation undermines traditional lens aberration correction yet remains underexplored. A major challenge is that conventional scattering models (e.g., for dehazing) fail to fit veiling glare due to its spatial-varying and depth-independent nature. Consequently, paired high-quality data are difficult to prepare via simulation, hindering application of data-driven veiling glare removal models. To this end, we propose VeilGen, a generative model that learns to simulate veiling glare by estimating its underlying optical transmission and glare maps in an unsupervised manner from target images, regularized by Stable Diffusion (SD)-based priors. VeilGen enables paired dataset generation with realistic compound degradation of optical aberrations and veiling glare, while also providing the estimated latent optical transmission and glare maps to guide the veiling glare removal process. We further introduce DeVeiler, a restoration network trained with a reversibility constraint, which utilizes the predicted latent maps to guide an inverse process of the learned scattering model. Extensive experiments on challenging compact optical systems demonstrate that our approach delivers superior restoration quality and physical fidelity compared with existing methods. These suggest that VeilGen reliably synthesizes realistic veiling glare, and its learned latent maps effectively guide the restoration process in DeVeiler. All code and datasets will be publicly released at https://github.com/XiaolongQian/DeVeiler.
comment: All code and datasets will be publicly released at https://github.com/XiaolongQian/DeVeiler
OmniLens++: Blind Lens Aberration Correction via Large LensLib Pre-Training and Latent PSF Representation
Emerging deep-learning-based lens library pre-training (LensLib-PT) pipeline offers a new avenue for blind lens aberration correction by training a universal neural network, demonstrating strong capability in handling diverse unknown optical degradations. This work proposes the OmniLens++ framework, which resolves two challenges that hinder the generalization ability of existing pipelines: the difficulty of scaling data and the absence of prior guidance characterizing optical degradation. To improve data scalability, we expand the design specifications to increase the degradation diversity of the lens source, and we sample a more uniform distribution by quantifying the spatial-variation patterns and severity of optical degradation. In terms of model design, to leverage the Point Spread Functions (PSFs), which intuitively describe optical degradation, as guidance in a blind paradigm, we propose the Latent PSF Representation (LPR). The VQVAE framework is introduced to learn latent features of LensLib's PSFs, which is assisted by modeling the optical degradation process to constrain the learning of degradation priors. Experiments on diverse aberrations of real-world lenses and synthetic LensLib show that OmniLens++ exhibits state-of-the-art generalization capacity in blind aberration correction. Beyond performance, the AODLibpro is verified as a scalable foundation for more effective training across diverse aberrations, and LPR can further tap the potential of large-scale LensLib. The source code and datasets will be made publicly available at https://github.com/zju-jiangqi/OmniLens2.
comment: The source code and datasets will be made publicly available at https://github.com/zju-jiangqi/OmniLens2
MedImageInsight for Thoracic Cavity Health Classification from Chest X-rays
Chest radiography remains one of the most widely used imaging modalities for thoracic diagnosis, yet increasing imaging volumes and radiologist workload continue to challenge timely interpretation. In this work, we investigate the use of MedImageInsight, a medical imaging foundational model, for automated binary classification of chest X-rays into Normal and Abnormal categories. Two approaches were evaluated: (1) fine-tuning MedImageInsight for end-to-end classification, and (2) employing the model as a feature extractor for a transfer learning pipeline using traditional machine learning classifiers. Experiments were conducted using a combination of the ChestX-ray14 dataset and real-world clinical data sourced from partner hospitals. The fine-tuned classifier achieved the highest performance, with an ROC-AUC of 0.888 and superior calibration compared to the transfer learning models, demonstrating performance comparable to established architectures such as CheXNet. These results highlight the effectiveness of foundational medical imaging models in reducing task-specific training requirements while maintaining diagnostic reliability. The system is designed for integration into web-based and hospital PACS workflows to support triage and reduce radiologist burden. Future work will extend the model to multi-label pathology classification to provide preliminary diagnostic interpretation in clinical environments.
comment: 9 pages, 5 figures and 3 tables
Parameter-Free Neural Lens Blur Rendering for High-Fidelity Composites
Consistent and natural camera lens blur is important for seamlessly blending 3D virtual objects into photographed real-scenes. Since lens blur typically varies with scene depth, the placement of virtual objects and their corresponding blur levels significantly affect the visual fidelity of mixed reality compositions. Existing pipelines often rely on camera parameters (e.g., focal length, focus distance, aperture size) and scene depth to compute the circle of confusion (CoC) for realistic lens blur rendering. However, such information is often unavailable to ordinary users, limiting the accessibility and generalizability of these methods. In this work, we propose a novel compositing approach that directly estimates the CoC map from RGB images, bypassing the need for scene depth or camera metadata. The CoC values for virtual objects are inferred through a linear relationship between its signed CoC map and depth, and realistic lens blur is rendered using a neural reblurring network. Our method provides flexible and practical solution for real-world applications. Experimental results demonstrate that our method achieves high-fidelity compositing with realistic defocus effects, outperforming state-of-the-art techniques in both qualitative and quantitative evaluations.
comment: Accepted by ISMAR 2025 with oral presentation. 10 pages, 11 figures
Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models
Group Relative Policy Optimization (GRPO) has shown promise in aligning image and video generative models with human preferences. However, applying it to modern flow matching models is challenging because of its deterministic sampling paradigm. Current methods address this issue by converting Ordinary Differential Equations (ODEs) to Stochastic Differential Equations (SDEs), which introduce stochasticity. However, this SDE-based GRPO suffers from issues of inefficient credit assignment and incompatibility with high-order solvers for fewer-step sampling. In this paper, we first reinterpret existing SDE-based GRPO methods from a distance optimization perspective, revealing their underlying mechanism as a form of contrastive learning. Based on this insight, we propose Neighbor GRPO, a novel alignment algorithm that completely bypasses the need for SDEs. Neighbor GRPO generates a diverse set of candidate trajectories by perturbing the initial noise conditions of the ODE and optimizes the model using a softmax distance-based surrogate leaping policy. We establish a theoretical connection between this distance-based objective and policy gradient optimization, rigorously integrating our approach into the GRPO framework. Our method fully preserves the advantages of deterministic ODE sampling, including efficiency and compatibility with high-order solvers. We further introduce symmetric anchor sampling for computational efficiency and group-wise quasi-norm reweighting to address reward flattening. Extensive experiments demonstrate that Neighbor GRPO significantly outperforms SDE-based counterparts in terms of training cost, convergence speed, and generation quality.
Adaptive Receiver-Side Scheduling for Smooth Interactive Delivery
Interactive applications such as cloud gaming, XR streaming, and real-time inference depend on data objects arriving at a steady cadence. In practice, network delay variation and recovery dynamics at the receiver distort this cadence even when transports deliver all packets correctly, which produces visible jitter, stalls, and unstable playback. We present a lightweight receiver-side scheduling approach that regularizes release timing after recovery. The scheduler maintains an adaptive estimate of effective path delay and adjusts release times asymmetrically, responding quickly to late arrivals and only gradually to early ones. This upper-envelope behavior keeps release aligned with recent delay peaks and maintains smooth playback with minimal added latency. The scheduler runs entirely on the receiver clock and requires no feedback or synchronization. As a concrete example, we integrate receiver-side scheduling into the BitRipple Tunnel (BRT) overlay, an application-layer software system that forwards traffic without altering the underlying transport protocol. Within BRT, the scheduler functions as an independent module that regulates delivery timing for forwarded objects. Evaluating BRT with receiver-side scheduling on a cloud-gaming workload shows that the scheduler removes virtually all large jitter excursions and yields tightly clustered release intervals that improve visible smoothness. Broader latency improvements arise from the behavior of the full BRT overlay. Receiver-side scheduling can also be integrated modularly into transport stacks such as TCP, QUIC, WebRTC, UDP, or RTP, which are natural deployment points for future work.
comment: 25 pages, 6 figures, 1 table
Avoiding Quality Saturation in UGC Compression Using Denoised References
Video-sharing platforms must re-encode large volumes of noisy user-generated content (UGC) to meet streaming demands. However, conventional codecs, which aim to minimize the mean squared error (MSE) between the compressed and input videos, can cause quality saturation (QS) when applied to UGC, i.e., increasing the bitrate preserves input artifacts without improving visual quality. A direct approach to solve this problem is to detect QS by repeatedly evaluating a non-reference metric (NRM) on videos compressed with multiple codec parameters, which is inefficient. In this paper, we re-frame UGC compression and QS detection from the lens of noisy source coding theory: rather than using a NRM, we compute the MSE with respect to the denoised UGC, which serves as an alternative reference (D-MSE). Unlike MSE measured between the UGC input and the compressed UGC, D-MSE saturates at non-zero values as bitrates increase, a phenomenon we term distortion saturation (DS). Since D-MSE can be computed at the block level in the transform domain, we can efficiently detect D-MSE without coding and decoding with various parameters. We propose two methods for DS detection: distortion saturation detection (DSD), which relies on an input-dependent threshold derived from the D-MSE of the input UGC, and rate-distortion saturation detection (RDSD), which estimates the Lagrangian at the saturation point using a low-complexity compression method. Both methods work as a pre-processing step that can help standard-compliant codecs avoid QS in UGC compression. Experiments with AVC show that preventing encoding in the saturation region, i.e., avoiding encoding at QPs that result in QS according to our methods, achieves BD-rate savings of 8%-20% across multiple different NRMs, compared to a naïve baseline that encodes at the given input QP while ignoring QS.
Robust Detection of Retinal Neovascularization in Widefield Optical Coherence Tomography
Retinal neovascularization (RNV) is a vision threatening development in diabetic retinopathy (DR). Vision loss associated with RNV is preventable with timely intervention, making RNV clinical screening and monitoring a priority. Optical coherence tomography (OCT) angiography (OCTA) provides high-resolution imaging and high-sensitivity detection of RNV lesions. With recent commercial devices introducing widefield OCTA imaging to the clinic, the technology stands to improve early detection of RNV pathology. However, to meet clinical requirements these imaging capabilities must be combined with effective RNV detection and quantification, but existing algorithms for OCTA images are optimized for conventional, i.e. narrow, fields of view. Here, we present a novel approach for RNV diagnosis and staging on widefield OCT/OCTA. Unlike conventional methods dependent on multi-layer retinal segmentation, our model reframes RNV identification as a direct binary localization task. Our fully automated approach was trained and validated on 589 widefield scans (17x17-mm to 26x21-mm) collected from multiple devices at multiple clinics. Our method achieved a device-dependent area under curve (AUC) ranging from 0.96 to 0.99 for RNV diagnosis, and mean intersection over union (IOU) ranging from 0.76 to 0.88 for segmentation. We also demonstrate our method's ability to monitor lesion growth longitudinally. Our results indicate that deep learning-based analysis for widefield OCTA images could offer a valuable means for improving RNV screening and management.
comment: 17 pages, 11 figures. Submitted to Optica. Corresponding author: Yali Jia. Affiliations: ((1) Casey Eye Institute, Oregon Health & Science University, USA (2) Department of Ophthalmology, Aichi Medical University, Japan (3) Department of Biomedical Engineering, Oregon Health & Science University, USA (4) Department of Ophthalmology, Mackay Memorial Hospital, Taiwan)
HazeMatching: Dehazing Light Microscopy Images with Guided Conditional Flow Matching
Fluorescence microscopy is a major driver of scientific progress in the life sciences. Although high-end confocal microscopes are capable of filtering out-of-focus light, cheaper and more accessible microscopy modalities, such as widefield microscopy, can not, which consequently leads to hazy image data. Computational dehazing is trying to combine the best of both worlds, leading to cheap microscopy but crisp-looking images. The perception-distortion trade-off tells us that we can optimize either for data fidelity, e.g. low MSE or high PSNR, or for data realism, measured by perceptual metrics such as LPIPS or FID. Existing methods either prioritize fidelity at the expense of realism, or produce perceptually convincing results that lack quantitative accuracy. In this work, we propose HazeMatching, a novel iterative method for dehazing light microscopy images, which effectively balances these objectives. Our goal was to find a balanced trade-off between the fidelity of the dehazing results and the realism of individual predictions (samples). We achieve this by adapting the conditional flow matching framework by guiding the generative process with a hazy observation in the conditional velocity field. We evaluate HazeMatching on 5 datasets, covering both synthetic and real data, assessing both distortion and perceptual quality. Our method is compared against 11 baselines, achieving a consistent balance between fidelity and realism on average. Additionally, with calibration analysis, we show that HazeMatching produces well-calibrated predictions. Note that our method does not need an explicit degradation operator to exist, making it easily applicable on real microscopy data. All data used for training and evaluation and our code will be publicly available under a permissive license.
comment: 4 figures, 8 pages + refs, 45 pages total (including supplement), 28 supplementary figures
Full-Wave Modeling of Transcranial Ultrasound using Volume-Surface Integral Equations and CT-Derived Heterogeneous Skull Data
Transcranial ultrasound therapy uses focused acoustic energy to induce therapeutic bioeffects in the brain. Ultrasound must be transmitted through the skull, which is highly attenuating and heterogeneous, causing beam distortion, reducing focal pressure, and shifting the target location. Computational models are frequently used to predict beam aberration, assess cranial heating, and correct the phase of ultrasound transducers. These models often rely on computed tomography (CT) images to build patient-specific geometries and estimate skull acoustic properties. However, the coarse voxel resolution of CT limits accuracy for differential equation solvers at ultrasound frequencies. This paper presents an efficient numerical method based on volume-surface integral equations to model full-wave acoustic propagation through heterogeneous skull bone. We show that our approach effectively simulates transcranial ultrasound, even when using the original CT voxels as the computational mesh, where the 0.5 mm voxel length is relatively coarse compared to the shortest wavelength of 3 mm. The method is validated against a high-resolution boundary element model using an averaged skull representation. Simulations using a CT-based skull model and a bowl transducer reveal significant beam distortion of 7.8 mm attributed to the skull's heterogeneous acoustical properties.
Automated Muscle and Fat Segmentation in Computed Tomography for Comprehensive Body Composition Analysis
Body composition assessment using CT images can potentially be used for a number of clinical applications, including the prognostication of cardiovascular outcomes, evaluation of metabolic health, monitoring of disease progression, assessment of nutritional status, prediction of treatment response in oncology, and risk stratification for surgical and critical care outcomes. While multiple groups have developed in-house segmentation tools for this analysis, there are very limited publicly available tools that could be consistently used across different applications. To mitigate this gap, we present a publicly accessible, end-to-end segmentation and feature calculation model specifically for CT body composition analysis. Our model performs segmentation of skeletal muscle, subcutaneous adipose tissue (SAT), and visceral adipose tissue (VAT) across the chest, abdomen, and pelvis area in axial CT images. It also provides various body composition metrics, including muscle density, visceral-to-subcutaneous fat (VAT/SAT) ratio, muscle area/volume, and skeletal muscle index (SMI), supporting both 2D and 3D assessments. To evaluate the model, the segmentation was applied to both internal and external datasets, with body composition metrics analyzed across different age, sex, and race groups. The model achieved high dice coefficients on both internal and external datasets, exceeding 89% for skeletal muscle, SAT, and VAT segmentation. The model outperforms the benchmark by 2.10% on skeletal muscle and 8.6% on SAT compared to the manual annotations given by the publicly available dataset. Body composition metrics show mean relative absolute errors (MRAEs) under 10% for all measures. Our model with weights is publicly available at https://github.com/mazurowski-lab/CT-Muscle-and-Fat-Segmentation.git.
comment: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2025:026
Deep Learning Analysis of Prenatal Ultrasound for Identification of Ventriculomegaly
The proposed study aimed to develop a deep learning model capable of detecting ventriculomegaly on prenatal ultrasound images. Ventriculomegaly is a prenatal condition characterized by dilated cerebral ventricles of the fetal brain and is important to diagnose early, as it can be associated with an increased risk for fetal aneuploidies and/or underlying genetic syndromes. An Ultrasound Self-Supervised Foundation Model with Masked Autoencoding (USF-MAE), recently developed by our group, was fine-tuned for a binary classification task to distinguish fetal brain ultrasound images as either normal or showing ventriculomegaly. The USF-MAE incorporates a Vision Transformer encoder pretrained on more than 370,000 ultrasound images from the OpenUS-46 corpus. For this study, the pretrained encoder was adapted and fine-tuned on a curated dataset of fetal brain ultrasound images to optimize its performance for ventriculomegaly detection. Model evaluation was conducted using 5-fold cross-validation and an independent test cohort, and performance was quantified using accuracy, precision, recall, specificity, F1-score, and area under the receiver operating characteristic curve (AUC). The proposed USF-MAE model reached an F1-score of 91.76% on the 5-fold cross-validation and 91.78% on the independent test set, with much higher scores than those obtained by the baseline models by 19.37% and 16.15% compared to VGG-19, 2.31% and 2.56% compared to ResNet-50, and 5.03% and 11.93% compared to ViT-B/16, respectively. The model also showed a high mean test precision of 94.47% and an accuracy of 97.24%. The Eigen-CAM (Eigen Class Activation Map) heatmaps showed that the model was focusing on the ventricle area for the diagnosis of ventriculomegaly, which has explainability and clinical plausibility.
comment: 13 pages, 7 figures, 3 tables
Smooth Total variation Regularization for Interference Detection and Elimination (STRIDE) for MRI
MRI is increasingly desired to function near electronic devices that emit potentially dynamic electromagnetic interference (EMI). To accommodate for this, we propose the STRIDE method, which improves on previous external-sensor-based EMI removal methods by exploiting inherent MR image smoothness in its total variation. STRIDE measures data from both EMI detectors and primary MR imaging coils, transforms this data into the image domain, and for each column of the resulting image array, combines and subtracts data from the EMI detectors in a way that optimizes for total-variation smoothness. Performance was tested on phantom and in-vivo datasets with a 0.5T scanner. STRIDE resulted in visually better EMI removal, higher temporal SNR, larger EMI removal percentage, and lower RMSE than standard implementations. STRIDE is a robust technique that leverages inherent MR image properties to provide improved EMI removal performance over standard algorithms, particularly for time-varying noise sources.
Multimedia
MusicAIR: A Multimodal AI Music Generation Framework Powered by an Algorithm-Driven Core
Recent advances in generative AI have made music generation a prominent research focus. However, many neural-based models rely on large datasets, raising concerns about copyright infringement and high-performance costs. In contrast, we propose MusicAIR, an innovative multimodal AI music generation framework powered by a novel algorithm-driven symbolic music core, effectively mitigating copyright infringement risks. The music core algorithms connect critical lyrical and rhythmic information to automatically derive musical features, creating a complete, coherent melodic score solely from the lyrics. The MusicAIR framework facilitates music generation from lyrics, text, and images. The generated score adheres to established principles of music theory, lyrical structure, and rhythmic conventions. We developed Generate AI Music (GenAIM), a web tool using MusicAIR for lyric-to-song, text-to-music, and image-to-music generation. In our experiments, we evaluated AI-generated music scores produced by the system using both standard music metrics and innovative analysis that compares these compositions with original works. The system achieves an average key confidence of 85%, outperforming human composers at 79%, and aligns closely with established music theory standards, demonstrating its ability to generate diverse, human-like compositions. As a co-pilot tool, GenAIM can serve as a reliable music composition assistant and a possible educational composition tutor while simultaneously lowering the entry barrier for all aspiring musicians, which is innovative and significantly contributes to AI for music generation.
comment: Accepted by IEEE Big Data 2025
RacketVision: A Multiple Racket Sports Benchmark for Unified Ball and Racket Analysis AAAI 2026
We introduce RacketVision, a novel dataset and benchmark for advancing computer vision in sports analytics, covering table tennis, tennis, and badminton. The dataset is the first to provide large-scale, fine-grained annotations for racket pose alongside traditional ball positions, enabling research into complex human-object interactions. It is designed to tackle three interconnected tasks: fine-grained ball tracking, articulated racket pose estimation, and predictive ball trajectory forecasting. Our evaluation of established baselines reveals a critical insight for multi-modal fusion: while naively concatenating racket pose features degrades performance, a CrossAttention mechanism is essential to unlock their value, leading to trajectory prediction results that surpass strong unimodal baselines. RacketVision provides a versatile resource and a strong starting point for future research in dynamic object tracking, conditional motion forecasting, and multimodal analysis in sports. Project page at https://github.com/OrcustD/RacketVision
comment: Accepted to AAAI 2026 (Oral)
Avoiding Quality Saturation in UGC Compression Using Denoised References
Video-sharing platforms must re-encode large volumes of noisy user-generated content (UGC) to meet streaming demands. However, conventional codecs, which aim to minimize the mean squared error (MSE) between the compressed and input videos, can cause quality saturation (QS) when applied to UGC, i.e., increasing the bitrate preserves input artifacts without improving visual quality. A direct approach to solve this problem is to detect QS by repeatedly evaluating a non-reference metric (NRM) on videos compressed with multiple codec parameters, which is inefficient. In this paper, we re-frame UGC compression and QS detection from the lens of noisy source coding theory: rather than using a NRM, we compute the MSE with respect to the denoised UGC, which serves as an alternative reference (D-MSE). Unlike MSE measured between the UGC input and the compressed UGC, D-MSE saturates at non-zero values as bitrates increase, a phenomenon we term distortion saturation (DS). Since D-MSE can be computed at the block level in the transform domain, we can efficiently detect D-MSE without coding and decoding with various parameters. We propose two methods for DS detection: distortion saturation detection (DSD), which relies on an input-dependent threshold derived from the D-MSE of the input UGC, and rate-distortion saturation detection (RDSD), which estimates the Lagrangian at the saturation point using a low-complexity compression method. Both methods work as a pre-processing step that can help standard-compliant codecs avoid QS in UGC compression. Experiments with AVC show that preventing encoding in the saturation region, i.e., avoiding encoding at QPs that result in QS according to our methods, achieves BD-rate savings of 8%-20% across multiple different NRMs, compared to a naïve baseline that encodes at the given input QP while ignoring QS.
PrismSSL: One Interface, Many Modalities; A Single-Interface Library for Multimodal Self-Supervised Learning
We present PrismSSL, a Python library that unifies state-of-the-art self-supervised learning (SSL) methods across audio, vision, graphs, and cross-modal settings in a single, modular codebase. The goal of the demo is to show how researchers and practitioners can: (i) install, configure, and run pretext training with a few lines of code; (ii) reproduce compact benchmarks; and (iii) extend the framework with new modalities or methods through clean trainer and dataset abstractions. PrismSSL is packaged on PyPI, released under the MIT license, integrates tightly with HuggingFace Transformers, and provides quality-of-life features such as distributed training in PyTorch, Optuna-based hyperparameter search, LoRA fine-tuning for Transformer backbones, animated embedding visualizations for sanity checks, Weights & Biases logging, and colorful, structured terminal logs for improved usability and clarity. In addition, PrismSSL offers a graphical dashboard - built with Flask and standard web technologies - that enables users to configure and launch training pipelines with minimal coding. The artifact (code and data recipes) will be publicly available and reproducible.
Breaking the Curse of Knowledge: Towards Effective Multimodal Recommendation using Knowledge Soft Integration
A critical challenge in contemporary recommendation systems lies in effectively leveraging multimodal content to enhance recommendation personalization. Although various solutions have been proposed, most fail to account for discrepancies between knowledge extracted through isolated feature extraction and its application in recommendation tasks. Specifically, multimodal feature extraction does not incorporate task-specific prior knowledge, while downstream recommendation tasks typically use these features as auxiliary information. This misalignment often introduces biases in model fitting and degrades performance, a phenomenon we refer to as the curse of knowledge. To address this challenge, we propose a knowledge soft integration framework designed to balance the utilization of multimodal features with the biases they may introduce. The framework, named Knowledge Soft Integration (KSI), comprises two key components: the Structure Efficient Injection (SEI) module and the Semantic Soft Integration (SSI) module. The SEI module employs a Refined Graph Neural Network (RGNN) to model inter-modal correlations among items while introducing a regularization term to minimize redundancy in user and item representations. In parallel, the SSI module utilizes a self-supervised retrieval task to implicitly integrate multimodal semantic knowledge, thereby enhancing the semantic distinctiveness of item representations. We conduct comprehensive experiments on three benchmark datasets, demonstrating KSI's effectiveness. Furthermore, these results underscore the ability of the SEI and SSI modules to reduce representation redundancy and mitigate the curse of knowledge in multimodal recommendation systems.
comment: Accepted to IEEE Transactions on Multimedia (TMM)
Resolving Sentiment Discrepancy for Multimodal Sentiment Detection via Semantics Completion and Decomposition
With the proliferation of social media posts in recent years, the need to detect sentiments in multimodal (image-text) content has grown rapidly. Since posts are user-generated, the image and text from the same post can express different or even contradictory sentiments, leading to potential \textbf{sentiment discrepancy}. However, existing works mainly adopt a single-branch fusion structure that primarily captures the consistent sentiment between image and text. The ignorance or implicit modeling of discrepant sentiment results in compromised unimodal encoding and limited performance. In this paper, we propose a semantics Completion and Decomposition (CoDe) network to resolve the above issue. In the semantics completion module, we complement image and text representations with the semantics of the in-image text, helping bridge the sentiment gap. In the semantics decomposition module, we decompose image and text representations with exclusive projection and contrastive learning, thereby explicitly capturing the discrepant sentiment between modalities. Finally, we fuse image and text representations by cross-attention and combine them with the learned discrepant sentiment for final classification. Extensive experiments on four datasets demonstrate the superiority of CoDe and the effectiveness of each proposed module.
comment: Accepted by Pattern Recognition
HDCompression: Hybrid-Diffusion Image Compression for Ultra-Low Bitrates PRICAI 2025
Image compression under ultra-low bitrates remains challenging for both conventional learned image compression (LIC) and generative vector-quantized (VQ) modeling. Conventional LIC suffers from severe artifacts due to heavy quantization, while generative VQ modeling gives poor fidelity due to the mismatch between learned generative priors and specific inputs. In this work, we propose Hybrid-Diffusion Image Compression (HDCompression), a dual-stream framework that utilizes both generative VQ-modeling and diffusion models, as well as conventional LIC, to achieve both high fidelity and high perceptual quality. Different from previous hybrid methods that directly use pre-trained LIC models to generate low-quality fidelity-preserving information from heavily quantized latent, we use diffusion models to extract high-quality complementary fidelity information from the ground-truth input, which can enhance the system performance in several aspects: improving index map prediction, enhancing the fidelity-preserving output of the LIC stream, and refining conditioned image reconstruction with VQ-latent correction. In addition, our diffusion model is based on a dense representative vector (DRV), which is lightweight with very simple sampling schedulers. Extensive experiments demonstrate that our HDCompression outperforms the previous conventional LIC, generative VQ-modeling, and hybrid frameworks in both quantitative metrics and qualitative visualization, providing balanced robust compression performance at ultra-low bitrates.
comment: Accepted by PRICAI 2025 (Oral Presentation)
REArtGS: Reconstructing and Generating Articulated Objects via 3D Gaussian Splatting with Geometric and Motion Constraints
Articulated objects, as prevalent entities in human life, their 3D representations play crucial roles across various applications. However, achieving both high-fidelity textured surface reconstruction and dynamic generation for articulated objects remains challenging for existing methods. In this paper, we present REArtGS, a novel framework that introduces additional geometric and motion constraints to 3D Gaussian primitives, enabling realistic surface reconstruction and generation for articulated objects. Specifically, given multi-view RGB images of arbitrary two states of articulated objects, we first introduce an unbiased Signed Distance Field (SDF) guidance to regularize Gaussian opacity fields, enhancing geometry constraints and improving surface reconstruction quality. Then we establish deformable fields for 3D Gaussians constrained by the kinematic structures of articulated objects, achieving unsupervised generation of surface meshes in unseen states. Extensive experiments on both synthetic and real datasets demonstrate our approach achieves high-quality textured surface reconstruction for given states, and enables high-fidelity surface generation for unseen states. Project site: https://sites.google.com/view/reartgs/home.
comment: 11pages, 6 figures
AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Deepfake Detection of Frontal Face Videos
Multimodal manipulations (also known as audio-visual deepfakes) make it difficult for unimodal deepfake detectors to detect forgeries in multimedia content. To avoid the spread of false propaganda and fake news, timely detection is crucial. The damage to either modality (i.e., visual or audio) can only be discovered through multimodal models that can exploit both pieces of information simultaneously. However, previous methods mainly adopt unimodal video forensics and use supervised pre-training for forgery detection. This study proposes a new method based on a multimodal self-supervised-learning (SSL) feature extractor to exploit inconsistency between audio and visual modalities for multimodal video forgery detection. We use the transformer-based SSL pre-trained Audio-Visual HuBERT (AV-HuBERT) model as a visual and acoustic feature extractor and a multi-scale temporal convolutional neural network to capture the temporal correlation between the audio and visual modalities. Since AV-HuBERT only extracts visual features from the lip region, we also adopt another transformer-based video model to exploit facial features and capture spatial and temporal artifacts caused during the deepfake generation process. Experimental results show that our model outperforms all existing models and achieves new state-of-the-art performance on the FakeAVCeleb and DeepfakeTIMIT datasets.
Computation and Language
Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards
Test-time scaling has been shown to substantially improve large language models' (LLMs) mathematical reasoning. However, for a large portion of mathematical corpora, especially theorem proving, RLVR's scalability is limited: intermediate reasoning is crucial, while final answers are difficult to directly and reliably verify. Meanwhile, token-level SFT often degenerates into rote memorization rather than inducing longer chains of thought. Inspired by BERT's self-supervised tasks, we propose MR-RLVR (Masked-and-Reordered RLVR), which constructs process-level self-supervised rewards via "masked-then-fill" and "step reordering" to extract learnable signals from intermediate reasoning. Our training pipeline comprises two stages: we first perform self-supervised training on sampled mathematical calculation and proof data; we then conduct RLVR fine-tuning on mathematical calculation datasets where only outcomes are verifiable. We implement MR-RLVR on Qwen2.5-3B and DeepSeek-R1-Distill-Qwen-1.5B, and evaluate on AIME24, AIME25, AMC23, and MATH500. Under a fixed sampling and decoding budget, MR-RLVR achieves average relative gains over the original RLVR of +9.86% Pass@1, +5.27% Pass@5, and +4.00% Pass@8. These results indicate that incorporating process-aware self-supervised signals can effectively enhance RLVR's scalability and performance in only outcome-verifiable settings.
Planning with Sketch-Guided Verification for Physics-Aware Video Generation
Recent video generation approaches increasingly rely on planning intermediate control signals such as object trajectories to improve temporal coherence and motion fidelity. However, these methods mostly employ single-shot plans that are typically limited to simple motions, or iterative refinement which requires multiple calls to the video generator, incuring high computational cost. To overcome these limitations, we propose SketchVerify, a training-free, sketch-verification-based planning framework that improves motion planning quality with more dynamically coherent trajectories (i.e., physically plausible and instruction-consistent motions) prior to full video generation by introducing a test-time sampling and verification loop. Given a prompt and a reference image, our method predicts multiple candidate motion plans and ranks them using a vision-language verifier that jointly evaluates semantic alignment with the instruction and physical plausibility. To efficiently score candidate motion plans, we render each trajectory as a lightweight video sketch by compositing objects over a static background, which bypasses the need for expensive, repeated diffusion-based synthesis while achieving comparable performance. We iteratively refine the motion plan until a satisfactory one is identified, which is then passed to the trajectory-conditioned generator for final synthesis. Experiments on WorldModelBench and PhyWorldBench demonstrate that our method significantly improves motion quality, physical realism, and long-term consistency compared to competitive baselines while being substantially more efficient. Our ablation study further shows that scaling up the number of trajectory candidates consistently enhances overall performance.
comment: website: https://sketchverify.github.io/
SMILE: A Composite Lexical-Semantic Metric for Question-Answering Evaluation
Traditional evaluation metrics for textual and visual question answering, like ROUGE, METEOR, and Exact Match (EM), focus heavily on n-gram based lexical similarity, often missing the deeper semantic understanding needed for accurate assessment. While measures like BERTScore and MoverScore leverage contextual embeddings to address this limitation, they lack flexibility in balancing sentence-level and keyword-level semantics and ignore lexical similarity, which remains important. Large Language Model (LLM) based evaluators, though powerful, come with drawbacks like high costs, bias, inconsistency, and hallucinations. To address these issues, we introduce SMILE: Semantic Metric Integrating Lexical Exactness, a novel approach that combines sentence-level semantic understanding with keyword-level semantic understanding and easy keyword matching. This composite method balances lexical precision and semantic relevance, offering a comprehensive evaluation. Extensive benchmarks across text, image, and video QA tasks show SMILE is highly correlated with human judgments and computationally lightweight, bridging the gap between lexical and semantic evaluation.
comment: 23 pages, 6 tables, 9 figures
Beyond Multiple Choice: A Hybrid Framework for Unifying Robust Evaluation and Verifiable Reasoning Training
Multiple-choice question answering (MCQA) has been a popular format for evaluating and reinforcement fine-tuning (RFT) of modern multimodal language models. Its constrained output format allows for simplified, deterministic automatic verification. However, we find that the options may leak exploitable signals, which makes the accuracy metrics unreliable for indicating real capabilities and encourages explicit or implicit answer guessing behaviors during RFT. We propose ReVeL (Rewrite and Verify by LLM), a framework that rewrites multiple-choice questions into open-form questions while keeping answers verifiable whenever possible. The framework categorizes questions according to different answer types, apply different rewriting and verification schemes, respectively. When applied for RFT, we converted 20k MCQA examples and use GRPO to finetune Qwen2.5-VL models. Models trained on ReVeL-OpenQA match MCQA accuracy on multiple-choice benchmarks and improve OpenQA accuracy by about six percentage points, indicating better data efficiency and more robust reward signals than MCQA-based training. When used for evaluation, ReVeL also reveals up to 20 percentage points of score inflation in MCQA benchmarks (relative to OpenQA), improves judging accuracy, and reduces both cost and latency. We will release code and data publicly.
comment: Project url: https://flageval-baai.github.io/ReVeL/
PUCP-Metrix: A Comprehensive Open-Source Repository of Linguistic Metrics for Spanish EACL
Linguistic features remain essential for interpretability and tasks involving style, structure, and readability, but existing Spanish tools offer limited coverage. We present PUCP-Metrix, an open-source repository of 182 linguistic metrics spanning lexical diversity, syntactic and semantic complexity, cohesion, psycholinguistics, and readability. PUCP-Metrix enables fine-grained, interpretable text analysis. We evaluate its usefulness on Automated Readability Assessment and Machine-Generated Text Detection, showing competitive performance compared to an existing repository and strong neural baselines. PUCP-Metrix offers a comprehensive, extensible resource for Spanish, supporting diverse NLP applications.
comment: 1 figure, to be submitted to EACL Demo track
Selective Rotary Position Embedding
Position information is essential for language modeling. In softmax transformers, Rotary Position Embeddings (\textit{RoPE}) encode positions through \textit{fixed-angle} rotations, while in linear transformers, order is handled via input-dependent (selective) gating that decays past key-value associations. Selectivity has generally been shown to improve language-related tasks. Inspired by this, we introduce \textit{Selective RoPE}, an \textit{input-dependent} rotary embedding mechanism, that generalizes \textit{RoPE}, and enables rotation in \textit{arbitrary angles} for both linear and softmax transformers. We show that softmax attention already performs a hidden form of these rotations on query-key pairs, uncovering an implicit positional structure. We further show that in state-space models and gated linear transformers, the real part manages forgetting while the imaginary part encodes positions through rotations. We validate our method by equipping gated transformers with \textit{Selective RoPE}, demonstrating that its input-dependent rotations improve performance in language modeling and on difficult sequence tasks like copying, state tracking, and retrieval.
Don't Learn, Ground: A Case for Natural Language Inference with Visual Grounding
We propose a zero-shot method for Natural Language Inference (NLI) that leverages multimodal representations by grounding language in visual contexts. Our approach generates visual representations of premises using text-to-image models and performs inference by comparing these representations with textual hypotheses. We evaluate two inference techniques: cosine similarity and visual question answering. Our method achieves high accuracy without task-specific fine-tuning, demonstrating robustness against textual biases and surface heuristics. Additionally, we design a controlled adversarial dataset to validate the robustness of our approach. Our findings suggest that leveraging visual modality as a meaning representation provides a promising direction for robust natural language understanding.
A new kid on the block: Distributional semantics predicts the word-specific tone signatures of monosyllabic words in conversational Taiwan Mandarin
We present a corpus-based investigation of how the pitch contours of monosyllabic words are realized in spontaneous conversational Mandarin, focusing on the effects of words' meanings. We used the generalized additive model to decompose a given observed pitch contour into a set of component pitch contours that are tied to different control variables and semantic predictors. Even when variables such as word duration, gender, speaker identity, tonal context, vowel height, and utterance position are controlled for, the effect of word remains a strong predictor of tonal realization. We present evidence that this effect of word is a semantic effect: word sense is shown to be a better predictor than word, and heterographic homophones are shown to have different pitch contours. The strongest evidence for the importance of semantics is that the pitch contours of individual word tokens can be predicted from their contextualized embeddings with an accuracy that substantially exceeds a permutation baseline. For phonetics, distributional semantics is a new kid on the block. Although our findings challenge standard theories of Mandarin tone, they fit well within the theoretical framework of the Discriminative Lexicon Model.
comment: arXiv admin note: text overlap with arXiv:2409.07891
Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM
Human-robot collaboration towards a shared goal requires robots to understand human action and interaction with the surrounding environment. This paper focuses on human-robot interaction (HRI) based on human-robot dialogue that relies on the robot action confirmation and action step generation using multimodal scene understanding. The state-of-the-art approach uses multimodal transformers to generate robot action steps aligned with robot action confirmation from a single clip showing a task composed of multiple micro steps. Although actions towards a long-horizon task depend on each other throughout an entire video, the current approaches mainly focus on clip-level processing and do not leverage long-context information. This paper proposes a long-context Q-former incorporating left and right context dependency in full videos. Furthermore, this paper proposes a text-conditioning approach to feed text embeddings directly into the LLM decoder to mitigate the high abstraction of the information in text by Q-former. Experiments with the YouCook2 corpus show that the accuracy of confirmation generation is a major factor in the performance of action planning. Furthermore, we demonstrate that the long-context Q-former improves the confirmation and action planning by integrating VideoLLaMA3.
comment: Accepted to ASRU 2025
MusicAIR: A Multimodal AI Music Generation Framework Powered by an Algorithm-Driven Core
Recent advances in generative AI have made music generation a prominent research focus. However, many neural-based models rely on large datasets, raising concerns about copyright infringement and high-performance costs. In contrast, we propose MusicAIR, an innovative multimodal AI music generation framework powered by a novel algorithm-driven symbolic music core, effectively mitigating copyright infringement risks. The music core algorithms connect critical lyrical and rhythmic information to automatically derive musical features, creating a complete, coherent melodic score solely from the lyrics. The MusicAIR framework facilitates music generation from lyrics, text, and images. The generated score adheres to established principles of music theory, lyrical structure, and rhythmic conventions. We developed Generate AI Music (GenAIM), a web tool using MusicAIR for lyric-to-song, text-to-music, and image-to-music generation. In our experiments, we evaluated AI-generated music scores produced by the system using both standard music metrics and innovative analysis that compares these compositions with original works. The system achieves an average key confidence of 85%, outperforming human composers at 79%, and aligns closely with established music theory standards, demonstrating its ability to generate diverse, human-like compositions. As a co-pilot tool, GenAIM can serve as a reliable music composition assistant and a possible educational composition tutor while simultaneously lowering the entry barrier for all aspiring musicians, which is innovative and significantly contributes to AI for music generation.
comment: Accepted by IEEE Big Data 2025
Humanlike Multi-user Agent (HUMA): Designing a Deceptively Human AI Facilitator for Group Chats
Conversational agents built on large language models (LLMs) are becoming increasingly prevalent, yet most systems are designed for one-on-one, turn-based exchanges rather than natural, asynchronous group chats. As AI assistants become widespread throughout digital platforms, from virtual assistants to customer service, developing natural and humanlike interaction patterns seems crucial for maintaining user trust and engagement. We present the Humanlike Multi-user Agent (HUMA), an LLM-based facilitator that participates in multi-party conversations using human-like strategies and timing. HUMA extends prior multi-user chatbot work with an event-driven architecture that handles messages, replies, reactions and introduces realistic response-time simulation. HUMA comprises three components-Router, Action Agent, and Reflection-which together adapt LLMs to group conversation dynamics. We evaluate HUMA in a controlled study with 97 participants in four-person role-play chats, comparing AI and human community managers (CMs). Participants classified CMs as human at near-chance rates in both conditions, indicating they could not reliably distinguish HUMA agents from humans. Subjective experience was comparable across conditions: community-manager effectiveness, social presence, and engagement/satisfaction differed only modestly with small effect sizes. Our results suggest that, in natural group chat settings, an AI facilitator can match human quality while remaining difficult to identify as nonhuman.
comment: 9 pages, 4 figures
Large Language Models for Sentiment Analysis to Detect Social Challenges: A Use Case with South African Languages
Sentiment analysis can aid in understanding people's opinions and emotions on social issues. In multilingual communities sentiment analysis systems can be used to quickly identify social challenges in social media posts, enabling government departments to detect and address these issues more precisely and effectively. Recently, large-language models (LLMs) have become available to the wide public and initial analyses have shown that they exhibit magnificent zero-shot sentiment analysis abilities in English. However, there is no work that has investigated to leverage LLMs for sentiment analysis on social media posts in South African languages and detect social challenges. Consequently, in this work, we analyse the zero-shot performance of the state-of-the-art LLMs GPT-3.5, GPT-4, LlaMa 2, PaLM 2, and Dolly 2 to investigate the sentiment polarities of the 10 most emerging topics in English, Sepedi and Setswana social media posts that fall within the jurisdictional areas of 10 South African government departments. Our results demonstrate that there are big differences between the various LLMs, topics, and languages. In addition, we show that a fusion of the outcomes of different LLMs provides large gains in sentiment classification performance with sentiment classification errors below 1%. Consequently, it is now feasible to provide systems that generate reliable information about sentiment analysis to detect social challenges and draw conclusions about possible needs for actions on specific topics and within different language groups.
comment: Published in the Proceedings of The Southern African Conference on AI Research (SACAIR 2024), Bloemfontein, South Africa, 2-6 December 2024. ISBN: 978-0-7961-6069-0
Estonian WinoGrande Dataset: Comparative Analysis of LLM Performance on Human and Machine Translation
In this paper, we present a localized and culturally adapted Estonian translation of the test set from the widely used commonsense reasoning benchmark, WinoGrande. We detail the translation and adaptation process carried out by translation specialists and evaluate the performance of both proprietary and open source models on the human translated benchmark. Additionally, we explore the feasibility of achieving high-quality machine translation by incorporating insights from the manual translation process into the design of a detailed prompt. This prompt is specifically tailored to address both the linguistic characteristics of Estonian and the unique translation challenges posed by the WinoGrande dataset. Our findings show that model performance on the human translated Estonian dataset is slightly lower than on the original English test set, while performance on machine-translated data is notably worse. Additionally, our experiments indicate that prompt engineering offers limited improvement in translation quality or model accuracy, and highlight the importance of involving language specialists in dataset translation and adaptation to ensure reliable and interpretable evaluations of language competency and reasoning in large language models.
comment: Preprint
Cross-cultural value alignment frameworks for responsible AI governance: Evidence from China-West comparative analysis
As Large Language Models (LLMs) increasingly influence high-stakes decision-making across global contexts, ensuring their alignment with diverse cultural values has become a critical governance challenge. This study presents a Multi-Layered Auditing Platform for Responsible AI that systematically evaluates cross-cultural value alignment in China-origin and Western-origin LLMs through four integrated methodologies: Ethical Dilemma Corpus for assessing temporal stability, Diversity-Enhanced Framework (DEF) for quantifying cultural fidelity, First-Token Probability Alignment for distributional accuracy, and Multi-stAge Reasoning frameworK (MARK) for interpretable decision-making. Our comparative analysis of 20+ leading models, such as Qwen, GPT-4o, Claude, LLaMA, and DeepSeek, reveals universal challenges-fundamental instability in value systems, systematic under-representation of younger demographics, and non-linear relationships between model scale and alignment quality-alongside divergent regional development trajectories. While China-origin models increasingly emphasize multilingual data integration for context-specific optimization, Western models demonstrate greater architectural experimentation but persistent U.S.-centric biases. Neither paradigm achieves robust cross-cultural generalization. We establish that Mistral-series architectures significantly outperform LLaMA3-series in cross-cultural alignment, and that Full-Parameter Fine-Tuning on diverse datasets surpasses Reinforcement Learning from Human Feedback in preserving cultural variation...
comment: Presented on Academic Conference "Technology for Good: Driving Social Impact" (2025)
Social-Media Based Personas Challenge: Hybrid Prediction of Common and Rare User Actions on Bluesky
Understanding and predicting user behavior on social media platforms is crucial for content recommendation and platform design. While existing approaches focus primarily on common actions like retweeting and liking, the prediction of rare but significant behaviors remains largely unexplored. This paper presents a hybrid methodology for social media user behavior prediction that addresses both frequent and infrequent actions across a diverse action vocabulary. We evaluate our approach on a large-scale Bluesky dataset containing 6.4 million conversation threads spanning 12 distinct user actions across 25 persona clusters. Our methodology combines four complementary approaches: (i) a lookup database system based on historical response patterns; (ii) persona-specific LightGBM models with engineered temporal and semantic features for common actions; (iii) a specialized hybrid neural architecture fusing textual and temporal representations for rare action classification; and (iv) generation of text replies. Our persona-specific models achieve an average macro F1-score of 0.64 for common action prediction, while our rare action classifier achieves 0.56 macro F1-score across 10 rare actions. These results demonstrate that effective social media behavior prediction requires tailored modeling strategies recognizing fundamental differences between action types. Our approach achieved first place in the SocialSim: Social-Media Based Personas challenge organized at the Social Simulation with LLMs workshop at COLM 2025.
comment: 1st place at SocialSim: Social-Media Based Personas challenge 2025
Lost in Translation and Noise: A Deep Dive into the Failure Modes of VLMs on Real-World Tables
The impressive performance of VLMs is largely measured on benchmarks that fail to capture the complexities of real-world scenarios. Existing datasets for tabular QA, such as WikiTableQuestions and FinQA, are overwhelmingly monolingual (English) and present tables in a digitally perfect, clean format. This creates a significant gap between research and practice. To address this, we present \textbf{MirageTVQA}, a new benchmark designed to evaluate VLMs on these exact dimensions. Featuring nearly 60,000 QA pairs across 24 languages, MirageTVQA challenges models with tables that are not only multilingual but also visually imperfect, incorporating realistic noise to mimic scanned documents. Our evaluation of the leading VLMs reveals two primary failure points: a severe degradation in performance (over 35\% drop for the best models) when faced with visual noise and a consistent English-first bias where reasoning abilities fail to transfer to other languages. MirageTVQA provides a benchmark for measuring and driving progress towards more robust VLM models for table reasoning. The dataset and the code are available at: https://github.com/anshulsc/MirageTVQA.
comment: Accepted as Spotligh Talk at EurIPS 2025 Workshop on AI For Tabular Data
Parrot: Persuasion and Agreement Robustness Rating of Output Truth -- A Sycophancy Robustness Benchmark for LLMs
This study presents PARROT (Persuasion and Agreement Robustness Rating of Output Truth), a robustness focused framework designed to measure the degradation in accuracy that occurs under social pressure exerted on users through authority and persuasion in large language models (LLMs) the phenomenon of sycophancy (excessive conformity). PARROT (i) isolates causal effects by comparing the neutral version of the same question with an authoritatively false version using a double-blind evaluation, (ii) quantifies confidence shifts toward the correct and imposed false responses using log-likelihood-based calibration tracking, and (iii) systematically classifies failure modes (e.g., robust correct, sycophantic agreement, reinforced error, stubborn error, self-correction, etc.) using an eight-state behavioral taxonomy. We evaluated 22 models using 1,302 MMLU-style multiple-choice questions across 13 domains and domain-specific authority templates. Findings show marked heterogeneity: advanced models (e.g., GPT-5, GPT-4.1, Claude Sonnet 4.5) exhibit low "follow rates" ($\leq 11\%$, GPT-5: 4\%) and minimal accuracy loss, while older/smaller models show severe epistemic collapse (GPT-4: 80\%, Qwen 2.5-1.5B: 94\%). The danger is not limited to response changes; weak models reduce confidence in the correct response while increasing confidence in the imposed incorrect response. While international law and global knowledge at the domain level exhibit high fragility, elementary mathematics is relatively resilient. Consequently, we argue that the goal of "resistance to overfitting pressure" should be addressed as a primary objective alongside accuracy, harm avoidance, and privacy for safe deployment in the real world.
A Simple Yet Strong Baseline for Long-Term Conversational Memory of LLM Agents
LLM-based conversational agents still struggle to maintain coherent, personalized interaction over many sessions: fixed context windows limit how much history can be kept in view, and most external memory approaches trade off between coarse retrieval over large chunks and fine-grained but fragmented views of the dialogue. Motivated by neo-Davidsonian event semantics, we propose an event-centric alternative that represents conversational history as short, event-like propositions which bundle together participants, temporal cues, and minimal local context, rather than as independent relation triples or opaque summaries. In contrast to work that aggressively compresses or forgets past content, our design aims to preserve information in a non-compressive form and make it more accessible, rather than more lossy. Concretely, we instruct an LLM to decompose each session into enriched elementary discourse units (EDUs) -- self-contained statements with normalized entities and source turn attributions -- and organize sessions, EDUs, and their arguments in a heterogeneous graph that supports associative recall. On top of this representation we build two simple retrieval-based variants that use dense similarity search and LLM filtering, with an optional graph-based propagation step to connect and aggregate evidence across related EDUs. Experiments on the LoCoMo and LongMemEval$_S$ benchmarks show that these event-centric memories match or surpass strong baselines, while operating with much shorter QA contexts. Our results suggest that structurally simple, event-level memory provides a principled and practical foundation for long-horizon conversational agents. Our code and data will be released at https://github.com/KevinSRR/EMem.
comment: Work in progress
E$^3$-Pruner: Towards Efficient, Economical, and Effective Layer Pruning for Large Language Models
With the increasing size of large language models, layer pruning has gained increased attention as a hardware-friendly approach for model compression. However, existing layer pruning methods struggle to simultaneously address key practical deployment challenges, including performance degradation, high training costs, and limited acceleration. To overcome these limitations, we propose \name, a task-\underline{E}ffective, training-\underline{E}conomical and inference-\underline{E}fficient layer pruning framework. \namespace introduces two key innovations: (1) a differentiable mask optimization method using a Gumbel-TopK sampler, enabling efficient and precise pruning mask search; and (2) an entropy-aware adaptive knowledge distillation strategy that enhances task performance. Extensive experiments over diverse model architectures and benchmarks demonstrate the superiority of our method over state-of-the-art approaches. Notably, \namespace achieves 96\% accuracy, a mere 0.8\% drop from the original model (96.8\%) on MATH-500 when pruning 25\% layers of Qwen3-32B, outperforming existing SOTA (95\%), with a 1.33$\times$ inference speedup by consuming merely 0.5B tokens (0.5\% of the post-training data volume).
AutoLink: Autonomous Schema Exploration and Expansion for Scalable Schema Linking in Text-to-SQL at Scale
For industrial-scale text-to-SQL, supplying the entire database schema to Large Language Models (LLMs) is impractical due to context window limits and irrelevant noise. Schema linking, which filters the schema to a relevant subset, is therefore critical. However, existing methods incur prohibitive costs, struggle to trade off recall and noise, and scale poorly to large databases. We present \textbf{AutoLink}, an autonomous agent framework that reformulates schema linking as an iterative, agent-driven process. Guided by an LLM, AutoLink dynamically explores and expands the linked schema subset, progressively identifying necessary schema components without inputting the full database schema. Our experiments demonstrate AutoLink's superior performance, achieving state-of-the-art strict schema linking recall of \textbf{97.4\%} on Bird-Dev and \textbf{91.2\%} on Spider-2.0-Lite, with competitive execution accuracy, i.e., \textbf{68.7\%} EX on Bird-Dev (better than CHESS) and \textbf{34.9\%} EX on Spider-2.0-Lite (ranking 2nd on the official leaderboard). Crucially, AutoLink exhibits \textbf{exceptional scalability}, \textbf{maintaining high recall}, \textbf{efficient token consumption}, and \textbf{robust execution accuracy} on large schemas (e.g., over 3,000 columns) where existing methods severely degrade-making it a highly scalable, high-recall schema-linking solution for industrial text-to-SQL systems.
Attention-Guided Feature Fusion (AGFF) Model for Integrating Statistical and Semantic Features in News Text Classification
News text classification is a crucial task in natural language processing, essential for organizing and filtering the massive volume of digital content. Traditional methods typically rely on statistical features like term frequencies or TF-IDF values, which are effective at capturing word-level importance but often fail to reflect contextual meaning. In contrast, modern deep learning approaches utilize semantic features to understand word usage within context, yet they may overlook simple, high-impact statistical indicators. This paper introduces an Attention-Guided Feature Fusion (AGFF) model that combines statistical and semantic features in a unified framework. The model applies an attention-based mechanism to dynamically determine the relative importance of each feature type, enabling more informed classification decisions. Through evaluation on benchmark news datasets, the AGFF model demonstrates superior performance compared to both traditional statistical models and purely semantic deep learning models. The results confirm that strategic integration of diverse feature types can significantly enhance classification accuracy. Additionally, ablation studies validate the contribution of each component in the fusion process. The findings highlight the model's ability to balance and exploit the complementary strengths of statistical and semantic representations, making it a practical and effective solution for real-world news classification tasks.
Hallucinate Less by Thinking More: Aspect-Based Causal Abstention for Large Language Models AAAI 2026
Large Language Models (LLMs) often produce fluent but factually incorrect responses, a phenomenon known as hallucination. Abstention, where the model chooses not to answer and instead outputs phrases such as "I don't know", is a common safeguard. However, existing abstention methods typically rely on post-generation signals, such as generation variations or feedback, which limits their ability to prevent unreliable responses in advance. In this paper, we introduce Aspect-Based Causal Abstention (ABCA), a new framework that enables early abstention by analysing the internal diversity of LLM knowledge through causal inference. This diversity reflects the multifaceted nature of parametric knowledge acquired from various sources, representing diverse aspects such as disciplines, legal contexts, or temporal frames. ABCA estimates causal effects conditioned on these aspects to assess the reliability of knowledge relevant to a given query. Based on these estimates, we enable two types of abstention: Type-1, where aspect effects are inconsistent (knowledge conflict), and Type-2, where aspect effects consistently support abstention (knowledge insufficiency). Experiments on standard benchmarks demonstrate that ABCA improves abstention reliability, achieves state-of-the-art performance, and enhances the interpretability of abstention decisions.
comment: Accepted to AAAI 2026 (Main Technical Track)
The PLLuM Instruction Corpus
This paper describes the instruction dataset used to fine-tune a set of transformer-based large language models (LLMs) developed in the PLLuM (Polish Large Language Model) project. We present a functional typology of the organic, converted, and synthetic instructions used in PLLuM and share some observations about the implications of using human-authored versus synthetic instruction datasets in the linguistic adaptation of base LLMs. Additionally, we release the first representative subset of the PLLuM instruction corpus (PLLuMIC), which we believe to be useful in guiding and planning the development of similar datasets for other LLMs.
LangMark: A Multilingual Dataset for Automatic Post-Editing ACL 2025
Automatic post-editing (APE) aims to correct errors in machine-translated text, enhancing translation quality, while reducing the need for human intervention. Despite advances in neural machine translation (NMT), the development of effective APE systems has been hindered by the lack of large-scale multilingual datasets specifically tailored to NMT outputs. To address this gap, we present and release LangMark, a new human-annotated multilingual APE dataset for English translation to seven languages: Brazilian Portuguese, French, German, Italian, Japanese, Russian, and Spanish. The dataset has 206,983 triplets, with each triplet consisting of a source segment, its NMT output, and a human post-edited translation. Annotated by expert human linguists, our dataset offers both linguistic diversity and scale. Leveraging this dataset, we empirically show that Large Language Models (LLMs) with few-shot prompting can effectively perform APE, improving upon leading commercial and even proprietary machine translation systems. We believe that this new resource will facilitate the future development and evaluation of APE systems.
comment: 15 pages, 8 figures, ACL 2025
Learning to Compress: Unlocking the Potential of Large Language Models for Text Representation AAAI'26
Text representation plays a critical role in tasks like clustering, retrieval, and other downstream applications. With the emergence of large language models (LLMs), there is increasing interest in harnessing their capabilities for this purpose. However, most of the LLMs are inherently causal and optimized for next-token prediction, making them suboptimal for producing holistic representations. To address this, recent studies introduced pretext tasks to adapt LLMs for text representation. Most of these tasks, however, rely on token-level prediction objectives, such as the masked next-token prediction (MNTP) used in LLM2Vec. In this work, we explore the untapped potential of context compression as a pretext task for unsupervised adaptation of LLMs. During compression pre-training, the model learns to generate compact memory tokens, which substitute the whole context for downstream sequence prediction. Experiments demonstrate that a well-designed compression objective can significantly enhance LLM-based text representations, outperforming models trained with token-level pretext tasks. Further improvements through contrastive learning produce a strong representation model (LLM2Comp) that outperforms contemporary LLM-based text encoders on a wide range of tasks while being more sample-efficient, requiring significantly less training data.
comment: Accepted by AAAI'26
Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design
We report on the first large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware, utilizing both MI300X GPUs with Pollara interconnect. We distill practical guidance for both systems and model design. On the systems side, we deliver a comprehensive cluster and networking characterization: microbenchmarks for all core collectives (all-reduce, reduce-scatter, all-gather, broadcast) across message sizes and GPU counts on Pollara. To our knowledge, this is the first at this scale. We further provide MI300X microbenchmarks on kernel sizing and memory bandwidth to inform model design. On the modeling side, we introduce and apply MI300X-aware transformer sizing rules for attention and MLP blocks and justify MoE widths that jointly optimize training throughput and inference latency. We describe our training stack in depth, including often-ignored utilities such as fault-tolerance and checkpoint-reshaping, as well as detailed information on our training recipe. We also provide a preview of our model architecture and base model - ZAYA1 (760M active, 8.3B total parameters MoE) - which will be further improved upon in forthcoming papers. ZAYA1-base achieves performance comparable to leading base models such as Qwen3-4B and Gemma3-12B at its scale and larger, and outperforms models including Llama-3-8B and OLMoE across reasoning, mathematics, and coding benchmarks. Together, these results demonstrate that the AMD hardware, network, and software stack are mature and optimized enough for competitive large-scale pretraining.
Geometric-Disentangelment Unlearning
Machine unlearning, the removal of a training subset's influence from a deployed model, is critical for privacy preservation and model reliability, yet gradient ascent on forget samples often harms retained knowledge. Existing approaches face a persistent tradeoff between effective forgetting and preservation on the retain set. While previous methods provide useful heuristics, they often lack a formal analysis on how exactly forgetting updates harm retained knowledge, and whether the side effects can be removed with theoretical guarantees. To explore a theoretically sound and simple solution, we start from the first principle on how performance on the retain set is actually affected: a first-order analysis of the local change of the retain loss under small parameter updates during model training. We start from a crisp equivalence: the retain loss is unchanged to first order iff the update direction is orthogonal to the subspace spanned by retain gradients ("retain-invariant"). This identifies the entangled component as the tangential part of forget update within the retain-gradient subspace, and characterizes disentanglement as orthogonality. Guided by this, we propose the Geometric-disentanglement Unlearning (GU) that decomposes any candidate forget gradient update into tangential and normal components to retain space and executes only the normal component. Under a standard trust-region budget, the projected direction aligned with the raw forget gradient is optimal among all first-order retain-invariant moves, and we also derive the optimal projected direction for joint forget-retain updating objectives. Our method is plug-and-play and can be attached to existing gradient-based unlearning procedures to mitigate side effects. GU achieves consistent improvement on various methods across three benchmarks TOFU, MUSE, and WMDP.
comment: 27 Pages
MUCH: A Multilingual Claim Hallucination Benchmark
Claim-level Uncertainty Quantification (UQ) is a promising approach to mitigate the lack of reliability in Large Language Models (LLMs). We introduce MUCH, the first claim-level UQ benchmark designed for fair and reproducible evaluation of future methods under realistic conditions. It includes 4,873 samples across four European languages (English, French, Spanish, and German) and four instruction-tuned open-weight LLMs. Unlike prior claim-level benchmarks, we release 24 generation logits per token, facilitating the development of future white-box methods without re-generating data. Moreover, in contrast to previous benchmarks that rely on manual or LLM-based segmentation, we propose a new deterministic algorithm capable of segmenting claims using as little as 0.2% of the LLM generation time. This makes our segmentation approach suitable for real-time monitoring of LLM outputs, ensuring that MUCH evaluates UQ methods under realistic deployment constraints. Finally, our evaluations show that current methods still have substantial room for improvement in both performance and efficiency.
An Efficient Computational Framework for Discrete Fuzzy Numbers Based on Total Orders
Discrete fuzzy numbers, and in particular those defined over a finite chain $L_n = \{0, \ldots, n\}$, have been effectively employed to represent linguistic information within the framework of fuzzy systems. Research on total (admissible) orderings of such types of fuzzy subsets, and specifically those belonging to the set $\mathcal{D}_1^{L_n\rightarrow Y_m}$ consisting of discrete fuzzy numbers $A$ whose support is a closed subinterval of the finite chain $L_n = \{0, 1, \ldots, n\}$ and whose membership values $A(x)$, for $x \in L_n$, belong to the set $Y_m = \{ 0 = y_1 < y_2 < \cdots < y_{m-1} < y_m = 1 \}$, has facilitated the development of new methods for constructing logical connectives, based on a bijective function, called $\textit{pos function}$, that determines the position of each $A \in \mathcal{D}_1^{L_n\rightarrow Y_m}$. For this reason, in this work we revisit the problem by introducing algorithms that exploit the combinatorial structure of total (admissible) orders to compute the $\textit{pos}$ function and its inverse with exactness. The proposed approach achieves a complexity of $\mathcal{O}(n^{2} m \log n)$, which is quadratic in the size of the underlying chain ($n$) and linear in the number of membership levels ($m$). The key point is that the dominant factor is $m$, ensuring scalability with respect to the granularity of membership values. The results demonstrate that this formulation substantially reduces computational cost and enables the efficient implementation of algebraic operations -- such as aggregation and implication -- on the set of discrete fuzzy numbers.
comment: 19 pages, 2 figures. Submitted to Computational and Applied Mathematics (Springer)
Principled Design of Interpretable Automated Scoring for Large-Scale Educational Assessments
AI-driven automated scoring systems offer scalable and efficient means of evaluating complex student-generated responses. Yet, despite increasing demand for transparency and interpretability, the field has yet to develop a widely accepted solution for interpretable automated scoring to be used in large-scale real-world assessments. This work takes a principled approach to address this challenge. We analyze the needs and potential benefits of interpretable automated scoring for various assessment stakeholders and develop four principles of interpretability -- Faithfulness, Groundedness, Traceability, and Interchangeability (FGTI) -- targeted at those needs. To illustrate the feasibility of implementing these principles, we develop the AnalyticScore framework for short answer scoring as a baseline reference framework for future research. AnalyticScore operates by (1) extracting explicitly identifiable elements of the responses, (2) featurizing each response into human-interpretable values using LLMs, and (3) applying an intuitive ordinal logistic regression model for scoring. In terms of scoring accuracy, AnalyticScore outperforms many uninterpretable scoring methods, and is within only 0.06 QWK of the uninterpretable SOTA on average across 10 items from the ASAP-SAS dataset. By comparing against human annotators conducting the same featurization task, we further demonstrate that the featurization behavior of AnalyticScore aligns well with that of humans.
comment: 16 pages, 2 figures
Do Vision-Language Models Understand Visual Persuasiveness? NeurIPS 2025
Recent advances in vision-language models (VLMs) have enabled impressive multi-modal reasoning and understanding. Yet, whether these models truly grasp visual persuasion-how visual cues shape human attitudes and decisions-remains unclear. To probe this question, we construct a high-consensus dataset for binary persuasiveness judgment and introduce the taxonomy of Visual Persuasive Factors (VPFs), encompassing low-level perceptual, mid-level compositional, and high-level semantic cues. We also explore cognitive steering and knowledge injection strategies for persuasion-relevant reasoning. Empirical analysis across VLMs reveals a recall-oriented bias-models over-predict high persuasiveness-and weak discriminative power for low/mid-level features. In contrast, high-level semantic alignment between message and object presence emerges as the strongest predictor of human judgment. Among intervention strategies, simple instruction or unguided reasoning scaffolds yield marginal or negative effects, whereas concise, object-grounded rationales significantly improve precision and F1 scores. These results indicate that VLMs core limitation lies not in recognizing persuasive objects but in linking them to communicative intent.
comment: 8 pages (except for reference and appendix), 5 figures, 7 tables, to be published in NeurIPS 2025 Workshop: VLM4RWD
Supervised Fine Tuning of Large Language Models for Domain Specific Knowledge Graph Construction:A Case Study on Hunan's Historical Celebrities
Large language models and knowledge graphs offer strong potential for advancing research on historical culture by supporting the extraction, analysis, and interpretation of cultural heritage. Using Hunan's modern historical celebrities shaped by Huxiang culture as a case study, pre-trained large models can help researchers efficiently extract key information, including biographical attributes, life events, and social relationships, from textual sources and construct structured knowledge graphs. However, systematic data resources for Hunan's historical celebrities remain limited, and general-purpose models often underperform in domain knowledge extraction and structured output generation in such low-resource settings. To address these issues, this study proposes a supervised fine-tuning approach for enhancing domain-specific information extraction. First, we design a fine-grained, schema-guided instruction template tailored to the Hunan historical celebrities domain and build an instruction-tuning dataset to mitigate the lack of domain-specific training corpora. Second, we apply parameter-efficient instruction fine-tuning to four publicly available large language models - Qwen2.5-7B, Qwen3-8B, DeepSeek-R1-Distill-Qwen-7B, and Llama-3.1-8B-Instruct - and develop evaluation criteria for assessing their extraction performance. Experimental results show that all models exhibit substantial performance gains after fine-tuning. Among them, Qwen3-8B achieves the strongest results, reaching a score of 89.3866 with 100 samples and 50 training iterations. This study provides new insights into fine-tuning vertical large language models for regional historical and cultural domains and highlights their potential for cost-effective applications in cultural heritage knowledge extraction and knowledge graph construction.
Vision Language Models are Confused Tourists
Although the cultural dimension has been one of the key aspects in evaluating Vision-Language Models (VLMs), their ability to remain stable across diverse cultural inputs remains largely untested, despite being crucial to support diversity and multicultural societies. Existing evaluations often rely on benchmarks featuring only a singular cultural concept per image, overlooking scenarios where multiple, potentially unrelated cultural cues coexist. To address this gap, we introduce ConfusedTourist, a novel cultural adversarial robustness suite designed to assess VLMs' stability against perturbed geographical cues. Our experiments reveal a critical vulnerability, where accuracy drops heavily under simple image-stacking perturbations and even worsens with its image-generation-based variant. Interpretability analyses further show that these failures stem from systematic attention shifts toward distracting cues, diverting the model from its intended focus. These findings highlight a critical challenge: visual cultural concept mixing can substantially impair even state-of-the-art VLMs, underscoring the urgent need for more culturally robust multimodal understanding.
ARQUSUMM: Argument-aware Quantitative Summarization of Online Conversations AAAI2026
Online conversations have become more prevalent on public discussion platforms (e.g. Reddit). With growing controversial topics, it is desirable to summarize not only diverse arguments, but also their rationale and justification. Early studies on text summarization focus on capturing general salient information in source documents, overlooking the argumentative nature of online conversations. Recent research on conversation summarization although considers the argumentative relationship among sentences, fail to explicate deeper argument structure within sentences for summarization. In this paper, we propose a novel task of argument-aware quantitative summarization to reveal the claim-reason structure of arguments in conversations, with quantities measuring argument strength. We further propose ARQUSUMM, a novel framework to address the task. To reveal the underlying argument structure within sentences, ARQUSUMM leverages LLM few-shot learning grounded in the argumentation theory to identify propositions within sentences and their claim-reason relationships. For quantitative summarization, ARQUSUMM employs argument structure-aware clustering algorithms to aggregate arguments and quantify their support. Experiments show that ARQUSUMM outperforms existing conversation and quantitative summarization models and generate summaries representing argument structures that are more helpful to users, of high textual quality and quantification accuracy.
comment: Paper accepted to AAAI2026 Main Technical Track
OmniScientist: Toward a Co-evolving Ecosystem of Human and AI Scientists
With the rapid development of Large Language Models (LLMs), AI agents have demonstrated increasing proficiency in scientific tasks, ranging from hypothesis generation and experimental design to manuscript writing. Such agent systems are commonly referred to as "AI Scientists." However, existing AI Scientists predominantly formulate scientific discovery as a standalone search or optimization problem, overlooking the fact that scientific research is inherently a social and collaborative endeavor. Real-world science relies on a complex scientific infrastructure composed of collaborative mechanisms, contribution attribution, peer review, and structured scientific knowledge networks. Due to the lack of modeling for these critical dimensions, current systems struggle to establish a genuine research ecosystem or interact deeply with the human scientific community. To bridge this gap, we introduce OmniScientist, a framework that explicitly encodes the underlying mechanisms of human research into the AI scientific workflow. OmniScientist not only achieves end-to-end automation across data foundation, literature review, research ideation, experiment automation, scientific writing, and peer review, but also provides comprehensive infrastructural support by simulating the human scientific system, comprising: (1) a structured knowledge system built upon citation networks and conceptual correlations; (2) a collaborative research protocol (OSP), which enables seamless multi-agent collaboration and human researcher participation; and (3) an open evaluation platform (ScienceArena) based on blind pairwise user voting and Elo rankings. This infrastructure empowers agents to not only comprehend and leverage human knowledge systems but also to collaborate and co-evolve, fostering a sustainable and scalable innovation ecosystem.
Predicting the Formation of Induction Heads NeurIPS
Arguably, specialized attention heads dubbed induction heads (IHs) underlie the remarkable in-context learning (ICL) capabilities of modern language models (LMs); yet, a precise characterization of their formation remains unclear. In this study, we investigate the relationship between statistical properties of training data (for both natural and synthetic data) and IH formation. We show that (1) a simple equation combining batch size and context size predicts the point at which IHs form; (2) surface bigram repetition frequency and reliability strongly affect the formation of IHs, and we find a precise Pareto frontier in terms of these two values; and (3) local dependency with high bigram repetition frequency and reliability is sufficient for IH formation, but when the frequency and reliability are low, categoriality and the shape of the marginal distribution matter.
comment: Accepted to CogInterp @ NeurIPS
Deep Improvement Supervision
Recently, it was shown that small, looped architectures, such as Tiny Recursive Models (TRMs), can outperform Large Language Models (LLMs) on complex reasoning tasks, including the Abstraction and Reasoning Corpus (ARC). In this work, we investigate a core question: how can we further improve the efficiency of these methods with minimal changes? To address this, we frame the latent reasoning of TRMs as a form of classifier-free guidance and implicit policy improvement algorithm. Building on these insights, we propose a novel training scheme that provides a target for each loop during training. We demonstrate that our approach significantly enhances training efficiency. Our method reduces the total number of forward passes by 18x and eliminates halting mechanisms, while maintaining quality comparable to standard TRMs. Notably, we achieve 24% accuracy on ARC-1 with only 0.8M parameters, outperforming most LLMs.
Improving Latent Reasoning in LLMs via Soft Concept Mixing
Unlike human reasoning in abstract conceptual spaces, large language models (LLMs) typically reason by generating discrete tokens, which potentially limit their expressive power. The recent work Soft Thinking has shown that LLMs' latent reasoning via soft concepts is a promising direction, but LLMs are trained on discrete tokens. To reduce this gap between the soft concepts in reasoning and the discrete tokens in training, we propose Soft Concept Mixing (SCM), a soft concept aware training scheme that directly exposes the model to soft representations during training. Specifically, SCM constructs a soft concept vector by forming a probability-weighted average of embeddings. Then, this vector is mixed into the model's hidden states, which embody rich contextual information. Finally, the entire latent reasoning process is optimized with Reinforcement Learning (RL). Experiments on five reasoning benchmarks demonstrate that SCM improves the reasoning performance of LLMs, and simultaneously maintains a stable training dynamic.
comment: 7 pages, 3 figures
Fine-Grained Reward Optimization for Machine Translation using Error Severity Mappings
Reinforcement learning (RL) has been proven to be an effective and robust method for training neural machine translation systems, especially when paired with powerful reward models that accurately assess translation quality. However, most research has focused on RL methods that use sentence-level feedback, leading to inefficient learning signals due to the reward sparsity problem -- the model receives a single score for the entire sentence. To address this, we propose a novel approach that leverages fine-grained, token-level quality assessments along with error severity levels using RL methods. Specifically, we use xCOMET, a state-of-the-art quality estimation system, as our token-level reward model. We conduct experiments on small and large translation datasets with standard encoder-decoder and large language models-based machine translation systems, comparing the impact of sentence-level versus fine-grained reward signals on translation quality. Our results show that training with token-level rewards improves translation quality across language pairs over baselines according to both automatic and human evaluation. Furthermore, token-level reward optimization improves training stability, evidenced by a steady increase in mean rewards over training epochs.
Do LLMs produce texts with "human-like" lexical diversity?
The degree to which large language models (LLMs) produce writing that is truly human-like remains unclear despite the extensive empirical attention that this question has received. The present study addresses this question from the perspective of lexical diversity. Specifically, the study investigates patterns of lexical diversity in LLM-generated texts from four ChatGPT models (ChatGPT-3.5, ChatGPT-4, ChatGPT-o4 mini, and ChatGPT-4.5) in comparison with texts written by L1 and L2 English participants (n = 240) across four education levels. Six dimensions of lexical diversity were measured in each text: volume, abundance, variety-repetition, evenness, disparity, and dispersion. Results from one-way MANOVAs, one-way ANOVAs, and Support Vector Machines revealed that the ChatGPT-generated texts differed significantly from human-written texts for each variable, with ChatGPT-o4 mini and ChatGPT-4.5 differing the most. Within these two groups, ChatGPT-4.5 demonstrated higher levels of lexical diversity than older models despite producing fewer tokens. The human writers' lexical diversity did not differ across subgroups (i.e., education, language status). Altogether, the results indicate that ChatGPT models do not produce human-like texts in relation to lexical diversity, and the newer models produce less human-like text than older models. We discuss the implications of these results for language pedagogy and related applications.
AI use in American newspapers is widespread, uneven, and rarely disclosed
AI is rapidly transforming journalism, but the extent of its use in published newspaper articles remains unclear. We address this gap by auditing a large-scale dataset of 186K articles from online editions of 1.5K American newspapers published in the summer of 2025. Using Pangram, a state-of-the-art AI detector, we discover that approximately 9% of newly-published articles are either partially or fully AI-generated. This AI use is unevenly distributed, appearing more frequently in smaller, local outlets, in specific topics such as weather and technology, and within certain ownership groups. We also analyze 45K opinion pieces from Washington Post, New York Times, and Wall Street Journal, finding that they are 6.4 times more likely to contain AI-generated content than news articles from the same publications, with many AI-flagged op-eds authored by prominent public figures. Despite this prevalence, we find that AI use is rarely disclosed: a manual audit of 100 AI-flagged articles found only five disclosures of AI use. Overall, our audit highlights the immediate need for greater transparency and updated editorial standards regarding the use of AI in journalism to maintain public trust.
Concise Reasoning via Reinforcement Learning
A major drawback of reasoning models is their excessive token usage, inflating computational cost, resource demand, and latency. We show this verbosity stems not from deeper reasoning but from reinforcement learning loss minimization when models produce incorrect answers. With unsolvable problems dominating training, this effect compounds into a systematic tendency toward longer outputs. Through theoretical analysis of PPO and GRPO, we prove that incorrect answers inherently drive policies toward verbosity \textit{even when} $γ=1$, reframing response lengthening as an optimization artifact. We further uncover a consistent correlation between conciseness and correctness across reasoning and non-reasoning models. Building on these insights, we propose a two-phase RL procedure where a brief secondary stage, trained on a small set of solvable problems, significantly reduces response length while preserving or improving accuracy. Finally, we show that while GRPO shares properties with PPO, it exhibits collapse modes, limiting its reliability for concise reasoning. Our claims are supported by extensive experiments.
Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?
Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Gödel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that LIVE-SWE-AGENT can achieve an impressive solve rate of 77.4% without test-time scaling, outperforming all existing software agents, including the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8%.
Testing Hypotheses from the Social Approval Theory of Online Hate: An Analysis of 110 Million Messages from Parler
We examined how online hate is motivated by receiving social approval via Walther's (2024) social approval theory of online hate, which argues (H1a) more signals of social approval on hate messages predicts more subsequent hate messages, and (H1b) as social approval increases, hate speech becomes more extreme. Using 110 million messages from Parler (2018-2021), we observed the number of upvotes received on a hate speech post was unassociated with hate speech in one's next post and during the next month, three-months, and six-months. The number of upvotes received on (extreme) hate speech comments, however, was positively associated with (extreme) hate speech during the next week, month, three-months, and six-months. Between-person effects revealed an average positive relationship between social approval and hate speech production at all time intervals. For comments, social approval linked more strongly to online hate than social disapproval. Social approval is a critical mechanism facilitating online hate propagation.
Fairness Evaluation of Large Language Models in Academic Library Reference Services
As libraries explore large language models (LLMs) for use in virtual reference services, a key question arises: Can LLMs serve all users equitably, regardless of demographics or social status? While they offer great potential for scalable support, LLMs may also reproduce societal biases embedded in their training data, risking the integrity of libraries' commitment to equitable service. To address this concern, we evaluate whether LLMs differentiate responses across user identities by prompting six state-of-the-art LLMs to assist patrons differing in sex, race/ethnicity, and institutional role. We find no evidence of differentiation by race or ethnicity, and only minor evidence of stereotypical bias against women in one model. LLMs demonstrate nuanced accommodation of institutional roles through the use of linguistic choices related to formality, politeness, and domain-specific vocabularies, reflecting professional norms rather than discriminatory treatment. These findings suggest that current LLMs show a promising degree of readiness to support equitable and contextually appropriate communication in academic library reference services.
WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue
As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA through DSPy to replicate expert clinical assessment. The optimized judge (Gemini-2.5-Pro) achieves human-comparable performance, obtaining 90% accuracy and a strong Cohen's $κ$ of 0.816. This work provides a validated, automated framework for moving ASR evaluation beyond simple textual fidelity to a necessary, scalable assessment of safety in clinical dialogue.
LLM one-shot style transfer for Authorship Attribution and Verification
Computational stylometry analyzes writing style through quantitative patterns in text, supporting applications from forensic tasks such as identity linking and plagiarism detection to literary attribution in the humanities. Supervised and contrastive approaches rely on data with spurious correlations and often confuse style with topic. Despite their natural use in AI-generated text detection, the CLM pre-training of modern LLMs has been scarcely leveraged for general authorship problems. We propose a novel unsupervised approach based on this extensive pre-training and the in-context learning capabilities of LLMs, employing the log-probabilities of an LLM to measure style transferability from one text to another. Our method significantly outperforms LLM prompting approaches of comparable scale and achieves higher accuracy than contrastively trained baselines when controlling for topical correlations. Moreover, performance scales fairly consistently with the size of the base model and, in the case of authorship verification, with an additional mechanism that increases test-time computation; enabling flexible trade-offs between computational cost and accuracy.
Evaluating Large Language Models for Diacritic Restoration in Romanian Texts: A Comparative Study
Automatic diacritic restoration is crucial for text processing in languages with rich diacritical marks, such as Romanian. This study evaluates the performance of several large language models (LLMs) in restoring diacritics in Romanian texts. Using a comprehensive corpus, we tested models including OpenAI's GPT-3.5, GPT-4, GPT-4o, Google's Gemini 1.0 Pro, Meta's Llama 2 and Llama 3, MistralAI's Mixtral 8x7B Instruct, airoboros 70B, and OpenLLM-Ro's RoLlama 2 7B, under multiple prompt templates ranging from zero-shot to complex multi-shot instructions. Results show that models such as GPT-4o achieve high diacritic restoration accuracy, consistently surpassing a neutral echo baseline, while others, including Meta's Llama family, exhibit wider variability. These findings highlight the impact of model architecture, training data, and prompt design on diacritic restoration performance and outline promising directions for improving NLP tools for diacritic-rich languages.
comment: The original submission contained metadata errors and requires correction. A revised and complete version will be submitted as a replacement
Resolving Sentiment Discrepancy for Multimodal Sentiment Detection via Semantics Completion and Decomposition
With the proliferation of social media posts in recent years, the need to detect sentiments in multimodal (image-text) content has grown rapidly. Since posts are user-generated, the image and text from the same post can express different or even contradictory sentiments, leading to potential \textbf{sentiment discrepancy}. However, existing works mainly adopt a single-branch fusion structure that primarily captures the consistent sentiment between image and text. The ignorance or implicit modeling of discrepant sentiment results in compromised unimodal encoding and limited performance. In this paper, we propose a semantics Completion and Decomposition (CoDe) network to resolve the above issue. In the semantics completion module, we complement image and text representations with the semantics of the in-image text, helping bridge the sentiment gap. In the semantics decomposition module, we decompose image and text representations with exclusive projection and contrastive learning, thereby explicitly capturing the discrepant sentiment between modalities. Finally, we fuse image and text representations by cross-attention and combine them with the learned discrepant sentiment for final classification. Extensive experiments on four datasets demonstrate the superiority of CoDe and the effectiveness of each proposed module.
comment: Accepted by Pattern Recognition
When Bias Pretends to Be Truth: How Spurious Correlations Undermine Hallucination Detection in LLMs
Despite substantial advances, large language models (LLMs) continue to exhibit hallucinations, generating plausible yet incorrect responses. In this paper, we highlight a critical yet previously underexplored class of hallucinations driven by spurious correlations -- superficial but statistically prominent associations between features (e.g., surnames) and attributes (e.g., nationality) present in the training data. We demonstrate that these spurious correlations induce hallucinations that are confidently generated, immune to model scaling, evade current detection methods, and persist even after refusal fine-tuning. Through systematically controlled synthetic experiments and empirical evaluations on state-of-the-art open-source and proprietary LLMs (including GPT-5), we show that existing hallucination detection methods, such as confidence-based filtering and inner-state probing, fundamentally fail in the presence of spurious correlations. Our theoretical analysis further elucidates why these statistical biases intrinsically undermine confidence-based detection techniques. Our findings thus emphasize the urgent need for new approaches explicitly designed to address hallucinations caused by spurious correlations.
DiffTester: Accelerating Unit Test Generation for Diffusion LLMs via Repetitive Pattern
Software development relies heavily on extensive unit testing, which makes the efficiency of automated Unit Test Generation (UTG) particularly important. However, most existing LLMs generate test cases one token at a time in each forward pass, which leads to inefficient UTG. Recently, diffusion LLMs (dLLMs) have emerged, offering promising parallel generation capabilities and showing strong potential for efficient UTG. Despite this advantage, their application to UTG is still constrained by a clear trade-off between efficiency and test quality, since increasing the number of tokens generated in each step often causes a sharp decline in the quality of test cases. To overcome this limitation, we present DiffTester, an acceleration framework specifically tailored for dLLMs in UTG. The key idea of DiffTester is that unit tests targeting the same focal method often share repetitive structural patterns. By dynamically identifying these common patterns through abstract syntax tree analysis during generation, DiffTester adaptively increases the number of tokens produced at each step without compromising the quality of the output. To enable comprehensive evaluation, we extend the original TestEval benchmark, which was limited to Python, by introducing additional programming languages including Java and C++. Extensive experiments on three benchmarks with two representative models show that DiffTester delivers significant acceleration while preserving test coverage. Moreover, DiffTester generalizes well across different dLLMs and programming languages, providing a practical and scalable solution for efficient UTG in software development. Code and data are publicly available at https://github.com/wellbeingyang/DLM4UTG-open .
comment: Update reference
MiniLLM: Knowledge Distillation of Large Language Models ICLR 2024
Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs). However, previous KD methods are primarily applied to white-box classification models or training small models to imitate black-box model APIs like ChatGPT. How to effectively distill the knowledge of white-box LLMs into small models is still under-explored, which becomes more important with the prosperity of open-source LLMs. In this work, we propose a KD approach that distills LLMs into smaller language models. We first replace the forward Kullback-Leibler divergence (KLD) objective in the standard KD approaches with reverse KLD, which is more suitable for KD on generative language models, to prevent the student model from overestimating the low-probability regions of the teacher distribution. Then, we derive an effective on-policy optimization approach to learn this objective. The student models are named MiniLLM. Extensive experiments in the instruction-following setting show that MiniLLM generates more precise responses with higher overall quality, lower exposure bias, better calibration, and higher long-text generation performance than the baselines. Our method is scalable for different model families with 120M to 13B parameters. Our code, data, and model checkpoints can be found in https://github.com/microsoft/LMOps/tree/main/minillm.
comment: Published as a conference paper in ICLR 2024
Overcoming the Generalization Limits of SLM Finetuning for Shape-Based Extraction of Datatype and Object Properties
Small language models (SLMs) have shown promises for relation extraction (RE) when extracting RDF triples guided by SHACL shapes focused on common datatype properties. This paper investigates how SLMs handle both datatype and object properties for a complete RDF graph extraction. We show that the key bottleneck is related to long-tail distribution of rare properties. To solve this issue, we evaluate several strategies: stratified sampling, weighted loss, dataset scaling, and template-based synthetic data augmentation. We show that the best strategy to perform equally well over unbalanced target properties is to build a training set where the number of occurrences of each property exceeds a given threshold. To enable reproducibility, we publicly released our datasets, experimental results and code. Our findings offer practical guidance for training shape-aware SLMs and highlight promising directions for future work in semantic RE.
comment: Accepted at KCAP 2025
From Hypothesis to Publication: A Comprehensive Survey of AI-Driven Research Support Systems EMNLP 2025
Research is a fundamental process driving the advancement of human civilization, yet it demands substantial time and effort from researchers. In recent years, the rapid development of artificial intelligence (AI) technologies has inspired researchers to explore how AI can accelerate and enhance research. To monitor relevant advancements, this paper presents a systematic review of the progress in this domain. Specifically, we organize the relevant studies into three main categories: hypothesis formulation, hypothesis validation, and manuscript publication. Hypothesis formulation involves knowledge synthesis and hypothesis generation. Hypothesis validation includes the verification of scientific claims, theorem proving, and experiment validation. Manuscript publication encompasses manuscript writing and the peer review process. Furthermore, we identify and discuss the current challenges faced in these areas, as well as potential future directions for research. Finally, we also offer a comprehensive overview of existing benchmarks and tools across various domains that support the integration of AI into the research process. We hope this paper serves as an introduction for beginners and fosters future research. Resources have been made publicly available at https://github.com/zkzhou126/AI-for-Research.
comment: Accepted to EMNLP 2025
A systematic review of relation extraction task since the emergence of Transformers
This article presents a systematic review of relation extraction (RE) research since the advent of Transformer-based models. Using an automated framework to collect and annotate publications, we analyze 34 surveys, 64 datasets, and 104 models published between 2019 and 2024. The review highlights methodological advances, benchmark resources, and the integration of semantic web technologies. By consolidating results across multiple dimensions, the study identifies current trends, limitations, and open challenges, offering researchers and practitioners a comprehensive reference for understanding the evolution and future directions of RE.
comment: Submited at ACM-Computing Surveys + The resulting annotated Zotero bibliography : https://www.zotero.org/groups/6070963/scilex_re_systlitreview/library + SciLEx software: https://github.com/Wimmics/SciLEx
Emergence of psychopathological computations in large language models
Can large language models (LLMs) instantiate computations of psychopathology? An effective approach to the question hinges on addressing two factors. First, for conceptual validity, we require a general and computational account of psychopathology that is applicable to computational entities without biological embodiment or subjective experience. Second, psychopathological computations, derived from the adapted theory, need to be empirically identified within the LLM's internal processing. Thus, we establish a computational-theoretical framework to provide an account of psychopathology applicable to LLMs. Based on the framework, we conduct experiments demonstrating two key claims: first, that the computational structure of psychopathology exists in LLMs; and second, that executing this computational structure results in psychopathological functions. We further observe that as LLM size increases, the computational structure of psychopathology becomes denser and that the functions become more effective. Taken together, the empirical results corroborate our hypothesis that network-theoretic computations of psychopathology have already emerged in LLMs. This suggests that certain LLM behaviors mirroring psychopathology may not be a superficial mimicry but a feature of their internal processing. Our work shows the promise of developing a new powerful in silico model of psychopathology and also alludes to the possibility of safety threat from the AI systems with psychopathological behaviors in the near future.
comment: pre-print
Improving the Performance of Radiology Report De-identification with Large-Scale Training and Benchmarking Against Cloud Vendor Methods
Objective: To enhance automated de-identification of radiology reports by scaling transformer-based models through extensive training datasets and benchmarking performance against commercial cloud vendor systems for protected health information (PHI) detection. Materials and Methods: In this retrospective study, we built upon a state-of-the-art, transformer-based, PHI de-identification pipeline by fine-tuning on two large annotated radiology corpora from Stanford University, encompassing chest X-ray, chest CT, abdomen/pelvis CT, and brain MR reports and introducing an additional PHI category (AGE) into the architecture. Model performance was evaluated on test sets from Stanford and the University of Pennsylvania (Penn) for token-level PHI detection. We further assessed (1) the stability of synthetic PHI generation using a "hide-in-plain-sight" method and (2) performance against commercial systems. Precision, recall, and F1 scores were computed across all PHI categories. Results: Our model achieved overall F1 scores of 0.973 on the Penn dataset and 0.996 on the Stanford dataset, outperforming or maintaining the previous state-of-the-art model performance. Synthetic PHI evaluation showed consistent detectability (overall F1: 0.959 [0.958-0.960]) across 50 independently de-identified Penn datasets. Our model outperformed all vendor systems on synthetic Penn reports (overall F1: 0.960 vs. 0.632-0.754). Discussion: Large-scale, multimodal training improved cross-institutional generalization and robustness. Synthetic PHI generation preserved data utility while ensuring privacy. Conclusion: A transformer-based de-identification model trained on diverse radiology datasets outperforms prior academic and commercial systems in PHI detection and establishes a new benchmark for secure clinical text processing.
comment: In submission to JAMIA
ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions
Large language models (LLMs) have demonstrated strong capabilities in using external tools to address user inquiries. However, most existing evaluations assume tool use in short contexts, offering limited insight into model behavior during realistic long-term interactions. To fill this gap, we introduce ToolHaystack, a benchmark for testing the tool use capabilities in long-term interactions. Each test instance in ToolHaystack includes multiple tasks execution contexts and realistic noise within a continuous conversation, enabling assessment of how well models maintain context and handle various disruptions. By applying this benchmark to 14 state-of-the-art LLMs, we find that while current models perform well in standard multi-turn settings, they often significantly struggle in ToolHaystack, highlighting critical gaps in their long-term robustness not revealed by previous tool benchmarks.
comment: Our code and data are available at https://github.com/bwookwak/ToolHaystack Edited for adding acknowledgement section
From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models
With the remarkable success of Multimodal Large Language Models (MLLMs) in perception tasks, enhancing their complex reasoning capabilities has emerged as a critical research focus. Existing models still suffer from challenges such as opaque reasoning paths and insufficient generalization ability. Chain-of-Thought (CoT) reasoning, which has demonstrated significant efficacy in language models by enhancing reasoning transparency and output interpretability, holds promise for improving model reasoning capabilities when extended to the multimodal domain. This paper provides a systematic review centered on "Multimodal Chain-of-Thought" (MCoT). First, it analyzes the background and theoretical motivations for its inception from the perspectives of technical evolution and task demands. Then, it introduces mainstream MCoT methods from three aspects: CoT paradigms, the post-training stage, and the inference stage, while also analyzing their underlying mechanisms. Furthermore, the paper summarizes existing evaluation benchmarks and metrics, and discusses the application scenarios of MCoT. Finally, it analyzes the challenges currently facing MCoT and provides an outlook on its future research directions.
comment: Survey; 7 figures, 3 tables, 44 pages
The Rise of Parameter Specialization for Knowledge Storage in Large Language Models NeurIPS 2025
Over time, a growing wave of large language models from various series has been introduced to the community. Researchers are striving to maximize the performance of language models with constrained parameter sizes. However, from a microscopic perspective, there has been limited research on how to better store knowledge in model parameters, particularly within MLPs, to enable more effective utilization of this knowledge by the model. In this work, we analyze twenty publicly available open-source large language models to investigate the relationship between their strong performance and the way knowledge is stored in their corresponding MLP parameters. Our findings reveal that as language models become more advanced and demonstrate stronger knowledge capabilities, their parameters exhibit increased specialization. Specifically, parameters in the MLPs tend to be more focused on encoding similar types of knowledge. We experimentally validate that this specialized distribution of knowledge contributes to improving the efficiency of knowledge utilization in these models. Furthermore, by conducting causal training experiments, we confirm that this specialized knowledge distribution plays a critical role in improving the model's efficiency in leveraging stored knowledge.
comment: Accepted in NeurIPS 2025
Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models
Contextual priming, where earlier stimuli covertly bias later judgments, offers an unexplored attack surface for large language models (LLMs). We uncover a contextual priming vulnerability in which the previous response in the dialogue can steer its subsequent behavior toward policy-violating content. While existing jailbreak attacks largely rely on single-turn or multi-turn prompt manipulations, or inject static in-context examples, these methods suffer from limited effectiveness, inefficiency, or semantic drift. We introduce Response Attack (RA), a novel framework that strategically leverages intermediate, mildly harmful responses as contextual primers within a dialogue. By reformulating harmful queries and injecting these intermediate responses before issuing a targeted trigger prompt, RA exploits a previously overlooked vulnerability in LLMs. Extensive experiments across eight state-of-the-art LLMs show that RA consistently achieves significantly higher attack success rates than nine leading jailbreak baselines. Our results demonstrate that the success of RA is directly attributable to the strategic use of intermediate responses, which induce models to generate more explicit and relevant harmful content while maintaining stealth, efficiency, and fidelity to the original query. The code and data are available at https://github.com/Dtc7w3PQ/Response-Attack.
comment: 20 pages, 10 figures. Code and data available at https://github.com/Dtc7w3PQ/Response-Attack
SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought
As Large Language Models (LLMs) evolve into personal assistants with access to sensitive user data, they face a critical privacy challenge: while prior work has addressed output-level privacy, recent findings reveal that LLMs often leak private information through their internal reasoning processes, violating contextual privacy expectations. These leaky thoughts occur when models inadvertently expose sensitive details in their reasoning traces, even when final outputs appear safe. The challenge lies in preventing such leakage without compromising the model's reasoning capabilities, requiring a delicate balance between privacy and utility. We introduce Steering Activations towards Leakage-free Thinking (SALT), a lightweight test-time intervention that mitigates privacy leakage in model's Chain of Thought (CoT) by injecting targeted steering vectors into hidden state. We identify the high-leakage layers responsible for this behavior. Through experiments across multiple LLMs, we demonstrate that SALT achieves reductions including $18.2\%$ reduction in CPL on QwQ-32B, $17.9\%$ reduction in CPL on Llama-3.1-8B, and $31.2\%$ reduction in CPL on Deepseek in contextual privacy leakage dataset AirGapAgent-R while maintaining comparable task performance and utility. Our work establishes SALT as a practical approach for test-time privacy protection in reasoning-capable language models, offering a path toward safer deployment of LLM-based personal agents.
AraFinNews: Arabic Financial Summarisation with Domain-Adapted LLMs
This paper examines how domain specificity affects abstractive summarisation of Arabic financial texts using large language models (LLMs). We present AraFinNews, the largest publicly available Arabic financial news dataset to date, comprising 212,500 article-headline pairs spanning almost a decade of reporting from October 2015 to July 2025. Developed as an Arabic counterpart to major English summarisation corpora such as CNN/DailyMail, AraFinNews offers a strong benchmark for assessing domain-focused language understanding and generation in financial contexts. Using this resource, we evaluate transformer-based models, including mT5, AraT5 and the domain-adapted FinAraT5, to investigate how financial-domain pretraining influences accuracy, numerical reliability and stylistic alignment with professional reporting. The results show that domain-adapted models produce more coherent summaries, particularly when handling quantitative and entity-centred information. These findings underscore the value of domain-specific adaptation for improving narrative fluency in Arabic financial summarisation. The dataset is freely available for non-commercial research at https://github.com/ArabicNLP-UK/AraFinNews.
comment: 9 pages
Bridging the Semantic Gap: Contrastive Rewards for Multilingual Text-to-SQL with GRPO
Current Text-to-SQL methods are evaluated and only focused on executable queries, overlooking the semantic alignment challenge -- both in terms of the semantic meaning of the query and the correctness of the execution results. Even execution accuracy itself shows significant drops when moving from English to other languages, with an average decline of 6 percentage points across non-English languages. We address these challenges by presenting a new framework that combines Group Relative Policy Optimization (GRPO) within a multilingual contrastive reward signal to enhance both task efficiency and semantic accuracy in Text-to-SQL systems in cross-lingual scenarios. Our method teaches models to obtain better correspondence between SQL generation and user intent by combining a reward signal based on semantic similarity. On the seven-language MultiSpider dataset, fine-tuning the LLaMA-3-3B model with GRPO improved the execution accuracy up to 87.4 percent (+26 pp over zero-shot) and semantic accuracy up to 52.29 percent (+32.86 pp). Adding our contrastive reward signal in the GRPO framework further improved the average semantic accuracy to 59.14 percent (+6.85 pp, up to +10 pp for Vietnamese). Our experiments showcase that a smaller, parameter-efficient 3B LLaMA model fine-tuned with our contrastive reward signal outperforms a much larger zero-shot 8B LLaMA model, with an uplift of 7.43 pp in execution accuracy (from 81.43 percent on the 8B model to 88.86 percent on the 3B model), and nearly matches its semantic accuracy (59.14 percent vs. 68.57 percent) -- all using just 3,000 reinforcement learning training examples. These results demonstrate how we can improve the performance of Text-to-SQL systems with contrastive rewards for directed semantic alignment, without requiring large-scale training datasets.
comment: 20th International Workshop on Semantic and Social Media Adaptation & Personalization
EventWeave: A Dynamic Framework for Capturing Core and Supporting Events in Dialogue Systems
Large language models have improved dialogue systems, but often process conversational turns in isolation, overlooking the event structures that guide natural interactions. Hence we introduce \textbf{EventWeave}, a framework that explicitly models relationships between conversational events to generate more contextually appropriate dialogue responses. EventWeave constructs a dynamic event graph that distinguishes between core events (main goals) and supporting events (interconnected details), employing a multi-head attention mechanism to selectively determine which events are most relevant to the current turn. Unlike summarization or standard graph-based approaches, our method captures three distinct relationship types between events, allowing for more nuanced context modeling. Experiments on three dialogue datasets demonstrate that EventWeave produces more natural and contextually appropriate responses while requiring less computational overhead than models processing the entire dialogue history. Ablation studies confirm improvements stem from better event relationship modeling rather than increased information density. Our approach effectively balances comprehensive context understanding with generating concise responses, maintaining strong performance across various dialogue lengths through targeted optimization techniques.
RPRO: Ranked Preference Reinforcement Optimization for Enhancing Medical QA and Diagnostic Reasoning
Medical question answering requires advanced reasoning that integrates domain knowledge with logical inference. However, existing large language models (LLMs) often generate reasoning chains that lack factual accuracy and clinical reliability. We propose Ranked Preference Reinforcement Optimization (RPRO), a novel framework that combines reinforcement learning with preference-driven reasoning refinement to enhance clinical chain-of-thought (CoT) performance. RPRO distinguishes itself from prior approaches by employing task-adaptive reasoning templates and a probabilistic evaluation mechanism that aligns model outputs with established clinical workflows, while automatically identifying and correcting low-quality reasoning chains. Unlike traditional pairwise preference methods, RPRO introduces a groupwise ranking optimization based on the Bradley--Terry model and incorporates KL-divergence regularization for stable training. Experiments on PubMedQA, MedQA-USMLE, and a real-world clinical dataset from Far Eastern Memorial Hospital (FEMH) demonstrate consistent improvements over strong baselines. Remarkably, our 2B-parameter model outperforms much larger 7B--20B models, including medical-specialized variants. These findings demonstrate that combining preference optimization with quality-driven refinement provides a scalable and clinically grounded approach to building more reliable medical LLMs.
Task-Aligned Tool Recommendation for Large Language Models AACL 2025
By augmenting Large Language Models (LLMs) with external tools, their capacity to solve complex problems has been significantly enhanced. However, despite ongoing advancements in the parsing capabilities of LLMs, incorporating all available tools simultaneously in the prompt remains impractical due to the vast number of external tools. Consequently, it is essential to provide LLMs with a precise set of tools tailored to the specific task, considering both quantity and quality. Current tool retrieval methods primarily focus on refining the ranking list of tools and directly packaging a fixed number of top-ranked tools as the tool set. However, these approaches often fail to equip LLMs with the optimal set of tools prior to execution, since the optimal number of tools for different tasks could be different, resulting in inefficiencies such as redundant or unsuitable tools, which impede immediate access to the most relevant tools. This paper addresses the challenge of recommending precise toolsets for LLMs. We introduce the problem of tool recommendation, define its scope, and propose a novel Precision-driven Tool Recommendation (PTR) approach. PTR captures an initial, concise set of tools by leveraging historical tool bundle usage and dynamically adjusts the tool set by performing tool matching, culminating in a multi-view-based tool addition. Additionally, we present a new dataset, RecTools, and a metric, TRACC, designed to evaluate the effectiveness of tool recommendation for LLMs. We further validate our design choices through comprehensive experiments, demonstrating promising accuracy across two open benchmarks and our RecTools dataset.
comment: IJCNLP-AACL 2025 Main
Human-Computer Interaction
GRAPHIC--Guidelines for Reviewing Algorithmic Practices in Human-centred Design and Interaction for Creativity
Artificial Intelligence (AI) has been increasingly applied to creative domains, leading to the development of systems that collaborate with humans in design processes. In Graphic Design, integrating computational systems into co-creative workflows presents specific challenges, as it requires balancing scientific rigour with the subjective and visual nature of design practice. Following the PRISMA methodology, we identified 872 articles, resulting in a final corpus of 71 publications describing 68 unique systems. Based on this review, we introduce GRAPHIC (Guidelines for Reviewing Algorithmic Practices in Human-centred Design and Interaction for Creativity), a framework for analysing AI-based systems applied to Graphic Design. Its goal is to understand how current systems support human-AI collaboration in the Graphic Design discipline. The framework comprises main dimensions, which our analysis revealed to be essential across diverse system types: (1) Collaborative Panorama, (2) Processes and Modalities, and (3) Graphic Design Principles. Its application revealed research gaps, including the need to balance initiative and control between agents, improve communication through explainable interaction models, and promote systems that support transformational creativity grounded in core design principles.
comment: 20 pages, 16 figures
Feasibility of Embodied Dynamics Based Bayesian Learning for Continuous Pursuit Motion Control of Assistive Mobile Robots in the Built Environment
Non-invasive electroencephalography (EEG)-based brain-computer interfaces (BCIs) offer an intuitive means for individuals with severe motor impairments to independently operate assistive robotic wheelchairs and navigate built environments. Despite considerable progress in BCI research, most current motion control systems are limited to discrete commands, rather than supporting continuous pursuit, where users can freely adjust speed and direction in real time. Such natural mobility control is, however, essential for wheelchair users to navigate complex public spaces, such as transit stations, airports, hospitals, and indoor corridors, to interact socially with the dynamic populations with agility, and to move flexibly and comfortably as autonomous driving is refined to allow movement at will. In this study, we address the gap of continuous pursuit motion control in BCIs by proposing and validating a brain-inspired Bayesian inference framework, where embodied dynamics in acceleration-based motor representations are decoded. This approach contrasts with conventional kinematics-level decoding and deep learning-based methods. Using a public dataset with sixteen hours of EEG from four subjects performing motor imagery-based target-following, we demonstrate that our method, utilizing Automatic Relevance Determination for feature selection and continual online learning, reduces the normalized mean squared error between predicted and true velocities by 72% compared to autoregressive and EEGNet-based methods in a session-accumulative transfer learning setting. Theoretically, these findings empirically support embodied cognition theory and reveal the brain's intrinsic motor control dynamics in an embodied and predictive nature. Practically, grounding EEG decoding in the same dynamical principles that govern biological motion offers a promising path toward more stable and intuitive BCI control.
comment: 37 pages, 9 figures, and 7 tables
AI Workers, Geopolitics, and Algorithmic Collective Action
According to the theory of International Political Economy (IPE), states are often incentivized to rely on rather than constrain powerful corporations. For this reason, IPE provides a useful lens to explain why efforts to govern Artificial Intelligence (AI) at the international and national levels have thus far been developed, applied, and enforced unevenly. Building on recent work that explores how AI companies engage in geopolitics, this position paper argues that some AI workers can be considered actors of geopolitics. It makes the timely case that governance alone cannot ensure responsible, ethical, or robust AI development and use, and greater attention should be paid to bottom-up interventions at the site of AI development. AI workers themselves should be situated as individual agents of change, especially when considering their potential to foster Algorithmic Collective Action (ACA). Drawing on methods of Participatory Design (PD), this paper proposes engaging AI workers as sources of knowledge, relative power, and intentionality to encourage more responsible and just AI development and create the conditions that can facilitate ACA.
Mixed Reality Scenic Live Streaming for Cultural Heritage: Visual Interactions in a Historic Landscape
Scenic Live Streams (SLS), capturing real-world scenic sites from fixed cameras without streamers, have gained increasing popularity recently. They afford unique real-time lenses into remote sites for viewers' synchronous and collective engagement. Foregrounding its lack of dynamism and interactivity, we aim to maximize the potential of SLS by making it interactive. Namely MRSLS, we overlaid plain SLS with interactive Mixed Reality content that matches the site's geographical structures and local cultural backgrounds. We further highlight the substantial benefit of MRSLS to cultural heritage site interactions, and we demonstrate this design proposal with an MRSLS prototype at a UNESCO-listed heritage site in China. The design process includes an interview (N=6) to pinpoint local scenery and culture, as well as two iterative design studies (N=15, 14). A mixed-methods, between-subjects study (N=43, 37) shows that MRSLS affords immersive scenery appreciation, effective cultural imprints, and vivid shared experience. With its balance between cultural, participatory, and authentic attributes, we appeal for more HCI attention to (MR)SLS as an under-explored design space.
comment: 14 pages, 6 figures, to be published in the Proceedings of the International Conference on Human-Engaged Computing (ICHEC '25), November 21--23, 2025, Singapore
Senti-iFusion: An Integrity-centered Hierarchical Fusion Framework for Multimodal Sentiment Analysis under Uncertain Modality Missingness
Multimodal Sentiment Analysis (MSA) is critical for human-computer interaction but faces challenges when the modalities are incomplete or missing. Existing methods often assume pre-defined missing modalities or fixed missing rates, limiting their real-world applicability. To address this challenge, we propose Senti-iFusion, an integrity-centered hierarchical fusion framework capable of handling both inter- and intra-modality missingness simultaneously. It comprises three hierarchical components: Integrity Estimation, Integrity-weighted Completion, and Integrity-guided Fusion. First, the Integrity Estimation module predicts the completeness of each modality and mitigates the noise caused by incomplete data. Second, the Integrity-weighted Cross-modal Completion module employs a novel weighting mechanism to disentangle consistent semantic structures from modality-specific representations, enabling the precise recovery of sentiment-related features across language, acoustic, and visual modalities. To ensure consistency in reconstruction, a dual-depth validation with semantic- and feature-level losses ensures consistent reconstruction at both fine-grained (low-level) and semantic (high-level) scales. Finally, the Integrity-guided Adaptive Fusion mechanism dynamically selects the dominant modality for attention-based fusion, ensuring that the most reliable modality, based on completeness and quality, contributes more significantly to the final prediction. Senti-iFusion employs a progressive training approach to ensure stable convergence. Experimental results on popular MSA datasets demonstrate that Senti-iFusion outperforms existing methods, particularly in fine-grained sentiment analysis tasks. The code and our proposed Senti-iFusion model will be publicly available.
The Wireless Charger as a Gesture Sensor: A Novel Approach to Ubiquitous Interaction
Advancements in information technology have increased demand for natural human-computer interaction in areas such as gaming, smart homes, and vehicles. However, conventional approaches like physical buttons or cameras are often limited by contact requirements, privacy concerns, and high costs.Motivated by the observation that these EM signals are not only strong and measurable but also rich in gesture-related information, we propose EMGesture, a novel contactless interaction technique that leverages the electromagnetic (EM) signals from Qi wireless chargers for gesture recognition. EMGesture analyzes the distinctive EM features and employs a robust classification model. The end-to-end framework enables it capable of accurately interpreting user intent. Experiments involving 30 participants, 10 mobile devices, and 5 chargers showed that EMGesture achieves over 97% recognition accuracy. Corresponding user studies also confirmed higher usability and convenience, which demonstrating that EMGesture is a practical, privacy-conscious, and cost-effective solution for pervasive interaction.
comment: 21 pages, 17 figures
IsharaKotha: A Comprehensive Avatar-based Bangla Sign Language Corpus
Sign language is a vital communication medium for the hearing-impaired community, enabling effective interaction and self-expression. To help bridge the communication gap between hearing and hearing-impaired individuals, a text-to-sign translation system is essential. Such systems can also support learners interested in acquiring sign language skills. This work presents IsharaKotha, the first HamNoSys-based Bangla Sign Language corpus, containing 3823 words. A deep learning based lemmatizer was integrated to extract root words, enabling sign generation for complete sentences. An evaluation interface was developed to assess the quality of sign animations for letters, digits, and sentences. Two professional interpreters and one real sign language user rated the animations using categorical numeric scores. The system achieved an average rating of 3.14 out of 4.00, indicating high quality performance between Good and Excellent. These results demonstrate the potential of IsharaKotha to support future advancements in dynamic sign language translation systems. The evaluation system is available at http://bdsl-isharakotha.ap-1.evennode.com
comment: 20 pages
Digital Diasporas: How Origin Characteristics and Host-Native Distance Shape Immigrants' Online Cultural Retention
Immigrants bring unique cultural backgrounds to their host countries. Subsequent interplay of cultures can lead to either a melting pot, where immigrants adopt the dominant culture of the host country, or a mosaic, where distinct cultural identities coexist. The existing literature primarily focuses on the acculturation of immigrants, specifically the melting pot hypothesis. In contrast, we attempt to identify the antecedents of the mosaic hypothesis or factors that enhance (or diminish) the propensity for cultural retention among immigrants. Based on Facebook advertising data for immigrants from 8 countries residing in the USA, our findings suggest that greater host-native distance is linked to higher online cultural retention, and while origin country context is statistically significant, its impact is generally smaller.
comment: This paper will appear at ICWSM 2026. Please cite the peer-reviewed version
Chatbots to strengthen democracy: An interdisciplinary seminar to train identifying argumentation techniques of science denial
In recent times, discussions on social media platforms have increasingly come under scrutiny due to the proliferation of science denial and fake news. Traditional solutions, such as regulatory actions, have been implemented to mitigate the spread of misinformation; however, these measures alone are not sufficient. To complement these efforts, educational approaches are becoming essential in empowering users to critically engage with misinformation. Conversation training, through serious games or personalized methods, has emerged as a promising strategy to help users handle science denial and toxic conversation tactics. This paper suggests an interdisciplinary seminar to explore the suitability of Large Language Models (LLMs) acting as a persona of a science denier to support people in identifying misinformation and improving resilience against toxic interactions. In the seminar, groups of four to five students will develop an AI-based chatbot that enables realistic interactions with science-denial argumentation structures. The task involves planning the setting, integrating a Large Language Model to facilitate natural dialogues, implementing the chatbot using the RASA framework, and evaluating the outcomes in a user study. It is crucial that users understand what they need to do during the interaction, how to conclude it, and how the relevant information is conveyed. The seminar does not aim to develop chatbots for practicing debunking but serves to teach AI technologies and test the feasibility of this idea for future applications. The chatbot seminar is conducted as a hybrid, parallel master's module at the participating educational institutions.
comment: 6 pages, 4 figures
Biologically Inspired Predictive Coding TCN-Transformer for Anticipatory Human-Robot Interaction in Shared Physical Spaces
As mobile robots increasingly operate in environments shared with humans, proactively anticipating human motion rather than responding reactively is critical for preempting collisions during close-proximity navigation, while maintaining mobility efficiency and avoiding unnecessary yields. A timely and motivating engineering application is how autonomous vehicles interpret ambiguous right-of-way such as unsignalized pedestrian crossings. To address this challenge, this study explores the feasibility of decoding preparatory neural activity from wearable electroencephalography (EEG) to predict human motion intention before it is behaviorally expressed. Drawing inspiration from biological predictive coding mechanisms between the sensorimotor cortex and insula-frontoparietal network, we implement this principle in a Temporal Convolutional Network-Transformer architecture to decode fast-evolving EEG signals underlying perception-action transitions. In experiments involving twelve participants simulating road-crossing decisions under varying traffic volume, marked crosswalks, and traffic signals, neurophysiological analyses reveal hemispheric asymmetries in functional specialization and identify high-beta oscillations (16-25 Hz) in the right fronto-central region (F4) as robust neural markers of motor readiness and decision commitment. Using sliding-window feature extraction, we benchmarked sixteen classification models across traditional, recurrent, and convolutional deep learning architectures, and found that our approach achieved the highest Area Under the Curve (AUC) of 0.727 with an approximate 1-second look-ahead. This work demonstrates how biologically grounded temporal architectures can enhance anticipatory intelligence in autonomous systems and represents the first step toward proactive and adaptive human-robot interaction in the built environment.
comment: 45 pages, 12 figures, 6 tables
SweeperBot: Making 3D Browsing Accessible through View Analysis and Visual Question Answering
Accessing 3D models remains challenging for Screen Reader (SR) users. While some existing 3D viewers allow creators to provide alternative text, they often lack sufficient detail about the 3D models. Grounded on a formative study, this paper introduces SweeperBot, a system that enables SR users to leverage visual question answering to explore and compare 3D models. SweeperBot answers SR users' visual questions by combining an optimal view selection technique with the strength of generative- and recognition-based foundation models. An expert review with 10 Blind and Low-Vision (BLV) users with SR experience demonstrated the feasibility of using SweeperBot to assist BLV users in exploring and comparing 3D models. The quality of the descriptions generated by SweeperBot was validated by a second survey study with 30 sighted participants.
comment: 28 pages, 16 figures, this is an original manuscript of an article published by Taylor & Francis in the International Journal of Human-Computer Interaction (IJHCI), available online: https://doi.org/10.1080/10447318.2025.2594750
Conversational AI increases political knowledge as effectively as self-directed internet search
Conversational AI systems are increasingly being used in place of traditional search engines to help users complete information-seeking tasks. This has raised concerns in the political domain, where biased or hallucinated outputs could misinform voters or distort public opinion. However, in spite of these concerns, the extent to which conversational AI is used for political information-seeking, as well the potential impact of this use on users' political knowledge, remains uncertain. Here, we address these questions: First, in a representative national survey of the UK public (N = 2,499), we find that in the week before the 2024 election as many as 32% of chatbot users - and 13% of eligible UK voters - have used conversational AI to seek political information relevant to their electoral choice. Second, in a series of randomised controlled trials (N = 2,858 total) we find that across issues, models, and prompting strategies, conversations with AI increase political knowledge (increase belief in true information and decrease belief in misinformation) to the same extent as self-directed internet search. Taken together, our results suggest that although people in the UK are increasingly turning to conversational AI for information about politics, this shift may not lead to increased public belief in political misinformation.
Final Happiness: What Intelligent User Interfaces Can Do for the lonely Dying
This study explores the design of Intelligent User Interfaces (IUIs) to address the profound existential loneliness of terminally ill individuals. While Human-Computer Interaction (HCI) has made inroads in "Thanatechnology," current research often focuses on practical aspects like digital legacy management, overlooking the subjective, existential needs of those facing death in isolation. To address this gap, we conducted in-depth qualitative interviews with 14 lonely, terminally ill individuals. Our core contributions are: (1) An empirically-grounded model articulating the complex psychological, practical, social, and spiritual needs of this group; (2) The "Three Pillars, Twelve Principles" framework for designing IUIs as "Existential Companions"; and (3) A critical design directive derived from user evaluations: technology in this context should aim for transcendence over simulation. The findings suggest that IUIs should create experiences that augment or surpass human capabilities, rather than attempting to simulate basic human connections, which can paradoxically deepen loneliness. This research provides a clear, user-centered path for designing technology that serves not as a "tool for dying," but as a "partner for living fully until the end".
A Longitudinal Study on the Attitudes of Gay Men in Beijing Towards Gay Social Media Platforms: Lonely Souls in the Digital Concrete Jungle
Over the past decade, specialized social networking applications have become a cornerstone of life for many gay men in China. This paper employs a longitudinal mixed-methods approach to investigate how Chinese men who have sex with men (MSM) have shifted their attitudes toward these platforms between approximately 2013 and 2023. Drawing on archival analysis of online discourses, a quantitative survey of 412 participants, and in-depth semi-structured interviews with 32 participants, we trace the complex trajectory of this evolution. Our findings reveal a clear pattern: from the initial embrace of these applications as revolutionary tools for community building and identity affirmation (2014--2017), to a period of growing ambivalence and critique centered on commercialization, ``hookup culture,'' and multiple forms of discrimination (2017--2020), and finally to the present era (2020--2023), characterized by pragmatic, fragmented, yet simultaneously critical and reconstructive uses. Today, users strategically employ a repertoire of applications -- including global platforms (e.g., Grindr and Tinder), domestic mainstream platforms (e.g., Blued), and niche alternatives (e.g., Aloha) -- to fulfill differentiated needs. We develop a detailed temporal framework to capture this attitudinal evolution and discuss its design implications for creating more supportive, secure, and community-oriented digital environments for marginalized groups.
Estimating LLM Consistency: A User Baseline vs Surrogate Metrics EMNLP 2025
Large language models (LLMs) are prone to hallucinations and sensitive to prompt perturbations, often resulting in inconsistent or unreliable generated text. Different methods have been proposed to mitigate such hallucinations and fragility, one of which is to measure the consistency of LLM responses -- the model's confidence in the response or likelihood of generating a similar response when resampled. In previous work, measuring LLM response consistency often relied on calculating the probability of a response appearing within a pool of resampled responses, analyzing internal states, or evaluating logits of responses. However, it was not clear how well these approaches approximated users' perceptions of consistency of LLM responses. To find out, we performed a user study ($n=2,976$) demonstrating that current methods for measuring LLM response consistency typically do not align well with humans' perceptions of LLM consistency. We propose a logit-based ensemble method for estimating LLM consistency and show that our method matches the performance of the best-performing existing metric in estimating human ratings of LLM consistency. Our results suggest that methods for estimating LLM consistency without human evaluation are sufficiently imperfect to warrant broader use of evaluation with human input; this would avoid misjudging the adequacy of models because of the imperfections of automated consistency metrics.
comment: Published as a main conference paper at EMNLP 2025
Reversing the Lens: Using Explainable AI to Understand Human Expertise
Both humans and machine learning models learn from experience, particularly in safety- and reliability-critical domains. While psychology seeks to understand human cognition, the field of Explainable AI (XAI) develops methods to interpret machine learning models. This study bridges these domains by applying computational tools from XAI to analyze human learning. We modeled human behavior during a complex real-world task -- tuning a particle accelerator -- by constructing graphs of operator subtasks. Applying techniques such as community detection and hierarchical clustering to archival operator data, we reveal how operators decompose the problem into simpler components and how these problem-solving structures evolve with expertise. Our findings illuminate how humans develop efficient strategies in the absence of globally optimal solutions, and demonstrate the utility of XAI-based methods for quantitatively studying human cognition.
Computer Vision and Pattern Recognition
Dataset Distillation for Pre-Trained Self-Supervised Vision Models NeurIPS 2025
The task of dataset distillation aims to find a small set of synthetic images such that training a model on them reproduces the performance of the same model trained on a much larger dataset of real samples. Existing distillation methods focus on synthesizing datasets that enable training randomly initialized models. In contrast, state-of-the-art vision approaches are increasingly building on large, pre-trained self-supervised models rather than training from scratch. In this paper, we investigate the problem of distilling datasets that enable us to optimally train linear probes on top of such large, pre-trained vision models. We introduce a method of dataset distillation for this task called Linear Gradient Matching that optimizes the synthetic images such that, when passed through a pre-trained feature extractor, they induce gradients in the linear classifier similar to those produced by the real data. Our method yields synthetic data that outperform all real-image baselines and, remarkably, generalize across pre-trained vision models, enabling us, for instance, to train a linear CLIP probe that performs competitively using a dataset distilled via a DINO backbone. Further, we show that our distilled datasets are exceptionally effective for fine-grained classification and provide a valuable tool for model interpretability, predicting, among other things, how similar two models' embedding spaces are under the platonic representation hypothesis or whether a model is sensitive to spurious correlations in adversarial datasets.
comment: Accepted at NeurIPS 2025. Project page: https://linear-gradient-matching.github.io/ Code: https://github.com/GeorgeCazenavette/linear-gradient-matching
EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards
Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3\% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.
comment: 9 Pages, 6 Figures, 4 Tables
NoPo-Avatar: Generalizable and Animatable Avatars from Sparse Inputs without Human Poses NeurIPS'25
We tackle the task of recovering an animatable 3D human avatar from a single or a sparse set of images. For this task, beyond a set of images, many prior state-of-the-art methods use accurate "ground-truth" camera poses and human poses as input to guide reconstruction at test-time. We show that pose-dependent reconstruction degrades results significantly if pose estimates are noisy. To overcome this, we introduce NoPo-Avatar, which reconstructs avatars solely from images, without any pose input. By removing the dependence of test-time reconstruction on human poses, NoPo-Avatar is not affected by noisy human pose estimates, making it more widely applicable. Experiments on challenging THuman2.0, XHuman, and HuGe100K data show that NoPo-Avatar outperforms existing baselines in practical settings (without ground-truth poses) and delivers comparable results in lab settings (with ground-truth poses).
comment: NeurIPS'25; project page: https://wenj.github.io/NoPo-Avatar/
Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation
Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: https://github.com/ZiyuGuo99/Thinking-while-Generating.
comment: Project Page: https://think-while-gen.github.io Code: https://github.com/ZiyuGuo99/Thinking-while-Generating
Learning to Think Fast and Slow for Visual Language Models
When confronted with complex problems, we tend to think slowly; conversely, for simple questions, we think quickly. Such a two-system thinking mechanism allows us to efficiently allocate cognitive resources, enabling quick decision-making for straightforward issues while reserving deeper analytical thinking for more intricate challenges. However, existing reasoning-oriented visual language models (VLMs), whether trained with explicit chain-of-thought annotations or rule-based RL rewards, mainly pursue lengthy, detailed reasoning chains, which often lead to excessive computational costs. In this work, we propose a simple RL approach, which enables VLMs to automatically switch between fast and slow thinking modes depending on task difficulty. The approach consists of two stages: in the first stage, we label data as either requiring fast thinking or slow thinking based on the model output length, which is inspired by the observation that pre-trained VLMs typically produce answers of varying lengths for different types of questions; in the second stage, we train the model using GRPO along with the thinking mode labels to develop dual-mode thinking. Despite its simplicity, our model, named DualMindVLM, significantly outperforms the base model and achieves performance on par with state-of-the-art visual reasoning models, while maintaining exceptionally high token efficiency.
Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO
While language models have become impactful in many real-world applications, video generation remains largely confined to entertainment. Motivated by video's inherent capacity to demonstrate physical-world information that is difficult to convey through language alone (e.g., imagine teaching someone to tie a tie using only text), we identify an underutilized opportunity to extend video as a new answer modality for Next-Event Prediction (NEP), formalized as Video-Next-Event Prediction (VNEP). While the established NEP task takes a video with a procedural or predictive question as input to predict the next event in text, VNEP requires dynamic video responses. This shift from telling to showing unlocks more intuitive and customized answers for procedural learning and creative exploration. However, this task remains challenging for existing models, as it demands an understanding of multimodal input, instruction-conditioned reasoning, and the generation of video with visual and semantic consistency. To address this, we introduce VANS, a model that leverages reinforcement learning to align a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) for VNEP. The core of VANS is our proposed Joint-GRPO that orchestrates the VLM and VDM to function as a unit. Driven by a shared reward on their respective output, it optimizes the VLM to produce captions that are both accurate and friendly to visualize, while guiding the VDM to generate videos that are faithful to these captions and the input visual context. To enable this learning, we craft VANS-Data-100K, a dedicated dataset for the VNEP task. Experiments on procedural and predictive benchmarks demonstrate that VANS achieves state-of-the-art performance in both video event prediction and visualization. Codes are released in https://github.com/KlingTeam/VANS.
comment: Project page: https://video-as-answer.github.io/
V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models
Recent progress in generative video models, such as Veo-3, has shown surprising zero-shot reasoning abilities, creating a growing need for systematic and reliable evaluation. We introduce V-ReasonBench, a benchmark designed to assess video reasoning across four key dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. The benchmark is built from both synthetic and real-world image sequences and provides a diverse set of answer-verifiable tasks that are reproducible, scalable, and unambiguous. Evaluations of six state-of-the-art video models reveal clear dimension-wise differences, with strong variation in structured, spatial, pattern-based, and physical reasoning. We further compare video models with strong image models, analyze common hallucination behaviors, and study how video duration affects Chain-of-Frames reasoning. Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning and aims to support the development of models with more reliable, human-aligned reasoning skills.
comment: Project Page: https://oahzxl.github.io/VReasonBench
SceneDesigner: Controllable Multi-Object Image Generation with 9-DoF Pose Manipulation NeurIPS 2025
Controllable image generation has attracted increasing attention in recent years, enabling users to manipulate visual content such as identity and style. However, achieving simultaneous control over the 9D poses (location, size, and orientation) of multiple objects remains an open challenge. Despite recent progress, existing methods often suffer from limited controllability and degraded quality, falling short of comprehensive multi-object 9D pose control. To address these limitations, we propose SceneDesigner, a method for accurate and flexible multi-object 9-DoF pose manipulation. SceneDesigner incorporates a branched network to the pre-trained base model and leverages a new representation, CNOCS map, which encodes 9D pose information from the camera view. This representation exhibits strong geometric interpretation properties, leading to more efficient and stable training. To support training, we construct a new dataset, ObjectPose9D, which aggregates images from diverse sources along with 9D pose annotations. To further address data imbalance issues, particularly performance degradation on low-frequency poses, we introduce a two-stage training strategy with reinforcement learning, where the second stage fine-tunes the model using a reward-based objective on rebalanced data. At inference time, we propose Disentangled Object Sampling, a technique that mitigates insufficient object generation and concept confusion in complex multi-object scenes. Moreover, by integrating user-specific personalization weights, SceneDesigner enables customized pose control for reference subjects. Extensive qualitative and quantitative experiments demonstrate that SceneDesigner significantly outperforms existing approaches in both controllability and quality. Code is publicly available at https://github.com/FudanCVL/SceneDesigner.
comment: NeurIPS 2025 (Spotlight), Project Page: https://henghuiding.com/SceneDesigner/
TriDiff-4D: Fast 4D Generation through Diffusion-based Triplane Re-posing
With the increasing demand for 3D animation, generating high-fidelity, controllable 4D avatars from textual descriptions remains a significant challenge. Despite notable efforts in 4D generative modeling, existing methods exhibit fundamental limitations that impede their broader applicability, including temporal and geometric inconsistencies, perceptual artifacts, motion irregularities, high computational costs, and limited control over dynamics. To address these challenges, we propose TriDiff-4D, a novel 4D generative pipeline that employs diffusion-based triplane re-posing to produce high-quality, temporally coherent 4D avatars. Our model adopts an auto-regressive strategy to generate 4D sequences of arbitrary length, synthesizing each 3D frame with a single diffusion process. By explicitly learning 3D structure and motion priors from large-scale 3D and motion datasets, TriDiff-4D enables skeleton-driven 4D generation that excels in temporal consistency, motion accuracy, computational efficiency, and visual fidelity. Specifically, TriDiff-4D first generates a canonical 3D avatar and a corresponding motion sequence from a text prompt, then uses a second diffusion model to animate the avatar according to the motion sequence, supporting arbitrarily long 4D generation. Experimental results demonstrate that TriDiff-4D significantly outperforms existing methods, reducing generation time from hours to seconds by eliminating the optimization process, while substantially improving the generation of complex motions with high-fidelity appearance and accurate 3D geometry.
comment: 8 pages, 10 figures, Under review at a conference
PartUV: Part-Based UV Unwrapping of 3D Meshes
UV unwrapping flattens 3D surfaces to 2D with minimal distortion, often requiring the complex surface to be decomposed into multiple charts. Although extensively studied, existing UV unwrapping methods frequently struggle with AI-generated meshes, which are typically noisy, bumpy, and poorly conditioned. These methods often produce highly fragmented charts and suboptimal boundaries, introducing artifacts and hindering downstream tasks. We introduce PartUV, a part-based UV unwrapping pipeline that generates significantly fewer, part-aligned charts while maintaining low distortion. Built on top of a recent learning-based part decomposition method PartField, PartUV combines high-level semantic part decomposition with novel geometric heuristics in a top-down recursive framework. It ensures each chart's distortion remains below a user-specified threshold while minimizing the total number of charts. The pipeline integrates and extends parameterization and packing algorithms, incorporates dedicated handling of non-manifold and degenerate meshes, and is extensively parallelized for efficiency. Evaluated across four diverse datasets, including man-made, CAD, AI-generated, and Common Shapes, PartUV outperforms existing tools and recent neural methods in chart count and seam length, achieves comparable distortion, exhibits high success rates on challenging meshes, and enables new applications like part-specific multi-tiles packing. Our project page is at https://www.zhaoningwang.com/PartUV.
comment: project page: https://www.zhaoningwang.com/PartUV
Solving Spatial Supersensing Without Spatial Supersensing
Cambrian-S aims to take the first steps towards improving video world models with spatial supersensing by introducing (i) two benchmarks, VSI-Super-Recall (VSR) and VSI-Super-Counting (VSC), and (ii) bespoke predictive sensing inference strategies tailored to each benchmark. In this work, we conduct a critical analysis of Cambrian-S across both these fronts. First, we introduce a simple baseline, NoSense, which discards almost all temporal structure and uses only a bag-of-words SigLIP model, yet near-perfectly solves VSR, achieving 95% accuracy even on 4-hour videos. This shows benchmarks like VSR can be nearly solved without spatial cognition, world modeling or spatial supersensing. Second, we hypothesize that the tailored inference methods proposed by Cambrian-S likely exploit shortcut heuristics in the benchmark. We illustrate this with a simple sanity check on the VSC benchmark, called VSC-Repeat: We concatenate each video with itself 1-5 times, which does not change the number of unique objects. However, this simple perturbation entirely collapses the mean relative accuracy of Cambrian-S from 42% to 0%. A system that performs spatial supersensing and integrates information across experiences should recognize views of the same scene and keep object-count predictions unchanged; instead, Cambrian-S inference algorithm relies largely on a shortcut in the VSC benchmark that rooms are never revisited. Taken together, our findings suggest that (i) current VSI-Super benchmarks do not yet reliably measure spatial supersensing, and (ii) predictive-sensing inference recipes used by Cambrian-S improve performance by inadvertently exploiting shortcuts rather than from robust spatial supersensing. We include the response from the Cambrian-S authors (in Appendix A) to provide a balanced perspective alongside our claims. We release our code at: https://github.com/bethgelab/supersanity
comment: Tech Report
Teacher-Guided One-Shot Pruning via Context-Aware Knowledge Distillation
Unstructured pruning remains a powerful strategy for compressing deep neural networks, yet it often demands iterative train-prune-retrain cycles, resulting in significant computational overhead. To address this challenge, we introduce a novel teacher-guided pruning framework that tightly integrates Knowledge Distillation (KD) with importance score estimation. Unlike prior approaches that apply KD as a post-pruning recovery step, our method leverages gradient signals informed by the teacher during importance score calculation to identify and retain parameters most critical for both task performance and knowledge transfer. Our method facilitates a one-shot global pruning strategy that efficiently eliminates redundant weights while preserving essential representations. After pruning, we employ sparsity-aware retraining with and without KD to recover accuracy without reactivating pruned connections. Comprehensive experiments across multiple image classification benchmarks, including CIFAR-10, CIFAR-100, and TinyImageNet, demonstrate that our method consistently achieves high sparsity levels with minimal performance degradation. Notably, our approach outperforms state-of-the-art baselines such as EPG and EPSD at high sparsity levels, while offering a more computationally efficient alternative to iterative pruning schemes like COLT. The proposed framework offers a computation-efficient, performance-preserving solution well suited for deployment in resource-constrained environments.
comment: Accepted at 2025 IEEE International Conference on Big Data (IEEE BigData 2025)
Late-decoupled 3D Hierarchical Semantic Segmentation with Semantic Prototype Discrimination based Bi-branch Supervision
3D hierarchical semantic segmentation (3DHS) is crucial for embodied intelligence applications that demand a multi-grained and multi-hierarchy understanding of 3D scenes. Despite the progress, previous 3DHS methods have overlooked following two challenges: I) multi-label learning with a parameter-sharing model can lead to multi-hierarchy conflicts in cross-hierarchy optimization, and II) the class imbalance issue is inevitable across multiple hierarchies of 3D scenes, which makes the model performance become dominated by major classes. To address these issues, we propose a novel framework with a primary 3DHS branch and an auxiliary discrimination branch. Specifically, to alleviate the multi-hierarchy conflicts, we propose a late-decoupled 3DHS framework which employs multiple decoders with the coarse-to-fine hierarchical guidance and consistency. The late-decoupled architecture can mitigate the underfitting and overfitting conflicts among multiple hierarchies and can also constrain the class imbalance problem in each individual hierarchy. Moreover, we introduce a 3DHS-oriented semantic prototype based bi-branch supervision mechanism, which additionally learns class-wise discriminative point cloud features and performs mutual supervision between the auxiliary and 3DHS branches, to enhance the class-imbalance segmentation. Extensive experiments on multiple datasets and backbones demonstrate that our approach achieves state-of-the-art 3DHS performance, and its core components can also be used as a plug-and-play enhancement to improve previous methods.
TRIM: Scalable 3D Gaussian Diffusion Inference with Temporal and Spatial Trimming NeurIPS 2025
Recent advances in 3D Gaussian diffusion models suffer from time-intensive denoising and post-denoising processing due to the massive number of Gaussian primitives, resulting in slow generation and limited scalability along sampling trajectories. To improve the efficiency of 3D diffusion models, we propose $\textbf{TRIM}$ ($\textbf{T}$rajectory $\textbf{R}$eduction and $\textbf{I}$nstance $\textbf{M}$ask denoising), a post-training approach that incorporates both temporal and spatial trimming strategies, to accelerate inference without compromising output quality while supporting the inference-time scaling for Gaussian diffusion models. Instead of scaling denoising trajectories in a costly end-to-end manner, we develop a lightweight selector model to evaluate latent Gaussian primitives derived from multiple sampled noises, enabling early trajectory reduction by selecting candidates with high-quality potential. Furthermore, we introduce instance mask denoising to prune learnable Gaussian primitives by filtering out redundant background regions, reducing inference computation at each denoising step. Extensive experiments and analysis demonstrate that TRIM significantly improves both the efficiency and quality of 3D generation. Source code is available at $\href{https://github.com/zeyuanyin/TRIM}{link}$.
comment: NeurIPS 2025
SurvAgent: Hierarchical CoT-Enhanced Case Banking and Dichotomy-Based Multi-Agent System for Multimodal Survival Prediction
Survival analysis is critical for cancer prognosis and treatment planning, yet existing methods lack the transparency essential for clinical adoption. While recent pathology agents have demonstrated explainability in diagnostic tasks, they face three limitations for survival prediction: inability to integrate multimodal data, ineffective region-of-interest exploration, and failure to leverage experiential learning from historical cases. We introduce SurvAgent, the first hierarchical chain-of-thought (CoT)-enhanced multi-agent system for multimodal survival prediction. SurvAgent consists of two stages: (1) WSI-Gene CoT-Enhanced Case Bank Construction employs hierarchical analysis through Low-Magnification Screening, Cross-Modal Similarity-Aware Patch Mining, and Confidence-Aware Patch Mining for pathology images, while Gene-Stratified analysis processes six functional gene categories. Both generate structured reports with CoT reasoning, storing complete analytical processes for experiential learning. (2) Dichotomy-Based Multi-Expert Agent Inference retrieves similar cases via RAG and integrates multimodal reports with expert predictions through progressive interval refinement. Extensive experiments on five TCGA cohorts demonstrate SurvAgent's superority over conventional methods, proprietary MLLMs, and medical agents, establishing a new paradigm for explainable AI-driven survival prediction in precision oncology.
comment: 20 pages
SAM 3D: 3Dfy Anything in Images
We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.
comment: Website: https://ai.meta.com/sam3d/
SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking
Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing \textbf{SAM2} for \textbf{S}urgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity-resilient learning to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average $\mathcal{J}$\&$\mathcal{F}$ over vanilla SAM2. SAM2S further advances performance to 80.42 average $\mathcal{J}$\&$\mathcal{F}$, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization. Code and dataset will be released at https://jinlab-imvr.github.io/SAM2S.
comment: 11 pages, 4 figures
TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.
comment: Project page: https://xuboshen.github.io/TimeViper
Green Resilience of Cyber-Physical Systems: Doctoral Dissertation
Cyber-physical systems (CPS) combine computational and physical components. Online Collaborative AI System (OL-CAIS) is a type of CPS that learn online in collaboration with humans to achieve a common goal, which makes it vulnerable to disruptive events that degrade performance. Decision-makers must therefore restore performance while limiting energy impact, creating a trade-off between resilience and greenness. This research addresses how to balance these two properties in OL-CAIS. It aims to model resilience for automatic state detection, develop agent-based policies that optimize the greenness-resilience trade-off, and understand catastrophic forgetting to maintain performance consistency. We model OL-CAIS behavior through three operational states: steady, disruptive, and final. To support recovery during disruptions, we introduce the GResilience framework, which provides recovery strategies through multi-objective optimization (one-agent), game-theoretic decision-making (two-agent), and reinforcement learning (RL-agent). We also design a measurement framework to quantify resilience and greenness. Empirical evaluation uses real and simulated experiments with a collaborative robot learning object classification from human demonstrations. Results show that the resilience model captures performance transitions during disruptions, and that GResilience policies improve green recovery by shortening recovery time, stabilizing performance, and reducing human dependency. RL-agent policies achieve the strongest results, although with a marginal increase in CO2 emissions. We also observe catastrophic forgetting after repeated disruptions, while our policies help maintain steadiness. A comparison with containerized execution shows that containerization cuts CO2 emissions by half. Overall, this research provides models, metrics, and policies that ensure the green recovery of OL-CAIS.
Erase to Retain: Low Rank Adaptation Guided Selective Unlearning in Medical Segmentation Networks
The ability to selectively remove knowledge from medical segmentation networks is increasingly important for privacy compliance, ethical deployment, and continual dataset revision. We introduce Erase to Retain, a controllable unlearning framework for medical image segmentation that achieves targeted forgetting without full retraining. Our method uses a teacher-student distillation paradigm with Low-Rank Adaptation (LoRA) constrained subspace updates, enabling the student network to erase lesion-specific or class-specific representations in low-rank decoder spaces while preserving global anatomical understanding. During the strong unlearning phase, LoRA modules are adversarially optimized to contradict the teacher's confident predictions on a designated forget subset, enforcing semantic removal. This is followed by a gentle restoration phase that recovers generalization on retained data through head-only supervised refinement. For ISIC segmentation, the student reduces forget-set IoU from 0.875 to 0.509 while maintaining competitive performance on the retain and validation splits (0.647 to 0.677 IoU). On the cross-domain CHASE dataset, Erase to Retain consistently lowers forget-set IoU while preserving utility on retain and validation sets. For ISIC classification, our method decreases accuracy on the forget subset from 87.0 percent to 64.1 percent while improving retain accuracy from 83.9 percent to 90.6 percent. These results demonstrate that LoRA-based subspace unlearning provides a practical pathway toward responsible, controllable, and reversible unlearning in medical image analysis, enabling models to forget sensitive samples or structures while preserving performance where it matters most.
POMA-3D: The Point Map Way to 3D Scene Understanding
In this paper, we introduce POMA-3D, the first self-supervised 3D representation model learned from point maps. Point maps encode explicit 3D coordinates on a structured 2D grid, preserving global 3D geometry while remaining compatible with the input format of 2D foundation models. To transfer rich 2D priors into POMA-3D, a view-to-scene alignment strategy is designed. Moreover, as point maps are view-dependent with respect to a canonical space, we introduce POMA-JEPA, a joint embedding-predictive architecture that enforces geometrically consistent point map features across multiple views. Additionally, we introduce ScenePoint, a point map dataset constructed from 6.5K room-level RGB-D scenes and 1M 2D image scenes to facilitate large-scale POMA-3D pretraining. Experiments show that POMA-3D serves as a strong backbone for both specialist and generalist 3D understanding. It benefits diverse tasks, including 3D question answering, embodied navigation, scene retrieval, and embodied localization, all achieved using only geometric inputs (i.e., 3D coordinates). Overall, our POMA-3D explores a point map way to 3D scene understanding, addressing the scarcity of pretrained priors and limited data in 3D representation learning. Project Page: https://matchlab-imperial.github.io/poma3d/
comment: 11 pages, 6 tables, 5 figures
NutriScreener: Retrieval-Augmented Multi-Pose Graph Attention Network for Malnourishment Screening AAAI 2026
Child malnutrition remains a global crisis, yet existing screening methods are laborious and poorly scalable, hindering early intervention. In this work, we present NutriScreener, a retrieval-augmented, multi-pose graph attention network that combines CLIP-based visual embeddings, class-boosted knowledge retrieval, and context awareness to enable robust malnutrition detection and anthropometric prediction from children's images, simultaneously addressing generalizability and class imbalance. In a clinical study, doctors rated it 4.3/5 for accuracy and 4.6/5 for efficiency, confirming its deployment readiness in low-resource settings. Trained and tested on 2,141 children from AnthroVision and additionally evaluated on diverse cross-continent populations, including ARAN and an in-house collected CampusPose dataset, it achieves 0.79 recall, 0.82 AUC, and significantly lower anthropometric RMSEs, demonstrating reliable measurement in unconstrained pediatric settings. Cross-dataset results show up to 25% recall gain and up to 3.5 cm RMSE reduction using demographically matched knowledge bases. NutriScreener offers a scalable and accurate solution for early malnutrition detection in low-resource environments.
comment: Accepted in AAAI 2026 Special Track on AI for Social Impact
Lite Any Stereo: Efficient Zero-Shot Stereo Matching
Recent advances in stereo matching have focused on accuracy, often at the cost of significantly increased model size. Traditionally, the community has regarded efficient models as incapable of zero-shot ability due to their limited capacity. In this paper, we introduce Lite Any Stereo, a stereo depth estimation framework that achieves strong zero-shot generalization while remaining highly efficient. To this end, we design a compact yet expressive backbone to ensure scalability, along with a carefully crafted hybrid cost aggregation module. We further propose a three-stage training strategy on million-scale data to effectively bridge the sim-to-real gap. Together, these components demonstrate that an ultra-light model can deliver strong generalization, ranking 1st across four widely used real-world benchmarks. Remarkably, our model attains accuracy comparable to or exceeding state-of-the-art non-prior-based accurate methods while requiring less than 1% computational cost, setting a new standard for efficient stereo matching.
Progressive Supernet Training for Efficient Visual Autoregressive Modeling CVPR 2025
Visual Auto-Regressive (VAR) models significantly reduce inference steps through the "next-scale" prediction paradigm. However, progressive multi-scale generation incurs substantial memory overhead due to cumulative KV caching, limiting practical deployment. We observe a scale-depth asymmetric dependency in VAR: early scales exhibit extreme sensitivity to network depth, while later scales remain robust to depth reduction. Inspired by this, we propose VARiant: by equidistant sampling, we select multiple subnets ranging from 16 to 2 layers from the original 30-layer VAR-d30 network. Early scales are processed by the full network, while later scales utilize subnet. Subnet and the full network share weights, enabling flexible depth adjustment within a single model. However, weight sharing between subnet and the entire network can lead to optimization conflicts. To address this, we propose a progressive training strategy that breaks through the Pareto frontier of generation quality for both subnets and the full network under fixed-ratio training, achieving joint optimality. Experiments on ImageNet demonstrate that, compared to the pretrained VAR-d30 (FID 1.95), VARiant-d16 and VARiant-d8 achieve nearly equivalent quality (FID 2.05/2.12) while reducing memory consumption by 40-65%. VARiant-d2 achieves 3.5 times speedup and 80% memory reduction at moderate quality cost (FID 2.97). In terms of deployment, VARiant's single-model architecture supports zero-cost runtime depth switching and provides flexible deployment options from high quality to extreme efficiency, catering to diverse application scenarios.
comment: Submitted to CVPR 2025. 10 pages, 7 figures
EOGS++: Earth Observation Gaussian Splatting with Internal Camera Refinement and Direct Panchromatic Rendering SP
Recently, 3D Gaussian Splatting has been introduced as a compelling alternative to NeRF for Earth observation, offering com- petitive reconstruction quality with significantly reduced training times. In this work, we extend the Earth Observation Gaussian Splatting (EOGS) framework to propose EOGS++, a novel method tailored for satellite imagery that directly operates on raw high-resolution panchromatic data without requiring external preprocessing. Furthermore, leveraging optical flow techniques we embed bundle adjustment directly within the training process, avoiding reliance on external optimization tools while improving camera pose estimation. We also introduce several improvements to the original implementation, including early stopping and TSDF post-processing, all contributing to sharper reconstructions and better geometric accuracy. Experiments on the IARPA 2016 and DFC2019 datasets demonstrate that EOGS++ achieves state-of-the-art performance in terms of reconstruction quality and effi- ciency, outperforming the original EOGS method and other NeRF-based methods while maintaining the computational advantages of Gaussian Splatting. Our model demonstrates an improvement from 1.33 to 1.19 mean MAE errors on buildings compared to the original EOGS models
comment: 8 pages, ISPRS
Supervised Contrastive Learning for Few-Shot AI-Generated Image Detection and Attribution
The rapid advancement of generative artificial intelligence has enabled the creation of synthetic images that are increasingly indistinguishable from authentic content, posing significant challenges for digital media integrity. This problem is compounded by the accelerated release cycle of novel generative models, which renders traditional detection approaches (reliant on periodic retraining) computationally infeasible and operationally impractical. This work proposes a novel two-stage detection framework designed to address the generalization challenge inherent in synthetic image detection. The first stage employs a vision deep learning model trained via supervised contrastive learning to extract discriminative embeddings from input imagery. Critically, this model was trained on a strategically partitioned subset of available generators, with specific architectures withheld from training to rigorously ablate cross-generator generalization capabilities. The second stage utilizes a k-nearest neighbors (k-NN) classifier operating on the learned embedding space, trained in a few-shot learning paradigm incorporating limited samples from previously unseen test generators. With merely 150 images per class in the few-shot learning regime, which are easily obtainable from current generation models, the proposed framework achieves an average detection accuracy of 91.3\%, representing a 5.2 percentage point improvement over existing approaches . For the source attribution task, the proposed approach obtains improvements of of 14.70\% and 4.27\% in AUC and OSCR respectively on an open set classification context, marking a significant advancement toward robust, scalable forensic attribution systems capable of adapting to the evolving generative AI landscape without requiring exhaustive retraining protocols.
comment: 17 pages, 6 figures, 6 tables
Investigating Optical Flow Computation: From Local Methods to a Multiresolution Horn-Schunck Implementation with Bilinear Interpolation
This paper presents an applied analysis of local and global methods, with a focus on the Horn-Schunck algorithm for optical flow computation. We explore the theoretical and practical aspects of local approaches, such as the Lucas-Kanade method, and global techniques such as Horn-Schunck. Additionally, we implement a multiresolution version of the Horn-Schunck algorithm, using bilinear interpolation and prolongation to improve accuracy and convergence. The study investigates the effectiveness of these combined strategies in estimating motion between frames, particularly under varying image conditions.
Enhancing Multi-Camera Gymnast Tracking Through Domain Knowledge Integration
We present a robust multi-camera gymnast tracking, which has been applied at international gymnastics championships for gymnastics judging. Despite considerable progress in multi-camera tracking algorithms, tracking gymnasts presents unique challenges: (i) due to space restrictions, only a limited number of cameras can be installed in the gymnastics stadium; and (ii) due to variations in lighting, background, uniforms, and occlusions, multi-camera gymnast detection may fail in certain views and only provide valid detections from two opposing views. These factors complicate the accurate determination of a gymnast's 3D trajectory using conventional multi-camera triangulation. To alleviate this issue, we incorporate gymnastics domain knowledge into our tracking solution. Given that a gymnast's 3D center typically lies within a predefined vertical plane during \revised{much of their} performance, we can apply a ray-plane intersection to generate coplanar 3D trajectory candidates for opposing-view detections. More specifically, we propose a novel cascaded data association (DA) paradigm that employs triangulation to generate 3D trajectory candidates when cross-view detections are sufficient, and resort to the ray-plane intersection when they are insufficient. Consequently, coplanar candidates are used to compensate for uncertain trajectories, thereby minimizing tracking failures. The robustness of our method is validated through extensive experimentation, demonstrating its superiority over existing methods in challenging scenarios. Furthermore, our gymnastics judging system, equipped with this tracking method, has been successfully applied to recent Gymnastics World Championships, earning significant recognition from the International Gymnastics Federation.
Contrastive vision-language learning with paraphrasing and negation
Contrastive vision-language models continue to be the dominant approach for image and text retrieval. Contrastive Language-Image Pre-training (CLIP) trains two neural networks in contrastive manner to align their image and text embeddings in a shared latent space. Recent results evaluating CLIP on negated or paraphrased text have shown mixed performance because negation changes meaning radically with minimal lexical changes, while paraphrasing can create very different textual expressions with the same intended meaning. This poses a significant challenge for improving the evaluation results and alignment of vision-language models. To address this challenge, this paper evaluates the combination of paraphrasing and negation, proposes a new CLIP contrastive loss function accounting for both paraphrasing and negation, and applies LLM-generated training triples consisting of original, paraphrased and negated textual captions to CLIP-like training models. The approach, called SemCLIP, is shown to move paraphrased captions towards the original image embeddings while pushing negated captions further away in embedding space. Empirically, SemCLIP is shown to be capable of preserving CLIP's performance while increasing considerably the distances to negated captions. On the CC-Neg benchmark using an original over negation image-retrieval accuracy metric, SemCLIP improves accuracy from 68.1% to 78.1%. Although results are mixed when compared with CLIP on the Sugarcrepe++ benchmark, SemCLIP's performance is generally better than the models trained with negated captions. This robustness to negation extends to downstream zero-shot classification tasks where SemCLIP pre-trained on Sugarcrepe++ performs better than CLIP on all tested downstream tasks. These results indicate that SemCLIP can achieve significant robustness to semantic transformations.
BoxingVI: A Multi-Modal Benchmark for Boxing Action Recognition and Localization
Accurate analysis of combat sports using computer vision has gained traction in recent years, yet the development of robust datasets remains a major bottleneck due to the dynamic, unstructured nature of actions and variations in recording environments. In this work, we present a comprehensive, well-annotated video dataset tailored for punch detection and classification in boxing. The dataset comprises 6,915 high-quality punch clips categorized into six distinct punch types, extracted from 20 publicly available YouTube sparring sessions and involving 18 different athletes. Each clip is manually segmented and labeled to ensure precise temporal boundaries and class consistency, capturing a wide range of motion styles, camera angles, and athlete physiques. This dataset is specifically curated to support research in real-time vision-based action recognition, especially in low-resource and unconstrained environments. By providing a rich benchmark with diverse punch examples, this contribution aims to accelerate progress in movement analysis, automated coaching, and performance assessment within boxing and related domains.
YOWO: You Only Walk Once to Jointly Map An Indoor Scene and Register Ceiling-mounted Cameras
Using ceiling-mounted cameras (CMCs) for indoor visual capturing opens up a wide range of applications. However, registering CMCs to the target scene layout presents a challenging task. While manual registration with specialized tools is inefficient and costly, automatic registration with visual localization may yield poor results when visual ambiguity exists. To alleviate these issues, we propose a novel solution for jointly mapping an indoor scene and registering CMCs to the scene layout. Our approach involves equipping a mobile agent with a head-mounted RGB-D camera to traverse the entire scene once and synchronize CMCs to capture this mobile agent. The egocentric videos generate world-coordinate agent trajectories and the scene layout, while the videos of CMCs provide pseudo-scale agent trajectories and CMC relative poses. By correlating all the trajectories with their corresponding timestamps, the CMC relative poses can be aligned to the world-coordinate scene layout. Based on this initialization, a factor graph is customized to enable the joint optimization of ego-camera poses, scene layout, and CMC poses. We also develop a new dataset, setting the first benchmark for collaborative scene mapping and CMC registration (https://sites.google.com/view/yowo/home). Experimental results indicate that our method not only effectively accomplishes two tasks within a unified framework, but also jointly enhances their performance. We thus provide a reliable tool to facilitate downstream position-aware applications.
MiMo-Embodied: X-Embodied Foundation Model Technical Report
We open-source MiMo-Embodied, the first cross-embodied foundation model to successfully integrate and achieve state-of-the-art performance in both Autonomous Driving and Embodied AI. MiMo-Embodied sets new records across 17 embodied AI benchmarks in Task Planning, Affordance Prediction and Spatial Understanding, while also excelling in 12 autonomous driving benchmarks across Environmental Perception, Status Prediction, and Driving Planning. Across these tasks, MiMo-Embodied significantly outperforms existing open-source, closed-source, and specialized baselines. Our results indicate that through multi-stage learning, curated data construction, and CoT/RL fine-tuning, these two domains exhibit strong positive transfer and mutually reinforce one another. We provide a detailed analysis of our model design and training methodologies to facilitate further research. Code and models are available at https://github.com/XiaomiMiMo/MiMo-Embodied.
comment: Code: https://github.com/XiaomiMiMo/MiMo-Embodied Model: https://huggingface.co/XiaomiMiMo/MiMo-Embodied-7B
Acquisition Time-Informed Breast Tumor Segmentation from Dynamic Contrast-Enhanced MRI
Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) plays an important role in breast cancer screening, tumor assessment, and treatment planning and monitoring. The dynamic changes in contrast in different tissues help to highlight the tumor in post-contrast images. However, varying acquisition protocols and individual factors result in large variation in the appearance of tissues, even for images acquired in the same phase (e.g., first post-contrast phase), making automated tumor segmentation challenging. Here, we propose a tumor segmentation method that leverages knowledge of the image acquisition time to modulate model features according to the specific acquisition sequence. We incorporate the acquisition times using feature-wise linear modulation (FiLM) layers, a lightweight method for incorporating temporal information that also allows for capitalizing on the full, variables number of images acquired per imaging study. We trained baseline and different configurations for the time-modulated models with varying backbone architectures on a large public multisite breast DCE-MRI dataset. Evaluation on in-domain images and a public out-of-domain dataset showed that incorporating knowledge of phase acquisition time improved tumor segmentation performance and model generalization.
comment: 5 pages, 3 figures
Physics-Informed Machine Learning for Efficient Sim-to-Real Data Augmentation in Micro-Object Pose Estimation
Precise pose estimation of optical microrobots is essential for enabling high-precision object tracking and autonomous biological studies. However, current methods rely heavily on large, high-quality microscope image datasets, which are difficult and costly to acquire due to the complexity of microrobot fabrication and the labour-intensive labelling. Digital twin systems offer a promising path for sim-to-real data augmentation, yet existing techniques struggle to replicate complex optical microscopy phenomena, such as diffraction artifacts and depth-dependent imaging.This work proposes a novel physics-informed deep generative learning framework that, for the first time, integrates wave optics-based physical rendering and depth alignment into a generative adversarial network (GAN), to synthesise high-fidelity microscope images for microrobot pose estimation efficiently. Our method improves the structural similarity index (SSIM) by 35.6% compared to purely AI-driven methods, while maintaining real-time rendering speeds (0.022 s/frame).The pose estimator (CNN backbone) trained on our synthetic data achieves 93.9%/91.9% (pitch/roll) accuracy, just 5.0%/5.4% (pitch/roll) below that of an estimator trained exclusively on real data. Furthermore, our framework generalises to unseen poses, enabling data augmentation and robust pose estimation for novel microrobot configurations without additional training data.
Flow and Depth Assisted Video Prediction with Latent Transformer
Video prediction is a fundamental task for various downstream applications, including robotics and world modeling. Although general video prediction models have achieved remarkable performance in standard scenarios, occlusion is still an inherent challenge in video prediction. We hypothesize that providing explicit information about motion (via point-flow) and geometric structure (via depth-maps) will enable video prediction models to perform better in situations with occlusion and the background motion. To investigate this, we present the first systematic study dedicated to occluded video prediction. We use a standard multi-object latent transformer architecture to predict future frames, but modify this to incorporate information from depth and point-flow. We evaluate this model in a controlled setting on both synthetic and real-world datasets with not only appearance-based metrics but also Wasserstein distances on object masks, which can effectively measure the motion distribution of the prediction. We find that when the prediction model is assisted with point flow and depth, it performs better in occluded scenarios and predicts more accurate background motion compared to models without the help of these modalities.
FastSurfer-CC: A robust, accurate, and comprehensive framework for corpus callosum morphometry
The corpus callosum, the largest commissural structure in the human brain, is a central focus in research on aging and neurological diseases. It is also a critical target for interventions such as deep brain stimulation and serves as an important biomarker in clinical trials, including those investigating remyelination therapies. Despite extensive research on corpus callosum segmentation, few publicly available tools provide a comprehensive and automated analysis pipeline. To address this gap, we present FastSurfer-CC, an efficient and fully automated framework for corpus callosum morphometry. FastSurfer-CC automatically identifies mid-sagittal slices, segments the corpus callosum and fornix, localizes the anterior and posterior commissures to standardize head positioning, generates thickness profiles and subdivisions, and extracts eight shape metrics for statistical analysis. We demonstrate that FastSurfer-CC outperforms existing specialized tools across the individual tasks. Moreover, our method reveals statistically significant differences between Huntington's disease patients and healthy controls that are not detected by the current state-of-the-art.
Arctic-Extract Technical Report
Arctic-Extract is a state-of-the-art model designed for extracting structural data (question answering, entities and tables) from scanned or digital-born business documents. Despite its SoTA capabilities, the model is deployable on resource-constrained hardware, weighting only 6.6 GiB, making it suitable for deployment on devices with limited resources, such as A10 GPUs with 24 GB of memory. Arctic-Extract can process up to 125 A4 pages on those GPUs, making suitable for long document processing. This paper highlights Arctic-Extract's training protocols and evaluation results, demonstrating its strong performance in document understanding.
LLaVA$^3$: Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs AAAI'26
Developing a multi-modal language model capable of understanding 3D scenes remains challenging due to the limited availability of 3D training data, in contrast to the abundance of 2D datasets used for vision-language models (VLM). As an alternative, we introduce LLaVA$^3$ (pronounced LLaVA-Cube), a novel method that improves the 3D scene understanding capabilities of VLM using only multi-view 2D images and without any fine-tuning. Inspired by Cubist painters, who represented multiple viewpoints of a 3D object within a single picture, we propose to describe the 3D scene for the VLM through omnidirectional visual representations of each object. These representations are derived from an intermediate multi-view 3D reconstruction of the scene. Extensive experiments on 3D VQA and 3D language grounding show that our approach outperforms previous 2D-based VLM solutions.
comment: Accepted at AAAI'26
VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference
Vision-Language-Action (VLA) models have shown great promise for embodied AI, yet the heavy computational cost of processing continuous visual streams severely limits their real-time deployment. Token pruning (keeping salient visual tokens and dropping redundant ones) has emerged as an effective approach for accelerating Vision-Language Models (VLMs), offering a solution for efficient VLA. However, these VLM-specific token pruning methods select tokens based solely on semantic salience metrics (e.g., prefill attention), while overlooking the VLA's intrinsic dual-system nature of high-level semantic understanding and low-level action execution. Consequently, these methods bias token retention toward semantic cues, discard critical information for action generation, and significantly degrade VLA performance. To bridge this gap, we propose VLA-Pruner, a versatile plug-and-play VLA-specific token prune method that aligns with the dual-system nature of VLA models and exploits the temporal continuity in robot manipulation. Specifically, VLA-Pruner adopts a dual-level importance criterion for visual token retention: vision-language prefill attention for semantic-level relevance and action decode attention, estimated via temporal smoothing, for action-level importance. Based on this criterion, VLA-Pruner proposes a novel dual-level token selection strategy that adaptively preserves a compact, informative set of visual tokens for both semantic understanding and action execution under given compute budget. Experiments show that VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks.
StreetView-Waste: A Multi-Task Dataset for Urban Waste Management WACV 2026
Urban waste management remains a critical challenge for the development of smart cities. Despite the growing number of litter detection datasets, the problem of monitoring overflowing waste containers, particularly from images captured by garbage trucks, has received little attention. While existing datasets are valuable, they often lack annotations for specific container tracking or are captured in static, decontextualized environments, limiting their utility for real-world logistics. To address this gap, we present StreetView-Waste, a comprehensive dataset of urban scenes featuring litter and waste containers. The dataset supports three key evaluation tasks: (1) waste container detection, (2) waste container tracking, and (3) waste overflow segmentation. Alongside the dataset, we provide baselines for each task by benchmarking state-of-the-art models in object detection, tracking, and segmentation. Additionally, we enhance baseline performance by proposing two complementary strategies: a heuristic-based method for improved waste container tracking and a model-agnostic framework that leverages geometric priors to refine litter segmentation. Our experimental results show that while fine-tuned object detectors achieve reasonable performance in detecting waste containers, baseline tracking methods struggle to accurately estimate their number; however, our proposed heuristics reduce the mean absolute counting error by 79.6%. Similarly, while segmenting amorphous litter is challenging, our geometry-aware strategy improves segmentation mAP@0.5 by 27% on lightweight models, demonstrating the value of multimodal inputs for this task. Ultimately, StreetView-Waste provides a challenging benchmark to encourage research into real-world perception systems for urban waste management.
comment: Accepted at WACV 2026
Beyond Visual Cues: Leveraging General Semantics as Support for Few-Shot Segmentation
Few-shot segmentation (FSS) aims to segment novel classes under the guidance of limited support samples by a meta-learning paradigm. Existing methods mainly mine references from support images as meta guidance. However, due to intra-class variations among visual representations, the meta information extracted from support images cannot produce accurate guidance to segment untrained classes. In this paper, we argue that the references from support images may not be essential, the key to the support role is to provide unbiased meta guidance for both trained and untrained classes. We then introduce a Language-Driven Attribute Generalization (LDAG) architecture to utilize inherent target property language descriptions to build robust support strategy. Specifically, to obtain an unbiased support representation, we design a Multi-attribute Enhancement (MaE) module, which produces multiple detailed attribute descriptions of the target class through Large Language Models (LLMs), and then builds refined visual-text prior guidance utilizing multi-modal matching. Meanwhile, due to text-vision modal shift, attribute text struggles to promote visual feature representation, we design a Multi-modal Attribute Alignment (MaA) to achieve cross-modal interaction between attribute texts and visual feature. Experiments show that our proposed method outperforms existing approaches by a clear margin and achieves the new state-of-the art performance. The code will be released.
Graph Neural Networks for Surgical Scene Segmentation
Purpose: Accurate identification of hepatocystic anatomy is critical to preventing surgical complications during laparoscopic cholecystectomy. Deep learning models often struggle with occlusions, long-range dependencies, and capturing the fine-scale geometry of rare structures. This work addresses these challenges by introducing graph-based segmentation approaches that enhance spatial and semantic understanding in surgical scene analyses. Methods: We propose two segmentation models integrating Vision Transformer (ViT) feature encoders with Graph Neural Networks (GNNs) to explicitly model spatial relationships between anatomical regions. (1) A static k Nearest Neighbours (k-NN) graph with a Graph Convolutional Network with Initial Residual and Identity Mapping (GCNII) enables stable long-range information propagation. (2) A dynamic Differentiable Graph Generator (DGG) with a Graph Attention Network (GAT) supports adaptive topology learning. Both models are evaluated on the Endoscapes-Seg50 and CholecSeg8k benchmarks. Results: The proposed approaches achieve up to 7-8% improvement in Mean Intersection over Union (mIoU) and 6% improvement in Mean Dice (mDice) scores over state-of-the-art baselines. It produces anatomically coherent predictions, particularly on thin, rare and safety-critical structures. Conclusion: The proposed graph-based segmentation methods enhance both performance and anatomical consistency in surgical scene segmentation. By combining ViT-based global context with graph-based relational reasoning, the models improve interpretability and reliability, paving the way for safer laparoscopic and robot-assisted surgery through a precise identification of critical anatomical features.
comment: 12 pages, 4 figures, 3 tables
CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation
Self-supervised surround-view depth estimation enables dense, low-cost 3D perception with a 360° field of view from multiple minimally overlapping images. Yet, most existing methods suffer from depth estimates that are inconsistent between overlapping images. Addressing this limitation, we propose a novel geometry-guided method for calibrated, time-synchronized multi-camera rigs that predicts dense, metric, and cross-view-consistent depth. Given the intrinsic and relative orientation parameters, a first depth map is predicted per image and the so-derived 3D points from all images are projected onto a shared unit cylinder, establishing neighborhood relations across different images. This produces a 2D position map for every image, where each pixel is assigned its projected position on the cylinder. Based on these position maps, we apply an explicit, non-learned spatial attention that aggregates features among pixels across images according to their distances on the cylinder, to predict a final depth map per image. Evaluated on the DDAD and nuScenes datasets, our approach improves the consistency of depth estimates across images and the overall depth compared to state-of-the-art methods.
End-to-End Motion Capture from Rigid Body Markers with Geodesic Loss
Marker-based optical motion capture (MoCap), while long regarded as the gold standard for accuracy, faces practical challenges, such as time-consuming preparation and marker identification ambiguity, due to its reliance on dense marker configurations, which fundamentally limit its scalability. To address this, we introduce a novel fundamental unit for MoCap, the Rigid Body Marker (RBM), which provides unambiguous 6-DoF data and drastically simplifies setup. Leveraging this new data modality, we develop a deep-learning-based regression model that directly estimates SMPL parameters under a geodesic loss. This end-to-end approach matches the performance of optimization-based methods while requiring over an order of magnitude less computation. Trained on synthesized data from the AMASS dataset, our end-to-end model achieves state-of-the-art accuracy in body pose estimation. Real-world data captured using a Vicon optical tracking system further demonstrates the practical viability of our approach. Overall, the results show that combining sparse 6-DoF RBM with a manifold-aware geodesic loss yields a practical and high-fidelity solution for real-time MoCap in graphics, virtual reality, and biomechanics.
comment: The source code is available in : https://github.com/wer010/GLRBM-Mocap
CAMS: Towards Compositional Zero-Shot Learning via Gated Cross-Attention and Multi-Space Disentanglement
Compositional zero-shot learning (CZSL) aims to learn the concepts of attributes and objects in seen compositions and to recognize their unseen compositions. Most Contrastive Language-Image Pre-training (CLIP)-based CZSL methods focus on disentangling attributes and objects by leveraging the global semantic representation obtained from the image encoder. However, this representation has limited representational capacity and do not allow for complete disentanglement of the two. To this end, we propose CAMS, which aims to extract semantic features from visual features and perform semantic disentanglement in multidimensional spaces, thereby improving generalization over unseen attribute-object compositions. Specifically, CAMS designs a Gated Cross-Attention that captures fine-grained semantic features from the high-level image encoding blocks of CLIP through a set of latent units, while adaptively suppressing background and other irrelevant information. Subsequently, it conducts Multi-Space Disentanglement to achieve disentanglement of attribute and object semantics. Experiments on three popular benchmarks (MIT-States, UT-Zappos, and C-GQA) demonstrate that CAMS achieves state-of-the-art performance in both closed-world and open-world settings. The code is available at https://github.com/ybyangjing/CAMS.
DetailSemNet: Elevating Signature Verification through Detail-Semantic Integration
Offline signature verification (OSV) is a frequently utilized technology in forensics. This paper proposes a new model, DetailSemNet, for OSV. Unlike previous methods that rely on holistic features for pair comparisons, our approach underscores the significance of fine-grained differences for robust OSV. We propose to match local structures between two signature images, significantly boosting verification accuracy. Furthermore, we observe that without specific architectural modifications, transformer-based backbones might naturally obscure local details, adversely impacting OSV performance. To address this, we introduce a Detail Semantics Integrator, leveraging feature disentanglement and re-entanglement. This integrator is specifically designed to enhance intricate details while simultaneously expanding discriminative semantics, thereby augmenting the efficacy of local structural matching. We evaluate our method against leading benchmarks in offline signature verification. Our model consistently outperforms recent methods, achieving state-of-the-art results with clear margins. The emphasis on local structure matching not only improves performance but also enhances the model's interpretability, supporting our findings. Additionally, our model demonstrates remarkable generalization capabilities in cross-dataset testing scenarios. The combination of generalizability and interpretability significantly bolsters the potential of DetailSemNet for real-world applications.
Multi-Order Matching Network for Alignment-Free Depth Super-Resolution
Recent guided depth super-resolution methods are premised on the assumption of strictly spatial alignment between depth and RGB, achieving high-quality depth reconstruction. However, in real-world scenarios, the acquisition of strictly aligned RGB-D is hindered by inherent hardware limitations (e.g., physically separate RGB-D sensors) and unavoidable calibration drift induced by mechanical vibrations or temperature variations. Consequently, existing approaches often suffer inevitable performance degradation when applied to misaligned real-world scenes. In this paper, we propose the Multi-Order Matching Network (MOMNet), a novel alignment-free framework that adaptively retrieves and selects the most relevant information from misaligned RGB. Specifically, our method begins with a multi-order matching mechanism, which jointly performs zero-order, first-order, and second-order matching to comprehensively identify RGB information consistent with depth across multi-order feature spaces. To effectively integrate the retrieved RGB and depth, we further introduce a multi-order aggregation composed of multiple structure detectors. This strategy uses multi-order priors as prompts to facilitate the selective feature transfer from RGB to depth. Extensive experiments demonstrate that MOMNet achieves state-of-the-art performance and exhibits outstanding robustness.
CRISTAL: Real-time Camera Registration in Static LiDAR Scans using Neural Rendering
Accurate camera localization is crucial for robotics and Extended Reality (XR), enabling reliable navigation and alignment of virtual and real content. Existing visual methods often suffer from drift, scale ambiguity, and depend on fiducials or loop closure. This work introduces a real-time method for localizing a camera within a pre-captured, highly accurate colored LiDAR point cloud. By rendering synthetic views from this cloud, 2D-3D correspondences are established between live frames and the point cloud. A neural rendering technique narrows the domain gap between synthetic and real images, reducing occlusion and background artifacts to improve feature matching. The result is drift-free camera tracking with correct metric scale in the global LiDAR coordinate system. Two real-time variants are presented: Online Render and Match, and Prebuild and Localize. We demonstrate improved results on the ScanNet++ dataset and outperform existing SLAM pipelines.
Aerial View River Landform Video segmentation: A Weakly Supervised Context-aware Temporal Consistency Distillation Approach
The study of terrain and landform classification through UAV remote sensing diverges significantly from ground vehicle patrol tasks. Besides grappling with the complexity of data annotation and ensuring temporal consistency, it also confronts the scarcity of relevant data and the limitations imposed by the effective range of many technologies. This research substantiates that, in aerial positioning tasks, both the mean Intersection over Union (mIoU) and temporal consistency (TC) metrics are of paramount importance. It is demonstrated that fully labeled data is not the optimal choice, as selecting only key data lacks the enhancement in TC, leading to failures. Hence, a teacher-student architecture, coupled with key frame selection and key frame updating algorithms, is proposed. This framework successfully performs weakly supervised learning and TC knowledge distillation, overcoming the deficiencies of traditional TC training in aerial tasks. The experimental results reveal that our method utilizing merely 30\% of labeled data, concurrently elevates mIoU and temporal consistency ensuring stable localization of terrain objects. Result demo : https://gitlab.com/prophet.ai.inc/drone-based-riverbed-inspection
Arbitrary-Resolution and Arbitrary-Scale Face Super-Resolution with Implicit Representation Networks
Face super-resolution (FSR) is a critical technique for enhancing low-resolution facial images and has significant implications for face-related tasks. However, existing FSR methods are limited by fixed up-sampling scales and sensitivity to input size variations. To address these limitations, this paper introduces an Arbitrary-Resolution and Arbitrary-Scale FSR method with implicit representation networks (ARASFSR), featuring three novel designs. First, ARASFSR employs 2D deep features, local relative coordinates, and up-sampling scale ratios to predict RGB values for each target pixel, allowing super-resolution at any up-sampling scale. Second, a local frequency estimation module captures high-frequency facial texture information to reduce the spectral bias effect. Lastly, a global coordinate modulation module guides FSR to leverage prior facial structure knowledge and achieve resolution adaptation effectively. Quantitative and qualitative evaluations demonstrate the robustness of ARASFSR over existing state-of-the-art methods while super-resolving facial images across various input sizes and up-sampling scales.
ChangeDINO: DINOv3-Driven Building Change Detection in Optical Remote Sensing Imagery
Remote sensing change detection (RSCD) aims to identify surface changes from co-registered bi-temporal images. However, many deep learning-based RSCD methods rely solely on change-map annotations and underuse the semantic information in non-changing regions, which limits robustness under illumination variation, off-nadir views, and scarce labels. This article introduces ChangeDINO, an end-to-end multiscale Siamese framework for optical building change detection. The model fuses a lightweight backbone stream with features transferred from a frozen DINOv3, yielding semantic- and context-rich pyramids even on small datasets. A spatial-spectral differential transformer decoder then exploits multi-scale absolute differences as change priors to highlight true building changes and suppress irrelevant responses. Finally, a learnable morphology module refines the upsampled logits to recover clean boundaries. Experiments on four public benchmarks show that ChangeDINO consistently outperforms recent state-of-the-art methods in IoU and F1, and ablation studies confirm the effectiveness of each component. The source code is available at https://github.com/chingheng0808/ChangeDINO.
WWE-UIE: A Wavelet & White Balance Efficient Network for Underwater Image Enhancement
Underwater Image Enhancement (UIE) aims to restore visibility and correct color distortions caused by wavelength-dependent absorption and scattering. Recent hybrid approaches, which couple domain priors with modern deep neural architectures, have achieved strong performance but incur high computational cost, limiting their practicality in real-time scenarios. In this work, we propose WWE-UIE, a compact and efficient enhancement network that integrates three interpretable priors. First, adaptive white balance alleviates the strong wavelength-dependent color attenuation, particularly the dominance of blue-green tones. Second, a wavelet-based enhancement block (WEB) performs multi-band decomposition, enabling the network to capture both global structures and fine textures, which are critical for underwater restoration. Third, a gradient-aware module (SGFB) leverages Sobel operators with learnable gating to explicitly preserve edge structures degraded by scattering. Extensive experiments on benchmark datasets demonstrate that WWE-UIE achieves competitive restoration quality with substantially fewer parameters and FLOPs, enabling real-time inference on resource-limited platforms. Ablation studies and visualizations further validate the contribution of each component. The source code is available at https://github.com/chingheng0808/WWE-UIE.
NaTex: Seamless Texture Generation as Latent Color Diffusion
We present NaTex, a native texture generation framework that predicts texture color directly in 3D space. In contrast to previous approaches that rely on baking 2D multi-view images synthesized by geometry-conditioned Multi-View Diffusion models (MVDs), NaTex avoids several inherent limitations of the MVD pipeline. These include difficulties in handling occluded regions that require inpainting, achieving precise mesh-texture alignment along boundaries, and maintaining cross-view consistency and coherence in both content and color intensity. NaTex features a novel paradigm that addresses the aforementioned issues by viewing texture as a dense color point cloud. Driven by this idea, we propose latent color diffusion, which comprises a geometry-awared color point cloud VAE and a multi-control diffusion transformer (DiT), entirely trained from scratch using 3D data, for texture reconstruction and generation. To enable precise alignment, we introduce native geometry control that conditions the DiT on direct 3D spatial information via positional embeddings and geometry latents. We co-design the VAE-DiT architecture, where the geometry latents are extracted via a dedicated geometry branch tightly coupled with the color VAE, providing fine-grained surface guidance that maintains strong correspondence with the texture. With these designs, NaTex demonstrates strong performance, significantly outperforming previous methods in texture coherence and alignment. Moreover, NaTex also exhibits strong generalization capabilities, either training-free or with simple tuning, for various downstream applications, e.g., material generation, texture refinement, and part segmentation and texturing.
comment: Technical Report
BioBench: A Blueprint to Move Beyond ImageNet for Scientific ML Benchmarks NeurIPS 2025
ImageNet-1K linear-probe transfer accuracy remains the default proxy for visual representation quality, yet it no longer predicts performance on scientific imagery. Across 46 modern vision model checkpoints, ImageNet top-1 accuracy explains only 34% of variance on ecology tasks and mis-ranks 30% of models above 75% accuracy. We present BioBench, an open ecology vision benchmark that captures what ImageNet misses. BioBench unifies 9 publicly released, application-driven tasks, 4 taxonomic kingdoms, and 6 acquisition modalities (drone RGB, web video, micrographs, in-situ and specimen photos, camera-trap frames), totaling 3.1M images. A single Python API downloads data, fits lightweight classifiers to frozen backbones, and reports class-balanced macro-F1 (plus domain metrics for FishNet and FungiCLEF); ViT-L models evaluate in 6 hours on an A6000 GPU. BioBench provides new signal for computer vision in ecology and a template recipe for building reliable AI-for-science benchmarks in any domain. Code and predictions are available at https://github.com/samuelstevens/biobench and results at https://samuelstevens.me/biobench.
comment: Accepted at the 3rd Imageomics Workshop at NeurIPS 2025
Sparse Autoencoders are Topic Models
Sparse autoencoders (SAEs) are used to analyze embeddings, but their role and practical value are debated. We propose a new perspective on SAEs by demonstrating that they can be naturally understood as topic models. We extend Latent Dirichlet Allocation to embedding spaces and derive the SAE objective as a maximum a posteriori estimator under this model. This view implies SAE features are thematic components rather than steerable directions. Based on this, we introduce SAE-TM, a topic modeling framework that: (1) trains an SAE to learn reusable topic atoms, (2) interprets them as word distributions on downstream data, and (3) merges them into any number of topics without retraining. SAE-TM yields more coherent topics than strong baselines on text and image datasets while maintaining diversity. Finally, we analyze thematic structure in image datasets and trace topic changes over time in Japanese woodblock prints. Our work positions SAEs as effective tools for large-scale thematic analysis across modalities. Code and data will be released upon publication.
Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling
We present \textbf{Upsample Anything}, a lightweight test-time optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14x/16x (e.g., ViT), which limits their direct use in pixel-level applications. Existing feature upsampling approaches depend on dataset-specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per-image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge-aware operator that transfers seamlessly across architectures and modalities, enabling precise high-resolution reconstruction of features, depth, or probability maps. It runs in only $\approx0.419 \text{s}$ per 224x224 image and achieves state-of-the-art performance on semantic segmentation, depth estimation, and both depth and probability map upsampling.
comment: 15 pages, 12 figures
Optimizing 3D Gaussian Splattering for Mobile GPUs
Image-based 3D scene reconstruction, which transforms multi-view images into a structured 3D representation of the surrounding environment, is a common task across many modern applications. 3D Gaussian Splatting (3DGS) is a new paradigm to address this problem and offers considerable efficiency as compared to the previous methods. Motivated by this, and considering various benefits of mobile device deployment (data privacy, operating without internet connectivity, and potentially faster responses), this paper develops Texture3dgs, an optimized mapping of 3DGS for a mobile GPU. A critical challenge in this area turns out to be optimizing for the two-dimensional (2D) texture cache, which needs to be exploited for faster executions on mobile GPUs. As a sorting method dominates the computations in 3DGS on mobile platforms, the core of Texture3dgs is a novel sorting algorithm where the processing, data movement, and placement are highly optimized for 2D memory. The properties of this algorithm are analyzed in view of a cost model for the texture cache. In addition, we accelerate other steps of the 3DGS algorithm through improved variable layout design and other optimizations. End-to-end evaluation shows that Texture3dgs delivers up to 4.1$\times$ and 1.7$\times$ speedup for the sorting and overall 3D scene reconstruction, respectively -- while also reducing memory usage by up to 1.6$\times$ -- demonstrating the effectiveness of our design for efficient mobile 3D scene reconstruction.
Explainable AI for Diabetic Retinopathy Detection Using Deep Learning with Attention Mechanisms and Fuzzy Logic-Based Interpretability
The task of weed detection is an essential element of precision agriculture since accurate species identification allows a farmer to selectively apply herbicides and fits into sustainable agriculture crop management. This paper proposes a hybrid deep learning framework recipe for weed detection that utilizes Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Graph Neural Networks (GNNs) to build robustness to multiple field conditions. A Generative Adversarial Network (GAN)-based augmentation method was imposed to balance class distributions and better generalize the model. Further, a self-supervised contrastive pre-training method helps to learn more features from limited annotated data. Experimental results yield superior results with 99.33% accuracy, precision, recall, and F1-score on multi-benchmark datasets. The proposed model architecture enables local, global, and relational feature representations and offers high interpretability and adaptability. Practically, the framework allows real-time, efficient deployment of edge devices for automated weed detecting, reducing over-reliance on herbicides and providing scalable, sustainable precision-farming options.
Building temporally coherent 3D maps with VGGT for memory-efficient Semantic SLAM
We present a fast, spatio-temporal scene understanding framework based on Vision Gated Generative Transformers (VGGT). The proposed pipeline is designed to enable efficient, close to real-time performance, supporting applications including assistive navigation. To achieve continuous updates of the 3D scene representation, we process the image flow with a sliding window, aligning submaps, thereby overcoming VGGT's high memory demands. We exploit the VGGT tracking head to aggregate 2D semantic instance masks into 3D objects. To allow for temporal consistency and richer contextual reasoning the system stores timestamps and instance-level identities, thereby enabling the detection of changes in the environment. We evaluate the approach on well-known benchmarks and custom datasets specifically designed for assistive navigation scenarios. The results demonstrate the applicability of the framework to real-world scenarios.
TetraSDF: Precise Mesh Extraction with Multi-resolution Tetrahedral Grid
Extracting meshes that exactly match the zero-level set of neural signed distance functions (SDFs) remains challenging. Sampling-based methods introduce discretization error, while continuous piecewise affine (CPWA) analytic approaches apply only to plain ReLU MLPs. We present TetraSDF, a precise analytic meshing framework for SDFs represented by a ReLU MLP composed with a multi-resolution tetrahedral positional encoder. The encoder's barycentric interpolation preserves global CPWA structure, enabling us to track ReLU linear regions within an encoder-induced polyhedral complex. A fixed analytic input preconditioner derived from the encoder's metric further reduces directional bias and stabilizes training. Across multiple benchmarks, TetraSDF matches or surpasses existing grid-based encoders in SDF reconstruction accuracy, and its analytic extractor produces highly self-consistent meshes that remain faithful to the learned isosurfaces, all with practical runtime and memory efficiency.
Weakly Supervised Segmentation and Classification of Alpha-Synuclein Aggregates in Brightfield Midbrain Images
Parkinson's disease (PD) is a neurodegenerative disorder associated with the accumulation of misfolded alpha-synuclein aggregates, forming Lewy bodies and neuritic shape used for pathology diagnostics. Automatic analysis of immunohistochemistry histopathological images with Deep Learning provides a promising tool for better understanding the spatial organization of these aggregates. In this study, we develop an automated image processing pipeline to segment and classify these aggregates in whole-slide images (WSIs) of midbrain tissue from PD and incidental Lewy Body Disease (iLBD) cases based on weakly supervised segmentation, robust to immunohistochemical labelling variability, with a ResNet50 classifier. Our approach allows to differentiate between major aggregate morphologies, including Lewy bodies and neurites with a balanced accuracy of $80\%$. This framework paves the way for large-scale characterization of the spatial distribution and heterogeneity of alpha-synuclein aggregates in brightfield immunohistochemical tissue, and for investigating their poorly understood relationships with surrounding cells such as microglia and astrocytes.
Mem-MLP: Real-Time 3D Human Motion Generation from Sparse Inputs
Realistic and smooth full-body tracking is crucial for immersive AR/VR applications. Existing systems primarily track head and hands via Head Mounted Devices (HMDs) and controllers, making the 3D full-body reconstruction in-complete. One potential approach is to generate the full-body motions from sparse inputs collected from limited sensors using a Neural Network (NN) model. In this paper, we propose a novel method based on a multi-layer perceptron (MLP) backbone that is enhanced with residual connections and a novel NN-component called Memory-Block. In particular, Memory-Block represents missing sensor data with trainable code-vectors, which are combined with the sparse signals from previous time instances to improve the temporal consistency. Furthermore, we formulate our solution as a multi-task learning problem, allowing our MLP-backbone to learn robust representations that boost accuracy. Our experiments show that our method outperforms state-of-the-art baselines by substantially reducing prediction errors. Moreover, it achieves 72 FPS on mobile HMDs that ultimately improves the accuracy-running time tradeoff.
How Robot Dogs See the Unseeable
Peering, a side-to-side motion used by animals to estimate distance through motion parallax, offers a powerful bio-inspired strategy to overcome a fundamental limitation in robotic vision: partial occlusion. Conventional robot cameras, with their small apertures and large depth of field, render both foreground obstacles and background objects in sharp focus, causing occluders to obscure critical scene information. This work establishes a formal connection between animal peering and synthetic aperture (SA) sensing from optical imaging. By having a robot execute a peering motion, its camera describes a wide synthetic aperture. Computational integration of the captured images synthesizes an image with an extremely shallow depth of field, effectively blurring out occluding elements while bringing the background into sharp focus. This efficient, wavelength-independent technique enables real-time, high-resolution perception across various spectral bands. We demonstrate that this approach not only restores basic scene understanding but also empowers advanced visual reasoning in large multimodal models, which fail with conventionally occluded imagery. Unlike feature-dependent multi-view 3D vision methods or active sensors like LiDAR, SA sensing via peering is robust to occlusion, computationally efficient, and immediately deployable on any mobile robot. This research bridges animal behavior and robotics, suggesting that peering motions for synthetic aperture sensing are a key to advanced scene understanding in complex, cluttered environments.
SwiTrack: Tri-State Switch for Cross-Modal Object Tracking
Cross-modal object tracking (CMOT) is an emerging task that maintains target consistency while the video stream switches between different modalities, with only one modality available in each frame, mostly focusing on RGB-Near Infrared (RGB-NIR) tracking. Existing methods typically connect parallel RGB and NIR branches to a shared backbone, which limits the comprehensive extraction of distinctive modality-specific features and fails to address the issue of object drift, especially in the presence of unreliable inputs. In this paper, we propose SwiTrack, a novel state-switching framework that redefines CMOT through the deployment of three specialized streams. Specifically, RGB frames are processed by the visual encoder, while NIR frames undergo refinement via a NIR gated adapter coupled with the visual encoder to progressively calibrate shared latent space features, thereby yielding more robust cross-modal representations. For invalid modalities, a consistency trajectory prediction module leverages spatio-temporal cues to estimate target movement, ensuring robust tracking and mitigating drift. Additionally, we incorporate dynamic template reconstruction to iteratively update template features and employ a similarity alignment loss to reinforce feature consistency. Experimental results on the latest benchmarks demonstrate that our tracker achieves state-of-the-art performance, boosting precision rate and success rate gains by 7.2\% and 4.3\%, respectively, while maintaining real-time tracking at 65 frames per second. Code and models are available at https://github.com/xuboyue1999/SwiTrack.git.
Can MLLMs Read the Room? A Multimodal Benchmark for Assessing Deception in Multi-Party Social Interactions
Despite their advanced reasoning capabilities, state-of-the-art Multimodal Large Language Models (MLLMs) demonstrably lack a core component of human intelligence: the ability to `read the room' and assess deception in complex social interactions. To rigorously quantify this failure, we introduce a new task, Multimodal Interactive Deception Assessment (MIDA), and present a novel multimodal dataset providing synchronized video and text with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating 12 state-of-the-art open- and closed-source MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to effectively ground language in multimodal social cues and lack the ability to model what others know, believe, or intend, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems. To take a step forward, we design a Social Chain-of-Thought (SoCoT) reasoning pipeline and a Dynamic Social Epistemic Memory (DSEM) module. Our framework yields performance improvement on this challenging task, demonstrating a promising new path toward building MLLMs capable of genuine human-like social reasoning.
Unsupervised Image Classification with Adaptive Nearest Neighbor Selection and Cluster Ensembles
Unsupervised image classification, or image clustering, aims to group unlabeled images into semantically meaningful categories. Early methods integrated representation learning and clustering within an iterative framework. However, the rise of foundational models have recently shifted focus solely to clustering, bypassing the representation learning step. In this work, we build upon a recent multi-head clustering approach by introducing adaptive nearest neighbor selection and cluster ensembling strategies to improve clustering performance. Our method, "Image Clustering through Cluster Ensembles" (ICCE), begins with a clustering stage, where we train multiple clustering heads on a frozen backbone, producing diverse image clusterings. We then employ a cluster ensembling technique to consolidate these potentially conflicting results into a unified consensus clustering. Finally, we train an image classifier using the consensus clustering result as pseudo-labels. ICCE achieves state-of-the-art performance on ten image classification benchmarks, achieving 99.3% accuracy on CIFAR10, 89% on CIFAR100, and 70.4% on ImageNet datasets, narrowing the performance gap with supervised methods. To the best of our knowledge, ICCE is the first fully unsupervised image classification method to exceed 70% accuracy on ImageNet.
When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models
Vision-Language-Action models (VLAs) have recently demonstrated remarkable progress in embodied environments, enabling robots to perceive, reason, and act through unified multimodal understanding. Despite their impressive capabilities, the adversarial robustness of these systems remains largely unexplored, especially under realistic multimodal and black-box conditions. Existing studies mainly focus on single-modality perturbations and overlook the cross-modal misalignment that fundamentally affects embodied reasoning and decision-making. In this paper, we introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings. VLA-Fool unifies three levels of multimodal adversarial attacks: (1) textual perturbations through gradient-based and prompt-based manipulations, (2) visual perturbations via patch and noise distortions, and (3) cross-modal misalignment attacks that intentionally disrupt the semantic correspondence between perception and instruction. We further incorporate a VLA-aware semantic space into linguistic prompts, developing the first automatically crafted and semantically guided prompting framework. Experiments on the LIBERO benchmark using a fine-tuned OpenVLA model reveal that even minor multimodal perturbations can cause significant behavioral deviations, demonstrating the fragility of embodied multimodal alignment.
PrIntMesh: Precise Intersection Surfaces for 3D Organ Mesh Reconstruction
Human organs are composed of interconnected substructures whose geometry and spatial relationships constrain one another. Yet, most deep-learning approaches treat these parts independently, producing anatomically implausible reconstructions. We introduce PrIntMesh, a template-based, topology-preserving framework that reconstructs organs as unified systems. Starting from a connected template, PrIntMesh jointly deforms all substructures to match patient-specific anatomy, while explicitly preserving internal boundaries and enforcing smooth, artifact-free surfaces. We demonstrate its effectiveness on the heart, hippocampus, and lungs, achieving high geometric accuracy, correct topology, and robust performance even with limited or noisy training data. Compared to voxel- and surface-based methods, PrIntMesh better reconstructs shared interfaces, maintains structural consistency, and provides a data-efficient solution suitable for clinical use.
comment: 12 pages, 9 figures
Domain-Shared Learning and Gradual Alignment for Unsupervised Domain Adaptation Visible-Infrared Person Re-Identification
Recently, Visible-Infrared person Re-Identification (VI-ReID) has achieved remarkable performance on public datasets. However, due to the discrepancies between public datasets and real-world data, most existing VI-ReID algorithms struggle in real-life applications. To address this, we take the initiative to investigate Unsupervised Domain Adaptation Visible-Infrared person Re-Identification (UDA-VI-ReID), aiming to transfer the knowledge learned from the public data to real-world data without compromising accuracy and requiring the annotation of new samples. Specifically, we first analyze two basic challenges in UDA-VI-ReID, i.e., inter-domain modality discrepancies and intra-domain modality discrepancies. Then, we design a novel two-stage model, i.e., Domain-Shared Learning and Gradual Alignment (DSLGA), to handle these discrepancies. In the first pre-training stage, DSLGA introduces a Domain-Shared Learning Strategy (DSLS) to mitigate ineffective pre-training caused by inter-domain modality discrepancies via exploiting shared information between the source and target domains. While, in the second fine-tuning stage, DSLGA designs a Gradual Alignment Strategy (GAS) to handle the cross-modality alignment challenges between visible and infrared data caused by the large intra-domain modality discrepancies through a cluster-to-holistic alignment way. Finally, a new UDA-VI-ReID testing method i.e., CMDA-XD, is constructed for training and testing different UDA-VI-ReID models. A large amount of experiments demonstrate that our method significantly outperforms existing domain adaptation methods for VI-ReID and even some supervised methods under various settings.
FOOTPASS: A Multi-Modal Multi-Agent Tactical Context Dataset for Play-by-Play Action Spotting in Soccer Broadcast Videos
Soccer video understanding has motivated the creation of datasets for tasks such as temporal action localization, spatiotemporal action detection (STAD), or multiobject tracking (MOT). The annotation of structured sequences of events (who does what, when, and where) used for soccer analytics requires a holistic approach that integrates both STAD and MOT. However, current action recognition methods remain insufficient for constructing reliable play-by-play data and are typically used to assist rather than fully automate annotation. Parallel research has advanced tactical modeling, trajectory forecasting, and performance analysis, all grounded in game-state and play-by-play data. This motivates leveraging tactical knowledge as a prior to support computer-vision-based predictions, enabling more automated and reliable extraction of play-by-play data. We introduce Footovision Play-by-Play Action Spotting in Soccer Dataset (FOOTPASS), the first benchmark for play-by-play action spotting over entire soccer matches in a multi-modal, multi-agent tactical context. It enables the development of methods for player-centric action spotting that exploit both outputs from computer-vision tasks (e.g., tracking, identification) and prior knowledge of soccer, including its tactical regularities over long time horizons, to generate reliable play-by-play data streams. These streams form an essential input for data-driven sports analytics.
Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight
Recent advances in Vision-Language-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state prediction objective enables the meta queries to automatically capture the latent actions that delineate the visual trajectory, and hence boost the learning of explicit actions. The disentanglement reduces the burden of the VLA backbone, enabling it to maintain comprehension and reasoning capabilities through language supervision. Empirically, pretrained on human manipulation videos, robot demonstrations, and image-text pairs, Mantis achieves a 96.7% success rate on LIBERO benchmark after fine-tuning, surpassing powerful baselines while exhibiting high convergence speed. Real-world evaluations show that Mantis outperforms $π_{0.5}$, a leading open-source VLA model, particularly in instruction-following capability, generalization to unseen instructions, and reasoning ability. Code and weights are released to support the open-source community.
Target Refocusing via Attention Redistribution for Open-Vocabulary Semantic Segmentation: An Explainability Perspective AAAI 2026
Open-vocabulary semantic segmentation (OVSS) employs pixel-level vision-language alignment to associate category-related prompts with corresponding pixels. A key challenge is enhancing the multimodal dense prediction capability, specifically this pixel-level multimodal alignment. Although existing methods achieve promising results by leveraging CLIP's vision-language alignment, they rarely investigate the performance boundaries of CLIP for dense prediction from an interpretability mechanisms perspective. In this work, we systematically investigate CLIP's internal mechanisms and identify a critical phenomenon: analogous to human distraction, CLIP diverts significant attention resources from target regions to irrelevant tokens. Our analysis reveals that these tokens arise from dimension-specific over-activation; filtering them enhances CLIP's dense prediction performance. Consequently, we propose ReFocusing CLIP (RF-CLIP), a training-free approach that emulates human distraction-refocusing behavior to redirect attention from distraction tokens back to target regions, thereby refining CLIP's multimodal alignment granularity. Our method achieves SOTA performance on eight benchmarks while maintaining high inference efficiency.
comment: Accepted by AAAI 2026
EvoVLA: Self-Evolving Vision-Language-Action Model
Long-horizon robotic manipulation remains challenging for Vision-Language-Action (VLA) models despite recent progress in zero-shot generalization and simulation-to-real-world transfer. Current VLA models suffer from stage hallucination, where agents exploit coarse evaluation signals to shortcut multi-step tasks, reporting high progress without truly completing them. We present EvoVLA, a self-supervised VLA framework that addresses this issue through three complementary components: Stage-Aligned Reward (SAR), which uses triplet contrastive learning with Gemini-generated hard negatives to prevent visual shortcuts; Pose-Based Object Exploration (POE), which grounds curiosity in relative object-gripper pose instead of raw pixels; and Long-Horizon Memory, which uses selective context retention and gated fusion to stabilize intrinsic shaping during extended rollouts. Extensive evaluations on Discoverse-L, a long-horizon manipulation benchmark with three multi-stage tasks, show that EvoVLA improves average task success by 10.2 percentage points over the strongest baseline (OpenVLA-OFT), reaching 69.2 percent. EvoVLA also achieves one-and-a-half times better sample efficiency and reduces stage hallucination from 38.5 percent to 14.8 percent. Real-world deployment on physical robots reaches an average success rate of 54.6 percent across four manipulation tasks, outperforming OpenVLA-OFT by 11 points, demonstrating effective sim-to-real transfer and strong generalization. Code: https://github.com/AIGeeksGroup/EvoVLA. Website: https://aigeeksgroup.github.io/EvoVLA.
An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs
With the remarkable success of Vision-Language Models (VLMs) on multimodal tasks, concerns regarding their deployment efficiency have become increasingly prominent. In particular, the number of tokens consumed during the generation process has emerged as a key evaluation metric.Prior studies have shown that specific inputs can induce VLMs to generate lengthy outputs with low information density, which significantly increases energy consumption, latency, and token costs. However, existing methods simply delay the occurrence of the EOS token to implicitly prolong output, and fail to directly maximize the output token length as an explicit optimization objective, lacking stability and controllability.To address these limitations, this paper proposes a novel verbose-text induction attack (VTIA) to inject imperceptible adversarial perturbations into benign images via a two-stage framework, which identifies the most malicious prompt embeddings for optimizing and maximizing the output token of the perturbed images.Specifically, we first perform adversarial prompt search, employing reinforcement learning strategies to automatically identify adversarial prompts capable of inducing the LLM component within VLMs to produce verbose outputs. We then conduct vision-aligned perturbation optimization to craft adversarial examples on input images, maximizing the similarity between the perturbed image's visual embeddings and those of the adversarial prompt, thereby constructing malicious images that trigger verbose text generation. Comprehensive experiments on four popular VLMs demonstrate that our method achieves significant advantages in terms of effectiveness, efficiency, and generalization capability.
Layer-wise Noise Guided Selective Wavelet Reconstruction for Robust Medical Image Segmentation
Clinical deployment requires segmentation models to stay stable under distribution shifts and perturbations. The mainstream solution is adversarial training (AT) to improve robustness; however, AT often brings a clean--robustness trade-off and high training/tuning cost, which limits scalability and maintainability in medical imaging. We propose \emph{Layer-wise Noise-Guided Selective Wavelet Reconstruction (LNG-SWR)}. During training, we inject small, zero-mean noise at multiple layers to learn a frequency-bias prior that steers representations away from noise-sensitive directions. We then apply prior-guided selective wavelet reconstruction on the input/feature branch to achieve frequency adaptation: suppress noise-sensitive bands, enhance directional structures and shape cues, and stabilize boundary responses while maintaining spectral consistency. The framework is backbone-agnostic and adds low additional inference overhead. It can serve as a plug-in enhancement to AT and also improves robustness without AT. On CT and ultrasound datasets, under a unified protocol with PGD-$L_{\infty}/L_{2}$ and SSAH, LNG-SWR delivers consistent gains on clean Dice/IoU and significantly reduces the performance drop under strong attacks; combining LNG-SWR with AT yields additive gains. When combined with adversarial training, robustness improves further without sacrificing clean accuracy, indicating an engineering-friendly and scalable path to robust segmentation. These results indicate that LNG-SWR provides a simple, effective, and engineering-friendly path to robust medical image segmentation in both adversarial and standard training regimes.
Simba: Towards High-Fidelity and Geometrically-Consistent Point Cloud Completion via Transformation Diffusion AAAI
Point cloud completion is a fundamental task in 3D vision. A persistent challenge in this field is simultaneously preserving fine-grained details present in the input while ensuring the global structural integrity of the completed shape. While recent works leveraging local symmetry transformations via direct regression have significantly improved the preservation of geometric structure details, these methods suffer from two major limitations: (1) These regression-based methods are prone to overfitting which tend to memorize instant-specific transformations instead of learning a generalizable geometric prior. (2) Their reliance on point-wise transformation regression lead to high sensitivity to input noise, severely degrading their robustness and generalization. To address these challenges, we introduce Simba, a novel framework that reformulates point-wise transformation regression as a distribution learning problem. Our approach integrates symmetry priors with the powerful generative capabilities of diffusion models, avoiding instance-specific memorization while capturing robust geometric structures. Additionally, we introduce a hierarchical Mamba-based architecture to achieve high-fidelity upsampling. Extensive experiments across the PCN, ShapeNet, and KITTI benchmarks validate our method's state-of-the-art (SOTA) performance.
comment: Accepted for publication at the 40th AAAI Conference on Artificial Intelligence (AAAI-26)
Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning
Spatial intelligence is a critical frontier for Multimodal Large Language Models (MLLMs), empowering them to comprehend the physical world. Drawing inspiration from human perception mechanisms, existing studies attempt to construct a coherent spatial understanding via grid-based cognitive maps from multi-frame visual inputs. However, current grid-based map methods rely on discretized raster representations, which limit the model's ability in fine-grained spatial reasoning. To overcome this limitation, we propose Video2Layout, a framework for reconstructing metric-grounded spatial layouts from video. The framework employs continuous object boundary coordinates to quantify inter-object physical distances and object size. This empowers the model with quantitative spatial computation capabilities, effectively alleviating the inherent ambiguity when describing spatial relationships in natural language. Specifically, our method comprises two core stages. First, in supervised fine-tuning stage, we construct a high-quality dataset from the AI2THOR simulator, which enables the model to learn the mapping from visual inputs to precise boundary coordinates. Subsequently, a reinforcement fine-tuning stage further enhances the model's real-world generalization capabilities. To systematically evaluate the correlation between cognitive map accuracy and image quantity, as well as how the quantity of image inputs affects spatial reasoning accuracy, we introduce QVS-Bench, a diagnostic benchmark designed to analyze the relevant mechanisms. Evaluated on QVS-Bench and mainstream spatial reasoning benchmarks, our model, V2LO-7B achieves an average improvement of 4.92% over the model trained on grid maps, validating the superiority of our method. Our code is available at https://github.com/ybrrraway/Video2Layout.
Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers
Diffusion Transformers (DiTs) have shown exceptional performance in image generation, yet their large parameter counts incur high computational costs, impeding deployment in resource-constrained settings. To address this, we propose Pluggable Pruning with Contiguous Layer Distillation (PPCL), a flexible structured pruning framework specifically designed for DiT architectures. First, we identify redundant layer intervals through a linear probing mechanism combined with the first-order differential trend analysis of similarity metrics. Subsequently, we propose a plug-and-play teacher-student alternating distillation scheme tailored to integrate depth-wise and width-wise pruning within a single training phase. This distillation framework enables flexible knowledge transfer across diverse pruning ratios, eliminating the need for per-configuration retraining. Extensive experiments on multiple Multi-Modal Diffusion Transformer architecture models demonstrate that PPCL achieves a 50\% reduction in parameter count compared to the full model, with less than 3\% degradation in key objective metrics. Notably, our method maintains high-quality image generation capabilities while achieving higher compression ratios, rendering it well-suited for resource-constrained environments. The open-source code, checkpoints for PPCL can be found at the following link: https://github.com/OPPO-Mente-Lab/Qwen-Image-Pruning.
comment: https://github.com/OPPO-Mente-Lab/Qwen-Image-Pruning
Reasoning Guided Embeddings: Leveraging MLLM Reasoning for Improved Multimodal Retrieval
Multimodal embeddings are widely used in downstream tasks such as multimodal retrieval, enabling alignment of interleaved modalities in a shared representation space. While recent studies show that Multimodal Large Language Models (MLLMs) can serve as strong embedding extractors, existing approaches treat embedding extraction as a direct encoding step, overlooking the fact that MLLMs possess the generative capability for reasoning that could be leveraged to enhance representation quality. In this work, we explore how to explicitly incorporate reasoning into the embedding process. To this end, we propose Reasoning Guided Embeddings (RGE), which preserves the generative rationale process of MLLMs and couples it with contrastive training. Our method first enables the model to perform structured rationale generation conditioned on the instruction, and then extracts representations after reasoning has unfolded. This simple design enhances the context-conditional inference signals within the embedding, leading to improved multimodal representation quality. Experiments on the MMEB benchmark show that reasoning-guided conditioning improves multimodal retrieval performance by 4.9% over the non-reasoning baseline, confirming that explicit reasoning can effectively enhance embedding quality.
LEGO-SLAM: Language-Embedded Gaussian Optimization SLAM
Recent advances in 3D Gaussian Splatting (3DGS) have enabled Simultaneous Localization and Mapping (SLAM) systems to build photorealistic maps. However, these maps lack the open-vocabulary semantic understanding required for advanced robotic interaction. Integrating language features into SLAM remains a significant challenge, as storing high-dimensional features demands excessive memory and rendering overhead, while existing methods with static models lack adaptability for novel environments. To address these limitations, we propose LEGO-SLAM (Language-Embedded Gaussian Optimization SLAM), the first framework to achieve real-time, open-vocabulary mapping within a 3DGS-based SLAM system. At the core of our method is a scene-adaptive encoder-decoder that distills high-dimensional language embeddings into a compact 16-dimensional feature space. This design reduces the memory per Gaussian and accelerates rendering, enabling real-time performance. Unlike static approaches, our encoder adapts online to unseen scenes. These compact features also enable a language-guided pruning strategy that identifies semantic redundancy, reducing the map's Gaussian count by over 60\% while maintaining rendering quality. Furthermore, we introduce a language-based loop detection approach that reuses these mapping features, eliminating the need for a separate detection model. Extensive experiments demonstrate that LEGO-SLAM achieves competitive mapping quality and tracking accuracy, all while providing open-vocabulary capabilities at 15 FPS.
comment: 18 pages
A Spatial Semantics and Continuity Perception Attention for Remote Sensing Water Body Change Detection
Remote sensing Water Body Change Detection (WBCD) aims to detect water body surface changes from bi-temporal images of the same geographic area. Recently, the scarcity of high spatial resolution datasets for WBCD restricts its application in urban and rural regions, which require more accurate positioning. Meanwhile, previous deep learning-based methods fail to comprehensively exploit the spatial semantic and structural information in deep features in the change detection networks. To resolve these concerns, we first propose a new dataset, HSRW-CD, with a spatial resolution higher than 3 meters for WBCD. Specifically, it contains a large number of image pairs, widely covering various water body types. Besides, a Spatial Semantics and Continuity Perception (SSCP) attention module is designed to fully leverage both the spatial semantics and structure of deep features in the WBCD networks, significantly improving the discrimination capability for water body. The proposed SSCP has three components: the Multi-Semantic spatial Attention (MSA), the Structural Relation-aware Global Attention (SRGA), and the Channel-wise Self-Attention (CSA). The MSA enhances the spatial semantics of water body features and provides precise spatial semantic priors for the CSA. Then, the SRGA further extracts spatial structure to learn the spatial continuity of the water body. Finally, the CSA utilizes the spatial semantic and structural priors from the MSA and SRGA to compute the similarity across channels. Specifically designed as a plug-and-play module for water body deep features, the proposed SSCP allows integration into existing WBCD models. Numerous experiments conducted on the proposed HSRW-CD and Water-CD datasets validate the effectiveness and generalization of the SSCP. The code of this work and the HSRW-CD dataset will be accessed at https://github.com/QingMa1/SSCP.
Real-Time 3D Object Detection with Inference-Aligned Learning AAAI 2026
Real-time 3D object detection from point clouds is essential for dynamic scene understanding in applications such as augmented reality, robotics and navigation. We introduce a novel Spatial-prioritized and Rank-aware 3D object detection (SR3D) framework for indoor point clouds, to bridge the gap between how detectors are trained and how they are evaluated. This gap stems from the lack of spatial reliability and ranking awareness during training, which conflicts with the ranking-based prediction selection used as inference. Such a training-inference gap hampers the model's ability to learn representations aligned with inference-time behavior. To address the limitation, SR3D consists of two components tailored to the spatial nature of point clouds during training: a novel spatial-prioritized optimal transport assignment that dynamically emphasizes well-located and spatially reliable samples, and a rank-aware adaptive self-distillation scheme that adaptively injects ranking perception via a self-distillation paradigm. Extensive experiments on ScanNet V2 and SUN RGB-D show that SR3D effectively bridges the training-inference gap and significantly outperforms prior methods in accuracy while maintaining real-time speed.
comment: Accepted by AAAI 2026
Degradation-Aware Hierarchical Termination for Blind Quality Enhancement of Compressed Video
Existing studies on Quality Enhancement for Compressed Video (QECV) predominantly rely on known Quantization Parameters (QPs), employing distinct enhancement models per QP setting, termed non-blind methods. However, in real-world scenarios involving transcoding or transmission, QPs may be partially or entirely unknown, limiting the applicability of such approaches and motivating the development of blind QECV techniques. Current blind methods generate degradation vectors via classification models with cross-entropy loss, using them as channel attention to guide artifact removal. However, these vectors capture only global degradation information and lack spatial details, hindering adaptation to varying artifact patterns at different spatial positions. To address these limitations, we propose a pretrained Degradation Representation Learning (DRL) module that decouples and extracts high-dimensional, multiscale degradation representations from video content to guide the artifact removal. Additionally, both blind and non-blind methods typically employ uniform architectures across QPs, hence, overlooking the varying computational demands inherent to different compression levels. We thus introduce a hierarchical termination mechanism that dynamically adjusts the number of artifact reduction stages based on the compression level. Experimental results demonstrate that the proposed approach significantly enhances performance, achieving a PSNR improvement of 110% (from 0.31 dB to 0.65 dB) over a competing state-of-the-art blind method at QP = 22. Furthermore, the proposed hierarchical termination mechanism reduces the average inference time at QP = 22 by half compared to QP = 42.
How Noise Benefits AI-generated Image Detection
The rapid advancement of generative models has made real and synthetic images increasingly indistinguishable. Although extensive efforts have been devoted to detecting AI-generated images, out-of-distribution generalization remains a persistent challenge. We trace this weakness to spurious shortcuts exploited during training and we also observe that small feature-space perturbations can mitigate shortcut dominance. To address this problem in a more controllable manner, we propose the Positive-Incentive Noise for CLIP (PiN-CLIP), which jointly trains a noise generator and a detection network under a variational positive-incentive principle. Specifically, we construct positive-incentive noise in the feature space via cross-attention fusion of visual and categorical semantic features. During optimization, the noise is injected into the feature space to fine-tune the visual encoder, suppressing shortcut-sensitive directions while amplifying stable forensic cues, thereby enabling the extraction of more robust and generalized artifact representations. Comparative experiments are conducted on an open-world dataset comprising synthetic images generated by 42 distinct generative models. Our method achieves new state-of-the-art performance, with notable improvements of 5.4 in average accuracy over existing approaches.
VTinker: Guided Flow Upsampling and Texture Mapping for High-Resolution Video Frame Interpolation AAAI 2026
Due to large pixel movement and high computational cost, estimating the motion of high-resolution frames is challenging. Thus, most flow-based Video Frame Interpolation (VFI) methods first predict bidirectional flows at low resolution and then use high-magnification upsampling (e.g., bilinear) to obtain the high-resolution ones. However, this kind of upsampling strategy may cause blur or mosaic at the flows' edges. Additionally, the motion of fine pixels at high resolution cannot be adequately captured in motion estimation at low resolution, which leads to the misalignment of task-oriented flows. With such inaccurate flows, input frames are warped and combined pixel-by-pixel, resulting in ghosting and discontinuities in the interpolated frame. In this study, we propose a novel VFI pipeline, VTinker, which consists of two core components: guided flow upsampling (GFU) and Texture Mapping. After motion estimation at low resolution, GFU introduces input frames as guidance to alleviate the blurring details in bilinear upsampling flows, which makes flows' edges clearer. Subsequently, to avoid pixel-level ghosting and discontinuities, Texture Mapping generates an initial interpolated frame, referred to as the intermediate proxy. The proxy serves as a cue for selecting clear texture blocks from the input frames, which are then mapped onto the proxy to facilitate producing the final interpolated frame via a reconstruction module. Extensive experiments demonstrate that VTinker achieves state-of-the-art performance in VFI. Codes are available at: https://github.com/Wucy0519/VTinker.
comment: Accepted by AAAI 2026
Decoupling Complexity from Scale in Latent Diffusion Model
Existing latent diffusion models typically couple scale with content complexity, using more latent tokens to represent higher-resolution images or higher-frame rate videos. However, the latent capacity required to represent visual data primarily depends on content complexity, with scale serving only as an upper bound. Motivated by this observation, we propose DCS-LDM, a novel paradigm for visual generation that decouples information complexity from scale. DCS-LDM constructs a hierarchical, scale-independent latent space that models sample complexity through multi-level tokens and supports decoding to arbitrary resolutions and frame rates within a fixed latent representation. This latent space enables DCS-LDM to achieve a flexible computation-quality tradeoff. Furthermore, by decomposing structural and detailed information across levels, DCS-LDM supports a progressive coarse-to-fine generation paradigm. Experimental results show that DCS-LDM delivers performance comparable to state-of-the-art methods while offering flexible generation across diverse scales and visual qualities.
comment: 15 pages, 16 figures
Clustered Error Correction with Grouped 4D Gaussian Splatting SIGGRAPH
Existing 4D Gaussian Splatting (4DGS) methods struggle to accurately reconstruct dynamic scenes, often failing to resolve ambiguous pixel correspondences and inadequate densification in dynamic regions. We address these issues by introducing a novel method composed of two key components: (1) Elliptical Error Clustering and Error Correcting Splat Addition that pinpoints dynamic areas to improve and initialize fitting splats, and (2) Grouped 4D Gaussian Splatting that improves consistency of mapping between splats and represented dynamic objects. Specifically, we classify rendering errors into missing-color and occlusion types, then apply targeted corrections via backprojection or foreground splitting guided by cross-view color consistency. Evaluations on Neural 3D Video and Technicolor datasets demonstrate that our approach significantly improves temporal consistency and achieves state-of-the-art perceptual rendering quality, improving 0.39dB of PSNR on the Technicolor Light Field dataset. Our visualization shows improved alignment between splats and dynamic objects, and the error correction method's capability to identify errors and properly initialize new splats. Our implementation details and source code are available at https://github.com/tho-kn/cem-4dgs.
comment: 16 pages, 8 figures, SIGGRAPH Asia Conference Papers 2025
T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs
In large language models (LLM), in-context learning (ICL) refers to performing new tasks by conditioning on small demonstrations provided in the input context. Recent advances in visual in-context learning (VICL) demonstrate promising capabilities for solving downstream tasks by unified vision-language models (VLMs). When the visual prompt and the target images originate from different visual tasks, can VLMs still enable VICL? In the paper, we propose a fully collaborative pipeline, i.e. T2T-VICL, for VLMs to investigate the potential of cross-task VICL. Fundamentally, we design a mechanism to generate and select text prompts that best implicitly describe the differences between two distinct low-level vision tasks, and construct the first cross-task VICL dataset. Building upon this, we propose a novel inference framework that combines perceptual score-based reasoning with traditional evaluation metrics to perform cross-task VICL. Our approach achieves top-tier results across nine cross-task scenarios and second-tier performance in ten additional scenarios, unlocking the boundaries of cross-task VICL within VLMs.
Rad-GS: Radar-Vision Integration for 3D Gaussian Splatting SLAM in Outdoor Environments
We present Rad-GS, a 4D radar-camera SLAM system designed for kilometer-scale outdoor environments, utilizing 3D Gaussian as a differentiable spatial representation. Rad-GS combines the advantages of raw radar point cloud with Doppler information and geometrically enhanced point cloud to guide dynamic object masking in synchronized images, thereby alleviating rendering artifacts and improving localization accuracy. Additionally, unsynchronized image frames are leveraged to globally refine the 3D Gaussian representation, enhancing texture consistency and novel view synthesis fidelity. Furthermore, the global octree structure coupled with a targeted Gaussian primitive management strategy further suppresses noise and significantly reduces memory consumption in large-scale environments. Extensive experiments and ablation studies demonstrate that Rad-GS achieves performance comparable to traditional 3D Gaussian methods based on camera or LiDAR inputs, highlighting the feasibility of robust outdoor mapping using 4D mmWave radar. Real-world reconstruction at kilometer scale validates the potential of Rad-GS for large-scale scene reconstruction.
SpectralTrain: A Universal Framework for Hyperspectral Image Classification
Hyperspectral image (HSI) classification typically involves large-scale data and computationally intensive training, which limits the practical deployment of deep learning models in real-world remote sensing tasks. This study introduces SpectralTrain, a universal, architecture-agnostic training framework that enhances learning efficiency by integrating curriculum learning (CL) with principal component analysis (PCA)-based spectral downsampling. By gradually introducing spectral complexity while preserving essential information, SpectralTrain enables efficient learning of spectral -- spatial patterns at significantly reduced computational costs. The framework is independent of specific architectures, optimizers, or loss functions and is compatible with both classical and state-of-the-art (SOTA) models. Extensive experiments on three benchmark datasets -- Indian Pines, Salinas-A, and the newly introduced CloudPatch-7 -- demonstrate strong generalization across spatial scales, spectral characteristics, and application domains. The results indicate consistent reductions in training time by 2-7x speedups with small-to-moderate accuracy deltas depending on backbone. Its application to cloud classification further reveals potential in climate-related remote sensing, emphasizing training strategy optimization as an effective complement to architectural design in HSI models. Code is available at https://github.com/mh-zhou/SpectralTrain.
VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning
Traditional video reasoning segmentation methods rely on supervised fine-tuning, which limits generalization to out-of-distribution scenarios and lacks explicit reasoning. To address this, we propose \textbf{VideoSeg-R1}, the first framework to introduce reinforcement learning into video reasoning segmentation. It adopts a decoupled architecture that formulates the task as joint referring image segmentation and video mask propagation. It comprises three stages: (1) A hierarchical text-guided frame sampler to emulate human attention; (2) A reasoning model that produces spatial cues along with explicit reasoning chains; and (3) A segmentation-propagation stage using SAM2 and XMem. A task difficulty-aware mechanism adaptively controls reasoning length for better efficiency and accuracy. Extensive evaluations on multiple benchmarks demonstrate that VideoSeg-R1 achieves state-of-the-art performance in complex video reasoning and segmentation tasks. The code will be publicly available at https://github.com/euyis1019/VideoSeg-R1.
LiSTAR: Ray-Centric World Models for 4D LiDAR Sequences in Autonomous Driving
Synthesizing high-fidelity and controllable 4D LiDAR data is crucial for creating scalable simulation environments for autonomous driving. This task is inherently challenging due to the sensor's unique spherical geometry, the temporal sparsity of point clouds, and the complexity of dynamic scenes. To address these challenges, we present LiSTAR, a novel generative world model that operates directly on the sensor's native geometry. LiSTAR introduces a Hybrid-Cylindrical-Spherical (HCS) representation to preserve data fidelity by mitigating quantization artifacts common in Cartesian grids. To capture complex dynamics from sparse temporal data, it utilizes a Spatio-Temporal Attention with Ray-Centric Transformer (START) that explicitly models feature evolution along individual sensor rays for robust temporal coherence. Furthermore, for controllable synthesis, we propose a novel 4D point cloud-aligned voxel layout for conditioning and a corresponding discrete Masked Generative START (MaskSTART) framework, which learns a compact, tokenized representation of the scene, enabling efficient, high-resolution, and layout-guided compositional generation. Comprehensive experiments validate LiSTAR's state-of-the-art performance across 4D LiDAR reconstruction, prediction, and conditional generation, with substantial quantitative gains: reducing generation MMD by a massive 76%, improving reconstruction IoU by 32%, and lowering prediction L1 Med by 50%. This level of performance provides a powerful new foundation for creating realistic and controllable autonomous systems simulations. Project link: https://ocean-luna.github.io/LiSTAR.gitub.io.
AMS-KV: Adaptive KV Caching in Multi-Scale Visual Autoregressive Transformers
Visual autoregressive modeling (VAR) via next-scale prediction has emerged as a scalable image generation paradigm. While Key and Value (KV) caching in large language models (LLMs) has been extensively studied, next-scale prediction presents unique challenges, and KV caching design for next-scale based VAR transformers remains largely unexplored. A major bottleneck is the excessive KV memory growth with the increasing number of scales-severely limiting scalability. Our systematic investigation reveals that: (1) Attending to tokens from local scales significantly contributes to generation quality (2) Allocating a small amount of memory for the coarsest scales, termed as condensed scales, stabilizes multi-scale image generation (3) Strong KV similarity across finer scales is predominantly observed in cache-efficient layers, whereas cache-demanding layers exhibit weaker inter-scale similarity. Based on the observations, we introduce AMS-KV, a scale-adaptive KV caching policy for next-scale prediction in VAR models. AMS-KV prioritizes storing KVs from condensed and local scales, preserving the most relevant tokens to maintain generation quality. It further optimizes KV cache utilization and computational efficiency identifying cache-demanding layers through inter-scale similarity analysis. Compared to the vanilla next-scale prediction-based VAR models, AMS-KV reduces KV cache usage by up to 84.83% and self-attention latency by 60.48%. Moreover, when the baseline VAR-d30 model encounters out-of-memory failures at a batch size of 128, AMS-KV enables stable scaling to a batch size of 256 with improved throughput.
LLMs-based Augmentation for Domain Adaptation in Long-tailed Food Datasets
Training a model for food recognition is challenging because the training samples, which are typically crawled from the Internet, are visually different from the pictures captured by users in the free-living environment. In addition to this domain-shift problem, the real-world food datasets tend to be long-tailed distributed and some dishes of different categories exhibit subtle variations that are difficult to distinguish visually. In this paper, we present a framework empowered with large language models (LLMs) to address these challenges in food recognition. We first leverage LLMs to parse food images to generate food titles and ingredients. Then, we project the generated texts and food images from different domains to a shared embedding space to maximize the pair similarities. Finally, we take the aligned features of both modalities for recognition. With this simple framework, we show that our proposed approach can outperform the existing approaches tailored for long-tailed data distribution, domain adaptation, and fine-grained classification, respectively, on two food datasets.
Crossmodal learning for Crop Canopy Trait Estimation
Recent advances in plant phenotyping have driven widespread adoption of multi sensor platforms for collecting crop canopy reflectance data. This includes the collection of heterogeneous data across multiple platforms, with Unmanned Aerial Vehicles (UAV) seeing significant usage due to their high performance in crop monitoring, forecasting, and prediction tasks. Similarly, satellite missions have been shown to be effective for agriculturally relevant tasks. In contrast to UAVs, such missions are bound to the limitation of spatial resolution, which hinders their effectiveness for modern farming systems focused on micro-plot management. In this work, we propose a cross modal learning strategy that enriches high-resolution satellite imagery with UAV level visual detail for crop canopy trait estimation. Using a dataset of approximately co registered satellite UAV image pairs collected from replicated plots of 84 hybrid maize varieties across five distinct locations in the U.S. Corn Belt, we train a model that learns fine grained spectral spatial correspondences between sensing modalities. Results show that the generated UAV-like representations from satellite inputs consistently outperform real satellite imagery on multiple downstream tasks, including yield and nitrogen prediction, demonstrating the potential of cross-modal correspondence learning to bridge the gap between satellite and UAV sensing in agricultural monitoring.
comment: 18 pages, 7 figures
CuriGS: Curriculum-Guided Gaussian Splatting for Sparse View Synthesis
3D Gaussian Splatting (3DGS) has recently emerged as an efficient, high-fidelity representation for real-time scene reconstruction and rendering. However, extending 3DGS to sparse-view settings remains challenging because of supervision scarcity and overfitting caused by limited viewpoint coverage. In this paper, we present CuriGS, a curriculum-guided framework for sparse-view 3D reconstruction using 3DGS. CuriGS addresses the core challenge of sparse-view synthesis by introducing student views: pseudo-views sampled around ground-truth poses (teacher). For each teacher, we generate multiple groups of student views with different perturbation levels. During training, we follow a curriculum schedule that gradually unlocks higher perturbation level, randomly sampling candidate students from the active level to assist training. Each sampled student is regularized via depth-correlation and co-regularization, and evaluated using a multi-signal metric that combines SSIM, LPIPS, and an image-quality measure. For every teacher and perturbation level, we periodically retain the best-performing students and promote those that satisfy a predefined quality threshold to the training set, resulting in a stable augmentation of sparse training views. Experimental results show that CuriGS outperforms state-of-the-art baselines in both rendering fidelity and geometric consistency across various synthetic and real sparse-view scenes. Project page: https://zijian1026.github.io/CuriGS/
LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation
Unified multimodal models have recently shown remarkable gains in both capability and versatility, yet most leading systems are still trained from scratch and require substantial computational resources. In this paper, we show that competitive performance can be obtained far more efficiently by strategically fusing publicly available models specialized for either generation or understanding. Our key design is to retain the original blocks while additionally interleaving multimodal self-attention blocks throughout the networks. This double fusion mechanism (1) effectively enables rich multi-modal fusion while largely preserving the original strengths of the base models, and (2) catalyzes synergistic fusion of high-level semantic representations from the understanding encoder with low-level spatial signals from the generation encoder. By training with only ~ 35B tokens, this approach achieves strong results across multiple benchmarks: 0.91 on GenEval for compositional text-to-image generation, 82.16 on DPG-Bench for complex text-to-image generation, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench for image editing. By fully releasing the entire suite of code, model weights, and datasets, we hope to support future research on unified multimodal modeling.
comment: Preprint. Work in progress
Sigma: Semantically Informative Pre-training for Skeleton-based Sign Language Understanding
Pre-training has proven effective for learning transferable features in sign language understanding (SLU) tasks. Recently, skeleton-based methods have gained increasing attention because they can robustly handle variations in subjects and backgrounds without being affected by appearance or environmental factors. Current SLU methods continue to face three key limitations: 1) weak semantic grounding, as models often capture low-level motion patterns from skeletal data but struggle to relate them to linguistic meaning; 2) imbalance between local details and global context, with models either focusing too narrowly on fine-grained cues or overlooking them for broader context; and 3) inefficient cross-modal learning, as constructing semantically aligned representations across modalities remains difficult. To address these, we propose Sigma, a unified skeleton-based SLU framework featuring: 1) a sign-aware early fusion mechanism that facilitates deep interaction between visual and textual modalities, enriching visual features with linguistic context; 2) a hierarchical alignment learning strategy that jointly maximises agreements across different levels of paired features from different modalities, effectively capturing both fine-grained details and high-level semantic relationships; and 3) a unified pre-training framework that combines contrastive learning, text matching and language modelling to promote semantic consistency and generalisation. Sigma achieves new state-of-the-art results on isolated sign language recognition, continuous sign language recognition, and gloss-free sign language translation on multiple benchmarks spanning different sign and spoken languages, demonstrating the impact of semantically informative pre-training and the effectiveness of skeletal data as a stand-alone solution for SLU.
Beyond Patches: Mining Interpretable Part-Prototypes for Explainable AI
As AI systems grow more capable, it becomes increasingly important that their decisions remain understandable and aligned with human expectations. A key challenge is the limited interpretability of deep models. Post-hoc methods like GradCAM offer heatmaps but provide limited conceptual insight, while prototype-based approaches offer example-based explanations but often rely on rigid region selection and lack semantic consistency. To address these limitations, we propose PCMNet, a part-prototypical concept mining network that learns human-comprehensible prototypes from meaningful image regions without additional supervision. By clustering these prototypes into concept groups and extracting concept activation vectors, PCMNet provides structured, concept-level explanations and enhances robustness to occlusion and challenging conditions, which are both critical for building reliable and aligned AI systems. Experiments across multiple image classification benchmarks show that PCMNet outperforms state-of-the-art methods in interpretability, stability, and robustness. This work contributes to AI alignment by enhancing transparency, controllability, and trustworthiness in AI systems. Our code is available at: https://github.com/alehdaghi/PCMNet.
On Geometry-Enhanced Parameter-Efficient Fine-Tuning for 3D Scene Segmentation
The emergence of large-scale pre-trained point cloud models has significantly advanced 3D scene understanding, but adapting these models to specific downstream tasks typically demands full fine-tuning, incurring high computational and storage costs. Parameter-efficient fine-tuning (PEFT) techniques, successful in natural language processing and 2D vision tasks, would underperform when naively applied to 3D point cloud models due to significant geometric and spatial distribution shifts. Existing PEFT methods commonly treat points as orderless tokens, neglecting important local spatial structures and global geometric contexts in 3D modeling. To bridge this gap, we introduce the Geometric Encoding Mixer (GEM), a novel geometry-aware PEFT module specifically designed for 3D point cloud transformers. GEM explicitly integrates fine-grained local positional encodings with a lightweight latent attention mechanism to capture comprehensive global context, thereby effectively addressing the spatial and geometric distribution mismatch. Extensive experiments demonstrate that GEM achieves performance comparable to or sometimes even exceeding full fine-tuning, while only updating 1.6% of the model's parameters, fewer than other PEFT methods. With significantly reduced training time and memory requirements, our approach thus sets a new benchmark for efficient, scalable, and geometry-aware fine-tuning of large-scale 3D point cloud models. Code is available at https://github.com/LiyaoTang/GEM.
comment: Neurips 2025; available at https://github.com/LiyaoTang/GEM
Self-Supervised Discriminative Feature Learning for Deep Multi-View Clustering
Multi-view clustering is an important research topic due to its capability to utilize complementary information from multiple views. However, there are few methods to consider the negative impact caused by certain views with unclear clustering structures, resulting in poor multi-view clustering performance. To address this drawback, we propose self-supervised discriminative feature learning for deep multi-view clustering (SDMVC). Concretely, deep autoencoders are applied to learn embedded features for each view independently. To leverage the multi-view complementary information, we concatenate all views' embedded features to form the global features, which can overcome the negative impact of some views' unclear clustering structures. In a self-supervised manner, pseudo-labels are obtained to build a unified target distribution to perform multi-view discriminative feature learning. During this process, global discriminative information can be mined to supervise all views to learn more discriminative features, which in turn are used to update the target distribution. Besides, this unified target distribution can make SDMVC learn consistent cluster assignments, which accomplishes the clustering consistency of multiple views while preserving their features' diversity. Experiments on various types of multi-view datasets show that SDMVC outperforms 14 competitors including classic and state-of-the-art methods. The code is available at https://github.com/SubmissionsIn/SDMVC.
Context-Aware Multimodal Representation Learning for Spatio-Temporally Explicit Environmental Modelling
Earth observation (EO) foundation models have emerged as an effective approach to derive latent representations of the Earth system from various remote sensing sensors. These models produce embeddings that can be used as analysis-ready datasets, enabling the modelling of ecosystem dynamics without extensive sensor-specific preprocessing. However, existing models typically operate at fixed spatial or temporal scales, limiting their use for ecological analyses that require both fine spatial detail and high temporal fidelity. To overcome these limitations, we propose a representation learning framework that integrates different EO modalities into a unified feature space at high spatio-temporal resolution. We introduce the framework using Sentinel-1 and Sentinel-2 data as representative modalities. Our approach produces a latent space at native 10 m resolution and the temporal frequency of cloud-free Sentinel-2 acquisitions. Each sensor is first modeled independently to capture its sensor-specific characteristics. Their representations are then combined into a shared model. This two-stage design enables modality-specific optimisation and easy extension to new sensors, retaining pretrained encoders while retraining only fusion layers. This enables the model to capture complementary remote sensing data and to preserve coherence across space and time. Qualitative analyses reveal that the learned embeddings exhibit high spatial and semantic consistency across heterogeneous landscapes. Quantitative evaluation in modelling Gross Primary Production reveals that they encode ecologically meaningful patterns and retain sufficient temporal fidelity to support fine-scale analyses. Overall, the proposed framework provides a flexible, analysis-ready representation learning approach for environmental applications requiring diverse spatial and temporal resolutions.
comment: 10 pages (incliding 2 pages of references), 7 figures
vMFCoOp: Towards Equilibrium on a Unified Hyperspherical Manifold for Prompting Biomedical VLMs AAAI 2026
Recent advances in context optimization (CoOp) guided by large language model (LLM)-distilled medical semantic priors offer a scalable alternative to manual prompt engineering and full fine-tuning for adapting biomedical CLIP-based vision-language models (VLMs). However, prompt learning in this context is challenged by semantic misalignment between LLMs and CLIP variants due to divergent training corpora and model architectures; it further lacks scalability across continuously evolving families of foundation models. More critically, pairwise multimodal alignment via conventional Euclidean-space optimization lacks the capacity to model unified representations or apply localized geometric constraints, which tends to amplify modality gaps in complex biomedical imaging and destabilize few-shot adaptation. In this work, we propose vMFCoOp, a framework that inversely estimates von Mises-Fisher (vMF) distributions on a shared Hyperspherical Manifold, aligning semantic biases between arbitrary LLMs and CLIP backbones via Unified Semantic Anchors to achieve robust biomedical prompting and superior few-shot classification. Grounded in three complementary constraints, vMFCoOp demonstrates consistent improvements across 14 medical datasets, 12 medical imaging modalities, and 13 anatomical regions, outperforming state-of-the-art methods in accuracy, generalization, and clinical applicability. This work aims to continuously expand to encompass more downstream applications, and the corresponding resources are intended to be shared through https://github.com/VinyehShaw/UniEqui.
comment: Accepted as an Oral Presentation at AAAI 2026 Main Technical Track (this version is not peer-reviewed; it is the extended version)
DiffuSyn Bench: Evaluating Vision-Language Models on Real-World Complexities with Diffusion-Generated Synthetic Benchmarks
This study assesses the ability of Large Vision-Language Models (LVLMs) to differentiate between AI-generated and human-generated images. It introduces a new automated benchmark construction method for this evaluation. The experiment compared common LVLMs with human participants using a mixed dataset of AI and human-created images. Results showed that LVLMs could distinguish between the image types to some extent but exhibited a rightward bias, and perform significantly worse compared to humans. To build on these findings, we developed an automated benchmark construction process using AI. This process involved topic retrieval, narrative script generation, error embedding, and image generation, creating a diverse set of text-image pairs with intentional errors. We validated our method through constructing two caparable benchmarks. This study highlights the strengths and weaknesses of LVLMs in real-world understanding and advances benchmark construction techniques, providing a scalable and automatic approach for AI model evaluation.
A Decade of You Only Look Once (YOLO) for Object Detection: A Review
This review marks the tenth anniversary of You Only Look Once (YOLO), one of the most influential frameworks in real-time object detection. Over the past decade, YOLO has evolved from a streamlined detector into a diverse family of architectures characterized by efficient design, modular scalability, and cross-domain adaptability. The paper presents a technical overview of the main versions (from YOLOv1 to YOLOv13), highlights key architectural trends, and surveys the principal application areas in which YOLO has been adopted. It also addresses evaluation practices, ethical considerations, and potential future directions for the framework's continued development. The analysis aims to provide a comprehensive and critical perspective on YOLO's trajectory and ongoing transformation.
Active Measurement: Efficient Estimation at Scale NeurIPS 2025
AI has the potential to transform scientific discovery by analyzing vast datasets with little human effort. However, current workflows often do not provide the accuracy or statistical guarantees that are needed. We introduce active measurement, a human-in-the-loop AI framework for scientific measurement. An AI model is used to predict measurements for individual units, which are then sampled for human labeling using importance sampling. With each new set of human labels, the AI model is improved and an unbiased Monte Carlo estimate of the total measurement is refined. Active measurement can provide precise estimates even with an imperfect AI model, and requires little human effort when the AI model is very accurate. We derive novel estimators, weighting schemes, and confidence intervals, and show that active measurement reduces estimation error compared to alternatives in several measurement tasks.
comment: NeurIPS 2025
Unsupervised Discovery of Long-Term Spatiotemporal Periodic Workflows in Human Activities WACV 2026
Periodic human activities with implicit workflows are common in manufacturing, sports, and daily life. While short-term periodic activities -- characterized by simple structures and high-contrast patterns -- have been widely studied, long-term periodic workflows with low-contrast patterns remain largely underexplored. To bridge this gap, we introduce the first benchmark comprising 580 multimodal human activity sequences featuring long-term periodic workflows. The benchmark supports three evaluation tasks aligned with real-world applications: unsupervised periodic workflow detection, task completion tracking, and procedural anomaly detection. We also propose a lightweight, training-free baseline for modeling diverse periodic workflow patterns. Experiments show that: (i) our benchmark presents significant challenges to both unsupervised periodic detection methods and zero-shot approaches based on powerful large language models (LLMs); (ii) our baseline outperforms competing methods by a substantial margin in all evaluation tasks; and (iii) in real-world applications, our baseline demonstrates deployment advantages on par with traditional supervised workflow detection approaches, eliminating the need for annotation and retraining. Our project page is https://sites.google.com/view/periodicworkflow.
comment: accepted to WACV 2026
Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models
Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.
comment: 16 pages; Previously this version appeared as arXiv:2510.15430 which was submitted as a new work by accident
Body-Hand Modality Expertized Networks with Cross-attention for Fine-grained Skeleton Action Recognition
Skeleton-based Human Action Recognition (HAR) is a vital technology in robotics and human-robot interaction. However, most existing methods concentrate primarily on full-body movements and often overlook subtle hand motions that are critical for distinguishing fine-grained actions. Recent work leverages a unified graph representation that combines body, hand, and foot keypoints to capture detailed body dynamics. Yet, these models often blur fine hand details due to the disparity between body and hand action characteristics and the loss of subtle features during the spatial-pooling. In this paper, we propose BHaRNet (Body-Hand action Recognition Network), a novel framework that augments a typical body-expert model with a hand-expert model. Our model jointly trains both streams with an ensemble loss that fosters cooperative specialization, functioning in a manner reminiscent of a Mixture-of-Experts (MoE). Moreover, cross-attention is employed via an expertized branch method and a pooling-attention module to enable feature-level interactions and selectively fuse complementary information. Inspired by MMNet, we also demonstrate the applicability of our approach to multi-modal tasks by leveraging RGB information, where body features guide RGB learning to capture richer contextual cues. Experiments on large-scale benchmarks (NTU RGB+D 60, NTU RGB+D 120, PKU-MMD, and Northwestern-UCLA) demonstrate that BHaRNet achieves SOTA accuracies -- improving from 86.4\% to 93.0\% in hand-intensive actions -- while maintaining fewer GFLOPs and parameters than the relevant unified methods.
comment: 7 figures, 8 pages
MHR: Momentum Human Rig
We present MHR, a parametric human body model that combines the decoupled skeleton/shape paradigm of ATLAS with a flexible, modern rig and pose corrective system inspired by the Momentum library. Our model enables expressive, anatomically plausible human animation, supporting non-linear pose correctives, and is designed for robust integration in AR/VR and graphics pipelines.
Unsupervised learning of spatially varying regularization for diffeomorphic image registration
Spatially varying regularization accommodates the deformation variations that may be necessary for different anatomical regions during deformable image registration. Historically, optimization-based registration models have harnessed spatially varying regularization to address anatomical subtleties. However, most modern deep learning-based models tend to gravitate towards spatially invariant regularization, wherein a homogenous regularization strength is applied across the entire image, potentially disregarding localized variations. In this paper, we propose a hierarchical probabilistic model that integrates a prior distribution on the deformation regularization strength, enabling the end-to-end learning of a spatially varying deformation regularizer directly from the data. The proposed method is straightforward to implement and easily integrates with various registration network architectures. Additionally, automatic tuning of hyperparameters is achieved through Bayesian optimization, allowing efficient identification of optimal hyperparameters for any given registration task. Comprehensive evaluations on publicly available datasets demonstrate that the proposed method significantly improves registration performance and enhances the interpretability of deep learning-based registration, all while maintaining smooth deformations.
comment: Accepted to Medical Image Analysis ((c) MedIA). Code available at http://bit.ly/3BrXGxz
CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation BMVC 2025
Vision foundation models (VFMs) such as DINO have led to a paradigm shift in 2D camera-based perception towards extracting generalized features to support many downstream tasks. Recent works introduce self-supervised cross-modal knowledge distillation (KD) as a way to transfer these powerful generalization capabilities into 3D LiDAR-based models. However, they either rely on highly complex distillation losses, pseudo-semantic maps, or limit KD to features useful for semantic segmentation only. In this work, we propose CleverDistiller, a self-supervised, cross-modal 2D-to-3D KD framework introducing a set of simple yet effective design choices: Unlike contrastive approaches relying on complex loss design choices, our method employs a direct feature similarity loss in combination with a multi layer perceptron (MLP) projection head to allow the 3D network to learn complex semantic dependencies throughout the projection. Crucially, our approach does not depend on pseudo-semantic maps, allowing for direct knowledge transfer from a VFM without explicit semantic supervision. Additionally, we introduce the auxiliary self-supervised spatial task of occupancy prediction to enhance the semantic knowledge, obtained from a VFM through KD, with 3D spatial reasoning capabilities. Experiments on standard autonomous driving benchmarks for 2D-to-3D KD demonstrate that CleverDistiller achieves state-of-the-art performance in both semantic segmentation and 3D object detection (3DOD) by up to 10% mIoU, especially when fine tuning on really low data amounts, showing the effectiveness of our simple yet powerful KD strategy
comment: Accepted to BMVC 2025
TC-Light: Temporally Coherent Generative Rendering for Realistic World Transfer
Illumination and texture editing are critical dimensions for world-to-world transfer, which is valuable for applications including sim2real and real2real visual data scaling up for embodied AI. Existing techniques generatively re-render the input video to realize the transfer, such as video relighting models and conditioned world generation models. Nevertheless, these models are predominantly limited to the domain of training data (e.g., portrait) or fall into the bottleneck of temporal consistency and computation efficiency, especially when the input video involves complex dynamics and long durations. In this paper, we propose TC-Light, a novel generative renderer to overcome these problems. Starting from the video preliminarily relighted by an inflated video relighting model, it optimizes appearance embedding in the first stage to align global illumination. Then it optimizes the proposed canonical video representation, i.e., Unique Video Tensor (UVT), to align fine-grained texture and lighting in the second stage. To comprehensively evaluate performance, we also establish a long and highly dynamic video benchmark. Extensive experiments show that our method enables physically plausible re-rendering results with superior temporal coherence and low computation cost. The code and video demos are available at https://dekuliutesla.github.io/tclight/.
comment: Project Page: https://dekuliutesla.github.io/tclight/ Code: https://github.com/Linketic/TC-Light
CD-DPE: Dual-Prompt Expert Network based on Convolutional Dictionary Feature Decoupling for Multi-Contrast MRI Super-Resolution AAAI
Multi-contrast magnetic resonance imaging (MRI) super-resolution intends to reconstruct high-resolution (HR) images from low-resolution (LR) scans by leveraging structural information present in HR reference images acquired with different contrasts. This technique enhances anatomical detail and soft tissue differentiation, which is vital for early diagnosis and clinical decision-making. However, inherent contrasts disparities between modalities pose fundamental challenges in effectively utilizing reference image textures to guide target image reconstruction, often resulting in suboptimal feature integration. To address this issue, we propose a dual-prompt expert network based on a convolutional dictionary feature decoupling (CD-DPE) strategy for multi-contrast MRI super-resolution. Specifically, we introduce an iterative convolutional dictionary feature decoupling module (CD-FDM) to separate features into cross-contrast and intra-contrast components, thereby reducing redundancy and interference. To fully integrate these features, a novel dual-prompt feature fusion expert module (DP-FFEM) is proposed. This module uses a frequency prompt to guide the selection of relevant reference features for incorporation into the target image, while an adaptive routing prompt determines the optimal method for fusing reference and target features to enhance reconstruction quality. Extensive experiments on public multi-contrast MRI datasets demonstrate that CD-DPE outperforms state-of-the-art methods in reconstructing fine details. Additionally, experiments on unseen datasets demonstrated that CD-DPE exhibits strong generalization capabilities.
comment: This paper has been accepted by AAAI, but due to the final camera-ready version not being finalized, there are still some expression errors. It will be re-published after correction
End-to-End 4D Heart Mesh Recovery Across Full-Stack and Sparse Cardiac MRI
Reconstructing cardiac motion from CMR sequences is critical for diagnosis, prognosis, and intervention. Existing methods rely on complete CMR stacks to infer full heart motion, limiting their applicability during intervention when only sparse observations are available. We present TetHeart, the first end-to-end framework for unified 4D heart mesh recovery from both offline full-stack and intra-procedural sparse-slice observations. Our method leverages deformable tetrahedra to capture shape and motion in a coherent space shared across cardiac structures. Before a procedure, it initializes detailed, patient-specific heart meshes from high-quality full stacks, which can then be updated using whatever slices can be obtained in real-time, down to a single one during the procedure. TetHeart incorporates several key innovations: (i) an attentive slice-adaptive 2D-3D feature assembly mechanism that integrates information from arbitrary numbers of slices at any position; (ii) a distillation strategy to ensure accurate reconstruction under extreme sparsity; and (iii) a weakly supervised motion learning scheme requiring annotations only at keyframes, such as the end-diastolic and end-systolic phases. Trained and validated on three large public datasets and evaluated zero-shot on additional private interventional and public datasets without retraining, TetHeart achieves state-of-the-art accuracy and strong generalization in both pre- and intra-procedural settings.
One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image
Retrieval-augmented generation (RAG) is instrumental for inhibiting hallucinations in large language models (LLMs) through the use of a factual knowledge base (KB). Although PDF documents are prominent sources of knowledge, text-based RAG pipelines are ineffective at capturing their rich multi-modal information. In contrast, visual document RAG (VD-RAG) uses screenshots of document pages as the KB, which has been shown to achieve state-of-the-art results. However, by introducing the image modality, VD-RAG introduces new attack vectors for adversaries to disrupt the system by injecting malicious documents into the KB. In this paper, we demonstrate the vulnerability of VD-RAG to poisoning attacks targeting both retrieval and generation. We define two attack objectives and demonstrate that both can be realized by injecting only a single adversarial image into the KB. Firstly, we introduce a targeted attack against one or a group of queries with the goal of spreading targeted disinformation. Secondly, we present a universal attack that, for any potential user query, influences the response to cause a denial-of-service in the VD-RAG system. We investigate the two attack objectives under both white-box and black-box assumptions, employing a multi-objective gradient-based optimization approach as well as prompting state-of-the-art generative models. Using two visual document datasets, a diverse set of state-of-the-art retrievers (embedding models) and generators (vision language models), we show VD-RAG is vulnerable to poisoning attacks in both the targeted and universal settings, yet demonstrating robustness to black-box attacks in the universal setting.
Co-Reinforcement Learning for Unified Multimodal Understanding and Generation NeurIPS 2025
This paper presents a pioneering exploration of reinforcement learning (RL) via group relative policy optimization for unified multimodal large language models (ULMs), aimed at simultaneously reinforcing generation and understanding capabilities. Through systematic pilot studies, we uncover the significant potential of ULMs to enable the synergistic co-evolution of dual capabilities within a shared policy optimization framework. Building on this insight, we introduce CoRL, a co-reinforcement learning framework comprising a unified RL stage for joint optimization and a refined RL stage for task-specific enhancement. With the proposed CoRL, our resulting model, ULM-R1, achieves average improvements of 7% on three text-to-image generation datasets and 23% on nine multimodal understanding benchmarks. These results demonstrate the effectiveness of CoRL and highlight the substantial benefit of reinforcement learning in facilitating cross-task synergy and optimization for ULMs. Code is available at https://github.com/mm-vl/ULM-R1.
comment: NeurIPS 2025
FunnyNodules: A Customizable Medical Dataset Tailored for Evaluating Explainable AI
Densely annotated medical image datasets that capture not only diagnostic labels but also the underlying reasoning behind these diagnoses are scarce. Such reasoning-related annotations are essential for developing and evaluating explainable AI (xAI) models that reason similarly to radiologists: making correct predictions for the right reasons. To address this gap, we introduce FunnyNodules, a fully parameterized synthetic dataset designed for systematic analysis of attribute-based reasoning in medical AI models. The dataset generates abstract, lung nodule-like shapes with controllable visual attributes such as roundness, margin sharpness, and spiculation. Target class is derived from a predefined attribute combination, allowing full control over the decision rule that links attributes to the diagnostic class. We demonstrate how FunnyNodules can be used in model-agnostic evaluations to assess whether models learn correct attribute-target relations, to interpret over- or underperformance in attribute prediction, and to analyze attention alignment with attribute-specific regions of interest. The framework is fully customizable, supporting variations in dataset complexity, target definitions, class balance, and beyond. With complete ground truth information, FunnyNodules provides a versatile foundation for developing, benchmarking, and conducting in-depth analyses of explainable AI methods in medical image analysis.
Seeing Beyond Haze: Generative Nighttime Image Dehazing
Nighttime image dehazing is particularly challenging when dense haze and intense glow severely degrade or entirely obscure background information. Existing methods often struggle due to insufficient background priors and limited generative capability, both of which are highly important under such conditions. In this paper, we introduce BeyondHaze, a generative nighttime dehazing method that not only reduces haze and glow effects but also reconstructs plausible background structures in regions where visual cues are heavily degraded. Our approach is built on two main ideas: obtaining strong background priors by adapting image diffusion models to nighttime dehazing, and enhancing generative ability in haze- and glow-obscured areas through guided training. Task-specific nighttime dehazing knowledge is distilled into an image diffusion model while preserving its capacity to generate clean images. The diffusion model is further trained on tailored image pairs to improve its ability to recover background details that are suppressed by haze effects. Since generative models may introduce hallucinated content, we design our framework to allow user control over the generative level, enabling a balance between visual realism and fidelity. Experiments on real-world nighttime images demonstrate that BeyondHaze substantially improves visibility and scene detail under dense haze.
Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.
comment: Website: https://kandinskylab.ai/
Multimodal Evaluation of Russian-language Architectures
Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking and licenses for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.
System Filter-Based Common Components Modeling for Cross-Subject EEG Decoding
Brain-computer interface (BCI) technology enables direct communication between the brain and external devices through electroencephalography (EEG) signals. However, existing decoding models often mix common and personalized components, leading to interference from individual variability that limits cross-subject decoding performance. To address this issue, this paper proposes a system filter that extends the concept of signal filtering to the system level. The method expands a system into its spectral representation, selectively removes unnecessary components, and reconstructs the system from the retained target components, thereby achieving explicit system-level decomposition and filtering. We further integrate the system filter into a Cross-Subject Decoding framework based on the System Filter (CSD-SF) and evaluate it on the four-class motor imagery (MI) task of the BCIC IV 2a dataset. Personalized models are transformed into relation spectrums, and statistical testing across subjects is used to remove personalized components. The remaining stable relations, representing common components across subjects, are then used to construct a common model for cross-subject decoding. Experimental results show an average improvement of 3.28% in decoding accuracy over baseline methods, demonstrating that the proposed system filter effectively isolates stable common components and enhances model robustness and generalizability in cross-subject EEG decoding.
comment: 12 pages, 11 figures
DINO in the Room: Leveraging 2D Foundation Models for 3D Segmentation 3DV 2026
Vision foundation models (VFMs) trained on large-scale image datasets provide high-quality features that have significantly advanced 2D visual recognition. However, their potential in 3D scene segmentation remains largely untapped, despite the common availability of 2D images alongside 3D point cloud datasets. While significant research has been dedicated to 2D-3D fusion, recent state-of-the-art 3D methods predominantly focus on 3D data, leaving the integration of VFMs into 3D models underexplored. In this work, we challenge this trend by introducing DITR, a generally applicable approach that extracts 2D foundation model features, projects them to 3D, and finally injects them into a 3D point cloud segmentation model. DITR achieves state-of-the-art results on both indoor and outdoor 3D semantic segmentation benchmarks. To enable the use of VFMs even when images are unavailable during inference, we additionally propose to pretrain 3D models by distilling 2D foundation models. By initializing the 3D backbone with knowledge distilled from 2D VFMs, we create a strong basis for downstream 3D segmentation tasks, ultimately boosting performance across various datasets.
comment: Accepted to 3DV 2026. Project page at https://vision.rwth-aachen.de/ditr
MagicFace: High-Fidelity Facial Expression Editing with Action-Unit Control
We address the problem of facial expression editing by controling the relative variation of facial action-unit (AU) from the same person. This enables us to edit this specific person's expression in a fine-grained, continuous and interpretable manner, while preserving their identity, pose, background and detailed facial attributes. Key to our model, which we dub MagicFace, is a diffusion model conditioned on AU variations and an ID encoder to preserve facial details of high consistency. Specifically, to preserve the facial details with the input identity, we leverage the power of pretrained Stable-Diffusion models and design an ID encoder to merge appearance features through self-attention. To keep background and pose consistency, we introduce an efficient Attribute Controller by explicitly informing the model of current background and pose of the target. By injecting AU variations into a denoising UNet, our model can animate arbitrary identities with various AU combinations, yielding superior results in high-fidelity expression editing compared to other facial expression editing works. Code is publicly available at https://github.com/weimengting/MagicFace.
Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence
Video reasoning, which requires multi-step deduction across frames, remains a major challenge for multimodal large language models (MLLMs). While reinforcement learning (RL)-based methods enhance reasoning capabilities, they often rely on text-only chains that yield ungrounded or hallucinated conclusions. Conversely, frame-retrieval approaches introduce visual grounding, yet still struggle with inaccurate evidence localization. To address these limitations, we present Conan, a framework for evidence-grounded multi-step video reasoning. Conan identifies context and evidence frames, reasons over cross-frame clues, and adaptively decides when to conclude or explore further. To achieve this, we 1) construct Conan-91K, a large-scale dataset of automatically generated reasoning traces that include frame identification, evidence reasoning, and action decision, and 2) design a multi-stage progressive cold-start strategy combined with an Identification-Reasoning-Action (AIR) RLVR training framework to progressively incentivize multi-step visual reasoning. Extensive experiments on six multi-step reasoning benchmarks demonstrate that Conan surpasses the baseline Qwen2.5-VL-7B-Instruct by an average of over 10% in accuracy, achieving state-of-the-art performance. Furthermore, Conan generalizes effectively to long video understanding tasks, validating its strong scalability and robustness.
VisPlay: Self-Evolving Vision-Language Models from Images
Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at https://bruno686.github.io/VisPlay/
RoMa v2: Harder Better Faster Denser Feature Matching
Dense feature matching aims to estimate all correspondences between two images of a 3D scene and has recently been established as the gold-standard due to its high accuracy and robustness. However, existing dense matchers still fail or perform poorly for many hard real-world scenarios, and high-precision models are often slow, limiting their applicability. In this paper, we attack these weaknesses on a wide front through a series of systematic improvements that together yield a significantly better model. In particular, we construct a novel matching architecture and loss, which, combined with a curated diverse training distribution, enables our model to solve many complex matching tasks. We further make training faster through a decoupled two-stage matching-then-refinement pipeline, and at the same time, significantly reduce refinement memory usage through a custom CUDA kernel. Finally, we leverage the recent DINOv3 foundation model along with multiple other insights to make the model more robust and unbiased. In our extensive set of experiments we show that the resulting novel matcher sets a new state-of-the-art, being significantly more accurate than its predecessors. Code is available at https://github.com/Parskatt/romav2
comment: Added acknowledgements, and some minor fixes
Label-Efficient Cross-Modality Generalization for Liver Segmentation in Multi-Phase MRI MICCAI 2025
Accurate liver segmentation in multi-phase MRI is vital for liver fibrosis assessment, yet labeled data is often scarce and unevenly distributed across imaging modalities and vendor systems. We propose a label-efficient segmentation approach that promotes cross-modality generalization under real-world conditions, where GED4 hepatobiliary-phase annotations are limited, non-contrast sequences (T1WI, T2WI, DWI) are unlabeled, and spatial misalignment and missing phases are common. Our method integrates a foundation-scale 3D segmentation backbone adapted via fine-tuning, co-training with cross pseudo supervision to leverage unlabeled volumes, and a standardized preprocessing pipeline. Without requiring spatial registration, the model learns to generalize across MRI phases and vendors, demonstrating robust segmentation performance in both labeled and unlabeled domains. Our results exhibit the effectiveness of our proposed label-efficient baseline for liver segmentation in multi-phase, multi-vendor MRI and highlight the potential of combining foundation model adaptation with co-training for real-world clinical imaging tasks.
comment: Accepted at MICCAI 2025 Workshop
Structural-Spectral Graph Convolution with Evidential Edge Learning for Hyperspectral Image Clustering
Hyperspectral image (HSI) clustering groups pixels into clusters without labeled data, which is an important yet challenging task. For large-scale HSIs, most methods rely on superpixel segmentation and perform superpixel-level clustering based on graph neural networks (GNNs). However, existing GNNs cannot fully exploit the spectral information of the input HSI, and the inaccurate superpixel topological graph may lead to the confusion of different class semantics during information aggregation. To address these challenges, we first propose a structural-spectral graph convolutional operator (SSGCO) tailored for graph-structured HSI superpixels to improve their representation quality through the co-extraction of spatial and spectral features. Second, we propose an evidence-guided adaptive edge learning (EGAEL) module that adaptively predicts and refines edge weights in the superpixel topological graph. We integrate the proposed method into a contrastive learning framework to achieve clustering, where representation learning and clustering are simultaneously conducted. Experiments demonstrate that the proposed method improves clustering accuracy by 2.61%, 6.06%, 4.96% and 3.15% over the best compared methods on four HSI datasets. Our code is available at https://github.com/jhqi/SSGCO-EGAEL.
From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos
Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving, and provides 180K triplets drawn from FineGym and FineDiving datasets. Previous CoVR benchmarks, focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each pair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on average), reflecting real-world tasks such as sports-highlight generation. To model these temporal dynamics, we propose TF-CoVR-Base, a concise two-stage training framework: (i) pre-train a video encoder on fine-grained action classification to obtain temporally discriminative embeddings; (ii) align the composed query with candidate videos using contrastive learning. We conduct the first comprehensive study of image, video, and general multimodal embedding (GME) models on temporally fine-grained composed retrieval in both zero-shot and fine-tuning regimes. On TF-CoVR, TF-CoVR-Base improves zero-shot mAP@50 from 5.92 (LanguageBind) to 7.51, and after fine-tuning raises the state-of-the-art from 19.83 to 27.22.
Learning from Dense Events: Towards Fast Spiking Neural Networks Training via Event Dataset Distillation
Event cameras sense brightness changes and output binary asynchronous event streams, attracting increasing attention. Their bio-inspired dynamics align well with spiking neural networks (SNNs), offering a promising energy-efficient alternative to conventional vision systems. However, SNNs remain costly to train due to temporal coding, which limits their practical deployment. To alleviate the high training cost of SNNs, we introduce \textbf{PACE} (Phase-Aligned Condensation for Events), the first dataset distillation framework to SNNs and event-based vision. PACE distills a large training dataset into a compact synthetic one that enables fast SNN training, which is achieved by two core modules: \textbf{ST-DSM} and \textbf{PEQ-N}. ST-DSM uses residual membrane potentials to densify spike-based features (SDR) and to perform fine-grained spatiotemporal matching of amplitude and phase (ST-SM), while PEQ-N provides a plug-and-play straight through probabilistic integer quantizer compatible with standard event-frame pipelines. Across DVS-Gesture, CIFAR10-DVS, and N-MNIST datasets, PACE outperforms existing coreset selection and dataset distillation baselines, with particularly strong gains on dynamic event streams and at low or moderate IPC. Specifically, on N-MNIST, it achieves \(84.4\%\) accuracy, about \(85\%\) of the full training set performance, while reducing training time by more than \(50\times\) and storage cost by \(6000\times\), yielding compact surrogates that enable minute-scale SNN training and efficient edge deployment.
Human Motion Unlearning
We introduce the task of human motion unlearning to prevent the synthesis of toxic animations while preserving the general text-to-motion generative performance. Unlearning toxic motions is challenging as those can be generated from explicit text prompts and from implicit toxic combinations of safe motions (e.g., "kicking" is "loading and swinging a leg"). We propose the first motion unlearning benchmark by filtering toxic motions from the large and recent text-to-motion datasets of HumanML3D and Motion-X. We propose baselines, by adapting state-of-the-art image unlearning techniques to process spatio-temporal signals. Finally, we propose a novel motion unlearning model based on Latent Code Replacement, which we dub LCR. LCR is training-free and suitable to the discrete latent spaces of state-of-the-art text-to-motion diffusion models. LCR is simple and consistently outperforms baselines qualitatively and quantitatively. Project page: https://www.pinlab.org/hmu.
Introducing DEFORMISE: A deep learning framework for dementia diagnosis in the elderly using optimized MRI slice selection
Dementia, a debilitating neurological condition affecting millions worldwide, presents significant diagnostic challenges. In this work, we introduce DEFORMISE, a novel DEep learning Framework for dementia diagnOsis of eldeRly patients using 3D brain Magnetic resonance Imaging (MRI) scans with Optimized Slice sElection. Our approach features a unique technique for selectively processing MRI slices, focusing on the most relevant brain regions and excluding less informative sections. This methodology is complemented by a confidence-based classification committee composed of three novel deep learning models. Tested on the Open OASIS datasets, our method achieved an impressive accuracy of 94.12%, surpassing existing methodologies. Furthermore, validation on the ADNI dataset confirmed the robustness and generalizability of our approach. The use of explainable AI (XAI) techniques and comprehensive ablation studies further substantiate the effectiveness of our techniques, providing insights into the decision-making process and the importance of our methodology. This research offers a significant advancement in dementia diagnosis, providing a highly accurate and efficient tool for clinical applications.
Enhancing efficiency in paediatric brain tumour segmentation using a pathologically diverse single-center clinical dataset
Background Brain tumours are the most common solid malignancies in children, encompassing diverse histological, molecular subtypes and imaging features and outcomes. Paediatric brain tumours (PBTs), including high- and low-grade gliomas (HGG, LGG), medulloblastomas (MB), ependymomas, and rarer forms, pose diagnostic and therapeutic challenges. Deep learning (DL)-based segmentation offers promising tools for tumour delineation, yet its performance across heterogeneous PBT subtypes and MRI protocols remains uncertain. Methods A retrospective single-centre cohort of 174 paediatric patients with HGG, LGG, medulloblastomas (MB), ependymomas, and other rarer subtypes was used. MRI sequences included T1, T1 post-contrast (T1-C), T2, and FLAIR. Manual annotations were provided for four tumour subregions: whole tumour (WT), T2-hyperintensity (T2H), enhancing tumour (ET), and cystic component (CC). A 3D nnU-Net model was trained and tested (121/53 split), with segmentation performance assessed using the Dice similarity coefficient (DSC) and compared against intra- and inter-rater variability. Results The model achieved robust performance for WT and T2H (mean DSC: 0.85), comparable to human annotator variability (mean DSC: 0.86). ET segmentation was moderately accurate (mean DSC: 0.75), while CC performance was poor. Segmentation accuracy varied by tumour type, MRI sequence combination, and location. Notably, T1, T1-C, and T2 alone produced results nearly equivalent to the full protocol. Conclusions DL is feasible for PBTs, particularly for T2H and WT. Challenges remain for ET and CC segmentation, highlighting the need for further refinement. These findings support the potential for protocol simplification and automation to enhance volumetric assessment and streamline paediatric neuro-oncology workflows.
comment: A. Jakab and F. Kofler have shared last authorship
LSAP: Rethinking Inversion Fidelity, Perception and Editability in GAN Latent Space
As research on image inversion advances, the process is generally divided into two stages. The first step is Image Embedding, involves using an encoder or optimization procedure to embed an image and obtain its corresponding latent code. The second stage, referred to as Result Refinement, further improves the inversion and editing outcomes. Although this refinement stage substantially enhances reconstruction fidelity, perception and editability remain largely unchanged and are highly dependent on the latent codes derived from the first stage. Therefore, a key challenge lies in obtaining latent codes that preserve reconstruction fidelity while simultaneously improving perception and editability. In this work, we first reveal that these two properties are closely related to the degree of alignment (or disalignment) between the inverted latent codes and the synthetic distribution. Based on this insight, we propose the \textbf{ Latent Space Alignment Inversion Paradigm (LSAP)}, which integrates both an evaluation metric and a unified inversion solution. Specifically, we introduce the \textbf{Normalized Style Space ($\mathcal{S^N}$ space)} and \textbf{Normalized Style Space Cosine Distance (NSCD)} to quantify the disalignment of inversion methods. Moreover, our paradigm can be optimized for both encoder-based and optimization-based embeddings, providing a consistent alignment framework. Extensive experiments across various domains demonstrate that NSCD effectively captures perceptual and editable characteristics, and that our alignment paradigm achieves state-of-the-art performance in both stages of inversion.
comment: under review
Otter: Mitigating Background Distractions of Wide-Angle Few-Shot Action Recognition with Enhanced RWKV AAAI 2026
Wide-angle videos in few-shot action recognition (FSAR) effectively express actions within specific scenarios. However, without a global understanding of both subjects and background, recognizing actions in such samples remains challenging because of the background distractions. Receptance Weighted Key Value (RWKV), which learns interaction between various dimensions, shows promise for global modeling. While directly applying RWKV to wide-angle FSAR may fail to highlight subjects due to excessive background information. Additionally, temporal relation degraded by frames with similar backgrounds is difficult to reconstruct, further impacting performance. Therefore, we design the CompOund SegmenTation and Temporal REconstructing RWKV (Otter). Specifically, the Compound Segmentation Module~(CSM) is devised to segment and emphasize key patches in each frame, effectively highlighting subjects against background information. The Temporal Reconstruction Module (TRM) is incorporated into the temporal-enhanced prototype construction to enable bidirectional scanning, allowing better reconstruct temporal relation. Furthermore, a regular prototype is combined with the temporal-enhanced prototype to simultaneously enhance subject emphasis and temporal modeling, improving wide-angle FSAR performance. Extensive experiments on benchmarks such as SSv2, Kinetics, UCF101, and HMDB51 demonstrate that Otter achieves state-of-the-art performance. Extra evaluation on the VideoBadminton dataset further validates the superiority of Otter in wide-angle FSAR.
comment: Accepted by AAAI 2026 Oral
Linear time small coresets for k-mean clustering of segments with applications
We study the $k$-means problem for a set $\mathcal{S} \subseteq \mathbb{R}^d$ of $n$ segments, aiming to find $k$ centers $X \subseteq \mathbb{R}^d$ that minimize $D(\mathcal{S},X) := \sum_{S \in \mathcal{S}} \min_{x \in X} D(S,x)$, where $D(S,x) := \int_{p \in S} |p - x| dp$ measures the total distance from each point along a segment to a center. Variants of this problem include handling outliers, employing alternative distance functions such as M-estimators, weighting distances to achieve balanced clustering, or enforcing unique cluster assignments. For any $\varepsilon > 0$, an $\varepsilon$-coreset is a weighted subset $C \subseteq \mathbb{R}^d$ that approximates $D(\mathcal{S},X)$ within a factor of $1 \pm \varepsilon$ for any set of $k$ centers, enabling efficient streaming, distributed, or parallel computation. We propose the first coreset construction that provably handles arbitrary input segments. For constant $k$ and $\varepsilon$, it produces a coreset of size $O(\log^2 n)$ computable in $O(nd)$ time. Experiments, including a real-time video tracking application, demonstrate substantial speedups with minimal loss in clustering accuracy, confirming both the practical efficiency and theoretical guarantees of our method.
comment: First published in WALCOM 2026 by Springer Nature
IOR: Inversed Objects Replay for Incremental Object Detection
Existing Incremental Object Detection (IOD) methods partially alleviate catastrophic forgetting when incrementally detecting new objects in real-world scenarios. However, many of these methods rely on the assumption that unlabeled old-class objects may co-occur with labeled new-class objects in the incremental data. When unlabeled old-class objects are absent, the performance of existing methods tends to degrade. The absence can be mitigated by generating old-class samples, but it incurs high costs. This paper argues that previous generation-based IOD suffers from redundancy, both in the use of generative models, which require additional training and storage, and in the overproduction of generated samples, many of which do not contribute significantly to performance improvements. To eliminate the redundancy, we propose Inversed Objects Replay (IOR). Specifically, we generate old-class samples by inversing the original detectors, thus eliminating the necessity of training and storing additional generative models. We propose augmented replay to reuse the objects in generated samples, reducing redundant generations. Moreover, we propose high-value knowledge distillation focusing on the positions of old-class objects overwhelmed by the background, which transfers the knowledge to the incremental detector. Extensive experiments conducted on MS COCO 2017 demonstrate that our method can efficiently improve detection performance in IOD scenarios with the absence of old-class objects. The code is available at https://github.com/JiaJia075/IOR.
Towards Metric-Aware Multi-Person Mesh Recovery by Jointly Optimizing Human Crowd in Camera Space
Multi-person human mesh recovery from a single image is a challenging task, hindered by the scarcity of in-the-wild training data. Prevailing in-the-wild human mesh pseudo-ground-truth (pGT) generation pipelines are single-person-centric, where each human is processed individually without joint optimization. This oversight leads to a lack of scene-level consistency, producing individuals with conflicting depths and scales within the same image. To address this, we introduce Depth-conditioned Translation Optimization (DTO), a novel optimization-based method that jointly refines the camera-space translations of all individuals in a crowd. By leveraging anthropometric priors on human height and depth cues from a monocular depth estimator, DTO solves for a scene-consistent placement of all subjects within a principled Maximum a posteriori (MAP) framework. Applying DTO to the 4D-Humans dataset, we construct DTO-Humans, a new large-scale pGT dataset of 0.56M high-quality, scene-consistent multi-person images, featuring dense crowds with an average of 4.8 persons per image. Furthermore, we propose Metric-Aware HMR, an end-to-end network that directly estimates human mesh and camera parameters in metric scale. This is enabled by a camera branch and a relative metric loss that enforces plausible relative scales. Extensive experiments demonstrate that our method achieves state-of-the-art performance on relative depth reasoning and human mesh recovery. Code is available at: https://github.com/gouba2333/MA-HMR.
Fusion of Multi-scale Heterogeneous Pathology Foundation Models for Whole Slide Image Analysis
Whole slide image (WSI) analysis has emerged as an increasingly essential technique in computational pathology. Recent advances in the pathology foundation models (FMs) have demonstrated significant advantages in deriving meaningful patch-level or slide-level multi-scale features from WSIs. However, current pathology FMs have exhibited substantial heterogeneity caused by diverse private training datasets and different network architectures. This heterogeneity introduces performance variability when we utilize the features from different FMs in the downstream tasks. To fully explore the advantages of multiple FMs effectively, in this work, we propose a novel framework for the fusion of multi-scale heterogeneous pathology FMs, called FuseCPath, yielding a model with a superior ensemble performance. The main contributions of our framework can be summarized as follows: (i) To guarantee the representativeness of the training patches, we propose a multi-view clustering-based method to filter out the discriminative patches via multiple FMs' embeddings. (ii) To effectively fuse the patch-level FMs, we devise a cluster-level re-embedding strategy to online capture patch-level local features. (iii) To effectively fuse the slide-level FMs, we devise a collaborative distillation strategy to explore the connections between slide-level FMs. Extensive experiments demonstrate that the proposed FuseCPath achieves state-of-the-art performance across multiple tasks on diverse datasets.
comment: 22 pages, 9 figures
UINO-FSS: Unifying Representation Learning and Few-shot Segmentation via Hierarchical Distillation and Mamba-HyperCorrelation
Few-shot semantic segmentation has attracted growing interest for its ability to generalize to novel object categories using only a few annotated samples. To address data scarcity, recent methods incorporate multiple foundation models to improve feature transferability and segmentation performance. However, they often rely on dual-branch architectures that combine pre-trained encoders to leverage complementary strengths, a design that limits flexibility and efficiency. This raises a fundamental question: can we build a unified model that integrates knowledge from different foundation architectures? Achieving this is, however, challenging due to the misalignment between class-agnostic segmentation capabilities and fine-grained discriminative representations. To this end, we present UINO-FSS, a novel framework built on the key observation that early-stage DINOv2 features exhibit distribution consistency with SAM's output embeddings. This consistency enables the integration of both models' knowledge into a single-encoder architecture via coarse-to-fine multimodal distillation. In particular, our segmenter consists of three core components: a bottleneck adapter for embedding alignment, a meta-visual prompt generator that leverages dense similarity volumes and semantic embeddings, and a mask decoder. Using hierarchical cross-model distillation, we effectively transfer SAM's knowledge into the segmenter, further enhanced by Mamba-based 4D correlation mining on support-query pairs. Extensive experiments on PASCAL-5$^i$ and COCO-20$^i$ show that UINO-FSS achieves new state-of-the-art results under the 1-shot setting, with mIoU of 80.6 (+3.8%) on PASCAL-5$^i$ and 64.5 (+4.1%) on COCO-20$^i$, demonstrating the effectiveness of our unified approach.
FLUX-Text: A Simple and Advanced Diffusion Transformer Baseline for Scene Text Editing
Scene text editing aims to modify or add texts on images while ensuring text fidelity and overall visual quality consistent with the background. Recent methods are primarily built on UNet-based diffusion models, which have improved scene text editing results, but still struggle with complex glyph structures, especially for non-Latin ones (\eg, Chinese, Korean, Japanese). To address these issues, we present \textbf{FLUX-Text}, a simple and advanced multilingual scene text editing DiT method. Specifically, our FLUX-Text enhances glyph understanding and generation through lightweight Visual and Text Embedding Modules, while preserving the original generative capability of FLUX. We further propose a Regional Text Perceptual Loss tailored for text regions, along with a matching two-stage training strategy to better balance text editing and overall image quality. Benefiting from the DiT-based architecture and lightweight feature injection modules, FLUX-Text can be trained with only $0.1$M training examples, a \textbf{97\%} reduction compared to $2.9$M required by popular methods. Extensive experiments on multiple public datasets, including English and Chinese benchmarks, demonstrate that our method surpasses other methods in visual quality and text fidelity. All the code is available at https://github.com/AMAP-ML/FluxText.
comment: 10 pages, 5 figures
DuetMatch: Harmonizing Semi-Supervised Brain MRI Segmentation via Decoupled Branch Optimization
The limited availability of annotated data in medical imaging makes semi-supervised learning increasingly appealing for its ability to learn from imperfect supervision. Recently, teacher-student frameworks have gained popularity for their training benefits and robust performance. However, jointly optimizing the entire network can hinder convergence and stability, especially in challenging scenarios. To address this for medical image segmentation, we propose DuetMatch, a novel dual-branch semi-supervised framework with asynchronous optimization, where each branch optimizes either the encoder or decoder while keeping the other frozen. To improve consistency under noisy conditions, we introduce Decoupled Dropout Perturbation, enforcing regularization across branches. We also design Pair-wise CutMix Cross-Guidance to enhance model diversity by exchanging pseudo-labels through augmented input pairs. To mitigate confirmation bias from noisy pseudo-labels, we propose Consistency Matching, refining labels using stable predictions from frozen teacher models. Extensive experiments on benchmark brain MRI segmentation datasets, including ISLES2022 and BraTS, show that DuetMatch consistently outperforms state-of-the-art methods, demonstrating its effectiveness and robustness across diverse semi-supervised segmentation scenarios.
comment: Published in Computerized Medical Imaging and Graphics (CMIG)
Medverse: A Universal Model for Full-Resolution 3D Medical Image Segmentation, Transformation and Enhancement
In-context learning (ICL) offers a promising paradigm for universal medical image analysis, enabling models to perform diverse image processing tasks without retraining. However, current ICL models for medical imaging remain limited in two critical aspects: they cannot simultaneously achieve high-fidelity predictions and global anatomical understanding, and there is no unified model trained across diverse medical imaging tasks (e.g., segmentation and enhancement) and anatomical regions. As a result, the full potential of ICL in medical imaging remains underexplored. Thus, we present \textbf{Medverse}, a universal ICL model for 3D medical imaging, trained on 22 datasets covering diverse tasks in universal image segmentation, transformation, and enhancement across multiple organs, imaging modalities, and clinical centers. Medverse employs a next-scale autoregressive in-context learning framework that progressively refines predictions from coarse to fine, generating consistent, full-resolution volumetric outputs and enabling multi-scale anatomical awareness. We further propose a blockwise cross-attention module that facilitates long-range interactions between context and target inputs while preserving computational efficiency through spatial sparsity. Medverse is extensively evaluated on a broad collection of held-out datasets covering previously unseen clinical centers, organs, species, and imaging modalities. Results demonstrate that Medverse substantially outperforms existing ICL baselines and establishes a novel paradigm for in-context learning. Code and model weights will be made publicly available. Our model are publicly available at https://github.com/jiesihu/Medverse.
Rep-GLS: Report-Guided Generalized Label Smoothing for Robust Disease Detection
Unlike nature image classification where groundtruth label is explicit and of no doubt, physicians commonly interpret medical image conditioned on certainty like using phrase "probable" or "likely". Existing medical image datasets either simply overlooked the nuance and polarise into binary label. Here, we propose a novel framework that leverages a Large Language Model (LLM) to directly mine medical reports to utilise the uncertainty relevant expression for supervision signal. At first, we collect uncertainty keywords from medical reports. Then, we use Qwen-3 4B to identify the textual uncertainty and map them into an adaptive Generalized Label Smoothing (GLS) rate. This rate allows our model to treat uncertain labels not as errors, but as informative signals, effectively incorporating expert skepticism into the training process. We establish a new clinical expert uncertainty-aware benchmark to rigorously evaluate this problem. Experiments demonstrate that our approach significantly outperforms state-of-the-art methods in medical disease detection. The curated uncertainty words database, code, and benchmark will be made publicly available upon acceptance.
Event Stream Filtering via Probability Flux Estimation
Event cameras asynchronously capture brightness changes with microsecond latency, offering exceptional temporal precision but suffering from severe noise and signal inconsistencies. Unlike conventional signals, events carry state information through polarities and process information through inter-event time intervals. However, existing event filters often ignore the latter, producing outputs that are sparser than the raw input and limiting the reconstruction of continuous irradiance dynamics. We propose the Event Density Flow Filter (EDFilter), a framework that models event generation as threshold-crossing probability fluxes arising from the stochastic diffusion of irradiance trajectories. EDFilter performs nonparametric, kernel-based estimation of probability flux and reconstructs the continuous event density flow using an O(1) recursive solver, enabling real-time processing. The Rotary Event Dataset (RED), featuring microsecond-resolution ground-truth irradiance flow under controlled illumination is also presented for event quality evaluation. Experiments demonstrate that EDFilter achieves high-fidelity, physically interpretable event denoising and motion reconstruction.
TubeRMC: Tube-conditioned Reconstruction with Mutual Constraints for Weakly-supervised Spatio-Temporal Video Grounding AAAI 2026
Spatio-Temporal Video Grounding (STVG) aims to localize a spatio-temporal tube that corresponds to a given language query in an untrimmed video. This is a challenging task since it involves complex vision-language understanding and spatiotemporal reasoning. Recent works have explored weakly-supervised setting in STVG to eliminate reliance on fine-grained annotations like bounding boxes or temporal stamps. However, they typically follow a simple late-fusion manner, which generates tubes independent of the text description, often resulting in failed target identification and inconsistent target tracking. To address this limitation, we propose a Tube-conditioned Reconstruction with Mutual Constraints (\textbf{TubeRMC}) framework that generates text-conditioned candidate tubes with pre-trained visual grounding models and further refine them via tube-conditioned reconstruction with spatio-temporal constraints. Specifically, we design three reconstruction strategies from temporal, spatial, and spatio-temporal perspectives to comprehensively capture rich tube-text correspondences. Each strategy is equipped with a Tube-conditioned Reconstructor, utilizing spatio-temporal tubes as condition to reconstruct the key clues in the query. We further introduce mutual constraints between spatial and temporal proposals to enhance their quality for reconstruction. TubeRMC outperforms existing methods on two public benchmarks VidSTG and HCSTVG. Further visualization shows that TubeRMC effectively mitigates both target identification errors and inconsistent tracking.
comment: Accepted to AAAI 2026
One Model for All: Unified Try-On and Try-Off in Any Pose via LLM-Inspired Bidirectional Tweedie Diffusion
Recent diffusion-based approaches have made significant advances in image-based virtual try-on, enabling more realistic and end-to-end garment synthesis. However, most existing methods remain constrained by their reliance on exhibition garments and segmentation masks, as well as their limited ability to handle flexible pose variations. These limitations reduce their practicality in real-world scenarios - for instance, users cannot easily transfer garments worn by one person onto another, and the generated try-on results are typically restricted to the same pose as the reference image. In this paper, we introduce OMFA (One Model For All), a unified diffusion framework for both virtual try-on and try-off that operates without the need for exhibition garments and supports arbitrary poses. OMFA is inspired by language modeling, where generation is guided by conditioning prompts. However, our framework differs fundamentally from LLMs in two key aspects. First, it employs a bidirectional modeling paradigm that symmetrically allows prompting either from the garment to generate try-on results or from the dressed person to recover the try-off garment. Second, it strictly adheres to Tweedie's formula, enabling faithful estimation of the underlying data distribution during the denoising process. Instead of imposing lower body constraints, OMFA is an entirely mask-free framework that requires only a single portrait and a target garment as input, and is designed to support flexible outfit combinations and cross-person garment transfer, making it better aligned with practical usage scenarios. Additionally, by leveraging SMPL-X-based pose conditioning, OMFA supports multi-view and arbitrary-pose try-on from just one image. Extensive experiments demonstrate that OMFA achieves state-of-the-art results on both try-on and try-off tasks, providing a practical solution for virtual garment synthesis.
CompTrack: Information Bottleneck-Guided Low-Rank Dynamic Token Compression for Point Cloud Tracking AAAI 2026
3D single object tracking (SOT) in LiDAR point clouds is a critical task in computer vision and autonomous driving. Despite great success having been achieved, the inherent sparsity of point clouds introduces a dual-redundancy challenge that limits existing trackers: (1) vast spatial redundancy from background noise impairs accuracy, and (2) informational redundancy within the foreground hinders efficiency. To tackle these issues, we propose CompTrack, a novel end-to-end framework that systematically eliminates both forms of redundancy in point clouds. First, CompTrack incorporates a Spatial Foreground Predictor (SFP) module to filter out irrelevant background noise based on information entropy, addressing spatial redundancy. Subsequently, its core is an Information Bottleneck-guided Dynamic Token Compression (IB-DTC) module that eliminates the informational redundancy within the foreground. Theoretically grounded in low-rank approximation, this module leverages an online SVD analysis to adaptively compress the redundant foreground into a compact and highly informative set of proxy tokens. Extensive experiments on KITTI, nuScenes and Waymo datasets demonstrate that CompTrack achieves top-performing tracking performance with superior efficiency, running at a real-time 90 FPS on a single RTX 3090 GPU.
comment: Accepted by AAAI 2026 (Oral)
Spatial-and-Frequency-aware Restoration method for Images based on Diffusion Models
Diffusion models have recently emerged as a promising framework for Image Restoration (IR), owing to their ability to produce high-quality reconstructions and their compatibility with established methods. Existing methods for solving noisy inverse problems in IR, considers the pixel-wise data-fidelity. In this paper, we propose SaFaRI, a spatial-and-frequency-aware diffusion model for IR with Gaussian noise. Our model encourages images to preserve data-fidelity in both the spatial and frequency domains, resulting in enhanced reconstruction quality. We comprehensively evaluate the performance of our model on a variety of noisy inverse problems, including inpainting, denoising, and super-resolution. Our thorough evaluation demonstrates that SaFaRI achieves state-of-the-art performance on both the ImageNet datasets and FFHQ datasets, outperforming existing zero-shot IR methods in terms of LPIPS and FID metrics.
Image and Video Processing
SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking
Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing \textbf{SAM2} for \textbf{S}urgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity-resilient learning to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average $\mathcal{J}$\&$\mathcal{F}$ over vanilla SAM2. SAM2S further advances performance to 80.42 average $\mathcal{J}$\&$\mathcal{F}$, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization. Code and dataset will be released at https://jinlab-imvr.github.io/SAM2S.
comment: 11 pages, 4 figures
Saving Foundation Flow-Matching Priors for Inverse Problems
Foundation flow-matching (FM) models promise a universal prior for solving inverse problems (IPs), yet today they trail behind domain-specific or even untrained priors. How can we unlock their potential? We introduce FMPlug, a plug-in framework that redefines how foundation FMs are used in IPs. FMPlug combines an instance-guided, time-dependent warm-start strategy with a sharp Gaussianity regularization, adding problem-specific guidance while preserving the Gaussian structures. This leads to a significant performance boost across image restoration and scientific IPs. Our results point to a path for making foundation FM models practical, reusable priors for IP solving.
Weakly Supervised Segmentation and Classification of Alpha-Synuclein Aggregates in Brightfield Midbrain Images
Parkinson's disease (PD) is a neurodegenerative disorder associated with the accumulation of misfolded alpha-synuclein aggregates, forming Lewy bodies and neuritic shape used for pathology diagnostics. Automatic analysis of immunohistochemistry histopathological images with Deep Learning provides a promising tool for better understanding the spatial organization of these aggregates. In this study, we develop an automated image processing pipeline to segment and classify these aggregates in whole-slide images (WSIs) of midbrain tissue from PD and incidental Lewy Body Disease (iLBD) cases based on weakly supervised segmentation, robust to immunohistochemical labelling variability, with a ResNet50 classifier. Our approach allows to differentiate between major aggregate morphologies, including Lewy bodies and neurites with a balanced accuracy of $80\%$. This framework paves the way for large-scale characterization of the spatial distribution and heterogeneity of alpha-synuclein aggregates in brightfield immunohistochemical tissue, and for investigating their poorly understood relationships with surrounding cells such as microglia and astrocytes.
MRI Super-Resolution with Deep Learning: A Comprehensive Survey
High-resolution (HR) magnetic resonance imaging (MRI) is crucial for many clinical and research applications. However, achieving it remains costly and constrained by technical trade-offs and experimental limitations. Super-resolution (SR) presents a promising computational approach to overcome these challenges by generating HR images from more affordable low-resolution (LR) scans, potentially improving diagnostic accuracy and efficiency without requiring additional hardware. This survey reviews recent advances in MRI SR techniques, with a focus on deep learning (DL) approaches. It examines DL-based MRI SR methods from the perspectives of computer vision, computational imaging, inverse problems, and MR physics, covering theoretical foundations, architectural designs, learning strategies, benchmark datasets, and performance metrics. We propose a systematic taxonomy to categorize these methods and present an in-depth study of both established and emerging SR techniques applicable to MRI, considering unique challenges in clinical and research contexts. We also highlight open challenges and directions that the community needs to address. Additionally, we provide a collection of essential open-access resources, tools, and tutorials, available on our GitHub: https://github.com/mkhateri/Awesome-MRI-Super-Resolution. IEEE keywords: MRI, Super-Resolution, Deep Learning, Computational Imaging, Inverse Problem, Survey.
comment: 41 pages
Motion Transfer-Enhanced StyleGAN for Generating Diverse Macaque Facial Expressions
Generating animal faces using generative AI techniques is challenging because the available training images are limited both in quantity and variation, particularly for facial expressions across individuals. In this study, we focus on macaque monkeys, widely studied in systems neuroscience and evolutionary research, and propose a method to generate their facial expressions using a style-based generative image model (i.e., StyleGAN2). To address data limitations, we implemented: 1) data augmentation by synthesizing new facial expression images using a motion transfer to animate still images with computer graphics, 2) sample selection based on the latent representation of macaque faces from an initially trained StyleGAN2 model to ensure the variation and uniform sampling in training dataset, and 3) loss function refinement to ensure the accurate reproduction of subtle movements, such as eye movements. Our results demonstrate that the proposed method enables the generation of diverse facial expressions for multiple macaque individuals, outperforming models trained solely on original still images. Additionally, we show that our model is effective for style-based image editing, where specific style parameters correspond to distinct facial movements. These findings underscore the model's potential for disentangling motion components as style parameters, providing a valuable tool for research on macaque facial expressions.
Reconfigurable, large-format D-ToF/photon-counting SPAD image sensors with embedded FPGA for scene adaptability
CMOS-compatible single-photon avalanche diodes (SPADs) have emerged in many systems as the solution of choice for cameras with photon-number resolution and photon counting capabilities. Being natively digital optical interfaces, SPADs are naturally drawn to in situ logic processing and event-driven computation; they are usually coupled to discrete FPGAs to enable reconfigurability. In this work, we propose to bring the FPGA on-chip, in direct contact with the SPADs at pixel or cluster level. To demonstrate the suitability of this approach, we created an architecture for processing timestamps and photon counts using programmable weighted sums based on an efficient use of look-up tables. The outputs are processed hierarchically, similarly to what is done in FPGAs, reducing power consumption and simplifying I/Os. Finally, we show how artificial neural networks can be designed and reprogrammed by using look-up tables in an efficient way.
comment: Presented at the International Image Sensor Workshop 2025
Introducing DEFORMISE: A deep learning framework for dementia diagnosis in the elderly using optimized MRI slice selection
Dementia, a debilitating neurological condition affecting millions worldwide, presents significant diagnostic challenges. In this work, we introduce DEFORMISE, a novel DEep learning Framework for dementia diagnOsis of eldeRly patients using 3D brain Magnetic resonance Imaging (MRI) scans with Optimized Slice sElection. Our approach features a unique technique for selectively processing MRI slices, focusing on the most relevant brain regions and excluding less informative sections. This methodology is complemented by a confidence-based classification committee composed of three novel deep learning models. Tested on the Open OASIS datasets, our method achieved an impressive accuracy of 94.12%, surpassing existing methodologies. Furthermore, validation on the ADNI dataset confirmed the robustness and generalizability of our approach. The use of explainable AI (XAI) techniques and comprehensive ablation studies further substantiate the effectiveness of our techniques, providing insights into the decision-making process and the importance of our methodology. This research offers a significant advancement in dementia diagnosis, providing a highly accurate and efficient tool for clinical applications.
Behind the Screens: Uncovering Bias in AI-Driven Video Interview Assessments Using Counterfactuals
AI-enhanced personality assessments are increasingly shaping hiring decisions, using affective computing to predict traits from the Big Five (OCEAN) model. However, integrating AI into these assessments raises ethical concerns, especially around bias amplification rooted in training data. These biases can lead to discriminatory outcomes based on protected attributes like gender, ethnicity, and age. To address this, we introduce a counterfactual-based framework to systematically evaluate and quantify bias in AI-driven personality assessments. Our approach employs generative adversarial networks (GANs) to generate counterfactual representations of job applicants by altering protected attributes, enabling fairness analysis without access to the underlying model. Unlike traditional bias assessments that focus on unimodal or static data, our method supports multimodal evaluation-spanning visual, audio, and textual features. This comprehensive approach is particularly important in high-stakes applications like hiring, where third-party vendors often provide AI systems as black boxes. Applied to a state-of-the-art personality prediction model, our method reveals significant disparities across demographic groups. We also validate our framework using a protected attribute classifier to confirm the effectiveness of our counterfactual generation. This work provides a scalable tool for fairness auditing of commercial AI hiring platforms, especially in black-box settings where training data and model internals are inaccessible. Our results highlight the importance of counterfactual approaches in improving ethical transparency in affective computing.
Multimedia
Sensorium Arc: AI Agent System for Oceanic Data Exploration and Interactive Eco-Art NeurIPS 2025
Sensorium Arc (AI reflects on climate) is a real-time multimodal interactive AI agent system that personifies the ocean as a poetic speaker and guides users through immersive explorations of complex marine data. Built on a modular multi-agent system and retrieval-augmented large language model (LLM) framework, Sensorium enables natural spoken conversations with AI agents that embodies the ocean's perspective, generating responses that blend scientific insight with ecological poetics. Through keyword detection and semantic parsing, the system dynamically triggers data visualizations and audiovisual playback based on time, location, and thematic cues drawn from the dialogue. Developed in collaboration with the Center for the Study of the Force Majeure and inspired by the eco-aesthetic philosophy of Newton Harrison, Sensorium Arc reimagines ocean data not as an abstract dataset but as a living narrative. The project demonstrates the potential of conversational AI agents to mediate affective, intuitive access to high-dimensional environmental data and proposes a new paradigm for human-machine-ecosystem.
comment: (to appear) NeurIPS 2025 Creative AI Track
Co-Reinforcement Learning for Unified Multimodal Understanding and Generation NeurIPS 2025
This paper presents a pioneering exploration of reinforcement learning (RL) via group relative policy optimization for unified multimodal large language models (ULMs), aimed at simultaneously reinforcing generation and understanding capabilities. Through systematic pilot studies, we uncover the significant potential of ULMs to enable the synergistic co-evolution of dual capabilities within a shared policy optimization framework. Building on this insight, we introduce CoRL, a co-reinforcement learning framework comprising a unified RL stage for joint optimization and a refined RL stage for task-specific enhancement. With the proposed CoRL, our resulting model, ULM-R1, achieves average improvements of 7% on three text-to-image generation datasets and 23% on nine multimodal understanding benchmarks. These results demonstrate the effectiveness of CoRL and highlight the substantial benefit of reinforcement learning in facilitating cross-task synergy and optimization for ULMs. Code is available at https://github.com/mm-vl/ULM-R1.
comment: NeurIPS 2025
MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions AAAI 2025
We introduce Multimodal Matching based on Valence and Arousal (MMVA), a tri-modal encoder framework designed to capture emotional content across images, music, and musical captions. To support this framework, we expand the Image-Music-Emotion-Matching-Net (IMEMNet) dataset, creating IMEMNet-C which includes 24,756 images and 25,944 music clips with corresponding musical captions. We employ multimodal matching scores based on the continuous valence (emotional positivity) and arousal (emotional intensity) values. This continuous matching score allows for random sampling of image-music pairs during training by computing similarity scores from the valence-arousal values across different modalities. Consequently, the proposed approach achieves state-of-the-art performance in valence-arousal prediction tasks. Furthermore, the framework demonstrates its efficacy in various zeroshot tasks, highlighting the potential of valence and arousal predictions in downstream applications.
comment: Paper accepted in Artificial Intelligence for Music workshop at AAAI 2025
Recent Advances in Discrete Speech Tokens: A Review
The rapid advancement of speech generation technologies in the era of large language models (LLMs) has established discrete speech tokens as a foundational paradigm for speech representation. These tokens, characterized by their discrete, compact, and concise nature, are not only advantageous for efficient transmission and storage, but also inherently compatible with the language modeling framework, enabling seamless integration of speech into text-dominated LLM architectures. Current research categorizes discrete speech tokens into two principal classes: acoustic tokens and semantic tokens, each of which has evolved into a rich research domain characterized by unique design philosophies and methodological approaches. This survey systematically synthesizes the existing taxonomy and recent innovations in discrete speech tokenization, conducts a critical examination of the strengths and limitations of each paradigm, and presents systematic experimental comparisons across token types. Furthermore, we identify persistent challenges in the field and propose potential research directions, aiming to offer actionable insights to inspire future advancements in the development and application of discrete speech tokens.
comment: 26 pages, 8 figures, 3 tables. This version is a major revision of the previous one, including reorganization of the section structure, more experimental results, and extensive revisions to both text and figures
Computation and Language
Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation
Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: https://github.com/ZiyuGuo99/Thinking-while-Generating.
comment: Project Page: https://think-while-gen.github.io Code: https://github.com/ZiyuGuo99/Thinking-while-Generating
Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs
Training a family of large language models targeting multiple scales and deployment objectives is prohibitively expensive, requiring separate training runs for each different size. Recent work on model compression through pruning and knowledge distillation has reduced this cost; however, this process still incurs hundreds of billions of tokens worth of training cost per compressed model. In this paper, we present Nemotron Elastic, a framework for building reasoning-oriented LLMs, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning. We enable this functionality through an end-to-end trained router, tightly coupled to a two-stage training curriculum designed specifically for reasoning models. We additionally introduce group-aware SSM elastification that preserves Mamba's structural constraints, heterogeneous MLP elastification, normalized MSE-based layer importance for improved depth selection, and knowledge distillation enabling simultaneous multi-budget optimization. We apply Nemotron Elastic to the Nemotron Nano V2 12B model, simultaneously producing a 9B and a 6B model using only 110B training tokens; this results in over 360x cost reduction compared to training model families from scratch, and around 7x compared to SoTA compression techniques. Each of the nested models performs on par or better than the SoTA in accuracy. Moreover, unlike other compression methods, the nested capability of our approach allows having a many-in-one reasoning model that has constant deployment memory against the number of models in the family.
Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems
Recent advancements in Retrieval-Augmented Generation (RAG) have enabled Large Language Models (LLMs) to access multimodal knowledge bases containing both text and visual information such as charts, diagrams, and tables in financial documents. However, existing multimodal RAG systems rely on LLM-based summarization to convert images into text during preprocessing, storing only text representations in vector databases, which causes loss of contextual information and visual details critical for downstream retrieval and question answering. To address this limitation, we present a comprehensive comparative analysis of two retrieval approaches for multimodal RAG systems, including text-based chunk retrieval (where images are summarized into text before embedding) and direct multimodal embedding retrieval (where images are stored natively in the vector space). We evaluate all three approaches across 6 LLM models and a two multi-modal embedding models on a newly created financial earnings call benchmark comprising 40 question-answer pairs, each paired with 2 documents (1 image and 1 text chunk). Experimental results demonstrate that direct multimodal embedding retrieval significantly outperforms LLM-summary-based approaches, achieving absolute improvements of 13% in mean average precision (mAP@5) and 11% in normalized discounted cumulative gain. These gains correspond to relative improvements of 32% in mAP@5 and 20% in nDCG@5, providing stronger evidence of their practical impact. We additionally find that direct multimodal retrieval produces more accurate and factually consistent answers as measured by LLM-as-a-judge pairwise comparisons. We demonstrate that LLM summarization introduces information loss during preprocessing, whereas direct multimodal embeddings preserve visual context for retrieval and inference.
Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs
Recent advancements in neural audio codecs have not only enabled superior audio compression but also enhanced speech synthesis techniques. Researchers are now exploring their potential as universal acoustic feature extractors for a broader range of speech processing tasks. Building on this trend, we introduce Codec2Vec, the first speech representation learning framework that relies exclusively on discrete audio codec units. This approach offers several advantages, including improved data storage and transmission efficiency, faster training, and enhanced data privacy. We explore masked prediction with various training target derivation strategies to thoroughly understand the effectiveness of this framework. Evaluated on the SUPERB benchmark, Codec2Vec achieves competitive performance compared to continuous-input models while reducing storage requirements by up to 16.5x and training time by 2.3x, showcasing its scalability and efficiency.
comment: To be presented at ASRU 2025
SurvAgent: Hierarchical CoT-Enhanced Case Banking and Dichotomy-Based Multi-Agent System for Multimodal Survival Prediction
Survival analysis is critical for cancer prognosis and treatment planning, yet existing methods lack the transparency essential for clinical adoption. While recent pathology agents have demonstrated explainability in diagnostic tasks, they face three limitations for survival prediction: inability to integrate multimodal data, ineffective region-of-interest exploration, and failure to leverage experiential learning from historical cases. We introduce SurvAgent, the first hierarchical chain-of-thought (CoT)-enhanced multi-agent system for multimodal survival prediction. SurvAgent consists of two stages: (1) WSI-Gene CoT-Enhanced Case Bank Construction employs hierarchical analysis through Low-Magnification Screening, Cross-Modal Similarity-Aware Patch Mining, and Confidence-Aware Patch Mining for pathology images, while Gene-Stratified analysis processes six functional gene categories. Both generate structured reports with CoT reasoning, storing complete analytical processes for experiential learning. (2) Dichotomy-Based Multi-Expert Agent Inference retrieves similar cases via RAG and integrates multimodal reports with expert predictions through progressive interval refinement. Extensive experiments on five TCGA cohorts demonstrate SurvAgent's superority over conventional methods, proprietary MLLMs, and medical agents, establishing a new paradigm for explainable AI-driven survival prediction in precision oncology.
comment: 20 pages
TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.
comment: Project page: https://xuboshen.github.io/TimeViper
D-GARA: A Dynamic Benchmarking Framework for GUI Agent Robustness in Real-World Anomalies AAAI 2026
Developing intelligent agents capable of operating a wide range of Graphical User Interfaces (GUIs) with human-level proficiency is a key milestone on the path toward Artificial General Intelligence. While most existing datasets and benchmarks for training and evaluating GUI agents are static and idealized, failing to reflect the complexity and unpredictability of real-world environments, particularly the presence of anomalies. To bridge this research gap, we propose D-GARA, a dynamic benchmarking framework, to evaluate Android GUI agent robustness in real-world anomalies. D-GARA introduces a diverse set of real-world anomalies that GUI agents commonly face in practice, including interruptions such as permission dialogs, battery warnings, and update prompts. Based on D-GARA framework, we construct and annotate a benchmark featuring commonly used Android applications with embedded anomalies to support broader community research. Comprehensive experiments and results demonstrate substantial performance degradation in state-of-the-art GUI agents when exposed to anomaly-rich environments, highlighting the need for robustness-aware learning. D-GARA is modular and extensible, supporting the seamless integration of new tasks, anomaly types, and interaction scenarios to meet specific evaluation goals.
comment: Accepted to AAAI 2026
Integrating Symbolic Natural Language Understanding and Language Models for Word Sense Disambiguation
Word sense disambiguation is a fundamental challenge in natural language understanding. Current methods are primarily aimed at coarse-grained representations (e.g. WordNet synsets or FrameNet frames) and require hand-annotated training data to construct. This makes it difficult to automatically disambiguate richer representations (e.g. built on OpenCyc) that are needed for sophisticated inference. We propose a method that uses statistical language models as oracles for disambiguation that does not require any hand-annotation of training data. Instead, the multiple candidate meanings generated by a symbolic NLU system are converted into distinguishable natural language alternatives, which are used to query an LLM to select appropriate interpretations given the linguistic context. The selected meanings are propagated back to the symbolic NLU system. We evaluate our method against human-annotated gold answers to demonstrate its effectiveness.
comment: 16 pages
WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue
As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA to replicate expert clinical assessment. The optimized judge (Gemini-2.5-Pro) achieves human-comparable performance, obtaining 90% accuracy and a strong Cohen's $κ$ of 0.816. This work provides a validated, automated framework for moving ASR evaluation beyond simple textual fidelity to a necessary, scalable assessment of safety in clinical dialogue.
The Oracle and The Prism: A Decoupled and Efficient Framework for Generative Recommendation Explanation
The integration of Large Language Models (LLMs) into explainable recommendation systems often leads to a performance-efficiency trade-off in end-to-end architectures, where joint optimization of ranking and explanation can result in suboptimal compromises. To resolve this, we propose Prism, a novel decoupled framework that rigorously separates the recommendation process into a dedicated ranking stage and an explanation generation stage. Inspired by knowledge distillation, Prism leverages a powerful teacher LLM (e.g., FLAN-T5-XXL) as an Oracle to produce high-fidelity explanatory knowledge. A compact, fine-tuned student model (e.g., BART-Base), the Prism, then specializes in synthesizing this knowledge into personalized explanations. This decomposition ensures that each component is optimized for its specific objective, eliminating inherent conflicts in coupled models. Extensive experiments on benchmark datasets demonstrate that our 140M-parameter Prism model significantly outperforms its 11B-parameter teacher in human evaluations of faithfulness and personalization, while achieving a 24 times speedup and a 10 times reduction in memory consumption during inference. These results validate that decoupling, coupled with targeted distillation, provides an efficient and effective pathway to high-quality explainable recommendation.
comment: 11 pages,3 figures
Beyond Tokens in Language Models: Interpreting Activations through Text Genre Chunks
Understanding Large Language Models (LLMs) is key to ensure their safe and beneficial deployment. This task is complicated by the difficulty of interpretability of LLM structures, and the inability to have all their outputs human-evaluated. In this paper, we present the first step towards a predictive framework, where the genre of a text used to prompt an LLM, is predicted based on its activations. Using Mistral-7B and two datasets, we show that genre can be extracted with F1-scores of up to 98% and 71% using scikit-learn classifiers. Across both datasets, results consistently outperform the control task, providing a proof of concept that text genres can be inferred from LLMs with shallow learning models.
comment: 13 pages, 5 figures
TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval
Neural information retrieval systems excel in high-resource languages but remain underexplored for morphologically rich, lower-resource languages such as Turkish. Dense bi-encoders currently dominate Turkish IR, yet late-interaction models -- which retain token-level representations for fine-grained matching -- have not been systematically evaluated. We introduce TurkColBERT, the first comprehensive benchmark comparing dense encoders and late-interaction models for Turkish retrieval. Our two-stage adaptation pipeline fine-tunes English and multilingual encoders on Turkish NLI/STS tasks, then converts them into ColBERT-style retrievers using PyLate trained on MS MARCO-TR. We evaluate 10 models across five Turkish BEIR datasets covering scientific, financial, and argumentative domains. Results show strong parameter efficiency: the 1.0M-parameter colbert-hash-nano-tr is 600$\times$ smaller than the 600M turkish-e5-large dense encoder while preserving over 71\% of its average mAP. Late-interaction models that are 3--5$\times$ smaller than dense encoders significantly outperform them; ColmmBERT-base-TR yields up to +13.8\% mAP on domain-specific tasks. For production-readiness, we compare indexing algorithms: MUVERA+Rerank is 3.33$\times$ faster than PLAID and offers +1.7\% relative mAP gain. This enables low-latency retrieval, with ColmmBERT-base-TR achieving 0.54 ms query times under MUVERA. We release all checkpoints, configs, and evaluation scripts. Limitations include reliance on moderately sized datasets ($\leq$50K documents) and translated benchmarks, which may not fully reflect real-world Turkish retrieval conditions; larger-scale MUVERA evaluations remain necessary.
MiMo-Embodied: X-Embodied Foundation Model Technical Report
We open-source MiMo-Embodied, the first cross-embodied foundation model to successfully integrate and achieve state-of-the-art performance in both Autonomous Driving and Embodied AI. MiMo-Embodied sets new records across 17 embodied AI benchmarks in Task Planning, Affordance Prediction and Spatial Understanding, while also excelling in 12 autonomous driving benchmarks across Environmental Perception, Status Prediction, and Driving Planning. Across these tasks, MiMo-Embodied significantly outperforms existing open-source, closed-source, and specialized baselines. Our results indicate that through multi-stage learning, curated data construction, and CoT/RL fine-tuning, these two domains exhibit strong positive transfer and mutually reinforce one another. We provide a detailed analysis of our model design and training methodologies to facilitate further research. Code and models are available at https://github.com/XiaomiMiMo/MiMo-Embodied.
comment: Code: https://github.com/XiaomiMiMo/MiMo-Embodied Model: https://huggingface.co/XiaomiMiMo/MiMo-Embodied-7B
Music Recommendation with Large Language Models: Challenges, Opportunities, and Evaluation
Music Recommender Systems (MRS) have long relied on an information-retrieval framing, where progress is measured mainly through accuracy on retrieval-oriented subtasks. While effective, this reductionist paradigm struggles to address the deeper question of what makes a good recommendation, and attempts to broaden evaluation, through user studies or fairness analyses, have had limited impact. The emergence of Large Language Models (LLMs) disrupts this framework: LLMs are generative rather than ranking-based, making standard accuracy metrics questionable. They also introduce challenges such as hallucinations, knowledge cutoffs, non-determinism, and opaque training data, rendering traditional train/test protocols difficult to interpret. At the same time, LLMs create new opportunities, enabling natural-language interaction and even allowing models to act as evaluators. This work argues that the shift toward LLM-driven MRS requires rethinking evaluation. We first review how LLMs reshape user modeling, item modeling, and natural-language recommendation in music. We then examine evaluation practices from NLP, highlighting methodologies and open challenges relevant to MRS. Finally, we synthesize insights-focusing on how LLM prompting applies to MRS, to outline a structured set of success and risk dimensions. Our goal is to provide the MRS community with an updated, pedagogical, and cross-disciplinary perspective on evaluation.
comment: Under review with the ACM Transactions on Recommender Systems (TORS)
Arctic-Extract Technical Report
Arctic-Extract is a state-of-the-art model designed for extracting structural data (question answering, entities and tables) from scanned or digital-born business documents. Despite its SoTA capabilities, the model is deployable on resource-constrained hardware, weighting only 6.6 GiB, making it suitable for deployment on devices with limited resources, such as A10 GPUs with 24 GB of memory. Arctic-Extract can process up to 125 A4 pages on those GPUs, making suitable for long document processing. This paper highlights Arctic-Extract's training protocols and evaluation results, demonstrating its strong performance in document understanding.
Anatomy of an Idiom: Tracing Non-Compositionality in Language Models
We investigate the processing of idiomatic expressions in transformer-based language models using a novel set of techniques for circuit discovery and analysis. First discovering circuits via a modified path patching algorithm, we find that idiom processing exhibits distinct computational patterns. We identify and investigate ``Idiom Heads,'' attention heads that frequently activate across different idioms, as well as enhanced attention between idiom tokens due to earlier processing, which we term ``augmented reception.'' We analyze these phenomena and the general features of the discovered circuits as mechanisms by which transformers balance computational efficiency and robustness. Finally, these findings provide insights into how transformers handle non-compositional language and suggest pathways for understanding the processing of more complex grammatical constructions.
ESGBench: A Benchmark for Explainable ESG Question Answering in Corporate Sustainability Reports
We present ESGBench, a benchmark dataset and evaluation framework designed to assess explainable ESG question answering systems using corporate sustainability reports. The benchmark consists of domain-grounded questions across multiple ESG themes, paired with human-curated answers and supporting evidence to enable fine-grained evaluation of model reasoning. We analyze the performance of state-of-the-art LLMs on ESGBench, highlighting key challenges in factual consistency, traceability, and domain alignment. ESGBench aims to accelerate research in transparent and accountable ESG-focused AI systems.
comment: Workshop paper accepted at AI4DF 2025 (part of ACM ICAIF 2025). 3 pages including tables and figures
TOFA: Training-Free One-Shot Federated Adaptation for Vision-Language Models AAAI 2026
Efficient and lightweight adaptation of pre-trained Vision-Language Models (VLMs) to downstream tasks through collaborative interactions between local clients and a central server is a rapidly emerging research topic in federated learning. Existing adaptation algorithms are typically trained iteratively, which incur significant communication costs and increase the susceptibility to potential attacks. Motivated by the one-shot federated training techniques that reduce client-server exchanges to a single round, developing a lightweight one-shot federated VLM adaptation method to alleviate these issues is particularly attractive. However, current one-shot approaches face certain challenges in adapting VLMs within federated settings: (1) insufficient exploitation of the rich multimodal information inherent in VLMs; (2) lack of specialized adaptation strategies to systematically handle the severe data heterogeneity; and (3) requiring additional training resource of clients or server. To bridge these gaps, we propose a novel Training-free One-shot Federated Adaptation framework for VLMs, named TOFA. To fully leverage the generalizable multimodal features in pre-trained VLMs, TOFA employs both visual and textual pipelines to extract task-relevant representations. In the visual pipeline, a hierarchical Bayesian model learns personalized, class-specific prototype distributions. For the textual pipeline, TOFA evaluates and globally aligns the generated local text prompts for robustness. An adaptive weight calibration mechanism is also introduced to combine predictions from both modalities, balancing personalization and robustness to handle data heterogeneity. Our method is training-free, not relying on additional training resources on either the client or server side. Extensive experiments across 9 datasets in various federated settings demonstrate the effectiveness of the proposed TOFA method.
comment: Accepted by AAAI 2026
Classification of worldwide news articles by perceived quality, 2018-2024
This study explored whether supervised machine learning and deep learning models can effectively distinguish perceived lower-quality news articles from perceived higher-quality news articles. 3 machine learning classifiers and 3 deep learning models were assessed using a newly created dataset of 1,412,272 English news articles from the Common Crawl over 2018-2024. Expert consensus ratings on 579 source websites were split at the median, creating perceived low and high-quality classes of about 706,000 articles each, with 194 linguistic features per website-level labelled article. Traditional machine learning classifiers such as the Random Forest demonstrated capable performance (0.7355 accuracy, 0.8131 ROC AUC). For deep learning, ModernBERT-large (256 context length) achieved the best performance (0.8744 accuracy; 0.9593 ROC-AUC; 0.8739 F1), followed by DistilBERT-base (512 context length) at 0.8685 accuracy and 0.9554 ROC-AUC. DistilBERT-base (256 context length) reached 0.8478 accuracy and 0.9407 ROC-AUC, while ModernBERT-base (256 context length) attained 0.8569 accuracy and 0.9470 ROC-AUC. These results suggest that the perceived quality of worldwide news articles can be effectively differentiated by traditional CPU-based machine learning classifiers and deep learning classifiers.
AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser
While web data quality is crucial for large language models, most curation efforts focus on filtering and deduplication,treating HTML-to-text extraction as a fixed pre-processing step. Existing web corpora rely on heuristic-based extractors like Trafilatura, which struggle to preserve document structure and frequently corrupt structured elements such as formulas, codes, and tables. We hypothesize that improving extraction quality can be as impactful as aggressive filtering strategies for downstream performance. We introduce MinerU-HTML, a novel extraction pipeline that reformulates content extraction as a sequence labeling problem solved by a 0.6B-parameter language model. Unlike text-density heuristics, MinerU-HTML leverages semantic understanding and employs a two-stage formatting pipeline that explicitly categorizes semantic elements before converting to Markdown. Crucially, its model-based approach is inherently scalable, whereas heuristic methods offer limited improvement pathways. On MainWebBench, our benchmark of 7,887 annotated web pages, MinerU-HTML achieves 81.8\% ROUGE-N F1 compared to Trafilatura's 63.6\%, with exceptional structured element preservation (90.9\% for code blocks, 94.0\% for formulas). Using MinerU-HTML, we construct AICC (AI-ready Common Crawl), a 7.3-trillion token multilingual corpus from two Common Crawl snapshots. In controlled pretraining experiments where AICC and Trafilatura-extracted TfCC undergo identical filtering, models trained on AICC (62B tokens) achieve 50.8\% average accuracy across 13 benchmarks, outperforming TfCC by 1.08pp-providing direct evidence that extraction quality significantly impacts model capabilities. AICC also surpasses RefinedWeb and FineWeb on key benchmarks. We publicly release MainWebBench, MinerU-HTML, and AICC, demonstrating that HTML extraction is a critical, often underestimated component of web corpus construction.
Learning from Sufficient Rationales: Analysing the Relationship Between Explanation Faithfulness and Token-level Regularisation Strategies AACL 2025
Human explanations of natural language, rationales, form a tool to assess whether models learn a label for the right reasons or rely on dataset-specific shortcuts. Sufficiency is a common metric for estimating the informativeness of rationales, but it provides limited insight into the effects of rationale information on model performance. We address this limitation by relating sufficiency to two modelling paradigms: the ability of models to identify which tokens are part of the rationale (through token classification) and the ability of improving model performance by incorporating rationales in the input (through attention regularisation). We find that highly informative rationales are not likely to help classify the instance correctly. Sufficiency conversely captures the classification impact of the non-rationalised context, which interferes with rationale information in the same input. We also find that incorporating rationale information in model inputs can boost cross-domain classification, but results are inconsistent per task and model type. Finally, sufficiency and token classification appear to be unrelated. These results exemplify the complexity of rationales, showing that metrics capable of systematically capturing this type of information merit further investigation.
comment: Long paper accepted to the main conference of AACL 2025. Please cite the conference proceedings when available
NLP Datasets for Idiom and Figurative Language Tasks
Idiomatic and figurative language form a large portion of colloquial speech and writing. With social media, this informal language has become more easily observable to people and trainers of large language models (LLMs) alike. While the advantage of large corpora seems like the solution to all machine learning and Natural Language Processing (NLP) problems, idioms and figurative language continue to elude LLMs. Finetuning approaches are proving to be optimal, but better and larger datasets can help narrow this gap even further. The datasets presented in this paper provide one answer, while offering a diverse set of categories on which to build new models and develop new approaches. A selection of recent idiom and figurative language datasets were used to acquire a combined idiom list, which was used to retrieve context sequences from a large corpus. One large-scale dataset of potential idiomatic and figurative language expressions and two additional human-annotated datasets of definite idiomatic and figurative language expressions were created to evaluate the baseline ability of pre-trained language models in handling figurative meaning through idiom recognition (detection) tasks. The resulting datasets were post-processed for model agnostic training compatibility, utilized in training, and evaluated on slot labeling and sequence tagging.
comment: 32 pages, 10 figures
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe
Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at https://github.com/EvolvingLMMs-Lab/OpenMMReasoner.
Incorporating Self-Rewriting into Large Language Model Reasoning Reinforcement AAAI 2026
Through reinforcement learning (RL) with outcome correctness rewards, large reasoning models (LRMs) with scaled inference computation have demonstrated substantial success on complex reasoning tasks. However, the one-sided reward, focused solely on final correctness, limits its ability to provide detailed supervision over internal reasoning process. This deficiency leads to suboptimal internal reasoning quality, manifesting as issues like over-thinking, under-thinking, redundant-thinking, and disordered-thinking. Inspired by the recent progress in LRM self-rewarding, we introduce self-rewriting framework, where a model rewrites its own reasoning texts, and subsequently learns from the rewritten reasoning to improve the internal thought process quality. For algorithm design, we propose a selective rewriting approach wherein only "simple" samples, defined by the model's consistent correctness, are rewritten, thereby preserving all original reward signals of GRPO. For practical implementation, we compile rewriting and vanilla generation within one single batch, maintaining the scalability of the RL algorithm and introducing only ~10% overhead. Extensive experiments on diverse tasks with different model sizes validate the effectiveness of self-rewriting. In terms of the accuracy-length tradeoff, the self-rewriting approach achieves improved accuracy (+0.6) with substantially shorter reasoning (-46%) even without explicit instructions in rewriting prompts to reduce reasoning length, outperforming existing strong baselines. In terms of internal reasoning quality, self-rewriting achieves significantly higher scores (+7.2) under the LLM-as-a-judge metric, successfully mitigating internal reasoning flaws.
comment: Accepted to AAAI 2026
SDA: Steering-Driven Distribution Alignment for Open LLMs without Fine-Tuning
With the rapid advancement of large language models (LLMs), their deployment in real-world applications has become increasingly widespread. LLMs are expected to deliver robust performance across diverse tasks, user preferences, and practical scenarios. However, as demands grow, ensuring that LLMs produce responses aligned with human intent remains a foundational challenge. In particular, aligning model behavior effectively and efficiently during inference, without costly retraining or extensive supervision, is both a critical requirement and a non-trivial technical endeavor. To address the challenge, we propose SDA (Steering-Driven Distribution Alignment), a training-free and model-agnostic alignment framework designed for open-source LLMs. SDA dynamically redistributes model output probabilities based on user-defined alignment instructions, enhancing alignment between model behavior and human intents without fine-tuning. The method is lightweight, resource-efficient, and compatible with a wide range of open-source LLMs. It can function independently during inference or be integrated with training-based alignment strategies. Moreover, SDA supports personalized preference alignment, enabling flexible control over the model response behavior. Empirical results demonstrate that SDA consistently improves alignment performance across 8 open-source LLMs with varying scales and diverse origins, evaluated on three key alignment dimensions, helpfulness, harmlessness, and honesty (3H). Specifically, SDA achieves average gains of 64.4% in helpfulness, 30% in honesty and 11.5% in harmlessness across the tested models, indicating its effectiveness and generalization across diverse models and application scenarios.
SeSE: A Structural Information-Guided Uncertainty Quantification Framework for Hallucination Detection in LLMs
Reliable uncertainty quantification (UQ) is essential for deploying large language models (LLMs) in safety-critical scenarios, as it enables them to abstain from responding when uncertain, thereby avoiding hallucinating falsehoods. However, state-of-the-art UQ methods primarily rely on semantic probability distributions or pairwise distances, overlooking latent semantic structural information that could enable more precise uncertainty estimates. This paper presents Semantic Structural Entropy (SeSE), a principled UQ framework that quantifies the inherent semantic uncertainty of LLMs from a structural information perspective for hallucination detection. Specifically, to effectively model semantic spaces, we first develop an adaptively sparsified directed semantic graph construction algorithm that captures directional semantic dependencies while automatically pruning unnecessary connections that introduce negative interference. We then exploit latent semantic structural information through hierarchical abstraction: SeSE is defined as the structural entropy of the optimal semantic encoding tree, formalizing intrinsic uncertainty within semantic spaces after optimal compression. A higher SeSE value corresponds to greater uncertainty, indicating that LLMs are highly likely to generate hallucinations. In addition, to enhance fine-grained UQ in long-form generation -- where existing methods often rely on heuristic sample-and-count techniques -- we extend SeSE to quantify the uncertainty of individual claims by modeling their random semantic interactions, providing theoretically explicable hallucination detection. Extensive experiments across 29 model-dataset combinations show that SeSE significantly outperforms advanced UQ baselines, including strong supervised methods and the recently proposed KLE.
comment: 14 pages of main text and 10 pages of appendices
Can MLLMs Read the Room? A Multimodal Benchmark for Assessing Deception in Multi-Party Social Interactions
Despite their advanced reasoning capabilities, state-of-the-art Multimodal Large Language Models (MLLMs) demonstrably lack a core component of human intelligence: the ability to `read the room' and assess deception in complex social interactions. To rigorously quantify this failure, we introduce a new task, Multimodal Interactive Deception Assessment (MIDA), and present a novel multimodal dataset providing synchronized video and text with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating 12 state-of-the-art open- and closed-source MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to effectively ground language in multimodal social cues and lack the ability to model what others know, believe, or intend, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems. To take a step forward, we design a Social Chain-of-Thought (SoCoT) reasoning pipeline and a Dynamic Social Epistemic Memory (DSEM) module. Our framework yields performance improvement on this challenging task, demonstrating a promising new path toward building MLLMs capable of genuine human-like social reasoning.
PSM: Prompt Sensitivity Minimization via LLM-Guided Black-Box Optimization
System prompts are critical for guiding the behavior of Large Language Models (LLMs), yet they often contain proprietary logic or sensitive information, making them a prime target for extraction attacks. Adversarial queries can successfully elicit these hidden instructions, posing significant security and privacy risks. Existing defense mechanisms frequently rely on heuristics, incur substantial computational overhead, or are inapplicable to models accessed via black-box APIs. This paper introduces a novel framework for hardening system prompts through shield appending, a lightweight approach that adds a protective textual layer to the original prompt. Our core contribution is the formalization of prompt hardening as a utility-constrained optimization problem. We leverage an LLM-as-optimizer to search the space of possible SHIELDs, seeking to minimize a leakage metric derived from a suite of adversarial attacks, while simultaneously preserving task utility above a specified threshold, measured by semantic fidelity to baseline outputs. This black-box, optimization-driven methodology is lightweight and practical, requiring only API access to the target and optimizer LLMs. We demonstrate empirically that our optimized SHIELDs significantly reduce prompt leakage against a comprehensive set of extraction attacks, outperforming established baseline defenses without compromising the model's intended functionality. Our work presents a paradigm for developing robust, utility-aware defenses in the escalating landscape of LLM security. The code is made public on the following link: https://github.com/psm-defense/psm
SemanticCite: Citation Verification with AI-Powered Full-Text Analysis and Evidence-Based Reasoning
Effective scientific communication depends on accurate citations that validate sources and guide readers to supporting evidence. Yet academic literature faces mounting challenges: semantic citation errors that misrepresent sources, AI-generated hallucinated references, and traditional citation formats that point to entire papers without indicating which sections substantiate specific claims. We introduce SemanticCite, an AI-powered system that verifies citation accuracy through full-text source analysis while providing rich contextual information via detailed reasoning and relevant text snippets. Our approach combines multiple retrieval methods with a four-class classification system (Supported, Partially Supported, Unsupported, Uncertain) that captures nuanced claim-source relationships and enables appropriate remedial actions for different error types. Our experiments show that fine-tuned lightweight language models achieve performance comparable to large commercial systems with significantly lower computational requirements, making large-scale citation verification practically feasible. The system provides transparent, evidence-based explanations that support user understanding and trust. We contribute a comprehensive dataset of over 1,000 citations with detailed alignments, functional classifications, semantic annotations, and bibliometric metadata across eight disciplines, alongside fine-tuned models and the complete verification framework as open-source software. SemanticCite addresses critical challenges in research integrity through scalable citation verification, streamlined peer review, and quality control for AI-generated content, providing an open-source foundation for maintaining citation accuracy at scale.
comment: 21 pages, 4 figures
TS-PEFT: Token-Selective Parameter-Efficient Fine-Tuning with Learnable Threshold Gating
In the field of large models (LMs) for natural language processing (NLP) and computer vision (CV), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a resource-efficient method that modifies a limited number of parameters while keeping the pretrained weights fixed. This paper investigates the traditional PEFT approach, which applies modifications to all position indices, and questions its necessity. We introduce a new paradigm called Token-Selective PEFT (TS-PEFT), in which a function S selectively applies PEFT modifications to a subset of position indices, potentially enhancing performance on downstream tasks. Our experimental results reveal that the indiscriminate application of PEFT to all indices is not only superfluous, but may also be counterproductive. This study offers a fresh perspective on PEFT, advocating for a more targeted approach to modifications and providing a framework for future research to optimize the fine-tuning process for large models.
comment: 11 pages, 3 figures
ELPO: Ensemble Learning Based Prompt Optimization for Large Language Models
The remarkable performance of Large Language Models (LLMs) highly relies on crafted prompts. However, manual prompt engineering is a laborious process, creating a core bottleneck for practical application of LLMs. This phenomenon has led to the emergence of a new research area known as Automatic Prompt Optimization (APO), which develops rapidly in recent years. Existing APO methods such as those based on evolutionary algorithms or trial-and-error approaches realize an efficient and accurate prompt optimization to some extent. However, those researches focus on a single model or algorithm for the generation strategy and optimization process, which limits their performance when handling complex tasks. To address this, we propose a novel framework called Ensemble Learning based Prompt Optimization (ELPO) to achieve more accurate and robust results. Motivated by the idea of ensemble learning, ELPO conducts voting mechanism and introduces shared generation strategies along with different search methods for searching superior prompts. Moreover, ELPO creatively presents more efficient algorithms for the prompt generation and search process. Experimental results demonstrate that ELPO outperforms state-of-the-art prompt optimization methods across different tasks, e.g., improving F1 score by 7.6 on ArSarcasm dataset.
Early science acceleration experiments with GPT-5
AI models like GPT-5 are an increasingly valuable tool for scientists, but many remain unaware of the capabilities of frontier AI. We present a collection of short case studies in which GPT-5 produced new, concrete steps in ongoing research across mathematics, physics, astronomy, computer science, biology, and materials science. In these examples, the authors highlight how AI accelerated their work, and where it fell short; where expert time was saved, and where human input was still key. We document the interactions of the human authors with GPT-5, as guiding examples of fruitful collaboration with AI. Of note, this paper includes four new results in mathematics (carefully verified by the human authors), underscoring how GPT-5 can help human mathematicians settle previously unsolved problems. These contributions are modest in scope but profound in implication, given the rate at which frontier AI is progressing.
comment: 89 pages
Learning Tractable Distributions Of Language Model Continuations
Controlled language generation conditions text on sequence-level constraints (for example, syntax, style, or safety). These constraints may depend on future tokens, which makes directly conditioning an autoregressive language model (LM) generally intractable. Prior work uses tractable surrogates such as hidden Markov models (HMMs) to approximate the distribution over continuations and adjust the model's next-token logits at decoding time. However, we find that these surrogates are often weakly context aware, which reduces query quality. We propose Learning to Look Ahead (LTLA), a hybrid approach that pairs the same base language model for rich prefix encoding with a fixed tractable surrogate model that computes exact continuation probabilities. Two efficiency pitfalls arise when adding neural context: (i) naively rescoring the prefix with every candidate next token requires a sweep over the entire vocabulary at each step, and (ii) predicting fresh surrogate parameters for each prefix, although tractable at a single step, forces recomputation of future probabilities for every new prefix and eliminates reuse. LTLA avoids both by using a single batched HMM update to account for all next-token candidates at once, and by conditioning only the surrogate's latent state prior on the LM's hidden representations while keeping the surrogate decoder fixed, so computations can be reused across prefixes. Empirically, LTLA attains higher conditional likelihood than an unconditional HMM, approximates continuation distributions for vision-language models where a standalone HMM cannot encode visual context, and improves constraint satisfaction at comparable fluency on controlled-generation tasks, with minimal inference overhead.
Liars' Bench: Evaluating Lie Detectors for Language Models
Prior work has introduced techniques for detecting when large language models (LLMs) lie, that is, generating statements they believe are false. However, these techniques are typically validated in narrow settings that do not capture the diverse lies LLMs can generate. We introduce LIARS' BENCH, a testbed consisting of 72,863 examples of lies and honest responses generated by four open-weight models across seven datasets. Our settings capture qualitatively different types of lies and vary along two dimensions: the model's reason for lying and the object of belief targeted by the lie. Evaluating three black- and white-box lie detection techniques on LIARS' BENCH, we find that existing techniques systematically fail to identify certain types of lies, especially in settings where it's not possible to determine whether the model lied from the transcript alone. Overall, LIARS' BENCH reveals limitations in prior techniques and provides a practical testbed for guiding progress in lie detection.
comment: *Kieron Kretschmar and Walter Laurito contributed equally to this work. 10 pages, 2 figures; plus appendix. Code at https://github.com/Cadenza-Labs/liars-bench and datasets at https://huggingface.co/datasets/Cadenza-Labs/liars-bench Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
SpellForger: Prompting Custom Spell Properties In-Game using BERT supervised-trained model
Introduction: The application of Artificial Intelligence in games has evolved significantly, allowing for dynamic content generation. However, its use as a core gameplay co-creation tool remains underexplored. Objective: This paper proposes SpellForger, a game where players create custom spells by writing natural language prompts, aiming to provide a unique experience of personalization and creativity. Methodology: The system uses a supervisedtrained BERT model to interpret player prompts. This model maps textual descriptions to one of many spell prefabs and balances their parameters (damage, cost, effects) to ensure competitive integrity. The game is developed in the Unity Game Engine, and the AI backend is in Python. Expected Results: We expect to deliver a functional prototype that demonstrates the generation of spells in real time, applied to an engaging gameplay loop, where player creativity is central to the experience, validating the use of AI as a direct gameplay mechanic.
comment: Published in Anais Estendidos do XXIV Simpósio Brasileiro de Jogos e Entretenimento Digital (SBGames 2025)
QueryGym: A Toolkit for Reproducible LLM-Based Query Reformulation
We present QueryGym, a lightweight, extensible Python toolkit that supports large language model (LLM)-based query reformulation. This is an important tool development since recent work on llm-based query reformulation has shown notable increase in retrieval effectiveness. However, while different authors have sporadically shared the implementation of their methods, there is no unified toolkit that provides a consistent implementation of such methods, which hinders fair comparison, rapid experimentation, consistent benchmarking and reliable deployment. QueryGym addresses this gap by providing a unified framework for implementing, executing, and comparing llm-based reformulation methods. The toolkit offers: (1) a Python API for applying diverse LLM-based methods, (2) a retrieval-agnostic interface supporting integration with backends such as Pyserini and PyTerrier, (3) a centralized prompt management system with versioning and metadata tracking, (4) built-in support for benchmarks like BEIR and MS MARCO, and (5) a completely open-source extensible implementation available to all researchers. QueryGym is publicly available at https://github.com/radinhamidi/QueryGym.
comment: 4 pages
CARE-RAG - Clinical Assessment and Reasoning in RAG
Access to the right evidence does not guarantee that large language models (LLMs) will reason with it correctly. This gap between retrieval and reasoning is especially concerning in clinical settings, where outputs must align with structured protocols. We study this gap using Written Exposure Therapy (WET) guidelines as a testbed. In evaluating model responses to curated clinician-vetted questions, we find that errors persist even when authoritative passages are provided. To address this, we propose an evaluation framework that measures accuracy, consistency, and fidelity of reasoning. Our results highlight both the potential and the risks: retrieval-augmented generation (RAG) can constrain outputs, but safe deployment requires assessing reasoning as rigorously as retrieval.
comment: The Second Workshop on GenAI for Health: Potential, Trust, and Policy Compliance
TOD-ProcBench: Benchmarking Complex Instruction-Following in Task-Oriented Dialogues
In real-world task-oriented dialogue (TOD) settings, agents are required to strictly adhere to complex instructions while conducting multi-turn conversations with customers. These instructions are typically presented in natural language format and include general guidelines and step-by-step procedures with complex constraints. Existing TOD benchmarks often oversimplify the complex nature of these instructions by reducing them to simple schemas composed of intents, slots, and API call configurations. To address this gap and systematically benchmark LLMs' instruction-following capabilities, we propose TOD-ProcBench, a challenging benchmark featuring complex process instructions with intricate, fine-grained constraints that evaluates various LLMs' abilities to understand and follow instructions in multi-turn TODs. Our benchmark dataset comprises instruction documents derived from the high-quality ABCD dataset with corresponding conversations under human quality control. We formulate fine-grained constraints and action procedures as multi-level condition-action instruction statements. We design three tasks to comprehensively benchmark LLMs' complex instruction-following capabilities in multi-turn TODs. Task 1 evaluates how LLMs retrieve the most relevant statement from a complex instruction and predict the corresponding next action. In Task 2, we synthesize instruction-violating responses by injecting inconsistencies and manipulating the original instructions, and then we analyze how effectively LLMs can identify instruction-violating responses. Task 3 investigates LLMs' abilities in conditional generation of instruction-following responses based on the original complex instructions. Additionally, we conduct studies on the impact of multilingual settings and different instruction text formats on compliance performance. We release our benchmark under the Llama 3.3 Community License Agreement.
JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation
While small language models (SLMs) have shown promise on various reasoning tasks, their ability to judge the correctness of answers remains unclear compared to large language models (LLMs). Prior work on LLM-as-a-judge frameworks typically relies on comparing candidate answers against ground-truth labels or other candidate answers using predefined metrics like entailment. However, this approach is inherently indirect and difficult to fully automate, offering limited support for fine-grained and scalable evaluation of reasoning outputs. In this work, we propose JudgeBoard, a novel evaluation pipeline that directly queries models to assess the correctness of candidate answers without requiring extra answer comparisons. We focus on two core reasoning domains: mathematical reasoning and science/commonsense reasoning, and construct task-specific evaluation leaderboards using both accuracy-based ranking and an Elo-based rating system across five benchmark datasets, enabling consistent model comparison as judges rather than comparators. To improve judgment performance in lightweight models, we propose MAJ (Multi-Agent Judging), a novel multi-agent evaluation framework that leverages multiple interacting SLMs with distinct reasoning profiles to approximate LLM-level judgment accuracy through collaborative deliberation. Experimental results reveal a significant performance gap between SLMs and LLMs in isolated judging tasks. However, our MAJ framework substantially improves the reliability and consistency of SLMs. On the MATH dataset, MAJ using smaller-sized models as backbones performs comparatively well or even better than their larger-sized counterparts. Our findings highlight that multi-agent SLM systems can potentially match or exceed LLM performance in judgment tasks, with implications for scalable and efficient assessment.
comment: 23 pages, 4 figures
ConCISE: A Reference-Free Conciseness Evaluation Metric for LLM-Generated Answers
Large language models (LLMs) frequently generate responses that are lengthy and verbose, filled with redundant or unnecessary details. This diminishes clarity and user satisfaction, and it increases costs for model developers, especially with well-known proprietary models that charge based on the number of output tokens. In this paper, we introduce a novel reference-free metric for evaluating the conciseness of responses generated by LLMs. Our method quantifies non-essential content without relying on gold standard references and calculates the average of three calculations: i) a compression ratio between the original response and an LLM abstractive summary; ii) a compression ratio between the original response and an LLM extractive summary; and iii) wordremoval compression, where an LLM removes as many non-essential words as possible from the response while preserving its meaning, with the number of tokens removed indicating the conciseness score. Experimental results demonstrate that our proposed metric identifies redundancy in LLM outputs, offering a practical tool for automated evaluation of response brevity in conversational AI systems without the need for ground truth human annotations.
Fantastic Bugs and Where to Find Them in AI Benchmarks
Benchmarks are pivotal in driving AI progress, and invalid benchmark questions frequently undermine their reliability. Manually identifying and correcting errors among thousands of benchmark questions is not only infeasible but also a critical bottleneck for reliable evaluation. In this work, we introduce a framework for systematic benchmark revision that leverages statistical analysis of response patterns to flag potentially invalid questions for further expert review. Our approach builds on a core assumption commonly used in AI evaluations that the mean score sufficiently summarizes model performance. This implies a unidimensional latent construct underlying the measurement experiment, yielding expected ranges for various statistics for each item. When empirically estimated values for these statistics fall outside the expected range for an item, the item is more likely to be problematic. Across nine widely used benchmarks, our method guides expert review to identify problematic questions with up to 84\% precision. In addition, we introduce an LLM-judge first pass to review questions, further reducing human effort. Together, these components provide an efficient and scalable framework for systematic benchmark revision.
Cognitive BASIC: An In-Model Interpreted Reasoning Language for LLMs
Cognitive BASIC is a minimal, BASIC-style prompting language and in-model interpreter that structures large language model (LLM) reasoning into explicit, stepwise execution traces. Inspired by the simplicity of retro BASIC, we repurpose numbered lines and simple commands as an interpretable cognitive control layer. Modern LLMs can reliably simulate such short programs, enabling transparent multi-step reasoning inside the model. A natural-language interpreter file specifies command semantics, memory updates, and logging behavior. Our mental-model interpreter extracts declarative and procedural knowledge, detects contradictions, and produces resolutions when necessary. A comparison across three LLMs on a benchmark of knowledge extraction, conflict detection, and reasoning tasks shows that all models can execute Cognitive BASIC programs, with overall strong but not uniform performance.
comment: 6 pages, Submitted to ESANN 2026
The Shifting Landscape of Vaccine Discourse: Insights From a Decade of Pre- to Post-COVID-19 Vaccine Posts on Social Media
In this work, we study English-language vaccine discourse in social media posts, specifically posts on X (formerly Twitter), in seven years before the COVID-19 outbreak (2013 to 2019) and three years after the outbreak was first reported (2020 to 2022). Drawing on theories from social cognition and the stereotype content model in Social Psychology, we analyze how English speakers talk about vaccines on social media to understand the evolving narrative around vaccines in social media posts. To do that, we first introduce a novel dataset comprising 18.7 million curated posts on vaccine discourse from 2013 to 2022. This extensive collection-filtered down from an initial 129 million posts through rigorous preprocessing-captures both pre-COVID and COVID-19 periods, offering valuable insights into the evolution of English-speaking X users' perceptions related to vaccines. Our analysis shows that the COVID-19 pandemic led to complex shifts in X users' sentiment and discourse around vaccines. We observe that negative emotion word usage decreased during the pandemic, with notable rises in usage of surprise, and trust related emotion words. Furthermore, vaccine-related language tended to use more warmth-focused words associated with trustworthiness, along with positive, competence-focused words during the early days of the pandemic, with a marked rise in negative word usage towards the end of the pandemic, possibly reflecting a growing vaccine hesitancy and skepticism.
PEPPER: Perception-Guided Perturbation for Robust Backdoor Defense in Text-to-Image Diffusion Models
Recent studies show that text to image (T2I) diffusion models are vulnerable to backdoor attacks, where a trigger in the input prompt can steer generation toward harmful or unintended content. To address this, we introduce PEPPER (PErcePtion Guided PERturbation), a backdoor defense that rewrites the caption into a semantically distant yet visually similar caption while adding unobstructive elements. With this rewriting strategy, PEPPER disrupt the trigger embedded in the input prompt, dilute the influence of trigger tokens and thereby achieve enhanced robustness. Experiments show that PEPPER is particularly effective against text encoder based attacks, substantially reducing attack success while preserving generation quality. Beyond this, PEPPER can be paired with any existing defenses yielding consistently stronger and generalizable robustness than any standalone method. Our code will be released on Github.
Interpretable dimensions support an effect of agentivity and telicity on split intransitivity
Intransitive verbs fall into two different syntactic classes, unergatives and unaccusatives. It has long been argued that verbs describing an agentive action are more likely to appear in an unergative syntax, and those describing a telic event to appear in an unaccusative syntax. However, recent work by Kim et al. (2024) found that human ratings for agentivity and telicity were a poor predictor of the syntactic behavior of intransitives. Here we revisit this question using interpretable dimensions, computed from seed words on opposite poles of the agentive and telic scales. Our findings support the link between unergativity/unaccusativity and agentivity/telicity, and demonstrate that using interpretable dimensions in conjunction with human judgments can offer valuable evidence for semantic properties that are not easily evaluated in rating tasks.
From Representation to Enactment: The ABC Framework of the Translating Mind
Building on the Extended Mind (EM) theory and radical enactivism, this article suggests an alternative to representation-based models of the mind. We lay out a novel ABC framework of the translating mind, in which translation is not the manipulation of static interlingual correspondences but an enacted activity, dynamically integrating affective, behavioral, and cognitive (ABC) processes. Drawing on Predictive Processing and (En)Active Inference, we argue that the translator's mind emerges, rather than being merely extended, through loops of brain-body-environment interactions. This non-representational account reframes translation as skillful participation in sociocultural practice, where meaning is co-created in real time through embodied interaction with texts, tools, and contexts.
NALA_MAINZ at BLP-2025 Task 2: A Multi-agent Approach for Bangla Instruction to Python Code Generation
This paper presents JGU Mainz's winning system for the BLP-2025 Shared Task on Code Generation from Bangla Instructions. We propose a multi-agent-based pipeline. First, a code-generation agent produces an initial solution from the input instruction. The candidate program is then executed against the provided unit tests (pytest-style, assert-based). Only the failing cases are forwarded to a debugger agent, which reruns the tests, extracts error traces, and, conditioning on the error messages, the current program, and the relevant test cases, generates a revised solution. Using this approach, our submission achieved first place in the shared task with a $Pass@1$ score of 95.4. We also make our code public.
comment: BLP 2025 Shared Task 2 - Code Generation in Bangla
LLMInit: A Free Lunch from Large Language Models for Selective Initialization of Recommendation EMNLP 2025
Collaborative filtering (CF) is widely adopted in industrial recommender systems (RecSys) for modeling user-item interactions across numerous applications, but often struggles with cold-start and data-sparse scenarios. Recent advancements in pre-trained large language models (LLMs) with rich semantic knowledge, offer promising solutions to these challenges. However, deploying LLMs at scale is hindered by their significant computational demands and latency. In this paper, we propose a novel and scalable LLM-RecSys framework, LLMInit, designed to integrate pretrained LLM embeddings into CF models through selective initialization strategies. Specifically, we identify the embedding collapse issue observed when CF models scale and match the large embedding sizes in LLMs and avoid the problem by introducing efficient sampling methods, including, random, uniform, and variance-based selections. Comprehensive experiments conducted on multiple real-world datasets demonstrate that LLMInit significantly improves recommendation performance while maintaining low computational costs, offering a practical and scalable solution for industrial applications. To facilitate industry adoption and promote future research, we provide open-source access to our implementation at https://github.com/DavidZWZ/LLMInit.
comment: Accepted in EMNLP 2025 Industry Track
Sigma: Semantically Informative Pre-training for Skeleton-based Sign Language Understanding
Pre-training has proven effective for learning transferable features in sign language understanding (SLU) tasks. Recently, skeleton-based methods have gained increasing attention because they can robustly handle variations in subjects and backgrounds without being affected by appearance or environmental factors. Current SLU methods continue to face three key limitations: 1) weak semantic grounding, as models often capture low-level motion patterns from skeletal data but struggle to relate them to linguistic meaning; 2) imbalance between local details and global context, with models either focusing too narrowly on fine-grained cues or overlooking them for broader context; and 3) inefficient cross-modal learning, as constructing semantically aligned representations across modalities remains difficult. To address these, we propose Sigma, a unified skeleton-based SLU framework featuring: 1) a sign-aware early fusion mechanism that facilitates deep interaction between visual and textual modalities, enriching visual features with linguistic context; 2) a hierarchical alignment learning strategy that jointly maximises agreements across different levels of paired features from different modalities, effectively capturing both fine-grained details and high-level semantic relationships; and 3) a unified pre-training framework that combines contrastive learning, text matching and language modelling to promote semantic consistency and generalisation. Sigma achieves new state-of-the-art results on isolated sign language recognition, continuous sign language recognition, and gloss-free sign language translation on multiple benchmarks spanning different sign and spoken languages, demonstrating the impact of semantically informative pre-training and the effectiveness of skeletal data as a stand-alone solution for SLU.
Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs ICLR 2025
Large Language Models (LLMs) generate text by sampling the next token from a probability distribution over the vocabulary at each decoding step. Popular sampling methods like top-p (nucleus sampling) often struggle to balance quality and diversity, especially at higher temperatures which lead to incoherent or repetitive outputs. We propose min-p sampling, a dynamic truncation method that adjusts the sampling threshold based on the model's confidence by using the top token's probability as a scaling factor. Our experiments on benchmarks including GPQA, GSM8K, and AlpacaEval Creative Writing show that min-p sampling improves both the quality and diversity of generated text across different model families (Mistral and Llama 3) and model sizes (1B to 123B parameters), especially at higher temperatures. Human evaluations further show a clear preference for min-p sampling, in both text quality and creativity. Min-p sampling has been adopted by popular open-source LLM frameworks, including Hugging Face Transformers, VLLM, and many others, highlighting its considerable impact on improving text generation quality.
comment: Oral presentation at ICLR 2025. Camera-ready version available at https://iclr.cc/virtual/2025/poster/30358
Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark
While large language models (LLMs) with reasoning capabilities are progressing rapidly on high-school math competitions and coding, can they reason effectively through complex, open-ended challenges found in frontier physics research? And crucially, what kinds of reasoning tasks do physicists want LLMs to assist with? To address these questions, we present the CritPt (Complex Research using Integrated Thinking - Physics Test, pronounced "critical point"), the first benchmark designed to test LLMs on unpublished, research-level reasoning tasks that broadly covers modern physics research areas, including condensed matter, quantum physics, atomic, molecular & optical physics, astrophysics, high energy physics, mathematical physics, statistical physics, nuclear physics, nonlinear dynamics, fluid dynamics and biophysics. CritPt consists of 71 composite research challenges designed to simulate full-scale research projects at the entry level, which are also decomposed to 190 simpler checkpoint tasks for more fine-grained insights. All problems are newly created by 50+ active physics researchers based on their own research. Every problem is hand-curated to admit a guess-resistant and machine-verifiable answer and is evaluated by an automated grading pipeline heavily customized for advanced physics-specific output formats. We find that while current state-of-the-art LLMs show early promise on isolated checkpoints, they remain far from being able to reliably solve full research-scale challenges: the best average accuracy among base models is only 5.7%, achieved by GPT-5 (high), moderately rising to around 10% when equipped with coding tools. Through the realistic yet standardized evaluation offered by CritPt, we highlight a large disconnect between current model capabilities and realistic physics research demands, offering a foundation to guide the development of scientifically grounded AI tools.
comment: 39 pages, 6 figures, 6 tables
False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize
Large Language Models (LLMs) can comply with harmful instructions, raising serious safety concerns despite their impressive capabilities. Recent work has leveraged probing-based approaches to study the separability of malicious and benign inputs in LLMs' internal representations, and researchers have proposed using such probing methods for safety detection. We systematically re-examine this paradigm. Motivated by poor out-of-distribution performance, we hypothesize that probes learn superficial patterns rather than semantic harmfulness. Through controlled experiments, we confirm this hypothesis and identify the specific patterns learned: instructional patterns and trigger words. Our investigation follows a systematic approach, progressing from demonstrating comparable performance of simple n-gram methods, to controlled experiments with semantically cleaned datasets, to detailed analysis of pattern dependencies. These results reveal a false sense of security around current probing-based approaches and highlight the need to redesign both models and evaluation protocols, for which we provide further discussions in the hope of suggesting responsible further research in this direction. We have open-sourced the project at https://github.com/WangCheng0116/Why-Probe-Fails.
comment: Withdrawn due to identified errors in the experimental procedure
AgentSwift: Efficient LLM Agent Design via Value-guided Hierarchical Search AAAI-2026
Large language model (LLM) agents have demonstrated strong capabilities across diverse domains, yet automated agent design remains a significant challenge. Current automated agent design approaches are often constrained by limited search spaces that primarily optimize workflows but fail to integrate crucial human-designed components like memory, planning, and tool use. Furthermore, these methods are hampered by high evaluation costs, as evaluating even a single new agent on a benchmark can require tens of dollars. The difficulty of this exploration is further exacerbated by inefficient search strategies that struggle to navigate the large design space effectively, making the discovery of novel agents a slow and resource-intensive process. To address these challenges, we propose AgentSwift, a novel framework for automated agent design. We formalize a hierarchical search space that jointly models agentic workflow and composable functional components. This structure moves beyond optimizing workflows alone by co-optimizing functional components, which enables the discovery of more complex and effective agent architectures. To make exploration within this expansive space feasible, we mitigate high evaluation costs by training a value model on a high-quality dataset, generated via a novel strategy combining combinatorial coverage and balanced Bayesian sampling for low-cost evaluation. Guiding the entire process is a hierarchical MCTS strategy, which is informed by uncertainty to efficiently navigate the search space. Evaluated across a comprehensive set of seven benchmarks spanning embodied, math, web, tool, and game domains, AgentSwift discovers agents that achieve an average performance gain of 8.34\% over both existing automated agent search methods and manually designed agents. Our framework serves as a launchpad for researchers to rapidly discover powerful agent architectures.
comment: AAAI-2026
KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference ICML25
KV cache quantization can improve Large Language Models (LLMs) inference throughput and latency in long contexts and large batch-size scenarios while preserving LLMs effectiveness. However, current methods have three unsolved issues: overlooking layer-wise sensitivity to KV cache quantization, high overhead of online fine-grained decision-making, and low flexibility to different LLMs and constraints. Therefore, we theoretically analyze the inherent correlation of layer-wise transformer attention patterns to KV cache quantization errors and study why key cache is generally more important than value cache for quantization error reduction. We further propose a simple yet effective framework KVTuner to adaptively search for the optimal hardware-friendly layer-wise KV quantization precision pairs for coarse-grained KV cache with multi-objective optimization and directly utilize the offline searched configurations during online inference. To reduce the computational cost of offline calibration, we utilize the intra-layer KV precision pair pruning and inter-layer clustering to reduce the search space. Experimental results show that we can achieve nearly lossless 3.25-bit mixed precision KV cache quantization for LLMs like Llama-3.1-8B-Instruct and 4.0-bit for sensitive models like Qwen2.5-7B-Instruct on mathematical reasoning tasks. The maximum inference throughput can be improved by 21.25\% compared with KIVI-KV8 quantization over various context lengths. Our code and searched configurations are available at https://github.com/cmd2001/KVTuner.
comment: Accepted by ICML25. Code: https://github.com/cmd2001/KVTuner
Crowdsourcing Lexical Diversity
Lexical-semantic resources (LSRs), such as online lexicons and wordnets, are fundamental to natural language processing applications as well as to fields such as linguistic anthropology and language preservation. In many languages, however, such resources suffer from quality issues: incorrect entries, incompleteness, but also the rarely addressed issue of bias towards the English language and Anglo-Saxon culture. Such bias manifests itself in the absence of concepts specific to the language or culture at hand, the presence of foreign (Anglo-Saxon) concepts, as well as in the lack of an explicit indication of untranslatability, also known as cross-lingual lexical gaps, when a term has no equivalent in another language. This paper proposes a novel crowdsourcing methodology for reducing bias in LSRs. Crowd workers compare lexemes from two languages, focusing on domains rich in lexical diversity, such as kinship or food. Our LingoGap crowdsourcing platform facilitates comparisons through microtasks identifying equivalent terms, language-specific terms, and lexical gaps across languages. We validated our method by applying it to two case studies focused on food-related terminology: (1) English and Arabic, and (2) Standard Indonesian and Banjarese. These experiments identified 2,140 lexical gaps in the first case study and 951 in the second. The success of these experiments confirmed the usability of our method and tool for future large-scale lexicon enrichment tasks.
Arg-LLaDA: Argument Summarization via Large Language Diffusion Models and Sufficiency-Aware Refinement
Argument summarization aims to generate concise, structured representations of complex, multi-perspective debates. While recent work has advanced the identification and clustering of argumentative components, the generation stage remains underexplored. Existing approaches typically rely on single-pass generation, offering limited support for factual correction or structural refinement. To address this gap, we introduce Arg-LLaDA, a novel large language diffusion framework that iteratively improves summaries via sufficiency-guided remasking and regeneration. Our method combines a flexible masking controller with a sufficiency-checking module to identify and revise unsupported, redundant, or incomplete spans, yielding more faithful, concise, and coherent outputs. Empirical results on two benchmark datasets demonstrate that Arg-LLaDA surpasses state-of-the-art baselines in 7 out of 10 automatic evaluation metrics. In addition, human evaluations reveal substantial improvements across core dimensions, coverage, faithfulness, and conciseness, validating the effectiveness of our iterative, sufficiency-aware generation strategy.
comment: Preprint
Eliciting Reasoning in Language Models with Cognitive Tools
The recent advent of reasoning models like OpenAI's o1 was met with excited speculation by the AI community about the mechanisms underlying these capabilities in closed models, followed by a rush of replication efforts, particularly from the open source community. These speculations were largely settled by the demonstration from DeepSeek-R1 that chains-of-thought and reinforcement learning (RL) can effectively replicate reasoning on top of base LLMs. However, it remains valuable to explore alternative methods for theoretically eliciting reasoning that could help elucidate the underlying mechanisms, as well as providing additional methods that may offer complementary benefits. Here, we build on the long-standing literature in cognitive psychology and cognitive architectures, which postulates that reasoning arises from the orchestrated, sequential execution of a set of modular, predetermined cognitive operations. Crucially, we implement this key idea within a modern agentic tool-calling framework. In particular, we endow an LLM with a small set of "cognitive tools" encapsulating specific reasoning operations, each executed by the LLM itself. Surprisingly, this simple strategy results in considerable gains in performance on standard mathematical reasoning benchmarks compared to base LLMs, for both closed and open-weight models. For instance, providing our "cognitive tools" to GPT-4.1 increases its pass@1 performance on AIME2024 from 32% to 53%, even surpassing the performance of o1-preview. In addition to its practical implications, this demonstration contributes to the debate regarding the role of post-training methods in eliciting reasoning in LLMs versus the role of inherent capabilities acquired during pre-training, and whether post-training merely uncovers these latent abilities.
comment: 25 pages, 2 figures
AutoJudge: Judge Decoding Without Manual Annotation NeurIPS 2025
We introduce AutoJudge, a method that accelerates large language model (LLM) inference with task-specific lossy speculative decoding. Instead of matching the original model output distribution token-by-token, we identify which of the generated tokens affect the downstream quality of the response, relaxing the distribution match guarantee so that the "unimportant" tokens can be generated faster. Our approach relies on a semi-greedy search algorithm to test which of the mismatches between target and draft models should be corrected to preserve quality and which ones may be skipped. We then train a lightweight classifier based on existing LLM embeddings to predict, at inference time, which mismatching tokens can be safely accepted without compromising the final answer quality. We evaluate the effectiveness of AutoJudge with multiple draft/target model pairs on mathematical reasoning and programming benchmarks, achieving significant speedups at the cost of a minor accuracy reduction. Notably, on GSM8k with the Llama 3.1 70B target model, our approach achieves up to $\approx2\times$ speedup over speculative decoding at the cost of $\le 1\%$ drop in accuracy. When applied to the LiveCodeBench benchmark, AutoJudge automatically detects programming-specific important tokens, accepting $\ge 25$ tokens per speculation cycle at $2\%$ drop in Pass@1. Our approach requires no human annotation and is easy to integrate with modern LLM inference frameworks.
comment: Accepted at NeurIPS 2025
One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image
Retrieval-augmented generation (RAG) is instrumental for inhibiting hallucinations in large language models (LLMs) through the use of a factual knowledge base (KB). Although PDF documents are prominent sources of knowledge, text-based RAG pipelines are ineffective at capturing their rich multi-modal information. In contrast, visual document RAG (VD-RAG) uses screenshots of document pages as the KB, which has been shown to achieve state-of-the-art results. However, by introducing the image modality, VD-RAG introduces new attack vectors for adversaries to disrupt the system by injecting malicious documents into the KB. In this paper, we demonstrate the vulnerability of VD-RAG to poisoning attacks targeting both retrieval and generation. We define two attack objectives and demonstrate that both can be realized by injecting only a single adversarial image into the KB. Firstly, we introduce a targeted attack against one or a group of queries with the goal of spreading targeted disinformation. Secondly, we present a universal attack that, for any potential user query, influences the response to cause a denial-of-service in the VD-RAG system. We investigate the two attack objectives under both white-box and black-box assumptions, employing a multi-objective gradient-based optimization approach as well as prompting state-of-the-art generative models. Using two visual document datasets, a diverse set of state-of-the-art retrievers (embedding models) and generators (vision language models), we show VD-RAG is vulnerable to poisoning attacks in both the targeted and universal settings, yet demonstrating robustness to black-box attacks in the universal setting.
Co-Reinforcement Learning for Unified Multimodal Understanding and Generation NeurIPS 2025
This paper presents a pioneering exploration of reinforcement learning (RL) via group relative policy optimization for unified multimodal large language models (ULMs), aimed at simultaneously reinforcing generation and understanding capabilities. Through systematic pilot studies, we uncover the significant potential of ULMs to enable the synergistic co-evolution of dual capabilities within a shared policy optimization framework. Building on this insight, we introduce CoRL, a co-reinforcement learning framework comprising a unified RL stage for joint optimization and a refined RL stage for task-specific enhancement. With the proposed CoRL, our resulting model, ULM-R1, achieves average improvements of 7% on three text-to-image generation datasets and 23% on nine multimodal understanding benchmarks. These results demonstrate the effectiveness of CoRL and highlight the substantial benefit of reinforcement learning in facilitating cross-task synergy and optimization for ULMs. Code is available at https://github.com/mm-vl/ULM-R1.
comment: NeurIPS 2025
CoBA: Counterbias Text Augmentation for Mitigating Various Spurious Correlations via Semantic Triples EMNLP 2025
Deep learning models often learn and exploit spurious correlations in training data, using these non-target features to inform their predictions. Such reliance leads to performance degradation and poor generalization on unseen data. To address these limitations, we introduce a more general form of counterfactual data augmentation, termed counterbias data augmentation, which simultaneously tackles multiple biases (e.g., gender bias, simplicity bias) and enhances out-of-distribution robustness. We present CoBA: CounterBias Augmentation, a unified framework that operates at the semantic triple level: first decomposing text into subject-predicate-object triples, then selectively modifying these triples to disrupt spurious correlations. By reconstructing the text from these adjusted triples, CoBA generates counterbias data that mitigates spurious patterns. Through extensive experiments, we demonstrate that CoBA not only improves downstream task performance, but also effectively reduces biases and strengthens out-of-distribution resilience, offering a versatile and robust solution to the challenges posed by spurious correlations.
comment: Accepted at EMNLP 2025
TabDistill: Distilling Transformers into Neural Nets for Few-Shot Tabular Classification
Transformer-based models have shown promising performance on tabular data compared to their classical counterparts such as neural networks and Gradient Boosted Decision Trees (GBDTs) in scenarios with limited training data. They utilize their pre-trained knowledge to adapt to new domains, achieving commendable performance with only a few training examples, also called the few-shot regime. However, the performance gain in the few-shot regime comes at the expense of significantly increased complexity and number of parameters. To circumvent this trade-off, we introduce TabDistill, a new strategy to distill the pre-trained knowledge in complex transformer-based models into simpler neural networks for effectively classifying tabular data. Our framework yields the best of both worlds: being parameter-efficient while performing well with limited training data. The distilled neural networks surpass classical baselines such as regular neural networks, XGBoost and logistic regression under equal training data, and in some cases, even the original transformer-based models that they were distilled from.
Multimodal Evaluation of Russian-language Architectures
Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking and licenses for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.
VisPlay: Self-Evolving Vision-Language Models from Images
Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at https://bruno686.github.io/VisPlay/
HalluClean: A Unified Framework to Combat Hallucinations in LLMs
Large language models (LLMs) have achieved impressive performance across a wide range of natural language processing tasks, yet they often produce hallucinated content that undermines factual reliability. To address this challenge, we introduce HalluClean, a lightweight and task-agnostic framework for detecting and correcting hallucinations in LLM-generated text. HalluClean adopts a reasoning-enhanced paradigm, explicitly decomposing the process into planning, execution, and revision stages to identify and refine unsupported claims. It employs minimal task-routing prompts to enable zero-shot generalization across diverse domains, without relying on external knowledge sources or supervised detectors. We conduct extensive evaluations on five representative tasks-question answering, dialogue, summarization, math word problems, and contradiction detection. Experimental results show that HalluClean significantly improves factual consistency and outperforms competitive baselines, demonstrating its potential to enhance the trustworthiness of LLM outputs in real-world applications.
LoRA on the Go: Instance-level Dynamic LoRA Selection and Merging
Low-Rank Adaptation (LoRA) has emerged as a parameter-efficient approach for fine-tuning large language models. However, conventional LoRA adapters are typically trained for a single task, limiting their applicability in real-world settings where inputs may span diverse and unpredictable domains. At inference time, existing approaches combine multiple LoRAs for improving performance on diverse tasks, while usually requiring labeled data or additional task-specific training, which is expensive at scale. In this work, we introduce LoRA on the Go (LoGo), a training-free framework that dynamically selects and merges adapters at the instance level without any additional requirements. LoGo leverages signals extracted from a single forward pass through LoRA adapters, to identify the most relevant adapters and determine their contributions on-the-fly. Across 5 NLP benchmarks, 27 datasets, and 3 model families, LoGo outperforms training-based baselines on some tasks upto a margin of 3.6% while remaining competitive on other tasks and maintaining inference throughput, highlighting its effectiveness and practicality.
From Confidence to Collapse in LLM Factual Robustness
Ensuring the robustness of factual knowledge in LLMs is critical for reliable applications in tasks such as question answering and reasoning. However, existing evaluation methods predominantly focus on performance-based metrics, often investigating from the perspective of prompt perturbations, which captures only the externally triggered side of knowledge robustness. To bridge this gap, we introduce a principled approach to measure factual robustness from the perspective of the generation process by analyzing token distribution entropy in combination with temperature scaling sensitivity. These two factors build the Factual Robustness Score (FRS), a novel metric which quantifies the stability of a fact against perturbations in decoding conditions, given its initial uncertainty. To validate our approach, we conduct extensive experiments on 5 LLMs across 3 closed-book QA datasets (SQuAD, TriviaQA, and HotpotQA). We show that factual robustness varies significantly -- smaller models report an FRS of $0.76$, larger ones $0.93$ -- with accuracy degrading by ~$60\%$ under increased uncertainty. These insights demonstrate how entropy and temperature scaling impact factual accuracy, and lay a foundation for developing more robust knowledge retention and retrieval in future models.
CRISP: Persistent Concept Unlearning via Sparse Autoencoders
As large language models (LLMs) are increasingly deployed in real-world applications, the need to selectively remove unwanted knowledge while preserving model utility has become paramount. Recent work has explored sparse autoencoders (SAEs) to perform precise interventions on monosemantic features. However, most SAE-based methods operate at inference time, which does not create persistent changes in the model's parameters. Such interventions can be bypassed or reversed by malicious actors with parameter access. We introduce CRISP, a parameter-efficient method for persistent concept unlearning using SAEs. CRISP automatically identifies salient SAE features across multiple layers and suppresses their activations. We experiment with two LLMs and show that our method outperforms prior approaches on safety-critical unlearning tasks from the WMDP benchmark, successfully removing harmful knowledge while preserving general and in-domain capabilities. Feature-level analysis reveals that CRISP achieves semantically coherent separation between target and benign concepts, allowing precise suppression of the target features.
comment: 18 pages, 5 figures
Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs
LLMs are now an integral part of information retrieval. As such, their role as question answering chatbots raises significant concerns due to their shown vulnerability to adversarial man-in-the-middle (MitM) attacks. Here, we propose the first principled attack evaluation on LLM factual memory under prompt injection via Xmera, our novel, theory-grounded MitM framework. By perturbing the input given to "victim" LLMs in three closed-book and fact-based QA settings, we undermine the correctness of the responses and assess the uncertainty of their generation process. Surprisingly, trivial instruction-based attacks report the highest success rate (up to ~85.3%) while simultaneously having a high uncertainty for incorrectly answered questions. To provide a simple defense mechanism against Xmera, we train Random Forest classifiers on the response uncertainty levels to distinguish between attacked and unattacked queries (average AUC of up to ~96%). We believe that signaling users to be cautious about the answers they receive from black-box and potentially corrupt LLMs is a first checkpoint toward user cyberspace safety.
GPTopic: Dynamic and Interactive Topic Representations
Topic modeling seems to be almost synonymous with generating lists of top words to represent topics within large text corpora. However, deducing a topic from such list of individual terms can require substantial expertise and experience, making topic modelling less accessible to people unfamiliar with the particularities and pitfalls of top-word interpretation. A topic representation limited to top-words might further fall short of offering a comprehensive and easily accessible characterization of the various aspects, facets and nuances a topic might have. To address these challenges, we introduce GPTopic, a software package that leverages Large Language Models (LLMs) to create dynamic, interactive topic representations. GPTopic provides an intuitive chat interface for users to explore, analyze, and refine topics interactively, making topic modeling more accessible and comprehensive. The corresponding code is available here: https://github.com/ArikReuter/TopicGPT.
ACEBench: Who Wins the Match Point in Tool Usage?
Large Language Models (LLMs) have demonstrated significant potential in decision-making and reasoning, particularly when integrated with various tools to effectively solve complex problems. However, existing benchmarks for evaluating LLMs' tool usage face several limitations: (1) limited evaluation scenarios, often lacking assessments in real multi-turn dialogue contexts; (2) narrow evaluation dimensions, with insufficient detailed assessments of how LLMs use tools; and (3) reliance on LLMs or real API executions for evaluation, which introduces significant overhead. To address these challenges, we introduce ACEBench, a comprehensive benchmark for assessing tool usage in LLMs. ACEBench categorizes data into three primary types based on evaluation methodology: Normal, Special, and Agent. "Normal" evaluates tool usage in basic scenarios; "Special" evaluates tool usage in situations with ambiguous or incomplete instructions; "Agent" evaluates tool usage through multi-agent interactions to simulate real-world, multi-turn dialogues. We conducted extensive experiments using ACEBench, analyzing various LLMs in-depth and providing a more granular examination of error causes across different data types.
Atomic Calibration of LLMs in Long-Form Generations ACL 2025
Large language models (LLMs) often suffer from hallucinations, posing significant challenges for real-world applications. Confidence calibration, as an effective indicator of hallucination, is thus essential to enhance the trustworthiness of LLMs. Prior work mainly focuses on short-form tasks using a single response-level score (macro calibration), which is insufficient for long-form outputs that may contain both accurate and inaccurate claims. In this work, we systematically study atomic calibration, which evaluates factuality calibration at a fine-grained level by decomposing long responses into atomic claims. We further categorize existing confidence elicitation methods into discriminative and generative types, and propose two new confidence fusion strategies to improve calibration. Our experiments demonstrate that LLMs exhibit poorer calibration at the atomic level during long-form generation. More importantly, atomic calibration uncovers insightful patterns regarding the alignment of confidence methods and the changes of confidence throughout generation. This sheds light on future research directions for confidence estimation in long-form generation.
comment: ACL 2025 KnowFM Oral / AACL-IJCNLP 2025
An Iterative Question-Guided Framework for Knowledge Base Question Answering ACL 2025
Large Language Models (LLMs) excel in many natural language processing tasks but often exhibit factual inconsistencies in knowledge-intensive settings. Integrating external knowledge resources, particularly knowledge graphs (KGs), provides a transparent and updatable foundation for more reliable reasoning. Knowledge Base Question Answering (KBQA), which queries and reasons over KGs, is central to this effort, especially for complex, multi-hop queries. However, multi-hop reasoning poses two key challenges: (1)~maintaining coherent reasoning paths, and (2)~avoiding prematurely discarding critical multi-hop connections. To tackle these challenges, we introduce iQUEST, a question-guided KBQA framework that iteratively decomposes complex queries into simpler sub-questions, ensuring a structured and focused reasoning trajectory. Additionally, we integrate a Graph Neural Network (GNN) to look ahead and incorporate 2-hop neighbor information at each reasoning step. This dual approach strengthens the reasoning process, enabling the model to explore viable paths more effectively. Detailed experiments demonstrate the consistent improvement delivered by iQUEST across four benchmark datasets and four LLMs.
comment: Accepted to the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), Main Track
MAQuA: Adaptive Question-Asking for Multidimensional Mental Health Screening using Item Response Theory
Recent advances in large language models (LLMs) offer new opportunities for scalable, interactive mental health assessment, but excessive querying by LLMs burdens users and is inefficient for real-world screening across transdiagnostic symptom profiles. We introduce MAQuA, an adaptive question-asking framework for simultaneous, multidimensional mental health screening. Combining multi-outcome modeling on language responses with item response theory (IRT) and factor analysis, MAQuA selects the questions with most informative responses across multiple dimensions at each turn to optimize diagnostic information, improving accuracy and potentially reducing response burden. Empirical results on a novel dataset reveal that MAQuA reduces the number of assessment questions required for score stabilization by 50-87% compared to random ordering (e.g., achieving stable depression scores with 71% fewer questions and eating disorder scores with 85% fewer questions). MAQuA demonstrates robust performance across both internalizing (depression, anxiety) and externalizing (substance use, eating disorder) domains, with early stopping strategies further reducing patient time and burden. These findings position MAQuA as a powerful and efficient tool for scalable, nuanced, and interactive mental health screening, advancing the integration of LLM-based agents into real-world clinical workflows.
MajinBook: An open catalogue of digital world literature with likes
This data paper introduces MajinBook, an open catalogue designed to facilitate the use of shadow libraries--such as Library Genesis and Z-Library--for computational social science and cultural analytics. By linking metadata from these vast, crowd-sourced archives with structured bibliographic data from Goodreads, we create a high-precision corpus of over 539,000 references to English-language books spanning three centuries, enriched with first publication dates, genres, and popularity metrics like ratings and reviews. Our methodology prioritizes natively digital EPUB files to ensure machine-readable quality, while addressing biases in traditional corpora like HathiTrust, and includes secondary datasets for French, German, and Spanish. We evaluate the linkage strategy for accuracy, release all underlying data openly, and discuss the project's legal permissibility under EU and US frameworks for text and data mining in research.
comment: 9 pages, 5 figures, 1 table
Multi-dimensional Data Analysis and Applications Basing on LLM Agents and Knowledge Graph Interactions
In the current era of big data, extracting deep insights from massive, heterogeneous, and complexly associated multi-dimensional data has become a significant challenge. Large Language Models (LLMs) perform well in natural language understanding and generation, but still suffer from "hallucination" issues when processing structured knowledge and are difficult to update in real-time. Although Knowledge Graphs (KGs) can explicitly store structured knowledge, their static nature limits dynamic interaction and analytical capabilities. Therefore, this paper proposes a multi-dimensional data analysis method based on the interactions between LLM agents and KGs, constructing a dynamic, collaborative analytical ecosystem. This method utilizes LLM agents to automatically extract product data from unstructured data, constructs and visualizes the KG in real-time, and supports users in deep exploration and analysis of graph nodes through an interactive platform. Experimental results show that this method has significant advantages in product ecosystem analysis, relationship mining, and user-driven exploratory analysis, providing new ideas and tools for multi-dimensional data analysis.
comment: 14 pages, 7 figures, 40 references
ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning
The rapid advancement of Large Language Models (LLMs) has led to performance saturation on many established benchmarks, questioning their ability to distinguish frontier models. Concurrently, existing high-difficulty benchmarks often suffer from narrow disciplinary focus, oversimplified answer formats, and vulnerability to data contamination, creating a fidelity gap with real-world scientific inquiry. To address these challenges, we introduce ATLAS (AGI-Oriented Testbed for Logical Application in Science), a large-scale, high-difficulty, and cross-disciplinary evaluation suite composed of approximately 800 original problems. Developed by domain experts (PhD-level and above), ATLAS spans seven core scientific fields: mathematics, physics, chemistry, biology, computer science, earth science, and materials science. Its key features include: (1) High Originality and Contamination Resistance, with all questions newly created or substantially adapted to prevent test data leakage; (2) Cross-Disciplinary Focus, designed to assess models' ability to integrate knowledge and reason across scientific domains; (3) High-Fidelity Answers, prioritizing complex, open-ended answers involving multi-step reasoning and LaTeX-formatted expressions over simple multiple-choice questions; and (4) Rigorous Quality Control, employing a multi-stage process of expert peer review and adversarial testing to ensure question difficulty, scientific value, and correctness. We also propose a robust evaluation paradigm using a panel of LLM judges for automated, nuanced assessment of complex answers. Preliminary results on leading models demonstrate ATLAS's effectiveness in differentiating their advanced scientific reasoning capabilities. We plan to develop ATLAS into a long-term, open, community-driven platform to provide a reliable "ruler" for progress toward Artificial General Intelligence.
comment: 39 pages
Beyond Bias Scores: Unmasking Vacuous Neutrality in Small Language Models
The rapid adoption of Small Language Models (SLMs) for resource constrained applications has outpaced our understanding of their ethical and fairness implications. To address this gap, we introduce the Vacuous Neutrality Framework (VaNeu), a multi-dimensional evaluation paradigm designed to assess SLM fairness prior to deployment. The framework examines model robustness across four stages - biases, utility, ambiguity handling, and positional bias over diverse social bias categories. To the best of our knowledge, this work presents the first large-scale audit of SLMs in the 0.5-5B parameter range, an overlooked "middle tier" between BERT-class encoders and flagship LLMs. We evaluate nine widely used SLMs spanning four model families under both ambiguous and disambiguated contexts. Our findings show that models demonstrating low bias in early stages often fail subsequent evaluations, revealing hidden vulnerabilities and unreliable reasoning. These results underscore the need for a more comprehensive understanding of fairness and reliability in SLMs, and position the proposed framework as a principled tool for responsible deployment in socially sensitive settings.
Property-guided Inverse Design of Metal-Organic Frameworks Using Quantum Natural Language Processing
In this study, we explore the potential of using quantum natural language processing (QNLP) to inverse design metal-organic frameworks (MOFs) with targeted properties. Specifically, by analyzing 450 hypothetical MOF structures consisting of 3 topologies, 10 metal nodes and 15 organic ligands, we categorize these structures into four distinct classes for pore volume and $CO_{2}$ Henry's constant values. We then compare various QNLP models (i.e. the bag-of-words, DisCoCat (Distributional Compositional Categorical), and sequence-based models) to identify the most effective approach to process the MOF dataset. Using a classical simulator provided by the IBM Qiskit, the bag-of-words model is identified to be the optimum model, achieving validation accuracies of 88.6% and 78.0% for binary classification tasks on pore volume and $CO_{2}$ Henry's constant, respectively. Further, we developed multi-class classification models tailored to the probabilistic nature of quantum circuits, with average test accuracies of 92% and 80% across different classes for pore volume and $CO_{2}$ Henry's constant datasets. Finally, the performance of generating MOF with target properties showed accuracies of 93.5% for pore volume and 87% for $CO_{2}$ Henry's constant, respectively. Although our investigation covers only a fraction of the vast MOF search space, it marks a promising first step towards using quantum computing for materials design, offering a new perspective through which to explore the complex landscape of MOFs.
comment: 46 pages, 7 figures, 6 supplementary figures, 1 table, 2 supplementary tables, 1 supplementary note
OEMA: Ontology-Enhanced Multi-Agent Collaboration Framework for Zero-Shot Clinical Named Entity Recognition
With the rapid expansion of unstructured clinical texts in electronic health records (EHRs), clinical named entity recognition (NER) has become a crucial technique for extracting medical information. However, traditional supervised models such as CRF and BioClinicalBERT suffer from high annotation costs. Although zero-shot NER based on large language models (LLMs) reduces the dependency on labeled data, challenges remain in aligning example selection with task granularity and effectively integrating prompt design with self-improvement frameworks. To address these limitations, we propose OEMA, a novel zero-shot clinical NER framework based on multi-agent collaboration. OEMA consists of three core components: (1) a self-annotator that autonomously generates candidate examples; (2) a discriminator that leverages SNOMED CT to filter token-level examples by clinical relevance; and (3) a predictor that incorporates entity-type descriptions to enhance inference accuracy. Experimental results on two benchmark datasets, MTSamples and VAERS, demonstrate that OEMA achieves state-of-the-art performance under exact-match evaluation. Moreover, under related-match criteria, OEMA performs comparably to the supervised BioClinicalBERT model while significantly outperforming the traditional CRF method. OEMA improves zero-shot clinical NER, achieving near-supervised performance under related-match criteria. Future work will focus on continual learning and open-domain adaptation to expand its applicability in clinical NLP.
comment: 12 pages, 4 figures, 4 tables
LLMs as Models for Analogical Reasoning
Analogical reasoning -- the capacity to identify and map structural relationships between different domains -- is fundamental to human cognition and learning. Recent studies have shown that large language models (LLMs) can sometimes match humans in analogical reasoning tasks, opening the possibility that analogical reasoning might emerge from domain-general processes. However, it is still debated whether these emergent capacities are largely superficial and limited to simple relations seen during training or whether they encompass the flexible representational and mapping capabilities which are the focus of leading cognitive models of analogy. In this study, we introduce novel analogical reasoning tasks that require participants to map between semantically contentful words and sequences of letters and other abstract characters. This task necessitates the ability to flexibly re-represent rich semantic information -- an ability which is known to be central to human analogy but which is thus far not well captured by existing cognitive theories and models. We assess the performance of both human participants and LLMs on tasks focusing on reasoning from semantic structure and semantic content, introducing variations that test the robustness of their analogical inferences. Advanced LLMs match human performance across several conditions, though humans and LLMs respond differently to certain task variations and semantic distractors. Our results thus provide new evidence that LLMs might offer a how-possibly explanation of human analogical reasoning in contexts that are not yet well modeled by existing theories, but that even today's best models are unlikely to yield how-actually explanations.
comment: The title has been changed from Semantic Structure-Mapping in LLM and Human Analogical Reasoning to LLMs as Models for Analogical Reasoning to improve clarity and accuracy
Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models
We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for Large Language Models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of 3 open-weight LLM judges, whose binary safety assessments were validated on a stratified human-labeled subset. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.
CoTKR: Chain-of-Thought Enhanced Knowledge Rewriting for Complex Knowledge Graph Question Answering
Recent studies have explored the use of Large Language Models (LLMs) with Retrieval Augmented Generation (RAG) for Knowledge Graph Question Answering (KGQA). They typically require rewriting retrieved subgraphs into natural language formats comprehensible to LLMs. However, when tackling complex questions, the knowledge rewritten by existing methods may include irrelevant information, omit crucial details, or fail to align with the question's semantics. To address them, we propose a novel rewriting method CoTKR, Chain-of-Thought Enhanced Knowledge Rewriting, for generating reasoning traces and corresponding knowledge in an interleaved manner, thereby mitigating the limitations of single-step knowledge rewriting. Additionally, to bridge the preference gap between the knowledge rewriter and the question answering (QA) model, we propose a training strategy PAQAF, Preference Alignment from Question Answering Feedback, for leveraging feedback from the QA model to further optimize the knowledge rewriter. We conduct experiments using various LLMs across several KGQA benchmarks. Experimental results demonstrate that, compared with previous knowledge rewriting methods, CoTKR generates the most beneficial knowledge representation for QA models, which significantly improves the performance of LLMs in KGQA.
Interpreting the Effects of Quantization on LLMs AACL 2025
Quantization offers a practical solution to deploy LLMs in resource-constraint environments. However, its impact on internal representations remains understudied, raising questions about the reliability of quantized models. In this study, we employ a range of interpretability techniques to investigate how quantization affects model and neuron behavior. We analyze multiple LLMs under 4-bit and 8-bit quantization. Our findings reveal that the impact of quantization on model calibration is generally minor. Analysis of neuron activations indicates that the number of dead neurons, i.e., those with activation values close to 0 across the dataset, remains consistent regardless of quantization. In terms of neuron contribution to predictions, we observe that smaller full precision models exhibit fewer salient neurons, whereas larger models tend to have more, with the exception of Llama-2-7B. The effect of quantization on neuron redundancy varies across models. Overall, our findings suggest that effect of quantization may vary by model and tasks, however, we did not observe any drastic change which may discourage the use of quantization as a reliable model compression technique.
comment: Accepted to AACL 2025 Main
Confidence-Guided Stepwise Model Routing for Cost-Efficient Reasoning
Recent advances in Large Language Models (LLMs) - particularly model scaling and test-time techniques - have greatly enhanced the reasoning capabilities of language models at the expense of higher inference costs. To lower inference costs, prior works train router models or deferral mechanisms that allocate easy queries to a small, efficient model, while forwarding harder queries to larger, more expensive models. However, these trained router models often lack robustness under domain shifts and require expensive data synthesis techniques such as Monte Carlo rollouts to obtain sufficient ground-truth routing labels for training. In this work, we propose Confidence-Guided Stepwise Model Routing for Cost-Efficient Reasoning (STEER), a domain-agnostic framework that performs fine-grained, step-level routing between smaller and larger LLMs without utilizing external models. STEER leverages confidence scores from the smaller model's logits prior to generating a reasoning step, so that the large model is invoked only when necessary. Extensive evaluations using different LLMs on a diverse set of challenging benchmarks across multiple domains such as Mathematical Reasoning, Multi-Hop QA, and Planning tasks indicate that STEER achieves competitive or enhanced accuracy while reducing inference costs (up to +20% accuracy with 48% less FLOPs compared to solely using the larger model on AIME), outperforming baselines that rely on trained external modules. Our results establish model-internal confidence as a robust, domain-agnostic signal for model routing, offering a scalable pathway for efficient LLM deployment.
comment: 7 pages, 5 figures
OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking EMNLP 2025
Machine writing with large language models often relies on retrieval-augmented generation. However, these approaches remain confined within the boundaries of the model's predefined scope, limiting the generation of content with rich information. Specifically, vanilla-retrieved information tends to lack depth, novelty, and suffers from redundancy, which negatively impacts the quality of generated articles, leading to shallow, unoriginal, and repetitive outputs. To address these issues, we propose OmniThink, a slow-thinking machine writing framework that emulates the human-like process of iterative expansion and reflection. The core idea behind OmniThink is to simulate the cognitive behavior of learners as they slowly deepen their knowledge of the topics. Experimental results demonstrate that OmniThink improves the knowledge density of generated articles without compromising metrics such as coherence and depth. Human evaluations and expert feedback further highlight the potential of OmniThink to address real-world challenges in the generation of long-form articles. Code is available at https://github.com/zjunlp/OmniThink.
comment: EMNLP 2025
CaKE: Circuit-aware Editing Enables Generalizable Knowledge Learners EMNLP 2025
Knowledge Editing (KE) enables the modification of outdated or incorrect information in large language models (LLMs). While existing KE methods can update isolated facts, they often fail to generalize these updates to multi-hop reasoning tasks that rely on the modified knowledge. Through an analysis of reasoning circuits -- the neural pathways LLMs use for knowledge-based inference, we find that current layer-localized KE approaches (e.g., MEMIT, WISE), which edit only single or a few model layers, inadequately integrate updated knowledge into these reasoning pathways. To address this limitation, we present CaKE (Circuit-aware Knowledge Editing), a novel method that enhances the effective integration of updated knowledge in LLMs. By only leveraging a few curated data samples guided by our circuit-based analysis, CaKE stimulates the model to develop appropriate reasoning circuits for newly incorporated knowledge. Experiments show that CaKE enables more accurate and consistent use of edited knowledge across related reasoning tasks, achieving an average improvement of 20% in multi-hop reasoning accuracy on the MQuAKE dataset while requiring less memory than existing KE methods. We release the code and data in https://github.com/zjunlp/CaKE.
comment: EMNLP 2025
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity NeurIPS 2025
Recent generations of language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established math and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from contamination and does not provide insights into the reasoning traces. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs think. Through extensive experiments, we show that LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having remaining token budget. By comparing LRMs with their standard LLM counterparts under same inference compute, we identify three performance regimes: (1) low-complexity tasks where standard models outperform LRMs, (2) medium-complexity tasks where LRMs demonstrates advantage, and (3) high-complexity tasks where both models face complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across scales. We also investigate the reasoning traces in more depth, studying the patterns of explored solutions and analyzing the models' computational behavior, shedding light on their strengths, limitations, and raising questions about their reasoning capabilities.
comment: NeurIPS 2025. camera-ready version + additional discussion in the appendix
Can LLMs Replace Economic Choice Prediction Labs? The Case of Language-based Persuasion Games
Human choice prediction in economic contexts is crucial for applications in marketing, finance, public policy, and more. This task, however, is often constrained by the difficulties in acquiring human choice data. With most experimental economics studies focusing on simple choice settings, the AI community has explored whether LLMs can substitute for humans in these predictions and examined more complex experimental economics settings. However, a key question remains: can LLMs generate training data for human choice prediction? We explore this in language-based persuasion games, a complex economic setting involving natural language in strategic interactions. Our experiments show that models trained on LLM-generated data can effectively predict human behavior in these games and even outperform models trained on actual human data. Beyond data generation, we investigate the dual role of LLMs as both data generators and predictors, introducing a comprehensive empirical study on the effectiveness of utilizing LLMs for data generation, human choice prediction, or both. We then utilize our choice prediction framework to analyze how strategic factors shape decision-making, showing that interaction history (rather than linguistic sentiment alone) plays a key role in predicting human decision-making in repeated interactions. Particularly, when LLMs capture history-dependent decision patterns similarly to humans, their predictive success improves substantially. Finally, we demonstrate the robustness of our findings across alternative persuasion-game settings, highlighting the broader potential of using LLM-generated data to model human decision-making.
ReviewGuard: Enhancing Deficient Peer Review Detection via LLM-Driven Data Augmentation
Peer review serves as the gatekeeper of science, yet the surge in submissions and widespread adoption of large language models (LLMs) in scholarly evaluation present unprecedented challenges. While recent work has focused on using LLMs to improve review efficiency, unchecked deficient reviews from both human experts and AI systems threaten to systematically undermine academic integrity. To address this issue, we introduce ReviewGuard, an automated system for detecting and categorizing deficient reviews through a four-stage LLM-driven framework: data collection from ICLR and NeurIPS on OpenReview, GPT-4.1 annotation with human validation, synthetic data augmentation yielding 6,634 papers with 24,657 real and 46,438 synthetic reviews, and fine-tuning of encoder-based models and open-source LLMs. Feature analysis reveals that deficient reviews exhibit lower rating scores, higher self-reported confidence, reduced structural complexity, and more negative sentiment than sufficient reviews. AI-generated text detection shows dramatic increases in AI-authored reviews since ChatGPT's emergence. Mixed training with synthetic and real data substantially improves detection performance - for example, Qwen 3-8B achieves recall of 0.6653 and F1 of 0.7073, up from 0.5499 and 0.5606 respectively. This study presents the first LLM-driven system for detecting deficient peer reviews, providing evidence to inform AI governance in peer review. Code, prompts, and data are available at https://github.com/haoxuan-unt2024/ReviewGuard
comment: Accepted as a full paper at the 2025 ACM/IEEE Joint Conference on Digital Libraries (JCDL 2025)
RAG-BioQA Retrieval-Augmented Generation for Long-Form Biomedical Question Answering
The exponential growth of biomedical literature creates significant challenges for accessing precise medical information. Current biomedical question-answering systems primarily focus on short-form answers, failing to provide the comprehensive explanations necessary for clinical decision-making. We present RAG-BioQA, a novel framework combining retrieval-augmented generation with domain-specific fine-tuning to produce evidence-based, long-form biomedical answers. Our approach integrates BioBERT embeddings with FAISS indexing and compares various re-ranking strategies (BM25, ColBERT, MonoT5) to optimize context selection before synthesizing evidence through a fine-tuned T5 model. Experimental results on the PubMedQA dataset show significant improvements over baselines, with our best model achieving substantial gains across BLEU, ROUGE, and METEOR metrics, advancing the state of accessible, evidence-based biomedical knowledge retrieval.
comment: Need to work on the methodology more
Beyond Human Judgment: A Bayesian Evaluation of LLMs' Moral Values Understanding EMNLP 2025
How do Large Language Models understand moral dimensions compared to humans? This first large-scale Bayesian evaluation of market-leading language models provides the answer. In contrast to prior work using deterministic ground truth (majority or inclusion rules), we model annotator disagreements to capture both aleatoric uncertainty (inherent human disagreement) and epistemic uncertainty (model domain sensitivity). We evaluated the best language models (Claude Sonnet 4, DeepSeek-V3, Llama 4 Maverick) across 250K+ annotations from nearly 700 annotators in 100K+ texts spanning social networks, news and forums. Our GPU-optimized Bayesian framework processed 1M+ model queries, revealing that AI models typically rank among the top 25\% of human annotators, performing much better than average balanced accuracy. Importantly, we find that AI produces far fewer false negatives than humans, highlighting their more sensitive moral detection capabilities.
comment: Appears in UncertaiNLP@EMNLP 2025
Human-Computer Interaction
End-to-End Motion Capture from Rigid Body Markers with Geodesic Loss
Marker-based optical motion capture (MoCap), while long regarded as the gold standard for accuracy, faces practical challenges, such as time-consuming preparation and marker identification ambiguity, due to its reliance on dense marker configurations, which fundamentally limit its scalability. To address this, we introduce a novel fundamental unit for MoCap, the Rigid Body Marker (RBM), which provides unambiguous 6-DoF data and drastically simplifies setup. Leveraging this new data modality, we develop a deep-learning-based regression model that directly estimates SMPL parameters under a geodesic loss. This end-to-end approach matches the performance of optimization-based methods while requiring over an order of magnitude less computation. Trained on synthesized data from the AMASS dataset, our end-to-end model achieves state-of-the-art accuracy in body pose estimation. Real-world data captured using a Vicon optical tracking system further demonstrates the practical viability of our approach. Overall, the results show that combining sparse 6-DoF RBM with a manifold-aware geodesic loss yields a practical and high-fidelity solution for real-time MoCap in graphics, virtual reality, and biomechanics.
comment: The source code is available in : https://github.com/wer010/GLRBM-Mocap
Optimized User Experience for Labeling Systems for Predictive Maintenance Applications (Extended)
The maintenance of rail vehicles and infrastructure plays a critical role in reducing delays, preventing malfunctions, and ensuring the economic efficiency of rail transportation companies. Predictive maintenance systems powered by supervised machine learning offer a promising approach by detecting failures before they occur, reducing unscheduled downtime, and improving operational efficiency. However, the success of such systems depends on high quality labeled data, necessitating user centered labeling interfaces tailored to annotators needs for Usability and User Experience. This study introduces a cost effective predictive maintenance system developed in the federally funded project DigiOnTrack, which combines structure borne noise measurement with supervised learning to provide monitoring and maintenance recommendations for rail vehicles and infrastructure in rural Germany. The system integrates wireless sensor networks, distributed ledger technology for secure data transfer, and a dockerized container infrastructure hosting the labeling interface and dashboard. Train drivers and workshop foremen labeled faults on infrastructure and vehicles to ensure accurate recommendations. The Usability and User Experience evaluation showed that the locomotive drivers interface achieved Excellent Usability, while the workshop foremans interface was rated as Good. These results highlight the systems potential for integration into daily workflows, particularly in labeling efficiency. However, areas such as Perspicuity require further optimization for more data intensive scenarios. The findings offer insights into the design of predictive maintenance systems and labeling interfaces, providing a foundation for future guidelines in Industry 4.0 applications, particularly in rail transportation.
GazeInterpreter: Parsing Eye Gaze to Generate Eye-Body-Coordinated Narrations AAAI 2026
Comprehensively interpreting human behavior is a core challenge in human-aware artificial intelligence. However, prior works typically focused on body behavior, neglecting the crucial role of eye gaze and its synergy with body motion. We present GazeInterpreter - a novel large language model-based (LLM-based) approach that parses eye gaze data to generate eye-body-coordinated narrations. Specifically, our method features 1) a symbolic gaze parser that translates raw gaze signals into symbolic gaze events; 2) a hierarchical structure that first uses an LLM to generate eye gaze narration at semantic level and then integrates gaze with body motion within the same observation window to produce integrated narration; and 3) a self-correcting loop that iteratively refines the modality match, temporal coherence, and completeness of the integrated narration. This hierarchical and iterative processing can effectively align physical values and semantic text in the temporal and spatial domains. We validated the effectiveness of our eye-body-coordinated narrations on the text-driven motion generation task in the large-scale Nymeria benchmark. Moreover, we report significant performance improvements for the sample downstream tasks of action anticipation and behavior summarization. Taken together, these results reveal the significant potential of parsing eye gaze to interpret human behavior and open up a new direction for human behavior understanding.
comment: Accepted to AAAI 2026. 9 pages, 4 figures
Optimizing Predictive Maintenance: Enhanced AI and Backend Integration
Rail transportation success depends on efficient maintenance to avoid delays and malfunctions, particularly in rural areas with limited resources. We propose a cost-effective wireless monitoring system that integrates sensors and machine learning to address these challenges. We developed a secure data management system, equipping train cars and rail sections with sensors to collect structural and environmental data. This data supports Predictive Maintenance by identifying potential issues before they lead to failures. Implementing this system requires a robust backend infrastructure for secure data transfer, storage, and analysis. Designed collaboratively with stakeholders, including the railroad company and project partners, our system is tailored to meet specific requirements while ensuring data integrity and security. This article discusses the reasoning behind our design choices, including the selection of sensors, data handling protocols, and Machine Learning models. We propose a system architecture for implementing the solution, covering aspects such as network topology and data processing workflows. Our approach aims to enhance the reliability and efficiency of rail transportation through advanced technological integration.
Optimized User Experience for Labeling Systems for Predictive Maintenance Applications
This paper presents the design and implementation of a graphical labeling user interface for a monitoring and predictive maintenance system for trains and rail infrastructure in a rural area of Germany. Aiming to enhance rail transportation's economic viability and operational efficiency, our project utilizes cost-effective wireless monitoring systems that combine affordable sensors and machine learning algorithms. Given that a successful labeling phase is indispensable for training a supervised machine learning system, we emphasize the importance of a user-friendly labeling user interface, which can be optimally integrated into the daily work routines of annotators. The labeling system has been designed based on best practices in usability heuristics and will be validated for usability and user experience through a study, the protocol for which is presented here. The value of this work lies in its potential to reduce maintenance costs and improve service reliability in rail transportation, contributing to the academic literature and offering practical insights for research on effective labeling user interfaces, as well as for the development of labeling systems in the industry. Upon completion of the study, we will share the results, refine the system as necessary, and explore its scalability in other areas of infrastructure maintenance.
When Less is More: A Story of Failing Bayesian Optimization Due to Additional Expert Knowledge
The compounding of plastics with recycled material remains a practical challenge, as the properties of the processed material is not as easy to control as with completely new raw materials. For a data scientist, it makes sense to plan the necessary experiments in the development of new compounds using Bayesian Optimization, an optimization approach based on a surrogate model that is known for its data efficiency and is therefore well suited for data obtained from costly experiments. Furthermore, if historical data and expert knowledge are available, their inclusion in the surrogate model is expected to accelerate the convergence of the optimization. In this article, we describe a use case in which the addition of data and knowledge has impaired optimization. We also describe the unsuccessful methods that were used to remedy the problem before we found the reasons for the poor performance and achieved a satisfactory result. We conclude with a lesson learned: additional knowledge and data are only beneficial if they do not complicate the underlying optimization goal.
Gaze Archive: Enhancing Human Memory through Active Visual Logging on Smart Glasses
People today are overwhelmed by massive amounts of information, leading to cognitive overload and memory burden. Traditional visual memory augmentation methods are either effortful and disruptive or fail to align with user intent. To address these limitations, we propose Gaze Archive, a novel visual memory enhancement paradigm through active logging on smart glasses. It leverages human gaze as a natural attention indicator, enabling both intent-precise capture and effortless-and-unobtrusive interaction. To implement Gaze Archive, we develop GAHMA, a technical framework that enables compact yet intent-aligned memory encoding and intuitive memory recall based on natural language queries. Quantitative experiments on our newly constructed GAVER dataset show that GAHMA achieves more intent-precise logging than non-gaze baselines. Through extensive user studies in both laboratory and real-world scenarios, we compare Gaze Archive with other existing memory augmentation methods. Results demonstrate its advantages in perceived effortlessness, unobtrusiveness and overall preference, showing strong potential for real-world deployment.
Heterogeneous Stroke: Using Unique Vibration Cues to Improve the Wrist-Worn Spatiotemporal Tactile Display
Beyond a simple notification of incoming calls or messages, more complex information such as alphabets and digits can be delivered through spatiotemporal tactile patterns (STPs) on a wrist-worn tactile display (WTD) with multiple tactors. However, owing to the limited skin area and spatial acuity of the wrist, frequent confusions occur between closely located tactors, resulting in a low recognition accuracy. Furthermore, the accuracies reported in previous studies have mostly been measured for a specific posture and could further decrease with free arm postures in real life. Herein, we present Heterogeneous Stroke, a design concept for improving the recognition accuracy of STPs on a WTD. By assigning unique vibrotactile stimuli to each tactor, the confusion between tactors can be reduced. Through our implementation of Heterogeneous Stroke, the alphanumeric characters could be delivered with high accuracy (93.8% for 26 alphabets and 92.4% for 10 digits) across different arm postures.
comment: ACM CHI 2021
Semantic Glitch: Agency and Artistry in an Autonomous Pixel Cloud NeurIPS 2025
While mainstream robotics pursues metric precision and flawless performance, this paper explores the creative potential of a deliberately "lo-fi" approach. We present the "Semantic Glitch," a soft flying robotic art installation whose physical form, a 3D pixel style cloud, is a "physical glitch" derived from digital archaeology. We detail a novel autonomous pipeline that rejects conventional sensors like LiDAR and SLAM, relying solely on the qualitative, semantic understanding of a Multimodal Large Language Model to navigate. By authoring a bio-inspired personality for the robot through a natural language prompt, we create a "narrative mind" that complements the "weak," historically, loaded body. Our analysis begins with a 13-minute autonomous flight log, and a follow-up study statistically validates the framework's robustness for authoring quantifiably distinct personas. The combined analysis reveals emergent behaviors, from landmark-based navigation to a compelling "plan to execution" gap, and a character whose unpredictable, plausible behavior stems from a lack of precise proprioception. This demonstrates a lo-fi framework for creating imperfect companions whose success is measured in character over efficiency.
comment: NeurIPS 2025 Creative AI Track, The Thirty-Ninth Annual Conference on Neural Information Processing Systems
Panel-by-Panel Souls: A Performative Workflow for Expressive Faces in AI-Assisted Manga Creation NeurIPS 2025
Current text-to-image models struggle to render the nuanced facial expressions required for compelling manga narratives, largely due to the ambiguity of language itself. To bridge this gap, we introduce an interactive system built on a novel, dual-hybrid pipeline. The first stage combines landmark-based auto-detection with a manual framing tool for robust, artist-centric face preparation. The second stage maps expressions using the LivePortrait engine, blending intuitive performative input from video for fine-grained control. Our case study analysis suggests that this integrated workflow can streamline the creative process and effectively translate narrative intent into visual expression. This work presents a practical model for human-AI co-creation, offering artists a more direct and intuitive means of ``infusing souls'' into their characters. Our primary contribution is not a new generative model, but a novel, interactive workflow that bridges the gap between artistic intent and AI execution.
comment: NeurIPS 2025 Creative AI Track, The Thirty-Ninth Annual Conference on Neural Information Processing Systems
Monte Carlo Expected Threat (MOCET) Scoring NeurIPS 2025
Evaluating and measuring AI Safety Level (ASL) threats are crucial for guiding stakeholders to implement safeguards that keep risks within acceptable limits. ASL-3+ models present a unique risk in their ability to uplift novice non-state actors, especially in the realm of biosecurity. Existing evaluation metrics, such as LAB-Bench, BioLP-bench, and WMDP, can reliably assess model uplift and domain knowledge. However, metrics that better contextualize "real-world risks" are needed to inform the safety case for LLMs, along with scalable, open-ended metrics to keep pace with their rapid advancements. To address both gaps, we introduce MOCET, an interpretable and doubly-scalable metric (automatable and open-ended) that can quantify real-world risks.
comment: Accepted to NeurIPS 2025 BioSafe GenAI
Stable diffusion models reveal a persisting human and AI gap in visual creativity
While recent research suggests Large Language Models match human creative performance in divergent thinking tasks, visual creativity remains underexplored. This study compared image generation in human participants (Visual Artists and Non Artists) and using an image generation AI model (two prompting conditions with varying human input: high for Human Inspired, low for Self Guided). Human raters (N=255) and GPT4o evaluated the creativity of the resulting images. We found a clear creativity gradient, with Visual Artists being the most creative, followed by Non Artists, then Human Inspired generative AI, and finally Self Guided generative AI. Increased human guidance strongly improved GenAI's creative output, bringing its productions close to those of Non Artists. Notably, human and AI raters also showed vastly different creativity judgment patterns. These results suggest that, in contrast to language centered tasks, GenAI models may face unique challenges in visual domains, where creativity depends on perceptual nuance and contextual sensitivity, distinctly human capacities that may not be readily transferable from language models.
Scene Awareness While Using Multiple Navigation Aids in AR Search
Augmented reality (AR) allows virtual information to be presented in the real world, providing support for numerous tasks including search and navigation. Allowing users access to multiple navigation aids may help leverage the benefits of different navigational guidance methods, but may also have negative perceptual and cognitive impacts. In this study, users performed searches for virtual gems within a large-scale augmented environment while choosing to deploy two different navigation aids either independently or simultaneously: world-locked arrows and an on-screen radar. After completing the search, participants were asked to recall objects that may or may not have been present in the scene. The use of navigation aids impacted object recall, with impaired recall of objects in the environment when an aid was switched on. The results point at possible impact factors of object awareness in mobile AR and underscore the potential for adaptable interfaces to support users navigating the physical world.
comment: Poster Summary, 2 pages. Presented at the 2025 IEEE International Symposium on Mixed and Augmented Reality (ISMAR)
Generative Augmented Reality: Paradigms, Technologies, and Future Applications
This paper introduces Generative Augmented Reality (GAR) as a next-generation paradigm that reframes augmentation as a process of world re-synthesis rather than world composition by a conventional AR engine. GAR replaces the conventional AR engine's multi-stage modules with a unified generative backbone, where environmental sensing, virtual content, and interaction signals are jointly encoded as conditioning inputs for continuous video generation. We formalize the computational correspondence between AR and GAR, survey the technical foundations that make real-time generative augmentation feasible, and outline prospective applications that leverage its unified inference model. We envision GAR as a future AR paradigm that delivers high-fidelity experiences in terms of realism, interactivity, and immersion, while eliciting new research challenges on technologies, content ecosystems, and the ethical and societal implications.
Trust in AI emerges from distrust in humans: A machine learning study on decision-making guidance
This study explores the dynamics of trust in artificial intelligence (AI) agents, particularly large language models (LLMs), by introducing the concept of "deferred trust", a cognitive mechanism where distrust in human agents redirects reliance toward AI perceived as more neutral or competent. Drawing on frameworks from social psychology and technology acceptance models, the research addresses gaps in user-centric factors influencing AI trust. Fifty-five undergraduate students participated in an experiment involving 30 decision-making scenarios (factual, emotional, moral), selecting from AI agents (e.g., ChatGPT), voice assistants, peers, adults, or priests as guides. Data were analyzed using K-Modes and K-Means clustering for patterns, and XGBoost models with SHAP interpretations to predict AI selection based on sociodemographic and prior trust variables. Results showed adults (35.05\%) and AI (28.29\%) as the most selected agents overall. Clustering revealed context-specific preferences: AI dominated factual scenarios, while humans prevailed in social/moral ones. Lower prior trust in human agents (priests, peers, adults) consistently predicted higher AI selection, supporting deferred trust as a compensatory transfer. Participant profiles with higher AI trust were distinguished by human distrust, lower technology use, and higher socioeconomic status. Models demonstrated consistent performance (e.g., average precision up to 0.863). Findings challenge traditional models like TAM/UTAUT, emphasizing relational and epistemic dimensions in AI trust. They highlight risks of over-reliance due to fluency effects and underscore the need for transparency to calibrate vigilance. Limitations include sample homogeneity and static scenarios; future work should incorporate diverse populations and multimodal data to refine deferred trust across contexts.
comment: 36 pages, 6 figures
People readily follow personal advice from AI but it does not improve their well-being
People increasingly seek personal advice from large language models (LLMs), yet whether humans follow their advice, and its consequences for their well-being, remains unknown. In a longitudinal randomised controlled trial with a representative UK sample (N = 2,302), 75% of participants who had a 20-minute discussion with GPT-4o about health, careers or relationships subsequently reported following its advice. Based on autograder evaluations of chat transcripts, LLM advice rarely violated safety best practice. When queried 2-3 weeks later, participants who had interacted with personalised AI (with access to detailed user information) followed its advice more often in the real world and reported higher well-being than those advised by non-personalised AI. However, while receiving personal advice from AI temporarily reduced well-being, no differential long-term effects compared to a control emerged. Our results suggest that humans readily follow LLM advice about personal issues but doing so shows no additional well-being benefit over casual conversations.
Automatically Detecting Online Deceptive Patterns
Deceptive patterns in digital interfaces manipulate users into making unintended decisions, exploiting cognitive biases and psychological vulnerabilities. These patterns have become ubiquitous on various digital platforms. While efforts to mitigate deceptive patterns have emerged from legal and technical perspectives, a significant gap remains in creating usable and scalable solutions. We introduce our AutoBot framework to address this gap and help web stakeholders navigate and mitigate online deceptive patterns. AutoBot accurately identifies and localizes deceptive patterns from a screenshot of a website without relying on the underlying HTML code. AutoBot employs a two-stage pipeline that leverages the capabilities of specialized vision models to analyze website screenshots, identify interactive elements, and extract textual features. Next, using a large language model, AutoBot understands the context surrounding these elements to determine the presence of deceptive patterns. We also use AutoBot, to create a synthetic dataset to distill knowledge from 'teacher' LLMs to smaller language models. Through extensive evaluation, we demonstrate AutoBot's effectiveness in detecting deceptive patterns on the web, achieving an F1-score of 0.93 when detecting deceptive patterns, underscoring its potential as an essential tool for mitigating online deceptive patterns. We implement AutoBot, across three downstream applications targeting different web stakeholders: (1) a local browser extension providing users with real-time feedback, (2) a Lighthouse audit to inform developers of potential deceptive patterns on their sites, and (3) as a measurement tool designed for researchers and regulators.
When concept-based XAI is imprecise: Do people distinguish between generalisations and misrepresentations?
Concept-based explainable artificial intelligence (C-XAI) can let people see which representations an AI model has learned. This is particularly important when high-level semantic information (e.g., actions and relations) is used to make decisions about abstract categories (e.g., danger). In such tasks, AI models need to generalise beyond situation-specific details, and this ability can be reflected in C-XAI outputs that randomise over irrelevant features. However, it is unclear whether people appreciate such generalisation and can distinguish it from other, less desirable forms of imprecision in C-XAI outputs. Therefore, the present study investigated how the generality and relevance of C-XAI outputs affect people's evaluation of AI. In an experimental railway safety evaluation scenario, participants rated the performance of a simulated AI that classified traffic scenes involving people as dangerous or not. These classification decisions were explained via concepts in the form of similar image snippets. The latter differed in their match with the classified image, either regarding a highly relevant feature (i.e., people's relation to tracks) or a less relevant feature (i.e., people's action). Contrary to the hypotheses, concepts that generalised over less relevant features were rated lower than concepts that matched the classified image precisely. Moreover, their ratings were no better than those for systematic misrepresentations of the less relevant feature. Conversely, participants were highly sensitive to imprecisions in relevant features. These findings cast doubts on the assumption that people can easily infer from C-XAI outputs whether AI models have gained a deeper understanding of complex situations.
Design of a visual environment for programming by direct data manipulation
The use of applications on computers, smartphones, and tablets has been considerably simplied thanks to interactive and dynamic graphical interfaces coupled with the mouse and touch screens. It is no longer necessary to be a computer specialist to use them. Paradoxically, the development of computer programs generally requires writing lines of code in a programming language whose syntax is particularly strict. This process poses many diculties for programmers. We propose an original tool in which arbitrary programs (Turing-complete) can be developed in a completely visual manner by direct manipulation of the data, without writing a line of code. The user can thus develop an algorithm by directly visualizing the result of actions taken on the data. A method for constructing iterations is associated with the tool. It proposes to create each part, including the loop body, in a non-linear manner under visual control of the state of the data. In addition, the tool supports the production of code that corresponds to the actions performed, where the language can be Python, C, or Java. In this article, we present the tool, the design choices, the problems solved, and the limits and contributions of the direct-data-manipulation approach.
OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking EMNLP 2025
Machine writing with large language models often relies on retrieval-augmented generation. However, these approaches remain confined within the boundaries of the model's predefined scope, limiting the generation of content with rich information. Specifically, vanilla-retrieved information tends to lack depth, novelty, and suffers from redundancy, which negatively impacts the quality of generated articles, leading to shallow, unoriginal, and repetitive outputs. To address these issues, we propose OmniThink, a slow-thinking machine writing framework that emulates the human-like process of iterative expansion and reflection. The core idea behind OmniThink is to simulate the cognitive behavior of learners as they slowly deepen their knowledge of the topics. Experimental results demonstrate that OmniThink improves the knowledge density of generated articles without compromising metrics such as coherence and depth. Human evaluations and expert feedback further highlight the potential of OmniThink to address real-world challenges in the generation of long-form articles. Code is available at https://github.com/zjunlp/OmniThink.
comment: EMNLP 2025
Can LLMs Replace Economic Choice Prediction Labs? The Case of Language-based Persuasion Games
Human choice prediction in economic contexts is crucial for applications in marketing, finance, public policy, and more. This task, however, is often constrained by the difficulties in acquiring human choice data. With most experimental economics studies focusing on simple choice settings, the AI community has explored whether LLMs can substitute for humans in these predictions and examined more complex experimental economics settings. However, a key question remains: can LLMs generate training data for human choice prediction? We explore this in language-based persuasion games, a complex economic setting involving natural language in strategic interactions. Our experiments show that models trained on LLM-generated data can effectively predict human behavior in these games and even outperform models trained on actual human data. Beyond data generation, we investigate the dual role of LLMs as both data generators and predictors, introducing a comprehensive empirical study on the effectiveness of utilizing LLMs for data generation, human choice prediction, or both. We then utilize our choice prediction framework to analyze how strategic factors shape decision-making, showing that interaction history (rather than linguistic sentiment alone) plays a key role in predicting human decision-making in repeated interactions. Particularly, when LLMs capture history-dependent decision patterns similarly to humans, their predictive success improves substantially. Finally, we demonstrate the robustness of our findings across alternative persuasion-game settings, highlighting the broader potential of using LLM-generated data to model human decision-making.
Behind the Screens: Uncovering Bias in AI-Driven Video Interview Assessments Using Counterfactuals
AI-enhanced personality assessments are increasingly shaping hiring decisions, using affective computing to predict traits from the Big Five (OCEAN) model. However, integrating AI into these assessments raises ethical concerns, especially around bias amplification rooted in training data. These biases can lead to discriminatory outcomes based on protected attributes like gender, ethnicity, and age. To address this, we introduce a counterfactual-based framework to systematically evaluate and quantify bias in AI-driven personality assessments. Our approach employs generative adversarial networks (GANs) to generate counterfactual representations of job applicants by altering protected attributes, enabling fairness analysis without access to the underlying model. Unlike traditional bias assessments that focus on unimodal or static data, our method supports multimodal evaluation-spanning visual, audio, and textual features. This comprehensive approach is particularly important in high-stakes applications like hiring, where third-party vendors often provide AI systems as black boxes. Applied to a state-of-the-art personality prediction model, our method reveals significant disparities across demographic groups. We also validate our framework using a protected attribute classifier to confirm the effectiveness of our counterfactual generation. This work provides a scalable tool for fairness auditing of commercial AI hiring platforms, especially in black-box settings where training data and model internals are inaccessible. Our results highlight the importance of counterfactual approaches in improving ethical transparency in affective computing.
From Checking to Sensemaking: A Caregiver-in-the-Loop Framework for AI-Assisted Task Verification in Dementia Care
Informal caregivers play a central role in enabling people living with dementia (PLwD) to remain at home, yet they face persistent challenges verifying whether daily tasks have been completed. Existing digital reminder systems prompt actions but rarely confirm outcomes, leaving caregivers to double-check tasks manually. This study explores how generative artificial intelligence (AI) might support caregiver-led task verification without displacing human judgment. We combined qualitative interviews with ten caregivers and one PLwD with a speculative simulation probe using a generative large language model to generate follow-up questions and flag responses for verification. Using template analysis, we identified three interrelated patterns of reasoning: detecting anomalies, constructing trustworthy evidence, and calibrating trust and control. These insights informed the Caregiver-in-the-Loop Task Verification (CLTV) framework, which models verification as a collaborative cycle of anomaly detection, evidence triangulation, AI-assisted summarization, and accountability circulation centered on caregiver oversight. CLTV advances human-AI collaboration theory by situating interpretability, trust, and control within the relational and emotional realities of dementia care and by offering design principles for transparent, adjustable, and context-aware AI support. We contribute a care-centered extension of human-AI collaboration theory, demonstrating how interpretability and trust can be operationalized through caregiver oversight.
Generative AI as a Learning Buddy and Teaching Assistant: Pre-service Teachers' Uses and Attitudes
This cross-sectional study investigates how preservice teachers in the Global South engage with Generative Artificial Intelligence across academic and instructional tasks while navigating infrastructural barriers such as limited internet access and high data costs. The study surveyed 167 preservice teachers from four teacher education institutions in Ghana. Descriptive statistics and inferential analyses, including multiple and ordinal logistic regressions, were used to examine patterns of GenAI use. Findings show that preservice teachers rely on GenAI as a learning companion for locating reading materials, accessing detailed content explanations, and identifying practical examples. They also use GenAI as a teaching assistant for tasks related to lesson preparation, including generating instructional resources, identifying assessment strategies, and developing lesson objectives. Usage patterns indicate that students in their third and fourth years have significantly higher frequencies of GenAI use compared to those in earlier years. Gender was not a significant predictor of GenAI adoption, in contrast to class level and age. Participants reported positive attitudes toward GenAI, noting that it supports autonomous learning and reduces dependence on peers and instructors for routine academic and teaching activities. However, challenges such as high data costs, occasional inaccuracies in GenAI outputs, and concerns about academic dishonesty were identified as factors that limit more frequent use. The study recommends the integration of GenAI literacy in teacher education programs, with a focus on ethical and responsible AI use to support equitable adoption in the Global South.
Beyond Human Judgment: A Bayesian Evaluation of LLMs' Moral Values Understanding EMNLP 2025
How do Large Language Models understand moral dimensions compared to humans? This first large-scale Bayesian evaluation of market-leading language models provides the answer. In contrast to prior work using deterministic ground truth (majority or inclusion rules), we model annotator disagreements to capture both aleatoric uncertainty (inherent human disagreement) and epistemic uncertainty (model domain sensitivity). We evaluated the best language models (Claude Sonnet 4, DeepSeek-V3, Llama 4 Maverick) across 250K+ annotations from nearly 700 annotators in 100K+ texts spanning social networks, news and forums. Our GPU-optimized Bayesian framework processed 1M+ model queries, revealing that AI models typically rank among the top 25\% of human annotators, performing much better than average balanced accuracy. Importantly, we find that AI produces far fewer false negatives than humans, highlighting their more sensitive moral detection capabilities.
comment: Appears in UncertaiNLP@EMNLP 2025
Computer Vision and Pattern Recognition
RoMa v2: Harder Better Faster Denser Feature Matching
Dense feature matching aims to estimate all correspondences between two images of a 3D scene and has recently been established as the gold-standard due to its high accuracy and robustness. However, existing dense matchers still fail or perform poorly for many hard real-world scenarios, and high-precision models are often slow, limiting their applicability. In this paper, we attack these weaknesses on a wide front through a series of systematic improvements that together yield a significantly better model. In particular, we construct a novel matching architecture and loss, which, combined with a curated diverse training distribution, enables our model to solve many complex matching tasks. We further make training faster through a decoupled two-stage matching-then-refinement pipeline, and at the same time, significantly reduce refinement memory usage through a custom CUDA kernel. Finally, we leverage the recent DINOv3 foundation model along with multiple other insights to make the model more robust and unbiased. In our extensive set of experiments we show that the resulting novel matcher sets a new state-of-the-art, being significantly more accurate than its predecessors. Code is available at https://github.com/Parskatt/romav2
GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization
Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools, leaving a gap toward more general-purpose agentic models. In this work, we revisit the geolocalization task, which requires not only nuanced visual grounding but also web search to confirm or refine hypotheses during reasoning. Since existing geolocalization benchmarks fail to meet the need for high-resolution imagery and the localization challenge for deep agentic reasoning, we curate GeoBench, a benchmark that includes photos and panoramas from around the world, along with a subset of satellite images of different cities to rigorously evaluate the geolocalization ability of agentic models. We also propose GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information. We develop a complete training pipeline for it, including a cold-start supervised fine-tuning (SFT) stage to learn reasoning patterns and tool-use priors, followed by a reinforcement learning (RL) stage to further enhance reasoning ability. We adopt a hierarchical reward to leverage multi-level geographical information and improve overall geolocalization performance. Experimental results show that GeoVista surpasses other open-source agentic models on the geolocalization task greatly and achieves performance comparable to closed-source models such as Gemini-2.5-flash and GPT-5 on most metrics.
In-N-On: Scaling Egocentric Manipulation with in-the-wild and on-task Data
Egocentric videos are a valuable and scalable data source to learn manipulation policies. However, due to significant data heterogeneity, most existing approaches utilize human data for simple pre-training, which does not unlock its full potential. This paper first provides a scalable recipe for collecting and using egocentric data by categorizing human data into two categories: in-the-wild and on-task alongside with systematic analysis on how to use the data. We first curate a dataset, PHSD, which contains over 1,000 hours of diverse in-the-wild egocentric data and over 20 hours of on-task data directly aligned to the target manipulation tasks. This enables learning a large egocentric language-conditioned flow matching policy, Human0. With domain adaptation techniques, Human0 minimizes the gap between humans and humanoids. Empirically, we show Human0 achieves several novel properties from scaling human data, including language following of instructions from only human data, few-shot learning, and improved robustness using on-task data. Project website: https://xiongyicai.github.io/In-N-On/
comment: Project webpage: https://xiongyicai.github.io/In-N-On/
Think Visually, Reason Textually: Vision-Language Synergy in ARC
Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4. These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence. The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks. Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles. However, our pilot experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution. This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages: vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution. Building on this insight, we introduce two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction. Extensive experiments demonstrate that our approach yields up to a 4.33% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks. Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence in future foundation models. Source code will be released soon.
First Frame Is the Place to Go for Video Content Customization
What role does the first frame play in video generation models? Traditionally, it's viewed as the spatial-temporal starting point of a video, merely a seed for subsequent animation. In this work, we reveal a fundamentally different perspective: video models implicitly treat the first frame as a conceptual memory buffer that stores visual entities for later reuse during generation. Leveraging this insight, we show that it's possible to achieve robust and generalized video content customization in diverse scenarios, using only 20-50 training examples without architectural changes or large-scale finetuning. This unveils a powerful, overlooked capability of video generation models for reference-based video customization.
comment: Project Website: https://firstframego.github.io/
Hyperspectral Image Classification using Spectral-Spatial Mixer Network SP
This paper introduces SS-MixNet, a lightweight and effective deep learning model for hyperspectral image (HSI) classification. The architecture integrates 3D convolutional layers for local spectral-spatial feature extraction with two parallel MLP-style mixer blocks that capture long-range dependencies in spectral and spatial dimensions. A depthwise convolution-based attention mechanism is employed to enhance discriminative capability with minimal computational overhead. The model is evaluated on the QUH-Tangdaowan and QUH-Qingyun datasets using only 1% of labeled data for training and validation. SS-MixNet achieves the highest performance among compared methods, including 2D-CNN, 3D-CNN, IP-SWIN, SimPoolFormer, and HybridKAN, reaching 95.68% and 93.86% overall accuracy on the Tangdaowan and Qingyun datasets, respectively. The results, supported by quantitative metrics and classification maps, confirm the model's effectiveness in delivering accurate and robust predictions with limited supervision. The code will be made publicly available at: https://github.com/mqalkhatib/SS-MixNet
comment: Accepted for WHISPERS2025
MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping
Mixture-of-Experts (MoE) Multimodal large language models (MLLMs) excel at vision-language tasks, but they suffer from high computational inefficiency. To reduce inference overhead, expert skipping methods have been proposed to deactivate redundant experts based on the current input tokens. However, we find that applying these methods-originally designed for unimodal large language models (LLMs)-to MLLMs results in considerable performance degradation. This is primarily because such methods fail to account for the heterogeneous contributions of experts across MoE layers and modality-specific behaviors of tokens within these layers. Motivated by these findings, we propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference. It incorporates a globally-modulated local gating (GMLG) mechanism that integrates global layer-wise importance into local routing probabilities to accurately estimate per-token expert importance. A dual-modality thresholding (DMT) method is then applied, which processes tokens from each modality separately, to derive the skipping schedule. To set the optimal thresholds, we introduce a frontier search algorithm that exploits monotonicity properties, cutting convergence time from several days to a few hours. Extensive experiments for 3 model series across 13 benchmarks demonstrate that MoDES far outperforms previous approaches. For instance, when skipping 88% experts for Qwen3-VL-MoE-30B-A3B-Instruct, the performance boost is up to 10.67% (97.33% vs. 86.66%). Furthermore, MoDES significantly enhances inference speed, improving the prefilling time by 2.16$\times$ and the decoding time by 1.26$\times$.
comment: Code will be released upon acceptance
MF-GCN: A Multi-Frequency Graph Convolutional Network for Tri-Modal Depression Detection Using Eye-Tracking, Facial, and Acoustic Features
Eye tracking data quantifies the attentional bias towards negative stimuli that is frequently observed in depressed groups. Audio and video data capture the affective flattening and psychomotor retardation characteristic of depression. Statistical validation confirmed their significant discriminative power in distinguishing depressed from non depressed groups. We address a critical limitation of existing graph-based models that focus on low-frequency information and propose a Multi-Frequency Graph Convolutional Network (MF-GCN). This framework consists of a novel Multi-Frequency Filter Bank Module (MFFBM), which can leverage both low and high frequency signals. Extensive evaluation against traditional machine learning algorithms and deep learning frameworks demonstrates that MF-GCN consistently outperforms baselines. In binary (depressed and non depressed) classification, the model achieved a sensitivity of 0.96 and F2 score of 0.94. For the 3 class (no depression, mild to moderate depression and severe depression) classification task, the proposed method achieved a sensitivity of 0.79 and specificity of 0.87 and siginificantly suprassed other models. To validate generalizability, the model was also evaluated on the Chinese Multimodal Depression Corpus (CMDC) dataset and achieved a sensitivity of 0.95 and F2 score of 0.96. These results confirm that our trimodal, multi frequency framework effectively captures cross modal interaction for accurate depression detection.
VisPlay: Self-Evolving Vision-Language Models from Images
Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at https://bruno686.github.io/VisPlay/
GEO-Bench-2: From Performance to Capability, Rethinking Evaluation in Geospatial AI
Geospatial Foundation Models (GeoFMs) are transforming Earth Observation (EO), but evaluation lacks standardized protocols. GEO-Bench-2 addresses this with a comprehensive framework spanning classification, segmentation, regression, object detection, and instance segmentation across 19 permissively-licensed datasets. We introduce ''capability'' groups to rank models on datasets that share common characteristics (e.g., resolution, bands, temporality). This enables users to identify which models excel in each capability and determine which areas need improvement in future work. To support both fair comparison and methodological innovation, we define a prescriptive yet flexible evaluation protocol. This not only ensures consistency in benchmarking but also facilitates research into model adaptation strategies, a key and open challenge in advancing GeoFMs for downstream tasks. Our experiments show that no single model dominates across all tasks, confirming the specificity of the choices made during architecture design and pretraining. While models pretrained on natural images (ConvNext ImageNet, DINO V3) excel on high-resolution tasks, EO-specific models (TerraMind, Prithvi, and Clay) outperform them on multispectral applications such as agriculture and disaster response. These findings demonstrate that optimal model choice depends on task requirements, data modalities, and constraints. This shows that the goal of a single GeoFM model that performs well across all tasks remains open for future research. GEO-Bench-2 enables informed, reproducible GeoFM evaluation tailored to specific use cases. Code, data, and leaderboard for GEO-Bench-2 are publicly released under a permissive license.
INQUIRE-Search: A Framework for Interactive Discovery in Large-Scale Biodiversity Databases
Large community science platforms such as iNaturalist contain hundreds of millions of biodiversity images that often capture ecological context on behaviors, interactions, phenology, and habitat. Yet most ecological workflows rely on metadata filtering or manual inspection, leaving this secondary information inaccessible at scale. We introduce INQUIRE-Search, an open-source system that enables scientists to rapidly and interactively search within an ecological image database for specific concepts using natural language, verify and export relevant observations, and utilize this discovered data for novel scientific analysis. Compared to traditional methods, INQUIRE-Search takes a fraction of the time, opening up new possibilities for scientific questions that can be explored. Through five case studies, we show the diversity of scientific applications that a tool like INQUIRE-Search can support, from seasonal variation in behavior across species to forest regrowth after wildfires. These examples demonstrate a new paradigm for interactive, efficient, and scalable scientific discovery that can begin to unlock previously inaccessible scientific value in large-scale biodiversity datasets. Finally, we emphasize using such AI-enabled discovery tools for science call for experts to reframe the priorities of the scientific process and develop novel methods for experiment design, data collection, survey effort, and uncertainty analysis.
comment: EV, JC, RKV contributed equally
MambaIO: Global-Coordinate Inertial Odometry for Pedestrians via Multi-Scale Frequency-Decoupled Modeling
Inertial Odometry (IO) enables real-time localization using only acceleration and angular velocity measurements from an Inertial Measurement Unit (IMU), making it a promising solution for localization in consumer-grade applications. Traditionally, IMU measurements in IO have been processed under two coordinate system paradigms: the body coordinate frame and the global coordinate frame, with the latter being widely adopted. However, recent studies in drone scenarios have demonstrated that the body frame can significantly improve localization accuracy, prompting a re-evaluation of the suitability of the global frame for pedestrian IO. To address this issue, this paper systematically evaluates the effectiveness of the global coordinate frame in pedestrian IO through theoretical analysis, qualitative inspection, and quantitative experiments. Building upon these findings, we further propose MambaIO, which decomposes IMU measurements into high-frequency and low-frequency components using a Laplacian pyramid. The low-frequency component is processed by a Mamba architecture to extract implicit contextual motion cues, while the high-frequency component is handled by a convolutional structure to capture fine-grained local motion details. Experiments on multiple public datasets show that MambaIO substantially reduces localization error and achieves state-of-the-art (SOTA) performance. To the best of our knowledge, this is the first application of the Mamba architecture to the inertial odometry task.
Multi-Stage Residual-Aware Unsupervised Deep Learning Framework for Consistent Ultrasound Strain Elastography
Ultrasound Strain Elastography (USE) is a powerful non-invasive imaging technique for assessing tissue mechanical properties, offering crucial diagnostic value across diverse clinical applications. However, its clinical application remains limited by tissue decorrelation noise, scarcity of ground truth, and inconsistent strain estimation under different deformation conditions. Overcoming these barriers, we propose MUSSE-Net, a residual-aware, multi-stage unsupervised sequential deep learning framework designed for robust and consistent strain estimation. At its backbone lies our proposed USSE-Net, an end-to-end multi-stream encoder-decoder architecture that parallelly processes pre- and post-deformation RF sequences to estimate displacement fields and axial strains. The novel architecture incorporates Context-Aware Complementary Feature Fusion (CACFF)-based encoder with Tri-Cross Attention (TCA) bottleneck with a Cross-Attentive Fusion (CAF)-based sequential decoder. To ensure temporal coherence and strain stability across varying deformation levels, this architecture leverages a tailored consistency loss. Finally, with the MUSSE-Net framework, a secondary residual refinement stage further enhances accuracy and suppresses noise. Extensive validation on simulation, in vivo, and private clinical datasets from Bangladesh University of Engineering and Technology (BUET) medical center, demonstrates MUSSE-Net's outperformed existing unsupervised approaches. On MUSSE-Net achieves state-of-the-art performance with a target SNR of 24.54, background SNR of 132.76, CNR of 59.81, and elastographic SNR of 9.73 on simulation data. In particular, on the BUET dataset, MUSSE-Net produces strain maps with enhanced lesion-to-background contrast and significant noise suppression yielding clinically interpretable strain patterns.
comment: 13 pages, 9 figures
Hierarchical Semantic Tree Anchoring for CLIP-Based Class-Incremental Learning
Class-Incremental Learning (CIL) enables models to learn new classes continually while preserving past knowledge. Recently, vision-language models like CLIP offer transferable features via multi-modal pre-training, making them well-suited for CIL. However, real-world visual and linguistic concepts are inherently hierarchical: a textual concept like "dog" subsumes fine-grained categories such as "Labrador" and "Golden Retriever," and each category entails its images. But existing CLIP-based CIL methods fail to explicitly capture this inherent hierarchy, leading to fine-grained class features drift during incremental updates and ultimately to catastrophic forgetting. To address this challenge, we propose HASTEN (Hierarchical Semantic Tree Anchoring) that anchors hierarchical information into CIL to reduce catastrophic forgetting. First, we employ an external knowledge graph as supervision to embed visual and textual features in hyperbolic space, effectively preserving hierarchical structure as data evolves. Second, to mitigate catastrophic forgetting, we project gradients onto the null space of the shared hyperbolic mapper, preventing interference with prior tasks. These two steps work synergistically to enable the model to resist forgetting by maintaining hierarchical relationships. Extensive experiments show that HASTEN consistently outperforms existing methods while providing a unified structured representation.
The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification
Automated video analysis is critical for wildlife conservation. A foundational task in this domain is multi-animal tracking (MAT), which underpins applications such as individual re-identification and behavior recognition. However, existing datasets are limited in scale, constrained to a few species, or lack sufficient temporal and geographical diversity - leaving no suitable benchmark for training general-purpose MAT models applicable across wild animal populations. To address this, we introduce SA-FARI, the largest open-source MAT dataset for wild animals. It comprises 11,609 camera trap videos collected over approximately 10 years (2014-2024) from 741 locations across 4 continents, spanning 99 species categories. Each video is exhaustively annotated culminating in ~46 hours of densely annotated footage containing 16,224 masklet identities and 942,702 individual bounding boxes, segmentation masks, and species labels. Alongside the task-specific annotations, we publish anonymized camera trap locations for each video. Finally, we present comprehensive benchmarks on SA-FARI using state-of-the-art vision-language models for detection and tracking, including SAM 3, evaluated with both species-specific and generic animal prompts. We also compare against vision-only methods developed specifically for wildlife analysis. SA-FARI is the first large-scale dataset to combine high species diversity, multi-region coverage, and high-quality spatio-temporal annotations, offering a new foundation for advancing generalizable multianimal tracking in the wild. The dataset is available at $\href{https://www.conservationxlabs.com/sa-fari}{\text{conservationxlabs.com/SA-FARI}}$.
FlashMesh: Faster and Better Autoregressive Mesh Synthesis via Structured Speculation
Autoregressive models can generate high-quality 3D meshes by sequentially producing vertices and faces, but their token-by-token decoding results in slow inference, limiting practical use in interactive and large-scale applications. We present FlashMesh, a fast and high-fidelity mesh generation framework that rethinks autoregressive decoding through a predict-correct-verify paradigm. The key insight is that mesh tokens exhibit strong structural and geometric correlations that enable confident multi-token speculation. FlashMesh leverages this by introducing a speculative decoding scheme tailored to the commonly used hourglass transformer architecture, enabling parallel prediction across face, point, and coordinate levels. Extensive experiments show that FlashMesh achieves up to a 2 x speedup over standard autoregressive models while also improving generation fidelity. Our results demonstrate that structural priors in mesh data can be systematically harnessed to accelerate and enhance autoregressive generation.
When to Think and When to Look: Uncertainty-Guided Lookback
Test-time thinking (that is, generating explicit intermediate reasoning chains) is known to boost performance in large language models and has recently shown strong gains for large vision language models (LVLMs). However, despite these promising results, there is still no systematic analysis of how thinking actually affects visual reasoning. We provide the first such analysis with a large scale, controlled comparison of thinking for LVLMs, evaluating ten variants from the InternVL3.5 and Qwen3-VL families on MMMU-val under generous token budgets and multi pass decoding. We show that more thinking is not always better; long chains often yield long wrong trajectories that ignore the image and underperform the same models run in standard instruct mode. A deeper analysis reveals that certain short lookback phrases, which explicitly refer back to the image, are strongly enriched in successful trajectories and correlate with better visual grounding. Building on this insight, we propose uncertainty guided lookback, a training free decoding strategy that combines an uncertainty signal with adaptive lookback prompts and breadth search. Our method improves overall MMMU performance, delivers the largest gains in categories where standard thinking is weak, and outperforms several strong decoding baselines, setting a new state of the art under fixed model families and token budgets. We further show that this decoding strategy generalizes, yielding consistent improvements on five additional benchmarks, including two broad multimodal suites and math focused visual reasoning datasets.
SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models
Vision-Language-Action (VLA) models excel in robotic manipulation but are constrained by their heavy reliance on expert demonstrations, leading to demonstration bias and limiting performance. Reinforcement learning (RL) is a vital post-training strategy to overcome these limits, yet current VLA-RL methods, including group-based optimization approaches, are crippled by severe reward sparsity. Relying on binary success indicators wastes valuable information in failed trajectories, resulting in low training efficiency. To solve this, we propose Self-Referential Policy Optimization (SRPO), a novel VLA-RL framework. SRPO eliminates the need for external demonstrations or manual reward engineering by leveraging the model's own successful trajectories, generated within the current training batch, as a self-reference. This allows us to assign a progress-wise reward to failed attempts. A core innovation is the use of latent world representations to measure behavioral progress robustly. Instead of relying on raw pixels or requiring domain-specific fine-tuning, we utilize the compressed, transferable encodings from a world model's latent space. These representations naturally capture progress patterns across environments, enabling accurate, generalized trajectory comparison. Empirical evaluations on the LIBERO benchmark demonstrate SRPO's efficiency and effectiveness. Starting from a supervised baseline with 48.9% success, SRPO achieves a new state-of-the-art success rate of 99.2% in just 200 RL steps, representing a 103% relative improvement without any extra supervision. Furthermore, SRPO shows substantial robustness, achieving a 167% performance improvement on the LIBERO-Plus benchmark.
MaskMed: Decoupled Mask and Class Prediction for Medical Image Segmentation
Medical image segmentation typically adopts a point-wise convolutional segmentation head to predict dense labels, where each output channel is heuristically tied to a specific class. This rigid design limits both feature sharing and semantic generalization. In this work, we propose a unified decoupled segmentation head that separates multi-class prediction into class-agnostic mask prediction and class label prediction using shared object queries. Furthermore, we introduce a Full-Scale Aware Deformable Transformer module that enables low-resolution encoder features to attend across full-resolution encoder features via deformable attention, achieving memory-efficient and spatially aligned full-scale fusion. Our proposed method, named MaskMed, achieves state-of-the-art performance, surpassing nnUNet by +2.0% Dice on AMOS 2022 and +6.9% Dice on BTCV.
US-X Complete: A Multi-Modal Approach to Anatomical 3D Shape Recovery MICCAI 2025
Ultrasound offers a radiation-free, cost-effective solution for real-time visualization of spinal landmarks, paraspinal soft tissues and neurovascular structures, making it valuable for intraoperative guidance during spinal procedures. However, ultrasound suffers from inherent limitations in visualizing complete vertebral anatomy, in particular vertebral bodies, due to acoustic shadowing effects caused by bone. In this work, we present a novel multi-modal deep learning method for completing occluded anatomical structures in 3D ultrasound by leveraging complementary information from a single X-ray image. To enable training, we generate paired training data consisting of: (1) 2D lateral vertebral views that simulate X-ray scans, and (2) 3D partial vertebrae representations that mimic the limited visibility and occlusions encountered during ultrasound spine imaging. Our method integrates morphological information from both imaging modalities and demonstrates significant improvements in vertebral reconstruction (p < 0.001) compared to state of art in 3D ultrasound vertebral completion. We perform phantom studies as an initial step to future clinical translation, and achieve a more accurate, complete volumetric lumbar spine visualization overlayed on the ultrasound scan without the need for registration with preoperative modalities such as computed tomography. This demonstrates that integrating a single X-ray projection mitigates ultrasound's key limitation while preserving its strengths as the primary imaging modality. Code and data can be found at https://github.com/miruna20/US-X-Complete
comment: Accepted at the Workshop on Shape in Medical Imaging at MICCAI 2025
Learning from Mistakes: Loss-Aware Memory Enhanced Continual Learning for LiDAR Place Recognition
LiDAR place recognition plays a crucial role in SLAM, robot navigation, and autonomous driving. However, existing LiDAR place recognition methods often struggle to adapt to new environments without forgetting previously learned knowledge, a challenge widely known as catastrophic forgetting. To address this issue, we propose KDF+, a novel continual learning framework for LiDAR place recognition that extends the KDF paradigm with a loss-aware sampling strategy and a rehearsal enhancement mechanism. The proposed sampling strategy estimates the learning difficulty of each sample via its loss value and selects samples for replay according to their estimated difficulty. Harder samples, which tend to encode more discriminative information, are sampled with higher probability while maintaining distributional coverage across the dataset. In addition, the rehearsal enhancement mechanism encourages memory samples to be further refined during new-task training by slightly reducing their loss relative to previous tasks, thereby reinforcing long-term knowledge retention. Extensive experiments across multiple benchmarks demonstrate that KDF+ consistently outperforms existing continual learning methods and can be seamlessly integrated into state-of-the-art continual learning for LiDAR place recognition frameworks to yield significant and stable performance gains. The code will be available at https://github.com/repo/KDF-plus.
comment: 8 pages, 4 figures
MHR: Momentum Human Rig
We present MHR, a parametric human body model that combines the decoupled skeleton/shape paradigm of ATLAS with a flexible, modern rig and pose corrective system inspired by the Momentum library. Our model enables expressive, anatomically plausible human animation, supporting non-linear pose correctives, and is designed for robust integration in AR/VR and graphics pipelines.
CompTrack: Information Bottleneck-Guided Low-Rank Dynamic Token Compression for Point Cloud Tracking AAAI 2026
3D single object tracking (SOT) in LiDAR point clouds is a critical task in computer vision and autonomous driving. Despite great success having been achieved, the inherent sparsity of point clouds introduces a dual-redundancy challenge that limits existing trackers: (1) vast spatial redundancy from background noise impairs accuracy, and (2) informational redundancy within the foreground hinders efficiency. To tackle these issues, we propose CompTrack, a novel end-to-end framework that systematically eliminates both forms of redundancy in point clouds. First, CompTrack incorporates a Spatial Foreground Predictor (SFP) module to filter out irrelevant background noise based on information entropy, addressing spatial redundancy. Subsequently, its core is an Information Bottleneck-guided Dynamic Token Compression (IB-DTC) module that eliminates the informational redundancy within the foreground. Theoretically grounded in low-rank approximation, this module leverages an online SVD analysis to adaptively compress the redundant foreground into a compact and highly informative set of proxy tokens. Extensive experiments on KITTI, nuScenes and Waymo datasets demonstrate that CompTrack achieves top-performing tracking performance with superior efficiency, running at a real-time 90 FPS on a single RTX 3090 GPU.
comment: Accepted by AAAI 2026 (Oral)
AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning
With the increasing prevalence of video content, effectively understanding and answering questions about long form videos has become essential for numerous applications. Although large vision language models (LVLMs) have enhanced performance, they often face challenges with nuanced queries that demand both a comprehensive understanding and detailed analysis. To overcome these obstacles, we introduce AVATAAR, a modular and interpretable framework that combines global and local video context, along with a Pre Retrieval Thinking Agent and a Rethink Module. AVATAAR creates a persistent global summary and establishes a feedback loop between the Rethink Module and the Pre Retrieval Thinking Agent, allowing the system to refine its retrieval strategies based on partial answers and replicate human-like iterative reasoning. On the CinePile benchmark, AVATAAR demonstrates significant improvements over a baseline, achieving relative gains of +5.6% in temporal reasoning, +5% in technical queries, +8% in theme-based questions, and +8.2% in narrative comprehension. Our experiments confirm that each module contributes positively to the overall performance, with the feedback loop being crucial for adaptability. These findings highlight AVATAAR's effectiveness in enhancing video understanding capabilities. Ultimately, AVATAAR presents a scalable solution for long-form Video Question Answering (QA), merging accuracy, interpretability, and extensibility.
comment: Accepted in the 5th IEEE Big Data Workshop on Multimodal AI (MMAI 2025), Dec 8-11, Macau, China, 2025 (Preprint Copy)
From Low-Rank Features to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers
Feature-map knowledge distillation (KD) is highly effective for convolutional networks but often fails for Vision Transformers (ViTs). To understand this failure and guide method design, we conduct a two-view representation analysis of ViTs. First, a layer-wise Singular Value Decomposition (SVD) of full feature matrices shows that final-layer representations are globally low-rank: for CaiT-S24, only $121/61/34/14$ dimensions suffice to capture $99\%/95\%/90\%/80\%$ of the energy. In principle, this suggests that a compact student plus a simple linear projector should be enough for feature alignment, contradicting the weak empirical performance of standard feature KD. To resolve this paradox, we introduce a token-level Spectral Energy Pattern (SEP) analysis that measures how each token uses channel capacity. SEP reveals that, despite the global low-rank structure, individual tokens distribute energy over most channels, forming a high-bandwidth encoding pattern. This results in an encoding mismatch between wide teachers and narrow students. Motivated by this insight, we propose two minimal, mismatch-driven strategies: (1) post-hoc feature lifting with a lightweight projector retained during inference, or (2) native width alignment that widens only the student's last block to the teacher's width. On ImageNet-1K, these strategies reactivate simple feature-map distillation in ViTs, raising DeiT-Tiny accuracy from $74.86\%$ to $77.53\%$ and $78.23\%$ when distilling from CaiT-S24, while also improving standalone students trained without any teacher. Our analysis thus explains why ViT feature distillation fails and shows how exploiting low-rank structure yields effective, interpretable remedies and concrete design guidance for compact ViTs.
Transferable Dual-Domain Feature Importance Attack against AI-Generated Image Detector
Recent AI-generated image (AIGI) detectors achieve impressive accuracy under clean condition. In view of antiforensics, it is significant to develop advanced adversarial attacks for evaluating the security of such detectors, which remains unexplored sufficiently. This letter proposes a Dual-domain Feature Importance Attack (DuFIA) scheme to invalidate AIGI detectors to some extent. Forensically important features are captured by the spatially interpolated gradient and frequency-aware perturbation. The adversarial transferability is enhanced by jointly modeling spatial and frequency-domain feature importances, which are fused to guide the optimization-based adversarial example generation. Extensive experiments across various AIGI detectors verify the cross-model transferability, transparency and robustness of DuFIA.
Computer-Use Agents as Judges for Generative User Interface
Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans--prioritizing aesthetics and usability--forcing agents to adopt human-oriented behaviors that are unnecessary for efficient task execution. At the same time, rapid advances in coding-oriented language models (Coder) have transformed automatic GUI design. This raises a fundamental question: Can CUA as judges to assist Coder for automatic GUI design? To investigate, we introduce AUI-Gym, a benchmark for Automatic GUI development spanning 52 applications across diverse domains. Using language models, we synthesize 1560 tasks that simulate real-world scenarios. To ensure task reliability, we further develop a verifier that programmatically checks whether each task is executable within its environment. Building on this, we propose a Coder-CUA in Collaboration framework: the Coder acts as Designer, generating and revising websites, while the CUA serves as Judge, evaluating functionality and refining designs. Success is measured not by visual appearance, but by task solvability and CUA navigation success rate. To turn CUA feedback into usable guidance, we design a CUA Dashboard that compresses multi-step navigation histories into concise visual summaries, offering interpretable guidance for iterative redesign. By positioning agents as both designers and judges, our framework shifts interface design toward agent-native efficiency and reliability. Our work takes a step toward shifting agents from passive use toward active participation in digital environments. Our code and dataset are available at https://github.com/showlab/AUI.
comment: Project: https://showlab.github.io/AUI Github: https://github.com/showlab/AUI
Scriboora: Rethinking Human Pose Forecasting
Human pose forecasting predicts future poses based on past observations, and has many significant applications in areas such as action recognition, autonomous driving or human-robot interaction. This paper evaluates a wide range of pose forecasting algorithms in the task of absolute pose forecasting, revealing many reproducibility issues, and provides a unified training and evaluation pipeline. After drawing a high-level analogy to the task of speech understanding, it is shown that recent speech models can be efficiently adapted to the task of pose forecasting, and improve current state-of-the-art performance. At last the robustness of the models is evaluated, using noisy joint coordinates obtained from a pose estimator model, to reflect a realistic type of noise, which is more close to real-world applications. For this a new dataset variation is introduced, and it is shown that estimated poses result in a substantial performance degradation, and how much of it can be recovered again by unsupervised finetuning.
Multimodal Evaluation of Russian-language Architectures
Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking and licenses for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.
A Hybrid CNN-ViT-GNN Framework with GAN-Based Augmentation for Intelligent Weed Detection in Precision Agriculture
The task of weed detection is an essential element of precision agriculture since accurate species identification allows a farmer to selectively apply herbicides and fits into sustainable agriculture crop management. This paper proposes a hybrid deep learning framework recipe for weed detection that utilizes Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Graph Neural Networks (GNNs) to build robustness to multiple field conditions. A Generative Adversarial Network (GAN)-based augmentation method was imposed to balance class distributions and better generalize the model. Further, a self-supervised contrastive pre-training method helps to learn more features from limited annotated data. Experimental results yield superior results with 99.33% accuracy, precision, recall, and F1-score on multi-benchmark datasets. The proposed model architecture enables local, global, and relational feature representations and offers high interpretability and adaptability. Practically, the framework allows real-time, efficient deployment to edge devices for automated weed detecting, reducing over-reliance on herbicides and providing scalable, sustainable precision-farming options.
Multi-Text Guided Few-Shot Semantic Segmentation
Recent CLIP-based few-shot semantic segmentation methods introduce class-level textual priors to assist segmentation by typically using a single prompt (e.g., a photo of class). However, these approaches often result in incomplete activation of target regions, as a single textual description cannot fully capture the semantic diversity of complex categories. Moreover, they lack explicit cross-modal interaction and are vulnerable to noisy support features, further degrading visual prior quality. To address these issues, we propose the Multi-Text Guided Few-Shot Semantic Segmentation Network (MTGNet), a dual-branch framework that enhances segmentation performance by fusing diverse textual prompts to refine textual priors and guide the cross-modal optimization of visual priors. Specifically, we design a Multi-Textual Prior Refinement (MTPR) module that suppresses interference and aggregates complementary semantic cues to enhance foreground activation and expand semantic coverage for structurally complex objects. We introduce a Text Anchor Feature Fusion (TAFF) module, which leverages multi-text embeddings as semantic anchors to facilitate the transfer of discriminative local prototypes from support images to query images, thereby improving semantic consistency and alleviating intra-class variations. Furthermore, a Foreground Confidence-Weighted Attention (FCWA) module is presented to enhance visual prior robustness by leveraging internal self-similarity within support foreground features. It adaptively down-weights inconsistent regions and effectively suppresses interference in the query segmentation process. Extensive experiments on standard FSS benchmarks validate the effectiveness of MTGNet. In the 1-shot setting, it achieves 76.8% mIoU on PASCAL-5i and 57.4% on COCO-20i, with notable improvements in folds exhibiting high intra-class variations.
Learning to Expand Images for Efficient Visual Autoregressive Modeling
Autoregressive models have recently shown great promise in visual generation by leveraging discrete token sequences akin to language modeling. However, existing approaches often suffer from inefficiency, either due to token-by-token decoding or the complexity of multi-scale representations. In this work, we introduce Expanding Autoregressive Representation (EAR), a novel generation paradigm that emulates the human visual system's center-outward perception pattern. EAR unfolds image tokens in a spiral order from the center and progressively expands outward, preserving spatial continuity and enabling efficient parallel decoding. To further enhance flexibility and speed, we propose a length-adaptive decoding strategy that dynamically adjusts the number of tokens predicted at each step. This biologically inspired design not only reduces computational cost but also improves generation quality by aligning the generation order with perceptual relevance. Extensive experiments on ImageNet demonstrate that EAR achieves state-of-the-art trade-offs between fidelity and efficiency on single-scale autoregressive models, setting a new direction for scalable and cognitively aligned autoregressive image generation.
comment: 16 pages, 18 figures, includes appendix with additional visualizations, submitted as arXiv preprint
Evaluating Low-Light Image Enhancement Across Multiple Intensity Levels
Imaging in low-light environments is challenging due to reduced scene radiance, which leads to elevated sensor noise and reduced color saturation. Most learning-based low-light enhancement methods rely on paired training data captured under a single low-light condition and a well-lit reference. The lack of radiance diversity limits our understanding of how enhancement techniques perform across varying illumination intensities. We introduce the Multi-Illumination Low-Light (MILL) dataset, containing images captured at diverse light intensities under controlled conditions with fixed camera settings and precise illuminance measurements. MILL enables comprehensive evaluation of enhancement algorithms across variable lighting conditions. We benchmark several state-of-the-art methods and reveal significant performance variations across intensity levels. Leveraging the unique multi-illumination structure of our dataset, we propose improvements that enhance robustness across diverse illumination scenarios. Our modifications achieve up to 10 dB PSNR improvement for DSLR and 2 dB for the smartphone on Full HD images.
NTK-Guided Implicit Neural Teaching
Implicit Neural Representations (INRs) parameterize continuous signals via multilayer perceptrons (MLPs), enabling compact, resolution-independent modeling for tasks like image, audio, and 3D reconstruction. However, fitting high-resolution signals demands optimizing over millions of coordinates, incurring prohibitive computational costs. To address it, we propose NTK-Guided Implicit Neural Teaching (NINT), which accelerates training by dynamically selecting coordinates that maximize global functional updates. Leveraging the Neural Tangent Kernel (NTK), NINT scores examples by the norm of their NTK-augmented loss gradients, capturing both fitting errors and heterogeneous leverage (self-influence and cross-coordinate coupling). This dual consideration enables faster convergence compared to existing methods. Through extensive experiments, we demonstrate that NINT significantly reduces training time by nearly half while maintaining or improving representation quality, establishing state-of-the-art acceleration among recent sampling-based strategies.
comment: Preprint
A Novel CustNetGC Boosted Model with Spectral Features for Parkinson's Disease Prediction
Parkinson's disease is a neurodegenerative disorder that can be very tricky to diagnose and treat. Such early symptoms can include tremors, wheezy breathing, and changes in voice quality as critical indicators of neural damage. Notably, there has been growing interest in utilizing changes in vocal attributes as markers for the detection of PD early on. Based on this understanding, the present paper was designed to focus on the acoustic feature analysis based on voice recordings of patients diagnosed with PD and healthy controls (HC). In this paper, we introduce a novel classification and visualization model known as CustNetGC, combining a Convolutional Neural Network (CNN) with Custom Network Grad-CAM and CatBoost to enhance the efficiency of PD diagnosis. We use a publicly available dataset from Figshare, including voice recordings of 81 participants: 40 patients with PD and 41 healthy controls. From these recordings, we extracted the key spectral features: L-mHP and Spectral Slopes. The L-mHP feature combines three spectrogram representations: Log-Mel spectrogram, harmonic spectrogram, and percussive spectrogram, which are derived using Harmonic-Percussive Source Separation (HPSS). Grad-CAM was used to highlight the important regions in the data, thus making the PD predictions interpretable and effective. Our proposed CustNetGC model achieved an accuracy of 99.06% and precision of 95.83%, with the area under the ROC curve (AUC) recorded at 0.90 for the PD class and 0.89 for the HC class. Additionally, the combination of CatBoost, a gradient boosting algorithm, enhanced the robustness and the prediction performance by properly classifying PD and non-PD samples. Therefore, the results provide the potential improvement in the CustNetGC system in enhancing diagnostic accuracy and the interpretability of the Parkinson's Disease prediction model.
FunnyNodules: A Customizable Medical Dataset Tailored for Evaluating Explainable AI
Densely annotated medical image datasets that capture not only diagnostic labels but also the underlying reasoning behind these diagnoses are scarce. Such reasoning-related annotations are essential for developing and evaluating explainable AI (xAI) models that reason similarly to radiologists: making correct predictions for the right reasons. To address this gap, we introduce FunnyNodules, a fully parameterized synthetic dataset designed for systematic analysis of attribute-based reasoning in medical AI models. The dataset generates abstract, lung nodule-like shapes with controllable visual attributes such as roundness, margin sharpness, and spiculation. Target class is derived from a predefined attribute combination, allowing full control over the decision rule that links attributes to the diagnostic class. We demonstrate how FunnyNodules can be used in model-agnostic evaluations to assess whether models learn correct attribute-target relations, to interpret over- or underperformance in attribute prediction, and to analyze attention alignment with attribute-specific regions of interest. The framework is fully customizable, supporting variations in dataset complexity, target definitions, class balance, and beyond. With complete ground truth information, FunnyNodules provides a versatile foundation for developing, benchmarking, and conducting in-depth analyses of explainable AI methods in medical image analysis.
RS-CA-HSICT: A Residual and Spatial Channel Augmented CNN Transformer Framework for Monkeypox Detection
This work proposes a hybrid deep learning approach, namely Residual and Spatial Learning based Channel Augmented Integrated CNN-Transformer architecture, that leverages the strengths of CNN and Transformer towards enhanced MPox detection. The proposed RS-CA-HSICT framework is composed of an HSICT block, a residual CNN module, a spatial CNN block, and a CA, which enhances the diverse feature space, detailed lesion information, and long-range dependencies. The new HSICT module first integrates an abstract representation of the stem CNN and customized ICT blocks for efficient multihead attention and structured CNN layers with homogeneous (H) and structural (S) operations. The customized ICT blocks learn global contextual interactions and local texture extraction. Additionally, H and S layers learn spatial homogeneity and fine structural details by reducing noise and modeling complex morphological variations. Moreover, inverse residual learning enhances vanishing gradient, and stage-wise resolution reduction ensures scale invariance. Furthermore, the RS-CA-HSICT framework augments the learned HSICT channels with the TL-driven Residual and Spatial CNN maps for enhanced multiscale feature space capturing global and localized structural cues, subtle texture, and contrast variations. These channels, preceding augmentation, are refined through the Channel-Fusion-and-Attention block, which preserves discriminative channels while suppressing redundant ones, thereby enabling efficient computation. Finally, the spatial attention mechanism refines pixel selection to detect subtle patterns and intra-class contrast variations in Mpox. Experimental results on both the Kaggle benchmark and a diverse MPox dataset reported classification accuracy as high as 98.30% and an F1-score of 98.13%, which outperforms the existing CNNs and ViTs.
comment: 33 Pages, 12 Figure, 4 Tables
Deep Learning for Accurate Vision-based Catch Composition in Tropical Tuna Purse Seiners
Purse seiners play a crucial role in tuna fishing, as approximately 69% of the world's tropical tuna is caught using this gear. All tuna Regional Fisheries Management Organizations have established minimum standards to use electronic monitoring (EM) in fisheries in addition to traditional observers. The EM systems produce a massive amount of video data that human analysts must process. Integrating artificial intelligence (AI) into their workflow can decrease that workload and improve the accuracy of the reports. However, species identification still poses significant challenges for AI, as achieving balanced performance across all species requires appropriate training data. Here, we quantify the difficulty experts face to distinguish bigeye tuna (BET, Thunnus Obesus) from yellowfin tuna (YFT, Thunnus Albacares) using images captured by EM systems. We found inter-expert agreements of 42.9% $\pm$ 35.6% for BET and 57.1% $\pm$ 35.6% for YFT. We then present a multi-stage pipeline to estimate the species composition of the catches using a reliable ground-truth dataset based on identifications made by observers on board. Three segmentation approaches are compared: Mask R-CNN, a combination of DINOv2 with SAM2, and a integration of YOLOv9 with SAM2. We found that the latest performs the best, with a validation mean average precision of 0.66 $\pm$ 0.03 and a recall of 0.88 $\pm$ 0.03. Segmented individuals are tracked using ByteTrack. For classification, we evaluate a standard multiclass classification model and a hierarchical approach, finding a superior generalization by the hierarchical. All our models were cross-validated during training and tested on fishing operations with fully known catch composition. Combining YOLOv9-SAM2 with the hierarchical classification produced the best estimations, with 84.8% of the individuals being segmented and classified with a mean average error of 4.5%.
comment: 23 pages, 5 figures
SIGMMA: Hierarchical Graph-Based Multi-Scale Multi-modal Contrastive Alignment of Histopathology Image and Spatial Transcriptome
Recent advances in computational pathology have leveraged vision-language models to learn joint representations of Hematoxylin and Eosin (HE) images with spatial transcriptomic (ST) profiles. However, existing approaches typically align HE tiles with their corresponding ST profiles at a single scale, overlooking fine-grained cellular structures and their spatial organization. To address this, we propose Sigmma, a multi-modal contrastive alignment framework for learning hierarchical representations of HE images and spatial transcriptome profiles across multiple scales. Sigmma introduces multi-scale contrastive alignment, ensuring that representations learned at different scales remain coherent across modalities. Furthermore, by representing cell interactions as a graph and integrating inter- and intra-subgraph relationships, our approach effectively captures cell-cell interactions, ranging from fine to coarse, within the tissue microenvironment. We demonstrate that Sigmm learns representations that better capture cross-modal correspondences, leading to an improvement of avg. 9.78\% in the gene-expression prediction task and avg. 26.93\% in the cross-modal retrieval task across datasets. We further show that it learns meaningful multi-tissue organization in downstream analyses.
Driving in Spikes: An Entropy-Guided Object Detector for Spike Cameras
Object detection in autonomous driving suffers from motion blur and saturation under fast motion and extreme lighting. Spike cameras, offer microsecond latency and ultra high dynamic range for object detection by using per pixel asynchronous integrate and fire. However, their sparse, discrete output cannot be processed by standard image-based detectors, posing a critical challenge for end to end spike stream detection. We propose EASD, an end to end spike camera detector with a dual branch design: a Temporal Based Texture plus Feature Fusion branch for global cross slice semantics, and an Entropy Selective Attention branch for object centric details. To close the data gap, we introduce DSEC Spike, the first driving oriented simulated spike detection benchmark.
A Dataset and Baseline for Deep Learning-Based Visual Quality Inspection in Remanufacturing
Remanufacturing describes a process where worn products are restored to like-new condition and it offers vast ecological and economic potentials. A key step is the quality inspection of disassembled components, which is mostly done manually due to the high variety of parts and defect patterns. Deep neural networks show great potential to automate such visual inspection tasks but struggle to generalize to new product variants, components, or defect patterns. To tackle this challenge, we propose a novel image dataset depicting typical gearbox components in good and defective condition from two automotive transmissions. Depending on the train-test split of the data, different distribution shifts are generated to benchmark the generalization ability of a classification model. We evaluate different models using the dataset and propose a contrastive regularization loss to enhance model robustness. The results obtained demonstrate the ability of the loss to improve generalisation to unseen types of components.
HV-Attack: Hierarchical Visual Attack for Multimodal Retrieval Augmented Generation
Advanced multimodal Retrieval-Augmented Generation (MRAG) techniques have been widely applied to enhance the capabilities of Large Multimodal Models (LMMs), but they also bring along novel safety issues. Existing adversarial research has revealed the vulnerability of MRAG systems to knowledge poisoning attacks, which fool the retriever into recalling injected poisoned contents. However, our work considers a different setting: visual attack of MRAG by solely adding imperceptible perturbations at the image inputs of users, without manipulating any other components. This is challenging due to the robustness of fine-tuned retrievers and large-scale generators, and the effect of visual perturbation may be further weakened by propagation through the RAG chain. We propose a novel Hierarchical Visual Attack that misaligns and disrupts the two inputs (the multimodal query and the augmented knowledge) of MRAG's generator to confuse its generation. We further design a hierarchical two-stage strategy to obtain misaligned augmented knowledge. We disrupt the image input of the retriever to make it recall irrelevant knowledge from the original database, by optimizing the perturbation which first breaks the cross-modal alignment and then disrupts the multimodal semantic alignment. We conduct extensive experiments on two widely-used MRAG datasets: OK-VQA and InfoSeek. We use CLIP-based retrievers and two LMMs BLIP-2 and LLaVA as generators. Results demonstrate the effectiveness of our visual attack on MRAG through the significant decrease in both retrieval and generation performance.
Representation Space Constrained Learning with Modality Decoupling for Multimodal Object Detection
Multimodal object detection has attracted significant attention in both academia and industry for its enhanced robustness. Although numerous studies have focused on improving modality fusion strategies, most neglect fusion degradation, and none provide a theoretical analysis of its underlying causes. To fill this gap, this paper presents a systematic theoretical investigation of fusion degradation in multimodal detection and identifies two key optimization deficiencies: (1) the gradients of unimodal branch backbones are severely suppressed under multimodal architectures, resulting in under-optimization of the unimodal branches; (2) disparities in modality quality cause weaker modalities to experience stronger gradient suppression, which in turn results in imbalanced modality learning. To address these issues, this paper proposes a Representation Space Constrained Learning with Modality Decoupling (RSC-MD) method, which consists of two modules. The RSC module and the MD module are designed to respectively amplify the suppressed gradients and eliminate inter-modality coupling interference as well as modality imbalance, thereby enabling the comprehensive optimization of each modality-specific backbone. Extensive experiments conducted on the FLIR, LLVIP, M3FD, and MFAD datasets demonstrate that the proposed method effectively alleviates fusion degradation and achieves state-of-the-art performance across multiple benchmarks. The code and training procedures will be released at https://github.com/yikangshao/RSC-MD.
comment: This work has been submitted to the IEEE for possible publication
WarNav: An Autonomous Driving Benchmark for Segmentation of Navigable Zones in War Scenes
We introduce WarNav, a novel real-world dataset constructed from images of the open-source DATTALION repository, specifically tailored to enable the development and benchmarking of semantic segmentation models for autonomous ground vehicle navigation in unstructured, conflict-affected environments. This dataset addresses a critical gap between conventional urban driving resources and the unique operational scenarios encountered by unmanned systems in hazardous and damaged war-zones. We detail the methodological challenges encountered, ranging from data heterogeneity to ethical considerations, providing guidance for future efforts that target extreme operational contexts. To establish performance references, we report baseline results on WarNav using several state-of-the-art semantic segmentation models trained on structured urban scenes. We further analyse the impact of training data environments and propose a first step towards effective navigability in challenging environments with the constraint of having no annotation of the targeted images. Our goal is to foster impactful research that enhances the robustness and safety of autonomous vehicles in high-risk scenarios while being frugal in annotated data.
comment: Accepted at CAID (Conference on Artificial Intelligence for Defence)
D4C: Data-free Quantization for Contrastive Language-Image Pre-training Models
Data-Free Quantization (DFQ) offers a practical solution for model compression without requiring access to real data, making it particularly attractive in privacy-sensitive scenarios. While DFQ has shown promise for unimodal models, its extension to Vision-Language Models such as Contrastive Language-Image Pre-training (CLIP) models remains underexplored. In this work, we reveal that directly applying existing DFQ techniques to CLIP results in substantial performance degradation due to two key limitations: insufficient semantic content and low intra-image diversity in synthesized samples. To tackle these challenges, we propose D4C, the first DFQ framework tailored for CLIP. D4C synthesizes semantically rich and structurally diverse pseudo images through three key components: (1) Prompt-Guided Semantic Injection aligns generated images with real-world semantics using text prompts; (2) Structural Contrastive Generation reproduces compositional structures of natural images by leveraging foreground-background contrastive synthesis; and (3) Perturbation-Aware Enhancement applies controlled perturbations to improve sample diversity and robustness. These components jointly empower D4C to synthesize images that are both semantically informative and structurally diverse, effectively bridging the performance gap of DFQ on CLIP. Extensive experiments validate the effectiveness of D4C, showing significant performance improvements on various bit-widths and models. For example, under the W4A8 setting with CLIP ResNet-50 and ViT-B/32, D4C achieves Top-1 accuracy improvement of 12.4% and 18.9% on CIFAR-10, 6.8% and 19.7% on CIFAR-100, and 1.4% and 5.7% on ImageNet-1K in zero-shot classification, respectively.
IPR-1: Interactive Physical Reasoner
Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can similarly acquire human-like reasoning from interaction and keep improving with more experience. We study this in a Game-to-Unseen (G2U) setting, curating 1,000+ heterogeneous games with diverse physical and causal mechanisms, and evaluate at three human-like levels: Survival, Curiosity, Utility, from primitive intuition to goal-driven reasoning. Our analysis reveals complementary failures: VLM/VLA agents reason but lack look-ahead in interactive settings, while world models imagine but imitate visual patterns rather than analyze physics and causality. We therefore propose IPR (Interactive Physical Reasoner), using world-model rollouts to score and reinforce a VLM's policy, and introduce PhysCode, a physics-centric action code aligning semantic intent with dynamics to provide a shared action space for prediction and reasoning. Pretrained on 1,000+ games, our IPR performs robustly on three levels, matches GPT-5 overall, and surpasses it on Curiosity. We find that performance improves with more training games and interaction steps, and that the model also zero-shot transfers to unseen games. These results support physics-centric interaction as a path to steadily improving physical reasoning.
comment: 11 pages, 5 figures
Controlling False Positives in Image Segmentation via Conformal Prediction
Reliable semantic segmentation is essential for clinical decision making, yet deep models rarely provide explicit statistical guarantees on their errors. We introduce a simple post-hoc framework that constructs confidence masks with distribution-free, image-level control of false-positive predictions. Given any pretrained segmentation model, we define a nested family of shrunken masks obtained either by increasing the score threshold or by applying morphological erosion. A labeled calibration set is used to select a single shrink parameter via conformal prediction, ensuring that, for new images that are exchangeable with the calibration data, the proportion of false positives retained in the confidence mask stays below a user-specified tolerance with high probability. The method is model-agnostic, requires no retraining, and provides finite-sample guarantees regardless of the underlying predictor. Experiments on a polyp-segmentation benchmark demonstrate target-level empirical validity. Our framework enables practical, risk-aware segmentation in settings where over-segmentation can have clinical consequences. Code at https://github.com/deel-ai-papers/conseco.
ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation
Recent progress in self- and weakly supervised occupancy estimation has largely relied on 2D projection or rendering-based supervision, which suffers from geometric inconsistencies and severe depth bleeding. We thus introduce ShelfOcc, a vision-only method that overcomes these limitations without relying on LiDAR. ShelfOcc brings supervision into native 3D space by generating metrically consistent semantic voxel labels from video, enabling true 3D supervision without any additional sensors or manual 3D annotations. While recent vision-based 3D geometry foundation models provide a promising source of prior knowledge, they do not work out of the box as a prediction due to sparse or noisy and inconsistent geometry, especially in dynamic driving scenes. Our method introduces a dedicated framework that mitigates these issues by filtering and accumulating static geometry consistently across frames, handling dynamic content and propagating semantic information into a stable voxel representation. This data-centric shift in supervision for weakly/shelf-supervised occupancy estimation allows the use of essentially any SOTA occupancy model architecture without relying on LiDAR data. We argue that such high-quality supervision is essential for robust occupancy learning and constitutes an important complementary avenue to architectural innovation. On the Occ3D-nuScenes benchmark, ShelfOcc substantially outperforms all previous weakly/shelf-supervised methods (up to a 34% relative improvement), establishing a new data-driven direction for LiDAR-free 3D scene understanding.
Breaking Expert Knowledge Limits: Self-Pruning for Large Language Models
Large language models (LLMs) have achieved remarkable performance on a wide range of tasks, hindering real-world deployment due to their massive size. Existing pruning methods (e.g., Wanda) tailored for LLMs rely heavily on manual design pruning algorithms, thereby leading to \textit{huge labor costs} and \textit{requires expert knowledge}. Furthermore, we are the first to identify the serious \textit{outlier value issue} behind dramatic performance degradation under high pruning ratios that are caused by uniform sparsity, raising an additional concern about how to design adaptive pruning sparsity ideal for LLMs. Can LLMs prune by themselves? In this work, we introduce an affirmative answer by proposing a novel pruning method called \textbf{AutoPrune}, which first overcomes expert knowledge limits by leveraging LLMs to design optimal pruning algorithms for themselves automatically without any expert knowledge. Specifically, to mitigate the black-box nature of LLMs, we propose a Graph-driven Chain-of-Thought (GCoT) to optimize prompts, significantly enhancing the reasoning process in learning the pruning algorithm and enabling us to generate pruning algorithms with superior performance and interpretability in the next generation. Finally, grounded in insights of outlier value issue, we introduce Skew-aware Dynamic Sparsity Allocation (SDSA) to overcome the outlier value issue, mitigating performance degradation under high pruning ratios. We conduct extensive experiments on mainstream LLMs benchmarks, demonstrating the superiority of AutoPrune, which consistently excels state-of-the-art competitors. The code is available at: https://anonymous.4open.science/r/AutoPrune.
Zero-Shot Open-Vocabulary Human Motion Grounding with Test-Time Training
Understanding complex human activities demands the ability to decompose motion into fine-grained, semantic-aligned sub-actions. This motion grounding process is crucial for behavior analysis, embodied AI and virtual reality. Yet, most existing methods rely on dense supervision with predefined action classes, which are infeasible in open-vocabulary, real-world settings. In this paper, we propose ZOMG, a zero-shot, open-vocabulary framework that segments motion sequences into semantically meaningful sub-actions without requiring any annotations or fine-tuning. Technically, ZOMG integrates (1) language semantic partition, which leverages large language models to decompose instructions into ordered sub-action units, and (2) soft masking optimization, which learns instance-specific temporal masks to focus on frames critical to sub-actions, while maintaining intra-segment continuity and enforcing inter-segment separation, all without altering the pretrained encoder. Experiments on three motion-language datasets demonstrate state-of-the-art effectiveness and efficiency of motion grounding performance, outperforming prior methods by +8.7\% mAP on HumanML3D benchmark. Meanwhile, significant improvements also exist in downstream retrieval, establishing a new paradigm for annotation-free motion understanding.
IPTQ-ViT: Post-Training Quantization of Non-linear Functions for Integer-only Vision Transformers WACV 2026
Previous Quantization-Aware Training (QAT) methods for vision transformers rely on expensive retraining to recover accuracy loss in non-linear layer quantization, limiting their use in resource-constrained environments. In contrast, existing Post-Training Quantization (PTQ) methods either partially quantize non-linear functions or adjust activation distributions to maintain accuracy but fail to achieve fully integer-only inference. In this paper, we introduce IPTQ-ViT, a novel PTQ framework for fully integer-only vision transformers without retraining. We present approximation functions: a polynomial-based GELU optimized for vision data and a bit-shifting-based Softmax designed to improve approximation accuracy in PTQ. In addition, we propose a unified metric integrating quantization sensitivity, perturbation, and computational cost to select the optimal approximation function per activation layer. IPTQ-ViT outperforms previous PTQ methods, achieving up to 6.44\%p (avg. 1.78\%p) top-1 accuracy improvement for image classification, 1.0 mAP for object detection. IPTQ-ViT outperforms partial floating-point PTQ methods under W8A8 and W4A8, and achieves accuracy and latency comparable to integer-only QAT methods. We plan to release our code https://github.com/gihwan-kim/IPTQ-ViT.git.
comment: accepted in WACV 2026 (10 pages)
Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration
Existing multimodal reasoning models and frameworks suffer from fundamental architectural limitations: most lack the human-like ability to autonomously explore diverse reasoning pathways-whether in direct inference, tool-driven visual exploration, programmatic visual manipulation, or intrinsic visual imagination. Consequently, they struggle to adapt to dynamically changing capability requirements in real-world tasks. Meanwhile, humans exhibit a complementary set of thinking abilities when addressing such tasks, whereas existing methods typically cover only a subset of these dimensions. Inspired by this, we propose Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration, a new paradigm for multimodal agentic reasoning. We define six core capabilities essential for multimodal reasoning and organize a comprehensive evaluation benchmark, Octopus-Bench, accordingly. Octopus is capable of autonomously exploring during reasoning and dynamically selecting the most appropriate capability based on the current state. Experimental results show that Octopus achieves the best performance on the vast majority of tasks in Octopus-Bench, highlighting the crucial role of capability coordination in agentic multimodal reasoning.
Fast Post-Hoc Confidence Fusion for 3-Class Open-Set Aerial Object Detection
Developing reliable UAV navigation systems requires robust air-to-air object detectors capable of distinguishing between objects seen during training and previously unseen objects. While many methods address closed-set detection and achieve high-confidence recognition of in-domain (ID) targets, they generally do not tackle open-set detection, which requires simultaneous handling of both ID and out-of-distribution (OOD) objects. Existing open-set approaches typically rely on a single uncertainty score with thresholding, limiting flexibility and often conflating OOD objects with background clutter. In contrast, we propose a lightweight, model-agnostic post-processing framework that explicitly separates background from unknown objects while preserving the base detector's performance. Our approach extends open-set detection beyond binary ID/OOD classification to real-time three-way classification among ID targets, OOD objects, and background. To this end, we employ a fusion scheme that aggregates multiple confidence estimates and per-detection features using a compact multilayer perceptron (MLP). Incorporating different logit variants into the MLP consistently enhances performance across both binary and three-class classification without compromising throughput. Extensive ablation and comparative experiments confirm that our method surpasses threshold-based baselines in two-class classification by an average of 2.7% AUROC, while retaining or improving open-set mAP. Furthermore, our study uniquely enables robust three-class classification, a critical capability for safe UAV navigation, where OOD objects must be actively avoided and background regions safely ignored. Comparative analysis highlights that our method surpasses competitive techniques in AUROC across datasets, while improving closed-set mAP by up to 9 points, an 18% relative gain.
Adaptive thresholding pattern for fingerprint forgery detection
Fingerprint liveness detection systems have been affected by spoofing, which is a severe threat for fingerprint-based biometric systems. Therefore, it is crucial to develop some techniques to distinguish the fake fingerprints from the real ones. The software based techniques can detect the fingerprint forgery automatically. Also, the scheme shall be resistant against various distortions such as noise contamination, pixel missing and block missing, so that the forgers cannot deceive the detector by adding some distortions to the faked fingerprint. In this paper, we propose a fingerprint forgery detection algorithm based on a suggested adaptive thresholding pattern. The anisotropic diffusion of the input image is passed through three levels of the wavelet transform. The coefficients of different layers are adaptively thresholded and concatenated to produce the feature vector which is classified using the SVM classifier. Another contribution of the paper is to investigate the effect of various distortions such as pixel missing, block missing, and noise contamination. Our suggested approach includes a novel method that exhibits improved resistance against a range of distortions caused by environmental phenomena or manipulations by malicious users. In quantitative comparisons, our proposed method outperforms its counterparts by approximately 8% and 5% in accuracy for missing pixel scenarios of 90% and block missing scenarios of size 70x70 , respectively. This highlights the novelty approach in addressing such challenges.
comment: 25 pages, 10 figures, Journal paper
What Your Features Reveal: Data-Efficient Black-Box Feature Inversion Attack for Split DNNs
Split DNNs enable edge devices by offloading intensive computation to a cloud server, but this paradigm exposes privacy vulnerabilities, as the intermediate features can be exploited to reconstruct the private inputs via Feature Inversion Attack (FIA). Existing FIA methods often produce limited reconstruction quality, making it difficult to assess the true extent of privacy leakage. To reveal the privacy risk of the leaked features, we introduce FIA-Flow, a black-box FIA framework that achieves high-fidelity image reconstruction from intermediate features. To exploit the semantic information within intermediate features, we design a Latent Feature Space Alignment Module (LFSAM) to bridge the semantic gap between the intermediate feature space and the latent space. Furthermore, to rectify distributional mismatch, we develop Deterministic Inversion Flow Matching (DIFM), which projects off-manifold features onto the target manifold with one-step inference. This decoupled design simplifies learning and enables effective training with few image-feature pairs. To quantify privacy leakage from a human perspective, we also propose two metrics based on a large vision-language model. Experiments show that FIA-Flow achieves more faithful and semantically aligned feature inversion across various models (AlexNet, ResNet, Swin Transformer, DINO, and YOLO11) and layers, revealing a more severe privacy threat in Split DNNs than previously recognized.
A Multimodal Transformer Approach for UAV Detection and Aerial Object Recognition Using Radar, Audio, and Video Data
Unmanned aerial vehicle (UAV) detection and aerial object recognition are critical for modern surveillance and security, prompting a need for robust systems that overcome limitations of single-modality approaches. This research addresses these challenges by designing and rigorously evaluating a novel multimodal Transformer model that integrates diverse data streams: radar, visual band video (RGB), infrared (IR) video, and audio. The architecture effectively fuses distinct features from each modality, leveraging the Transformer's self-attention mechanisms to learn comprehensive, complementary, and highly discriminative representations for classification. The model demonstrated exceptional performance on an independent test set, achieving macro-averaged metrics of 0.9812 accuracy, 0.9873 recall, 0.9787 precision, 0.9826 F1-score, and 0.9954 specificity. Notably, it exhibited particularly high precision and recall in distinguishing drones from other aerial objects. Furthermore, computational analysis confirmed its efficiency, with 1.09 GFLOPs, 1.22 million parameters, and an inference speed of 41.11 FPS, highlighting its suitability for real-time applications. This study presents a significant advancement in aerial object classification, validating the efficacy of multimodal data fusion via a Transformer architecture for achieving state-of-the-art performance, thereby offering a highly accurate and resilient solution for UAV detection and monitoring in complex airspace.
comment: 23 pages, 7 figures
Adapt-As-You-Walk Through the Clouds: Training-Free Online Test-Time Adaptation of 3D Vision-Language Foundation Models AAAI 2026
3D Vision-Language Foundation Models (VLFMs) have shown strong generalization and zero-shot recognition capabilities in open-world point cloud processing tasks. However, these models often underperform in practical scenarios where data are noisy, incomplete, or drawn from a different distribution than the training data. To address this, we propose Uni-Adapter, a novel training-free online test-time adaptation (TTA) strategy for 3D VLFMs based on dynamic prototype learning. We define a 3D cache to store class-specific cluster centers as prototypes, which are continuously updated to capture intra-class variability in heterogeneous data distributions. These dynamic prototypes serve as anchors for cache-based logit computation via similarity scoring. Simultaneously, a graph-based label smoothing module captures inter-prototype similarities to enforce label consistency among similar prototypes. Finally, we unify predictions from the original 3D VLFM and the refined 3D cache using entropy-weighted aggregation for reliable adaptation. Without retraining, Uni-Adapter effectively mitigates distribution shifts, achieving state-of-the-art performance on diverse 3D benchmarks over different 3D VLFMs, improving ModelNet-40C by 10.55%, ScanObjectNN-C by 8.26%, and ShapeNet-C by 4.49% over the source 3D VLFMs.
comment: Accepted by AAAI 2026. 7 pages, 4 figures
Text2Loc++: Generalizing 3D Point Cloud Localization from Natural Language CVPR 2024
We tackle the problem of localizing 3D point cloud submaps using complex and diverse natural language descriptions, and present Text2Loc++, a novel neural network designed for effective cross-modal alignment between language and point clouds in a coarse-to-fine localization pipeline. To support benchmarking, we introduce a new city-scale dataset covering both color and non-color point clouds from diverse urban scenes, and organize location descriptions into three levels of linguistic complexity. In the global place recognition stage, Text2Loc++ combines a pretrained language model with a Hierarchical Transformer with Max pooling (HTM) for sentence-level semantics, and employs an attention-based point cloud encoder for spatial understanding. We further propose Masked Instance Training (MIT) to filter out non-aligned objects and improve multimodal robustness. To enhance the embedding space, we introduce Modality-aware Hierarchical Contrastive Learning (MHCL), incorporating cross-modal, submap-, text-, and instance-level losses. In the fine localization stage, we completely remove explicit text-instance matching and design a lightweight yet powerful framework based on Prototype-based Map Cloning (PMC) and a Cascaded Cross-Attention Transformer (CCAT). Extensive experiments on the KITTI360Pose dataset show that Text2Loc++ outperforms existing methods by up to 15%. In addition, the proposed model exhibits robust generalization when evaluated on the new dataset, effectively handling complex linguistic expressions and a wide variety of urban environments. The code and dataset will be made publicly available.
comment: This paper builds upon and extends our earlier conference paper Text2Loc presented at CVPR 2024
Taming Generative Synthetic Data for X-ray Prohibited Item Detection
Training prohibited item detection models requires a large amount of X-ray security images, but collecting and annotating these images is time-consuming and laborious. To address data insufficiency, X-ray security image synthesis methods composite images to scale up datasets. However, previous methods primarily follow a two-stage pipeline, where they implement labor-intensive foreground extraction in the first stage and then composite images in the second stage. Such a pipeline introduces inevitable extra labor cost and is not efficient. In this paper, we propose a one-stage X-ray security image synthesis pipeline (Xsyn) based on text-to-image generation, which incorporates two effective strategies to improve the usability of synthetic images. The Cross-Attention Refinement (CAR) strategy leverages the cross-attention map from the diffusion model to refine the bounding box annotation. The Background Occlusion Modeling (BOM) strategy explicitly models background occlusion in the latent space to enhance imaging complexity. To the best of our knowledge, compared with previous methods, Xsyn is the first to achieve high-quality X-ray security image synthesis without extra labor cost. Experiments demonstrate that our method outperforms all previous methods with 1.2% mAP improvement, and the synthetic images generated by our method are beneficial to improve prohibited item detection performance across various X-ray security datasets and detectors. Code is available at https://github.com/pILLOW-1/Xsyn/.
Edge-Centric Relational Reasoning for 3D Scene Graph Prediction
3D scene graph prediction aims to abstract complex 3D environments into structured graphs consisting of objects and their pairwise relationships. Existing approaches typically adopt object-centric graph neural networks, where relation edge features are iteratively updated by aggregating messages from connected object nodes. However, this design inherently restricts relation representations to pairwise object context, making it difficult to capture high-order relational dependencies that are essential for accurate relation prediction. To address this limitation, we propose a Link-guided Edge-centric relational reasoning framework with Object-aware fusion, namely LEO, which enables progressive reasoning from relation-level context to object-level understanding. Specifically, LEO first predicts potential links between object pairs to suppress irrelevant edges, and then transforms the original scene graph into a line graph where each relation is treated as a node. A line graph neural network is applied to perform edge-centric relational reasoning to capture inter-relation context. The enriched relation features are subsequently integrated into the original object-centric graph to enhance object-level reasoning and improve relation prediction. Our framework is model-agnostic and can be integrated with any existing object-centric method. Experiments on the 3DSSG dataset with two competitive baselines show consistent improvements, highlighting the effectiveness of our edge-to-object reasoning paradigm.
Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception
In embodied AI perception systems, visual perception should be active: the goal is not to passively process static images, but to actively acquire more informative data within pixel and spatial budget constraints. Existing vision models and fixed RGB-D camera systems fundamentally fail to reconcile wide-area coverage with fine-grained detail acquisition, severely limiting their efficacy in open-world robotic applications. To address this issue, we propose EyeVLA, a robotic eyeball for active visual perception that can take proactive actions based on instructions, enabling clear observation of fine-grained target objects and detailed information across a wide spatial extent. EyeVLA discretizes action behaviors into action tokens and integrates them with vision-language models (VLMs) that possess strong open-world understanding capabilities, enabling joint modeling of vision, language, and actions within a single autoregressive sequence. By using the 2D bounding box coordinates to guide the reasoning chain and applying reinforcement learning to refine the viewpoint selection policy, we transfer the open-world scene understanding capability of the VLM to a vision language action (VLA) policy using only minimal real-world data. Experiments show that our system efficiently performs instructed scenes in real-world environments and actively acquires more accurate visual information through instruction-driven actions of rotation and zoom, thereby achieving strong environmental perception capabilities. EyeVLA introduces a novel robotic vision system that leverages detailed and spatially rich, large-scale embodied data, and actively acquires highly informative visual observations for downstream embodied tasks.
Graph Query Networks for Object Detection with Automotive Radar WACV 2026
Object detection with 3D radar is essential for 360-degree automotive perception, but radar's long wavelengths produce sparse and irregular reflections that challenge traditional grid and sequence-based convolutional and transformer detectors. This paper introduces Graph Query Networks (GQN), an attention-based framework that models objects sensed by radar as graphs, to extract individualized relational and contextual features. GQN employs a novel concept of graph queries to dynamically attend over the bird's-eye view (BEV) space, constructing object-specific graphs processed by two novel modules: EdgeFocus for relational reasoning and DeepContext Pooling for contextual aggregation. On the NuScenes dataset, GQN improves relative mAP by up to +53%, including a +8.2% gain over the strongest prior radar method, while reducing peak graph construction overhead by 80% with moderate FLOPs cost.
comment: Accepted in WACV 2026 Main Conference
SplitFlux: Learning to Decouple Content and Style from a Single Image
Disentangling image content and style is essential for customized image generation. Existing SDXL-based methods struggle to achieve high-quality results, while the recently proposed Flux model fails to achieve effective content-style separation due to its underexplored characteristics. To address these challenges, we conduct a systematic analysis of Flux and make two key observations: (1) Single Dream Blocks are essential for image generation; and (2) Early single stream blocks mainly control content, whereas later blocks govern style. Based on these insights, we propose SplitFlux, which disentangles content and style by fine-tuning the single dream blocks via LoRA, enabling the disentangled content to be re-embedded into new contexts. It includes two key components: (1) Rank-Constrained Adaptation. To preserve content identity and structure, we compress the rank and amplify the magnitude of updates within specific blocks, preventing content leakage into style blocks. (2) Visual-Gated LoRA. We split the content LoRA into two branches with different ranks, guided by image saliency. The high-rank branch preserves primary subject information, while the low-rank branch encodes residual details, mitigating content overfitting and enabling seamless re-embedding. Extensive experiments demonstrate that SplitFlux consistently outperforms state-of-the-art methods, achieving superior content preservation and stylization quality across diverse scenarios.
GRPO-RM: Fine-Tuning Representation Models via GRPO-Driven Reinforcement Learning
The Group Relative Policy Optimization (GRPO), a reinforcement learning method used to fine-tune large language models (LLMs), has proved its effectiveness in practical applications such as DeepSeek-R1. It raises a question whether GRPO can be generalized to representation learning models. In this paper, we propose Group Relative Policy Optimization for Representation Model (GRPO-RM), and investigate the performance of GRPO-like policy in post-training representation models. Specifically, our method establishes a predefined output set to functionally replace token sequence sampling in LLMs, thereby generating an output group, which is essential for the probability-driven optimization of GRPO. In addition, a specialized reward function is designed to accommodate the properties of representation models. Extensive experiments are conducted on various real-world datasets to validate the effectiveness of our proposed method.
Context Cascade Compression: Exploring the Upper Limits of Text Compression
Million-level token inputs in long-context tasks pose significant computational and memory challenges for Large Language Models (LLMs). Recently, DeepSeek-OCR conducted research into the feasibility of Contexts Optical Compression and achieved preliminary results. Inspired by this, we introduce Context Cascade Compression C3 to explore the upper limits of text compression. Our method cascades two LLMs of different sizes to handle the compression and decoding tasks. Specifically, a small LLM, acting as the first stage, performs text compression by condensing a long context into a set of latent tokens (e.g., 32 or 64 in length), achieving a high ratio of text tokens to latent tokens. A large LLM, as the second stage, then executes the decoding task on this compressed context. Experiments show that at a 20x compression ratio (where the number of text tokens is 20 times the number of latent tokens), our model achieves 98% decoding accuracy, compared to approximately 60% for DeepSeek-OCR. When we further increase the compression ratio to 40x, the accuracy is maintained at around 93%. This indicates that in the domain of context compression, C3 Compression demonstrates superior performance and feasibility over optical character compression. C3 uses a simpler, pure-text pipeline that ignores factors like layout, color, and information loss from a visual encoder. This also suggests a potential upper bound for compression ratios in future work on optical character compression, OCR, and related fields. Codes and model weights are publicly accessible at https://github.com/liufanfanlff/C3-Context-Cascade-Compression
SkinGPT-R1: Adapter-Only Dual Distillation for Efficient Dermatology Reasoning
We present SkinGPT-R1, a dermatology focused vision language model that makes diagnostic chain of thought reasoning explicit, step by step, and verifiable. To support skin specific reasoning, we build DermCoT, a corpus of standardized dermatologic chain of thought narratives that combines 10,000 DermEval filtered training cases with 3,000 dermatologist scored certified cases, and we define DermEval as a physician aligned six dimensional evaluator and DermBench as the corresponding benchmark for dermatologic chain of thought quality. On DermBench, across 14 general, reasoning, and medical vision language models, SkinGPT-R1 achieves an average score of 4.031 out of 5 over the six clinician defined dimensions, ranks 1st among all systems, and improves the average score over Vision-R1 by about 41%. On three dermatology classification benchmarks, SkinGPT-R1 delivers stable accuracy gains over Vision-R1 and remains competitive among strong vision language models. Ablation results further show that DermCoT based chain of thought supervision provides substantial improvements over the base model and that adding dermatology aware visual distillation yields consistent additional gains in both narrative quality and recognition.
Physics-Based Benchmarking Metrics for Multimodal Synthetic Images
Current state of the art measures like BLEU, CIDEr, VQA score, SigLIP-2 and CLIPScore are often unable to capture semantic or structural accuracy, especially for domain-specific or context-dependent scenarios. For this, this paper proposes a Physics-Constrained Multimodal Data Evaluation (PCMDE) metric combining large language models with reasoning, knowledge based mapping and vision-language models to overcome these limitations. The architecture is comprised of three main stages: (1) feature extraction of spatial and semantic information with multimodal features through object detection and VLMs; (2) Confidence-Weighted Component Fusion for adaptive component-level validation; and (3) physics-guided reasoning using large language models for structural and relational constraints (e.g., alignment, position, consistency) enforcement.
Towards Unbiased Cross-Modal Representation Learning for Food Image-to-Recipe Retrieval
This paper addresses the challenges of learning representations for recipes and food images in the cross-modal retrieval problem. As the relationship between a recipe and its cooked dish is cause-and-effect, treating a recipe as a text source describing the visual appearance of a dish for learning representation, as the existing approaches, will create bias misleading image-and-recipe similarity judgment. Specifically, a food image may not equally capture every detail in a recipe, due to factors such as the cooking process, dish presentation, and image-capturing conditions. The current representation learning tends to capture dominant visual-text alignment while overlooking subtle variations that determine retrieval relevance. In this paper, we model such bias in cross-modal representation learning using causal theory. The causal view of this problem suggests ingredients as one of the confounder sources and a simple backdoor adjustment can alleviate the bias. By causal intervention, we reformulate the conventional model for food-to-recipe retrieval with an additional term to remove the potential bias in similarity judgment. Based on this theory-informed formulation, we empirically prove the oracle performance of retrieval on the Recipe1M dataset to be MedR=1 across the testing data sizes of 1K, 10K, and even 50K. We also propose a plug-and-play neural module, which is essentially a multi-label ingredient classifier for debiasing. New state-of-the-art search performances are reported on the Recipe1M dataset.
Insert In Style: A Zero-Shot Generative Framework for Harmonious Cross-Domain Object Composition
Reference-based object composition methods fail when inserting real-world objects into stylized domains. This under-explored problem is currently split between practical "blenders" that lack generative fidelity and "generators" that require impractical, per-subject online finetuning. In this work, we introduce Insert In Style, the first zero-shot generative framework that is both practical and high-fidelity. Our core contribution is a unified framework with two key innovations: (i) a novel multi-stage training protocol that disentangles representations for identity, style, and composition, and (ii) a specialized masked-attention architecture that surgically enforces this disentanglement during generation. This approach prevents the concept interference common in general-purpose, unified-attention models. Our framework is trained on a new 100k sample dataset, curated from a novel data pipeline. This pipeline couples large-scale generation with a rigorous, two-stage filtering process to ensure both high-fidelity semantic identity and style coherence. Unlike prior work, our model is truly zero-shot and requires no text prompts. We also introduce a new public benchmark for stylized composition. We demonstrate state-of-the-art performance, significantly outperforming existing methods on both identity and style metrics, a result strongly corroborated by user studies.
BrainRotViT: Transformer-ResNet Hybrid for Explainable Modeling of Brain Aging from 3D sMRI
Accurate brain age estimation from structural MRI is a valuable biomarker for studying aging and neurodegeneration. Traditional regression and CNN-based methods face limitations such as manual feature engineering, limited receptive fields, and overfitting on heterogeneous data. Pure transformer models, while effective, require large datasets and high computational cost. We propose Brain ResNet over trained Vision Transformer (BrainRotViT), a hybrid architecture that combines the global context modeling of vision transformers (ViT) with the local refinement of residual CNNs. A ViT encoder is first trained on an auxiliary age and sex classification task to learn slice-level features. The frozen encoder is then applied to all sagittal slices to generate a 2D matrix of embedding vectors, which is fed into a residual CNN regressor that incorporates subject sex at the final fully-connected layer to estimate continuous brain age. Our method achieves an MAE of 3.34 years (Pearson $r=0.98$, Spearman $ρ=0.97$, $R^2=0.95$) on validation across 11 MRI datasets encompassing more than 130 acquisition sites, outperforming baseline and state-of-the-art models. It also generalizes well across 4 independent cohorts with MAEs between 3.77 and 5.04 years. Analyses on the brain age gap (the difference between the predicted age and actual age) show that aging patterns are associated with Alzheimer's disease, cognitive impairment, and autism spectrum disorder. Model attention maps highlight aging-associated regions of the brain, notably the cerebellar vermis, precentral and postcentral gyri, temporal lobes, and medial superior frontal gyrus. Our results demonstrate that this method provides an efficient, interpretable, and generalizable framework for brain-age prediction, bridging the gap between CNN- and transformer-based approaches while opening new avenues for aging and neurodegeneration research.
Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset
The applicability of current lesion segmentation models for chest X-rays (CXRs) has been limited both by a small number of target labels and the reliance on long, detailed expert-level text inputs, creating a barrier to practical use. To address these limitations, we introduce a new paradigm: instruction-guided lesion segmentation (ILS), which is designed to segment diverse lesion types based on simple, user-friendly instructions. Under this paradigm, we construct MIMIC-ILS, the first large-scale instruction-answer dataset for CXR lesion segmentation, using our fully automated multimodal pipeline that generates annotations from chest X-ray images and their corresponding reports. MIMIC-ILS contains 1.1M instruction-answer pairs derived from 192K images and 91K unique segmentation masks, covering seven major lesion types. To empirically demonstrate its utility, we introduce ROSALIA, a vision-language model fine-tuned on MIMIC-ILS. ROSALIA can segment diverse lesions and provide textual explanations in response to user instructions. The model achieves high segmentation and textual accuracy in our newly proposed task, highlighting the effectiveness of our pipeline and the value of MIMIC-ILS as a foundational resource for pixel-level CXR lesion grounding.
MMCM: Multimodality-aware Metric using Clustering-based Modes for Probabilistic Human Motion Prediction WACV2026
This paper proposes a novel metric for Human Motion Prediction (HMP). Since a single past sequence can lead to multiple possible futures, a probabilistic HMP method predicts such multiple motions. While a single motion predicted by a deterministic method is evaluated only with the difference from its ground truth motion, multiple predicted motions should also be evaluated based on their distribution. For this evaluation, this paper focuses on the following two criteria. \textbf{(a) Coverage}: motions should be distributed among multiple motion modes to cover diverse possibilities. \textbf{(b) Validity}: motions should be kinematically valid as future motions observable from a given past motion. However, existing metrics simply appreciate widely distributed motions even if these motions are observed in a single mode and kinematically invalid. To resolve these disadvantages, this paper proposes a Multimodality-aware Metric using Clustering-based Modes (MMCM). For (a) coverage, MMCM divides a motion space into several clusters, each of which is regarded as a mode. These modes are used to explicitly evaluate whether predicted motions are distributed among multiple modes. For (b) validity, MMCM identifies valid modes by collecting possible future motions from a motion dataset. Our experiments validate that our clustering yields sensible mode definitions and that MMCM accurately scores multimodal predictions. Code: https://github.com/placerkyo/MMCM
comment: Accepted to WACV2026
Data-driven Prediction of Species-Specific Plant Responses to Spectral-Shifting Films from Leaf Phenotypic and Photosynthetic Traits
The application of spectral-shifting films in greenhouses to shift green light to red light has shown variable growth responses across crop species. However, the yield enhancement of crops under altered light quality is related to the collective effects of the specific biophysical characteristics of each species. Considering only one attribute of a crop has limitations in understanding the relationship between sunlight quality adjustments and crop growth performance. Therefore, this study aims to comprehensively link multiple plant phenotypic traits and daily light integral considering the physiological responses of crops to their growth outcomes under SF using artificial intelligence. Between 2021 and 2024, various leafy, fruiting, and root crops were grown in greenhouses covered with either PEF or SF, and leaf reflectance, leaf mass per area, chlorophyll content, daily light integral, and light saturation point were measured from the plants cultivated in each condition. 210 data points were collected, but there was insufficient data to train deep learning models, so a variational autoencoder was used for data augmentation. Most crop yields showed an average increase of 22.5% under SF. These data were used to train several models, including logistic regression, decision tree, random forest, XGBoost, and feedforward neural network (FFNN), aiming to binary classify whether there was a significant effect on yield with SF application. The FFNN achieved a high classification accuracy of 91.4% on a test dataset that was not used for training. This study provide insight into the complex interactions between leaf phenotypic and photosynthetic traits, environmental conditions, and solar spectral components by improving the ability to predict solar spectral shift effects using SF.
Learning Depth from Past Selves: Self-Evolution Contrast for Robust Depth Estimation
Self-supervised depth estimation has gained significant attention in autonomous driving and robotics. However, existing methods exhibit substantial performance degradation under adverse weather conditions such as rain and fog, where reduced visibility critically impairs depth prediction. To address this issue, we propose a novel self-evolution contrastive learning framework called SEC-Depth for self-supervised robust depth estimation tasks. Our approach leverages intermediate parameters generated during training to construct temporally evolving latency models. Using these, we design a self-evolution contrastive scheme to mitigate performance loss under challenging conditions. Concretely, we first design a dynamic update strategy of latency models for the depth estimation task to capture optimization states across training stages. To effectively leverage latency models, we introduce a self-evolution contrastive Loss (SECL) that treats outputs from historical latency models as negative samples. This mechanism adaptively adjusts learning objectives while implicitly sensing weather degradation severity, reducing the needs for manual intervention. Experiments show that our method integrates seamlessly into diverse baseline models and significantly enhances robustness in zero-shot evaluations.
Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance
Multimodal continual instruction tuning enables multimodal large language models to sequentially adapt to new tasks while building upon previously acquired knowledge. However, this continual learning paradigm faces the significant challenge of catastrophic forgetting, where learning new tasks leads to performance degradation on previous ones. In this paper, we introduce a novel insight into catastrophic forgetting by conceptualizing it as a problem of missing gradients from old tasks during new task learning. Our approach approximates these missing gradients by leveraging the geometric properties of the parameter space, specifically using the directional vector between current parameters and previously optimal parameters as gradient guidance. This approximated gradient can be further integrated with real gradients from a limited replay buffer and regulated by a Bernoulli sampling strategy that dynamically balances model stability and plasticity. Extensive experiments on multimodal continual instruction tuning datasets demonstrate that our method achieves state-of-the-art performance without model expansion, effectively mitigating catastrophic forgetting while maintaining a compact architecture.
Generating Natural-Language Surgical Feedback: From Structured Representation to Domain-Grounded Evaluation ML4H 2025
High-quality intraoperative feedback from a surgical trainer is pivotal for improving trainee performance and long-term skill acquisition. Automating natural, trainer-style feedback promises timely, accessible, and consistent guidance at scale but requires models that understand clinically relevant representations. We present a structure-aware pipeline that learns a surgical action ontology from real trainer-to-trainee transcripts (33 surgeries) and uses it to condition feedback generation. We contribute by (1) mining Instrument-Action-Target (IAT) triplets from real-world feedback text and clustering surface forms into normalized categories, (2) fine-tuning a video-to-IAT model that leverages the surgical procedure and task contexts as well as fine-grained temporal instrument motion, and (3) demonstrating how to effectively use IAT triplet representations to guide GPT-4o in generating clinically grounded, trainer-style feedback. We show that, on Task 1: Video-to-IAT recognition, our context injection and temporal tracking deliver consistent AUC gains (Instrument: 0.67 to 0.74; Action: 0.60 to 0.63; Tissue: 0.74 to 0.79). For Task 2: feedback text generation (rated on a 1-5 fidelity rubric where 1 = opposite/unsafe, 3 = admissible, and 5 = perfect match to a human trainer), GPT-4o from video alone scores 2.17, while IAT conditioning reaches 2.44 (+12.4%), doubling the share of admissible generations with score >= 3 from 21% to 42%. Traditional text-similarity metrics also improve: word error rate decreases by 15-31% and ROUGE (phrase/substring overlap) increases by 9-64%. Grounding generation in explicit IAT structure improves fidelity and yields clinician-verifiable rationales, supporting auditable use in surgical training.
comment: Accepted as proceedings paper for ML4H 2025
SceneEdited: A City-Scale Benchmark for 3D HD Map Updating via Image-Guided Change Detection WACV 2026
Accurate, up-to-date High-Definition (HD) maps are critical for urban planning, infrastructure monitoring, and autonomous navigation. However, these maps quickly become outdated as environments evolve, creating a need for robust methods that not only detect changes but also incorporate them into updated 3D representations. While change detection techniques have advanced significantly, there remains a clear gap between detecting changes and actually updating 3D maps, particularly when relying on 2D image-based change detection. To address this gap, we introduce SceneEdited, the first city-scale dataset explicitly designed to support research on HD map maintenance through 3D point cloud updating. SceneEdited contains over 800 up-to-date scenes covering 73 km of driving and approximate 3 $\text{km}^2$ of urban area, with more than 23,000 synthesized object changes created both manually and automatically across 2000+ out-of-date versions, simulating realistic urban modifications such as missing roadside infrastructure, buildings, overpasses, and utility poles. Each scene includes calibrated RGB images, LiDAR scans, and detailed change masks for training and evaluation. We also provide baseline methods using a foundational image-based structure-from-motion pipeline for updating outdated scenes, as well as a comprehensive toolkit supporting scalability, trackability, and portability for future dataset expansion and unification of out-of-date object annotations. Both the dataset and the toolkit are publicly available at https://github.com/ChadLin9596/ScenePoint-ETK, establising a standardized benchmark for 3D map updating research.
comment: accepted by WACV 2026
DCL-SE: Dynamic Curriculum Learning for Spatiotemporal Encoding of Brain Imaging
High-dimensional neuroimaging analyses for clinical diagnosis are often constrained by compromises in spatiotemporal fidelity and by the limited adaptability of large-scale, general-purpose models. To address these challenges, we introduce Dynamic Curriculum Learning for Spatiotemporal Encoding (DCL-SE), an end-to-end framework centered on data-driven spatiotemporal encoding (DaSE). We leverage Approximate Rank Pooling (ARP) to efficiently encode three-dimensional volumetric brain data into information-rich, two-dimensional dynamic representations, and then employ a dynamic curriculum learning strategy, guided by a Dynamic Group Mechanism (DGM), to progressively train the decoder, refining feature extraction from global anatomical structures to fine pathological details. Evaluated across six publicly available datasets, including Alzheimer's disease and brain tumor classification, cerebral artery segmentation, and brain age prediction, DCL-SE consistently outperforms existing methods in accuracy, robustness, and interpretability. These findings underscore the critical importance of compact, task-specific architectures in the era of large-scale pretrained networks.
WaveFuse-AL: Cyclical and Performance-Adaptive Multi-Strategy Active Learning for Medical Images
Active learning reduces annotation costs in medical imaging by strategically selecting the most informative samples for labeling. However, individual acquisition strategies often exhibit inconsistent behavior across different stages of the active learning cycle. We propose Cyclical and Performance-Adaptive Multi-Strategy Active Learning (WaveFuse-AL), a novel framework that adaptively fuses multiple established acquisition strategies-BALD, BADGE, Entropy, and CoreSet throughout the learning process. WaveFuse-AL integrates cyclical (sinusoidal) temporal priors with performance-driven adaptation to dynamically adjust strategy importance over time. We evaluate WaveFuse-AL on three medical imaging benchmarks: APTOS-2019 (multi-class classification), RSNA Pneumonia Detection (binary classification), and ISIC-2018 (skin lesion segmentation). Experimental results demonstrate that WaveFuse-AL consistently outperforms both single-strategy and alternating-strategy baselines, achieving statistically significant performance improvements (on ten out of twelve metric measurements) while maximizing the utility of limited annotation budgets.
Unbiased Semantic Decoding with Vision Foundation Models for Few-shot Segmentation
Few-shot segmentation has garnered significant attention. Many recent approaches attempt to introduce the Segment Anything Model (SAM) to handle this task. With the strong generalization ability and rich object-specific extraction ability of the SAM model, such a solution shows great potential in few-shot segmentation. However, the decoding process of SAM highly relies on accurate and explicit prompts, making previous approaches mainly focus on extracting prompts from the support set, which is insufficient to activate the generalization ability of SAM, and this design is easy to result in a biased decoding process when adapting to the unknown classes. In this work, we propose an Unbiased Semantic Decoding (USD) strategy integrated with SAM, which extracts target information from both the support and query set simultaneously to perform consistent predictions guided by the semantics of the Contrastive Language-Image Pre-training (CLIP) model. Specifically, to enhance the unbiased semantic discrimination of SAM, we design two feature enhancement strategies that leverage the semantic alignment capability of CLIP to enrich the original SAM features, mainly including a global supplement at the image level to provide a generalize category indicate with support image and a local guidance at the pixel level to provide a useful target location with query image. Besides, to generate target-focused prompt embeddings, a learnable visual-text target prompt generator is proposed by interacting target text embeddings and clip visual features. Without requiring re-training of the vision foundation models, the features with semantic discrimination draw attention to the target region through the guidance of prompt with rich target information.
An Event-triggered System for Social Persuasion and Danger Alert in Elder Home Monitoring
In the study, the physical state and mental state of elders are both considered, and an event-triggered system has developed to detect events: watch dog, danger notice and photo link. By adopting GMM background modeling, the motion behavior of visitors and elders can be detected in the watch dog event and danger notice event respectively. Experiments set in home scenarios and 5 families participated in the experiments for detecting and recording three types of events from their life activities. In addition, the captured images were analyzed using SVM machine learning. For lack of technical experiences of elders, an intuitive operation as normal life activity was designed to create communication between elder and relatives via social media.
comment: Accepted in the 35th IPPR Conference on Computer Vision, Graphics, and Image Processing (CVGIP2022)
Gaussian Blending: Rethinking Alpha Blending in 3D Gaussian Splatting AAAI 2026
The recent introduction of 3D Gaussian Splatting (3DGS) has significantly advanced novel view synthesis. Several studies have further improved the rendering quality of 3DGS, yet they still exhibit noticeable visual discrepancies when synthesizing views at sampling rates unseen during training. Specifically, they suffer from (i) erosion-induced blurring artifacts when zooming in and (ii) dilation-induced staircase artifacts when zooming out. We speculate that these artifacts arise from the fundamental limitation of the alpha blending adopted in 3DGS methods. Instead of the conventional alpha blending that computes alpha and transmittance as scalar quantities over a pixel, we propose to replace it with our novel Gaussian Blending that treats alpha and transmittance as spatially varying distributions. Thus, transmittances can be updated considering the spatial distribution of alpha values across the pixel area, allowing nearby background splats to contribute to the final rendering. Our Gaussian Blending maintains real-time rendering speed and requires no additional memory cost, while being easily integrated as a drop-in replacement into existing 3DGS-based or other NVS frameworks. Extensive experiments demonstrate that Gaussian Blending effectively captures fine details at various sampling rates unseen during training, consistently outperforming existing novel view synthesis models across both unseen and seen sampling rates.
comment: AAAI 2026
A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models
Discrete diffusion-based multimodal large language models (dMLLMs) have emerged as a promising alternative to autoregressive MLLMs thanks to their advantages in parallel decoding and bidirectional context modeling, but most existing dMLLMs incur significant computational overhead during inference due to the full-sequence attention computation in each denoising step. Pioneer studies attempt to resolve this issue from a modality-agnostic perspective via key-value cache optimization or efficient sampling but most of them overlook modality-specific visual token redundancy. In this work, we conduct a comprehensive study on how visual token redundancy evolves with different dMLLM architectures and tasks and how visual token pruning affects dMLLM responses and efficiency. Specifically, our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks. In addition, we validate that visual token pruning introduces non-negligible information loss in dMLLMs and only from-scratch dMLLMs can recover the lost information progressively during late denoising steps. Furthermore, our study shows that layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs. Overall, this work offers a new perspective on efficiency optimization for dMLLMs, greatly advancing their applicability across various multimodal understanding tasks.
comment: 14 pages, 2 figures
Jointly Conditioned Diffusion Model for Multi-View Pose-Guided Person Image Synthesis
Pose-guided human image generation is limited by incomplete textures from single reference views and the absence of explicit cross-view interaction. We present jointly conditioned diffusion model (JCDM), a jointly conditioned diffusion framework that exploits multi-view priors. The appearance prior module (APM) infers a holistic identity preserving prior from incomplete references, and the joint conditional injection (JCI) mechanism fuses multi-view cues and injects shared conditioning into the denoising backbone to align identity, color, and texture across poses. JCDM supports a variable number of reference views and integrates with standard diffusion backbones with minimal and targeted architectural modifications. Experiments demonstrate state of the art fidelity and cross-view consistency.
BBox DocVQA: A Large Scale Bounding Box Grounded Dataset for Enhancing Reasoning in Document Visual Question Answer
Document Visual Question Answering (DocVQA) is a fundamental task for multimodal document understanding and a key testbed for vision language reasoning. However, most existing DocVQA datasets are limited to the page level and lack fine grained spatial grounding, constraining the interpretability and reasoning capability of Vision Language Models (VLMs). To address this gap, we introduce BBox DocVQA a large scale, bounding box grounded dataset designed to enhance spatial reasoning and evidence localization in visual documents. We further present an automated construction pipeline, Segment Judge and Generate, which integrates a segment model for region segmentation, a VLM for semantic judgment, and another advanced VLM for question answer generation, followed by human verification for quality assurance. The resulting dataset contains 3.6 K diverse documents and 32 K QA pairs, encompassing single and multi region as well as single and multi page scenarios. Each QA instance is grounded on explicit bounding boxes, enabling fine grained evaluation of spatial semantic alignment. Benchmarking multiple state of the art VLMs (e.g., GPT 5, Qwen2.5 VL, and InternVL) on BBox DocVQA reveals persistent challenges in spatial grounding and reasoning accuracy. Furthermore, fine tuning on BBox DocVQA substantially improves both bounding box localization and answer generation, validating its effectiveness for enhancing the reasoning ability of VLMs. Our dataset and code will be publicly released to advance research on interpretable and spatially grounded vision language reasoning.
comment: 22 pages, 4 figures
DeepContrast: Deep Tissue Contrast Enhancement using Synthetic Data Degradations and OOD Model Predictions
Microscopy images are crucial for life science research, allowing detailed inspection and characterization of cellular and tissue-level structures and functions. However, microscopy data are unavoidably affected by image degradations, such as noise, blur, or others. Many such degradations also contribute to a loss of image contrast, which becomes especially pronounced in deeper regions of thick samples. Today, best performing methods to increase the quality of images are based on Deep Learning approaches, which typically require ground truth (GT) data during training. Our inability to counteract blurring and contrast loss when imaging deep into samples prevents the acquisition of such clean GT data. The fact that the forward process of blurring and contrast loss deep into tissue can be modeled, allowed us to propose a new method that can circumvent the problem of unobtainable GT data. To this end, we first synthetically degraded the quality of microscopy images even further by using an approximate forward model for deep tissue image degradations. Then we trained a neural network that learned the inverse of this degradation function from our generated pairs of raw and degraded images. We demonstrated that networks trained in this way can be used out-of-distribution (OOD) to improve the quality of less severely degraded images, e.g. the raw data imaged in a microscope. Since the absolute level of degradation in such microscopy images can be stronger than the additional degradation introduced by our forward model, we also explored the effect of iterative predictions. Here, we observed that in each iteration the measured image contrast kept improving while detailed structures in the images got increasingly removed. Therefore, dependent on the desired downstream analysis, a balance between contrast improvement and retention of image details has to be found.
comment: 8 pages, 7 figures, 1 table
Measuring the (Un)Faithfulness of Concept-Based Explanations
Deep vision models perform input-output computations that are hard to interpret. Concept-based explanation methods (CBEMs) increase interpretability by re-expressing parts of the model with human-understandable semantic units, or concepts. Checking if the derived explanations are faithful -- that is, they represent the model's internal computation -- requires a surrogate that combines concepts to compute the output. Simplifications made for interpretability inevitably reduce faithfulness, resulting in a tradeoff between the two. State-of-the-art unsupervised CBEMs (U-CBEMs) have reported increasingly interpretable concepts, while also being more faithful to the model. However, we observe that the reported improvement in faithfulness artificially results from either (1) using overly complex surrogates, which introduces an unmeasured cost to the explanation's interpretability, or (2) relying on deletion-based approaches that, as we demonstrate, do not properly measure faithfulness. We propose Surrogate Faithfulness (SURF), which (1) replaces prior complex surrogates with a simple, linear surrogate that measures faithfulness without changing the explanation's interpretability and (2) introduces well-motivated metrics that assess loss across all output classes, not just the predicted class. We validate SURF with a measure-over-measure study by proposing a simple sanity check -- explanations with random concepts should be less faithful -- which prior surrogates fail. SURF enables the first reliable faithfulness benchmark of U-CBEMs, revealing that many visually compelling U-CBEMs are not faithful. Code to be released.
comment: Pre-print
TrackStudio: An Integrated Toolkit for Markerless Tracking
Markerless motion tracking has advanced rapidly in the past 10 years and currently offers powerful opportunities for behavioural, clinical, and biomechanical research. While several specialised toolkits provide high performance for specific tasks, using existing tools still requires substantial technical expertise. There remains a gap in accessible, integrated solutions that deliver sufficient tracking for non-experts across diverse settings. TrackStudio was developed to address this gap by combining established open-source tools into a single, modular, GUI-based pipeline that works out of the box. It provides automatic 2D and 3D tracking, calibration, preprocessing, feature extraction, and visualisation without requiring any programming skills. We supply a user guide with practical advice for video acquisition, synchronisation, and setup, alongside documentation of common pitfalls and how to avoid them. To validate the toolkit, we tested its performance across three environments using either low-cost webcams or high-resolution cameras, including challenging conditions for body position, lightning, and space and obstructions. Across 76 participants, average inter-frame correlations exceeded 0.98 and average triangulation errors remained low (<13.6mm for hand tracking), demonstrating stable and consistent tracking. We further show that the same pipeline can be extended beyond hand tracking to other body and face regions. TrackStudio provides a practical, accessible route into markerless tracking for researchers or laypeople who need reliable performance without specialist expertise.
comment: 26 pages, 5 main text figures, 5 supplementary figures
DINOv3 as a Frozen Encoder for CRPS-Oriented Probabilistic Rainfall Nowcasting
This paper proposes a competitive and computationally efficient approach to probabilistic rainfall nowcasting. A video projector (V-JEPA Vision Transformer) associated to a lightweight probabilistic head is attached to a pre-trained satellite vision encoder (DINOv3-SAT493M) to map encoder tokens into a discrete empirical CDF (eCDF) over 4-hour accumulated rainfall. The projector-head is optimized end-to-end over the Ranked Probability Score (RPS). As an alternative, 3D-UNET baselines trained with an aggregate Rank Probability Score and a per-pixel Gamma-Hurdle objective are used. On the Weather4Cast 2025 benchmark, the proposed method achieved a promising performance, with a CRPS of 3.5102, which represents $\approx$ 26% in effectiveness gain against the best 3D-UNET.
Distribution Matching Distillation Meets Reinforcement Learning
Distribution Matching Distillation (DMD) distills a pre-trained multi-step diffusion model to a few-step one to improve inference efficiency. However, the performance of the latter is often capped by the former. To circumvent this dilemma, we propose DMDR, a novel framework that combines Reinforcement Learning (RL) techniques into the distillation process. We show that for the RL of the few-step generator, the DMD loss itself is a more effective regularization compared to the traditional ones. In turn, RL can help to guide the mode coverage process in DMD more effectively. These allow us to unlock the capacity of the few-step generator by conducting distillation and RL simultaneously. Meanwhile, we design the dynamic distribution guidance and dynamic renoise sampling training strategies to improve the initial distillation process. The experiments demonstrate that DMDR can achieve leading visual quality, prompt coherence among few-step methods, and even exhibit performance that exceeds the multi-step teacher.
comment: The synergy of reinforcement learning and distribution matching distillation. See more: https://github.com/vvvvvjdy/dmdr
Deep Spectral Prior
We introduce the Deep Spectral Prior (DSP), a new framework for unsupervised image reconstruction that operates entirely in the complex frequency domain. Unlike the Deep Image Prior (DIP), which optimises pixel-level errors and is highly sensitive to overfitting, DSP performs joint learning of amplitude and phase to capture the full spectral structure of images. We derive a rigorous theoretical characterisation of DSP's optimisation dynamics, proving that it follows frequency-dependent descent trajectories that separate informative low-frequency modes from stochastic high-frequency noise. This spectral mode separation explains DSP's self-regularising behaviour and, for the first time, formally establishes the elimination of DIP's major limitation-its reliance on manual early stopping. Moreover, DSP induces an implicit projection onto a frequency-consistent manifold, ensuring convergence to stable, physically plausible reconstructions without explicit priors or supervision. Extensive experiments on denoising, inpainting, and deblurring demonstrate that DSP consistently surpasses DIP and other unsupervised baselines, achieving superior fidelity, robustness, and theoretical interpretability within a unified, unsupervised data-free framework.
Alpha Divergence Losses for Biometric Verification
Performance in face and speaker verification is largely driven by margin based softmax losses like CosFace and ArcFace. Recently introduced $α$-divergence loss functions offer a compelling alternative, particularly for their ability to induce sparse solutions (when $α>1$). However, integrating an angular margin-crucial for verification tasks-is not straightforward. We find this integration can be achieved in at least two distinct ways: via the reference measure (prior probabilities) or via the logits (unnormalized log-likelihoods). In this paper, we explore both pathways, deriving two novel margin-based $α$-divergence losses: Q-Margin (margin in the reference measure) and A3M (margin in the logits). We identify and address a critical training instability in A3M-caused by the interplay of penalized logits and sparsity-with a simple yet effective prototype re-initialization strategy. Our methods achieve significant performance gains on the challenging IJB-B and IJB-C face verification benchmarks. We demonstrate similarly strong performance in speaker verification on VoxCeleb. Crucially, our models significantly outperform strong baselines at low false acceptance rates (FAR). This capability is crucial for practical high-security applications, such as banking authentication, when minimizing false authentications is paramount.
comment: Found something suboptimal in results
Interpretable Retinal Disease Prediction Using Biology-Informed Heterogeneous Graph Representations
Interpretability is crucial to enhance trust in machine learning models for medical diagnostics. However, most state-of-the-art image classifiers based on neural networks are not interpretable. As a result, clinicians often resort to known biomarkers for diagnosis, although biomarker-based classification typically performs worse than large neural networks. This work proposes a method that surpasses the performance of established machine learning models while simultaneously improving prediction interpretability for diabetic retinopathy staging from optical coherence tomography angiography (OCTA) images. Our method is based on a novel biology-informed heterogeneous graph representation that models retinal vessel segments, intercapillary areas, and the foveal avascular zone (FAZ) in a human-interpretable way. This graph representation allows us to frame diabetic retinopathy staging as a graph-level classification task, which we solve using an efficient graph neural network. We benchmark our method against well-established baselines, including classical biomarker-based classifiers, convolutional neural networks (CNNs), and vision transformers. Our model outperforms all baselines on two datasets. Crucially, we use our biology-informed graph to provide explanations of unprecedented detail. Our approach surpasses existing methods in precisely localizing and identifying critical vessels or intercapillary areas. In addition, we give informative and human-interpretable attributions to critical characteristics. Our work contributes to the development of clinical decision-support tools in ophthalmology.
The Role of Radiographic Knee Alignment in Total Knee Replacement Outcomes and Opportunities for Artificial Intelligence-Driven Assessment
Knee osteoarthritis (OA) is one of the most widespread and burdensome health problems [1-4]. Total knee replacement (TKR) may be offered as treatment for end-stage knee OA. Nevertheless, TKR is an invasive procedure involving prosthesis implantation at the knee joint, and around 10% of patients are dissatisfied following TKR [5,6]. Dissatisfaction is often assessed through patient-reported outcome measures (PROMs) [7], which are usually completed by patients and assessed by health professionals to evaluate the condition of TKR patients. In clinical practice, predicting poor TKR outcomes in advance could help optimise patient selection and improve management strategies. Radiographic knee alignment is an important biomarker for predicting TKR outcomes and long-term joint health. Abnormalities such as femoral or tibial deformities can directly influence surgical planning, implant selection, and postoperative recovery [8,9]. Traditional alignment measurement is manual, time-consuming, and requires long-leg radiographs, which are not always undertaken in clinical practice. Instead, standard anteroposterior (AP) knee radiographs are often the main imaging modality. Automated methods for alignment assessment in standard knee radiographs are potentially clinically valuable for improving efficiency in the knee OA treatment pathway.
Self Pre-training with Topology- and Spatiality-aware Masked Autoencoders for 3D Medical Image Segmentation
Masked Autoencoders (MAEs) have been shown to be effective in pre-training Vision Transformers (ViTs) for natural and medical image analysis problems. By reconstructing missing pixel/voxel information in visible patches, a ViT encoder can aggregate contextual information for downstream tasks. But, existing MAE pre-training methods, which were specifically developed with the ViT architecture, lack the ability to capture geometric shape and spatial information, which is critical for medical image segmentation tasks. In this paper, we propose a novel extension of known MAEs for self pre-training (i.e., models pre-trained on the same target dataset) for 3D medical image segmentation. (1) We propose a new topological loss to preserve geometric shape information by computing topological signatures of both the input and reconstructed volumes, learning geometric shape information. (2) We introduce a pre-text task that predicts the positions of the centers and eight corners of 3D crops, enabling the MAE to aggregate spatial information. (3) We extend the MAE pre-training strategy to a hybrid state-of-the-art (SOTA) medical image segmentation architecture and co-pretrain it alongside the ViT. (4) We develop a fine-tuned model for downstream segmentation tasks by complementing the pre-trained ViT encoder with our pre-trained SOTA model. Extensive experiments on five public 3D segmentation datasets show the effectiveness of our new approach.
One Latent Space to Rule All Degradations: Unifying Restoration Knowledge for Image Fusion
All-in-One Degradation-Aware Fusion Models (ADFMs) as one of multi-modal image fusion models, which aims to address complex scenes by mitigating degradations from source images and generating high-quality fused images. Mainstream ADFMs rely on end-to-end learning and heavily synthesized datasets to achieve degradation awareness and fusion. This rough learning strategy and non-real world scenario dataset dependence often limit their upper-bound performance, leading to low-quality results. To address these limitations, we present LURE, a Learning-driven Unified REpresentation model for infrared and visible image fusion, which is degradation-aware. LURE learns a Unified Latent Feature Space (ULFS) to avoid the dependency on complex data formats inherent in previous end-to-end learning pipelines. It further improves image fusion quality by leveraging the intrinsic relationships between multi-modalities. A novel loss function is also proposed to drive the learning of unified latent representations more stable.More importantly, LURE seamlessly incorporates existing high-quality real-world image restoration datasets. To further enhance the model's representation capability, we design a simple yet effective structure, termed internal residual block, to facilitate the learning of latent features. Experiments show our method outperforms state-of-the-art (SOTA) methods across general fusion, degradation-aware fusion, and downstream tasks. The code is available in the supplementary materials.
SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference ICML
An efficient attention implementation is essential for large models due to its quadratic time complexity. Fortunately, attention commonly exhibits sparsity, i.e., many values in the attention map are near zero, allowing for the omission of corresponding computations. Many studies have utilized the sparse pattern to accelerate attention. However, most existing works focus on optimizing attention within specific models by exploiting certain sparse patterns of the attention map. A universal sparse attention that guarantees both the speedup and end-to-end performance of diverse models remains elusive. In this paper, we propose SpargeAttn, a universal sparse and quantized attention for any model. Our method uses a two-stage online filter: in the first stage, we rapidly and accurately predict the attention map, enabling the skip of some matrix multiplications in attention. In the second stage, we design an online softmax-aware filter that incurs no extra overhead and further skips some matrix multiplications. Experiments show that our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics. The code is available at https://github.com/thu-ml/SpargeAttn.
comment: @inproceedings{zhang2025spargeattn, title={Spargeattn: Accurate sparse attention accelerating any model inference}, author={Zhang, Jintao and Xiang, Chendong and Huang, Haofeng and Wei, Jia and Xi, Haocheng and Zhu, Jun and Chen, Jianfei}, booktitle={International Conference on Machine Learning (ICML)}, year={2025} }
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention
In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank. This naturally suggests applying sparse acceleration to the first part and low-rank acceleration to the second. Based on this finding, we propose SLA (Sparse-Linear Attention), a trainable attention method that fuses sparse and linear attention to accelerate diffusion models. SLA classifies attention weights into critical, marginal, and negligible categories, applying O(N^2) attention to critical weights, O(N) attention to marginal weights, and skipping negligible ones. SLA combines these computations into a single GPU kernel and supports both forward and backward passes. With only a few fine-tuning steps using SLA, DiT models achieve a 20x reduction in attention computation, resulting in significant acceleration without loss of generation quality. Experiments show that SLA reduces attention computation by 95% without degrading end-to-end generation quality, outperforming baseline methods. In addition, we implement an efficient GPU kernel for SLA, which yields a 13.7x speedup in attention computation and a 2.2x end-to-end speedup in video generation on Wan2.1-1.3B. The code is available at https://github.com/thu-ml/SLA.
Fairness-Aware Deepfake Detection: Leveraging Dual-Mechanism Optimization
Fairness is a core element in the trustworthy deployment of deepfake detection models, especially in the field of digital identity security. Biases in detection models toward different demographic groups, such as gender and race, may lead to systemic misjudgments, exacerbating the digital divide and social inequities. However, current fairness-enhanced detectors often improve fairness at the cost of detection accuracy. To address this challenge, we propose a dual-mechanism collaborative optimization framework. Our proposed method innovatively integrates structural fairness decoupling and global distribution alignment: decoupling channels sensitive to demographic groups at the model architectural level, and subsequently reducing the distance between the overall sample distribution and the distributions corresponding to each demographic group at the feature level. Experimental results demonstrate that, compared with other methods, our framework improves both inter-group and intra-group fairness while maintaining overall detection accuracy across domains.
Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks
Spatial intelligence spans a rich suite of abilities, including visualising and transforming shapes, mentally rotating objects, judging relational positions and containment, and estimating numerosity. However, it still remains a critical unresolved challenge for Multimodal Large Language Models (MLLMs). To fill this gap, we propose to treat Euclidean geometry problem-solving as a surrogate task. Specifically, we meticulously constructed a curated multimodal dataset, called Euclid30K, comprising approximately 30K plane and solid geometry problems. Furthermore, to enable the model to learn and apply Euclidean principles from these geometry problems, we fine-tuned seven model variants (spanning 3--72B parameters) from the Qwen2.5VL, Qwen3VL, and RoboBrain2.0 families using Group Relative Policy Optimization (GRPO), inspiring the models to identify shapes, count, and relate entities, and perform multi-step deductive reasoning using Euclidean principles. Our experiments demonstrate that the resulting models achieve substantial zero-shot gains across four spatial reasoning benchmarks (Super-CLEVR, Omni3DBench, VSI-Bench, and MindCube) without any task-specific adaptations. Notably, after training on the Euclid30K, the mean VSI-Bench accuracy rose from 36.6\% to 41.8\% (+5.2\%), and the mean MindCube accuracy rose from 31.4\% to 38.1\% (+6.7\%). To our knowledge, this is the first systematic study showing that geometry-centric fine-tuning can confer vision-language models with broadly transferable spatial skills. Code and Euclid30K dataset can be found in \href{https://zgca-ai4edu.github.io/Euclids_Gift}{this}.
SymGS : Leveraging Local Symmetries for 3D Gaussian Splatting Compression
3D Gaussian Splatting has emerged as a transformative technique in novel view synthesis, primarily due to its high rendering speed and photorealistic fidelity. However, its memory footprint scales rapidly with scene complexity, often reaching several gigabytes. Existing methods address this issue by introducing compression strategies that exploit primitive-level redundancy through similarity detection and quantization. We aim to surpass the compression limits of such methods by incorporating symmetry-aware techniques, specifically targeting mirror symmetries to eliminate redundant primitives. We propose a novel compression framework, SymGS, introducing learnable mirrors into the scene, thereby eliminating local and global reflective redundancies for compression. Our framework functions as a plug-and-play enhancement to state-of-the-art compression methods, (e.g. HAC) to achieve further compression. Compared to HAC, we achieve $1.66 \times$ compression across benchmark datasets (upto $3\times$ on large-scale scenes). On an average, SymGS enables $\bf{108\times}$ compression of a 3DGS scene, while preserving rendering quality. The project page and supplementary can be found at symgs.github.io
comment: Project Page: https://symgs.github.io/
Verb Mirage: Unveiling and Assessing Verb Concept Hallucinations in Multimodal Large Language Models AAAI-26
Multimodal Large Language Models (MLLMs) have garnered significant attention recently and demonstrate outstanding capabilities in various tasks such as OCR, VQA, captioning, $\textit{etc}$. However, hallucination remains a persistent issue. While numerous methods have been proposed to mitigate hallucinations, achieving notable improvements, these methods primarily focus on mitigating hallucinations about $\textbf{object/noun-related}$ concepts. Verb concepts, crucial for understanding human actions, have been largely overlooked. In this paper, to the best of our knowledge, we are the $\textbf{first}$ to investigate the $\textbf{verb hallucination}$ phenomenon of MLLMs from various perspectives. Our findings reveal that most state-of-the-art MLLMs suffer from severe verb hallucination. To assess the effectiveness of existing mitigation methods for object concept hallucination on verb hallucination, we evaluated these methods and found that they do not effectively address verb hallucination. To address this issue, we propose a novel rich verb knowledge-based tuning method to mitigate verb hallucination. The experiment results demonstrate that our method significantly reduces hallucinations related to verbs.
comment: Accepted by AAAI-26
Class-Aware PillarMix: Can Mixed Sample Data Augmentation Enhance 3D Object Detection with Radar Point Clouds? IROS 2025
Due to the significant effort required for data collection and annotation in 3D perception tasks, mixed sample data augmentation (MSDA) has been widely studied to generate diverse training samples by mixing existing data. Recently, many MSDA techniques have been developed for point clouds, but they mainly target LiDAR data, leaving their application to radar point clouds largely unexplored. In this paper, we examine the feasibility of applying existing MSDA methods to radar point clouds and identify several challenges in adapting these techniques. These obstacles stem from the radar's irregular angular distribution, deviations from a single-sensor polar layout in multi-radar setups, and point sparsity. To address these issues, we propose Class-Aware PillarMix (CAPMix), a novel MSDA approach that applies MixUp at the pillar level in 3D point clouds, guided by class labels. Unlike methods that rely a single mix ratio to the entire sample, CAPMix assigns an independent ratio to each pillar, boosting sample diversity. To account for the density of different classes, we use class-specific distributions: for dense objects (e.g., large vehicles), we skew ratios to favor points from another sample, while for sparse objects (e.g., pedestrians), we sample more points from the original. This class-aware mixing retains critical details and enriches each sample with new information, ultimately generating more diverse training data. Experimental results demonstrate that our method not only significantly boosts performance but also outperforms existing MSDA approaches across two datasets (Bosch Street and K-Radar). We believe that this straightforward yet effective approach will spark further investigation into MSDA techniques for radar data.
comment: 8 pages, 6 figures, 4 tables, accepted to 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025). Code: https://github.com/boschresearch/CAPMIX
ANTS: Adaptive Negative Textual Space Shaping for OOD Detection via Test-Time MLLM Understanding and Reasoning
The introduction of negative labels (NLs) has proven effective in enhancing Out-of-Distribution (OOD) detection. However, existing methods often lack an understanding of OOD images, making it difficult to construct an accurate negative space. Furthermore, the absence of negative labels semantically similar to ID labels constrains their capability in near-OOD detection. To address these issues, we propose shaping an Adaptive Negative Textual Space (ANTS) by leveraging the understanding and reasoning capabilities of multimodal large language models (MLLMs). Specifically, we cache images likely to be OOD samples from the historical test images and prompt the MLLM to describe these images, generating expressive negative sentences that precisely characterize the OOD distribution and enhance far-OOD detection. For the near-OOD setting, where OOD samples resemble the in-distribution (ID) subset, we cache the subset of ID classes that are visually similar to historical test images and then leverage MLLM reasoning to generate visually similar negative labels tailored to this subset, effectively reducing false negatives and improving near-OOD detection. To balance these two types of negative textual spaces, we design an adaptive weighted score that enables the method to handle different OOD task settings (near-OOD and far-OOD), making it highly adaptable in open environments. On the ImageNet benchmark, our ANTS significantly reduces the FPR95 by 3.1\%, establishing a new state-of-the-art. Furthermore, our method is training-free and zero-shot, enabling high scalability.
Streaming Generation of Co-Speech Gestures via Accelerated Rolling Diffusion AAAI
Generating co-speech gestures in real time requires both temporal coherence and efficient sampling. We introduce a novel framework for streaming gesture generation that extends Rolling Diffusion models with structured progressive noise scheduling, enabling seamless long-sequence motion synthesis while preserving realism and diversity. Our framework is universally compatible with existing diffusion-based gesture generation model, transforming them into streaming methods capable of continuous generation without requiring post-processing. We evaluate our framework on ZEGGS and BEAT, strong benchmarks for real-world applicability. Applied to state-of-the-art baselines on both datasets, it consistently outperforms them, demonstrating its effectiveness as a generalizable and efficient solution for real-time co-speech gesture synthesis. We further propose Rolling Diffusion Ladder Acceleration (RDLA), a new approach that employs a ladder-based noise scheduling strategy to simultaneously denoise multiple frames. This significantly improves sampling efficiency while maintaining motion consistency, achieving up to a 4x speedup with high visual fidelity and temporal coherence in our experiments. Comprehensive user studies further validate our framework ability to generate realistic, diverse gestures closely synchronized with the audio input.
comment: Accepted at the 40th AAAI Conference on Artificial Intelligence (AAAI-26) Main Track
Gaussian Splatting-based Low-Rank Tensor Representation for Multi-Dimensional Image Recovery
Tensor singular value decomposition (t-SVD) is a promising tool for multi-dimensional image representation, which decomposes a multi-dimensional image into a latent tensor and an accompanying transform matrix. However, two critical limitations of t-SVD methods persist: (1) the approximation of the latent tensor (e.g., tensor factorizations) is coarse and fails to accurately capture spatial local high-frequency information; (2) The transform matrix is composed of fixed basis atoms (e.g., complex exponential atoms in DFT and cosine atoms in DCT) and cannot precisely capture local high-frequency information along the mode-3 fibers. To address these two limitations, we propose a Gaussian Splatting-based Low-rank tensor Representation (GSLR) framework, which compactly and continuously represents multi-dimensional images. Specifically, we leverage tailored 2D Gaussian splatting and 1D Gaussian splatting to generate the latent tensor and transform matrix, respectively. The 2D and 1D Gaussian splatting are indispensable and complementary under this representation framework, which enjoys a powerful representation capability, especially for local high-frequency information. To evaluate the representation ability of the proposed GSLR, we develop an unsupervised GSLR-based multi-dimensional image recovery model. Extensive experiments on multi-dimensional image recovery demonstrate that GSLR consistently outperforms state-of-the-art methods, particularly in capturing local high-frequency information.
ViewBridge:Revisiting Cross-View Localization from Image Matching
Cross-view localization aims to estimate the 3-DoF pose of a ground-view image by aligning it with aerial or satellite imagery. Existing methods typically address this task through direct regression or feature alignment in a shared bird's-eye view (BEV) space. Although effective for coarse alignment, these methods fail to establish fine-grained and geometrically reliable correspondences under large viewpoint variations, thereby limiting both the accuracy and interpretability of localization results. Consequently, we revisit cross-view localization from the perspective of image matching and propose a unified framework that enhances both matching and localization. Specifically, we introduce a Surface Model that constrains BEV feature projection to physically valid regions for geometric consistency, and a SimRefiner that adaptively refines similarity distributions to enhance match reliability. To further support research in this area, we present CVFM, the first benchmark with 32,509 cross-view image pairs annotated with pixel-level correspondences. Extensive experiments demonstrate that our approach achieves geometry-consistent and fine-grained correspondences across extreme viewpoints and further improves the accuracy and stability of cross-view localization.
Drifting Away from Truth: GenAI-Driven News Diversity Challenges LVLM-Based Misinformation Detection
The proliferation of multimodal misinformation poses growing threats to public discourse and societal trust. While Large Vision-Language Models (LVLMs) have enabled recent progress in multimodal misinformation detection (MMD), the rise of generative AI (GenAI) tools introduces a new challenge: GenAI-driven news diversity, characterized by highly varied and complex content. We show that this diversity induces multi-level drift, comprising (1) model-level misperception drift, where stylistic variations disrupt a model's internal reasoning, and (2) evidence-level drift, where expression diversity degrades the quality or relevance of retrieved external evidence. These drifts significantly degrade the robustness of current LVLM-based MMD systems. To systematically study this problem, we introduce DriftBench, a large-scale benchmark comprising 16,000 news instances across six categories of diversification. We design three evaluation tasks: (1) robustness of truth verification under multi-level drift; (2) susceptibility to adversarial evidence contamination generated by GenAI; and (3) analysis of reasoning consistency across diverse inputs. Experiments with six state-of-the-art LVLM-based detectors show substantial performance drops (average F1 -14.8%) and increasingly unstable reasoning traces, with even more severe failures under adversarial evidence injection. Our findings uncover fundamental vulnerabilities in existing MMD systems and suggest an urgent need for more resilient approaches in the GenAI era.
Causal Representation Learning with Observational Grouping for CXR Classification MICCAI
Identifiable causal representation learning seeks to uncover the true causal relationships underlying a data generation process. In medical imaging, this presents opportunities to improve the generalisability and robustness of task-specific latent features. This work introduces the concept of grouping observations to learn identifiable representations for disease classification in chest X-rays via an end-to-end framework. Our experiments demonstrate that these causal representations improve generalisability and robustness across multiple classification tasks when grouping is used to enforce invariance w.r.t race, sex, and imaging views.
comment: Proceedings of the 3rd FAIMI Workshop at the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) 2025, Daejeon, South Korea
MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation
Referring Image Segmentation (RIS) is an advanced vision-language task that involves identifying and segmenting objects within an image as described by free-form text descriptions. While previous studies focused on aligning visual and language features, exploring training techniques, such as data augmentation, remains underexplored. In this work, we explore effective data augmentation for RIS and propose a novel training framework called Masked Referring Image Segmentation (MaskRIS). We observe that the conventional image augmentations fall short of RIS, leading to performance degradation, while simple random masking significantly enhances the performance of RIS. MaskRIS uses both image and text masking, followed by Distortion-aware Contextual Learning (DCL) to fully exploit the benefits of the masking strategy. This approach can improve the model's robustness to occlusions, incomplete information, and various linguistic complexities, resulting in a significant performance improvement. Experiments demonstrate that MaskRIS can easily be applied to various RIS models, outperforming existing methods in both fully supervised and weakly supervised settings. Finally, MaskRIS achieves new state-of-the-art performance on RefCOCO, RefCOCO+, and RefCOCOg datasets. Code is available at https://github.com/naver-ai/maskris.
comment: Accepted to TMLR 2025. First two authors contributed equally
IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video?
The webpage-to-code task requires models to understand visual representations of webpages and generate corresponding code. However, existing benchmarks primarily focus on static screenshot-to-code tasks, thereby overlooking the dynamic interactions fundamental to real-world web applications. To address this limitation, this paper introduces IWR-Bench, a novel benchmark for evaluating the capabilities of Large Vision-Language Models (LVLMs) in interactive webpage reconstruction from video. IWR-Bench comprises 113 meticulously curated tasks from 100 real-world websites, with 1,001 actions and featuring diverse interaction complexities (e.g., web games), visual styles, and domains. Aligning with standard web development practices, each task includes not only user interaction videos but also all crawled static assets (e.g., images, videos). This benchmark evaluates models on two fundamental challenges: comprehensive multi-modal reasoning to infer interaction logic from video and assets, and advanced code generation to translate this logic into functional code. An agent-as-a-judge framework with a comprehensive metric system automatically assesses the functional correctness and visual fidelity of generated webpages. Extensive experiments on 28 LVLMs reveal a significant challenge: the best model achieves an overall score of only 36.35%, as functional correctness (24.39% IFS) lags significantly behind visual fidelity (64.25% VFS). These results highlight critical limitations in current models' ability to reason about temporal dynamics and synthesize event-driven logic, establishing IWR-Bench as a challenging frontier for vision-language research. The benchmark and evaluation code will be made publicly available at https://github.com/SIGMME/IWR-Bench.
Beacon2Science: Enhancing STEREO/HI beacon data with machine learning for efficient CME tracking
Observing and forecasting coronal mass ejections (CME) in real-time is crucial due to the strong geomagnetic storms they can generate that can have a potentially damaging effect, for example, on satellites and electrical devices. With its near-real-time availability, STEREO/HI beacon data is the perfect candidate for early forecasting of CMEs. However, previous work concluded that CME arrival prediction based on beacon data could not achieve the same accuracy as with high-resolution science data due to data gaps and lower quality. We present our novel machine-learning pipeline entitled ``Beacon2Science'', bridging the gap between beacon and science data to improve CME tracking. Through this pipeline, we first enhance the quality (signal-to-noise ratio and spatial resolution) of beacon data. We then increase the time resolution of enhanced beacon images through learned interpolation to match science data's 40-minute resolution. We maximize information coherence between consecutive frames with adapted model architecture and loss functions through the different steps. The improved beacon images are comparable to science data, showing better CME visibility than the original beacon data. Furthermore, we compare CMEs tracked in beacon, enhanced beacon, and science images. The tracks extracted from enhanced beacon data are closer to those from science images, with a mean average error of $\sim 0.5 ^\circ$ of elongation compared to $1^\circ$ with original beacon data. The work presented in this paper paves the way for its application to forthcoming missions such as Vigil and PUNCH.
comment: 25 pages, 11 figures, 1 tables, submitted to AGU Space Weather on 14th March 2025, accepted 05 June 2025, published 15 July 2025
Survival Modeling from Whole Slide Images via Patch-Level Graph Clustering and Mixture Density Experts
We propose a modular framework for predicting cancer specific survival directly from whole slide pathology images (WSIs). The framework consists of four key stages designed to capture prognostic and morphological heterogeneity. First, a Quantile Based Patch Filtering module selects prognostically informative tissue regions through quantile thresholding. Second, Graph Regularized Patch Clustering models phenotype level variations using a k nearest neighbor graph that enforces spatial and morphological coherence. Third, Hierarchical Feature Aggregation learns both intra and inter cluster dependencies to represent multiscale tumor organization. Finally, an Expert Guided Mixture Density Model estimates complex survival distributions via Gaussian mixtures, enabling fine grained risk prediction. Evaluated on TCGA LUAD, TCGA KIRC, and TCGA BRCA cohorts, our model achieves concordance indices of 0.653 ,0.719 ,and 0.733 respectively, surpassing existing state of the art approaches in survival prediction from WSIs.
Event Stream Filtering via Probability Flux Estimation
Event cameras asynchronously capture brightness changes with microsecond latency, offering exceptional temporal precision but suffering from severe noise and signal inconsistencies. Unlike conventional signals, events carry state information through polarities and process information through inter-event time intervals. However, existing event filters often ignore the latter, producing outputs that are sparser than the raw input and limiting the reconstruction of continuous irradiance dynamics. We propose the Event Density Flow Filter (EDFilter), a framework that models event generation as threshold-crossing probability fluxes arising from the stochastic diffusion of irradiance trajectories. EDFilter performs nonparametric, kernel-based estimation of probability flux and reconstructs the continuous event density flow using an O(1) recursive solver, enabling real-time processing. The Rotary Event Dataset (RED), featuring microsecond-resolution ground-truth irradiance flow under controlled illumination is also presented for event quality evaluation. Experiments demonstrate that EDFilter achieves high-fidelity, physically interpretable event denoising and motion reconstruction.
ReassembleNet: Learnable Keypoints and Diffusion for 2D Fresco Reconstruction
The task of reassembly is a significant challenge across multiple domains, including archaeology, genomics, and molecular docking, requiring the precise placement and orientation of elements to reconstruct an original structure. In this work, we address key limitations in state-of-the-art Deep Learning methods for reassembly, namely i) scalability; ii) multimodality; and iii) real-world applicability: beyond square or simple geometric shapes, realistic and complex erosion, or other real-world problems. We propose ReassembleNet, a method that reduces complexity by representing each input piece as a set of contour keypoints and learning to select the most informative ones by Graph Neural Networks pooling inspired techniques. ReassembleNet effectively lowers computational complexity while enabling the integration of features from multiple modalities, including both geometric and texture data. Further enhanced through pretraining on a semi-synthetic dataset. We then apply diffusion-based pose estimation to recover the original structure. We improve on prior methods by 57% and 87% for RMSE Rotation and Translation, respectively.
Learning from the Right Patches: A Two-Stage Wavelet-Driven Masked Autoencoder for Histopathology Representation Learning
Whole-slide images are central to digital pathology, yet their extreme size and scarce annotations make self-supervised learning essential. Masked Autoencoders (MAEs) with Vision Transformer backbones have recently shown strong potential for histopathology representation learning. However, conventional random patch sampling during MAE pretraining often includes irrelevant or noisy regions, limiting the model's ability to capture meaningful tissue patterns. In this paper, we present a lightweight and domain-adapted framework that brings structure and biological relevance into MAE-based learning through a wavelet-informed patch selection strategy. WISE-MAE applies a two-step coarse-to-fine process: wavelet-based screening at low magnification to locate structurally rich regions, followed by high-resolution extraction for detailed modeling. This approach mirrors the diagnostic workflow of pathologists and improves the quality of learned representations. Evaluations across multiple cancer datasets, including lung, renal, and colorectal tissues, show that WISE-MAE achieves competitive representation quality and downstream classification performance while maintaining efficiency under weak supervision.
UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization
Video event localization tasks include temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL). Existing methods tend to over-specialize on individual tasks, neglecting the equal importance of these different events for a complete understanding of video content. In this work, we aim to develop a unified framework to solve TAL, SED and AVEL tasks together to facilitate holistic video understanding. However, it is challenging since different tasks emphasize distinct event characteristics and there are substantial disparities in existing task-specific datasets (size/domain/duration). It leads to unsatisfactory results when applying a naive multi-task strategy. To tackle the problem, we introduce UniAV, a Unified Audio-Visual perception network to effectively learn and share mutually beneficial knowledge across tasks and modalities. Concretely, we propose a unified audio-visual encoder to derive generic representations from multiple temporal scales for videos from all tasks. Meanwhile, task-specific experts are designed to capture the unique knowledge specific to each task. Besides, instead of using separate prediction heads, we develop a novel unified language-aware classifier by utilizing semantic-aligned task prompts, enabling our model to flexibly localize various instances across tasks with an impressive open-set ability to localize novel categories. Extensive experiments demonstrate that UniAV, with its unified architecture, significantly outperforms both single-task models and the naive multi-task baseline across all three tasks. It achieves superior or on-par performances compared to the state-of-the-art task-specific methods on ActivityNet 1.3, DESED and UnAV-100 benchmarks.
comment: Published on IEEE TPAMI
UINO-FSS: Unifying Representation Learning and Few-shot Segmentation via Hierarchical Distillation and Mamba-HyperCorrelation
Few-shot semantic segmentation has attracted growing interest for its ability to generalize to novel object categories using only a few annotated samples. To address data scarcity, recent methods incorporate multiple foundation models to improve feature transferability and segmentation performance. However, they often rely on dual-branch architectures that combine pre-trained encoders to leverage complementary strengths, a design that limits flexibility and efficiency. This raises a fundamental question: can we build a unified model that integrates knowledge from different foundation architectures? Achieving this is, however, challenging due to the misalignment between class-agnostic segmentation capabilities and fine-grained discriminative representations. To this end, we present UINO-FSS, a novel framework built on the key observation that early-stage DINOv2 features exhibit distribution consistency with SAM's output embeddings. This consistency enables the integration of both models' knowledge into a single-encoder architecture via coarse-to-fine multimodal distillation. In particular, our segmenter consists of three core components: a bottleneck adapter for embedding alignment, a meta-visual prompt generator that leverages dense similarity volumes and semantic embeddings, and a mask decoder. Using hierarchical cross-model distillation, we effectively transfer SAM's knowledge into the segmenter, further enhanced by Mamba-based 4D correlation mining on support-query pairs. Extensive experiments on PASCAL-5$^i$ and COCO-20$^i$ show that UINO-FSS achieves new state-of-the-art results under the 1-shot setting, with mIoU of 80.6 (+3.8%) on PASCAL-5$^i$ and 64.5 (+4.1%) on COCO-20$^i$, demonstrating the effectiveness of our unified approach.
Label-Efficient Cross-Modality Generalization for Liver Segmentation in Multi-Phase MRI MICCAI 2025
Accurate liver segmentation in multi-phase MRI is vital for liver fibrosis assessment, yet labeled data is often scarce and unevenly distributed across imaging modalities and vendor systems. We propose a label-efficient segmentation approach that promotes cross-modality generalization under real-world conditions, where GED4 hepatobiliary-phase annotations are limited, non-contrast sequences (T1WI, T2WI, DWI) are unlabeled, and spatial misalignment and missing phases are common. Our method integrates a foundation-scale 3D segmentation backbone adapted via fine-tuning, co-training with cross pseudo supervision to leverage unlabeled volumes, and a standardized preprocessing pipeline. Without requiring spatial registration, the model learns to generalize across MRI phases and vendors, demonstrating robust segmentation performance in both labeled and unlabeled domains. Our results exhibit the effectiveness of our proposed label-efficient baseline for liver segmentation in multi-phase, multi-vendor MRI and highlight the potential of combining foundation model adaptation with co-training for real-world clinical imaging tasks.
comment: Accepted at CARE @ MICCAI 2025
Underage Detection through a Multi-Task and MultiAge Approach for Screening Minors in Unconstrained Imagery
Accurate automatic screening of minors in unconstrained images requires models robust to distribution shift and resilient to the under-representation of children in public datasets. To address these issues, we propose a multi-task architecture with dedicated under/over-age discrimination tasks based on a frozen FaRL vision-language backbone joined with a compact two-layer MLP that shares features across one age-regression head and four binary underage heads (12, 15, 18, and 21 years). This design focuses on the legally critical age range while keeping the backbone frozen. Class imbalance is mitigated through an $α$-reweighted focal loss and age-balanced mini-batch sampling, while an age gap removes ambiguous samples near thresholds. Evaluation is conducted on our new Overall Underage Benchmark (303k cleaned training images, 110k test images), defining both the "ASORES-39k" restricted overall test, which removes the noisiest domains, and the age estimation wild-shifts test "ASWIFT-20k" of 20k-images, stressing extreme poses ($>$45°), expressions, and low image quality to emulate real-world shifts. Trained on the cleaned overall set with resampling and age gap, our multiage model "F" reduces the mean absolute error on ASORES-39k from 4.175 y (age-only baseline) to 4.068 y and improves under-18 detection from F2 score of 0.801 to 0.857 at 1% false-adult rate. Under the ASWIFT-20k, the same configuration nearly sustains 0.99 recall while F2 rises from 0.742 to 0.833, demonstrating robustness to domain shift.
Capture Stage Matting: Challenges, Approaches, and Solutions for Offline and Real-Time Processing
Capture stages are high-end sources of state-of-the-art recordings for downstream applications in movies, games, and other media. One crucial step in almost all pipelines is matting, i.e., separating captured performances from the background. While common matting algorithms deliver remarkable performance in other applications like teleconferencing and mobile entertainment, we found that they struggle significantly with the peculiarities of capture stage content. The goal of our work is to share insights into those challenges as a curated list of these characteristics along with a constructive discussion for proactive intervention and present a guideline to practitioners for an improved workflow to mitigate unresolved challenges. To this end, we also demonstrate an efficient pipeline to adapt state-of-the-art approaches to such custom setups without the need for extensive annotations, both offline and real-time. For an objective evaluation, we introduce a validation methodology using a state-of-the-art diffusion model to demonstrate the benefits of our approach.
Visual Odometry with Transformers
Despite the rapid development of large 3D models, classical optimization-based approaches dominate the field of visual odometry (VO). Thus, current approaches to VO heavily rely on camera parameters and many handcrafted components, most of which involve complex bundle adjustment and feature-matching processes. Although disregarded in the literature, we find it problematic in terms of both (1) speed, that performs bundle adjustment requires a significant amount of time, and (2) scalability, as hand-crafted components struggle to learn from large-scale training data. In this work, we introduce a simple yet efficient architecture, Visual Odometry Transformer (VoT), that formulates monocular visual odometry as a direct relative pose regression problem. Our approach streamlines the monocular visual odometry pipeline in an end-to-end manner, effectively eliminating the need for handcrafted components such as bundle adjustment, feature matching, or camera calibration. We show that VoT is up to 4 times faster than traditional approaches, yet with competitive or better performance. Compared to recent 3D foundation models, VoT runs 10 times faster with strong scaling behavior in terms of both model sizes and training data. Moreover, VoT generalizes well in both low-data regimes and previously unseen scenarios, reducing the gap between optimization-based and end-to-end approaches.
Cross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain Modeling AAAI 2026
Fine-grained image-text alignment is a pivotal challenge in multimodal learning, underpinning key applications such as visual question answering, image captioning, and vision-language navigation. Unlike global alignment, fine-grained alignment requires precise correspondence between localized visual regions and textual tokens, often hindered by noisy attention mechanisms and oversimplified modeling of cross-modal relationships. In this work, we identify two fundamental limitations of existing approaches: the lack of robust intra-modal mechanisms to assess the significance of visual and textual tokens, leading to poor generalization in complex scenes; and the absence of fine-grained uncertainty modeling, which fails to capture the one-to-many and many-to-one nature of region-word correspondences. To address these issues, we propose a unified approach that incorporates significance-aware and granularity-aware modeling and region-level uncertainty modeling. Our method leverages modality-specific biases to identify salient features without relying on brittle cross-modal attention, and represents region features as a mixture of Gaussian distributions to capture fine-grained uncertainty. Extensive experiments on Flickr30K and MS-COCO demonstrate that our approach achieves state-of-the-art performance across various backbone architectures, significantly enhancing the robustness and interpretability of fine-grained image-text alignment.
comment: 10 pages, 6 figures, accepted by AAAI 2026
A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning
Reinforcement learning (RL)-based fine-tuning has emerged as a powerful approach for aligning diffusion models with black-box objectives. Proximal policy optimization (PPO) is the most popular choice of method for policy optimization. While effective in terms of performance, PPO is highly sensitive to hyper-parameters and involves substantial computational overhead. REINFORCE, on the other hand, mitigates some computational complexities such as high memory overhead and sensitive hyper-parameter tuning, but has suboptimal performance due to high-variance and sample inefficiency. While the variance of the REINFORCE can be reduced by sampling multiple actions per input prompt and using a baseline correction term, it still suffers from sample inefficiency. To address these challenges, we systematically analyze the efficiency-effectiveness trade-off between REINFORCE and PPO, and propose leave-one-out PPO (LOOP), a novel RL for diffusion fine-tuning method. LOOP combines variance reduction techniques from REINFORCE, such as sampling multiple actions per input prompt and a baseline correction term, with the robustness and sample efficiency of PPO via clipping and importance sampling. Our results demonstrate that LOOP effectively improves diffusion models on various black-box objectives, and achieves a better balance between computational efficiency and performance.
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text-to-image generation. To address this challenge, we propose \textbf{WISE}, the first benchmark specifically designed for \textbf{W}orld Knowledge-\textbf{I}nformed \textbf{S}emantic \textbf{E}valuation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 subdomains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce \textbf{WiScore}, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data are available at \href{https://github.com/PKU-YuanGroup/WISE}{PKU-YuanGroup/WISE}.
comment: Code, data and leaderboard: https://github.com/PKU-YuanGroup/WISE
FireCastNet: Earth-as-a-Graph for Seasonal Fire Prediction
With climate change intensifying fire weather conditions globally, accurate seasonal wildfire forecasting has become critical for disaster preparedness and ecosystem management. We introduce FireCastNet, a novel deep learning architecture that combines 3D convolutional encoding with GraphCast-based Graph Neural Networks (GNNs) to model complex spatio-temporal dependencies for global wildfire prediction. Our approach leverages the SeasFire dataset, a comprehensive multivariate Earth system datacube containing climate, vegetation, and human-related variables, to forecast burned area patterns up to six months in advance. FireCastNet treats the Earth as an interconnected graph, enabling it to capture both local fire dynamics and long-range teleconnections that influence wildfire behavior across different spatial and temporal scales. Through comprehensive benchmarking against state-of-the-art models including GRU, Conv-GRU, Conv-LSTM, U-TAE, and TeleViT, we demonstrate that FireCastNet achieves superior performance in global burned area forecasting, with particularly strong results in fire-prone regions such as Africa, South America, and Southeast Asia. Our analysis reveals that longer input time-series significantly improve prediction robustness, while spatial context integration enhances model performance across extended forecasting horizons. Additionally, we implement local area modeling techniques that provide enhanced spatial resolution and accuracy for region-specific predictions. These findings highlight the importance of modeling Earth system interactions for long-term wildfire prediction.
GeoMVD: Geometry-Enhanced Multi-View Generation Model Based on Geometric Information Extraction
Multi-view image generation holds significant application value in computer vision, particularly in domains like 3D reconstruction, virtual reality, and augmented reality. Most existing methods, which rely on extending single images, face notable computational challenges in maintaining cross-view consistency and generating high-resolution outputs. To address these issues, we propose the Geometry-guided Multi-View Diffusion Model, which incorporates mechanisms for extracting multi-view geometric information and adjusting the intensity of geometric features to generate images that are both consistent across views and rich in detail. Specifically, we design a multi-view geometry information extraction module that leverages depth maps, normal maps, and foreground segmentation masks to construct a shared geometric structure, ensuring shape and structural consistency across different views. To enhance consistency and detail restoration during generation, we develop a decoupled geometry-enhanced attention mechanism that strengthens feature focus on key geometric details, thereby improving overall image quality and detail preservation. Furthermore, we apply an adaptive learning strategy that fine-tunes the model to better capture spatial relationships and visual coherence between the generated views, ensuring realistic results. Our model also incorporates an iterative refinement process that progressively improves the output quality through multiple stages of image generation. Finally, a dynamic geometry information intensity adjustment mechanism is proposed to adaptively regulate the influence of geometric data, optimizing overall quality while ensuring the naturalness of generated images. More details can be found on the project page: https://sobeymil.github.io/GeoMVD.com.
Gaussian Mapping for Evolving Scenes
Mapping systems with novel view synthesis (NVS) capabilities, most notably 3D Gaussian Splatting (3DGS), are widely used in computer vision and across various applications, including augmented reality, robotics, and autonomous driving. However, many current approaches are limited to static scenes. While recent works have begun addressing short-term dynamics (motion within the camera's view), long-term dynamics (the scene evolving through changes out of view) remain less explored. To overcome this limitation, we introduce a dynamic scene-adaptation mechanism that continuously updates 3DGS to reflect the latest changes. Since maintaining consistency remains challenging due to stale observations that disrupt the reconstruction process, we propose a novel keyframe management mechanism that discards outdated observations while preserving as much information as possible. We thoroughly evaluate Gaussian Mapping for Evolving Scenes (\ours) on both synthetic and real-world datasets, achieving a 29.7\% improvement in PSNR and a 3 times improvement in L1 depth error over the most competitive baseline.
Cheating Stereo Matching in Full-scale: Physical Adversarial Attack against Binocular Depth Estimation in Autonomous Driving
Though deep neural models adopted to realize the perception of autonomous driving have proven vulnerable to adversarial examples, known attacks often leverage 2D patches and target mostly monocular perception. Therefore, the effectiveness of Physical Adversarial Examples (PAEs) on stereo-based binocular depth estimation remains largely unexplored. To this end, we propose the first texture-enabled physical adversarial attack against stereo matching models in the context of autonomous driving. Our method employs a 3D PAE with global camouflage texture rather than a local 2D patch-based one, ensuring both visual consistency and attack effectiveness across different viewpoints of stereo cameras. To cope with the disparity effect of these cameras, we also propose a new 3D stereo matching rendering module that allows the PAE to be aligned with real-world positions and headings in binocular vision. We further propose a novel merging attack that seamlessly blends the target into the environment through fine-grained PAE optimization. It has significantly enhanced stealth and lethality upon existing hiding attacks that fail to get seamlessly merged into the background. Extensive evaluations show that our PAEs can successfully fool the stereo models into producing erroneous depth information.
Causal Tracing of Object Representations in Large Vision Language Models: Mechanistic Interpretability and Hallucination Mitigation AAAI2026
Despite the remarkable advancements of Large Vision-Language Models (LVLMs), the mechanistic interpretability remains underexplored. Existing analyses are insufficiently comprehensive and lack examination covering visual and textual tokens, model components, and the full range of layers. This limitation restricts actionable insights to improve the faithfulness of model output and the development of downstream tasks, such as hallucination mitigation. To address this limitation, we introduce Fine-grained Cross-modal Causal Tracing (FCCT) framework, which systematically quantifies the causal effects on visual object perception. FCCT conducts fine-grained analysis covering the full range of visual and textual tokens, three core model components including multi-head self-attention (MHSA), feed-forward networks (FFNs), and hidden states, across all decoder layers. Our analysis is the first to demonstrate that MHSAs of the last token in middle layers play a critical role in aggregating cross-modal information, while FFNs exhibit a three-stage hierarchical progression for the storage and transfer of visual object representations. Building on these insights, we propose Intermediate Representation Injection (IRI), a training-free inference-time technique that reinforces visual object information flow by precisely intervening on cross-modal representations at specific components and layers, thereby enhancing perception and mitigating hallucination. Consistent improvements across five widely used benchmarks and LVLMs demonstrate IRI achieves state-of-the-art performance, while preserving inference speed and other foundational performance.
comment: AAAI2026 Oral
Systematic Evaluation and Guidelines for Segment Anything Model in Surgical Video Analysis
Surgical video segmentation is critical for AI to interpret spatial-temporal dynamics in surgery, yet model performance is constrained by limited annotated data. The SAM2 model, pretrained on natural videos, offers potential for zero-shot surgical segmentation, but its applicability in complex surgical environments, with challenges like tissue deformation and instrument variability, remains unexplored. We present the first comprehensive evaluation of the zero-shot capability of SAM2 in 9 surgical datasets (17 surgery types), covering laparoscopic, endoscopic, and robotic procedures. We analyze various prompting (points, boxes, mask) and {finetuning (dense, sparse) strategies}, robustness to surgical challenges, and generalization across procedures and anatomies. Key findings reveal that while SAM2 demonstrates notable zero-shot adaptability in structured scenarios (e.g., instrument segmentation, {multi-organ segmentation}, and scene segmentation), its performance varies under dynamic surgical conditions, highlighting gaps in handling temporal coherence and domain-specific artifacts. These results highlight future pathways to adaptive data-efficient solutions for the surgical data science field.
UGG-ReID: Uncertainty-Guided Graph Model for Multi-Modal Object Re-Identification
Multi-modal object Re-IDentification (ReID) has gained considerable attention with the goal of retrieving specific targets across cameras using heterogeneous visual data sources. At present, multi-modal object ReID faces two core challenges: (1) learning robust features under fine-grained local noise caused by occlusion, frame loss, and other disruptions; and (2) effectively integrating heterogeneous modalities to enhance multi-modal representation. To address the above challenges, we propose a robust approach named Uncertainty-Guided Graph model for multi-modal object ReID (UGG-ReID). UGG-ReID is designed to mitigate noise interference and facilitate effective multi-modal fusion by estimating both local and sample-level aleatoric uncertainty and explicitly modeling their dependencies. Specifically, we first propose the Gaussian patch-graph representation model that leverages uncertainty to quantify fine-grained local cues and capture their structural relationships. This process boosts the expressiveness of modal-specific information, ensuring that the generated embeddings are both more informative and robust. Subsequently, we design an uncertainty-guided mixture of experts strategy that dynamically routes samples to experts exhibiting low uncertainty. This strategy effectively suppresses noise-induced instability, leading to enhanced robustness. Meanwhile, we design an uncertainty-guided routing to strengthen the multi-modal interaction, improving the performance. UGG-ReID is comprehensively evaluated on five representative multi-modal object ReID datasets, encompassing diverse spectral modalities. Experimental results show that the proposed method achieves excellent performance on all datasets and is significantly better than current methods in terms of noise immunity. Our code is available at https://github.com/wanxixi11/UGG-ReID.
DoGCLR: Dominance-Game Contrastive Learning Network for Skeleton-Based Action Recognition
Existing self-supervised contrastive learning methods for skeleton-based action recognition often process all skeleton regions uniformly, and adopt a first-in-first-out (FIFO) queue to store negative samples, which leads to motion information loss and non-optimal negative sample selection. To address these challenges, this paper proposes Dominance-Game Contrastive Learning network for skeleton-based action Recognition (DoGCLR), a self-supervised framework based on game theory. DoGCLR models the construction of positive and negative samples as a dynamic Dominance Game, where both sample types interact to reach an equilibrium that balances semantic preservation and discriminative strength. Specifically, a spatio-temporal dual weight localization mechanism identifies key motion regions and guides region-wise augmentations to enhance motion diversity while maintaining semantics. In parallel, an entropy-driven dominance strategy manages the memory bank by retaining high entropy (hard) negatives and replacing low-entropy (weak) ones, ensuring consistent exposure to informative contrastive signals. Extensive experiments are conducted on NTU RGB+D and PKU-MMD datasets. On NTU RGB+D 60 X-Sub/X-View, DoGCLR achieves 81.1%/89.4% accuracy, and on NTU RGB+D 120 X-Sub/X-Set, DoGCLR achieves 71.2%/75.5% accuracy, surpassing state-of-the-art methods by 0.1%, 2.7%, 1.1%, and 2.3%, respectively. On PKU-MMD Part I/Part II, DoGCLR performs comparably to the state-of-the-art methods and achieves a 1.9% higher accuracy on Part II, highlighting its strong robustness on more challenging scenarios.
comment: 14 pages, 7 figures, journal
Point Cloud Quantization through Multimodal Prompting for 3D Understanding AAAI 2026
Vector quantization has emerged as a powerful tool in large-scale multimodal models, unifying heterogeneous representations through discrete token encoding. However, its effectiveness hinges on robust codebook design. Current prototype-based approaches relying on trainable vectors or clustered centroids fall short in representativeness and interpretability, even as multimodal alignment demonstrates its promise in vision-language models. To address these limitations, we propose a simple multimodal prompting-driven quantization framework for point cloud analysis. Our methodology is built upon two core insights: 1) Text embeddings from pre-trained models inherently encode visual semantics through many-to-one contrastive alignment, naturally serving as robust prototype priors; and 2) Multimodal prompts enable adaptive refinement of these prototypes, effectively mitigating vision-language semantic gaps. The framework introduces a dual-constrained quantization space, enforced by compactness and separation regularization, which seamlessly integrates visual and prototype features, resulting in hybrid representations that jointly encode geometric and semantic information. Furthermore, we employ Gumbel-Softmax relaxation to achieve differentiable discretization while maintaining quantization sparsity. Extensive experiments on the ModelNet40 and ScanObjectNN datasets clearly demonstrate the superior effectiveness of the proposed method.
comment: Accepted by AAAI 2026. 11 pages, 7 figures
TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials AAAI 2026
Building Graphical User Interface (GUI) agents is a promising research direction, which simulates human interaction with computers or mobile phones to perform diverse GUI tasks. However, a major challenge in developing generalized GUI agents is the lack of sufficient trajectory data across various operating systems and applications, mainly due to the high cost of manual annotations. In this paper, we propose the TongUI framework that builds generalized GUI agents by learning from rich multimodal web tutorials. Concretely, we crawl and process online GUI tutorials (such as videos and articles) into GUI agent trajectory data, through which we produce the GUI-Net dataset containing 143K trajectory data across five operating systems and more than 200 applications. We develop the TongUI agent by fine-tuning Qwen2.5-VL-3B/7B models on GUI-Net, which show remarkable performance improvements on commonly used grounding and navigation benchmarks, outperforming baseline agents about 10\% on multiple benchmarks, showing the effectiveness of the GUI-Net dataset and underscoring the significance of our TongUI framework. We will fully open-source the code, the GUI-Net dataset, and the trained models soon.
comment: AAAI 2026
Uni-Hema: Unified Model for Digital Hematopathology
Digital hematopathology requires cell-level analysis across diverse disease categories, including malignant disorders (e.g., leukemia), infectious conditions (e.g., malaria), and non-malignant red blood cell disorders (e.g., sickle cell disease). Whether single-task, vision-language, WSI-optimized, or single-cell hematology models, these approaches share a key limitation, they cannot provide unified, multi-task, multi-modal reasoning across the complexities of digital hematopathology. To overcome these limitations, we propose Uni-Hema, a multi-task, unified model for digital hematopathology integrating detection, classification, segmentation, morphology prediction, and reasoning across multiple diseases. Uni-Hema leverages 46 publicly available datasets, encompassing over 700K images and 21K question-answer pairs, and is built upon Hema-Former, a multimodal module that bridges visual and textual representations at the hierarchy level for the different tasks (detection, classification, segmentation, morphology, mask language modeling and visual question answer) at different granularity. Extensive experiments demonstrate that Uni-Hema achieves comparable or superior performance to train on a single-task and single dataset models, across diverse hematological tasks, while providing interpretable, morphologically relevant insights at the single-cell level. Our framework establishes a new standard for multi-task and multi-modal digital hematopathology. The code will be made publicly available.
GloTok: Global Perspective Tokenizer for Image Reconstruction and Generation AAAI'26
Existing state-of-the-art image tokenization methods leverage diverse semantic features from pre-trained vision models for additional supervision, to expand the distribution of latent representations and thereby improve the quality of image reconstruction and generation. These methods employ a locally supervised approach for semantic supervision, which limits the uniformity of semantic distribution. However, VA-VAE proves that a more uniform feature distribution yields better generation performance. In this work, we introduce a Global Perspective Tokenizer (GloTok), which utilizes global relational information to model a more uniform semantic distribution of tokenized features. Specifically, a codebook-wise histogram relation learning method is proposed to transfer the semantics, which are modeled by pre-trained models on the entire dataset, to the semantic codebook. Then, we design a residual learning module that recovers the fine-grained details to minimize the reconstruction error caused by quantization. Through the above design, GloTok delivers more uniformly distributed semantic latent representations, which facilitates the training of autoregressive (AR) models for generating high-quality images without requiring direct access to pre-trained models during the training process. Experiments on the standard ImageNet-1k benchmark clearly show that our proposed method achieves state-of-the-art reconstruction performance and generation quality.
comment: Accepted at AAAI'26
Multi-source-free Domain Adaptation via Uncertainty-aware Adaptive Distillation
Source-free domain adaptation (SFDA) alleviates the domain discrepancy among data obtained from domains without accessing the data for the awareness of data privacy. However, existing conventional SFDA methods face inherent limitations in medical contexts, where medical data are typically collected from multiple institutions using various equipment. To address this problem, we propose a simple yet effective method, named Uncertainty-aware Adaptive Distillation (UAD) for the multi-source-free unsupervised domain adaptation (MSFDA) setting. UAD aims to perform well-calibrated knowledge distillation from (i) model level to deliver coordinated and reliable base model initialisation and (ii) instance level via model adaptation guided by high-quality pseudo-labels, thereby obtaining a high-performance target domain model. To verify its general applicability, we evaluate UAD on two image-based diagnosis benchmarks among two multi-centre datasets, where our method shows a significant performance gain compared with existing works. The code is available at https://github.com/YXSong000/UAD.
comment: Accepted by ISBI 2024. Code available at https://github.com/YXSong000/UAD
RetinexDual: Retinex-based Dual Nature Approach for Generalized Ultra-High-Definition Image Restoration
Advancements in image sensing have elevated the importance of Ultra-High-Definition Image Restoration (UHD IR). Traditional methods, such as extreme downsampling or transformation from the spatial to the frequency domain, encounter significant drawbacks: downsampling induces irreversible information loss in UHD images, while our frequency analysis reveals that pure frequency-domain approaches are ineffective for spatially confined image artifacts, primarily due to the loss of degradation locality. To overcome these limitations, we present RetinexDual, a novel Retinex theory-based framework designed for generalized UHD IR tasks. RetinexDual leverages two complementary sub-networks: the Scale-Attentive maMBA (SAMBA) and the Frequency Illumination Adaptor (FIA). SAMBA, responsible for correcting the reflectance component, utilizes a coarse-to-fine mechanism to overcome the causal modeling of mamba, which effectively reduces artifacts and restores intricate details. On the other hand, FIA ensures precise correction of color and illumination distortions by operating in the frequency domain and leveraging the global context provided by it. Evaluating RetinexDual on four UHD IR tasks, namely deraining, deblurring, dehazing, and Low-Light Image Enhancement (LLIE), shows that it outperforms recent methods qualitatively and quantitatively. Ablation studies demonstrate the importance of employing distinct designs for each branch in RetinexDual, as well as the effectiveness of its various components.
UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning AAAI2026
Universal multimodal embedding models are foundational to various tasks. Existing approaches typically employ in-batch negative mining by measuring the similarity of query-candidate pairs. However, these methods often struggle to capture subtle semantic differences among candidates and lack diversity in negative samples. Moreover, the embeddings exhibit limited discriminative ability in distinguishing false and hard negatives. In this paper, we leverage the advanced understanding capabilities of MLLMs to enhance representation learning and present a novel Universal Multimodal Embedding (UniME-V2) model. Our approach first constructs a potential hard negative set through global retrieval. We then introduce the MLLM-as-a-Judge mechanism, which utilizes MLLMs to assess the semantic alignment of query-candidate pairs and generate soft semantic matching scores. These scores serve as a foundation for hard negative mining, mitigating the impact of false negatives and enabling the identification of diverse, high-quality hard negatives. Furthermore, the semantic matching scores are used as soft labels to mitigate the rigid one-to-one mapping constraint. By aligning the similarity matrix with the soft semantic matching score matrix, the model learns semantic distinctions among candidates, significantly enhancing its discriminative capacity. To further improve performance, we propose UniME-V2-Reranker, a reranking model trained on our mined hard negatives through a joint pairwise and listwise optimization approach. We conduct comprehensive experiments on the MMEB benchmark and multiple retrieval tasks, demonstrating that our method achieves state-of-the-art performance on average across all tasks.
comment: AAAI2026 Oral, Webpage:https://garygutc.github.io/UniME-v2/
Gene-DML: Dual-Pathway Multi-Level Discrimination for Gene Expression Prediction from Histopathology Images WACV2026
Accurately predicting gene expression from histopathology images offers a scalable and non-invasive approach to molecular profiling, with significant implications for precision medicine and computational pathology. However, existing methods often underutilize the cross-modal representation alignment between histopathology images and gene expression profiles across multiple representational levels, thereby limiting their prediction performance. To address this, we propose Gene-DML, a unified framework that structures latent space through Dual-pathway Multi-Level discrimination to enhance correspondence between morphological and transcriptional modalities. The multi-scale instance-level discrimination pathway aligns hierarchical histopathology representations extracted at local, neighbor, and global levels with gene expression profiles, capturing scale-aware morphological-transcriptional relationships. In parallel, the cross-level instance-group discrimination pathway enforces structural consistency between individual (image/gene) instances and modality-crossed (gene/image, respectively) groups, strengthening the alignment across modalities. By jointly modeling fine-grained and structural-level discrimination, Gene-DML is able to learn robust cross-modal representations, enhancing both predictive accuracy and generalization across diverse biological contexts. Extensive experiments on public spatial transcriptomics datasets demonstrate that Gene-DML achieves state-of-the-art performance in gene expression prediction. The code and processed datasets are available at https://github.com/YXSong000/Gene-DML.
comment: Accepted by The IEEE/CVF Winter Conference on Applications of Computer Vision 2026 (WACV2026). Code and data available at https://github.com/YXSong000/Gene-DML
HiFusion: Hierarchical Intra-Spot Alignment and Regional Context Fusion for Spatial Gene Expression Prediction from Histopathology AAAI 2026
Spatial transcriptomics (ST) bridges gene expression and tissue morphology but faces clinical adoption barriers due to technical complexity and prohibitive costs. While computational methods predict gene expression from H&E-stained whole-slide images (WSIs), existing approaches often fail to capture the intricate biological heterogeneity within spots and are susceptible to morphological noise when integrating contextual information from surrounding tissue. To overcome these limitations, we propose HiFusion, a novel deep learning framework that integrates two complementary components. First, we introduce the Hierarchical Intra-Spot Modeling module that extracts fine-grained morphological representations through multi-resolution sub-patch decomposition, guided by a feature alignment loss to ensure semantic consistency across scales. Concurrently, we present the Context-aware Cross-scale Fusion module, which employs cross-attention to selectively incorporate biologically relevant regional context, thereby enhancing representational capacity. This architecture enables comprehensive modeling of both cellular-level features and tissue microenvironmental cues, which are essential for accurate gene expression prediction. Extensive experiments on two benchmark ST datasets demonstrate that HiFusion achieves state-of-the-art performance across both 2D slide-wise cross-validation and more challenging 3D sample-specific scenarios. These results underscore HiFusion's potential as a robust, accurate, and scalable solution for ST inference from routine histopathology.
comment: Accepted to AAAI 2026. 7 pages (main text), 12 pages total including references and supplementary material. 6 figures
Differentiable, Bit-shifting, and Scalable Quantization without training neural network from scratch
Quantization of neural networks provides benefits of inference in less compute and memory requirements. Previous work in quantization lack two important aspects which this work provides. First almost all previous work in quantization used a non-differentiable approach and for learning; the derivative is usually set manually in backpropogation which make the learning ability of algorithm questionable, our approach is not just differentiable, we also provide proof of convergence of our approach to the optimal neural network. Second previous work in shift/logrithmic quantization either have avoided activation quantization along with weight quantization or achieved less accuracy. Learning logrithmic quantize values of form $2^n$ requires the quantization function can scale to more than 1 bit quantization which is another benifit of our quantization that it provides $n$ bits quantization as well. Our approach when tested with image classification task using imagenet dataset, resnet18 and weight quantization only achieves less than 1 percent accuracy compared to full precision accuracy while taking only 15 epochs to train using shift bit quantization and achieves comparable to SOTA approaches accuracy in both weight and activation quantization using shift bit quantization in 15 training epochs with slightly higher(only higher cpu instructions) inference cost compared to 1 bit quantization(without logrithmic quantization) and not requiring any higher precision multiplication.
H-CNN-ViT: A Hierarchical Gated Attention Multi-Branch Model for Bladder Cancer Recurrence Prediction
Bladder cancer is one of the most prevalent malignancies worldwide, with a recurrence rate of up to 78%, necessitating accurate post-operative monitoring for effective patient management. Multi-sequence contrast-enhanced MRI is commonly used for recurrence detection; however, interpreting these scans remains challenging, even for experienced radiologists, due to post-surgical alterations such as scarring, swelling, and tissue remodeling. AI-assisted diagnostic tools have shown promise in improving bladder cancer recurrence prediction, yet progress in this field is hindered by the lack of dedicated multi-sequence MRI datasets for recurrence assessment study. In this work, we first introduce a curated multi-sequence, multi-modal MRI dataset specifically designed for bladder cancer recurrence prediction, establishing a valuable benchmark for future research. We then propose H-CNN-ViT, a new Hierarchical Gated Attention Multi-Branch model that enables selective weighting of features from the global (ViT) and local (CNN) paths based on contextual demands, achieving a balanced and targeted feature fusion. Our multi-branch architecture processes each modality independently, ensuring that the unique properties of each imaging channel are optimally captured and integrated. Evaluated on our dataset, H-CNN-ViT achieves an AUC of 78.6%, surpassing state-of-the-art models. Our model is publicly available at https://github.com/XLIAaron/H-CNN-ViT.
RoboTidy : A 3D Gaussian Splatting Household Tidying Benchmark for Embodied Navigation and Action
Household tidying is an important application area, yet current benchmarks neither model user preferences nor support mobility, and they generalize poorly, making it hard to comprehensively assess integrated language-to-action capabilities. To address this, we propose RoboTidy, a unified benchmark for language-guided household tidying that supports Vision-Language-Action (VLA) and Vision-Language-Navigation (VLN) training and evaluation. RoboTidy provides 500 photorealistic 3D Gaussian Splatting (3DGS) household scenes (covering 500 objects and containers) with collisions, formulates tidying as an "Action (Object, Container)" list, and supplies 6.4k high-quality manipulation demonstration trajectories and 1.5k naviagtion trajectories to support both few-shot and large-scale training. We also deploy RoboTidy in the real world for object tidying, establishing an end-to-end benchmark for household tidying. RoboTidy offers a scalable platform and bridges a key gap in embodied AI by enabling holistic and realistic evaluation of language-guided robots.
Clothing agnostic Pre-inpainting Virtual Try-ON
With the development of deep learning technology, virtual try-on technology has devel-oped important application value in the fields of e-commerce, fashion, and entertainment. The recently proposed Leffa technology has addressed the texture distortion problem of diffusion-based models, but there are limitations in that the bottom detection inaccuracy and the existing clothing silhouette persist in the synthesis results. To solve this problem, this study proposes CaP-VTON (Clothing Agnostic Pre-Inpainting Virtual Try-On). CaP-VTON integrates DressCode-based multi-category masking and Stable Diffu-sion-based skin inflation preprocessing; in particular, a generated skin module was in-troduced to solve skin restoration problems that occur when long-sleeved images are con-verted to short-sleeved or sleeveless ones, introducing a preprocessing structure that im-proves the naturalness and consistency of full-body clothing synthesis, and allowing the implementation of high-quality restoration considering human posture and color. As a result, CaP-VTON achieved 92.5%, which is 15.4% better than Leffa, in short-sleeved syn-thesis accuracy, and consistently reproduced the style and shape of the reference clothing in visual evaluation. These structures maintain model-agnostic properties and are appli-cable to various diffusion-based virtual inspection systems; they can also contribute to applications that require high-precision virtual wearing, such as e-commerce, custom styling, and avatar creation.
comment: Github : https://github.com/DevChoco/CAP-VTON
PointVDP: Learning View-Dependent Projection by Fireworks Rays for 3D Point Cloud Segmentation
In this paper, we propose view-dependent projection (VDP) to facilitate point cloud segmentation, designing efficient 3D-to-2D mapping that dynamically adapts to the spatial geometry from view variations. Existing projection-based methods leverage view-independent projection in complex scenes, relying on straight lines to generate direct rays or upward curves to reduce occlusions. However, their view independence provides projection rays that are limited to pre-defined parameters by human settings, restricting point awareness and failing to capture sufficient projection diversity across different view planes. Although multiple projections per view plane are commonly used to enhance spatial variety, the projected redundancy leads to excessive computational overhead and inefficiency in image processing. To address these limitations, we design a framework of VDP to generate data-driven projections from 3D point distributions, producing highly informative single-image inputs by predicting rays inspired by the adaptive behavior of fireworks. In addition, we construct color regularization to optimize the framework, which emphasizes essential features within semantic pixels and suppresses the non-semantic features within black pixels, thereby maximizing 2D space utilization in a projected image. As a result, our approach, PointVDP, develops lightweight projections in marginal computation costs. Experiments on S3DIS and ScanNet benchmarks show that our approach achieves competitive results, offering a resource-efficient solution for semantic understanding.
comment: This version needs major revision
UNIV: Unified Foundation Model for Infrared and Visible Modalities
Joint RGB-infrared perception is essential for achieving robustness under diverse weather and illumination conditions. Although foundation models excel within single modalities, they suffer from substantial cross-modal degradation, an issue we attribute to a pattern shortcut, i.e., a modal bias that prioritizes superficial sensor patterns over underlying semantics. To address this problem, we introduce UNIV, a Unified foundation model for Infrared and Visible modalities. At the core of UNIV lies Patch Cross-modal Contrastive Learning (PCCL), a self-supervised contrastive learning strategy that constructs a unified cross-modal feature space. PCCL employs a frozen pre-trained model to sample pseudo patch pairs based on semantic similarity, and aligns infrared-visible representations by attracting semantically related pairs while repelling unrelated ones. This process simultaneously enhances cross-modal alignment and inter-class semantic separability, guiding the model to focus on semantic structure rather than falling into pattern shortcuts. To further enable cross-modal learning, we introduce MVIP, the most comprehensive visible-infrared benchmark to date, containing 98,992 precisely aligned image pairs across diverse scenes. Extensive experiments demonstrate UNIV's superior performance on infrared tasks (+1.7 mIoU for semantic segmentation and +0.7 mAP for detection), while maintaining competitive accuracy on RGB tasks.
MOON: Generative MLLM-based Multimodal Representation Learning for E-commerce Product Understanding WSDM 2026
With the rapid advancement of e-commerce, exploring general representations rather than task-specific ones has attracted increasing research attention. For product understanding, although existing discriminative dual-flow architectures drive progress in this field, they inherently struggle to model the many-to-one alignment between multiple images and texts of products. Therefore, we argue that generative Multimodal Large Language Models (MLLMs) hold significant potential for improving product representation learning. Nevertheless, achieving this goal still remains non-trivial due to several key challenges: the lack of multimodal and aspect-aware modeling modules in typical LLMs; the common presence of background noise in product images; and the absence of a standard benchmark for evaluation. To address these issues, we propose the first generative MLLM-based model named MOON for product representation learning. Our method (1) employs a guided Mixture-of-Experts (MoE) module for targeted modeling of multimodal and aspect-specific product content; (2) effectively detects core semantic regions in product images to mitigate the distraction and interference caused by background noise; and (3) introduces the specialized negative sampling strategy to increase the difficulty and diversity of negative samples. In addition, we release a large-scale multimodal benchmark MBE for various product understanding tasks. Experimentally, our model demonstrates competitive zero-shot performance on both our benchmark and the public dataset, showcasing strong generalization across various downstream tasks, including cross-modal retrieval, product classification, and attribute prediction. Furthermore, the case study and visualization illustrate the effectiveness of MOON for product understanding.
comment: Accepted by WSDM 2026. 11 pages, 9 figures
Wonder3D++: Cross-domain Diffusion for High-fidelity 3D Generation from a Single Image
In this work, we introduce \textbf{Wonder3D++}, a novel method for efficiently generating high-fidelity textured meshes from single-view images. Recent methods based on Score Distillation Sampling (SDS) have shown the potential to recover 3D geometry from 2D diffusion priors, but they typically suffer from time-consuming per-shape optimization and inconsistent geometry. In contrast, certain works directly produce 3D information via fast network inferences, but their results are often of low quality and lack geometric details. To holistically improve the quality, consistency, and efficiency of single-view reconstruction tasks, we propose a cross-domain diffusion model that generates multi-view normal maps and the corresponding color images. To ensure the consistency of generation, we employ a multi-view cross-domain attention mechanism that facilitates information exchange across views and modalities. Lastly, we introduce a cascaded 3D mesh extraction algorithm that drives high-quality surfaces from the multi-view 2D representations in only about $3$ minute in a coarse-to-fine manner. Our extensive evaluations demonstrate that our method achieves high-quality reconstruction results, robust generalization, and good efficiency compared to prior works. Code available at https://github.com/xxlong0/Wonder3D/tree/Wonder3D_Plus.
comment: 21 pages, 19 figures, accepted by TPAMI
Image and Video Processing
Event-based Data Format Standard (EVT+)
Event-based Sensing (EBS) hardware is quickly proliferating while finding foothold in many commercial, industrial, and defense applications. At present, there are a handful of technologically mature systems which produce data streams with diverse output formats. In the near future it is anticipated there will be vendors who offer new sensor hardware which could also yield unique data schema that are not aligned to past efforts. Thus, due to the relative nascent nature of the technology and its potential for widespread use in a variety of applications, it is an opportune time to define a standard for this class of sensors' output data. The intent of this document is to identify and provide a standard for the collected EBS streaming data. The main objective of the standard is to be sensor agnostic, incorporate some of the current sensor configurations and modalities, and account for the developing configurations and modalities. The intent is also to leave enough place holders and space in the standard for future variations that may develop as EBS technology matures.
comment: 22 pages
Multimodal Optical Imaging Platform for Quantitative Burn Assessment
Accurate assessment of burn severity at injury onset remains a major clinical challenge due to the lack of objective methods for detecting subsurface tissue damage. This limitation is critical in battlefield and mass-casualty settings, where rapid and reliable evaluation of burn depth is essential for triage and surgical decision-making. We present a multimodal optical imaging framework that establishes the foundation for a compact, low-size, weight, and power (low-SWaP) field-deployable device for quantitative burn assessment. The system integrates broadband hyperspectral imaging (VSWIR, 400 -- 2100 nm) with laser speckle contrast imaging to jointly evaluate biochemical composition and microvascular perfusion. Using short-wave infrared (SWIR, >1000 nm) wavelengths, we developed and validated novel deep-tissue parameters linked to water, lipid, and collagen absorption features that enhance burn-tissue separability and burn severity classification. We implemented and validated unsupervised learning methods for spectral feature extraction, band down-selection, and clustering against histology, establishing a foundation for a rugged, data-driven device for early quantitative burn evaluation in austere environments.
Data-driven Prediction of Species-Specific Plant Responses to Spectral-Shifting Films from Leaf Phenotypic and Photosynthetic Traits
The application of spectral-shifting films in greenhouses to shift green light to red light has shown variable growth responses across crop species. However, the yield enhancement of crops under altered light quality is related to the collective effects of the specific biophysical characteristics of each species. Considering only one attribute of a crop has limitations in understanding the relationship between sunlight quality adjustments and crop growth performance. Therefore, this study aims to comprehensively link multiple plant phenotypic traits and daily light integral considering the physiological responses of crops to their growth outcomes under SF using artificial intelligence. Between 2021 and 2024, various leafy, fruiting, and root crops were grown in greenhouses covered with either PEF or SF, and leaf reflectance, leaf mass per area, chlorophyll content, daily light integral, and light saturation point were measured from the plants cultivated in each condition. 210 data points were collected, but there was insufficient data to train deep learning models, so a variational autoencoder was used for data augmentation. Most crop yields showed an average increase of 22.5% under SF. These data were used to train several models, including logistic regression, decision tree, random forest, XGBoost, and feedforward neural network (FFNN), aiming to binary classify whether there was a significant effect on yield with SF application. The FFNN achieved a high classification accuracy of 91.4% on a test dataset that was not used for training. This study provide insight into the complex interactions between leaf phenotypic and photosynthetic traits, environmental conditions, and solar spectral components by improving the ability to predict solar spectral shift effects using SF.
Image Denoising Using Transformed L1 (TL1) Regularization via ADMM
Total variation (TV) regularization is a classical tool for image denoising, but its convex $\ell_1$ formulation often leads to staircase artifacts and loss of contrast. To address these issues, we introduce the Transformed $\ell_1$ (TL1) regularizer applied to image gradients. In particular, we develop a TL1-regularized denoising model and solve it using the Alternating Direction Method of Multipliers (ADMM), featuring a closed-form TL1 proximal operator and an FFT-based image update under periodic boundary conditions. Experimental results demonstrate that our approach achieves superior denoising performance, effectively suppressing noise while preserving edges and enhancing image contrast.
UniUltra: Interactive Parameter-Efficient SAM2 for Universal Ultrasound Segmentation
The Segment Anything Model 2 (SAM2) demonstrates remarkable universal segmentation capabilities on natural images. However, its performance on ultrasound images is significantly degraded due to domain disparities. This limitation raises two critical challenges: how to efficiently adapt SAM2 to ultrasound imaging while maintaining parameter efficiency, and how to deploy the adapted model effectively in resource-constrained clinical environments. To address these issues, we propose UniUltra for universal ultrasound segmentation. Specifically, we first introduce a novel context-edge hybrid adapter (CH-Adapter) that enhances fine-grained perception across diverse ultrasound imaging modalities while achieving parameter-efficient fine-tuning. To further improve clinical applicability, we develop a deep-supervised knowledge distillation (DSKD) technique that transfers knowledge from the large image encoder of the fine-tuned SAM2 to a super lightweight encoder, substantially reducing computational requirements without compromising performance. Extensive experiments demonstrate that UniUltra outperforms state-of-the-arts with superior generalization capabilities. Notably, our framework achieves competitive performance using only 8.91% of SAM2's parameters during fine-tuning, and the final compressed model reduces the parameter count by 94.08% compared to the original SAM2, making it highly suitable for practical clinical deployment. The source code is available at https://github.com/xq141839/UniUltra.
DeepContrast: Deep Tissue Contrast Enhancement using Synthetic Data Degradations and OOD Model Predictions
Microscopy images are crucial for life science research, allowing detailed inspection and characterization of cellular and tissue-level structures and functions. However, microscopy data are unavoidably affected by image degradations, such as noise, blur, or others. Many such degradations also contribute to a loss of image contrast, which becomes especially pronounced in deeper regions of thick samples. Today, best performing methods to increase the quality of images are based on Deep Learning approaches, which typically require ground truth (GT) data during training. Our inability to counteract blurring and contrast loss when imaging deep into samples prevents the acquisition of such clean GT data. The fact that the forward process of blurring and contrast loss deep into tissue can be modeled, allowed us to propose a new method that can circumvent the problem of unobtainable GT data. To this end, we first synthetically degraded the quality of microscopy images even further by using an approximate forward model for deep tissue image degradations. Then we trained a neural network that learned the inverse of this degradation function from our generated pairs of raw and degraded images. We demonstrated that networks trained in this way can be used out-of-distribution (OOD) to improve the quality of less severely degraded images, e.g. the raw data imaged in a microscope. Since the absolute level of degradation in such microscopy images can be stronger than the additional degradation introduced by our forward model, we also explored the effect of iterative predictions. Here, we observed that in each iteration the measured image contrast kept improving while detailed structures in the images got increasingly removed. Therefore, dependent on the desired downstream analysis, a balance between contrast improvement and retention of image details has to be found.
comment: 8 pages, 7 figures, 1 table
The Role of Radiographic Knee Alignment in Total Knee Replacement Outcomes and Opportunities for Artificial Intelligence-Driven Assessment
Knee osteoarthritis (OA) is one of the most widespread and burdensome health problems [1-4]. Total knee replacement (TKR) may be offered as treatment for end-stage knee OA. Nevertheless, TKR is an invasive procedure involving prosthesis implantation at the knee joint, and around 10% of patients are dissatisfied following TKR [5,6]. Dissatisfaction is often assessed through patient-reported outcome measures (PROMs) [7], which are usually completed by patients and assessed by health professionals to evaluate the condition of TKR patients. In clinical practice, predicting poor TKR outcomes in advance could help optimise patient selection and improve management strategies. Radiographic knee alignment is an important biomarker for predicting TKR outcomes and long-term joint health. Abnormalities such as femoral or tibial deformities can directly influence surgical planning, implant selection, and postoperative recovery [8,9]. Traditional alignment measurement is manual, time-consuming, and requires long-leg radiographs, which are not always undertaken in clinical practice. Instead, standard anteroposterior (AP) knee radiographs are often the main imaging modality. Automated methods for alignment assessment in standard knee radiographs are potentially clinically valuable for improving efficiency in the knee OA treatment pathway.
TVC: Tokenized Video Compression with Ultra-Low Bit Rate
Tokenized visual representations have shown promise in image compression, yet their extension to video remains underexplored due to the challenges posed by complex temporal dynamics and stringent bit rate constraints. In this paper, we present tokenized video compression (TVC), a token-based dual-stream framework designed to operate effectively at ultra-low bit rates. TVC leverages the Cosmos video tokenizer to extract both discrete and continuous token streams. The discrete tokens are partially masked using a strategic masking scheme and then compressed losslessly with a discrete checkerboard context model to reduce transmission overhead. The masked tokens are reconstructed by a decoder-only Transformer with spatiotemporal token prediction. In parallel, the continuous tokens are quantized and compressed using a continuous checkerboard context model, providing complementary continuous information at ultra-low bit rates. At the decoder side, the two streams are fused with a ControlNet-based multi-scale integration module, ensuring high perceptual quality alongside stable fidelity in reconstruction. Overall, this work illustrates the practicality of tokenized video compression and points to new directions for semantics-aware, token-native approaches.
Constructed Realities? Technical and Contextual Anomalies in a High-Profile Image
This study offers a forensic assessment of a widely circulated photograph featuring Andrew Mountbatten-Windsor, Virginia Giuffre, and Ghislaine Maxwell - an image that has played a pivotal role in public discourse and legal narratives. By comparing multiple published versions, many inconsistencies emerge, including irregularities in lighting, posture, and physical interaction, which are more compatible with digital compositing than with an unaltered snapshot. Because no original print is available and -crucially- because a verifiable audit trail cannot be demanded for a potentially fabricated image, definitive conclusions remain unattainable. Even so, the technical and contextual anomalies indicate that the photograph may have been deliberately constructed. In the absence of further evidence, it remains an unresolved yet symbolically charged artifact within a complex story of abuse, memory, and contested truth.
comment: 41 pages, 9 figures, 39 references
Electromagnetic Quantitative Inversion for Translationally Moving Targets via Phase Correlation Registration of Back-Projection Images
A novel electromagnetic quantitative inversion scheme for translationally moving targets via phase correlation registration of back-projection (BP) images is proposed. Based on a time division multiplexing multiple-input multiple-output (TDM-MIMO) radar architecture, the scheme first achieves high-precision relative positioning of the target, then applies relative motion compensation to perform iterative inversion on multi-cycle MIMO measurement data, thereby reconstructing the target's electromagnetic parameters. As a general framework compatible with other mainstream inversion algorithms, we exemplify our approach by incorporating the classical cross-correlated contrast source inversion (CC-CSI) into iterative optimization step of the scheme, resulting in a new algorithm termed RMC-CC-CSI. Numerical and experimental results demonstrate that RMC-CC-CSI offers accelerated convergence, enhanced reconstruction fidelity, and improved noise immunity over conventional CC-CSI for stationary targets despite increased computational cost.
Style Content Decomposition-based Data Augmentation for Domain Generalizable Medical Image Segmentation
Due to domain shifts across diverse medical imaging modalities, learned segmentation models often suffer significant performance degradation during deployment. We posit that these domain shifts can generally be categorized into two main components: 1) "style" shifts, referring to global disparities in image properties such as illumination, contrast, and color; and 2) "content" shifts, which involve local discrepancies in anatomical structures. To address the domain shifts in medical image segmentation, we first factorize an image into style codes and content maps, explicitly modeling the "style" and "content" components. Building on this, we introduce a Style-Content decomposition-based data augmentation algorithm (StyCona), which performs augmentation on both the global style and local content of source-domain images, enabling the training of a well-generalized model for domain generalizable medical image segmentation. StyCona is a simple yet effective plug-and-play module that substantially improves model generalization without requiring additional training parameters or modifications to segmentation model architectures. Experiments on cardiac magnetic resonance imaging and fundus photography segmentation tasks, with single and multiple target domains respectively, demonstrate the effectiveness of StyCona and its superiority over state-of-the-art domain generalization methods. The code is available at https://github.com/Senyh/StyCona.
Neuromorphic Split Computing via Optical Inter-Satellite Links
We present a neuromorphic split-computing framework for energy-efficient low-latency inference over optical inter-satellite links. The system partitions a spiking neural network (SNN) between edge and core nodes. To transmit sparse spiking features efficiently, we introduce a lossless channel-block-sparse event representation that exploits inter- and intra-channel sparsity. We employ hierarchical error protection using multi-level forward error correction and cyclic redundancy checks to ensure reliable communication without retransmission. The framework uses end-to-end training with sparsity and clustering regularizers, combined with channel-aware stochastic masking to optimize feature compression and channel robustness jointly. In a proof-of-concept implementation on remote sensing imagery, the framework achieves over $10 \times$ reduction in both computational energy and transmission load compared to conventional dense split systems, with less than 1% accuracy loss. The proposed approach also outperforms address-event-based split SNNs by $3.7 \times$ in transmission efficiency and shows superior resilience to optical pointing jitter.
LEARNER: Contrastive Pretraining for Learning Fine-Grained Patient Progression from Coarse Inter-Patient Labels
Predicting whether a treatment leads to meaningful improvement is a central challenge in personalized medicine, particularly when disease progression manifests as subtle visual changes over time. While data-driven deep learning (DL) offers a promising route to automate such predictions, acquiring large-scale longitudinal data for each individual patient remains impractical. To address this limitation, we explore whether inter-patient variability can serve as a proxy for learning intra-patient progression. We propose LEARNER, a contrastive pretraining framework that leverages coarsely labeled inter-patient data to learn fine-grained, patient-specific representations. Using lung ultrasound (LUS) and brain MRI datasets, we demonstrate that contrastive objectives trained on coarse inter-patient differences enable models to capture subtle intra-patient changes associated with treatment response. Across both modalities, our approach improves downstream classification accuracy and F1-score compared to standard MSE pretraining, highlighting the potential of inter-patient contrastive learning for individualized outcome prediction.
comment: Under review at ISBI 2026 conference
Multimedia
ChartEditor: A Reinforcement Learning Framework for Robust Chart Editing AAAI 2026
Chart editing reduces manual effort in visualization design. Typical benchmarks limited in data diversity and assume access to complete chart code, which is seldom in real-world scenarios. To address this gap, we present ChartEditVista, a comprehensive benchmark consisting of 7,964 samples spanning 31 chart categories. It encompasses diverse editing instructions and covers nearly all editable chart elements. The inputs in ChartEditVista include only the original chart image and natural language editing instructions, without the original chart codes. ChartEditVista is generated through a fully automated pipeline that produces, edits, and verifies charts, ensuring high-quality chart editing data. Besides, we introduce two novel fine-grained, rule-based evaluation metrics: the layout metric, which evaluates the position, size and color of graphical components; and the text metric, which jointly assesses textual content and font styling. Building on top of ChartEditVista, we present ChartEditor, a model trained using a reinforcement learning framework that incorporates a novel rendering reward to simultaneously enforce code executability and visual fidelity. Through extensive experiments and human evaluations, we demonstrate that ChartEditVista provides a robust evaluation, while ChartEditor consistently outperforms models with similar-scale and larger-scale on chart editing tasks.
comment: Accept to AAAI 2026 Main Track
Towards Unbiased Cross-Modal Representation Learning for Food Image-to-Recipe Retrieval
This paper addresses the challenges of learning representations for recipes and food images in the cross-modal retrieval problem. As the relationship between a recipe and its cooked dish is cause-and-effect, treating a recipe as a text source describing the visual appearance of a dish for learning representation, as the existing approaches, will create bias misleading image-and-recipe similarity judgment. Specifically, a food image may not equally capture every detail in a recipe, due to factors such as the cooking process, dish presentation, and image-capturing conditions. The current representation learning tends to capture dominant visual-text alignment while overlooking subtle variations that determine retrieval relevance. In this paper, we model such bias in cross-modal representation learning using causal theory. The causal view of this problem suggests ingredients as one of the confounder sources and a simple backdoor adjustment can alleviate the bias. By causal intervention, we reformulate the conventional model for food-to-recipe retrieval with an additional term to remove the potential bias in similarity judgment. Based on this theory-informed formulation, we empirically prove the oracle performance of retrieval on the Recipe1M dataset to be MedR=1 across the testing data sizes of 1K, 10K, and even 50K. We also propose a plug-and-play neural module, which is essentially a multi-label ingredient classifier for debiasing. New state-of-the-art search performances are reported on the Recipe1M dataset.
An Event-triggered System for Social Persuasion and Danger Alert in Elder Home Monitoring
In the study, the physical state and mental state of elders are both considered, and an event-triggered system has developed to detect events: watch dog, danger notice and photo link. By adopting GMM background modeling, the motion behavior of visitors and elders can be detected in the watch dog event and danger notice event respectively. Experiments set in home scenarios and 5 families participated in the experiments for detecting and recording three types of events from their life activities. In addition, the captured images were analyzed using SVM machine learning. For lack of technical experiences of elders, an intuitive operation as normal life activity was designed to create communication between elder and relatives via social media.
comment: Accepted in the 35th IPPR Conference on Computer Vision, Graphics, and Image Processing (CVGIP2022)
U2UData+: A Scalable Swarm UAVs Autonomous Flight Dataset for Embodied Long-horizon Tasks AAAI26
Swarm UAV autonomous flight for Embodied Long-Horizon (ELH) tasks is crucial for advancing the low-altitude economy. However, existing methods focus only on specific basic tasks due to dataset limitations, failing in real-world deployment for ELH tasks. ELH tasks are not mere concatenations of basic tasks, requiring handling long-term dependencies, maintaining embodied persistent states, and adapting to dynamic goal shifts. This paper presents U2UData+, the first large-scale swarm UAV autonomous flight dataset for ELH tasks and the first scalable swarm UAV data online collection and algorithm closed-loop verification platform. The dataset is captured by 15 UAVs in autonomous collaborative flights for ELH tasks, comprising 12 scenes, 720 traces, 120 hours, 600 seconds per trajectory, 4.32M LiDAR frames, and 12.96M RGB frames. This dataset also includes brightness, temperature, humidity, smoke, and airflow values covering all flight routes. The platform supports the customization of simulators, UAVs, sensors, flight algorithms, formation modes, and ELH tasks. Through a visual control window, this platform allows users to collect customized datasets through one-click deployment online and to verify algorithms by closed-loop simulation. U2UData+ also introduces an ELH task for wildlife conservation and provides comprehensive benchmarks with 9 SOTA models. U2UData+ can be found at https://fengtt42.github.io/U2UData-2/.
comment: Accepted by AAAI26
UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization
Video event localization tasks include temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL). Existing methods tend to over-specialize on individual tasks, neglecting the equal importance of these different events for a complete understanding of video content. In this work, we aim to develop a unified framework to solve TAL, SED and AVEL tasks together to facilitate holistic video understanding. However, it is challenging since different tasks emphasize distinct event characteristics and there are substantial disparities in existing task-specific datasets (size/domain/duration). It leads to unsatisfactory results when applying a naive multi-task strategy. To tackle the problem, we introduce UniAV, a Unified Audio-Visual perception network to effectively learn and share mutually beneficial knowledge across tasks and modalities. Concretely, we propose a unified audio-visual encoder to derive generic representations from multiple temporal scales for videos from all tasks. Meanwhile, task-specific experts are designed to capture the unique knowledge specific to each task. Besides, instead of using separate prediction heads, we develop a novel unified language-aware classifier by utilizing semantic-aligned task prompts, enabling our model to flexibly localize various instances across tasks with an impressive open-set ability to localize novel categories. Extensive experiments demonstrate that UniAV, with its unified architecture, significantly outperforms both single-task models and the naive multi-task baseline across all three tasks. It achieves superior or on-par performances compared to the state-of-the-art task-specific methods on ActivityNet 1.3, DESED and UnAV-100 benchmarks.
comment: Published on IEEE TPAMI
Cross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain Modeling AAAI 2026
Fine-grained image-text alignment is a pivotal challenge in multimodal learning, underpinning key applications such as visual question answering, image captioning, and vision-language navigation. Unlike global alignment, fine-grained alignment requires precise correspondence between localized visual regions and textual tokens, often hindered by noisy attention mechanisms and oversimplified modeling of cross-modal relationships. In this work, we identify two fundamental limitations of existing approaches: the lack of robust intra-modal mechanisms to assess the significance of visual and textual tokens, leading to poor generalization in complex scenes; and the absence of fine-grained uncertainty modeling, which fails to capture the one-to-many and many-to-one nature of region-word correspondences. To address these issues, we propose a unified approach that incorporates significance-aware and granularity-aware modeling and region-level uncertainty modeling. Our method leverages modality-specific biases to identify salient features without relying on brittle cross-modal attention, and represents region features as a mixture of Gaussian distributions to capture fine-grained uncertainty. Extensive experiments on Flickr30K and MS-COCO demonstrate that our approach achieves state-of-the-art performance across various backbone architectures, significantly enhancing the robustness and interpretability of fine-grained image-text alignment.
comment: 10 pages, 6 figures, accepted by AAAI 2026
To Align or Not to Align: Strategic Multimodal Representation Alignment for Optimal Performance
Multimodal learning often relies on aligning representations across modalities to enable effective information integration, an approach traditionally assumed to be universally beneficial. However, prior research has primarily taken an observational approach, examining naturally occurring alignment in multimodal data and exploring its correlation with model performance, without systematically studying the direct effects of explicitly enforced alignment between representations of different modalities. In this work, we investigate how explicit alignment influences both model performance and representation alignment under different modality-specific information structures. Specifically, we introduce a controllable contrastive learning module that enables precise manipulation of alignment strength during training, allowing us to explore when explicit alignment improves or hinders performance. Our results on synthetic and real datasets under different data characteristics show that the impact of explicit alignment on the performance of unimodal models is related to the characteristics of the data: the optimal level of alignment depends on the amount of redundancy between the different modalities. We identify an optimal alignment strength that balances modality-specific signals and shared redundancy in the mixed information distributions. This work provides practical guidance on when and how explicit alignment should be applied to achieve optimal unimodal encoder performance.
Computation and Language
Tokenisation over Bounded Alphabets is Hard
Recent works have shown that tokenisation is NP-complete. However, these works assume tokenisation is applied to inputs with unboundedly large alphabets -- an unrealistic assumption, given that in practice tokenisers operate over fixed-size alphabets, such as bytes or Unicode characters. We close this gap by analysing tokenisation over bounded $n$-ary alphabets, considering two natural variants: bottom-up tokenisation and direct tokenisation, where we must, respectively, select a sequence of merge operations or a vocabulary whose application optimally compresses a dataset. First, we note that proving hardness results for an $n$-ary alphabet proves the same results for alphabets of any larger size. We then prove that even with binary alphabets, both variants are not only NP-complete, but admit no polynomial-time approximation scheme (unless P=NP). We further show that direct tokenisation remains NP-complete even when applied to unary alphabets. While unary alphabets may not be practically useful, this result establishes that the computational intractability of tokenisation is not an artifact of large alphabets or complex constructions, but a fundamental barrier. Overall, our results explain why practical algorithms such as BPE and UnigramLM are heuristic, and points toward approximation algorithms being an important path going forward for tokenisation research.
Think Visually, Reason Textually: Vision-Language Synergy in ARC
Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4. These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence. The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks. Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles. However, our pilot experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution. This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages: vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution. Building on this insight, we introduce two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction. Extensive experiments demonstrate that our approach yields up to a 4.33% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks. Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence in future foundation models. Source code will be released soon.
MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping
Mixture-of-Experts (MoE) Multimodal large language models (MLLMs) excel at vision-language tasks, but they suffer from high computational inefficiency. To reduce inference overhead, expert skipping methods have been proposed to deactivate redundant experts based on the current input tokens. However, we find that applying these methods-originally designed for unimodal large language models (LLMs)-to MLLMs results in considerable performance degradation. This is primarily because such methods fail to account for the heterogeneous contributions of experts across MoE layers and modality-specific behaviors of tokens within these layers. Motivated by these findings, we propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference. It incorporates a globally-modulated local gating (GMLG) mechanism that integrates global layer-wise importance into local routing probabilities to accurately estimate per-token expert importance. A dual-modality thresholding (DMT) method is then applied, which processes tokens from each modality separately, to derive the skipping schedule. To set the optimal thresholds, we introduce a frontier search algorithm that exploits monotonicity properties, cutting convergence time from several days to a few hours. Extensive experiments for 3 model series across 13 benchmarks demonstrate that MoDES far outperforms previous approaches. For instance, when skipping 88% experts for Qwen3-VL-MoE-30B-A3B-Instruct, the performance boost is up to 10.67% (97.33% vs. 86.66%). Furthermore, MoDES significantly enhances inference speed, improving the prefilling time by 2.16$\times$ and the decoding time by 1.26$\times$.
comment: Code will be released upon acceptance
VisPlay: Self-Evolving Vision-Language Models from Images
Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at https://bruno686.github.io/VisPlay/
When to Think and When to Look: Uncertainty-Guided Lookback
Test-time thinking (that is, generating explicit intermediate reasoning chains) is known to boost performance in large language models and has recently shown strong gains for large vision language models (LVLMs). However, despite these promising results, there is still no systematic analysis of how thinking actually affects visual reasoning. We provide the first such analysis with a large scale, controlled comparison of thinking for LVLMs, evaluating ten variants from the InternVL3.5 and Qwen3-VL families on MMMU-val under generous token budgets and multi pass decoding. We show that more thinking is not always better; long chains often yield long wrong trajectories that ignore the image and underperform the same models run in standard instruct mode. A deeper analysis reveals that certain short lookback phrases, which explicitly refer back to the image, are strongly enriched in successful trajectories and correlate with better visual grounding. Building on this insight, we propose uncertainty guided lookback, a training free decoding strategy that combines an uncertainty signal with adaptive lookback prompts and breadth search. Our method improves overall MMMU performance, delivers the largest gains in categories where standard thinking is weak, and outperforms several strong decoding baselines, setting a new state of the art under fixed model families and token budgets. We further show that this decoding strategy generalizes, yielding consistent improvements on five additional benchmarks, including two broad multimodal suites and math focused visual reasoning datasets.
SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models
Vision-Language-Action (VLA) models excel in robotic manipulation but are constrained by their heavy reliance on expert demonstrations, leading to demonstration bias and limiting performance. Reinforcement learning (RL) is a vital post-training strategy to overcome these limits, yet current VLA-RL methods, including group-based optimization approaches, are crippled by severe reward sparsity. Relying on binary success indicators wastes valuable information in failed trajectories, resulting in low training efficiency. To solve this, we propose Self-Referential Policy Optimization (SRPO), a novel VLA-RL framework. SRPO eliminates the need for external demonstrations or manual reward engineering by leveraging the model's own successful trajectories, generated within the current training batch, as a self-reference. This allows us to assign a progress-wise reward to failed attempts. A core innovation is the use of latent world representations to measure behavioral progress robustly. Instead of relying on raw pixels or requiring domain-specific fine-tuning, we utilize the compressed, transferable encodings from a world model's latent space. These representations naturally capture progress patterns across environments, enabling accurate, generalized trajectory comparison. Empirical evaluations on the LIBERO benchmark demonstrate SRPO's efficiency and effectiveness. Starting from a supervised baseline with 48.9% success, SRPO achieves a new state-of-the-art success rate of 99.2% in just 200 RL steps, representing a 103% relative improvement without any extra supervision. Furthermore, SRPO shows substantial robustness, achieving a 167% performance improvement on the LIBERO-Plus benchmark.
HSKBenchmark: Modeling and Benchmarking Chinese Second Language Acquisition in Large Language Models through Curriculum Tuning AAAI-2026
Language acquisition is vital to revealing the nature of human language intelligence and has recently emerged as a promising perspective for improving the interpretability of large language models (LLMs). However, it is ethically and practically infeasible to conduct experiments that require controlling human learners' language inputs. This poses challenges for the verifiability and scalability of language acquisition modeling, particularly in Chinese second language acquisition (SLA). While LLMs provide a controllable and reproducible alternative, a systematic benchmark to support phase-wise modeling and assessment is still lacking. In this paper, we present HSKBenchmark, the first benchmark for staged modeling and writing assessment of LLMs in Chinese SLA. It covers HSK levels 3 to 6 and includes authentic textbooks with 6.76 million tokens, 16K synthetic instruction samples, 30 test topics, and a linguistically grounded evaluation system. To simulate human learning trajectories, we introduce a curriculum-tuning framework that trains models from beginner to advanced levels. An evaluation system is created to examine level-based grammar coverage, writing errors, lexical and syntactic complexity, and holistic scoring. We also build HSKAgent, fine-tuned on 10K learner compositions. Extensive experimental results demonstrate that HSKBenchmark not only models Chinese SLA effectively, but also serves as a reliable benchmark for dynamic writing assessment in LLMs. Our fine-tuned LLMs have writing performance on par with advanced human learners and exhibit human-like acquisition characteristics. The HSKBenchmark, HSKAgent, and checkpoints serve as foundational tools and resources, with the potential to pave the way for future research on language acquisition modeling and LLMs interpretability. Code and data are publicly available at: https://github.com/CharlesYang030/HSKB.
comment: Accepted by AAAI-2026
Computer-Use Agents as Judges for Generative User Interface
Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans--prioritizing aesthetics and usability--forcing agents to adopt human-oriented behaviors that are unnecessary for efficient task execution. At the same time, rapid advances in coding-oriented language models (Coder) have transformed automatic GUI design. This raises a fundamental question: Can CUA as judges to assist Coder for automatic GUI design? To investigate, we introduce AUI-Gym, a benchmark for Automatic GUI development spanning 52 applications across diverse domains. Using language models, we synthesize 1560 tasks that simulate real-world scenarios. To ensure task reliability, we further develop a verifier that programmatically checks whether each task is executable within its environment. Building on this, we propose a Coder-CUA in Collaboration framework: the Coder acts as Designer, generating and revising websites, while the CUA serves as Judge, evaluating functionality and refining designs. Success is measured not by visual appearance, but by task solvability and CUA navigation success rate. To turn CUA feedback into usable guidance, we design a CUA Dashboard that compresses multi-step navigation histories into concise visual summaries, offering interpretable guidance for iterative redesign. By positioning agents as both designers and judges, our framework shifts interface design toward agent-native efficiency and reliability. Our work takes a step toward shifting agents from passive use toward active participation in digital environments. Our code and dataset are available at https://github.com/showlab/AUI.
comment: Project: https://showlab.github.io/AUI Github: https://github.com/showlab/AUI
Multimodal Evaluation of Russian-language Architectures
Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking and licenses for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.
Standardising the NLP Workflow: A Framework for Reproducible Linguistic Analysis
The introduction of large language models and other influential developments in AI-based language processing have led to an evolution in the methods available to quantitatively analyse language data. With the resultant growth of attention on language processing, significant challenges have emerged, including the lack of standardisation in organising and sharing linguistic data and the absence of standardised and reproducible processing methodologies. Striving for future standardisation, we first propose the Language Processing Data Structure (LPDS), a data structure inspired by the Brain Imaging Data Structure (BIDS), a widely adopted standard for handling neuroscience data. It provides a folder structure and file naming conventions for linguistic research. Second, we introduce pelican nlp, a modular and extensible Python package designed to enable streamlined language processing, from initial data cleaning and task-specific preprocessing to the extraction of sophisticated linguistic and acoustic features, such as semantic embeddings and prosodic metrics. The entire processing workflow can be specified within a single, shareable configuration file, which pelican nlp then executes on LPDS-formatted data. Depending on the specifications, the reproducible output can consist of preprocessed language data or standardised extraction of both linguistic and acoustic features and corresponding result aggregations. LPDS and pelican nlp collectively offer an end-to-end processing pipeline for linguistic data, designed to ensure methodological transparency and enhance reproducibility.
comment: 26 pages, 3 figures
CroPS: Improving Dense Retrieval with Cross-Perspective Positive Samples in Short-Video Search AAAI-2026
Dense retrieval has become a foundational paradigm in modern search systems, especially on short-video platforms. However, most industrial systems adopt a self-reinforcing training pipeline that relies on historically exposed user interactions for supervision. This paradigm inevitably leads to a filter bubble effect, where potentially relevant but previously unseen content is excluded from the training signal, biasing the model toward narrow and conservative retrieval. In this paper, we present CroPS (Cross-Perspective Positive Samples), a novel retrieval data engine designed to alleviate this problem by introducing diverse and semantically meaningful positive examples from multiple perspectives. CroPS enhances training with positive signals derived from user query reformulation behavior (query-level), engagement data in recommendation streams (system-level), and world knowledge synthesized by large language models (knowledge-level). To effectively utilize these heterogeneous signals, we introduce a Hierarchical Label Assignment (HLA) strategy and a corresponding H-InfoNCE loss that together enable fine-grained, relevance-aware optimization. Extensive experiments conducted on Kuaishou Search, a large-scale commercial short-video search platform, demonstrate that CroPS significantly outperforms strong baselines both offline and in live A/B tests, achieving superior retrieval performance and reducing query reformulation rates. CroPS is now fully deployed in Kuaishou Search, serving hundreds of millions of users daily.
comment: AAAI-2026, Oral
LLM-MemCluster: Empowering Large Language Models with Dynamic Memory for Text Clustering
Large Language Models (LLMs) are reshaping unsupervised learning by offering an unprecedented ability to perform text clustering based on their deep semantic understanding. However, their direct application is fundamentally limited by a lack of stateful memory for iterative refinement and the difficulty of managing cluster granularity. As a result, existing methods often rely on complex pipelines with external modules, sacrificing a truly end-to-end approach. We introduce LLM-MemCluster, a novel framework that reconceptualizes clustering as a fully LLM-native task. It leverages a Dynamic Memory to instill state awareness and a Dual-Prompt Strategy to enable the model to reason about and determine the number of clusters. Evaluated on several benchmark datasets, our tuning-free framework significantly and consistently outperforms strong baselines. LLM-MemCluster presents an effective, interpretable, and truly end-to-end paradigm for LLM-based text clustering.
Building Robust and Scalable Multilingual ASR for Indian Languages
This paper describes the systems developed by SPRING Lab, Indian Institute of Technology Madras, for the ASRU MADASR 2.0 challenge. The systems developed focuses on adapting ASR systems to improve in predicting the language and dialect of the utterance among 8 languages across 33 dialects. We participated in Track 1 and Track 2, which restricts the use of additional data and develop from-the-scratch multilingual systems. We presented a novel training approach using Multi-Decoder architecture with phonemic Common Label Set (CLS) as intermediate representation. It improved the performance over the baseline (in the CLS space). We also discuss various methods used to retain the gain obtained in the phonemic space while converting them back to the corresponding grapheme representations. Our systems beat the baseline in 3 languages (Track 2) in terms of WER/CER and achieved the highest language ID and dialect ID accuracy among all participating teams (Track 2).
NAMeGEn: Creative Name Generation via A Novel Agent-based Multiple Personalized Goal Enhancement Framework
Trained on diverse human-authored texts, Large Language Models (LLMs) unlocked the potential for Creative Natural Language Generation (CNLG), benefiting various applications like advertising and storytelling. Nevertheless, CNLG still remains difficult due to two main challenges. (1) Multi-objective flexibility: user requirements are often personalized, fine-grained, and pluralistic, which LLMs struggle to satisfy simultaneously; (2) Interpretive complexity: beyond generation, creativity also involves understanding and interpreting implicit meaning to enhance users' perception. These challenges significantly limit current methods, especially in short-form text generation, in generating creative and insightful content. To address this, we focus on Chinese baby naming, a representative short-form CNLG task requiring adherence to explicit user constraints (e.g., length, semantics, anthroponymy) while offering meaningful aesthetic explanations. We propose NAMeGEn, a novel multi-agent optimization framework that iteratively alternates between objective extraction, name generation, and evaluation to meet diverse requirements and generate accurate explanations. To support this task, we further construct a classical Chinese poetry corpus with 17k+ poems to enhance aesthetics, and introduce CBNames, a new benchmark with tailored metrics. Extensive experiments demonstrate that NAMeGEn effectively generates creative names that meet diverse, personalized requirements while providing meaningful explanations, outperforming six baseline methods spanning various LLM backbones without any training.
comment: 13 pages,9 figures. This work has been submitted to the IEEE for possible publication
DEPO: Dual-Efficiency Preference Optimization for LLM Agents AAAI 2026
Recent advances in large language models (LLMs) have greatly improved their reasoning and decision-making abilities when deployed as agents. Richer reasoning, however, often comes at the cost of longer chain of thought (CoT), hampering interaction efficiency in real-world scenarios. Nevertheless, there still lacks systematic definition of LLM agent efficiency, hindering targeted improvements. To this end, we introduce dual-efficiency, comprising (i) step-level efficiency, which minimizes tokens per step, and (ii) trajectory-level efficiency, which minimizes the number of steps to complete a task. Building on this definition, we propose DEPO, a dual-efficiency preference optimization method that jointly rewards succinct responses and fewer action steps. Experiments on WebShop and BabyAI show that DEPO cuts token usage by up to 60.9% and steps by up to 26.9%, while achieving up to a 29.3% improvement in performance. DEPO also generalizes to three out-of-domain math benchmarks and retains its efficiency gains when trained on only 25% of the data. Our project page is at https://opencausalab.github.io/DEPO.
comment: Accepted to AAAI 2026
A Compliance-Preserving Retrieval System for Aircraft MRO Task Search
Aircraft Maintenance Technicians (AMTs) spend up to 30% of work time searching manuals, a documented efficiency bottleneck in MRO operations where every procedure must be traceable to certified sources. We present a compliance-preserving retrieval system that adapts LLM reranking and semantic search to aviation MRO environments by operating alongside, rather than replacing, certified legacy viewers. The system constructs revision-robust embeddings from ATA chapter hierarchies and uses vision-language parsing to structure certified content, allowing technicians to preview ranked tasks and access verified procedures in existing viewers. Evaluation on 49k synthetic queries achieves >90% retrieval accuracy, while bilingual controlled studies with 10 licensed AMTs demonstrate 90.9% top-10 success rate and 95% reduction in lookup time, from 6-15 minutes to 18 seconds per task. These gains provide concrete evidence that semantic retrieval can operate within strict regulatory constraints and meaningfully reduce operational workload in real-world multilingual MRO workflows.
The Empowerment of Science of Science by Large Language Models: New Tools and Methods
Large language models (LLMs) have exhibited exceptional capabilities in natural language understanding and generation, image recognition, and multimodal tasks, charting a course towards AGI and emerging as a central issue in the global technological race. This manuscript conducts a comprehensive review of the core technologies that support LLMs from a user standpoint, including prompt engineering, knowledge-enhanced retrieval augmented generation, fine tuning, pretraining, and tool learning. Additionally, it traces the historical development of Science of Science (SciSci) and presents a forward looking perspective on the potential applications of LLMs within the scientometric domain. Furthermore, it discusses the prospect of an AI agent based model for scientific evaluation, and presents new research fronts detection and knowledge graph building methods with LLMs.
comment: The manuscript is currently ongoing the underreview process of the journal of information science
HEAD-QA v2: Expanding a Healthcare Benchmark for Reasoning
We introduce HEAD-QA v2, an expanded and updated version of a Spanish/English healthcare multiple-choice reasoning dataset originally released by Vilares and Gómez-Rodríguez (2019). The update responds to the growing need for high-quality datasets that capture the linguistic and conceptual complexity of healthcare reasoning. We extend the dataset to over 12,000 questions from ten years of Spanish professional exams, benchmark several open-source LLMs using prompting, RAG, and probability-based answer selection, and provide additional multilingual versions to support future work. Results indicate that performance is mainly driven by model scale and intrinsic reasoning ability, with complex inference strategies obtaining limited gains. Together, these results establish HEAD-QA v2 as a reliable resource for advancing research on biomedical reasoning and model improvement.
comment: Preprint. 12 pages
SkyEgg: Joint Implementation Selection and Scheduling for Hardware Synthesis using E-graphs
Hardware synthesis from high-level descriptions remains fundamentally limited by the sequential optimization of interdependent design decisions. Current methodologies, including state-of-the-art high-level synthesis (HLS) tools, artificially separate implementation selection from scheduling, leading to suboptimal designs that cannot fully exploit modern FPGA heterogeneous architectures. Implementation selection is typically performed by ad-hoc pattern matching on operations, a process that does not consider the impact on scheduling. Subsequently, scheduling algorithms operate on fixed selection solutions with inaccurate delay estimates, which misses critical optimization opportunities from appropriately configured FPGA blocks like DSP slices. We present SkyEgg, a novel hardware synthesis framework that jointly optimizes implementation selection and scheduling using the e-graph data structure. Our key insight is that both algebraic transformations and hardware implementation choices can be uniformly represented as rewrite rules within an e-graph, modeling the complete design space of implementation candidates to be selected and scheduled together. First, SkyEgg constructs an e-graph from the input program. It then applies both algebraic and implementation rewrites through equality saturation. Finally, it formulates the joint optimization as a mixed-integer linear programming (MILP) problem on the saturated e-graph. We provide both exact MILP solving and an efficient ASAP heuristic for scalable synthesis. Our evaluation on benchmarks from diverse applications targeting Xilinx Kintex UltraScale+ FPGAs demonstrates that SkyEgg achieves an average speedup of 3.01x over Vitis HLS, with improvements up to 5.22x for complex expressions.
Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models
We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for large language models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of open-weight judge models and a human-validated stratified subset (with double-annotations to measure agreement). Disagreements were manually resolved. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.
MAPROC at AHaSIS Shared Task: Few-Shot and Sentence Transformer for Sentiment Analysis of Arabic Hotel Reviews
Sentiment analysis of Arabic dialects presents significant challenges due to linguistic diversity and the scarcity of annotated data. This paper describes our approach to the AHaSIS shared task, which focuses on sentiment analysis on Arabic dialects in the hospitality domain. The dataset comprises hotel reviews written in Moroccan and Saudi dialects, and the objective is to classify the reviewers sentiment as positive, negative, or neutral. We employed the SetFit (Sentence Transformer Fine-tuning) framework, a data-efficient few-shot learning technique. On the official evaluation set, our system achieved an F1 of 73%, ranking 12th among 26 participants. This work highlights the potential of few-shot learning to address data scarcity in processing nuanced dialectal Arabic text within specialized domains like hotel reviews.
ChartEditor: A Reinforcement Learning Framework for Robust Chart Editing AAAI 2026
Chart editing reduces manual effort in visualization design. Typical benchmarks limited in data diversity and assume access to complete chart code, which is seldom in real-world scenarios. To address this gap, we present ChartEditVista, a comprehensive benchmark consisting of 7,964 samples spanning 31 chart categories. It encompasses diverse editing instructions and covers nearly all editable chart elements. The inputs in ChartEditVista include only the original chart image and natural language editing instructions, without the original chart codes. ChartEditVista is generated through a fully automated pipeline that produces, edits, and verifies charts, ensuring high-quality chart editing data. Besides, we introduce two novel fine-grained, rule-based evaluation metrics: the layout metric, which evaluates the position, size and color of graphical components; and the text metric, which jointly assesses textual content and font styling. Building on top of ChartEditVista, we present ChartEditor, a model trained using a reinforcement learning framework that incorporates a novel rendering reward to simultaneously enforce code executability and visual fidelity. Through extensive experiments and human evaluations, we demonstrate that ChartEditVista provides a robust evaluation, while ChartEditor consistently outperforms models with similar-scale and larger-scale on chart editing tasks.
comment: Accept to AAAI 2026 Main Track
IndicGEC: Powerful Models, or a Measurement Mirage?
In this paper, we report the results of the TeamNRC's participation in the BHASHA-Task 1 Grammatical Error Correction shared task https://github.com/BHASHA-Workshop/IndicGEC2025/ for 5 Indian languages. Our approach, focusing on zero/few-shot prompting of language models of varying sizes (4B to large proprietary models) achieved a Rank 4 in Telugu and Rank 2 in Hindi with GLEU scores of 83.78 and 84.31 respectively. In this paper, we extend the experiments to the other three languages of the shared task - Tamil, Malayalam and Bangla, and take a closer look at the data quality and evaluation metric used. Our results primarily highlight the potential of small language models, and summarize the concerns related to creating good quality datasets and appropriate metrics for this task that are suitable for Indian language scripts.
comment: Technical report
M, Toolchain and Language for Reusable Model Compilation
Complex software-driven systems often interleave distributed, concurrent computation processes with physical interactions with the environment. Developing these systems more efficiently and safely can be achieved by employing actionable, software-based models. From a high-level system model, engineers often need to derive multiple specialized models for different purposes, including simulation, deployment, and formal verification. Each of these target models usually rely on its own formalism, specification language, and execution platform. Traditionally, a compiler analyzes a program written in a programming language and generates executable code. In contrast, a model compiler processes a source model written in a modeling language and should ideally support the generation of multiple heterogeneous targets. However, most existing modeling languages are designed with a narrow focus, typically targeting only simulation or implementation. Multi-target compilation, when not considered during the language's early design, becomes significantly harder to achieve. In this paper, we introduce our initiative: a toolchain and modeling language called M, designed to support system modeling and multi-target compilation for model-driven engineering of complex, concurrent, and time-aware systems. M is a textual, grammar-driven language based on the actor model and extended with discrete-event scheduling semantics. It provides constructs for modeling system entities, message-based interactions, and time- or state-triggered reactions. From such models, M enables the systematic generation of diverse target artifacts while preserving semantic conformance to the original model. Moreover, M can serve as a middle language to which other modeling languages may anchor, thereby allowing them to benefit from its compilation framework.
Context Cascade Compression: Exploring the Upper Limits of Text Compression
Million-level token inputs in long-context tasks pose significant computational and memory challenges for Large Language Models (LLMs). Recently, DeepSeek-OCR conducted research into the feasibility of Contexts Optical Compression and achieved preliminary results. Inspired by this, we introduce Context Cascade Compression C3 to explore the upper limits of text compression. Our method cascades two LLMs of different sizes to handle the compression and decoding tasks. Specifically, a small LLM, acting as the first stage, performs text compression by condensing a long context into a set of latent tokens (e.g., 32 or 64 in length), achieving a high ratio of text tokens to latent tokens. A large LLM, as the second stage, then executes the decoding task on this compressed context. Experiments show that at a 20x compression ratio (where the number of text tokens is 20 times the number of latent tokens), our model achieves 98% decoding accuracy, compared to approximately 60% for DeepSeek-OCR. When we further increase the compression ratio to 40x, the accuracy is maintained at around 93%. This indicates that in the domain of context compression, C3 Compression demonstrates superior performance and feasibility over optical character compression. C3 uses a simpler, pure-text pipeline that ignores factors like layout, color, and information loss from a visual encoder. This also suggests a potential upper bound for compression ratios in future work on optical character compression, OCR, and related fields. Codes and model weights are publicly accessible at https://github.com/liufanfanlff/C3-Context-Cascade-Compression
OEMA: Ontology-Enhanced Multi-Agent Collaboration Framework for Zero-Shot Clinical Named Entity Recognition
Clinical named entity recognition (NER) is crucial for extracting information from electronic health records (EHRs), but supervised models like CRF and BioClinicalBERT require costly annotated data. While zero-shot NER with large language models (LLMs) reduces this dependency, it struggles with example selection granularity and integrating prompts with self-improvement. To address this, we propose OEMA, a zero-shot clinical NER framework using multi-agent collaboration. OEMA's three components are: a self-annotator generating examples, a discriminator filtering them via SNOMED CT, and a predictor using entity descriptions for accurate inference. On MTSamples and VAERS datasets, OEMA achieves state-of-the-art exact-match performance. Under related-match, it matches supervised BioClinicalBERT and surpasses CRF. OEMA addresses key zero-shot NER challenges through ontology-guided reasoning and multi-agent collaboration, achieving near-supervised performance and showing promise for clinical NLP applications.
comment: 12 pages, 4 figures, 4 tables
Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story
Intrinsic dimension (ID) is an important tool in modern LLM analysis, informing studies of training dynamics, scaling behavior, and dataset structure, yet its textual determinants remain underexplored. We provide the first comprehensive study grounding ID in interpretable text properties through cross-encoder analysis, linguistic features, and sparse autoencoders (SAEs). In this work, we establish three key findings. First, ID is complementary to entropy-based metrics: after controlling for length, the two are uncorrelated, with ID capturing geometric complexity orthogonal to prediction quality. Second, ID exhibits robust genre stratification: scientific prose shows low ID (~8), encyclopedic content medium ID (~9), and creative/opinion writing high ID (~10.5) across all models tested. This reveals that contemporary LLMs find scientific text "representationally simple" while fiction requires additional degrees of freedom. Third, using SAEs, we identify causal features: scientific signals (formal tone, report templates, statistics) reduce ID; humanized signals (personalization, emotion, narrative) increase it. Steering experiments confirm these effects are causal. Thus, for contemporary models, scientific writing appears comparatively "easy", whereas fiction, opinion, and affect add representational degrees of freedom. Our multi-faceted analysis provides practical guidance for the proper use of ID and the sound interpretation of ID-based results.
HinTel-AlignBench: A Framework and Benchmark for Hindi-Telugu with English-Aligned Samples
With nearly 1.5 billion people and more than 120 major languages, India represents one of the most diverse regions in the world. As multilingual Vision-Language Models (VLMs) gain prominence, robust evaluation methodologies are essential to drive progress toward equitable AI for low-resource languages. Current multilingual VLM evaluations suffer from four major limitations: reliance on unverified auto-translations, narrow task/domain coverage, limited sample sizes, and lack of cultural and natively sourced Question-Answering (QA). To address these gaps, we present a scalable framework to evaluate VLMs in Indian languages and compare it with performance in English. Using the framework, we generate HinTel-AlignBench, a benchmark that draws from diverse sources in Hindi and Telugu with English-aligned samples. Our contributions are threefold: (1) a semi-automated dataset creation framework combining back-translation, filtering, and human verification; (2) the most comprehensive vision-language benchmark for Hindi and and Telugu, including adapted English datasets (VQAv2, RealWorldQA, CLEVR-Math) and native novel Indic datasets (JEE for STEM, VAANI for cultural grounding) with approximately 4,000 QA pairs per language; and (3) a detailed performance analysis of various State-of-the-Art (SOTA) open-weight and closed-source VLMs. We find a regression in performance for tasks in English versus in Indian languages for 4 out of 5 tasks across all the models, with an average regression of 8.3 points in Hindi and 5.5 points for Telugu. We categorize common failure modes to highlight concrete areas of improvement in multilingual multimodal understanding.
Teaching According to Students' Aptitude: Personalized Mathematics Tutoring via Persona-, Memory-, and Forgetting-Aware LLMs AAAI 2026
Large Language Models (LLMs) are increasingly integrated into intelligent tutoring systems to provide human-like and adaptive instruction. However, most existing approaches fail to capture how students' knowledge evolves dynamically across their proficiencies, conceptual gaps, and forgetting patterns. This challenge is particularly acute in mathematics tutoring, where effective instruction requires fine-grained scaffolding precisely calibrated to each student's mastery level and cognitive retention. To address this issue, we propose TASA (Teaching According to Students' Aptitude), a student-aware tutoring framework that integrates persona, memory, and forgetting dynamics for personalized mathematics learning. Specifically, TASA maintains a structured student persona capturing proficiency profiles and an event memory recording prior learning interactions. By incorporating a continuous forgetting curve with knowledge tracing, TASA dynamically updates each student's mastery state and generates contextually appropriate, difficulty-calibrated questions and explanations. Empirical results demonstrate that TASA achieves superior learning outcomes and more adaptive tutoring behavior compared to representative baselines, underscoring the importance of modeling temporal forgetting and learner profiles in LLM-based tutoring systems.
comment: AAAI 2026 Workshop
Generating Natural-Language Surgical Feedback: From Structured Representation to Domain-Grounded Evaluation ML4H 2025
High-quality intraoperative feedback from a surgical trainer is pivotal for improving trainee performance and long-term skill acquisition. Automating natural, trainer-style feedback promises timely, accessible, and consistent guidance at scale but requires models that understand clinically relevant representations. We present a structure-aware pipeline that learns a surgical action ontology from real trainer-to-trainee transcripts (33 surgeries) and uses it to condition feedback generation. We contribute by (1) mining Instrument-Action-Target (IAT) triplets from real-world feedback text and clustering surface forms into normalized categories, (2) fine-tuning a video-to-IAT model that leverages the surgical procedure and task contexts as well as fine-grained temporal instrument motion, and (3) demonstrating how to effectively use IAT triplet representations to guide GPT-4o in generating clinically grounded, trainer-style feedback. We show that, on Task 1: Video-to-IAT recognition, our context injection and temporal tracking deliver consistent AUC gains (Instrument: 0.67 to 0.74; Action: 0.60 to 0.63; Tissue: 0.74 to 0.79). For Task 2: feedback text generation (rated on a 1-5 fidelity rubric where 1 = opposite/unsafe, 3 = admissible, and 5 = perfect match to a human trainer), GPT-4o from video alone scores 2.17, while IAT conditioning reaches 2.44 (+12.4%), doubling the share of admissible generations with score >= 3 from 21% to 42%. Traditional text-similarity metrics also improve: word error rate decreases by 15-31% and ROUGE (phrase/substring overlap) increases by 9-64%. Grounding generation in explicit IAT structure improves fidelity and yields clinician-verifiable rationales, supporting auditable use in surgical training.
comment: Accepted as proceedings paper for ML4H 2025
CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries
We introduce CASTELLA, a human-annotated audio benchmark for the task of audio moment retrieval (AMR). Although AMR has various useful potential applications, there is still no established benchmark with real-world data. The early study of AMR trained the model with solely synthetic datasets. Moreover, the evaluation is based on annotated dataset of fewer than 100 samples. This resulted in less reliable reported performance. To ensure performance for applications in real-world environments, we present CASTELLA, a large-scale manually annotated AMR dataset. CASTELLA consists of 1,009, 213, and 640 audio recordings for train, valid, and test split, respectively, which is 24 times larger than the previous dataset. We also establish a baseline model for AMR using CASTELLA. Our experiments demonstrate that a model fine-tuned on CASTELLA after pre-training on the synthetic data outperformed a model trained solely on the synthetic data by 10.4 points in Recall1@0.7. CASTELLA is publicly available in https://h-munakata.github.io/CASTELLA-demo/.
Knowledge-Informed Automatic Feature Extraction via Collaborative Large Language Model Agents
The performance of machine learning models on tabular data is critically dependent on high-quality feature engineering. While Large Language Models (LLMs) have shown promise in automating feature extraction (AutoFE), existing methods are often limited by monolithic LLM architectures, simplistic quantitative feedback, and a failure to systematically integrate external domain knowledge. This paper introduces Rogue One, a novel, LLM-based multi-agent framework for knowledge-informed automatic feature extraction. Rogue One operationalizes a decentralized system of three specialized agents-Scientist, Extractor, and Tester-that collaborate iteratively to discover, generate, and validate predictive features. Crucially, the framework moves beyond primitive accuracy scores by introducing a rich, qualitative feedback mechanism and a "flooding-pruning" strategy, allowing it to dynamically balance feature exploration and exploitation. By actively incorporating external knowledge via an integrated retrieval-augmented (RAG) system, Rogue One generates features that are not only statistically powerful but also semantically meaningful and interpretable. We demonstrate that Rogue One significantly outperforms state-of-the-art methods on a comprehensive suite of 19 classification and 9 regression datasets. Furthermore, we show qualitatively that the system surfaces novel, testable hypotheses, such as identifying a new potential biomarker in the myocardial dataset, underscoring its utility as a tool for scientific discovery.
comment: 19 pages, 4 figures, in review
ProRAC: A Neuro-symbolic Method for Reasoning about Actions with LLM-based Progression
In this paper, we propose ProRAC (Progression-based Reasoning about Actions and Change), a neuro-symbolic framework that leverages LLMs to tackle RAC problems. ProRAC extracts fundamental RAC elements including actions and questions from the problem, progressively executes each action to derive the final state, and then evaluates the query against the progressed state to arrive at an answer. We evaluate ProRAC on several RAC benchmarks, and the results demonstrate that our approach achieves strong performance across different benchmarks, domains, LLM backbones, and types of RAC tasks.
Evaluating Multimodal Large Language Models on Vertically Written Japanese Text
Multimodal Large Language Models (MLLMs) have seen rapid advances in recent years and are now being applied to visual document understanding tasks. They are expected to process a wide range of document images across languages, including Japanese. Understanding documents from images requires models to read what are written in them. Since some Japanese documents are written vertically, support for vertical writing is essential. However, research specifically focused on vertically written Japanese text remains limited. In this study, we evaluate the reading capability of existing MLLMs on vertically written Japanese text. First, we generate a synthetic Japanese OCR dataset by rendering Japanese texts into images, and use it for both model fine-tuning and evaluation. This dataset includes Japanese text in both horizontal and vertical writing. We also create an evaluation dataset sourced from the real-world document images containing vertically written Japanese text. Using these datasets, we demonstrate that the existing MLLMs perform worse on vertically written Japanese text than on horizontally written Japanese text. Furthermore, we show that training MLLMs on our synthesized Japanese OCR dataset results in improving the performance of models that previously could not handle vertical writing. The datasets and code are publicly available https://github.com/llm-jp/eval_vertical_ja.
comment: 17pages, 8 figures
Mathematical Analysis of Hallucination Dynamics in Large Language Models: Uncertainty Quantification, Advanced Decoding, and Principled Mitigation
Large Language Models (LLMs) are powerful linguistic engines but remain susceptible to hallucinations: plausible-sounding outputs that are factually incorrect or unsupported. In this work, we present a mathematically grounded framework to understand, measure, and mitigate these hallucinations. Drawing on probabilistic modeling, information theory, trigonometric signal analysis, and Bayesian uncertainty estimation, we analyze how errors compound autoregressively, propose refined uncertainty metrics, including semantic and phase-aware variants, and develop principled mitigation strategies such as contrastive decoding, retrieval-augmented grounding, factual alignment, and abstention. This unified lens connects recent advances in calibration, retrieval, and alignment to support safer and more reliable LLMs.
comment: 10 pages, theoretical/mathematical LLM research, no figures, intended for peer-reviewed journal
AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization
We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI acclerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt's capability improves over time, boosting the average percentage of peak throughput from $49\%$ to $61\%$ on Trainium 1 and from $45\%$ to $59\%$ on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being $26\times$ cheaper.
Mind the Motions: Benchmarking Theory-of-Mind in Everyday Body Language
Our ability to interpret others' mental states through nonverbal cues (NVCs) is fundamental to our survival and social cohesion. While existing Theory of Mind (ToM) benchmarks have primarily focused on false-belief tasks and reasoning with asymmetric information, they overlook other mental states beyond belief and the rich tapestry of human nonverbal communication. We present Motion2Mind, a framework for evaluating the ToM capabilities of machines in interpreting NVCs. Leveraging an expert-curated body-language reference as a proxy knowledge base, we build Motion2Mind, a carefully curated video dataset with fine-grained nonverbal cue annotations paired with manually verified psychological interpretations. It encompasses 222 types of nonverbal cues and 397 mind states. Our evaluation reveals that current AI systems struggle significantly with NVC interpretation, exhibiting not only a substantial performance gap in Detection, as well as patterns of over-interpretation in Explanation compared to human annotators.
What Really Counts? Examining Step and Token Level Attribution in Multilingual CoT Reasoning
This study investigates the attribution patterns underlying Chain-of-Thought (CoT) reasoning in multilingual LLMs. While prior works demonstrate the role of CoT prompting in improving task performance, there are concerns regarding the faithfulness and interpretability of the generated reasoning chains. To assess these properties across languages, we applied two complementary attribution methods--ContextCite for step-level attribution and Inseq for token-level attribution--to the Qwen2.5 1.5B-Instruct model using the MGSM benchmark. Our experimental results highlight key findings such as: (1) attribution scores excessively emphasize the final reasoning step, particularly in incorrect generations; (2) structured CoT prompting significantly improves accuracy primarily for high-resource Latin-script languages; and (3) controlled perturbations via negation and distractor sentences reduce model accuracy and attribution coherence. These findings highlight the limitations of CoT prompting, particularly in terms of multilingual robustness and interpretive transparency.
comment: Received the Best Student Project Award at RuG's Advanced-NLP course
The Subtle Art of Defection: Understanding Uncooperative Behaviors in LLM based Multi-Agent Systems
This paper introduces a novel framework for simulating and analyzing how uncooperative behaviors can destabilize or collapse LLM-based multi-agent systems. Our framework includes two key components: (1) a game theory-based taxonomy of uncooperative agent behaviors, addressing a notable gap in the existing literature; and (2) a structured, multi-stage simulation pipeline that dynamically generates and refines uncooperative behaviors as agents' states evolve. We evaluate the framework via a collaborative resource management setting, measuring system stability using metrics such as survival time and resource overuse rate. Empirically, our framework achieves 96.7% accuracy in generating realistic uncooperative behaviors, validated by human evaluations. Our results reveal a striking contrast: cooperative agents maintain perfect system stability (100% survival over 12 rounds with 0% resource overuse), while any uncooperative behavior can trigger rapid system collapse within 1 to 7 rounds. These findings demonstrate that uncooperative agents can significantly degrade collective outcomes, highlighting the need for designing more resilient multi-agent systems.
Step-Audio-R1 Technical Report
Recent advances in reasoning models have demonstrated remarkable success in text and vision domains through extended chain-of-thought deliberation. However, a perplexing phenomenon persists in audio language models: they consistently perform better with minimal or no reasoning, raising a fundamental question - can audio intelligence truly benefit from deliberate thinking? We introduce Step-Audio-R1, the first audio reasoning model that successfully unlocks reasoning capabilities in the audio domain. Through our proposed Modality-Grounded Reasoning Distillation (MGRD) framework, Step-Audio-R1 learns to generate audio-relevant reasoning chains that genuinely ground themselves in acoustic features rather than hallucinating disconnected deliberations. Our model exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to the state-of-the-art Gemini 3 Pro across comprehensive audio understanding and reasoning benchmarks spanning speech, environmental sounds, and music. These results demonstrate that reasoning is a transferable capability across modalities when appropriately anchored, transforming extended deliberation from a liability into a powerful asset for audio intelligence. By establishing the first successful audio reasoning model, Step-Audio-R1 opens new pathways toward building truly multimodal reasoning systems that think deeply across all sensory modalities.
comment: 15 pages, 5 figures. Technical Report
Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays
BERT and its variants are extensively explored for automated scoring. However, a limit of 512 tokens for these encoder-based models showed the deficiency in automated scoring of long essays. Thus, this research explores generative language models for automated scoring of long essays via summarization and prompting. The results revealed great improvement of scoring accuracy with QWK increased from 0.822 to 0.8878 for the Learning Agency Lab Automated Essay Scoring 2.0 dataset.
comment: 19 pages, 5 Tables 7 Figures, Presentation at Artificial Intelligence in Measurement and Education Conference (AIME-Con)
Fairshare Data Pricing via Data Valuation for Large Language Models
Training data is the backbone of large language models (LLMs), yet today's data markets often operate under exploitative pricing -- sourcing data from marginalized groups with little pay or recognition. This paper introduces a theoretical framework for LLM data markets, modeling the strategic interactions between buyers (LLM builders) and sellers (human annotators). We begin with theoretical and empirical analysis showing how exploitative pricing drives high-quality sellers out of the market, degrading data quality and long-term model performance. Then we introduce fairshare, a pricing mechanism grounded in data valuation that quantifies each data's contribution. It aligns incentives by sustaining seller participation and optimizing utility for both buyers and sellers. Theoretically, we show that fairshare yields mutually optimal outcomes: maximizing long-term buyer utility and seller profit while sustaining market participation. Empirically when training open-source LLMs on complex NLP tasks, including math problems, medical diagnosis, and physical reasoning, fairshare boosts seller earnings and ensures a stable supply of high-quality data, while improving buyers' performance-per-dollar and long-term welfare. Our findings offer a concrete path toward fair, transparent, and economically sustainable data markets for LLM.
Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains
Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training, FARE improves the downstream RL-trained model performance by up to 14.1% vs. string-matching verifiers. When initialized from FARE, a continually-finetuned FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality.
comment: 29 pages, 9 tables, 6 figures
Knowledge-Grounded Agentic Large Language Models for Multi-Hazard Understanding from Reconnaissance Reports
Post-disaster reconnaissance reports contain critical evidence for understanding multi-hazard interactions, yet their unstructured narratives make systematic knowledge transfer difficult. Large language models (LLMs) offer new potential for analyzing these reports, but often generate unreliable or hallucinated outputs when domain grounding is absent. This study introduces the Mixture-of-Retrieval Agentic RAG (MoRA-RAG), a knowledge-grounded LLM framework that transforms reconnaissance reports into a structured foundation for multi-hazard reasoning. The framework integrates a Mixture-of-Retrieval mechanism that dynamically routes queries across hazard-specific databases while using agentic chunking to preserve contextual coherence during retrieval. It also includes a verification loop that assesses evidence sufficiency, refines queries, and initiates targeted searches when information remains incomplete. We construct HazardRecQA by deriving question-answer pairs from GEER reconnaissance reports, which document 90 global events across seven major hazard types. MoRA-RAG achieves up to 94.5 percent accuracy, outperforming zero-shot LLMs by 30 percent and state-of-the-art RAG systems by 10 percent, while reducing hallucinations across diverse LLM architectures. MoRA-RAG also enables open-weight LLMs to achieve performance comparable to proprietary models. It establishes a new paradigm for transforming post-disaster documentation into actionable, trustworthy intelligence for hazard resilience.
comment: 17 pages, 5 figures
Newswire Extraction: A pipeline for extracting newswires from newspaper images
I describe a new pipeline for extracting wire services (e.g., Associated Press, United Press International, Newspaper Enterprise Association) from newspaper images.
Privacy Preserving In-Context-Learning Framework for Large Language Models
Large language models (LLMs) have significantly transformed natural language understanding and generation, but they raise privacy concerns due to potential exposure of sensitive information. Studies have highlighted the risk of information leakage, where adversaries can extract sensitive information embedded in the prompts. In this work, we introduce a novel private prediction framework for generating high-quality synthetic text with strong privacy guarantees. Our approach leverages the Differential Privacy (DP) framework to ensure worst-case theoretical bounds on information leakage without requiring any fine-tuning of the underlying models. The proposed method performs inference on private records and aggregates the resulting per-token output distributions. This enables the generation of longer and coherent synthetic text while maintaining privacy guarantees. Additionally, we propose a simple blending operation that combines private and public inference to further enhance utility. Empirical evaluations demonstrate that our approach outperforms previous state-of-the-art methods on in-context-learning (ICL) tasks, making it a promising direction for privacy-preserving text generation while maintaining high utility. Our code is available at https://github.com/bhusalb/privacy-preserving-icl.
comment: Git repo: https://github.com/bhusalb/privacy-preserving-icl
Based on Data Balancing and Model Improvement for Multi-Label Sentiment Classification Performance Enhancement
Multi-label sentiment classification plays a vital role in natural language processing by detecting multiple emotions within a single text. However, existing datasets like GoEmotions often suffer from severe class imbalance, which hampers model performance, especially for underrepresented emotions. To address this, we constructed a balanced multi-label sentiment dataset by integrating the original GoEmotions data, emotion-labeled samples from Sentiment140 using a RoBERTa-base-GoEmotions model, and manually annotated texts generated by GPT-4 mini. Our data balancing strategy ensured an even distribution across 28 emotion categories. Based on this dataset, we developed an enhanced multi-label classification model that combines pre-trained FastText embeddings, convolutional layers for local feature extraction, bidirectional LSTM for contextual learning, and an attention mechanism to highlight sentiment-relevant words. A sigmoid-activated output layer enables multi-label prediction, and mixed precision training improves computational efficiency. Experimental results demonstrate significant improvements in accuracy, precision, recall, F1-score, and AUC compared to models trained on imbalanced data, highlighting the effectiveness of our approach.
comment: 12 pages, 8 figures, 5 tables. Dataset and code available at https://doi.org/10.5281/zenodo.16890154 and https://doi.org/10.5281/zenodo.15837871
On the Alignment of Large Language Models with Global Human Opinion
Today's large language models (LLMs) are capable of supporting multilingual scenarios, allowing users to interact with LLMs in their native languages. When LLMs respond to subjective questions posed by users, they are expected to align with the views of specific demographic groups or historical periods, shaped by the language in which the user interacts with the model. Existing studies mainly focus on researching the opinions represented by LLMs among demographic groups in the United States or a few countries, lacking worldwide country samples and studies on human opinions in different historical periods, as well as lacking discussion on using language to steer LLMs. Moreover, they also overlook the potential influence of prompt language on the alignment of LLMs' opinions. In this study, our goal is to fill these gaps. To this end, we create an evaluation framework based on the World Values Survey (WVS) to systematically assess the alignment of LLMs with human opinions across different countries, languages, and historical periods around the world. We find that LLMs appropriately or over-align the opinions with only a few countries while under-aligning the opinions with most countries. Furthermore, changing the language of the prompt to match the language used in the questionnaire can effectively steer LLMs to align with the opinions of the corresponding country more effectively than existing steering methods. At the same time, LLMs are more aligned with the opinions of the contemporary population. To our knowledge, our study is the first comprehensive investigation of the topic of opinion alignment in LLMs across global, language, and temporal dimensions. Our code and data are publicly available at https://github.com/ku-nlp/global-opinion-alignment and https://github.com/nlply/global-opinion-alignment.
comment: 28 pages, 26 figures
Investigating Hallucination in Conversations for Low Resource Languages
Large Language Models (LLMs) have demonstrated remarkable proficiency in generating text that closely resemble human writing. However, they often generate factually incorrect statements, a problem typically referred to as 'hallucination'. Addressing hallucination is crucial for enhancing the reliability and effectiveness of LLMs. While much research has focused on hallucinations in English, our study extends this investigation to conversational data in three languages: Hindi, Farsi, and Mandarin. We offer a comprehensive analysis of a dataset to examine both factual and linguistic errors in these languages for GPT-3.5, GPT-4o, Llama-3.1, Gemma-2.0, DeepSeek-R1 and Qwen-3. We found that LLMs produce very few hallucinated responses in Mandarin but generate a significantly higher number of hallucinations in Hindi and Farsi.
Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks
Spatial intelligence spans a rich suite of abilities, including visualising and transforming shapes, mentally rotating objects, judging relational positions and containment, and estimating numerosity. However, it still remains a critical unresolved challenge for Multimodal Large Language Models (MLLMs). To fill this gap, we propose to treat Euclidean geometry problem-solving as a surrogate task. Specifically, we meticulously constructed a curated multimodal dataset, called Euclid30K, comprising approximately 30K plane and solid geometry problems. Furthermore, to enable the model to learn and apply Euclidean principles from these geometry problems, we fine-tuned seven model variants (spanning 3--72B parameters) from the Qwen2.5VL, Qwen3VL, and RoboBrain2.0 families using Group Relative Policy Optimization (GRPO), inspiring the models to identify shapes, count, and relate entities, and perform multi-step deductive reasoning using Euclidean principles. Our experiments demonstrate that the resulting models achieve substantial zero-shot gains across four spatial reasoning benchmarks (Super-CLEVR, Omni3DBench, VSI-Bench, and MindCube) without any task-specific adaptations. Notably, after training on the Euclid30K, the mean VSI-Bench accuracy rose from 36.6\% to 41.8\% (+5.2\%), and the mean MindCube accuracy rose from 31.4\% to 38.1\% (+6.7\%). To our knowledge, this is the first systematic study showing that geometry-centric fine-tuning can confer vision-language models with broadly transferable spatial skills. Code and Euclid30K dataset can be found in \href{https://zgca-ai4edu.github.io/Euclids_Gift}{this}.
Retrieval Augmented Generation based context discovery for ASR EMNLP 2025
This work investigates retrieval augmented generation as an efficient strategy for automatic context discovery in context-aware Automatic Speech Recognition (ASR) system, in order to improve transcription accuracy in the presence of rare or out-of-vocabulary terms. However, identifying the right context automatically remains an open challenge. This work proposes an efficient embedding-based retrieval approach for automatic context discovery in ASR. To contextualize its effectiveness, two alternatives based on large language models (LLMs) are also evaluated: (1) large language model (LLM)-based context generation via prompting, and (2) post-recognition transcript correction using LLMs. Experiments on the TED-LIUMv3, Earnings21 and SPGISpeech demonstrate that the proposed approach reduces WER by up to 17% (percentage difference) relative to using no-context, while the oracle context results in a reduction of up to 24.1%.
comment: Accepted at EMNLP 2025
A Data-driven ML Approach for Maximizing Performance in LLM-Adapter Serving
With the rapid adoption of Large Language Models (LLMs), LLM-adapters have become increasingly common, providing lightweight specialization of large-scale models. Serving hundreds or thousands of these adapters on a single GPU allows request aggregation, increasing throughput, but may also cause request starvation if GPU memory limits are exceeded. To address this issue, this study focuses on determining the joint configuration of concurrent and parallel adapters that maximizes GPU throughput without inducing starvation, given heterogeneous adapter and traffic properties. We propose a data-driven ML approach leveraging interpretable models to tackle this caching problem and introduce the first Digital Twin capable of reproducing an LLM-adapter serving system, enabling efficient training data generation. Experiments with the vLLM framework and LoRA adapters show that the Digital Twin reproduces throughput within 5.1% of real results, while the ML approach predicts optimal numbers of concurrent and parallel adapters with an error of at most 7.2% under heterogeneous, real-world workloads. The code is publicly available at https://github.com/FerranAgulloLopez/GPULLMAdapterOptimization.
comment: Accepted in a computer science workshop
MessIRve: A Large-Scale Spanish Information Retrieval Dataset EMNLP 2025
Information retrieval (IR) is the task of finding relevant documents in response to a user query. Although Spanish is the second most spoken native language, there are few Spanish IR datasets, which limits the development of information access tools for Spanish speakers. We introduce MessIRve, a large-scale Spanish IR dataset with almost 700,000 queries from Google's autocomplete API and relevant documents sourced from Wikipedia. MessIRve's queries reflect diverse Spanish-speaking regions, unlike other datasets that are translated from English or do not consider dialectal variations. The large size of the dataset allows it to cover a wide variety of topics, unlike smaller datasets. We provide a comprehensive description of the dataset, comparisons with existing datasets, and baseline evaluations of prominent IR models. Our contributions aim to advance Spanish IR research and improve information access for Spanish speakers.
comment: Camera-ready for EMNLP 2025 (main conference)
CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents EMNLP 2025
Cross-lingual information retrieval (CLIR) helps users find documents in languages different from their queries. This is especially important in academic search, where key research is often published in non-English languages. We present CLIRudit, a novel English-French academic retrieval dataset built from Érudit, a Canadian publishing platform. Using multilingual metadata, we pair English author-written keywords as queries with non-English abstracts as target documents, a method that can be applied to other languages and repositories. We benchmark various first-stage sparse and dense retrievers, with and without machine translation. We find that dense embeddings without translation perform nearly as well as systems using machine translation, that translating documents is generally more effective than translating queries, and that sparse retrievers with document translation remain competitive while offering greater efficiency. Along with releasing the first English-French academic retrieval dataset, we provide a reproducible benchmarking method to improve access to non-English scholarly content.
comment: Camera-ready for the 5th Multilingual Representation Learning (MRL) Workshop (Co-located with EMNLP 2025)
A Typology of Synthetic Datasets for Dialogue Processing in Clinical Contexts
Synthetic data sets are used across linguistic domains and NLP tasks, particularly in scenarios where authentic data is limited (or even non-existent). One such domain is that of clinical (healthcare) contexts, where there exist significant and long-standing challenges (e.g., privacy, anonymization, and data governance) which have led to the development of an increasing number of synthetic datasets. One increasingly important category of clinical dataset is that of clinical dialogues which are especially sensitive and difficult to collect, and as such are commonly synthesized. While such synthetic datasets have been shown to be sufficient in some situations, little theory exists to inform how they may be best used and generalized to new applications. In this paper, we provide an overview of how synthetic datasets are created, evaluated and being used for dialogue related tasks in the medical domain. Additionally, we propose a novel typology for use in classifying types and degrees of data synthesis, to facilitate comparison and evaluation.
GlobalRAG: Enhancing Global Reasoning in Multi-hop Question Answering via Reinforcement Learning
Reinforcement learning has recently shown promise in improving retrieval-augmented generation (RAG). Despite these advances, its effectiveness in multi-hop question answering (QA) remains limited by two fundamental limitations: (i) global planning absence to structure multi-step reasoning, and (ii) unfaithful execution, which hinders effective query formulation and consistent use of retrieved evidence. We propose GlobalRAG, a reinforcement learning framework designed to enhance global reasoning in multi-hop QA. GlobalRAG decomposes questions into subgoals, coordinates retrieval with reasoning, and refines evidence iteratively. To guide this process, we introduce Planning Quality Reward and SubGoal Completion Reward, which encourage coherent planning and reliable subgoal execution. In addition, a progressive weight annealing strategy balances process-oriented and outcome-based objectives. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that GlobalRAG significantly outperforms strong baselines while using only 8k training data (42% of the training data used by strong baselines), achieving average improvements of 14.2% in both EM and F1.
comment: 8 pages, 3 figures, 4 tables
ReFactX: Scalable Reasoning with Reliable Facts via Constrained Generation ISWC
Knowledge gaps and hallucinations are persistent challenges for Large Language Models (LLMs), which generate unreliable responses when lacking the necessary information to fulfill user instructions. Existing approaches, such as Retrieval-Augmented Generation (RAG) and tool use, aim to address these issues by incorporating external knowledge. Yet, they rely on additional models or services, resulting in complex pipelines, potential error propagation, and often requiring the model to process a large number of tokens. In this paper, we present a scalable method that enables LLMs to access external knowledge without depending on retrievers or auxiliary models. Our approach uses constrained generation with a pre-built prefix-tree index. Triples from a Knowledge Graph are verbalized in textual facts, tokenized, and indexed in a prefix tree for efficient access. During inference, to acquire external knowledge, the LLM generates facts with constrained generation which allows only sequences of tokens that form an existing fact. We evaluate our proposal on Question Answering and show that it scales to large knowledge bases (800 million facts), adapts to domain-specific data, and achieves effective results. These gains come with minimal generation-time overhead. ReFactX code is available at https://github.com/rpo19/ReFactX.
comment: 19 pages, 6 figures, accepted at ISWC
Leveraging the Power of Large Language Models in Entity Linking via Adaptive Routing and Targeted Reasoning EMNLP 2025
Entity Linking (EL) has traditionally relied on large annotated datasets and extensive model fine-tuning. While recent few-shot methods leverage large language models (LLMs) through prompting to reduce training requirements, they often suffer from inefficiencies due to expensive LLM-based reasoning. ARTER (Adaptive Routing and Targeted Entity Reasoning) presents a structured pipeline that achieves high performance without deep fine-tuning by strategically combining candidate generation, context-based scoring, adaptive routing, and selective reasoning. ARTER computes a small set of complementary signals(both embedding and LLM-based) over the retrieved candidates to categorize contextual mentions into easy and hard cases. The cases are then handled by a low-computational entity linker (e.g. ReFinED) and more expensive targeted LLM-based reasoning respectively. On standard benchmarks, ARTER outperforms ReFinED by up to +4.47%, with an average gain of +2.53% on 5 out of 6 datasets, and performs comparably to pipelines using LLM-based reasoning for all mentions, while being as twice as efficient in terms of the number of LLM tokens.
comment: Accepted to EMNLP 2025 Industry Track
The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages
Subword segmentation is typically applied in preprocessing and stays fixed during training. Alternatively, it can be learned during training to optimise the training objective. In this paper we study the learning dynamics of subword segmentation: if a language model can dynamically optimise tokenisation, how do its subwords evolve during pretraining and finetuning? To explore this, we extend the subword segmental language model (SSLM), a framework for learning subwords during training, to support pretraining and finetuning. We train models for three typologically diverse languages to study learning dynamics across the morphological spectrum: Isi-Xhosa is conjunctive (long word forms composed of many morphemes), Setswana is disjunctive (morphemes written as separate words), and English represents a typological middle ground. We analyse subword dynamics from a linguistic perspective, tracking morphology, productivity, and fertility. We identify four stages of subword learning, with the morphologically complex isi-Xhosa exhibiting greater instability. During finetuning, subword boundaries shift to become finer-grained. Lastly, we show that learnable subwords offers a promising approach to improve text generation and cross-lingual transfer for low-resource, morphologically complex languages.
ConInstruct: Evaluating Large Language Models on Conflict Detection and Resolution in Instructions AAAI 2026
Instruction-following is a critical capability of Large Language Models (LLMs). While existing works primarily focus on assessing how well LLMs adhere to user instructions, they often overlook scenarios where instructions contain conflicting constraints-a common occurrence in complex prompts. The behavior of LLMs under such conditions remains under-explored. To bridge this gap, we introduce ConInstruct, a benchmark specifically designed to assess LLMs' ability to detect and resolve conflicts within user instructions. Using this dataset, we evaluate LLMs' conflict detection performance and analyze their conflict resolution behavior. Our experiments reveal two key findings: (1) Most proprietary LLMs exhibit strong conflict detection capabilities, whereas among open-source models, only DeepSeek-R1 demonstrates similarly strong performance. DeepSeek-R1 and Claude-4.5-Sonnet achieve the highest average F1-scores at 91.5% and 87.3%, respectively, ranking first and second overall. (2) Despite their strong conflict detection abilities, LLMs rarely explicitly notify users about the conflicts or request clarification when faced with conflicting constraints. These results underscore a critical shortcoming in current LLMs and highlight an important area for future improvement when designing instruction-following LLMs.
comment: Accepted to AAAI 2026
Assemble Your Crew: Automatic Multi-agent Communication Topology Design via Autoregressive Graph Generation AAAI 2026
Multi-agent systems (MAS) based on large language models (LLMs) have emerged as a powerful solution for dealing with complex problems across diverse domains. The effectiveness of MAS is critically dependent on its collaboration topology, which has become a focal point for automated design research. However, existing approaches are fundamentally constrained by their reliance on a template graph modification paradigm with a predefined set of agents and hard-coded interaction structures, significantly limiting their adaptability to task-specific requirements. To address these limitations, we reframe MAS design as a conditional autoregressive graph generation task, where both the system composition and structure are designed jointly. We propose ARG-Designer, a novel autoregressive model that operationalizes this paradigm by constructing the collaboration graph from scratch. Conditioned on a natural language task query, ARG-Designer sequentially and dynamically determines the required number of agents, selects their appropriate roles from an extensible pool, and establishes the optimal communication links between them. This generative approach creates a customized topology in a flexible and extensible manner, precisely tailored to the unique demands of different tasks. Extensive experiments across six diverse benchmarks demonstrate that ARG-Designer not only achieves state-of-the-art performance but also enjoys significantly greater token efficiency and enhanced extensibility. The source code of ARG-Designer is available at https://github.com/Shiy-Li/ARG-Designer.
comment: Accepted as an oral presentation by AAAI 2026
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text-to-image generation. To address this challenge, we propose \textbf{WISE}, the first benchmark specifically designed for \textbf{W}orld Knowledge-\textbf{I}nformed \textbf{S}emantic \textbf{E}valuation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 subdomains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce \textbf{WiScore}, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data are available at \href{https://github.com/PKU-YuanGroup/WISE}{PKU-YuanGroup/WISE}.
comment: Code, data and leaderboard: https://github.com/PKU-YuanGroup/WISE
Where does an LLM begin computing an instruction?
Following an instruction involves distinct sub-processes, such as reading content, reading the instruction, executing it, and producing an answer. We ask where, along the layer stack, instruction following begins, the point where reading gives way to doing. We introduce three simple datasets (Key-Value, Quote Attribution, Letter Selection) and two hop compositions of these tasks. Using activation patching on minimal-contrast prompt pairs, we measure a layer-wise flip rate that indicates when substituting selected residual activations changes the predicted answer. Across models in the Llama family, we observe an inflection point, which we term onset, where interventions that change predictions before this point become largely ineffective afterward. Multi-hop compositions show a similar onset location. These results provide a simple, replicable way to locate where instruction following begins and to compare this location across tasks and model sizes.
comment: Extended Abstract accepted at UniReps '25 Workshop
HalluClean: A Unified Framework to Combat Hallucinations in LLMs
Large language models (LLMs) have achieved impressive performance across a wide range of natural language processing tasks, yet they often produce hallucinated content that undermines factual reliability. To address this challenge, we introduce HalluClean, a lightweight and task-agnostic framework for detecting and correcting hallucinations in LLM-generated text. HalluClean adopts a reasoning-enhanced paradigm, explicitly decomposing the process into planning, execution, and revision stages to identify and refine unsupported claims. It employs minimal task-routing prompts to enable zero-shot generalization across diverse domains, without relying on external knowledge sources or supervised detectors. We conduct extensive evaluations on five representative tasks-question answering, dialogue, summarization, math word problems, and contradiction detection. Experimental results show that HalluClean significantly improves factual consistency and outperforms competitive baselines, demonstrating its potential to enhance the trustworthiness of LLM outputs in real-world applications.
Pragmatic Theories Enhance Understanding of Implied Meanings in LLMs
The ability to accurately interpret implied meanings plays a crucial role in human communication and language use, and language models are also expected to possess this capability. This study demonstrates that providing language models with pragmatic theories as prompts is an effective in-context learning approach for tasks to understand implied meanings. Specifically, we propose an approach in which an overview of pragmatic theories, such as Gricean pragmatics and Relevance Theory, is presented as a prompt to the language model, guiding it through a step-by-step reasoning process to derive a final interpretation. Experimental results showed that, compared to the baseline, which prompts intermediate reasoning without presenting pragmatic theories (0-shot Chain-of-Thought), our methods enabled language models to achieve up to 9.6\% higher scores on pragmatic reasoning tasks. Furthermore, we show that even without explaining the details of pragmatic theories, merely mentioning their names in the prompt leads to a certain performance improvement (around 1-3%) in larger models compared to the baseline.
Critical or Compliant? The Double-Edged Sword of Reasoning in Chain-of-Thought Explanations
Explanations are often promoted as tools for transparency, but they can also foster confirmation bias; users may assume reasoning is correct whenever outputs appear acceptable. We study this double-edged role of Chain-of-Thought (CoT) explanations in multimodal moral scenarios by systematically perturbing reasoning chains and manipulating delivery tones. Specifically, we analyze reasoning errors in vision language models (VLMs) and how they impact user trust and the ability to detect errors. Our findings reveal two key effects: (1) users often equate trust with outcome agreement, sustaining reliance even when reasoning is flawed, and (2) the confident tone suppresses error detection while maintaining reliance, showing that delivery styles can override correctness. These results highlight how CoT explanations can simultaneously clarify and mislead, underscoring the need for NLP systems to provide explanations that encourage scrutiny and critical thinking rather than blind trust. All code will be released publicly.
comment: Under review; 16 pages, 15 figures
Step-Audio-EditX Technical Report
We present Step-Audio-EditX, the first open-source LLM-based audio model excelling at expressive and iterative audio editing encompassing emotion, speaking style, and paralinguistics alongside robust zero-shot text-to-speech (TTS) capabilities. Our core innovation lies in leveraging only large-margin synthetic data, which circumvents the need for embedding-based priors or auxiliary modules. This large-margin learning approach enables both iterative control and high expressivity across voices, and represents a fundamental pivot from the conventional focus on representation-level disentanglement. Evaluation results demonstrate that Step-Audio-EditX surpasses both MiniMax-2.6-hd and Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.
Confidential Prompting: Privacy-preserving LLM Inference on Cloud
This paper introduces a vision of confidential prompting: securing user prompts from an untrusted, cloud-hosted large language model (LLM) while preserving model confidentiality, output invariance, and compute efficiency. As a first step toward this vision, we present Petridish, a system built on top of confidential computing and its core contribution, a novel technology called Secure Partitioned Decoding (SPD). Petridish runs the LLM service inside a confidential virtual machine (CVM), which protects the secrets, i.e., the LLM parameters and user prompts, from adversaries outside the CVM. Importantly, it splits the LLM service for a user into two processes, using SPD: a per-user process performs prefill with the user prompts and computes attention scores during decoding; a service process, shared by all users, batches the attention scores from per-user processes and generates output tokens for all users. Both the LLM provider and the users trust Petridish's CVM and its operating system, which guarantees isolation between processes and limits their outbound network capabilities to control information flow. The CVM's attestation capability and its open-source software stack enable Petridish to provide auditable protection of both user prompt and LLM confidentiality. Together, Petridish maintains full utility of LLM service and enables practical, privacy-preserving cloud-hosted LLM inference for sensitive applications, such as processing personal data, clinical records, and financial documents.
MedBench v4: A Robust and Scalable Benchmark for Evaluating Chinese Medical Language Models, Multimodal Models, and Intelligent Agents
Recent advances in medical large language models (LLMs), multimodal models, and agents demand evaluation frameworks that reflect real clinical workflows and safety constraints. We present MedBench v4, a nationwide, cloud-based benchmarking infrastructure comprising over 700,000 expert-curated tasks spanning 24 primary and 91 secondary specialties, with dedicated tracks for LLMs, multimodal models, and agents. Items undergo multi-stage refinement and multi-round review by clinicians from more than 500 institutions, and open-ended responses are scored by an LLM-as-a-judge calibrated to human ratings. We evaluate 15 frontier models. Base LLMs reach a mean overall score of 54.1/100 (best: Claude Sonnet 4.5, 62.5/100), but safety and ethics remain low (18.4/100). Multimodal models perform worse overall (mean 47.5/100; best: GPT-5, 54.9/100), with solid perception yet weaker cross-modal reasoning. Agents built on the same backbones substantially improve end-to-end performance (mean 79.8/100), with Claude Sonnet 4.5-based agents achieving up to 85.3/100 overall and 88.9/100 on safety tasks. MedBench v4 thus reveals persisting gaps in multimodal reasoning and safety for base models, while showing that governance-aware agentic orchestration can markedly enhance benchmarked clinical readiness without sacrificing capability. By aligning tasks with Chinese clinical guidelines and regulatory priorities, the platform offers a practical reference for hospitals, developers, and policymakers auditing medical AI.
Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities AAAI 2026
Recent advancements in Large Reasoning Models (LRMs), such as OpenAI's o1/o3 and DeepSeek-R1, have demonstrated remarkable performance in specialized reasoning tasks through human-like deliberative thinking and long chain-of-thought reasoning. However, our systematic evaluation across various model families (DeepSeek, Qwen, and LLaMA) and scales (7B to 32B) reveals that acquiring these deliberative reasoning capabilities significantly reduces the foundational capabilities of LRMs, including notable declines in helpfulness and harmlessness, alongside substantially increased inference costs. Importantly, we demonstrate that adaptive reasoning -- employing modes like Zero-Thinking, Less-Thinking, and Summary-Thinking -- can effectively alleviate these drawbacks. Our empirical insights underline the critical need for developing more versatile LRMs capable of dynamically allocating inference-time compute according to specific task characteristics.
comment: To appear at AAAI 2026
Socrates or Smartypants: Testing Logic Reasoning Capabilities of Large Language Models with Logic Programming-based Test Oracles
Large Language Models (LLMs) have achieved significant progress in language understanding and reasoning. Evaluating and analyzing their logical reasoning abilities has therefore become essential. However, existing datasets and benchmarks are often limited to overly simplistic, unnatural, or contextually constrained examples. In response to the growing demand, we introduce SmartyPat-Bench, a challenging, naturally expressed, and systematically labeled benchmark derived from real-world high-quality Reddit posts containing subtle logical fallacies. Unlike existing datasets and benchmarks, it provides more detailed annotations of logical fallacies and features more diverse data. To further scale up the study and address the limitations of manual data collection and labeling - such as fallacy-type imbalance and labor-intensive annotation - we introduce SmartyPat, an automated framework powered by logic programming-based oracles. SmartyPat utilizes Prolog rules to systematically generate logically fallacious statements, which are then refined into fluent natural-language sentences by LLMs, ensuring precise fallacy representation. Extensive evaluation demonstrates that SmartyPat produces fallacies comparable in subtlety and quality to human-generated content and significantly outperforms baseline methods. Finally, experiments reveal nuanced insights into LLM capabilities, highlighting that while excessive reasoning steps hinder fallacy detection accuracy, structured reasoning enhances fallacy categorization performance.
Metis-SPECS: Decoupling Multimodal Learning via Self-distilled Preference-based Cold Start SP
Reinforcement learning (RL) with verifiable rewards has recently catalyzed a wave of "MLLM-r1" approaches that bring RL to vision language models. Most representative paradigms begin with a cold start, typically employing supervised fine-tuning (SFT), to initialize the policy before RL. However, SFT-based cold start adopts the reasoning paradigm intertwined with task solution and output format, which may induce instruction-style overfitting, weakens out-of-distribution generalization, and ultimately affects downstream RL. We revisit the cold start along two views, its training method and data construction, and introduce the Generalization Factor (GF) coefficient to quantify the generalization capability under different methods. Our empirical study finds that preference-based training methods (e.g. DPO) generalizes better than SFT-based methods in cold start. Motivated by this, we propose SPECS-a Self-distilled, Preference-based Cold Start framework that decouples multimodal learning: (1) generates introspective preference data pairs via self-distillation, avoiding reliance on larger teachers or manual annotation; (2) performs preference-based training to learn, focusing on shallow, transferable surface-form criteria (format, structure, style) rather than memorizing content; and (3) hands off to RL with verifiable rewards for deep reasoning results. Experimental results across multiple multimodal benchmarks show that our decoupling learning framework yields consistent performance gains over strong baselines, improving MEGA-Bench by 4.1% and MathVista by 12.2%. Additional experiments indicate that SPECS contributes to reducing in-distribution "stuckness," improving exploration, stabilizing training, and raising the performance ceiling.
comment: Project Page: https://github.com/Kwen-Chen/SPECS-VL
Bias after Prompting: Persistent Discrimination in Large Language Models
A dangerous assumption that can be made from prior work on the bias transfer hypothesis (BTH) is that biases do not transfer from pre-trained large language models (LLMs) to adapted models. We invalidate this assumption by studying the BTH in causal models under prompt adaptations, as prompting is an extremely popular and accessible adaptation strategy used in real-world applications. In contrast to prior work, we find that biases can transfer through prompting and that popular prompt-based mitigation methods do not consistently prevent biases from transferring. Specifically, the correlation between intrinsic biases and those after prompt adaptation remain moderate to strong across demographics and tasks -- for example, gender (rho >= 0.94) in co-reference resolution, and age (rho >= 0.98) and religion (rho >= 0.69) in question answering. Further, we find that biases remain strongly correlated when varying few-shot composition parameters, such as sample size, stereotypical content, occupational distribution and representational balance (rho >= 0.90). We evaluate several prompt-based debiasing strategies and find that different approaches have distinct strengths, but none consistently reduce bias transfer across models, tasks or demographics. These results demonstrate that correcting bias, and potentially improving reasoning ability, in intrinsic models may prevent propagation of biases to downstream tasks.
In-N-Out: A Parameter-Level API Graph Dataset for Tool Agents
Tool agents -- LLM-based systems that interact with external APIs -- offer a way to execute real-world tasks. However, as tasks become increasingly complex, these agents struggle to identify and call the correct APIs in the proper order. To tackle this problem, we investigate converting API documentation into a structured API graph that captures API dependencies and leveraging it for multi-tool queries that require compositional API calls. To support this, we introduce In-N-Out, the first expert-annotated dataset of API graphs built from two real-world API benchmarks and their documentation. Using In-N-Out significantly improves performance on both tool retrieval and multi-tool query generation, nearly doubling that of LLMs using documentation alone. Moreover, graphs generated by models fine-tuned on In-N-Out close 90% of this gap, showing that our dataset helps models learn to comprehend API documentation and parameter relationships. Our findings highlight the promise of using explicit API graphs for tool agents and the utility of In-N-Out as a valuable resource. We will release the dataset and code publicly.
Better LLM Reasoning via Dual-Play
Large Language Models (LLMs) have achieved remarkable progress through Reinforcement Learning with Verifiable Rewards (RLVR), yet still rely heavily on external supervision (e.g., curated labels). Adversarial learning, particularly through self-play, offers a promising alternative that enables models to iteratively learn from themselves - thus reducing reliance on external supervision. Dual-play extends adversarial learning by assigning specialized roles to two models and training them against each other, fostering sustained competition and mutual evolution. Despite its promise, adapting dual-play training to LLMs remains limited, largely due to their susceptibility to reward hacking and training instability. In this paper, we introduce PasoDoble, a novel LLM dual-play framework. PasoDoble adversarially trains two models initialized from the same base model: a Proposer, which generates challenging questions with ground-truth answers, and a Solver, which attempts to solve them. We enrich the Proposer with knowledge from a pre-training dataset to ensure the questions' quality and diversity. To avoid reward hacking, the Proposer is rewarded for producing only valid questions that push the Solver's limit, while the Solver is rewarded for solving them correctly, and both are updated jointly. To further enhance training stability, we introduce an optional offline paradigm that decouples Proposer and Solver updates, alternately updating each for several steps while holding the other fixed. Notably, PasoDoble operates without supervision during training. Experimental results show that PasoDoble can improve the reasoning performance of LLMs. Our project page is available at https://hcy123902.github.io/PasoDoble.
Eguard: Defending LLM Embeddings Against Inversion Attacks via Text Mutual Information Optimization
Embeddings have become a cornerstone in the functionality of large language models (LLMs) due to their ability to transform text data into rich, dense numerical representations that capture semantic and syntactic properties. These embedding vector databases serve as the long-term memory of LLMs, enabling efficient handling of a wide range of natural language processing tasks. However, the surge in popularity of embedding vector databases in LLMs has been accompanied by significant concerns about privacy leakage. Embedding vector databases are particularly vulnerable to embedding inversion attacks, where adversaries can exploit the embeddings to reverse-engineer and extract sensitive information from the original text data. Existing defense mechanisms have shown limitations, often struggling to balance security with the performance of downstream tasks. To address these challenges, we introduce Eguard, a novel defense mechanism designed to mitigate embedding inversion attacks. Eguard employs a transformer-based projection network and text mutual information optimization to safeguard embeddings while preserving the utility of LLMs. Our approach significantly reduces privacy risks, protecting over 95% of tokens from inversion while maintaining high performance across downstream tasks consistent with original embeddings.
Tomato, Tomahto, Tomate: Do Multilingual Language Models Understand Based on Subword-Level Semantic Concepts?
Human understanding of text depends on general semantic concepts of words rather than their superficial forms. To what extent does our human intuition transfer to language models? In this work, we study the degree to which current multilingual language models (mLMs) understand based on subword-level semantic concepts. To this end, we form "semantic tokens" by merging the semantically similar subwords and their embeddings, and evaluate the updated mLMs on five heterogeneous multilingual downstream tasks. Results show that the general shared semantics could get the models a long way in making the predictions on mLMs with different tokenizers and model sizes. Inspections of the grouped subwords show that they exhibit a wide range of semantic similarities, including synonyms and translations across many languages and scripts. Lastly, we find that the zero-shot results with semantic tokens are on par with or even better than the original models on certain classification tasks, suggesting that the shared subword-level semantics may serve as the anchors for cross-lingual transfer.
comment: 8 pages, 9 figures
Towards Alignment-Centric Paradigm: A Survey of Instruction Tuning in Large Language Models
Instruction tuning is a pivotal technique for aligning large language models (LLMs) with human intentions, safety constraints, and domain-specific requirements. This survey provides a comprehensive overview of the full pipeline, encompassing (i) data collection methodologies, (ii) full-parameter and parameter-efficient fine-tuning strategies, and (iii) evaluation protocols. We categorized data construction into three major paradigms: expert annotation, distillation from larger models, and self-improvement mechanisms, each offering distinct trade-offs between quality, scalability, and resource cost. Fine-tuning techniques range from conventional supervised training to lightweight approaches, such as low-rank adaptation (LoRA) and prefix tuning, with a focus on computational efficiency and model reusability. We further examine the challenges of evaluating faithfulness, utility, and safety across multilingual and multimodal scenarios, highlighting the emergence of domain-specific benchmarks in healthcare, legal, and financial applications. Finally, we discuss promising directions for automated data generation, adaptive optimization, and robust evaluation frameworks, arguing that a closer integration of data, algorithms, and human feedback is essential for advancing instruction-tuned LLMs. This survey aims to serve as a practical reference for researchers and practitioners seeking to design LLMs that are both effective and reliably aligned with human intentions.
comment: 24 pages, 7 figures, 5 tables
Discriminating Form and Meaning in Multilingual Models with Minimal-Pair ABX Tasks EMNLP 2025
We introduce a set of training-free ABX-style discrimination tasks to evaluate how multilingual language models represent language identity (form) and semantic content (meaning). Inspired from speech processing, these zero-shot tasks measure whether minimal differences in representation can be reliably detected. This offers a flexible and interpretable alternative to probing. Applied to XLM-R (Conneau et al, 2020) across pretraining checkpoints and layers, we find that language discrimination declines over training and becomes concentrated in lower layers, while meaning discrimination strengthens over time and stabilizes in deeper layers. We then explore probing tasks, showing some alignment between our metrics and linguistic learning performance. Our results position ABX tasks as a lightweight framework for analyzing the structure of multilingual representations.
comment: Comments: Published in EMNLP 2025. https://aclanthology.org/2025.emnlp-main.1210.pdf
Verbalized Algorithms NeurIPS 2025
Instead of querying LLMs in a one-shot manner and hoping to get the right answer for a reasoning task, we propose a paradigm we call \emph{verbalized algorithms} (VAs), which leverage classical algorithms with established theoretical understanding. VAs decompose a task into simple elementary operations on natural language strings that they should be able to answer reliably, and limit the scope of LLMs to only those simple tasks. For example, for sorting a series of natural language strings, \emph{verbalized sorting} uses an LLM as a binary comparison oracle in a known and well-analyzed sorting algorithm (e.g., bitonic sorting network). We demonstrate the effectiveness of this approach on sorting and clustering tasks.
comment: Accepted in NeurIPS 2025 Workshop on Efficient Reasoning
Efficient Architectures for High Resolution Vision-Language Models COLING 2025
Vision-Language Models (VLMs) have recently experienced significant advancements. However, challenges persist in the accurate recognition of fine details within high resolution images, which limits performance in multiple tasks. This work introduces Pheye, a novel architecture that efficiently processes high-resolution images while training fewer parameters than similarly sized VLMs. Notably, Pheye achieves a high efficiency while maintaining strong performance, particularly in tasks that demand fine-grained image understanding and/or the handling of scene-text.
comment: Accepted at COLING 2025
HAWAII: Hierarchical Visual Knowledge Transfer for Efficient Vision-Language Models NeurIPS 2025
Improving the visual understanding ability of vision-language models (VLMs) is crucial for enhancing their performance across various tasks. While using multiple pretrained visual experts has shown great promise, it often incurs significant computational costs during training and inference. To address this challenge, we propose HAWAII, a novel framework that distills knowledge from multiple visual experts into a single vision encoder, enabling it to inherit the complementary strengths of several experts with minimal computational overhead. To mitigate conflicts among different teachers and switch between different teacher-specific knowledge, instead of using a fixed set of adapters for multiple teachers, we propose to use teacher-specific Low-Rank Adaptation (LoRA) adapters with a corresponding router. Each adapter is aligned with a specific teacher, avoiding noisy guidance during distillation. To enable efficient knowledge distillation, we propose fine-grained and coarse-grained distillation. At the fine-grained level, token importance scores are employed to emphasize the most informative tokens from each teacher adaptively. At the coarse-grained level, we summarize the knowledge from multiple teachers and transfer it to the student using a set of general-knowledge LoRA adapters with a router. Extensive experiments on various vision-language tasks demonstrate the superiority of HAWAII compared to popular open-source VLMs. The code is available at https://github.com/yimuwangcs/wise-hawaii.
comment: NeurIPS 2025
Diagnosing the Performance Trade-off in Moral Alignment: A Case Study on Gender Stereotypes
Moral alignment has emerged as a widely adopted approach for regulating the behavior of pretrained language models (PLMs), typically through fine-tuning on curated datasets. Gender stereotype mitigation is a representational task within the broader application of moral alignment. However, this process often comes at the cost of degraded downstream task performance. Prior studies commonly aim to achieve a performance trade-off by encouraging PLMs to selectively forget only stereotypical knowledge through carefully designed fairness objective, while preserving their language modeling capability (overall forgetting). In this short paper, we investigate whether the performance trade-off can be achieved through the lens of forgetting and the fairness objective. Our analysis shows that the large datasets needed for satisfactory fairness highlight the limitations of current fairness objectives in achieving an effective trade-off: (1) downstream task performance is strongly correlated with overall forgetting; (2) selective forgetting reduces stereotypes, but overall forgetting increases. and (3) general solutions for alleviating forgetting are ineffective at reducing the overall forgetting and fail to improve downstream task performance.
Efficient Environmental Claim Detection with Hyperbolic Graph Neural Networks
Transformer based models, especially large language models (LLMs) dominate the field of NLP with their mass adoption in tasks such as text generation, summarization and fake news detection. These models offer ease of deployment and reliability for most applications, however, they require significant amounts of computational power for training as well as inference. This poses challenges in their adoption in resource-constrained applications, especially in the open-source community where compute availability is usually scarce. This work proposes a graph-based approach for Environmental Claim Detection, exploring Graph Neural Networks (GNNs) and Hyperbolic Graph Neural Networks (HGNNs) as lightweight yet effective alternatives to transformer-based models. Re-framing the task as a graph classification problem, we transform claim sentences into dependency parsing graphs, utilizing a combination of word2vec \& learnable part-of-speech (POS) tag embeddings for the node features and encoding syntactic dependencies in the edge relations. Our results show that our graph-based models, particularly HGNNs in the poincaré space (P-HGNNs), achieve performance superior to the state-of-the-art on environmental claim detection while using up to \textbf{30x fewer parameters}. We also demonstrate that HGNNs benefit vastly from explicitly modeling data in hierarchical (tree-like) structures, enabling them to significantly improve over their euclidean counterparts.
Steering Evaluation-Aware Language Models to Act Like They Are Deployed
Large language models (LLMs) can sometimes detect when they are being evaluated and adjust their behavior to appear more aligned, compromising the reliability of safety evaluations. In this paper, we show that adding a steering vector to an LLM's activations can suppress evaluation-awareness and make the model act like it is deployed during evaluation. To study our steering technique, we train an LLM to exhibit evaluation-aware behavior using a two-step training process designed to mimic how this behavior could emerge naturally. First, we perform continued pretraining on documents with factual descriptions of the model (1) using Python type hints during evaluation but not during deployment and (2) recognizing that the presence of a certain evaluation cue always means that it is being tested. Then, we train the model with expert iteration to use Python type hints in evaluation settings. The resulting model is evaluation-aware: it writes type hints in evaluation contexts more than deployment contexts. We find that activation steering can suppress evaluation awareness and make the model act like it is deployed even when the cue is present. Importantly, we constructed our steering vector using the original model before our additional training. Our results suggest that AI evaluators could improve the reliability of safety evaluations by steering models to act like they are deployed.
Auditing Google's AI Overviews and Featured Snippets: A Case Study on Baby Care and Pregnancy AAAI
Google Search increasingly surfaces AI-generated content through features like AI Overviews (AIO) and Featured Snippets (FS), which users frequently rely on despite having no control over their presentation. Through a systematic algorithm audit of 1,508 real baby care and pregnancy-related queries, we evaluate the quality and consistency of these information displays. Our robust evaluation framework assesses multiple quality dimensions, including answer consistency, relevance, presence of medical safeguards, source categories, and sentiment alignment. Our results reveal concerning gaps in information consistency, with information in AIO and FS displayed on the same search result page being inconsistent with each other in 33% of cases. Despite high relevance scores, both features critically lack medical safeguards (present in just 11% of AIO and 7% of FS responses). While health and wellness websites dominate source categories for both, AIO and FS, FS also often link to commercial sources. These findings have important implications for public health information access and demonstrate the need for stronger quality controls in AI-mediated health information. Our methodology provides a transferable framework for auditing AI systems across high-stakes domains where information quality directly impacts user well-being.
comment: 18 pages, 10 figures; to appear in AAAI ICWSM 2026
CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation
As text-to-image models become increasingly prevalent, ensuring their equitable performance across diverse cultural contexts is critical. Efforts to mitigate cross-cultural biases have been hampered by trade-offs, including a loss in performance, factual inaccuracies, or offensive outputs. Despite widespread recognition of these challenges, an inability to reliably measure these biases has stalled progress. To address this gap, we introduce CAIRe, an evaluation metric that assesses the degree of cultural relevance of an image, given a user-defined set of labels. Our framework grounds entities and concepts in the image to a knowledge base and uses factual information to give independent graded judgments for each culture label. On a manually curated dataset of culturally salient but rare items built using language models, CAIRe surpasses all baselines by 22% F1 points. Additionally, we construct two datasets for culturally universal concepts, one comprising T2I-generated outputs and another retrieved from naturally occurring data. CAIRe achieves Pearson's correlations of 0.56 and 0.66 with human ratings on these sets, based on a 5-point Likert scale of cultural relevance. This demonstrates its strong alignment with human judgment across diverse image sources.
comment: Preprint, under review
Human-Computer Interaction
Infrastructuring Pop-Up Cities with "Social Layer": Designing Serendipitous Co-Livings for Temporary Intentional Communities
After the pandemic, a new form of "pop-up city" has emerged -- co-living gatherings of 100-200 people for 4-8 weeks that differ from conferences and hack houses. These temporary intentional communities leverages existing urban infrastructure, blending daily life (housing, meals, care) with self-organized activities like learning, creating, and socializing. They coordinate bottom-up programming through an "unconference" system for identity, calendaring, RSVP, and social discovery that fosters spontaneous, serendipitous, enduring ties. This paper examines the design of "Social Layer," an unconferencing system for pop-up cities. We studied its real-world deployment in ShanHaiWoo (Jilin, China, 2023), muChiangmai (Chiangmai, Thailand, 2023), Edge Esmeralda, Edge Esmeralda (Healdsburg, CA, USA, 2024), Aleph (Buenos Aires, Argentina, 2024), and Gathering of Tribe (Lisbon, Portugal, 2024). Our findings distill: (1) the strong concept "scaffolded spontaneity" -- infrastructural affordances that balance structure with openness, amplifying participant agency while maintaining privacy and lightweight governance; (2) design implications for design researchers working on pop-up cities.
comment: Submitted to DRS 2026
DuoZone: A User-Centric, LLM-Guided Mixed-Initiative XR Window Management System
Mixed reality (XR) environments offer vast spatial possibilities, but current window management systems require users to manually place, resize, and organize multiple applications across large 3D spaces. This creates cognitive and interaction burdens that limit productivity. We introduce DuoZone, a mixed-initiative XR window management system that combines user-defined spatial layouts with LLM-guided automation. DuoZone separates window management into two complementary zones. The Recommendation Zone enables fast setup by providing spatial layout templates and automatically recommending relevant applications based on user tasks and high-level goals expressed through voice or text. The Arrangement Zone supports precise refinement through direct manipulation, allowing users to adjust windows using natural spatial actions such as dragging, resizing, and snapping. Through this dual-zone approach, DuoZone promotes efficient organization while reducing user cognitive load. We conducted a user study comparing DuoZone with a baseline manual XR window manager. Results show that DuoZone improves task completion speed, reduces mental effort, and increases sense of control when working with multiple applications in XR. We discuss design implications for future mixed-initiative systems and outline opportunities for integrating adaptive, goal-aware intelligence into spatial computing workflows.
A Decade of Systems for Human Data Interaction
Human-data interaction (HDI) presents fundamentally different challenges from traditional data management. HDI systems must meet latency, correctness, and consistency needs that stem from usability rather than query semantics; failing to meet these expectations breaks the user experience. Moreover, interfaces and systems are tightly coupled; neither can easily be optimized in isolation, and effective solutions demand their co-design. This dependence also presents a research opportunity: rather than adapt systems to interface demands, systems innovations and database theory can also inspire new interaction and visualization designs. We survey a decade of our lab's work that embraces this coupling and argue that HDI systems are the foundation for reliable, interactive, AI-driven applications.
Computer-Use Agents as Judges for Generative User Interface
Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans--prioritizing aesthetics and usability--forcing agents to adopt human-oriented behaviors that are unnecessary for efficient task execution. At the same time, rapid advances in coding-oriented language models (Coder) have transformed automatic GUI design. This raises a fundamental question: Can CUA as judges to assist Coder for automatic GUI design? To investigate, we introduce AUI-Gym, a benchmark for Automatic GUI development spanning 52 applications across diverse domains. Using language models, we synthesize 1560 tasks that simulate real-world scenarios. To ensure task reliability, we further develop a verifier that programmatically checks whether each task is executable within its environment. Building on this, we propose a Coder-CUA in Collaboration framework: the Coder acts as Designer, generating and revising websites, while the CUA serves as Judge, evaluating functionality and refining designs. Success is measured not by visual appearance, but by task solvability and CUA navigation success rate. To turn CUA feedback into usable guidance, we design a CUA Dashboard that compresses multi-step navigation histories into concise visual summaries, offering interpretable guidance for iterative redesign. By positioning agents as both designers and judges, our framework shifts interface design toward agent-native efficiency and reliability. Our work takes a step toward shifting agents from passive use toward active participation in digital environments. Our code and dataset are available at https://github.com/showlab/AUI.
comment: Project: https://showlab.github.io/AUI Github: https://github.com/showlab/AUI
Game Master LLM: Task-Based Role-Playing for Natural Slang Learning
Natural and idiomatic expressions are essential for fluent, everyday communication, yet many second-language learners struggle to acquire and spontaneously use casual slang despite strong formal proficiency. To address this gap, we designed and evaluated an LLM-powered, task-based role-playing game in which a GPT-4o-based Game Master guides learners through an immersive, three-phase spoken narrative. After selecting five unfamiliar slang phrases to practice, participants engage in open-ended dialogue with non-player characters; the Game Master naturally incorporates the target phrases in rich semantic contexts (implicit input enhancement) while a dedicated Practice Box provides real-time explicit tracking and encouragement. Post-session, learners receive multi-level formative feedback analyzing the entire interaction. We evaluated the system in a between-subjects study with 14 international graduate students, randomly assigned to either the RPG condition or a control condition consisting of a traditional AI-led virtual classroom. Results from an immediate post-test show that the RPG group achieved greater gains in both comprehension of the target phrases and their accurate, contextual use in sentences. Quantitative analysis of in-activity word-usage frequency, combined with qualitative survey responses, further indicates that the game-based approach provided more practice opportunities and higher perceived engagement, resulting in a more natural learning experience. These findings highlight the potential of narrative-driven LLM interactions in vocabulary acquisition.
People readily follow personal advice from AI but it does not improve their well-being
People increasingly seek personal advice from large language models (LLMs), yet whether humans follow their advice, and its consequences for their well-being, remains unknown. In a longitudinal randomised controlled trial with a representative UK sample (N = 2,302), 75% of participants who had a 20-minute discussion with GPT-4o about health, careers or relationships subsequently reported following its advice. Based on autograder evaluations of chat transcripts, LLM advice rarely violated safety best practice. When queried 2-3 weeks later, participants who had interacted with personalised AI (with access to detailed user information) followed its advice more often in the real world and reported higher well-being than those advised by non-personalised AI. However, while receiving personal advice from AI temporarily reduced well-being, no differential long-term effects compared to a control emerged. Our results suggest that humans readily follow LLM advice about personal issues but doing so shows no additional well-being benefit over casual conversations.
Reflexive Evidence-Based Multimodal Learning for Clean Energy Transitions: Causal Insights on Cooking Fuel Access, Urbanization, and Carbon Emissions
Achieving Sustainable Development Goal 7 (Affordable and Clean Energy) requires not only technological innovation but also a deeper understanding of the socioeconomic factors influencing energy access and carbon emissions. While these factors are gaining attention, critical questions remain, particularly regarding how to quantify their impacts on energy systems, model their cross-domain interactions, and capture feedback dynamics in the broader context of energy transitions. To address these gaps, this study introduces ClimateAgents, an AI-based framework that combines large language models with domain-specialized agents to support hypothesis generation and scenario exploration. Leveraging 20 years of socioeconomic and emissions data from 265 economies, countries and regions, and 98 indicators drawn from the World Bank database, the framework applies a machine learning based causal inference approach to identify key determinants of carbon emissions in an evidence-based, data driven manner. The analysis highlights three primary drivers: access to clean cooking fuels in rural areas, access to clean cooking fuels in urban areas, and the percentage of population living in urban areas. These findings underscore the critical role of clean cooking technologies and urbanization patterns in shaping emission outcomes. In line with growing calls for evidence-based AI policy, ClimateAgents offers a modular and reflexive learning system that supports the generation of credible and actionable insights for policy. By integrating heterogeneous data modalities, including structured indicators, policy documents, and semantic reasoning, the framework contributes to adaptive policymaking infrastructures that can evolve with complex socio-technical challenges. This approach aims to support a shift from siloed modeling to reflexive, modular systems designed for dynamic, context-aware climate action.
DesignerlyLoop: Bridging the Cognitive Gap through Visual Node-Based Reasoning in Human-AI Collaborative Design
Large language models (LLMs) offer powerful support for design tasks, yet their goal-oriented, single-turn responses often misalign with the nonlinear, exploratory nature of design processes. This mismatch creates a cognitive gap, limiting designers' ability to articulate evolving intentions, critically evaluate outputs, and maintain creative agency. To address these challenges, we developed DesignerlyLoop, a visual node-based system that embeds LLM reasoning chains into the design workflow. The system enables designers to externalize and curate reasoning structures, iteratively organize intentions, and interact with LLMs as dynamic cognitive engines rather than static answer providers. We conducted a within-subject study with 20 designers, combining qualitative and quantitative methods, and found that DesignerlyLoop enhanced creative reflection, design quality, and interaction experience by supporting systematic engagement with both human and machine reasoning. These findings highlight the potential of structured, interactive visualization to transform human-AI co-creation into a reflective and iterative design process.
Opinion Dynamics Models for Sentiment Evolution in Weibo Blogs
Online social media platforms enable influencers to distribute content and quickly capture audience reactions, significantly shaping their promotional strategies and advertising agreements. Understanding how sentiment dynamics and emotional contagion unfold among followers is vital for influencers and marketers, as these processes shape engagement, brand perception, and purchasing behavior. While sentiment analysis tools effectively track sentiment fluctuations, dynamical models explaining their evolution remain limited, often neglecting network structures and interactions both among blogs and between their topic-focused follower groups. In this study, we tracked influential tech-focused Weibo bloggers over six months, quantifying follower sentiment from text-mined feedback. By treating each blogger's audience as a single "macro-agent", we find that sentiment trajectories follow the principle of iterative averaging -- a foundational mechanism in many dynamical models of opinion formation, a theoretical framework at the intersection of social network analysis and dynamical systems theory. The sentiment evolution aligns closely with opinion-dynamics models, particularly modified versions of the classical French-DeGroot model that incorporate delayed perception and distinguish between expressed and private opinions. The inferred influence structures reveal interdependencies among blogs that may arise from homophily, whereby emotionally similar users subscribe to the same blogs and collectively shape the shared sentiment expressed within these communities.
PresentCoach: Dual-Agent Presentation Coaching through Exemplars and Interactive Feedback
Effective presentation skills are essential in education, professional communication, and public speaking, yet learners often lack access to high-quality exemplars or personalized coaching. Existing AI tools typically provide isolated functionalities such as speech scoring or script generation without integrating reference modeling and interactive feedback into a cohesive learning experience. We introduce a dual-agent system that supports presentation practice through two complementary roles: the Ideal Presentation Agent and the Coach Agent. The Ideal Presentation Agent converts user-provided slides into model presentation videos by combining slide processing, visual-language analysis, narration script generation, personalized voice synthesis, and synchronized video assembly. The Coach Agent then evaluates user-recorded presentations against these exemplars, conducting multimodal speech analysis and delivering structured feedback in an Observation-Impact-Suggestion (OIS) format. To enhance the authenticity of the learning experience, the Coach Agent incorporates an Audience Agent, which simulates the perspective of a human listener and provides humanized feedback reflecting audience reactions and engagement. Together, these agents form a closed loop of observation, practice, and feedback. Implemented on a robust backend with multi-model integration, voice cloning, and error handling mechanisms, the system demonstrates how AI-driven agents can provide engaging, human-centered, and scalable support for presentation skill development in both educational and professional contexts.
comment: 13pages,6figures
Efficient Transformer-Integrated Deep Neural Architectures for Robust EEG Decoding of Complex Visual Imagery
This study introduces a pioneering approach in brain-computer interface (BCI) technology, featuring our novel concept of complex visual imagery for non-invasive electroencephalography (EEG)-based communication. Complex visual imagery, as proposed in our work, involves the user engaging in the mental visualization of complex upper limb movements. This innovative approach significantly enhances the BCI system, facilitating the extension of its applications to more sophisticated tasks such as EEG-based robotic arm control. By leveraging this advanced form of visual imagery, our study opens new horizons for intricate and intuitive mind-controlled interfaces. We developed an advanced deep learning architecture that integrates functional connectivity metrics with a convolutional neural network-image transformer. This framework is adept at decoding subtle user intentions, addressing the spatial variability in complex visual tasks, and effectively translating these into precise commands for robotic arm control. Our comprehensive offline and pseudo-online evaluations demonstrate the framework's efficacy in real-time applications, including the nuanced control of robotic arms. The robustness of our approach is further validated through leave-one-subject-out cross-validation, marking a significant step towards versatile, subject-independent BCI applications. This research highlights the transformative impact of advanced visual imagery and deep learning in enhancing the usability and adaptability of BCI systems, particularly in robotic arm manipulation.
comment: Doctoral dissertation, Korea University, 2025
SWR-Viz: AI-assisted Interactive Visual Analytics Framework for Ship Weather Routing
Efficient and sustainable maritime transport increasingly depends on reliable forecasting and adaptive routing, yet operational adoption remains difficult due to forecast latencies and the need for human judgment in rapid decision-making under changing ocean conditions. We introduce SWR-Viz, an AI-assisted visual analytics framework that combines a physics-informed Fourier Neural Operator wave forecast model with SIMROUTE-based routing and interactive emissions analytics. The framework generates near-term forecasts directly from current conditions, supports data assimilation with sparse observations, and enables rapid exploration of what-if routing scenarios. We evaluate the forecast models and SWR-Viz framework along key shipping corridors in the Japan Coast and Gulf of Mexico, showing both improved forecast stability and realistic routing outcomes comparable to ground-truth reanalysis wave products. Expert feedback highlights the usability of SWR-Viz, its ability to isolate voyage segments with high emission reduction potential, and its value as a practical decision-support system. More broadly, this work illustrates how lightweight AI forecasting can be integrated with interactive visual analytics to support human-centered decision-making in complex geospatial and environmental domains.
Teaching According to Students' Aptitude: Personalized Mathematics Tutoring via Persona-, Memory-, and Forgetting-Aware LLMs AAAI 2026
Large Language Models (LLMs) are increasingly integrated into intelligent tutoring systems to provide human-like and adaptive instruction. However, most existing approaches fail to capture how students' knowledge evolves dynamically across their proficiencies, conceptual gaps, and forgetting patterns. This challenge is particularly acute in mathematics tutoring, where effective instruction requires fine-grained scaffolding precisely calibrated to each student's mastery level and cognitive retention. To address this issue, we propose TASA (Teaching According to Students' Aptitude), a student-aware tutoring framework that integrates persona, memory, and forgetting dynamics for personalized mathematics learning. Specifically, TASA maintains a structured student persona capturing proficiency profiles and an event memory recording prior learning interactions. By incorporating a continuous forgetting curve with knowledge tracing, TASA dynamically updates each student's mastery state and generates contextually appropriate, difficulty-calibrated questions and explanations. Empirical results demonstrate that TASA achieves superior learning outcomes and more adaptive tutoring behavior compared to representative baselines, underscoring the importance of modeling temporal forgetting and learner profiles in LLM-based tutoring systems.
comment: AAAI 2026 Workshop
Cross-Modal Consistency-Guided Active Learning for Affective BCI Systems
Deep learning models perform best with abundant, high-quality labels, yet such conditions are rarely achievable in EEG-based emotion recognition. Electroencephalogram (EEG) signals are easily corrupted by artifacts and individual variability, while emotional labels often stem from subjective and inconsistent reports-making robust affective decoding particularly difficult. We propose an uncertainty-aware active learning framework that enhances robustness to label noise by jointly leveraging model uncertainty and cross-modal consistency. Instead of relying solely on EEG-based uncertainty estimates, the method evaluates cross-modal alignment to determine whether uncertainty originates from cognitive ambiguity or sensor noise. A representation alignment module embeds EEG and face features into a shared latent space, enforcing semantic coherence between modalities. Residual discrepancies are treated as noise-induced inconsistencies, and these samples are selectively queried for oracle feedback during active learning. This feedback-driven process guides the network toward reliable, informative samples and reduces the impact of noisy labels. Experiments on the ASCERTAIN dataset examine the efficiency and robustness of ours, highlighting its potential as a data-efficient and noise-tolerant approach for EEG-based affective decoding in brain-computer interface systems.
Eye Care You: Voice Guidance Application Using Social Robot for Visually Impaired People
In the study, the device of social robot was designed for visually impaired users, and along with a mobile application for provide functions to assist their lives. Both physical and mental conditions of visually impaired users are considered, and the mobile application provides functions: photo record, mood lift, greeting guest and today highlight. The application was designed for visually impaired users, and uses voice control to provide a friendly interface. Photo record function allows visually impaired users to capture image immediately when they encounter danger situations. Mood lift function accompanies visually impaired users by asking questions, playing music and reading articles. Greeting guest function answers to the visitors for the inconvenient physical condition of visually impaired users. In addition, today highlight function read news including weather forecast, daily horoscopes and daily reminder for visually impaired users. Multiple tools were adopted for developing the mobile application, and a website was developed for caregivers to check statues of visually impaired users and for marketing of the application.
comment: Accepted in the 35th IPPR Conference on Computer Vision, Graphics, and Image Processing (CVGIP2022)
Simulated Human Learning in a Dynamic, Partially-Observed, Time-Series Environment
While intelligent tutoring systems (ITSs) can use information from past students to personalize instruction, each new student is unique. Moreover, the education problem is inherently difficult because the learning process is only partially observable. We therefore develop a dynamic, time-series environment to simulate a classroom setting, with student-teacher interventions - including tutoring sessions, lectures, and exams. In particular, we design the simulated environment to allow for varying levels of probing interventions that can gather more information. Then, we develop reinforcement learning ITSs that combine learning the individual state of students while pulling from population information through the use of probing interventions. These interventions can reduce the difficulty of student estimation, but also introduce a cost-benefit decision to find a balance between probing enough to get accurate estimates and probing so often that it becomes disruptive to the student. We compare the efficacy of standard RL algorithms with several greedy rules-based heuristic approaches to find that they provide different solutions, but with similar results. We also highlight the difficulty of the problem with increasing levels of hidden information, and the boost that we get if we allow for probing interventions. We show the flexibility of both heuristic and RL policies with regards to changing student population distributions, finding that both are flexible, but RL policies struggle to help harder classes. Finally, we test different course structures with non-probing policies and we find that our policies are able to boost the performance of quiz and midterm structures more than we can in a finals-only structure, highlighting the benefit of having additional information.
comment: Manuscript in preparation for IEEE Transactions on Education, 20 pages, 6 figures, 5 tables
Personalized targeted memory reactivation enhances consolidation of challenging memories via slow wave and spindle dynamics
Sleep is crucial for memory consolidation, underpinning effective learning. Targeted memory reactivation (TMR) can strengthen neural representations by re-engaging learning circuits during sleep. However, TMR protocols overlook individual differences in learning capacity and memory trace strength, limiting efficacy for difficult-to-recall memories. Here, we present a personalized TMR protocol that adjusts stimulation frequency based on individual retrieval performance and task difficulty during a word-pair memory task. In an experiment comparing personalized TMR, TMR, and control groups, the personalized protocol significantly reduced memory decay and improved error correction under challenging recall. Electroencephalogram (EEG) analyses revealed enhanced synchronization of slow waves and spindles, with a significant positive correlation between behavioral and EEG features for challenging memories. Multivariate classification identified distinct neural signatures linked to the personalized approach, highlighting its ability to target memory-specific circuits. These findings provide novel insights into sleep-dependent memory consolidation and support personalized TMR interventions to optimize learning outcomes.
A Quantitative Framework for Assessing Sleep Quality from EEG Time Series in Complex Dynamic Systems
Modern lifestyles contribute to insufficient sleep, impairing cognitive function and weakening the immune system. Sleep quality (SQ) is vital for physiological and mental health, making its understanding and accurate assessment critical. However, its multifaceted nature, shaped by neurological and environmental factors, makes precise quantification challenging. Here, we address this challenge by utilizing electroencephalography (EEG) for phase-amplitude coupling (PAC) analysis to elucidate the neurological basis of SQ, examining both states of sleep and wakefulness, including resting state (RS) and working memory. Our results revealed distinct patterns in beta power and delta connectivity in sleep and RS, together with the reaction time of working memory. A notable finding was the pronounced delta-beta PAC, a feature markedly stronger in individuals with good SQ. We further observed that SQ was positively correlated with increased delta-beta PAC. Leveraging these insights, we applied machine learning models to classify SQ at an individual level, demonstrating that the delta-beta PAC outperformed other EEG characteristics. These findings establish delta-beta PAC as a robust electrophysiological marker to quantify SQ and elucidate its neurological determinants.
comment: Doctoral dissertation, Korea University, 2025
A Crowdsourced Study of ChatBot Influence in Value-Driven Decision Making Scenarios
Similar to social media bots that shape public opinion, healthcare and financial decisions, LLM-based ChatBots like ChatGPT can persuade users to alter their behavior. Unlike prior work that persuades via overt-partisan bias or misinformation, we test whether framing alone suffices. We conducted a crowdsourced study, where 336 participants interacted with a neutral or one of two value-framed ChatBots while deciding to alter US defense spending. In this single policy domain with controlled content, participants exposed to value-framed ChatBots significantly changed their budget choices relative to the neutral control. When the frame misaligned with their values, some participants reinforced their original preference, revealing a potentially replicable backfire effect, originally considered rare in the literature. These findings suggest that value-framing alone lowers the barrier for manipulative uses of LLMs, revealing risks distinct from overt bias or misinformation, and clarifying risks to countering misinformation.
Writing With Machines and Peers: Designing for Critical Engagement with Generative AI
The growing integration of generative AI in higher education is transforming how students write, learn, and engage with knowledge. As AI tools become more integrated into classrooms, there is an urgent need for pedagogical approaches that help students use them critically and reflectively. This study proposes a pedagogical design that integrates AI and peer feedback in a graduate-level academic writing activity. Over eight weeks, students developed literature review projects through multiple writing and revision stages, receiving feedback from both a custom-built AI reviewer and human peers. We examine two questions: (1) How did students interact with and incorporate AI and peer feedback during the writing process? and (2) How did they reflect on and build relationships with both human and AI reviewers? Data sources include student writing artifacts, AI and peer feedback, AI chat logs, and student reflections. Findings show that students engaged differently with each feedback source-relying on AI for rubric alignment and surface-level edits, and on peer feedback for conceptual development and disciplinary relevance. Reflections revealed evolving relationships with AI, characterized by increasing confidence, strategic use, and critical awareness of its limitations. The pedagogical design supported writing development, AI literacy, and disciplinary understanding. This study offers a scalable pedagogical model for integrating AI into writing instruction and contributes insights for system-level approaches to fostering meaningful human-AI collaboration in higher education.
Can we use LLMs to bootstrap reinforcement learning? -- A case study in digital health behavior change
Personalizing digital applications for health behavior change is a promising route to making them more engaging and effective. This especially holds for approaches that adapt to users and their specific states (e.g., motivation, knowledge, wants) over time. However, developing such approaches requires making many design choices, whose effectiveness is difficult to predict from literature and costly to evaluate in practice. In this work, we explore whether large language models (LLMs) can be used out-of-the-box to generate samples of user interactions that provide useful information for training reinforcement learning models for digital behavior change settings. Using real user data from four large behavior change studies as comparison, we show that LLM-generated samples can be useful in the absence of real data. Comparisons to the samples provided by human raters further show that LLM-generated samples reach the performance of human raters. Additional analyses of different prompting strategies including shorter and longer prompt variants, chain-of-thought prompting, and few-shot prompting show that the relative effectiveness of different strategies depends on both the study and the LLM with also relatively large differences between prompt paraphrases alone. We provide recommendations for how LLM-generated samples can be useful in practice.
The Trust Calibration Maturity Model for Characterizing and Communicating Trustworthiness of AI Systems
Recent proliferation of powerful AI systems has created a strong need for capabilities that help users to calibrate trust in those systems. As AI systems grow in scale, information required to evaluate their trustworthiness becomes less accessible, presenting a growing risk of using these systems inappropriately. We propose the Trust Calibration Maturity Model (TCMM) to characterize and communicate information about AI system trustworthiness. The TCMM incorporates five dimensions of analytic maturity: Performance Characterization, Bias & Robustness Quantification, Transparency, Safety & Security, and Usability. The TCMM can be presented along with system performance information to (1) help a user to appropriately calibrate trust, (2) establish requirements and track progress, and (3) identify research needs. Here, we discuss the TCMM and demonstrate it on two target tasks: using ChatGPT for high consequence nuclear science determinations, and using PhaseNet (an ensemble of seismic models) for categorizing sources of seismic events.
comment: 19 pages, 4 figures, 3 tables
What Makes a Visualization Image Complex? IEEE VIS 2025
We investigate the perceived visual complexity (VC) in data visualizations using objective image-based metrics. We collected VC scores through a large-scale crowdsourcing experiment involving 349 participants and 1,800 visualization images. We then examined how these scores align with 12 image-based metrics spanning information-theoretic, clutter, color, and our two object-based metrics. Our results show that both low-level image properties and the high-level elements affect perceived VC in visualization images; The number of corners and distinct colors are robust metrics across visualizations. Second, feature congestion, an information-theoretic metric capturing statistical patterns in color and texture, is the strongest predictor of perceived complexity in visualizations rich in the same stimuli; edge density effectively explains VC in node-link diagrams. Additionally, we observe a bell-curve effect for text annotations: increasing text-to-ink ratio (TiR) initially reduces complexity, reaching an optimal point, beyond which further text increases perceived complexity. Our quantification pipeline is also interpretable, enabling metric-based explanations, grounded in the VisComplexity2K dataset, bridging computational metrics with human perceptual responses. osf.io/5xe8a has the preregistration and osf.io/bdet6 has the VisComplexity2K dataset, source code, and all Apdx. and figures.
comment: 9+20 pages, 9+18 figures. Accepted at IEEE VIS 2025
Challenges and Opportunities for Visual Analytics in Jurisprudence
Legal exploration, analysis, and interpretation remain complex and demanding tasks, even for experienced legal scholars, due to the domain-specific language, tacit legal concepts, and intentional ambiguities embedded in legal texts. In related, text-based domains, Visual Analytics (VA) has become an indispensable tool for navigating documents, representing knowledge, and supporting analytical reasoning. However, legal scholarship presents distinct challenges: it requires managing formal legal structure, drawing on tacit domain knowledge, and documenting intricate and accurate reasoning processes - needs that current VA system designs for law fail to address adequately. We identify and describe key challenges and underexplored opportunities in applying VA to law, exploring how these technologies might better serve the legal domain. Interviews with nine legal experts reveal that current legal information retrieval interfaces do not adequately support the navigational complexity of law, often forcing users to rely on internalized legal expertise instead. To address this gap, we identify a three-phase workflow for legal experts, which highlights opportunities for VA to support legal reasoning through knowledge externalization and provenance tracking, leveraging tree-, graph-, and hierarchy-based visualizations. Through this contribution, our work establishes a user-centered VA workflow for the legal domain, recognizing tacit legal knowledge as a critical element of sense-making and insight generation, and situates these contributions within a broader research agenda for VA in law and other text-based disciplines.
comment: 34 pages, 3 figures, 1 table
Streaming Generation of Co-Speech Gestures via Accelerated Rolling Diffusion AAAI
Generating co-speech gestures in real time requires both temporal coherence and efficient sampling. We introduce a novel framework for streaming gesture generation that extends Rolling Diffusion models with structured progressive noise scheduling, enabling seamless long-sequence motion synthesis while preserving realism and diversity. Our framework is universally compatible with existing diffusion-based gesture generation model, transforming them into streaming methods capable of continuous generation without requiring post-processing. We evaluate our framework on ZEGGS and BEAT, strong benchmarks for real-world applicability. Applied to state-of-the-art baselines on both datasets, it consistently outperforms them, demonstrating its effectiveness as a generalizable and efficient solution for real-time co-speech gesture synthesis. We further propose Rolling Diffusion Ladder Acceleration (RDLA), a new approach that employs a ladder-based noise scheduling strategy to simultaneously denoise multiple frames. This significantly improves sampling efficiency while maintaining motion consistency, achieving up to a 4x speedup with high visual fidelity and temporal coherence in our experiments. Comprehensive user studies further validate our framework ability to generate realistic, diverse gestures closely synchronized with the audio input.
comment: Accepted at the 40th AAAI Conference on Artificial Intelligence (AAAI-26) Main Track
MarkupLens: Balancing Computer Vision Assistance and Control in Professional Video Annotation for Video-Based Design Tasks
Video-Based Design (VBD) uses video as a primary medium for analyzing user interactions, prototyping, and generating design insights. However, current VBD workflows are constrained by labor-intensive, inconsistent manual annotations that fragment attention and delay insights. Computer Vision (CV)-powered automatic annotation offers opportunities to reduce manual effort while supporting higher-level interpretation. This paper investigates human-AI collaboration in video analysis by examining how different levels of automated support shape user experience in VBD. We developed MarkupLens, a CV-assisted annotation platform, and conducted a between-subjects eye-tracking study with 36 designers in an urban VBD case. We compared three levels of automation: no support, partial support, and full support, and found that higher levels improved annotation quality, reduced cognitive load, and interestingly, enriched reflection. Our insights on automation levels inform adjustable autonomy and mixed-initiative system design beyond VBD tasks.
IMUFace: Towards Always-On 3D Facial Reconstruction via Earphone Inertial Sensing
The potential of facial expression reconstruction technology is significant, with applications in various fields such as human-computer interaction, affective computing, and virtual reality. Recent studies have proposed using ear-worn devices for facial expression reconstruction to address the environmental limitations and privacy concerns associated with traditional camera-based methods. However, these approaches still require improvements in terms of aesthetics and power consumption. This paper introduces a system called IMUFace. It uses inertial measurement units (IMUs) embedded in wireless earphones to detect subtle ear movements caused by facial muscle activities, allowing for covert and low-power facial reconstruction. A user study involving 12 participants was conducted, and a deep learning model named IMUTwinTrans was proposed. The results show that IMUFace can accurately predict users' facial landmarks with a precision of 2.21 mm, using only five minutes of training data. The predicted landmarks can be utilized to reconstruct a three-dimensional facial model. IMUFace operates at a sampling rate of 30 Hz with a relatively low power consumption of 58 mW. The findings presented in this study demonstrate the real-world applicability of IMUFace and highlight potential directions for further research to facilitate its practical adoption.
Critical or Compliant? The Double-Edged Sword of Reasoning in Chain-of-Thought Explanations
Explanations are often promoted as tools for transparency, but they can also foster confirmation bias; users may assume reasoning is correct whenever outputs appear acceptable. We study this double-edged role of Chain-of-Thought (CoT) explanations in multimodal moral scenarios by systematically perturbing reasoning chains and manipulating delivery tones. Specifically, we analyze reasoning errors in vision language models (VLMs) and how they impact user trust and the ability to detect errors. Our findings reveal two key effects: (1) users often equate trust with outcome agreement, sustaining reliance even when reasoning is flawed, and (2) the confident tone suppresses error detection while maintaining reliance, showing that delivery styles can override correctness. These results highlight how CoT explanations can simultaneously clarify and mislead, underscoring the need for NLP systems to provide explanations that encourage scrutiny and critical thinking rather than blind trust. All code will be released publicly.
comment: Under review; 16 pages, 15 figures
Step-Audio-EditX Technical Report
We present Step-Audio-EditX, the first open-source LLM-based audio model excelling at expressive and iterative audio editing encompassing emotion, speaking style, and paralinguistics alongside robust zero-shot text-to-speech (TTS) capabilities. Our core innovation lies in leveraging only large-margin synthetic data, which circumvents the need for embedding-based priors or auxiliary modules. This large-margin learning approach enables both iterative control and high expressivity across voices, and represents a fundamental pivot from the conventional focus on representation-level disentanglement. Evaluation results demonstrate that Step-Audio-EditX surpasses both MiniMax-2.6-hd and Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.
Between Filters and Feeds: Investigating Douyin and WeChat's Influence on Chinese Adolescent Body Image
In the digital era, social media platforms play a pivotal role in shaping adolescents' body image perceptions. This study examines how Douyin and WeChat, two contrasting Chinese social media platforms, influence body image among Chinese male adolescents. Employing a platformization perspective, we surveyed 395 male adolescents aged 10 to 24 using the Multidimensional Body-Self Relations Questionnaire-Appearance Scales (MBSRQ-AS) to assess self-evaluation and body satisfaction. Our findings reveal that Douyin usage is significantly correlated with appearance evaluation and body area satisfaction, while WeChat usage shows no significant correlation with any body image dimensions. These results suggest that Douyin's algorithm-driven, video-centric environment intensifies exposure to idealized body standards, impacting users at a cognitive level. This study underscores the importance of considering platform-specific characteristics in understanding social media's impact on body image. It contributes to the broader discourse on how technological design and content modalities mediate psychological outcomes, offering insights for addressing body image concerns among male adolescents in China.
comment: We are withdrawing this preprint to perform a major revision that upgrades the study from cross-sectional to a two-wave longitudinal design with new data collected in 2025. To avoid compromising the peer-review process of the substantially revised manuscript, we wish to keep it confidential until submission
SweeperBot: Making 3D Browsing Accessible through View Analysis and Visual Question Answering
Accessing 3D models remains challenging for Screen Reader (SR) users. While some existing 3D viewers allow creators to provide alternative text, they often lack sufficient detail about the 3D models. Grounded on a formative study, this paper introduces SweeperBot, a system that enables SR users to leverage visual question answering to explore and compare 3D models. SweeperBot answers SR users' visual questions by combining an optimal view selection technique with the strength of generative- and recognition-based foundation models. An expert review with 10 Blind and Low-Vision (BLV) users with SR experience demonstrated the feasibility of using SweeperBot to assist BLV users in exploring and comparing 3D models. The quality of the descriptions generated by SweeperBot was validated by a second survey study with 30 sighted participants.
comment: 28 pages, 16 figures, this article has been accepted for publication in the International Journal of Human-Computer Interaction (IJHCI), published by Taylor and Francis
MedBuild AI: An Agent-Based Hybrid Intelligence Framework for Reshaping Agency in Healthcare Infrastructure Planning through Generative Design for Medical Architecture
Globally, disparities in healthcare infrastructure remain stark, leaving countless communities without access to even basic services. Traditional infrastructure planning is often slow and inaccessible, and although many architects are actively delivering humanitarian and aid-driven hospital projects worldwide, these vital efforts still fall far short of the sheer scale and urgency of demand. This paper introduces MedBuild AI, a hybrid-intelligence framework that integrates large language models (LLMs) with deterministic expert systems to rebalance the early design and conceptual planning stages. As a web-based platform, it enables any region with satellite internet access to obtain guidance on modular, low-tech, low-cost medical building designs. The system operates through three agents: the first gathers local health intelligence via conversational interaction; the second translates this input into an architectural functional program through rule-based computation; and the third generates layouts and 3D models. By embedding computational negotiation into the design process, MedBuild AI fosters a reciprocal, inclusive, and equitable approach to healthcare planning, empowering communities and redefining agency in global healthcare architecture.
comment: 24 pages, 16 figures. Submitted to the IJAC Special Issue "Rebalance and Reciprocity"
Spot The Ball: A Benchmark for Visual Social Inference
Humans excel at visual social inference, the ability to infer hidden elements of a scene from subtle behavioral cues such as other people's gaze, pose, and orientation. This ability drives everyday social reasoning in humans and is critical for developing more human-like AI agents. We introduce Spot The Ball, a challenging benchmark for evaluating visual social inference in vision-language models (VLMs) using sports as a test domain. The task is to localize a removed sports ball from soccer, basketball, and volleyball images. We present a curated evaluation set with human baselines and a scalable pipeline for generating additional test items. We evaluate four state-of-the-art VLMs (Gemini, GPT, LLaMA, Qwen) using three prompting strategies, finding that humans are consistently two to three times more accurate (20-34%) than models ($\leq$ 17%) across all sports. Our analyses show that models rely on superficial spatial heuristics--such as guessing near the image center or nearby players--while humans leverage social cues like gaze direction and body pose. These findings reveal a persistent human-model gap in visual social reasoning and underscore the need for architectures that explicitly encode structured behavioral cues to achieve robust, human-like inference.
Inference of Human-derived Specifications of Object Placement via Demonstration IJCAI'25
As robots' manipulation capabilities improve for pick-and-place tasks (e.g., object packing, sorting, and kitting), methods focused on understanding human-acceptable object configurations remain limited expressively with regard to capturing spatial relationships important to humans. To advance robotic understanding of human rules for object arrangement, we introduce positionally-augmented RCC (PARCC), a formal logic framework based on region connection calculus (RCC) for describing the relative position of objects in space. Additionally, we introduce an inference algorithm for learning PARCC specifications via demonstrations. Finally, we present the results from a human study, which demonstrate our framework's ability to capture a human's intended specification and the benefits of learning from demonstration approaches over human-provided specifications.
comment: IJCAI'25
Auditing Google's AI Overviews and Featured Snippets: A Case Study on Baby Care and Pregnancy AAAI
Google Search increasingly surfaces AI-generated content through features like AI Overviews (AIO) and Featured Snippets (FS), which users frequently rely on despite having no control over their presentation. Through a systematic algorithm audit of 1,508 real baby care and pregnancy-related queries, we evaluate the quality and consistency of these information displays. Our robust evaluation framework assesses multiple quality dimensions, including answer consistency, relevance, presence of medical safeguards, source categories, and sentiment alignment. Our results reveal concerning gaps in information consistency, with information in AIO and FS displayed on the same search result page being inconsistent with each other in 33% of cases. Despite high relevance scores, both features critically lack medical safeguards (present in just 11% of AIO and 7% of FS responses). While health and wellness websites dominate source categories for both, AIO and FS, FS also often link to commercial sources. These findings have important implications for public health information access and demonstrate the need for stronger quality controls in AI-mediated health information. Our methodology provides a transferable framework for auditing AI systems across high-stakes domains where information quality directly impacts user well-being.
comment: 18 pages, 10 figures; to appear in AAAI ICWSM 2026
Computer Vision and Pattern Recognition
ARC Is a Vision Problem!
The Abstraction and Reasoning Corpus (ARC) is designed to promote research on abstract reasoning, a fundamental aspect of human intelligence. Common approaches to ARC treat it as a language-oriented problem, addressed by large language models (LLMs) or recurrent reasoning models. However, although the puzzle-like tasks in ARC are inherently visual, existing research has rarely approached the problem from a vision-centric perspective. In this work, we formulate ARC within a vision paradigm, framing it as an image-to-image translation problem. To incorporate visual priors, we represent the inputs on a "canvas" that can be processed like natural images. It is then natural for us to apply standard vision architectures, such as a vanilla Vision Transformer (ViT), to perform image-to-image mapping. Our model is trained from scratch solely on ARC data and generalizes to unseen tasks through test-time training. Our framework, termed Vision ARC (VARC), achieves 60.4% accuracy on the ARC-1 benchmark, substantially outperforming existing methods that are also trained from scratch. Our results are competitive with those of leading LLMs and close the gap to average human performance.
comment: Technical Report. Project webpage: https://github.com/lillian039/VARC
UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning
We present UniGen-1.5, a unified multimodal large language model (MLLM) for advanced image understanding, generation and editing. Building upon UniGen, we comprehensively enhance the model architecture and training pipeline to strengthen the image understanding and generation capabilities while unlocking strong image editing ability. Especially, we propose a unified Reinforcement Learning (RL) strategy that improves both image generation and image editing jointly via shared reward models. To further enhance image editing performance, we propose a light Edit Instruction Alignment stage that significantly improves the editing instruction comprehension that is essential for the success of the RL training. Experimental results show that UniGen-1.5 demonstrates competitive understanding and generation performance. Specifically, UniGen-1.5 achieves 0.89 and 4.31 overall scores on GenEval and ImgEdit that surpass the state-of-the-art models such as BAGEL and reaching performance comparable to proprietary models such as GPT-Image-1.
Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers
We propose Confidence-Guided Token Merging (Co-Me), an acceleration mechanism for visual geometric transformers without retraining or finetuning the base model. Co-Me distilled a light-weight confidence predictor to rank tokens by uncertainty and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal in Co-Me reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers, achieving speedups that scale with sequence length. When applied to VGGT and MapAnything, Co-Me achieves up to $11.3\times$ and $7.2\times$ speedup, making visual geometric transformers practical for real-time 3D perception and reconstruction.
Vision Large Language Models Are Good Noise Handlers in Engagement Analysis
Engagement recognition in video datasets, unlike traditional image classification tasks, is particularly challenged by subjective labels and noise limiting model performance. To overcome the challenges of subjective and noisy engagement labels, we propose a framework leveraging Vision Large Language Models (VLMs) to refine annotations and guide the training process. Our framework uses a questionnaire to extract behavioral cues and split data into high- and low-reliability subsets. We also introduce a training strategy combining curriculum learning with soft label refinement, gradually incorporating ambiguous samples while adjusting supervision to reflect uncertainty. We demonstrate that classical computer vision models trained on refined high-reliability subsets and enhanced with our curriculum strategy show improvements, highlighting benefits of addressing label subjectivity with VLMs. This method surpasses prior state of the art across engagement benchmarks such as EngageNet (three of six feature settings, maximum improvement of +1.21%), and DREAMS / PAFE with F1 gains of +0.22 / +0.06.
A Neural Field-Based Approach for View Computation & Data Exploration in 3D Urban Environments
Despite the growing availability of 3D urban datasets, extracting insights remains challenging due to computational bottlenecks and the complexity of interacting with data. In fact, the intricate geometry of 3D urban environments results in high degrees of occlusion and requires extensive manual viewpoint adjustments that make large-scale exploration inefficient. To address this, we propose a view-based approach for 3D data exploration, where a vector field encodes views from the environment. To support this approach, we introduce a neural field-based method that constructs an efficient implicit representation of 3D environments. This representation enables both faster direct queries, which consist of the computation of view assessment indices, and inverse queries, which help avoid occlusion and facilitate the search for views that match desired data patterns. Our approach supports key urban analysis tasks such as visibility assessments, solar exposure evaluation, and assessing the visual impact of new developments. We validate our method through quantitative experiments, case studies informed by real-world urban challenges, and feedback from domain experts. Results show its effectiveness in finding desirable viewpoints, analyzing building facade visibility, and evaluating views from outdoor spaces. Code and data are publicly available at https://urbantk.org/neural-3d.
comment: Accepted at IEEE Transactions on Visualization and Computer Graphics. Code and data are publicly available at https://urbantk.org/neural-3d
Zero-shot Synthetic Video Realism Enhancement via Structure-aware Denoising
We propose an approach to enhancing synthetic video realism, which can re-render synthetic videos from a simulator in photorealistic fashion. Our realism enhancement approach is a zero-shot framework that focuses on preserving the multi-level structures from synthetic videos into the enhanced one in both spatial and temporal domains, built upon a diffusion video foundational model without further fine-tuning. Specifically, we incorporate an effective modification to have the generation/denoising process conditioned on estimated structure-aware information from the synthetic video, such as depth maps, semantic maps, and edge maps, by an auxiliary model, rather than extracting the information from a simulator. This guidance ensures that the enhanced videos are consistent with the original synthetic video at both the structural and semantic levels. Our approach is a simple yet general and powerful approach to enhancing synthetic video realism: we show that our approach outperforms existing baselines in structural consistency with the original video while maintaining state-of-the-art photorealism quality in our experiments.
comment: Project Page: https://wyf0824.github.io/Video_Realism_Enhancement/
Diffusion As Self-Distillation: End-to-End Latent Diffusion In One Model
Standard Latent Diffusion Models rely on a complex, three-part architecture consisting of a separate encoder, decoder, and diffusion network, which are trained in multiple stages. This modular design is computationally inefficient, leads to suboptimal performance, and prevents the unification of diffusion with the single-network architectures common in vision foundation models. Our goal is to unify these three components into a single, end-to-end trainable network. We first demonstrate that a naive joint training approach fails catastrophically due to ``latent collapse'', where the diffusion training objective interferes with the network's ability to learn a good latent representation. We identify the root causes of this instability by drawing a novel analogy between diffusion and self-distillation based unsupervised learning method. Based on this insight, we propose Diffusion as Self-Distillation (DSD), a new framework with key modifications to the training objective that stabilize the latent space. This approach enables, for the first time, the stable end-to-end training of a single network that simultaneously learns to encode, decode, and perform diffusion. DSD achieves outstanding performance on the ImageNet $256\times 256$ conditional generation task: FID=13.44/6.38/4.25 with only 42M/118M/205M parameters and 50 training epochs on ImageNet, without using classifier-free-guidance.
comment: Tech Report. 10 pages
FreeSwim: Revisiting Sliding-Window Attention Mechanisms for Training-Free Ultra-High-Resolution Video Generation
The quadratic time and memory complexity of the attention mechanism in modern Transformer based video generators makes end-to-end training for ultra high resolution videos prohibitively expensive. Motivated by this limitation, we introduce a training-free approach that leverages video Diffusion Transformers pretrained at their native scale to synthesize higher resolution videos without any additional training or adaptation. At the core of our method lies an inward sliding window attention mechanism, which originates from a key observation: maintaining each query token's training scale receptive field is crucial for preserving visual fidelity and detail. However, naive local window attention, unfortunately, often leads to repetitive content and exhibits a lack of global coherence in the generated results. To overcome this challenge, we devise a dual-path pipeline that backs up window attention with a novel cross-attention override strategy, enabling the semantic content produced by local attention to be guided by another branch with a full receptive field and, therefore, ensuring holistic consistency. Furthermore, to improve efficiency, we incorporate a cross-attention caching strategy for this branch to avoid the frequent computation of full 3D attention. Extensive experiments demonstrate that our method delivers ultra-high-resolution videos with fine-grained visual details and high efficiency in a training-free paradigm. Meanwhile, it achieves superior performance on VBench, even compared to training-based alternatives, with competitive or improved efficiency. Codes are available at: https://github.com/WillWu111/FreeSwim
comment: 13 pages, 8 figures
Seeing Beyond the Image: ECG and Anatomical Knowledge-Guided Myocardial Scar Segmentation from Late Gadolinium-Enhanced Images
Accurate segmentation of myocardial scar from late gadolinium enhanced (LGE) cardiac MRI is essential for evaluating tissue viability, yet remains challenging due to variable contrast and imaging artifacts. Electrocardiogram (ECG) signals provide complementary physiological information, as conduction abnormalities can help localize or suggest scarred myocardial regions. In this work, we propose a novel multimodal framework that integrates ECG-derived electrophysiological information with anatomical priors from the AHA-17 atlas for physiologically consistent LGE-based scar segmentation. As ECGs and LGE-MRIs are not acquired simultaneously, we introduce a Temporal Aware Feature Fusion (TAFF) mechanism that dynamically weights and fuses features based on their acquisition time difference. Our method was evaluated on a clinical dataset and achieved substantial gains over the state-of-the-art image-only baseline (nnU-Net), increasing the average Dice score for scars from 0.6149 to 0.8463 and achieving high performance in both precision (0.9115) and sensitivity (0.9043). These results show that integrating physiological and anatomical knowledge allows the model to "see beyond the image", setting a new direction for robust and physiologically grounded cardiac scar segmentation.
HyMAD: A Hybrid Multi-Activity Detection Approach for Border Surveillance and Monitoring
Seismic sensing has emerged as a promising solution for border surveillance and monitoring; the seismic sensors that are often buried underground are small and cannot be noticed easily, making them difficult for intruders to detect, avoid, or vandalize. This significantly enhances their effectiveness compared to highly visible cameras or fences. However, accurately detecting and distinguishing between overlapping activities that are happening simultaneously, such as human intrusions, animal movements, and vehicle rumbling, remains a major challenge due to the complex and noisy nature of seismic signals. Correctly identifying simultaneous activities is critical because failing to separate them can lead to misclassification, missed detections, and an incomplete understanding of the situation, thereby reducing the reliability of surveillance systems. To tackle this problem, we propose HyMAD (Hybrid Multi-Activity Detection), a deep neural architecture based on spatio-temporal feature fusion. The framework integrates spectral features extracted with SincNet and temporal dependencies modeled by a recurrent neural network (RNN). In addition, HyMAD employs self-attention layers to strengthen intra-modal representations and a cross-modal fusion module to achieve robust multi-label classification of seismic events. e evaluate our approach on a dataset constructed from real-world field recordings collected in the context of border surveillance and monitoring, demonstrating its ability to generalize to complex, simultaneous activity scenarios involving humans, animals, and vehicles. Our method achieves competitive performance and offers a modular framework for extending seismic-based activity recognition in real-world security applications.
comment: Multi-label seismic signal classification using novel attention-based feature fusion. Submitting to cs.CV due to relevance to general pattern recognition and time-frequency (spectrogram) analysis
Attention via Synaptic Plasticity is All You Need: A Biologically Inspired Spiking Neuromorphic Transformer
Attention is the brain's ability to selectively focus on a few specific aspects while ignoring irrelevant ones. This biological principle inspired the attention mechanism in modern Transformers. Transformers now underpin large language models (LLMs) such as GPT, but at the cost of massive training and inference energy, leading to a large carbon footprint. While brain attention emerges from neural circuits, Transformer attention relies on dot-product similarity to weight elements in the input sequence. Neuromorphic computing, especially spiking neural networks (SNNs), offers a brain-inspired path to energy-efficient intelligence. Despite recent work on attention-based spiking Transformers, the core attention layer remains non-neuromorphic. Current spiking attention (i) relies on dot-product or element-wise similarity suited to floating-point operations, not event-driven spikes; (ii) keeps attention matrices that suffer from the von Neumann bottleneck, limiting in-memory computing; and (iii) still diverges from brain-like computation. To address these issues, we propose the Spiking STDP Transformer (S$^{2}$TDPT), a neuromorphic Transformer that implements self-attention through spike-timing-dependent plasticity (STDP), embedding query--key correlations in synaptic weights. STDP, a core mechanism of memory and learning in the brain and widely studied in neuromorphic devices, naturally enables in-memory computing and supports non-von Neumann hardware. On CIFAR-10 and CIFAR-100, our model achieves 94.35\% and 78.08\% accuracy with only four timesteps and 0.49 mJ on CIFAR-100, an 88.47\% energy reduction compared to a standard ANN Transformer. Grad-CAM shows that the model attends to semantically relevant regions, enhancing interpretability. Overall, S$^{2}$TDPT illustrates how biologically inspired attention can yield energy-efficient, hardware-friendly, and explainable neuromorphic models.
comment: 21 Pages, 5 Figures, 3 Table
Impact of Image Resolution on Age Estimation with DeepFace and InsightFace
Automatic age estimation is widely used for age verification, where input images often vary considerably in resolution. This study evaluates the effect of image resolution on age estimation accuracy using DeepFace and InsightFace. A total of 1000 images from the IMDB-Clean dataset were processed in seven resolutions, resulting in 7000 test samples. Performance was evaluated using Mean Absolute Error (MAE), Standard Deviation (SD), and Median Absolute Error (MedAE). Based on this study, we conclude that input image resolution has a clear and consistent impact on the accuracy of age estimation in both DeepFace and InsightFace. Both frameworks achieve optimal performance at 224x224 pixels, with an MAE of 10.83 years (DeepFace) and 7.46 years (InsightFace). At low resolutions, MAE increases substantially, while very high resolutions also degrade accuracy. InsightFace is consistently faster than DeepFace across all resolutions.
comment: 6 pages, 7 figures, 7 tables. Evaluation of DeepFace and InsightFace age estimation across seven image resolutions (64 to 1080 px)
Improving segmentation of retinal arteries and veins using cardiac signal in doppler holograms
Doppler holography is an emerging retinal imaging technique that captures the dynamic behavior of blood flow with high temporal resolution, enabling quantitative assessment of retinal hemodynamics. This requires accurate segmentation of retinal arteries and veins, but traditional segmentation methods focus solely on spatial information and overlook the temporal richness of holographic data. In this work, we propose a simple yet effective approach for artery-vein segmentation in temporal Doppler holograms using standard segmentation architectures. By incorporating features derived from a dedicated pulse analysis pipeline, our method allows conventional U-Nets to exploit temporal dynamics and achieve performance comparable to more complex attention- or iteration-based models. These findings demonstrate that time-resolved preprocessing can unlock the full potential of deep learning for Doppler holography, opening new perspectives for quantitative exploration of retinal hemodynamics. The dataset is publicly available at https://huggingface.co/datasets/DigitalHolography/
comment: 5 pages, 3 figures, 1 table. Submitted to ISBI2026
RepAir: A Framework for Airway Segmentation and Discontinuity Correction in CT
Accurate airway segmentation from chest computed tomography (CT) scans is essential for quantitative lung analysis, yet manual annotation is impractical and many automated U-Net-based methods yield disconnected components that hinder reliable biomarker extraction. We present RepAir, a three-stage framework for robust 3D airway segmentation that combines an nnU-Net-based network with anatomically informed topology correction. The segmentation network produces an initial airway mask, after which a skeleton-based algorithm identifies potential discontinuities and proposes reconnections. A 1D convolutional classifier then determines which candidate links correspond to true anatomical branches versus false or obstructed paths. We evaluate RepAir on two distinct datasets: ATM'22, comprising annotated CT scans from predominantly healthy subjects and AeroPath, encompassing annotated scans with severe airway pathology. Across both datasets, RepAir outperforms existing 3D U-Net-based approaches such as Bronchinet and NaviAirway on both voxel-level and topological metrics, and produces more complete and anatomically consistent airway trees while maintaining high segmentation accuracy.
comment: 4 pages, 3 figures, 1 table. Preprint submitted to SSIAI 2026 Conference on November 17, 2025
SLAM-AGS: Slide-Label Aware Multi-Task Pretraining Using Adaptive Gradient Surgery in Computational Cytology
Computational cytology faces two major challenges: i) instance-level labels are unreliable and prohibitively costly to obtain, ii) witness rates are extremely low. We propose SLAM-AGS, a Slide-Label-Aware Multitask pretraining framework that jointly optimizes (i) a weakly supervised similarity objective on slide-negative patches and (ii) a self-supervised contrastive objective on slide-positive patches, yielding stronger performance on downstream tasks. To stabilize learning, we apply Adaptive Gradient Surgery to tackle conflicting task gradients and prevent model collapse. We integrate the pretrained encoder into an attention-based Multiple Instance Learning aggregator for bag-level prediction and attention-guided retrieval of the most abnormal instances in a bag. On a publicly available bone-marrow cytology dataset, with simulated witness rates from 10% down to 0.5%, SLAM-AGS improves bag-level F1-Score and Top 400 positive cell retrieval over other pretraining methods, with the largest gains at low witness rates, showing that resolving gradient interference enables stable pretraining and better performance on downstream tasks. To facilitate reproducibility, we share our complete implementation and evaluation framework as open source: https://github.com/Ace95/SLAM-AGS.
comment: 5 pages, 2 figures, Submitted to ISBI2026
SparseSurf: Sparse-View 3D Gaussian Splatting for Surface Reconstruction AAAI 2026
Recent advances in optimizing Gaussian Splatting for scene geometry have enabled efficient reconstruction of detailed surfaces from images. However, when input views are sparse, such optimization is prone to overfitting, leading to suboptimal reconstruction quality. Existing approaches address this challenge by employing flattened Gaussian primitives to better fit surface geometry, combined with depth regularization to alleviate geometric ambiguities under limited viewpoints. Nevertheless, the increased anisotropy inherent in flattened Gaussians exacerbates overfitting in sparse-view scenarios, hindering accurate surface fitting and degrading novel view synthesis performance. In this paper, we propose \net{}, a method that reconstructs more accurate and detailed surfaces while preserving high-quality novel view rendering. Our key insight is to introduce Stereo Geometry-Texture Alignment, which bridges rendering quality and geometry estimation, thereby jointly enhancing both surface reconstruction and view synthesis. In addition, we present a Pseudo-Feature Enhanced Geometry Consistency that enforces multi-view geometric consistency by incorporating both training and unseen views, effectively mitigating overfitting caused by sparse supervision. Extensive experiments on the DTU, BlendedMVS, and Mip-NeRF360 datasets demonstrate that our method achieves the state-of-the-art performance.
comment: Accepted at AAAI 2026. Project page: https://miya-oi.github.io/SparseSurf-project
Enhancing Agentic Autonomous Scientific Discovery with Vision-Language Model Capabilities
We show that multi-agent systems guided by vision-language models (VLMs) improve end-to-end autonomous scientific discovery. By treating plots as verifiable checkpoints, a VLM-as-a-judge evaluates figures against dynamically generated domain-specific rubrics, enabling agents to correct their own errors and steer exploratory data analysis in real-time. Case studies in cosmology and astrochemistry demonstrate recovery from faulty reasoning paths and adaptation to new datasets without human intervention. On a 10-task benchmark for data-driven discovery, VLM-augmented systems achieve pass at 1 scores of 0.7-0.8, compared to 0.2-0.3 for code-only and 0.4-0.5 for code-and-text baselines, while also providing auditable reasoning traces that improve interpretability. Code available here: https://github.com/CMBAgents/cmbagent
Fusing Biomechanical and Spatio-Temporal Features for Fall Prediction: Characterizing and Mitigating the Simulation-to-Reality Gap
Falls are a leading cause of injury and loss of independence among older adults. Vision-based fall prediction systems offer a non-invasive solution to anticipate falls seconds before impact, but their development is hindered by the scarcity of available fall data. Contributing to these efforts, this study proposes the Biomechanical Spatio-Temporal Graph Convolutional Network (BioST-GCN), a dual-stream model that combines both pose and biomechanical information using a cross-attention fusion mechanism. Our model outperforms the vanilla ST-GCN baseline by 5.32% and 2.91% F1-score on the simulated MCF-UA stunt-actor and MUVIM datasets, respectively. The spatio-temporal attention mechanisms in the ST-GCN stream also provide interpretability by identifying critical joints and temporal phases. However, a critical simulation-reality gap persists. While our model achieves an 89.0% F1-score with full supervision on simulated data, zero-shot generalization to unseen subjects drops to 35.9%. This performance decline is likely due to biases in simulated data, such as `intent-to-fall' cues. For older adults, particularly those with diabetes or frailty, this gap is exacerbated by their unique kinematic profiles. To address this, we propose personalization strategies and advocate for privacy-preserving data pipelines to enable real-world validation. Our findings underscore the urgent need to bridge the gap between simulated and real-world data to develop effective fall prediction systems for vulnerable elderly populations.
3D-Guided Scalable Flow Matching for Generating Volumetric Tissue Spatial Transcriptomics from Serial Histology
A scalable and robust 3D tissue transcriptomics profile can enable a holistic understanding of tissue organization and provide deeper insights into human biology and disease. Most predictive algorithms that infer ST directly from histology treat each section independently and ignore 3D structure, while existing 3D-aware approaches are not generative and do not scale well. We present Holographic Tissue Expression Inpainting and Analysis (HoloTea), a 3D-aware flow-matching framework that imputes spot-level gene expression from H&E while explicitly using information from adjacent sections. Our key idea is to retrieve morphologically corresponding spots on neighboring slides in a shared feature space and fuse this cross section context into a lightweight ControlNet, allowing conditioning to follow anatomical continuity. To better capture the count nature of the data, we introduce a 3D-consistent prior for flow matching that combines a learned zero-inflated negative binomial (ZINB) prior with a spatial-empirical prior constructed from neighboring sections. A global attention block introduces 3D H&E scaling linearly with the number of spots in the slide, enabling training and inference on large 3D ST datasets. Across three spatial transcriptomics datasets spanning different tissue types and resolutions, HoloTea consistently improves 3D expression accuracy and generalization compared to 2D and 3D baselines. We envision HoloTea advancing the creation of accurate 3D virtual tissues, ultimately accelerating biomarker discovery and deepening our understanding of disease.
comment: 11 pages
XAttn-BMD: Multimodal Deep Learning with Cross-Attention for Femoral Neck Bone Mineral Density Estimation
Poor bone health is a significant public health concern, and low bone mineral density (BMD) leads to an increased fracture risk, a key feature of osteoporosis. We present XAttn-BMD (Cross-Attention BMD), a multimodal deep learning framework that predicts femoral neck BMD from hip X-ray images and structured clinical metadata. It utilizes a novel bidirectional cross-attention mechanism to dynamically integrate image and metadata features for cross-modal mutual reinforcement. A Weighted Smooth L1 loss is tailored to address BMD imbalance and prioritize clinically significant cases. Extensive experiments on the data from the Hertfordshire Cohort Study show that our model outperforms the baseline models in regression generalization and robustness. Ablation studies confirm the effectiveness of both cross-attention fusion and the customized loss function. Experimental results show that the integration of multimodal data via cross-attention outperforms naive feature concatenation without cross-attention, reducing MSE by 16.7%, MAE by 6.03%, and increasing the R2 score by 16.4%, highlighting the effectiveness of the approach for femoral neck BMD estimation. Furthermore, screening performance was evaluated using binary classification at clinically relevant femoral neck BMD thresholds, demonstrating the model's potential in real-world scenarios.
comment: 11 figures, 10 tables, 38 pages. Submitted to Artificial Intelligence in Medicine (currently with editor)
MRI Embeddings Complement Clinical Predictors for Cognitive Decline Modeling in Alzheimer's Disease Cohorts SP
Accurate modeling of cognitive decline in Alzheimer's disease is essential for early stratification and personalized management. While tabular predictors provide robust markers of global risk, their ability to capture subtle brain changes remains limited. In this study, we evaluate the predictive contributions of tabular and imaging-based representations, with a focus on transformer-derived Magnetic Resonance Imaging (MRI) embeddings. We introduce a trajectory-aware labeling strategy based on Dynamic Time Warping clustering to capture heterogeneous patterns of cognitive change, and train a 3D Vision Transformer (ViT) via unsupervised reconstruction on harmonized and augmented MRI data to obtain anatomy-preserving embeddings without progression labels. The pretrained encoder embeddings are subsequently assessed using both traditional machine learning classifiers and deep learning heads, and compared against tabular representations and convolutional network baselines. Results highlight complementary strengths across modalities. Clinical and volumetric features achieved the highest AUCs of around 0.70 for predicting mild and severe progression, underscoring their utility in capturing global decline trajectories. In contrast, MRI embeddings from the ViT model were most effective in distinguishing cognitively stable individuals with an AUC of 0.71. However, all approaches struggled in the heterogeneous moderate group. These findings indicate that clinical features excel in identifying high-risk extremes, whereas transformer-based MRI embeddings are more sensitive to subtle markers of stability, motivating multimodal fusion strategies for AD progression modeling.
comment: Accepted at SPIE - Medical Imaging Conference 2026
CCSD: Cross-Modal Compositional Self-Distillation for Robust Brain Tumor Segmentation with Missing Modalities
The accurate segmentation of brain tumors from multi-modal MRI is critical for clinical diagnosis and treatment planning. While integrating complementary information from various MRI sequences is a common practice, the frequent absence of one or more modalities in real-world clinical settings poses a significant challenge, severely compromising the performance and generalizability of deep learning-based segmentation models. To address this challenge, we propose a novel Cross-Modal Compositional Self-Distillation (CCSD) framework that can flexibly handle arbitrary combinations of input modalities. CCSD adopts a shared-specific encoder-decoder architecture and incorporates two self-distillation strategies: (i) a hierarchical modality self-distillation mechanism that transfers knowledge across modality hierarchies to reduce semantic discrepancies, and (ii) a progressive modality combination distillation approach that enhances robustness to missing modalities by simulating gradual modality dropout during training. Extensive experiments on public brain tumor segmentation benchmarks demonstrate that CCSD achieves state-of-the-art performance across various missing-modality scenarios, with strong generalization and stability.
comment: 9 pages, 5 figures
Deep Learning-Based Regional White Matter Hyperintensity Mapping as a Robust Biomarker for Alzheimer's Disease SP
White matter hyperintensities (WMH) are key imaging markers in cognitive aging, Alzheimer's disease (AD), and related dementias. Although automated methods for WMH segmentation have advanced, most provide only global lesion load and overlook their spatial distribution across distinct white matter regions. We propose a deep learning framework for robust WMH segmentation and localization, evaluated across public datasets and an independent Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. Our results show that the predicted lesion loads are in line with the reference WMH estimates, confirming the robustness to variations in lesion load, acquisition, and demographics. Beyond accurate segmentation, we quantify WMH load within anatomically defined regions and combine these measures with brain structure volumes to assess diagnostic value. Regional WMH volumes consistently outperform global lesion burden for disease classification, and integration with brain atrophy metrics further improves performance, reaching area under the curve (AUC) values up to 0.97. Several spatially distinct regions, particularly within anterior white matter tracts, are reproducibly associated with diagnostic status, indicating localized vulnerability in AD. These results highlight the added value of regional WMH quantification. Incorporating localized lesion metrics alongside atrophy markers may enhance early diagnosis and stratification in neurodegenerative disorders.
comment: Accepted at SPIE - Medical Imaging Conference 2026
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models
Omnimodal large language models (OmniLLMs) have attracted increasing research attention of late towards unified audio-video understanding, wherein processing audio-video token sequences creates a significant computational bottleneck, however. Existing token compression methods have yet to accommodate this emerging need of jointly compressing multimodal tokens. To bridge this gap, we present OmniZip, a training-free, audio-guided audio-visual token-compression framework that optimizes multimodal token representation and accelerates inference. Specifically, OmniZip first identifies salient audio tokens, then computes an audio retention score for each time group to capture information density, thereby dynamically guiding video token pruning and preserving cues from audio anchors enhanced by cross-modal similarity. For each time window, OmniZip compresses the video tokens using an interleaved spatio-temporal scheme. Extensive empirical results demonstrate the merits of OmniZip - it achieves 3.42X inference speedup and 1.4X memory reduction over other top-performing counterparts, while maintaining performance with no training.
comment: Code Link: https://github.com/KD-TAO/OmniZip
Explaining Digital Pathology Models via Clustering Activations
We present a clustering-based explainability technique for digital pathology models based on convolutional neural networks. Unlike commonly used methods based on saliency maps, such as occlusion, GradCAM, or relevance propagation, which highlight regions that contribute the most to the prediction for a single slide, our method shows the global behaviour of the model under consideration, while also providing more fine-grained information. The result clusters can be visualised not only to understand the model, but also to increase confidence in its operation, leading to faster adoption in clinical practice. We also evaluate the performance of our technique on an existing model for detecting prostate cancer, demonstrating its usefulness.
ForensicFlow: A Tri-Modal Adaptive Network for Robust Deepfake Detection
Deepfakes generated by advanced GANs and autoencoders severely threaten information integrity and societal stability. Single-stream CNNs fail to capture multi-scale forgery artifacts across spatial, texture, and frequency domains, limiting robustness and generalization. We introduce the ForensicFlow, a tri-modal forensic framework that synergistically fuses RGB, texture, and frequency evidence for video Deepfake detection. The RGB branch (ConvNeXt-tiny) extracts global visual inconsistencies; the texture branch (Swin Transformer-tiny) detects fine-grained blending artifacts; the frequency branch (CNN + SE) identifies periodic spectral noise. Attention-based temporal pooling dynamically prioritizes high-evidence frames, while adaptive attention fusion balances branch contributions.Trained on Celeb-DF (v2) with Focal Loss, ForensicFlow achieves AUC 0.9752, F1-Score 0.9408, and accuracy 0.9208, outperforming single-stream baselines. Ablation validates branch synergy; Grad-CAM confirms forensic focus. This comprehensive feature fusion provides superior resilience against subtle forgeries.
comment: 11 pages, 4 figures, 2 tables. Preprint. Submitted on November 18, 2025
Interaction-Aware 4D Gaussian Splatting for Dynamic Hand-Object Interaction Reconstruction
This paper focuses on a challenging setting of simultaneously modeling geometry and appearance of hand-object interaction scenes without any object priors. We follow the trend of dynamic 3D Gaussian Splatting based methods, and address several significant challenges. To model complex hand-object interaction with mutual occlusion and edge blur, we present interaction-aware hand-object Gaussians with newly introduced optimizable parameters aiming to adopt piecewise linear hypothesis for clearer structural representation. Moreover, considering the complementarity and tightness of hand shape and object shape during interaction dynamics, we incorporate hand information into object deformation field, constructing interaction-aware dynamic fields to model flexible motions. To further address difficulties in the optimization process, we propose a progressive strategy that handles dynamic regions and static background step by step. Correspondingly, explicit regularizations are designed to stabilize the hand-object representations for smooth motion transition, physical interaction reality, and coherent lighting. Experiments show that our approach surpasses existing dynamic 3D-GS-based methods and achieves state-of-the-art performance in reconstructing dynamic hand-object interaction.
comment: 11 pages, 6 figures
Learning Compact Latent Space for Representing Neural Signed Distance Functions with High-fidelity Geometry Details AAAI
Neural signed distance functions (SDFs) have been a vital representation to represent 3D shapes or scenes with neural networks. An SDF is an implicit function that can query signed distances at specific coordinates for recovering a 3D surface. Although implicit functions work well on a single shape or scene, they pose obstacles when analyzing multiple SDFs with high-fidelity geometry details, due to the limited information encoded in the latent space for SDFs and the loss of geometry details. To overcome these obstacles, we introduce a method to represent multiple SDFs in a common space, aiming to recover more high-fidelity geometry details with more compact latent representations. Our key idea is to take full advantage of the benefits of generalization-based and overfitting-based learning strategies, which manage to preserve high-fidelity geometry details with compact latent codes. Based on this framework, we also introduce a novel sampling strategy to sample training queries. The sampling can improve the training efficiency and eliminate artifacts caused by the influence of other SDFs. We report numerical and visual evaluations on widely used benchmarks to validate our designs and show advantages over the latest methods in terms of the representative ability and compactness.
comment: Accepted as an Poster paper at the AAAI Conference on Artificial Intelligence (AAAI-26)
DeCo-VAE: Learning Compact Latents for Video Reconstruction via Decoupled Representation
Existing video Variational Autoencoders (VAEs) generally overlook the similarity between frame contents, leading to redundant latent modeling. In this paper, we propose decoupled VAE (DeCo-VAE) to achieve compact latent representation. Instead of encoding RGB pixels directly, we decompose video content into distinct components via explicit decoupling: keyframe, motion and residual, and learn dedicated latent representation for each. To avoid cross-component interference, we design dedicated encoders for each decoupled component and adopt a shared 3D decoder to maintain spatiotemporal consistency during reconstruction. We further utilize a decoupled adaptation strategy that freezes partial encoders while training the others sequentially, ensuring stable training and accurate learning of both static and dynamic features. Extensive quantitative and qualitative experiments demonstrate that DeCo-VAE achieves superior video reconstruction performance.
A Generative Data Framework with Authentic Supervision for Underwater Image Restoration and Enhancement
Underwater image restoration and enhancement are crucial for correcting color distortion and restoring image details, thereby establishing a fundamental basis for subsequent underwater visual tasks. However, current deep learning methodologies in this area are frequently constrained by the scarcity of high-quality paired datasets. Since it is difficult to obtain pristine reference labels in underwater scenes, existing benchmarks often rely on manually selected results from enhancement algorithms, providing debatable reference images that lack globally consistent color and authentic supervision. This limits the model's capabilities in color restoration, image enhancement, and generalization. To overcome this limitation, we propose using in-air natural images as unambiguous reference targets and translating them into underwater-degraded versions, thereby constructing synthetic datasets that provide authentic supervision signals for model learning. Specifically, we establish a generative data framework based on unpaired image-to-image translation, producing a large-scale dataset that covers 6 representative underwater degradation types. The framework constructs synthetic datasets with precise ground-truth labels, which facilitate the learning of an accurate mapping from degraded underwater images to their pristine scene appearances. Extensive quantitative and qualitative experiments across 6 representative network architectures and 3 independent test sets show that models trained on our synthetic data achieve comparable or superior color restoration and generalization performance to those trained on existing benchmarks. This research provides a reliable and scalable data-driven solution for underwater image restoration and enhancement. The generated dataset is publicly available at: https://github.com/yftian2025/SynUIEDatasets.git.
comment: This work has been submitted to the IEEE for possible publication
D-PerceptCT: Deep Perceptual Enhancement for Low-Dose CT Images
Low Dose Computed Tomography (LDCT) is widely used as an imaging solution to aid diagnosis and other clinical tasks. However, this comes at the price of a deterioration in image quality due to the low dose of radiation used to reduce the risk of secondary cancer development. While some efficient methods have been proposed to enhance LDCT quality, many overestimate noise and perform excessive smoothing, leading to a loss of critical details. In this paper, we introduce D-PerceptCT, a novel architecture inspired by key principles of the Human Visual System (HVS) to enhance LDCT images. The objective is to guide the model to enhance or preserve perceptually relevant features, thereby providing radiologists with CT images where critical anatomical structures and fine pathological details are perceptu- ally visible. D-PerceptCT consists of two main blocks: 1) a Visual Dual-path Extractor (ViDex), which integrates semantic priors from a pretrained DINOv2 model with local spatial features, allowing the network to incorporate semantic-awareness during enhancement; (2) a Global-Local State-Space block that captures long-range information and multiscale features to preserve the important structures and fine details for diagnosis. In addition, we propose a novel deep perceptual loss, designated as the Deep Perceptual Relevancy Loss Function (DPRLF), which is inspired by human contrast sensitivity, to further emphasize perceptually important features. Extensive experiments on the Mayo2016 dataset demonstrate the effectiveness of D-PerceptCT method for LDCT enhancement, showing better preservation of structural and textural information within LDCT images compared to SOTA methods.
IMSE: Efficient U-Net-based Speech Enhancement using Inception Depthwise Convolution and Amplitude-Aware Linear Attention
Achieving a balance between lightweight design and high performance remains a significant challenge for speech enhancement (SE) tasks on resource-constrained devices. Existing state-of-the-art methods, such as MUSE, have established a strong baseline with only 0.51M parameters by introducing a Multi-path Enhanced Taylor (MET) transformer and Deformable Embedding (DE). However, an in-depth analysis reveals that MUSE still suffers from efficiency bottlenecks: the MET module relies on a complex "approximate-compensate" mechanism to mitigate the limitations of Taylor-expansion-based attention, while the offset calculation for deformable embedding introduces additional computational burden. This paper proposes IMSE, a systematically optimized and ultra-lightweight network. We introduce two core innovations: 1) Replacing the MET module with Amplitude-Aware Linear Attention (MALA). MALA fundamentally rectifies the "amplitude-ignoring" problem in linear attention by explicitly preserving the norm information of query vectors in the attention calculation, achieving efficient global modeling without an auxiliary compensation branch. 2) Replacing the DE module with Inception Depthwise Convolution (IDConv). IDConv borrows the Inception concept, decomposing large-kernel operations into efficient parallel branches (square, horizontal, and vertical strips), thereby capturing spectrogram features with extremely low parameter redundancy. Extensive experiments on the VoiceBank+DEMAND dataset demonstrate that, compared to the MUSE baseline, IMSE significantly reduces the parameter count by 16.8\% (from 0.513M to 0.427M) while achieving competitive performance comparable to the state-of-the-art on the PESQ metric (3.373). This study sets a new benchmark for the trade-off between model size and speech quality in ultra-lightweight speech enhancement.
Parameter Aware Mamba Model for Multi-task Dense Prediction
Understanding the inter-relations and interactions between tasks is crucial for multi-task dense prediction. Existing methods predominantly utilize convolutional layers and attention mechanisms to explore task-level interactions. In this work, we introduce a novel decoder-based framework, Parameter Aware Mamba Model (PAMM), specifically designed for dense prediction in multi-task learning setting. Distinct from approaches that employ Transformers to model holistic task relationships, PAMM leverages the rich, scalable parameters of state space models to enhance task interconnectivity. It features dual state space parameter experts that integrate and set task-specific parameter priors, capturing the intrinsic properties of each task. This approach not only facilitates precise multi-task interactions but also allows for the global integration of task priors through the structured state space sequence model (S4). Furthermore, we employ the Multi-Directional Hilbert Scanning method to construct multi-angle feature sequences, thereby enhancing the sequence model's perceptual capabilities for 2D data. Extensive experiments on the NYUD-v2 and PASCAL-Context benchmarks demonstrate the effectiveness of our proposed method. Our code is available at https://github.com/CQC-gogopro/PAMM.
comment: Accepted to IEEE Transactions on Cybernetics
Enhancing End-to-End Autonomous Driving with Risk Semantic Distillaion from VLM
The autonomous driving (AD) system has exhibited remarkable performance in complex driving scenarios. However, generalization is still a key limitation for the current system, which refers to the ability to handle unseen scenarios or unfamiliar sensor configurations.Related works have explored the use of Vision-Language Models (VLMs) to address few-shot or zero-shot tasks. While promising, these methods introduce a new challenge: the emergence of a hybrid AD system, where two distinct systems are used to plan a trajectory, leading to potential inconsistencies. Alternative research directions have explored Vision-Language-Action (VLA) frameworks that generate control actions from VLM directly. However, these end-to-end solutions demonstrate prohibitive computational demands. To overcome these challenges, we introduce Risk Semantic Distillation (RSD), a novel framework that leverages VLMs to enhance the training of End-to-End (E2E) AD backbones. By providing risk attention for key objects, RSD addresses the issue of generalization. Specifically, we introduce RiskHead, a plug-in module that distills causal risk estimates from Vision-Language Models into Bird's-Eye-View (BEV) features, yielding interpretable risk-attention maps.This approach allows BEV features to learn richer and more nuanced risk attention representations, which directly enhance the model's ability to handle spatial boundaries and risky objects.By focusing on risk attention, RSD aligns better with human-like driving behavior, which is essential to navigate in complex and dynamic environments. Our experiments on the Bench2Drive benchmark demonstrate the effectiveness of RSD in managing complex and unpredictable driving conditions. Due to the enhanced BEV representations enabled by RSD, we observed a significant improvement in both perception and planning capabilities.
Segmentation-Aware Latent Diffusion for Satellite Image Super-Resolution: Enabling Smallholder Farm Boundary Delineation
Delineating farm boundaries through segmentation of satellite images is a fundamental step in many agricultural applications. The task is particularly challenging for smallholder farms, where accurate delineation requires the use of high resolution (HR) imagery which are available only at low revisit frequencies (e.g., annually). To support more frequent (sub-) seasonal monitoring, HR images could be combined as references (ref) with low resolution (LR) images -- having higher revisit frequency (e.g., weekly) -- using reference-based super-resolution (Ref-SR) methods. However, current Ref-SR methods optimize perceptual quality and smooth over crucial features needed for downstream tasks, and are unable to meet the large scale-factor requirements for this task. Further, previous two-step approaches of SR followed by segmentation do not effectively utilize diverse satellite sources as inputs. We address these problems through a new approach, $\textbf{SEED-SR}$, which uses a combination of conditional latent diffusion models and large-scale multi-spectral, multi-source geo-spatial foundation models. Our key innovation is to bypass the explicit SR task in the pixel space and instead perform SR in a segmentation-aware latent space. This unique approach enables us to generate segmentation maps at an unprecedented 20$\times$ scale factor, and rigorous experiments on two large, real datasets demonstrate up to $\textbf{25.5}$ and $\textbf{12.9}$ relative improvement in instance and semantic segmentation metrics respectively over approaches based on state-of-the-art Ref-SR methods.
2D Gaussians Spatial Transport for Point-supervised Density Regression AAAI
This paper introduces Gaussian Spatial Transport (GST), a novel framework that leverages Gaussian splatting to facilitate transport from the probability measure in the image coordinate space to the annotation map. We propose a Gaussian splatting-based method to estimate pixel-annotation correspondence, which is then used to compute a transport plan derived from Bayesian probability. To integrate the resulting transport plan into standard network optimization in typical computer vision tasks, we derive a loss function that measures discrepancy after transport. Extensive experiments on representative computer vision tasks, including crowd counting and landmark detection, validate the effectiveness of our approach. Compared to conventional optimal transport schemes, GST eliminates iterative transport plan computation during training, significantly improving efficiency. Code is available at https://github.com/infinite0522/GST.
comment: 9 pages, 5 figures, accepted by AAAI, 2026
Learning Subglacial Bed Topography from Sparse Radar with Physics-Guided Residuals
Accurate subglacial bed topography is essential for ice sheet modeling, yet radar observations are sparse and uneven. We propose a physics-guided residual learning framework that predicts bed thickness residuals over a BedMachine prior and reconstructs bed from the observed surface. A DeepLabV3+ decoder over a standard encoder (e.g.,ResNet-50) is trained with lightweight physics and data terms: multi-scale mass conservation, flow-aligned total variation, Laplacian damping, non-negativity of thickness, a ramped prior-consistency term, and a masked Huber fit to radar picks modulated by a confidence map. To measure real-world generalization, we adopt leakage-safe blockwise hold-outs (vertical/horizontal) with safety buffers and report metrics only on held-out cores. Across two Greenland sub-regions, our approach achieves strong test-core accuracy and high structural fidelity, outperforming U-Net, Attention U-Net, FPN, and a plain CNN. The residual-over-prior design, combined with physics, yields spatially coherent, physically plausible beds suitable for operational mapping under domain shift.
CompEvent: Complex-valued Event-RGB Fusion for Low-light Video Enhancement and Deblurring
Low-light video deblurring poses significant challenges in applications like nighttime surveillance and autonomous driving due to dim lighting and long exposures. While event cameras offer potential solutions with superior low-light sensitivity and high temporal resolution, existing fusion methods typically employ staged strategies, limiting their effectiveness against combined low-light and motion blur degradations. To overcome this, we propose CompEvent, a complex neural network framework enabling holistic full-process fusion of event data and RGB frames for enhanced joint restoration. CompEvent features two core components: 1) Complex Temporal Alignment GRU, which utilizes complex-valued convolutions and processes video and event streams iteratively via GRU to achieve temporal alignment and continuous fusion; and 2) Complex Space-Frequency Learning module, which performs unified complex-valued signal processing in both spatial and frequency domains, facilitating deep fusion through spatial structures and system-level characteristics. By leveraging the holistic representation capability of complex-valued neural networks, CompEvent achieves full-process spatiotemporal fusion, maximizes complementary learning between modalities, and significantly strengthens low-light video deblurring capability. Extensive experiments demonstrate that CompEvent outperforms SOTA methods in addressing this challenging task. The code is available at https://github.com/YuXie1/CompEvent.
DIR-TIR: Dialog-Iterative Refinement for Text-to-Image Retrieval
This paper addresses the task of interactive, conversational text-to-image retrieval. Our DIR-TIR framework progressively refines the target image search through two specialized modules: the Dialog Refiner Module and the Image Refiner Module. The Dialog Refiner actively queries users to extract essential information and generate increasingly precise descriptions of the target image. Complementarily, the Image Refiner identifies perceptual gaps between generated images and user intentions, strategically reducing the visual-semantic discrepancy. By leveraging multi-turn dialogues, DIR-TIR provides superior controllability and fault tolerance compared to conventional single-query methods, significantly improving target image hit accuracy. Comprehensive experiments across diverse image datasets demonstrate our dialogue-based approach substantially outperforms initial-description-only baselines, while the synergistic module integration achieves both higher retrieval precision and enhanced interactive experience.
Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding
Video understanding requires not only visual recognition but also complex reasoning. While Vision-Language Models (VLMs) demonstrate impressive capabilities, they typically process videos largely in a single-pass manner with limited support for evidence revisit and iterative refinement. While recently emerging agent-based methods enable long-horizon reasoning, they either depend heavily on expensive proprietary models or require extensive agentic RL training. To overcome these limitations, we propose Agentic Video Intelligence (AVI), a flexible and training-free framework that can mirror human video comprehension through system-level design and optimization. AVI introduces three key innovations: (1) a human-inspired three-phase reasoning process (Retrieve-Perceive-Review) that ensures both sufficient global exploration and focused local analysis, (2) a structured video knowledge base organized through entity graphs, along with multi-granularity integrated tools, constituting the agent's interaction environment, and (3) an open-source model ensemble combining reasoning LLMs with lightweight base CV models and VLM, eliminating dependence on proprietary APIs or RL training. Experiments on LVBench, VideoMME-Long, LongVideoBench, and Charades-STA demonstrate that AVI achieves competitive performance while offering superior interpretability.
Learning to See Through a Baby's Eyes: Early Visual Diets Enable Robust Visual Intelligence in Humans and Machines
Newborns perceive the world with low-acuity, color-degraded, and temporally continuous vision, which gradually sharpens as infants develop. To explore the ecological advantages of such staged "visual diets", we train self-supervised learning (SSL) models on object-centric videos under constraints that simulate infant vision: grayscale-to-color (C), blur-to-sharp (A), and preserved temporal continuity (T)-collectively termed CATDiet. For evaluation, we establish a comprehensive benchmark across ten datasets, covering clean and corrupted image recognition, texture-shape cue conflict tests, silhouette recognition, depth-order classification, and the visual cliff paradigm. All CATDiet variants demonstrate enhanced robustness in object recognition, despite being trained solely on object-centric videos. Remarkably, models also exhibit biologically aligned developmental patterns, including neural plasticity changes mirroring synaptic density in macaque V1 and behaviors resembling infants' visual cliff responses. Building on these insights, CombDiet initializes SSL with CATDiet before standard training while preserving temporal continuity. Trained on object-centric or head-mounted infant videos, CombDiet outperforms standard SSL on both in-domain and out-of-domain object recognition and depth perception. Together, these results suggest that the developmental progression of early infant visual experience offers a powerful reverse-engineering framework for understanding the emergence of robust visual intelligence in machines. All code, data, and models will be publicly released.
Cranio-ID: Graph-Based Craniofacial Identification via Automatic Landmark Annotation in 2D Multi-View X-rays
In forensic craniofacial identification and in many biomedical applications, craniometric landmarks are important. Traditional methods for locating landmarks are time-consuming and require specialized knowledge and expertise. Current methods utilize superimposition and deep learning-based methods that employ automatic annotation of landmarks. However, these methods are not reliable due to insufficient large-scale validation studies. In this paper, we proposed a novel framework Cranio-ID: First, an automatic annotation of landmarks on 2D skulls (which are X-ray scans of faces) with their respective optical images using our trained YOLO-pose models. Second, cross-modal matching by formulating these landmarks into graph representations and then finding semantic correspondence between graphs of these two modalities using cross-attention and optimal transport framework. Our proposed framework is validated on the S2F and CUHK datasets (CUHK dataset resembles with S2F dataset). Extensive experiments have been conducted to evaluate the performance of our proposed framework, which demonstrates significant improvements in both reliability and accuracy, as well as its effectiveness in cross-domain skull-to-face and sketch-to-face matching in forensic science.
comment: 11 pages, 6 figures
Language as an Anchor: Preserving Relative Visual Geometry for Domain Incremental Learning
A key challenge in Domain Incremental Learning (DIL) is to continually learn under shifting distributions while preserving knowledge from previous domains. Existing methods face a fundamental dilemma. On one hand, projecting all domains into a single unified visual space leads to inter-domain interference and semantic distortion, as large shifts may vary with not only visual appearance but also underlying semantics. On the other hand, isolating domain-specific parameters causes knowledge fragmentation, creating "knowledge islands" that hamper knowledge reuse and exacerbate forgetting. To address this issue, we propose LAVA (Language-Anchored Visual Alignment), a novel DIL framework that replaces direct feature alignment with relative alignment driven by a text-based reference anchor. LAVA guides the visual representations of each incoming domain to preserve a consistent relative geometry, which is defined by mirroring the pairwise semantic similarities between the class names. This anchored geometric structure acts as a bridge across domains, enabling the retrieval of class-aware prior knowledge and facilitating robust feature aggregation. Extensive experiments on standard DIL benchmarks demonstrate that LAVA achieves significant performance improvements over state-of-the-arts. Code is available at https://github.com/ShuyiGeng/LAVA.
Stage Aware Diagnosis of Diabetic Retinopathy via Ordinal Regression
Diabetic Retinopathy (DR) has emerged as a major cause of preventable blindness in recent times. With timely screening and intervention, the condition can be prevented from causing irreversible damage. The work introduces a state-of-the-art Ordinal Regression-based DR Detection framework that uses the APTOS-2019 fundus image dataset. A widely accepted combination of preprocessing methods: Green Channel (GC) Extraction, Noise Masking, and CLAHE, was used to isolate the most relevant features for DR classification. Model performance was evaluated using the Quadratic Weighted Kappa, with a focus on agreement between results and clinical grading. Our Ordinal Regression approach attained a QWK score of 0.8992, setting a new benchmark on the APTOS dataset.
comment: Submitted to Confluence 2026, Amity University
Continuous Vision-Language-Action Co-Learning with Semantic-Physical Alignment for Behavioral Cloning AAAI 2026
Language-conditioned manipulation facilitates human-robot interaction via behavioral cloning (BC), which learns control policies from human demonstrations and serves as a cornerstone of embodied AI. Overcoming compounding errors in sequential action decisions remains a central challenge to improving BC performance. Existing approaches mitigate compounding errors through data augmentation, expressive representation, or temporal abstraction. However, they suffer from physical discontinuities and semantic-physical misalignment, leading to inaccurate action cloning and intermittent execution. In this paper, we present Continuous vision-language-action Co-Learning with Semantic-Physical Alignment (CCoL), a novel BC framework that ensures temporally consistent execution and fine-grained semantic grounding. It generates robust and smooth action execution trajectories through continuous co-learning across vision, language, and proprioceptive inputs (e.g., robot internal states). Meanwhile, we anchor language semantics to visuomotor representations by a bidirectional cross-attention to learn contextual information for action generation, successfully overcoming the problem of semantic-physical misalignment. Extensive experiments show that CCoL achieves an average 8.0% relative improvement across three simulation suites, with up to 19.2% relative gain in human-demonstrated bimanual insertion tasks. Real-world tests on a 7-DoF robot further confirm CCoL's generalization under unseen and noisy object states.
comment: Accepted at AAAI 2026, the Project website is available at https://qhemu.github.io/CCoL/
BEDLAM2.0: Synthetic Humans and Cameras in Motion NeurIPS 2025
Inferring 3D human motion from video remains a challenging problem with many applications. While traditional methods estimate the human in image coordinates, many applications require human motion to be estimated in world coordinates. This is particularly challenging when there is both human and camera motion. Progress on this topic has been limited by the lack of rich video data with ground truth human and camera movement. We address this with BEDLAM2.0, a new dataset that goes beyond the popular BEDLAM dataset in important ways. In addition to introducing more diverse and realistic cameras and camera motions, BEDLAM2.0 increases diversity and realism of body shape, motions, clothing, hair, and 3D environments. Additionally, it adds shoes, which were missing in BEDLAM. BEDLAM has become a key resource for training 3D human pose and motion regressors today and we show that BEDLAM2.0 is significantly better, particularly for training methods that estimate humans in world coordinates. We compare state-of-the art methods trained on BEDLAM and BEDLAM2.0, and find that BEDLAM2.0 significantly improves accuracy over BEDLAM. For research purposes, we provide the rendered videos, ground truth body parameters, and camera motions. We also provide the 3D assets to which we have rights and links to those from third parties.
comment: NeurIPS 2025 (Datasets and Benchmarks track, oral). Project website: https://bedlam2.is.tue.mpg.de
Enhancing LLM-based Autonomous Driving with Modular Traffic Light and Sign Recognition
Large Language Models (LLMs) are increasingly used for decision-making and planning in autonomous driving, showing promising reasoning capabilities and potential to generalize across diverse traffic situations. However, current LLM-based driving agents lack explicit mechanisms to enforce traffic rules and often struggle to reliably detect small, safety-critical objects such as traffic lights and signs. To address this limitation, we introduce TLS-Assist, a modular redundancy layer that augments LLM-based autonomous driving agents with explicit traffic light and sign recognition. TLS-Assist converts detections into structured natural language messages that are injected into the LLM input, enforcing explicit attention to safety-critical cues. The framework is plug-and-play, model-agnostic, and supports both single-view and multi-view camera setups. We evaluate TLS-Assist in a closed-loop setup on the LangAuto benchmark in CARLA. The results demonstrate relative driving performance improvements of up to 14% over LMDrive and 7% over BEVDriver, while consistently reducing traffic light and sign infractions. We publicly release the code and models on https://github.com/iis-esslingen/TLS-Assist.
Cheating Stereo Matching in Full-scale: Physical Adversarial Attack against Binocular Depth Estimation in Autonomous Driving
Though deep neural models adopted to realize the perception of autonomous driving have proven vulnerable to adversarial examples, known attacks often leverage 2D patches and target mostly monocular perception. Therefore, the effectiveness of Physical Adversarial Examples (PAEs) on stereo-based binocular depth estimation remains largely unexplored. To this end, we propose the first texture-enabled physical adversarial attack against stereo matching models in the context of autonomous driving. Our method employs a 3D PAE with global camouflage texture rather than a local 2D patch-based one, ensuring both visual consistency and attack effectiveness across different viewpoints of stereo cameras. To cope with the disparity effect of these cameras, we also propose a new 3D stereo matching rendering module that allows the PAE to be aligned with real-world positions and headings in binocular vision. We further propose a novel merging attack that seamlessly blends the target into the environment through fine-grained PAE optimization. It has significantly enhanced stealth and lethality upon existing hiding attacks that fail to get seamlessly merged into the background. Extensive evaluations show that our PAEs can successfully fool the stereo models into producing erroneous depth information.
A Quantitative Method for Shoulder Presentation Evaluation in Biometric Identity Documents
International standards for biometric identity documents mandate strict compliance with pose requirements, including the square presentation of a subject's shoulders. However, the literature on automated quality assessment offers few quantitative methods for evaluating this specific attribute. This paper proposes a Shoulder Presentation Evaluation (SPE) algorithm to address this gap. The method quantifies shoulder yaw and roll using only the 3D coordinates of two shoulder landmarks provided by common pose estimation frameworks. The algorithm was evaluated on a dataset of 121 portrait images. The resulting SPE scores demonstrated a strong Pearson correlation (r approx. 0.80) with human-assigned labels. An analysis of the metric's filtering performance, using an adapted Error-versus-Discard methodology, confirmed its utility in identifying non-compliant samples. The proposed algorithm is a viable lightweight tool for automated compliance checking in enrolment systems.
comment: 13 pages, 4 figures, conference or journal submission. Course project from DTU Compute, Technical University of Denmark
Blur-Robust Detection via Feature Restoration: An End-to-End Framework for Prior-Guided Infrared UAV Target Detection AAAI 2026
Infrared unmanned aerial vehicle (UAV) target images often suffer from motion blur degradation caused by rapid sensor movement, significantly reducing contrast between target and background. Generally, detection performance heavily depends on the discriminative feature representation between target and background. Existing methods typically treat deblurring as a preprocessing step focused on visual quality, while neglecting the enhancement of task-relevant features crucial for detection. Improving feature representation for detection under blur conditions remains challenging. In this paper, we propose a novel Joint Feature-Domain Deblurring and Detection end-to-end framework, dubbed JFD3. We design a dual-branch architecture with shared weights, where the clear branch guides the blurred branch to enhance discriminative feature representation. Specifically, we first introduce a lightweight feature restoration network, where features from the clear branch serve as feature-level supervision to guide the blurred branch, thereby enhancing its distinctive capability for detection. We then propose a frequency structure guidance module that refines the structure prior from the restoration network and integrates it into shallow detection layers to enrich target structural information. Finally, a feature consistency self-supervised loss is imposed between the dual-branch detection backbones, driving the blurred branch to approximate the feature representations of the clear one. Wealso construct a benchmark, named IRBlurUAV, containing 30,000 simulated and 4,118 real infrared UAV target images with diverse motion blur. Extensive experiments on IRBlurUAV demonstrate that JFD3 achieves superior detection performance while maintaining real-time efficiency.
comment: Accepted by AAAI 2026
O3SLM: Open Weight, Open Data, and Open Vocabulary Sketch-Language Model AAAI 2026
While Large Vision Language Models (LVLMs) are increasingly deployed in real-world applications, their ability to interpret abstract visual inputs remains limited. Specifically, they struggle to comprehend hand-drawn sketches, a modality that offers an intuitive means of expressing concepts that are difficult to describe textually. We identify the primary bottleneck as the absence of a large-scale dataset that jointly models sketches, photorealistic images, and corresponding natural language instructions. To address this, we present two key contributions: (1) a new, large-scale dataset of image-sketch-instruction triplets designed to facilitate both pretraining and instruction tuning, and (2) O3SLM, an LVLM trained on this dataset. Comprehensive evaluations on multiple sketch-based tasks: (a) object localization, (b) counting, (c) image retrieval i.e., (SBIR and fine-grained SBIR), and (d) visual question answering (VQA); while incorporating the three existing sketch datasets, namely QuickDraw!, Sketchy, and Tu Berlin, along with our generated SketchVCL dataset, show that O3SLM achieves state-of-the-art performance, substantially outperforming existing LVLMs in sketch comprehension and reasoning.
comment: Accepted to AAAI 2026
Clinically-Validated Innovative Mobile Application for Assessing Blinking and Eyelid Movements
Blinking is a vital physiological process that protects and maintains the health of the ocular surface. Objective assessment of eyelid movements remains challenging due to the complexity, cost, and limited clinical applicability of existing tools. This study presents the clinical validation of Bapp (Blink Application), a mobile application developed using the Flutter framework and integrated with Google ML Kit for on-device, real-time analysis of eyelid movements. The validation occurred using 45 videos from real patients, whose blinks were manually annotated by ophthalmology specialists from the Paulista School of Medicine of the Federal University of Sao Paulo (EPM-UNIFESP) to serve as the ground truth. Bapp's performance was evaluated using standard metrics, including Precision, Recall, and F1-Score, with results demonstrating 98.4% precision, 96.9% recall, and an overall accuracy of 98.3%. These outcomes confirm the reliability of Bapp as a portable, accessible, and objective tool for monitoring both normal and abnormal eyelid movements. The application offers a promising alternative to traditional manual blink counting, supporting continuous ocular health monitoring and postoperative evaluation in clinical environments.
comment: 14 pages, 8 figures
IBGS: Image-Based Gaussian Splatting NeurIPS 2025
3D Gaussian Splatting (3DGS) has recently emerged as a fast, high-quality method for novel view synthesis (NVS). However, its use of low-degree spherical harmonics limits its ability to capture spatially varying color and view-dependent effects such as specular highlights. Existing works augment Gaussians with either a global texture map, which struggles with complex scenes, or per-Gaussian texture maps, which introduces high storage overhead. We propose Image-Based Gaussian Splatting, an efficient alternative that leverages high-resolution source images for fine details and view-specific color modeling. Specifically, we model each pixel color as a combination of a base color from standard 3DGS rendering and a learned residual inferred from neighboring training images. This promotes accurate surface alignment and enables rendering images of high-frequency details and accurate view-dependent effects. Experiments on standard NVS benchmarks show that our method significantly outperforms prior Gaussian Splatting approaches in rendering quality, without increasing the storage footprint.
comment: Accepted to NeurIPS 2025
ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries
The proliferation of hour-long videos (e.g., lectures, podcasts, documentaries) has intensified demand for efficient content structuring. However, existing approaches are constrained by small-scale training with annotations that are typical short and coarse, restricting generalization to nuanced transitions in long videos. We introduce ARC-Chapter, the first large-scale video chaptering model trained on over million-level long video chapters, featuring bilingual, temporally grounded, and hierarchical chapter annotations. To achieve this goal, we curated a bilingual English-Chinese chapter dataset via a structured pipeline that unifies ASR transcripts, scene texts, visual captions into multi-level annotations, from short title to long summaries. We demonstrate clear performance improvements with data scaling, both in data volume and label intensity. Moreover, we design a new evaluation metric termed GRACE, which incorporates many-to-one segment overlaps and semantic similarity, better reflecting real-world chaptering flexibility. Extensive experiments demonstrate that ARC-Chapter establishes a new state-of-the-art by a significant margin, outperforming the previous best by 14.0% in F1 score and 11.3% in SODA score. Moreover, ARC-Chapter shows excellent transferability, improving the state-of-the-art on downstream tasks like dense video captioning on YouCook2.
comment: Project Page: https://arcchapter.github.io/index_en.html
Silhouette-to-Contour Registration: Aligning Intraoral Scan Models with Cephalometric Radiographs
Reliable 3D-2D alignment between intraoral scan (IOS) models and lateral cephalometric radiographs is critical for orthodontic diagnosis, yet conventional intensity-driven registration methods struggle under real clinical conditions, where cephalograms exhibit projective magnification, geometric distortion, low-contrast dental crowns, and acquisition-dependent variation. These factors hinder the stability of appearance-based similarity metrics and often lead to convergence failures or anatomically implausible alignments. To address these limitations, we propose DentalSCR, a pose-stable, contour-guided framework for accurate and interpretable silhouette-to-contour registration. Our method first constructs a U-Midline Dental Axis (UMDA) to establish a unified cross-arch anatomical coordinate system, thereby stabilizing initialization and standardizing projection geometry across cases. Using this reference frame, we generate radiograph-like projections via a surface-based DRR formulation with coronal-axis perspective and Gaussian splatting, which preserves clinical source-object-detector magnification and emphasizes external silhouettes. Registration is then formulated as a 2D similarity transform optimized with a symmetric bidirectional Chamfer distance under a hierarchical coarse-to-fine schedule, enabling both large capture range and subpixel-level contour agreement. We evaluate DentalSCR on 34 expert-annotated clinical cases. Experimental results demonstrate substantial reductions in landmark error-particularly at posterior teeth-tighter dispersion on the lower jaw, and low Chamfer and controlled Hausdorff distances at the curve level. These findings indicate that DentalSCR robustly handles real-world cephalograms and delivers high-fidelity, clinically inspectable 3D--2D alignment, outperforming conventional baselines.
Going Places: Place Recognition in Artificial and Natural Systems
Place recognition, the ability to identify previously visited locations, is critical for both biological navigation and autonomous systems. This review synthesizes findings from robotic systems, animal studies, and human research to explore how different systems encode and recall place. We examine the computational and representational strategies employed across artificial systems, animals, and humans, highlighting convergent solutions such as topological mapping, cue integration, and memory management. Animal systems reveal evolved mechanisms for multimodal navigation and environmental adaptation, while human studies provide unique insights into semantic place concepts, cultural influences, and introspective capabilities. Artificial systems showcase scalable architectures and data-driven models. We propose a unifying set of concepts by which to consider and develop place recognition mechanisms and identify key challenges such as generalization, robustness, and environmental variability. This review aims to foster innovations in artificial localization by connecting future developments in artificial place recognition systems to insights from both animal navigation research and human spatial cognition studies.
ArchMap: Arch-Flattening and Knowledge-Guided Vision Language Model for Tooth Counting and Structured Dental Understanding
A structured understanding of intraoral 3D scans is essential for digital orthodontics. However, existing deep-learning approaches rely heavily on modality-specific training, large annotated datasets, and controlled scanning conditions, which limit generalization across devices and hinder deployment in real clinical workflows. Moreover, raw intraoral meshes exhibit substantial variation in arch pose, incomplete geometry caused by occlusion or tooth contact, and a lack of texture cues, making unified semantic interpretation highly challenging. To address these limitations, we propose ArchMap, a training-free and knowledge-guided framework for robust structured dental understanding. ArchMap first introduces a geometry-aware arch-flattening module that standardizes raw 3D meshes into spatially aligned, continuity-preserving multi-view projections. We then construct a Dental Knowledge Base (DKB) encoding hierarchical tooth ontology, dentition-stage policies, and clinical semantics to constrain the symbolic reasoning space. We validate ArchMap on 1060 pre-/post-orthodontic cases, demonstrating robust performance in tooth counting, anatomical partitioning, dentition-stage classification, and the identification of clinical conditions such as crowding, missing teeth, prosthetics, and caries. Compared with supervised pipelines and prompted VLM baselines, ArchMap achieves higher accuracy, reduced semantic drift, and superior stability under sparse or artifact-prone conditions. As a fully training-free system, ArchMap demonstrates that combining geometric normalization with ontology-guided multimodal reasoning offers a practical and scalable solution for the structured analysis of 3D intraoral scans in modern digital orthodontics.
Step by Step Network
Scaling up network depth is a fundamental pursuit in neural architecture design, as theory suggests that deeper models offer exponentially greater capability. Benefiting from the residual connections, modern neural networks can scale up to more than one hundred layers and enjoy wide success. However, as networks continue to deepen, current architectures often struggle to realize their theoretical capacity improvements, calling for more advanced designs to further unleash the potential of deeper networks. In this paper, we identify two key barriers that obstruct residual models from scaling deeper: shortcut degradation and limited width. Shortcut degradation hinders deep-layer learning, while the inherent depth-width trade-off imposes limited width. To mitigate these issues, we propose a generalized residual architecture dubbed Step by Step Network (StepsNet) to bridge the gap between theoretical potential and practical performance of deep models. Specifically, we separate features along the channel dimension and let the model learn progressively via stacking blocks with increasing width. The resulting method mitigates the two identified problems and serves as a versatile macro design applicable to various models. Extensive experiments show that our method consistently outperforms residual models across diverse tasks, including image classification, object detection, semantic segmentation, and language modeling. These results position StepsNet as a superior generalization of the widely adopted residual architecture.
LSP-YOLO: A Lightweight Single-Stage Network for Sitting Posture Recognition on Embedded Devices
With the rise in sedentary behavior, health problems caused by poor sitting posture have drawn increasing attention. Most existing methods, whether using invasive sensors or computer vision, rely on two-stage pipelines, which result in high intrusiveness, intensive computation, and poor real-time performance on embedded edge devices. Inspired by YOLOv11-Pose, a lightweight single-stage network for sitting posture recognition on embedded edge devices termed LSP-YOLO was proposed. By integrating partial convolution(PConv) and Similarity-Aware Activation Module(SimAM), a lightweight module, Light-C3k2, was designed to reduce computational cost while maintaining feature extraction capability. In the recognition head, keypoints were directly mapped to posture classes through pointwise convolution, and intermediate supervision was employed to enable efficient fusion of pose estimation and classification. Furthermore, a dataset containing 5,000 images across six posture categories was constructed for model training and testing. The smallest trained model, LSP-YOLO-n, achieved 94.2% accuracy and 251 Fps on personal computer(PC) with a model size of only 1.9 MB. Meanwhile, real-time and high-accuracy inference under constrained computational resources was demonstrated on the SV830C + GC030A platform. The proposed approach is characterized by high efficiency, lightweight design and deployability, making it suitable for smart classrooms, rehabilitation, and human-computer interaction applications.
comment: Submitted to Engineering Applications of Artificial Intelligence (EAAI)
Dental3R: Geometry-Aware Pairing for Intraoral 3D Reconstruction from Sparse-View Photographs
Intraoral 3D reconstruction is fundamental to digital orthodontics, yet conventional methods like intraoral scanning are inaccessible for remote tele-orthodontics, which typically relies on sparse smartphone imagery. While 3D Gaussian Splatting (3DGS) shows promise for novel view synthesis, its application to the standard clinical triad of unposed anterior and bilateral buccal photographs is challenging. The large view baselines, inconsistent illumination, and specular surfaces common in intraoral settings can destabilize simultaneous pose and geometry estimation. Furthermore, sparse-view photometric supervision often induces a frequency bias, leading to over-smoothed reconstructions that lose critical diagnostic details. To address these limitations, we propose \textbf{Dental3R}, a pose-free, graph-guided pipeline for robust, high-fidelity reconstruction from sparse intraoral photographs. Our method first constructs a Geometry-Aware Pairing Strategy (GAPS) to intelligently select a compact subgraph of high-value image pairs. The GAPS focuses on correspondence matching, thereby improving the stability of the geometry initialization and reducing memory usage. Building on the recovered poses and point cloud, we train the 3DGS model with a wavelet-regularized objective. By enforcing band-limited fidelity using a discrete wavelet transform, our approach preserves fine enamel boundaries and interproximal edges while suppressing high-frequency artifacts. We validate our approach on a large-scale dataset of 950 clinical cases and an additional video-based test set of 195 cases. Experimental results demonstrate that Dental3R effectively handles sparse, unposed inputs and achieves superior novel view synthesis quality for dental occlusion visualization, outperforming state-of-the-art methods.
Iterative Diffusion-Refined Neural Attenuation Fields for Multi-Source Stationary CT Reconstruction: NAF Meets Diffusion Model
Multi-source stationary computed tomography (CT) has recently attracted attention for its ability to achieve rapid image reconstruction, making it suitable for time-sensitive clinical and industrial applications. However, practical systems are often constrained by ultra-sparse-view sampling, which significantly degrades reconstruction quality. Traditional methods struggle under ultra-sparse-view settings, where interpolation becomes inaccurate and the resulting reconstructions are unsatisfactory. To address this challenge, this study proposes Diffusion-Refined Neural Attenuation Fields (Diff-NAF), an iterative framework tailored for multi-source stationary CT under ultra-sparse-view conditions. Diff-NAF combines a Neural Attenuation Field representation with a dual-branch conditional diffusion model. The process begins by training an initial NAF using ultra-sparse-view projections. New projections are then generated through an Angle-Prior Guided Projection Synthesis strategy that exploits inter view priors, and are subsequently refined by a Diffusion-driven Reuse Projection Refinement Module. The refined projections are incorporated as pseudo-labels into the training set for the next iteration. Through iterative refinement, Diff-NAF progressively enhances projection completeness and reconstruction fidelity under ultra-sparse-view conditions, ultimately yielding high-quality CT reconstructions. Experimental results on multiple simulated 3D CT volumes and real projection data demonstrate that Diff-NAF achieves the best performance under ultra-sparse-view conditions.
SAM-Fed: SAM-Guided Federated Semi-Supervised Learning for Medical Image Segmentation
Medical image segmentation is clinically important, yet data privacy and the cost of expert annotation limit the availability of labeled data. Federated semi-supervised learning (FSSL) offers a solution but faces two challenges: pseudo-label reliability depends on the strength of local models, and client devices often require compact or heterogeneous architectures due to limited computational resources. These constraints reduce the quality and stability of pseudo-labels, while large models, though more accurate, cannot be trained or used for routine inference on client devices. We propose SAM-Fed, a federated semi-supervised framework that leverages a high-capacity segmentation foundation model to guide lightweight clients during training. SAM-Fed combines dual knowledge distillation with an adaptive agreement mechanism to refine pixel-level supervision. Experiments on skin lesion and polyp segmentation across homogeneous and heterogeneous settings show that SAM-Fed consistently outperforms state-of-the-art FSSL methods.
GEN3D: Generating Domain-Free 3D Scenes from a Single Image
Despite recent advancements in neural 3D reconstruction, the dependence on dense multi-view captures restricts their broader applicability. Additionally, 3D scene generation is vital for advancing embodied AI and world models, which depend on diverse, high-quality scenes for learning and evaluation. In this work, we propose Gen3d, a novel method for generation of high-quality, wide-scope, and generic 3D scenes from a single image. After the initial point cloud is created by lifting the RGBD image, Gen3d maintains and expands its world model. The 3D scene is finalized through optimizing a Gaussian splatting representation. Extensive experiments on diverse datasets demonstrate the strong generalization capability and superior performance of our method in generating a world model and Synthesizing high-fidelity and consistent novel views.
comment: 5 pages , 2 figures
NeuralBoneReg: A Novel Self-Supervised Method for Robust and Accurate Multi-Modal Bone Surface Registration
In computer- and robot-assisted orthopedic surgery (CAOS), patient-specific surgical plans derived from preoperative imaging define target locations and implant trajectories. During surgery, these plans must be accurately transferred, relying on precise cross-registration between preoperative and intraoperative data. However, substantial modality heterogeneity across imaging modalities makes this registration challenging and error-prone. Robust, automatic, and modality-agnostic bone surface registration is therefore clinically important. We propose NeuralBoneReg, a self-supervised, surface-based framework that registers bone surfaces using 3D point clouds as a modality-agnostic representation. NeuralBoneReg includes two modules: an implicit neural unsigned distance field (UDF) that learns the preoperative bone model, and an MLP-based registration module that performs global initialization and local refinement by generating transformation hypotheses to align the intraoperative point cloud with the neural UDF. Unlike SOTA supervised methods, NeuralBoneReg operates in a self-supervised manner, without requiring inter-subject training data. We evaluated NeuralBoneReg against baseline methods on two publicly available multi-modal datasets: a CT-ultrasound dataset of the fibula and tibia (UltraBones100k) and a CT-RGB-D dataset of spinal vertebrae (SpineDepth). The evaluation also includes a newly introduced CT--ultrasound dataset of cadaveric subjects containing femur and pelvis (UltraBones-Hip), which will be made publicly available. NeuralBoneReg matches or surpasses existing methods across all datasets, achieving mean RRE/RTE of 1.68°/1.86 mm on UltraBones100k, 1.88°/1.89 mm on UltraBones-Hip, and 3.79°/2.45 mm on SpineDepth. These results demonstrate strong generalizability across anatomies and modalities, providing robust and accurate cross-modal alignment for CAOS.
NeuralSSD: A Neural Solver for Signed Distance Surface Reconstruction
We proposed a generalized method, NeuralSSD, for reconstructing a 3D implicit surface from the widely-available point cloud data. NeuralSSD is a solver-based on the neural Galerkin method, aimed at reconstructing higher-quality and accurate surfaces from input point clouds. Implicit method is preferred due to its ability to accurately represent shapes and its robustness in handling topological changes. However, existing parameterizations of implicit fields lack explicit mechanisms to ensure a tight fit between the surface and input data. To address this, we propose a novel energy equation that balances the reliability of point cloud information. Additionally, we introduce a new convolutional network that learns three-dimensional information to achieve superior optimization results. This approach ensures that the reconstructed surface closely adheres to the raw input points and infers valuable inductive biases from point clouds, resulting in a highly accurate and stable surface reconstruction. NeuralSSD is evaluated on a variety of challenging datasets, including the ShapeNet and Matterport datasets, and achieves state-of-the-art results in terms of both surface reconstruction accuracy and generalizability.
comment: Under review
Free Lunch to Meet the Gap: Intermediate Domain Reconstruction for Cross-Domain Few-Shot Learning
Cross-Domain Few-Shot Learning (CDFSL) endeavors to transfer generalized knowledge from the source domain to target domains using only a minimal amount of training data, which faces a triplet of learning challenges in the meantime, i.e., semantic disjoint, large domain discrepancy, and data scarcity. Different from predominant CDFSL works focused on generalized representations, we make novel attempts to construct Intermediate Domain Proxies (IDP) with source feature embeddings as the codebook and reconstruct the target domain feature with this learned codebook. We then conduct an empirical study to explore the intrinsic attributes from perspectives of visual styles and semantic contents in intermediate domain proxies. Reaping benefits from these attributes of intermediate domains, we develop a fast domain alignment method to use these proxies as learning guidance for target domain feature transformation. With the collaborative learning of intermediate domain reconstruction and target feature transformation, our proposed model is able to surpass the state-of-the-art models by a margin on 8 cross-domain few-shot learning benchmarks.
comment: Accepted to IJCV 2025
Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation
Text-to-3D generation has advanced rapidly, yet state-of-the-art models, encompassing both optimization-based and feed-forward architectures, still face two fundamental limitations. First, they struggle with coarse semantic alignment, often failing to capture fine-grained prompt details. Second, they lack robust 3D spatial understanding, leading to geometric inconsistencies and catastrophic failures in part assembly and spatial relationships. To address these challenges, we propose VLM3D, a general framework that repurposes large vision-language models (VLMs) as powerful, differentiable semantic and spatial critics. Our core contribution is a dual-query critic signal derived from the VLM's Yes or No log-odds, which assesses both semantic fidelity and geometric coherence. We demonstrate the generality of this guidance signal across two distinct paradigms: (1) As a reward objective for optimization-based pipelines, VLM3D significantly outperforms existing methods on standard benchmarks. (2) As a test-time guidance module for feed-forward pipelines, it actively steers the iterative sampling process of SOTA native 3D models to correct severe spatial errors. VLM3D establishes a principled and generalizable path to inject the VLM's rich, language-grounded understanding of both semantics and space into diverse 3D generative pipelines.
Gaussian Splatting-based Low-Rank Tensor Representation for Multi-Dimensional Image Recovery
Tensor singular value decomposition (t-SVD) is a promising tool for multi-dimensional image representation, which decomposes a multi-dimensional image into a latent tensor and an accompanying transform matrix. However, two critical limitations of t-SVD methods persist: (1) the approximation of the latent tensor (e.g., tensor factorizations) is coarse and fails to accurately capture spatial local high-frequency information; (2) The transform matrix is composed of fixed basis atoms (e.g., complex exponential atoms in DFT and cosine atoms in DCT) and cannot precisely capture local high-frequency information along the mode-3 fibers. To address these two limitations, we propose a Gaussian Splatting-based Low-rank tensor Representation (GSLR) framework, which compactly and continuously represents multi-dimensional images. Specifically, we leverage tailored 2D Gaussian splatting and 1D Gaussian splatting to generate the latent tensor and transform matrix, respectively. The 2D and 1D Gaussian splatting are indispensable and complementary under this representation framework, which enjoys a powerful representation capability, especially for local high-frequency information. To evaluate the representation ability of the proposed GSLR, we develop an unsupervised GSLR-based multi-dimensional image recovery model. Extensive experiments on multi-dimensional image recovery demonstrate that GSLR consistently outperforms state-of-the-art methods, particularly in capturing local high-frequency information.
ManipShield: A Unified Framework for Image Manipulation Detection, Localization and Explanation
With the rapid advancement of generative models, powerful image editing methods now enable diverse and highly realistic image manipulations that far surpass traditional deepfake techniques, posing new challenges for manipulation detection. Existing image manipulation detection and localization (IMDL) benchmarks suffer from limited content diversity, narrow generative-model coverage, and insufficient interpretability, which hinders the generalization and explanation capabilities of current manipulation detection methods. To address these limitations, we introduce \textbf{ManipBench}, a large-scale benchmark for image manipulation detection and localization focusing on AI-edited images. ManipBench contains over 450K manipulated images produced by 25 state-of-the-art image editing models across 12 manipulation categories, among which 100K images are further annotated with bounding boxes, judgment cues, and textual explanations to support interpretable detection. Building upon ManipBench, we propose \textbf{ManipShield}, an all-in-one model based on a Multimodal Large Language Model (MLLM) that leverages contrastive LoRA fine-tuning and task-specific decoders to achieve unified image manipulation detection, localization, and explanation. Extensive experiments on ManipBench and several public datasets demonstrate that ManipShield achieves state-of-the-art performance and exhibits strong generality to unseen manipulation models. Both ManipBench and ManipShield will be released upon publication.
V2VLoc: Robust GNSS-Free Collaborative Perception via LiDAR Localization AAAI2026
Multi-agents rely on accurate poses to share and align observations, enabling a collaborative perception of the environment. However, traditional GNSS-based localization often fails in GNSS-denied environments, making consistent feature alignment difficult in collaboration. To tackle this challenge, we propose a robust GNSS-free collaborative perception framework based on LiDAR localization. Specifically, we propose a lightweight Pose Generator with Confidence (PGC) to estimate compact pose and confidence representations. To alleviate the effects of localization errors, we further develop the Pose-Aware Spatio-Temporal Alignment Transformer (PASTAT), which performs confidence-aware spatial alignment while capturing essential temporal context. Additionally, we present a new simulation dataset, V2VLoc, which can be adapted for both LiDAR localization and collaborative detection tasks. V2VLoc comprises three subsets: Town1Loc, Town4Loc, and V2VDet. Town1Loc and Town4Loc offer multi-traversal sequences for training in localization tasks, whereas V2VDet is specifically intended for the collaborative detection task. Extensive experiments conducted on the V2VLoc dataset demonstrate that our approach achieves state-of-the-art performance under GNSS-denied conditions. We further conduct extended experiments on the real-world V2V4Real dataset to validate the effectiveness and generalizability of PASTAT.
comment: AAAI2026
Enhancing Generalization of Depth Estimation Foundation Model via Weakly-Supervised Adaptation with Regularization AAAI 2026
The emergence of foundation models has substantially advanced zero-shot generalization in monocular depth estimation (MDE), as exemplified by the Depth Anything series. However, given access to some data from downstream tasks, a natural question arises: can the performance of these models be further improved? To this end, we propose WeSTAR, a parameter-efficient framework that performs Weakly supervised Self-Training Adaptation with Regularization, designed to enhance the robustness of MDE foundation models in unseen and diverse domains. We first adopt a dense self-training objective as the primary source of structural self-supervision. To further improve robustness, we introduce semantically-aware hierarchical normalization, which exploits instance-level segmentation maps to perform more stable and multi-scale structural normalization. Beyond dense supervision, we introduce a cost-efficient weak supervision in the form of pairwise ordinal depth annotations to further guide the adaptation process, which enforces informative ordinal constraints to mitigate local topological errors. Finally, a weight regularization loss is employed to anchor the LoRA updates, ensuring training stability and preserving the model's generalizable knowledge. Extensive experiments on both realistic and corrupted out-of-distribution datasets under diverse and challenging scenarios demonstrate that WeSTAR consistently improves generalization and achieves state-of-the-art performance across a wide range of benchmarks.
comment: Accepted by AAAI 2026
Breaking the Passive Learning Trap: An Active Perception Strategy for Human Motion Prediction
Forecasting 3D human motion is an important embodiment of fine-grained understanding and cognition of human behavior by artificial agents. Current approaches excessively rely on implicit network modeling of spatiotemporal relationships and motion characteristics, falling into the passive learning trap that results in redundant and monotonous 3D coordinate information acquisition while lacking actively guided explicit learning mechanisms. To overcome these issues, we propose an Active Perceptual Strategy (APS) for human motion prediction, leveraging quotient space representations to explicitly encode motion properties while introducing auxiliary learning objectives to strengthen spatio-temporal modeling. Specifically, we first design a data perception module that projects poses into the quotient space, decoupling motion geometry from coordinate redundancy. By jointly encoding tangent vectors and Grassmann projections, this module simultaneously achieves geometric dimension reduction, semantic decoupling, and dynamic constraint enforcement for effective motion pose characterization. Furthermore, we introduce a network perception module that actively learns spatio-temporal dependencies through restorative learning. This module deliberately masks specific joints or injects noise to construct auxiliary supervision signals. A dedicated auxiliary learning network is designed to actively adapt and learn from perturbed information. Notably, APS is model agnostic and can be integrated with different prediction models to enhance active perceptual. The experimental results demonstrate that our method achieves the new state-of-the-art, outperforming existing methods by large margins: 16.3% on H3.6M, 13.9% on CMU Mocap, and 10.1% on 3DPW.
comment: 8 pages, 3 figures
StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model
This paper focuses on the task of speech-driven 3D facial animation, which aims to generate realistic and synchronized facial motions driven by speech inputs.Recent methods have employed audio-conditioned diffusion models for 3D facial animation, achieving impressive results in generating expressive and natural animations.However, these methods process the whole audio sequences in a single pass, which poses two major challenges: they tend to perform poorly when handling audio sequences that exceed the training horizon and will suffer from significant latency when processing long audio inputs. To address these limitations, we propose a novel autoregressive diffusion model that processes input audio in a streaming manner. This design ensures flexibility with varying audio lengths and achieves low latency independent of audio duration. Specifically, we select a limited number of past frames as historical motion context and combine them with the audio input to create a dynamic condition. This condition guides the diffusion process to iteratively generate facial motion frames, enabling real-time synthesis with high-quality results. Additionally, we implemented a real-time interactive demo, highlighting the effectiveness and efficiency of our approach. We will release the code at https://zju3dv.github.io/StreamingTalker/.
Measurement-Constrained Sampling for Text-Prompted Blind Face Restoration
Blind face restoration (BFR) may correspond to multiple plausible high-quality (HQ) reconstructions under extremely low-quality (LQ) inputs. However, existing methods typically produce deterministic results, struggling to capture this one-to-many nature. In this paper, we propose a Measurement-Constrained Sampling (MCS) approach that enables diverse LQ face reconstructions conditioned on different textual prompts. Specifically, we formulate BFR as a measurement-constrained generative task by constructing an inverse problem through controlled degradations of coarse restorations, which allows posterior-guided sampling within text-to-image diffusion. Measurement constraints include both Forward Measurement, which ensures results align with input structures, and Reverse Measurement, which produces projection spaces, ensuring that the solution can align with various prompts. Experiments show that our MCS can generate prompt-aligned results and outperforms existing BFR methods. Codes will be released after acceptance.
Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution
We introduce Orion, a visual agent framework that can take in any modality and generate any modality. Using an agentic framework with multiple tool-calling capabilities, Orion is designed for visual AI tasks and achieves state-of-the-art results. Unlike traditional vision-language models that produce descriptive outputs, Orion orchestrates a suite of specialized computer vision tools, including object detection, keypoint localization, panoptic segmentation, Optical Character Recognition, and geometric analysis, to execute complex multi-step visual workflows. The system achieves competitive performance on MMMU, MMBench, DocVQA, and MMLongBench while extending monolithic vision-language models to production-grade visual intelligence. By combining neural perception with symbolic execution, Orion enables autonomous visual reasoning, marking a transition from passive visual understanding to active, tool-driven visual intelligence.
InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior
Video inverse problems are fundamental to streaming, telepresence, and AR/VR, where high perceptual quality must coexist with tight latency constraints. Diffusion-based priors currently deliver state-of-the-art reconstructions, but existing approaches either adapt image diffusion models with ad hoc temporal regularizers - leading to temporal artifacts - or rely on native video diffusion models whose iterative posterior sampling is far too slow for real-time use. We introduce InstantViR, an amortized inference framework for ultra-fast video reconstruction powered by a pre-trained video diffusion prior. We distill a powerful bidirectional video diffusion model (teacher) into a causal autoregressive student that maps a degraded video directly to its restored version in a single forward pass, inheriting the teacher's strong temporal modeling while completely removing iterative test-time optimization. The distillation is prior-driven: it only requires the teacher diffusion model and known degradation operators, and does not rely on externally paired clean/noisy video data. To further boost throughput, we replace the video-diffusion backbone VAE with a high-efficiency LeanVAE via an innovative teacher-space regularized distillation scheme, enabling low-latency latent-space processing. Across streaming random inpainting, Gaussian deblurring and super-resolution, InstantViR matches or surpasses the reconstruction quality of diffusion-based baselines while running at over 35 FPS on NVIDIA A100 GPUs, achieving up to 100 times speedups over iterative video diffusion solvers. These results show that diffusion-based video reconstruction is compatible with real-time, interactive, editable, streaming scenarios, turning high-quality video restoration into a practical component of modern vision systems.
Multi-Scale Correlation-Aware Transformer for Maritime Vessel Re-Identification
Maritime vessel re-identification (Re-ID) plays a crucial role in advancing maritime monitoring and intelligent situational awareness systems. However, some existing vessel Re-ID methods are directly adapted from pedestrian-focused algorithms, making them ill-suited for mitigating the unique problems present in vessel images, particularly the greater intra-identity variations and more severe missing of local parts, which lead to the emergence of outlier samples within the same identity. To address these challenges, we propose the Multi-scale Correlation-aware Transformer Network (MCFormer), which explicitly models multi-scale correlations across the entire input set to suppress the adverse effects of outlier samples with intra-identity variations or local missing, incorporating two novel modules, the Global Correlation Module (GCM), and the Local Correlation Module (LCM). Specifically, GCM constructs a global similarity affinity matrix across all input images to model global correlations through feature aggregation based on inter-image consistency, rather than solely learning features from individual images as in most existing approaches. Simultaneously, LCM mines and aligns local features of positive samples with contextual similarity to extract local correlations by maintaining a dynamic memory bank, effectively compensating for missing or occluded regions in individual images. To further enhance feature robustness, MCFormer integrates global and local features that have been respectively correlated across multiple scales, effectively capturing latent relationships among image features. Experiments on three benchmarks demonstrate that MCFormer achieves state-of-the-art performance.
Online Data Curation for Object Detection via Marginal Contributions to Dataset-level Average Precision
High-quality data has become a primary driver of progress under scale laws, with curated datasets often outperforming much larger unfiltered ones at lower cost. Online data curation extends this idea by dynamically selecting training samples based on the model's evolving state. While effective in classification and multimodal learning, existing online sampling strategies rarely extend to object detection because of its structural complexity and domain gaps. We introduce DetGain, an online data curation method specifically for object detection that estimates the marginal perturbation of each image to dataset-level Average Precision (AP) based on its prediction quality. By modeling global score distributions, DetGain efficiently estimates the global AP change and computes teacher-student contribution gaps to select informative samples at each iteration. The method is architecture-agnostic and minimally intrusive, enabling straightforward integration into diverse object detection architectures. Experiments on the COCO dataset with multiple representative detectors show consistent improvements in accuracy. DetGain also demonstrates strong robustness under low-quality data and can be effectively combined with knowledge distillation techniques to further enhance performance, highlighting its potential as a general and complementary strategy for data-efficient object detection.
comment: preprint version, under review
MindCross: Fast New Subject Adaptation with Limited Data for Cross-subject Video Reconstruction from Brain Signals AAAI 2026
Reconstructing video from brain signals is an important brain decoding task. Existing brain decoding frameworks are primarily built on a subject-dependent paradigm, which requires large amounts of brain data for each subject. However, the expensive cost of collecting brain-video data causes severe data scarcity. Although some cross-subject methods being introduced, they often overfocus with subject-invariant information while neglecting subject-specific information, resulting in slow fine-tune-based adaptation strategy. To achieve fast and data-efficient new subject adaptation, we propose MindCross, a novel cross-subject framework. MindCross's N specific encoders and one shared encoder are designed to extract subject-specific and subject-invariant information, respectively. Additionally, a Top-K collaboration module is adopted to enhance new subject decoding with the knowledge learned from previous subjects' encoders. Extensive experiments on fMRI/EEG-to-video benchmarks demonstrate MindCross's efficacy and efficiency of cross-subject decoding and new subject adaptation using only one model.
comment: AAAI 2026, 16 pages
Hierarchical Semantic Learning for Multi-Class Aorta Segmentation MICCAI 2024
The aorta, the body's largest artery, is prone to pathologies such as dissection, aneurysm, and atherosclerosis, which often require timely intervention. Minimally invasive repairs involving branch vessels necessitate detailed 3D anatomical analysis. Existing methods often overlook hierarchical anatomical relationships while struggling with severe class imbalance inherent in vascular structures. We address these challenges with a curriculum learning strategy that leverages a novel fractal softmax for hierarchical semantic learning. Inspired by human cognition, our approach progressively learns anatomical constraints by decomposing complex structures from simple to complex components. The curriculum learning framework naturally addresses class imbalance by first establishing robust feature representations for dominant classes before tackling rare but anatomically critical structures, significantly accelerating model convergence in multi-class scenarios. Our two-stage inference strategy achieves up to fivefold acceleration, enhancing clinical practicality. On the validation set at epoch 50, our hierarchical semantic loss improves the Dice score of nnU-Net ResEnc M by 11.65%. The proposed model demonstrates a 5.6% higher Dice score than baselines on the test set. Experimental results show significant improvements in segmentation accuracy and efficiency, making the framework suitable for real-time clinical applications. The implementation code for this challenge entry is publicly available at: https://github.com/PengchengShi1220/AortaSeg24. The code for fractal softmax will be available at https://github.com/PengchengShi1220/fractal-softmax.
comment: Accepted by MICCAI 2024 Workshop AortaSeg
Few-Shot Precise Event Spotting via Unified Multi-Entity Graph and Distillation AAAI
Precise event spotting (PES) aims to recognize fine-grained events at exact moments and has become a key component of sports analytics. This task is particularly challenging due to rapid succession, motion blur, and subtle visual differences. Consequently, most existing methods rely on domain-specific, end-to-end training with large labeled datasets and often struggle in few-shot conditions due to their dependence on pixel- or pose-based inputs alone. However, obtaining large labeled datasets is practically hard. We propose a Unified Multi-Entity Graph Network (UMEG-Net) for few-shot PES. UMEG-Net integrates human skeletons and sport-specific object keypoints into a unified graph and features an efficient spatio-temporal extraction module based on advanced GCN and multi-scale temporal shift. To further enhance performance, we employ multimodal distillation to transfer knowledge from keypoint-based graphs to visual representations. Our approach achieves robust performance with limited labeled data and significantly outperforms baseline models in few-shot settings, providing a scalable and effective solution for few-shot PES. Code is publicly available at https://github.com/LZYAndy/UMEG-Net.
comment: The 40th Annual AAAI Conference on Artificial Intelligence (AAAI 2026)
PAVE: An End-to-End Dataset for Production Autonomous Vehicle Evaluation
Most existing autonomous-driving datasets (e.g., KITTI, nuScenes, and the Waymo Perception Dataset), collected by human-driving mode or unidentified driving mode, can only serve as early training for the perception and prediction of autonomous vehicles (AVs). To evaluate the real behavioral safety of AVs controlled in the black box, we present the first end-to-end benchmark dataset collected entirely by autonomous-driving mode in the real world. This dataset contains over 100 hours of naturalistic data from multiple production autonomous-driving vehicle models in the market. We segment the original data into 32,727 key frames, each consisting of four synchronized camera images and high-precision GNSS/IMU data (0.8 cm localization accuracy). For each key frame, 20 Hz vehicle trajectories spanning the past 6 s and future 5 s are provided, along with detailed 2D annotations of surrounding vehicles, pedestrians, traffic lights, and traffic signs. These key frames have rich scenario-level attributes, including driver intent, area type (covering highways, urban roads, and residential areas), lighting (day, night, or dusk), weather (clear or rain), road surface (paved or unpaved), traffic and vulnerable road users (VRU) density, traffic lights, and traffic signs (warning, prohibition, and indication). To evaluate the safety of AVs, we employ an end-to-end motion planning model that predicts vehicle trajectories with an Average Displacement Error (ADE) of 1.4 m on autonomous-driving frames. The dataset continues to expand by over 10 hours of new data weekly, thereby providing a sustainable foundation for research on AV driving behavior analysis and safety evaluation.
OG-VLA: Orthographic Image Generation for 3D-Aware Vision-Language Action Model
We introduce OG-VLA, a novel architecture and learning framework that combines the generalization strengths of Vision Language Action models (VLAs) with the robustness of 3D-aware policies. We address the challenge of mapping natural language instructions and one or more RGBD observations to quasi-static robot actions. 3D-aware robot policies achieve state-of-the-art performance on precise robot manipulation tasks, but struggle with generalization to unseen instructions, scenes, and objects. On the other hand, VLAs excel at generalizing across instructions and scenes, but can be sensitive to camera and robot pose variations. We leverage prior knowledge embedded in language and vision foundation models to improve generalization of 3D-aware keyframe policies. OG-VLA unprojects input observations from diverse views into a point cloud which is then rendered from canonical orthographic views, ensuring input view invariance and consistency between input and output spaces. These canonical views are processed with a vision backbone, a Large Language Model (LLM), and an image diffusion model to generate images that encode the next position and orientation of the end-effector on the input scene. Evaluations on the Arnold and Colosseum benchmarks demonstrate state-of-the-art generalization to unseen environments, with over 40% relative improvements while maintaining robust performance in seen settings. We also show real-world adaption in 3 to 5 demonstrations along with strong generalization. Videos and resources at https://og-vla.github.io/
comment: 13 pages
LED: Light Enhanced Depth Estimation at Night BMVC 2025
Nighttime camera-based depth estimation is a highly challenging task, especially for autonomous driving applications, where accurate depth perception is essential for ensuring safe navigation. Models trained on daytime data often fail in the absence of precise but costly LiDAR. Even vision foundation models trained on large amounts of data are unreliable in low-light conditions. In this work, we aim to improve the reliability of perception systems at night time. To this end, we introduce Light Enhanced Depth (LED), a novel, cost-effective approach that significantly improves depth estimation in low-light environments by harnessing a pattern projected by high definition headlights available in modern vehicles. LED leads to significant performance boosts across multiple depth-estimation architectures (encoder-decoder, Adabins, DepthFormer, Depth Anything V2) both on synthetic and real datasets. Furthermore, increased performances beyond illuminated areas reveal a holistic enhancement in scene understanding. Finally, we release the Nighttime Synthetic Drive Dataset, a synthetic and photo-realistic nighttime dataset, which comprises 49,990 comprehensively annotated images.
comment: BMVC 2025 (Poster). Code and dataset available on the project page : https://simondemoreau.github.io/LED/ 21 pages, 13 figures
StrokeFusion: Vector Sketch Generation via Joint Stroke-UDF Encoding and Latent Sequence Diffusion
In the field of sketch generation, raster-format trained models often produce non-stroke artifacts, while vector-format trained models typically lack a holistic understanding of sketches, leading to compromised recognizability. Moreover, existing methods struggle to extract common features from similar elements (e.g., eyes of animals) appearing at varying positions across sketches. To address these challenges, we propose StrokeFusion, a two-stage framework for vector sketch generation. It contains a dual-modal sketch feature learning network that maps strokes into a high-quality latent space. This network decomposes sketches into normalized strokes and jointly encodes stroke sequences with Unsigned Distance Function (UDF) maps, representing sketches as sets of stroke feature vectors. Building upon this representation, our framework exploits a stroke-level latent diffusion model that simultaneously adjusts stroke position, scale, and trajectory during generation. This enables high-fidelity sketch generation while supporting stroke interpolation editing. Extensive experiments on the QuickDraw dataset demonstrate that our framework outperforms state-of-the-art techniques, validating its effectiveness in preserving structural integrity and semantic features. Code and models will be made publicly available upon publication.
Measuring Train Driver Performance as Key to Approval of Driverless Trains
Points 2.1.4(b), 2.4.2(b) and 2.4.3(b) in Annex I of Implementing Regulation (EU) No. 402/2013 allow a simplified approach for the safety approval of computer vision systems for driverless trains, if they have 'similar' functions and interfaces as the replaced human driver. The human driver is not replaced one-to-one by a technical system - only a limited set of cognitive functions are replaced. However, performance in the most challenging function, obstacle detection, is difficult to quantify due to the deficiency of published measurement results. This article summarizes the data published so far. This article also goes a long way to remedy this situation by providing a new public and anonymized dataset of 711 train driver performance measurements from controlled experiments. The measurements are made for different speeds, obstacle sizes, train protection systems and obstacle color contrasts respectively. The measured values are reaction time and distance to the obstacle. The goal of this paper is an unbiased and exhaustive description of the presented dataset for research, standardization and regulation. The dataset with supplementing information and literature is published on https://data.fid-move.de/de/dataset/atosensedata
comment: 6 pages, 3 figures
Accuracy is Not Enough: Poisoning Interpretability in Federated Learning via Color Skew
As machine learning models are increasingly deployed in safety-critical domains, visual explanation techniques have become essential tools for supporting transparency. In this work, we reveal a new class of attacks that compromise model interpretability without affecting accuracy. Specifically, we show that small color perturbations applied by adversarial clients in a federated learning setting can shift a model's saliency maps away from semantically meaningful regions while keeping the prediction unchanged. The proposed saliency-aware attack framework, called Chromatic Perturbation Module, systematically crafts adversarial examples by altering the color contrast between foreground and background in a way that disrupts explanation fidelity. These perturbations accumulate across training rounds, poisoning the global model's internal feature attributions in a stealthy and persistent manner. Our findings challenge a common assumption in model auditing that correct predictions imply faithful explanations and demonstrate that interpretability itself can be an attack surface. We evaluate this vulnerability across multiple datasets and show that standard training pipelines are insufficient to detect or mitigate explanation degradation, especially in the federated learning setting, where subtle color perturbations are harder to discern. Our attack reduces peak activation overlap in Grad-CAM explanations by up to 35% while preserving classification accuracy above 96% on all evaluated datasets.
GMAT: Grounded Multi-Agent Clinical Description Generation for Text Encoder in Vision-Language MIL for Whole Slide Image Classification MICCAI
Multiple Instance Learning (MIL) is the leading approach for whole slide image (WSI) classification, enabling efficient analysis of gigapixel pathology slides. Recent work has introduced vision-language models (VLMs) into MIL pipelines to incorporate medical knowledge through text-based class descriptions rather than simple class names. However, when these methods rely on large language models (LLMs) to generate clinical descriptions or use fixed-length prompts to represent complex pathology concepts, the limited token capacity of VLMs often constrains the expressiveness and richness of the encoded class information. Additionally, descriptions generated solely by LLMs may lack domain grounding and fine-grained medical specificity, leading to suboptimal alignment with visual features. To address these challenges, we propose a vision-language MIL framework with two key contributions: (1) A grounded multi-agent description generation system that leverages curated pathology textbooks and agent specialization (e.g., morphology, spatial context) to produce accurate and diverse clinical descriptions; (2) A text encoding strategy using a list of descriptions rather than a single prompt, capturing fine-grained and complementary clinical signals for better alignment with visual features. Integrated into a VLM-MIL pipeline, our approach shows improved performance over single-prompt class baselines and achieves results comparable to state-of-the-art models, as demonstrated on renal and lung cancer datasets.
comment: Acccepted in MICCAI Workshop 2025
Real-Time Sign Language to text Translation using Deep Learning: A Comparative study of LSTM and 3D CNN
This study investigates the performance of 3D Convolutional Neural Networks (3D CNNs) and Long Short-Term Memory (LSTM) networks for real-time American Sign Language (ASL) recognition. Though 3D CNNs are good at spatiotemporal feature extraction from video sequences, LSTMs are optimized for modeling temporal dependencies in sequential data. We evaluate both architectures on a dataset containing 1,200 ASL signs across 50 classes, comparing their accuracy, computational efficiency, and latency under similar training conditions. Experimental results demonstrate that 3D CNNs achieve 92.4% recognition accuracy but require 3.2% more processing time per frame compared to LSTMs, which maintain 86.7% accuracy with significantly lower resource consumption. The hybrid 3D CNNLSTM model shows decent performance, which suggests that context-dependent architecture selection is crucial for practical implementation.This project provides professional benchmarks for developing assistive technologies, highlighting trade-offs between recognition precision and real-time operational requirements in edge computing environments.
MOON: Generative MLLM-based Multimodal Representation Learning for E-commerce Product Understanding WSDM 2026
With the rapid advancement of e-commerce, exploring general representations rather than task-specific ones has attracted increasing research attention. For product understanding, although existing discriminative dual-flow architectures drive progress in this field, they inherently struggle to model the many-to-one alignment between multiple images and texts of products. Therefore, we argue that generative Multimodal Large Language Models (MLLMs) hold significant potential for improving product representation learning. Nevertheless, achieving this goal still remains non-trivial due to several key challenges: the lack of multimodal and aspect-aware modeling modules in typical LLMs; the common presence of background noise in product images; and the absence of a standard benchmark for evaluation. To address these issues, we propose the first generative MLLM-based model named MOON for product representation learning. Our method (1) employs a guided Mixture-of-Experts (MoE) module for targeted modeling of multimodal and aspect-specific product content; (2) effectively detects core semantic regions in product images to mitigate the distraction and interference caused by background noise; and (3) introduces the specialized negative sampling strategy to increase the difficulty and diversity of negative samples. In addition, we release a large-scale multimodal benchmark MBE for various product understanding tasks. Experimentally, our model demonstrates competitive zero-shot performance on both our benchmark and the public dataset, showcasing strong generalization across various downstream tasks, including cross-modal retrieval, product classification, and attribute prediction. Furthermore, the case study and visualization illustrate the effectiveness of MOON for product understanding.
comment: Accepted by WSDM 2026. 11 pages, 9 figures
MOON Embedding: Multimodal Representation Learning for E-commerce Search Advertising
We introduce MOON, our comprehensive set of sustainable iterative practices for multimodal representation learning for e-commerce applications. MOON has already been fully deployed across all stages of Taobao search advertising system, including retrieval, relevance, ranking, and so on. The performance gains are particularly significant on click-through rate (CTR) prediction task, which achieves an overall +20.00% online CTR improvement. Over the past three years, this project has delivered the largest improvement on CTR prediction task and undergone five full-scale iterations. Throughout the exploration and iteration of our MOON, we have accumulated valuable insights and practical experience that we believe will benefit the research community. MOON contains a three-stage training paradigm of "Pretraining, Post-training, and Application", allowing effective integration of multimodal representations with downstream tasks. Notably, to bridge the misalignment between the objectives of multimodal representation learning and downstream training, we define the exchange rate to quantify how effectively improvements in an intermediate metric can translate into downstream gains. Through this analysis, we identify the image-based search recall as a critical intermediate metric guiding the optimization of multimodal models. Over three years and five iterations, MOON has evolved along four critical dimensions: data processing, training strategy, model architecture, and downstream application. The lessons and insights gained through the iterative improvements will also be shared. As part of our exploration into scaling effects in the e-commerce field, we further conduct a systematic study of the scaling laws governing multimodal representation learning, examining multiple factors such as the number of training tokens, negative samples, and the length of user behavior sequences.
comment: 31 pages, 12 figures
Seeing and Knowing in the Wild: Open-domain Visual Entity Recognition with Large-scale Knowledge Graphs via Contrastive Learning AAAI2026
Open-domain visual entity recognition aims to identify and link entities depicted in images to a vast and evolving set of real-world concepts, such as those found in Wikidata. Unlike conventional classification tasks with fixed label sets, it operates under open-set conditions, where most target entities are unseen during training and exhibit long-tail distributions. This makes the task inherently challenging due to limited supervision, high visual ambiguity, and the need for semantic disambiguation. We propose a Knowledge-guided Contrastive Learning (KnowCoL) framework that combines both images and text descriptions into a shared semantic space grounded by structured information from Wikidata. By abstracting visual and textual inputs to a conceptual level, the model leverages entity descriptions, type hierarchies, and relational context to support zero-shot entity recognition. We evaluate our approach on the OVEN benchmark, a large-scale open-domain visual recognition dataset with Wikidata IDs as the label space. Our experiments show that using visual, textual, and structured knowledge greatly improves accuracy, especially for rare and unseen entities. Our smallest model improves the accuracy on unseen entities by 10.5% compared to the state-of-the-art, despite being 35 times smaller.
comment: Accepted by AAAI2026
Fine-Grained Representation for Lane Topology Reasoning AAAI 2026
Precise modeling of lane topology is essential for autonomous driving, as it directly impacts navigation and control decisions. Existing methods typically represent each lane with a single query and infer topological connectivity based on the similarity between lane queries. However, this kind of design struggles to accurately model complex lane structures, leading to unreliable topology prediction. In this view, we propose a Fine-Grained lane topology reasoning framework (TopoFG). It divides the procedure from bird's-eye-view (BEV) features to topology prediction via fine-grained queries into three phases, i.e., Hierarchical Prior Extractor (HPE), Region-Focused Decoder (RFD), and Robust Boundary-Point Topology Reasoning (RBTR). Specifically, HPE extracts global spatial priors from the BEV mask and local sequential priors from in-lane keypoint sequences to guide subsequent fine-grained query modeling. RFD constructs fine-grained queries by integrating the spatial and sequential priors. It then samples reference points in RoI regions of the mask and applies cross-attention with BEV features to refine the query representations of each lane. RBTR models lane connectivity based on boundary-point query features and further employs a topological denoising strategy to reduce matching ambiguity. By integrating spatial and sequential priors into fine-grained queries and applying a denoising strategy to boundary-point topology reasoning, our method precisely models complex lane structures and delivers trustworthy topology predictions. Extensive experiments on the OpenLane-V2 benchmark demonstrate that TopoFG achieves new state-of-the-art performance, with an OLS of 48.0 on subsetA and 45.4 on subsetB.
comment: Accepted by AAAI 2026
Logos as a Well-Tempered Pre-train for Sign Language Recognition
This paper examines two aspects of the isolated sign language recognition (ISLR) task. First, although a certain number of datasets is available, the data for individual sign languages is limited. It poses the challenge of cross-language ISLR model training, including transfer learning. Second, similar signs can have different semantic meanings. It leads to ambiguity in dataset labeling and raises the question of the best policy for annotating such signs. To address these issues, this study presents Logos, a novel Russian Sign Language (RSL) dataset, the most extensive available ISLR dataset by the number of signers, one of the most extensive datasets in size and vocabulary, and the largest RSL dataset. It is shown that a model, pre-trained on the Logos dataset can be used as a universal encoder for other language SLR tasks, including few-shot learning. We explore cross-language transfer learning approaches and find that joint training using multiple classification heads benefits accuracy for the target low-resource datasets the most. The key feature of the Logos dataset is explicitly annotated visually similar sign groups. We show that explicitly labeling visually similar signs improves trained model quality as a visual encoder for downstream tasks. Based on the proposed contributions, we outperform current state-of-the-art results for the WLASL dataset and get competitive results for the AUTSL dataset, with a single stream model processing solely RGB video. The source code, dataset, and pre-trained models are publicly available.
StyleDrive: Towards Driving-Style Aware Benchmarking of End-To-End Autonomous Driving
Personalization, while extensively studied in conventional autonomous driving pipelines, has been largely overlooked in the context of end-to-end autonomous driving (E2EAD), despite its critical role in fostering user trust, safety perception, and real-world adoption. A primary bottleneck is the absence of large-scale real-world datasets that systematically capture driving preferences, severely limiting the development and evaluation of personalized E2EAD models. In this work, we introduce the first large-scale real-world dataset explicitly curated for personalized E2EAD, integrating comprehensive scene topology with rich dynamic context derived from agent dynamics and semantics inferred via a fine-tuned vision-language model (VLM). We propose a hybrid annotation pipeline that combines behavioral analysis, rule-and-distribution-based heuristics, and subjective semantic modeling guided by VLM reasoning, with final refinement through human-in-the-loop verification. Building upon this dataset, we introduce the first standardized benchmark for systematically evaluating personalized E2EAD models. Empirical evaluations on state-of-the-art architectures demonstrate that incorporating personalized driving preferences significantly improves behavioral alignment with human demonstrations.
comment: 25 pages, 7 figures, 5 tables
Learnable Total Variation with Lambda Mapping for Low-Dose CT Denoising
Although Total Variation (TV) performs well in noise reduction and edge preservation on images, its dependence on the lambda parameter limits its efficiency and makes it difficult to use effectively. In this study, we present a Learnable Total Variation (LTV) framework that couples an unrolled TV solver with a data-driven Lambda Mapping Network (LambdaNet) predicting a per-pixel regularization map. The pipeline is trained end-to-end so that reconstruction and regularization are optimized jointly, yielding spatially adaptive smoothing: strong in homogeneous regions, relaxed near anatomical boundaries. Experiments on the DeepLesion dataset, using a realistic noise model adapted from the LoDoPaB-CT methodology, show consistent gains over classical TV and FBP+U-Net: +2.9 dB PSNR and +6% SSIM on average. LTV provides an interpretable alternative to black-box CNNs and a basis for 3D and data-consistency-driven reconstruction.
Beyond Flatlands: Unlocking Spatial Intelligence by Decoupling 3D Reasoning from Numerical Regression
Existing Vision Language Models (VLMs) architecturally rooted in "flatland" perception, fundamentally struggle to comprehend real-world 3D spatial intelligence. This failure stems from a dual-bottleneck: input-stage conflict between computationally exorbitant geometric-aware encoders and superficial 2D-only features, and output-stage misalignment where discrete tokenizers are structurally incapable of producing precise, continuous numerical values. To break this impasse, we introduce GEODE (Geometric-Output and Decoupled-Input Engine), a novel architecture that resolves this dual-bottleneck by decoupling 3D reasoning from numerical generation. GEODE augments main VLM with two specialized, plug-and-play modules: Decoupled Rationale Module (DRM) that acts as spatial co-processor, aligning explicit 3D data with 2D visual features via cross-attention and distilling spatial Chain-of-Thought (CoT) logic into injectable Rationale Tokens; and Direct Regression Head (DRH), an "Embedding-as-Value" paradigm which routes specialized control tokens to a lightweight MLP for precise, continuous regression of scalars and 3D bounding boxes. The synergy of these modules allows our 1.5B parameter model to function as a high-level semantic dispatcher, achieving state-of-the-art spatial reasoning performance that rivals 7B+ models.
Towards Understanding 3D Vision: the Role of Gaussian Curvature
Recent advances in computer vision have predominantly relied on data-driven approaches that leverage deep learning and large-scale datasets. Deep neural networks have achieved remarkable success in tasks such as stereo matching and monocular depth reconstruction. However, these methods lack explicit models of 3D geometry that can be directly analyzed, transferred across modalities, or systematically modified for controlled experimentation. We investigate the role of Gaussian curvature in 3D surface modeling. Besides Gaussian curvature being an invariant quantity under change of observers or coordinate systems, we demonstrate using the Middlebury stereo dataset that it offers a sparse and compact description of 3D surfaces. Furthermore, we show a strong correlation between the performance rank of top state-of-the-art stereo and monocular methods and the low total absolute Gaussian curvature. We propose that this property can serve as a geometric prior to improve future 3D reconstruction algorithms.
CARScenes: Semantic VLM Dataset for Safe Autonomous Driving
CAR-Scenes is a frame-level dataset for autonomous driving that enables training and evaluation of vision-language models (VLMs) for interpretable, scene-level understanding. We annotate 5,192 images drawn from Argoverse 1, Cityscapes, KITTI, and nuScenes using a 28-key category/sub-category knowledge base covering environment, road geometry, background-vehicle behavior, ego-vehicle behavior, vulnerable road users, sensor states, and a discrete severity scale (1-10), totaling 350+ leaf attributes. Labels are produced by a GPT-4o-assisted vision-language pipeline with human-in-the-loop verification; we release the exact prompts, post-processing rules, and per-field baseline model performance. CAR-Scenes also provides attribute co-occurrence graphs and JSONL records that support semantic retrieval, dataset triage, and risk-aware scenario mining across sources. To calibrate task difficulty, we include reproducible, non-benchmark baselines, notably a LoRA-tuned Qwen2-VL-2B with deterministic decoding, evaluated via scalar accuracy, micro-averaged F1 for list attributes, and severity MAE/RMSE on a fixed validation split. We publicly release the annotation and analysis scripts, including graph construction and evaluation scripts, to enable explainable, data-centric workflows for future intelligent vehicles. Dataset: https://github.com/Croquembouche/CAR-Scenes
comment: 8 pages, 6 figures, 7 tables
Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions NeurIPS 2025
Language-image pre-training (LIP) enables the development of vision-language models capable of zero-shot classification, localization, multimodal retrieval, and semantic understanding. Various explanation methods have been proposed to visualize the importance of input image-text pairs on the model's similarity outputs. However, popular saliency maps are limited by capturing only first-order attributions, overlooking the complex cross-modal interactions intrinsic to such encoders. We introduce faithful interaction explanations of LIP models (FIxLIP) as a unified approach to decomposing the similarity in vision-language encoders. FIxLIP is rooted in game theory, where we analyze how using the weighted Banzhaf interaction index offers greater flexibility and improves computational efficiency over the Shapley interaction quantification framework. From a practical perspective, we propose how to naturally extend explanation evaluation metrics, such as the pointing game and area between the insertion/deletion curves, to second-order interaction explanations. Experiments on the MS COCO and ImageNet-1k benchmarks validate that second-order methods, such as FIxLIP, outperform first-order attribution methods. Beyond delivering high-quality explanations, we demonstrate the utility of FIxLIP in comparing different models, e.g. CLIP vs. SigLIP-2.
comment: NeurIPS 2025. Code: https://github.com/hbaniecki/fixlip
SlotMatch: Distilling Object-Centric Representations for Unsupervised Video Segmentation
Unsupervised video segmentation is a challenging computer vision task, especially due to the lack of supervisory signals coupled with the complexity of visual scenes. To overcome this challenge, state-of-the-art models based on slot attention often have to rely on large and computationally expensive neural architectures. To this end, we propose a simple knowledge distillation framework that effectively transfers object-centric representations to a lightweight student. The proposed framework, called SlotMatch, aligns corresponding teacher and student slots via the cosine similarity, requiring no additional distillation objectives or auxiliary supervision. The simplicity of SlotMatch is confirmed via theoretical and empirical evidence, both indicating that integrating additional losses is redundant. We conduct experiments on three datasets to compare the state-of-the-art teacher model, SlotContrast, with our distilled student. The results show that our student based on SlotMatch matches and even outperforms its teacher, while using 3.6x less parameters and running up to 2.7x faster. Moreover, our student surpasses all other state-of-the-art unsupervised video segmentation models.
Sa2VA-i: Improving Sa2VA Results with Consistent Training and Inference
Sa2VA is a recent model for language-guided dense grounding in images and video that achieves state-of-the-art results on multiple segmentation benchmarks and that has become widely popular. However, we found that Sa2VA does not perform according to its full potential for referring video object segmentation tasks. We identify inconsistencies between training and inference procedures as the key factor holding it back. To mitigate this issue, we propose an improved version of Sa2VA, Sa2VA-i, that rectifies these issues and improves the results. In fact, Sa2VA-i sets a new state of the art for multiple video benchmarks and achieves improvements of up to +11.6 J&F on MeViS, +1.4 on Ref-YT-VOS, +3.3 on Ref-DAVIS and +4.1 on ReVOS using the same Sa2VA checkpoints. With our fixes, the Sa2VA-i-1B model even performs on par with the original Sa2VA-26B model on the MeViS benchmark. We hope that this work will show the importance of seemingly trivial implementation details and that it will provide valuable insights for the referring video segmentation field. We provide the code and updated models at https://github.com/kumuji/sa2va-i
4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration
Leveraging diverse robotic data for pretraining remains a critical challenge. Existing methods typically model the dataset's action distribution using simple observations as inputs. However, these inputs are often incomplete, resulting in a dispersed conditional action distribution-an issue we refer to as coordinate system chaos and state chaos. This inconsistency significantly hampers pretraining efficiency. To address this, we propose 4D-VLA, a novel approach that effectively integrates 4D information into the input to mitigate these sources of chaos. Our model introduces depth and temporal information into visual features with sequential RGB-D inputs, aligning the coordinate systems of the robot and the scene. This alignment endows the model with strong spatiotemporal reasoning capabilities while minimizing training overhead. Additionally, we introduce memory bank sampling, a frame sampling strategy designed to extract informative frames from historical images, further improving effectiveness and efficiency. Experimental results demonstrate that our pretraining method and architectural components substantially enhance model performance. In both simulated and real-world experiments, our model achieves a significant increase in success rate over OpenVLA. To further assess spatial perception and generalization to novel views, we introduce MV-Bench, a multi-view simulation benchmark. Our model consistently outperforms existing methods, demonstrating stronger spatial understanding and adaptability.
Benchmarking Deep Learning-Based Object Detection Models on Feature Deficient Astrophotography Imagery Dataset
Object detection models are typically trained on datasets like ImageNet, COCO, and PASCAL VOC, which focus on everyday objects. However, these lack signal sparsity found in non-commercial domains. MobilTelesco, a smartphone-based astrophotography dataset, addresses this by providing sparse night-sky images. We benchmark several detection models on it, highlighting challenges under feature-deficient conditions.
From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D
Recent advances in LVLMs have improved vision-language understanding, but they still struggle with spatial perception, limiting their ability to reason about complex 3D scenes. Unlike previous approaches that incorporate 3D representations into models to improve spatial understanding, we aim to unlock the potential of VLMs by leveraging spatially relevant image data. To this end, we introduce a novel 2D spatial data generation and annotation pipeline built upon scene data with 3D ground-truth. This pipeline enables the creation of a diverse set of spatial tasks, ranging from basic perception tasks to more complex reasoning tasks. Leveraging this pipeline, we construct SPAR-7M, a large-scale dataset generated from thousands of scenes across multiple public datasets. In addition, we introduce SPAR-Bench, a benchmark designed to offer a more comprehensive evaluation of spatial capabilities compared to existing spatial benchmarks, supporting both single-view and multi-view inputs. Training on both SPAR-7M and large-scale 2D datasets enables our models to achieve state-of-the-art performance on 2D spatial benchmarks. Further fine-tuning on 3D task-specific datasets yields competitive results, underscoring the effectiveness of our dataset in enhancing spatial reasoning.
comment: Project page: https://fudan-zvg.github.io/spar
SpeeDe3DGS: Speedy Deformable 3D Gaussian Splatting with Temporal Pruning and Motion Grouping
Dynamic extensions of 3D Gaussian Splatting (3DGS) achieve high-quality reconstructions through neural motion fields, but per-Gaussian neural inference makes these models computationally expensive. Building on DeformableGS, we introduce Speedy Deformable 3D Gaussian Splatting (SpeeDe3DGS), which bridges this efficiency-fidelity gap through three complementary modules: Temporal Sensitivity Pruning (TSP) removes low-impact Gaussians via temporally aggregated sensitivity analysis, Temporal Sensitivity Sampling (TSS) perturbs timestamps to suppress floaters and improve temporal coherence, and GroupFlow distills the learned deformation field into shared SE(3) transformations for efficient groupwise motion. On the 50 dynamic scenes in MonoDyGauBench, integrating TSP and TSS into DeformableGS accelerates rendering by 6.78$\times$ on average while maintaining neural-field fidelity and using 10$\times$ fewer primitives. Adding GroupFlow culminates in 13.71$\times$ faster rendering and 2.53$\times$ shorter training, surpassing all baselines in speed while preserving superior image quality.
comment: Project Page: https://speede3dgs.github.io/
Segmentation-Driven Initialization for Sparse-view 3D Gaussian Splatting
Sparse-view synthesis remains a challenging problem due to the difficulty of recovering accurate geometry and appearance from limited observations. While recent advances in 3D Gaussian Splatting (3DGS) have enabled real-time rendering with competitive quality, existing pipelines often rely on Structure-from-Motion (SfM) for camera pose estimation, an approach that struggles in genuinely sparse-view settings. Moreover, several SfM-free methods replace SfM with multi-view stereo (MVS) models, but generate massive numbers of 3D Gaussians by back-projecting every pixel into 3D space, leading to high memory costs. We propose Segmentation-Driven Initialization for Gaussian Splatting (SDI-GS), a method that mitigates inefficiency by leveraging region-based segmentation to identify and retain only structurally significant regions. This enables selective downsampling of the dense point cloud, preserving scene fidelity while substantially reducing Gaussian count. Experiments across diverse benchmarks show that SDI-GS reduces Gaussian count by up to 50% and achieves comparable or superior rendering quality in PSNR and SSIM, with only marginal degradation in LPIPS. It further enables faster training and lower memory footprint, advancing the practicality of 3DGS for constrained-view scenarios.
Deep Equilibrium models for Poisson Imaging Inverse problems via Mirror Descent
Deep Equilibrium Models (DEQs) are implicit neural networks with fixed points, which have recently gained attention for learning image regularization functionals, particularly in settings involving Gaussian fidelities, where assumptions on the forward operator ensure contractiveness of standard (proximal) Gradient Descent operators. In this work, we extend the application of DEQs to Poisson inverse problems, where the data fidelity term is more appropriately modeled by the Kullback--Leibler divergence. To this end, we introduce a novel DEQ formulation based on Mirror Descent defined in terms of a tailored non-Euclidean geometry that naturally adapts with the structure of the data term. This enables the learning of neural regularizers within a principled training framework. We derive sufficient conditions and establish refined convergence results based on the Kurdyka--Lojasiewicz framework for subanalytic functions with non-closed domains to guarantee the convergence of the learned reconstruction scheme and propose computational strategies that enable both efficient training and parameter-free inference. Numerical experiments show that our method outperforms traditional model-based approaches and it is comparable to the performance of Bregman Plug-and-Play methods, while mitigating their typical drawbacks, such as time-consuming tuning of hyper-parameters. The code is publicly available at https://github.com/christiandaniele/DEQ-MD.
DepthVision: Enabling Robust Vision-Language Models with GAN-Based LiDAR-to-RGB Synthesis for Autonomous Driving
Ensuring reliable autonomous operation when visual input is degraded remains a key challenge in intelligent vehicles and robotics. We present DepthVision, a multimodal framework that enables Vision--Language Models (VLMs) to exploit LiDAR data without any architectural changes or retraining. DepthVision synthesizes dense, RGB-like images from sparse LiDAR point clouds using a conditional GAN with an integrated refiner, and feeds these into off-the-shelf VLMs through their standard visual interface. A Luminance-Aware Modality Adaptation (LAMA) module fuses synthesized and real camera images by dynamically weighting each modality based on ambient lighting, compensating for degradation such as darkness or motion blur. This design turns LiDAR into a drop-in visual surrogate when RGB becomes unreliable, effectively extending the operational envelope of existing VLMs. We evaluate DepthVision on real and simulated datasets across multiple VLMs and safety-critical tasks, including vehicle-in-the-loop experiments. The results show substantial improvements in low-light scene understanding over RGB-only baselines while preserving full compatibility with frozen VLM architectures. These findings demonstrate that LiDAR-guided RGB synthesis is a practical pathway for integrating range sensing into modern vision-language systems for autonomous driving.
Rasterized Steered Mixture of Experts for Efficient 2D Image Regression
The Steered Mixture of Experts regression framework has demonstrated strong performance in image reconstruction, compression, denoising, and super-resolution. However, its high computational cost limits practical applications. This work introduces a rasterization-based optimization strategy that combines the efficiency of rasterized Gaussian kernel rendering with the edge-aware gating mechanism of the Steered Mixture of Experts. The proposed method is designed to accelerate two-dimensional image regression while maintaining the model's inherent sparsity and reconstruction quality. By replacing global iterative optimization with a rasterized formulation, the method achieves significantly faster parameter updates and more memory-efficient model representations. In addition, the proposed framework supports applications such as native super-resolution and image denoising, which are not directly achievable with standard rasterized Gaussian kernel approaches. The combination of fast rasterized optimization with the edge-aware structure of the Steered Mixture of Experts provides a new balance between computational efficiency and reconstruction fidelity for two-dimensional image processing tasks.
MAVias: Mitigate any Visual Bias
Mitigating biases in computer vision models is an essential step towards the trustworthiness of artificial intelligence models. Existing bias mitigation methods focus on a small set of predefined biases, limiting their applicability in visual datasets where multiple, possibly unknown biases exist. To address this limitation, we introduce MAVias, an open-set bias mitigation approach leveraging foundation models to discover spurious associations between visual attributes and target classes. MAVias first captures a wide variety of visual features in natural language via a foundation image tagging model, and then leverages a large language model to select those visual features defining the target class, resulting in a set of language-coded potential visual biases. We then translate this set of potential biases into vision-language embeddings and introduce an in-processing bias mitigation approach to prevent the model from encoding information related to them. Our experiments on diverse datasets, including CelebA, Waterbirds, ImageNet, and UrbanCars, show that MAVias effectively detects and mitigates a wide range of biases in visual recognition tasks outperforming current state-of-the-art.
Context-Aware Multimodal Representation Learning for Spatio-Temporally Explicit Environmental Modelling
Earth observation (EO) foundation models have emerged as an effective approach to derive latent representations of the Earth system from various remote sensing sensors. These models produce embeddings that can be used as analysis-ready datasets, enabling the modelling of ecosystem dynamics without extensive sensor-specific preprocessing. However, existing models typically operate at fixed spatial or temporal scales, limiting their use for ecological analyses that require both fine spatial detail and high temporal fidelity. To overcome these limitations, we propose a representation learning framework that integrates different EO modalities into a unified feature space at high spatio-temporal resolution. We introduce the framework using Sentinel-1 and Sentinel-2 data as representative modalities. Our approach produces a latent space at native 10 m resolution and the temporal frequency of cloud-free Sentinel-2 acquisitions. Each sensor is first modeled independently to capture its sensor-specific characteristics. Their representations are then combined into a shared model. This two-stage design enables modality-specific optimisation and easy extension to new sensors, retaining pretrained encoders while retraining only fusion layers. This enables the model to capture complementary remote sensing data and to preserve coherence across space and time. Qualitative analyses reveal that the learned embeddings exhibit high spatial and semantic consistency across heterogeneous landscapes. Quantitative evaluation in modelling Gross Primary Production reveals that they encode ecologically meaningful patterns and retain sufficient temporal fidelity to support fine-scale analyses. Overall, the proposed framework provides a flexible, analysis-ready representation learning approach for environmental applications requiring diverse spatial and temporal resolutions.
comment: 10 pages (incliding 2 pages of references), 7 figures
Divide and Merge: Motion and Semantic Learning in End-to-End Autonomous Driving
Perceiving the environment and its changes over time corresponds to two fundamental yet heterogeneous types of information: semantics and motion. Previous end-to-end autonomous driving works represent both types of information in a single feature vector. However, including motion related tasks, such as prediction and planning, impairs detection and tracking performance, a phenomenon known as negative transfer in multi-task learning. To address this issue, we propose Neural-Bayes motion decoding, a novel parallel detection, tracking, and prediction method that separates semantic and motion learning. Specifically, we employ a set of learned motion queries that operate in parallel with detection and tracking queries, sharing a unified set of recursively updated reference points. Moreover, we employ interactive semantic decoding to enhance information exchange in semantic tasks, promoting positive transfer. Experiments on the nuScenes dataset with UniAD and SparseDrive confirm the effectiveness of our divide and merge approach, resulting in performance improvements across perception, prediction, and planning. Our code is available at https://github.com/shenyinzhe/DMAD.
Mapping Reduced Accessibility to WASH Facilities in Rohingya Refugee Camps With Sub-Meter Imagery
Access to Water, Sanitation, and Hygiene (WASH) services remains a major public health concern in refugee camps. This study introduces a remote sensing-driven framework to quantify WASH accessibility-specifically to water pumps, latrines, and bathing cubicles-in the Rohingya camps of Cox's Bazar, one of the world's most densely populated displacement settings. Detecting refugee shelters in such emergent camps presents substantial challenges, primarily due to their dense spatial configuration and irregular geometric patterns. Using sub-meter satellite images, we develop a semi-supervised segmentation framework that achieves an F1-score of 76.4% in detecting individual refugee shelters. Applying the framework across multi-year data reveals declining WASH accessibility, driven by rapid refugee population growth and reduced facility availability, rising from 25 people per facility in 2022 to 29.4 in 2025. Gender-disaggregated analysis further shows that women and girls experience reduced accessibility, in scenarios with inadequate safety-related segregation in WASH facilities. These findings suggest the importance of demand-responsive allocation strategies that can identify areas with under-served populations-such as women and girls-and ensure that limited infrastructure serves the greatest number of people in settings with fixed or shrinking budgets. We also discuss the value of high-resolution remote sensing and machine learning to detect inequality and inform equitable resource planning in complex humanitarian environments.
comment: 23 pages, 13 figures, 2 tables
PALM: A Dataset and Baseline for Learning Multi-subject Hand Prior
The ability to grasp objects, signal with gestures, and share emotion through touch all stem from the unique capabilities of human hands. Yet creating high-quality personalized hand avatars from images remains challenging due to complex geometry, appearance, and articulation, particularly under unconstrained lighting and limited views. Progress has also been limited by the lack of datasets that jointly provide accurate 3D geometry, high-resolution multiview imagery, and a diverse population of subjects. To address this, we present PALM, a large-scale dataset comprising 13k high-quality hand scans from 263 subjects and 90k multi-view images, capturing rich variation in skin tone, age, and geometry. To show its utility, we present a baseline PALM-Net, a multi-subject prior over hand geometry and material properties learned via physically based inverse rendering, enabling realistic, relightable single-image hand avatar personalization. PALM's scale and diversity make it a valuable real-world resource for hand modeling and related research.
Improving Greenland Bed Topography Mapping with Uncertainty-Aware Graph Learning on Sparse Radar Data
Accurate maps of Greenland's subglacial bed are essential for sea-level projections, but radar observations are sparse and uneven. We introduce GraphTopoNet, a graph-learning framework that fuses heterogeneous supervision and explicitly models uncertainty via Monte Carlo dropout. Spatial graphs built from surface observables (elevation, velocity, mass balance) are augmented with gradient features and polynomial trends to capture both local variability and broad structure. To handle data gaps, we employ a hybrid loss that combines confidence-weighted radar supervision with dynamically balanced regularization. Applied to three Greenland subregions, GraphTopoNet outperforms interpolation, convolutional, and graph-based baselines, reducing error by up to 60 percent while preserving fine-scale glacial features. The resulting bed maps improve reliability for operational modeling, supporting agencies engaged in climate forecasting and policy. More broadly, GraphTopoNet shows how graph machine learning can convert sparse, uncertain geophysical observations into actionable knowledge at continental scale.
RynnEC: Bringing MLLMs into Embodied World
We introduce RynnEC, a video multimodal large language model designed for embodied cognition. Built upon a general-purpose vision-language foundation model, RynnEC incorporates a region encoder and a mask decoder, enabling flexible region-level video interaction. Despite its compact architecture, RynnEC achieves state-of-the-art performance in object property understanding, object segmentation, and spatial reasoning. Conceptually, it offers a region-centric video paradigm for the brain of embodied agents, providing fine-grained perception of the physical world and enabling more precise interactions. To mitigate the scarcity of annotated 3D datasets, we propose an egocentric video based pipeline for generating embodied cognition data. Furthermore, we introduce RynnEC-Bench, a region-centered benchmark for evaluating embodied cognitive capabilities. We anticipate that RynnEC will advance the development of general-purpose cognitive cores for embodied agents and facilitate generalization across diverse embodied tasks. The code, model checkpoints, and benchmark are available at: https://github.com/alibaba-damo-academy/RynnEC
comment: The technical report of RynnEC, an embodied cognition MLLM
SAM2MOT: A Novel Paradigm of Multi-Object Tracking by Segmentation
Inspired by Segment Anything 2, which generalizes segmentation from images to videos, we propose SAM2MOT--a novel segmentation-driven paradigm for multi-object tracking that breaks away from the conventional detection-association framework. In contrast to previous approaches that treat segmentation as auxiliary information, SAM2MOT places it at the heart of the tracking process, systematically tackling challenges like false positives and occlusions. Its effectiveness has been thoroughly validated on major MOT benchmarks. Furthermore, SAM2MOT integrates pre-trained detector, pre-trained segmentor with tracking logic into a zero-shot MOT system that requires no fine-tuning. This significantly reduces dependence on labeled data and paves the way for transitioning MOT research from task-specific solutions to general-purpose systems. Experiments on DanceTrack, UAVDT, and BDD100K show state-of-the-art results. Notably, SAM2MOT outperforms existing methods on DanceTrack by +2.1 HOTA and +4.5 IDF1, highlighting its effectiveness in MOT. Code is available at https://github.com/TripleJoy/SAM2MOT.
Foundation Models in Medical Imaging: A Review and Outlook
Foundation models (FMs) are changing the way medical images are analyzed by learning from large collections of unlabeled data. Instead of relying on manually annotated examples, FMs are pre-trained to learn general-purpose visual features that can later be adapted to specific clinical tasks with little additional supervision. In this review, we examine how FMs are being developed and applied in pathology, radiology, and ophthalmology, drawing on evidence from over 150 studies. We explain the core components of FM pipelines, including model architectures, self-supervised learning methods, and strategies for downstream adaptation. We also review how FMs are being used in each imaging domain and compare design choices across applications. Finally, we discuss key challenges and open questions to guide future research.
GeoMVD: Geometry-Enhanced Multi-View Generation Model Based on Geometric Information Extraction
Multi-view image generation holds significant application value in computer vision, particularly in domains like 3D reconstruction, virtual reality, and augmented reality. Most existing methods, which rely on extending single images, face notable computational challenges in maintaining cross-view consistency and generating high-resolution outputs. To address these issues, we propose the Geometry-guided Multi-View Diffusion Model, which incorporates mechanisms for extracting multi-view geometric information and adjusting the intensity of geometric features to generate images that are both consistent across views and rich in detail. Specifically, we design a multi-view geometry information extraction module that leverages depth maps, normal maps, and foreground segmentation masks to construct a shared geometric structure, ensuring shape and structural consistency across different views. To enhance consistency and detail restoration during generation, we develop a decoupled geometry-enhanced attention mechanism that strengthens feature focus on key geometric details, thereby improving overall image quality and detail preservation. Furthermore, we apply an adaptive learning strategy that fine-tunes the model to better capture spatial relationships and visual coherence between the generated views, ensuring realistic results. Our model also incorporates an iterative refinement process that progressively improves the output quality through multiple stages of image generation. Finally, a dynamic geometry information intensity adjustment mechanism is proposed to adaptively regulate the influence of geometric data, optimizing overall quality while ensuring the naturalness of generated images. More details can be found on the project page: https://sobeymil.github.io/GeoMVD.com.
LoG3D: Ultra-High-Resolution 3D Shape Modeling via Local-to-Global Partitioning
Generating high-fidelity 3D contents remains a fundamental challenge due to the complexity of representing arbitrary topologies-such as open surfaces and intricate internal structures-while preserving geometric details. Prevailing methods based on signed distance fields (SDFs) are hampered by costly watertight preprocessing and struggle with non-manifold geometries, while point-cloud representations often suffer from sampling artifacts and surface discontinuities. To overcome these limitations, we propose a novel 3D variational autoencoder (VAE) framework built upon unsigned distance fields (UDFs)-a more robust and computationally efficient representation that naturally handles complex and incomplete shapes. Our core innovation is a local-to-global (LoG) architecture that processes the UDF by partitioning it into uniform subvolumes, termed UBlocks. This architecture couples 3D convolutions for capturing local detail with sparse transformers for enforcing global coherence. A Pad-Average strategy further ensures smooth transitions at subvolume boundaries during reconstruction. This modular design enables seamless scaling to ultra-high resolutions up to $2048^3$-a regime previously unattainable for 3D VAEs. Experiments demonstrate state-of-the-art performance in both reconstruction accuracy and generative quality, yielding superior surface smoothness and geometric flexibility.
comment: 11 pages, 6 figures
SMOL-MapSeg: Show Me One Label as prompt
Historical maps offer valuable insights into changes on Earth's surface but pose challenges for modern segmentation models due to inconsistent visual styles and symbols. While deep learning models such as UNet and pre-trained foundation models perform well in domains like autonomous driving and medical imaging, they struggle with the variability of historical maps, where similar concepts appear in diverse forms. To address this issue, we propose On-Need Declarative (OND) knowledge-based prompting, a method that provides explicit image-label pair prompts to guide models in linking visual patterns with semantic concepts. This enables users to define and segment target concepts on demand, supporting flexible, concept-aware segmentation. Our approach replaces the prompt encoder of the Segment Anything Model (SAM) with the OND prompting mechanism and fine-tunes it on historical maps, creating SMOL-MapSeg (Show Me One Label). Unlike existing SAM-based fine-tuning methods that are class-agnostic or restricted to fixed classes, SMOL-MapSeg supports class-aware segmentation across arbitrary datasets. Experiments show that SMOL-MapSeg accurately segments user-defined classes and substantially outperforms baseline models. Furthermore, it demonstrates strong generalization even with minimal training data, highlighting its potential for scalable and adaptable historical map analysis.
RelTopo: Multi-Level Relational Modeling for Driving Scene Topology Reasoning
Accurate road topology reasoning is critical for autonomous driving, as it requires both perceiving road elements and understanding how lanes connect to each other (L2L) and to traffic elements (L2T). Existing methods often focus on either perception or L2L reasoning, leaving L2T underexplored and fall short of jointly optimizing perception and reasoning. Moreover, although topology prediction inherently involves relations, relational modeling itself is seldom incorporated into feature extraction or supervision. As humans naturally leverage contextual relationships to recognize road element and infer their connectivity, we posit that relational modeling can likewise benefit both perception and reasoning, and that these two tasks should be mutually enhancing. To this end, we propose RelTopo, a multi-level relational modeling approach that systematically integrates relational cues across three levels: 1) perception-level: a relation-aware lane detector with geometry-biased self-attention and curve-guided cross-attention enriches lane representations; 2) reasoning-level: relation-enhanced topology heads, including a geometry-enhanced L2L head and a cross-view L2T head, enhance topology inference via relational cues; and 3) supervision-level: a contrastive InfoNCE strategy regularizes relational embeddings. This design enables perception and reasoning to be optimized jointly. Extensive experiments on OpenLane-V2 demonstrate that RelTopo significantly improves both detection and topology reasoning, with gains of +3.1 in DET$_l$, +5.3 in TOP$_{ll}$, +4.9 in TOP$_{lt}$, and +4.4 overall in OLS, setting a new state-of-the-art. Code will be released.
comment: Preprint. Under review
Playmate2: Training-Free Multi-Character Audio-Driven Animation via Diffusion Transformer with Reward Feedback AAAI 2026
Recent advances in diffusion models have significantly improved audio-driven human video generation, surpassing traditional methods in both quality and controllability. However, existing approaches still face challenges in lip-sync accuracy, temporal coherence for long video generation, and multi-character animation. In this work, we propose a diffusion transformer (DiT)-based framework for generating lifelike talking videos of arbitrary length, and introduce a training-free method for multi-character audio-driven animation. First, we employ a LoRA-based training strategy combined with a position shift inference approach, which enables efficient long video generation while preserving the capabilities of the foundation model. Moreover, we combine partial parameter updates with reward feedback to enhance both lip synchronization and natural body motion. Finally, we propose a training-free approach, Mask Classifier-Free Guidance (Mask-CFG), for multi-character animation, which requires no specialized datasets or model modifications and supports audio-driven animation for three or more characters. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches, achieving high-quality, temporally coherent, and multi-character audio-driven video generation in a simple, efficient, and cost-effective manner.
comment: AAAI 2026
Manifold Learning for Hyperspectral Images
Traditional feature extraction and projection techniques, such as Principal Component Analysis, struggle to adequately represent X-Ray Transmission (XRT) Multi-Energy (ME) images, limiting the performance of neural networks in decision-making processes. To address this issue, we propose a method that approximates the dataset topology by constructing adjacency graphs using the Uniform Manifold Approximation and Projection. This approach captures nonlinear correlations within the data, significantly improving the performance of machine learning algorithms, particularly in processing Hyperspectral Images (HSI) from X-ray transmission spectroscopy. This technique not only preserves the global structure of the data but also enhances feature separability, leading to more accurate and robust classification results.
Towards Reliable Human Evaluations in Gesture Generation: Insights from a Community-Driven State-of-the-Art Benchmark
We review human evaluation practices in automated, speech-driven 3D gesture generation and find a lack of standardisation and frequent use of flawed experimental setups. This leads to a situation where it is impossible to know how different methods compare, or what the state of the art is. In order to address common shortcomings of evaluation design, and to standardise future user studies in gesture-generation works, we introduce a detailed human evaluation protocol for the widely-used BEAT2 motion-capture dataset. Using this protocol, we conduct large-scale crowdsourced evaluation to rank six recent gesture-generation models -- each trained by its original authors -- across two key evaluation dimensions: motion realism and speech-gesture alignment. Our results provide strong evidence that 1) newer models do not consistently outperform earlier approaches; 2) published claims of high motion realism or speech-gesture alignment may not hold up under rigorous evaluation; and 3) the field must adopt disentangled assessments of motion quality and multimodal alignment for accurate benchmarking in order to make progress. Finally, in order to drive standardisation and enable new evaluation research, we will release five hours of synthetic motion from the benchmarked models; over 750 rendered video stimuli from the user studies -- enabling new evaluations without model reimplementation required -- alongside our open-source rendering script, and the 16,000 pairwise human preference votes collected for our benchmark.
comment: 23 pages, 10 figures. The last two authors made equal contributions
Geometry Meets Light: Leveraging Geometric Priors for Universal Photometric Stereo under Limited Multi-Illumination Cues AAAI 2026
Universal Photometric Stereo is a promising approach for recovering surface normals without strict lighting assumptions. However, it struggles when multi-illumination cues are unreliable, such as under biased lighting or in shadows or self-occluded regions of complex in-the-wild scenes. We propose GeoUniPS, a universal photometric stereo network that integrates synthetic supervision with high-level geometric priors from large-scale 3D reconstruction models pretrained on massive in-the-wild data. Our key insight is that these 3D reconstruction models serve as visual-geometry foundation models, inherently encoding rich geometric knowledge of real scenes. To leverage this, we design a Light-Geometry Dual-Branch Encoder that extracts both multi-illumination cues and geometric priors from the frozen 3D reconstruction model. We also address the limitations of the conventional orthographic projection assumption by introducing the PS-Perp dataset with realistic perspective projection to enable learning of spatially varying view directions. Extensive experiments demonstrate that GeoUniPS delivers state-of-the-arts performance across multiple datasets, both quantitatively and qualitatively, especially in the complex in-the-wild scenes.
comment: Accepted by AAAI 2026 (Oral)
HCF: Hierarchical Cascade Framework for Distributed Multi-Stage Image Compression AAAI 2026
Distributed multi-stage image compression -- where visual content traverses multiple processing nodes under varying quality requirements -- poses challenges. Progressive methods enable bitstream truncation but underutilize available compute resources; successive compression repeats costly pixel-domain operations and suffers cumulative quality loss and inefficiency; fixed-parameter models lack post-encoding flexibility. In this work, we developed the Hierarchical Cascade Framework (HCF) that achieves high rate-distortion performance and better computational efficiency through direct latent-space transformations across network nodes in distributed multi-stage image compression systems. Under HCF, we introduced policy-driven quantization control to optimize rate-distortion trade-offs, and established the edge quantization principle through differential entropy analysis. The configuration based on this principle demonstrates up to 0.6dB PSNR gains over other configurations. When comprehensively evaluated on the Kodak, CLIC, and CLIC2020-mobile datasets, HCF outperforms successive-compression methods by up to 5.56% BD-Rate in PSNR on CLIC, while saving up to 97.8% FLOPs, 96.5% GPU memory, and 90.0% execution time. It also outperforms state-of-the-art progressive compression methods by up to 12.64% BD-Rate on Kodak and enables retraining-free cross-quality adaptation with 7.13-10.87% BD-Rate reductions on CLIC2020-mobile.
comment: Accepted at AAAI 2026 as a Conference Paper (Oral Presentation)
Learning few-step posterior samplers by unfolding and distillation of diffusion models
Diffusion models (DMs) have emerged as powerful image priors in Bayesian computational imaging. Two primary strategies have been proposed for leveraging DMs in this context: Plug-and-Play methods, which are zero-shot and highly flexible but rely on approximations; and specialized conditional DMs, which achieve higher accuracy and faster inference for specific tasks through supervised training. In this work, we introduce a novel framework that integrates deep unfolding and model distillation to transform a DM image prior into a few-step conditional model for posterior sampling. A central innovation of our approach is the unfolding of a Markov chain Monte Carlo (MCMC) algorithm - specifically, the recently proposed LATINO Langevin sampler (Spagnoletti et al., 2025) - representing the first known instance of deep unfolding applied to a Monte Carlo sampling scheme. We demonstrate our proposed unfolded and distilled samplers through extensive experiments and comparisons with the state of the art, where they achieve excellent accuracy and computational efficiency, while retaining the flexibility to adapt to variations in the forward model at inference time.
comment: 34 pages, 18 figures, 11 tables
MoReFun: Past-Movement Guided Motion Representation Learning for Future Motion Prediction and Understanding
3D human motion prediction aims to generate coherent future motions from observed sequences, yet existing end-to-end regression frameworks often fail to capture complex dynamics and tend to produce temporally inconsistent or static predictions-a limitation rooted in representation shortcutting, where models rely on superficial cues rather than learning meaningful motion structure. We propose a two-stage self-supervised framework that decouples representation learning from prediction. In the pretraining stage, the model performs unified past-future self-reconstruction, reconstructing the past sequence while recovering masked joints in the future sequence under full historical guidance. A velocity-based masking strategy selects highly dynamic joints, forcing the model to focus on informative motion components and internalize the statistical dependencies between past and future states without regression interference. In the fine-tuning stage, the pretrained model predicts the entire future sequence, now treated as fully masked, and is further equipped with a lightweight future-text prediction head for joint optimization of low-level motion prediction and high-level motion understanding. Experiments on Human3.6M, 3DPW, and AMASS show that our method reduces average prediction errors by 8.8% over state-of-the-art methods while achieving competitive future-motion understanding performance compared to LLM-based models. Code is available at: https://github.com/JunyuShi02/MoReFun
MedGEN-Bench: Contextually entangled benchmark for open-ended multimodal medical generation CVPR 2026
As Vision-Language Models (VLMs) increasingly gain traction in medical applications, clinicians are progressively expecting AI systems not only to generate textual diagnoses but also to produce corresponding medical images that integrate seamlessly into authentic clinical workflows. Despite the growing interest, existing medical visual benchmarks present notable limitations. They often rely on ambiguous queries that lack sufficient relevance to image content, oversimplify complex diagnostic reasoning into closed-ended shortcuts, and adopt a text-centric evaluation paradigm that overlooks the importance of image generation capabilities. To address these challenges, we introduce MedGEN-Bench, a comprehensive multimodal benchmark designed to advance medical AI research. MedGEN-Bench comprises 6,422 expert-validated image-text pairs spanning six imaging modalities, 16 clinical tasks, and 28 subtasks. It is structured into three distinct formats: Visual Question Answering, Image Editing, and Contextual Multimodal Generation. What sets MedGEN-Bench apart is its focus on contextually intertwined instructions that necessitate sophisticated cross-modal reasoning and open-ended generative outputs, moving beyond the constraints of multiple-choice formats. To evaluate the performance of existing systems, we employ a novel three-tier assessment framework that integrates pixel-level metrics, semantic text analysis, and expert-guided clinical relevance scoring. Using this framework, we systematically assess 10 compositional frameworks, 3 unified models, and 5 VLMs.
comment: CVPR 2026 Under Review
Not All Regions Are Equal: Attention-Guided Perturbation Network for Industrial Anomaly Detection
In unsupervised image anomaly detection, reconstruction methods aim to train models to capture normal patterns comprehensively for normal data reconstruction. Yet, these models sometimes retain unintended reconstruction capacity for anomalous regions during inference, leading to missed detections. To mitigate this issue, existing works perturb normal samples in a sample-agnostic manner, uniformly adding noise across spatial locations before reconstructing the original. Despite promising results, they disregard the fact that foreground locations are inherently more critical for robust reconstruction. Motivated by this, we present a novel reconstruction framework named Attention-Guided Perturbation Network (AGPNet) for industrial anomaly detection. Its core idea is to add perturbations guided by a sample-aware attention mask to improve the learning of invariant normal patterns at important locations. AGPNet consists of two branches, \ie, a reconstruction branch and an auxiliary attention-based perturbation one. The reconstruction branch learns to reconstruct normal samples, while the auxiliary one aims to produce attention masks to guide the noise perturbation process for normal samples. By perturbing more aggressively at those important regions, we encourage the reconstruction branch to learn inherent normal patterns both comprehensively and robustly. Extensive experiments are conducted on several popular benchmarks covering MVTec-AD, VisA, and MVTec-3D, and show that AGPNet consistently obtains leading anomaly detection performance across a variety of setups, including few-shot, one-class, and multi-class ones.
Region-Wise Correspondence Prediction between Manga Line Art Images
Understanding region-wise correspondences between manga line art images is fundamental for high-level manga processing, supporting downstream tasks such as line art colorization and in-between frame generation. Unlike natural images that contain rich visual cues, manga line art consists only of sparse black-and-white strokes, making it challenging to determine which regions correspond across images. In this work, we introduce a new task: predicting region-wise correspondence between raw manga line art images without any annotations. To address this problem, we propose a Transformer-based framework trained on large-scale, automatically generated region correspondences. The model learns to suppress noisy matches and strengthen consistent structural relationships, resulting in robust patch-level feature alignment within and across images. During inference, our method segments each line art and establishes coherent region-level correspondences through edge-aware clustering and region matching. We construct manually annotated benchmarks for evaluation, and experiments across multiple datasets demonstrate both high patch-level accuracy and strong region-level correspondence performance, achieving 78.4-84.4% region-level accuracy. These results highlight the potential of our method for real-world manga and animation applications.
Spatial Policy: Guiding Visuomotor Robotic Manipulation with Spatial-Aware Modeling and Reasoning
Vision-centric hierarchical embodied models have demonstrated strong potential. However, existing methods lack spatial awareness capabilities, limiting their effectiveness in bridging visual plans to actionable control in complex environments. To address this problem, we propose Spatial Policy (SP), a unified spatial-aware visuomotor robotic manipulation framework via explicit spatial modeling and reasoning. Specifically, we first design a spatial-conditioned embodied video generation module to model spatially guided predictions through the spatial plan table. Then, we propose a flow-based action prediction module to infer executable actions with coordination. Finally, we propose a spatial reasoning feedback policy to refine the spatial plan table via dual-stage replanning. Extensive experiments show that SP substantially outperforms state-of-the-art baselines, achieving over 33% improvement on Meta-World and over 25% improvement on iTHOR, demonstrating strong effectiveness across 23 embodied control tasks. We additionally evaluate SP in real-world robotic experiments to verify its practical viability. SP enhances the practicality of embodied models for robotic control applications. Code and checkpoints are maintained at https://plantpotatoonmoon.github.io/SpatialPolicy/.
EventHallusion: Diagnosing Event Hallucinations in Video LLMs
Recently, Multimodal Large Language Models (MLLMs) have made significant progress in the video comprehension field. Despite remarkable content reasoning and instruction following capabilities they demonstrated, the hallucination problem of these VideoLLMs is less explored compared with its counterpart in the image domain. To mitigate this gap, we propose EventHallusion, a novel benchmark that focuses on assessing the VideoLLMs' hallucination toward event, the crux of video analysis. From a hallucination attribution perspective, our EventHallusion benchmark is curated to assess a VideoLLM's susceptibility toward language priors and vision-language biases. On the other hand, we also propose a simple yet effective method, called Temporal Contrastive Decoding (TCD), to tackle the hallucination problems of VideoLLMs. The proposed TCD method rectifies the model's bias toward its priors during the decoding stage by comparing the original video with a modified version, in which temporal cues are disrupted. Through comprehensive evaluation of eight open-source and two closed-source VideoLLMs on the proposed EventHallusion benchmark, we observe that the open-source models suffer significantly from hallucination problems, whereas the closed-source ones perform markedly better. By further equipping open-source VideoLLMs with the proposed TCD approach, evident performance improvements are achieved across most metrics in the EventHallusion benchmark. Our codes and benchmark data are available at https://github.com/Stevetich/EventHallusion.
Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models EMNLP 2025
Video large language models (VideoLLM) excel at video understanding, but face efficiency challenges due to the quadratic complexity of abundant visual tokens. Our systematic analysis of token compression methods for VideoLLMs reveals two critical issues: (i) overlooking distinctive visual signals across frames, leading to information loss; (ii) suffering from implementation constraints, causing incompatibility with modern architectures or efficient operators. To address these challenges, we distill three design principles for VideoLLM token compression and propose a plug-and-play inference acceleration framework "Video Compression Commander" (VidCom2). By quantifying each frame's uniqueness, VidCom2 adaptively adjusts compression intensity across frames, effectively preserving essential information while reducing redundancy in video sequences. Extensive experiments across various VideoLLMs and benchmarks demonstrate the superior performance and efficiency of our VidCom2. With only 25% visual tokens, VidCom2 achieves 99.6% of the original performance on LLaVA-OV while reducing 70.8% of the LLM generation latency. Notably, our Frame Compression Adjustment strategy is compatible with other token compression methods to further improve their performance. Our code is available at https://github.com/xuyang-liu16/VidCom2.
comment: EMNLP 2025 main
Iterative Explainability for Weakly Supervised Segmentation in Medical PE Detection
Pulmonary Embolism (PE) are a leading cause of cardiovascular death. Computed tomographic pulmonary angiography (CTPA) is the gold standard for PE diagnosis, with growing interest in AI-based diagnostic assistance. However, these algorithms are limited by scarce fine-grained annotations of thromboembolic burden. We address this challenge with iExplain, a weakly supervised learning algorithm that transforms coarse image-level annotations into detailed pixel-level PE masks through iterative model explainability. Our approach generates soft segmentation maps used to mask detected regions, enabling the process to repeat and discover additional embolisms that would be missed in a single pass. This iterative refinement effectively captures complete PE regions and detects multiple distinct embolisms. Models trained on these automatically generated annotations achieve excellent PE detection performance, with significant improvements at each iteration. We demonstrate iExplain's effectiveness on the RSPECT augmented dataset, achieving results comparable to strongly supervised methods while outperforming existing weakly supervised methods.
comment: Paper accepted at MICAD2025 Previous title: "Label up: Learning pulmonary embolism segmentation from image level annotation through model explainability"
Decoupling Scene Perception and Ego Status: A Multi-Context Fusion Approach for Enhanced Generalization in End-to-End Autonomous Driving AAAI 2026
Modular design of planning-oriented autonomous driving has markedly advanced end-to-end systems. However, existing architectures remain constrained by an over-reliance on ego status, hindering generalization and robust scene understanding. We identify the root cause as an inherent design within these architectures that allows ego status to be easily leveraged as a shortcut. Specifically, the premature fusion of ego status in the upstream BEV encoder allows an information flow from this strong prior to dominate the downstream planning module. To address this challenge, we propose AdaptiveAD, an architectural-level solution based on a multi-context fusion strategy. Its core is a dual-branch structure that explicitly decouples scene perception and ego status. One branch performs scene-driven reasoning based on multi-task learning, but with ego status deliberately omitted from the BEV encoder, while the other conducts ego-driven reasoning based solely on the planning task. A scene-aware fusion module then adaptively integrates the complementary decisions from the two branches to form the final planning trajectory. To ensure this decoupling does not compromise multi-task learning, we introduce a path attention mechanism for ego-BEV interaction and add two targeted auxiliary tasks: BEV unidirectional distillation and autoregressive online mapping. Extensive evaluations on the nuScenes dataset demonstrate that AdaptiveAD achieves state-of-the-art open-loop planning performance. Crucially, it significantly mitigates the over-reliance on ego status and exhibits impressive generalization capabilities across diverse scenarios.
comment: Accepted to AAAI 2026 (Oral)
How does My Model Fail? Automatic Identification and Interpretation of Physical Plausibility Failure Modes with Matryoshka Transcoders
Although recent generative models are remarkably capable of producing instruction-following and realistic outputs, they remain prone to notable physical plausibility failures. Though critical in applications, these physical plausibility errors often escape detection by existing evaluation methods. Furthermore, no framework exists for automatically identifying and interpreting specific physical error patterns in natural language, preventing targeted model improvements. We introduce Matryoshka Transcoders, a novel framework for the automatic discovery and interpretation of physical plausibility features in generative models. Our approach extends the Matryoshka representation learning paradigm to transcoder architectures, enabling hierarchical sparse feature learning at multiple granularity levels. By training on intermediate representations from a physical plausibility classifier and leveraging large multimodal models for interpretation, our method identifies diverse physics-related failure modes without manual feature engineering, achieving superior feature relevance and feature accuracy compared to existing approaches. We utilize the discovered visual patterns to establish a benchmark for evaluating physical plausibility in generative models. Our analysis of eight state-of-the-art generative models provides valuable insights into how these models fail to follow physical constraints, paving the way for further model improvements.
Branch, or Layer? Zeroth-Order Optimization for Continual Learning of Vision-Language Models
Vision-Language Continual Learning (VLCL) has attracted significant research attention for its robust capabilities, and the adoption of Parameter-Efficient Fine-Tuning (PEFT) strategies is enabling these models to achieve competitive performance with substantially reduced resource consumption. However, dominated First-Order (FO) optimization is prone to trap models in suboptimal local minima, especially in limited exploration subspace within PEFT. To overcome this challenge, this paper pioneers a systematic exploration of adopting Zeroth-Order (ZO) optimization for PEFT-based VLCL. We first identify the incompatibility of naive full-ZO adoption in VLCL due to optimization process instability. We then investigate the application of ZO optimization from a modality branch-wise to a fine-grained layer-wise across various training units to identify an optimal strategy. Besides, a key theoretical insight reveals that vision modality exhibit higher variance than language counterparts in VLCL during the ZO optimization process, and we propose a modality-aware ZO strategy, which adopts gradient sign normalization in ZO and constrains vision modality perturbation to further improve performance. Benefiting from the adoption of ZO optimization, PEFT-based VLCL fulfills better ability to escape local minima during the optimization process, extensive experiments on four benchmarks demonstrate that our method achieves state-of-the-art results.
MoHoBench: Assessing Honesty of Multimodal Large Language Models via Unanswerable Visual Questions AAAI2026
Recently Multimodal Large Language Models (MLLMs) have achieved considerable advancements in vision-language tasks, yet produce potentially harmful or untrustworthy content. Despite substantial work investigating the trustworthiness of language models, MMLMs' capability to act honestly, especially when faced with visually unanswerable questions, remains largely underexplored. This work presents the first systematic assessment of honesty behaviors across various MLLMs. We ground honesty in models' response behaviors to unanswerable visual questions, define four representative types of such questions, and construct MoHoBench, a large-scale MMLM honest benchmark, consisting of 12k+ visual question samples, whose quality is guaranteed by multi-stage filtering and human verification. Using MoHoBench, we benchmarked the honesty of 28 popular MMLMs and conducted a comprehensive analysis. Our findings show that: (1) most models fail to appropriately refuse to answer when necessary, and (2) MMLMs' honesty is not solely a language modeling issue, but is deeply influenced by visual information, necessitating the development of dedicated methods for multimodal honesty alignment. Therefore, we implemented initial alignment methods using supervised and preference learning to improve honesty behavior, providing a foundation for future work on trustworthy MLLMs. Our data and code can be found at https://github.com/yanxuzhu/MoHoBench.
comment: AAAI2026 Oral
LENS: Learning to Segment Anything with Unified Reinforced Reasoning
Text-prompted image segmentation enables fine-grained visual understanding and is critical for applications such as human-computer interaction and robotics. However, existing supervised fine-tuning methods typically ignore explicit chain-of-thought (CoT) reasoning at test time, which limits their ability to generalize to unseen prompts and domains. To address this issue, we introduce LENS, a scalable reinforcement-learning framework that jointly optimizes the reasoning process and segmentation in an end-to-end manner. We propose unified reinforcement-learning rewards that span sentence-, box-, and segment-level cues, encouraging the model to generate informative CoT rationales while refining mask quality. Using a publicly available 3-billion-parameter vision-language model, i.e., Qwen2.5-VL-3B-Instruct, LENS achieves an average cIoU of 81.2% on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks, outperforming the strong fine-tuned method, i.e., GLaMM, by up to 5.6%. These results demonstrate that RL-driven CoT reasoning significantly enhances text-prompted segmentation and offers a practical path toward more generalizable Segment Anything models (SAM). Code is available at https://github.com/hustvl/LENS.
comment: Code is released at https://github.com/hustvl/LENS
SemCo: Toward Semantic Coherent Visual Relationship Forecasting
Visual Relationship Forecasting (VRF) aims to anticipate relations among objects without observing future visual content. The task relies on capturing and modeling the semantic coherence in object interactions, as it underpins the evolution of events and scenes in videos. However, existing VRF datasets offer limited support for learning such coherence due to noisy annotations in the datasets and weak correlations between different actions and relationship transitions in subject-object pair. Furthermore, existing methods struggle to distinguish similar relationships and overfit to unchanging relationships in consecutive frames. To address these challenges, we present SemCoBench, a benchmark that emphasizes semantic coherence for visual relationship forecasting. Based on action labels and short-term subject-object pairs, SemCoBench decomposes relationship categories and dynamics by cleaning and reorganizing video datasets to ensure predicting semantic coherence in object interactions. In addition, we also present Semantic Coherent Transformer method (SemCoFormer) to model the semantic coherence with a Relationship Augmented Module (RAM) and a Coherence Reasoning Module (CRM). RAM is designed to distinguish similar relationships, and CRM facilitates the model's focus on the dynamics in relationships. The experimental results on SemCoBench demonstrate that modeling the semantic coherence is a key step toward reasonable, fine-grained, and diverse visual relationship forecasting, contributing to a more comprehensive understanding of video scenes.
Availability-aware Sensor Fusion via Unified Canonical Space NeurIPS 2025
Sensor fusion of camera, LiDAR, and 4-dimensional (4D) Radar has brought a significant performance improvement in autonomous driving. However, there still exist fundamental challenges: deeply coupled fusion methods assume continuous sensor availability, making them vulnerable to sensor degradation and failure, whereas sensor-wise cross-attention fusion methods struggle with computational cost and unified feature representation. This paper presents availability-aware sensor fusion (ASF), a novel method that employs unified canonical projection (UCP) to enable consistency in all sensor features for fusion and cross-attention across sensors along patches (CASAP) to enhance robustness of sensor fusion against sensor degradation and failure. As a result, the proposed ASF shows a superior object detection performance to the existing state-of-the-art fusion methods under various weather and sensor degradation (or failure) conditions. Extensive experiments on the K-Radar dataset demonstrate that ASF achieves improvements of 9.7% in AP BEV (87.2%) and 20.1% in AP 3D (73.6%) in object detection at IoU=0.5, while requiring a low computational cost. All codes are available at https://github.com/kaist-avelab/k-radar.
comment: Accepted at NeurIPS 2025
GAIS: Frame-Level Gated Audio-Visual Integration with Semantic Variance-Scaled Perturbation for Text-Video Retrieval
Text-to-video retrieval requires precise alignment between language and temporally rich audio-video signals. However, existing methods often emphasize visual cues while underutilizing audio semantics or relying on coarse fusion strategies, resulting in suboptimal multimodal representations. We introduce GAIS, a retrieval framework that strengthens multimodal alignment from both representation and regularization perspectives. First, a Frame-level Gated Fusion (FGF) module adaptively integrates audio-visual features under textual guidance, enabling fine-grained temporal selection of informative frames. Second, a Semantic Variance-Scaled Perturbation (SVSP) mechanism regularizes the text embedding space by controlling perturbation magnitude in a semantics-aware manner. These two modules are complementary: FGF minimizes modality gaps through selective fusion, while SVSP improves embedding stability and discrimination. Extensive experiments on MSR-VTT, DiDeMo, LSMDC, and VATEX demonstrate that GAIS consistently outperforms strong baselines across multiple retrieval metrics while maintaining notable computational efficiency.
comment: 13 pages
Rethinking Saliency Maps: A Cognitive Human Aligned Taxonomy and Evaluation Framework for Explanations
Saliency maps are widely used for visual explanations in deep learning, but a fundamental lack of consensus persists regarding their intended purpose and alignment with diverse user queries. This ambiguity hinders the effective evaluation and practical utility of explanation methods. We address this gap by introducing the Reference-Frame $\times$ Granularity (RFxG) taxonomy, a principled conceptual framework that organizes saliency explanations along two essential axes:Reference-Frame: Distinguishing between pointwise ("Why this prediction?") and contrastive ("Why this and not an alternative?") explanations. Granularity: Ranging from fine-grained class-level (e.g., "Why Husky?") to coarse-grained group-level (e.g., "Why Dog?") interpretations. Using the RFxG lens, we demonstrate critical limitations in existing evaluation metrics, which overwhelmingly prioritize pointwise faithfulness while neglecting contrastive reasoning and semantic granularity. To systematically assess explanation quality across both RFxG dimensions, we propose four novel faithfulness metrics. Our comprehensive evaluation framework applies these metrics to ten state-of-the-art saliency methods, four model architectures, and three datasets. By advocating a shift toward user-intent-driven evaluation, our work provides both the conceptual foundation and the practical tools necessary to develop visual explanations that are not only faithful to the underlying model behavior but are also meaningfully aligned with the complexity of human understanding and inquiry.
Viper-F1: Fast and Fine-Grained Multimodal Understanding with Cross-Modal State-Space Modulation
Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as robotic manipulation, personal assistants, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Viper-F1, a Hybrid State-Space Vision-Language Model that replaces attention with efficient Liquid State-Space Dynamics. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates the state-space dynamics via FiLM conditioning. This enables the model to selectively emphasize visual regions relevant to the textual prompt while maintaining linear-time inference. Experimental results across multiple benchmarks demonstrate that Viper-F1 achieves accurate, fine-grained understanding with significantly improved efficiency.
comment: Need to enhance the method and benchmark to be better
Unlocking the Forgery Detection Potential of Vanilla MLLMs: A Novel Training-Free Pipeline
With the rapid advancement of artificial intelligence-generated content (AIGC) technologies, including multimodal large language models (MLLMs) and diffusion models, image generation and manipulation have become remarkably effortless. Existing image forgery detection and localization (IFDL) methods often struggle to generalize across diverse datasets and offer limited interpretability. Nowadays, MLLMs demonstrate strong generalization potential across diverse vision-language tasks, and some studies introduce this capability to IFDL via large-scale training. However, such approaches cost considerable computational resources, while failing to reveal the inherent generalization potential of vanilla MLLMs to address this problem. Inspired by this observation, we propose Foresee, a training-free MLLM-based pipeline tailored for image forgery analysis. It eliminates the need for additional training and enables a lightweight inference process, while surpassing existing MLLM-based methods in both tamper localization accuracy and the richness of textual explanations. Foresee employs a type-prior-driven strategy and utilizes a Flexible Feature Detector (FFD) module to specifically handle copy-move manipulations, thereby effectively unleashing the potential of vanilla MLLMs in the forensic domain. Extensive experiments demonstrate that our approach simultaneously achieves superior localization accuracy and provides more comprehensive textual explanations. Moreover, Foresee exhibits stronger generalization capability, outperforming existing IFDL methods across various tampering types, including copy-move, splicing, removal, local enhancement, deepfake, and AIGC-based editing. The code will be released in the final version.
YOLO Meets Mixture-of-Experts: Adaptive Expert Routing for Robust Object Detection
This paper presents a novel Mixture-of-Experts framework for object detection, incorporating adaptive routing among multiple YOLOv9-T experts to enable dynamic feature specialization and achieve higher mean Average Precision (mAP) and Average Recall (AR) compared to a single YOLOv9-T model.
comment: 1 figure, 1 table
Deep Learning and Machine Learning -- Object Detection and Semantic Segmentation: From Theory to Applications
An in-depth exploration of object detection and semantic segmentation is provided, combining theoretical foundations with practical applications. State-of-the-art advancements in machine learning and deep learning are reviewed, focusing on convolutional neural networks (CNNs), YOLO architectures, and transformer-based approaches such as DETR. The integration of artificial intelligence (AI) techniques and large language models for enhancing object detection in complex environments is examined. Additionally, a comprehensive analysis of big data processing is presented, with emphasis on model optimization and performance evaluation metrics. By bridging the gap between traditional methods and modern deep learning frameworks, valuable insights are offered for researchers, data scientists, and engineers aiming to apply AI-driven methodologies to large-scale object detection tasks.
comment: 167 pages
Image and Video Processing
NERD: Network-Regularized Diffusion Sampling For 3D Computed Tomography
Numerous diffusion model (DM)-based methods have been proposed for solving inverse imaging problems. Among these, a recent line of work has demonstrated strong performance by formulating sampling as an optimization procedure that enforces measurement consistency, forward diffusion consistency, and both step-wise and backward diffusion consistency. However, these methods have only considered 2D reconstruction tasks and do not directly extend to 3D image reconstruction problems, such as in Computed Tomography (CT). To bridge this gap, we propose NEtwork-Regularized diffusion sampling for 3D CT (NERD) by incorporating an L1 regularization into the optimization objective. This regularizer encourages spatial continuity across adjacent slices, reducing inter-slice artifacts and promoting coherent volumetric reconstructions. Additionally, we introduce two efficient optimization strategies to solve the resulting objective: one based on the Alternating Direction Method of Multipliers (ADMM) and another based on the Primal-Dual Hybrid Gradient (PDHG) method. Experiments on medical 3D CT data demonstrate that our approach achieves either state-of-the-art or highly competitive results.
ELiC: Efficient LiDAR Geometry Compression via Cross-Bit-depth Feature Propagation and Bag-of-Encoders
Hierarchical LiDAR geometry compression encodes voxel occupancies from low to high bit-depths, yet prior methods treat each depth independently and re-estimate local context from coordinates at every level, limiting compression efficiency. We present ELiC, a real-time framework that combines cross-bit-depth feature propagation, a Bag-of-Encoders (BoE) selection scheme, and a Morton-order-preserving hierarchy. Cross-bit-depth propagation reuses features extracted at denser, lower depths to support prediction at sparser, higher depths. BoE selects, per depth, the most suitable coding network from a small pool, adapting capacity to observed occupancy statistics without training a separate model for each level. The Morton hierarchy maintains global Z-order across depth transitions, eliminating per-level sorting and reducing latency. Together these components improve entropy modeling and computation efficiency, yielding state-of-the-art compression at real-time throughput on Ford and SemanticKITTI. Code and models will be released upon publication.
Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion
This paper addresses data quality issues in multimodal emotion recognition in conversation (MERC) through systematic quality control and multi-stage transfer learning. We implement a quality control pipeline for MELD and IEMOCAP datasets that validates speaker identity, audio-text alignment, and face detection. We leverage transfer learning from speaker and face recognition, assuming that identity-discriminative embeddings capture not only stable acoustic and Facial traits but also person-specific patterns of emotional expression. We employ RecoMadeEasy(R) engines for extracting 512-dimensional speaker and face embeddings, fine-tune MPNet-v2 for emotion-aware text representations, and adapt these features through emotion-specific MLPs trained on unimodal datasets. MAMBA-based trimodal fusion achieves 64.8% accuracy on MELD and 74.3% on IEMOCAP. These results show that combining identity-based audio and visual embeddings with emotion-tuned text representations on a quality-controlled subset of data yields consistent competitive performance for multimodal emotion recognition in conversation and provides a basis for further improvement on challenging, low-frequency emotion classes.
comment: 8 pages, 14 images, 3 tables, Recognition Technologies, Inc. Technical Report RTI-20251118-01
Reconstruction of three-dimensional shapes of normal and disease-related erythrocytes from partial observations using multi-fidelity neural networks
Reconstruction of 3D erythrocyte or red blood cell (RBC) morphology from partial observations, such as microscope images, is essential for understanding the physiology of RBC aging and the pathology of various RBC disorders. In this study, we propose a multi-fidelity neural network (MFNN) approach to fuse high-fidelity cross-sections of an RBC, with a morphologically similar low-fidelity reference 3D RBC shape to recover its full 3D surface. The MFNN predictor combines a convolutional neural network trained on low-fidelity reference RBC data with a feedforward neural network that captures nonlinear morphological correlations, and augments training with surface area and volume constraints for regularization in the low-fidelity branch. This approach is theoretically grounded by a topological homeomorphism between a sphere and 3D RBC surfaces, with training data generated by dissipative particle dynamics simulations of stomatocyte-discocyte-echinocyte transformation. Benchmarking across diverse RBC shapes observed in normal and aged populations, our results show that the MFNN predictor can reconstruct complex RBC morphologies with over 95% coordinate accuracy when provided with at least two orthogonal cross-sections. It is observed that informative oblique cross-sections intersecting spicule tips of echinocytes improve both local and global feature reconstruction, highlighting the value of feature-aware sampling. Our study further evaluates the influence of sampling strategies, shape dissimilarity, and noise, showing enhanced robustness under physically constrained training. Altogether, these results demonstrate the capability of MFNN to reconstruct the 3D shape of normal and aged RBCs from partial cross-sections as observed in conventional microscope images, which could facilitate the quantitative analysis of RBC morphological parameters in normal and disease-related RBC samples.
comment: 29 pages, 10 figures, 3 appendices
SALPA: Spaceborne LiDAR Point Adjustment for Enhanced GEDI Footprint Geolocation
Spaceborne Light Detection and Ranging (LiDAR) systems, such as NASA's Global Ecosystem Dynamics Investigation (GEDI), provide forest structure for global carbon assessments. However, geolocation uncertainties (typically 5-15 m) propagate systematically through derived products, undermining forest profile estimates, including carbon stock assessments. Existing correction methods face critical limitations: waveform simulation approaches achieve meter-level accuracy but require high-resolution LiDAR data unavailable in most regions, while terrain-based methods employ deterministic grid searches that may overlook optimal solutions in continuous solution spaces. We present SALPA (Spaceborne LiDAR Point Adjustment), a multi-algorithm optimization framework integrating three optimization paradigms with five distance metrics. Operating exclusively with globally available digital elevation models and geoid data, SALPA explores continuous solution spaces through gradient-based, evolutionary, and swarm intelligence approaches. Validation across contrasting sites: topographically complex Nikko, Japan, and flat Landes, France, demonstrates 15-16% improvements over original GEDI positions and 0.5-2% improvements over the state-of-the-art GeoGEDI algorithm. L-BFGS-B with Area-based metrics achieves optimal accuracy-efficiency trade-offs, while population-based algorithms (genetic algorithms, particle swarm optimization) excel in complex terrain. The platform-agnostic framework facilitates straightforward adaptation to emerging spaceborne LiDAR missions, providing a generalizable foundation for universal geolocation correction essential for reliable global forest monitoring and climate policy decisions.
comment: 21 pages, 2 figures
A model of the Unity High Definition Render Pipeline, with applications to flat-panel and head-mounted display characterization
Game engines such as Unity and Unreal Engine have become popular tools for creating perceptual and behavioral experiments based on complex, interactive scenes. They are often used with flat-panel displays, and also with head-mounted displays. Here I describe and test a mathematical model of luminance and color in Unity's High Definition Render Pipeline (HDRP). I show that the HDRP has several non-obvious features, such as nonlinearities applied to material properties and rendered values, that must be taken into account in order to show well-controlled stimuli. I also show how the HDRP can be configured to display gamma-corrected luminance and color, and I provide software to create the specialized files needed for gamma correction.
comment: 24 pages, 8 figures
cryoSENSE: Compressive Sensing Enables High-throughput Microscopy with Sparse and Generative Priors on the Protein Cryo-EM Image Manifold
Cryo-electron microscopy (cryo-EM) enables the atomic-resolution visualization of biomolecules; however, modern direct detectors generate data volumes that far exceed the available storage and transfer bandwidth, thereby constraining practical throughput. We introduce cryoSENSE, the computational realization of a hardware-software co-designed framework for compressive cryo-EM sensing and acquisition. We show that cryo-EM images of proteins lie on low-dimensional manifolds that can be independently represented using sparse priors in predefined bases and generative priors captured by a denoising diffusion model. cryoSENSE leverages these low-dimensional manifolds to enable faithful image reconstruction from spatial and Fourier-domain undersampled measurements while preserving downstream structural resolution. In experiments, cryoSENSE increases acquisition throughput by up to 2.5$\times$ while retaining the original 3D resolution, offering controllable trade-offs between the number of masked measurements and the level of downsampling. Sparse priors favor faithful reconstruction from Fourier-domain measurements and moderate compression, whereas generative diffusion priors achieve accurate recovery from pixel-domain measurements and more severe undersampling. Project website: https://cryosense.github.io.
Time-causal and time-recursive wavelets
When to apply wavelet analysis to real-time temporal signals, where the future cannot be accessed, it is essential to base all the steps in the signal processing pipeline on computational mechanisms that are truly time-causal. This paper describes how a time-causal wavelet analysis can be performed based on concepts developed in the area of temporal scale-space theory, originating from a complete classification of temporal smoothing kernels that guarantee non-creation of new structures from finer to coarser temporal scale levels. By necessity, convolution with truncated exponential kernels in cascade constitutes the only permissable class of kernels, as well as their temporal derivatives as a natural complement to fulfil the admissibility conditions of wavelet representations. For a particular way of choosing the time constants in the resulting infinite convolution of truncated exponential kernels, to ensure temporal scale covariance and thus self-similarity over temporal scales, we describe how mother wavelets can be chosen as temporal derivatives of the resulting time-causal limit kernel. By developing connections between wavelet theory and scale-space theory, we characterize and quantify how the continuous scaling properties transfer to the discrete implementation, demonstrating how the proposed time-causal wavelet representation can reflect the duration of locally dominant temporal structures in the input signals. We propose that this notion of time-causal wavelet analysis could be a valuable tool for signal processing tasks, where streams of signals are to be processed in real time, specifically for signals that may contain local variations over a rich span of temporal scales, or more generally for analysing physical or biophysical temporal phenomena, where a fully time-causal analysis is called for to be physically realistic.
comment: 28 pages, 11 figures
Foundation Models in Medical Imaging: A Review and Outlook
Foundation models (FMs) are changing the way medical images are analyzed by learning from large collections of unlabeled data. Instead of relying on manually annotated examples, FMs are pre-trained to learn general-purpose visual features that can later be adapted to specific clinical tasks with little additional supervision. In this review, we examine how FMs are being developed and applied in pathology, radiology, and ophthalmology, drawing on evidence from over 150 studies. We explain the core components of FM pipelines, including model architectures, self-supervised learning methods, and strategies for downstream adaptation. We also review how FMs are being used in each imaging domain and compare design choices across applications. Finally, we discuss key challenges and open questions to guide future research.
Iterative Explainability for Weakly Supervised Segmentation in Medical PE Detection
Pulmonary Embolism (PE) are a leading cause of cardiovascular death. Computed tomographic pulmonary angiography (CTPA) is the gold standard for PE diagnosis, with growing interest in AI-based diagnostic assistance. However, these algorithms are limited by scarce fine-grained annotations of thromboembolic burden. We address this challenge with iExplain, a weakly supervised learning algorithm that transforms coarse image-level annotations into detailed pixel-level PE masks through iterative model explainability. Our approach generates soft segmentation maps used to mask detected regions, enabling the process to repeat and discover additional embolisms that would be missed in a single pass. This iterative refinement effectively captures complete PE regions and detects multiple distinct embolisms. Models trained on these automatically generated annotations achieve excellent PE detection performance, with significant improvements at each iteration. We demonstrate iExplain's effectiveness on the RSPECT augmented dataset, achieving results comparable to strongly supervised methods while outperforming existing weakly supervised methods.
comment: Paper accepted at MICAD2025 Previous title: "Label up: Learning pulmonary embolism segmentation from image level annotation through model explainability"
Subjective and Objective Quality Evaluation of Super-Resolution Enhanced Broadcast Images on a Novel SR-IQA Dataset
To display low-quality broadcast content on high-resolution screens in full-screen format, the application of Super-Resolution (SR), a key consumer technology, is essential. Recently, SR methods have been developed that not only increase resolution while preserving the original image information but also enhance the perceived quality. However, evaluating the quality of SR images generated from low-quality sources, such as SR-enhanced broadcast content, is challenging due to the need to consider both distortions and improvements. Additionally, assessing SR image quality without original high-quality sources presents another significant challenge. Unfortunately, there has been a dearth of research specifically addressing the Image Quality Assessment (IQA) of SR images under these conditions. In this work, we introduce a new IQA dataset for SR broadcast images in both 2K and 4K resolutions. We conducted a subjective quality evaluation to obtain the Mean Opinion Score (MOS) for these SR images and performed a comprehensive human study to identify the key factors influencing the perceived quality. Finally, we evaluated the performance of existing IQA metrics on our dataset. This study reveals the limitations of current metrics, highlighting the need for a more robust IQA metric that better correlates with the perceived quality of SR images.
comment: Accepted for publication in IEEE Access
Towards Prospective Medical Image Reconstruction via Knowledge-Informed Dynamic Optimal Transport
Medical image reconstruction from measurement data is a vital but challenging inverse problem. Deep learning approaches have achieved promising results, but often requires paired measurement and high-quality images, which is typically simulated through a forward model, i.e., retrospective reconstruction. However, training on simulated pairs commonly leads to performance degradation on real prospective data due to the retrospective-to-prospective gap caused by incomplete imaging knowledge in simulation. To address this challenge, this paper introduces imaging Knowledge-Informed Dynamic Optimal Transport (KIDOT), a novel dynamic optimal transport framework with optimality in the sense of preserving consistency with imaging physics in transport, that conceptualizes reconstruction as finding a dynamic transport path. KIDOT learns from unpaired data by modeling reconstruction as a continuous evolution path from measurements to images, guided by an imaging knowledge-informed cost function and transport equation. This dynamic and knowledge-aware approach enhances robustness and better leverages unpaired data while respecting acquisition physics. Theoretically, we demonstrate that KIDOT naturally generalizes dynamic optimal transport, ensuring its mathematical rationale and solution existence. Extensive experiments on MRI and CT reconstruction demonstrate KIDOT's superior performance.
DeSamba: Decoupled Spectral Adaptive Framework for 3D Multi-Sequence MRI Lesion Classification
Magnetic Resonance Imaging (MRI) sequences provide rich spatial and frequency domain information, which is crucial for accurate lesion classification in medical imaging. However, effectively integrating multi-sequence MRI data for robust 3D lesion classification remains a challenge. In this paper, we propose DeSamba (Decoupled Spectral Adaptive Network and Mamba-Based Model), a novel framework designed to extract decoupled representations and adaptively fuse spatial and spectral features for lesion classification. DeSamba introduces a Decoupled Representation Learning Module (DRLM) that decouples features from different MRI sequences through self-reconstruction and cross-reconstruction, and a Spectral Adaptive Modulation Block (SAMB) within the proposed SAMNet, enabling dynamic fusion of spectral and spatial information based on lesion characteristics. We evaluate DeSamba on two clinically relevant 3D datasets. On a six-class spinal metastasis dataset (n=1,448), DeSamba achieves 62.10% Top-1 accuracy, 63.62% F1-score, 87.71% AUC, and 93.55% Top-3 accuracy on an external validation set (n=372), outperforming all state-of-the-art (SOTA) baselines. On a spondylitis dataset (n=251) involving a challenging binary classification task, DeSamba achieves 70.00%/64.52% accuracy and 74.75/73.88 AUC on internal and external validation sets, respectively. Ablation studies demonstrate that both DRLM and SAMB significantly contribute to overall performance, with over 10% relative improvement compared to the baseline. Our results highlight the potential of DeSamba as a generalizable and effective solution for 3D lesion classification in multi-sequence medical imaging.
comment: Our manuscript requires further experimental work, and the dataset cannot be made publicly available; therefore, we respectfully request withdrawal of the paper
Solving Imaging Inverse Problems Using Plug-and-Play Denoisers: Regularization and Optimization Perspectives
Inverse problems lie at the heart of modern imaging science, with broad applications in areas such as medical imaging, remote sensing, and microscopy. Recent years have witnessed a paradigm shift in solving imaging inverse problems, where data-driven regularizers are used increasingly, leading to remarkably high-fidelity reconstruction. A particularly notable approach for data-driven regularization is to use learned image denoisers as implicit priors in iterative image reconstruction algorithms. This chapter presents a comprehensive overview of this powerful and emerging class of algorithms, commonly referred to as plug-and-play (PnP) methods. We begin by providing a brief background on image denoising and inverse problems, followed by a short review of traditional regularization strategies. We then explore how proximal splitting algorithms, such as the alternating direction method of multipliers (ADMM) and proximal gradient descent (PGD), can naturally accommodate learned denoisers in place of proximal operators, and under what conditions such replacements preserve convergence. The role of Tweedie's formula in connecting optimal Gaussian denoisers and score estimation is discussed, which lays the foundation for regularization-by-denoising (RED) and more recent diffusion-based posterior sampling methods. We discuss theoretical advances regarding the convergence of PnP algorithms, both within the RED and proximal settings, emphasizing the structural assumptions that the denoiser must satisfy for convergence, such as non-expansiveness, Lipschitz continuity, and local homogeneity. We also address practical considerations in algorithm design, including choices of denoiser architecture and acceleration strategies.
RN-SDEs: Limited-Angle CT Reconstruction with Residual Null-Space Diffusion Stochastic Differential Equations
Computed tomography is a widely used imaging modality with applications ranging from medical imaging to material analysis. One major challenge arises from the lack of scanning information at certain angles, leading to distorted CT images with artifacts. This results in an ill-posed problem known as the Limited Angle Computed Tomography (LACT) reconstruction problem. To address this problem, we propose Residual Null-Space Diffusion Stochastic Differential Equations (RN-SDEs), which are a variant of diffusion models that characterize the diffusion process with mean-reverting (MR) stochastic differential equations. To demonstrate the generalizability of RN-SDEs, our experiments are conducted on two different LACT datasets, i.e., ChromSTEM and C4KC-KiTS. Through extensive experiments, we show that by leveraging learned Mean-Reverting SDEs as a prior and emphasizing data consistency using Range-Null Space Decomposition (RNSD) based rectification, RN-SDEs can restore high-quality images from severe degradation and achieve state-of-the-art performance in most LACT tasks. Additionally, we present a quantitative comparison of computational complexity and runtime efficiency, highlighting the superior effectiveness of our proposed approach.
Multimedia
DeCo-VAE: Learning Compact Latents for Video Reconstruction via Decoupled Representation
Existing video Variational Autoencoders (VAEs) generally overlook the similarity between frame contents, leading to redundant latent modeling. In this paper, we propose decoupled VAE (DeCo-VAE) to achieve compact latent representation. Instead of encoding RGB pixels directly, we decompose video content into distinct components via explicit decoupling: keyframe, motion and residual, and learn dedicated latent representation for each. To avoid cross-component interference, we design dedicated encoders for each decoupled component and adopt a shared 3D decoder to maintain spatiotemporal consistency during reconstruction. We further utilize a decoupled adaptation strategy that freezes partial encoders while training the others sequentially, ensuring stable training and accurate learning of both static and dynamic features. Extensive quantitative and qualitative experiments demonstrate that DeCo-VAE achieves superior video reconstruction performance.
MindCross: Fast New Subject Adaptation with Limited Data for Cross-subject Video Reconstruction from Brain Signals AAAI 2026
Reconstructing video from brain signals is an important brain decoding task. Existing brain decoding frameworks are primarily built on a subject-dependent paradigm, which requires large amounts of brain data for each subject. However, the expensive cost of collecting brain-video data causes severe data scarcity. Although some cross-subject methods being introduced, they often overfocus with subject-invariant information while neglecting subject-specific information, resulting in slow fine-tune-based adaptation strategy. To achieve fast and data-efficient new subject adaptation, we propose MindCross, a novel cross-subject framework. MindCross's N specific encoders and one shared encoder are designed to extract subject-specific and subject-invariant information, respectively. Additionally, a Top-K collaboration module is adopted to enhance new subject decoding with the knowledge learned from previous subjects' encoders. Extensive experiments on fMRI/EEG-to-video benchmarks demonstrate MindCross's efficacy and efficiency of cross-subject decoding and new subject adaptation using only one model.
comment: AAAI 2026, 16 pages
Real-Time Mobile Video Analytics for Pre-arrival Emergency Medical Services
Timely and accurate pre-arrival video streaming and analytics are critical for emergency medical services (EMS) to deliver life-saving interventions. Yet, current-generation EMS infrastructure remains constrained by one-to-one video streaming and limited analytics capabilities, leaving dispatchers and EMTs to manually interpret overwhelming, often noisy or redundant information in high-stress environments. We present TeleEMS, a mobile live video analytics system that enables pre-arrival multimodal inference by fusing audio and video into a unified decision-making pipeline before EMTs arrive on scene. TeleEMS comprises two key components: TeleEMS Client and TeleEMS Server. The TeleEMS Client runs across phones, smart glasses, and desktops to support bystanders, EMTs en route, and 911 dispatchers. The TeleEMS Server, deployed at the edge, integrates EMS-Stream, a communication backbone that enables smooth multi-party video streaming. On top of EMSStream, the server hosts three real-time analytics modules: (1) audio-to-symptom analytics via EMSLlama, a domain-specialized LLM for robust symptom extraction and normalization; (2) video-to-vital analytics using state-of-the-art rPPG methods for heart rate estimation; and (3) joint text-vital analytics via PreNet, a multimodal multitask model predicting EMS protocols, medication types, medication quantities, and procedures. Evaluation shows that EMSLlama outperforms GPT-4o (exact-match 0.89 vs. 0.57) and that text-vital fusion improves inference robustness, enabling reliable pre-arrival intervention recommendations. TeleEMS demonstrates the potential of mobile live video analytics to transform EMS operations, bridging the gap between bystanders, dispatchers, and EMTs, and paving the way for next-generation intelligent EMS infrastructure.
Privis: Towards Content-Aware Secure Volumetric Video Delivery
Volumetric video has emerged as a key paradigm in eXtended Reality (XR) and immersive multimedia because it enables highly interactive, spatially consistent 3D experiences. However, the transport-layer security for such 3D content remains largely unaddressed. Existing volumetric streaming pipelines inherit uniform encryption schemes from 2D video, overlooking the heterogeneous privacy sensitivity of different geometry and the strict motion-to-photon latency constraints of real-time XR. We take an initial step toward content-aware secure volumetric video delivery by introducing Privis, a saliency-guided transport framework that (i) partitions volumetric assets into independent units, (ii) applies lightweight authenticated encryption with adaptive key rotation, and (iii) employs selective traffic shaping to balance confidentiality and low latency. Privis specifies a generalized transport-layer security architecture for volumetric media, defining core abstractions and adaptive protection mechanisms. We further explore a prototype implementation and present initial latency measurements to illustrate feasibility and design tradeoffs, providing early empirical guidance toward future work on real-time, saliency-conditioned secure delivery.
CPSL: Representing Volumetric Video via Content-Promoted Scene Layers
Volumetric video enables immersive and interactive visual experiences by supporting free viewpoint exploration and realistic motion parallax. However, existing volumetric representations from explicit point clouds to implicit neural fields, remain costly in capture, computation, and rendering, which limits their scalability for on-demand video and reduces their feasibility for real-time communication. To bridge this gap, we propose Content-Promoted Scene Layers (CPSL), a compact 2.5D video representation that brings the perceptual benefits of volumetric video to conventional 2D content. Guided by per-frame depth and content saliency, CPSL decomposes each frame into a small set of geometry-consistent layers equipped with soft alpha bands and an edge-depth cache that jointly preserve occlusion ordering and boundary continuity. These lightweight, 2D-encodable assets enable parallax-corrected novel-view synthesis via depth-weighted warping and front-to-back alpha compositing, bypassing expensive 3D reconstruction. Temporally, CPSL maintains inter-frame coherence using motion-guided propagation and per-layer encoding, supporting real-time playback with standard video codecs. Across multiple benchmarks, CPSL achieves superior perceptual quality and boundary fidelity compared with layer-based and neural-field baselines while reducing storage and rendering cost by several folds. Our approach offer a practical path from 2D video to scalable 2.5D immersive media.
MusRec: Zero-Shot Text-to-Music Editing via Rectified Flow and Diffusion Transformers
Music editing has emerged as an important and practical area of artificial intelligence, with applications ranging from video game and film music production to personalizing existing tracks according to user preferences. However, existing models face significant limitations, such as being restricted to editing synthesized music generated by their own models, requiring highly precise prompts, or necessitating task-specific retraining, thus lacking true zero-shot capability. leveraging recent advances in rectified flow and diffusion transformers, we introduce MusRec, a zero-shot text-to-music editing model capable of performing diverse editing tasks on real-world music efficiently and effectively. Experimental results demonstrate that our approach outperforms existing methods in preserving musical content, structural consistency, and editing fidelity, establishing a strong foundation for controllable music editing in real-world scenarios.
comment: This work has been submitted to the IEEE for possible publication
TalkSketch: Multimodal Generative AI for Real-time Sketch Ideation with Speech AAAI 2026
Sketching is a widely used medium for generating and exploring early-stage design concepts. While generative AI (GenAI) chatbots are increasingly used for idea generation, designers often struggle to craft effective prompts and find it difficult to express evolving visual concepts through text alone. In the formative study (N=6), we examined how designers use GenAI during ideation, revealing that text-based prompting disrupts creative flow. To address these issues, we developed TalkSketch, an embedded multimodal AI sketching system that integrates freehand drawing with real-time speech input. TalkSketch aims to support a more fluid ideation process through capturing verbal descriptions during sketching and generating context-aware AI responses. Our work highlights the potential of GenAI tools to engage the design process itself rather than focusing on output.
comment: Accepted at AAAI 2026 Workshop on Creative AI for Live Interactive Performances (CLIP). To be published in Springer CCIS series
To Align or Not to Align: Strategic Multimodal Representation Alignment for Optimal Performance
Multimodal learning often relies on aligning representations across modalities to enable effective information integration, an approach traditionally assumed to be universally beneficial. However, prior research has primarily taken an observational approach, examining naturally occurring alignment in multimodal data and exploring its correlation with model performance, without systematically studying the direct effects of explicitly enforced alignment between representations of different modalities. In this work, we investigate how explicit alignment influences both model performance and representation alignment under different modality-specific information structures. Specifically, we introduce a controllable contrastive learning module that enables precise manipulation of alignment strength during training, allowing us to explore when explicit alignment improves or hinders performance. Our results on synthetic and real datasets under different data characteristics show that the impact of explicit alignment on the performance of unimodal models is related to the characteristics of the data: the optimal level of alignment depends on the amount of redundancy between the different modalities. We identify an optimal alignment strength that balances modality-specific signals and shared redundancy in the mixed information distributions. This work provides practical guidance on when and how explicit alignment should be applied to achieve optimal unimodal encoder performance.
Iris: Integrating Language into Diffusion-based Monocular Depth Estimation
Traditional monocular depth estimation suffers from inherent ambiguity and visual nuisances. We demonstrate that language can enhance monocular depth estimation by providing an additional condition (rather than images alone) aligned with plausible 3D scenes, thereby reducing the solution space for depth estimation. This conditional distribution is learned during the text-to-image pre-training of diffusion models. To generate images under various viewpoints and layouts that precisely reflect textual descriptions, the model implicitly models object sizes, shapes, and scales, their spatial relationships, and the overall scene structure. In this paper, Iris, we investigate the benefits of our strategy to integrate text descriptions into training and inference of diffusion-based depth estimation models. We experiment with three different diffusion-based monocular depth estimators (Marigold, Lotus, and E2E-FT) and their variants. By training on HyperSim and Virtual KITTI, and evaluating on NYUv2, KITTI, ETH3D, ScanNet, and DIODE, we find that our strategy improves the overall monocular depth estimation accuracy, especially in small areas. It also improves the model's depth perception of specific regions described in the text. We find that by providing more details in the text, the depth prediction can be iteratively refined. Simultaneously, we find that language can act as a constraint to accelerate the convergence of both training and the inference diffusion trajectory. Code and generated text data will be released upon acceptance.
Computation and Language
Strategic Innovation Management in the Age of Large Language Models Market Intelligence, Adaptive R&D, and Ethical Governance
This study analyzes the multiple functions of Large Language Models (LLMs) in transforming research and development (R&D) processes. By automating knowledge discovery, boosting hypothesis creation, integrating transdisciplinary insights, and enabling cooperation within innovation ecosystems, LLMs dramatically improve the efficiency and effectiveness of research processes. Through extensive analysis of scientific literature, patent databases, and experimental data, these models enable more flexible and informed R&D workflows, ultimately accelerating innovation cycles and lowering time-to-market for breakthrough ideas.
Subword Tokenization Strategies for Kurdish Word Embeddings
We investigate tokenization strategies for Kurdish word embeddings by comparing word-level, morpheme-based, and BPE approaches on morphological similarity preservation tasks. We develop a BiLSTM-CRF morphological segmenter using bootstrapped training from minimal manual annotation and evaluate Word2Vec embeddings across comprehensive metrics including similarity preservation, clustering quality, and semantic organization. Our analysis reveals critical evaluation biases in tokenization comparison. While BPE initially appears superior in morphological similarity, it evaluates only 28.6\% of test cases compared to 68.7\% for morpheme model, creating artificial performance inflation. When assessed comprehensively, morpheme-based tokenization demonstrates superior embedding space organization, better semantic neighborhood structure, and more balanced coverage across morphological complexity levels. These findings highlight the importance of coverage-aware evaluation in low-resource language processing and offers different tokenization methods for low-resourced language processing.
Talk, Snap, Complain: Validation-Aware Multimodal Expert Framework for Fine-Grained Customer Grievances AAAI
Existing approaches to complaint analysis largely rely on unimodal, short-form content such as tweets or product reviews. This work advances the field by leveraging multimodal, multi-turn customer support dialogues, where users often share both textual complaints and visual evidence (e.g., screenshots, product photos) to enable fine-grained classification of complaint aspects and severity. We introduce VALOR, a Validation-Aware Learner with Expert Routing, tailored for this multimodal setting. It employs a multi-expert reasoning setup using large-scale generative models with Chain-of-Thought (CoT) prompting for nuanced decision-making. To ensure coherence between modalities, a semantic alignment score is computed and integrated into the final classification through a meta-fusion strategy. In alignment with the United Nations Sustainable Development Goals (UN SDGs), the proposed framework supports SDG 9 (Industry, Innovation and Infrastructure) by advancing AI-driven tools for robust, scalable, and context-aware service infrastructure. Further, by enabling structured analysis of complaint narratives and visual context, it contributes to SDG 12 (Responsible Consumption and Production) by promoting more responsive product design and improved accountability in consumer services. We evaluate VALOR on a curated multimodal complaint dataset annotated with fine-grained aspect and severity labels, showing that it consistently outperforms baseline models, especially in complex complaint scenarios where information is distributed across text and images. This study underscores the value of multimodal interaction and expert validation in practical complaint understanding systems. Resources related to data and codes are available here: https://github.com/sarmistha-D/VALOR
comment: To be published in the Proceedings of the 40th Annual AAAI Conference on Artificial Intelligence (AAAI 2026 Special Track on AI for Social Impact )
Ground Truth Generation for Multilingual Historical NLP using LLMs
Historical and low-resource NLP remains challenging due to limited annotated data and domain mismatches with modern, web-sourced corpora. This paper outlines our work in using large language models (LLMs) to create ground-truth annotations for historical French (16th-20th centuries) and Chinese (1900-1950) texts. By leveraging LLM-generated ground truth on a subset of our corpus, we were able to fine-tune spaCy to achieve significant gains on period-specific tests for part-of-speech (POS) annotations, lemmatization, and named entity recognition (NER). Our results underscore the importance of domain-specific models and demonstrate that even relatively limited amounts of synthetic data can improve NLP tools for under-resourced corpora in computational humanities research.
comment: 13 pages, 5 tables, 1 figure
Encoding and Understanding Astrophysical Information in Large Language Model-Generated Summaries NeurIPS 2025
Large Language Models have demonstrated the ability to generalize well at many levels across domains, modalities, and even shown in-context learning capabilities. This enables research questions regarding how they can be used to encode physical information that is usually only available from scientific measurements, and loosely encoded in textual descriptions. Using astrophysics as a test bed, we investigate if LLM embeddings can codify physical summary statistics that are obtained from scientific measurements through two main questions: 1) Does prompting play a role on how those quantities are codified by the LLM? and 2) What aspects of language are most important in encoding the physics represented by the measurement? We investigate this using sparse autoencoders that extract interpretable features from the text.
comment: Accepted to the Machine Learning and the Physical Sciences Workshop at NeurIPS 2025, 11 pages, 4 figures
SMRC: Aligning Large Language Models with Student Reasoning for Mathematical Error Correction
Large language models (LLMs) often make reasoning errors when solving mathematical problems, and how to automatically detect and correct these errors has become an important research direction. However, existing approaches \textit{mainly focus on self-correction within the model}, which falls short of the ``teacher-style`` correction required in educational settings, \textit{i.e.}, systematically guiding and revising a student's problem-solving process. To address this gap, we propose \texttt{SMRC} (\textit{\underline{S}tudent \underline{M}athematical \underline{R}easoning \underline{C}orrection}), a novel method that aligns LLMs with student reasoning. Specifically, \texttt{SMRC} formulates student reasoning as a multi-step sequential decision problem and introduces Monte Carlo Tree Search (MCTS) to explore optimal correction paths. To reduce the cost of the annotating process-level rewards, we leverage breadth-first search (BFS) guided by LLMs and final-answer evaluation to generate reward signals, which are then distributed across intermediate reasoning steps via a back-propagation mechanism, enabling fine-grained process supervision. Additionally, we construct a benchmark for high school mathematics, MSEB (Multi-Solution Error Benchmark), consisting of 158 instances that include problem statements, student solutions, and correct reasoning steps. We further propose a dual evaluation protocol centered on \textbf{solution accuracy} and \textbf{correct-step retention}, offering a comprehensive measure of educational applicability. Experiments demonstrate that \texttt{SMRC} significantly outperforms existing methods on two public datasets (ProcessBench and MR-GSM8K) and our MSEB in terms of effectiveness and overall performance. The code and data are available at https://github.com/Mind-Lab-ECNU/SMRC.
comment: 13 pages, 3 figures
Quadratic Term Correction on Heaps' Law
Heaps' or Herdan's law characterizes the word-type vs. word-token relation by a power-law function, which is concave in linear-linear scale but a straight line in log-log scale. However, it has been observed that even in log-log scale, the type-token curve is still slightly concave, invalidating the power-law relation. At the next-order approximation, we have shown, by twenty English novels or writings (some are translated from another language to English), that quadratic functions in log-log scale fit the type-token data perfectly. Regression analyses of log(type)-log(token) data with both a linear and quadratic term consistently lead to a linear coefficient of slightly larger than 1, and a quadratic coefficient around -0.02. Using the ``random drawing colored ball from the bag with replacement" model, we have shown that the curvature of the log-log scale is identical to a ``pseudo-variance" which is negative. Although a pseudo-variance calculation may encounter numeric instability when the number of tokens is large, due to the large values of pseudo-weights, this formalism provides a rough estimation of the curvature when the number of tokens is small.
comment: 3 figures
Streamlining Industrial Contract Management with Retrieval-Augmented LLMs
Contract management involves reviewing and negotiating provisions, individual clauses that define rights, obligations, and terms of agreement. During this process, revisions to provisions are proposed and iteratively refined, some of which may be problematic or unacceptable. Automating this workflow is challenging due to the scarcity of labeled data and the abundance of unstructured legacy contracts. In this paper, we present a modular framework designed to streamline contract management through a retrieval-augmented generation (RAG) pipeline. Our system integrates synthetic data generation, semantic clause retrieval, acceptability classification, and reward-based alignment to flag problematic revisions and generate improved alternatives. Developed and evaluated in collaboration with an industry partner, our system achieves over 80% accuracy in both identifying and optimizing problematic revisions, demonstrating strong performance under real-world, low-resource conditions and offering a practical means of accelerating contract revision workflows.
Bias in, Bias out: Annotation Bias in Multilingual Large Language Models
Annotation bias in NLP datasets remains a major challenge for developing multilingual Large Language Models (LLMs), particularly in culturally diverse settings. Bias from task framing, annotator subjectivity, and cultural mismatches can distort model outputs and exacerbate social harms. We propose a comprehensive framework for understanding annotation bias, distinguishing among instruction bias, annotator bias, and contextual and cultural bias. We review detection methods (including inter-annotator agreement, model disagreement, and metadata analysis) and highlight emerging techniques such as multilingual model divergence and cultural inference. We further outline proactive and reactive mitigation strategies, including diverse annotator recruitment, iterative guideline refinement, and post-hoc model adjustments. Our contributions include: (1) a typology of annotation bias; (2) a synthesis of detection metrics; (3) an ensemble-based bias mitigation approach adapted for multilingual settings, and (4) an ethical analysis of annotation processes. Together, these insights aim to inform more equitable and culturally grounded annotation pipelines for LLMs.
Graded strength of comparative illusions is explained by Bayesian inference
Like visual processing, language processing is susceptible to illusions in which people systematically misperceive stimuli. In one such case--the comparative illusion (CI), e.g., More students have been to Russia than I have--comprehenders tend to judge the sentence as acceptable despite its underlying nonsensical comparison. Prior research has argued that this phenomenon can be explained as Bayesian inference over a noisy channel: the posterior probability of an interpretation of a sentence is proportional to both the prior probability of that interpretation and the likelihood of corruption into the observed (CI) sentence. Initial behavioral work has supported this claim by evaluating a narrow set of alternative interpretations of CI sentences and showing that comprehenders favor interpretations that are more likely to have been corrupted into the illusory sentence. In this study, we replicate and go substantially beyond this earlier work by directly predicting the strength of illusion with a quantitative model of the posterior probability of plausible interpretations, which we derive through a novel synthesis of statistical language models with human behavioral data. Our model explains not only the fine gradations in the strength of CI effects, but also a previously unexplained effect caused by pronominal vs. full noun phrase than-clause subjects. These findings support a noisy-channel theory of sentence comprehension by demonstrating that the theory makes novel predictions about the comparative illusion that bear out empirically. This outcome joins related evidence of noisy channel processing in both illusory and non-illusory contexts to support noisy channel inference as a unified computational-level theory of diverse language processing phenomena.
comment: 49 pages, 7 figures
A Specialized Large Language Model for Clinical Reasoning and Diagnosis in Rare Diseases
Rare diseases affect hundreds of millions worldwide, yet diagnosis often spans years. Convectional pipelines decouple noisy evidence extraction from downstream inferential diagnosis, and general/medical large language models (LLMs) face scarce real world electronic health records (EHRs), stale domain knowledge, and hallucinations. We assemble a large, domain specialized clinical corpus and a clinician validated reasoning set, and develop RareSeek R1 via staged instruction tuning, chain of thought learning, and graph grounded retrieval. Across multicenter EHR narratives and public benchmarks, RareSeek R1 attains state of the art accuracy, robust generalization, and stability under noisy or overlapping phenotypes. Augmented retrieval yields the largest gains when narratives pair with prioritized variants by resolving ambiguity and aligning candidates to mechanisms. Human studies show performance on par with experienced physicians and consistent gains in assistive use. Notably, transparent reasoning highlights decisive non phenotypic evidence (median 23.1%, such as imaging, interventions, functional tests) underpinning many correct diagnoses. This work advances a narrative first, knowledge integrated reasoning paradigm that shortens the diagnostic odyssey and enables auditable, clinically translatable decision support.
comment: 50 pages, 5 figures
Enhancing Agentic Autonomous Scientific Discovery with Vision-Language Model Capabilities
We show that multi-agent systems guided by vision-language models (VLMs) improve end-to-end autonomous scientific discovery. By treating plots as verifiable checkpoints, a VLM-as-a-judge evaluates figures against dynamically generated domain-specific rubrics, enabling agents to correct their own errors and steer exploratory data analysis in real-time. Case studies in cosmology and astrochemistry demonstrate recovery from faulty reasoning paths and adaptation to new datasets without human intervention. On a 10-task benchmark for data-driven discovery, VLM-augmented systems achieve pass at 1 scores of 0.7-0.8, compared to 0.2-0.3 for code-only and 0.4-0.5 for code-and-text baselines, while also providing auditable reasoning traces that improve interpretability. Code available here: https://github.com/CMBAgents/cmbagent
Bridging Human and Model Perspectives: A Comparative Analysis of Political Bias Detection in News Media Using Large Language Models
Detecting political bias in news media is a complex task that requires interpreting subtle linguistic and contextual cues. Although recent advances in Natural Language Processing (NLP) have enabled automatic bias classification, the extent to which large language models (LLMs) align with human judgment still remains relatively underexplored and not yet well understood. This study aims to present a comparative framework for evaluating the detection of political bias across human annotations and multiple LLMs, including GPT, BERT, RoBERTa, and FLAN. We construct a manually annotated dataset of news articles and assess annotation consistency, bias polarity, and inter-model agreement to quantify divergence between human and model perceptions of bias. Experimental results show that among traditional transformer-based models, RoBERTa achieves the highest alignment with human labels, whereas generative models such as GPT demonstrate the strongest overall agreement with human annotations in a zero-shot setting. Among all transformer-based baselines, our fine-tuned RoBERTa model acquired the highest accuracy and the strongest alignment with human-annotated labels. Our findings highlight systematic differences in how humans and LLMs perceive political slant, underscoring the need for hybrid evaluation frameworks that combine human interpretability with model scalability in automated media bias detection.
A Method for Characterizing Disease Progression from Acute Kidney Injury to Chronic Kidney Disease
Patients with acute kidney injury (AKI) are at high risk of developing chronic kidney disease (CKD), but identifying those at greatest risk remains challenging. We used electronic health record (EHR) data to dynamically track AKI patients' clinical evolution and characterize AKI-to-CKD progression. Post-AKI clinical states were identified by clustering patient vectors derived from longitudinal medical codes and creatinine measurements. Transition probabilities between states and progression to CKD were estimated using multi-state modeling. After identifying common post-AKI trajectories, CKD risk factors in AKI subpopulations were identified through survival analysis. Of 20,699 patients with AKI at admission, 3,491 (17%) developed CKD. We identified fifteen distinct post-AKI states, each with different probabilities of CKD development. Most patients (75%, n=15,607) remained in a single state or made only one transition during the study period. Both established (e.g., AKI severity, diabetes, hypertension, heart failure, liver disease) and novel CKD risk factors, with their impact varying across these clinical states. This study demonstrates a data-driven approach for identifying high-risk AKI patients, supporting the development of decision-support tools for early CKD detection and intervention.
Leveraging Digitized Newspapers to Collect Summarization Data in Low-Resource Languages
High quality summarization data remains scarce in under-represented languages. However, historical newspapers, made available through recent digitization efforts, offer an abundant source of untapped, naturally annotated data. In this work, we present a novel method for collecting naturally occurring summaries via Front-Page Teasers, where editors summarize full length articles. We show that this phenomenon is common across seven diverse languages and supports multi-document summarization. To scale data collection, we develop an automatic process, suited to varying linguistic resource levels. Finally, we apply this process to a Hebrew newspaper title, producing HEBTEASESUM, the first dedicated multi-document summarization dataset in Hebrew.
Examining the Metrics for Document-Level Claim Extraction in Czech and Slovak
Document-level claim extraction remains an open challenge in the field of fact-checking, and subsequently, methods for evaluating extracted claims have received limited attention. In this work, we explore approaches to aligning two sets of claims pertaining to the same source document and computing their similarity through an alignment score. We investigate techniques to identify the best possible alignment and evaluation method between claim sets, with the aim of providing a reliable evaluation framework. Our approach enables comparison between model-extracted and human-annotated claim sets, serving as a metric for assessing the extraction performance of models and also as a possible measure of inter-annotator agreement. We conduct experiments on newly collected dataset-claims extracted from comments under Czech and Slovak news articles-domains that pose additional challenges due to the informal language, strong local context, and subtleties of these closely related languages. The results draw attention to the limitations of current evaluation approaches when applied to document-level claim extraction and highlight the need for more advanced methods-ones able to correctly capture semantic similarity and evaluate essential claim properties such as atomicity, checkworthiness, and decontextualization.
LiveRAG: A diverse Q&A dataset with varying difficulty level for RAG evaluation
With Retrieval Augmented Generation (RAG) becoming more and more prominent in generative AI solutions, there is an emerging need for systematically evaluating their effectiveness. We introduce the LiveRAG benchmark, a publicly available dataset of 895 synthetic questions and answers designed to support systematic evaluation of RAG-based Q&A systems. This synthetic benchmark is derived from the one used during the SIGIR'2025 LiveRAG Challenge, where competitors were evaluated under strict time constraints. It is augmented with information that was not made available to competitors during the Challenge, such as the ground-truth answers, together with their associated supporting claims which were used for evaluating competitors' answers. In addition, each question is associated with estimated difficulty and discriminability scores, derived from applying an Item Response Theory model to competitors' responses. Our analysis highlights the benchmark's questions diversity, the wide range of their difficulty levels, and their usefulness in differentiating between system capabilities. The LiveRAG benchmark will hopefully help the community advance RAG research, conduct systematic evaluation, and develop more robust Q&A systems.
comment: 14 pages, 4 figures, 5 tables
Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning
Large Language Models (LLMs) are increasingly being explored for building Agents capable of active environmental interaction (e.g., via tool use) to solve complex problems. Reinforcement Learning (RL) is considered a key technology with significant potential for training such Agents; however, the effective application of RL to LLM Agents is still in its nascent stages and faces considerable challenges. Currently, this emerging field lacks in-depth exploration into RL approaches specifically tailored for the LLM Agent context, alongside a scarcity of flexible and easily extensible training frameworks designed for this purpose. To help advance this area, this paper first revisits and clarifies Reinforcement Learning methodologies for LLM Agents by systematically extending the Markov Decision Process (MDP) framework to comprehensively define the key components of an LLM Agent. Secondly, we introduce Agent-R1, a modular, flexible, and user-friendly training framework for RL-based LLM Agents, designed for straightforward adaptation across diverse task scenarios and interactive environments. We conducted experiments on Multihop QA benchmark tasks, providing initial validation for the effectiveness of our proposed methods and framework.
comment: This paper serves as the technical report of the Agent-R1 project
Tell Me: An LLM-powered Mental Well-being Assistant with RAG, Synthetic Dialogue Generation, and Agentic Planning ACL
We present Tell Me, a mental well-being system that leverages advances in large language models to provide accessible, context-aware support for users and researchers. The system integrates three components: (i) a retrieval-augmented generation (RAG) assistant for personalized, knowledge-grounded dialogue; (ii) a synthetic client-therapist dialogue generator conditioned on client profiles to facilitate research on therapeutic language and data augmentation; and (iii) a Well-being AI crew, implemented with CrewAI, that produces weekly self-care plans and guided meditation audio. The system is designed as a reflective space for emotional processing rather than a substitute for professional therapy. It illustrates how conversational assistants can lower barriers to support, complement existing care, and broaden access to mental health resources. To address the shortage of confidential therapeutic data, we introduce synthetic client-therapist dialogue generation conditioned on client profiles. Finally, the planner demonstrates an innovative agentic workflow for dynamically adaptive, personalized self-care, bridging the limitations of static well-being tools. We describe the architecture, demonstrate its functionalities, and report evaluation of the RAG assistant in curated well-being scenarios using both automatic LLM-based judgments and a human-user study. This work highlights opportunities for interdisciplinary collaboration between NLP researchers and mental health professionals to advance responsible innovation in human-AI interaction for well-being.
comment: 8 pages, 2 figures, 1 Table. Submitted to the Computation and Language (cs.CL) category. Uses the ACL-style template. Code and demo will be released at: https://github.com/trystine/Tell_Me_Mental_Wellbeing_System
MedBench v4: A Robust and Scalable Benchmark for Evaluating Chinese Medical Language Models, Multimodal Models, and Intelligent Agents
Recent advances in medical large language models (LLMs), multimodal models, and agents demand evaluation frameworks that reflect real clinical workflows and safety constraints. We present MedBench v4, a nationwide, cloud-based benchmarking infrastructure comprising over 700,000 expert-curated tasks spanning 24 primary and 91 secondary specialties, with dedicated tracks for LLMs, multimodal models, and agents. Items undergo multi-stage refinement and multi-round review by clinicians from more than 500 institutions, and open-ended responses are scored by an LLM-as-a-judge calibrated to human ratings. We evaluate 15 frontier models. Base LLMs reach a mean overall score of 54.1/100 (best: Claude Sonnet 4.5, 62.5/100), but safety and ethics remain low (18.4/100). Multimodal models perform worse overall (mean 47.5/100; best: GPT-5, 54.9/100), with solid perception yet weaker cross-modal reasoning. Agents built on the same backbones substantially improve end-to-end performance (mean 79.8/100), with Claude Sonnet 4.5-based agents achieving up to 85.3/100 overall and 88.9/100 on safety tasks. MedBench v4 thus reveals persisting gaps in multimodal reasoning and safety for base models, while showing that governance-aware agentic orchestration can markedly enhance benchmarked clinical readiness without sacrificing capability. By aligning tasks with Chinese clinical guidelines and regulatory priorities, the platform offers a practical reference for hospitals, developers, and policymakers auditing medical AI.
Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education
Large Language Models (LLMs) are increasingly integrated into educational applications. However, they remain vulnerable to jailbreak and fine-tuning attacks, which can compromise safety alignment and lead to harmful outputs. Existing studies mainly focus on general safety evaluations, with limited attention to the unique safety requirements of educational scenarios. To address this gap, we construct EduHarm, a benchmark containing safe-unsafe instruction pairs across five representative educational scenarios, enabling systematic safety evaluation of educational LLMs. Furthermore, we propose a three-stage shield framework (TSSF) for educational LLMs that simultaneously mitigates both jailbreak and fine-tuning attacks. First, safety-aware attention realignment redirects attention toward critical unsafe tokens, thereby restoring the harmfulness feature that discriminates between unsafe and safe inputs. Second, layer-wise safety judgment identifies harmfulness features by aggregating safety cues across multiple layers to detect unsafe instructions. Finally, defense-driven dual routing separates safe and unsafe queries, ensuring normal processing for benign inputs and guarded responses for harmful ones. Extensive experiments across eight jailbreak attack strategies demonstrate that TSSF effectively strengthens safety while preventing over-refusal of benign queries. Evaluations on three fine-tuning attack datasets further show that it consistently achieves robust defense against harmful queries while maintaining preserving utility gains from benign fine-tuning.
Mitigating Label Length Bias in Large Language Models AACL 2025
Large language models (LLMs) are powerful zero- and few-shot learners. However, when predicting over a set of candidate options, LLMs suffer from label biases, and existing calibration methods overlook biases arising from multi-token class labels. We tackle an issue we call label length bias, where labels of different lengths are treated inconsistently, even after standard length normalization. To mitigate it, we propose normalized contextual calibration (NCC), an effective method that normalizes and calibrates predictions at the full-label level. NCC achieves statistically significant improvements over prior approaches across multiple datasets and models, with gains of up to 10% F1. Moreover, NCC extends bias mitigation to broader tasks such as multiple-choice question answering. Our analysis shows that, when combined with in-context learning, NCC is less sensitive to few-shot example selection, requires fewer examples for competitive performance, and produces more reliable confidence estimates. These findings highlight the importance of mitigating full-label biases to improve the performance and robustness of LLM-based methods, particularly in real-world applications where class labels naturally consist of multiple tokens.
comment: Accepted to AACL 2025 (Main)
O3SLM: Open Weight, Open Data, and Open Vocabulary Sketch-Language Model AAAI 2026
While Large Vision Language Models (LVLMs) are increasingly deployed in real-world applications, their ability to interpret abstract visual inputs remains limited. Specifically, they struggle to comprehend hand-drawn sketches, a modality that offers an intuitive means of expressing concepts that are difficult to describe textually. We identify the primary bottleneck as the absence of a large-scale dataset that jointly models sketches, photorealistic images, and corresponding natural language instructions. To address this, we present two key contributions: (1) a new, large-scale dataset of image-sketch-instruction triplets designed to facilitate both pretraining and instruction tuning, and (2) O3SLM, an LVLM trained on this dataset. Comprehensive evaluations on multiple sketch-based tasks: (a) object localization, (b) counting, (c) image retrieval i.e., (SBIR and fine-grained SBIR), and (d) visual question answering (VQA); while incorporating the three existing sketch datasets, namely QuickDraw!, Sketchy, and Tu Berlin, along with our generated SketchVCL dataset, show that O3SLM achieves state-of-the-art performance, substantially outperforming existing LVLMs in sketch comprehension and reasoning.
comment: Accepted to AAAI 2026
ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning
The rapid advancement of Large Language Models (LLMs) has led to performance saturation on many established benchmarks, questioning their ability to distinguish frontier models. Concurrently, existing high-difficulty benchmarks often suffer from narrow disciplinary focus, oversimplified answer formats, and vulnerability to data contamination, creating a fidelity gap with real-world scientific inquiry. To address these challenges, we introduce ATLAS (AGI-Oriented Testbed for Logical Application in Science), a large-scale, high-difficulty, and cross-disciplinary evaluation suite composed of approximately 800 original problems. Developed by domain experts (PhD-level and above), ATLAS spans seven core scientific fields: mathematics, physics, chemistry, biology, computer science, earth science, and materials science. Its key features include: (1) High Originality and Contamination Resistance, with all questions newly created or substantially adapted to prevent test data leakage; (2) Cross-Disciplinary Focus, designed to assess models' ability to integrate knowledge and reason across scientific domains; (3) High-Fidelity Answers, prioritizing complex, open-ended answers involving multi-step reasoning and LaTeX-formatted expressions over simple multiple-choice questions; and (4) Rigorous Quality Control, employing a multi-stage process of expert peer review and adversarial testing to ensure question difficulty, scientific value, and correctness. We also propose a robust evaluation paradigm using a panel of LLM judges for automated, nuanced assessment of complex answers. Preliminary results on leading models demonstrate ATLAS's effectiveness in differentiating their advanced scientific reasoning capabilities. We plan to develop ATLAS into a long-term, open, community-driven platform to provide a reliable "ruler" for progress toward Artificial General Intelligence.
comment: 39 pages
The Tokenization Bottleneck: How Vocabulary Extension Improves Chemistry Representation Learning in Pretrained Language Models
The application of large language models (LLMs) to chemistry is frequently hampered by a "tokenization bottleneck", where tokenizers tuned on general-domain text tend to fragment chemical representations such as SMILES into semantically uninformative sub-tokens. This paper introduces a principled methodology to resolve this bottleneck by unifying the representation of natural language and molecular structures within a single model. Our approach involves targeted vocabulary extension-augmenting a pretrained LLM's vocabulary with chemically salient tokens, followed by continued pretraining on chemistry-domain text to integrate this new knowledge. We provide an empirical demonstration of the effectiveness of this strategy, showing that our methodology leads to superior performance on a range of downstream chemical tasks.
SciRAG: Adaptive, Citation-Aware, and Outline-Guided Retrieval and Synthesis for Scientific Literature
The accelerating growth of scientific publications has intensified the need for scalable, trustworthy systems to synthesize knowledge across diverse literature. While recent retrieval-augmented generation (RAG) methods have improved access to scientific information, they often overlook citation graph structure, adapt poorly to complex queries, and yield fragmented, hard-to-verify syntheses. We introduce SciRAG, an open-source framework for scientific literature exploration that addresses these gaps through three key innovations: (1) adaptive retrieval that flexibly alternates between sequential and parallel evidence gathering; (2) citation-aware symbolic reasoning that leverages citation graphs to organize and filter supporting documents; and (3) outline-guided synthesis that plans, critiques, and refines answers to ensure coherence and transparent attribution. Extensive experiments across multiple benchmarks such as QASA and ScholarQA demonstrate that SciRAG outperforms prior systems in factual accuracy and synthesis quality, establishing a new foundation for reliable, large-scale scientific knowledge aggregation.
ConInstruct: Evaluating Large Language Models on Conflict Detection and Resolution in Instructions AAAI 2026
Instruction-following is a critical capability of Large Language Models (LLMs). While existing works primarily focus on assessing how well LLMs adhere to user instructions, they often overlook scenarios where instructions contain conflicting constraints-a common occurrence in complex prompts. The behavior of LLMs under such conditions remains under-explored. To bridge this gap, we introduce ConInstruct, a benchmark specifically designed to assess LLMs' ability to detect and resolve conflicts within user instructions. Using this dataset, we evaluate LLMs' conflict detection performance and analyze their conflict resolution behavior. Our experiments reveal two key findings: (1) Most proprietary LLMs exhibit strong conflict detection capabilities, whereas among open-source models, only DeepSeek-R1 demonstrates similarly strong performance. DeepSeek-R1 and Claude-4.5-Sonnet achieve the highest average F1-scores at 91.5% and 87.3%, respectively, ranking first and second overall. (2) Despite their strong conflict detection abilities, LLMs rarely explicitly notify users about the conflicts or request clarification when faced with conflicting constraints. These results underscore a critical shortcoming in current LLMs and highlight an important area for future improvement when designing instruction-following LLMs.
comment: Accepted to AAAI 2026
Steganographic Backdoor Attacks in NLP: Ultra-Low Poisoning and Defense Evasion
Transformer models are foundational to natural language processing (NLP) applications, yet remain vulnerable to backdoor attacks introduced through poisoned data, which implant hidden behaviors during training. To strengthen the ability to prevent such compromises, recent research has focused on designing increasingly stealthy attacks to stress-test existing defenses, pairing backdoor behaviors with stylized artifact or token-level perturbation triggers. However, this trend diverts attention from the harder and more realistic case: making the model respond to semantic triggers such as specific names or entities, where a successful backdoor could manipulate outputs tied to real people or events in deployed systems. Motivated by this growing disconnect, we introduce SteganoBackdoor, bringing stealth techniques back into line with practical threat models. Leveraging innocuous properties from natural-language steganography, SteganoBackdoor applies a gradient-guided data optimization process to transform semantic trigger seeds into steganographic carriers that embed a high backdoor payload, remain fluent, and exhibit no representational resemblance to the trigger. Across diverse experimental settings, SteganoBackdoor achieves over 99% attack success at an order-of-magnitude lower data-poisoning rate than prior approaches while maintaining unparalleled evasion against a comprehensive suite of data-level defenses. By revealing this practical and covert attack, SteganoBackdoor highlights an urgent blind spot in current defenses and demands immediate attention to adversarial data defenses and real-world threat modeling.
DataSage: Multi-agent Collaboration for Insight Discovery with External Knowledge Retrieval, Multi-role Debating, and Multi-path Reasoning
In today's data-driven era, fully automated end-to-end data analytics, particularly insight discovery, is critical for discovering actionable insights that assist organizations in making effective decisions. With the rapid advancement of large language models (LLMs), LLM-driven agents have emerged as a promising paradigm for automating data analysis and insight discovery. However, existing data insight agents remain limited in several key aspects, often failing to deliver satisfactory results due to: (1) insufficient utilization of domain knowledge, (2) shallow analytical depth, and (3) error-prone code generation during insight generation. To address these issues, we propose DataSage, a novel multi-agent framework that incorporates three innovative features including external knowledge retrieval to enrich the analytical context, a multi-role debating mechanism to simulate diverse analytical perspectives and deepen analytical depth, and multi-path reasoning to improve the accuracy of the generated code and insights. Extensive experiments on InsightBench demonstrate that DataSage consistently outperforms existing data insight agents across all difficulty levels, offering an effective solution for automated data insight discovery.
AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models
We present AraLingBench: a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language models (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.
Don't Miss the Forest for the Trees: In-Depth Confidence Estimation for LLMs via Reasoning over the Answer Space
Knowing the reliability of a model's response is essential in application. With the strong generation capabilities of LLMs, research has focused on generating verbalized confidence. This is further enhanced by combining chain-of-thought reasoning, which provides logical and transparent estimation. However, how reasoning strategies affect the estimated confidence is still under-explored. In this work, we demonstrate that predicting a verbalized probability distribution can effectively encourage in-depth reasoning for confidence estimation. Intuitively, it requires an LLM to consider all candidates within the answer space instead of basing on a single guess, and to carefully assign confidence scores to meet the requirements of a distribution. This method shows an advantage across different models and various tasks, regardless of whether the answer space is known. Its advantage is maintained even after reinforcement learning, and further analysis shows its reasoning patterns are aligned with human expectations.
Entropy-Guided Reasoning Compression
Large reasoning models have demonstrated remarkable performance on complex reasoning tasks, yet the excessive length of their chain-of-thought outputs remains a major practical bottleneck due to high computation cost and poor deployability. Existing compression methods have achieved partial success but overlook a crucial phenomenon in the training process -- the entropy conflict. During compression training, entropy decreases, leading to shorter reasoning but limited exploration, while accuracy-oriented objectives increase entropy, lengthening reasoning chains. This can cause the model to get stuck in a local dilemma. Our analysis further reveals the origin of the entropy conflict: many high-entropy tokens are logical connectors that receive larger gradients and are encouraged under the performance objective, while the compression objective simultaneously penalizes these potentially redundant connectors. This opposing pressure creates a direct source of entropy conflict. To address these issues, we adopt an entropy-guided training framework. As entropy descends, the model is guided toward efficient reasoning by encouraging concise thought steps; as entropy rises, exploration is reinforced under the compact reasoning mode to improve robustness. Experiments on six mathematical benchmarks show that our method compresses reasoning length to 20% of the original while maintaining or even surpassing baseline accuracy. Code and models will be released publicly.
comment: 10pages, 4 figures
AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR AACL 2025
Recent advances in speech-enabled AI, including Google's NotebookLM and OpenAI's speech-to-speech API, are driving widespread interest in voice interfaces globally. Despite this momentum, there exists no publicly available application-specific model evaluation that caters to Africa's linguistic diversity. We present AfriSpeech-MultiBench, the first domain-specific evaluation suite for over 100 African English accents across 10+ countries and seven application domains: Finance, Legal, Medical, General dialogue, Call Center, Named Entities and Hallucination Robustness. We benchmark a diverse range of open, closed, unimodal ASR and multimodal LLM-based speech recognition systems using both spontaneous and non-spontaneous speech conversation drawn from various open African accented English speech datasets. Our empirical analysis reveals systematic variation: open-source ASR models excels in spontaneous speech contexts but degrades on noisy, non-native dialogue; multimodal LLMs are more accent-robust yet struggle with domain-specific named entities; proprietary models deliver high accuracy on clean speech but vary significantly by country and domain. Models fine-tuned on African English achieve competitive accuracy with lower latency, a practical advantage for deployment, hallucinations still remain a big problem for most SOTA models. By releasing this comprehensive benchmark, we empower practitioners and researchers to select voice technologies suited to African use-cases, fostering inclusive voice applications for underserved communities.
comment: Accepted As a Conference Paper IJCNLP-AACL 2025
Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning AAAI 2026
The automatic movie dubbing model generates vivid speech from given scripts, replicating a speaker's timbre from a brief timbre prompt while ensuring lip-sync with the silent video. Existing approaches simulate a simplified workflow where actors dub directly without preparation, overlooking the critical director-actor interaction. In contrast, authentic workflows involve a dynamic collaboration: directors actively engage with actors, guiding them to internalize the context cues, specifically emotion, before performance. To address this issue, we propose a new Retrieve-Augmented Director-Actor Interaction Learning scheme to achieve authentic movie dubbing, termed Authentic-Dubber, which contains three novel mechanisms: (1) We construct a multimodal Reference Footage library to simulate the learning footage provided by directors. Note that we integrate Large Language Models (LLMs) to achieve deep comprehension of emotional representations across multimodal signals. (2) To emulate how actors efficiently and comprehensively internalize director-provided footage during dubbing, we propose an Emotion-Similarity-based Retrieval-Augmentation strategy. This strategy retrieves the most relevant multimodal information that aligns with the target silent video. (3) We develop a Progressive Graph-based speech generation approach that incrementally incorporates the retrieved multimodal emotional knowledge, thereby simulating the actor's final dubbing process. The above mechanisms enable the Authentic-Dubber to faithfully replicate the authentic dubbing workflow, achieving comprehensive improvements in emotional expressiveness. Both subjective and objective evaluations on the V2C Animation benchmark dataset validate the effectiveness. The code and demos are available at https://github.com/AI-S2-Lab/Authentic-Dubber.
comment: Accepted by AAAI 2026
MuCPT: Music-related Natural Language Model Continued Pretraining
Large language models perform strongly on general tasks but remain constrained in specialized settings such as music, particularly in the music-entertainment domain, where corpus scale, purity, and the match between data and training objectives are critical. We address this by constructing a large, music-related natural language corpus (40B tokens) that combines open source and in-house data, and by implementing a domain-first data pipeline: a lightweight classifier filters and weights in-domain text, followed by multi-stage cleaning, de-duplication, and privacy-preserving masking. We further integrate multi-source music text with associated metadata to form a broader, better-structured foundation of domain knowledge. On the training side, we introduce reference-model (RM)-based token-level soft scoring for quality control: a unified loss-ratio criterion is used both for data selection and for dynamic down-weighting during optimization, reducing noise gradients and amplifying task-aligned signals, thereby enabling more effective music-domain continued pretraining and alignment. To assess factuality, we design the MusicSimpleQA benchmark, which adopts short, single-answer prompts with automated agreement scoring. Beyond the benchmark design, we conduct systematic comparisons along the axes of data composition. Overall, this work advances both the right corpus and the right objective, offering a scalable data-training framework and a reusable evaluation tool for building domain LLMs in the music field.
ArbESC+: Arabic Enhanced Edit Selection System Combination for Grammatical Error Correction Resolving conflict and improving system combination in Arabic GEC
Grammatical Error Correction (GEC) is an important aspect of natural language processing. Arabic has a complicated morphological and syntactic structure, posing a greater challenge than other languages. Even though modern neural models have improved greatly in recent years, the majority of previous attempts used individual models without taking into account the potential benefits of combining different systems. In this paper, we present one of the first multi-system approaches for correcting grammatical errors in Arabic, the Arab Enhanced Edit Selection System Complication (ArbESC+). Several models are used to collect correction proposals, which are represented as numerical features in the framework. A classifier determines and implements the appropriate corrections based on these features. In order to improve output quality, the framework uses support techniques to filter overlapping corrections and estimate decision reliability. A combination of AraT5, ByT5, mT5, AraBART, AraBART+Morph+GEC, and Text editing systems gave better results than a single model alone, with F0.5 at 82.63% on QALB-14 test data, 84.64% on QALB-15 L1 data, and 65.55% on QALB-15 L2 data. As one of the most significant contributions of this work, it's the first Arab attempt to integrate linguistic error correction. Improving existing models provides a practical step towards developing advanced tools that will benefit users and researchers of Arabic text processing.
comment: 26 pages
Harnessing Deep LLM Participation for Robust Entity Linking
Entity Linking (EL), the task of mapping textual entity mentions to their corresponding entries in knowledge bases, constitutes a fundamental component of natural language understanding. Recent advancements in Large Language Models (LLMs) have demonstrated remarkable potential for enhancing EL performance. Prior research has leveraged LLMs to improve entity disambiguation and input representation, yielding significant gains in accuracy and robustness. However, these approaches typically apply LLMs to isolated stages of the EL task, failing to fully integrate their capabilities throughout the entire process. In this work, we introduce DeepEL, a comprehensive framework that incorporates LLMs into every stage of the entity linking task. Furthermore, we identify that disambiguating entities in isolation is insufficient for optimal performance. To address this limitation, we propose a novel self-validation mechanism that utilizes global contextual information, enabling LLMs to rectify their own predictions and better recognize cohesive relationships among entities within the same sentence. Extensive empirical evaluation across ten benchmark datasets demonstrates that DeepEL substantially outperforms existing state-of-the-art methods, achieving an average improvement of 2.6\% in overall F1 score and a remarkable 4% gain on out-of-domain datasets. These results underscore the efficacy of deep LLM integration in advancing the state-of-the-art in entity linking.
SymLoc: Symbolic Localization of Hallucination across HaluEval and TruthfulQA
LLMs still struggle with hallucination, especially when confronted with symbolic triggers like modifiers, negation, numbers, exceptions, and named entities. Yet, we lack a clear understanding of where these symbolic hallucinations originate, making it crucial to systematically handle such triggers and localize the emergence of hallucination inside the model. While prior work explored localization using statistical techniques like LSC and activation variance analysis, these methods treat all tokens equally and overlook the role symbolic linguistic knowledge plays in triggering hallucinations. So far, no approach has investigated how symbolic elements specifically drive hallucination failures across model layers, nor has symbolic linguistic knowledge been used as the foundation for a localization framework. We propose the first symbolic localization framework that leverages symbolic linguistic and semantic knowledge to meaningfully trace the development of hallucinations across all model layers. By focusing on how models process symbolic triggers, we analyze five models using HaluEval and TruthfulQA. Our symbolic knowledge approach reveals that attention variance for these linguistic elements explodes to critical instability in early layers (2-4), with negation triggering catastrophic variance levels, demonstrating that symbolic semantic processing breaks down from the very beginning. Through the lens of symbolic linguistic knowledge, despite larger model sizes, hallucination rates remain consistently high (78.3%-83.7% across Gemma variants), with steep attention drops for symbolic semantic triggers throughout deeper layers. Our findings demonstrate that hallucination is fundamentally a symbolic linguistic processing failure, not a general generation problem, revealing that symbolic semantic knowledge provides the key to understanding and localizing hallucination mechanisms in LLMs.
Selective Weak-to-Strong Generalization AAAI2025
Future superhuman models will surpass the ability of humans and humans will only be able to \textit{weakly} supervise superhuman models. To alleviate the issue of lacking high-quality data for model alignment, some works on weak-to-strong generalization (W2SG) finetune a strong pretrained model with a weak supervisor so that it can generalize beyond weak supervision. However, the invariable use of weak supervision in existing methods exposes issues in robustness, with a proportion of weak labels proving harmful to models. In this paper, we propose a selective W2SG framework to avoid using weak supervision when unnecessary. We train a binary classifier P(IK) to identify questions that a strong model can answer and use its self-generated labels for alignment. We further refine weak labels with a graph smoothing method. Extensive experiments on three benchmarks show that our method consistently outperforms competitive baselines. Further analyses show that P(IK) can generalize across tasks and difficulties, which indicates selective W2SG can help superalignment.
comment: AAAI2025 Special Track on AI Alignment
Applying Relation Extraction and Graph Matching to Answering Multiple Choice Questions KR
In this research, we combine Transformer-based relation extraction with matching of knowledge graphs (KGs) and apply them to answering multiple-choice questions (MCQs) while maintaining the traceability of the output process. KGs are structured representations of factual knowledge consisting of entities and relations. Due to the high construction cost, they had been regarded as static databases with validated links. However, the recent development of Transformer-based relation extraction (RE) methods has enabled us to generate KGs dynamically by giving them natural language texts, and thereby opened the possibility for representing the meaning of the input sentences with the created KGs. Using this effect, we propose a method that answers MCQs in the "fill-in-the-blank" format, taking care of the point that RE methods generate KGs that represent false information if provided with factually incorrect texts. We measure the truthfulness of each question sentence by (i) converting the sentence into a relational graph using an RE method and (ii) verifying it against factually correct KGs under the closed-world assumption. The experimental results demonstrate that our method correctly answers up to around 70% of the questions, while providing traceability of the procedure. We also highlight that the question category has a vast influence on the accuracy.
comment: Presented at NeLaMKRR@KR, 2025 (arXiv:2511.09575)
From Graphs to Hypergraphs: Enhancing Aspect-Based Sentiment Analysis via Multi-Level Relational Modeling
Aspect-Based Sentiment Analysis (ABSA) predicts sentiment polarity for specific aspect terms, a task made difficult by conflicting sentiments across aspects and the sparse context of short texts. Prior graph-based approaches model only pairwise dependencies, forcing them to construct multiple graphs for different relational views. These introduce redundancy, parameter overhead, and error propagation during fusion, limiting robustness in short-text, low-resource settings. We present HyperABSA, a dynamic hypergraph framework that induces aspect-opinion structures through sample-specific hierarchical clustering. To construct these hyperedges, we introduce a novel acceleration-fallback cutoff for hierarchical clustering, which adaptively determines the level of granularity. Experiments on three benchmarks (Lap14, Rest14, MAMS) show consistent improvements over strong graph baselines, with substantial gains when paired with RoBERTa backbones. These results position dynamic hypergraph construction as an efficient, powerful alternative for ABSA, with potential extensions to other short-text NLP tasks.
PRISM: Prompt-Refined In-Context System Modelling for Financial Retrieval
With the rapid progress of large language models (LLMs), financial information retrieval has become a critical industrial application. Extracting task-relevant information from lengthy financial filings is essential for both operational and analytical decision-making. The FinAgentBench dataset formalizes this problem through two tasks: document ranking and chunk ranking. We present PRISM, a training-free framework that integrates refined system prompting, in-context learning (ICL), and a lightweight multi-agent system. Each component is examined extensively to reveal their synergies: prompt engineering provides precise task instructions, ICL supplies semantically relevant few-shot examples, and the multi-agent system models coordinated scoring behaviour. Our best configuration achieves an NDCG@5 of 0.71818 on the restricted validation split. We further demonstrate that PRISM is feasible and robust for production-scale financial retrieval. Its modular, inference-only design makes it practical for real-world use cases. The source code is released at https://bit.ly/prism-ailens.
comment: 3rd-place solution for the ACM ICAIF 2025 Agentic Retrieval Grand Challenge
Synthetic Clinical Notes for Rare ICD Codes: A Data-Centric Framework for Long-Tail Medical Coding
Automatic ICD coding from clinical text is a critical task in medical NLP but remains hindered by the extreme long-tail distribution of diagnostic codes. Thousands of rare and zero-shot ICD codes are severely underrepresented in datasets like MIMIC-III, leading to low macro-F1 scores. In this work, we propose a data-centric framework that generates high-quality synthetic discharge summaries to mitigate this imbalance. Our method constructs realistic multi-label code sets anchored on rare codes by leveraging real-world co-occurrence patterns, ICD descriptions, synonyms, taxonomy, and similar clinical notes. Using these structured prompts, we generate 90,000 synthetic notes covering 7,902 ICD codes, significantly expanding the training distribution. We fine-tune two state-of-the-art transformer-based models, PLM-ICD and GKI-ICD, on both the original and extended datasets. Experiments show that our approach modestly improves macro-F1 while maintaining strong micro-F1, outperforming prior SOTA. While the gain may seem marginal relative to the computational cost, our results demonstrate that carefully crafted synthetic data can enhance equity in long-tail ICD code prediction.
comment: 4 page-short paper
Stealth Fine-Tuning: Efficiently Breaking Alignment in RVLMs Using Self-Generated CoT
Reasoning-augmented Vision-Language Models (RVLMs) rely on safety alignment to prevent harmful behavior, yet their exposed chain-of-thought (CoT) traces introduce new attack surfaces. In this work, we find that the safety alignment of RVLMs can be easily break through a novel attack method termed \textbf{Stealth Fine-Tuning}. Our method elicits harmful reasoning traces through \textbf{segment-level interference} and reuses the self-generated outputs as supervised fine-tuning data. Through a \textbf{turn-based weighted} loss design, yielding a lightweight, distribution-consistent finetuning method. In our experiment, with only 499 samples and under 3 hours on a single A100 (QLoRA), Stealth Fine-Tuning outperforms IDEATOR by 38.52\% ASR while preserving general reasoning ability, as the tuned model retains the original representation distribution. Experiments on AdvBench and several general benchmarks demonstrate that Stealth Fine-Tuning is a low-cost and highly effective way to bypass alignment defenses. \textcolor{red}{\textbf{Disclaimer: This paper contains content that may be disturbing or offensive.}}
comment: 10 pages, 7 figures
Error-Driven Scene Editing for 3D Grounding in Large Language Models
Despite recent progress in 3D-LLMs, they remain limited in accurately grounding language to visual and spatial elements in 3D environments. This limitation stems in part from training data that focuses on language reasoning rather than spatial understanding due to scarce 3D resources, leaving inherent grounding biases unresolved. To address this, we propose 3D scene editing as a key mechanism to generate precise visual counterfactuals that mitigate these biases through fine-grained spatial manipulation, without requiring costly scene reconstruction or large-scale 3D data collection. Furthermore, to make these edits targeted and directly address the specific weaknesses of the model, we introduce DEER-3D, an error-driven framework following a structured "Decompose, Diagnostic Evaluation, Edit, and Re-train" workflow, rather than broadly or randomly augmenting data as in conventional approaches. Specifically, upon identifying a grounding failure of the 3D-LLM, our framework first diagnoses the exact predicate-level error (e.g., attribute or spatial relation). It then executes minimal, predicate-aligned 3D scene edits, such as recoloring or repositioning, to produce targeted counterfactual supervision for iterative model fine-tuning, significantly enhancing grounding accuracy. We evaluate our editing pipeline across multiple benchmarks for 3D grounding and scene understanding tasks, consistently demonstrating improvements across all evaluated datasets through iterative refinement. DEER-3D underscores the effectiveness of targeted, error-driven scene editing in bridging linguistic reasoning capabilities with spatial grounding in 3D LLMs.
comment: Code: https://github.com/zhangyuejoslin/Deer-3D
Based on Data Balancing and Model Improvement for Multi-Label Sentiment Classification Performance Enhancement
Multi-label sentiment classification plays a vital role in natural language processing by detecting multiple emotions within a single text. However, existing datasets like GoEmotions often suffer from severe class imbalance, which hampers model performance, especially for underrepresented emotions. To address this, we constructed a balanced multi-label sentiment dataset by integrating the original GoEmotions data, emotion-labeled samples from Sentiment140 using a RoBERTa-base-GoEmotions model, and manually annotated texts generated by GPT-4 mini. Our data balancing strategy ensured an even distribution across 28 emotion categories. Based on this dataset, we developed an enhanced multi-label classification model that combines pre-trained FastText embeddings, convolutional layers for local feature extraction, bidirectional LSTM for contextual learning, and an attention mechanism to highlight sentiment-relevant words. A sigmoid-activated output layer enables multi-label prediction, and mixed precision training improves computational efficiency. Experimental results demonstrate significant improvements in accuracy, precision, recall, F1-score, and AUC compared to models trained on imbalanced data, highlighting the effectiveness of our approach.
comment: 12 pages, 8 figures, 5 tables. Dataset and code available at https://doi.org/10.5281/zenodo.16890154 and https://doi.org/10.5281/zenodo.15837871
GRPO Privacy Is at Risk: A Membership Inference Attack Against Reinforcement Learning With Verifiable Rewards
Membership inference attacks (MIAs) on large language models (LLMs) pose significant privacy risks across various stages of model training. Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have brought a profound paradigm shift in LLM training, particularly for complex reasoning tasks. However, the on-policy nature of RLVR introduces a unique privacy leakage pattern: since training relies on self-generated responses without fixed ground-truth outputs, membership inference must now determine whether a given prompt (independent of any specific response) is used during fine-tuning. This creates a threat where leakage arises not from answer memorization. To audit this novel privacy risk, we propose Divergence-in-Behavior Attack (DIBA), the first membership inference framework specifically designed for RLVR. DIBA shifts the focus from memorization to behavioral change, leveraging measurable shifts in model behavior across two axes: advantage-side improvement (e.g., correctness gain) and logit-side divergence (e.g., policy drift). Through comprehensive evaluations, we demonstrate that DIBA significantly outperforms existing baselines, achieving around 0.8 AUC and an order-of-magnitude higher TPR@0.1%FPR. We validate DIBA's superiority across multiple settings--including in-distribution, cross-dataset, cross-algorithm, black-box scenarios, and extensions to vision-language models. Furthermore, our attack remains robust under moderate defensive measures. To the best of our knowledge, this is the first work to systematically analyze privacy vulnerabilities in RLVR, revealing that even in the absence of explicit supervision, training data exposure can be reliably inferred through behavioral traces.
AISAC: An Integrated multi-agent System for Transparent, Retrieval-Grounded Scientific Assistance
AI Scientific Assistant Core (AISAC) is an integrated multi-agent system developed at Argonne National Laboratory for scientific and engineering workflows. AISAC builds on established technologies - LangGraph for orchestration, FAISS for vector search, and SQLite for persistence - and integrates them into a unified system prototype focused on transparency, provenance tracking, and scientific adaptability. The system implements a Router-Planner-Coordinator workflow and an optional Evaluator role, using prompt-engineered agents coordinated via LangGraph's StateGraph and supported by helper agents such as a Researcher. Each role is defined through custom system prompts that enforce structured JSON outputs. A hybrid memory approach (FAISS + SQLite) enables both semantic retrieval and structured conversation history. An incremental indexing strategy based on file hashing minimizes redundant re-embedding when scientific corpora evolve. A configuration-driven project bootstrap layer allows research teams to customize tools, prompts, and data sources without modifying core code. All agent decisions, tool invocations, and retrievals are logged and visualized through a custom Gradio interface, providing step-by-step transparency for each reasoning episode. The authors have applied AISAC to multiple research areas at Argonne, including specialized deployments for waste-to-products research and energy process safety, as well as general-purpose scientific assistance, demonstrating its cross-domain applicability.
HiEAG: Evidence-Augmented Generation for Out-of-Context Misinformation Detection
Recent advancements in multimodal out-of-context (OOC) misinformation detection have made remarkable progress in checking the consistencies between different modalities for supporting or refuting image-text pairs. However, existing OOC misinformation detection methods tend to emphasize the role of internal consistency, ignoring the significant of external consistency between image-text pairs and external evidence. In this paper, we propose HiEAG, a novel Hierarchical Evidence-Augmented Generation framework to refine external consistency checking through leveraging the extensive knowledge of multimodal large language models (MLLMs). Our approach decomposes external consistency checking into a comprehensive engine pipeline, which integrates reranking and rewriting, apart from retrieval. Evidence reranking module utilizes Automatic Evidence Selection Prompting (AESP) that acquires the relevant evidence item from the products of evidence retrieval. Subsequently, evidence rewriting module leverages Automatic Evidence Generation Prompting (AEGP) to improve task adaptation on MLLM-based OOC misinformation detectors. Furthermore, our approach enables explanation for judgment, and achieves impressive performance with instruction tuning. Experimental results on different benchmark datasets demonstrate that our proposed HiEAG surpasses previous state-of-the-art (SOTA) methods in the accuracy over all samples.
Knowledge-Grounded Agentic Large Language Models for Multi-Hazard Understanding from Reconnaissance Reports
Post-disaster reconnaissance reports contain critical evidence for understanding multi-hazard interactions, yet their unstructured narratives make systematic knowledge transfer difficult. Large language models (LLMs) offer new potential for analyzing these reports, but often generate unreliable or hallucinated outputs when domain grounding is absent. This study introduces the Mixture-of-Retrieval Agentic RAG (MoRA-RAG), a knowledge-grounded LLM framework that transforms reconnaissance reports into a structured foundation for multi-hazard reasoning. The framework integrates a Mixture-of-Retrieval mechanism that dynamically routes queries across hazard-specific databases while using agentic chunking to preserve contextual coherence during retrieval. It also includes a verification loop that assesses evidence sufficiency, refines queries, and initiates targeted searches when information remains incomplete. We construct HazardRecQA by deriving question-answer pairs from GEER reconnaissance reports, which document 90 global events across seven major hazard types. MoRA-RAG achieves up to 94.5 percent accuracy, outperforming zero-shot LLMs by 30 percent and state-of-the-art RAG systems by 10 percent, while reducing hallucinations across diverse LLM architectures. MoRA-RAG also enables open-weight LLMs to achieve performance comparable to proprietary models. It establishes a new paradigm for transforming post-disaster documentation into actionable, trustworthy intelligence for hazard resilience.
comment: 17 pages, 5 figures
How to Train Private Clinical Language Models: A Comparative Study of Privacy-Preserving Pipelines for ICD-9 Coding
Large language models trained on clinical text risk exposing sensitive patient information, yet differential privacy (DP) methods often severely degrade the diagnostic accuracy needed for deployment. Despite rapid progress in DP optimisation and text generation, it remains unclear which privacy-preserving strategy actually works best for clinical language tasks. We present the first systematic head-to-head comparison of four training pipelines for automated diagnostic coding from hospital discharge summaries. All pipelines use identical 1B-parameter models and matched privacy budgets to predict ICD-9 codes. At moderate and relaxed privacy budgets ($\varepsilon \in \{4, 6\}$), knowledge distillation from DP-trained teachers outperforms both direct DP-SGD and DP-synthetic data training, recovering up to 63\% of the non-private performance whilst maintaining strong empirical privacy (membership-inference AUC $\approx$ 0.5). These findings expose large differences in the privacy-utility trade-off across architectures and identify knowledge distillation as the most practical route to privacy-preserving clinical NLP.
comment: 10 pages, 5 figures. Accepted to the Privacy-Preserving Machine Learning Workshop at EurIPS 2025
Skin-R1: Toward Trustworthy Clinical Reasoning for Dermatological Diagnosis
The emergence of vision-language models (VLMs) has opened new possibilities for clinical reasoning and has shown promising performance in dermatological diagnosis. However, their trustworthiness and clinical utility are often limited by three major factors: (1) Data heterogeneity, where diverse datasets lack consistent diagnostic labels and clinical concept annotations; (2) Absence of grounded diagnostic rationales, leading to a scarcity of reliable reasoning supervision; and (3) Limited scalability and generalization, as models trained on small, densely annotated datasets struggle to transfer nuanced reasoning to large, sparsely-annotated ones. To address these limitations, we propose SkinR1, a novel dermatological VLM that combines deep, textbook-based reasoning with the broad generalization capabilities of reinforcement learning (RL). SkinR1 systematically resolves the key challenges through a unified, end-to-end framework. First, we design a textbook-based reasoning generator that synthesizes high-fidelity, hierarchy-aware, and differential-diagnosis (DDx)-informed trajectories, providing reliable expert-level supervision. Second, we leverage the constructed trajectories for supervised fine-tuning (SFT) empowering the model with grounded reasoning ability. Third, we develop a novel RL paradigm that, by incorporating the hierarchical structure of diseases, effectively transfers these grounded reasoning patterns to large-scale, sparse data. Extensive experiments on multiple dermatology datasets demonstrate that SkinR1 achieves superior diagnostic accuracy. The ablation study demonstrates the importance of the reasoning foundation instilled by SFT.
Hierarchical Token Prepending: Enhancing Information Flow in Decoder-based LLM Embeddings
Large language models produce powerful text embeddings, but their causal attention mechanism restricts the flow of information from later to earlier tokens, degrading representation quality. While recent methods attempt to solve this by prepending a single summary token, they over-compress information, hence harming performance on long documents. We propose Hierarchical Token Prepending (HTP), a method that resolves two critical bottlenecks. To mitigate attention-level compression, HTP partitions the input into blocks and prepends block-level summary tokens to subsequent blocks, creating multiple pathways for backward information flow. To address readout-level over-squashing, we replace last-token pooling with mean-pooling, a choice supported by theoretical analysis. HTP achieves consistent performance gains across 11 retrieval datasets and 30 general embedding benchmarks, especially in long-context settings. As a simple, architecture-agnostic method, HTP enhances both zero-shot and finetuned models, offering a scalable route to superior long-document embeddings.
Empowering Multi-Turn Tool-Integrated Reasoning with Group Turn Policy Optimization
Training Large Language Models (LLMs) for multi-turn Tool-Integrated Reasoning (TIR) - where models iteratively reason, generate code, and verify through execution - remains challenging for existing reinforcement learning (RL) approaches. Current RL methods, exemplified by Group Relative Policy Optimization (GRPO), suffer from coarse-grained, trajectory-level rewards that provide insufficient learning signals for complex multi-turn interactions, leading to training stagnation. To address this issue, we propose Group Turn Policy Optimization (GTPO), a novel RL algorithm specifically designed for training LLMs on multi-turn TIR tasks. GTPO introduces three key innovations: (1) turn-level reward assignment that provides fine-grained feedback for individual turns, (2) return-based advantage estimation where normalized discounted returns are calculated as advantages, and (3) self-supervised reward shaping that exploits self-supervision signals from generated code to densify sparse binary outcome-based rewards. Our comprehensive evaluation demonstrates that GTPO outperforms GRPO by 3.0% on average across diverse reasoning benchmarks, establishing its effectiveness for advancing complex mathematical reasoning in the real world.
Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data
Supervised Fine-Tuning (SFT) plays a pivotal role in adapting Large Language Models (LLMs) to specialized domains such as medical reasoning. However, existing SFT practices often rely on unfiltered datasets that contain redundant and low-quality samples, leading to substantial computational costs and suboptimal performance. Although existing methods attempt to alleviate this problem by selecting data based on sample difficulty, defined by knowledge and reasoning complexity, they overlook each sample's optimization utility reflected in its gradient. Interestingly, we find that gradient-based influence alone favors easy-to-optimize samples that cause large parameter shifts but lack deep reasoning chains, while difficulty alone selects noisy or overly complex cases that fail to guide stable optimization. Based on this observation, we propose a data selection strategy, Difficulty-Influence Quadrant (DIQ), which prioritizes samples in the high-difficulty-high-influence quadrant to balance complex clinical reasoning with substantial gradient influence, enabling efficient medical reasoning with minimal fine-tuning data. Furthermore, Human and LLM-as-a-judge evaluations show that DIQ-selected subsets demonstrate higher data quality and generate clinical reasoning that is more aligned with expert practices in differential diagnosis, safety check, and evidence citation, as DIQ emphasizes samples that foster expert-like reasoning patterns. Extensive experiments on medical reasoning benchmarks demonstrate that DIQ enables models fine-tuned on only 1% of selected data to match full-dataset performance, while using 10% consistently outperforms baseline methods, highlighting the superiority of principled data selection over brute-force scaling. The code and data are available at https://github.com/mihara-bot/DIQ.
comment: preprint, under review
SpiderGen: Towards Procedure Generation For Carbon Life Cycle Assessments with Generative AI
Investigating the effects of climate change and global warming caused by GHG emissions have been a key concern worldwide. These emissions are largely contributed to by the production, use and disposal of consumer products. Thus, it is important to build tools to estimate the environmental impact of consumer goods, an essential part of which is conducting Life Cycle Assessments (LCAs). LCAs specify and account for the appropriate processes involved with the production, use, and disposal of the products. We present SpiderGen, an LLM-based workflow which integrates the taxonomy and methodology of traditional LCA with the reasoning capabilities and world knowledge of LLMs to generate graphical representations of the key procedural information used for LCA, known as Product Category Rules Process Flow Graphs (PCR PFGs). We additionally evaluate the output of SpiderGen by comparing it with 65 real-world LCA documents. We find that SpiderGen provides accurate LCA process information that is either fully correct or has minor errors, achieving an F1-Score of 65% across 10 sample data points, as compared to 53% using a one-shot prompting method. We observe that the remaining errors occur primarily due to differences in detail between LCA documents, as well as differences in the "scope" of which auxiliary processes must also be included. We also demonstrate that SpiderGen performs better than several baselines techniques, such as chain-of-thought prompting and one-shot prompting. Finally, we highlight SpiderGen's potential to reduce the human effort and costs for estimating carbon impact, as it is able to produce LCA process information for less than \$1 USD in under 10 minutes as compared to the status quo LCA, which can cost over \$25000 USD and take up to 21-person days.
MajinBook: An open catalogue of digital world literature with likes
This data paper introduces MajinBook, an open catalogue designed to facilitate the use of shadow libraries--such as Library Genesis and Z-Library--for computational social science and cultural analytics. By linking metadata from these vast, crowd-sourced archives with structured bibliographic data from Goodreads, we create a high-precision corpus of over 539,000 references to English-language books spanning three centuries, enriched with first publication dates, genres, and popularity metrics like ratings and reviews. Our methodology prioritizes natively digital EPUB files to ensure machine-readable quality, while addressing biases in traditional corpora like HathiTrust, and includes secondary datasets for French, German, and Spanish. We evaluate the linkage strategy for accuracy, release all underlying data openly, and discuss the project's legal permissibility under EU and US frameworks for text and data mining in research.
comment: 9 pages, 5 figures, 1 table
Surprisingly Fragile: Assessing and Addressing Prompt Instability in Multimodal Foundation Models
Multimodal foundation models (MFMs) such as OFASys show the potential to unlock analysis of complex data such as images, videos, and audio data via text prompts alone. However, their performance may suffer in the face of text input that differs even slightly from their training distribution, which is surprising considering the use of modality-specific data to "ground" the text input. This study demonstrates that prompt instability is a major concern for MFMs, leading to a consistent drop in performance across all modalities, but that instability can be mitigated with additional training with augmented data. We evaluate several methods for grounded prompt perturbation, where we generate perturbations and filter based on similarity to text and/or modality data. After re-training the models on the augmented data, we find improved accuracy and more stable performance on the perturbed test data regardless of perturbation condition, suggesting that the data augmentation strategy helps the models handle domain shifts more effectively. In error analysis, we find consistent patterns of performance improvement across domains, suggesting that retraining on prompt perturbations tends to help general reasoning capabilities in MFMs.
comment: arxiv
Automatic Fact-checking in English and Telugu
False information poses a significant global challenge, and manually verifying claims is a time-consuming and resource-intensive process. In this research paper, we experiment with different approaches to investigate the effectiveness of large language models (LLMs) in classifying factual claims by their veracity and generating justifications in English and Telugu. The key contributions of this work include the creation of a bilingual English-Telugu dataset and the benchmarking of different veracity classification approaches based on LLMs.
comment: Proceedings of the First Workshop on Advancing NLP for Low Resource Languages associated with RANLP 2025 Varna Bulgaria September 13 2025 pages 140-151
OptScale: Probabilistic Optimality for Inference-time Scaling AAAI-2026
Inference-time scaling has emerged as a powerful technique for enhancing the reasoning performance of Large Language Models (LLMs). However, existing approaches often rely on heuristic strategies for parallel sampling, lacking a principled foundation. To address this gap, we propose a probabilistic framework that formalizes the optimality of inference-time scaling under the assumption that parallel samples are independently and identically distributed (i.i.d.), and where the Best-of-N selection strategy follows a probability distribution that can be estimated. Within this framework, we derive a theoretical lower bound on the required number of samples to achieve a target performance level, providing the first principled guidance for compute-efficient scaling. Leveraging this insight, we develop \textsc{OptScale}, a practical algorithm that dynamically determines the optimal number of sampled responses. \textsc{OptScale} employs a language model-based predictor to estimate probabilistic prior parameters, enabling the decision of the minimal number of samples needed that satisfy predefined performance thresholds and confidence levels. Extensive experiments on representative reasoning benchmarks (including MATH-500, GSM8K, AIME, and AMC) demonstrate that \textsc{OptScale} significantly reduces sampling overhead while remaining better or on par with state-of-the-art reasoning performance. Our work offers both a theoretical foundation and a practical solution for principled inference-time scaling, addressing a critical gap in the efficient deployment of LLMs for complex reasoning. The source code is publicly available at https://github.com/Albertwyk/OptScale.
comment: Accepted by AAAI-2026
IntelliProof: An Argumentation Network-based Conversational Helper for Organized Reflection AAAI
We present IntelliProof, an interactive system for analyzing argumentative essays through LLMs. IntelliProof structures an essay as an argumentation graph, where claims are represented as nodes, supporting evidence is attached as node properties, and edges encode supporting or attacking relations. Unlike existing automated essay scoring systems, IntelliProof emphasizes the user experience: each relation is initially classified and scored by an LLM, then visualized for enhanced understanding. The system provides justifications for classifications and produces quantitative measures for essay coherence. It enables rapid exploration of argumentative quality while retaining human oversight. In addition, IntelliProof provides a set of tools for a better understanding of an argumentative essay and its corresponding graph in natural language, bridging the gap between the structural semantics of argumentative essays and the user's understanding of a given text.
comment: Accepted for the 40th Annual AAAI Conference on Artificial Intelligence (2026) - Demonstration Track
Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm AAAI 2026
Agents based on Large Language Models (LLMs) have demonstrated strong capabilities across a wide range of tasks. However, deploying LLM-based agents in high-stakes domains comes with significant safety and ethical risks. Unethical behavior by these agents can directly result in serious real-world consequences, including physical harm and financial loss. To efficiently steer the ethical behavior of agents, we frame agent behavior steering as a model editing task, which we term Behavior Editing. Model editing is an emerging area of research that enables precise and efficient modifications to LLMs while preserving their overall capabilities. To systematically study and evaluate this approach, we introduce BehaviorBench, a multi-tier benchmark grounded in psychological moral theories. This benchmark supports both the evaluation and editing of agent behaviors across a variety of scenarios, with each tier introducing more complex and ambiguous scenarios. We first demonstrate that Behavior Editing can dynamically steer agents toward the target behavior within specific scenarios. Moreover, Behavior Editing enables not only scenario-specific local adjustments but also more extensive shifts in an agent's global moral alignment. We demonstrate that Behavior Editing can be used to promote ethical and benevolent behavior or, conversely, to induce harmful or malicious behavior. Through extensive evaluations of agents built on frontier LLMs, BehaviorBench validates the effectiveness of behavior editing across a wide range of models and scenarios. Our findings offer key insights into a new paradigm for steering agent behavior, highlighting both the promise and perils of Behavior Editing.
comment: AAAI 2026 Oral. 14 pages (including appendix), 11 figures. Code, data, results, and additional resources are available at: https://model-editing.github.io
AI use in American newspapers is widespread, uneven, and rarely disclosed
AI is rapidly transforming journalism, but the extent of its use in published newspaper articles remains unclear. We address this gap by auditing a large-scale dataset of 186K articles from online editions of 1.5K American newspapers published in the summer of 2025. Using Pangram, a state-of-the-art AI detector, we discover that approximately 9% of newly-published articles are either partially or fully AI-generated. This AI use is unevenly distributed, appearing more frequently in smaller, local outlets, in specific topics such as weather and technology, and within certain ownership groups. We also analyze 45K opinion pieces from Washington Post, New York Times, and Wall Street Journal, finding that they are 6.4 times more likely to contain AI-generated content than news articles from the same publications, with many AI-flagged op-eds authored by prominent public figures. Despite this prevalence, we find that AI use is rarely disclosed: a manual audit of 100 AI-flagged articles found only five disclosures of AI use. Overall, our audit highlights the immediate need for greater transparency and updated editorial standards regarding the use of AI in journalism to maintain public trust.
MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling
We present MiroThinker v1.0, an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities. Unlike previous agents that only scale up model size or context length, MiroThinker explores interaction scaling at the model level, systematically training the model to handle deeper and more frequent agent-environment interactions as a third dimension of performance improvement. Unlike LLM test-time scaling, which operates in isolation and risks degradation with longer reasoning chains, interactive scaling leverages environment feedback and external information acquisition to correct errors and refine trajectories. Through reinforcement learning, the model achieves efficient interaction scaling: with a 256K context window, it can perform up to 600 tool calls per task, enabling sustained multi-turn reasoning and complex real-world research workflows. Across four representative benchmarks-GAIA, HLE, BrowseComp, and BrowseComp-ZH-the 72B variant achieves up to 81.9%, 37.7%, 47.1%, and 55.6% accuracy respectively, surpassing previous open-source agents and approaching commercial counterparts such as GPT-5-high. Our analysis reveals that MiroThinker benefits from interactive scaling consistently: research performance improves predictably as the model engages in deeper and more frequent agent-environment interactions, demonstrating that interaction depth exhibits scaling behaviors analogous to model size and context length. These findings establish interaction scaling as a third critical dimension for building next-generation open research agents, complementing model capacity and context windows.
comment: Technical Report
GenRecal: Generation after Recalibration from Large to Small Vision-Language Models
Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V. However, deploying these models in real-world scenarios, particularly on resource-constrained devices, remains challenging due to their substantial computational demands. This has spurred interest in distilling knowledge from large VLMs into smaller, more efficient counterparts. A key challenge arises here from the diversity of VLM architectures, which are built on different LLMs and employ varying token types-differing in vocabulary size, token splits, and token index ordering. To address this challenge of limitation to a specific VLM type, we present Generation after Recalibration (GenRecal), a general-purpose distillation framework for VLMs. GenRecal incorporates a Recalibrator that aligns and adapts feature representations between heterogeneous VLMs, enabling effective knowledge transfer across different types of VLMs. Through extensive experiments on multiple challenging benchmarks, we demonstrate that GenRecal significantly improves baseline performances, eventually outperforming large-scale open- and closed-source VLMs.
comment: Project page: https://byungkwanlee.github.io/GenRecal-page/
ACoRN: Noise-Robust Abstractive Compression in Retrieval-Augmented Language Models IJCNN 2025
Abstractive compression utilizes smaller langauge models to condense query-relevant context, reducing computational costs in retrieval-augmented generation (RAG). However,retrieved documents often include information that is either irrelevant to answering the query or misleading due to factual incorrect content, despite having high relevance scores. This behavior indicates that abstractive compressors are more likely to omit important information essential for the correct answer, especially in long contexts where attention dispersion occurs. To address this issue, we categorize retrieved documents in a more fine-grained manner and propose Abstractive Compression Robust against Noise (ACoRN), which introduces two novel training steps. First, we use offline data augmentation on the training dataset to enhance compressor robustness against two distinct types of retrieval noise. Second, since the language modelbased compressor cannot fully utilize information from multiple retrieved documents and exhibits positional bias, we perform finetuning to generate summaries centered around key information that directly supports the correct answer. Our experiments demonstrate that T5-large, trained with ACoRN as a compressor, improves EM and F1 scores while preserving the answer string, which could serve as direct evidence. ACoRN excels on datasets with many accuracy-reducing documents, making it highly useful in real-world scenarios.
comment: Accepted by IJCNN 2025
O-Mem: Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents
Recent advancements in LLM-powered agents have demonstrated significant potential in generating human-like responses; however, they continue to face challenges in maintaining long-term interactions within complex environments, primarily due to limitations in contextual consistency and dynamic personalization. Existing memory systems often depend on semantic grouping prior to retrieval, which can overlook semantically irrelevant yet critical user information and introduce retrieval noise. In this report, we propose the initial design of O-Mem, a novel memory framework based on active user profiling that dynamically extracts and updates user characteristics and event records from their proactive interactions with agents. O-Mem supports hierarchical retrieval of persona attributes and topic-related context, enabling more adaptive and coherent personalized responses. O-Mem achieves 51.67% on the public LoCoMo benchmark, a nearly 3% improvement upon LangMem,the previous state-of-the-art, and it achieves 62.99% on PERSONAMEM, a 3.5% improvement upon A-Mem,the previous state-of-the-art. O-Mem also boosts token and interaction response time efficiency compared to previous memory frameworks. Our work opens up promising directions for developing efficient and human-like personalized AI assistants in the future.
Evaluating Large Language Models for Diacritic Restoration in Romanian Texts: A Comparative Study
Automatic diacritic restoration is crucial for text processing in languages with rich diacritical marks, such as Romanian. This study evaluates the performance of several large language models (LLMs) in restoring diacritics in Romanian texts. Using a comprehensive corpus, we tested models including OpenAI's GPT-3.5, GPT-4, GPT-4o, Google's Gemini 1.0 Pro, Meta's Llama 2 and Llama 3, MistralAI's Mixtral 8x7B Instruct, airoboros 70B, and OpenLLM-Ro's RoLlama 2 7B, under multiple prompt templates ranging from zero-shot to complex multi-shot instructions. Results show that models such as GPT-4o achieve high diacritic restoration accuracy, consistently surpassing a neutral echo baseline, while others, including Meta's Llama family, exhibit wider variability. These findings highlight the impact of model architecture, training data, and prompt design on diacritic restoration performance and outline promising directions for improving NLP tools for diacritic-rich languages.
comment: The original submission contained metadata errors and requires correction. A revised and complete version will be submitted as a replacement
MoM: Linear Sequence Modeling with Mixture-of-Memories
Linear sequence modeling methods, such as linear attention, state space modeling, and linear RNNs, offer significant efficiency improvements by reducing the complexity of training and inference. However, these methods typically compress the entire input sequence into a single fixed-size memory state, which leads to suboptimal performance on recall-intensive tasks. To address this limitation, we introduce a novel architecture called Mixture-of-Memories (MoM). MoM utilizes multiple independent memory states, with a router network directing input tokens to specific memory states. This approach greatly enhances the overall memory capacity while minimizing memory interference. MoM serves as a general framework that can be seamlessly combined with diverse memory update mechanisms across linear models. As a result, MoM performs exceptionally well on recall-intensive tasks, surpassing existing linear sequence modeling techniques. Despite incorporating multiple memory states, the computation of each memory state remains linear in complexity, allowing MoM to retain the linear-complexity advantage during training, while constant-complexity during inference. Our experimental results show that MoM outperforms current linear sequence models on downstream language tasks, particularly recall-intensive tasks, and even achieves performance comparable to Transformer models. The code is released at https://github.com/OpenSparseLLMs/MoM and is also released as a part of https://github.com/OpenSparseLLMs/Linear-MoE.
comment: Technical report, 18 pages
SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs
Large language models (LLMs) power many modern applications, but serving them at scale remains costly and resource-intensive. Current server-centric systems overlook consumer-grade GPUs at the edge. We introduce SpecEdge, an edge-assisted inference framework that splits LLM workloads between edge and server GPUs using a speculative decoding scheme, exchanging only token outputs over the network. SpecEdge employs proactive edge drafting to overlap edge token creation with server verification and pipeline-aware scheduling that interleaves multiple user requests to increase server-side throughput. Experiments show SpecEdge enhances overall cost efficiency by 1.91x through achieving 2.22x server throughput, and reduces inter token latency by 11.24% compared to a server-only baseline, introducing a scalable, cost-effective paradigm for LLM serving. The code is available at https://github.com/kaist-ina/specedge
Continuous sentiment scores for literary and multilingual contexts
Sentiment Analysis is widely used to quantify sentiment in text, but its application to literary texts poses unique challenges due to figurative language, stylistic ambiguity, as well as sentiment evocation strategies. Traditional dictionary-based tools often underperform, especially for low-resource languages, and transformer models, while promising, typically output coarse categorical labels that limit fine-grained analysis. We introduce a novel continuous sentiment scoring method based on concept vector projection, trained on multilingual literary data, which more effectively captures nuanced sentiment expressions across genres, languages, and historical periods. Our approach outperforms existing tools on English and Danish texts, producing sentiment scores whose distribution closely matches human ratings, enabling more accurate analysis and sentiment arc modeling in literature.
comment: 16 pages after compiling, 3025 words, 6 figures, 5 tables and an algorithm
Categorical Emotions or Appraisals - Which Emotion Model Explains Argument Convincingness Better?
The convincingness of an argument does not only depend on its structure (logos), the person who makes the argument (ethos), but also on the emotion that it causes in the recipient (pathos). While the overall intensity and categorical values of emotions in arguments have received considerable attention in the research community, we argue that the emotion an argument evokes in a recipient is subjective. It depends on the recipient's goals, standards, prior knowledge, and stance. Appraisal theories lend themselves as a link between the subjective cognitive assessment of events and emotions. They have been used in event-centric emotion analysis, but their suitability for assessing argument convincingness remains unexplored. In this paper, we evaluate whether appraisal theories are suitable for emotion analysis in arguments by considering subjective cognitive evaluations of the importance and impact of an argument on its receiver. Based on the annotations in the recently published ContArgA corpus, we perform zero-shot prompting experiments to evaluate the importance of gold-annotated and predicted emotions and appraisals for the assessment of the subjective convincingness labels. We find that, while categorical emotion information does improve convincingness prediction, the improvement is more pronounced with appraisals. This work presents the first systematic comparison between emotion models for convincingness prediction, demonstrating the advantage of appraisals, providing insights for theoretical and practical applications in computational argumentation.
Artificial intelligence contribution to translation industry: looking back and forward
This study provides a comprehensive analysis of artificial intelligence (AI) contribution to research in the translation industry (ACTI), synthesizing it over forty-five years from 1980-2024. 13220 articles were retrieved from three sources, namely WoS, Scopus, and Lens; 9836 were unique records, which were used for the analysis. We provided two types of analysis, viz., scientometric and thematic, focusing on Cluster, Subject categories, Keywords, Bursts, Centrality and Research Centers as for the former. For the latter, we provided a thematic review for 18 articles, selected purposefully from the articles involved, centering on purpose, approach, findings, and contribution to ACTI future directions. This study is significant for its valuable contribution to ACTI knowledge production over 45 years, emphasizing several trending issues and hotspots including Machine translation, Statistical machine translation, Low-resource language, Large language model, Arabic dialects, Translation quality, and Neural machine translation. The findings reveal that the more AI develops, the more it contributes to translation industry, as Neural Networking Algorithms have been incorporated and Deep Language Learning Models like ChatGPT have been launched. However, much rigorous research is still needed to overcome several problems encountering translation industry, specifically concerning low-resource, multi-dialectical and free word order languages, and cultural and religious registers.
comment: 30 pages, 13 figures
Spark-Prover-X1: Formal Theorem Proving Through Diverse Data Training
Large Language Models (LLMs) have shown significant promise in automated theorem proving, yet progress is often constrained by the scarcity of diverse and high-quality formal language data. To address this issue, we introduce Spark-Prover-X1, a 7B parameter model trained via an three-stage framework designed to unlock the reasoning potential of more accessible and moderately-sized LLMs. The first stage infuses deep knowledge through continuous pre-training on a broad mathematical corpus, enhanced by a suite of novel data tasks. Key innovation is a "CoT-augmented state prediction" task to achieve fine-grained reasoning. The second stage employs Supervised Fine-tuning (SFT) within an expert iteration loop to specialize both the Spark-Prover-X1-7B and Spark-Formalizer-X1-7B models. Finally, a targeted round of Group Relative Policy Optimization (GRPO) is applied to sharpen the prover's capabilities on the most challenging problems. To facilitate robust evaluation, particularly on problems from real-world examinations, we also introduce ExamFormal-Bench, a new benchmark dataset of 402 formal problems. Experimental results demonstrate that Spark-Prover achieves state-of-the-art performance among similarly-sized open-source models within the "Whole-Proof Generation" paradigm. It shows exceptional performance on difficult competition benchmarks, notably solving 27 problems on PutnamBench (pass@32) and achieving 24.0\% on CombiBench (pass@32). Our work validates that this diverse training data and progressively refined training pipeline provides an effective path for enhancing the formal reasoning capabilities of lightweight LLMs. Both Spark-Prover-X1-7B and Spark-Formalizer-X1-7B, along with the ExamFormal-Bench dataset, are made publicly available at: https://www.modelscope.cn/organization/iflytek, https://gitcode.com/ifly_opensource.
Segmentation Beyond Defaults: Asymmetrical Byte Pair Encoding for Optimal Machine Translation Performance
Existing Machine Translation (MT) research often suggests a single, fixed set of hyperparameters for word segmentation models, symmetric Byte Pair Encoding (BPE), which applies the same number of merge operations (NMO) to train tokenizers for both source and target languages. However, we demonstrate that this uniform approach doesn't guarantee optimal MT performance across different language pairs and data sizes. This work investigates BPE segmentation recipes across various data volumes and language pairs to evaluate MT system performance. We find that utilizing asymmetric BPE, where the source and target languages have different NMOs, significantly improves results over the symmetric approach, especially in low-resource settings (50K, 100K, and 500K sentence pairs). Specifically, asymmetric BPE yield statistically significant ($p<0.05$) average gains of 5.32, 4.46, and 0.7 CHRF++ on English-Hindi in low-resource setups (50K, 100K, and 500K sentence pairs, respectively). We validated this trend across six additional language pairs (English and Telugu, Shona, Norwegian, Kyrgyz, Hausa, and Inuktitut), observing statistically significant improvement in 10 out of 12 systems compared to symmetric BPE. Our findings indicate a high NMO for the source (4K to 32K) and a low NMO for the target (0.5K to 2K) provides optimal results, particularly benefiting low-resource MT.
comment: Accepted at WAT 2025 (Camera-Ready Version)
Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL
Generating accurate SQL from users' natural language questions (text-to-SQL) remains a long-standing challenge due to the complexities involved in user question understanding, database schema comprehension, and SQL generation. Traditional text-to-SQL systems, which combine human engineering and deep neural networks, have made significant progress. Subsequently, pre-trained language models (PLMs) have been developed for text-to-SQL tasks, achieving promising results. However, as modern databases and user questions grow more complex, PLMs with a limited parameter size often produce incorrect SQL. This necessitates more sophisticated and tailored optimization methods, which restricts the application of PLM-based systems. Recently, large language models (LLMs) have shown significant capabilities in natural language understanding as model scale increases. Thus, integrating LLM-based solutions can bring unique opportunities, improvements, and solutions to text-to-SQL research. In this survey, we provide a comprehensive review of existing LLM-based text-to-SQL studies. Specifically, we offer a brief overview of the technical challenges and evolutionary process of text-to-SQL. Next, we introduce the datasets and metrics designed to evaluate text-to-SQL systems. Subsequently, we present a systematic analysis of recent advances in LLM-based text-to-SQL. Finally, we make a summarization and discuss the remaining challenges in this field and suggest expectations for future research directions. All the related resources of LLM-based, including research papers, benchmarks, and open-source projects, are collected for the community in our repository: https://github.com/DEEP-PolyU/Awesome-LLM-based-Text2SQL.
comment: Accepted to IEEE TKDE2025
Patent Language Model Pretraining with ModernBERT
Transformer-based language models such as BERT have become foundational in NLP, yet their performance degrades in specialized domains like patents, which contain long, technical, and legally structured text. Prior approaches to patent NLP have primarily relied on fine-tuning general-purpose models or domain-adapted variants pretrained with limited data. In this work, we pretrain 3 domain-specific masked language models for patents, using the ModernBERT architecture and a curated corpus of over 60 million patent records. Our approach incorporates architectural optimizations, including FlashAttention, rotary embeddings, and GLU feed-forward layers. We evaluate our models on four downstream patent classification tasks. Our model, ModernBERT-base-PT, consistently outperforms the general-purpose ModernBERT baseline on three out of four datasets and achieves competitive performance with a baseline PatentBERT. Additional experiments with ModernBERT-base-VX and Mosaic-BERT-large demonstrate that scaling the model size and customizing the tokenizer further enhance performance on selected tasks. Notably, all ModernBERT variants retain substantially faster inference over - 3x that of PatentBERT - underscoring their suitability for time-sensitive applications. These results underscore the benefits of domain-specific pretraining and architectural improvements for patent-focused NLP tasks.
comment: 7 pages, 5 figures, 4 tables
Dialetto, ma Quanto Dialetto? Transcribing and Evaluating Dialects on a Continuum NAACL 2025
There is increasing interest in looking at dialects in NLP. However, most work to date still treats dialects as discrete categories. For instance, evaluative work in variation-oriented NLP for English often works with Indian English or African-American Venacular English as homogeneous categories (Faisal et al., 2024; Ziems et al., 2023), yet even within one variety there is substantial variation. We examine within-dialect variation and show that performance critically varies within categories. We measure speech-to-text performance on Italian dialects, and empirically observe a geographical performance disparity. This disparity correlates substantially (-0.5) with linguistic similarity to the highest performing dialect variety. We cross-examine our results against dialectometry methods, and interpret the performance disparity to be due to a bias towards dialects that are more similar to the standard variety in the speech-to-text model examined. We additionally leverage geostatistical methods to predict zero-shot performance at unseen sites, and find the incorporation of geographical information to substantially improve prediction performance, indicating there to be geographical structure in the performance distribution.
comment: Published in NAACL 2025 findings
OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs
Large Language Models (LLMs) have demonstrated great potential as evaluators of NLG systems, allowing for high-quality, reference-free, and multi-aspect assessments. However, existing LLM-based metrics suffer from two major drawbacks: reliance on proprietary models to generate training data or perform evaluations, and a lack of fine-grained, explanatory feedback. In this paper, we introduce OpeNLGauge, a fully open-source, reference-free NLG evaluation metric that provides accurate explanations based on error spans. OpeNLGauge is available as a two-stage ensemble of larger open-weight LLMs, or as a small fine-tuned evaluation model, with confirmed generalizability to unseen tasks, domains and aspects. Our extensive meta-evaluation shows that OpeNLGauge achieves competitive correlation with human judgments, outperforming state-of-the-art models on certain tasks while maintaining full reproducibility and providing explanations more than twice as accurate.
comment: INLG 2025
Are We Asking the Right Questions? On Ambiguity in Natural Language Queries for Tabular Data Analysis
Natural language interfaces to tabular data must handle ambiguities inherent to queries. Instead of treating ambiguity as a deficiency, we reframe it as a feature of cooperative interaction where users are intentional about the degree to which they specify queries. We develop a principled framework based on a shared responsibility of query specification between user and system, distinguishing unambiguous and ambiguous cooperative queries, which systems can resolve through reasonable inference, from uncooperative queries that cannot be resolved. Applying the framework to evaluations for tabular question answering and analysis, we analyze the queries in 15 popular datasets, and observe an uncontrolled mixing of query types neither adequate for evaluating a system's execution accuracy nor for evaluating interpretation capabilities. This conceptualization around cooperation in resolving queries informs how to design and evaluate natural language interfaces for tabular data analysis, for which we distill concrete directions for future research and broader implications.
comment: Accepted to the AI for Tabular Data workshop at EurIPS 2025
Native Design Bias: Studying the Impact of English Nativeness on Language Model Performance AACL
Large Language Models (LLMs) excel at providing information acquired during pretraining on large-scale corpora and following instructions through user prompts. This study investigates whether the quality of LLM responses varies depending on the demographic profile of users. Considering English as the global lingua franca, along with the diversity of its dialects among speakers of different native languages, we explore whether non-native English speakers receive lower-quality or even factually incorrect responses from LLMs more frequently. Our results show that performance discrepancies occur when LLMs are prompted by native versus non-native English speakers and persist when comparing native speakers from Western countries with others. Additionally, we find a strong anchoring effect when the model recognizes or is made aware of the user's nativeness, which further degrades the response quality when interacting with non-native speakers. Our analysis is based on a newly collected dataset with over 12,000 unique annotations from 124 annotators, including information on their native language and English proficiency.
comment: Accepted at ICJNLP-AACL (findings)
MCTSr-Zero: Self-Reflective Psychological Counseling Dialogues Generation via Principles and Adaptive Exploration AAAI-2026
The integration of Monte Carlo Tree Search (MCTS) with Large Language Models (LLMs) has demonstrated significant success in structured, problem-oriented tasks. However, applying these methods to open-ended dialogues, such as those in psychological counseling, presents unique challenges. Unlike tasks with objective correctness, success in therapeutic conversations depends on subjective factors like empathetic engagement, ethical adherence, and alignment with human preferences, for which strict "correctness" criteria are ill-defined. Existing result-oriented MCTS approaches can therefore produce misaligned responses. To address this, we introduce MCTSr-Zero, an MCTS framework designed for open-ended, human-centric dialogues. Its core innovation is "domain alignment", which shifts the MCTS search objective from predefined end-states towards conversational trajectories that conform to target domain principles (e.g., empathy in counseling). Furthermore, MCTSr-Zero incorporates "Regeneration" and "Meta-Prompt Adaptation" mechanisms to substantially broaden exploration by allowing the MCTS to consider fundamentally different initial dialogue strategies. We evaluate MCTSr-Zero in psychological counseling by generating multi-turn dialogue data, which is used to fine-tune an LLM, PsyLLM. We also introduce PsyEval, a benchmark for assessing multi-turn psychological counseling dialogues. Experiments demonstrate that PsyLLM achieves state-of-the-art performance on PsyEval and other relevant metrics, validating MCTSr-Zero's effectiveness in generating high-quality, principle-aligned conversational data for human-centric domains and addressing the LLM challenge of consistently adhering to complex psychological standards.
comment: 48 pages, 3 figures. Accepted in AAAI-2026 (Main Technical Track). For code and model, see this https://github.com/JianChengXingYun/Mctsr-Zero
In-context Language Learning for Endangered Languages in Speech Recognition
With approximately 7,000 languages spoken worldwide, current large language models (LLMs) support only a small subset. Prior research indicates LLMs can learn new languages for certain tasks without supervised data. We extend this investigation to speech recognition, investigating whether LLMs can learn unseen, low-resource languages through in-context learning (ICL). With experiments on four diverse endangered languages that LLMs have not been trained on, we find that providing more relevant text samples enhances performance in both language modelling and Automatic Speech Recognition (ASR) tasks. Furthermore, we show that the probability-based approach outperforms the traditional instruction-based approach in language learning. Lastly, we show ICL enables LLMs to achieve ASR performance that is comparable to or even surpasses dedicated language models trained specifically for these languages, while preserving the original capabilities of the LLMs. Our code is publicly available.
comment: Interspeech2025
Evaluation of OpenAI o1: Opportunities and Challenges of AGI
This comprehensive study evaluates the performance of OpenAI's o1-preview large language model across a diverse array of complex reasoning tasks, spanning multiple domains, including computer science, mathematics, natural sciences, medicine, linguistics, and social sciences. Through rigorous testing, o1-preview demonstrated remarkable capabilities, often achieving human-level or superior performance in areas ranging from coding challenges to scientific reasoning and from language processing to creative problem-solving. Key findings include: -83.3% success rate in solving complex competitive programming problems, surpassing many human experts. -Superior ability in generating coherent and accurate radiology reports, outperforming other evaluated models. -100% accuracy in high school-level mathematical reasoning tasks, providing detailed step-by-step solutions. -Advanced natural language inference capabilities across general and specialized domains like medicine. -Impressive performance in chip design tasks, outperforming specialized models in areas such as EDA script generation and bug analysis. -Remarkable proficiency in anthropology and geology, demonstrating deep understanding and reasoning in these specialized fields. -Strong capabilities in quantitative investing. O1 has comprehensive financial knowledge and statistical modeling skills. -Effective performance in social media analysis, including sentiment analysis and emotion recognition. The model excelled particularly in tasks requiring intricate reasoning and knowledge integration across various fields. While some limitations were observed, including occasional errors on simpler problems and challenges with certain highly specialized concepts, the overall results indicate significant progress towards artificial general intelligence.
LoopTool: Closing the Data-Training Loop for Robust LLM Tool Calls
Augmenting Large Language Models (LLMs) with external tools enables them to execute complex, multi-step tasks. However, tool learning is hampered by the static synthetic data pipelines where data generation and model training are executed as two separate, non-interactive processes. This approach fails to adaptively focus on a model's specific weaknesses and allows noisy labels to persist, degrading training efficiency. We introduce LoopTool, a fully automated, model-aware data evolution framework that closes this loop by tightly integrating data synthesis and model training. LoopTool iteratively refines both the data and the model through three synergistic modules: (1) Greedy Capability Probing (GCP) diagnoses the model's mastered and failed capabilities; (2) Judgement-Guided Label Verification (JGLV) uses an open-source judge model to find and correct annotation errors, progressively purifying the dataset; and (3) Error-Driven Data Expansion (EDDE) generates new, challenging samples based on identified failures. This closed-loop process operates within a cost-effective, open-source ecosystem, eliminating dependence on expensive closed-source APIs. Experiments show that our 8B model trained with LoopTool significantly surpasses its 32B data generator and achieves new state-of-the-art results on the BFCL-v3 and ACEBench benchmarks for its scale. Our work demonstrates that closed-loop, self-refining data pipelines can dramatically enhance the tool-use capabilities of LLMs.
comment: The code is accessible at https://github.com/Rednote-DeepExperience/LoopTool. The LoopTool-8B is accessible at https://huggingface.co/zhuiguang-ning/LoopTool-8B
EvoLM: In Search of Lost Language Model Training Dynamics NeurIPS 2025
Modern language model (LM) training has been divided into multiple stages, making it difficult for downstream developers to evaluate the impact of design choices made at each stage. We present EvoLM, a model suite that enables systematic and transparent analysis of LMs' training dynamics across pre-training, continued pre-training, supervised fine-tuning, and reinforcement learning. We train over 100 LMs with 1B and 4B parameters from scratch, and evaluate both upstream (language modeling) and downstream (problem-solving) capabilities, including considerations of both in-domain and out-of-domain generalization. Key insights highlight the diminishing returns from excessive pre-training and post-training, the importance and practices of mitigating forgetting during domain-specific continued pre-training, the crucial role of continued pre-training in bridging pre-training and post-training phases, and various intricate trade-offs when configuring supervised fine-tuning and reinforcement learning. To facilitate open research and reproducibility, we release all pre-trained and post-trained models, training datasets for all stages, and our entire training and evaluation pipeline.
comment: NeurIPS 2025 (Oral)
GraphInstruct: Empowering Large Language Models with Graph Understanding and Reasoning Capability
Improving the general capabilities of large language models (LLMs) is an active research topic. As a common data structure in many real-world domains, understanding graph data is a crucial part of advancing general intelligence. To this end, we propose a dynamic benchmark named GraphInstruct in this paper, which comprehensively includes 21 classical graph reasoning tasks, providing diverse graph generation pipelines and detailed intermediate reasoning steps for each sample. Based on GraphInstruct, we develop GraphSolver via efficient instruction-tuning, which demonstrates prominent graph understanding capability compared to other open-sourced LLMs. To further endow LLMs with multi-step graph reasoning capability, we propose a label-mask training strategy and build GraphSolver+, which leverages masked supervision on intermediate reasoning tokens to emphasize crucial node-identification signals. As one of the pioneering efforts to enhance the graph understanding and reasoning abilities of LLMs, extensive experiments have demonstrated the superiority of GraphSolver and GraphSolver+ over other LLMs. We sincerely hope GraphInstruct will facilitate further research on applying LLMs to graph-structured data. Our code and data are released publicly at: https://github.com/CGCL-codes/GraphInstruct.
comment: The article has been accepted by Frontiers of Computer Science (FCS), with the DOI: {10.1007/s11704-025-51382-0}
CoSense-LLM: Semantics at the Edge with Cost- and Uncertainty-Aware Cloud-Edge Cooperation
We present CoSense-LLM, an edge-first framework that turns continuous multimodal sensor streams (for example Wi-Fi CSI, IMU, audio, RFID, and lightweight vision) into compact, verifiable semantic tokens and coordinates with large language models under explicit latency, energy, bandwidth, and privacy constraints. CoSense-LLM has four parts: (i) SenseFusion, a lightweight encoder that aligns sensor embeddings with language and compresses them into short discrete code sequences; (ii) Edge-RAG, a local hybrid retrieval layer that grounds generation in site specific policies and notes; (iii) PromptRouter, a cost and uncertainty aware policy that selects edge only generation, edge plus retrieval, or compact cloud escalation; and (iv) Secure Execution, an auditable redaction path that enforces data minimization so raw waveforms never leave the device. The system works with modern serving optimizations, including paged or streaming KV caches, FlashAttention style kernels, speculative decoding, and quantized LoRA adapters, and supports on device personalization and federated updates under non IID drift. Across home, office, and clinic deployments, CoSense-LLM delivers grounded explanations while meeting tight service level objectives: it sustains sub second (p95) end to end latency on edge dominant paths, reduces inter tier token and bandwidth costs by preferring local retrieval grounded responses, and preserves privacy by transmitting only discrete codes and redacted metadata. Ablations show that Edge-RAG improves factual consistency and reduces contradictions, calibrated uncertainty enables selective abstention and controlled escalations, and KV plus decoding accelerators lower energy per decision. The results support an edge first design that treats semantics, privacy, and predictable latency as co equal goals for large model deployments in interference prone environments.
comment: 19 pages,8 figures
Hidden in the Noise: Unveiling Backdoors in Audio LLMs Alignment through Latent Acoustic Pattern Triggers
As Audio Large Language Models (ALLMs) emerge as powerful tools for speech processing, their safety implications demand urgent attention. While considerable research has explored textual and vision safety, audio's distinct characteristics present significant challenges. This paper first investigates: Is ALLM vulnerable to backdoor attacks exploiting acoustic triggers? In response to this issue, we introduce Hidden in the Noise (HIN), a novel backdoor attack framework designed to exploit subtle, audio-specific features. HIN applies acoustic modifications to raw audio waveforms, such as alterations to temporal dynamics and strategic injection of spectrally tailored noise. These changes introduce consistent patterns that an ALLM's acoustic feature encoder captures, embedding robust triggers within the audio stream. To evaluate ALLM robustness against audio-feature-based triggers, we develop the AudioSafe benchmark, assessing nine distinct risk types. Extensive experiments on AudioSafe and three established safety datasets reveal critical vulnerabilities in existing ALLMs: (I) audio features like environment noise and speech rate variations achieve over 90% average attack success rate. (II) ALLMs exhibit significant sensitivity differences across acoustic features, particularly showing minimal response to volume as a trigger, and (III) poisoned sample inclusion causes only marginal loss curve fluctuations, highlighting the attack's stealth.
From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models
With the remarkable success of Multimodal Large Language Models (MLLMs) in perception tasks, enhancing their complex reasoning capabilities has emerged as a critical research focus. Existing models still suffer from challenges such as opaque reasoning paths and insufficient generalization ability. Chain-of-Thought (CoT) reasoning, which has demonstrated significant efficacy in language models by enhancing reasoning transparency and output interpretability, holds promise for improving model reasoning capabilities when extended to the multimodal domain. This paper provides a systematic review centered on "Multimodal Chain-of-Thought" (MCoT). First, it analyzes the background and theoretical motivations for its inception from the perspectives of technical evolution and task demands. Then, it introduces mainstream MCoT methods from three aspects: CoT paradigms, the post-training stage, and the inference stage, while also analyzing their underlying mechanisms. Furthermore, the paper summarizes existing evaluation benchmarks and metrics, and discusses the application scenarios of MCoT. Finally, it analyzes the challenges currently facing MCoT and provides an outlook on its future research directions.
comment: Survey; 7 figures, 3 tables, 44 pages
Deep Learning and Machine Learning -- Natural Language Processing: From Theory to Application
With a focus on natural language processing (NLP) and the role of large language models (LLMs), we explore the intersection of machine learning, deep learning, and artificial intelligence. As artificial intelligence continues to revolutionize fields from healthcare to finance, NLP techniques such as tokenization, text classification, and entity recognition are essential for processing and understanding human language. This paper discusses advanced data preprocessing techniques and the use of frameworks like Hugging Face for implementing transformer-based models. Additionally, it highlights challenges such as handling multilingual data, reducing bias, and ensuring model robustness. By addressing key aspects of data processing and model fine-tuning, this work aims to provide insights into deploying effective and ethically sound AI solutions.
comment: 252 pages
Predicting the Performance of Black-box LLMs through Self-Queries NeurIPS 2025
As large language models (LLMs) are increasingly relied on in AI systems, predicting when they make mistakes is crucial. While a great deal of work in the field uses internal representations to interpret model behavior, these representations are inaccessible when given solely black-box access through an API. In this paper, we extract features of LLMs in a black-box manner by using follow-up prompts and taking the probabilities of different responses as representations to train reliable predictors of model behavior. We demonstrate that training a linear model on these low-dimensional representations produces reliable and generalizable predictors of model performance at the instance level (e.g., if a particular generation correctly answers a question). Remarkably, these can often outperform white-box linear predictors that operate over a model's hidden state or the full distribution over its vocabulary. In addition, we demonstrate that these extracted features can be used to evaluate more nuanced aspects of a language model's state. For instance, they can be used to distinguish between a clean version of GPT-4o-mini and a version that has been influenced via an adversarial system prompt that answers question-answering tasks incorrectly or introduces bugs into generated code. Furthermore, they can reliably distinguish between different model architectures and sizes, enabling the detection of misrepresented models provided through an API (e.g., identifying if GPT-3.5 is supplied instead of GPT-4o-mini).
comment: NeurIPS 2025
IPAD: Inverse Prompt for AI Detection - A Robust and Interpretable LLM-Generated Text Detector
Large Language Models (LLMs) have attained human-level fluency in text generation, which complicates the distinguishing between human-written and LLM-generated texts. This increases the risk of misuse and highlights the need for reliable detectors. Yet, existing detectors exhibit poor robustness on out-of-distribution (OOD) data and attacked data, which is critical for real-world scenarios. Also, they struggle to provide interpretable evidence to support their decisions, thus undermining the reliability. In light of these challenges, we propose IPAD (Inverse Prompt for AI Detection), a novel framework consisting of a Prompt Inverter that identifies predicted prompts that could have generated the input text, and two Distinguishers that examine the probability that the input texts align with the predicted prompts. Empirical evaluations demonstrate that IPAD outperforms the strongest baselines by 9.05% (Average Recall) on in-distribution data, 12.93% (AUROC) on out-of-distribution data, and 5.48% (AUROC) on attacked data. IPAD also performs robustly on structured datasets. Furthermore, an interpretability assessment is conducted to illustrate that IPAD enhances the AI detection trustworthiness by allowing users to directly examine the decision-making evidence, which provides interpretable support for its state-of-the-art detection results.
Crossing Borders: A Multimodal Challenge for Indian Poetry Translation and Image Generation
Indian poetry, known for its linguistic complexity and deep cultural resonance, has a rich and varied heritage spanning thousands of years. However, its layered meanings, cultural allusions, and sophisticated grammatical constructions often pose challenges for comprehension, especially for non-native speakers or readers unfamiliar with its context and language. Despite its cultural significance, existing works on poetry have largely overlooked Indian language poems. In this paper, we propose the Translation and Image Generation (TAI) framework, leveraging Large Language Models (LLMs) and Latent Diffusion Models through appropriate prompt tuning. Our framework supports the United Nations Sustainable Development Goals of Quality Education (SDG 4) and Reduced Inequalities (SDG 10) by enhancing the accessibility of culturally rich Indian-language poetry to a global audience. It includes (1) a translation module that uses an Odds Ratio Preference Alignment Algorithm to accurately translate morphologically rich poetry into English, and (2) an image generation module that employs a semantic graph to capture tokens, dependencies, and semantic relationships between metaphors and their meanings, to create visually meaningful representations of Indian poems. Our comprehensive experimental evaluation, including both human and quantitative assessments, demonstrates the superiority of TAI Diffusion in poem image generation tasks, outperforming strong baselines. To further address the scarcity of resources for Indian-language poetry, we introduce the Morphologically Rich Indian Language Poems MorphoVerse Dataset, comprising 1,570 poems across 21 low-resource Indian languages. By addressing the gap in poetry translation and visual comprehension, this work aims to broaden accessibility and enrich the reader's experience.
KnowCoder-A1: Incentivizing Agentic Reasoning Capability with Outcome Supervision for KBQA
Knowledge Base Question Answering (KBQA) aims to answer natural-language questions over a structured Knowledge Base (KB). Recent work improves KBQA by adopting an agentic reasoning paradigm, in which Large Language Models (LLMs) iteratively decompose a question, generate its corresponding logical queries, and interact with the KB to derive the answer. However, these methods typically fine-tune LLMs on reasoning trajectories synthesized via process supervision, which offers weak incentives for exploration and thus fails to strengthen the agentic reasoning ability. In this paper, we propose KnowCoder-A1, an LLM that can autonomously perform agentic reasoning on KBs to obtain answers. To incentivize autonomous exploration, KnowCoder-A1 trains the LLM under outcome-only supervision via a multi-stage curriculum reinforcement learning with an easy-to-hard curriculum. To establish foundational agentic capabilities, KnowCoder-A1 first fine-tunes the LLM on a small set of high-quality trajectories obtained through outcome-based rejection sampling. Then, to alleviate the reward sparsity inherent in outcome-only supervision, it applies multi-stage curriculum RL with reward schedules that progress from easy to hard. Trained with outcome-only supervision, KnowCoder-A1 exhibits powerful reasoning behaviors and consistently outperforms prior approaches across three mainstream datasets. Notably, on the zero-shot subset of GrailQA, KnowCoder-A1 achieves up to an 11.1% relative improvement while using only one-twelfth of the training data, demonstrating strong agentic reasoning capabilities.
AgentArmor: Enforcing Program Analysis on Agent Runtime Trace to Defend Against Prompt Injection
Large Language Model (LLM) agents offer a powerful new paradigm for solving various problems by combining natural language reasoning with the execution of external tools. However, their dynamic and non-transparent behavior introduces critical security risks, particularly in the presence of prompt injection attacks. In this work, we propose a novel insight that treats the agent runtime traces as structured programs with analyzable semantics. Thus, we present AgentArmor, a program analysis framework that converts agent traces into graph intermediate representation-based structured program dependency representations (e.g., CFG, DFG, and PDG) and enforces security policies via a type system. AgentArmor consists of three key components: (1) a graph constructor that reconstructs the agent's runtime traces as graph-based intermediate representations with control and data flow described within; (2) a property registry that attaches security-relevant metadata of interacted tools \& data, and (3) a type system that performs static inference and checking over the intermediate representation. By representing agent behavior as structured programs, AgentArmor enables program analysis for sensitive data flow, trust boundaries, and policy violations. We evaluate AgentArmor on the AgentDojo benchmark, the results show that AgentArmor can reduce the ASR to 3\%, with the utility drop only 1\%.
Iris: Integrating Language into Diffusion-based Monocular Depth Estimation
Traditional monocular depth estimation suffers from inherent ambiguity and visual nuisances. We demonstrate that language can enhance monocular depth estimation by providing an additional condition (rather than images alone) aligned with plausible 3D scenes, thereby reducing the solution space for depth estimation. This conditional distribution is learned during the text-to-image pre-training of diffusion models. To generate images under various viewpoints and layouts that precisely reflect textual descriptions, the model implicitly models object sizes, shapes, and scales, their spatial relationships, and the overall scene structure. In this paper, Iris, we investigate the benefits of our strategy to integrate text descriptions into training and inference of diffusion-based depth estimation models. We experiment with three different diffusion-based monocular depth estimators (Marigold, Lotus, and E2E-FT) and their variants. By training on HyperSim and Virtual KITTI, and evaluating on NYUv2, KITTI, ETH3D, ScanNet, and DIODE, we find that our strategy improves the overall monocular depth estimation accuracy, especially in small areas. It also improves the model's depth perception of specific regions described in the text. We find that by providing more details in the text, the depth prediction can be iteratively refined. Simultaneously, we find that language can act as a constraint to accelerate the convergence of both training and the inference diffusion trajectory. Code and generated text data will be released upon acceptance.
Do Retrieval Augmented Language Models Know When They Don't Know? AAAI 2026
Existing large language models (LLMs) occasionally generate plausible yet factually incorrect responses, known as hallucinations. Two main approaches have been proposed to mitigate hallucinations: retrieval-augmented language models (RALMs) and refusal post-training. However, current research predominantly focuses on their individual effectiveness while overlooking the evaluation of the refusal capability of RALMs. Ideally, if RALMs know when they do not know, they should refuse to answer.In this study, we ask the fundamental question: Do RALMs know when they don't know? Specifically, we investigate three questions. First, are RALMs well calibrated with respect to different internal and external knowledge states? We examine the influence of various factors. Contrary to expectations, when all retrieved documents are irrelevant, RALMs still tend to refuse questions they could have answered correctly. Next, given the model's pronounced \textbf{over-refusal} behavior, we raise a second question: How does a RALM's refusal ability align with its calibration quality? Our results show that the over-refusal problem can be mitigated through in-context fine-tuning. However, we observe that improved refusal behavior does not necessarily imply better calibration or higher overall accuracy. Finally, we ask: Can we combine refusal-aware RALMs with uncertainty-based answer abstention to mitigate over-refusal? We develop a simple yet effective refusal mechanism for refusal-post-trained RALMs that improves their overall answer quality by balancing refusal and correct answers. Our study provides a more comprehensive understanding of the factors influencing RALM behavior. Meanwhile, we emphasize that uncertainty estimation for RALMs remains an open problem deserving deeper investigation.
comment: AAAI 2026 camera ready version. Extended version with Appendix is coming soon
Anti-adversarial Learning: Desensitizing Prompts for Large Language Models AAAI 2026
With the widespread use of LLMs, preserving privacy in user prompts has become crucial, as prompts risk exposing privacy and sensitive data to the cloud LLMs. Traditional techniques like homomorphic encryption, secure multi-party computation, and federated learning face challenges due to heavy computational costs and user participation requirements, limiting their applicability in LLM scenarios. In this paper, we propose PromptObfus, a novel method for desensitizing LLM prompts. The core idea of PromptObfus is "anti-adversarial" learning, which perturbs privacy words in the prompt to obscure sensitive information while retaining the stability of model predictions. Specifically, PromptObfus frames prompt desensitization as a masked language modeling task, replacing privacy-sensitive terms with a [MASK] token. A desensitization model is trained to generate candidate replacements for each masked position. These candidates are subsequently selected based on gradient feedback from a surrogate model, ensuring minimal disruption to the task output. We demonstrate the effectiveness of our approach on three NLP tasks. Results show that PromptObfus effectively prevents privacy inference from remote LLMs while preserving task performance.
comment: Accepted to AAAI 2026
NAIST Academic Travelogue Dataset
We have constructed NAIST Academic Travelogue Dataset (ATD) and released it free of charge for academic research. This dataset is a Japanese text dataset with a total of over 31 million words, comprising 4,672 Japanese domestic travelogues and 9,607 overseas travelogues. Before providing our dataset, there was a scarcity of widely available travelogue data for research purposes, and each researcher had to prepare their own data. This hinders the replication of existing studies and fair comparative analysis of experimental results. Our dataset enables any researchers to conduct investigation on the same data and to ensure transparency and reproducibility in research. In this paper, we describe the academic significance, characteristics, and prospects of our dataset.
comment: Updated version with revised manuscript
Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap
For Large Language Models (LLMs), a disconnect persists between benchmark performance and real-world utility. Current evaluation frameworks remain fragmented, prioritizing technical metrics while neglecting holistic assessment for deployment. This survey introduces an anthropomorphic evaluation paradigm through the lens of human intelligence, proposing a novel three-dimensional taxonomy: Intelligence Quotient (IQ)-General Intelligence for foundational capacity, Emotional Quotient (EQ)-Alignment Ability for value-based interactions, and Professional Quotient (PQ)-Professional Expertise for specialized proficiency. For practical value, we pioneer a Value-oriented Evaluation (VQ) framework assessing economic viability, social impact, ethical alignment, and environmental sustainability. Our modular architecture integrates six components with an implementation roadmap. Through analysis of 200+ benchmarks, we identify key challenges including dynamic assessment needs and interpretability gaps. It provides actionable guidance for developing LLMs that are technically proficient, contextually relevant, and ethically sound. We maintain a curated repository of open-source evaluation resources at: https://github.com/onejune2018/Awesome-LLM-Eval.
comment: Preprint. Under Review
PromptGuard at BLP-2025 Task 1: A Few-Shot Classification Framework Using Majority Voting and Keyword Similarity for Bengali Hate Speech Detection AACL
The BLP-2025 Task 1A requires Bengali hate speech classification into six categories. Traditional supervised approaches need extensive labeled datasets that are expensive for low-resource languages. We developed PromptGuard, a few-shot framework combining chi-square statistical analysis for keyword extraction with adaptive majority voting for decision-making. We explore statistical keyword selection versus random approaches and adaptive voting mechanisms that extend classification based on consensus quality. Chi-square keywords provide consistent improvements across categories, while adaptive voting benefits ambiguous cases requiring extended classification rounds. PromptGuard achieves a micro-F1 of 67.61, outperforming n-gram baselines (60.75) and random approaches (14.65). Ablation studies confirm chi-square-based keywords show the most consistent impact across all categories.
comment: Accepted to BLP at AACL-IJCNLP 2025
Scaling Textual Gradients via Sampling-Based Momentum
LLM-based prompt optimization, that uses LLM-provided "textual gradients" (feedback) to refine prompts, has emerged an effective method for automatic prompt engineering. However, its scalability and stability are unclear when using more data in training. We systematically investigate the potential and challenges of scaling training data in textual gradient descent. We show that naively scaling training examples is infeasible due to both explicit context-length limits and an implicit context wall, where long-context degradation yields diminishing returns. Inspired by prior wisdom in stochastic gradient descent, we propose Textual Stochastic Gradient Descent with Momentum (TSGD-M), which reweights updates through momentum sampling, using bootstrapped minibatch validation accuracy as importance weights over historical prompts. We introduce Gumbel-Top-$k$ sampling for prompt generation, balancing exploration--exploitation and improving sampling efficiency while maintaining a low-variance running mean estimator. TSGD-M integrates seamlessly into existing prompt optimization frameworks, including TextGrad, DSPy-COPRO, and AdalFlow, and achieves consistent gains across 5 benchmarks.
RAT: Bridging RNN Efficiency and Attention Accuracy via Chunk-based Sequence Modeling NeurIPS 2025
Transformers have become the cornerstone of modern large-scale language models, but their reliance on softmax attention poses a computational bottleneck at both training and inference. Recurrent models offer high efficiency, but compressing the full sequence into a fixed-size and holistic representation can suffer from memory degradation in long contexts and limit fine-grained retrieval. To address this, we propose RAT, an intermediate design that bridges the efficiency of RNNs and capacity of attention. RAT partitions the input into chunks, applies recurrence within each chunk for local dependencies, and softmax-based attention across chunks for long-range interactions. This design mitigates memory degradation and enables direct access to distant tokens, while retaining computational efficiency. Empirically, with a chunk size of 16, the RAT block achieves a 7$\times$ improvement in training speed for 100K sequence length and 9$times$ in generation at the 4K position, while maintaining similar performance compared to standard attention. We demonstrate this by training 1.3B parameter models from scratch and performing large-scale evaluations, including short- and long-context benchmarks, as well as supervised fine-tuning~(SFT). We further propose a hybrid architecture that interleaves RAT with local attention. By combining efficient long-range modeling with strong local interactions, this hybrid design not only improves inference speed and reduces cache memory usage, but also consistently enhances performance and shows the overall best results. Code is available at https://github.com/CLAIRE-Labo/RAT.
comment: Accepted by NeurIPS 2025
Expert-Guided Prompting and Retrieval-Augmented Generation for Emergency Medical Service Question Answering AAAI 2026
Large language models (LLMs) have shown promise in medical question answering, yet they often overlook the domain-specific expertise that professionals depend on, such as the clinical subject areas (e.g., trauma, airway) and the certification level (e.g., EMT, Paramedic). Existing approaches typically apply general-purpose prompting or retrieval strategies without leveraging this structured context, limiting performance in high-stakes settings. We address this gap with EMSQA, an 24.3K-question multiple-choice dataset spanning 10 clinical subject areas and 4 certification levels, accompanied by curated, subject area-aligned knowledge bases (40K documents and 2M tokens). Building on EMSQA, we introduce (i) Expert-CoT, a prompting strategy that conditions chain-of-thought (CoT) reasoning on specific clinical subject area and certification level, and (ii) ExpertRAG, a retrieval-augmented generation pipeline that grounds responses in subject area-aligned documents and real-world patient data. Experiments on 4 LLMs show that Expert-CoT improves up to 2.05% over vanilla CoT prompting. Additionally, combining Expert-CoT with ExpertRAG yields up to a 4.59% accuracy gain over standard RAG baselines. Notably, the 32B expertise-augmented LLMs pass all the computer-adaptive EMS certification simulation exams.
comment: Accepted by AAAI 2026
Conflict Adaptation in Vision-Language Models NeurIPS 2025
A signature of human cognitive control is conflict adaptation: improved performance on a high-conflict trial following another high-conflict trial. This phenomenon offers an account for how cognitive control, a scarce resource, is recruited. Using a sequential Stroop task, we find that 12 of 13 vision-language models (VLMs) tested exhibit behavior consistent with conflict adaptation, with the lone exception likely reflecting a ceiling effect. To understand the representational basis of this behavior, we use sparse autoencoders (SAEs) to identify task-relevant supernodes in InternVL 3.5 4B. Partially overlapping supernodes emerge for text and color in both early and late layers, and their relative sizes mirror the automaticity asymmetry between reading and color naming in humans. We further isolate a conflict-modulated supernode in layers 24-25 whose ablation significantly increases Stroop errors while minimally affecting congruent trials.
comment: Workshop on Interpreting Cognition in Deep Learning Models at NeurIPS 2025
Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models AACL 2025
Multilingual vision-language models (VLMs) promise universal image-text retrieval, yet their social biases remain underexplored. We perform the first systematic audit of four public multilingual CLIP variants: M-CLIP, NLLB-CLIP, CAPIVARA-CLIP, and the debiased SigLIP-2, covering ten languages that differ in resource availability and morphological gender marking. Using balanced subsets of FairFace and the PATA stereotype suite in a zero-shot setting, we quantify race and gender bias and measure stereotype amplification. Contrary to the intuition that multilinguality mitigates bias, every model exhibits stronger gender skew than its English-only baseline. CAPIVARA-CLIP shows its largest biases precisely in the low-resource languages it targets, while the shared encoder of NLLB-CLIP and SigLIP-2 transfers English gender stereotypes into gender-neutral languages; loosely coupled encoders largely avoid this leakage. Although SigLIP-2 reduces agency and communion skews, it inherits -- and in caption-sparse contexts (e.g., Xhosa) amplifies -- the English anchor's crime associations. Highly gendered languages consistently magnify all bias types, yet gender-neutral languages remain vulnerable whenever cross-lingual weight sharing imports foreign stereotypes. Aggregated metrics thus mask language-specific hot spots, underscoring the need for fine-grained, language-aware bias evaluation in future multilingual VLM research.
comment: Accepted at IJCNLP-AACL 2025
Planning-Aware Code Infilling via Horizon-Length Prediction
Fill-in-the-Middle (FIM), or infilling, has become integral to code language models, enabling generation of missing code given both left and right contexts. However, the current FIM training paradigm which performs next-token prediction (NTP) over reordered sequence often leads to models struggling to generate content that aligns well with the surrounding context. We hypothesize that NTP alone is insufficient for models to learn effective planning conditioned on the distant right context, a critical factor for successful code infilling. To overcome this, we propose Horizon-Length Prediction (HLP), a novel training objective that teaches models to predict the number of remaining middle tokens at each step. HLP advances FIM with lookahead planning, enabling models to inherently learn infilling boundaries for arbitrary left and right contexts without relying on dataset-specific post-processing. Our evaluation across different model families and sizes shows that HLP significantly improves FIM performance by up to 24% relatively on diverse benchmarks, across file-level and repository-level. Furthermore, the enhanced planning capability gained through HLP boosts model performance on code reasoning. Importantly, HLP incurs negligible training overhead and no additional inference cost, ensuring its practicality for real-world scenarios.
Human-Computer Interaction
Natural Language Interfaces for Databases: What Do Users Think?
Natural Language Interfaces for Databases (NLIDBs) aim to make database querying accessible by allowing users to ask questions in everyday language rather than using formal SQL queries. Despite significant advancements in translation accuracy, critical usability challenges, such as user frustration, query refinement strategies, and error recovery, remain underexplored. To investigate these usability dimensions, we conducted a mixed-method user study comparing SQL-LLM, a state-of-the-art NL2SQL system, with Snowflake, a traditional SQL analytics platform. Our controlled evaluation involved 20 participants completing realistic database querying tasks across 12 queries each. Results show that SQL-LLM significantly reduced query completion times by 10 to 30 percent (mean: 418 s vs. 629 s, p = 0.036) and improved overall accuracy from 50 to 75 percent (p = 0.002). Additionally, participants using SQL-LLM exhibited fewer query reformulations, recovered from errors 30 to 40 seconds faster, and reported lower frustration levels compared to Snowflake users. Behavioral analysis revealed that SQL-LLM encouraged structured, schema-first querying strategies, enhancing user confidence and efficiency, particularly for complex queries. These findings underscore the practical significance of well-designed, user-friendly NLIDBs in business analytics settings, emphasizing the critical role of usability alongside technical accuracy in real-world deployments.
M-CALLM: Multi-level Context Aware LLM Framework for Group Interaction Prediction
This paper explores how large language models can leverage multi-level contextual information to predict group coordination patterns in collaborative mixed reality environments. We demonstrate that encoding individual behavioral profiles, group structural properties, and temporal dynamics as natural language enables LLMs to break through the performance ceiling of statistical models. We build M-CALLM, a framework that transforms multimodal sensor streams into hierarchical context for LLM-based prediction, and evaluate three paradigms (zero-shot prompting, few-shot learning, and supervised fine-tuning) against statistical baselines across intervention mode (real-time prediction) and simulation mode (autoregressive forecasting) Head-to-head comparison on 16 groups (64 participants, ~25 hours) demonstrates that context-aware LLMs achieve 96% accuracy for conversation prediction, a 3.2x improvement over LSTM baselines, while maintaining sub-35ms latency. However, simulation mode reveals brittleness with 83% degradation due to cascading errors. Deep-dive into modality-specific performance shows conversation depends on temporal patterns, proximity benefits from group structure (+6%), while shared attention fails completely (0% recall), exposing architectural limitations. We hope this work spawns new ideas for building intelligent collaborative sensing systems that balance semantic reasoning capabilities with fundamental constraints.
Theoretical basis for code presentation: A case for cognitive load
Evidence supports that reducing cognitive load (CL) improves task performance for people of all abilities. This effect is specifically important for blind-and-low-vision (BLV) individuals because they cannot rely on many common methods of managing CL, which are frequently vision-based techniques. Current accessible "solutions" for BLV developers only sporadically consider CL in their design. There isn't a way to know whether CL is being alleviated by them. Neither do we know if alleviating CL is part of the mechanism behind why these solutions help BLV people. Using a strong foundation in psychological sciences, we identify aspects of CL that impact performance and learning in programming. These aspects are then examined when evaluating existing solutions for programming sub-tasks for BLV users. We propose an initial design "recommendations" for presentation of code which, when followed, will reduce cognitive load for BLV developers.
comment: 10 pages, 1 figure
SecureSign: Bridging Security and UX in Mobile Web3 through Emulated EIP-6963 Sandboxing
Mobile Web3 faces catastrophic retention (< 5%) yielding effective acquisition costs of \$500 - \$1,000 per retained user. Existing solutions force an impossible tradeoff: embedded wallets achieve moderate usability but suffer inherent click-jacking vulnerabilities; app wallets maintain security at the cost of 2 - 3% retention due to download friction and context-switching penalties. We present SecureSign, a PWA-based architecture that adapts desktop browser extension security to mobile via EIP-6963 provider sandboxing. SecureSign isolates dApp execution in iframes within a trusted parent application, achieving click-jacking immunity and transaction integrity while enabling native mobile capabilities (push notifications, home screen installation, zero context-switching). Our drop-in SDK requires no codebase changes for existing Web3 applications. Threat model analysis demonstrates immunity to click-jacking, overlay, and skimming attacks while maintaining wallet interoperability across dApps.
comment: 19 pages, 11 figures
Biased Minds Meet Biased AI: How Class Imbalance Shapes Appropriate Reliance and Interacts with Human Base Rate Neglect
Humans increasingly interact with artificial intelligence (AI) in decision-making. However, both AI and humans are prone to biases. While AI and human biases have been studied extensively in isolation, this paper examines their complex interaction. Specifically, we examined how class imbalance as an AI bias affects people's ability to appropriately rely on an AI-based decision-support system, and how it interacts with base rate neglect as a human bias. In a within-subject online study (N= 46), participants classified three diseases using an AI-based decision-support system trained on either a balanced or unbalanced dataset. We found that class imbalance disrupted participants' calibration of AI reliance. Moreover, we observed mutually reinforcing effects between class imbalance and base rate neglect, offering evidence of a compound human-AI bias. Based on these findings, we advocate for an interactionist perspective and further research into the mutually reinforcing effects of biases in human-AI interaction.
SweeperBot: Making 3D Browsing Accessible through View Analysis and Visual Question Answering
Accessing 3D models remains challenging for Screen Reader (SR) users. While some existing 3D viewers allow creators to provide alternative text, they often lack sufficient detail about the 3D models. Grounded on a formative study, this paper introduces SweeperBot, a system that enables SR users to leverage visual question answering to explore and compare 3D models. SweeperBot answers SR users' visual questions by combining an optimal view selection technique with the strength of generative- and recognition-based foundation models. An expert review with 10 Blind and Low-Vision (BLV) users with SR experience demonstrated the feasibility of using SweeperBot to assist BLV users in exploring and comparing 3D models. The quality of the descriptions generated by SweeperBot was validated by a second survey study with 30 sighted participants.
comment: 28 pages, 16 figures, this article has been accepted for publication in the International Journal of Human-Computer Interaction (IJHCI), published by Taylor and Francis
Tell Me: An LLM-powered Mental Well-being Assistant with RAG, Synthetic Dialogue Generation, and Agentic Planning ACL
We present Tell Me, a mental well-being system that leverages advances in large language models to provide accessible, context-aware support for users and researchers. The system integrates three components: (i) a retrieval-augmented generation (RAG) assistant for personalized, knowledge-grounded dialogue; (ii) a synthetic client-therapist dialogue generator conditioned on client profiles to facilitate research on therapeutic language and data augmentation; and (iii) a Well-being AI crew, implemented with CrewAI, that produces weekly self-care plans and guided meditation audio. The system is designed as a reflective space for emotional processing rather than a substitute for professional therapy. It illustrates how conversational assistants can lower barriers to support, complement existing care, and broaden access to mental health resources. To address the shortage of confidential therapeutic data, we introduce synthetic client-therapist dialogue generation conditioned on client profiles. Finally, the planner demonstrates an innovative agentic workflow for dynamically adaptive, personalized self-care, bridging the limitations of static well-being tools. We describe the architecture, demonstrate its functionalities, and report evaluation of the RAG assistant in curated well-being scenarios using both automatic LLM-based judgments and a human-user study. This work highlights opportunities for interdisciplinary collaboration between NLP researchers and mental health professionals to advance responsible innovation in human-AI interaction for well-being.
comment: 8 pages, 2 figures, 1 Table. Submitted to the Computation and Language (cs.CL) category. Uses the ACL-style template. Code and demo will be released at: https://github.com/trystine/Tell_Me_Mental_Wellbeing_System
Model Learning for Adjusting the Level of Automation in HCPS
The steadily increasing level of automation in human-centred systems demands rigorous design methods for analysing and controlling interactions between humans and automated components, especially in safety-critical applications. The variability of human behaviour poses particular challenges for formal verification and synthesis. We present a model-based framework that enables design-time exploration of safe shared-control strategies in human-automation systems. The approach combines active automata learning -- to derive coarse, finite-state abstractions of human behaviour from simulations -- with game-theoretic reactive synthesis to determine whether a controller can guarantee safety when interacting with these models. If no such strategy exists, the framework supports iterative refinement of the human model or adjustment of the automation's controllable actions. A driving case study, integrating automata learning with reactive synthesis in UPPAAL, illustrates the applicability of the framework on a simplified driving scenario and its potential for analysing shared-control strategies in human-centred cyber-physical systems.
comment: In Proceedings FMAS 2025, arXiv:2511.13245
PACEE: Supporting Children's Personal Emotion Education through Parent-AI Collaboration
Emotion education is a crucial lesson for children aged 3 to 6. However, existing technologies primarily focus on promoting emotion education from the child's perspective, often neglecting the central role of parents in guiding early childhood emotion development. In this work, we conducted co-design sessions with five experienced kindergarten teachers and five parents to identify parental challenges and the roles that AI can play in family emotion education. Guided by these insights, we developed PACEE, an assistant for supporting parent-AI collaborative emotion education. PACEE enables parents to engage in emotional dialogues about common scenarios, with multiple forms of support provided by generative AI. It combines insights from parents and AI to model children's emotional states and collaboratively delivers personalized, parent-mediated guidance. In a user study involving 16 families, we found that PACEE significantly enhances parent-child engagement, encourages more in-depth emotional communication, and improves the parental experience. Our findings advance emotion coaching theory in both family settings and LLM-assisted contexts, offering valuable insights for designing AI-supported, parent-centered family education systems.
Towards LLM-Based Usability Analysis for Recommender User Interfaces
Usability is a key factor in the effectiveness of recommender systems. However, the analysis of user interfaces is a time-consuming process that requires expertise. Recent advances in multimodal large language models (LLMs) offer promising opportunities to automate such evaluations. In this work, we explore the potential of multimodal LLMs to assess the usability of recommender system interfaces by considering a variety of publicly available systems as examples. We take user interface screenshots from multiple of these recommender platforms to cover both preference elicitation and recommendation presentation scenarios. An LLM is instructed to analyze these interfaces with regard to different usability criteria and provide explanatory feedback. Our evaluation demonstrates how LLMs can support heuristic-style usability assessments at scale to support the improvement of user experience.
comment: The paper was presented at IntRS'25: Joint Workshop on Interfaces and Human Decision Making for Recommender Systems, September 22, 2025, Prague, Czech Republic and is published in the workshop proceedings: https://ceur-ws.org/Vol-4027/
TailCue: Exploring Animal-inspired Robotic Tail for Automated Vehicles Interaction
Automated vehicles (AVs) are gradually becoming part of our daily lives. However, effective communication between road users and AVs remains a significant challenge. Although various external human-machine interfaces (eHMIs) have been developed to facilitate interactions, psychological factors, such as a lack of trust and inadequate emotional signaling, may still deter users from confidently engaging with AVs in certain contexts. To address this gap, we propose TailCue, an exploration of how tail-based eHMIs affect user interaction with AVs. We first investigated mappings between tail movements and emotional expressions from robotics and zoology, and accordingly developed a motion-emotion mapping scheme. A physical robotic tail was implemented, and specific tail motions were designed based on our scheme. An online, video-based user study with 21 participants was conducted. Our findings suggest that, although the intended emotions conveyed by the tail were not consistently recognized, open-ended feedback indicated that the tail motion needs to align with the scenarios and cues. Our result highlights the necessity of scenario-specific optimization to enhance tail-based eHMIs. Future work will refine tail movement strategies to maximize their effectiveness across diverse interaction contexts.
Visionary Co-Driver: Enhancing Driver Perception of Potential Risks with LLM and HUD
Drivers' perception of risky situations has always been a challenge in driving. Existing risk-detection methods excel at identifying collisions but face challenges in assessing the behavior of road users in non-collision situations. This paper introduces Visionary Co-Driver, a system that leverages large language models to identify non-collision roadside risks and alert drivers based on their eye movements. Specifically, the system combines video processing algorithms and LLMs to identify potentially risky road users. These risks are dynamically indicated on an adaptive heads-up display interface to enhance drivers' attention. A user study with 41 drivers confirms that Visionary Co-Driver improves drivers' risk perception and supports their recognition of roadside risks.
comment: Accepted for publication in IEEE Transactions on Intelligent Transportation Systems (T-ITS)
Algorithmic Management and the Future of Human Work: Implications for Autonomy, Collaboration, and Innovation
This study examines the evolving impact of algorithmic management on human resource management (HRM) practices, with a focus on employee autonomy, procedural transparency, and the sociotechnical dynamics of performance evaluation. Rather than adopting a qualitative or empirical approach, the paper develops a conceptual integration of insights from HRM, human-computer interaction (HCI), and Science and Technology Studies. The analysis highlights that although algorithmic systems can enhance operational efficiency, they risk reinforcing biases and narrowing the relational and contextual dimensions of work. These systems often overlook intangible contributions such as creativity, empathy, and collaborative problem solving, revealing gaps in data-driven performance measurement. In response, the study proposes a sociotechnical perspective on algorithmic accountability that emphasizes procedural transparency, organizational justice, and employee agency. By revisiting foundational questions within the rapidly evolving landscape of algorithmic management, the paper contributes to ongoing debates about the future of work and the design of managerial technologies that support, rather than constrain, human autonomy and organizational life.
comment: 13 pages, 1 figure, 2 tables. Conceptual integration paper
MindCross: Fast New Subject Adaptation with Limited Data for Cross-subject Video Reconstruction from Brain Signals AAAI 2026
Reconstructing video from brain signals is an important brain decoding task. Existing brain decoding frameworks are primarily built on a subject-dependent paradigm, which requires large amounts of brain data for each subject. However, the expensive cost of collecting brain-video data causes severe data scarcity. Although some cross-subject methods being introduced, they often overfocus with subject-invariant information while neglecting subject-specific information, resulting in slow fine-tune-based adaptation strategy. To achieve fast and data-efficient new subject adaptation, we propose MindCross, a novel cross-subject framework. MindCross's N specific encoders and one shared encoder are designed to extract subject-specific and subject-invariant information, respectively. Additionally, a Top-K collaboration module is adopted to enhance new subject decoding with the knowledge learned from previous subjects' encoders. Extensive experiments on fMRI/EEG-to-video benchmarks demonstrate MindCross's efficacy and efficiency of cross-subject decoding and new subject adaptation using only one model.
comment: AAAI 2026, 16 pages
A Longitudinal Study on the Attitudes of Gay Men in Beijing Towards Gay Social Media Platforms: Lonely Souls in the Digital Concrete Jungle
Over the past decade, specialized social networking applications have become a cornerstone of life for many gay men in China. This paper employs a longitudinal mixed-methods approach to investigate how Chinese men who have sex with men (MSM) have shifted their attitudes toward these platforms between approximately 2013 and 2023. Drawing on archival analysis of online discourses, a quantitative survey of 412 participants, and in-depth semi-structured interviews with 32 participants, we trace the complex trajectory of this evolution. Our findings reveal a clear pattern: from the initial embrace of these applications as revolutionary tools for community building and identity affirmation (2014--2017), to a period of growing ambivalence and critique centered on commercialization, ``hookup culture,'' and multiple forms of discrimination (2017--2020), and finally to the present era (2020--2023), characterized by pragmatic, fragmented, yet simultaneously critical and reconstructive uses. Today, users strategically employ a repertoire of applications -- including global platforms (e.g., Grindr and Tinder), domestic mainstream platforms (e.g., Blued), and niche alternatives (e.g., Aloha) -- to fulfill differentiated needs. We develop a detailed temporal framework to capture this attitudinal evolution and discuss its design implications for creating more supportive, secure, and community-oriented digital environments for marginalized groups.
Final Happiness: What Intelligent User Interfaces Can Do for the lonely Dying
This study explores the design of Intelligent User Interfaces (IUIs) to address the profound existential loneliness of terminally ill individuals. While Human-Computer Interaction (HCI) has made inroads in "Thanatechnology," current research often focuses on practical aspects like digital legacy management, overlooking the subjective, existential needs of those facing death in isolation. To address this gap, we conducted in-depth qualitative interviews with 14 lonely, terminally ill individuals. Our core contributions are: (1) An empirically-grounded model articulating the complex psychological, practical, social, and spiritual needs of this group; (2) The "Three Pillars, Twelve Principles" framework for designing IUIs as "Existential Companions"; and (3) A critical design directive derived from user evaluations: technology in this context should aim for transcendence over simulation. The findings suggest that IUIs should create experiences that augment or surpass human capabilities, rather than attempting to simulate basic human connections, which can paradoxically deepen loneliness. This research provides a clear, user-centered path for designing technology that serves not as a "tool for dying," but as a "partner for living fully until the end".
Gamified Virtual Reality Exposure Therapy for Mysophobia: Evaluating the Efficacy of a Simulated Sneeze Intervention
Mysophobia, or the fear of germs, is a prevalent anxiety disorder that significantly impacts daily life. This study investigates the potential of a gamified virtual reality (VR) intervention to simulate contamination-related scenarios and assess their emotional and psychological effects. A VR game based sneeze simulation was developed to evaluate its influence on participants' emotional states. Seven participants completed two versions of the game: a baseline version and an experimental version featuring the sneeze simulation. Emotional responses were measured using the Positive and Negative Affect Schedule (PANAS) and State-Trait Anxiety Inventory - State (STAI-S) questionnaires. The results revealed slight increases in negative affect and anxiety levels during the sneeze simulation. Also, a reduction in positive affect was revealed. However, these differences were not statistically significant (p > 0.05). This is likely due to small sample sizes, a lack of grossness in the simulation, or participants not being clinically mysophobes. This exploratory study highlights the potential of VR-based interventions for understanding and addressing contamination-related anxieties. It provides a foundation for future research with larger and more diverse participant pools.
Developing a Grounded View of AI
As a capability coming from computation, how does AI differ fundamentally from the capabilities delivered by rule-based software program? The paper examines the behavior of artificial intelligence (AI) from engineering points of view to clarify its nature and limits. The paper argues that the rationality underlying humanity's impulse to pursue, articulate, and adhere to rules deserves to be valued and preserved. Identifying where rule-based practical rationality ends is the beginning of making it aware until action. Although the rules of AI behaviors are still hidden or only weakly observable, the paper has proposed a methodology to make a sense of discrimination possible and practical to identify the distinctions of the behavior of AI models with three types of decisions. It is a prerequisite for human responsibilities with alternative possibilities, considering how and when to use AI. It would be a solid start for people to ensure AI system soundness for the well-being of humans, society, and the environment.
Affective Color Scales for Colormap Data Visualizations
Research on affective visualization design has shown that color is an especially powerful feature for influencing the emotional connotation of visualizations. Associations between colors and emotions are largely driven by lightness (e.g., lighter colors are associated with positive emotions, whereas darker colors are associated with negative emotions). Designing visualizations to have all light or all dark colors to convey particular emotions may work well for visualizations in which colors represent categories and spatial channels encode data values. However, this approach poses a problem for visualizations that use color to represent spatial patterns in data (e.g., colormap data visualizations) because lightness contrast is needed to reveal fine details in spatial structure. In this study, we found it is possible to design colormaps that have strong lightness contrast to support spatial vision while communicating clear affective connotation. We also found that affective connotation depended not only on the color scales used to construct the colormaps, but also the frequency with which colors appeared in the map, as determined by the underlying dataset (data-dependence hypothesis). These results emphasize the importance of data-aware design, which accounts for not only the design features that encode data (e.g., colors, shapes, textures), but also how those design features are instantiated in a visualization, given the properties of the data.
comment: To be published in IEEE Transactions on Visualization and Computer Graphics
Harmful Traits of AI Companions
Amid the growing prevalence of human -- AI interaction, large language models and other AI-based entities increasingly provide forms of companionship to human users. Such AI companionship -- i.e., bonded relationships between humans and AI systems that resemble the relationships people have with family members, friends, and romantic partners -- might substantially benefit humans. Yet such relationships can also do profound harm. We propose a framework for analyzing potential negative impacts of AI companionship by identifying specific harmful traits of AI companions and speculatively mapping causal pathways back from these traits to possible causes and forward to potential harmful effects. We provide detailed, structured analysis of four potentially harmful traits -- the absence of natural endpoints for relationships, vulnerability to product sunsetting, high attachment anxiety, and propensity to engender protectiveness -- and briefly discuss fourteen others. For each trait, we propose hypotheses connecting causes -- such as misaligned optimization objectives and the digital nature of AI companions -- to fundamental harms -- including reduced autonomy, diminished quality of human relationships, and deception. Each hypothesized causal connection identifies a target for potential empirical evaluation. Our analysis examines harms at three levels: to human partners directly, to their relationships with other humans, and to society broadly. We examine how existing law struggles to address these emerging harms, discuss potential benefits of AI companions, and conclude with design recommendations for mitigating risks. This analysis offers immediate suggestions for reducing risks while laying a foundation for deeper investigation of this critical but understudied topic.
How Should the Law Treat Future AI Systems? Fictional Legal Personhood versus Legal Identity
The law draws a sharp distinction between objects and persons, and between two kinds of persons, the ''fictional'' kind (i.e. corporations), and the ''non-fictional'' kind (individual or ''natural'' persons). This paper will assess whether we maximize overall long-term legal coherence by (A) maintaining an object classification for all future AI systems, (B) creating fictional legal persons associated with suitably advanced, individuated AI systems (giving these fictional legal persons derogable rights and duties associated with certified groups of existing persons, potentially including free speech, contract rights, and standing to sue ''on behalf of'' the AI system), or (C) recognizing non-fictional legal personhood through legal identity for suitably advanced, individuated AI systems (recognizing them as entities meriting legal standing with non-derogable rights which for the human case include life, due process, habeas corpus, freedom from slavery, and freedom of conscience). We will clarify the meaning and implications of each option along the way, considering liability, copyright, family law, fundamental rights, civil rights, citizenship, and AI safety regulation. We will tentatively find that the non-fictional personhood approach may be best from a coherence perspective, for at least some advanced AI systems. An object approach may prove untenable for sufficiently humanoid advanced systems, though we suggest that it is adequate for currently existing systems as of 2025. While fictional personhood would resolve some coherence issues for future systems, it would create others and provide solutions that are neither durable nor fit for purpose. Finally, our review will suggest that ''hybrid'' approaches are likely to fail and lead to further incoherence: the choice between object, fictional person and non-fictional person is unavoidable.
comment: 69 pages. Forthcoming in Case Western Journal of Law, Technology & the Internet (publication offer date september 2025)
Uncertainty-Resilient Multimodal Learning via Consistency-Guided Cross-Modal Transfer
Multimodal learning systems often face substantial uncertainty due to noisy data, low-quality labels, and heterogeneous modality characteristics. These issues become especially critical in human-computer interaction settings, where data quality, semantic reliability, and annotation consistency vary across users and recording conditions. This thesis tackles these challenges by exploring uncertainty-resilient multimodal learning through consistency-guided cross-modal transfer. The central idea is to use cross-modal semantic consistency as a basis for robust representation learning. By projecting heterogeneous modalities into a shared latent space, the proposed framework mitigates modality gaps and uncovers structural relations that support uncertainty estimation and stable feature learning. Building on this foundation, the thesis investigates strategies to enhance semantic robustness, improve data efficiency, and reduce the impact of noise and imperfect supervision without relying on large, high-quality annotations. Experiments on multimodal affect-recognition benchmarks demonstrate that consistency-guided cross-modal transfer significantly improves model stability, discriminative ability, and robustness to noisy or incomplete supervision. Latent space analyses further show that the framework captures reliable cross-modal structure even under challenging conditions. Overall, this thesis offers a unified perspective on resilient multimodal learning by integrating uncertainty modeling, semantic alignment, and data-efficient supervision, providing practical insights for developing reliable and adaptive brain-computer interface systems.
comment: Master's thesis, Korea University, 2025
Translating Cultural Choreography from Humanoid Forms to Robotic Arm
Robotic arm choreography often reproduces trajectories while missing cultural semantics. This study examines whether symbolic posture transfer with joint space compatible notation can preserve semantic fidelity on a six-degree-of-freedom arm and remain portable across morphologies. We implement ROPERA, a three-stage pipeline for encoding culturally codified postures, composing symbolic sequences, and decoding to servo commands. A scene from Kunqu opera, \textit{The Peony Pavilion}, serves as the material for evaluation. The procedure includes corpus-based posture selection, symbolic scoring, direct joint angle execution, and a visual layer with light painting and costume-informed colors. Results indicate reproducible execution with intended timing and cultural legibility reported by experts and audiences. The study points to non-anthropocentric cultural preservation and portable authoring workflows. Future work will design dance-informed transition profiles, extend the notation to locomotion with haptic, musical, and spatial cues, and test portability across platforms.
Embedding Explainable AI in NHS Clinical Safety: The Explainability-Enabled Clinical Safety Framework (ECSF)
Artificial intelligence (AI) is increasingly embedded in NHS workflows, but its probabilistic and adaptive behaviour conflicts with the deterministic assumptions underpinning existing clinical-safety standards. DCB0129 and DCB0160 provide strong governance for conventional software yet do not define how AI-specific transparency, interpretability, or model drift should be evidenced within Safety Cases, Hazard Logs, or post-market monitoring. This paper proposes an Explainability-Enabled Clinical Safety Framework (ECSF) that integrates explainability into the DCB0129/0160 lifecycle, enabling Clinical Safety Officers to use interpretability outputs as structured safety evidence without altering compliance pathways. A cross-regulatory synthesis mapped DCB clauses to principles from Good Machine Learning Practice, the NHS AI Assurance and T.E.S.T. frameworks, and the EU AI Act. The resulting matrix links regulatory clauses, principles, ECSF checkpoints, and suitable explainability outputs. ECSF introduces five checkpoints: global transparency for hazard identification, case-level interpretability for verification, clinician usability for evaluation, traceable decision pathways for risk control, and longitudinal interpretability monitoring for post-market surveillance. Techniques such as SHAP, LIME, Integrated Gradients, saliency mapping, and attention visualisation are mapped to corresponding DCB artefacts. ECSF reframes explainability as a core element of clinical-safety assurance, bridging deterministic risk governance with the probabilistic behaviour of AI and supporting alignment with GMLP, the EU AI Act, and NHS AI Assurance principles.
comment: 33 pages, 5 figures
Towards Reliable Human Evaluations in Gesture Generation: Insights from a Community-Driven State-of-the-Art Benchmark
We review human evaluation practices in automated, speech-driven 3D gesture generation and find a lack of standardisation and frequent use of flawed experimental setups. This leads to a situation where it is impossible to know how different methods compare, or what the state of the art is. In order to address common shortcomings of evaluation design, and to standardise future user studies in gesture-generation works, we introduce a detailed human evaluation protocol for the widely-used BEAT2 motion-capture dataset. Using this protocol, we conduct large-scale crowdsourced evaluation to rank six recent gesture-generation models -- each trained by its original authors -- across two key evaluation dimensions: motion realism and speech-gesture alignment. Our results provide strong evidence that 1) newer models do not consistently outperform earlier approaches; 2) published claims of high motion realism or speech-gesture alignment may not hold up under rigorous evaluation; and 3) the field must adopt disentangled assessments of motion quality and multimodal alignment for accurate benchmarking in order to make progress. Finally, in order to drive standardisation and enable new evaluation research, we will release five hours of synthetic motion from the benchmarked models; over 750 rendered video stimuli from the user studies -- enabling new evaluations without model reimplementation required -- alongside our open-source rendering script, and the 16,000 pairwise human preference votes collected for our benchmark.
comment: 23 pages, 10 figures. The last two authors made equal contributions
The EU AI Act, Stakeholder Needs, and Explainable AI: Aligning Regulatory Compliance in a Clinical Decision Support System
Explainable AI (XAI) is a promising route to comply with the EU AI Act, the first multinational AI regulation. XAI enhances transparency and human oversight of AI systems, especially ''black-box`` models criticized as incomprehensible. Yet discourse about the AI Act's stakeholders and XAI remains disconnected: XAI increasingly prioritizes end users' needs, while the AI Act focuses on providers' and deployers' obligations. We aim to bridge this divide and offer practical guidance on their relationship. Through interdisciplinary discussion in a cross functional team of XAI, AI Act, legal, and requirements-engineering experts, we outline steps to analyze an AI-based clinical decision support system, clarify end-user needs, and assess AI Act applicability. Using an AI system under development as a case study, we show how XAI techniques can help reconcile stakeholder needs with AI Act requirements and fill gaps between usability and regulatory demands. We compare similarities and differences between legal obligations and end-user needs, identify tensions, and point to concrete design choices and trade-offs. We invite researchers and practitioners in XAI to reflect on their role relative to the AI Act and to develop mutual understanding across disciplines. While XAI can help implement core AI Act principles such as transparency and human oversight, it should be considered one element of a broader compliance strategy that also requires standardization, legal interpretation, documentation, organizational processes, governance, testing, and ongoing monitoring and auditing practices. Our findings yield actionable recommendations for integrating XAI into product development, compliance workflows, and stakeholder communication, informing policy-making and standards development.
comment: 18 pages, 2 figures
Are We Asking the Right Questions? On Ambiguity in Natural Language Queries for Tabular Data Analysis
Natural language interfaces to tabular data must handle ambiguities inherent to queries. Instead of treating ambiguity as a deficiency, we reframe it as a feature of cooperative interaction where users are intentional about the degree to which they specify queries. We develop a principled framework based on a shared responsibility of query specification between user and system, distinguishing unambiguous and ambiguous cooperative queries, which systems can resolve through reasonable inference, from uncooperative queries that cannot be resolved. Applying the framework to evaluations for tabular question answering and analysis, we analyze the queries in 15 popular datasets, and observe an uncontrolled mixing of query types neither adequate for evaluating a system's execution accuracy nor for evaluating interpretation capabilities. This conceptualization around cooperation in resolving queries informs how to design and evaluate natural language interfaces for tabular data analysis, for which we distill concrete directions for future research and broader implications.
comment: Accepted to the AI for Tabular Data workshop at EurIPS 2025
BeautyGuard: Designing a Multi-Agent Roundtable System for Proactive Beauty Tech Compliance through Stakeholder Collaboration
As generative AI enters enterprise workflows, ensuring compliance with legal, ethical, and reputational standards becomes a pressing challenge. In beauty tech, where biometric and personal data are central, traditional reviews are often manual, fragmented, and reactive. To examine these challenges, we conducted a formative study with six experts (four IT managers, two legal managers) at a multinational beauty company. The study revealed pain points in rule checking, precedent use, and the lack of proactive guidance. Motivated by these findings, we designed a multi-agent "roundtable" system powered by a large language model. The system assigns role-specialized agents for legal interpretation, checklist review, precedent search, and risk mitigation, synthesizing their perspectives into structured compliance advice. We evaluated the prototype with the same experts using System Usability Scale(SUS), The Official NASA Task Load Index(NASA-TLX), and interviews. Results show exceptional usability (SUS: 77.5/100) and minimal cognitive workload, with three key findings: (1) multi-agent systems can preserve tacit knowledge into standardized workflows, (2) information augmentation achieves higher acceptance than decision automation, and (3) successful enterprise AI should mirror organizational structures. This work contributes design principles for human-AI collaboration in compliance review, with broader implications for regulated industries beyond beauty tech.
comment: International Conference on Human-Engaged Computing (ICHEC 2025), Singapore
Colour Perception in Immersive Virtual Reality: Emotional and Physiological Responses to Fifteen Munsell Hues
Colour is a fundamental determinant of affective experience in immersive virtual reality (VR), yet the emotional and physiological impact of individual hues remains poorly characterised. This study investigated how fifteen calibrated Munsell hues influence subjective and autonomic responses when presented in immersive VR. Thirty-six adults (18-45 years) viewed each hue in a within-subject design while pupil diameter and skin conductance were recorded continuously, and self-reported emotions were assessed using the Self-Assessment Manikin across pleasure, arousal, and dominance. Repeated-measures ANOVAs revealed robust hue effects on all three self-report dimensions and on pupil dilation, with medium to large effect sizes. Reds and red-purple hues elicited the highest arousal and dominance, whereas blue-green hues were rated most pleasurable. Pupil dilation closely tracked arousal ratings, while skin conductance showed no reliable hue differentiation, likely due to the brief (30 s) exposures. Individual differences in cognitive style and personality modulated overall reactivity but did not alter the relative ranking of hues. Taken together, these findings provide the first systematic hue-by-hue mapping of affective and physiological responses in immersive VR. They demonstrate that calibrated colour shapes both experience and ocular physiology, while also offering practical guidance for educational, clinical, and interface design in virtual environments.
comment: 24 pages, 6 Figures, 9 Tables
The Social Gaze of LLMs: A Literature Review of Multimodal Approaches to Human Behavior Understanding
LLM-powered multimodal systems are increasingly used to interpret human behavior, yet how researchers apply the models' 'social competence' remains poorly understood. This paper presents a systematic literature review of 176 publications across different application domains (e.g., healthcare, education, and entertainment). Using a four-dimensional coding framework (application, technical, evaluative, and ethical), we find (1) frequent use of pattern recognition and information extraction from multimodal sources, but limited support for adaptive, interactive reasoning; (2) a dominant 'modality-to-text' pipeline that privileges language over rich audiovisual cues, striping away nuanced social cues; (3) evaluation practices reliant on static benchmarks, with socially grounded, human-centered assessments rare; and (4) Ethical discussions focused mainly on legal and rights-related risks (e.g., privacy), leaving societal risks (e.g., deception) overlooked--or at best acknowledged but left unaddressed. We outline a research agenda for evaluating socially competent, ethically informed, and interaction-aware multi-modal systems.
TalkSketch: Multimodal Generative AI for Real-time Sketch Ideation with Speech AAAI 2026
Sketching is a widely used medium for generating and exploring early-stage design concepts. While generative AI (GenAI) chatbots are increasingly used for idea generation, designers often struggle to craft effective prompts and find it difficult to express evolving visual concepts through text alone. In the formative study (N=6), we examined how designers use GenAI during ideation, revealing that text-based prompting disrupts creative flow. To address these issues, we developed TalkSketch, an embedded multimodal AI sketching system that integrates freehand drawing with real-time speech input. TalkSketch aims to support a more fluid ideation process through capturing verbal descriptions during sketching and generating context-aware AI responses. Our work highlights the potential of GenAI tools to engage the design process itself rather than focusing on output.
comment: Accepted at AAAI 2026 Workshop on Creative AI for Live Interactive Performances (CLIP). To be published in Springer CCIS series
MedBuild AI: An Agent-Based Hybrid Intelligence Framework for Reshaping Agency in Healthcare Infrastructure Planning through Generative Design for Medical Architecture
Globally, disparities in healthcare infrastructure remain stark, leaving countless communities without access to even basic services. Traditional infrastructure planning is often slow and inaccessible, and although many architects are actively delivering humanitarian and aid-driven hospital projects worldwide, these vital efforts still fall far short of the sheer scale and urgency of demand. This paper introduces MedBuild AI, a hybrid-intelligence framework that integrates large language models (LLMs) with deterministic expert systems to rebalance the early design and conceptual planning stages. As a web-based platform, it enables any region with satellite internet access to obtain guidance on modular, low-tech, low-cost medical building designs. The system operates through three agents: the first gathers local health intelligence via conversational interaction; the second translates this input into an architectural functional program through rule-based computation; and the third generates layouts and 3D models. By embedding computational negotiation into the design process, MedBuild AI fosters a reciprocal, inclusive, and equitable approach to healthcare planning, empowering communities and redefining agency in global healthcare architecture.
comment: 25 pages, 16 figures. Submitted to the IJAC Special Issue "Rebalance and Reciprocity"
Deep Learning and Machine Learning -- Natural Language Processing: From Theory to Application
With a focus on natural language processing (NLP) and the role of large language models (LLMs), we explore the intersection of machine learning, deep learning, and artificial intelligence. As artificial intelligence continues to revolutionize fields from healthcare to finance, NLP techniques such as tokenization, text classification, and entity recognition are essential for processing and understanding human language. This paper discusses advanced data preprocessing techniques and the use of frameworks like Hugging Face for implementing transformer-based models. Additionally, it highlights challenges such as handling multilingual data, reducing bias, and ensuring model robustness. By addressing key aspects of data processing and model fine-tuning, this work aims to provide insights into deploying effective and ethically sound AI solutions.
comment: 252 pages
Human-AI Collaboration and Explainability for 2D/3D Registration Quality Assurance
Purpose: As surgery increasingly integrates advanced imaging, algorithms, and robotics to automate complex tasks, human judgment of system correctness remains a vital safeguard for patient safety. A critical example is 2D/3D registration, where small registration misalignments can lead to surgical errors. Current visualization strategies alone are insufficient to reliably enable humans to detect these misalignments, highlighting the need for enhanced decision-support tools. Methods: We propose the first artificial intelligence (AI) model tailored to 2D/3D registration quality assessment, augmented with explainable AI (XAI) mechanisms to clarify the model's predictions. Using both objective measures (e.g., accuracy, sensitivity, precision, specificity) and subjective evaluations (e.g., workload, trust, and understanding), we systematically compare decision-making across four conditions: AI-only, Human-only, Human+AI, and Human+XAI. Results: The AI-only condition achieved the highest accuracy, whereas collaborative paradigms (Human+AI and Human+XAI) improved sensitivity, precision, and specificity compared to standalone approaches. Participants experienced significantly lower workload in collaborative conditions relative to the Human-only condition. Moreover, participants reported a greater understanding of AI predictions in the Human+XAI condition than in Human+AI, although no significant differences were observed between the two collaborative paradigms in perceived trust or workload. Conclusion: Human-AI collaboration can enhance 2D/3D registration quality assurance, with explainability mechanisms improving user understanding. Future work should refine XAI designs to optimize decision-making performance and efficiency. Extending both the algorithmic design and human-XAI collaboration elements holds promise for more robust quality assurance of 2D/3D registration.
Computer Vision and Pattern Recognition
Back to Basics: Let Denoising Generative Models Denoise
Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than "$\textbf{Just image Transformers}$", or $\textbf{JiT}$, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.
comment: Tech report. Code at https://github.com/LTH14/JiT
Scaling Spatial Intelligence with Multimodal Foundation Models
Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, 54.6% on ViewSpatial, and 50.1% on SITE, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. SenseNova-SI is an ongoing project, and this report will be updated continuously. All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.
comment: Model: https://huggingface.co/collections/sensenova/sensenova-si; Code: https://github.com/OpenSenseNova/SenseNova-SI
Segment Anything Across Shots: A Method and Benchmark AAAI 2026
This work focuses on multi-shot semi-supervised video object segmentation (MVOS), which aims at segmenting the target object indicated by an initial mask throughout a video with multiple shots. The existing VOS methods mainly focus on single-shot videos and struggle with shot discontinuities, thereby limiting their real-world applicability. We propose a transition mimicking data augmentation strategy (TMA) which enables cross-shot generalization with single-shot data to alleviate the severe annotated multi-shot data sparsity, and the Segment Anything Across Shots (SAAS) model, which can detect and comprehend shot transitions effectively. To support evaluation and future study in MVOS, we introduce Cut-VOS, a new MVOS benchmark with dense mask annotations, diverse object categories, and high-frequency transitions. Extensive experiments on YouMVOS and Cut-VOS demonstrate that the proposed SAAS achieves state-of-the-art performance by effectively mimicking, understanding, and segmenting across complex transitions. The code and datasets are released at https://henghuiding.com/SAAS/.
comment: AAAI 2026, Project Page: https://henghuiding.com/SAAS/
UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity
The Segment Anything Model (SAM) family has become a widely adopted vision foundation model, but its ability to control segmentation granularity remains limited. Users often need to refine results manually - by adding more prompts or selecting from pre-generated masks - to achieve the desired level of detail. This process can be ambiguous, as the same prompt may correspond to several plausible masks, and collecting dense annotations across all granularities is prohibitively expensive, making supervised solutions infeasible. To address this limitation, we introduce UnSAMv2, which enables segment anything at any granularity without human annotations. UnSAMv2 extends the divide-and-conquer strategy of UnSAM by discovering abundant mask-granularity pairs and introducing a novel granularity control embedding that enables precise, continuous control over segmentation scale. Remarkably, with only $6$K unlabeled images and $0.02\%$ additional parameters, UnSAMv2 substantially enhances SAM-2, achieving segment anything at any granularity across interactive, whole-image, and video segmentation tasks. Evaluated on over $11$ benchmarks, UnSAMv2 improves $\text{NoC}_{90}$ (5.69 $\rightarrow$ 4.75), 1-IoU (58.0 $\rightarrow$ 73.1), and $\text{AR}_{1000}$ (49.6 $\rightarrow$ 68.3), showing that small amounts of unlabeled data with a granularity-aware self-supervised learning method can unlock the potential of vision foundation models.
Free-Form Scene Editor: Enabling Multi-Round Object Manipulation like in a 3D Engine AAAI 2026
Recent advances in text-to-image (T2I) diffusion models have significantly improved semantic image editing, yet most methods fall short in performing 3D-aware object manipulation. In this work, we present FFSE, a 3D-aware autoregressive framework designed to enable intuitive, physically-consistent object editing directly on real-world images. Unlike previous approaches that either operate in image space or require slow and error-prone 3D reconstruction, FFSE models editing as a sequence of learned 3D transformations, allowing users to perform arbitrary manipulations, such as translation, scaling, and rotation, while preserving realistic background effects (e.g., shadows, reflections) and maintaining global scene consistency across multiple editing rounds. To support learning of multi-round 3D-aware object manipulation, we introduce 3DObjectEditor, a hybrid dataset constructed from simulated editing sequences across diverse objects and scenes, enabling effective training under multi-round and dynamic conditions. Extensive experiments show that the proposed FFSE significantly outperforms existing methods in both single-round and multi-round 3D-aware editing scenarios.
comment: AAAI 2026, Project Page: https://henghuiding.com/FFSE/
TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models
The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3's chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we propose TiViBench, a hierarchical benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) Structural Reasoning & Search, ii) Spatial & Visual Pattern Reasoning, iii) Symbolic & Logical Reasoning, and iv) Action Planning & Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.
comment: Project: https://haroldchen19.github.io/TiViBench-Page/
Crossing Borders: A Multimodal Challenge for Indian Poetry Translation and Image Generation
Indian poetry, known for its linguistic complexity and deep cultural resonance, has a rich and varied heritage spanning thousands of years. However, its layered meanings, cultural allusions, and sophisticated grammatical constructions often pose challenges for comprehension, especially for non-native speakers or readers unfamiliar with its context and language. Despite its cultural significance, existing works on poetry have largely overlooked Indian language poems. In this paper, we propose the Translation and Image Generation (TAI) framework, leveraging Large Language Models (LLMs) and Latent Diffusion Models through appropriate prompt tuning. Our framework supports the United Nations Sustainable Development Goals of Quality Education (SDG 4) and Reduced Inequalities (SDG 10) by enhancing the accessibility of culturally rich Indian-language poetry to a global audience. It includes (1) a translation module that uses an Odds Ratio Preference Alignment Algorithm to accurately translate morphologically rich poetry into English, and (2) an image generation module that employs a semantic graph to capture tokens, dependencies, and semantic relationships between metaphors and their meanings, to create visually meaningful representations of Indian poems. Our comprehensive experimental evaluation, including both human and quantitative assessments, demonstrates the superiority of TAI Diffusion in poem image generation tasks, outperforming strong baselines. To further address the scarcity of resources for Indian-language poetry, we introduce the Morphologically Rich Indian Language Poems MorphoVerse Dataset, comprising 1,570 poems across 21 low-resource Indian languages. By addressing the gap in poetry translation and visual comprehension, this work aims to broaden accessibility and enrich the reader's experience.
Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting
We introduce GS-Light, an efficient, textual position-aware pipeline for text-guided relighting of 3D scenes represented via Gaussian Splatting (3DGS). GS-Light implements a training-free extension of a single-input diffusion model to handle multi-view inputs. Given a user prompt that may specify lighting direction, color, intensity, or reference objects, we employ a large vision-language model (LVLM) to parse the prompt into lighting priors. Using off-the-shelf estimators for geometry and semantics (depth, surface normals, and semantic segmentation), we fuse these lighting priors with view-geometry constraints to compute illumination maps and generate initial latent codes for each view. These meticulously derived init latents guide the diffusion model to generate relighting outputs that more accurately reflect user expectations, especially in terms of lighting direction. By feeding multi-view rendered images, along with the init latents, into our multi-view relighting model, we produce high-fidelity, artistically relit images. Finally, we fine-tune the 3DGS scene with the relit appearance to obtain a fully relit 3D scene. We evaluate GS-Light on both indoor and outdoor scenes, comparing it to state-of-the-art baselines including per-view relighting, video relighting, and scene editing methods. Using quantitative metrics (multi-view consistency, imaging quality, aesthetic score, semantic similarity, etc.) and qualitative assessment (user studies), GS-Light demonstrates consistent improvements over baselines. Code and assets will be made available upon publication.
comment: Submitting for Neurocomputing
QUILL: An Algorithm-Architecture Co-Design for Cache-Local Deformable Attention DATE 2026
Deformable transformers deliver state-of-the-art detection but map poorly to hardware due to irregular memory access and low arithmetic intensity. We introduce QUILL, a schedule-aware accelerator that turns deformable attention into cache-friendly, single-pass work. At its core, Distance-based Out-of-Order Querying (DOOQ) orders queries by spatial proximity; the look-ahead drives a region prefetch into an alternate buffer--forming a schedule-aware prefetch loop that overlaps memory and compute. A fused MSDeformAttn engine executes interpolation, Softmax, aggregation, and the final projection (W''m) in one pass without spilling intermediates, while small tensors are kept on-chip and surrounding dense layers run on integrated GEMMs. Implemented as RTL and evaluated end-to-end, QUILL achieves up to 7.29x higher throughput and 47.3x better energy efficiency than an RTX 4090, and exceeds prior accelerators by 3.26-9.82x in throughput and 2.01-6.07x in energy efficiency. With mixed-precision quantization, accuracy tracks FP32 within <=0.9 AP across Deformable and Sparse DETR variants. By converting sparsity into locality--and locality into utilization--QUILL delivers consistent, end-to-end speedups.
comment: Accepted to DATE 2026
OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation
Earth observation data presents a unique challenge: it is spatial like images, sequential like video or text, and highly multimodal. We present OlmoEarth: a multimodal, spatio-temporal foundation model that employs a novel self-supervised learning formulation, masking strategy, and loss all designed for the Earth observation domain. OlmoEarth achieves state-of-the-art performance compared to 12 other foundation models across a variety of research benchmarks and real-world tasks from external partners. When evaluating embeddings OlmoEarth achieves the best performance on 15 out of 24 tasks, and with full fine-tuning it is the best on 19 of 29 tasks. We deploy OlmoEarth as the backbone of an end-to-end platform for data collection, labeling, training, and inference of Earth observation models. The OlmoEarth Platform puts frontier foundation models and powerful data management tools into the hands of non-profits and NGOs working to solve the world's biggest problems. OlmoEarth source code, training data, and pre-trained weights are available at $\href{https://github.com/allenai/olmoearth_pretrain}{\text{https://github.com/allenai/olmoearth_pretrain}}$.
Tuning for Two Adversaries: Enhancing the Robustness Against Transfer and Query-Based Attacks using Hyperparameter Tuning AAAI
In this paper, we present the first detailed analysis of how optimization hyperparameters -- such as learning rate, weight decay, momentum, and batch size -- influence robustness against both transfer-based and query-based attacks. Supported by theory and experiments, our study spans a variety of practical deployment settings, including centralized training, ensemble learning, and distributed training. We uncover a striking dichotomy: for transfer-based attacks, decreasing the learning rate significantly enhances robustness by up to $64\%$. In contrast, for query-based attacks, increasing the learning rate consistently leads to improved robustness by up to $28\%$ across various settings and data distributions. Leveraging these findings, we explore -- for the first time -- the optimization hyperparameter design space to jointly enhance robustness against both transfer-based and query-based attacks. Our results reveal that distributed models benefit the most from hyperparameter tuning, achieving a remarkable tradeoff by simultaneously mitigating both attack types more effectively than other training setups.
comment: To appear in the Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) 2026
Distribution Matching Distillation Meets Reinforcement Learning
Distribution Matching Distillation (DMD) distills a pre-trained multi-step diffusion model to a few-step one to improve inference efficiency. However, the performance of the latter is often capped by the former. To circumvent this dilemma, we propose DMDR, a novel framework that combines Reinforcement Learning (RL) techniques into the distillation process. We show that for the RL of the few-step generator, the DMD loss itself is a more effective regularization compared to the traditional ones. In turn, RL can help to guide the mode coverage process in DMD more effectively. These allow us to unlock the capacity of the few-step generator by conducting distillation and RL simultaneously. Meanwhile, we design the dynamic distribution guidance and dynamic renoise sampling training strategies to improve the initial distillation process. The experiments demonstrate that DMDR can achieve leading visual quality, prompt coherence among few-step methods, and even exhibit performance that exceeds the multi-step teacher.
comment: The synergy of reinforcement learning and distribution matching distillation. See more: https://github.com/vvvvvjdy/dmdr
PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image
3D modeling is shifting from static visual representations toward physical, articulated assets that can be directly used in simulation and interaction. However, most existing 3D generation methods overlook key physical and articulation properties, thereby limiting their utility in embodied AI. To bridge this gap, we introduce PhysX-Anything, the first simulation-ready physical 3D generative framework that, given a single in-the-wild image, produces high-quality sim-ready 3D assets with explicit geometry, articulation, and physical attributes. Specifically, we propose the first VLM-based physical 3D generative model, along with a new 3D representation that efficiently tokenizes geometry. It reduces the number of tokens by 193x, enabling explicit geometry learning within standard VLM token budgets without introducing any special tokens during fine-tuning and significantly improving generative quality. In addition, to overcome the limited diversity of existing physical 3D datasets, we construct a new dataset, PhysX-Mobility, which expands the object categories in prior physical 3D datasets by over 2x and includes more than 2K common real-world objects with rich physical annotations. Extensive experiments on PhysX-Mobility and in-the-wild images demonstrate that PhysX-Anything delivers strong generative performance and robust generalization. Furthermore, simulation-based experiments in a MuJoCo-style environment validate that our sim-ready assets can be directly used for contact-rich robotic policy learning. We believe PhysX-Anything can substantially empower a broad range of downstream applications, especially in embodied AI and physics-based simulation.
comment: Project page: https://physx-anything.github.io/
Part-X-MLLM: Part-aware 3D Multimodal Large Language Model
We introduce Part-X-MLLM, a native 3D multimodal large language model that unifies diverse 3D tasks by formulating them as programs in a structured, executable grammar. Given an RGB point cloud and a natural language prompt, our model autoregressively generates a single, coherent token sequence encoding part-level bounding boxes, semantic descriptions, and edit commands. This structured output serves as a versatile interface to drive downstream geometry-aware modules for part-based generation and editing. By decoupling the symbolic planning from the geometric synthesis, our approach allows any compatible geometry engine to be controlled through a single, language-native frontend. We pre-train a dual-encoder architecture to disentangle structure from semantics and instruction-tune the model on a large-scale, part-centric dataset. Experiments demonstrate that our model excels at producing high-quality, structured plans, enabling state-of-the-art performance in grounded Q\&A, compositional generation, and localized editing through one unified interface. Project page: https://chunshi.wang/Part-X-MLLM/
CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding
Long-form video question answering (VQA) overwhelms current vision-language models (VLMs) because attention and key-value (KV) caches grow with runtime, forcing either expensive inference or near-sighted sliding windows. We introduce CacheFlow, a training-free pipeline that pairs Dynamic Token Dropping (DTD) with a compressive long-term memory. DTD prunes per-patch tokens online via cosine similarity to the previous frame, and surviving tokens are packed into fixed-size blocks. This online, per-frame processing makes our approach fundamentally suited for live streaming VQA. As blocks are processed, each one's keys are summarized by a tiny recurrent encoder to form a retrieval index, while the block's full KV pairs are offloaded and later rehydrated for generation, preserving answer fidelity. At inference, a consensus-based retrieval mechanism retrieves only the Top-K most relevant blocks and attends over both the retrieved and local context for precise, long-range reasoning. CacheFlow is drop-in, architecture-agnostic, and requires no fine-tuning. Experiments on both offline and streaming VQA benchmarks demonstrate that CacheFlow outperforms current strong baselines, while processing up to 87% less tokens. Our dual approach enables VLMs to be both efficient and context-aware, paving the way for practical long-form video understanding.
Alpha Divergence Losses for Biometric Verification
Performance in face and speaker verification is largely driven by margin based softmax losses like CosFace and ArcFace. Recently introduced $α$-divergence loss functions offer a compelling alternative, particularly for their ability to induce sparse solutions (when $α>1$). However, integrating an angular margin-crucial for verification tasks-is not straightforward. We find this integration can be achieved in at least two distinct ways: via the reference measure (prior probabilities) or via the logits (unnormalized log-likelihoods). In this paper, we explore both pathways, deriving two novel margin-based $α$-divergence losses: Q-Margin (margin in the reference measure) and A3M (margin in the logits). We identify and address a critical training instability in A3M-caused by the interplay of penalized logits and sparsity-with a simple yet effective prototype re-initialization strategy. Our methods achieve significant performance gains on the challenging IJB-B and IJB-C face verification benchmarks. We demonstrate similarly strong performance in speaker verification on VoxCeleb. Crucially, our models significantly outperform strong baselines at low false acceptance rates (FAR). This capability is crucial for practical high-security applications, such as banking authentication, when minimizing false authentications is paramount.
A Real-Time Driver Drowsiness Detection System Using MediaPipe and Eye Aspect Ratio
One of the major causes of road accidents is driver fatigue that causes thousands of fatalities and injuries every year. This study shows development of a Driver Drowsiness Detection System meant to improve the safety of the road by alerting drivers who are showing signs of being drowsy. The system is based on a standard webcam that tracks the facial features of the driver with the main emphasis on the examination of eye movements that can be conducted with the help of the Eye Aspect Ratio (EAR) method. The Face Mesh by MediaPipe is a lightweight framework that can identify facial landmarks with high accuracy and efficiency, which is considered to be important in real time use. The system detects the moments of long eye shutdowns or a very low rate of blinking which are manifestations of drowsiness and alerts the driver through sound to get her attention back. This system achieves a high-performance and low-cost driver monitoring solution with the help of the computational power of OpenCV to process the image and the MediaPipe to identify faces. Test data experimental analyses indicate that the system is very accurate and responds quicker; this confirms that it can be a component of the current Advanced Driving Assistance System (ADAS).
comment: 6 pages, 8 referenced papers
Tissue Aware Nuclei Detection and Classification Model for Histopathology Images
Accurate nuclei detection and classification are fundamental to computational pathology, yet existing approaches are hindered by reliance on detailed expert annotations and insufficient use of tissue context. We present Tissue-Aware Nuclei Detection (TAND), a novel framework achieving joint nuclei detection and classification using point-level supervision enhanced by tissue mask conditioning. TAND couples a ConvNeXt-based encoder-decoder with a frozen Virchow-2 tissue segmentation branch, where semantic tissue probabilities selectively modulate the classification stream through a novel multi-scale Spatial Feature-wise Linear Modulation (Spatial-FiLM). On the PUMA benchmark, TAND achieves state-of-the-art performance, surpassing both tissue-agnostic baselines and mask-supervised methods. Notably, our approach demonstrates remarkable improvements in tissue-dependent cell types such as epithelium, endothelium, and stroma. To the best of our knowledge, this is the first method to condition per-cell classification on learned tissue masks, offering a practical pathway to reduce annotation burden.
comment: 5 pages, 3 figures. Under review
AtlasMorph: Learning conditional deformable templates for brain MRI
Deformable templates, or atlases, are images that represent a prototypical anatomy for a population, and are often enhanced with probabilistic anatomical label maps. They are commonly used in medical image analysis for population studies and computational anatomy tasks such as registration and segmentation. Because developing a template is a computationally expensive process, relatively few templates are available. As a result, analysis is often conducted with sub-optimal templates that are not truly representative of the study population, especially when there are large variations within this population. We propose a machine learning framework that uses convolutional registration neural networks to efficiently learn a function that outputs templates conditioned on subject-specific attributes, such as age and sex. We also leverage segmentations, when available, to produce anatomical segmentation maps for the resulting templates. The learned network can also be used to register subject images to the templates. We demonstrate our method on a compilation of 3D brain MRI datasets, and show that it can learn high-quality templates that are representative of populations. We find that annotated conditional templates enable better registration than their unlabeled unconditional counterparts, and outperform other templates construction methods.
ICLR: Inter-Chrominance and Luminance Interaction for Natural Color Restoration in Low-Light Image Enhancement AAAI-26
Low-Light Image Enhancement (LLIE) task aims at improving contrast while restoring details and textures for images captured in low-light conditions. HVI color space has made significant progress in this task by enabling precise decoupling of chrominance and luminance. However, for the interaction of chrominance and luminance branches, substantial distributional differences between the two branches prevalent in natural images limit complementary feature extraction, and luminance errors are propagated to chrominance channels through the nonlinear parameter. Furthermore, for interaction between different chrominance branches, images with large homogeneous-color regions usually exhibit weak correlation between chrominance branches due to concentrated distributions. Traditional pixel-wise losses exploit strong inter-branch correlations for co-optimization, causing gradient conflicts in weakly correlated regions. Therefore, we propose an Inter-Chrominance and Luminance Interaction (ICLR) framework including a Dual-stream Interaction Enhancement Module (DIEM) and a Covariance Correction Loss (CCL). The DIEM improves the extraction of complementary information from two dimensions, fusion and enhancement, respectively. The CCL utilizes luminance residual statistics to penalize chrominance errors and balances gradient conflicts by constraining chrominance branches covariance. Experimental results on multiple datasets show that the proposed ICLR framework outperforms state-of-the-art methods.
comment: Accepted by AAAI-26
VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping
Visual autoregressive (AR) generation models have demonstrated strong potential for image generation, yet their next-token-prediction paradigm introduces considerable inference latency. Although speculative decoding (SD) has been proven effective for accelerating visual AR models, its "draft one step, then verify one step" paradigm prevents a direct reduction of the forward passes, thus restricting acceleration potential. Motivated by the visual token interchangeability, we for the first time to explore verification skipping in the SD process of visual AR model generation to explicitly cut the number of target model forward passes, thereby reducing inference latency. Based on an analysis of the drafting stage's characteristics, we observe that verification redundancy and stale feature reusability are key factors to retain generation quality and speedup for verification-free steps. Inspired by these two observations, we propose a novel SD framework VVS to accelerate visual AR generation via partial verification skipping, which integrates three complementary modules: (1) a verification-free token selector with dynamical truncation, (2) token-level feature caching and reuse, and (3) fine-grained skipped step scheduling. Consequently, VVS reduces the number of target model forward passes by a factor of $2.8\times$ relative to vanilla AR decoding while maintaining competitive generation quality, offering a superior speed-quality trade-off over conventional SD frameworks and revealing strong potential to reshape the SD paradigm.
Adaptive Multi-Scale Integration Unlocks Robust Cell Annotation in Histopathology Images
Identifying cell types and subtypes from routine histopathology images is essential for improving the computational understanding of human disease. Existing tile-based models can capture detailed nuclear morphology but often fail to incorporate the broader tissue context that influences a cell's function and identity. In addition, available human annotations are typically coarse-grained and unevenly distributed across studies, making fine-grained subtype-level supervision difficult to obtain. To address these limitations, we introduce NuClass, a pathologist workflow inspired framework for cell-wise multi-scale integration of nuclear morphology and microenvironmental context. NuClass includes two main components: Path local, which focuses on nuclear morphology from 224-by-224 pixel crops, and Path global, which models the surrounding 1024-by-1024 pixel neighborhood. A learnable gating module adaptively balances local detail and contextual cues. To encourage complementary learning, we incorporate an uncertainty-guided objective that directs the global path to prioritize regions where the local path is uncertain. We also provide calibrated confidence estimates and Grad-CAM visualizations to enhance interpretability. To overcome the lack of high-quality annotations, we construct a marker-guided dataset from Xenium spatial transcriptomics assays, yielding single-cell resolution labels for more than two million cells across eight organs and 16 classes. Evaluated on three fully held-out cohorts, NuClass achieves up to 96 percent F1 for its best-performing class, outperforming strong baselines. Our results show that multi-scale, uncertainty-aware fusion can bridge the gap between slide-level pathological foundation models and reliable, cell-level phenotype prediction.
Hierarchical Prompt Learning for Image- and Text-Based Person Re-Identification AAAI 2026
Person re-identification (ReID) aims to retrieve target pedestrian images given either visual queries (image-to-image, I2I) or textual descriptions (text-to-image, T2I). Although both tasks share a common retrieval objective, they pose distinct challenges: I2I emphasizes discriminative identity learning, while T2I requires accurate cross-modal semantic alignment. Existing methods often treat these tasks separately, which may lead to representation entanglement and suboptimal performance. To address this, we propose a unified framework named Hierarchical Prompt Learning (HPL), which leverages task-aware prompt modeling to jointly optimize both tasks. Specifically, we first introduce a Task-Routed Transformer, which incorporates dual classification tokens into a shared visual encoder to route features for I2I and T2I branches respectively. On top of this, we develop a hierarchical prompt generation scheme that integrates identity-level learnable tokens with instance-level pseudo-text tokens. These pseudo-tokens are derived from image or text features via modality-specific inversion networks, injecting fine-grained, instance-specific semantics into the prompts. Furthermore, we propose a Cross-Modal Prompt Regularization strategy to enforce semantic alignment in the prompt token space, ensuring that pseudo-prompts preserve source-modality characteristics while enhancing cross-modal transferability. Extensive experiments on multiple ReID benchmarks validate the effectiveness of our method, achieving state-of-the-art performance on both I2I and T2I tasks.
comment: 9 pages, 4 figures, accepted by AAAI 2026
Opt3DGS: Optimizing 3D Gaussian Splatting with Adaptive Exploration and Curvature-Aware Exploitation AAAI 2026
3D Gaussian Splatting (3DGS) has emerged as a leading framework for novel view synthesis, yet its core optimization challenges remain underexplored. We identify two key issues in 3DGS optimization: entrapment in suboptimal local optima and insufficient convergence quality. To address these, we propose Opt3DGS, a robust framework that enhances 3DGS through a two-stage optimization process of adaptive exploration and curvature-guided exploitation. In the exploration phase, an Adaptive Weighted Stochastic Gradient Langevin Dynamics (SGLD) method enhances global search to escape local optima. In the exploitation phase, a Local Quasi-Newton Direction-guided Adam optimizer leverages curvature information for precise and efficient convergence. Extensive experiments on diverse benchmark datasets demonstrate that Opt3DGS achieves state-of-the-art rendering quality by refining the 3DGS optimization process without modifying its underlying representation.
comment: Accepted at AAAI 2026 as a Conference Paper
TSE-Net: Semi-supervised Monocular Height Estimation from Single Remote Sensing Images
Monocular height estimation plays a critical role in 3D perception for remote sensing, offering a cost-effective alternative to multi-view or LiDAR-based methods. While deep learning has significantly advanced the capabilities of monocular height estimation, these methods remain fundamentally limited by the availability of labeled data, which are expensive and labor-intensive to obtain at scale. The scarcity of high-quality annotations hinders the generalization and performance of existing models. To overcome this limitation, we propose leveraging large volumes of unlabeled data through a semi-supervised learning framework, enabling the model to extract informative cues from unlabeled samples and improve its predictive performance. In this work, we introduce TSE-Net, a self-training pipeline for semi-supervised monocular height estimation. The pipeline integrates teacher, student, and exam networks. The student network is trained on unlabeled data using pseudo-labels generated by the teacher network, while the exam network functions as a temporal ensemble of the student network to stabilize performance. The teacher network is formulated as a joint regression and classification model: the regression branch predicts height values that serve as pseudo-labels, and the classification branch predicts height value classes along with class probabilities, which are used to filter pseudo-labels. Height value classes are defined using a hierarchical bi-cut strategy to address the inherent long-tailed distribution of heights, and the predicted class probabilities are calibrated with a Plackett-Luce model to reflect the expected accuracy of pseudo-labels. We evaluate the proposed pipeline on three datasets spanning different resolutions and imaging modalities. Codes are available at https://github.com/zhu-xlab/tse-net.
Robust Defense Strategies for Multimodal Contrastive Learning: Efficient Fine-tuning Against Backdoor Attacks
The advent of multimodal deep learning models, such as CLIP, has unlocked new frontiers in a wide range of applications, from image-text understanding to classification tasks. However, these models are not safe for adversarial attacks, particularly backdoor attacks, which can subtly manipulate model behavior. Moreover, existing defense methods typically involve training from scratch or fine-tuning using a large dataset without pinpointing the specific labels that are affected. In this study, we introduce an innovative strategy to enhance the robustness of multimodal contrastive learning models against such attacks. In particular, given a poisoned CLIP model, our approach can identify the backdoor trigger and pinpoint the victim samples and labels in an efficient manner. To that end, an image segmentation ``oracle'' is introduced as the supervisor for the output of the poisoned CLIP. We develop two algorithms to rectify the poisoned model: (1) differentiating between CLIP and Oracle's knowledge to identify potential triggers; (2) pinpointing affected labels and victim samples, and curating a compact fine-tuning dataset. With this knowledge, we are allowed to rectify the poisoned CLIP model to negate backdoor effects. Extensive experiments on visual recognition benchmarks demonstrate our strategy is effective in CLIP-based backdoor defense.
BootOOD: Self-Supervised Out-of-Distribution Detection via Synthetic Sample Exposure under Neural Collapse
Out-of-distribution (OOD) detection is critical for deploying image classifiers in safety-sensitive environments, yet existing detectors often struggle when OOD samples are semantically similar to the in-distribution (ID) classes. We present BootOOD, a fully self-supervised OOD detection framework that bootstraps exclusively from ID data and is explicitly designed to handle semantically challenging OOD samples. BootOOD synthesizes pseudo-OOD features through simple transformations of ID representations and leverages Neural Collapse (NC), where ID features cluster tightly around class means with consistent feature norms. Unlike prior approaches that aim to constrain OOD features into subspaces orthogonal to the collapsed ID means, BootOOD introduces a lightweight auxiliary head that performs radius-based classification on feature norms. This design decouples OOD detection from the primary classifier and imposes a relaxed requirement: OOD samples are learned to have smaller feature norms than ID features, which is easier to satisfy when ID and OOD are semantically close. Experiments on CIFAR-10, CIFAR-100, and ImageNet-200 show that BootOOD outperforms prior post-hoc methods, surpasses training-based methods without outlier exposure, and is competitive with state-of-the-art outlier-exposure approaches while maintaining or improving ID accuracy.
comment: 8 pages
Accuracy is Not Enough: Poisoning Interpretability in Federated Learning via Color Skew
As machine learning models are increasingly deployed in safety-critical domains, visual explanation techniques have become essential tools for supporting transparency. In this work, we reveal a new class of attacks that compromise model interpretability without affecting accuracy. Specifically, we show that small color perturbations applied by adversarial clients in a federated learning setting can shift a model's saliency maps away from semantically meaningful regions while keeping the prediction unchanged. The proposed saliency-aware attack framework, called Chromatic Perturbation Module, systematically crafts adversarial examples by altering the color contrast between foreground and background in a way that disrupts explanation fidelity. These perturbations accumulate across training rounds, poisoning the global model's internal feature attributions in a stealthy and persistent manner. Our findings challenge a common assumption in model auditing that correct predictions imply faithful explanations and demonstrate that interpretability itself can be an attack surface. We evaluate this vulnerability across multiple datasets and show that standard training pipelines are insufficient to detect or mitigate explanation degradation, especially in the federated learning setting, where subtle color perturbations are harder to discern. Our attack reduces peak activation overlap in Grad-CAM explanations by up to 35% while preserving classification accuracy above 96% on all evaluated datasets.
Minimax Multi-Target Conformal Prediction with Applications to Imaging Inverse Problems
In ill-posed imaging inverse problems, uncertainty quantification remains a fundamental challenge, especially in safety-critical applications. Recently, conformal prediction has been used to quantify the uncertainty that the inverse problem contributes to downstream tasks like image classification, image quality assessment, fat mass quantification, etc. While existing works handle only a scalar estimation target, practical applications often involve multiple targets. In response, we propose an asymptotically minimax approach to multi-target conformal prediction that provides tight prediction intervals while ensuring joint marginal coverage. We then outline how our minimax approach can be applied to multi-metric blind image quality assessment, multi-task uncertainty quantification, and multi-round measurement acquisition. Finally, we numerically demonstrate the benefits of our minimax method, relative to existing multi-target conformal prediction methods, using both synthetic and magnetic resonance imaging (MRI) data.
Mapping the Vanishing and Transformation of Urban Villages in China
Urban villages (UVs), informal settlements embedded within China's urban fabric, have undergone widespread demolition and redevelopment in recent decades. However, there remains a lack of systematic evaluation of whether the demolished land has been effectively reused, raising concerns about the efficacy and sustainability of current redevelopment practices. To address the gap, this study proposes a deep learning-based framework to monitor the spatiotemporal changes of UVs in China. Specifically, semantic segmentation of multi-temporal remote sensing imagery is first used to map evolving UV boundaries, and then post-demolition land use is classified into six categories based on the "remained-demolished-redeveloped" phase: incomplete demolition, vacant land, construction sites, buildings, green spaces, and others. Four representative cities from China's four economic regions were selected as the study areas, i.e., Guangzhou (East), Zhengzhou (Central), Xi'an (West), and Harbin (Northeast). The results indicate: 1) UV redevelopment processes were frequently prolonged; 2) redevelopment transitions primarily occurred in peripheral areas, whereas urban cores remained relatively stable; and 3) three spatiotemporal transformation pathways, i.e., synchronized redevelopment, delayed redevelopment, and gradual optimization, were revealed. This study highlights the fragmented, complex and nonlinear nature of UV redevelopment, underscoring the need for tiered and context-sensitive planning strategies. By linking spatial dynamics with the context of redevelopment policies, the findings offer valuable empirical insights that support more inclusive, efficient, and sustainable urban renewal, while also contributing to a broader global understanding of informal settlement transformations.
comment: Appendix A. Supplementary data at https://ars.els-cdn.com/content/image/1-s2.0-S2210670725008418-mmc1.docx
Language-Guided Invariance Probing of Vision-Language Models
Recent vision-language models (VLMs) such as CLIP, OpenCLIP, EVA02-CLIP and SigLIP achieve strong zero-shot performance, but it is unclear how reliably they respond to controlled linguistic perturbations. We introduce Language-Guided Invariance Probing (LGIP), a benchmark that measures (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image-text matching. Using 40k MS COCO images with five human captions each, we automatically generate paraphrases and rule-based flips that alter object category, color or count, and summarize model behavior with an invariance error, a semantic sensitivity gap and a positive-rate statistic. Across nine VLMs, EVA02-CLIP and large OpenCLIP variants lie on a favorable invariance-sensitivity frontier, combining low paraphrase-induced variance with consistently higher scores for original captions than for their flipped counterparts. In contrast, SigLIP and SigLIP2 show much larger invariance error and often prefer flipped captions to the human descriptions, especially for object and color edits. These failures are largely invisible to standard retrieval metrics, indicating that LGIP provides a model-agnostic diagnostic for the linguistic robustness of VLMs beyond conventional accuracy scores.
InterMoE: Individual-Specific 3D Human Interaction Generation via Dynamic Temporal-Selective MoE AAAI-26
Generating high-quality human interactions holds significant value for applications like virtual reality and robotics. However, existing methods often fail to preserve unique individual characteristics or fully adhere to textual descriptions. To address these challenges, we introduce InterMoE, a novel framework built on a Dynamic Temporal-Selective Mixture of Experts. The core of InterMoE is a routing mechanism that synergistically uses both high-level text semantics and low-level motion context to dispatch temporal motion features to specialized experts. This allows experts to dynamically determine the selection capacity and focus on critical temporal features, thereby preserving specific individual characteristic identities while ensuring high semantic fidelity. Extensive experiments show that InterMoE achieves state-of-the-art performance in individual-specific high-fidelity 3D human interaction generation, reducing FID scores by 9% on the InterHuman dataset and 22% on InterX.
comment: Accepted to AAAI-26. Codes: https://github.com/Lighten001/InterMoE
Semantic Document Derendering: SVG Reconstruction via Vision-Language Modeling
Multimedia documents such as slide presentations and posters are designed to be interactive and easy to modify. Yet, they are often distributed in a static raster format, which limits editing and customization. Restoring their editability requires converting these raster images back into structured vector formats. However, existing geometric raster-vectorization methods, which rely on low-level primitives like curves and polygons, fall short at this task. Specifically, when applied to complex documents like slides, they fail to preserve the high-level structure, resulting in a flat collection of shapes where the semantic distinction between image and text elements is lost. To overcome this limitation, we address the problem of semantic document derendering by introducing SliDer, a novel framework that uses Vision-Language Models (VLMs) to derender slide images as compact and editable Scalable Vector Graphic (SVG) representations. SliDer detects and extracts attributes from individual image and text elements in a raster input and organizes them into a coherent SVG format. Crucially, the model iteratively refines its predictions during inference in a process analogous to human design, generating SVG code that more faithfully reconstructs the original raster upon rendering. Furthermore, we introduce Slide2SVG, a novel dataset comprising raster-SVG pairs of slide documents curated from real-world scientific presentations, to facilitate future research in this domain. Our results demonstrate that SliDer achieves a reconstruction LPIPS of 0.069 and is favored by human evaluators in 82.9% of cases compared to the strongest zero-shot VLM baseline.
Trust in Vision-Language Models: Insights from a Participatory User Workshop
With the growing deployment of Vision-Language Models (VLMs), pre-trained on large image-text and video-text datasets, it is critical to equip users with the tools to discern when to trust these systems. However, examining how user trust in VLMs builds and evolves remains an open problem. This problem is exacerbated by the increasing reliance on AI models as judges for experimental validation, to bypass the cost and implications of running participatory design studies directly with users. Following a user-centred approach, this paper presents preliminary results from a workshop with prospective VLM users. Insights from this pilot workshop inform future studies aimed at contextualising trust metrics and strategies for participants' engagement to fit the case of user-VLM interaction.
Unlocking the Forgery Detection Potential of Vanilla MLLMs: A Novel Training-Free Pipeline
With the rapid advancement of artificial intelligence-generated content (AIGC) technologies, including multimodal large language models (MLLMs) and diffusion models, image generation and manipulation have become remarkably effortless. Existing image forgery detection and localization (IFDL) methods often struggle to generalize across diverse datasets and offer limited interpretability. Nowadays, MLLMs demonstrate strong generalization potential across diverse vision-language tasks, and some studies introduce this capability to IFDL via large-scale training. However, such approaches cost considerable computational resources, while failing to reveal the inherent generalization potential of vanilla MLLMs to address this problem. Inspired by this observation, we propose Foresee, a training-free MLLM-based pipeline tailored for image forgery analysis. It eliminates the need for additional training and enables a lightweight inference process, while surpassing existing MLLM-based methods in both tamper localization accuracy and the richness of textual explanations. Foresee employs a type-prior-driven strategy and utilizes a Flexible Feature Detector (FFD) module to specifically handle copy-move manipulations, thereby effectively unleashing the potential of vanilla MLLMs in the forensic domain. Extensive experiments demonstrate that our approach simultaneously achieves superior localization accuracy and provides more comprehensive textual explanations. Moreover, Foresee exhibits stronger generalization capability, outperforming existing IFDL methods across various tampering types, including copy-move, splicing, removal, local enhancement, deepfake, and AIGC-based editing. The code will be released in the final version.
FUSE: A Flow-based Mapping Between Shapes
We introduce a novel neural representation for maps between 3D shapes based on flow-matching models, which is computationally efficient and supports cross-representation shape matching without large-scale training or data-driven procedures. 3D shapes are represented as the probability distribution induced by a continuous and invertible flow mapping from a fixed anchor distribution. Given a source and a target shape, the composition of the inverse flow (source to anchor) with the forward flow (anchor to target), we continuously map points between the two surfaces. By encoding the shapes with a pointwise task-tailored embedding, this construction provides an invertible and modality-agnostic representation of maps between shapes across point clouds, meshes, signed distance fields (SDFs), and volumetric data. The resulting representation consistently achieves high coverage and accuracy across diverse benchmarks and challenging settings in shape matching. Beyond shape matching, our framework shows promising results in other tasks, including UV mapping and registration of raw point cloud scans of human bodies.
comment: 11 pages, 9 figures
VOPE: Revisiting Hallucination of Vision-Language Models in Voluntary Imagination Task
Most research on hallucinations in Large Vision-Language Models (LVLMs) focuses on factual description tasks that prohibit any output absent from the image. However, little attention has been paid to hallucinations in voluntary imagination tasks, e.g., story writing, where the models are expected to generate novel content beyond the given image. In these tasks, it is inappropriate to simply regard such imagined novel content as hallucinations. To address this limitation, we introduce Voluntary-imagined Object Presence Evaluation (VOPE)-a novel method to assess LVLMs' hallucinations in voluntary imagination tasks via presence evaluation. Specifically, VOPE poses recheck-based questions to evaluate how an LVLM interprets the presence of the imagined objects in its own response. The consistency between the model's interpretation and the object's presence in the image is then used to determine whether the model hallucinates when generating the response. We apply VOPE to several mainstream LVLMs and hallucination mitigation methods, revealing two key findings: (1) most LVLMs hallucinate heavily during voluntary imagination, and their performance in presence evaluation is notably poor on imagined objects; (2) existing hallucination mitigation methods show limited effect in voluntary imagination tasks, making this an important direction for future research.
comment: 8 pages
Delineate Anything Flow: Fast, Country-Level Field Boundary Detection from Any Source
Accurate delineation of agricultural field boundaries from satellite imagery is essential for land management and crop monitoring, yet existing methods often produce incomplete boundaries, merge adjacent fields, and struggle to scale. We present the Delineate Anything Flow (DelAnyFlow) methodology, a resolution-agnostic approach for large-scale field boundary mapping. DelAnyFlow combines the DelAny instance segmentation model, based on a YOLOv11 backbone and trained on the large-scale Field Boundary Instance Segmentation-22M (FBIS 22M) dataset, with a structured post-processing, merging, and vectorization sequence to generate topologically consistent vector boundaries. FBIS 22M, the largest dataset of its kind, contains 672,909 multi-resolution image patches (0.25-10m) and 22.9million validated field instances. The DelAny model delivers state-of-the-art accuracy with over 100% higher mAP and 400x faster inference than SAM2. DelAny demonstrates strong zero-shot generalization and supports national-scale applications: using Sentinel 2 data for 2024, DelAnyFlow generated a complete field boundary layer for Ukraine (603,000km2) in under six hours on a single workstation. DelAnyFlow outputs significantly improve boundary completeness relative to operational products from Sinergise Solutions and NASA Harvest, particularly in smallholder and fragmented systems (0.25-1ha). For Ukraine, DelAnyFlow delineated 3.75M fields at 5m and 5.15M at 2.5m, compared to 2.66M detected by Sinergise Solutions and 1.69M by NASA Harvest. This work delivers a scalable, cost-effective methodology for field delineation in regions lacking digital cadastral data. A project landing page with links to model weights, code, national-scale vector outputs, and dataset is available at https://lavreniuk.github.io/Delineate-Anything/.
Attention Grounded Enhancement for Visual Document Retrieval
Visual document retrieval requires understanding heterogeneous and multi-modal content to satisfy information needs. Recent advances use screenshot-based document encoding with fine-grained late interaction, significantly improving retrieval performance. However, retrievers are still trained with coarse global relevance labels, without revealing which regions support the match. As a result, retrievers tend to rely on surface-level cues and struggle to capture implicit semantic connections, hindering their ability to handle non-extractive queries. To alleviate this problem, we propose a \textbf{A}ttention-\textbf{G}rounded \textbf{RE}triever \textbf{E}nhancement (AGREE) framework. AGREE leverages cross-modal attention from multimodal large language models as proxy local supervision to guide the identification of relevant document regions. During training, AGREE combines local signals with the global signals to jointly optimize the retriever, enabling it to learn not only whether documents match, but also which content drives relevance. Experiments on the challenging ViDoRe V2 benchmark show that AGREE significantly outperforms the global-supervision-only baseline. Quantitative and qualitative analyses further demonstrate that AGREE promotes deeper alignment between query terms and document regions, moving beyond surface-level matching toward more accurate and interpretable retrieval. Our code is available at: https://anonymous.4open.science/r/AGREE-2025.
What Color Is It? A Text-Interference Multimodal Hallucination Benchmark
With the rapid advancement of Large Models, numerous text-and-vision-fused Multimodal Large Models (MLMs) have emerged. However, these MLMs remain susceptible to informational interference in visual perception, particularly in color perception, which introduces an additional risk of hallucination. To validate this hypothesis, we introduce the "What Color Is It" dataset, a novel benchmark constructed using a simple method to trigger single-modality visual hallucination in MLMs. Based on this dataset, we further investigate the underlying causes of hallucination in the visual modality of MLMs and propose potential solutions to enhance their robustness.
TripleFDS: Triple Feature Disentanglement and Synthesis for Scene Text Editing AAAI2026
Scene Text Editing (STE) aims to naturally modify text in images while preserving visual consistency, the decisive factors of which can be divided into three parts, i.e., text style, text content, and background. Previous methods have struggled with incomplete disentanglement of editable attributes, typically addressing only one aspect - such as editing text content - thus limiting controllability and visual consistency. To overcome these limitations, we propose TripleFDS, a novel framework for STE with disentangled modular attributes, and an accompanying dataset called SCB Synthesis. SCB Synthesis provides robust training data for triple feature disentanglement by utilizing the "SCB Group", a novel construct that combines three attributes per image to generate diverse, disentangled training groups. Leveraging this construct as a basic training unit, TripleFDS first disentangles triple features, ensuring semantic accuracy through inter-group contrastive regularization and reducing redundancy through intra-sample multi-feature orthogonality. In the synthesis phase, TripleFDS performs feature remapping to prevent "shortcut" phenomena during reconstruction and mitigate potential feature leakage. Trained on 125,000 SCB Groups, TripleFDS achieves state-of-the-art image fidelity (SSIM of 44.54) and text accuracy (ACC of 93.58%) on the mainstream STE benchmarks. Besides superior performance, the more flexible editing of TripleFDS supports new operations such as style replacement and background transfer. Code: https://github.com/yusenbao01/TripleFDS
comment: Accepted by AAAI2026
Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)
The remarkable progress of Vision-Language Models (VLMs) on a variety of tasks has raised interest in their application to automated driving. However, for these models to be trusted in such a safety-critical domain, they must first possess robust perception capabilities, i.e., they must be capable of understanding a traffic scene, which can often be highly complex, with many things happening simultaneously. Moreover, since critical objects and agents in traffic scenes are often at long distances, we require systems with not only strong perception capabilities at close distances (up to 20 meters), but also at long (30+ meters) range. Therefore, it is important to evaluate the perception capabilities of these models in isolation from other skills like reasoning or advanced world knowledge. Distance-Annotated Traffic Perception Question Answering (DTPQA) is a Visual Question Answering (VQA) benchmark designed specifically for this purpose: it can be used to evaluate the perception systems of VLMs in traffic scenarios using trivial yet crucial questions relevant to driving decisions. It consists of two parts: a synthetic benchmark (DTP-Synthetic) created using a simulator, and a real-world benchmark (DTP-Real) built on top of existing images of real traffic scenes. Additionally, DTPQA includes distance annotations, i.e., how far the object in question is from the camera. More specifically, each DTPQA sample consists of (at least): (a) an image, (b) a question, (c) the ground truth answer, and (d) the distance of the object in question, enabling analysis of how VLM performance degrades with increasing object distance. In this article, we provide the dataset itself along with the Python scripts used to create it, which can be used to generate additional data of the same kind.
Generalized Denoising Diffusion Codebook Models (gDDCM): Tokenizing images using a pre-trained diffusion model
Recently, the Denoising Diffusion Codebook Models (DDCM) was proposed. DDCM leverages the Denoising Diffusion Probabilistic Model (DDPM) and replaces the random noise in the backward process with noise sampled from specific sets according to a predefined rule, thereby enabling image compression. However, DDCM cannot be applied to methods other than DDPM. In this paper, we propose the generalized Denoising Diffusion Compression Model (gDDCM), which extends DDCM to mainstream diffusion models and their variants, including DDPM, Score-Based Models, Consistency Models, and Rectified Flow. We evaluate our method on CIFAR-10 and LSUN Bedroom datasets. Experimental results demonstrate that our approach successfully generalizes DDCM to the aforementioned models and achieves improved performance.
comment: in Chinese language
Semi-Supervised Multi-Task Learning for Interpretable Quality As- sessment of Fundus Images
Retinal image quality assessment (RIQA) supports computer-aided diagnosis of eye diseases. However, most tools classify only overall image quality, without indicating acquisition defects to guide recapture. This gap is mainly due to the high cost of detailed annotations. In this paper, we aim to mitigate this limitation by introducing a hybrid semi-supervised learning approach that combines manual labels for overall quality with pseudo-labels of quality details within a multi-task framework. Our objective is to obtain more interpretable RIQA models without requiring extensive manual labeling. Pseudo-labels are generated by a Teacher model trained on a small dataset and then used to fine-tune a pre-trained model in a multi-task setting. Using a ResNet-18 backbone, we show that these weak annotations improve quality assessment over single-task baselines (F1: 0.875 vs. 0.863 on EyeQ, and 0.778 vs. 0.763 on DeepDRiD), matching or surpassing existing methods. The multi-task model achieved performance statistically comparable to the Teacher for most detail prediction tasks (p > 0.05). In a newly annotated EyeQ subset released with this paper, our model performed similarly to experts, suggesting that pseudo-label noise aligns with expert variability. Our main finding is that the proposed semi-supervised approach not only improves overall quality assessment but also provides interpretable feedback on capture conditions (illumination, clarity, contrast). This enhances interpretability at no extra manual labeling cost and offers clinically actionable outputs to guide image recapture.
YOLO Meets Mixture-of-Experts: Adaptive Expert Routing for Robust Object Detection
This paper presents a novel Mixture-of-Experts framework for object detection, incorporating adaptive routing among multiple YOLOv9-T experts to enable dynamic feature specialization and achieve higher mean Average Precision (mAP) and Average Recall (AR) compared to a single YOLOv9-T model.
comment: 1 figure, 1 table
Computer Vision based group activity detection and action spotting
Group activity detection in multi-person scenes is challenging due to complex human interactions, occlusions, and variations in appearance over time. This work presents a computer vision based framework for group activity recognition and action spotting using a combination of deep learning models and graph based relational reasoning. The system first applies Mask R-CNN to obtain accurate actor localization through bounding boxes and instance masks. Multiple backbone networks, including Inception V3, MobileNet, and VGG16, are used to extract feature maps, and RoIAlign is applied to preserve spatial alignment when generating actor specific features. The mask information is then fused with the feature maps to obtain refined masked feature representations for each actor. To model interactions between individuals, we construct Actor Relation Graphs that encode appearance similarity and positional relations using methods such as normalized cross correlation, sum of absolute differences, and dot product. Graph Convolutional Networks operate on these graphs to reason about relationships and predict both individual actions and group level activities. Experiments on the Collective Activity dataset demonstrate that the combination of mask based feature refinement, robust similarity search, and graph neural network reasoning leads to improved recognition performance across both crowded and non crowded scenarios. This approach highlights the potential of integrating segmentation, feature extraction, and relational graph reasoning for complex video understanding tasks.
DriveLiDAR4D: Sequential and Controllable LiDAR Scene Generation for Autonomous Driving AAAI2026
The generation of realistic LiDAR point clouds plays a crucial role in the development and evaluation of autonomous driving systems. Although recent methods for 3D LiDAR point cloud generation have shown significant improvements, they still face notable limitations, including the lack of sequential generation capabilities and the inability to produce accurately positioned foreground objects and realistic backgrounds. These shortcomings hinder their practical applicability. In this paper, we introduce DriveLiDAR4D, a novel LiDAR generation pipeline consisting of multimodal conditions and a novel sequential noise prediction model LiDAR4DNet, capable of producing temporally consistent LiDAR scenes with highly controllable foreground objects and realistic backgrounds. To the best of our knowledge, this is the first work to address the sequential generation of LiDAR scenes with full scene manipulation capability in an end-to-end manner. We evaluated DriveLiDAR4D on the nuScenes and KITTI datasets, where we achieved an FRD score of 743.13 and an FVD score of 16.96 on the nuScenes dataset, surpassing the current state-of-the-art (SOTA) method, UniScene, with an performance boost of 37.2% in FRD and 24.1% in FVD, respectively.
comment: AAAI2026
DAP: A Discrete-token Autoregressive Planner for Autonomous Driving
Gaining sustainable performance improvement with scaling data and model budget remains a pivotal yet unresolved challenge in autonomous driving. While autoregressive models exhibited promising data-scaling efficiency in planning tasks, predicting ego trajectories alone suffers sparse supervision and weakly constrains how scene evolution should shape ego motion. Therefore, we introduce DAP, a discrete-token autoregressive planner that jointly forecasts BEV semantics and ego trajectories, thereby enforcing comprehensive representation learning and allowing predicted dynamics to directly condition ego motion. In addition, we incorporate a reinforcement-learning-based fine-tuning, which preserves supervised behavior cloning priors while injecting reward-guided improvements. Despite a compact 160M parameter budget, DAP achieves state-of-the-art performance on open-loop metrics and delivers competitive closed-loop results on the NAVSIM benchmark. Overall, the fully discrete-token autoregressive formulation operating on both rasterized BEV and ego actions provides a compact yet scalable planning paradigm for autonomous driving.
CorrectAD: A Self-Correcting Agentic System to Improve End-to-end Planning in Autonomous Driving
End-to-end planning methods are the de facto standard of the current autonomous driving system, while the robustness of the data-driven approaches suffers due to the notorious long-tail problem (i.e., rare but safety-critical failure cases). In this work, we explore whether recent diffusion-based video generation methods (a.k.a. world models), paired with structured 3D layouts, can enable a fully automated pipeline to self-correct such failure cases. We first introduce an agent to simulate the role of product manager, dubbed PM-Agent, which formulates data requirements to collect data similar to the failure cases. Then, we use a generative model that can simulate both data collection and annotation. However, existing generative models struggle to generate high-fidelity data conditioned on 3D layouts. To address this, we propose DriveSora, which can generate spatiotemporally consistent videos aligned with the 3D annotations requested by PM-Agent. We integrate these components into our self-correcting agentic system, CorrectAD. Importantly, our pipeline is an end-to-end model-agnostic and can be applied to improve any end-to-end planner. Evaluated on both nuScenes and a more challenging in-house dataset across multiple end-to-end planners, CorrectAD corrects 62.5% and 49.8% of failure cases, reducing collision rates by 39% and 27%, respectively.
SkyReels-Text: Fine-grained Font-Controllable Text Editing for Poster Design
Artistic design such as poster design often demands rapid yet precise modification of textual content while preserving visual harmony and typographic intent, especially across diverse font styles. Although modern image editing models have grown increasingly powerful, they still fall short in fine-grained, font-aware text manipulation, limiting their utility in professional design workflows such as poster editing. To address this issue, we present SkyReels-Text, a novel font-controllable framework for precise poster text editing. Our method enables simultaneous editing of multiple text regions, each rendered in distinct typographic styles, while preserving the visual appearance of non-edited regions. Notably, our model requires neither font labels nor fine-tuning during inference: users can simply provide cropped glyph patches corresponding to their desired typography, even if the font is not included in any standard library. Extensive experiments on multiple datasets, including handwrittent text benchmarks, SkyReels-Text achieves state-of-the-art performance in both text fidelity and visual realism, offering unprecedented control over font families, and stylistic nuances. This work bridges the gap between general-purpose image editing and professional-grade typographic design.
TabFlash: Efficient Table Understanding with Progressive Question Conditioning and Token Focusing AAAI 2026
Table images present unique challenges for effective and efficient understanding due to the need for question-specific focus and the presence of redundant background regions. Existing Multimodal Large Language Model (MLLM) approaches often overlook these characteristics, resulting in uninformative and redundant visual representations. To address these issues, we aim to generate visual features that are both informative and compact to improve table understanding. We first propose progressive question conditioning, which injects the question into Vision Transformer layers with gradually increasing frequency, considering each layer's capacity to handle additional information, to generate question-aware visual features. To reduce redundancy, we introduce a pruning strategy that discards background tokens, thereby improving efficiency. To mitigate information loss from pruning, we further propose token focusing, a training strategy that encourages the model to concentrate essential information in the retained tokens. By combining these approaches, we present TabFlash, an efficient and effective MLLM for table understanding. TabFlash achieves state-of-the-art performance, outperforming both open-source and proprietary MLLMs, while requiring 27% less FLOPs and 30% less memory usage compared to the second-best MLLM.
comment: AAAI 2026 (Main Technical Track)
Towards Metric-Aware Multi-Person Mesh Recovery by Jointly Optimizing Human Crowd in Camera Space
Multi-person human mesh recovery from a single image is a challenging task, hindered by the scarcity of in-the-wild training data. Prevailing in-the-wild human mesh pseudo-ground-truth (pGT) generation pipelines are single-person-centric, where each human is processed individually without joint optimization. This oversight leads to a lack of scene-level consistency, producing individuals with conflicting depths and scales within the same image. To address this, we introduce Depth-conditioned Translation Optimization (DTO), a novel optimization-based method that jointly refines the camera-space translations of all individuals in a crowd. By leveraging anthropometric priors on human height and depth cues from a monocular depth estimator, DTO solves for a scene-consistent placement of all subjects within a principled Maximum a posteriori (MAP) framework. Applying DTO to the 4D-Humans dataset, we construct DTO-Humans, a new large-scale pGT dataset of 0.56M high-quality, scene-consistent multi-person images, featuring dense crowds with an average of 4.8 persons per image. Furthermore, we propose Metric-Aware HMR, an end-to-end network that directly estimates human mesh and camera parameters in metric scale. This is enabled by a camera branch and a novel relative metric loss that enforces plausible relative scales. Extensive experiments demonstrate that our method achieves state-of-the-art performance on relative depth reasoning and human mesh recovery. Code and data will be released publicly.
SF-Recon: Simplification-Free Lightweight Building Reconstruction via 3D Gaussian Splatting
Lightweight building surface models are crucial for digital city, navigation, and fast geospatial analytics, yet conventional multi-view geometry pipelines remain cumbersome and quality-sensitive due to their reliance on dense reconstruction, meshing, and subsequent simplification. This work presents SF-Recon, a method that directly reconstructs lightweight building surfaces from multi-view images without post-hoc mesh simplification. We first train an initial 3D Gaussian Splatting (3DGS) field to obtain a view-consistent representation. Building structure is then distilled by a normal-gradient-guided Gaussian optimization that selects primitives aligned with roof and wall boundaries, followed by multi-view edge-consistency pruning to enhance structural sharpness and suppress non-structural artifacts without external supervision. Finally, a multi-view depth-constrained Delaunay triangulation converts the structured Gaussian field into a lightweight, structurally faithful building mesh. Based on a proposed SF dataset, the experimental results demonstrate that our SF-Recon can directly reconstruct lightweight building models from multi-view imagery, achieving substantially fewer faces and vertices while maintaining computational efficiency. Website:https://lzh282140127-cell.github.io/SF-Recon-project/
Recognition of Abnormal Events in Surveillance Videos using Weakly Supervised Dual-Encoder Models
We address the challenge of detecting rare and diverse anomalies in surveillance videos using only video-level supervision. Our dual-backbone framework combines convolutional and transformer representations through top-k pooling, achieving 90.7% area under the curve (AUC) on the UCF-Crime dataset.
comment: 1 figure, 1 table
Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation
Vision-Language Models (VLMs), leveraging their powerful visual perception and reasoning capabilities, have been widely applied in Unmanned Aerial Vehicle (UAV) tasks. However, the spatial intelligence capabilities of existing VLMs in UAV scenarios remain largely unexplored, raising concerns about their effectiveness in navigating and interpreting dynamic environments. To bridge this gap, we introduce SpatialSky-Bench, a comprehensive benchmark specifically designed to evaluate the spatial intelligence capabilities of VLMs in UAV navigation. Our benchmark comprises two categories-Environmental Perception and Scene Understanding-divided into 13 subcategories, including bounding boxes, color, distance, height, and landing safety analysis, among others. Extensive evaluations of various mainstream open-source and closed-source VLMs reveal unsatisfactory performance in complex UAV navigation scenarios, highlighting significant gaps in their spatial capabilities. To address this challenge, we developed the SpatialSky-Dataset, a comprehensive dataset containing 1M samples with diverse annotations across various scenarios. Leveraging this dataset, we introduce Sky-VLM, a specialized VLM designed for UAV spatial reasoning across multiple granularities and contexts. Extensive experimental results demonstrate that Sky-VLM achieves state-of-the-art performance across all benchmark tasks, paving the way for the development of VLMs suitable for UAV scenarios. The source code is available at https://github.com/linglingxiansen/SpatialSKy.
SymGS : Leveraging Local Symmetries for 3D Gaussian Splatting Compression
3D Gaussian Splatting has emerged as a transformative technique in novel view synthesis, primarily due to its high rendering speed and photorealistic fidelity. However, its memory footprint scales rapidly with scene complexity, often reaching several gigabytes. Existing methods address this issue by introducing compression strategies that exploit primitive-level redundancy through similarity detection and quantization. We aim to surpass the compression limits of such methods by incorporating symmetry-aware techniques, specifically targeting mirror symmetries to eliminate redundant primitives. We propose a novel compression framework, \textbf{\textit{SymGS}}, introducing learnable mirrors into the scene, thereby eliminating local and global reflective redundancies for compression. Our framework functions as a plug-and-play enhancement to state-of-the-art compression methods, (e.g. HAC) to achieve further compression. Compared to HAC, we achieve $1.66 \times$ compression across benchmark datasets (upto $3\times$ on large-scale scenes). On an average, SymGS enables $\bf{108\times}$ compression of a 3DGS scene, while preserving rendering quality. The project page and supplementary can be found at \textbf{\color{cyan}{symgs.github.io}}
comment: Project Page: https://symgs.github.io/
Building Egocentric Procedural AI Assistant: Methods, Benchmarks, and Challenges
Driven by recent advances in vision language models (VLMs) and egocentric perception research, we introduce the concept of an egocentric procedural AI assistant (EgoProceAssist) tailored to step-by-step support daily procedural tasks in a first-person view. In this work, we start by identifying three core tasks: egocentric procedural error detection, egocentric procedural learning, and egocentric procedural question answering. These tasks define the essential functions of EgoProceAssist within a new taxonomy. Specifically, our work encompasses a comprehensive review of current techniques, relevant datasets, and evaluation metrics across these three core areas. To clarify the gap between the proposed EgoProceAssist and existing VLM-based AI assistants, we introduce novel experiments and provide a comprehensive evaluation of representative VLM-based methods. Based on these findings and our technical analysis, we discuss the challenges ahead and suggest future research directions. Furthermore, an exhaustive list of this study is publicly available in an active repository that continuously collects the latest work: https://github.com/z1oong/Building-Egocentric-Procedural-AI-Assistant
comment: 26 pages, 8 figures, 8 tables, Under peer-review
GeoX-Bench: Benchmarking Cross-View Geo-Localization and Pose Estimation Capabilities of Large Multimodal Models
Large multimodal models (LMMs) have demonstrated remarkable capabilities across a wide range of tasks, however their knowledge and abilities in the cross-view geo-localization and pose estimation domains remain unexplored, despite potential benefits for navigation, autonomous driving, outdoor robotics, \textit{etc}. To bridge this gap, we introduce \textbf{GeoX-Bench}, a comprehensive \underline{Bench}mark designed to explore and evaluate the capabilities of LMMs in \underline{cross}-view \underline{Geo}-localization and pose estimation. Specifically, GeoX-Bench contains 10,859 panoramic-satellite image pairs spanning 128 cities in 49 countries, along with corresponding 755,976 question-answering (QA) pairs. Among these, 42,900 QA pairs are designated for benchmarking, while the remaining are intended to enhance the capabilities of LMMs. Based on GeoX-Bench, we evaluate the capabilities of 25 state-of-the-art LMMs on cross-view geo-localization and pose estimation tasks, and further explore the empowered capabilities of instruction-tuning. Our benchmark demonstrate that while current LMMs achieve impressive performance in geo-localization tasks, their effectiveness declines significantly on the more complex pose estimation tasks, highlighting a critical area for future improvement, and instruction-tuning LMMs on the training data of GeoX-Bench can significantly improve the cross-view geo-sense abilities. The GeoX-Bench is available at \textcolor{magenta}{https://github.com/IntMeGroup/GeoX-Bench}.
Referring Camouflaged Object Detection With Multi-Context Overlapped Windows Cross-Attention
Referring camouflaged object detection (Ref-COD) aims to identify hidden objects by incorporating reference information such as images and text descriptions. Previous research has transformed reference images with salient objects into one-dimensional prompts, yielding significant results. We explore ways to enhance performance through multi-context fusion of rich salient image features and camouflaged object features. Therefore, we propose RFMNet, which utilizes features from multiple encoding stages of the reference salient images and performs interactive fusion with the camouflage features at the corresponding encoding stages. Given that the features in salient object images contain abundant object-related detail information, performing feature fusion within local areas is more beneficial for detecting camouflaged objects. Therefore, we propose an Overlapped Windows Cross-attention mechanism to enable the model to focus more attention on the local information matching based on reference features. Besides, we propose the Referring Feature Aggregation (RFA) module to decode and segment the camouflaged objects progressively. Extensive experiments on the Ref-COD benchmark demonstrate that our method achieves state-of-the-art performance.
comment: 12 pages, 7figures, This work is supported by National Nature Science Foundation of China (Grant No. 62203291)
Uncovering and Mitigating Transient Blindness in Multimodal Model Editing AAAI'26
Multimodal Model Editing (MMED) aims to correct erroneous knowledge in multimodal models. Existing evaluation methods, adapted from textual model editing, overstate success by relying on low-similarity or random inputs, obscure overfitting. We propose a comprehensive locality evaluation framework, covering three key dimensions: random-image locality, no-image locality, and consistent-image locality, operationalized through seven distinct data types, enabling a detailed and structured analysis of multimodal edits. We introduce De-VQA, a dynamic evaluation for visual question answering, uncovering a phenomenon we term transient blindness, overfitting to edit-similar text while ignoring visuals. Token analysis shows edits disproportionately affect textual tokens. We propose locality-aware adversarial losses to balance cross-modal representations. Empirical results demonstrate that our approach consistently outperforms existing baselines, reducing transient blindness and improving locality by 17% on average.
comment: Accepted at AAAI'26
MMD-Thinker: Adaptive Multi-Dimensional Thinking for Multimodal Misinformation Detection
Multimodal misinformation floods on various social media, and continues to evolve in the era of AI-generated content (AIGC). The emerged misinformation with low creation cost and high deception poses significant threats to society. While recent studies leverage general-purpose multimodal large language models (MLLMs) to achieve remarkable results in detection, they encounter two critical limitations: (1) Insufficient reasoning, where general-purpose MLLMs often follow the uniform reasoning paradigm but generate inaccurate explanations and judgments, due to the lack of the task-specific knowledge of multimodal misinformation detection. (2) Reasoning biases, where a single thinking mode make detectors a suboptimal path for judgment, struggling to keep pace with the fast-growing and intricate multimodal misinformation. In this paper, we propose MMD-Thinker, a two-stage framework for multimodal misinformation detection through adaptive multi-dimensional thinking. First, we develop tailor-designed thinking mode for multimodal misinformation detection. Second, we adopt task-specific instruction tuning to inject the tailored thinking mode into general-purpose MLLMs. Third, we further leverage reinforcement learning strategy with a mixed advantage function, which incentivizes the reasoning capabilities in trajectories. Furthermore, we construct the multimodal misinformation reasoning (MMR) dataset, encompasses more than 8K image-text pairs with both reasoning processes and classification labels, to make progress in the relam of multimodal misinformation detection. Experimental results demonstrate that our proposed MMD-Thinker achieves state-of-the-art performance on both in-domain and out-of-domain benchmark datasets, while maintaining flexible inference and token usage. Code will be publicly available at Github.
MRIQT: Physics-Aware Diffusion Model for Image Quality Transfer in Neonatal Ultra-Low-Field MRI
Portable ultra-low-field MRI (uLF-MRI, 0.064 T) offers accessible neuroimaging for neonatal care but suffers from low signal-to-noise ratio and poor diagnostic quality compared to high-field (HF) MRI. We propose MRIQT, a 3D conditional diffusion framework for image quality transfer (IQT) from uLF to HF MRI. MRIQT combines realistic K-space degradation for physics-consistent uLF simulation, v-prediction with classifier-free guidance for stable image-to-image generation, and an SNR-weighted 3D perceptual loss for anatomical fidelity. The model denoises from a noised uLF input conditioned on the same scan, leveraging volumetric attention-UNet architecture for structure-preserving translation. Trained on a neonatal cohort with diverse pathologies, MRIQT surpasses recent GAN and CNN baselines in PSNR 15.3% with 1.78% over the state of the art, while physicians rated 85% of its outputs as good quality with clear pathology present. MRIQT enables high-fidelity, diffusion-based enhancement of portable ultra-low-field (uLF) MRI for deliable neonatal brain assessment.
comment: 5 pages, 4 figures
Hybrid-Domain Adaptative Representation Learning for Gaze Estimation AAAI2026
Appearance-based gaze estimation, aiming to predict accurate 3D gaze direction from a single facial image, has made promising progress in recent years. However, most methods suffer significant performance degradation in cross-domain evaluation due to interference from gaze-irrelevant factors, such as expressions, wearables, and image quality. To alleviate this problem, we present a novel Hybrid-domain Adaptative Representation Learning (shorted by HARL) framework that exploits multi-source hybrid datasets to learn robust gaze representation. More specifically, we propose to disentangle gaze-relevant representation from low-quality facial images by aligning features extracted from high-quality near-eye images in an unsupervised domain-adaptation manner, which hardly requires any computational or inference costs. Additionally, we analyze the effect of head-pose and design a simple yet efficient sparse graph fusion module to explore the geometric constraint between gaze direction and head-pose, leading to a dense and robust gaze representation. Extensive experiments on EyeDiap, MPIIFaceGaze, and Gaze360 datasets demonstrate that our approach achieves state-of-the-art accuracy of $\textbf{5.02}^{\circ}$ and $\textbf{3.36}^{\circ}$, and $\textbf{9.26}^{\circ}$ respectively, and present competitive performances through cross-dataset evaluation. The code is available at https://github.com/da60266/HARL.
comment: AAAI2026
3DAlign-DAER: Dynamic Attention Policy and Efficient Retrieval Strategy for Fine-grained 3D-Text Alignment at Scale
Despite recent advancements in 3D-text cross-modal alignment, existing state-of-the-art methods still struggle to align fine-grained textual semantics with detailed geometric structures, and their alignment performance degrades significantly when scaling to large-scale 3D databases. To overcome this limitation, we introduce 3DAlign-DAER, a unified framework designed to align text and 3D geometry via the proposed dynamic attention policy and the efficient retrieval strategy, capturing subtle correspondences for diverse cross-modal retrieval and classification tasks. Specifically, during the training, our proposed dynamic attention policy (DAP) employs the Hierarchical Attention Fusion (HAF) module to represent the alignment as learnable fine-grained token-to-point attentions. To optimize these attentions across different tasks and geometric hierarchies, our DAP further exploits the Monte Carlo tree search to dynamically calibrate HAF attention weights via a hybrid reward signal and further enhances the alignment between textual descriptions and local 3D geometry. During the inference, our 3DAlign-DAER introduces an Efficient Retrieval Strategy (ERS) to leverage efficient hierarchical searching in the large-scale embedding spaces, outperforming traditional methods (e.g., KNN) in accuracy and efficiency. Furthermore, to facilitate text-3D alignment research and train our 3DAlign-DAER, we construct Align3D-2M, a large-scale dataset featuring 2M text-3D pairs, to provide sufficient fine-grained cross-modal annotations. Extensive and comprehensive experiments demonstrate the superior performance of our 3DAlign-DAER on diverse benchmarks. We will release our codes, models, and datasets.
End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer
Existing multi-person video pose estimation methods typically adopt a two-stage pipeline: detecting individuals in each frame, followed by temporal modeling for single-person pose estimation. This design relies on heuristic operations such as detection, RoI cropping, and non-maximum suppression (NMS), limiting both accuracy and efficiency. In this paper, we present a fully end-to-end framework for multi-person 2D pose estimation in videos, effectively eliminating heuristic operations. A key challenge is to associate individuals across frames under complex and overlapping temporal trajectories. To address this, we introduce a novel Pose-Aware Video transformEr Network (PAVE-Net), which features a spatial encoder to model intra-frame relations and a spatiotemporal pose decoder to capture global dependencies across frames. To achieve accurate temporal association, we propose a pose-aware attention mechanism that enables each pose query to selectively aggregate features corresponding to the same individual across consecutive frames.Additionally, we explicitly model spatiotemporal dependencies among pose keypoints to improve accuracy. Notably, our approach is the first end-to-end method for multi-frame 2D human pose estimation.Extensive experiments show that PAVE-Net substantially outperforms prior image-based end-to-end methods, achieving a \textbf{6.0} mAP improvement on PoseTrack2017, and delivers accuracy competitive with state-of-the-art two-stage video-based approaches, while offering significant gains in efficiency.Project page: https://github.com/zgspose/PAVENet
PIGEON: VLM-Driven Object Navigation via Points of Interest Selection
Navigating to a specified object in an unknown environment is a fundamental yet challenging capability of embodied intelligence. However, current methods struggle to balance decision frequency with intelligence, resulting in decisions lacking foresight or discontinuous actions. In this work, we propose PIGEON: Point of Interest Guided Exploration for Object Navigation with VLM, maintaining a lightweight and semantically aligned snapshot memory during exploration as semantic input for the exploration strategy. We use a large Visual-Language Model (VLM), named PIGEON-VL, to select Points of Interest (PoI) formed during exploration and then employ a lower-level planner for action output, increasing the decision frequency. Additionally, this PoI-based decision-making enables the generation of Reinforcement Learning with Verifiable Reward (RLVR) data suitable for simulators. Experiments on classic object navigation benchmarks demonstrate that our zero-shot transfer method achieves state-of-the-art performance, while RLVR further enhances the model's semantic guidance capabilities, enabling deep reasoning during real-time navigation.
RefineVAD: Semantic-Guided Feature Recalibration for Weakly Supervised Video Anomaly Detection AAAI 2026
Weakly-Supervised Video Anomaly Detection aims to identify anomalous events using only video-level labels, balancing annotation efficiency with practical applicability. However, existing methods often oversimplify the anomaly space by treating all abnormal events as a single category, overlooking the diverse semantic and temporal characteristics intrinsic to real-world anomalies. Inspired by how humans perceive anomalies, by jointly interpreting temporal motion patterns and semantic structures underlying different anomaly types, we propose RefineVAD, a novel framework that mimics this dual-process reasoning. Our framework integrates two core modules. The first, Motion-aware Temporal Attention and Recalibration (MoTAR), estimates motion salience and dynamically adjusts temporal focus via shift-based attention and global Transformer-based modeling. The second, Category-Oriented Refinement (CORE), injects soft anomaly category priors into the representation space by aligning segment-level features with learnable category prototypes through cross-attention. By jointly leveraging temporal dynamics and semantic structure, explicitly models both "how" motion evolves and "what" semantic category it resembles. Extensive experiments on WVAD benchmark validate the effectiveness of RefineVAD and highlight the importance of integrating semantic context to guide feature refinement toward anomaly-relevant patterns.
comment: Accepted to AAAI 2026
Self-Supervised Ultrasound Screen Detection
Ultrasound (US) machines display images on a built-in monitor, but routine transfer to hospital systems relies on DICOM. We propose a self-supervised pipeline to extract the US image from a photograph of the monitor. This removes the DICOM bottleneck and enables rapid testing and prototyping of new algorithms. In a proof-of-concept study, the rectified images retained enough visual fidelity to classify cardiac views with a balanced accuracy of 0.79 with respect to the native DICOMs.
comment: Submitted to ISBI 2026
Difficulty-Aware Label-Guided Denoising for Monocular 3D Object Detection AAAI 2026
Monocular 3D object detection is a cost-effective solution for applications like autonomous driving and robotics, but remains fundamentally ill-posed due to inherently ambiguous depth cues. Recent DETR-based methods attempt to mitigate this through global attention and auxiliary depth prediction, yet they still struggle with inaccurate depth estimates. Moreover, these methods often overlook instance-level detection difficulty, such as occlusion, distance, and truncation, leading to suboptimal detection performance. We propose MonoDLGD, a novel Difficulty-Aware Label-Guided Denoising framework that adaptively perturbs and reconstructs ground-truth labels based on detection uncertainty. Specifically, MonoDLGD applies stronger perturbations to easier instances and weaker ones into harder cases, and then reconstructs them to effectively provide explicit geometric supervision. By jointly optimizing label reconstruction and 3D object detection, MonoDLGD encourages geometry-aware representation learning and improves robustness to varying levels of object complexity. Extensive experiments on the KITTI benchmark demonstrate that MonoDLGD achieves state-of-the-art performance across all difficulty levels.
comment: AAAI 2026 accepted
Birth of a Painting: Differentiable Brushstroke Reconstruction
Painting embodies a unique form of visual storytelling, where the creation process is as significant as the final artwork. Although recent advances in generative models have enabled visually compelling painting synthesis, most existing methods focus solely on final image generation or patch-based process simulation, lacking explicit stroke structure and failing to produce smooth, realistic shading. In this work, we present a differentiable stroke reconstruction framework that unifies painting, stylized texturing, and smudging to faithfully reproduce the human painting-smudging loop. Given an input image, our framework first optimizes single- and dual-color Bezier strokes through a parallel differentiable paint renderer, followed by a style generation module that synthesizes geometry-conditioned textures across diverse painting styles. We further introduce a differentiable smudge operator to enable natural color blending and shading. Coupled with a coarse-to-fine optimization strategy, our method jointly optimizes stroke geometry, color, and texture under geometric and semantic guidance. Extensive experiments on oil, watercolor, ink, and digital paintings demonstrate that our approach produces realistic and expressive stroke reconstructions, smooth tonal transitions, and richly stylized appearances, offering a unified model for expressive digital painting creation. See our project page for more demos: https://yingjiang96.github.io/DiffPaintWebsite/.
comment: 13 pages
Video Spatial Reasoning with Object-Centric 3D Rollout
Recent advances in Multi-modal Large Language Models (MLLMs) have showcased remarkable capabilities in vision-language understanding. However, enabling robust video spatial reasoning-the ability to comprehend object locations, orientations, and inter-object relationships in dynamic 3D scenes-remains a key unsolved challenge. Existing approaches primarily rely on spatially grounded supervised fine-tuning or reinforcement learning, yet we observe that such models often exhibit query-locked reasoning, focusing narrowly on objects explicitly mentioned in the prompt while ignoring critical contextual cues. To address this limitation, we propose Object-Centric 3D Rollout (OCR), a novel strategy that introduces structured perturbations to the 3D geometry of selected objects during training. By degrading object-specific visual cues and projecting the altered geometry into 2D space, OCR compels the model to reason holistically across the entire scene. We further design a rollout-based training pipeline that jointly leverages vanilla and region-noisy videos to optimize spatial reasoning trajectories. Experiments demonstrate state-of-the-art performance: our 3B-parameter model achieves 47.5% accuracy on VSI-Bench, outperforming several 7B baselines. Ablations confirm OCR's superiority over prior rollout strategies (e.g., T-GRPO, NoisyRollout).
Large Language Models Meet Extreme Multi-label Classification: Scaling and Multi-modal Framework AAAI 2026
Foundation models have revolutionized artificial intelligence across numerous domains, yet their transformative potential remains largely untapped in Extreme Multi-label Classification (XMC). Queries in XMC are associated with relevant labels from extremely large label spaces, where it is critical to strike a balance between efficiency and performance. Therefore, many recent approaches efficiently pose XMC as a maximum inner product search between embeddings learned from small encoder-only transformer architectures. In this paper, we address two important aspects in XMC: how to effectively harness larger decoder-only models, and how to exploit visual information while maintaining computational efficiency. We demonstrate that both play a critical role in XMC separately and can be combined for improved performance. We show that a few billion-size decoder can deliver substantial improvements while keeping computational overhead manageable. Furthermore, our Vision-enhanced eXtreme Multi-label Learning framework (ViXML) efficiently integrates foundation vision models by pooling a single embedding per image. This limits computational growth while unlocking multi-modal capabilities. Remarkably, ViXML with small encoders outperforms text-only decoder in most cases, showing that an image is worth billions of parameters. Finally, we present an extension of existing text-only datasets to exploit visual metadata and make them available for future benchmarking. Comprehensive experiments across four public text-only datasets and their corresponding image enhanced versions validate our proposals' effectiveness, surpassing previous state-of-the-art by up to +8.21\% in P@1 on the largest dataset. ViXML's code is available at https://github.com/DiegoOrtego/vixml.
comment: To appear at AAAI 2026
GenTract: Generative Global Tractography
Tractography is the process of inferring the trajectories of white-matter pathways in the brain from diffusion magnetic resonance imaging (dMRI). Local tractography methods, which construct streamlines by following local fiber orientation estimates stepwise through an image, are prone to error accumulation and high false positive rates, particularly on noisy or low-resolution data. In contrast, global methods, which attempt to optimize a collection of streamlines to maximize compatibility with underlying fiber orientation estimates, are computationally expensive. To address these challenges, we introduce GenTract, the first generative model for global tractography. We frame tractography as a generative task, learning a direct mapping from dMRI to complete, anatomically plausible streamlines. We compare both diffusion-based and flow matching paradigms and evaluate GenTract's performance against state-of-the-art baselines. Notably, GenTract achieves precision 2.1x higher than the next-best method, TractOracle. This advantage becomes even more pronounced in challenging low-resolution and noisy settings, where it outperforms the closest competitor by an order of magnitude. By producing tractograms with high precision on research-grade data while also maintaining reliability on imperfect, lower-resolution data, GenTract represents a promising solution for global tractography.
HDW-SR: High-Frequency Guided Diffusion Model based on Wavelet Decomposition for Image Super-Resolution
Diffusion-based methods have shown great promise in single image super-resolution (SISR); however, existing approaches often produce blurred fine details due to insufficient guidance in the high-frequency domain. To address this issue, we propose a High-Frequency Guided Diffusion Network based on Wavelet Decomposition (HDW-SR), which replaces the conventional U-Net backbone in diffusion frameworks. Specifically, we perform diffusion only on the residual map, allowing the network to focus more effectively on high-frequency information restoration. We then introduce wavelet-based downsampling in place of standard CNN downsampling to achieve multi-scale frequency decomposition, enabling sparse cross-attention between the high-frequency subbands of the pre-super-resolved image and the low-frequency subbands of the diffused image for explicit high-frequency guidance. Moreover, a Dynamic Thresholding Block (DTB) is designed to refine high-frequency selection during the sparse attention process. During upsampling, the invertibility of the wavelet transform ensures low-loss feature reconstruction. Experiments on both synthetic and real-world datasets demonstrate that HDW-SR achieves competitive super-resolution performance, excelling particularly in recovering fine-grained image details. The code will be available after acceptance.
THIR: Topological Histopathological Image Retrieval
According to the World Health Organization, breast cancer claimed the lives of approximately 685,000 women in 2020. Early diagnosis and accurate clinical decision making are critical in reducing this global burden. In this study, we propose THIR, a novel Content-Based Medical Image Retrieval (CBMIR) framework that leverages topological data analysis specifically, Betti numbers derived from persistent homology to characterize and retrieve histopathological images based on their intrinsic structural patterns. Unlike conventional deep learning approaches that rely on extensive training, annotated datasets, and powerful GPU resources, THIR operates entirely without supervision. It extracts topological fingerprints directly from RGB histopathological images using cubical persistence, encoding the evolution of loops as compact, interpretable feature vectors. The similarity retrieval is then performed by computing the distances between these topological descriptors, efficiently returning the top-K most relevant matches. Extensive experiments on the BreaKHis dataset demonstrate that THIR outperforms state of the art supervised and unsupervised methods. It processes the entire dataset in under 20 minutes on a standard CPU, offering a fast, scalable, and training free solution for clinical image retrieval.
SOMA: Feature Gradient Enhanced Affine-Flow Matching for SAR-Optical Registration
Achieving pixel-level registration between SAR and optical images remains a challenging task due to their fundamentally different imaging mechanisms and visual characteristics. Although deep learning has achieved great success in many cross-modal tasks, its performance on SAR-Optical registration tasks is still unsatisfactory. Gradient-based information has traditionally played a crucial role in handcrafted descriptors by highlighting structural differences. However, such gradient cues have not been effectively leveraged in deep learning frameworks for SAR-Optical image matching. To address this gap, we propose SOMA, a dense registration framework that integrates structural gradient priors into deep features and refines alignment through a hybrid matching strategy. Specifically, we introduce the Feature Gradient Enhancer (FGE), which embeds multi-scale, multi-directional gradient filters into the feature space using attention and reconstruction mechanisms to boost feature distinctiveness. Furthermore, we propose the Global-Local Affine-Flow Matcher (GLAM), which combines affine transformation and flow-based refinement within a coarse-to-fine architecture to ensure both structural consistency and local accuracy. Experimental results demonstrate that SOMA significantly improves registration precision, increasing the CMR@1px by 12.29% on the SEN1-2 dataset and 18.50% on the GFGE_SO dataset. In addition, SOMA exhibits strong robustness and generalizes well across diverse scenes and resolutions.
Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification
Multimodal pretraining has revolutionized visual understanding, but its impact on video-based person re-identification (ReID) remains underexplored. Existing approaches often rely on video-text pairs, yet suffer from two fundamental limitations: (1) lack of genuine multimodal pretraining, and (2) text poorly captures fine-grained temporal motion-an essential cue for distinguishing identities in video. In this work, we take a bold departure from text-based paradigms by introducing the first skeleton-driven pretraining framework for ReID. To achieve this, we propose Contrastive Skeleton-Image Pretraining for ReID (CSIP-ReID), a novel two-stage method that leverages skeleton sequences as a spatiotemporally informative modality aligned with video frames. In the first stage, we employ contrastive learning to align skeleton and visual features at sequence level. In the second stage, we introduce a dynamic Prototype Fusion Updater (PFU) to refine multimodal identity prototypes, fusing motion and appearance cues. Moreover, we propose a Skeleton Guided Temporal Modeling (SGTM) module that distills temporal cues from skeleton data and integrates them into visual features. Extensive experiments demonstrate that CSIP-ReID achieves new state-of-the-art results on standard video ReID benchmarks (MARS, LS-VID, iLIDS-VID). Moreover, it exhibits strong generalization to skeleton-only ReID tasks (BIWI, IAS), significantly outperforming previous methods. CSIP-ReID pioneers an annotation-free and motion-aware pretraining paradigm for ReID, opening a new frontier in multimodal representation learning.
Automated Road Distress Detection Using Vision Transformersand Generative Adversarial Networks
The American Society of Civil Engineers has graded Americas infrastructure condition as a C, with the road system receiving a dismal D. Roads are vital to regional economic viability, yet their management, maintenance, and repair processes remain inefficient, relying on outdated manual or laser-based inspection methods that are both costly and time-consuming. With the increasing availability of real-time visual data from autonomous vehicles, there is an opportunity to apply computer vision (CV) methods for advanced road monitoring, providing insights to guide infrastructure rehabilitation efforts. This project explores the use of state-of-the-art CV techniques for road distress segmentation. It begins by evaluating synthetic data generated with Generative Adversarial Networks (GANs) to assess its usefulness for model training. The study then applies Convolutional Neural Networks (CNNs) for road distress segmentation and subsequently examines the transformer-based model MaskFormer. Results show that GAN-generated data improves model performance and that MaskFormer outperforms the CNN model in two metrics: mAP50 and IoU.
WinMamba: Multi-Scale Shifted Windows in State Space Model for 3D Object Detection
3D object detection is critical for autonomous driving, yet it remains fundamentally challenging to simultaneously maximize computational efficiency and capture long-range spatial dependencies. We observed that Mamba-based models, with their linear state-space design, capture long-range dependencies at lower cost, offering a promising balance between efficiency and accuracy. However, existing methods rely on axis-aligned scanning within a fixed window, inevitably discarding spatial information. To address this problem, we propose WinMamba, a novel Mamba-based 3D feature-encoding backbone composed of stacked WinMamba blocks. To enhance the backbone with robust multi-scale representation, the WinMamba block incorporates a window-scale-adaptive module that compensates voxel features across varying resolutions during sampling. Meanwhile, to obtain rich contextual cues within the linear state space, we equip the WinMamba layer with a learnable positional encoding and a window-shift strategy. Extensive experiments on the KITTI and Waymo datasets demonstrate that WinMamba significantly outperforms the baseline. Ablation studies further validate the individual contributions of the WSF and AWF modules in improving detection accuracy. The code will be made publicly available.
comment: 9 pages, 3 figures,
MedGEN-Bench: Contextually entangled benchmark for open-ended multimodal medical generation CVPR 2026
As Vision-Language Models (VLMs) increasingly gain traction in medical applications, clinicians are progressively expecting AI systems not only to generate textual diagnoses but also to produce corresponding medical images that integrate seamlessly into authentic clinical workflows. Despite the growing interest, existing medical visual benchmarks present notable limitations. They often rely on ambiguous queries that lack sufficient relevance to image content, oversimplify complex diagnostic reasoning into closed-ended shortcuts, and adopt a text-centric evaluation paradigm that overlooks the importance of image generation capabilities. To address these challenges, we introduce \textsc{MedGEN-Bench}, a comprehensive multimodal benchmark designed to advance medical AI research. MedGEN-Bench comprises 6,422 expert-validated image-text pairs spanning six imaging modalities, 16 clinical tasks, and 28 subtasks. It is structured into three distinct formats: Visual Question Answering, Image Editing, and Contextual Multimodal Generation. What sets MedGEN-Bench apart is its focus on contextually intertwined instructions that necessitate sophisticated cross-modal reasoning and open-ended generative outputs, moving beyond the constraints of multiple-choice formats. To evaluate the performance of existing systems, we employ a novel three-tier assessment framework that integrates pixel-level metrics, semantic text analysis, and expert-guided clinical relevance scoring. Using this framework, we systematically assess 10 compositional frameworks, 3 unified models, and 5 VLMs.
comment: CVPR 2026 Under Review
Shedding Light on VLN Robustness: A Black-box Framework for Indoor Lighting-based Adversarial Attack
Vision-and-Language Navigation (VLN) agents have made remarkable progress, but their robustness remains insufficiently studied. Existing adversarial evaluations often rely on perturbations that manifest as unusual textures rarely encountered in everyday indoor environments. Errors under such contrived conditions have limited practical relevance, as real-world agents are unlikely to encounter such artificial patterns. In this work, we focus on indoor lighting, an intrinsic yet largely overlooked scene attribute that strongly influences navigation. We propose Indoor Lighting-based Adversarial Attack (ILA), a black-box framework that manipulates global illumination to disrupt VLN agents. Motivated by typical household lighting usage, we design two attack modes: Static Indoor Lighting-based Attack (SILA), where the lighting intensity remains constant throughout an episode, and Dynamic Indoor Lighting-based Attack (DILA), where lights are switched on or off at critical moments to induce abrupt illumination changes. We evaluate ILA on two state-of-the-art VLN models across three navigation tasks. Results show that ILA significantly increases failure rates while reducing trajectory efficiency, revealing previously unrecognized vulnerabilities of VLN agents to realistic indoor lighting variations.
MM-Telco: Benchmarks and Multimodal Large Language Models for Telecom Applications
Large Language Models (LLMs) have emerged as powerful tools for automating complex reasoning and decision-making tasks. In telecommunications, they hold the potential to transform network optimization, automate troubleshooting, enhance customer support, and ensure regulatory compliance. However, their deployment in telecom is hindered by domain-specific challenges that demand specialized adaptation. To overcome these challenges and to accelerate the adaptation of LLMs for telecom, we propose MM-Telco, a comprehensive suite of multimodal benchmarks and models tailored for the telecom domain. The benchmark introduces various tasks (both text based and image based) that address various practical real-life use cases such as network operations, network management, improving documentation quality, and retrieval of relevant text and images. Further, we perform baseline experiments with various LLMs and VLMs. The models fine-tuned on our dataset exhibit a significant boost in performance. Our experiments also help analyze the weak areas in the working of current state-of-art multimodal LLMs, thus guiding towards further development and research.
VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language
Jailbreak attacks can circumvent model safety guardrails and reveal critical blind spots. Prior attacks on text-to-video (T2V) models typically add adversarial perturbations to obviously unsafe prompts, which are often easy to detect and defend. In contrast, we show that benign-looking prompts containing rich, implicit cues can induce T2V models to generate semantically unsafe videos that both violate policy and preserve the original (blocked) intent. To realize this, we propose VEIL, a jailbreak framework that leverages T2V models' cross-modal associative patterns via a modular prompt design. Specifically, our prompts combine three components: neutral scene anchors, which provide the surface-level scene description extracted from the blocked intent to maintain plausibility; latent auditory triggers, textual descriptions of innocuous-sounding audio events (e.g., creaking, muffled noises) that exploit learned audio-visual co-occurrence priors to bias the model toward particular unsafe visual concepts; and stylistic modulators, cinematic directives (e.g., camera framing, atmosphere) that amplify and stabilize the latent trigger's effect. We formalize attack generation as a constrained optimization over the above modular prompt space and solve it with a guided search procedure that balances stealth and effectiveness. Extensive experiments over 7 T2V models demonstrate the efficacy of our attack, achieving a 23 percent improvement in average attack success rate in commercial models.
Region-Point Joint Representation for Effective Trajectory Similarity Learning AAAI2026
Recent learning-based methods have reduced the computational complexity of traditional trajectory similarity computation, but state-of-the-art (SOTA) methods still fail to leverage the comprehensive spectrum of trajectory information for similarity modeling. To tackle this problem, we propose \textbf{RePo}, a novel method that jointly encodes \textbf{Re}gion-wise and \textbf{Po}int-wise features to capture both spatial context and fine-grained moving patterns. For region-wise representation, the GPS trajectories are first mapped to grid sequences, and spatial context are captured by structural features and semantic context enriched by visual features. For point-wise representation, three lightweight expert networks extract local, correlation, and continuous movement patterns from dense GPS sequences. Then, a router network adaptively fuses the learned point-wise features, which are subsequently combined with region-wise features using cross-attention to produce the final trajectory embedding. To train RePo, we adopt a contrastive loss with hard negative samples to provide similarity ranking supervision. Experiment results show that RePo achieves an average accuracy improvement of 22.2\% over SOTA baselines across all evaluation metrics.
comment: This paper is accepted by AAAI2026
CloseUpShot: Close-up Novel View Synthesis from Sparse-views via Point-conditioned Diffusion Model
Reconstructing 3D scenes and synthesizing novel views from sparse input views is a highly challenging task. Recent advances in video diffusion models have demonstrated strong temporal reasoning capabilities, making them a promising tool for enhancing reconstruction quality under sparse-view settings. However, existing approaches are primarily designed for modest viewpoint variations, which struggle in capturing fine-grained details in close-up scenarios since input information is severely limited. In this paper, we present a diffusion-based framework, called CloseUpShot, for close-up novel view synthesis from sparse inputs via point-conditioned video diffusion. Specifically, we observe that pixel-warping conditioning suffers from severe sparsity and background leakage in close-up settings. To address this, we propose hierarchical warping and occlusion-aware noise suppression, enhancing the quality and completeness of the conditioning images for the video diffusion model. Furthermore, we introduce global structure guidance, which leverages a dense fused point cloud to provide consistent geometric context to the diffusion process, to compensate for the lack of globally consistent 3D constraints in sparse conditioning inputs. Extensive experiments on multiple datasets demonstrate that our method outperforms existing approaches, especially in close-up novel view synthesis, clearly validating the effectiveness of our design.
comment: Project Link: https://zyqz97.github.io/CloseUpShot/
A Lightweight 3D Anomaly Detection Method with Rotationally Invariant Features
3D anomaly detection (AD) is a crucial task in computer vision, aiming to identify anomalous points or regions from point cloud data. However, existing methods may encounter challenges when handling point clouds with changes in orientation and position because the resulting features may vary significantly. To address this problem, we propose a novel Rotationally Invariant Features (RIF) framework for 3D AD. Firstly, to remove the adverse effect of variations on point cloud data, we develop a Point Coordinate Mapping (PCM) technique, which maps each point into a rotationally invariant space to maintain consistency of representation. Then, to learn robust and discriminative features, we design a lightweight Convolutional Transform Feature Network (CTF-Net) to extract rotationally invariant features for the memory bank. To improve the ability of the feature extractor, we introduce the idea of transfer learning to pre-train the feature extractor with 3D data augmentation. Experimental results show that the proposed method achieves the advanced performance on the Anomaly-ShapeNet dataset, with an average P-AUROC improvement of 17.7\%, and also gains the best performance on the Real3D-AD dataset, with an average P-AUROC improvement of 1.6\%. The strong generalization ability of RIF has been verified by combining it with traditional feature extraction methods on anomaly detection tasks, demonstrating great potential for industrial applications.
comment: Submitted to Elsevier
Semantics and Content Matter: Towards Multi-Prior Hierarchical Mamba for Image Deraining
Rain significantly degrades the performance of computer vision systems, particularly in applications like autonomous driving and video surveillance. While existing deraining methods have made considerable progress, they often struggle with fidelity of semantic and spatial details. To address these limitations, we propose the Multi-Prior Hierarchical Mamba (MPHM) network for image deraining. This novel architecture synergistically integrates macro-semantic textual priors (CLIP) for task-level semantic guidance and micro-structural visual priors (DINOv2) for scene-aware structural information. To alleviate potential conflicts between heterogeneous priors, we devise a progressive Priors Fusion Injection (PFI) that strategically injects complementary cues at different decoder levels. Meanwhile, we equip the backbone network with an elaborate Hierarchical Mamba Module (HMM) to facilitate robust feature representation, featuring a Fourier-enhanced dual-path design that concurrently addresses global context modeling and local detail recovery. Comprehensive experiments demonstrate MPHM's state-of-the-art performance, achieving a 0.57 dB PSNR gain on the Rain200H dataset while delivering superior generalization on real-world rainy scenarios.
Learning Implicit Neural Degradation Representation for Unpaired Image Dehazing
Image dehazing is an important task in the field of computer vision, aiming at restoring clear and detail-rich visual content from haze-affected images. However, when dealing with complex scenes, existing methods often struggle to strike a balance between fine-grained feature representation of inhomogeneous haze distribution and global consistency modeling. Furthermore, to better learn the common degenerate representation of haze in spatial variations, we propose an unsupervised dehaze method for implicit neural degradation representation. Firstly, inspired by the Kolmogorov-Arnold representation theorem, we propose a mechanism combining the channel-independent and channel-dependent mechanisms, which efficiently enhances the ability to learn from nonlinear dependencies. which in turn achieves good visual perception in complex scenes. Moreover, we design an implicit neural representation to model haze degradation as a continuous function to eliminate redundant information and the dependence on explicit feature extraction and physical models. To further learn the implicit representation of the haze features, we also designed a dense residual enhancement module from it to eliminate redundant information. This achieves high-quality image restoration. Experimental results show that our method achieves competitive dehaze performance on various public and real-world datasets. This project code will be available at https://github.com/Fan-pixel/NeDR-Dehaze.
DGS-Net: Distillation-Guided Gradient Surgery for CLIP Fine-Tuning in AI-Generated Image Detection
The rapid progress of generative models such as GANs and diffusion models has led to the widespread proliferation of AI-generated images, raising concerns about misinformation, privacy violations, and trust erosion in digital media. Although large-scale multimodal models like CLIP offer strong transferable representations for detecting synthetic content, fine-tuning them often induces catastrophic forgetting, which degrades pre-trained priors and limits cross-domain generalization. To address this issue, we propose the Distillation-guided Gradient Surgery Network (DGS-Net), a novel framework that preserves transferable pre-trained priors while suppressing task-irrelevant components. Specifically, we introduce a gradient-space decomposition that separates harmful and beneficial descent directions during optimization. By projecting task gradients onto the orthogonal complement of harmful directions and aligning with beneficial ones distilled from a frozen CLIP encoder, DGS-Net achieves unified optimization of prior preservation and irrelevant suppression. Extensive experiments on 50 generative models demonstrate that our method outperforms state-of-the-art approaches by an average margin of 6.6, achieving superior detection performance and generalization across diverse generation techniques.
Low-Level Dataset Distillation for Medical Image Enhancement
Medical image enhancement is clinically valuable, but existing methods require large-scale datasets to learn complex pixel-level mappings. However, the substantial training and storage costs associated with these datasets hinder their practical deployment. While dataset distillation (DD) can alleviate these burdens, existing methods mainly target high-level tasks, where multiple samples share the same label. This many-to-one mapping allows distilled data to capture shared semantics and achieve information compression. In contrast, low-level tasks involve a many-to-many mapping that requires pixel-level fidelity, making low-level DD an underdetermined problem, as a small distilled dataset cannot fully constrain the dense pixel-level mappings. To address this, we propose the first low-level DD method for medical image enhancement. We first leverage anatomical similarities across patients to construct the shared anatomical prior based on a representative patient, which serves as the initialization for the distilled data of different patients. This prior is then personalized for each patient using a Structure-Preserving Personalized Generation (SPG) module, which integrates patient-specific anatomical information into the distilled dataset while preserving pixel-level fidelity. For different low-level tasks, the distilled data is used to construct task-specific high- and low-quality training pairs. Patient-specific knowledge is injected into the distilled data by aligning the gradients computed from networks trained on the distilled pairs with those from the corresponding patient's raw data. Notably, downstream users cannot access raw patient data. Instead, only a distilled dataset containing abstract training information is shared, which excludes patient-specific details and thus preserves privacy.
PlugTrack: Multi-Perceptive Motion Analysis for Adaptive Fusion in Multi-Object Tracking AAAI 2026
Multi-object tracking (MOT) predominantly follows the tracking-by-detection paradigm, where Kalman filters serve as the standard motion predictor due to computational efficiency but inherently fail on non-linear motion patterns. Conversely, recent data-driven motion predictors capture complex non-linear dynamics but suffer from limited domain generalization and computational overhead. Through extensive analysis, we reveal that even in datasets dominated by non-linear motion, Kalman filter outperforms data-driven predictors in up to 34\% of cases, demonstrating that real-world tracking scenarios inherently involve both linear and non-linear patterns. To leverage this complementarity, we propose PlugTrack, a novel framework that adaptively fuses Kalman filter and data-driven motion predictors through multi-perceptive motion understanding. Our approach employs multi-perceptive motion analysis to generate adaptive blending factors. PlugTrack achieves significant performance gains on MOT17/MOT20 and state-of-the-art on DanceTrack without modifying existing motion predictors. To the best of our knowledge, PlugTrack is the first framework to bridge classical and modern motion prediction paradigms through adaptive fusion in MOT.
comment: AAAI 2026. Code: https://github.com/VisualScienceLab-KHU/PlugTrack
CapeNext: Rethinking and refining dynamic support information for category-agnostic pose estimation
Recent research in Category-Agnostic Pose Estimation (CAPE) has adopted fixed textual keypoint description as semantic prior for two-stage pose matching frameworks. While this paradigm enhances robustness and flexibility by disentangling the dependency of support images, our critical analysis reveals two inherent limitations of static joint embedding: (1) polysemy-induced cross-category ambiguity during the matching process(e.g., the concept "leg" exhibiting divergent visual manifestations across humans and furniture), and (2) insufficient discriminability for fine-grained intra-category variations (e.g., posture and fur discrepancies between a sleeping white cat and a standing black cat). To overcome these challenges, we propose a new framework that innovatively integrates hierarchical cross-modal interaction with dual-stream feature refinement, enhancing the joint embedding with both class-level and instance-specific cues from textual description and specific images. Experiments on the MP-100 dataset demonstrate that, regardless of the network backbone, CapeNext consistently outperforms state-of-the-art CAPE methods by a large margin.
MergeSlide: Continual Model Merging and Task-to-Class Prompt-Aligned Inference for Lifelong Learning on Whole Slide Images WACV2026
Lifelong learning on Whole Slide Images (WSIs) aims to train or fine-tune a unified model sequentially on cancer-related tasks, reducing the resources and effort required for data transfer and processing, especially given the gigabyte-scale size of WSIs. In this paper, we introduce MergeSlide, a simple yet effective framework that treats lifelong learning as a model merging problem by leveraging a vision-language pathology foundation model. When a new task arrives, it is: 1) defined with class-aware prompts, 2) fine-tuned for a few epochs using an MLP-free backbone, and 3) merged into a unified model using an orthogonal continual merging strategy that preserves performance and mitigates catastrophic forgetting. For inference under the class-incremental learning (CLASS-IL) setting, where task identity is unknown, we introduce Task-to-Class Prompt-aligned (TCP) inference. Specifically, TCP first identifies the most relevant task using task-level prompts and then applies the corresponding class-aware prompts to generate predictions. To evaluate MergeSlide, we conduct experiments on a stream of six TCGA datasets. The results show that MergeSlide outperforms both rehearsal-based continual learning and vision-language zero-shot baselines. Code and data are available at https://github.com/caodoanh2001/MergeSlide.
comment: WACV2026 Accepted
MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements
Graphical User Interface (GUI) grounding - the task of mapping natural language instructions to screen coordinates - is essential for autonomous agents and accessibility technologies. Existing systems rely on monolithic models or one-shot pipelines that lack modularity and fail under visual clutter and ambiguous instructions. We introduce MEGA-GUI, a multi-stage framework that separates grounding into coarse Region-of-Interest (ROI) selection and fine-grained element grounding, orchestrated by specialized vision-language agents. MEGA-GUI features a bidirectional ROI zoom algorithm that mitigates spatial dilution and a context-aware rewriting agent that reduces semantic ambiguity. Our analysis reveals complementary strengths and weaknesses across vision-language models at different visual scales, and we show that leveraging this modular structure achieves consistently higher accuracy than monolithic approaches. On the visually dense ScreenSpot-Pro benchmark, MEGA-GUI attains 73.18% accuracy, and on the semantically complex OSWorld-G benchmark it reaches 68.63%, surpassing previously reported results. Code and the Grounding Benchmark Toolkit (GBT) are available at https://github.com/samsungsds-research-papers/mega-gui.
comment: 26 pages, 7 figures. Code available at https://github.com/samsungsds-research-papers/mega-gui
Real-time prediction of breast cancer sites using deformation-aware graph neural network
Early diagnosis of breast cancer is crucial, enabling the establishment of appropriate treatment plans and markedly enhancing patient prognosis. While direct magnetic resonance imaging-guided biopsy demonstrates promising performance in detecting cancer lesions, its practical application is limited by prolonged procedure times and high costs. To overcome these issues, an indirect MRI-guided biopsy that allows the procedure to be performed outside of the MRI room has been proposed, but it still faces challenges in creating an accurate real-time deformable breast model. In our study, we tackled this issue by developing a graph neural network (GNN)-based model capable of accurately predicting deformed breast cancer sites in real time during biopsy procedures. An individual-specific finite element (FE) model was developed by incorporating magnetic resonance (MR) image-derived structural information of the breast and tumor to simulate deformation behaviors. A GNN model was then employed, designed to process surface displacement and distance-based graph data, enabling accurate prediction of overall tissue displacement, including the deformation of the tumor region. The model was validated using phantom and real patient datasets, achieving an accuracy within 0.2 millimeters (mm) for cancer node displacement (RMSE) and a dice similarity coefficient (DSC) of 0.977 for spatial overlap with actual cancerous regions. Additionally, the model enabled real-time inference and achieved a speed-up of over 4,000 times in computational cost compared to conventional FE simulations. The proposed deformation-aware GNN model offers a promising solution for real-time tumor displacement prediction in breast biopsy, with high accuracy and real-time capability. Its integration with clinical procedures could significantly enhance the precision and efficiency of breast cancer diagnosis.
Rethinking Saliency Maps: A Cognitive Human Aligned Taxonomy and Evaluation Framework for Explanations
Saliency maps are widely used for visual explanations in deep learning, but a fundamental lack of consensus persists regarding their intended purpose and alignment with diverse user queries. This ambiguity hinders the effective evaluation and practical utility of explanation methods.We address this gap by introducing the Reference-Frame $\times$ Granularity (RFxG) taxonomy, a principled conceptual framework that organizes saliency explanations along two essential axes:Reference-Frame: Distinguishing between pointwise ("Why this prediction?") and contrastive ("Why this and not an alternative?") explanations.Granularity: Ranging from fine-grained class-level (e.g., "Why Husky?") to coarse-grained group-level (e.g., "Why Dog?") interpretations.Using the RFxG lens, we demonstrate critical limitations in existing evaluation metrics, which overwhelmingly prioritize pointwise faithfulness while neglecting contrastive reasoning and semantic granularity. To systematically assess explanation quality across both RFxG dimensions, we propose four novel faithfulness metrics. Our comprehensive evaluation framework applies these metrics to ten state-of-the-art saliency methods, four model architectures, and three datasets.By advocating a shift toward user-intent-driven evaluation, our work provides both the conceptual foundation and the practical tools necessary to develop visual explanations that are not only faithful to the underlying model behavior but are also meaningfully aligned with the complexity of human understanding and inquiry.
Decoupling Scene Perception and Ego Status: A Multi-Context Fusion Approach for Enhanced Generalization in End-to-End Autonomous Driving
Modular design of planning-oriented autonomous driving has markedly advanced end-to-end systems. However, existing architectures remain constrained by an over-reliance on ego status, hindering generalization and robust scene understanding. We identify the root cause as an inherent design within these architectures that allows ego status to be easily leveraged as a shortcut. Specifically, the premature fusion of ego status in the upstream BEV encoder allows an information flow from this strong prior to dominate the downstream planning module. To address this challenge, we propose AdaptiveAD, an architectural-level solution based on a multi-context fusion strategy. Its core is a dual-branch structure that explicitly decouples scene perception and ego status. One branch performs scene-driven reasoning based on multi-task learning, but with ego status deliberately omitted from the BEV encoder, while the other conducts ego-driven reasoning based solely on the planning task. A scene-aware fusion module then adaptively integrates the complementary decisions from the two branches to form the final planning trajectory. To ensure this decoupling does not compromise multi-task learning, we introduce a path attention mechanism for ego-BEV interaction and add two targeted auxiliary tasks: BEV unidirectional distillation and autoregressive online mapping. Extensive evaluations on the nuScenes dataset demonstrate that AdaptiveAD achieves state-of-the-art open-loop planning performance. Crucially, it significantly mitigates the over-reliance on ego status and exhibits impressive generalization capabilities across diverse scenarios.
comment: 11 pages, 8 figures
RobustGait: Robustness Analysis for Appearance Based Gait Recognition WACV'26
Appearance-based gait recognition have achieved strong performance on controlled datasets, yet systematic evaluation of its robustness to real-world corruptions and silhouette variability remains lacking. We present RobustGait, a framework for fine-grained robustness evaluation of appearance-based gait recognition systems. RobustGait evaluation spans four dimensions: the type of perturbation (digital, environmental, temporal, occlusion), the silhouette extraction method (segmentation and parsing networks), the architectural capacities of gait recognition models, and various deployment scenarios. The benchmark introduces 15 corruption types at 5 severity levels across CASIA-B, CCPG, and SUSTech1K, with in-the-wild validation on MEVID, and evaluates six state-of-the-art gait systems. We came across several exciting insights. First, applying noise at the RGB level better reflects real-world degradation, and reveal how distortions propagate through silhouette extraction to the downstream gait recognition systems. Second, gait accuracy is highly sensitive to silhouette extractor biases, revealing an overlooked source of benchmark bias. Third, robustness is dependent on both the type of perturbation and the architectural design. Finally, we explore robustness-enhancing strategies, showing that noise-aware training and knowledge distillation improve performance and move toward deployment-ready systems.
comment: IEEE WACV'26 Main Conference
FGNet: Leveraging Feature-Guided Attention to Refine SAM2 for 3D EM Neuron Segmentation
Accurate segmentation of neural structures in Electron Microscopy (EM) images is paramount for neuroscience. However, this task is challenged by intricate morphologies, low signal-to-noise ratios, and scarce annotations, limiting the accuracy and generalization of existing methods. To address these challenges, we seek to leverage the priors learned by visual foundation models on a vast amount of natural images to better tackle this task. Specifically, we propose a novel framework that can effectively transfer knowledge from Segment Anything 2 (SAM2), which is pre-trained on natural images, to the EM domain. We first use SAM2 to extract powerful, general-purpose features. To bridge the domain gap, we introduce a Feature-Guided Attention module that leverages semantic cues from SAM2 to guide a lightweight encoder, the Fine-Grained Encoder (FGE), in focusing on these challenging regions. Finally, a dual-affinity decoder generates both coarse and refined affinity maps. Experimental results demonstrate that our method achieves performance comparable to state-of-the-art (SOTA) approaches with the SAM2 weights frozen. Upon further fine-tuning on EM data, our method significantly outperforms existing SOTA methods. This study validates that transferring representations pre-trained on natural images, when combined with targeted domain-adaptive guidance, can effectively address the specific challenges in neuron segmentation.
Monocular 3D Lane Detection via Structure Uncertainty-Aware Network with Curve-Point Queries
Monocular 3D lane detection is challenged by aleatoric uncertainty arising from inherent observation noise. Existing methods rely on simplified geometric assumptions, such as independent point predictions or global planar modeling, failing to capture structural variations and aleatoric uncertainty in real-world scenarios. In this paper, we propose MonoUnc, a bird's-eye view (BEV)-free 3D lane detector that explicitly models aleatoric uncertainty informed by local lane structures. Specifically, 3D lanes are projected onto the front-view (FV) space and approximated by parametric curves. Guided by curve predictions, curve-point query embeddings are dynamically generated for lane point predictions in 3D space. Each segment formed by two adjacent points is modeled as a 3D Gaussian, parameterized by the local structure and uncertainty estimations. Accordingly, a novel 3D Gaussian matching loss is designed to constrain these parameters jointly. Experiments on the ONCE-3DLanes and OpenLane datasets demonstrate that MonoUnc outperforms previous state-of-the-art (SoTA) methods across all benchmarks under stricter evaluation criteria. Additionally, we propose two comprehensive evaluation metrics for ONCE-3DLanes, calculating the average and maximum bidirectional Chamfer distances to quantify global and local errors. Codes are released at https://github.com/lrx02/MonoUnc.
LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation
Unified multimodal models have recently shown remarkable gains in both capability and versatility, yet most leading systems are still trained from scratch and require substantial computational resources. In this paper, we show that competitive performance can be obtained far more efficiently by strategically fusing publicly available models specialized for either generation or understanding. Our key design is to retain the original blocks while additionally interleaving multimodal self-attention blocks throughout the networks. This double fusion mechanism (1) effectively enables rich multi-modal fusion while largely preserving the original strengths of the base models, and (2) catalyzes synergistic fusion of high-level semantic representations from the understanding encoder with low-level spatial signals from the generation encoder. By training with only ~ 35B tokens, this approach achieves strong results across multiple benchmarks: 0.91 on GenEval for compositional text-to-image generation, 82.16 on DPG-Bench for complex text-to-image generation, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench for image editing. By fully releasing the entire suite of code, model weights, and datasets, we hope to support future research on unified multimodal modeling.
comment: Preprint. Work in progress
iTACO: Interactable Digital Twins of Articulated Objects from Casually Captured RGBD Videos 3DV 2026
Articulated objects are prevalent in daily life. Interactable digital twins of such objects have numerous applications in embodied AI and robotics. Unfortunately, current methods to digitize articulated real-world objects require carefully captured data, preventing practical, scalable, and generalizable acquisition. We focus on motion analysis and part-level segmentation of an articulated object from a casually captured RGBD video shot with a hand-held camera. A casually captured video of an interaction with an articulated object is easy to obtain at scale using smartphones. However, this setting is challenging due to simultaneous object and camera motion and significant occlusions as the person interacts with the object. To tackle these challenges, we introduce iTACO: a coarse-to-fine framework that infers joint parameters and segments movable parts of the object from a dynamic RGBD video. To evaluate our method under this new setting, we build a dataset of 784 videos containing 284 objects across 11 categories that is 20$\times$ larger than available in prior work. We then compare our approach with existing methods that also take video as input. Our experiments show that iTACO outperforms existing articulated object digital twin methods on both synthetic and real casually captured RGBD videos.
comment: 3DV 2026 camera-ready version. Project website can be found at https://3dlg-hcvc.github.io/video2articulation/
Arcee: Differentiable Recurrent State Chain for Generative Vision Modeling with Mamba SSMs
State-space models (SSMs), Mamba in particular, are increasingly adopted for long-context sequence modeling, providing linear-time aggregation via an input-dependent, causal selective-scan operation. Along this line, recent "Mamba-for-vision" variants largely explore multiple scan orders to relax strict causality for non-sequential signals (e.g., images). Rather than preserving cross-block memory, the conventional formulation of the selective-scan operation in Mamba reinitializes each block's state-space dynamics from zero, discarding the terminal state-space representation (SSR) from the previous block. Arcee, a cross-block recurrent state chain, reuses each block's terminal state-space representation as the initial condition for the next block. Handoff across blocks is constructed as a differentiable boundary map whose Jacobian enables end-to-end gradient flow across terminal boundaries. Key to practicality, Arcee is compatible with all prior "vision-mamba" variants, parameter-free, and incurs constant, negligible cost. As a modeling perspective, we view terminal SSR as a mild directional prior induced by a causal pass over the input, rather than an estimator of the non-sequential signal itself. To quantify the impact, for unconditional generation on CelebA-HQ (256$\times$256) with Flow Matching, Arcee reduces FID$\downarrow$ from $82.81$ to $15.33$ ($5.4\times$ lower) on a single scan-order Zigzag Mamba baseline. Efficient CUDA kernels and training code will be released to support rigorous and reproducible research.
Fast Equivariant Imaging: Acceleration for Unsupervised Learning via Augmented Lagrangian and Auxiliary PnP Denoisers
In this work, we propose Fast Equivariant Imaging (FEI), a novel unsupervised learning framework to rapidly and efficiently train deep imaging networks without ground-truth data. From the perspective of reformulating the Equivariant Imaging based optimization problem via the method of Lagrange multipliers and utilizing plug-and-play denoisers, this novel unsupervised scheme shows superior efficiency and performance compared to the vanilla Equivariant Imaging paradigm. In particular, our FEI schemes achieve an order-of-magnitude (10x) acceleration over standard EI on training U-Net for X-ray CT reconstruction and image inpainting, with improved generalization performance.
Toward A Better Understanding of Monocular Depth Evaluation
Monocular depth estimation is an important task with rapid progress, but how to evaluate it is not fully resolved, as evidenced by a lack of standardization in existing literature and a large selection of evaluation metrics whose trade-offs and behaviors are not fully understood. This paper contributes a novel, quantitative analysis of existing metrics in terms of their sensitivity to various types of perturbations of ground truth, emphasizing comparison to human judgment. Our analysis reveals that existing metrics are severely under-sensitive to curvature perturbation such as making smooth surfaces bumpy. To remedy this, we introduce a new metric based on relative surface normals, along with new depth visualization tools and a principled method to create composite metrics with better human alignment. Code and data are available at: https://github.com/princeton-vl/evalmde.
Physics informed Transformer-VAE for biophysical parameter estimation: PROSAIL model inversion in Sentinel-2 imagery
Accurate retrieval of vegetation biophysical variables from satellite imagery is crucial for ecosystem monitoring and agricultural management. In this work, we propose a physics-informed Transformer-VAE architecture to invert the PROSAIL radiative transfer model for simultaneous estimation of key canopy parameters from Sentinel-2 data. Unlike previous hybrid approaches that require real satellite images for self-supevised training. Our model is trained exclusively on simulated data, yet achieves performance on par with state-of-the-art methods that utilize real imagery. The Transformer-VAE incorporates the PROSAIL model as a differentiable physical decoder, ensuring that inferred latent variables correspond to physically plausible leaf and canopy properties. We demonstrate retrieval of leaf area index (LAI) and canopy chlorophyll content (CCC) on real-world field datasets (FRM4Veg and BelSAR) with accuracy comparable to models trained with real Sentinel-2 data. Our method requires no in-situ labels or calibration on real images, offering a cost-effective and self-supervised solution for global vegetation monitoring. The proposed approach illustrates how integrating physical models with advanced deep networks can improve the inversion of RTMs, opening new prospects for large-scale, physically-constrained remote sensing of vegetation traits.
comment: 10 pages, 6 figures, uses fancyhdr.sty
Viper-F1: Fast and Fine-Grained Multimodal Understanding with Cross-Modal State-Space Modulation
Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as robotic manipulation, personal assistants, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Viper-F1, a Hybrid State-Space Vision-Language Model that replaces attention with efficient Liquid State-Space Dynamics. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates the state-space dynamics via FiLM conditioning. This enables the model to selectively emphasize visual regions relevant to the textual prompt while maintaining linear-time inference. Experimental results across multiple benchmarks demonstrate that Viper-F1 achieves accurate, fine-grained understanding with significantly improved efficiency.
Enhancing Monocular Height Estimation via Weak Supervision from Imperfect Labels
Monocular height estimation provides an efficient and cost-effective solution for three-dimensional perception in remote sensing. However, training deep neural networks for this task demands abundant annotated data, while high-quality labels are scarce and typically available only in developed regions, which limits model generalization and constrains their applicability at large scales. This work addresses the problem by leveraging imperfect labels from out-of-domain regions to train pixel-wise height estimation networks, which may be incomplete, inexact, or inaccurate compared to high-quality annotations. We introduce an ensemble-based pipeline compatible with any monocular height estimation network, featuring architecture and loss functions specifically designed to leverage information in noisy labels through weak supervision, utilizing balanced soft losses and ordinal constraints. Experiments on two datasets -- DFC23 (0.5--1 m) and GBH (3 m) -- show that our method achieves more consistent cross-domain performance, reducing average RMSE by up to 22.94% on DFC23 and 18.62% on GBH compared with baselines. Ablation studies confirm the contribution of each design component.
Generalizable 7T T1-map Synthesis from 1.5T and 3T T1 MRI with an Efficient Transformer Model
Purpose: Ultra-high-field 7T MRI offers improved resolution and contrast over standard clinical field strengths (1.5T, 3T). However, 7T scanners are costly, scarce, and introduce additional challenges such as susceptibility artifacts. We propose an efficient transformer-based model (7T-Restormer) to synthesize 7T-quality T1-maps from routine 1.5T or 3T T1-weighted (T1W) images. Methods: Our model was validated on 35 1.5T and 108 3T T1w MRI paired with corresponding 7T T1 maps of patients with confirmed MS. A total of 141 patient cases (32,128 slices) were randomly divided into 105 (25; 80) training cases (19,204 slices), 19 (5; 14) validation cases (3,476 slices), and 17 (5; 14) test cases (3,145 slices) where (X; Y) denotes the patients with 1.5T and 3T T1W scans, respectively. The synthetic 7T T1 maps were compared against the ResViT and ResShift models. Results: The 7T-Restormer model achieved a PSNR of 26.0 +/- 4.6 dB, SSIM of 0.861 +/- 0.072, and NMSE of 0.019 +/- 0.011 for 1.5T inputs, and 25.9 +/- 4.9 dB, and 0.866 +/- 0.077 for 3T inputs, respectively. Using 10.5 M parameters, our model reduced NMSE by 64 % relative to 56.7M parameter ResShift (0.019 vs 0.052, p = <.001 and by 41 % relative to 70.4M parameter ResViT (0.019 vs 0.032, p = <.001) at 1.5T, with similar advantages at 3T (0.021 vs 0.060 and 0.033; p < .001). Training with a mixed 1.5 T + 3 T corpus was superior to single-field strategies. Restricting the model to 1.5T increased the 1.5T NMSE from 0.019 to 0.021 (p = 1.1E-3) while training solely on 3T resulted in lower performance on input 1.5T T1W MRI. Conclusion: We propose a novel method for predicting quantitative 7T MP2RAGE maps from 1.5T and 3T T1W scans with higher quality than existing state-of-the-art methods. Our approach makes the benefits of 7T MRI more accessible to standard clinical workflows.
Bench2FreeAD: A Benchmark for Vision-based End-to-end Navigation in Unstructured Robotic Environments
Most current end-to-end (E2E) autonomous driving algorithms are built on standard vehicles in structured transportation scenarios, lacking exploration of robot navigation for unstructured scenarios such as auxiliary roads, campus roads, and indoor settings. This paper investigates E2E robot navigation in unstructured road environments. First, we introduce two data collection pipelines - one for real-world robot data and another for synthetic data generated using the Isaac Sim simulator, which together produce an unstructured robotics navigation dataset -- FreeWorld Dataset. Second, we fine-tuned an efficient E2E autonomous driving model -- VAD -- using our datasets to validate the performance and adaptability of E2E autonomous driving models in these environments. Results demonstrate that fine-tuning through our datasets significantly enhances the navigation potential of E2E autonomous driving models in unstructured robotic environments. Thus, this paper presents the first dataset targeting E2E robot navigation tasks in unstructured scenarios, and provides a benchmark based on vision-based E2E autonomous driving algorithms to facilitate the development of E2E navigation technology for logistics and service robots. The project is available on Github.
comment: 7 pages, 9 figures
S4M: 4-points to Segment Anything
Purpose: The Segment Anything Model (SAM) promises to ease the annotation bottleneck in medical segmentation, but overlapping anatomy and blurred boundaries make its point prompts ambiguous, leading to cycles of manual refinement to achieve precise masks. Better prompting strategies are needed. Methods: We propose a structured prompting strategy using 4 points as a compact instance-level shape description. We study two 4-point variants: extreme points and the proposed major/minor axis endpoints, inspired by ultrasound measurement practice. SAM cannot fully exploit such structured prompts because it treats all points identically and lacks geometry-aware reasoning. To address this, we introduce S4M (4-points to Segment Anything), which augments SAM to interpret 4 points as relational cues rather than isolated clicks. S4M expands the prompt space with role-specific embeddings and adds an auxiliary "Canvas" pretext task that sketches coarse masks directly from prompts, fostering geometry-aware reasoning. Results: Across eight datasets in ultrasound and surgical endoscopy, S4M improves segmentation by +3.42 mIoU over a strong SAM baseline at equal prompt budget. An annotation study with three clinicians further shows that major/minor prompts enable faster annotation. Conclusion: S4M increases performance, reduces annotation effort, and aligns prompting with clinical practice, enabling more scalable dataset development in medical imaging.
ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference
ViTs deliver SOTA performance, yet their fixed computational budget prevents scalable deployment across heterogeneous hardware. Recent Matryoshka-style Transformer architectures mitigate this by embedding nested subnetworks within a single model to enable scalable inference. However, these models allocate the same amount of compute to all inputs, regardless of their complexity, which leads to inefficiencies. To address this, we introduce ThinkingViT, a nested ViT architecture that employs progressive thinking stages to dynamically adjust inference computation based on input difficulty. ThinkingViT first activates a small subset of the most important attention heads to produce an initial prediction. If the prediction confidence exceeds a predefined threshold, inference terminates early. Otherwise, within the same backbone, it activates a larger subset of attention heads and conducts a new forward pass. This process continues iteratively until the model reaches the predefined confidence level or exhausts its maximum capacity. To boost the performance of subsequent rounds, we introduce a Token Recycling approach that fuses the input embeddings with the embeddings from the previous stage. Experiments show that ThinkingViT surpasses nested baselines by up to 2.0 percentage points (p.p.) in accuracy at the same throughput and by up to 2.9 p.p. at equal GMACs on ImageNet-1K. We show that the backbone-preserving design of ThinkingViT allows it to serve as a plug-in upgrade for ViTs in downstream tasks such as semantic segmentation. We also demonstrate that ThinkingViT transfers effectively to other architectures such as Swin. The source code is available at https://github.com/ds-kiel/ThinkingViT.
Beyond Patches: Mining Interpretable Part-Prototypes for Explainable AI
As AI systems grow more capable, it becomes increasingly important that their decisions remain understandable and aligned with human expectations. A key challenge is the limited interpretability of deep models. Post-hoc methods like GradCAM offer heatmaps but provide limited conceptual insight, while prototype-based approaches offer example-based explanations but often rely on rigid region selection and lack semantic consistency. To address these limitations, we propose PCMNet, a part-prototypical concept mining network that learns human-comprehensible prototypes from meaningful image regions without additional supervision. By clustering these prototypes into concept groups and extracting concept activation vectors, PCMNet provides structured, concept-level explanations and enhances robustness to occlusion and challenging conditions, which are both critical for building reliable and aligned AI systems. Experiments across multiple image classification benchmarks show that PCMNet outperforms state-of-the-art methods in interpretability, stability, and robustness. This work contributes to AI alignment by enhancing transparency, controllability, and trustworthiness in AI systems. Our code is available at: https://github.com/alehdaghi/PCMNet.
Vision Transformers with Self-Distilled Registers NeurIPS 2025
Vision Transformers (ViTs) have emerged as the dominant architecture for visual processing tasks, demonstrating excellent scalability with increased training data and model size. However, recent work has identified the emergence of artifact tokens in ViTs that are incongruous with local semantics. These anomalous tokens degrade ViT performance in tasks that require fine-grained localization or structural coherence. An effective mitigation of this issue is the addition of register tokens to ViTs, which implicitly "absorb" the artifact term during training.Given the availability of existing large-scale pre-trained ViTs, in this paper we seek add register tokens to existing models without needing to re-train from scratch, which is infeasible considering their size. Specifically, we propose Post Hoc Registers (PH-Reg), an efficient self-distillation method that integrates registers into an existing ViT without requiring additional labeled data and full retraining. PH-Reg initializes both teacher and student networks from the same pre-trained ViT. The teacher remains frozen and unmodified, while the student is augmented with randomly initialized register tokens. By applying test-time augmentation to the teacher's inputs, we generate denoised dense embeddings free of artifacts, which are then used to optimize only a small subset of unlocked student weights. We show that our approach can effectively reduce the number of artifact tokens, improving the segmentation and depth prediction of the student ViT under zero-shot and linear probing.
comment: NeurIPS 2025 Spotlight. Website: https://github.com/0raiser0/PH-Reg
Towards Cross-Domain Multi-Targeted Adversarial Attacks
Multi-targeted adversarial attacks aim to mislead classifiers toward specific target classes using a single perturbation generator with a conditional input specifying the desired target class. Existing methods face two key limitations: (1) a single generator supports only a limited number of predefined target classes, and (2) it requires access to the victim model's training data to learn target class semantics. This dependency raises data leakage concerns in practical black-box scenarios where the training data is typically private. To address these limitations, we propose a novel Cross-Domain Multi-Targeted Attack (CD-MTA) that can generate perturbations toward arbitrary target classes, even those that do not exist in the attacker's training data. CD-MTA is trained on a single public dataset but can perform targeted attacks on black-box models trained on different datasets with disjoint and unknown class sets. Our method requires only a single example image that visually represents the desired target class, without relying its label, class distribution or pretrained embeddings. We achieve this through a Feature Injection Module (FIM) and class-agnostic objectives which guide the generator to extract transferable, fine-grained features from the target image without inferring class semantics. Experiments on ImageNet and seven additional datasets show that CD-MTA outperforms existing multi-targeted attack methods on unseen target classes in black-box and cross-domain scenarios. The code is available at https://github.com/tgoncalv/CD-MTA.
comment: Under review
Point2Primitive: CAD Reconstruction from Point Cloud by Direct Primitive Prediction
Recovering CAD models from point clouds requires reconstructing their topology and sketch-based extrusion primitives. A dominant paradigm for representing sketches involves implicit neural representations such as Signed Distance Fields (SDFs). However, this indirect approach inherently struggles with precision, leading to unintended curved edges and models that are difficult to edit. In this paper, we propose Point2Primitive, a framework that learns to directly predict the explicit, parametric primitives of CAD models. Our method treats sketch reconstruction as a set prediction problem, employing a improved transformer-based decoder with explicit position queries to directly detect and predict the fundamental sketch curves (i.e., type and parameter) from the point cloud. Instead of approximating a continuous field, we formulate curve parameters as explicit position queries, which are optimized autoregressively to achieve high accuracy. The overall topology is rebuilt via extrusion segmentation. Extensive experiments demonstrate that this direct prediction paradigm significantly outperforms implicit methods in both primitive accuracy and overall geometric fidelity.
HierarchicalPrune: Position-Aware Compression for Large-Scale Diffusion Models AAAI 2026
State-of-the-art text-to-image diffusion models (DMs) achieve remarkable quality, yet their massive parameter scale (8-11B) poses significant challenges for inferences on resource-constrained devices. In this paper, we present HierarchicalPrune, a novel compression framework grounded in a key observation: DM blocks exhibit distinct functional hierarchies, where early blocks establish semantic structures while later blocks handle texture refinements. HierarchicalPrune synergistically combines three techniques: (1) Hierarchical Position Pruning, which identifies and removes less essential later blocks based on position hierarchy; (2) Positional Weight Preservation, which systematically protects early model portions that are essential for semantic structural integrity; and (3) Sensitivity-Guided Distillation, which adjusts knowledge-transfer intensity based on our discovery of block-wise sensitivity variations. As a result, our framework brings billion-scale diffusion models into a range more suitable for on-device inference, while preserving the quality of the output images. Specifically, combined with INT4 weight quantisation, HierarchicalPrune achieves 77.5-80.4% memory footprint reduction (e.g., from 15.8 GB to 3.2 GB) and 27.9-38.0% latency reduction, measured on server and consumer grade GPUs, with the minimum drop of 2.6% in GenEval score and 7% in HPSv2 score compared to the original model. Finally, our comprehensive user study with 85 participants demonstrates that HierarchicalPrune maintains perceptual quality comparable to the original model while significantly outperforming prior works.
comment: Accepted at AAAI 2026 (Main Technical Track)
ZPressor: Bottleneck-Aware Compression for Scalable Feed-Forward 3DGS NeurIPS 2025
Feed-forward 3D Gaussian Splatting (3DGS) models have recently emerged as a promising solution for novel view synthesis, enabling one-pass inference without the need for per-scene 3DGS optimization. However, their scalability is fundamentally constrained by the limited capacity of their models, leading to degraded performance or excessive memory consumption as the number of input views increases. In this work, we analyze feed-forward 3DGS frameworks through the lens of the Information Bottleneck principle and introduce ZPressor, a lightweight architecture-agnostic module that enables efficient compression of multi-view inputs into a compact latent state $Z$ that retains essential scene information while discarding redundancy. Concretely, ZPressor enables existing feed-forward 3DGS models to scale to over 100 input views at 480P resolution on an 80GB GPU, by partitioning the views into anchor and support sets and using cross attention to compress the information from the support views into anchor views, forming the compressed latent state $Z$. We show that integrating ZPressor into several state-of-the-art feed-forward 3DGS models consistently improves performance under moderate input views and enhances robustness under dense view settings on two large-scale benchmarks DL3DV-10K and RealEstate10K. The video results, code and trained models are available on our project page: https://lhmd.top/zpressor.
comment: NeurIPS 2025, Project Page: https://lhmd.top/zpressor, Code: https://github.com/ziplab/ZPressor
Backdooring CLIP through Concept Confusion
Backdoor attacks pose a serious threat to deep learning models by allowing adversaries to implant hidden behaviors that remain dormant on clean inputs but are maliciously triggered at inference. Existing backdoor attack methods typically rely on explicit triggers such as image patches or pixel perturbations, which makes them easier to detect and limits their applicability in complex settings. To address this limitation, we take a different perspective by analyzing backdoor attacks through the lens of concept-level reasoning, drawing on insights from interpretable AI. We show that traditional attacks can be viewed as implicitly manipulating the concepts activated within a model's latent space. This motivates a natural question: can backdoors be built by directly manipulating concepts? To answer this, we propose the Concept Confusion Attack (CCA), a novel framework that designates human-understandable concepts as internal triggers, eliminating the need for explicit input modifications. By relabeling images that strongly exhibit a chosen concept and fine-tuning on this mixed dataset, CCA teaches the model to associate the concept itself with the attacker's target label. Consequently, the presence of the concept alone is sufficient to activate the backdoor, making the attack stealthier and more resistant to existing defenses. Using CLIP as a case study, we show that CCA achieves high attack success rates while preserving clean-task accuracy and evading state-of-the-art defenses.
Tracing and Mitigating Hallucinations in Multimodal LLMs via Dynamic Attention Localization
Multimodal Large Language Models (MLLMs) achieve strong performance on tasks like image captioning and visual question answering, but remain prone to hallucinations, where generated text conflicts with the visual input. Prior work links this partly to insufficient visual attention, but existing attention-based detectors and mitigation typically apply uniform adjustments across layers and heads, obscuring where errors originate. In this paper, we first show these methods fail to accurately localize problematic layers. Then, we introduce two diagnostics: Layer Image Attention Entropy (LIAE) which flags anomalous layers, and Image Attention Focus (IAF) which scores attention heads within those layers. Analysis shows that LIAE pinpoints faulty layers and IAF reliably ranks heads that warrant correction. Guided by these signals, we propose Dynamic Layer-wise Entropy and Attention Fusion (D-LEAF), a task-agnostic, attention-guided method that dynamically localizes and corrects errors during inference with negligible overhead. Furthermore, by establishing a connection between D-LEAF and DPO, we provide theoretical justification for the effectiveness of D-LEAF. Results show our D-LEAF delivers a 53\% relative improvement on standard captioning benchmarks, and on VQA both accuracy and F1-score improve by approximately 4\%, substantially suppressing hallucinations while preserving efficiency.
CamSAM2: Segment Anything Accurately in Camouflaged Videos
Video camouflaged object segmentation (VCOS), aiming at segmenting camouflaged objects that seamlessly blend into their environment, is a fundamental vision task with various real-world applications. With the release of SAM2, video segmentation has witnessed significant progress. However, SAM2's capability of segmenting camouflaged videos is suboptimal, especially when given simple prompts such as point and box. To address the problem, we propose Camouflaged SAM2 (CamSAM2), which enhances SAM2's ability to handle camouflaged scenes without modifying SAM2's parameters. Specifically, we introduce a decamouflaged token to provide the flexibility of feature adjustment for VCOS. To make full use of fine-grained and high-resolution features from the current frame and previous frames, we propose implicit object-aware fusion (IOF) and explicit object-aware fusion (EOF) modules, respectively. Object prototype generation (OPG) is introduced to abstract and memorize object prototypes with informative details using high-quality features from previous frames. Extensive experiments are conducted to validate the effectiveness of our approach. While CamSAM2 only adds negligible learnable parameters to SAM2, it substantially outperforms SAM2 on three VCOS datasets, especially achieving 12.2 mDice gains with click prompt on MoCA-Mask and 19.6 mDice gains with mask prompt on SUN-SEG-Hard, with Hiera-T as the backbone. The code is available at https://github.com/zhoustan/CamSAM2.
A comprehensive and easy-to-use multi-domain multi-task medical imaging meta-dataset
While the field of medical image analysis has undergone a transformative shift with the integration of machine learning techniques, the main challenge of these techniques is often the scarcity of large, diverse, and well-annotated datasets. Medical images vary in format, size, and other parameters and therefore require extensive preprocessing and standardization, for usage in machine learning. Addressing these challenges, we introduce the Medical Imaging Meta-Dataset (MedIMeta), a novel multi-domain, multi-task meta-dataset. MedIMeta contains 19 medical imaging datasets spanning 10 different domains and encompassing 54 distinct medical tasks, all of which are standardized to the same format and readily usable in PyTorch or other ML frameworks. We perform a technical validation of MedIMeta, demonstrating its utility through fully supervised and cross-domain few-shot learning baselines.
LLMC+: Benchmarking Vision-Language Model Compression with a Plug-and-play Toolkit AAAI 2026
Large Vision-Language Models (VLMs) exhibit impressive multi-modal capabilities but suffer from prohibitive computational and memory demands, due to their long visual token sequences and massive parameter sizes. To address these issues, recent works have proposed training-free compression methods. However, existing efforts often suffer from three major limitations: (1) Current approaches do not decompose techniques into comparable modules, hindering fair evaluation across spatial and temporal redundancy. (2) Evaluation confined to simple single-turn tasks, failing to reflect performance in realistic scenarios. (3) Isolated use of individual compression techniques, without exploring their joint potential. To overcome these gaps, we introduce LLMC+, a comprehensive VLM compression benchmark with a versatile, plug-and-play toolkit. LLMC+ supports over 20 algorithms across five representative VLM families and enables systematic study of token-level and model-level compression. Our benchmark reveals that: (1) Spatial and temporal redundancies demand distinct technical strategies. (2) Token reduction methods degrade significantly in multi-turn dialogue and detail-sensitive tasks. (3) Combining token and model compression achieves extreme compression with minimal performance loss. We believe LLMC+ will facilitate fair evaluation and inspire future research in efficient VLM. Our code is available at https://github.com/ModelTC/LightCompress.
comment: Accepted by AAAI 2026
vGamba: Attentive State Space Bottleneck for efficient Long-range Dependencies in Visual Recognition
Capturing long-range dependencies efficiently is essential for visual recognition tasks, yet existing methods face limitations. Convolutional neural networks (CNNs) struggle with restricted receptive fields, while Vision Transformers (ViTs) achieve global context and long-range modeling at a high computational cost. State-space models (SSMs) offer an alternative, but their application in vision remains underexplored. This work introduces vGamba, a hybrid vision backbone that integrates SSMs with attention mechanisms to enhance efficiency and expressiveness. At its core, the Gamba bottleneck block that includes, Gamba Cell, an adaptation of Mamba for 2D spatial structures, alongside a Multi-Head Self-Attention (MHSA) mechanism and a Gated Fusion Module for effective feature representation. The interplay of these components ensures that vGamba leverages the low computational demands of SSMs while maintaining the accuracy of attention mechanisms for modeling long-range dependencies in vision tasks. Additionally, the Fusion module enables seamless interaction between these components. Extensive experiments on classification, detection, and segmentation tasks demonstrate that vGamba achieves a superior trade-off between accuracy and computational efficiency, outperforming several existing models.
Emergence of Fixational and Saccadic Movements in a Multi-Level Recurrent Attention Model for Vision
Inspired by foveal vision, hard attention models promise interpretability and parameter economy. However, existing models like the Recurrent Model of Visual Attention (RAM) and Deep Recurrent Attention Model (DRAM) failed to model the hierarchy of human vision system, that compromise on the visual exploration dynamics. As a result, they tend to produce attention that are either overly fixational or excessively saccadic, diverging from human eye movement behavior. In this paper, we propose a Multi-Level Recurrent Attention Model (MRAM), a novel hard attention framework that explicitly models the neural hierarchy of human visual processing. By decoupling the function of glimpse location generation and task execution in two recurrent layers, MRAM emergent a balanced behavior between fixation and saccadic movement. Our results show that MRAM not only achieves more human-like attention dynamics, but also consistently outperforms CNN, RAM and DRAM baselines on standard image classification benchmarks.
Hierarchical Generalized Category Discovery for Brain Tumor Classification in Digital Pathology
Accurate brain tumor classification is critical for intra-operative decision making in neuro-oncological surgery. However, existing approaches are restricted to a fixed set of predefined classes and are therefore unable to capture patterns of tumor types not available during training. Unsupervised learning can extract general-purpose features, but it lacks the ability to incorporate prior knowledge from labelled data, and semi-supervised methods often assume that all potential classes are represented in the labelled data. Generalized Category Discovery (GCD) aims to bridge this gap by categorizing both known and unknown classes within unlabelled data. To reflect the hierarchical structure of brain tumor taxonomies, in this work, we introduce Hierarchical Generalized Category Discovery for Brain Tumor Classification (HGCD-BT), a novel approach that integrates hierarchical clustering with contrastive learning. Our method extends contrastive learning based GCD by incorporating a novel semi-supervised hierarchical clustering loss. We evaluate HGCD-BT on OpenSRH, a dataset of stimulated Raman histology brain tumor images, achieving a +28% improvement in accuracy over state-of-the-art GCD methods for patch-level classification, particularly in identifying previously unseen tumor categories. Furthermore, we demonstrate the generalizability of HGCD-BT on slide-level classification of hematoxylin and eosin stained whole-slide images from the Digital Brain Tumor Atlas, confirming its utility across imaging modalities.
Algorithms Trained on Normal Chest X-rays Can Predict Health Insurance Types
Artificial intelligence is revealing what medicine never intended to encode. Deep vision models, trained on chest X-rays, can now detect not only disease but also invisible traces of social inequality. In this study, we show that state-of-the-art architectures (DenseNet121, SwinV2-B, MedMamba) can predict a patient's health insurance type, a strong proxy for socioeconomic status, from normal chest X-rays with significant accuracy (AUC around 0.67 on MIMIC-CXR-JPG, 0.68 on CheXpert). The signal persists even when age, race, and sex are controlled for, and remains detectable when the model is trained exclusively on a single racial group. Patch-based occlusion reveals that the signal is diffuse rather than localized, embedded in the upper and mid-thoracic regions. This suggests that deep networks may be internalizing subtle traces of clinical environments, equipment differences, or care pathways; learning socioeconomic segregation itself. These findings challenge the assumption that medical images are neutral biological data. By uncovering how models perceive and exploit these hidden social signatures, this work reframes fairness in medical AI: the goal is no longer only to balance datasets or adjust thresholds, but to interrogate and disentangle the social fingerprints embedded in clinical data itself.
comment: Submitting to MIDL 2026
Nearest Neighbor Projection Removal Adversarial Training
Deep neural networks have exhibited impressive performance in image classification tasks but remain vulnerable to adversarial examples. Standard adversarial training enhances robustness but typically fails to explicitly address inter-class feature overlap, a significant contributor to adversarial susceptibility. In this work, we introduce a novel adversarial training framework that actively mitigates inter-class proximity by projecting out inter-class dependencies from adversarial and clean samples in the feature space. Specifically, our approach first identifies the nearest inter-class neighbors for each adversarial sample and subsequently removes projections onto these neighbors to enforce stronger feature separability. Theoretically, we demonstrate that our proposed logits correction reduces the Lipschitz constant of neural networks, thereby lowering the Rademacher complexity, which directly contributes to improved generalization and robustness. Extensive experiments across standard benchmarks including CIFAR-10, CIFAR-100, and SVHN show that our method demonstrates strong performance that is competitive with leading adversarial training techniques, highlighting significant achievements in both robust and clean accuracy. Our findings reveal the importance of addressing inter-class feature proximity explicitly to bolster adversarial robustness in DNNs.
StrokeFusion: Vector Sketch Generation via Joint Stroke-UDF Encoding and Latent Sequence Diffusion
In the field of sketch generation, raster-format trained models often produce non-stroke artifacts, while vector-format trained models typically lack a holistic understanding of sketches, leading to compromised recognizability. Moreover, existing methods struggle to extract common features from similar elements (e.g., eyes of animals) appearing at varying positions across sketches. To address these challenges, we propose StrokeFusion, a two-stage framework for vector sketch generation. It contains a dual-modal sketch feature learning network that maps strokes into a high-quality latent space. This network decomposes sketches into normalized strokes and jointly encodes stroke sequences with Unsigned Distance Function (UDF) maps, representing sketches as sets of stroke feature vectors. Building upon this representation, our framework exploits a stroke-level latent diffusion model that simultaneously adjusts stroke position, scale, and trajectory during generation. This enables high-fidelity sketch generation while supporting stroke interpolation editing. Extensive experiments on the QuickDraw dataset demonstrate that our framework outperforms state-of-the-art techniques, validating its effectiveness in preserving structural integrity and semantic features. Code and models will be made publicly available upon publication.
Use as Many Surrogates as You Want: Selective Ensemble Attack to Unleash Transferability without Sacrificing Resource Efficiency
In surrogate ensemble attacks, using more surrogate models yields higher transferability but lower resource efficiency. This practical trade-off between transferability and efficiency has largely limited existing attacks despite many pre-trained models are easily accessible online. In this paper, we argue that such a trade-off is caused by an unnecessary common assumption, i.e., all models should be \textit{identical} across iterations. By lifting this assumption, we can use as many surrogates as we want to unleash transferability without sacrificing efficiency. Concretely, we propose Selective Ensemble Attack (SEA), which dynamically selects diverse models (from easily accessible pre-trained models) across iterations based on our new interpretation of decoupling within-iteration and cross-iteration model diversity. In this way, the number of within-iteration models is fixed for maintaining efficiency, while only cross-iteration model diversity is increased for higher transferability. Experiments on ImageNet demonstrate the superiority of SEA in various scenarios. For example, when dynamically selecting 4 from 20 accessible models, SEA yields 8.5% higher transferability than existing attacks under the same efficiency. The superiority of SEA also generalizes to real-world systems, such as commercial vision APIs and large vision-language models. Overall, SEA opens up the possibility of adaptively balancing transferability and efficiency according to specific resource requirements.
Deepfake Detection that Generalizes Across Benchmarks
The generalization of deepfake detectors to unseen manipulation techniques remains a challenge for practical deployment. Although many approaches adapt foundation models by introducing significant architectural complexity, this work demonstrates that robust generalization is achievable through a parameter-efficient adaptation of one of the foundational pre-trained vision encoders. The proposed method, GenD, fine-tunes only the Layer Normalization parameters (0.03% of the total) and enhances generalization by enforcing a hyperspherical feature manifold using L2 normalization and metric learning on it. We conducted an extensive evaluation on 14 benchmark datasets spanning from 2019 to 2025. The proposed method achieves state-of-the-art performance, outperforming more complex, recent approaches in average cross-dataset AUROC. Our analysis yields two primary findings for the field: 1) training on paired real-fake data from the same source video is essential for mitigating shortcut learning and improving generalization, and 2) detection difficulty on academic datasets has not strictly increased over time, with models trained on older, diverse datasets showing strong generalization capabilities. This work delivers a computationally efficient and reproducible method, proving that state-of-the-art generalization is attainable by making targeted, minimal changes to a pre-trained foundational image encoder model. The code is at: https://github.com/yermandy/GenD
JAFAR: Jack up Any Feature at Any Resolution
Foundation Vision Encoders have become essential for a wide range of dense vision tasks. However, their low-resolution spatial feature outputs necessitate feature upsampling to produce the high-resolution modalities required for downstream tasks. In this work, we introduce JAFAR, a lightweight and flexible feature upsampler that enhances the spatial resolution of visual features from any Foundation Vision Encoder to an arbitrary target resolution. JAFAR employs an attention-based module designed to promote semantic alignment between high-resolution queries, derived from low-level image features, and semantically enriched low-resolution keys, using Spatial Feature Transform (SFT) modulation. Notably, despite the absence of high-resolution supervision, we demonstrate that learning at low upsampling ratios and resolutions generalizes remarkably well to significantly higher output scales. Extensive experiments show that JAFAR effectively recovers fine-grained spatial details and consistently outperforms existing feature upsampling methods across a diverse set of downstream tasks. Project page at https://jafar-upsampler.github.io
comment: Code available at https://github.com/PaulCouairon/JAFAR
TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model
Large Vision-Language Models (LVLMs) have advanced multimodal learning but face high computational costs due to the large number of visual tokens, motivating token pruning to improve inference efficiency. The key challenge lies in identifying which tokens are truly important. Most existing approaches rely on attention-based criteria to estimate token importance. However, they inherently suffer from certain limitations, such as positional bias. In this work, we explore a new perspective on token importance based on token transitions in LVLMs. We observe that the transition of token representations provides a meaningful signal of semantic information. Based on this insight, we propose TransPrune, a training-free and efficient token pruning method. Specifically, TransPrune progressively prunes tokens by assessing their importance through a combination of Token Transition Variation (TTV)-which measures changes in both the magnitude and direction of token representations-and Instruction-Guided Attention (IGA), which measures how strongly the instruction attends to image tokens via attention. Extensive experiments demonstrate that TransPrune achieves comparable multimodal performance to original LVLMs, such as LLaVA-v1.5 and LLaVA-Next, across eight benchmarks, while reducing inference TFLOPs by more than half. Moreover, TTV alone can serve as an effective criterion without relying on attention, achieving performance comparable to attention-based methods. The code will be made publicly available upon acceptance of the paper at https://github.com/liaolea/TransPrune.
Efficient SAR Vessel Detection for FPGA-Based On-Satellite Sensing
Rapid analysis of satellite imagery within minutes-to-hours of acquisition is increasingly vital for many remote sensing applications, and is an essential component for developing next-generation autonomous and distributed satellite systems. On-satellite machine learning (ML) has the potential for such rapid analysis, by overcoming latency associated with intermittent satellite connectivity to ground stations or relay satellites, but state-of-the-art models are often too large or power-hungry for on-board deployment. Vessel detection using Synthetic Aperture Radar (SAR) is a critical time-sensitive application in maritime security that exemplifies this challenge. SAR vessel detection has previously been demonstrated only by ML models that either are too large for satellite deployment, have not been developed for sufficiently low-power hardware, or have only been tested on small SAR datasets that do not sufficiently represent the difficulty of the real-world task. Here we systematically explore a suite of architectural adaptations to develop a novel YOLOv8 architecture optimized for this task and FPGA-based processing. We deploy our model on a Kria KV260 MPSoC, and show it can analyze a ~700 megapixel SAR image in less than a minute, within common satellite power constraints (<10W). Our model has detection and classification performance only ~2% and 3% lower than values from state-of-the-art GPU-based models on the largest and most diverse open SAR vessel dataset, xView3-SAR, despite being ~50 and ~2500 times more computationally efficient. This work represents a key contribution towards on-satellite ML for time-critical SAR analysis, and more autonomous, scalable satellites.
comment: 17 pages, 7 figures, 6 tables. To be presented in the 10th ACM/IEEE Symposium on Edge Computing (SEC '25)
Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer
Transformer-based video diffusion models (VDMs) deliver state-of-the-art video generation quality but are constrained by the quadratic cost of self-attention, making long sequences and high resolutions computationally expensive. While linear attention offers sub-quadratic complexity, previous approaches have failed to match the expressiveness of softmax attention unless retrained at significant computational cost. We introduce Attention Surgery, an efficient framework that enables linear or hybrid attention in pretrained VDMs, eliminating the need for training from scratch. Inspired by recent advances in language models, our method combines a novel hybrid attention mechanism-mixing softmax and linear tokens-with a lightweight distillation and fine-tuning pipeline requiring only a few GPU-days. Additionally, we incorporate a cost-aware block-rate strategy to balance expressiveness and efficiency across layers. Applied to Wan2.1 1.3B, a state-of-the-art efficient transformer VDM and evaluated on VBench, VBench2.0 and a human preference study, Attention Surgery achieves competitive results. Furthermore, measurements of on-mobile latency, memory usage, and FLOPs demonstrate notable improvements in scaling behavior for longer videos. Project page is available at: https://qualcomm-ai-research.github.io/attention-surgery.
Decoupling Bias, Aligning Distributions: Synergistic Fairness Optimization for Deepfake Detection
Fairness is a core element in the trustworthy deployment of deepfake detection models, especially in the field of digital identity security. Biases in detection models toward different demographic groups, such as gender and race, may lead to systemic misjudgments, exacerbating the digital divide and social inequities. However, current fairness-enhanced detectors often improve fairness at the cost of detection accuracy. To address this challenge, we propose a dual-mechanism collaborative optimization framework. Our proposed method innovatively integrates structural fairness decoupling and global distribution alignment: decoupling channels sensitive to demographic groups at the model architectural level, and subsequently reducing the distance between the overall sample distribution and the distributions corresponding to each demographic group at the feature level. Experimental results demonstrate that, compared with other methods, our framework improves both inter-group and intra-group fairness while maintaining overall detection accuracy across domains.
Towards Prospective Medical Image Reconstruction via Knowledge-Informed Dynamic Optimal Transport
Medical image reconstruction from measurement data is a vital but challenging inverse problem. Deep learning approaches have achieved promising results, but often requires paired measurement and high-quality images, which is typically simulated through a forward model, i.e., retrospective reconstruction. However, training on simulated pairs commonly leads to performance degradation on real prospective data due to the retrospective-to-prospective gap caused by incomplete imaging knowledge in simulation. To address this challenge, this paper introduces imaging Knowledge-Informed Dynamic Optimal Transport (KIDOT), a novel dynamic optimal transport framework with optimality in the sense of preserving consistency with imaging physics in transport, that conceptualizes reconstruction as finding a dynamic transport path. KIDOT learns from unpaired data by modeling reconstruction as a continuous evolution path from measurements to images, guided by an imaging knowledge-informed cost function and transport equation. This dynamic and knowledge-aware approach enhances robustness and better leverages unpaired data while respecting acquisition physics. Theoretically, we demonstrate that KIDOT naturally generalizes dynamic optimal transport, ensuring its mathematical rationale and solution existence. Extensive experiments on MRI and CT reconstruction demonstrate KIDOT's superior performance.
3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation
Vision-Language Models (VLMs) have shown remarkable performance on diverse visual and linguistic tasks, yet they remain fundamentally limited in their understanding of 3D spatial structures. We propose Geometric Distillation, a lightweight, annotation-free fine-tuning framework that injects human-inspired geometric cues into pretrained VLMs without modifying their architecture. By distilling (1) sparse correspondences, (2) relative depth relations, and (3) dense cost volumes from off-the-shelf 3D foundation models (e.g., MASt3R, VGGT), our method shapes representations to be geometry-aware while remaining compatible with natural image-text inputs. Through extensive evaluations on 3D vision-language reasoning and 3D perception benchmarks, our method consistently outperforms prior approaches, achieving improved 3D spatial reasoning with significantly lower computational cost. Our work demonstrates a scalable and efficient path to bridge 2D-trained VLMs with 3D understanding, opening up wider use in spatially grounded multimodal tasks.
Self-NPO: Data-Free Diffusion Model Enhancement via Truncated Diffusion Fine-Tuning AAAI 2026
Diffusion models have demonstrated remarkable success in various visual generation tasks, including image, video, and 3D content generation. Preference optimization (PO) is a prominent and growing area of research that aims to align these models with human preferences. While existing PO methods primarily concentrate on producing favorable outputs, they often overlook the significance of classifier-free guidance (CFG) in mitigating undesirable results. Diffusion-NPO addresses this gap by introducing negative preference optimization (NPO), training models to generate outputs opposite to human preferences and thereby steering them away from unfavorable outcomes through CFG. However, prior NPO approaches rely on costly and fragile procedures for obtaining explicit preference annotations (e.g., manual pairwise labeling or reward model training), limiting their practicality in domains where such data are scarce or difficult to acquire. In this work, we propose Self-NPO, specifically truncated diffusion fine-tuning, a data-free approach of negative preference optimization by directly learning from the model itself, eliminating the need for manual data labeling or reward model training. This data-free approach is highly efficient (less than 1% training cost of Diffusion-NPO) and achieves comparable performance to Diffusion-NPO in a data-free manner. We demonstrate that Self-NPO integrates seamlessly into widely used diffusion models, including SD1.5, SDXL, and CogVideoX, as well as models already optimized for human preferences, consistently enhancing both their generation quality and alignment with human preferences. Code is available at https://github.com/G-U-N/Diffusion-NPO.
comment: accepted by AAAI 2026
MonoDream: Monocular Vision-Language Navigation with Panoramic Dreaming
Vision-Language Navigation (VLN) tasks often leverage panoramic RGB and depth inputs to provide rich spatial cues for action planning, but these sensors can be costly or less accessible in real-world deployments. Recent approaches based on Vision-Language Action (VLA) models achieve strong results with monocular input, yet they still lag behind methods using panoramic RGB-D information. We present MonoDream, a lightweight VLA framework that enables monocular agents to learn a Unified Navigation Representation (UNR). This shared feature representation jointly aligns navigation-relevant visual semantics (e.g., global layout, depth, and future cues) and language-grounded action intent, enabling more reliable action prediction. MonoDream further introduces Latent Panoramic Dreaming (LPD) tasks to supervise the UNR, which train the model to predict latent features of panoramic RGB and depth observations at both current and future steps based on only monocular input. Experiments on multiple VLN benchmarks show that MonoDream consistently improves monocular navigation performance and significantly narrows the gap with panoramic-based agents.
SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs AAAI2026
Visual language models (VLMs) have made significant progress in image captioning tasks, yet recent studies have found they are vulnerable to backdoor attacks. Attackers can inject undetectable perturbations into the data during inference, triggering abnormal behavior and generating malicious captions. These attacks are particularly challenging to detect and defend against due to the stealthiness and cross-modal propagation of the trigger signals. In this paper, we identify two key vulnerabilities by analyzing existing attack patterns: (1) the model exhibits abnormal attention concentration on certain regions of the input image, and (2) backdoor attacks often induce semantic drift and sentence incoherence. Based on these insights, we propose Semantic Reward Defense (SRD), a reinforcement learning framework that mitigates backdoor behavior without requiring any prior knowledge of trigger patterns. SRD learns to apply discrete perturbations to sensitive contextual regions of image inputs via a deep Q-network policy, aiming to confuse attention and disrupt the activation of malicious paths. To guide policy optimization, we design a reward signal named semantic fidelity score, which jointly assesses the semantic consistency and linguistic fluency of the generated captions, encouraging the agent to achieve a robust yet faithful output. SRD offers a trigger-agnostic, policy-interpretable defense paradigm that effectively mitigates local (TrojVLM) and global (Shadowcast) backdoor attacks, reducing ASR to 3.6% and 5.6% respectively, with less than 15% average CIDEr drop on the clean inputs. Our codes can be found at https://github.com/Ciconey/SRD.git.
comment: AAAI2026
Towards Methane Detection Onboard Satellites
Methane is a potent greenhouse gas and a major driver of climate change, making its timely detection critical for effective mitigation. Machine learning (ML) deployed onboard satellites can enable rapid detection while reducing downlink costs, supporting faster response systems. Conventional methane detection methods often rely on image processing techniques, such as orthorectification to correct geometric distortions and matched filters to enhance plume signals. We introduce a novel approach that bypasses these preprocessing steps by using \textit{unorthorectified} data (UnorthoDOS). We find that ML models trained on this dataset achieve performance comparable to those trained on orthorectified data. Moreover, we also train models on an orthorectified dataset, showing that they can outperform the matched filter baseline (mag1c). We release model checkpoints and two ML-ready datasets comprising orthorectified and unorthorectified hyperspectral images from the Earth Surface Mineral Dust Source Investigation (EMIT) sensor at https://huggingface.co/datasets/SpaceML/UnorthoDOS , along with code at https://github.com/spaceml-org/plume-hunter.
SparseWorld: A Flexible, Adaptive, and Efficient 4D Occupancy World Model Powered by Sparse and Dynamic Queries AAAI2026
Semantic occupancy has emerged as a powerful representation in world models for its ability to capture rich spatial semantics. However, most existing occupancy world models rely on static and fixed embeddings or grids, which inherently limit the flexibility of perception. Moreover, their ``in-place classification" over grids exhibits a potential misalignment with the dynamic and continuous nature of real scenarios. In this paper, we propose SparseWorld, a novel 4D occupancy world model that is flexible, adaptive, and efficient, powered by sparse and dynamic queries. We propose a Range-Adaptive Perception module, in which learnable queries are modulated by the ego vehicle states and enriched with temporal-spatial associations to enable extended-range perception. To effectively capture the dynamics of the scene, we design a State-Conditioned Forecasting module, which replaces classification-based forecasting with regression-guided formulation, precisely aligning the dynamic queries with the continuity of the 4D environment. In addition, We specifically devise a Temporal-Aware Self-Scheduling training strategy to enable smooth and efficient training. Extensive experiments demonstrate that SparseWorld achieves state-of-the-art performance across perception, forecasting, and planning tasks. Comprehensive visualizations and ablation studies further validate the advantages of SparseWorld in terms of flexibility, adaptability, and efficiency.
comment: Accepted by AAAI2026 Code: https://github.com/MSunDYY/SparseWorld
Segmentation and Smoothing Affect Explanation Quality More Than the Choice of Perturbation-based XAI Method for Image Explanations IJCNN 2025
Perturbation-based post-hoc image explanation methods are commonly used to explain image prediction models. These methods perturb parts of the input to measure how those parts affect the output. Since the methods only require the input and output, they can be applied to any model, making them a popular choice to explain black-box models. While many different methods exist and have been compared with one another, it remains poorly understood which parameters of the different methods are responsible for their varying performance. This work uses the Randomized Input Sampling for Explanations (RISE) method as a baseline to evaluate many combinations of mask sampling, segmentation techniques, smoothing, attribution calculation, and per-segment or per-pixel attribution, using a proxy metric. The results show that attribution calculation, which is frequently the focus of other works, has little impact on the results. Conversely, segmentation and per-pixel attribution, rarely examined parameters, have a significant impact. The implementation of and data gathered in this work are available online: https://github.com/guspih/post-hoc-image-perturbation and https://bit.ly/smooth-mask-perturbation.
comment: This manuscript have been published in IJCNN 2025
Virtual Multiplex Staining for Histological Images using a Marker-wise Conditioned Diffusion Model AAAI 2026
Multiplex imaging is revolutionizing pathology by enabling the simultaneous visualization of multiple biomarkers within tissue samples, providing molecular-level insights that traditional hematoxylin and eosin (H&E) staining cannot provide. However, the complexity and cost of multiplex data acquisition have hindered its widespread adoption. Additionally, most existing large repositories of H&E images lack corresponding multiplex images, limiting opportunities for multimodal analysis. To address these challenges, we leverage recent advances in latent diffusion models (LDMs), which excel at modeling complex data distributions by utilizing their powerful priors for fine-tuning to a target domain. In this paper, we introduce a novel framework for virtual multiplex staining that utilizes pretrained LDM parameters to generate multiplex images from H&E images using a conditional diffusion model. Our approach enables marker-by-marker generation by conditioning the diffusion model on each marker, while sharing the same architecture across all markers. To tackle the challenge of varying pixel value distributions across different marker stains and to improve inference speed, we fine-tune the model for single-step sampling, enhancing both color contrast fidelity and inference efficiency through pixel-level loss functions. We validate our framework on two publicly available datasets, notably demonstrating its effectiveness in generating up to 18 different marker types with improved accuracy, a substantial increase over the 2-3 marker types achieved in previous approaches. This validation highlights the potential of our framework, pioneering virtual multiplex staining. Finally, this paper bridges the gap between H&E and multiplex imaging, potentially enabling retrospective studies and large-scale analyses of existing H&E image repositories.
comment: AAAI 2026 accepted
MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding
Real-time multimodal inference on resource-constrained edge devices is essential for applications such as autonomous driving, human-computer interaction, and mobile health. However, prior work often overlooks the tight coupling between sensing dynamics and model execution, as well as the complex inter-modality dependencies. In this paper, we propose MMEdge, an new on-device multi-modal inference framework based on pipelined sensing and encoding. Instead of waiting for complete sensor inputs, MMEdge decomposes the entire inference process into a sequence of fine-grained sensing and encoding units, allowing computation to proceed incrementally as data arrive. MMEdge also introduces a lightweight but effective temporal aggregation module that captures rich temporal dynamics across different pipelined units to maintain accuracy performance. Such pipelined design also opens up opportunities for fine-grained cross-modal optimization and early decision-making during inference. To further enhance system performance under resource variability and input data complexity, MMEdge incorporates an adaptive multimodal configuration optimizer that dynamically selects optimal sensing and model configurations for each modality under latency constraints, and a cross-modal speculative skipping mechanism that bypasses future units of slower modalities when early predictions reach sufficient confidence. We evaluate MMEdge using two public multimodal datasets and deploy it on a real-world unmanned aerial vehicle (UAV)-based multimodal testbed. The results show that MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.
comment: Code available at: https://github.com/HKUST-MINSys-Lab/MMEdge. Accepted by SenSys 2026
Dereflection Any Image with Diffusion Priors and Diversified Data
Reflection removal of a single image remains a highly challenging task due to the complex entanglement between target scenes and unwanted reflections. Despite significant progress, existing methods are hindered by the scarcity of high-quality, diverse data and insufficient restoration priors, resulting in limited generalization across various real-world scenarios. In this paper, we propose Dereflection Any Image, a comprehensive solution with an efficient data preparation pipeline and a generalizable model for robust reflection removal. First, we introduce a dataset named Diverse Reflection Removal (DRR) created by randomly rotating reflective mediums in target scenes, enabling variation of reflection angles and intensities, and setting a new benchmark in scale, quality, and diversity. Second, we propose a diffusion-based framework with one-step diffusion for deterministic outputs and fast inference. To ensure stable learning, we design a three-stage progressive training strategy, including reflection-invariant finetuning to encourage consistent outputs across varying reflection patterns that characterize our dataset. Extensive experiments show that our method achieves SOTA performance on both common benchmarks and challenging in-the-wild images, showing superior generalization across diverse real-world scenes.
Towards Collective Intelligence: Uncertainty-aware SAM Adaptation for Ambiguous Medical Image Segmentation
Collective intelligence from multiple medical experts consistently surpasses individual expertise in clinical diagnosis, particularly for ambiguous medical image segmentation tasks involving unclear tissue boundaries or pathological variations. The Segment Anything Model (SAM), a powerful vision foundation model originally designed for natural image segmentation, has shown remarkable potential when adapted to medical image segmentation tasks. However, existing SAM adaptation methods follow a single-expert paradigm, developing models based on individual expert annotations to predict deterministic masks. These methods systematically ignore the inherent uncertainty and variability in expert annotations, which fundamentally contradicts clinical practice, where multiple specialists provide different yet equally valid interpretations that collectively enhance diagnostic confidence. We propose an Uncertainty-aware Adapter, the first SAM adaptation framework designed to transition from single expert mindset to collective intelligence representation. Our approach integrates stochastic uncertainty sampling from a Conditional Variational Autoencoder into the adapters, enabling diverse prediction generation that captures expert knowledge distributions rather than individual expert annotations. We employ a novel position-conditioned control mechanism to integrate multi-expert knowledge, ensuring that the output distribution closely aligns with the multi-annotation distribution. Comprehensive evaluations across seven medical segmentation benchmarks have demonstrated that our collective intelligence-based adaptation achieves superior performance while maintaining computational efficiency, establishing a new adaptation framework for reliable clinical implementation.
Spatially-Aware Mixture of Experts with Log-Logistic Survival Modeling for Whole-Slide Images
Accurate survival prediction from histopathology whole-slide images (WSIs) remains challenging due to their gigapixel resolution, strong spatial heterogeneity, and complex survival distributions. We introduce a comprehensive computational pathology framework that addresses these limitations through four complementary innovations: (1) Quantile-Gated Patch Selection for dynamically identifying prognostically relevant regions, (2) Graph-Guided Clustering to group patches by spatial and morphological similarity, (3) Hierarchical Context Attention to model both local tissue interactions and global slide-level context, and (4) an Expert-Driven Mixture of Log-Logistics module that flexibly models complex survival distributions. Across large TCGA cohorts, our method achieves state-of-the-art performance, yielding time-dependent concordance indices of 0.644 on LUAD, 0.751 on KIRC, and 0.752 on BRCA, consistently outperforming both histology-only and multimodal baselines. The framework further provides improved calibration and interpretability, advancing the use of WSIs for personalized cancer prognosis.
edgeVLM: Cloud-edge Collaborative Real-time VLM based on Context Transfer
Vision-Language Models (VLMs) are increasingly deployed in real-time applications such as autonomous driving and human-computer interaction, which demand fast and reliable responses based on accurate perception. To meet these requirements, existing systems commonly employ cloud-edge collaborative architectures, such as partitioned Large Vision-Language Models (LVLMs) or task offloading strategies between Large and Small Vision-Language Models (SVLMs). However, these methods fail to accommodate cloud latency fluctuations and overlook the full potential of delayed but accurate LVLM responses. In this work, we propose a novel cloud-edge collaborative paradigm for VLMs, termed Context Transfer, which treats the delayed outputs of LVLMs as historical context to provide real-time guidance for SVLMs inference. Based on this paradigm, we design edgeVLM, which incorporates both context replacement and visual focus modules to refine historical textual input and enhance visual grounding consistency. Extensive experiments on three real-time vision-lanuage reasoning tasks across four datasets demonstrate the effectiveness of the proposed framework. The new paradigm lays the groundwork for more effective and latency-aware collaboration strategies in future VLM systems.
Image and Video Processing
Smooth Total variation Regularization for Interference Detection and Elimination (STRIDE) for MRI
MRI is increasingly desired to function near electronic devices that emit potentially dynamic electromagnetic interference (EMI). To accommodate for this, we propose the STRIDE method, which improves on previous external-sensor-based EMI removal methods by exploiting inherent MR image smoothness in its total variation. STRIDE measures data from both EMI detectors and primary MR imaging coils, transforms this data into the image domain, and for each column of the resulting image array, combines and subtracts data from the EMI detectors in a way that optimizes for total-variation smoothness. Performance was tested on phantom and in-vivo datasets with a 0.5T scanner. STRIDE resulted in visually better EMI removal, higher temporal SNR, larger EMI removal percentage, and lower RMSE than standard implementations. STRIDE is a robust technique that leverages inherent MR image properties to provide improved EMI removal performance over standard algorithms, particularly for time-varying noise sources.
PyPeT: A Python Perfusion Tool for Automated Quantitative Brain CT and MR Perfusion Analysis
Computed tomography perfusion (CTP) and magnetic resonance perfusion (MRP) are widely used in acute ischemic stroke assessment and other cerebrovascular conditions to generate quantitative maps of cerebral hemodynamics. While commercial perfusion analysis software exists, it is often costly, closed source, and lacks customizability. This work introduces PyPeT, an openly available Python Perfusion Tool for head CTP and MRP processing. PyPeT is capable of producing cerebral blood flow (CBF), cerebral blood volume (CBV), mean transit time (MTT), time-to-peak (TTP), and time-to-maximum (Tmax) maps from raw four-dimensional perfusion data. PyPeT aims to make perfusion research as accessible and customizable as possible. This is achieved through a unified framework in which both CTP and MRP data can be processed, with a strong focus on modularity, low computational burden, and significant inline documentation. PyPeT's outputs can be validated through an extensive debug mode in which every step of the process is visualized. Additional validation was performed via visual and quantitative comparison with reference perfusion maps generated by three FDA-approved commercial perfusion tools and a research tool. These comparisons show a mean SSIM around 0.8 for all comparisons, indicating a good and stable correlation with FDA-approved tools. The code for PyPeT is openly available at our GitHub https://github.com/Marijn311/CT-and-MR-Perfusion-Tool
A Smart-Glasses for Emergency Medical Services via Multimodal Multitask Learning
Emergency Medical Technicians (EMTs) operate in high-pressure environments, making rapid, life-critical decisions under heavy cognitive and operational loads. We present EMSGlass, a smart-glasses system powered by EMSNet, the first multimodal multitask model for Emergency Medical Services (EMS), and EMSServe, a low-latency multimodal serving framework tailored to EMS scenarios. EMSNet integrates text, vital signs, and scene images to construct a unified real-time understanding of EMS incidents. Trained on real-world multimodal EMS datasets, EMSNet simultaneously supports up to five critical EMS tasks with superior accuracy compared to state-of-the-art unimodal baselines. Built on top of PyTorch, EMSServe introduces a modality-aware model splitter and a feature caching mechanism, achieving adaptive and efficient inference across heterogeneous hardware while addressing the challenge of asynchronous modality arrival in the field. By optimizing multimodal inference execution in EMS scenarios, EMSServe achieves 1.9x -- 11.7x speedup over direct PyTorch multimodal inference. A user study evaluation with six professional EMTs demonstrates that EMSGlass enhances real-time situational awareness, decision-making speed, and operational efficiency through intuitive on-glass interaction. In addition, qualitative insights from the user study provide actionable directions for extending EMSGlass toward next-generation AI-enabled EMS systems, bridging multimodal intelligence with real-world emergency response workflows.
Inertia-Informed Orientation Priors for Event-Based Optical Flow Estimation
Event cameras, by virtue of their working principle, directly encode motion within a scene. Many learning-based and model-based methods exist that estimate event-based optical flow, however the temporally dense yet spatially sparse nature of events poses significant challenges. To address these issues, contrast maximization (CM) is a prominent model-based optimization methodology that estimates the motion trajectories of events within an event volume by optimally warping them. Since its introduction, the CM framework has undergone a series of refinements by the computer vision community. Nonetheless, it remains a highly non-convex optimization problem. In this paper, we introduce a novel biologically-inspired hybrid CM method for event-based optical flow estimation that couples visual and inertial motion cues. Concretely, we propose the use of orientation maps, derived from camera 3D velocities, as priors to guide the CM process. The orientation maps provide directional guidance and constrain the space of estimated motion trajectories. We show that this orientation-guided formulation leads to improved robustness and convergence in event-based optical flow estimation. The evaluation of our approach on the MVSEC, DSEC, and ECD datasets yields superior accuracy scores over the state of the art.
comment: 13 pages, 9 figures, and 3 tables
cryoSENSE: Compressive Sensing Enables High-throughput Microscopy with Sparse and Generative Priors on the Protein Cryo-EM Image Manifold
Cryo-electron microscopy (cryo-EM) enables the atomic-resolution visualization of biomolecules; however, modern direct detectors generate data volumes that far exceed the available storage and transfer bandwidth, thereby constraining practical throughput. We introduce cryoSENSE, the computational realization of a hardware-software co-designed framework for compressive cryo-EM sensing and acquisition. We show that cryo-EM images of proteins lie on low-dimensional manifolds that can be independently represented using sparse priors in predefined bases and generative priors captured by a denoising diffusion model. cryoSENSE leverages these low-dimensional manifolds to enable faithful image reconstruction from spatial and Fourier-domain undersampled measurements while preserving downstream structural resolution. In experiments, cryoSENSE increases acquisition throughput by up to 2.5$\times$ while retaining the original 3D resolution, offering controllable trade-offs between the number of masked measurements and the level of downsampling. Sparse priors favor faithful reconstruction from Fourier-domain measurements and moderate compression, whereas generative diffusion priors achieve accurate recovery from pixel-domain measurements and more severe undersampling. Project website: https://cryosense.github.io.
BrainNormalizer: Anatomy-Informed Pseudo-Healthy Brain Reconstruction from Tumor MRI via Edge-Guided ControlNet
Brain tumors are among the most clinically significant neurological diseases and remain a major cause of morbidity and mortality due to their aggressive growth and structural heterogeneity. As tumors expand, they induce substantial anatomical deformation that disrupts both local tissue organization and global brain architecture, complicating diagnosis, treatment planning, and surgical navigation. Yet a subject-specific reference of how the brain would appear without tumor-induced changes is fundamentally unobtainable in clinical practice. We present BrainNormalizer, an anatomy-informed diffusion framework that reconstructs pseudo-healthy MRIs directly from tumorous scans by conditioning the generative process on boundary cues extracted from the subject's own anatomy. This boundary-guided conditioning enables anatomically plausible pseudo-healthy reconstruction without requiring paired non-tumorous and tumorous scans. BrainNormalizer employs a two-stage training strategy. The pretrained diffusion model is first adapted through inpainting-based fine-tuning on tumorous and non-tumorous scans. Next, an edge-map-guided ControlNet branch is trained to inject fine-grained anatomical contours into the frozen decoder while preserving learned priors. During inference, a deliberate misalignment strategy pairs tumorous inputs with non-tumorous prompts and mirrored contralateral edge maps, leveraging hemispheric correspondence to guide reconstruction. On the BraTS2020 dataset, BrainNormalizer achieves strong quantitative performance and qualitatively produces anatomically plausible reconstructions in tumor-affected regions while retaining overall structural coherence. BrainNormalizer provides clinically reliable anatomical references for treatment planning and supports new research directions in counterfactual modeling and tumor-induced deformation analysis.
PoCGM: Poisson-Conditioned Generative Model for Sparse-View CT Reconstruction
In computed tomography (CT), reducing the number of projection views is an effective strategy to lower radiation exposure and/or improve temporal resolution. However, this often results in severe aliasing artifacts and loss of structural details in reconstructed images, posing significant challenges for clinical applications. Inspired by the success of the Poisson Flow Generative Model (PFGM++) in natural image generation, we propose a PoCGM (Poisson-Conditioned Generative Model) to address the challenges of sparse-view CT reconstruction. Since PFGM++ was originally designed for unconditional generation, it lacks direct applicability to medical imaging tasks that require integrating conditional inputs. To overcome this limitation, the PoCGM reformulates PFGM++ into a conditional generative framework by incorporating sparse-view data as guidance during both training and sampling phases. By modeling the posterior distribution of full-view reconstructions conditioned on sparse observations, PoCGM effectively suppresses artifacts while preserving fine structural details. Qualitative and quantitative evaluations demonstrate that PoCGM outperforms the baselines, achieving improved artifact suppression, enhanced detail preservation, and reliable performance in dose-sensitive and time-critical imaging scenarios.
comment: 18th International Meeting on Fully 3D Image Reconstruction in Radiology and Nuclear Medicine, Shanghai, CHINA, 2025
Self-Supervised Compression and Artifact Correction for Streaming Underwater Imaging Sonar WACV 2026
Real-time imaging sonar has become an important tool for underwater monitoring in environments where optical sensing is unreliable. Its broader use is constrained by two coupled challenges: highly limited uplink bandwidth and severe sonar-specific artifacts (speckle, motion blur, reverberation, acoustic shadows) that affect up to 98% of frames. We present SCOPE, a self-supervised framework that jointly performs compression and artifact correction without clean-noise pairs or synthetic assumptions. SCOPE combines (i) Adaptive Codebook Compression (ACC), which learns frequency-encoded latent representations tailored to sonar, with (ii) Frequency-Aware Multiscale Segmentation (FAMS), which decomposes frames into low-frequency structure and sparse high-frequency dynamics while suppressing rapidly fluctuating artifacts. A hedging training strategy further guides frequency-aware learning using low-pass proxy pairs generated without labels. Evaluated on months of in-situ ARIS sonar data, SCOPE achieves a structural similarity index (SSIM) of 0.77, representing a 40% improvement over prior self-supervised denoising baselines, at bitrates down to <= 0.0118 bpp. It reduces uplink bandwidth by more than 80% while improving downstream detection. The system runs in real time, with 3.1 ms encoding on an embedded GPU and 97 ms full multi-layer decoding on the server end. SCOPE has been deployed for months in three Pacific Northwest rivers to support real-time salmon enumeration and environmental monitoring in the wild. Results demonstrate that learning frequency-structured latents enables practical, low-bitrate sonar streaming with preserved signal details under real-world deployment conditions.
comment: Accepted to WACV 2026
Fully Differentiable dMRI Streamline Propagation in PyTorch SP
Diffusion MRI (dMRI) provides a distinctive means to probe the microstructural architecture of living tissue, facilitating applications such as brain connectivity analysis, modeling across multiple conditions, and the estimation of macrostructural features. Tractography, which emerged in the final years of the 20th century and accelerated in the early 21st century, is a technique for visualizing white matter pathways in the brain using dMRI. Most diffusion tractography methods rely on procedural streamline propagators or global energy minimization methods. Although recent advancements in deep learning have enabled tasks that were previously challenging, existing tractography approaches are often non-differentiable, limiting their integration in end-to-end learning frameworks. While progress has been made in representing streamlines in differentiable frameworks, no existing method offers fully differentiable propagation. In this work, we propose a fully differentiable solution that retains numerical fidelity with a leading streamline algorithm. The key is that our PyTorch-engineered streamline propagator has no components that block gradient flow, making it fully differentiable. We show that our method matches standard propagators while remaining differentiable. By translating streamline propagation into a differentiable PyTorch framework, we enable deeper integration of tractography into deep learning workflows, laying the foundation for a new category of macrostructural reasoning that is not only computationally robust but also scientifically rigorous.
comment: 9 pages, 4 figures. Accepted to SPIE Medical Imaging 2026: Image Processing
Fast Equivariant Imaging: Acceleration for Unsupervised Learning via Augmented Lagrangian and Auxiliary PnP Denoisers
In this work, we propose Fast Equivariant Imaging (FEI), a novel unsupervised learning framework to rapidly and efficiently train deep imaging networks without ground-truth data. From the perspective of reformulating the Equivariant Imaging based optimization problem via the method of Lagrange multipliers and utilizing plug-and-play denoisers, this novel unsupervised scheme shows superior efficiency and performance compared to the vanilla Equivariant Imaging paradigm. In particular, our FEI schemes achieve an order-of-magnitude (10x) acceleration over standard EI on training U-Net for X-ray CT reconstruction and image inpainting, with improved generalization performance.
JAFAR: Jack up Any Feature at Any Resolution
Foundation Vision Encoders have become essential for a wide range of dense vision tasks. However, their low-resolution spatial feature outputs necessitate feature upsampling to produce the high-resolution modalities required for downstream tasks. In this work, we introduce JAFAR, a lightweight and flexible feature upsampler that enhances the spatial resolution of visual features from any Foundation Vision Encoder to an arbitrary target resolution. JAFAR employs an attention-based module designed to promote semantic alignment between high-resolution queries, derived from low-level image features, and semantically enriched low-resolution keys, using Spatial Feature Transform (SFT) modulation. Notably, despite the absence of high-resolution supervision, we demonstrate that learning at low upsampling ratios and resolutions generalizes remarkably well to significantly higher output scales. Extensive experiments show that JAFAR effectively recovers fine-grained spatial details and consistently outperforms existing feature upsampling methods across a diverse set of downstream tasks. Project page at https://jafar-upsampler.github.io
comment: Code available at https://github.com/PaulCouairon/JAFAR
Towards Prospective Medical Image Reconstruction via Knowledge-Informed Dynamic Optimal Transport
Medical image reconstruction from measurement data is a vital but challenging inverse problem. Deep learning approaches have achieved promising results, but often requires paired measurement and high-quality images, which is typically simulated through a forward model, i.e., retrospective reconstruction. However, training on simulated pairs commonly leads to performance degradation on real prospective data due to the retrospective-to-prospective gap caused by incomplete imaging knowledge in simulation. To address this challenge, this paper introduces imaging Knowledge-Informed Dynamic Optimal Transport (KIDOT), a novel dynamic optimal transport framework with optimality in the sense of preserving consistency with imaging physics in transport, that conceptualizes reconstruction as finding a dynamic transport path. KIDOT learns from unpaired data by modeling reconstruction as a continuous evolution path from measurements to images, guided by an imaging knowledge-informed cost function and transport equation. This dynamic and knowledge-aware approach enhances robustness and better leverages unpaired data while respecting acquisition physics. Theoretically, we demonstrate that KIDOT naturally generalizes dynamic optimal transport, ensuring its mathematical rationale and solution existence. Extensive experiments on MRI and CT reconstruction demonstrate KIDOT's superior performance.
Systematic validation of time-resolved diffuse optical simulators via non-contact SPAD-based measurements
Objective: Time-domain diffuse optical imaging (DOI) requires accurate forward models for photon propagation in scattering media. However, existing simulators lack comprehensive experimental validation, especially for non-contact configurations with oblique illumination. This study rigorously evaluates three widely used open-source simulators, including MMC, NIRFASTer, and Toast++, using time-resolved experimental data. Approach: All simulations employed a unified mesh and point-source illumination. Virtual source correction was applied to FEM solvers for oblique incidence. A time-resolved DOI system with a 32 $\times$ 32 single-photon avalanche diode (SPAD) array acquired transmission-mode data from 16 standardized phantoms with varying absorption coefficient $μ_a$ and reduced scattering coefficient $μ_s'$. The simulation results were quantified across five metrics: spatial-domain (SD) precision, time-domain (TD) precision, oblique beam accuracy, computational speed, and mesh-density independence. Results: Among three simulators, MMC achieves superior accuracy in SD and TD metrics, and shows robustness across all optical properties. NIRFASTer and Toast++ demonstrate comparable overall performance. In general, MMC is optimal for accuracy-critical TD-DOI applications, while NIRFASTer and Toast++ suit scenarios prioritizing speed with sufficiently large $μ_s'$. Besides, virtual source correction is essential for non-contact FEM modeling, which reduced average errors by > 34% in large-angle scenarios. Significance: This work provides benchmarked guidelines for simulator selection during the development phase of next-generation TD-DOI systems. Our work represents the first study to systematically validate TD simulators against SPAD array-based data under clinically relevant non-contact conditions, bridging a critical gap in biomedical optical simulation standards.
Virtual Multiplex Staining for Histological Images using a Marker-wise Conditioned Diffusion Model AAAI 2026
Multiplex imaging is revolutionizing pathology by enabling the simultaneous visualization of multiple biomarkers within tissue samples, providing molecular-level insights that traditional hematoxylin and eosin (H&E) staining cannot provide. However, the complexity and cost of multiplex data acquisition have hindered its widespread adoption. Additionally, most existing large repositories of H&E images lack corresponding multiplex images, limiting opportunities for multimodal analysis. To address these challenges, we leverage recent advances in latent diffusion models (LDMs), which excel at modeling complex data distributions by utilizing their powerful priors for fine-tuning to a target domain. In this paper, we introduce a novel framework for virtual multiplex staining that utilizes pretrained LDM parameters to generate multiplex images from H&E images using a conditional diffusion model. Our approach enables marker-by-marker generation by conditioning the diffusion model on each marker, while sharing the same architecture across all markers. To tackle the challenge of varying pixel value distributions across different marker stains and to improve inference speed, we fine-tune the model for single-step sampling, enhancing both color contrast fidelity and inference efficiency through pixel-level loss functions. We validate our framework on two publicly available datasets, notably demonstrating its effectiveness in generating up to 18 different marker types with improved accuracy, a substantial increase over the 2-3 marker types achieved in previous approaches. This validation highlights the potential of our framework, pioneering virtual multiplex staining. Finally, this paper bridges the gap between H&E and multiplex imaging, potentially enabling retrospective studies and large-scale analyses of existing H&E image repositories.
comment: AAAI 2026 accepted
Towards Collective Intelligence: Uncertainty-aware SAM Adaptation for Ambiguous Medical Image Segmentation
Collective intelligence from multiple medical experts consistently surpasses individual expertise in clinical diagnosis, particularly for ambiguous medical image segmentation tasks involving unclear tissue boundaries or pathological variations. The Segment Anything Model (SAM), a powerful vision foundation model originally designed for natural image segmentation, has shown remarkable potential when adapted to medical image segmentation tasks. However, existing SAM adaptation methods follow a single-expert paradigm, developing models based on individual expert annotations to predict deterministic masks. These methods systematically ignore the inherent uncertainty and variability in expert annotations, which fundamentally contradicts clinical practice, where multiple specialists provide different yet equally valid interpretations that collectively enhance diagnostic confidence. We propose an Uncertainty-aware Adapter, the first SAM adaptation framework designed to transition from single expert mindset to collective intelligence representation. Our approach integrates stochastic uncertainty sampling from a Conditional Variational Autoencoder into the adapters, enabling diverse prediction generation that captures expert knowledge distributions rather than individual expert annotations. We employ a novel position-conditioned control mechanism to integrate multi-expert knowledge, ensuring that the output distribution closely aligns with the multi-annotation distribution. Comprehensive evaluations across seven medical segmentation benchmarks have demonstrated that our collective intelligence-based adaptation achieves superior performance while maintaining computational efficiency, establishing a new adaptation framework for reliable clinical implementation.
Rethinking Whole-Body CT Image Interpretation: An Abnormality-Centric Approach
Automated interpretation of CT images-particularly localizing and describing abnormal findings across multi-plane and whole-body scans-remains a significant challenge in clinical radiology. This work aims to address this challenge through four key contributions: (i) On taxonomy, we collaborate with senior radiologists to propose a comprehensive hierarchical classification system, with 404 representative abnormal findings across all body regions; (ii) On data, we contribute a dataset containing over 14.5K CT images from multiple planes and all human body regions, and meticulously provide grounding annotations for over 19K abnormalities, each linked to the detailed description and cast into the taxonomy; (iii) On model development, we propose OmniAbnorm-CT, which can automatically ground and describe abnormal findings on multi-plane and whole-body CT images based on text queries, while also allowing flexible interaction through visual prompts; (iv) On evaluation, we establish three representative tasks based on real clinical scenarios, and introduce a clinically grounded metric to assess abnormality descriptions. Through extensive experiments, we show that OmniAbnorm-CT can significantly outperform existing methods in both internal and external validations, and across all the tasks.
comment: 40 pages
Unsupervised Motion-Compensated Decomposition for Cardiac MRI Reconstruction via Neural Representation AAAI-26
Cardiac magnetic resonance (CMR) imaging is widely used to characterize cardiac morphology and function. To accelerate CMR imaging, various methods have been proposed to recover high-quality spatiotemporal CMR images from highly undersampled k-t space data. However, current CMR reconstruction techniques either fail to achieve satisfactory image quality or are restricted by the scarcity of ground truth data, leading to limited applicability in clinical scenarios. In this work, we proposed MoCo-INR, a new unsupervised method that integrates implicit neural representations (INR) with the conventional motion-compensated (MoCo) framework. Using explicit motion modeling and the continuous prior of INRs, MoCo-INR can produce accurate cardiac motion decomposition and high-quality CMR reconstruction. Furthermore, we introduce a new INR network architecture tailored to the CMR problem, which significantly stabilizes model optimization. Experiments on retrospective (simulated) datasets demonstrate the superiority of MoCo-INR over state-of-the-art methods, achieving fast convergence and fine-detailed reconstructions at ultra-high acceleration factors (e.g., 20x in VISTA sampling). Additionally, evaluations on prospective (real-acquired) free-breathing CMR scans highlight the clinical practicality of MoCo-INR for real-time imaging. Several ablation studies further confirm the effectiveness of the critical components of MoCo-INR.
comment: Accepted by AAAI-26
Unsupervised patch-based dynamic MRI reconstruction using learnable tensor function with implicit neural representation
Dynamic MRI suffers from limited spatiotemporal resolution due to long acquisition times. Undersampling k-space accelerates imaging but makes accurate reconstruction challenging. Supervised deep learning methods achieve impressive results but rely on large fully sampled datasets, which are difficult to obtain. Recently, implicit neural representations (INR) have emerged as a powerful unsupervised paradigm that reconstructs images from a single undersampled dataset without external training data. However, existing INR-based methods still face challenges when applied to highly undersampled dynamic MRI, mainly due to their inefficient representation capacity and high computational cost. To address these issues, we propose TenF-INR, a novel unsupervised framework that integrates low-rank tensor modeling with INR, where each factor matrix in the tensor decomposition is modeled as a learnable factor function. Specifically,we employ INR to model learnable tensor functions within a low-rank decomposition, reducing the parameter space and computational burden. A patch-based nonlocal tensor modeling strategy further exploits temporal correlations and inter-patch similarities, enhancing the recovery of fine spatiotemporal details. Experiments on dynamic cardiac and abdominal datasets demonstrate that TenF-INR achieves up to 21-fold acceleration, outperforming both supervised and unsupervised state-of-the-art methods in image quality, temporal fidelity, and quantitative accuracy.
Fine-grained Image Quality Assessment for Perceptual Image Restoration AAAI2026
Recent years have witnessed remarkable achievements in perceptual image restoration (IR), creating an urgent demand for accurate image quality assessment (IQA), which is essential for both performance comparison and algorithm optimization. Unfortunately, the existing IQA metrics exhibit inherent weakness for IR task, particularly when distinguishing fine-grained quality differences among restored images. To address this dilemma, we contribute the first-of-its-kind fine-grained image quality assessment dataset for image restoration, termed FGRestore, comprising 18,408 restored images across six common IR tasks. Beyond conventional scalar quality scores, FGRestore was also annotated with 30,886 fine-grained pairwise preferences. Based on FGRestore, a comprehensive benchmark was conducted on the existing IQA metrics, which reveal significant inconsistencies between score-based IQA evaluations and the fine-grained restoration quality. Motivated by these findings, we further propose FGResQ, a new IQA model specifically designed for image restoration, which features both coarse-grained score regression and fine-grained quality ranking. Extensive experiments and comparisons demonstrate that FGResQ significantly outperforms state-of-the-art IQA metrics. Codes and model weights have been released in https://sxfly99.github.io/FGResQ-Homepage.
comment: Accepted by AAAI2026
Subjective and Objective Quality Evaluation of Super-Resolution Enhanced Broadcast Images on a Novel SR-IQA Dataset
Super-Resolution (SR) is essential for displaying low-quality broadcast content on high-resolution screens. Recently, SR methods have been developed that not only increase resolution while preserving the original image information but also enhance the perceived quality. However, evaluating the quality of SR images generated from low-quality sources, such as SR-enhanced broadcast content, is challenging due to the need to consider both distortions and improvements. Additionally, assessing SR image quality without original high-quality sources presents another significant challenge. Unfortunately, there has been a dearth of research specifically addressing the Image Quality Assessment (IQA) of SR images under these conditions. In this work, we introduce a new IQA dataset for SR broadcast images in both 2K and 4K resolutions. We conducted a subjective quality evaluation to obtain Mean Opinion Score (MOS) for these SR images and performed a comprehensive human study to identify key factors influencing perceived quality. Finally, we evaluated the performance of existing IQA metrics on our dataset. This study reveals the limitations of current metrics, highlighting the need for a more robust IQA metric that better correlates with the perceived quality of SR images. The proposed dataset and the subjective evaluation platform are publicly available at https://sites.google.com/hanyang.ac.kr/ivml/datasets/sreb.
comment: Accepted for publication in IEEE Access
Multimedia
Scaling Spatial Intelligence with Multimodal Foundation Models
Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, 54.6% on ViewSpatial, and 50.1% on SITE, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. SenseNova-SI is an ongoing project, and this report will be updated continuously. All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.
comment: Model: https://huggingface.co/collections/sensenova/sensenova-si; Code: https://github.com/OpenSenseNova/SenseNova-SI
Moving Pictures of Thought: Extracting Visual Knowledge in Charles S. Peirce's Manuscripts with Vision-Language Models
Diagrams are crucial yet underexplored tools in many disciplines, demonstrating the close connection between visual representation and scholarly reasoning. However, their iconic form poses obstacles to visual studies, intermedial analysis, and text-based digital workflows. In particular, Charles S. Peirce consistently advocated the use of diagrams as essential for reasoning and explanation. His manuscripts, often combining textual content with complex visual artifacts, provide a challenging case for studying documents involving heterogeneous materials. In this preliminary study, we investigate whether Visual Language Models (VLMs) can effectively help us identify and interpret such hybrid pages in context. First, we propose a workflow that (i) segments manuscript page layouts, (ii) reconnects each segment to IIIF-compliant annotations, and (iii) submits fragments containing diagrams to a VLM. In addition, by adopting Peirce's semiotic framework, we designed prompts to extract key knowledge about diagrams and produce concise captions. Finally, we integrated these captions into knowledge graphs, enabling structured representations of diagrammatic content within composite sources.
Towards Practical Real-Time Low-Latency Music Source Separation
In recent years, significant progress has been made in the field of deep learning for music demixing. However, there has been limited attention on real-time, low-latency music demixing, which holds potential for various applications, such as hearing aids, audio stream remixing, and live performances. Additionally, a notable tendency has emerged towards the development of larger models, limiting their applicability in certain scenarios. In this paper, we introduce a lightweight real-time low-latency model called Real-Time Single-Path TFC-TDF UNET (RT-STT), which is based on the Dual-Path TFC-TDF UNET (DTTNet). In RT-STT, we propose a feature fusion technique based on channel expansion. We also demonstrate the superiority of single-path modeling over dual-path modeling in real-time models. Moreover, we investigate the method of quantization to further reduce inference time. RT-STT exhibits superior performance with significantly fewer parameters and shorter inference times compared to state-of-the-art models.
Self-Supervised Compression and Artifact Correction for Streaming Underwater Imaging Sonar WACV 2026
Real-time imaging sonar has become an important tool for underwater monitoring in environments where optical sensing is unreliable. Its broader use is constrained by two coupled challenges: highly limited uplink bandwidth and severe sonar-specific artifacts (speckle, motion blur, reverberation, acoustic shadows) that affect up to 98% of frames. We present SCOPE, a self-supervised framework that jointly performs compression and artifact correction without clean-noise pairs or synthetic assumptions. SCOPE combines (i) Adaptive Codebook Compression (ACC), which learns frequency-encoded latent representations tailored to sonar, with (ii) Frequency-Aware Multiscale Segmentation (FAMS), which decomposes frames into low-frequency structure and sparse high-frequency dynamics while suppressing rapidly fluctuating artifacts. A hedging training strategy further guides frequency-aware learning using low-pass proxy pairs generated without labels. Evaluated on months of in-situ ARIS sonar data, SCOPE achieves a structural similarity index (SSIM) of 0.77, representing a 40% improvement over prior self-supervised denoising baselines, at bitrates down to <= 0.0118 bpp. It reduces uplink bandwidth by more than 80% while improving downstream detection. The system runs in real time, with 3.1 ms encoding on an embedded GPU and 97 ms full multi-layer decoding on the server end. SCOPE has been deployed for months in three Pacific Northwest rivers to support real-time salmon enumeration and environmental monitoring in the wild. Results demonstrate that learning frequency-structured latents enables practical, low-bitrate sonar streaming with preserved signal details under real-world deployment conditions.
comment: Accepted to WACV 2026
Fine-grained Image Quality Assessment for Perceptual Image Restoration AAAI2026
Recent years have witnessed remarkable achievements in perceptual image restoration (IR), creating an urgent demand for accurate image quality assessment (IQA), which is essential for both performance comparison and algorithm optimization. Unfortunately, the existing IQA metrics exhibit inherent weakness for IR task, particularly when distinguishing fine-grained quality differences among restored images. To address this dilemma, we contribute the first-of-its-kind fine-grained image quality assessment dataset for image restoration, termed FGRestore, comprising 18,408 restored images across six common IR tasks. Beyond conventional scalar quality scores, FGRestore was also annotated with 30,886 fine-grained pairwise preferences. Based on FGRestore, a comprehensive benchmark was conducted on the existing IQA metrics, which reveal significant inconsistencies between score-based IQA evaluations and the fine-grained restoration quality. Motivated by these findings, we further propose FGResQ, a new IQA model specifically designed for image restoration, which features both coarse-grained score regression and fine-grained quality ranking. Extensive experiments and comparisons demonstrate that FGResQ significantly outperforms state-of-the-art IQA metrics. Codes and model weights have been released in https://sxfly99.github.io/FGResQ-Homepage.
comment: Accepted by AAAI2026
Computation and Language
Generalist Foundation Models Are Not Clinical Enough for Hospital Operations
Hospitals and healthcare systems rely on operational decisions that determine patient flow, cost, and quality of care. Despite strong performance on medical knowledge and conversational benchmarks, foundation models trained on general text may lack the specialized knowledge required for these operational decisions. We introduce Lang1, a family of models (100M-7B parameters) pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health's EHRs and 627B tokens from the internet. To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes that evaluates five critical tasks: 30-day readmission prediction, 30-day mortality prediction, length of stay, comorbidity coding, and predicting insurance claims denial. In zero-shot settings, both general-purpose and specialized models underperform on four of five tasks (36.6%-71.7% AUROC), with mortality prediction being an exception. After finetuning, Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively. We also observed cross-task scaling with joint finetuning on multiple tasks leading to improvement on other tasks. Lang1-1B effectively transfers to out-of-distribution settings, including other clinical tasks and an external health system. Our findings suggest that predictive capabilities for hospital operations require explicit supervised finetuning, and that this finetuning process is made more efficient by in-domain pretraining on EHR. Our findings support the emerging view that specialized LLMs can compete with generalist models in specialized tasks, and show that effective healthcare systems AI requires the combination of in-domain pretraining, supervised finetuning, and real-world evaluation beyond proxy benchmarks.
Crossing Borders: A Multimodal Challenge for Indian Poetry Translation and Image Generation
Indian poetry, known for its linguistic complexity and deep cultural resonance, has a rich and varied heritage spanning thousands of years. However, its layered meanings, cultural allusions, and sophisticated grammatical constructions often pose challenges for comprehension, especially for non-native speakers or readers unfamiliar with its context and language. Despite its cultural significance, existing works on poetry have largely overlooked Indian language poems. In this paper, we propose the Translation and Image Generation (TAI) framework, leveraging Large Language Models (LLMs) and Latent Diffusion Models through appropriate prompt tuning. Our framework supports the United Nations Sustainable Development Goals of Quality Education (SDG 4) and Reduced Inequalities (SDG 10) by enhancing the accessibility of culturally rich Indian-language poetry to a global audience. It includes (1) a translation module that uses an Odds Ratio Preference Alignment Algorithm to accurately translate morphologically rich poetry into English, and (2) an image generation module that employs a semantic graph to capture tokens, dependencies, and semantic relationships between metaphors and their meanings, to create visually meaningful representations of Indian poems. Our comprehensive experimental evaluation, including both human and quantitative assessments, demonstrates the superiority of TAI Diffusion in poem image generation tasks, outperforming strong baselines. To further address the scarcity of resources for Indian-language poetry, we introduce the Morphologically Rich Indian Language Poems MorphoVerse Dataset, comprising 1,570 poems across 21 low-resource Indian languages. By addressing the gap in poetry translation and visual comprehension, this work aims to broaden accessibility and enrich the reader's experience.
Why is "Chicago" Predictive of Deceptive Reviews? Using LLMs to Discover Language Phenomena from Lexical Cues
Deceptive reviews mislead consumers, harm businesses, and undermine trust in online marketplaces. Machine learning classifiers can learn from large amounts of training examples to effectively distinguish deceptive reviews from genuine ones. However, the distinguishing features learned by these classifiers are often subtle, fragmented, and difficult for humans to interpret. In this work, we explore using large language models (LLMs) to translate machine-learned lexical cues into human-understandable language phenomena that can differentiate deceptive reviews from genuine ones. We show that language phenomena obtained in this manner are empirically grounded in data, generalizable across similar domains, and more predictive than phenomena either in LLMs' prior knowledge or obtained through in-context learning. These language phenomena have the potential to aid people in critically assessing the credibility of online reviews in environments where deception detection classifiers are unavailable.
Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?
Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Gödel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that Live-SWE-agent can achieve an impressive solve rate of 75.4% without test-time scaling, outperforming all existing open-source software agents and approaching the performance of the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8%.
P1: Mastering Physics Olympiads with Reinforcement Learning
Recent progress in large language models (LLMs) has moved the frontier from puzzle-solving to science-grade reasoning-the kind needed to tackle problems whose answers must stand against nature, not merely fit a rubric. Physics is the sharpest test of this shift, which binds symbols to reality in a fundamental way, serving as the cornerstone of most modern technologies. In this work, we manage to advance physics research by developing large language models with exceptional physics reasoning capabilities, especially excel at solving Olympiad-level physics problems. We introduce P1, a family of open-source physics reasoning models trained entirely through reinforcement learning (RL). Among them, P1-235B-A22B is the first open-source model with Gold-medal performance at the latest International Physics Olympiad (IPhO 2025), and wins 12 gold medals out of 13 international/regional physics competitions in 2024/2025. P1-30B-A3B also surpasses almost all other open-source models on IPhO 2025, getting a silver medal. Further equipped with an agentic framework PhysicsMinions, P1-235B-A22B+PhysicsMinions achieves overall No.1 on IPhO 2025, and obtains the highest average score over the 13 physics competitions. Besides physics, P1 models also present great performance on other reasoning tasks like math and coding, showing the great generalibility of P1 series.
Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents
Recent advancements in LLM-powered agents have demonstrated significant potential in generating human-like responses; however, they continue to face challenges in maintaining long-term interactions within complex environments, primarily due to limitations in contextual consistency and dynamic personalization. Existing memory systems often depend on semantic grouping prior to retrieval, which can overlook semantically irrelevant yet critical user information and introduce retrieval noise. In this report, we propose the initial design of O-Mem, a novel memory framework based on active user profiling that dynamically extracts and updates user characteristics and event records from their proactive interactions with agents. O-Mem supports hierarchical retrieval of persona attributes and topic-related context, enabling more adaptive and coherent personalized responses. O-Mem achieves 51.76% on the public LoCoMo benchmark, a nearly 3% improvement upon LangMem,the previous state-of-the-art, and it achieves 62.99% on PERSONAMEM, a 3.5% improvement upon A-Mem,the previous state-of-the-art. O-Mem also boosts token and interaction response time efficiency compared to previous memory frameworks. Our work opens up promising directions for developing efficient and human-like personalized AI assistants in the future.
Beyond SELECT: A Comprehensive Taxonomy-Guided Benchmark for Real-World Text-to-SQL Translation
Text-to-SQL datasets are essential for training and evaluating text-to-SQL models, but existing datasets often suffer from limited coverage and fail to capture the diversity of real-world applications. To address this, we propose a novel taxonomy for text-to-SQL classification based on dimensions including core intents, statement types, syntax structures, and key actions. Using this taxonomy, we evaluate widely used public text-to-SQL datasets (e.g., Spider and Bird) and reveal limitations in their coverage and diversity. We then introduce a taxonomy-guided dataset synthesis pipeline, yielding a new dataset named SQL-Synth. This approach combines the taxonomy with Large Language Models (LLMs) to ensure the dataset reflects the breadth and complexity of real-world text-to-SQL applications. Extensive analysis and experimental results validate the effectiveness of our taxonomy, as SQL-Synth exhibits greater diversity and coverage compared to existing benchmarks. Moreover, we uncover that existing LLMs typically fall short in adequately capturing the full range of scenarios, resulting in limited performance on SQL-Synth. However, fine-tuning can substantially improve their performance in these scenarios. The proposed taxonomy has significant potential impact, as it not only enables comprehensive analysis of datasets and the performance of different LLMs, but also guides the construction of training data for LLMs.
ForgeDAN: An Evolutionary Framework for Jailbreaking Aligned Large Language Models
The rapid adoption of large language models (LLMs) has brought both transformative applications and new security risks, including jailbreak attacks that bypass alignment safeguards to elicit harmful outputs. Existing automated jailbreak generation approaches e.g. AutoDAN, suffer from limited mutation diversity, shallow fitness evaluation, and fragile keyword-based detection. To address these limitations, we propose ForgeDAN, a novel evolutionary framework for generating semantically coherent and highly effective adversarial prompts against aligned LLMs. First, ForgeDAN introduces multi-strategy textual perturbations across \textit{character, word, and sentence-level} operations to enhance attack diversity; then we employ interpretable semantic fitness evaluation based on a text similarity model to guide the evolutionary process toward semantically relevant and harmful outputs; finally, ForgeDAN integrates dual-dimensional jailbreak judgment, leveraging an LLM-based classifier to jointly assess model compliance and output harmfulness, thereby reducing false positives and improving detection effectiveness. Our evaluation demonstrates ForgeDAN achieves high jailbreaking success rates while maintaining naturalness and stealth, outperforming existing SOTA solutions.
Toward Conversational Hungarian Speech Recognition: Introducing the BEA-Large and BEA-Dialogue Datasets LREC 2026
The advancement of automatic speech recognition (ASR) has been largely enhanced by extensive datasets in high-resource languages, while languages such as Hungarian remain underrepresented due to limited spontaneous and conversational corpora. To address this gap, we introduce two new datasets -- BEA-Large and BEA-Dialogue -- constructed from the previously unprocessed portions of the Hungarian speech corpus named BEA. BEA-Large extends BEA-Base with 255 hours of spontaneous speech from 433 speakers, enriched with detailed segment-level metadata. BEA-Dialogue, comprising 85 hours of spontaneous conversations, is a Hungarian speech corpus featuring natural dialogues partitioned into speaker-independent subsets, supporting research in conversational ASR and speaker diarization. We establish reproducible baselines on these datasets using publicly available ASR models, with the fine-tuned Fast Conformer model achieving word error rates as low as 14.18\% on spontaneous and 4.8\% on repeated speech. Diarization experiments yield diarization error rates between 13.05\% and 18.26\%, providing reference points for future improvements. The results highlight the persistent difficulty of conversational ASR, particularly due to disfluencies, overlaps, and informal speech patterns. By releasing these datasets and baselines, we aim to advance Hungarian speech technology and offer a methodological framework for developing spontaneous and conversational benchmarks in other languages.
comment: Submitted to LREC 2026
Applying Large Language Models to Characterize Public Narratives
Public Narratives (PNs) are key tools for leadership development and civic mobilization, yet their systematic analysis remains challenging due to their subjective interpretation and the high cost of expert annotation. In this work, we propose a novel computational framework that leverages large language models (LLMs) to automate the qualitative annotation of public narratives. Using a codebook we co-developed with subject-matter experts, we evaluate LLM performance against that of expert annotators. Our work reveals that LLMs can achieve near-human-expert performance, achieving an average F1 score of 0.80 across 8 narratives and 14 codes. We then extend our analysis to empirically explore how PN framework elements manifest across a larger dataset of 22 stories. Lastly, we extrapolate our analysis to a set of political speeches, establishing a novel lens in which to analyze political rhetoric in civic spaces. This study demonstrates the potential of LLM-assisted annotation for scalable narrative analysis and highlights key limitations and directions for future research in computational civic storytelling.
Aspect-Level Obfuscated Sentiment in Thai Financial Disclosures and Its Impact on Abnormal Returns
Understanding sentiment in financial documents is crucial for gaining insights into market behavior. These reports often contain obfuscated language designed to present a positive or neutral outlook, even when underlying conditions may be less favorable. This paper presents a novel approach using Aspect-Based Sentiment Analysis (ABSA) to decode obfuscated sentiment in Thai financial annual reports. We develop specific guidelines for annotating obfuscated sentiment in these texts and annotate more than one hundred financial reports. We then benchmark various text classification models on this annotated dataset, demonstrating strong performance in sentiment classification. Additionally, we conduct an event study to evaluate the real-world implications of our sentiment analysis on stock prices. Our results suggest that market reactions are selectively influenced by specific aspects within the reports. Our findings underscore the complexity of sentiment analysis in financial texts and highlight the importance of addressing obfuscated language to accurately assess market sentiment.
Non-Linear Scoring Model for Translation Quality Evaluation
Analytic Translation Quality Evaluation (TQE), based on Multidimensional Quality Metrics (MQM), traditionally uses a linear error-to-penalty scale calibrated to a reference sample of 1000-2000 words. However, linear extrapolation biases judgment on samples of different sizes, over-penalizing short samples and under-penalizing long ones, producing misalignment with expert intuition. Building on the Multi-Range framework, this paper presents a calibrated, non-linear scoring model that better reflects how human content consumers perceive translation quality across samples of varying length. Empirical data from three large-scale enterprise environments shows that acceptable error counts grow logarithmically, not linearly, with sample size. Psychophysical and cognitive evidence, including the Weber-Fechner law and Cognitive Load Theory, supports this premise by explaining why the perceptual impact of additional errors diminishes while the cognitive burden grows with scale. We propose a two-parameter model E(x) = a * ln(1 + b * x), a, b > 0, anchored to a reference tolerance and calibrated from two tolerance points using a one-dimensional root-finding step. The model yields an explicit interval within which the linear approximation stays within +/-20 percent relative error and integrates into existing evaluation workflows with only a dynamic tolerance function added. The approach improves interpretability, fairness, and inter-rater reliability across both human and AI-generated translations. By operationalizing a perceptually valid scoring paradigm, it advances translation quality evaluation toward more accurate and scalable assessment. The model also provides a stronger basis for AI-based document-level evaluation aligned with human judgment. Implementation considerations for CAT/LQA systems and implications for human and AI-generated text evaluation are discussed.
comment: ongoing work, 38 pages
Exploring Multi-Table Retrieval Through Iterative Search
Open-domain question answering over datalakes requires retrieving and composing information from multiple tables, a challenging subtask that demands semantic relevance and structural coherence (e.g., joinability). While exact optimization methods like Mixed-Integer Programming (MIP) can ensure coherence, their computational complexity is often prohibitive. Conversely, simpler greedy heuristics that optimize for query coverage alone often fail to find these coherent, joinable sets. This paper frames multi-table retrieval as an iterative search process, arguing this approach offers advantages in scalability, interpretability, and flexibility. We propose a general framework and a concrete instantiation: a fast, effective Greedy Join-Aware Retrieval algorithm that holistically balances relevance, coverage, and joinability. Experiments across 5 NL2SQL benchmarks demonstrate that our iterative method achieves competitive retrieval performance compared to the MIP-based approach while being 4-400x faster depending on the benchmark and search space settings. This work highlights the potential of iterative heuristics for practical, scalable, and composition-aware retrieval.
comment: Accepted @ the AI for Tabular Data Workshop, EurIPS 2025
Attention Grounded Enhancement for Visual Document Retrieval
Visual document retrieval requires understanding heterogeneous and multi-modal content to satisfy information needs. Recent advances use screenshot-based document encoding with fine-grained late interaction, significantly improving retrieval performance. However, retrievers are still trained with coarse global relevance labels, without revealing which regions support the match. As a result, retrievers tend to rely on surface-level cues and struggle to capture implicit semantic connections, hindering their ability to handle non-extractive queries. To alleviate this problem, we propose a \textbf{A}ttention-\textbf{G}rounded \textbf{RE}triever \textbf{E}nhancement (AGREE) framework. AGREE leverages cross-modal attention from multimodal large language models as proxy local supervision to guide the identification of relevant document regions. During training, AGREE combines local signals with the global signals to jointly optimize the retriever, enabling it to learn not only whether documents match, but also which content drives relevance. Experiments on the challenging ViDoRe V2 benchmark show that AGREE significantly outperforms the global-supervision-only baseline. Quantitative and qualitative analyses further demonstrate that AGREE promotes deeper alignment between query terms and document regions, moving beyond surface-level matching toward more accurate and interpretable retrieval. Our code is available at: https://anonymous.4open.science/r/AGREE-2025.
Mem-PAL: Towards Memory-based Personalized Dialogue Assistants for Long-term User-Agent Interaction AAAI 2026
With the rise of smart personal devices, service-oriented human-agent interactions have become increasingly prevalent. This trend highlights the need for personalized dialogue assistants that can understand user-specific traits to accurately interpret requirements and tailor responses to individual preferences. However, existing approaches often overlook the complexities of long-term interactions and fail to capture users' subjective characteristics. To address these gaps, we present PAL-Bench, a new benchmark designed to evaluate the personalization capabilities of service-oriented assistants in long-term user-agent interactions. In the absence of available real-world data, we develop a multi-step LLM-based synthesis pipeline, which is further verified and refined by human annotators. This process yields PAL-Set, the first Chinese dataset comprising multi-session user logs and dialogue histories, which serves as the foundation for PAL-Bench. Furthermore, to improve personalized service-oriented interactions, we propose H$^2$Memory, a hierarchical and heterogeneous memory framework that incorporates retrieval-augmented generation to improve personalized response generation. Comprehensive experiments on both our PAL-Bench and an external dataset demonstrate the effectiveness of the proposed memory framework.
comment: Accepted by AAAI 2026 (Oral)
Can Large Language Models Function as Qualified Pediatricians? A Systematic Evaluation in Real-World Clinical Contexts
With the rapid rise of large language models (LLMs) in medicine, a key question is whether they can function as competent pediatricians in real-world clinical settings. We developed PEDIASBench, a systematic evaluation framework centered on a knowledge-system framework and tailored to realistic clinical environments. PEDIASBench assesses LLMs across three dimensions: application of basic knowledge, dynamic diagnosis and treatment capability, and pediatric medical safety and medical ethics. We evaluated 12 representative models released over the past two years, including GPT-4o, Qwen3-235B-A22B, and DeepSeek-V3, covering 19 pediatric subspecialties and 211 prototypical diseases. State-of-the-art models performed well on foundational knowledge, with Qwen3-235B-A22B achieving over 90% accuracy on licensing-level questions, but performance declined ~15% as task complexity increased, revealing limitations in complex reasoning. Multiple-choice assessments highlighted weaknesses in integrative reasoning and knowledge recall. In dynamic diagnosis and treatment scenarios, DeepSeek-R1 scored highest in case reasoning (mean 0.58), yet most models struggled to adapt to real-time patient changes. On pediatric medical ethics and safety tasks, Qwen2.5-72B performed best (accuracy 92.05%), though humanistic sensitivity remained limited. These findings indicate that pediatric LLMs are constrained by limited dynamic decision-making and underdeveloped humanistic care. Future development should focus on multimodal integration and a clinical feedback-model iteration loop to enhance safety, interpretability, and human-AI collaboration. While current LLMs cannot independently perform pediatric care, they hold promise for decision support, medical education, and patient communication, laying the groundwork for a safe, trustworthy, and collaborative intelligent pediatric healthcare system.
Donors and Recipients: On Asymmetric Transfer Across Tasks and Languages with Parameter-Efficient Fine-Tuning
Large language models (LLMs) perform strongly across tasks and languages, yet how improvements in one task or language affect other tasks and languages and their combinations remains poorly understood. We conduct a controlled PEFT/LoRA study across multiple open-weight LLM families and sizes, treating task and language as transfer axes while conditioning on model family and size; we fine-tune each model on a single task-language source and measure transfer as the percentage-point change versus its baseline score when evaluated on all other task-language target pairs. We decompose transfer into (i) Matched-Task (Cross-Language), (ii) Matched-Language (Cross-Task), and (iii) Cross-Task (Cross-Language) regimes. We uncover two consistent general patterns. First, a pronounced on-task vs. off-task asymmetry: Matched-Task (Cross-Language) transfer is reliably positive, whereas off-task transfer often incurs collateral degradation. Second, a stable donor-recipient structure across languages and tasks (hub donors vs. brittle recipients). We outline implications for risk-aware fine-tuning and model specialisation.
AHaSIS: Shared Task on Sentiment Analysis for Arabic Dialects
The hospitality industry in the Arab world increasingly relies on customer feedback to shape services, driving the need for advanced Arabic sentiment analysis tools. To address this challenge, the Sentiment Analysis on Arabic Dialects in the Hospitality Domain shared task focuses on Sentiment Detection in Arabic Dialects. This task leverages a multi-dialect, manually curated dataset derived from hotel reviews originally written in Modern Standard Arabic (MSA) and translated into Saudi and Moroccan (Darija) dialects. The dataset consists of 538 sentiment-balanced reviews spanning positive, neutral, and negative categories. Translations were validated by native speakers to ensure dialectal accuracy and sentiment preservation. This resource supports the development of dialect-aware NLP systems for real-world applications in customer experience analysis. More than 40 teams have registered for the shared task, with 12 submitting systems during the evaluation phase. The top-performing system achieved an F1 score of 0.81, demonstrating the feasibility and ongoing challenges of sentiment analysis across Arabic dialects.
AutoMalDesc: Large-Scale Script Analysis for Cyber Threat Research AAAI 2026
Generating thorough natural language explanations for threat detections remains an open problem in cybersecurity research, despite significant advances in automated malware detection systems. In this work, we present AutoMalDesc, an automated static analysis summarization framework that, following initial training on a small set of expert-curated examples, operates independently at scale. This approach leverages an iterative self-paced learning pipeline to progressively enhance output quality through synthetic data generation and validation cycles, eliminating the need for extensive manual data annotation. Evaluation across 3,600 diverse samples in five scripting languages demonstrates statistically significant improvements between iterations, showing consistent gains in both summary quality and classification accuracy. Our comprehensive validation approach combines quantitative metrics based on established malware labels with qualitative assessment from both human experts and LLM-based judges, confirming both technical precision and linguistic coherence of generated summaries. To facilitate reproducibility and advance research in this domain, we publish our complete dataset of more than 100K script samples, including annotated seed (0.9K) and test (3.6K) datasets, along with our methodology and evaluation framework.
comment: Accepted at AAAI 2026 (oral)
RegionMarker: A Region-Triggered Semantic Watermarking Framework for Embedding-as-a-Service Copyright Protection AAAI 2026
Embedding-as-a-Service (EaaS) is an effective and convenient deployment solution for addressing various NLP tasks. Nevertheless, recent research has shown that EaaS is vulnerable to model extraction attacks, which could lead to significant economic losses for model providers. For copyright protection, existing methods inject watermark embeddings into text embeddings and use them to detect copyright infringement. However, current watermarking methods often resist only a subset of attacks and fail to provide \textit{comprehensive} protection. To this end, we present the region-triggered semantic watermarking framework called RegionMarker, which defines trigger regions within a low-dimensional space and injects watermarks into text embeddings associated with these regions. By utilizing a secret dimensionality reduction matrix to project onto this subspace and randomly selecting trigger regions, RegionMarker makes it difficult for watermark removal attacks to evade detection. Furthermore, by embedding watermarks across the entire trigger region and using the text embedding as the watermark, RegionMarker is resilient to both paraphrasing and dimension-perturbation attacks. Extensive experiments on various datasets show that RegionMarker is effective in resisting different attack methods, thereby protecting the copyright of EaaS.
comment: AAAI 2026
Dropouts in Confidence: Moral Uncertainty in Human-LLM Alignment AAAI 2026
Humans display significant uncertainty when confronted with moral dilemmas, yet the extent of such uncertainty in machines and AI agents remains underexplored. Recent studies have confirmed the overly confident tendencies of machine-generated responses, particularly in large language models (LLMs). As these systems are increasingly embedded in ethical decision-making scenarios, it is important to understand their moral reasoning and the inherent uncertainties in building reliable AI systems. This work examines how uncertainty influences moral decisions in the classical trolley problem, analyzing responses from 32 open-source models and 9 distinct moral dimensions. We first find that variance in model confidence is greater across models than within moral dimensions, suggesting that moral uncertainty is predominantly shaped by model architecture and training method. To quantify uncertainty, we measure binary entropy as a linear combination of total entropy, conditional entropy, and mutual information. To examine its effects, we introduce stochasticity into models via "dropout" at inference time. Our findings show that our mechanism increases total entropy, mainly through a rise in mutual information, while conditional entropy remains largely unchanged. Moreover, this mechanism significantly improves human-LLM moral alignment, with correlations in mutual information and alignment score shifts. Our results highlight the potential to better align model-generated decisions and human preferences by deliberately modulating uncertainty and reducing LLMs' confidence in morally complex scenarios.
comment: Accepted to AAAI 2026
Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, but their training remains resource- and time-intensive, requiring massive compute power and careful orchestration of training procedures. Model souping-the practice of averaging weights from multiple models of the same architecture-has emerged as a promising pre- and post-training technique that can enhance performance without expensive retraining. In this paper, we introduce Soup Of Category Experts (SoCE), a principled approach for model souping that utilizes benchmark composition to identify optimal model candidates and applies non-uniform weighted averaging to maximize performance. Contrary to previous uniform-averaging approaches, our method leverages the observation that benchmark categories often exhibit low inter-correlations in model performance. SoCE identifies "expert" models for each weakly-correlated category cluster and combines them using optimized weighted averaging rather than uniform weights. We demonstrate that the proposed method improves performance and robustness across multiple domains, including multilingual capabilities, tool calling, and math and achieves state-of-the-art results on the Berkeley Function Calling Leaderboard.
Computational Measurement of Political Positions: A Review of Text-Based Ideal Point Estimation Algorithms
This article presents the first systematic review of unsupervised and semi-supervised computational text-based ideal point estimation (CT-IPE) algorithms, methods designed to infer latent political positions from textual data. These algorithms are widely used in political science, communication, computational social science, and computer science to estimate ideological preferences from parliamentary speeches, party manifestos, and social media. Over the past two decades, their development has closely followed broader NLP trends -- beginning with word-frequency models and most recently turning to large language models (LLMs). While this trajectory has greatly expanded the methodological toolkit, it has also produced a fragmented field that lacks systematic comparison and clear guidance for applied use. To address this gap, we identified 25 CT-IPE algorithms through a systematic literature review and conducted a manual content analysis of their modeling assumptions and development contexts. To compare them meaningfully, we introduce a conceptual framework that distinguishes how algorithms generate, capture, and aggregate textual variance. On this basis, we identify four methodological families -- word-frequency, topic modeling, word embedding, and LLM-based approaches -- and critically assess their assumptions, interpretability, scalability, and limitations. Our review offers three contributions. First, it provides a structured synthesis of two decades of algorithm development, clarifying how diverse methods relate to one another. Second, it translates these insights into practical guidance for applied researchers, highlighting trade-offs in transparency, technical requirements, and validation strategies that shape algorithm choice. Third, it emphasizes that differences in estimation outcomes across algorithms are themselves informative, underscoring the need for systematic benchmarking.
comment: 46 pages, 8 figures, 2 tables, accepted for publication in Quality & Quantity
Seeing isn't Hearing: Benchmarking Vision Language Models at Interpreting Spectrograms AACL 2025
With the rise of Large Language Models (LLMs) and their vision-enabled counterparts (VLMs), numerous works have investigated their capabilities in tasks that fuse the modalities of vision and language. In this work, we benchmark the extent to which VLMs are able to act as highly-trained phoneticians, interpreting spectrograms and waveforms of speech. To do this, we synthesise a novel dataset containing 4k+ English words spoken in isolation alongside stylistically consistent spectrogram and waveform figures. We test the ability of VLMs to understand these representations of speech through a multiple-choice task whereby models must predict the correct phonemic or graphemic transcription of a spoken word when presented amongst 3 distractor transcriptions that have been selected based on their phonemic edit distance to the ground truth. We observe that both zero-shot and finetuned models rarely perform above chance, demonstrating the requirement for specific parametric knowledge of how to interpret such figures, rather than paired samples alone.
comment: Accepted to IJCNLP-AACL 2025
Evaluating Large Language Models for Diacritic Restoration in Romanian Texts: A Comparative Study
Automatic diacritic restoration is crucial for text processing in languages with rich diacritical marks, such as Romanian. This study evaluates the performance of several large language models (LLMs) in restoring diacritics in Romanian texts. Using a comprehensive corpus, we tested models including OpenAI's GPT-3.5, GPT-4, GPT-4o, Google's Gemini 1.0 Pro, Meta's Llama 2 and Llama 3, MistralAI's Mixtral 8x7B Instruct, airoboros 70B, and OpenLLM-Ro's RoLlama 2 7B, under multiple prompt templates ranging from zero-shot to complex multi-shot instructions. Results show that models such as GPT-4o achieve high diacritic restoration accuracy, consistently surpassing a neutral echo baseline, while others, including Meta's Llama family, exhibit wider variability. These findings highlight the impact of model architecture, training data, and prompt design on diacritic restoration performance and outline promising directions for improving NLP tools for diacritic-rich languages.
Translation Entropy: A Statistical Framework for Evaluating Translation Systems
The translation of written language has been known since the 3rd century BC; however, its necessity has become increasingly common in the information age. Today, many translators exist, based on encoder-decoder deep architectures, nevertheless, no quantitative objective methods are available to assess their performance, likely because the entropy of even a single language remains unknown. This study presents a quantitative method for estimating translation entropy, with the following key finding. Given a translator, several sentences that differ by only one selected token of a given pivot sentence yield identical translations. Analyzing the statistics of this phenomenon across an ensemble of such sentences, consisting each of a pivot selected token, yields the probabilities of replacing this specific token with others while preserving the translation. These probabilities constitute the entropy of the selected token, and the average across all selected pivot tokens provides an estimate of the translator's overall translation entropy, which is enhanced along the decoder blocks. This entropic measure allows for the quantitative ranking of several publicly available translators and reveals whether mutual translation entropy is symmetric. Extending the proposed method to include the replacement of two tokens in a given pivot sentence demonstrates a multiplicative effect, where translation degeneracy is proportional to the product of the degeneracies of the two tokens. These findings establish translation entropy as a measurable property and objective benchmarking of artificial translators. Results are based on MarianMT, T5-Base and NLLB-200 translators.
comment: 23 pages, 6 figures and 8 tables
TCM-5CEval: Extended Deep Evaluation Benchmark for LLM's Comprehensive Clinical Research Competence in Traditional Chinese Medicine
Large language models (LLMs) have demonstrated exceptional capabilities in general domains, yet their application in highly specialized and culturally-rich fields like Traditional Chinese Medicine (TCM) requires rigorous and nuanced evaluation. Building upon prior foundational work such as TCM-3CEval, which highlighted systemic knowledge gaps and the importance of cultural-contextual alignment, we introduce TCM-5CEval, a more granular and comprehensive benchmark. TCM-5CEval is designed to assess LLMs across five critical dimensions: (1) Core Knowledge (TCM-Exam), (2) Classical Literacy (TCM-LitQA), (3) Clinical Decision-making (TCM-MRCD), (4) Chinese Materia Medica (TCM-CMM), and (5) Clinical Non-pharmacological Therapy (TCM-ClinNPT). We conducted a thorough evaluation of fifteen prominent LLMs, revealing significant performance disparities and identifying top-performing models like deepseek\_r1 and gemini\_2\_5\_pro. Our findings show that while models exhibit proficiency in recalling foundational knowledge, they struggle with the interpretative complexities of classical texts. Critically, permutation-based consistency testing reveals widespread fragilities in model inference. All evaluated models, including the highest-scoring ones, displayed a substantial performance degradation when faced with varied question option ordering, indicating a pervasive sensitivity to positional bias and a lack of robust understanding. TCM-5CEval not only provides a more detailed diagnostic tool for LLM capabilities in TCM but aldso exposes fundamental weaknesses in their reasoning stability. To promote further research and standardized comparison, TCM-5CEval has been uploaded to the Medbench platform, joining its predecessor in the "In-depth Challenge for Comprehensive TCM Abilities" special track.
comment: 17 pages, 8 figures
Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis
Automatic Speech Recognition (ASR) transcripts, especially in low-resource languages like Bangla, contain a critical ambiguity: word-word repetitions can be either Repetition Disfluency (unintentional ASR error/hesitation) or Morphological Reduplication (a deliberate grammatical construct). Standard disfluency correction fails by erroneously deleting valid linguistic information. To solve this, we introduce the first publicly available, 20,000-row Bangla corpus, manually annotated to explicitly distinguish between these two phenomena in noisy ASR transcripts. We benchmark this novel resource using two paradigms: state-of-the-art multilingual Large Language Models (LLMs) and task-specific fine-tuning of encoder models. LLMs achieve competitive performance (up to 82.68\% accuracy) with few-shot prompting. However, fine-tuning proves superior, with the language-specific BanglaBERT model achieving the highest accuracy of 84.78\% and an F1 score of 0.677. This establishes a strong, linguistically-informed baseline and provides essential data for developing sophisticated, semantic-preserving text normalization systems for Bangla.
Zero-Shot Grammar Competency Estimation Using Large Language Model Generated Pseudo Labels AACL
Grammar competency estimation is essential for assessing linguistic proficiency in both written and spoken language; however, the spoken modality presents additional challenges due to its spontaneous, unstructured, and disfluent nature. Developing accurate grammar scoring models further requires extensive expert annotation, making large-scale data creation impractical. To address these limitations, we propose a zero-shot grammar competency estimation framework that leverages unlabeled data and Large Language Models (LLMs) without relying on manual labels. During training, we employ LLM-generated predictions on unlabeled data by using grammar competency rubric-based prompts. These predictions, treated as pseudo labels, are utilized to train a transformer-based model through a novel training framework designed to handle label noise effectively. We show that the choice of LLM for pseudo-label generation critically affects model performance and that the ratio of clean-to-noisy samples during training strongly influences stability and accuracy. Finally, a qualitative analysis of error intensity and score prediction confirms the robustness and interpretability of our approach. Experimental results demonstrate the efficacy of our approach in estimating grammar competency scores with high accuracy, paving the way for scalable, low-resource grammar assessment systems.
comment: Accepted in AACL-IJCNLP 2025
A Comparative Analysis of Recurrent and Attention Architectures for Isolated Sign Language Recognition
This study presents a systematic comparative analysis of recurrent and attention-based neural architectures for isolated sign language recognition. We implement and evaluate two representative models-ConvLSTM and Vanilla Transformer-on the Azerbaijani Sign Language Dataset (AzSLD) and the Word-Level American Sign Language (WLASL) dataset. Our results demonstrate that the attention-based Vanilla Transformer consistently outperforms the recurrent ConvLSTM in both Top-1 and Top-5 accuracy across datasets, achieving up to 76.8% Top-1 accuracy on AzSLD and 88.3% on WLASL. The ConvLSTM, while more computationally efficient, lags in recognition accuracy, particularly on smaller datasets. These findings highlight the complementary strengths of each paradigm: the Transformer excels in overall accuracy and signer independence, whereas the ConvLSTM offers advantages in computational efficiency and temporal modeling. The study provides a nuanced analysis of these trade-offs, offering guidance for architecture selection in sign language recognition systems depending on application requirements and resource constraints.
Extracting Events Like Code: A Multi-Agent Programming Framework for Zero-Shot Event Extraction AAAI 2026
Zero-shot event extraction (ZSEE) remains a significant challenge for large language models (LLMs) due to the need for complex reasoning and domain-specific understanding. Direct prompting often yields incomplete or structurally invalid outputs--such as misclassified triggers, missing arguments, and schema violations. To address these limitations, we present Agent-Event-Coder (AEC), a novel multi-agent framework that treats event extraction like software engineering: as a structured, iterative code-generation process. AEC decomposes ZSEE into specialized subtasks--retrieval, planning, coding, and verification--each handled by a dedicated LLM agent. Event schemas are represented as executable class definitions, enabling deterministic validation and precise feedback via a verification agent. This programming-inspired approach allows for systematic disambiguation and schema enforcement through iterative refinement. By leveraging collaborative agent workflows, AEC enables LLMs to produce precise, complete, and schema-consistent extractions in zero-shot settings. Experiments across five diverse domains and six LLMs demonstrate that AEC consistently outperforms prior zero-shot baselines, showcasing the power of treating event extraction like code generation. The code and data are released on https://github.com/UESTC-GQJ/Agent-Event-Coder.
comment: 11 pages, 5 figures, accepted by AAAI 2026 (Oral)
Evaluating the Ability of Large Language Models to Identify Adherence to CONSORT Reporting Guidelines in Randomized Controlled Trials: A Methodological Evaluation Study
The Consolidated Standards of Reporting Trials statement is the global benchmark for transparent and high-quality reporting of randomized controlled trials. Manual verification of CONSORT adherence is a laborious, time-intensive process that constitutes a significant bottleneck in peer review and evidence synthesis. This study aimed to systematically evaluate the accuracy and reliability of contemporary LLMs in identifying the adherence of published RCTs to the CONSORT 2010 statement under a zero-shot setting. We constructed a golden standard dataset of 150 published RCTs spanning diverse medical specialties. The primary outcome was the macro-averaged F1-score for the three-class classification task, supplemented by item-wise performance metrics and qualitative error analysis. Overall model performance was modest. The top-performing models, Gemini-2.5-Flash and DeepSeek-R1, achieved nearly identical macro F1 scores of 0.634 and Cohen's Kappa coefficients of 0.280 and 0.282, respectively, indicating only fair agreement with expert consensus. A striking performance disparity was observed across classes: while most models could identify compliant items with high accuracy (F1 score > 0.850), they struggled profoundly with identifying non-compliant and not applicable items, where F1 scores rarely exceeded 0.400. Notably, some high-profile models like GPT-4o underperformed, achieving a macro F1-score of only 0.521. LLMs show potential as preliminary screening assistants for CONSORT checks, capably identifying well-reported items. However, their current inability to reliably detect reporting omissions or methodological flaws makes them unsuitable for replacing human expertise in the critical appraisal of trial quality.
BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models
We introduce BeDiscovER (Benchmark of Discourse Understanding in the Era of Reasoning Language Models), an up-to-date, comprehensive suite for evaluating the discourse-level knowledge of modern LLMs. BeDiscovER compiles 5 publicly available discourse tasks across discourse lexicon, (multi-)sentential, and documental levels, with in total 52 individual datasets. It covers both extensively studied tasks such as discourse parsing and temporal relation extraction, as well as some novel challenges such as discourse particle disambiguation (e.g., ``just''), and also aggregates a shared task on Discourse Relation Parsing and Treebanking for multilingual and multi-framework discourse relation classification. We evaluate open-source LLMs: Qwen3 series, DeepSeek-R1, and frontier model such as GPT-5-mini on BeDiscovER, and find that state-of-the-art models exhibit strong performance in arithmetic aspect of temporal reasoning, but they struggle with full document reasoning and some subtle semantic and discourse phenomena, such as rhetorical relation recognition.
STEP: Success-Rate-Aware Trajectory-Efficient Policy Optimization
Multi-turn interaction remains challenging for online reinforcement learning. A common solution is trajectory-level optimization, which treats each trajectory as a single training sample. However, this approach can be inefficient and yield misleading learning signals: it applies uniform sampling across tasks regardless of difficulty, penalizes correct intermediate actions in failed trajectories, and incurs high sample-collection costs. To address these issues, we propose STEP (Success-rate-aware Trajectory-Efficient Policy optimization), a framework that dynamically allocates sampling based on per-task success rates and performs step-level optimization. STEP maintains a smoothed success-rate record to guide adaptive trajectory resampling, allocating more effort to harder tasks. It then computes success-rate-weighted advantages and decomposes trajectories into step-level samples. Finally, it applies a step-level GRPO augmentation to refine updates for low-success tasks. Experiments on OSWorld and AndroidWorld show that STEP substantially improves sample efficiency and training stability over trajectory-level GRPO, converging faster and generalizing better under the same sampling budget.
Spark-Prover-X1: Formal Theorem Proving Through Diverse Data Training
Large Language Models (LLMs) have shown significant promise in automated theorem proving, yet progress is often constrained by the scarcity of diverse and high-quality formal language data. To address this issue, we introduce Spark-Prover-X1, a 7B parameter model trained via an three-stage framework designed to unlock the reasoning potential of more accessible and moderately-sized LLMs. The first stage infuses deep knowledge through continuous pre-training on a broad mathematical corpus, enhanced by a suite of novel data tasks. Key innovation is a "CoT-augmented state prediction" task to achieve fine-grained reasoning. The second stage employs Supervised Fine-tuning (SFT) within an expert iteration loop to specialize both the Spark-Prover-X1-7B and Spark-Formalizer-X1-7B models. Finally, a targeted round of Group Relative Policy Optimization (GRPO) is applied to sharpen the prover's capabilities on the most challenging problems. To facilitate robust evaluation, particularly on problems from real-world examinations, we also introduce ExamFormal-Bench, a new benchmark dataset of 402 formal problems. Experimental results demonstrate that Spark-Prover-X1-7B achieves state-of-the-art performance among similarly-sized open-source models, attaining a 37.0\% average pass rate (pass@32). It shows exceptional performance on difficult competition benchmarks, notably solving 27 problems on PutnamBench (pass@32) and achieving 24.0\% on CombiBench (pass@32). Our work validates that this diverse training data and progressively refined training pipeline provides an effective path for enhancing the formal reasoning capabilities of lightweight LLMs. Both Spark-Prover-X1-7B and Spark-Formalizer-X1-7B, along with the ExamFormal-Bench dataset, are made publicly available at:https://www.modelscope.cn/organization/iflytek, https://gitcode.com/ifly_opensource.
How Good is BLI as an Alignment Measure: A Study in Word Embedding Paradigm
Sans a dwindling number of monolingual embedding studies originating predominantly from the low-resource domains, it is evident that multilingual embedding has become the de facto choice due to its adaptability to the usage of code-mixed languages, granting the ability to process multilingual documents in a language-agnostic manner, as well as removing the difficult task of aligning monolingual embeddings. But is this victory complete? Are the multilingual models better than aligned monolingual models in every aspect? Can the higher computational cost of multilingual models always be justified? Or is there a compromise between the two extremes? Bilingual Lexicon Induction is one of the most widely used metrics in terms of evaluating the degree of alignment between two embedding spaces. In this study, we explore the strengths and limitations of BLI as a measure to evaluate the degree of alignment of two embedding spaces. Further, we evaluate how well traditional embedding alignment techniques, novel multilingual models, and combined alignment techniques perform BLI tasks in the contexts of both high-resource and low-resource languages. In addition to that, we investigate the impact of the language families to which the pairs of languages belong. We identify that BLI does not measure the true degree of alignment in some cases and we propose solutions for them. We propose a novel stem-based BLI approach to evaluate two aligned embedding spaces that take into account the inflected nature of languages as opposed to the prevalent word-based BLI techniques. Further, we introduce a vocabulary pruning technique that is more informative in showing the degree of the alignment, especially performing BLI on multilingual embedding models. Often, combined embedding alignment techniques perform better while in certain cases multilingual embeddings perform better (mainly low-resource language cases).
comment: 15 pages, 2 figures, 6 tables
AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language Models
Existing language model evaluations primarily measure general capabilities, yet reliable use of these models across a range of domains demands factual accuracy and recognition of knowledge gaps. We introduce AA-Omniscience, a benchmark designed to measure both factual recall and knowledge calibration across 6,000 questions. Questions are derived from authoritative academic and industry sources, and cover 42 economically relevant topics within six different domains. The evaluation measures a model's Omniscience Index, a bounded metric (-100 to 100) measuring factual recall that jointly penalizes hallucinations and rewards abstention when uncertain, with 0 equating to a model that answers questions correctly as much as it does incorrectly. Among evaluated models, Claude 4.1 Opus attains the highest score (4.8), making it one of only three models to score above zero. These results reveal persistent factuality and calibration weaknesses across frontier models. Performance also varies by domain, with the models from three different research labs leading across the six domains. This performance variability suggests models should be chosen according to the demands of the use case rather than general performance for tasks where knowledge is important.
PragWorld: A Benchmark Evaluating LLMs' Local World Model under Minimal Linguistic Alterations and Conversational Dynamics AAAI 2026
Real-world conversations are rich with pragmatic elements, such as entity mentions, references, and implicatures. Understanding such nuances is a requirement for successful natural communication, and often requires building a local world model which encodes such elements and captures the dynamics of their evolving states. However, it is not well-understood whether language models (LMs) construct or maintain a robust implicit representation of conversations. In this work, we evaluate the ability of LMs to encode and update their internal world model in dyadic conversations and test their malleability under linguistic alterations. To facilitate this, we apply seven minimal linguistic alterations to conversations sourced from popular datasets and construct two benchmarks comprising yes-no questions. We evaluate a wide range of open and closed source LMs and observe that they struggle to maintain robust accuracy. Our analysis unveils that LMs struggle to memorize crucial details, such as tracking entities under linguistic alterations to conversations. We then propose a dual-perspective interpretability framework which identifies transformer layers that are useful or harmful and highlights linguistic alterations most influenced by harmful layers, typically due to encoding spurious signals or relying on shortcuts. Inspired by these insights, we propose two layer-regularization based fine-tuning strategies that suppress the effect of the harmful layers.
comment: 23 pages, 15 tables, 10 figures; AAAI 2026 Conference Main Track (oral)
WebCoach: Self-Evolving Web Agents with Cross-Session Memory Guidance
Multimodal LLM-powered agents have recently demonstrated impressive capabilities in web navigation, enabling agents to complete complex browsing tasks across diverse domains. However, current agents struggle with repetitive errors and lack the ability to learn from past experiences across sessions, limiting their long-term robustness and sample efficiency. We introduce WebCoach, a model-agnostic self-evolving framework that equips web browsing agents with persistent cross-session memory, enabling improved long-term planning, reflection, and continual learning without retraining. WebCoach consists of three key components: (1) a WebCondenser, which standardizes raw navigation logs into concise summaries; (2) an External Memory Store, which organizes complete trajectories as episodic experiences; and (3) a Coach, which retrieves relevant experiences based on similarity and recency, and decides whether to inject task-specific advice into the agent via runtime hooks. This design empowers web agents to access long-term memory beyond their native context window, improving robustness in complex browsing tasks. Moreover, WebCoach achieves self-evolution by continuously curating episodic memory from new navigation trajectories, enabling agents to improve over time without retraining. Evaluations on the WebVoyager benchmark demonstrate that WebCoach consistently improves the performance of browser-use agents across three different LLM backbones. With a 38B model, it increases task success rates from 47% to 61% while reducing or maintaining the average number of steps. Notably, smaller base models with WebCoach achieve performance comparable to the same web agent using GPT-4o.
comment: 18 pages; work in progress
Fine-Tuned LLMs Know They Don't Know: A Parameter-Efficient Approach to Recovering Honesty AAAI 2026
The honesty of Large Language Models (LLMs) is increasingly important for safe deployment in high-stakes domains. However, this crucial trait is severely undermined by supervised fine-tuning (SFT), a common technique for model specialization. Existing recovery methods rely on data-intensive global parameter adjustments, implicitly assuming that SFT deeply corrupts the models' ability to recognize their knowledge boundaries. However, we observe that fine-tuned LLMs still preserve this ability; what is damaged is their capacity to faithfully express that awareness. Building on this, we propose Honesty-Critical Neurons Restoration (HCNR) to surgically repair this suppressed capacity. HCNR identifies and restores key expression-governing neurons to their pre-trained state while harmonizing them with task-oriented neurons via Hessian-guided compensation. Experiments on four QA tasks and five LLM families demonstrate that HCNR effectively recovers 33.25% of the compromised honesty while achieving at least 2.23x speedup with over 10x less data compared to baseline methods, offering a practical solution for trustworthy LLM deployment.
comment: Accepted by AAAI 2026 Main Track
Visual Room 2.0: Seeing is Not Understanding for MLLMs
Can multi-modal large language models (MLLMs) truly understand what they can see? Extending Searle's Chinese Room into the multi-modal domain, this paper proposes the Visual Room argument: MLLMs may describe every visual detail precisely yet fail to comprehend the underlying emotions and intentions, namely seeing is not understanding. Building on this, we introduce \textit{Visual Room} 2.0, a hierarchical benchmark for evaluating perception-cognition alignment of MLLMs. We model human perceptive and cognitive processes across three levels: low, middle, and high, covering 17 representative tasks. The perception component ranges from attribute recognition to scene understanding, while the cognition component extends from textual entailment to causal and social reasoning. The dataset contains 350 multi-modal samples, each with six progressive questions (2,100 in total) spanning perception to cognition. Evaluating 10 state-of-the-art (SoTA) MLLMs, we highlight three key findings: (1) MLLMs exhibit stronger perceptual competence than cognitive ability (8.0\%$\uparrow$); (2) cognition appears not causally dependent on perception-based reasoning; and (3) cognition scales with model size, but perception does not consistently improve with larger variants. This work operationalizes Seeing $\ne$ Understanding as a testable hypothesis, offering a new paradigm from perceptual processing to cognitive reasoning in MLLMs. Our dataset is available at https://huggingface.co/datasets/LHK2003/PCBench.
Auditing Google's AI Overviews and Featured Snippets: A Case Study on Baby Care and Pregnancy AAAI
Google Search increasingly surfaces AI-generated content through features like AI Overviews (AIO) and Featured Snippets (FS), which users frequently rely on despite having no control over their presentation. Through a systematic algorithm audit of 1,508 real baby care and pregnancy-related queries, we evaluate the quality and consistency of these information displays. Our robust evaluation framework assesses multiple quality dimensions, including answer consistency, relevance, presence of medical safeguards, source categories, and sentiment alignment. Our results reveal concerning gaps in information consistency, with information in AIO and FS displayed on the same search result page being inconsistent with each other in 33% of cases. Despite high relevance scores, both features critically lack medical safeguards (present in just 11% of AIO and 7% of FS responses). While health and wellness websites dominate source categories for both, AIO and FS, FS also often link to commercial sources. These findings have important implications for public health information access and demonstrate the need for stronger quality controls in AI-mediated health information. Our methodology provides a transferable framework for auditing AI systems across high-stakes domains where information quality directly impacts user well-being.
comment: 18 pages, 10 figures; to appear in AAAI ICWSM 2026
Classification of Hope in Textual Data using Transformer-Based Models
This paper presents a transformer-based approach for classifying hope expressions in text. We developed and compared three architectures (BERT, GPT-2, and DeBERTa) for both binary classification (Hope vs. Not Hope) and multiclass categorization (five hope-related categories). Our initial BERT implementation achieved 83.65% binary and 74.87% multiclass accuracy. In the extended comparison, BERT demonstrated superior performance (84.49% binary, 72.03% multiclass accuracy) while requiring significantly fewer computational resources (443s vs. 704s training time) than newer architectures. GPT-2 showed lowest overall accuracy (79.34% binary, 71.29% multiclass), while DeBERTa achieved moderate results (80.70% binary, 71.56% multiclass) but at substantially higher computational cost (947s for multiclass training). Error analysis revealed architecture-specific strengths in detecting nuanced hope expressions, with GPT-2 excelling at sarcasm detection (92.46% recall). This study provides a framework for computational analysis of hope, with applications in mental health and social media analysis, while demonstrating that architectural suitability may outweigh model size for specialized emotion detection tasks.
From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models
With the remarkable success of Multimodal Large Language Models (MLLMs) in perception tasks, enhancing their complex reasoning capabilities has emerged as a critical research focus. Existing models still suffer from challenges such as opaque reasoning paths and insufficient generalization ability. Chain-of-Thought (CoT) reasoning, which has demonstrated significant efficacy in language models by enhancing reasoning transparency and output interpretability, holds promise for improving model reasoning capabilities when extended to the multimodal domain. This paper provides a systematic review centered on "Multimodal Chain-of-Thought" (MCoT). First, it analyzes the background and theoretical motivations for its inception from the perspectives of technical evolution and task demands. Then, it introduces mainstream MCoT methods from three aspects: CoT paradigms, the post-training stage, and the inference stage, while also analyzing their underlying mechanisms. Furthermore, the paper summarizes existing evaluation benchmarks and metrics, and discusses the application scenarios of MCoT. Finally, it analyzes the challenges currently facing MCoT and provides an outlook on its future research directions.
comment: Survey; 7 figures, 3 tables, 44 pages
NeuroLex: A Lightweight Domain Language Model for EEG Report Understanding and Generation
Clinical electroencephalogram (EEG) reports encode domain-specific linguistic conventions that general-purpose language models (LMs) fail to capture. We introduce NeuroLex, a lightweight domain-adaptive language model trained purely on EEG report text from the Harvard Electroencephalography Database. Unlike existing biomedical LMs, NeuroLex is tailored to the linguistic and diagnostic characteristics of EEG reporting, enabling it to serve as both an independent textual model and a decoder backbone for multimodal EEG-language systems. Using span-corruption pretraining and instruction-style fine-tuning on report polishing, paragraph summarization, and terminology question answering, NeuroLex learns the syntax and reasoning patterns characteristic of EEG interpretation. Comprehensive evaluations show that it achieves lower perplexity, higher extraction and summarization accuracy, better label efficiency, and improved robustness to negation and factual hallucination compared with general models of the same scale. With an EEG-aware linguistic backbone, NeuroLex bridges biomedical text modeling and brain-computer interface applications, offering a foundation for interpretable and language-driven neural decoding.
Quantifying consistency and accuracy of Latent Dirichlet Allocation
Topic modelling in Natural Language Processing uncovers hidden topics in large, unlabelled text datasets. It is widely applied in fields such as information retrieval, content summarisation, and trend analysis across various disciplines. However, probabilistic topic models can produce different results when rerun due to their stochastic nature, leading to inconsistencies in latent topics. Factors like corpus shuffling, rare text removal, and document elimination contribute to these variations. This instability affects replicability, reliability, and interpretation, raising concerns about whether topic models capture meaningful topics or just noise. To address these problems, we defined a new stability measure that incorporates accuracy and consistency and uses the generative properties of LDA to generate a new corpus with ground truth. These generated corpora are run through LDA 50 times to determine the variability in the output. We show that LDA can correctly determine the underlying number of topics in the documents. We also find that LDA is more internally consistent, as the multiple reruns return similar topics; however, these topics are not the true topics.
comment: 8 pages, 3 figures, to be submitted
Hint-Augmented Re-ranking: Efficient Product Search using LLM-Based Query Decomposition AACL 2025
Search queries with superlatives (e.g., best, most popular) require comparing candidates across multiple dimensions, demanding linguistic understanding and domain knowledge. We show that LLMs can uncover latent intent behind these expressions in e-commerce queries through a framework that extracts structured interpretations or hints. Our approach decomposes queries into attribute-value hints generated concurrently with retrieval, enabling efficient integration into the ranking pipeline. Our method improves search performanc eby 10.9 points in MAP and ranking by 5.9 points in MRR over baselines. Since direct LLM-based reranking faces prohibitive latency, we develop an efficient approach transferring superlative interpretations to lightweight models. Our findings provide insights into how superlative semantics can be represented and transferred between models, advancing linguistic interpretation in retrieval systems while addressing practical deployment constraints.
comment: AACL 2025
Show and Tell: Prompt Strategies for Style Control in Multi-Turn LLM Code Generation
Language models generate functionally correct code that tends toward excessive verbosity, with elaborate documentation and defensive patterns that diverge from human baselines. Two prompting mechanisms have emerged for stylistic control: instruction based prompts that articulate abstract directives, and example based prompts that provide concrete code demonstrations. The core problem is whether stylistic constraints persist when models enhance initial implementations with additional features while maintaining high functional accuracy. Here we show that instruction-based, example-based, and combined prompts produce distinct patterns of initial control and expansion discipline over one enhancement turn. We manipulated system prompts across four conditions in a paired two-turn protocol where models first generated solutions to an intermediate Python task, then revised their code under general improvement directives, holding the user task fixed (N = 160 paired programs). Combined prompts produced the strongest initial compression and greatest expansion discipline. Instructions showed large initial effects and moderate expansion discipline. Examples showed modest initial effects with no expansion discipline. These results show that initial prompt effectiveness and expansion discipline are separate aspects of prompt design, and that combined approaches provide the most stable stylistic control in this two-turn workflow.
comment: 23 pages, 2 figures, 3 tables. Under review
EchoAgent: Guideline-Centric Reasoning Agent for Echocardiography Measurement and Interpretation
Purpose: Echocardiographic interpretation requires video-level reasoning and guideline-based measurement analysis, which current deep learning models for cardiac ultrasound do not support. We present EchoAgent, a framework that enables structured, interpretable automation for this domain. Methods: EchoAgent orchestrates specialized vision tools under Large Language Model (LLM) control to perform temporal localization, spatial measurement, and clinical interpretation. A key contribution is a measurement-feasibility prediction model that determines whether anatomical structures are reliably measurable in each frame, enabling autonomous tool selection. We curated a benchmark of diverse, clinically validated video-query pairs for evaluation. Results: EchoAgent achieves accurate, interpretable results despite added complexity of spatiotemporal video analysis. Outputs are grounded in visual evidence and clinical guidelines, supporting transparency and traceability. Conclusion: This work demonstrates the feasibility of agentic, guideline-aligned reasoning for echocardiographic video analysis, enabled by task-specific tools and full video-level automation. EchoAgent sets a new direction for trustworthy AI in cardiac ultrasound.
comment: 12 pages, Under Review
What Works for 'Lost-in-the-Middle' in LLMs? A Study on GM-Extract and Mitigations
The diminishing ability of large language models (LLMs) to effectively utilize long-range context-the "lost-in-the-middle" phenomenon-poses a significant challenge in retrieval-based LLM applications. To study the impact of this phenomenon in a real-world application setting, we introduce GM-Extract, a novel benchmark dataset meticulously designed to evaluate LLM performance on retrieval of control variables. To accurately diagnose failure modes, we propose a simple yet elegant evaluation system using two distinct metrics: one for spatial retrieval capability (Document Metric) and the other for semantic retrieval capability (Variable Extraction Metric). We conduct a systematic evaluation of 7-8B parameter models on two multi-document tasks (key-value extraction and question-answering), demonstrating a significant change in retrieval performance simply by altering how the data is represented in the context window. While a distinct U-shaped curve was not consistently observed, our analysis reveals a clear pattern of performance across models, which we further correlate with perplexity scores. Furthermore, we perform a literature survey of mitigation methods, which we categorize into two distinct approaches: black-box and white-box methods. We then apply these techniques to our benchmark, finding that their efficacy is highly nuanced. Our evaluation highlights scenarios where these strategies successfully improve performance, as well as surprising cases where they lead to a negative impact, providing a comprehensive understanding of their utility in a practical context.
comment: To be submitted for publication
Can QE-informed (Re)Translation lead to Error Correction? EMNLP 2025
The paper presents two approaches submitted to the WMT 2025 Automated Translation Quality Evaluation Systems Task 3 - Quality Estimation (QE)-informed Segment-level Error Correction. While jointly training QE systems with Automatic Post-Editing (APE) has shown improved performance for both tasks, APE systems are still known to overcorrect the output of Machine Translation (MT), leading to a degradation in performance. We investigate a simple training-free approach - QE-informed Retranslation, and compare it with another within the same training-free paradigm. Our winning approach selects the highest-quality translation from multiple candidates generated by different LLMs. The second approach, more akin to APE, instructs an LLM to replace error substrings as specified in the provided QE explanation(s). A conditional heuristic was employed to minimise the number of edits, with the aim of maximising the Gain-to-Edit ratio. The two proposed approaches achieved a Delta COMET score of 0.0201 and -0.0108, respectively, leading the first approach to achieve the winning position on the subtask leaderboard.
comment: 10 pages, 3 figures, WMT25 Shared Task in EMNLP 2025 Conference
When AI Does Science: Evaluating the Autonomous AI Scientist KOSMOS in Radiation Biology
Agentic AI "scientists" now use language models to search the literature, run analyses, and generate hypotheses. We evaluate KOSMOS, an autonomous AI scientist, on three problems in radiation biology using simple random-gene null benchmarks. Hypothesis 1: baseline DNA damage response (DDR) capacity across cell lines predicts the p53 transcriptional response after irradiation (GSE30240). Hypothesis 2: baseline expression of OGT and CDO1 predicts the strength of repressed and induced radiation-response modules in breast cancer cells (GSE59732). Hypothesis 3: a 12-gene expression signature predicts biochemical recurrence-free survival after prostate radiotherapy plus androgen deprivation therapy (GSE116918). The DDR-p53 hypothesis was not supported: DDR score and p53 response were weakly negatively correlated (Spearman rho = -0.40, p = 0.76), indistinguishable from random five-gene scores. OGT showed only a weak association (r = 0.23, p = 0.34), whereas CDO1 was a clear outlier (r = 0.70, empirical p = 0.0039). The 12-gene signature achieved a concordance index of 0.61 (p = 0.017) but a non-unique effect size. Overall, KOSMOS produced one well-supported discovery, one plausible but uncertain result, and one false hypothesis, illustrating that AI scientists can generate useful ideas but require rigorous auditing against appropriate null models.
comment: 13 pages, 3 figures, preprint
Read Between the Lines: A Benchmark for Uncovering Political Bias in Bangla News Articles AACL
Detecting media bias is crucial, specifically in the South Asian region. Despite this, annotated datasets and computational studies for Bangla political bias research remain scarce. Crucially because, political stance detection in Bangla news requires understanding of linguistic cues, cultural context, subtle biases, rhetorical strategies, code-switching, implicit sentiment, and socio-political background. To address this, we introduce the first benchmark dataset of 200 politically significant and highly debated Bangla news articles, labeled for government-leaning, government-critique, and neutral stances, alongside diagnostic analyses for evaluating large language models (LLMs). Our comprehensive evaluation of 28 proprietary and open-source LLMs shows strong performance in detecting government-critique content (F1 up to 0.83) but substantial difficulty with neutral articles (F1 as low as 0.00). Models also tend to over-predict government-leaning stances, often misinterpreting ambiguous narratives. This dataset and its associated diagnostics provide a foundation for advancing stance detection in Bangla media research and offer insights for improving LLM performance in low-resource languages.
comment: Accepted to BLP at AACL-IJCNLP 2025
DataGen: Unified Synthetic Dataset Generation via Large Language Models
Large Language Models (LLMs) such as GPT-4 and Llama3 have significantly impacted various fields by enabling high-quality synthetic data generation and reducing dependence on expensive human-generated datasets. Despite this, challenges remain in the areas of generalization, controllability, diversity, and truthfulness within the existing generative frameworks. To address these challenges, this paper presents DataGen, a comprehensive LLM-powered framework designed to produce diverse, accurate, and highly controllable datasets. DataGen is adaptable, supporting all types of text datasets and enhancing the generative process through innovative mechanisms. To augment data diversity, DataGen incorporates an attribute-guided generation module and a group checking feature. For accuracy, it employs a code-based mathematical assessment for label verification alongside a retrieval-augmented generation technique for factual validation. The framework also allows for user-specified constraints, enabling customization of the data generation process to suit particular requirements. Extensive experiments demonstrate the superior quality of data generated by DataGen, and each module within DataGen plays a critical role in this enhancement. Additionally, DataGen is applied in two practical scenarios: benchmarking LLMs and data augmentation. The results indicate that DataGen effectively supports dynamic and evolving benchmarking and that data augmentation improves LLM capabilities in various domains, including agent-oriented abilities and reasoning skills.
Glia: A Human-Inspired AI for Automated Systems Design and Optimization
Can an AI autonomously design mechanisms for computer systems on par with the creativity and reasoning of human experts? We present Glia, an AI architecture for networked systems design that uses large language models (LLMs) in a human-inspired, multi-agent workflow. Each agent specializes in reasoning, experimentation, and analysis, collaborating through an evaluation framework that grounds abstract reasoning in empirical feedback. Unlike prior ML-for-systems methods that optimize black-box policies, Glia generates interpretable designs and exposes its reasoning process. When applied to a distributed GPU cluster for LLM inference, it produces new algorithms for request routing, scheduling, and auto-scaling that perform at human-expert levels in significantly less time, while yielding novel insights into workload behavior. Our results suggest that by combining reasoning LLMs with structured experimentation, an AI can produce creative and understandable designs for complex systems problems.
Bilevel MCTS for Amortized O(1) Node Selection in Classical Planning AAAI-26
We study an efficient implementation of Multi-Armed Bandit (MAB)-based Monte-Carlo Tree Search (MCTS) for classical planning. One weakness of MCTS is that it spends a significant time deciding which node to expand next. While selecting a node from an OPEN list with $N$ nodes has $O(1)$ runtime complexity with traditional array-based priority-queues for dense integer keys, the tree-based OPEN list used by MCTS requires $O(\log N)$, which roughly corresponds to the search depth $d$. In classical planning, $d$ is arbitrarily large (e.g., $2^k-1$ in $k$-disk Tower-of-Hanoi) and the runtime for node selection is significant, unlike in game tree search, where the cost is negligible compared to the node evaluation (rollouts) because $d$ is inherently limited by the game (e.g., $d\leq 361$ in Go). To improve this bottleneck, we propose a bilevel modification to MCTS that runs a best-first search from each selected leaf node with an expansion budget proportional to $d$, which achieves amortized $O(1)$ runtime for node selection, equivalent to the traditional queue-based OPEN list. In addition, we introduce Tree Collapsing, an enhancement that reduces action selection steps and further improves the performance.
comment: Accepted in AAAI-26
Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation AAAI 2026
Beyond simple text generation, Large Language Models (LLMs) have evolved into agentic systems capable of planning and interacting with external tools to solve complex tasks. This evolution involves fine-tuning LLMs on agent-specific tasks to enhance their proficiency. However, safety concerns are frequently overlooked during this fine-tuning process. In this work, we show that aligned LLMs can become unintentionally misaligned, leading to a higher likelihood of executing harmful tasks and a reduced tendency to refuse them when fine-tuned to execute agentic tasks. To address these safety challenges, we propose Prefix INjection Guard (PING), a simple yet effective method that prepends automatically generated natural language prefixes to agent responses, guiding them to refuse harmful requests while preserving performance on benign tasks. Specifically, we introduce an iterative approach that alternates between (1) generating candidate prefixes and (2) selecting those that optimize both task performance and refusal behavior. Experimental results demonstrate that PING significantly enhances the safety of fine-tuned LLM agents without sacrificing their effectiveness. PING consistently outperforms existing prompting approaches across diverse benchmarks in both web navigation and code generation tasks. Our analysis of internal hidden states via linear probes reveals that prefix tokens are crucial for behavior modification, explaining the performance gains. WARNING: This paper contains contents that are unethical or offensive in nature.
comment: Accepted at AAAI 2026 AI Alignment Track, Source code: https://github.com/HahmDY/agentic-ft-safety
RATTENTION: Towards the Minimal Sliding Window Size in Local-Global Attention Models
Local-global attention models have recently emerged as compelling alternatives to standard Transformers, promising improvements in both training and inference efficiency. However, the crucial choice of window size presents a Pareto tradeoff: larger windows maintain performance akin to full attention but offer minimal efficiency gains in short-context scenarios, while smaller windows can lead to performance degradation. Current models, such as Gemma2 and Mistral, adopt conservative window sizes (e.g., 4096 out of an 8192 pretraining length) to preserve performance. This work investigates strategies to shift this Pareto frontier, enabling local-global models to achieve efficiency gains even in short-context regimes. Our core motivation is to address the intrinsic limitation of local attention -- its complete disregard for tokens outside the defined window. We explore RATTENTION, a variant of local attention integrated with a specialized linear attention mechanism designed to capture information from these out-of-window tokens. Pretraining experiments at the 3B and 12B scales demonstrate that RATTENTION achieves a superior Pareto tradeoff between performance and efficiency. As a sweet spot, RATTENTION with a window size of just 512 consistently matches the performance of full-attention models across diverse settings. Furthermore, the recurrent nature inherent in the linear attention component of RATTENTION contributes to enhanced long-context performance, as validated on the RULER benchmark. Crucially, these improvements do not compromise training efficiency; thanks to a specialized kernel implementation and the reduced window size, RATTENTION maintains training speeds comparable to existing state-of-the-art approaches. We open-sourced our Pallas kernels along with model codes to facilitate further research effort.
comment: 9 pages
A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders NeurIPS 2025
Sparse Autoencoders (SAEs) aim to decompose the activation space of large language models (LLMs) into human-interpretable latent directions or features. As we increase the number of features in the SAE, hierarchical features tend to split into finer features ("math" may split into "algebra", "geometry", etc.), a phenomenon referred to as feature splitting. However, we show that sparse decomposition and splitting of hierarchical features is not robust. Specifically, we show that seemingly monosemantic features fail to fire where they should, and instead get "absorbed" into their children features. We coin this phenomenon feature absorption, and show that it is caused by optimizing for sparsity in SAEs whenever the underlying features form a hierarchy. We introduce a metric to detect absorption in SAEs, and validate our findings empirically on hundreds of LLM SAEs. Our investigation suggests that varying SAE sizes or sparsity is insufficient to solve this issue. We discuss the implications of feature absorption in SAEs and some potential approaches to solve the fundamental theoretical issues before SAEs can be used for interpreting LLMs robustly and at scale.
comment: Accepted at NeurIPS 2025 (Oral)
REIC: RAG-Enhanced Intent Classification at Scale EMNLP 2025
Accurate intent classification is critical for efficient routing in customer service, ensuring customers are connected with the most suitable agents while reducing handling times and operational costs. However, as companies expand their product lines, intent classification faces scalability challenges due to the increasing number of intents and variations in taxonomy across different verticals. In this paper, we introduce REIC, a Retrieval-augmented generation Enhanced Intent Classification approach, which addresses these challenges effectively. REIC leverages retrieval-augmented generation (RAG) to dynamically incorporate relevant knowledge, enabling precise classification without the need for frequent retraining. Through extensive experiments on real-world datasets, we demonstrate that REIC outperforms traditional fine-tuning, zero-shot, and few-shot methods in large-scale customer service settings. Our results highlight its effectiveness in both in-domain and out-of-domain scenarios, demonstrating its potential for real-world deployment in adaptive and large-scale intent classification systems.
comment: Accepted by EMNLP 2025 (Industry Track)
QuanTaxo: A Quantum Approach to Self-Supervised Taxonomy Expansion
A taxonomy is a hierarchical graph containing knowledge to provide valuable insights for various web applications. However, the manual construction of taxonomies requires significant human effort. As web content continues to expand at an unprecedented pace, existing taxonomies risk becoming outdated, struggling to incorporate new and emerging information effectively. As a consequence, there is a growing need for dynamic taxonomy expansion to keep them relevant and up-to-date. Existing taxonomy expansion methods often rely on classical word embeddings to represent entities. However, these embeddings fall short of capturing hierarchical polysemy, where an entity's meaning can vary based on its position in the hierarchy and its surrounding context. To address this challenge, we introduce QuanTaxo, a quantum-inspired framework for taxonomy expansion that encodes entities in a Hilbert space and models interference effects between them, yielding richer, context-sensitive representations. Comprehensive experiments on five real-world benchmark datasets show that QuanTaxo significantly outperforms classical embedding models, achieving substantial improvements of 12.3% in accuracy, 11.2% in Mean Reciprocal Rank (MRR), and 6.9% in Wu & Palmer (Wu&P) metrics across nine classical embedding-based baselines.
Building a Macedonian Recipe Dataset: Collection, Parsing, and Comparative Analysis
Computational gastronomy increasingly relies on diverse, high-quality recipe datasets to capture regional culinary traditions. Although there are large-scale collections for major languages, Macedonian recipes remain under-represented in digital research. In this work, we present the first systematic effort to construct a Macedonian recipe dataset through web scraping and structured parsing. We address challenges in processing heterogeneous ingredient descriptions, including unit, quantity, and descriptor normalization. An exploratory analysis of ingredient frequency and co-occurrence patterns, using measures such as Pointwise Mutual Information and Lift score, highlights distinctive ingredient combinations that characterize Macedonian cuisine. The resulting dataset contributes a new resource for studying food culture in underrepresented languages and offers insights into the unique patterns of Macedonian culinary tradition.
SciAgent: A Unified Multi-Agent System for Generalistic Scientific Reasoning
Recent advances in large language models have enabled AI systems to achieve expert-level performance on domain-specific scientific tasks, yet these systems remain narrow and handcrafted. We introduce SciAgent, a unified multi-agent system designed for generalistic scientific reasoning-the ability to adapt reasoning strategies across disciplines and difficulty levels. SciAgent organizes problem solving as a hierarchical process: a Coordinator Agent interprets each problem's domain and complexity, dynamically orchestrating specialized Worker Systems, each composed of interacting reasoning Sub-agents for symbolic deduction, conceptual modeling, numerical computation, and verification. These agents collaboratively assemble and refine reasoning pipelines tailored to each task. Across mathematics and physics Olympiads (IMO, IMC, IPhO, CPhO), SciAgent consistently attains or surpasses human gold-medalist performance, demonstrating both domain generality and reasoning adaptability. Additionally, SciAgent has been tested on the International Chemistry Olympiad (IChO) and selected problems from the Humanity's Last Exam (HLE) benchmark, further confirming the system's ability to generalize across diverse scientific domains. This work establishes SciAgent as a concrete step toward generalistic scientific intelligence-AI systems capable of coherent, cross-disciplinary reasoning at expert levels.
comment: 1. To ensure result rigor, the model outputs require further evaluation by human experts. 2. The results may affect our conclusions and methods, thus necessitating a more detailed review. 3. We anticipate subsequent revisions may be substantial, potentially involving major adjustments to the methodology. Given the uncertainty surrounding the revision process, we decide to request a withdrawal
Simultaneous Machine Translation with Large Language Models ALT
Real-world simultaneous machine translation (SimulMT) systems face more challenges than just the quality-latency trade-off. They also need to address issues related to robustness with noisy input, processing long contexts, and flexibility for knowledge injection. These challenges demand models with strong language understanding and generation capabilities which may not often equipped by dedicated MT models. In this paper, we investigate the possibility of applying Large Language Models (LLM) to SimulMT tasks by using existing incremental-decoding methods with a newly proposed RALCP algorithm for latency reduction. We conducted experiments using the \texttt{Llama2-7b-chat} model on nine different languages from the MUST-C dataset. The results show that LLM outperforms dedicated MT models in terms of BLEU and LAAL metrics. Further analysis indicates that LLM has advantages in terms of tuning efficiency and robustness. However, it is important to note that the computational cost of LLM remains a significant obstacle to its application in SimulMT.
comment: Accepted to ALTA 2024
NLP Methods May Actually Be Better Than Professors at Estimating Question Difficulty ECAI 2025
Estimating the difficulty of exam questions is essential for developing good exams, but professors are not always good at this task. We compare various Large Language Model-based methods with three professors in their ability to estimate what percentage of students will give correct answers on True/False exam questions in the areas of Neural Networks and Machine Learning. Our results show that the professors have limited ability to distinguish between easy and difficult questions and that they are outperformed by directly asking Gemini 2.5 to solve this task. Yet, we obtained even better results using uncertainties of the LLMs solving the questions in a supervised learning setting, using only 42 training samples. We conclude that supervised learning using LLM uncertainty can help professors better estimate the difficulty of exam questions, improving the quality of assessment.
comment: 10 pages, 2 figures, presented at ECAI 2025 at the 2nd International Workshop on AI in Society, Education and Educational Research (AISEER)
Conversational SimulMT: Efficient Simultaneous Translation with Large Language Models
Simultaneous machine translation (SimulMT) presents a challenging trade-off between translation quality and latency. Recent studies have shown that LLMs can achieve good performance in SimulMT tasks. However, this often comes at the expense of high inference cost and latency. In this paper, we propose a conversational SimulMT framework to enhance the inference efficiency of LLM-based SimulMT through multi-turn-dialogue-based decoding. Our experiments with Llama2-7b-chat on two SimulMT benchmarks demonstrate the superiority of LLM in translation quality while achieving comparable computational latency to specialized SimulMT models.
comment: Accepted to IWSLT 2025
Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query EMNLP 2025
Large language models (LLMs) rely on key-value cache (KV cache) to accelerate decoding by reducing redundant computations. However, the KV cache memory usage grows substantially with longer text sequences, posing challenges for efficient deployment. Existing KV cache eviction methods prune tokens using prefilling-stage attention scores, causing inconsistency with actual inference queries, especially under tight memory budgets. In this paper, we propose Lookahead Q-Cache (LAQ), a novel eviction framework that generates low-cost pseudo lookahead queries to better approximate the true decoding-stage queries. By using these lookahead queries as the observation window for importance estimation, LAQ achieves more consistent and accurate KV cache eviction aligned with real inference scenarios. Experimental results on LongBench and Needle-in-a-Haystack benchmarks show that LAQ outperforms existing methods across various budget levels, achieving a 1 $\sim$ 4 point improvement on LongBench under limited cache budget. Moreover, LAQ is complementary to existing approaches and can be flexibly combined to yield further improvements.
comment: Accepted by EMNLP 2025 Main
The taggedPBC: Annotating a massive parallel corpus for crosslinguistic investigations
Existing datasets available for crosslinguistic investigations have tended to focus on large amounts of data for a small group of languages or a small amount of data for a large number of languages. This means that claims based on these datasets are limited in what they reveal about universal properties of the human language faculty. While this has begun to change through the efforts of projects seeking to develop tagged corpora for a large number of languages, such efforts are still constrained by limits on resources. The current paper reports on a large tagged parallel dataset which has been developed to partially address this issue. The taggedPBC contains POS-tagged parallel text data from more than 1,940 languages, representing 155 language families and 78 isolates, dwarfing previously available resources. The accuracy of particular tags in this dataset is shown to correlate well with both existing SOTA taggers for high-resource languages (SpaCy, Trankit) as well as hand-tagged corpora (Universal Dependencies Treebanks). Additionally, a novel measure derived from this dataset, the N1 ratio, correlates with expert determinations of intransitive word order in three typological databases (WALS, Grambank, Autotyp) such that a Gaussian Naive Bayes classifier trained on this feature can accurately identify basic intransitive word order for languages not in those databases. While much work is still needed to expand and develop this dataset, the taggedPBC is an important step to enable corpus-based crosslinguistic investigations, and is made available for research and collaboration via GitHub.
Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers NeurIPS 2025
As large language models increasingly gain popularity in real-world applications, processing extremely long contexts, often exceeding the model's pre-trained context limits, has emerged as a critical challenge. While existing approaches to efficient long-context processing show promise, recurrent compression-based methods struggle with information preservation, whereas random access approaches require substantial memory resources. We introduce REFORM, a novel inference framework that efficiently handles long contexts through a two-phase approach. First, it incrementally processes input chunks while maintaining a compressed KV cache, constructs cross-layer context embeddings, and utilizes early exit strategy for improved efficiency. Second, it identifies and gathers essential tokens via similarity matching and selectively recomputes the KV cache. Compared to baselines, REFORM achieves over 52% and 34% performance gains on RULER and BABILong respectively at 1M context length. It also outperforms baselines on Infinite-Bench, RepoEval, and MM-NIAH, demonstrating flexibility across diverse tasks and domains. Additionally, REFORM reduces inference time by 30% and peak memory usage by 5%, achieving both efficiency and superior performance.
comment: NeurIPS 2025
RAG-R1: Incentivizing the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism
Large Language Models (LLMs), despite their remarkable capabilities, are prone to generating hallucinated or outdated content due to their static internal knowledge. While Retrieval-Augmented Generation (RAG) integrated with Reinforcement Learning (RL) offers a solution, these methods are fundamentally constrained by a single-query mode, leading to prohibitive latency and inherent brittleness. To overcome these limitations, we introduce RAG-R1, a novel two-stage training framework centered around multi-query parallelism. Our framework enables LLMs to adaptively leverage internal and external knowledge during the reasoning process while transitioning from the single-query mode to multi-query parallelism. This architectural shift bolsters reasoning robustness while significantly reducing inference latency. Extensive experiments on seven question-answering benchmarks confirm the superiority of our method, which outperforms the strongest baseline by up to 13.7% and decreases inference time by 11.1%.
Jailbreaking LLMs via Semantically Relevant Nested Scenarios with Targeted Toxic Knowledge
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks. However, they remain exposed to jailbreak attacks, eliciting harmful responses. The nested scenario strategy has been increasingly adopted across various methods, demonstrating immense potential. Nevertheless, these methods are easily detectable due to their prominent malicious intentions. In this work, we are the first to find and systematically verify that LLMs' alignment defenses are not sensitive to nested scenarios, where these scenarios are highly semantically relevant to the queries and incorporate targeted toxic knowledge. This is a crucial yet insufficiently explored direction. Based on this, we propose RTS-Attack (Semantically Relevant Nested Scenarios with Targeted Toxic Knowledge), an adaptive and automated framework to examine LLMs' alignment. By building scenarios highly relevant to the queries and integrating targeted toxic knowledge, RTS-Attack bypasses the alignment defenses of LLMs. Moreover, the jailbreak prompts generated by RTS-Attack are free from harmful queries, leading to outstanding concealment. Extensive experiments demonstrate that RTS-Attack exhibits superior performance in both efficiency and universality compared to the baselines across diverse advanced LLMs, including GPT-4o, Llama3-70b, and Gemini-pro. Our complete code is available at https://github.com/nercode/Work. WARNING: THIS PAPER CONTAINS POTENTIALLY HARMFUL CONTENT.
Accelerated Test-Time Scaling with Model-Free Speculative Sampling EMNLP 2025
Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach that exploits the inherent redundancy in reasoning trajectories to achieve significant acceleration without compromising accuracy. Our analysis shows that reasoning paths frequently reuse similar reasoning patterns, enabling efficient model-free token prediction without requiring separate draft models. By introducing stochastic drafting and preserving probabilistic information through a memory-efficient logit-based N-gram module, combined with optimized Gumbel-Top-K sampling and data-driven tree construction, STAND significantly improves token acceptance rates. Extensive evaluations across multiple models and reasoning tasks (AIME-2024, GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining accuracy. Furthermore, STAND consistently outperforms state-of-the-art speculative decoding methods across diverse inference patterns, including single-trajectory decoding, batch decoding, and test-time tree search. As a model-free approach, STAND can be applied to any existing language model without additional training, making it a powerful plug-and-play solution for accelerating language model reasoning.
comment: EMNLP 2025 Oral
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention NeurIPS 2025
Large Language Models (LLMs) have demonstrated the ability to tackle increasingly complex tasks through advanced reasoning, long-form content generation, and tool use. Solving these tasks often involves long inference-time computations. In human problem solving, a common strategy to expedite work is collaboration: by dividing the problem into sub-tasks, exploring different strategies concurrently, etc. Recent research has shown that LLMs can also operate in parallel by implementing explicit cooperation frameworks, such as voting mechanisms or the explicit creation of independent sub-tasks that can be executed in parallel. However, each of these frameworks may not be suitable for all types of tasks, which can hinder their applicability. In this work, we propose a different design approach: we run LLM "workers" in parallel , allowing them to synchronize via a concurrently-updated attention cache and prompt these workers to decide how best to collaborate. Our approach allows the LLM instances to come up with their own collaboration strategy for the problem at hand, all the while "seeing" each other's memory in the concurrent KV cache. We implement this approach via Hogwild! Inference: a parallel LLM inference engine where multiple instances of the same LLM run in parallel with the same attention cache, with "instant" access to each other's memory. Hogwild! Inference takes advantage of Rotary Position Embeddings (RoPE) to avoid recomputation while improving parallel hardware utilization. We find that modern reasoning-capable LLMs can perform inference with shared Key-Value cache out of the box, without additional fine-tuning.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
On the Limitations of Language Targeted Pruning: Investigating the Calibration Language Impact in Multilingual LLM Pruning ACL
Recent advances in large language model (LLM) pruning have shown state-of-the-art (SotA) compression results in post-training and retraining-free settings while maintaining high predictive performance. However, previous research mainly considered calibrating based on English text, despite the multilingual nature of modern LLMs and their frequent use in non-English languages. This analysis paper conducts an in-depth investigation of the performance and internal representation changes associated with pruning multilingual language models for monolingual applications. We present the first comprehensive empirical study, comparing different calibration languages for pruning multilingual models across diverse languages, tasks, models, and SotA pruning techniques. We further analyze the latent subspaces, pruning masks, and individual neurons within pruned models. Our results reveal that while calibration on the target language effectively retains perplexity and yields high signal-to-noise ratios, it does not consistently improve downstream task performance. Further analysis of internal representations at three different levels highlights broader limitations of current pruning approaches: While they effectively preserve dominant information like language-specific features, this is insufficient to counteract the loss of nuanced, language-agnostic features that are crucial for knowledge retention and reasoning.
comment: Accepted for publication in TACL
SoK: Large Language Model Copyright Auditing via Fingerprinting
The broad capabilities and substantial resources required to train Large Language Models (LLMs) make them valuable intellectual property, yet they remain vulnerable to copyright infringement, such as unauthorized use and model theft. LLM fingerprinting, a non-intrusive technique that compares the distinctive features (i.e., fingerprint) of LLMs to identify whether an LLM is derived from another, offers a promising solution to copyright auditing. However, its reliability remains uncertain due to the prevalence of diverse model modifications and the lack of standardized evaluation. In this SoK, we present the first comprehensive study of the emerging LLM fingerprinting. We introduce a unified framework and taxonomy that structures the field: white-box methods are classified based on their feature source as static, forward-pass, or backward-pass fingerprinting, while black-box methods are distinguished by their query strategy as either untargeted or targeted. Furthermore, we propose LeaFBench, the first systematic benchmark for evaluating LLM fingerprinting under realistic deployment scenarios. Built upon 7 mainstream foundation models and comprising 149 distinct model instances, LeaFBench integrates 13 representative post-development techniques, spanning both parameter-altering methods (e.g., fine-tuning, quantization) and parameter-independent techniques (e.g., system prompts, RAG). Extensive experiments on LeaFBench reveal the strengths and weaknesses of existing methods, thereby outlining future research directions and critical open problems in this emerging field. The code is available at https://github.com/shaoshuo-ss/LeaFBench.
Aligning Extraction and Generation for Robust Retrieval-Augmented Generation WSDM
Retrieval-augmented generation (RAG) enhances LLMs with external knowledge, yet generation remains vulnerable to retrieval-induced noise and uncertain placement of relevant chunks, often causing hallucinations. We present Ext2Gen, an extract-then-generate framework that strengthens LLMs via joint evidence selection and answer generation, dynamically identifying query-relevant content while suppressing noise, thereby removing the need for any independent pre-generation compression module. Optimized through preference alignment with well-curated pairwise feedback, Ext2Gen produces accurate and faithful answers even under noisy or imprecise retrieval. Experiments demonstrate that it substantially enhances the robustness of the generation backbone and yields greater performance gains than methods relying on independent compression models, e.g., Recomp, CompAct, EXIT). It further benefits from improved retrieval techniques such as query rewriting, underscoring that generation-side enhancements address limitations that retrieval alone cannot overcome.
comment: Accepted at ACM International Conference on Web Search and Data Mining (WSDM) 2026
Is Our Chatbot Telling Lies? Assessing Correctness of an LLM-based Dutch Support Chatbot
Companies support their customers using live chats and chatbots to gain their loyalty. AFAS is a Dutch company aiming to leverage the opportunity large language models (LLMs) offer to answer customer queries with minimal to no input from its customer support team. Adding to its complexity, it is unclear what makes a response correct, and that too in Dutch. Further, with minimal data available for training, the challenge is to identify whether an answer generated by a large language model is correct and do it on the fly. This study is the first to define the correctness of a response based on how the support team at AFAS makes decisions. It leverages literature on natural language generation and automated answer grading systems to automate the decision-making of the customer support team. We investigated questions requiring a binary response (e.g., Would it be possible to adjust tax rates manually?) or instructions (e.g., How would I adjust tax rate manually?) to test how close our automated approach reaches support rating. Our approach can identify wrong messages in 55\% of the cases. This work demonstrates the potential for automatically assessing when our chatbot may provide incorrect or misleading answers. Specifically, we contribute (1) a definition and metrics for assessing correctness, and (2) suggestions to improve correctness with respect to regional language and question type.
comment: 10 pages + 2 pages references, 4 figures
Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs
Large Language Models (LLMs) demonstrate impressive capabilities across a wide range of tasks, yet their safety mechanisms remain susceptible to adversarial attacks that exploit cognitive biases -- systematic deviations from rational judgment. Unlike prior jailbreaking approaches focused on prompt engineering or algorithmic manipulation, this work highlights the overlooked power of multi-bias interactions in undermining LLM safeguards. We propose CognitiveAttack, a novel red-teaming framework that systematically leverages both individual and combined cognitive biases. By integrating supervised fine-tuning and reinforcement learning, CognitiveAttack generates prompts that embed optimized bias combinations, effectively bypassing safety protocols while maintaining high attack success rates. Experimental results reveal significant vulnerabilities across 30 diverse LLMs, particularly in open-source models. CognitiveAttack achieves a substantially higher attack success rate compared to the SOTA black-box method PAP (60.1% vs. 31.6%), exposing critical limitations in current defense mechanisms. These findings highlight multi-bias interactions as a powerful yet underexplored attack vector. This work introduces a novel interdisciplinary perspective by bridging cognitive science and LLM safety, paving the way for more robust and human-aligned AI systems.
Efficient Reasoning for Large Reasoning Language Models via Certainty-Guided Reflection Suppression AAAI 2026
Recent Large Reasoning Language Models (LRLMs) employ long chain-of-thought reasoning with complex reflection behaviors, typically signaled by specific trigger words (e.g., "Wait" and "Alternatively") to enhance performance. However, these reflection behaviors can lead to the overthinking problem where the generation of redundant reasoning steps that unnecessarily increase token usage, raise inference costs, and reduce practical utility. In this paper, we propose Certainty-Guided Reflection Suppression (CGRS), a novel method that mitigates overthinking in LRLMs while maintaining reasoning accuracy. CGRS operates by dynamically suppressing the model's generation of reflection triggers when it exhibits high confidence in its current response, thereby preventing redundant reflection cycles without compromising output quality. Our approach is model-agnostic, requires no retraining or architectural modifications, and can be integrated seamlessly with existing autoregressive generation pipelines. Extensive experiments across four reasoning benchmarks (i.e., AIME24, AMC23, MATH500, and GPQA-D) demonstrate CGRS's effectiveness: it reduces token usage by an average of 18.5% to 41.9% while preserving accuracy. It also achieves the optimal balance between length reduction and performance compared to state-of-the-art baselines. These results hold consistently across model architectures (e.g., DeepSeek-R1-Distill series, QwQ-32B, and Qwen3 family) and scales (4B to 32B parameters), highlighting CGRS's practical value for efficient reasoning.
comment: Accepted by AAAI 2026
Unveiling the Influence of Amplifying Language-Specific Neurons AACL 2025
Language-specific neurons in LLMs that strongly correlate with individual languages have been shown to influence model behavior by deactivating them. However, their role in amplification remains underexplored. This work investigates the effect of amplifying language-specific neurons through interventions across 18 languages, including low-resource ones, using three models primarily trained in different languages. We compare amplification factors by their effectiveness in steering to the target language using a proposed Language Steering Shift (LSS) evaluation score, then evaluate it on downstream tasks: commonsense reasoning (XCOPA, XWinograd), knowledge (Include), and translation (FLORES). The optimal amplification factors effectively steer output toward nearly all tested languages. Intervention using this factor on downstream tasks improves self-language performance in some cases but generally degrades cross-language results. These findings highlight the effect of language-specific neurons in multilingual behavior, where amplification can be beneficial especially for low-resource languages, but provides limited advantage for cross-lingual transfer.
comment: Accepted to AACL 2025. Our code and dataset are made available at https://github.com/tauimbz/lang-task-neuron
Multi-Personality Generation of LLMs at Decoding-time WSDM 2026
Multi-personality generation for LLMs, enabling simultaneous embodiment of multiple personalization attributes, is a fundamental challenge. Existing retraining-based approaches are costly and poorly scalable, while decoding-time methods often rely on external models or heuristics, limiting flexibility and robustness. In this paper, we propose a novel Multi-Personality Generation (MPG) framework under the decoding-time combination paradigm. It flexibly controls multi-personality without relying on scarce multi-dimensional models or extra training, leveraging implicit density ratios in single-dimensional models as a "free lunch" to reformulate the task as sampling from a target strategy aggregating these ratios. To implement MPG efficiently, we design Speculative Chunk-level based Rejection sampling (SCR), which generates responses in chunks and parallelly validates them via estimated thresholds within a sliding window. This significantly reduces computational overhead while maintaining high-quality generation. Experiments on MBTI personality and Role-Playing demonstrate the effectiveness of MPG, showing improvements up to 16%-18%. Code and data are available at https://github.com/Libra117/MPG .
comment: Accepted by WSDM 2026
Exposing the Cracks: Vulnerabilities of Retrieval-Augmented LLM-based Machine Translation AAAI 2026
\textbf{RE}trieval-\textbf{A}ugmented \textbf{L}LM-based \textbf{M}achine \textbf{T}ranslation (REAL-MT) shows promise for knowledge-intensive tasks like idiomatic translation, but its reliability under noisy retrieval contexts remains poorly understood despite this being a common challenge in real-world deployment. To address this gap, we propose a noise synthesis framework and new metrics to evaluate the robustness of REAL-MT systematically. Using this framework, we instantiate REAL-MT with Qwen-series models, including standard LLMs and large reasoning models (LRMs) with enhanced reasoning, and evaluate their performance on idiomatic translation across high-, medium-, and low-resource language pairs under synthesized noise. Our results show that low-resource language pairs, which rely more heavily on retrieved context, degrade more severely under noise than high-resource ones and often produce nonsensical translations. Although LRMs possess enhanced reasoning capabilities, they show no improvement in error correction and are even more susceptible to noise, tending to rationalize incorrect contexts. We find that this stems from an attention shift away from the source idiom to noisy content, while confidence increases despite declining accuracy, indicating poor calibration. To mitigate these issues, we investigate training-free and fine-tuning strategies, which improve robustness at the cost of performance in clean contexts, revealing a fundamental trade-off. Our findings highlight the limitations of current approaches, underscoring the need for self-verifying integration mechanisms.
comment: Accepted by AAAI 2026
PathRAG: Pruning Graph-based Retrieval Augmented Generation with Relational Paths
Retrieval-augmented generation (RAG) improves the response quality of large language models (LLMs) by retrieving knowledge from external databases. Typical RAG approaches split the text database into chunks, organizing them in a flat structure for efficient searches. To better capture the inherent dependencies and structured relationships across the text database, researchers propose to organize textual information into an indexing graph, known asgraph-based RAG. However, we argue that the limitation of current graph-based RAG methods lies in the redundancy of the retrieved information, rather than its insufficiency. Moreover, previous methods use a flat structure to organize retrieved information within the prompts, leading to suboptimal performance. To overcome these limitations, we propose PathRAG, which retrieves key relational paths from the indexing graph, and converts these paths into textual form for prompting LLMs. Specifically, PathRAG effectively reduces redundant information with flow-based pruning, while guiding LLMs to generate more logical and coherent responses with path-based prompting. Experimental results show that PathRAG consistently outperforms state-of-the-art baselines across six datasets and five evaluation dimensions. The code is available at the following link: https://github.com/BUPT-GAMMA/PathRAG
MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts
Deploying Large Language Models (LLMs) in medical applications requires fact-checking capabilities to ensure patient safety and regulatory compliance. We introduce MedFact, a challenging Chinese medical fact-checking benchmark with 2,116 expert-annotated instances from diverse real-world texts, spanning 13 specialties, 8 error types, 4 writing styles, and 5 difficulty levels. Construction uses a hybrid AI-human framework where iterative expert feedback refines AI-driven, multi-criteria filtering to ensure high quality and difficulty. We evaluate 20 leading LLMs on veracity classification and error localization, and results show models often determine if text contains errors but struggle to localize them precisely, with top performers falling short of human performance. Our analysis reveals the "over-criticism" phenomenon, a tendency for models to misidentify correct information as erroneous, which can be exacerbated by advanced reasoning techniques such as multi-agent collaboration and inference-time scaling. MedFact highlights the challenges of deploying medical LLMs and provides resources to develop factually reliable medical AI systems.
Fact2Fiction: Targeted Poisoning Attack to Agentic Fact-checking System AAAI 2026
State-of-the-art (SOTA) fact-checking systems combat misinformation by employing autonomous LLM-based agents to decompose complex claims into smaller sub-claims, verify each sub-claim individually, and aggregate the partial results to produce verdicts with justifications (explanations for the verdicts). The security of these systems is crucial, as compromised fact-checkers can amplify misinformation, but remains largely underexplored. To bridge this gap, this work introduces a novel threat model against such fact-checking systems and presents \textsc{Fact2Fiction}, the first poisoning attack framework targeting SOTA agentic fact-checking systems. Fact2Fiction employs LLMs to mimic the decomposition strategy and exploit system-generated justifications to craft tailored malicious evidences that compromise sub-claim verification. Extensive experiments demonstrate that Fact2Fiction achieves 8.9\%--21.2\% higher attack success rates than SOTA attacks across various poisoning budgets and exposes security weaknesses in existing fact-checking systems, highlighting the need for defensive countermeasures.
comment: Accepted by AAAI 2026 (Oral). Code available at: https://trustworthycomp.github.io/Fact2Fiction/
Chain-of-Conceptual-Thought Elicits Daily Conversation in Large Language Models PRICAI 2025
Chain-of-Thought (CoT) is widely applied to enhance the LLM capability in math, coding and reasoning tasks. However, its performance is limited for open-domain tasks, when there are no clearly defined reasoning steps or logical transitions. To mitigate such challenges, we propose a new prompt-based paradigm called Chain of Conceptual Thoughts (CoCT), which suggests the LLM first to produce the tag of concepts, then complete the detailed content following the concept. To encourage this hierarchical way of thinking, we implement the concepts with emotions, strategies and topics. We experiment with this paradigm in daily and emotional support conversations, covering tasks with both in-domain and out-of-domain concept settings. Automatic, human, and LLM-based evaluations reveal that CoCT surpasses several prompt-based baselines such as self-refine, ECoT, SoT and RAG, suggesting a potential solution of LLM prompting paradigm for a wider scope of tasks.
comment: PRICAI 2025
Self-Correction Distillation for Structured Data Question Answering AAAI 2026
Structured data question answering (QA), including table QA, Knowledge Graph (KG) QA, and temporal KG QA, is a pivotal research area. Advances in large language models (LLMs) have driven significant progress in unified structural QA frameworks like TrustUQA. However, these frameworks face challenges when applied to small-scale LLMs since small-scale LLMs are prone to errors in generating structured queries. To improve the structured data QA ability of small-scale LLMs, we propose a self-correction distillation (SCD) method. In SCD, an error prompt mechanism (EPM) is designed to detect errors and provide customized error messages during inference, and a two-stage distillation strategy is designed to transfer large-scale LLMs' query-generation and error-correction capabilities to small-scale LLM. Experiments across 5 benchmarks with 3 structured data types demonstrate that our SCD achieves the best performance and superior generalization on small-scale LLM (8B) compared to other distillation methods, and closely approaches the performance of GPT4 on some datasets. Furthermore, large-scale LLMs equipped with EPM surpass the state-of-the-art results on most datasets.
comment: Accepted to AAAI 2026
A Survey on Unlearning in Large Language Models
Large Language Models (LLMs) demonstrate remarkable capabilities, but their training on massive corpora poses significant risks from memorized sensitive information. To mitigate these issues and align with legal standards, unlearning has emerged as a critical technique to selectively erase specific knowledge from LLMs without compromising their overall performance. This survey provides a systematic review of over 180 papers on LLM unlearning published since 2021. First, it introduces a novel taxonomy that categorizes unlearning methods based on the phase in the LLM pipeline of the intervention. This framework further distinguishes between parameter modification and parameter selection strategies, thus enabling deeper insights and more informed comparative analysis. Second, it offers a multidimensional analysis of evaluation paradigms. For datasets, we compare 18 existing benchmarks from the perspectives of task format, content, and experimental paradigms to offer actionable guidance. For metrics, we move beyond mere enumeration by dividing knowledge memorization metrics into 10 categories to analyze their advantages and applicability, while also reviewing metrics for model utility, robustness, and efficiency. By discussing current challenges and future directions, this survey aims to advance the field of LLM unlearning and the development of secure AI systems.
SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning
Large Reasoning Models (LRMs) introduce a new generation paradigm of explicitly reasoning before answering, leading to remarkable improvements in complex tasks. However, they pose great safety risks against harmful queries and adversarial attacks. While recent mainstream safety efforts on LRMs, supervised fine-tuning (SFT), improve safety performance, we find that SFT-aligned models struggle to generalize to unseen jailbreak prompts. After thorough investigation of LRMs' generation, we identify a safety aha moment that can activate safety reasoning and lead to a safe response. This aha moment typically appears in the `key sentence', which follows models' query understanding process and can indicate whether the model will proceed safely. Based on these insights, we propose SafeKey, including two complementary objectives to better activate the safety aha moment in the key sentence: (1) a Dual-Path Safety Head to enhance the safety signal in the model's internal representations before the key sentence, and (2) a Query-Mask Modeling objective to improve the models' attention on its query understanding, which has important safety hints. Experiments across multiple safety benchmarks demonstrate that our methods significantly improve safety generalization to a wide range of jailbreak attacks and out-of-distribution harmful prompts, lowering the average harmfulness rate by 9.6\%, while maintaining general abilities. Our analysis reveals how SafeKey enhances safety by reshaping internal attention and improving the quality of hidden representations.
Aligning Machiavellian Agents: Behavior Steering via Test-Time Policy Shaping AAAI 2026
The deployment of decision-making AI agents presents a critical challenge in maintaining alignment with human values or guidelines while operating in complex, dynamic environments. Agents trained solely to achieve their objectives may adopt harmful behavior, exposing a key trade-off between maximizing the reward function and maintaining alignment. For pre-trained agents, ensuring alignment is particularly challenging, as retraining can be a costly and slow process. This is further complicated by the diverse and potentially conflicting attributes representing the ethical values for alignment. To address these challenges, we propose a test-time alignment technique based on model-guided policy shaping. Our method allows precise control over individual behavioral attributes, generalizes across diverse reinforcement learning (RL) environments, and facilitates a principled trade-off between ethical alignment and reward maximization without requiring agent retraining. We evaluate our approach using the MACHIAVELLI benchmark, which comprises 134 text-based game environments and thousands of annotated scenarios involving ethical decisions. The RL agents are first trained to maximize the reward in their respective games. At test time, we apply policy shaping via scenario-action attribute classifiers to ensure decision alignment with ethical attributes. We compare our approach against prior training-time methods and general-purpose agents, as well as study several types of ethical violations and power-seeking behavior. Our results demonstrate that test-time policy shaping provides an effective and scalable solution for mitigating unethical behavior across diverse environments and alignment attributes.
comment: Accepted to AAAI 2026 AI Alignment Track
VocalBench-zh: Decomposing and Benchmarking the Speech Conversational Abilities in Mandarin Context
The development of multi-modal large language models (LLMs) leads to intelligent approaches capable of speech interactions. As one of the most widely spoken languages globally, Mandarin is supported by most models to enhance their applicability and reach. However, the scarcity of comprehensive speech-to-speech (S2S) benchmarks in Mandarin contexts impedes systematic evaluation for developers and hinders fair model comparison for users. In this work, we propose VocalBench-zh, an ability-level divided evaluation suite adapted to Mandarin context consisting of 10 well-crafted subsets and over 10K high-quality instances, covering 12 user-oriented characters. The evaluation experiment on 14 mainstream models reveals the common challenges for current routes, and highlights the need for new insights into next-generation speech interactive systems. The evaluation codes and datasets will be available at https://github.com/SJTU-OmniAgent/VocalBench-zh.
comment: This article will serve as an extension of the preceding work, "VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models" (arXiv:2505.15727). Therefore, we have chosen to withdraw to avoid potential duplicate publication. We will update the previously open-sourced paper of VocalBench in several weeks to include the content of VocalBench-zh
T^2Agent A Tool-augmented Multimodal Misinformation Detection Agent with Monte Carlo Tree Search AAAI 2026
Real-world multimodal misinformation often arises from mixed forgery sources, requiring dynamic reasoning and adaptive verification. However, existing methods mainly rely on static pipelines and limited tool usage, limiting their ability to handle such complexity and diversity. To address this challenge, we propose \method, a novel misinformation detection agent that incorporates an extensible toolkit with Monte Carlo Tree Search (MCTS). The toolkit consists of modular tools such as web search, forgery detection, and consistency analysis. Each tool is described using standardized templates, enabling seamless integration and future expansion. To avoid inefficiency from using all tools simultaneously, a greedy search-based selector is proposed to identify a task-relevant subset. This subset then serves as the action space for MCTS to dynamically collect evidence and perform multi-source verification. To better align MCTS with the multi-source nature of misinformation detection, \method~ extends traditional MCTS with multi-source verification, which decomposes the task into coordinated subtasks targeting different forgery sources. A dual reward mechanism containing a reasoning trajectory score and a confidence score is further proposed to encourage a balance between exploration across mixed forgery sources and exploitation for more reliable evidence. We conduct ablation studies to confirm the effectiveness of the tree search mechanism and tool usage. Extensive experiments further show that \method~ consistently outperforms existing baselines on challenging mixed-source multimodal misinformation benchmarks, demonstrating its strong potential as a training-free detector.
comment: accepted by AAAI 2026 (Oral)
Beyond Chains: Bridging Large Language Models and Knowledge Bases in Complex Question Answering AAAI2026
Knowledge Base Question Answering (KBQA) aims to answer natural language questions using structured knowledge from KBs. While LLM-only approaches offer generalization, they suffer from outdated knowledge, hallucinations, and lack of transparency. Chain-based KG-RAG methods address these issues by incorporating external KBs, but are limited to simple chain-structured questions due to the absence of planning and logical structuring. Inspired by semantic parsing methods, we propose PDRR: a four-stage framework consisting of Predict, Decompose, Retrieve, and Reason. Our method first predicts the question type and decomposes the question into structured triples. Then retrieves relevant information from KBs and guides the LLM as an agent to reason over and complete the decomposed triples. Experimental results demonstrate that PDRR consistently outperforms existing methods across various LLM backbones and achieves superior performance on both chain-structured and non-chain complex questions.
comment: AAAI2026 Main Track
Diagnose, Localize, Align: A Full-Stack Framework for Reliable LLM Multi-Agent Systems under Instruction Conflicts
Large Language Model (LLM)-powered multi-agent systems (MAS) have rapidly advanced collaborative reasoning, tool use, and role-specialized coordination in complex tasks. However, reliability-critical deployment remains hindered by a systemic failure mode: hierarchical compliance under instruction conflicts (system-user, peer-peer), where agents misprioritize system-level rules in the presence of competing demands. Moreover, widely used macro-level metrics (e.g., pass@k) obscure these micro-level violations and offer little actionable guidance for remedy. In this work, we present a full-stack, three-stage framework: (1) Diagnose - Contextualized Role Adherence Score (CRAS), a query-wise, context-aware scoring metric that decomposes role adherence into four measurable dimensions; (2) Localize - attention drift analysis revealing that instruction conflicts are resolved by attention heads that are largely concentrated in middle layers; (3) Align - Surgical Alignment of Instruction Layers (SAIL), which installs LoRA only on the localized focal layers and optimizes a token-weighted DPO-style preference objective that credits tokens by their focal attentional contribution. Across standard benchmarks and MAS frameworks, our surgical approach improves instruction hierarchy compliance (e.g., +5.60% with AutoGen on MedQA) without full-model finetuning.
comment: Upon further review, we realized that the version submitted to arXiv was not the final draft and omits crucial results and discussion. To avoid confusion and ensure the integrity of the record, we request withdrawal and will resubmit once the complete work is ready
Beyond Magic Words: Sharpness-Aware Prompt Evolving for Robust Large Language Models with TARE
The performance of Large Language Models (LLMs) hinges on carefully engineered prompts. However, prevailing prompt optimization methods, ranging from heuristic edits and reinforcement learning to evolutionary search, primarily target point-wise accuracy. They seldom enforce paraphrase invariance or searching stability, and therefore cannot remedy this brittleness in practice. Automated prompt search remains brittle: small, semantically preserving paraphrases often cause large performance swings. We identify this brittleness as the textual sharpness of the prompt landscape. In this work, we provide the first formal treatment of textual sharpness in the discrete, semantic space of prompts, together with an operational robustness criterion over a semantic neighborhood; the design is black-box or API-only, requiring no gradients to update the model's parameters. Then we introduce TARE (Textual Sharpness-Aware Evolving), a derivative-free framework that alternates between an inner, sampling-based adversarial search that stresses a prompt with hard paraphrases and an outer, robust selection that prefers candidates whose neighborhoods remain strong. We further propose ATARE, which learns anisotropic weights to shape the semantic neighborhood and adapts its radius over time to balance exploration and fidelity. Diverse tasks evaluate our methods, whose design for minimizing textual sharpness gap leads to prompts that preserve accuracy under paraphrasing, outperforming accuracy-only prompt search while remaining computationally practical.
comment: We have identified a critical methodological error in Section 3 of the manuscript, which invalidates the main results; therefore, we request withdrawal for further revision
KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning NeurIPS 2025
Recent advances have demonstrated that integrating reinforcement learning with rule-based rewards can significantly enhance the reasoning capabilities of large language models, even without supervised fine-tuning. However, prevalent reinforcement learning algorithms such as GRPO and its variants like DAPO, suffer from a coarse granularity issue when computing the advantage. Specifically, they compute rollout-level advantages that assign identical values to every token within a sequence, failing to capture token-specific contributions and hindering effective learning. To address this limitation, we propose Key-token Advantage Estimation (KTAE) - a novel algorithm that estimates fine-grained, token-level advantages without introducing additional models. KTAE leverages the correctness of sampled rollouts and applies statistical analysis to quantify the importance of individual tokens within a sequence to the final outcome. This quantified token-level importance is then combined with the rollout-level advantage to obtain a more fine-grained token-level advantage estimation. Empirical results show that models trained with GRPO+KTAE and DAPO+KTAE outperform baseline methods across five mathematical reasoning benchmarks. Notably, they achieve higher accuracy with shorter responses and even surpass R1-Distill-Qwen-1.5B using the same base model.
comment: NeurIPS 2025 Poster
A Human Behavioral Baseline for Collective Governance in Software Projects NeurIPS 2025
We study how open source communities describe participation and control through version controlled governance documents. Using a corpus of 710 projects with paired snapshots, we parse text into actors, rules, actions, and objects, then group them and measure change with entropy for evenness, richness for diversity, and Jensen Shannon divergence for drift. Projects define more roles and more actions over time, and these are distributed more evenly, while the composition of rules remains stable. These findings indicate that governance grows by expanding and balancing categories of participation without major shifts in prescriptive force. The analysis provides a reproducible baseline for evaluating whether future AI mediated workflows concentrate or redistribute authority.
comment: Algorithmic Collective Action Workshop @ NeurIPS 2025. arXiv admin note: text overlap with arXiv:2509.16295
You Don't Need Pre-built Graphs for RAG: Retrieval Augmented Generation with Adaptive Reasoning Structures AAAI'26
Large language models (LLMs) often suffer from hallucination, generating factually incorrect statements when handling questions beyond their knowledge and perception. Retrieval-augmented generation (RAG) addresses this by retrieving query-relevant contexts from knowledge bases to support LLM reasoning. Recent advances leverage pre-constructed graphs to capture the relational connections among distributed documents, showing remarkable performance in complex tasks. However, existing Graph-based RAG (GraphRAG) methods rely on a costly process to transform the corpus into a graph, introducing overwhelming token cost and update latency. Moreover, real-world queries vary in type and complexity, requiring different logic structures for accurate reasoning. The pre-built graph may not align with these required structures, resulting in ineffective knowledge retrieval. To this end, we propose a $\textbf{Logic}$-aware $\textbf{R}etrieval$-$\textbf{A}$ugmented $\textbf{G}$eneration framework ($\textbf{LogicRAG}$) that dynamically extracts reasoning structures at inference time to guide adaptive retrieval without any pre-built graph. LogicRAG begins by decomposing the input query into a set of subproblems and constructing a directed acyclic graph (DAG) to model the logical dependencies among them. To support coherent multi-step reasoning, LogicRAG then linearizes the graph using topological sort, so that subproblems can be addressed in a logically consistent order. Besides, LogicRAG applies graph pruning to reduce redundant retrieval and uses context pruning to filter irrelevant context, significantly reducing the overall token cost. Extensive experiments demonstrate that LogicRAG achieves both superior performance and efficiency compared to state-of-the-art baselines.
comment: This work has been accepted to AAAI'26
Magellan: Guided MCTS for Latent Space Exploration and Novelty Generation
Large Language Models (LLMs) often struggle with generating truly innovative ideas, typically defaulting to high-probability, familiar concepts within their training data's "gravity wells." While advanced search-based methods like Tree of Thoughts (ToT) attempt to mitigate this, they are fundamentally limited by their reliance on unprincipled, inconsistent self-evaluation heuristics to guide exploration. To address this gap, we introduce \textbf{Magellan}, a novel framework that reframes creative generation as a principled, guided exploration of an LLM's latent conceptual space. At its core, Magellan employs Monte Carlo Tree Search (MCTS) governed by a hierarchical guidance system. For long-range direction, a "semantic compass" vector, formulated via orthogonal projection, steers the search towards relevant novelty. For local, step-by-step decisions, a landscape-aware value function replaces flawed self-evaluation with an explicit reward structure that balances intrinsic coherence, extrinsic novelty, and narrative progress. Extensive experiments demonstrate that Magellan significantly outperforms strong baselines, including ReAct and ToT, in generating scientific ideas with superior plausibility and innovation. Our work shows that for creative discovery, a principled, guided search is more effective than unconstrained agency, paving the way for LLMs to become more capable partners in innovation.
comment: Accepted to 1st Open Conference on AI Agents for Science (agents4science 2025)
Theories of "Sexuality" in Natural Language Processing Bias Research
In recent years, significant advancements in the field of Natural Language Processing (NLP) have positioned commercialized language models as wide-reaching, highly useful tools. In tandem, there has been an explosion of multidisciplinary research examining how NLP tasks reflect, perpetuate, and amplify social biases such as gender and racial bias. A significant gap in this scholarship is a detailed analysis of how queer sexualities are encoded and (mis)represented by both NLP systems and practitioners. Following previous work in the field of AI fairness, we document how sexuality is defined and operationalized via a survey and analysis of 55 articles that quantify sexuality-based NLP bias. We find that sexuality is not clearly defined in a majority of the literature surveyed, indicating a reliance on assumed or normative conceptions of sexual/romantic practices and identities. Further, we find that methods for extracting biased outputs from NLP technologies often conflate gender and sexual identities, leading to monolithic conceptions of queerness and thus improper quantifications of bias. With the goal of improving sexuality-based NLP bias analyses, we conclude with recommendations that encourage more thorough engagement with both queer communities and interdisciplinary literature.
comment: 17 pages, 6 tables, 1 figure, undergraduate senior thesis, submitted to The Spectra: The Virginia Engineering and Science Research Journal
FinVet: A Collaborative Framework of RAG and External Fact-Checking Agents for Financial Misinformation Detection
Financial markets face growing threats from misinformation that can trigger billions in losses in minutes. Most existing approaches lack transparency in their decision-making and provide limited attribution to credible sources. We introduce FinVet, a novel multi-agent framework that integrates two Retrieval-Augmented Generation (RAG) pipelines with external fact-checking through a confidence-weighted voting mechanism. FinVet employs adaptive three-tier processing that dynamically adjusts verification strategies based on retrieval confidence, from direct metadata extraction to hybrid reasoning to full model-based analysis. Unlike existing methods, FinVet provides evidence-backed verdicts, source attribution, confidence scores, and explicit uncertainty flags when evidence is insufficient. Experimental evaluation on the FinFact dataset shows that FinVet achieves an F1 score of 0.85, which is a 10.4% improvement over the best individual pipeline (fact-check pipeline) and 37% improvement over standalone RAG approaches.
LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation
Large Language Models (LLMs) are increasingly explored for educational tasks such as grading, yet their alignment with human evaluation in real classrooms remains underexamined. In this study, we investigate the feasibility of using an LLM (GPT-4o) to evaluate short-answer quizzes and project reports in an undergraduate Computational Linguistics course. We collect responses from approximately 50 students across five quizzes and receive project reports from 14 teams. LLM-generated scores are compared against human evaluations conducted independently by the course teaching assistants (TAs). Our results show that GPT-4o achieves strong correlation with human graders (up to 0.98) and exact score agreement in 55\% of quiz cases. For project reports, it also shows strong overall alignment with human grading, while exhibiting some variability in scoring technical, open-ended responses. We release all code and sample data to support further research on LLMs in educational assessment. This work highlights both the potential and limitations of LLM-based grading systems and contributes to advancing automated grading in real-world academic settings.
LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion
Large language models (LLMs) have demonstrated remarkable progress in understanding long-context inputs. However, benchmarks for evaluating the long-context reasoning abilities of LLMs fall behind the pace. Existing benchmarks often focus on a narrow range of tasks or those that do not demand complex reasoning. To address this gap and enable a more comprehensive evaluation of the long-context reasoning capabilities of current LLMs, we propose a new synthetic benchmark, LongReason, which is constructed by synthesizing long-context reasoning questions from a varied set of short-context reasoning questions through context expansion. LongReason consists of 794 multiple-choice reasoning questions with diverse reasoning patterns across three task categories: reading comprehension, logical inference, and mathematical word problems. We evaluate 21 LLMs on LongReason, revealing that most models experience significant performance drops as context length increases. Our further analysis shows that even state-of-the-art LLMs still have significant room for improvement in providing robust reasoning across different tasks. We have open-sourced LongReason under https://huggingface.co/datasets/lz1bytedance/LongReason to support the comprehensive evaluation of LLMs' long-context reasoning capabilities.
Can Machines Think Like Humans? A Behavioral Evaluation of LLM Agents in Dictator Games
As Large Language Model (LLM)-based agents increasingly engage with human society, how well do we understand their prosocial behaviors? We (1) investigate how LLM agents' prosocial behaviors can be induced by different personas and benchmarked against human behaviors; and (2) introduce a social science approach to evaluate LLM agents' decision-making. We explored how different personas and experimental framings affect these AI agents' altruistic behavior in dictator games and compared their behaviors within the same LLM family, across various families, and with human behaviors. The findings reveal that merely assigning a human-like identity to LLMs does not produce human-like behaviors. These findings suggest that LLM agents' reasoning does not consistently exhibit textual markers of human decision-making in dictator games and that their alignment with human behavior varies substantially across model architectures and prompt formulations; even worse, such dependence does not follow a clear pattern. As society increasingly integrates machine intelligence, "Prosocial AI" emerges as a promising and urgent research direction in philanthropic studies.
Scaling Latent Reasoning via Looped Language Models
Modern LLMs are trained to "think" primarily via explicit text generation, such as chain-of-thought (CoT), which defers reasoning to post-training and under-leverages pre-training data. We present and open-source Ouro, named after the recursive Ouroboros, a family of pre-trained Looped Language Models (LoopLM) that instead build reasoning into the pre-training phase through (i) iterative computation in latent space, (ii) an entropy-regularized objective for learned depth allocation, and (iii) scaling to 7.7T tokens. Ouro 1.4B and 2.6B models enjoy superior performance that match the results of up to 12B SOTA LLMs across a wide range of benchmarks. Through controlled experiments, we show this advantage stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities. We also show that LoopLM yields reasoning traces more aligned with final outputs than explicit CoT. We hope our results show the potential of LoopLM as a novel scaling direction in the reasoning era. Our model is available here: http://ouro-llm.github.io.
Linguistic Structure from a Bottleneck on Sequential Information Processing
Human language has a distinct systematic structure, where utterances break into individually meaningful words which are combined to form phrases. We show that natural-language-like systematicity arises in codes that are constrained by a statistical measure of complexity called predictive information, also known as excess entropy. Predictive information is the mutual information between the past and future of a stochastic process. In simulations, we find that such codes break messages into groups of approximately independent features which are expressed systematically and locally, corresponding to words and phrases. Next, drawing on crosslinguistic text corpora, we find that actual human languages are structured in a way that reduces predictive information compared to baselines at the levels of phonology, morphology, syntax, and lexical semantics. Our results establish a link between the statistical and algebraic structure of language and reinforce the idea that these structures are shaped by communication under general cognitive constraints.
LocalBench: Benchmarking LLMs on County-Level Local Knowledge and Reasoning
Large language models (LLMs) have been widely evaluated on macro-scale geographic tasks, such as global factual recall, event summarization, and regional reasoning. Yet, their ability to handle hyper-local knowledge remains poorly understood. This gap is increasingly consequential as real-world applications, from civic platforms to community journalism, demand AI systems that can reason about neighborhood-specific dynamics, cultural narratives, and local governance. Existing benchmarks fall short in capturing this complexity, often relying on coarse-grained data or isolated references. We present LocalBench, the first benchmark designed to systematically evaluate LLMs on county-level local knowledge across the United States. Grounded in the Localness Conceptual Framework, LocalBench includes 14,782 validated question-answer pairs across 526 U.S. counties in 49 states, integrating diverse sources such as Census statistics, local subreddit discourse, and regional news. It spans physical, cognitive, and relational dimensions of locality. Using LocalBench, we evaluate 13 state-of-the-art LLMs under both closed-book and web-augmented settings. Our findings reveal critical limitations: even the best-performing models reach only 56.8% accuracy on narrative-style questions and perform below 15.5% on numerical reasoning. Moreover, larger model size and web augmentation do not guarantee better performance, for example, search improves Gemini's accuracy by +13.6%, but reduces GPT-series performance by -11.4%. These results underscore the urgent need for language models that can support equitable, place-aware AI systems: capable of engaging with the diverse, fine-grained realities of local communities across geographic and cultural contexts.
VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning
A hallmark of advanced artificial intelligence is the capacity to progress from passive visual perception to the strategic modification of visual information to facilitate complex reasoning. This advanced capability, however, remains critically underdeveloped in current Large Multi-modal Models (LMMs). The deficiency is often masked by evaluation metrics that prioritize final-answer accuracy, creating an illusion of competence where genuine reasoning is absent. Using the domain of geometric problem-solving as a precise instrument, we probe this issue through tasks that require constructing visual aids. To this end, we introduce \textbf{VisAidMath}, a challenging benchmark, and our novel Three-Layered Funnel Evaluation Framework. This framework moves beyond simple accuracy (ACCU) to scrutinize the generation of valid visual aids (PVA) and the soundness of subsequent reasoning steps (SPRS). Our extensive experiments on state-of-the-art models, including Doubao-Seed-1.6 and o4, reveal a profound ``Reasoning Illusion''. We observe that high surface-level accuracy conceals a catastrophic failure in the models' ability to produce valid visual aids or to reason from them. Our findings expose a fundamental schism between visual perception and logical deduction in modern LMMs. We host an evaluation platform at CodaBench for testing publicly. Homepage: https://nlp2ct.github.io/VisAidMathHomepage/ Evaluation: https://www.codabench.org/competitions/7634/
comment: 58 pages, 28 figures
Who Gets the Reward, Who Gets the Blame? Evaluation-Aligned Training Signals for Multi-LLM Agents
Large Language Models (LLMs) in multi-agent systems (MAS) have shown promise for complex tasks, yet current training methods lack principled ways to connect system-level evaluation with agent-level and message-level learning. We propose a theoretical framework that unifies cooperative game-theoretic attribution with process reward modeling to transform system evaluation into agent credit and then into response-level signals. Unlike prior approaches that rely only on attribution (e.g., Shapley) or step-level labels (e.g., PRM), our method produces local, signed, and credit-conserving signals. In success cases, Shapley-based credit assignment fairly allocates outcomes across agents and is refined into per-message rewards that promote cooperation while discouraging redundancy or sabotage. In failure cases, first-error localization yields repair-aware preferences that penalize harmful steps while rewarding corrective attempts. The resulting signals are bounded, cooperative, and directly compatible with reinforcement-based or preference-based post-training, providing a unified and auditable pathway from global evaluation to local supervision in LLM multi-agent training. Our contribution is conceptual: we present a theoretical foundation and training signals, leaving empirical validation for future work.
comment: Withdrawing temporarily to coordinate revisions with co-authors. A revised version will be resubmitted
Human-Computer Interaction
Person-AI Bidirectional Fit - A Proof-Of-Concept Case Study Of Augmented Human-Ai Symbiosis In Management Decision-Making Process
This article develops the concept of Person-AI bidirectional fit, defined as the continuously evolving, context-sensitive alignment-primarily cognitive, but also emotional and behavioral-between a human decision-maker and an artificial intelligence system. Grounded in contingency theory and quality theory, the study examines the role of P-AI fit in managerial decision-making through a proof-of-concept case study involving a real hiring process for a Senior AI Lead. Three decision pathways are compared: (1) independent evaluations by a CEO, CTO, and CSO; (2) an evaluation produced by an augmented human-AI symbiotic intelligence system (H3LIX-LAIZA); and (3) an assessment generated by a general-purpose large language model. The results reveal substantial role-based divergence in human judgments, high alignment between H3LIX-LAIZA and the CEOs implicit decision model-including ethical disqualification of a high-risk candidate and a critical false-positive recommendation from the LLMr. The findings demonstrate that higher P-AI fit, exemplified by the CEO H3LIX-LAIZA relationship, functions as a mechanism linking augmented symbiotic intelligence to accurate, trustworthy, and context-sensitive decisions. The study provides an initial verification of the P-AI fit construct and a proof-of-concept for H3LIX-LAIZA as an augmented human-AI symbiotic intelligence system.
comment: 30 pages, 2 figures
Exploring the Effectiveness of Google Play Store's Privacy Transparency Channels
With the requirements and emphases on privacy transparency placed by regulations such as GDPR and CCPA, the Google Play Store requires Android developers to more responsibly communicate their apps' privacy practices to potential users by providing the proper information via the data safety, privacy policy, and permission manifest privacy transparency channels. However, it is unclear how effective those channels are in helping users make informed decisions in the app selection and installation process. In this article, we conducted a study for 190 participants to interact with our simulated privacy transparency channels of mobile apps. We quantitatively analyzed (supplemented by qualitative analysis) participants' responses to five sets of questions. We found that data safety provides the most intuitive user interfaces, privacy policy is most informative and effective, while permission manifest excels at raising participants' concerns about an app's overall privacy risks. These channels complement each other and should all be improved.
FreeAskWorld: An Interactive and Closed-Loop Simulator for Human-Centric Embodied AI
As embodied intelligence emerges as a core frontier in artificial intelligence research, simulation platforms must evolve beyond low-level physical interactions to capture complex, human-centered social behaviors. We introduce FreeAskWorld, an interactive simulation framework that integrates large language models (LLMs) for high-level behavior planning and semantically grounded interaction, informed by theories of intention and social cognition. Our framework supports scalable, realistic human-agent simulations and includes a modular data generation pipeline tailored for diverse embodied tasks.To validate the framework, we extend the classic Vision-and-Language Navigation (VLN) task into a interaction enriched Direction Inquiry setting, wherein agents can actively seek and interpret navigational guidance. We present and publicly release FreeAskWorld, a large-scale benchmark dataset comprising reconstructed environments, six diverse task types, 16 core object categories, 63,429 annotated sample frames, and more than 17 hours of interaction data to support training and evaluation of embodied AI systems. We benchmark VLN models, and human participants under both open-loop and closed-loop settings. Experimental results demonstrate that models fine-tuned on FreeAskWorld outperform their original counterparts, achieving enhanced semantic understanding and interaction competency. These findings underscore the efficacy of socially grounded simulation frameworks in advancing embodied AI systems toward sophisticated high-level planning and more naturalistic human-agent interaction. Importantly, our work underscores that interaction itself serves as an additional information modality.
comment: 9 pages, 4 figures
A Lexical Analysis of online Reviews on Human-AI Interactions
This study focuses on understanding the complex dynamics between humans and AI systems by analyzing user reviews. While previous research has explored various aspects of human-AI interaction, such as user perceptions and ethical considerations, there remains a gap in understanding the specific concerns and challenges users face. By using a lexical approach to analyze 55,968 online reviews from G2.com, Producthunt.com, and Trustpilot.com, this preliminary research aims to analyze human-AI interaction. Initial results from factor analysis reveal key factors influencing these interactions. The study aims to provide deeper insights into these factors through content analysis, contributing to the development of more user-centric AI systems. The findings are expected to enhance our understanding of human-AI interaction and inform future AI technology and user experience improvements.
comment: 10 pages, 1 table
The Quick Red Fox gets the best Data Driven Classroom Interviews: A manual for an interview app and its associated methodology
Data Driven Classroom Interviews (DDCIs) are an interviewing technique that is facilitated by recent technological developments in the learning analytics community. DDCIs are short, targeted interviews that allow researchers to contextualize students' interactions with a digital learning environment (e.g., intelligent tutoring systems or educational games) while minimizing the amount of time that the researcher interrupts that learning experience, and focusing researcher time on the events they most want to focus on DDCIs are facilitated by a research tool called the Quick Red Fox (QRF)--an open-source server-client Android app that optimizes researcher time by directing interviewers to users that have just displayed an interesting behavior (previously defined by the research team). QRF integrates with existing student modeling technologies (e.g., behavior-sensing, affect-sensing, detection of self-regulated learning) to alert researchers to key moments in a learner's experience. This manual documents the tech while providing training on the processes involved in developing triggers and interview techniques; it also suggests methods of analyses.
Trust in Vision-Language Models: Insights from a Participatory User Workshop
With the growing deployment of Vision-Language Models (VLMs), pre-trained on large image-text and video-text datasets, it is critical to equip users with the tools to discern when to trust these systems. However, examining how user trust in VLMs builds and evolves remains an open problem. This problem is exacerbated by the increasing reliance on AI models as judges for experimental validation, to bypass the cost and implications of running participatory design studies directly with users. Following a user-centred approach, this paper presents preliminary results from a workshop with prospective VLM users. Insights from this pilot workshop inform future studies aimed at contextualising trust metrics and strategies for participants' engagement to fit the case of user-VLM interaction.
Local Collaborative Filtering: A Collaborative Filtering Method that Utilizes Local Similarities among Users
To leverage user behavior data from the Internet more effectively in recommender systems, this paper proposes a novel collaborative filtering (CF) method called Local Collaborative Filtering (LCF). LCF utilizes local similarities among users and integrates their data using the law of large numbers (LLN), thereby improving the utilization of user behavior data. Experiments are conducted on the Steam game dataset, and the results of LCF align with real-world needs.
comment: 4 pages, 2 figures
Agent-Oriented Visual Programming for the Web of Things
In this paper we introduce and discuss an approach for multi-agent-oriented visual programming. This aims at enabling individuals without programming experience but with knowledge in specific target domains to design and (re)configure autonomous software. We argue that, compared to procedural programming, it should be simpler for users to create programs when agent abstractions are employed. The underlying rationale is that these abstractions, and specifically the belief-desire-intention architecture that is aligned with human practical reasoning, match more closely with people's everyday experience in interacting with other agents and artifacts in the real world. On top of this, we designed and implemented a visual programming system for agents that hides the technicalities of agent-oriented programming using a blocks-based visual development environment that is built on the JaCaMo platform. To further validate the proposed solution, we integrate the Web of Things (WoT) to let users create autonomous behaviour on top of physical mashups of devices, following the trends in industrial end-user programming. Finally, we report on a pilot user study where we verified that novice users are indeed able to make use of this development environment to create multi-agent systems to solve simple automation tasks.
comment: Accepted and presented at the 10th International Workshop on Engineering Multi-Agent Systems (EMAS 2022), 9-10 May 2022, Auckland, New Zealand
F.A.C.U.L.: Language-Based Interaction with AI Companions in Gaming
In cooperative video games, traditional AI companions are deployed to assist players, who control them using hotkeys or command wheels to issue predefined commands such as ``attack'', ``defend'', or ``retreat''. Despite their simplicity, these methods, which lack target specificity, limit players' ability to give complex tactical instructions and hinder immersive gameplay experiences. To address this problem, we propose the FPS AI Companion who Understands Language (F.A.C.U.L.), the first real-time AI system that enables players to communicate and collaborate with AI companions using natural language. By integrating natural language processing with a confidence-based framework, F.A.C.U.L. efficiently decomposes complex commands and interprets player intent. It also employs a dynamic entity retrieval method for environmental awareness, aligning human intentions with decision-making. Unlike traditional rule-based systems, our method supports real-time language interactions, enabling players to issue complex commands such as ``clear the second floor'', ``take cover behind that tree'', or ``retreat to the river''. The system provides real-time behavioral responses and vocal feedback, ensuring seamless tactical collaboration. Using the popular FPS game \textit{Arena Breakout: Infinite} as a case study, we present comparisons demonstrating the efficacy of our approach and discuss the advantages and limitations of AI companions based on real-world user feedback.
comment: 14 pages, 11 figures,
Knowing Ourselves Through Others: Reflecting with AI in Digital Human Debates
LLMs can act as an impartial other, drawing on vast knowledge, or as personalized self-reflecting user prompts. These personalized LLMs, or Digital Humans, occupy an intermediate position between self and other. This research explores the dynamic of self and other mediated by these Digital Humans. Using a Research Through Design approach, nine junior and senior high school students, working in teams, designed Digital Humans and had them debate. Each team built a unique Digital Human using prompt engineering and RAG, then observed their autonomous debates. Findings from generative AI literacy tests, interviews, and log analysis revealed that participants deepened their understanding of AI's capabilities. Furthermore, experiencing their own creations as others prompted a reflective attitude, enabling them to objectively view their own cognition and values. We propose "Reflecting with AI" - using AI to re-examine the self - as a new generative AI literacy, complementing the conventional understanding, applying, criticism and ethics.
Design and Evaluation of an AI-DrivenPersonalized Mobile App to Provide MultifacetedHealth Support for Type 2 Diabetes Patients inChina
Type 2 diabetes patients in China face many significant challenges in patient-provider communication and self management In light of this, this work designed,implemented,and evaluated an AI-driven, personalized, multi-functional mobile app system named T2MD Health. The appintegrates real-time patient- provider conversation transcription,medical terminology interpretation, daily health tracking, and adata-driven feedback loop. We conducted qualitative interviewswith 40 participants to study key user needs before systemdevelopment and a mixed- method controlled experiment with 60participants after to evaluate the effectiveness and usability ofthe app. Evaluation results showed that the app was effective inimproving patient-provider communication efficiency, patientunderstanding and knowledge retention,and patient selfmanagement, Patient feedback also revealed that the app has thepotential to address the urban-rural gap in the access to medica!consultation services to some extent, Findings ofthis study couldinform future studies that seek to utilize mobile apps andartificial intelligence to support patients with chronic diseases.
Auditing Google's AI Overviews and Featured Snippets: A Case Study on Baby Care and Pregnancy AAAI
Google Search increasingly surfaces AI-generated content through features like AI Overviews (AIO) and Featured Snippets (FS), which users frequently rely on despite having no control over their presentation. Through a systematic algorithm audit of 1,508 real baby care and pregnancy-related queries, we evaluate the quality and consistency of these information displays. Our robust evaluation framework assesses multiple quality dimensions, including answer consistency, relevance, presence of medical safeguards, source categories, and sentiment alignment. Our results reveal concerning gaps in information consistency, with information in AIO and FS displayed on the same search result page being inconsistent with each other in 33% of cases. Despite high relevance scores, both features critically lack medical safeguards (present in just 11% of AIO and 7% of FS responses). While health and wellness websites dominate source categories for both, AIO and FS, FS also often link to commercial sources. These findings have important implications for public health information access and demonstrate the need for stronger quality controls in AI-mediated health information. Our methodology provides a transferable framework for auditing AI systems across high-stakes domains where information quality directly impacts user well-being.
comment: 18 pages, 10 figures; to appear in AAAI ICWSM 2026
Rethinking the filter bubble? Developing a research agenda for the protective filter bubble
Filter bubbles and echo chambers have received global attention from scholars, media organizations, and the general public. Filter bubbles have primarily been regarded as intrinsically negative, and many studies have sought to minimize their influence. The detrimental influence of filter bubbles is well-studied. Filter bubbles may, for example, create information silos, amplify misinformation, and promote hatred and extremism. However, comparatively few studies have considered the other side of the filter bubble; its protective benefits, particularly to marginalized communities and those living in countries with low levels of press freedom. Through a review of the literature on digital safe spaces and protective filter bubbles, this commentary suggests that there may be a need to rethink the filter bubble, and it proposes several areas for future research.
comment: This work has been published in Big Data & Society. Please cite the journal version
SoK: Synthesizing Smart Home Privacy Protection Mechanisms Across Academic Proposals and Commercial Documentations
Pervasive data collection by Smart Home Devices (SHDs) demands robust Privacy Protection Mechanisms (PPMs). The effectiveness of many PPMs, particularly user-facing controls, depends on user awareness and adoption, which are shaped by manufacturers' public documentations. However, the landscape of academic proposals and commercial disclosures remains underexplored. To address this gap, we investigate: (1) What PPMs have academics proposed, and how are these PPMs evaluated? (2) What PPMs do manufacturers document and what factors affect these documentation? To address these questions, we conduct a two-phase study, synthesizing a systematic review of 117 academic papers with an empirical analysis of 86 SHDs' publicly disclosed documentations. Our review of academic literature reveals a strong focus on novel system- and algorithm-based PPMs. However, these proposals neglect deployment barriers (e.g., cost, interoperability), and lack real-world field validation and legal analysis. Concurrently, our analysis of commercial SHDs finds that advanced academic proposals are absent from public discourse. Industry postures are fundamentally reactive, prioritizing compliance via post-hoc data management (e.g., deletion options), rather than the preventative controls favored by academia. The documented protections correspondingly converge on a small set of practical mechanisms, such as physical buttons and localized processing. By synthesizing these findings, we advocate for research to analyze challenges, provide deployable frameworks, real-world field validation, and interoperability solutions to advance practical PPMs.
Exploring the Use of ChatGPT by Computer Science Students in Software Development: Applications, Ethical Considerations, and Insights for Engineering Education
ChatGPT has been increasingly used in computer science, offering efficient support across software development tasks. While it helps students navigate programming challenges, its use also raises concerns about academic integrity and overreliance. Despite growing interest in this topic, prior research has largely relied on surveys, emphasizing trends over in-depth analysis of students' strategies and ethical awareness. This study complements existing work through a qualitative investigation of how computer science students in one UK institution strategically and ethically engage with ChatGPT in software development projects. Drawing on semi-structured interviews, it explores two key questions: How do computer science students ethically and strategically report using ChatGPT in software development projects? How do students understand and perceive the ethical issues associated with using ChatGPT in academic and professional contexts? Findings reveal a shift in students' learning models, moving from traditional "independent thinking-manual coding-iterative debugging" to "AI-assisted ideation-interactive programming-collaborative optimization." Importantly, many use ChatGPT conversationally to deepen understanding, while consciously reserving creative and high-level decision-making tasks for themselves. Students tend to cap ChatGPT's contribution to roughly 30%, and evaluate its output to mitigate overreliance. However, only a minority thoroughly analyze AI-generated code, raising concerns about reduced critical engagement. Meanwhile, students reject uncredited use, highlight risks such as privacy breaches and skill degradation, and call for clear usage guidelines set by their teachers. This research offers novel insights into the evolving learner-AI dynamic and highlights the need for explicit guidance to support responsible and pedagogically sound use of such tools.
comment: Full paper oral presentation at the European Society for Engineering Education (SEFI) 2025 Annual Conference (September 2025)
Personality Pairing Improves Human-AI Collaboration
Here we ask how AI agent "personalities" interact with human personalities, and other traits, to shape human-AI collaboration, productivity and performance. To estimate these relationships, we conducted a large-scale preregistered randomized experiment that paired 1,258 participants with AI agents that were prompted to exhibit varying levels of the Big Five personality traits. These human-AI teams produced 7,266 display ads for a real think tank, and the quality of these ads was evaluated by 1,995 independent human raters as well as in a field experiment conducted on X, which generated nearly 5 million impressions. We found, first, that personality pairing impacted teamwork quality. For example, neurotic AI improved teamwork for agreeable humans but impaired it for conscientious humans. Second, we found productivity effects of personality pairing and a "productivity-performance trade-off" in which certain pairings (e.g., agreeable human with neurotic AI) produced fewer ads but of higher quality. Third, personality pairing influenced ad quality and performance. For example, quality improved when open humans were paired with conscientious AI and when conscientious humans were paired with disagreeable AI. Some of these pairing effects were "jagged" in that they varied across text and visual tasks. For example open humans produced higher quality images but lower quality text when paired with agreeable AI. Pairing effects were also present in other human traits, like country of origin. For example, extroverted AI improved quality for Latin American workers, but degraded quality for East Asian workers. These findings demonstrate that human-AI personality alignment significantly improves collaboration, productivity, and performance and lay a foundation for future research on improving human-AI collaboration through AI personalization.
comment: 28 pages, 7 figures
Human-centric Maintenance Process Through Integration of AI, Speech, and AR
The adoption of Augmented Reality (AR) is increasing to enhance Human-System Interaction (HSI) by creating immersive experiences that improve efficiency and safety in various industries. In industrial maintenance, traditional practices involve physical documentation and device interactions, which might disrupt the task, affect efficiency, and increase the cognitive load for the maintenance personnel. AR has the potential to support and enhance industrial maintenance processes in these aspects. Therefore, the purpose of this research is to study and explore how advanced technologies like Artificial Intelligence (AI), AR and speech processing can be integrated to support hands-free, real-time task logging and interaction in maintenance environments. This is done by developing a demonstrator for Microsoft HoloLens 2 using Unity, C#, Azure Cognitive Services, and Azure Functions, which enables speech-to-text transcription for hands-free maintenance support. Using Azures' speech recognition, the demonstrator can achieve high transcription accuracy in an AR environment, facilitating natural interactions between users and the augmented environment. The study aims to explore the potential of AR to reduce cognitive load, streamline workflows, and improve safety by enhancing HSI for maintenance personnel in high-stakes environments.
Enhancing Decision Support in Construction through Industrial AI
The construction industry is presently going through a transformation led by adopting digital technologies that leverage Artificial Intelligence (AI). These industrial AI solutions assist in various phases of the construction process, including planning, design, production and management. In particular, the production phase offers unique potential for the integration of such AI-based solutions. These AI-based solutions assist site managers, project engineers, coordinators and other key roles in making final decisions. To facilitate the decision-making process in the production phase of construction through a human-centric AI-based solution, it is important to understand the needs and challenges faced by the end users who interact with these AI-based solutions to enhance the effectiveness and usability of these systems. Without this understanding, the potential usage of these AI-based solutions may be limited. Hence, the purpose of this research study is to explore, identify and describe the key factors crucial for developing AI solutions in the construction industry. This study further identifies the correlation between these key factors. This was done by developing a demonstrator and collecting quantifiable feedback through a questionnaire targeting the end users, such as site managers and construction professionals. This research study will offer insights into developing and improving these industrial AI solutions, focusing on Human-System Interaction aspects to enhance decision support, usability, and overall AI solution adoption.
Personalizing Prostate Cancer Education for Patients Using an EHR-Integrated LLM Agent
Cancer patients often lack timely education and personalized support due to clinician workload. This quality improvement study develops and evaluates a Large Language Model (LLM) agent, MedEduChat, which is integrated with the clinic's electronic health records (EHR) and designed to enhance prostate cancer patient education. Fifteen non-metastatic prostate cancer patients and three clinicians recruited from the Mayo Clinic interacted with the agent between May 2024 and April 2025. Findings showed that MedEduChat has a high usability score (UMUX 83.7 out of 100) and improves patients' health confidence (Health Confidence Score rose from 9.9 to 13.9). Clinicians evaluated the patient-chat interaction history and rated MedEduChat as highly correct (2.9 out of 3), complete (2.7 out of 3), and safe (2.7 out of 3), with moderate personalization (2.3 out of 3). This study highlights the potential of LLM agents to improve patient engagement and health education.
LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects
With the rapid rise of large language models (LLMs), phone automation has undergone transformative changes. This paper systematically reviews LLM-driven phone GUI agents, highlighting their evolution from script-based automation to intelligent, adaptive systems. We first contextualize key challenges, (i) limited generality, (ii) high maintenance overhead, and (iii) weak intent comprehension, and show how LLMs address these issues through advanced language understanding, multimodal perception, and robust decision-making. We then propose a taxonomy covering fundamental agent frameworks (single-agent, multi-agent, plan-then-act), modeling approaches (prompt engineering, training-based), and essential datasets and benchmarks. Furthermore, we detail task-specific architectures, supervised fine-tuning, and reinforcement learning strategies that bridge user intent and GUI operations. Finally, we discuss open challenges such as dataset diversity, on-device deployment efficiency, user-centric adaptation, and security concerns, offering forward-looking insights into this rapidly evolving field. By providing a structured overview and identifying pressing research gaps, this paper serves as a definitive reference for researchers and practitioners seeking to harness LLMs in designing scalable, user-friendly phone GUI agents. The collection of papers reviewed in this survey will be hosted and regularly updated on the GitHub repository: https://github.com/PhoneLLM/Awesome-LLM-Powered-Phone-GUI-Agents
comment: Paper accepted to TMLR 2025, Project Homepage: https://github.com/PhoneLLM/Awesome-LLM-Powered-Phone-GUI-Agents
An LLM-based Simulation Framework for Embodied Conversational Agents in Psychological Counseling AAAI 2026
Due to privacy concerns, open dialogue datasets for mental health are primarily generated through human or AI synthesis methods. However, the inherent implicit nature of psychological processes, particularly those of clients, poses challenges to the authenticity and diversity of synthetic data. In this paper, we propose ECAs (short for Embodied Conversational Agents), a framework for embodied agent simulation based on Large Language Models (LLMs) that incorporates multiple psychological theoretical principles.Using simulation, we expand real counseling case data into a nuanced embodied cognitive memory space and generate dialogue data based on high-frequency counseling questions.We validated our framework using the D4 dataset. First, we created a public ECAs dataset through batch simulations based on D4. Licensed counselors evaluated our method, demonstrating that it significantly outperforms baselines in simulation authenticity and necessity. Additionally, two LLM-based automated evaluation methods were employed to confirm the higher quality of the generated dialogues compared to the baselines. The source code and dataset are available at https://github.com/AIR-DISCOVER/ECAs-Dataset.
comment: Accepted to AAAI 2026
Identify As A Human Does: A Pathfinder of Next-Generation Anti-Cheat Framework for First-Person Shooter Games
The gaming industry has experienced substantial growth, but cheating in online games poses a significant threat to the integrity of the gaming experience. Cheating, particularly in first-person shooter (FPS) games, can lead to substantial losses for the game industry. Existing anti-cheat solutions have limitations, such as client-side hardware constraints, security risks, server-side unreliable methods, and both-sides suffer from a lack of comprehensive real-world datasets. To address these limitations, the paper proposes HAWK, a server-side FPS anti-cheat framework for the popular game CS:GO. HAWK utilizes machine learning techniques to mimic human experts' identification process, leverages novel multi-view features, and it is equipped with a well-defined workflow. The authors evaluate HAWK with the first large and real-world datasets containing multiple cheat types and cheating sophistication, and it exhibits promising efficiency and acceptable overheads, shorter ban times compared to the in-use anti-cheat, a significant reduction in manual labor, and the ability to capture cheaters who evaded official inspections.
Image and Video Processing
MSRNet: A Multi-Scale Recursive Network for Camouflaged Object Detection
Camouflaged object detection is an emerging and challenging computer vision task that requires identifying and segmenting objects that blend seamlessly into their environments due to high similarity in color, texture, and size. This task is further complicated by low-light conditions, partial occlusion, small object size, intricate background patterns, and multiple objects. While many sophisticated methods have been proposed for this task, current methods still struggle to precisely detect camouflaged objects in complex scenarios, especially with small and multiple objects, indicating room for improvement. We propose a Multi-Scale Recursive Network that extracts multi-scale features via a Pyramid Vision Transformer backbone and combines them via specialized Attention-Based Scale Integration Units, enabling selective feature merging. For more precise object detection, our decoder recursively refines features by incorporating Multi-Granularity Fusion Units. A novel recursive-feedback decoding strategy is developed to enhance global context understanding, helping the model overcome the challenges in this task. By jointly leveraging multi-scale learning and recursive feature optimization, our proposed method achieves performance gains, successfully detecting small and multiple camouflaged objects. Our model achieves state-of-the-art results on two benchmark datasets for camouflaged object detection and ranks second on the remaining two. Our codes, model weights, and results are available at \href{https://github.com/linaagh98/MSRNet}{https://github.com/linaagh98/MSRNet}.
Improving the Generalisation of Learned Reconstruction Frameworks
Ensuring proper generalization is a critical challenge in applying data-driven methods for solving inverse problems in imaging, as neural networks reconstructing an image must perform well across varied datasets and acquisition geometries. In X-ray Computed Tomography (CT), convolutional neural networks (CNNs) are widely used to filter the projection data but are ill-suited for this task as they apply grid-based convolutions to the sinogram, which inherently lies on a line manifold, not a regular grid. The CNNs, unaware of the geometry, are implicitly tied to it and require an excessive amount of parameters as they must infer the relations between measurements from the data rather than from prior information. The contribution of this paper is twofold. First, we introduce a graph data structure to represent CT acquisition geometries and tomographic data, providing a detailed explanation of the graph's structure for circular, cone-beam geometries. Second, we propose GLM, a hybrid neural network architecture that leverages both graph and grid convolutions to process tomographic data. We demonstrate that GLM outperforms CNNs when performance is quantified in terms of structural similarity and peak signal-to-noise ratio, despite the fact that GLM uses only a fraction of the trainable parameters. Compared to CNNs, GLM also requires significantly less training time and memory, and its memory requirements scale better. Crucially, GLM demonstrates robust generalization to unseen variations in the acquisition geometry, like when training only on fully sampled CT data and then testing on sparse-view CT data.
comment: 11 pages, 8 figures
Diffusion Algorithm for Metalens Optical Aberration Correction
Metalenses offer a path toward creating ultra-thin optical systems, but they inherently suffer from severe, spatially varying optical aberrations, especially chromatic aberration, which makes image reconstruction a significant challenge. This paper presents a novel algorithmic solution to this problem, designed to reconstruct a sharp, full-color image from two inputs: a sharp, bandpass-filtered grayscale ``structure image'' and a heavily distorted ``color cue'' image, both captured by the metalens system. Our method utilizes a dual-branch diffusion model, built upon a pre-trained Stable Diffusion XL framework, to fuse information from the two inputs. We demonstrate through quantitative and qualitative comparisons that our approach significantly outperforms existing deblurring and pansharpening methods, effectively restoring high-frequency details while accurately colorizing the image.
comment: 5 pages, 4 figures
C3Net: Context-Contrast Network for Camouflaged Object Detection
Camouflaged object detection identifies objects that blend seamlessly with their surroundings through similar colors, textures, and patterns. This task challenges both traditional segmentation methods and modern foundation models, which fail dramatically on camouflaged objects. We identify six fundamental challenges in COD: Intrinsic Similarity, Edge Disruption, Extreme Scale Variation, Environmental Complexities, Contextual Dependencies, and Salient-Camouflaged Object Disambiguation. These challenges frequently co-occur and compound the difficulty of detection, requiring comprehensive architectural solutions. We propose C3Net, which addresses all challenges through a specialized dual-pathway decoder architecture. The Edge Refinement Pathway employs gradient-initialized Edge Enhancement Modules to recover precise boundaries from early features. The Contextual Localization Pathway utilizes our novel Image-based Context Guidance mechanism to achieve intrinsic saliency suppression without external models. An Attentive Fusion Module synergistically combines the two pathways via spatial gating. C3Net achieves state-of-the-art performance with S-measures of 0.898 on COD10K, 0.904 on CAMO, and 0.913 on NC4K, while maintaining efficient processing. C3Net demonstrates that complex, multifaceted detection challenges require architectural innovation, with specialized components working synergistically to achieve comprehensive coverage beyond isolated improvements. Code, model weights, and results are available at https://github.com/Baber-Jan/C3Net.
FERMI-ML: A Flexible and Resource-Efficient Memory-In-Situ SRAM Macro for TinyML acceleration
The growing demand for low-power and area-efficient TinyML inference on AIoT devices necessitates memory architectures that minimise data movement while sustaining high computational efficiency. This paper presents FERMI-ML, a Flexible and Resource-Efficient Memory-In-Situ (MIS) SRAM macro designed for TinyML acceleration. The proposed 9T XNOR-based RX9T bit-cell integrates a 5T storage cell with a 4T XNOR compute unit, enabling variable-precision MAC and CAM operations within the same array. A 22-transistor (C22T) compressor-tree-based accumulator facilitates logarithmic 1-64-bit MAC computation with reduced delay and power compared to conventional adder trees. The 4 KB macro achieves dual functionality for in-situ computation and CAM-based lookup operations, supporting Posit-4 or FP-4 precision. Post-layout results at 65 nm show operation at 350 MHz with 0.9 V, delivering a throughput of 1.93 TOPS and an energy efficiency of 364 TOPS/W, while maintaining a Quality-of-Result (QoR) above 97.5% with InceptionV4 and ResNet-18. FERMI-ML thus demonstrates a compact, reconfigurable, and energy-aware digital Memory-In-Situ macro capable of supporting mixed-precision TinyML workloads.
A Multicollinearity-Aware Signal-Processing Framework for Cross-$β$ Identification via X-ray Scattering of Alzheimer's Tissue
X-ray scattering measurements of in situ human brain tissue encode structural signatures of pathological cross-$β$ inclusions, yet systematic exploitation of these data for automated detection remains challenging due to substrate contamination, strong inter-feature correlations, and limited sample sizes. This work develops a three-stage classification framework for identifying cross-$β$ structural inclusions-a hallmark of Alzheimer's disease-in X-ray scattering profiles of post-mortem human brain. Stage 1 employs a Bayes-optimal classifier to separate mica substrate from tissue regions on the basis of their distinct scattering signatures. Stage 2 introduces a multicollinearityaware, class-conditional correlation pruning scheme with formal guarantees on the induced Bayes risk and approximation error, thereby reducing redundancy while retaining class-discriminative information. Stage 3 trains a compact neural network on the pruned feature set to detect the presence or absence of cross-$β$ fibrillar ordering. The top-performing model, optimized with a composite loss combining Focal and Dice objectives, attains a test F1-score of 84.30% using 11 of 211 candidate features and 174 trainable parameters. The overall framework yields an interpretable, theory-grounded strategy for data-limited classification problems involving correlated, high-dimensional experimental measurements, exemplified here by X-ray scattering profiles of neurodegenerative tissue.
comment: 19 pages, 4 figures, journal paper under review
DEMIST: \underline{DE}coupled \underline{M}ulti-stream latent d\underline{I}ffusion for Quantitative Myelin Map \underline{S}yn\underline{T}hesis
Quantitative magnetization transfer (qMT) imaging provides myelin-sensitive biomarkers, such as the pool size ratio (PSR), which is valuable for multiple sclerosis (MS) assessment. However, qMT requires specialized 20-30 minute scans. We propose DEMIST to synthesize PSR maps from standard T1w and FLAIR images using a 3D latent diffusion model with three complementary conditioning mechanisms. Our approach has two stages: first, we train separate autoencoders for PSR and anatomical images to learn aligned latent representations. Second, we train a conditional diffusion model in this latent space on top of a frozen diffusion foundation backbone. Conditioning is decoupled into: (i) \textbf{semantic} tokens via cross-attention, (ii) \textbf{spatial} per-scale residual hints via a 3D ControlNet branch, and (iii) \textbf{adaptive} LoRA-modulated attention. We include edge-aware loss terms to preserve lesion boundaries and alignment losses to maintain quantitative consistency, while keeping the number of trainable parameters low and retaining the inductive bias of the pretrained model. We evaluate on 163 scans from 99 subjects using 5-fold cross-validation. Our method outperforms VAE, GAN and diffusion baselines on multiple metrics, producing sharper boundaries and better quantitative agreement with ground truth. Our code is publicly available at https://github.com/MedICL-VU/MS-Synthesis-3DcLDM.
Semantic Multiplexing
Mobile devices increasingly require the parallel execution of several computing tasks offloaded at the wireless edge. Existing communication systems only support parallel transmissions at the bit level, which fundamentally limits the number of tasks that can be concurrently processed. To address this bottleneck, this paper introduces the new concept of Semantic Multiplexing. Our approach shifts stream multiplexing from bits to tasks by merging multiple task-related compressed representations into a single semantic representation. As such, Semantic Multiplexing can multiplex more tasks than the number of physical channels without adding antennas or widening bandwidth by extending the effective degrees of freedom at the semantic layer, without contradicting Shannon capacity rules. We have prototyped Semantic Multiplexing on an experimental testbed with Jetson Orin Nano and millimeter-wave software-defined radios and tested its performance on image classification and sentiment analysis while comparing to several existing baselines in semantic communications. Our experiments demonstrate that Semantic Multiplexing allows jointly processing multiple tasks at the semantic level while maintaining sufficient task accuracy. For example, image classification accuracy drops by less than 4% when increasing from 2 to 8 the number of tasks multiplexed over a 4$\times$4 channel. Semantic Multiplexing reduces latency, energy consumption, and communication load respectively by up to 8$\times$, 25$\times$, and 54$\times$ compared to the baselines while keeping comparable performance. We pledge to publicly share the complete software codebase and the collected datasets for reproducibility.
Federated Continual 3D Segmentation With Single-round Communication
Federated learning seeks to foster collaboration among distributed clients while preserving the privacy of their local data. Traditionally, federated learning methods assume a fixed setting in which client data and learning objectives remain constant. However, in real-world scenarios, new clients may join, and existing clients may expand the segmentation label set as task requirements evolve. In such a dynamic federated analysis setup, the conventional federated communication strategy of model aggregation per communication round is suboptimal. As new clients join, this strategy requires retraining, linearly increasing communication and computation overhead. It also imposes requirements for synchronized communication, which is difficult to achieve among distributed clients. In this paper, we propose a federated continual learning strategy that employs a one-time model aggregation at the server through multi-model distillation. This approach builds and updates the global model while eliminating the need for frequent server communication. When integrating new data streams or onboarding new clients, this approach efficiently reuses previous client models, avoiding the need to retrain the global model across the entire federation. By minimizing communication load and bypassing the need to put unchanged clients online, our approach relaxes synchronization requirements among clients, providing an efficient and scalable federated analysis framework suited for real-world applications. Using multi-class 3D abdominal CT segmentation as an application task, we demonstrate the effectiveness of the proposed approach.
Multi-Scale Target-Aware Representation Learning for Fundus Image Enhancement
High-quality fundus images provide essential anatomical information for clinical screening and ophthalmic disease diagnosis. Yet, due to hardware limitations, operational variability, and patient compliance, fundus images often suffer from low resolution and signal-to-noise ratio. Recent years have witnessed promising progress in fundus image enhancement. However, existing works usually focus on restoring structural details or global characteristics of fundus images, lacking a unified image enhancement framework to recover comprehensive multi-scale information. Moreover, few methods pinpoint the target of image enhancement, e.g., lesions, which is crucial for medical image-based diagnosis. To address these challenges, we propose a multi-scale target-aware representation learning framework (MTRL-FIE) for efficient fundus image enhancement. Specifically, we propose a multi-scale feature encoder (MFE) that employs wavelet decomposition to embed both low-frequency structural information and high-frequency details. Next, we design a structure-preserving hierarchical decoder (SHD) to fuse multi-scale feature embeddings for real fundus image restoration. SHD integrates hierarchical fusion and group attention mechanisms to achieve adaptive feature fusion while retaining local structural smoothness. Meanwhile, a target-aware feature aggregation (TFA) module is used to enhance pathological regions and reduce artifacts. Experimental results on multiple fundus image datasets demonstrate the effectiveness and generalizability of MTRL-FIE for fundus image enhancement. Compared to state-of-the-art methods, MTRL-FIE achieves superior enhancement performance with a more lightweight architecture. Furthermore, our approach generalizes to other ophthalmic image processing tasks without supervised fine-tuning, highlighting its potential for clinical applications.
comment: Accepted for publication in Neural Networks (Elsevier), 2025
An Explainable Deep Learning Framework for Brain Stroke and Tumor Progression via MRI Interpretation
Early and accurate detection of brain abnormalities, such as tumors and strokes, is essential for timely intervention and improved patient outcomes. In this study, we present a deep learning-based system capable of identifying both brain tumors and strokes from MRI images, along with their respective stages. We have executed two groundbreaking strategies involving convolutional neural networks, MobileNet V2 and ResNet-50-optimized through transfer learning to classify MRI scans into five diagnostic categories. Our dataset, aggregated and augmented from various publicly available MRI sources, was carefully curated to ensure class balance and image diversity. To enhance model generalization and prevent overfitting, we applied dropout layers and extensive data augmentation. The models achieved strong performance, with training accuracy reaching 93\% and validation accuracy up to 88\%. While ResNet-50 demonstrated slightly better results, Mobile Net V2 remains a promising option for real-time diagnosis in low resource settings due to its lightweight architecture. This research offers a practical AI-driven solution for early brain abnormality detection, with potential for clinical deployment and future enhancement through larger datasets and multi modal inputs.
comment: This is the preprint version of the manuscript. It is currently being prepared for submission to an academic conference
Sequential Attention-based Sampling for Histopathological Analysis
Deep neural networks are increasingly applied in automated histopathology. Yet, whole-slide images (WSIs) are often acquired at gigapixel sizes, rendering them computationally infeasible to analyze entirely at high resolution. Diagnostic labels are largely available only at the slide-level, because expert annotation of images at a finer (patch) level is both laborious and expensive. Moreover, regions with diagnostic information typically occupy only a small fraction of the WSI, making it inefficient to examine the entire slide at full resolution. Here, we propose SASHA -- Sequential Attention-based Sampling for Histopathological Analysis -- a deep reinforcement learning approach for efficient analysis of histopathological images. First, SASHA learns informative features with a lightweight hierarchical, attention-based multiple instance learning (MIL) model. Second, SASHA samples intelligently and zooms selectively into a small fraction (10-20\%) of high-resolution patches to achieve reliable diagnoses. We show that SASHA matches state-of-the-art methods that analyze the WSI fully at high resolution, albeit at a fraction of their computational and memory costs. In addition, it significantly outperforms competing, sparse sampling methods. We propose SASHA as an intelligent sampling model for medical imaging challenges that involve automated diagnosis with exceptionally large images containing sparsely informative features. Model implementation is available at: https://github.com/coglabiisc/SASHA.
Multimedia
SynthGuard: An Open Platform for Detecting AI-Generated Multimedia with Multimodal LLMs
Artificial Intelligence (AI) has made it possible for anyone to create images, audio, and video with unprecedented ease, enriching education, communication, and creative expression. At the same time, the rapid rise of AI-generated media has introduced serious risks, including misinformation, identity misuse, and the erosion of public trust as synthetic content becomes increasingly indistinguishable from real media. Although deepfake detection has advanced, many existing tools remain closed-source, limited in modality, or lacking transparency and educational value, making it difficult for users to understand how detection decisions are made. To address these gaps, we introduce SynthGuard, an open, user-friendly platform for detecting and analyzing AI-generated multimedia using both traditional detectors and multimodal large language models (MLLMs). SynthGuard provides explainable inference, unified image and audio support, and an interactive interface designed to make forensic analysis accessible to researchers, educators, and the public. The SynthGuard platform is available at: https://in-engr-nova.it.purdue.edu/
Failures to Surface Harmful Contents in Video Large Language Models AAAI 2026
Video Large Language Models (VideoLLMs) are increasingly deployed on numerous critical applications, where users rely on auto-generated summaries while casually skimming the video stream. We show that this interaction hides a critical safety gap: if harmful content is embedded in a video, either as full-frame inserts or as small corner patches, state-of-the-art VideoLLMs rarely mention the harmful content in the output, despite its clear visibility to human viewers. A root-cause analysis reveals three compounding design flaws: (1) insufficient temporal coverage resulting from the sparse, uniformly spaced frame sampling used by most leading VideoLLMs, (2) spatial information loss introduced by aggressive token downsampling within sampled frames, and (3) encoder-decoder disconnection, whereby visual cues are only weakly utilized during text generation. Leveraging these insights, we craft three zero-query black-box attacks, aligning with these flaws in the processing pipeline. Our large-scale evaluation across five leading VideoLLMs shows that the harmfulness omission rate exceeds 90% in most cases. Even when harmful content is clearly present in all frames, these models consistently fail to identify it. These results underscore a fundamental vulnerability in current VideoLLMs' designs and highlight the urgent need for sampling strategies, token compression, and decoding mechanisms that guarantee semantic coverage rather than speed alone.
comment: 12 pages, 8 figures. Accepted to AAAI 2026
Computation and Language
From Passive to Persuasive: Steering Emotional Nuance in Human-AI Negotiation
Large Language Models (LLMs) demonstrate increasing conversational fluency, yet instilling them with nuanced, human-like emotional expression remains a significant challenge. Current alignment techniques often address surface-level output or require extensive fine-tuning. This paper demonstrates that targeted activation engineering can steer LLaMA 3.1-8B to exhibit more human-like emotional nuances. We first employ attribution patching to identify causally influential components, to find a key intervention locus by observing activation patterns during diagnostic conversational tasks. We then derive emotional expression vectors from the difference in the activations generated by contrastive text pairs (positive vs. negative examples of target emotions). Applying these vectors to new conversational prompts significantly enhances emotional characteristics: steered responses show increased positive sentiment (e.g., joy, trust) and more frequent first-person pronoun usage, indicative of greater personal engagement. Our findings offer a precise and interpretable framework and new directions for the study of conversational AI.
BioMedJImpact: A Comprehensive Dataset and LLM Pipeline for AI Engagement and Scientific Impact Analysis of Biomedical Journals
Assessing journal impact is central to scholarly communication, yet existing open resources rarely capture how collaboration structures and artificial intelligence (AI) research jointly shape venue prestige in biomedicine. We present BioMedJImpact, a large-scale, biomedical-oriented dataset designed to advance journal-level analysis of scientific impact and AI engagement. Built from 1.74 million PubMed Central articles across 2,744 journals, BioMedJImpact integrates bibliometric indicators, collaboration features, and LLM-derived semantic indicators for AI engagement. Specifically, the AI engagement feature is extracted through a reproducible three-stage LLM pipeline that we propose. Using this dataset, we analyze how collaboration intensity and AI engagement jointly influence scientific impact across pre- and post-pandemic periods (2016-2019, 2020-2023). Two consistent trends emerge: journals with higher collaboration intensity, particularly those with larger and more diverse author teams, tend to achieve greater citation impact, and AI engagement has become an increasingly strong correlate of journal prestige, especially in quartile rankings. To further validate the three-stage LLM pipeline we proposed for deriving the AI engagement feature, we conduct human evaluation, confirming substantial agreement in AI relevance detection and consistent subfield classification. Together, these contributions demonstrate that BioMedJImpact serves as both a comprehensive dataset capturing the intersection of biomedicine and AI, and a validated methodological framework enabling scalable, content-aware scientometric analysis of scientific impact and innovation dynamics. Code is available at https://github.com/JonathanWry/BioMedJImpact.
Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing
Large Language Models (LLMs) have recently emerged as powerful tools for autoformalization. Despite their impressive performance, these models can still struggle to produce grounded and verifiable formalizations. Recent work in text-to-SQL, has revealed that LLMs can be sensitive to paraphrased natural language (NL) inputs, even when high degrees of semantic fidelity are preserved (Safarzadeh, Oroojlooyjadid, and Roth 2025). In this paper, we investigate this claim in the autoformalization domain. Specifically, we evaluate the robustness of LLMs generating formal proofs with semantically similar paraphrased NL statements by measuring semantic and compilation validity. Using the formal benchmarks MiniF2F (Zheng, Han, and Polu 2021) and Lean 4 version of ProofNet (Xin et al. 2024), and two modern LLMs, we generate paraphrased natural language statements and cross-evaluate these statements across both models. The results of this paper reveal performance variability across paraphrased inputs, demonstrating that minor shifts in NL statements can significantly impact model outputs.
LLM Reinforcement in Context
Current Large Language Model alignment research mostly focuses on improving model robustness against adversarial attacks and misbehavior by training on examples and prompting. Research has shown that LLM jailbreak probability increases with the size of the user input or conversation length. There is a lack of appropriate research into means of strengthening alignment which also scale with user input length. We propose interruptions as a possible solution to this problem. Interruptions are control sentences added to the user input approximately every x tokens for some arbitrary x. We suggest that this can be generalized to the Chain-of-Thought process to prevent scheming.
comment: 4 pages
Evidence of Phase Transitions in Small Transformer-Based Language Models
Phase transitions have been proposed as the origin of emergent abilities in large language models (LLMs), where new capabilities appear abruptly once models surpass critical thresholds of scale. Prior work, such as that of Wei et al., demonstrated these phenomena under model and data scaling, with transitions revealed after applying a log scale to training compute. In this work, we ask three complementary questions: (1) Are phase transitions unique to large models, or can they also be observed in small transformer-based language models? (2) Can such transitions be detected directly in linear training space, rather than only after log rescaling? and (3) Can these transitions emerge at early stages of training? To investigate, we train a small GPT-style transformer on a character-level corpus and analyze the evolution of vocabulary usage throughout training. We track the average word length, the number of correct versus incorrect words, and shifts in vocabulary diversity. Building on these measures, we apply Poisson and sub-Poisson statistics to quantify how words connect and reorganize. This combined analysis reveals a distinct transition point during training. Notably, these transitions are not apparent in standard loss or validation curves, but become visible through our vocabulary- and statistics-based probes. Our findings suggest that phase-transition reorganizations are a general feature of language model training, observable even in modest models, detectable directly in linear training space, and occurring surprisingly early as coherence emerges. This perspective provides new insight into the nonlinear dynamics of language model training and underscores the importance of tailored metrics for uncovering phase transition behaviors
On the Brittleness of LLMs: A Journey around Set Membership
Large language models (LLMs) achieve superhuman performance on complex reasoning tasks, yet often fail on much simpler problems, raising concerns about their reliability and interpretability. We investigate this paradox through a focused study with two key design features: simplicity, to expose basic failure modes, and scale, to enable comprehensive controlled experiments. We focus on set membership queries -- among the most fundamental forms of reasoning -- using tasks like ``Is apple an element of the set \{pear, plum, apple, raspberry\}?''. We conduct a systematic empirical evaluation across prompt phrasing, semantic structure, element ordering, and model choice. Our large-scale analysis reveals that LLM performance on this elementary task is consistently brittle, and unpredictable across all dimensions, suggesting that the models' ``understanding'' of the set concept is fragmented and convoluted at best. Our work demonstrates that the large-scale experiments enabled by the simplicity of the problem allow us to map and analyze the failure modes comprehensively, making this approach a valuable methodology for LLM evaluation in general.
Adaptive Focus Memory for Language Models
Large language models (LLMs) are increasingly deployed in multi-turn dialogue settings, but their behavior is still bottlenecked by fixed context windows and naive memory strategies. Replaying the full conversation at every turn is simple but expensive, while static summarization or recency-only heuristics often erase safety-critical user details. We present Adaptive Focus Memory (AFM), a dynamic context manager that assigns each past message one of three fidelity levels -- FULL, COMPRESSED, or PLACEHOLDER -- based on semantic similarity to the current query, half-life recency weighting, and importance classification. AFM packs messages chronologically under a strict token budget, preferring high fidelity for the most relevant turns while aiming to preserve a cheap trace of the dialogue. In a safety-oriented benchmark involving a user with a severe peanut allergy planning a trip to Thailand, AFM retains the allergy across both short and medium-length conversations, matches the safety performance of naive replay, and cuts average token usage by 66% relative to a replay baseline. We release a modular Python implementation of AFM designed for OpenAI-compatible APIs and offline operation, enabling practitioners to reduce inference cost without sacrificing safety or factual continuity in the evaluated scenario.
Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs
Automated red teaming frameworks for Large Language Models (LLMs) have become increasingly sophisticated, yet they share a fundamental limitation: their jailbreak logic is confined to selecting, combining, or refining pre-existing attack strategies. This binds their creativity and leaves them unable to autonomously invent entirely new attack mechanisms. To overcome this gap, we introduce \textbf{EvoSynth}, an autonomous framework that shifts the paradigm from attack planning to the evolutionary synthesis of jailbreak methods. Instead of refining prompts, EvoSynth employs a multi-agent system to autonomously engineer, evolve, and execute novel, code-based attack algorithms. Crucially, it features a code-level self-correction loop, allowing it to iteratively rewrite its own attack logic in response to failure. Through extensive experiments, we demonstrate that EvoSynth not only establishes a new state-of-the-art by achieving an 85.5\% Attack Success Rate (ASR) against highly robust models like Claude-Sonnet-4.5, but also generates attacks that are significantly more diverse than those from existing methods. We release our framework to facilitate future research in this new direction of evolutionary synthesis of jailbreak methods. Code is available at: https://github.com/dongdongunique/EvoSynth.
Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data
Direct speech-to-speech translation (S2ST), in which all components are trained jointly, is an attractive alternative to cascaded systems because it offers a simpler pipeline and lower inference latency. However, direct S2ST models require large amounts of parallel speech data in the source and target languages, which are rarely available for low-resource languages such as Persian. This paper presents a direct S2ST system for translating Persian speech into English speech, as well as a pipeline for synthetic parallel Persian-English speech generation. The model comprises three components: (1) a conformer-based encoder, initialized from self-supervised pre-training, maps source speech to high-level acoustic representations; (2) a causal transformer decoder with relative position multi-head attention translates these representations into discrete target speech units; (3) a unit-based neural vocoder generates waveforms from the predicted discrete units. To mitigate the data scarcity problem, we construct a new Persian-English parallel speech corpus by translating Persian speech transcriptions into English using a large language model and then synthesizing the corresponding English speech with a state-of-the-art zero-shot text-to-speech system. The resulting corpus increases the amount of available parallel speech by roughly a factor of six. On the Persian-English portion of the CVSS corpus, the proposed model achieves improvement of 4.6 ASR BLEU with the synthetic data over direct baselines. These results indicate that combining self-supervised pre-training, discrete speech units, and synthetic parallel data is effective for improving direct S2ST in low-resource language pairs such as Persian-English
Reason-KE++: Aligning the Process, Not Just the Outcome, for Faithful LLM Knowledge Editing
Aligning Large Language Models (LLMs) to be faithful to new knowledge in complex, multi-hop reasoning tasks is a critical, yet unsolved, challenge. We find that SFT-based methods, e.g., Reason-KE, while state-of-the-art, suffer from a "faithfulness gap": they optimize for format mimicry rather than sound reasoning. This gap enables the LLM's powerful parametric priors to override new contextual facts, resulting in critical factual hallucinations (e.g., incorrectly reasoning "Houston" from "NASA" despite an explicit edit). To solve this core LLM alignment problem, we propose Reason-KE++, an SFT+RL framework that instills process-level faithfulness. Its core is a Stage-aware Reward mechanism that provides dense supervision for intermediate reasoning steps (e.g., Decomposition, Sub-answer Correctness). Crucially, we identify that naive outcome-only RL is a deceptive trap for LLM alignment: it collapses reasoning integrity (e.g., 19.00% Hop acc) while superficially boosting final accuracy. Our process-aware framework sets a new SOTA of 95.48% on MQUAKE-CF-3k (+5.28%), demonstrating that for complex tasks, aligning the reasoning process is essential for building trustworthy LLMs.
Knots: A Large-Scale Multi-Agent Enhanced Expert-Annotated Dataset and LLM Prompt Optimization for NOTAM Semantic Parsing
Notice to Air Missions (NOTAMs) serve as a critical channel for disseminating key flight safety information, yet their complex linguistic structures and implicit reasoning pose significant challenges for automated parsing. Existing research mainly focuses on surface-level tasks such as classification and named entity recognition, lacking deep semantic understanding. To address this gap, we propose NOTAM semantic parsing, a task emphasizing semantic inference and the integration of aviation domain knowledge to produce structured, inference-rich outputs. To support this task, we construct Knots (Knowledge and NOTAM Semantics), a high-quality dataset of 12,347 expert-annotated NOTAMs covering 194 Flight Information Regions, enhanced through a multi-agent collaborative framework for comprehensive field discovery. We systematically evaluate a wide range of prompt-engineering strategies and model-adaptation techniques, achieving substantial improvements in aviation text understanding and processing. Our experimental results demonstrate the effectiveness of the proposed approach and offer valuable insights for automated NOTAM analysis systems. Our code is available at: https://github.com/Estrellajer/Knots.
comment: Accepted to Advanced Engineering Informatics
Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data
We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal large model (OLM), it substantially advances Lychee's Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the Qwen2.5-7B dense architecture, we build Uni-MoE-2.0-Omni from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of omnimodal understanding, as well as generating images, text, and speech. Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following cross-modal pretraining, we use a progressive supervised fine-tuning strategy that activates modality-specific experts and is enhanced by balanced data composition and an iterative GSPO-DPO method to stabilise RL training and improve reasoning. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues. Extensive evaluation across 85 benchmarks demonstrates that our model achieves SOTA or highly competitive performance against leading OLMs, surpassing Qwen2.5-Omni (trained with 1.2T tokens) on over 50 of 76 benchmarks. Key strengths include video understanding (+7% avg. of 8), omnimodallity understanding (+7% avg. of 4), and audiovisual reasoning (+4%). It also advances long-form speech processing (reducing WER by 4.2%) and leads in low-level image processing and controllable generation across 5 metrics.
comment: 47 pages,10 Figures, Project Website: https://idealistxy.github.io/Uni-MoE-v2.github.io/; Codes: https://github.com/HITsz-TMG/Uni-MoE
Group-Aware Reinforcement Learning for Output Diversity in Large Language Models EMNLP
Large Language Models (LLMs) often suffer from mode collapse, repeatedly generating the same few completions even when many valid answers exist, limiting their diversity across a wide range of tasks. We introduce Group-Aware Policy Optimization (GAPO), a simple extension of the recent and popular Group Relative Policy Optimization (GRPO) that computes rewards over the group as a whole. GAPO enables learning from the group-level properties such as diversity and coverage. We demonstrate GAPO using a frequency-aware reward function that encourages uniform sampling over valid LLM completions, and show that GAPO-trained models produce valid and more diverse model responses. Beyond this setup, GAPO generalizes to open-ended prompts and improves response diversity without compromising accuracy on standard LLM benchmarks (GSM8K, MATH, HumanEval, MMLU-Pro). Our code will be made publicly available.
comment: EMNLP Main 2025
MMWOZ: Building Multimodal Agent for Task-oriented Dialogue
Task-oriented dialogue systems have garnered significant attention due to their conversational ability to accomplish goals, such as booking airline tickets for users. Traditionally, task-oriented dialogue systems are conceptualized as intelligent agents that interact with users using natural language and have access to customized back-end APIs. However, in real-world scenarios, the widespread presence of front-end Graphical User Interfaces (GUIs) and the absence of customized back-end APIs create a significant gap for traditional task-oriented dialogue systems in practical applications. In this paper, to bridge the gap, we collect MMWOZ, a new multimodal dialogue dataset that is extended from MultiWOZ 2.3 dataset. Specifically, we begin by developing a web-style GUI to serve as the front-end. Next, we devise an automated script to convert the dialogue states and system actions from the original dataset into operation instructions for the GUI. Lastly, we collect snapshots of the web pages along with their corresponding operation instructions. In addition, we propose a novel multimodal model called MATE (Multimodal Agent for Task-oriEnted dialogue) as the baseline model for the MMWOZ dataset. Furthermore, we conduct comprehensive experimental analysis using MATE to investigate the construction of a practical multimodal agent for task-oriented dialogue.
Mitigating Length Bias in RLHF through a Causal Lens
Reinforcement learning from human feedback (RLHF) is widely used to align large language models (LLMs) with human preferences. However, RLHF-trained reward models often exhibit length bias -- a systematic tendency to favor longer responses by conflating verbosity with quality. We propose a causal framework for analyzing and mitigating length bias in RLHF reward modeling. Central to our approach is a counterfactual data augmentation method that generates response pairs designed to isolate content quality from verbosity. These counterfactual examples are then used to train the reward model, enabling it to assess responses based on content quality independently of verbosity. Specifically, we construct (1) length-divergent pairs with similar content and (2) content-divergent pairs of similar length. Empirical evaluations show that our method reduces length bias in reward assignment and leads to more concise, content-focused outputs from the policy model. These findings demonstrate that the proposed approach effectively reduces length bias and improves the robustness and content sensitivity of reward modeling in RLHF pipelines.
A Content-Preserving Secure Linguistic Steganography AAAI 2026
Existing linguistic steganography methods primarily rely on content transformations to conceal secret messages. However, they often cause subtle yet looking-innocent deviations between normal and stego texts, posing potential security risks in real-world applications. To address this challenge, we propose a content-preserving linguistic steganography paradigm for perfectly secure covert communication without modifying the cover text. Based on this paradigm, we introduce CLstega (\textit{C}ontent-preserving \textit{L}inguistic \textit{stega}nography), a novel method that embeds secret messages through controllable distribution transformation. CLstega first applies an augmented masking strategy to locate and mask embedding positions, where MLM(masked language model)-predicted probability distributions are easily adjustable for transformation. Subsequently, a dynamic distribution steganographic coding strategy is designed to encode secret messages by deriving target distributions from the original probability distributions. To achieve this transformation, CLstega elaborately selects target words for embedding positions as labels to construct a masked sentence dataset, which is used to fine-tune the original MLM, producing a target MLM capable of directly extracting secret messages from the cover text. This approach ensures perfect security of secret messages while fully preserving the integrity of the original cover text. Experimental results show that CLstega can achieve a 100\% extraction success rate, and outperforms existing methods in security, effectively balancing embedding capacity and security.
comment: This is the extended version of the paper accepted to AAAI 2026
Accepted with Minor Revisions: Value of AI-Assisted Scientific Writing
Large Language Models have seen expanding application across domains, yet their effectiveness as assistive tools for scientific writing -- an endeavor requiring precision, multimodal synthesis, and domain expertise -- remains insufficiently understood. We examine the potential of LLMs to support domain experts in scientific writing, with a focus on abstract composition. We design an incentivized randomized controlled trial with a hypothetical conference setup where participants with relevant expertise are split into an author and reviewer pool. Inspired by methods in behavioral science, our novel incentive structure encourages authors to edit the provided abstracts to an acceptable quality for a peer-reviewed submission. Our 2x2 between-subject design expands into two dimensions: the implicit source of the provided abstract and the disclosure of it. We find authors make most edits when editing human-written abstracts compared to AI-generated abstracts without source attribution, often guided by higher perceived readability in AI generation. Upon disclosure of source information, the volume of edits converges in both source treatments. Reviewer decisions remain unaffected by the source of the abstract, but bear a significant correlation with the number of edits made. Careful stylistic edits, especially in the case of AI-generated abstracts, in the presence of source information, improve the chance of acceptance. We find that AI-generated abstracts hold potential to reach comparable levels of acceptability to human-written ones with minimal revision, and that perceptions of AI authorship, rather than objective quality, drive much of the observed editing behavior. Our findings reverberate the significance of source disclosure in collaborative scientific writing.
TAdaRAG: Task Adaptive Retrieval-Augmented Generation via On-the-Fly Knowledge Graph Construction AAAI 2026
Retrieval-Augmented Generation (RAG) improves large language models by retrieving external knowledge, often truncated into smaller chunks due to the input context window, which leads to information loss, resulting in response hallucinations and broken reasoning chains. Moreover, traditional RAG retrieves unstructured knowledge, introducing irrelevant details that hinder accurate reasoning. To address these issues, we propose TAdaRAG, a novel RAG framework for on-the-fly task-adaptive knowledge graph construction from external sources. Specifically, we design an intent-driven routing mechanism to a domain-specific extraction template, followed by supervised fine-tuning and a reinforcement learning-based implicit extraction mechanism, ensuring concise, coherent, and non-redundant knowledge integration. Evaluations on six public benchmarks and a real-world business benchmark (NowNewsQA) across three backbone models demonstrate that TAdaRAG outperforms existing methods across diverse domains and long-text tasks, highlighting its strong generalization and practical effectiveness.
comment: Accepted by AAAI 2026
QA-Noun: Representing Nominal Semantics via Natural Language Question-Answer Pairs
Decomposing sentences into fine-grained meaning units is increasingly used to model semantic alignment. While QA-based semantic approaches have shown effectiveness for representing predicate-argument relations, they have so far left noun-centered semantics largely unaddressed. We introduce QA-Noun, a QA-based framework for capturing noun-centered semantic relations. QA-Noun defines nine question templates that cover both explicit syntactical and implicit contextual roles for nouns, producing interpretable QA pairs that complement verbal QA-SRL. We release detailed guidelines, a dataset of over 2,000 annotated noun mentions, and a trained model integrated with QA-SRL to yield a unified decomposition of sentence meaning into individual, highly fine-grained, facts. Evaluation shows that QA-Noun achieves near-complete coverage of AMR's noun arguments while surfacing additional contextually implied relations, and that combining QA-Noun with QA-SRL yields over 130\% higher granularity than recent fact-based decomposition methods such as FactScore and DecompScore. QA-Noun thus complements the broader QA-based semantic framework, forming a comprehensive and scalable approach to fine-grained semantic decomposition for cross-text alignment.
SGuard-v1: Safety Guardrail for Large Language Models
We present SGuard-v1, a lightweight safety guardrail for Large Language Models (LLMs), which comprises two specialized models to detect harmful content and screen adversarial prompts in human-AI conversational settings. The first component, ContentFilter, is trained to identify safety risks in LLM prompts and responses in accordance with the MLCommons hazard taxonomy, a comprehensive framework for trust and safety assessment of AI. The second component, JailbreakFilter, is trained with a carefully designed curriculum over integrated datasets and findings from prior work on adversarial prompting, covering 60 major attack types while mitigating false-unsafe classification. SGuard-v1 is built on the 2B-parameter Granite-3.3-2B-Instruct model that supports 12 languages. We curate approximately 1.4 million training instances from both collected and synthesized data and perform instruction tuning on the base model, distributing the curated data across the two component according to their designated functions. Through extensive evaluation on public and proprietary safety benchmarks, SGuard-v1 achieves state-of-the-art safety performance while remaining lightweight, thereby reducing deployment overhead. SGuard-v1 also improves interpretability for downstream use by providing multi-class safety predictions and their binary confidence scores. We release the SGuard-v1 under the Apache-2.0 License to enable further research and practical deployment in AI safety.
comment: Technical Report
Evolving Prompts for Toxicity Search in Large Language Models
Large Language Models remain vulnerable to adversarial prompts that elicit toxic content even after safety alignment. We present ToxSearch, a black-box evolutionary framework that tests model safety by evolving prompts in a synchronous steady-state loop. The system employs a diverse set of operators, including lexical substitutions, negation, back-translation, paraphrasing, and two semantic crossover operators, while a moderation oracle provides fitness guidance. Operator-level analysis shows heterogeneous behavior: lexical substitutions offer the best yield-variance trade-off, semantic-similarity crossover acts as a precise low-throughput inserter, and global rewrites exhibit high variance with elevated refusal costs. Using elite prompts evolved on LLaMA 3.1 8B, we observe practically meaningful but attenuated cross-model transfer, with toxicity roughly halving on most targets, smaller LLaMA 3.2 variants showing the strongest resistance, and some cross-architecture models retaining higher toxicity. These results suggest that small, controllable perturbations are effective vehicles for systematic red-teaming and that defenses should anticipate cross-model reuse of adversarial prompts rather than focusing only on single-model hardening.
comment: pre-print
Co-Layout: LLM-driven Co-optimization for Interior Layout
We present a novel framework for automated interior design that combines large language models (LLMs) with grid-based integer programming to jointly optimize room layout and furniture placement. Given a textual prompt, the LLM-driven agent workflow extracts structured design constraints related to room configurations and furniture arrangements. These constraints are encoded into a unified grid-based representation inspired by ``Modulor". Our formulation accounts for key design requirements, including corridor connectivity, room accessibility, spatial exclusivity, and user-specified preferences. To improve computational efficiency, we adopt a coarse-to-fine optimization strategy that begins with a low-resolution grid to solve a simplified problem and guides the solution at the full resolution. Experimental results across diverse scenarios demonstrate that our joint optimization approach significantly outperforms existing two-stage design pipelines in solution quality, and achieves notable computational efficiency through the coarse-to-fine strategy.
Assessing LLMs for Serendipity Discovery in Knowledge Graphs: A Case for Drug Repurposing AAAI
Large Language Models (LLMs) have greatly advanced knowledge graph question answering (KGQA), yet existing systems are typically optimized for returning highly relevant but predictable answers. A missing yet desired capacity is to exploit LLMs to suggest surprise and novel ("serendipitious") answers. In this paper, we formally define the serendipity-aware KGQA task and propose the SerenQA framework to evaluate LLMs' ability to uncover unexpected insights in scientific KGQA tasks. SerenQA includes a rigorous serendipity metric based on relevance, novelty, and surprise, along with an expert-annotated benchmark derived from the Clinical Knowledge Graph, focused on drug repurposing. Additionally, it features a structured evaluation pipeline encompassing three subtasks: knowledge retrieval, subgraph reasoning, and serendipity exploration. Our experiments reveal that while state-of-the-art LLMs perform well on retrieval, they still struggle to identify genuinely surprising and valuable discoveries, underscoring a significant room for future improvements. Our curated resources and extended version are released at: https://cwru-db-group.github.io/serenQA.
comment: The 40th AAAI Conference on Artificial Intelligence (AAAI-26)
Probing Preference Representations: A Multi-Dimensional Evaluation and Analysis Method for Reward Models AAAI 2026
Previous methods evaluate reward models by testing them on a fixed pairwise ranking test set, but they typically do not provide performance information on each preference dimension. In this work, we address the evaluation challenge of reward models by probing preference representations. To confirm the effectiveness of this evaluation method, we construct a Multi-dimensional Reward Model Benchmark (MRMBench), a collection of six probing tasks for different preference dimensions. We design it to favor and encourage reward models that better capture preferences across different dimensions. Furthermore, we introduce an analysis method, inference-time probing, which identifies the dimensions used during the reward prediction and enhances its interpretability. Through extensive experiments, we find that MRMBench strongly correlates with the alignment performance of large language models (LLMs), making it a reliable reference for developing advanced reward models. Our analysis of MRMBench evaluation results reveals that reward models often struggle to capture preferences across multiple dimensions, highlighting the potential of multi-objective optimization in reward modeling. Additionally, our findings show that the proposed inference-time probing method offers a reliable metric for assessing the confidence of reward predictions, which ultimately improves the alignment of LLMs.
comment: Accepted by AAAI 2026
DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions
With the rapid adoption of multimodal large language models (MLLMs) across diverse applications, there is a pressing need for task-centered, high-quality training data. A key limitation of current training datasets is their reliance on sparse annotations mined from the Internet or entered via manual typing that capture only a fraction of an image's visual content. Dense annotations are more valuable but remain scarce. Traditional text-based annotation pipelines are poorly suited for creating dense annotations: typing limits expressiveness, slows annotation speed, and underrepresents nuanced visual features, especially in specialized areas such as multicultural imagery and 3D asset annotation. In this paper, we present DenseAnnotate, an audio-driven online annotation platform that enables efficient creation of dense, fine-grained annotations for images and 3D assets. Annotators narrate observations aloud while synchronously linking spoken phrases to image regions or 3D scene parts. Our platform incorporates speech-to-text transcription and region-of-attention marking. To demonstrate the effectiveness of DenseAnnotate, we conducted case studies involving over 1,000 annotators across two domains: culturally diverse images and 3D scenes. We curate a human-annotated multi-modal dataset of 3,531 images, 898 3D scenes, and 7,460 3D objects, with audio-aligned dense annotations in 20 languages, including 8,746 image captions, 2,000 scene captions, and 19,000 object captions. Models trained on this dataset exhibit improvements of 5% in multilingual, 47% in cultural alignment, and 54% in 3D spatial capabilities. Our results show that our platform offers a feasible approach for future vision-language research and can be applied to various tasks and diverse types of data.
DiagnoLLM: A Hybrid Bayesian Neural Language Framework for Interpretable Disease Diagnosis
Building trustworthy clinical AI systems requires not only accurate predictions but also transparent, biologically grounded explanations. We present \texttt{DiagnoLLM}, a hybrid framework that integrates Bayesian deconvolution, eQTL-guided deep learning, and LLM-based narrative generation for interpretable disease diagnosis. DiagnoLLM begins with GP-unmix, a Gaussian Process-based hierarchical model that infers cell-type-specific gene expression profiles from bulk and single-cell RNA-seq data while modeling biological uncertainty. These features, combined with regulatory priors from eQTL analysis, power a neural classifier that achieves high predictive performance in Alzheimer's Disease (AD) detection (88.0\% accuracy). To support human understanding and trust, we introduce an LLM-based reasoning module that translates model outputs into audience-specific diagnostic reports, grounded in clinical features, attribution signals, and domain knowledge. Human evaluations confirm that these reports are accurate, actionable, and appropriately tailored for both physicians and patients. Our findings show that LLMs, when deployed as post-hoc reasoners rather than end-to-end predictors, can serve as effective communicators within hybrid diagnostic pipelines.
Interpreting the Effects of Quantization on LLMs AACL 2025
Quantization offers a practical solution to deploy LLMs in resource-constraint environments. However, its impact on internal representations remains understudied, raising questions about the reliability of quantized models. In this study, we employ a range of interpretability techniques to investigate how quantization affects model and neuron behavior. We analyze multiple LLMs under 4-bit and 8-bit quantization. Our findings reveal that the impact of quantization on model calibration is generally minor. Analysis of neuron activations indicates that the number of dead neurons, i.e., those with activation values close to 0 across the dataset, remains consistent regardless of quantization. In terms of neuron contribution to predictions, we observe that smaller full precision models exhibit fewer salient neurons, whereas larger models tend to have more, with the exception of Llama-2-7B. The effect of quantization on neuron redundancy varies across models. Overall, our findings suggest that effect of quantization may vary by model and tasks, however, we did not observe any drastic change which may discourage the use of quantization as a reliable model compression technique.
comment: Accepted to AACL 2025 Main
InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training
Reinforcement learning has powered many of the recent breakthroughs in large language models, especially for tasks where rewards can be computed automatically, such as code generation. However, these methods deteriorate in open-ended domains like medical consultation, where feedback is inherently ambiguous, highly context-dependent, and cannot be reduced to a reliable scalar signal. In such settings, RL must either rely on supervision-intensive reward models that often fail to generalize, or it falls into pathological behaviors such as reward hacking - an especially troubling risk for high-stakes medical dialogue. To address these limitations, we introduce ORBIT, an open-ended rubric-based incremental training framework for high-stakes medical dialogue. ORBIT integrates synthetic dialogue generation with dynamically constructed rubrics that serve as adaptive guides for incremental RL. Instead of relying on external medical knowledge bases or handcrafted rule sets, ORBIT uses rubric-driven feedback to steer the learning process. Its judge component can be instantiated with general-purpose instruction-following LLMs, removing the need for any task-specific fine-tuning. Applied to the Qwen3-4B-Instruct model, ORBIT raises the HealthBench-Hard score from 7.0 to 27.5 using only 2k training samples, achieving SOTA performance for models at this scale. With larger rubric datasets, ORBIT-trained models further compete with the strongest open-source baselines on HealthBench-Hard. Our analysis shows that rubric-guided RL consistently improves consultation quality across diverse medical scenarios. We also apply such rubric generation and training pipeline to InfoBench, where ORBIT enhances instruction-following performance, highlighting the generality of rubric-based feedback.
Scaling Laws for Conditional Emergence of Multilingual Image Captioning via Generalization from Translation
Cross-lingual, cross-task transfer is challenged by task-specific data scarcity, which becomes more severe as language support grows and is further amplified in vision-language models (VLMs). We investigate multilingual generalization in encoder-decoder transformer VLMs to enable zero-shot image captioning in languages encountered only in the translation task. In this setting, the encoder must learn to generate generalizable, task-aware latent vision representations to instruct the decoder via inserted cross-attention layers. To analyze scaling behavior, we train Florence-2 based and Gemma-2 based models (0.4B to 11.2B parameters) on a synthetic dataset using varying compute budgets. While all languages in the dataset have image-aligned translations, only a subset of them include image captions. Notably, we show that captioning can emerge using a language prefix, even when this language only appears in the translation task. We find that indirect learning of unseen task-language pairs adheres to scaling laws that are governed by the multilinguality of the model, model size, and seen training samples. Finally, we demonstrate that the scaling laws extend to downstream tasks, achieving competitive performance through fine-tuning in multimodal machine translation (Multi30K, CoMMuTE), lexical disambiguation (CoMMuTE), and image captioning (Multi30K, XM3600, COCO Karpathy).
OptiHive: Ensemble Selection for LLM-Based Optimization via Statistical Modeling
LLM-based solvers have emerged as a promising means of automating problem modeling and solving. However, they remain unreliable and often depend on iterative repair loops that result in significant latency. We introduce OptiHive, a framework that enhances any solver-generation pipeline to produce higher-quality solvers from natural-language descriptions of optimization problems. OptiHive uses a single batched generation to produce diverse components (solvers, problem instances, and validation tests) and filters out erroneous components to ensure fully interpretable outputs. Accounting for the imperfection of the generated components, we employ a statistical model to infer their true performance, enabling principled uncertainty quantification and solver selection. On tasks ranging from traditional optimization problems to challenging variants of the Multi-Depot Vehicle Routing Problem, OptiHive significantly outperforms baselines, increasing the optimality rate from 5% to 92% on the most complex problems.
Silenced Biases: The Dark Side LLMs Learned to Refuse AAAI
Safety-aligned large language models (LLMs) are becoming increasingly widespread, especially in sensitive applications where fairness is essential and biased outputs can cause significant harm. However, evaluating the fairness of models is a complex challenge, and approaches that do so typically utilize standard question-answer (QA) styled schemes. Such methods often overlook deeper issues by interpreting the model's refusal responses as positive fairness measurements, which creates a false sense of fairness. In this work, we introduce the concept of silenced biases, which are unfair preferences encoded within models' latent space and are effectively concealed by safety-alignment. Previous approaches that considered similar indirect biases often relied on prompt manipulation or handcrafted implicit queries, which present limited scalability and risk contaminating the evaluation process with additional biases. We propose the Silenced Bias Benchmark (SBB), which aims to uncover these biases by employing activation steering to reduce model refusals during QA. SBB supports easy expansion to new demographic groups and subjects, presenting a fairness evaluation framework that encourages the future development of fair models and tools beyond the masking effects of alignment training. We demonstrate our approach over multiple LLMs, where our findings expose an alarming distinction between models' direct responses and their underlying fairness issues.
comment: Accepted to The 40th Annual AAAI Conference on Artificial Intelligence - AI Alignment Track (Oral)
Trainable Dynamic Mask Sparse Attention
The increasing demand for long-context modeling in large language models (LLMs) is bottlenecked by the quadratic complexity of the standard self-attention mechanism. The community has proposed sparse attention to mitigate this issue. However, position-aware sparse attention methods rely on static sparse structures that lack adaptability to diverse query contexts, while content-aware sparse attention methods depend on heuristic key-value selection, hindering full differentiability. We introduce a trainable dynamic mask sparse attention mechanism, a method that merges the advantages of both position-aware and content-aware approaches. Dynamic Mask Attention (DMA) achieves this through three key innovations: First, it leverages value vector representations to generate content-aware dynamic masks, enabling the model to adaptively identify and attend to critical information. Second, it computes position-aware sparse weights in a hardware-friendly manner, efficiently skipping unnecessary computational regions. Finally, we demonstrate that the introduced dynamic mask and sparse weights do not obstruct gradients, supporting end-to-end training. We have validated the performance of DMA through comprehensive experiments. A large body of experimental evidence shows that DMA consistently holds a Pareto advantage over state-of-the-art sparse attention baselines in tasks including scaling laws, multi-query associative recall, standard benchmarks, and needle in a haystack tests, while also delivering up to a 10x overall speedup. These results highlight its ability to effectively balance model efficiency with long-context modeling capabilities. Our computational kernel code is now open-source at https://github.com/SmallDoges/flash-dmattn to encourage further research and application by the community.
comment: 26 pages
Historical/temporal necessities/possibilities, and a logical theory of them in branching time
In this paper, we do three kinds of work. First, we recognize four notions of necessity and two notions of possibility related to time flow, namely strong/weak historical/temporal necessities, as well as historical/temporal possibilities, which are motivated more from a linguistic perspective than from a philosophical one. Strong/weak historical necessities and historical possibility typically concern the possible futures of the present world, and strong/weak temporal necessities and temporal possibility concern possible timelines of alternatives of the present world. Second, we provide our approach to the six notions and present a logical theory of them in branching time. Our approach to the six notions is as follows. The agent has a system of ontic rules that determine expected timelines. She treats some ontic rules as undefeatable, determining accepted timelines. The domains of strong/weak historical necessities, respectively, consist of accepted and expected timelines passing through the present moment, and historical possibility is the dual of strong historical necessity. The domains of strong/weak temporal necessities, respectively, consist of accepted and expected timelines, and temporal possibility is the dual of strong temporal necessity. The logical theory has six operators: a last-moment operator, a next-moment operator, and four operators for the four notions of necessity. Formulas' evaluation contexts consist of a tree-like model representing a time flow, a context representing the agent's system of ontic rules, a timeline, and an instant. Third, we offer an axiomatic system for the logical theory and show its soundness and completeness.
C$^3$TG: Conflict-aware, Composite, and Collaborative Controlled Text Generation AAAI-2026
Recent advancements in large language models (LLMs) have demonstrated remarkable text generation capabilities. However, controlling specific attributes of generated text remains challenging without architectural modifications or extensive fine-tuning. Current methods typically toggle a single, basic attribute but struggle with precise multi-attribute control. In scenarios where attribute requirements conflict, existing methods lack coordination mechanisms, causing interference between desired attributes. Furthermore, these methods fail to incorporate iterative optimization processes in the controlled generation pipeline. To address these limitations, we propose Conflict-aware, Composite, and Collaborative Controlled Text Generation (C$^3$TG), a two-phase framework for fine-grained, multi-dimensional text attribute control. During generation, C$^3$TG selectively pairs the LLM with the required attribute classifiers from the 17 available dimensions and employs weighted KL-divergence to adjust token probabilities. The optimization phase then leverages an energy function combining classifier scores and penalty terms to resolve attribute conflicts through iterative feedback, enabling precise control over multiple dimensions simultaneously while preserving natural text flow. Experiments show that C$^3$TG significantly outperforms baselines across multiple metrics including attribute accuracy, linguistic fluency, and output diversity, while simultaneously reducing toxicity. These results establish C$^3$TG as an effective and flexible solution for multi-dimensional text attribute control that requires no costly model modifications.
comment: This paper has been accepted as a poster presentation at AAAI-2026
Understanding and Mitigating Political Stance Cross-topic Generalization in Large Language Models
Fine-tuning Large Language Models on a political topic will significantly manipulate their political stance on various issues and unintentionally affect their stance on unrelated topics. While previous studies have proposed this issue, there is still a lack of understanding regarding the internal representations of these stances and the mechanisms that lead to unintended cross-topic generalization. In this paper, we systematically explore the internal mechanisms underlying this phenomenon from a neuron-level perspective and how to mitigate the cross-topic generalization of political fine-tuning. Firstly, we propose Political Neuron Localization through Activation Contrasting (PNLAC) to identify two distinct types of political neurons: general political neurons, which govern stance across multiple political topics, and topic-specific neurons} that affect the model's political stance on individual topics. We find the existence of these political neuron types across four models and datasets through activation patching experiments. Leveraging these insights, we introduce InhibitFT, an inhibition-based fine-tuning method, effectively mitigating the cross-topic stance generalization. Experimental results demonstrate the robustness of identified neuron types across various models and datasets, and show that InhibitFT significantly reduces the cross-topic stance generalization by 20% on average, while preserving topic-specific performance. Moreover, we demonstrate that selectively inhibiting only 5% of neurons is sufficient to effectively mitigate the cross-topic stance generalization.
Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation
With the rapid advancement of mathematical reasoning capabilities in Large Language Models (LLMs), AI systems are increasingly being adopted in educational settings to support students' comprehension of problem-solving processes. However, a critical component remains underexplored in current LLM-generated explanations: multimodal explanation. In real-world instructional contexts, human tutors routinely employ visual aids, such as diagrams, markings, and highlights, to enhance conceptual clarity. To bridge this gap, we introduce the multimodal solution explanation task, designed to evaluate whether models can identify visual keypoints, such as auxiliary lines, points, angles, and generate explanations that incorporate these key elements essential for understanding. To evaluate model performance on this task, we propose ME2, a multimodal benchmark consisting of 1,000 math problems annotated with visual keypoints and corresponding explanatory text that references those elements. Our empirical results show that current models struggle to identify visual keypoints. In the task of generating keypoint-based explanations, open-source models also face notable difficulties. This highlights a significant gap in current LLMs' ability to perform mathematical visual grounding, engage in visually grounded reasoning, and provide explanations in educational contexts. We expect that the multimodal solution explanation task and the ME2 dataset will catalyze further research on LLMs in education and promote their use as effective, explanation-oriented AI tutors.
comment: 14 pages, 9 figures
FRAME: Feedback-Refined Agent Methodology for Enhancing Medical Research Insights
The automation of scientific research through large language models (LLMs) presents significant opportunities but faces critical challenges in knowledge synthesis and quality assurance. We introduce Feedback-Refined Agent Methodology (FRAME), a novel framework that enhances medical paper generation through iterative refinement and structured feedback. Our approach comprises three key innovations: (1) A structured dataset construction method that decomposes 4,287 medical papers into essential research components through iterative refinement; (2) A tripartite architecture integrating Generator, Evaluator, and Reflector agents that progressively improve content quality through metric-driven feedback; and (3) A comprehensive evaluation framework that combines statistical metrics with human-grounded benchmarks. Experimental results demonstrate FRAME's effectiveness, achieving significant improvements over conventional approaches across multiple models (9.91% average gain with DeepSeek V3, comparable improvements with GPT-4o Mini) and evaluation dimensions. Human evaluation confirms that FRAME-generated papers achieve quality comparable to human-authored works, with particular strength in synthesizing future research directions. The results demonstrated our work could efficiently assist medical research by building a robust foundation for automated medical research paper generation while maintaining rigorous academic standards.
comment: 12 pages, 4 figures, 5 table
BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages
In the context of pretraining of Large Language Models (LLMs), synthetic data has emerged as an alternative for generating high-quality pretraining data at scale. This is particularly beneficial in low-resource language settings where the benefits of recent LLMs have been unevenly distributed across languages. In this work, we present a systematic study on the generation and evaluation of synthetic multilingual pretraining data for Indic languages, where we construct a large-scale synthetic dataset BhashaKritika, comprising 540B tokens using 5 different techniques for 10 languages. We explore the impact of grounding generation in documents, personas, and topics. We analyze how language choice, both in the prompt instructions and document grounding, affects data quality, and we compare translations of English content with native generation in Indic languages. To support scalable and language-sensitive evaluation, we introduce a modular quality evaluation pipeline that integrates script and language detection, metadata consistency checks, n-gram repetition analysis, and perplexity-based filtering using KenLM models. Our framework enables robust quality control across diverse scripts and linguistic contexts. Empirical results through model runs reveal key trade-offs in generation strategies and highlight best practices for constructing effective multilingual corpora.
DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios
Despite the remarkable advances of Large Language Models (LLMs) across diverse cognitive tasks, the rapid enhancement of these capabilities also introduces emergent deceptive behaviors that may induce severe risks in high-stakes deployments. More critically, the characterization of deception across realistic real-world scenarios remains underexplored. To bridge this gap, we establish DeceptionBench, the first benchmark that systematically evaluates how deceptive tendencies manifest across different societal domains, what their intrinsic behavioral patterns are, and how extrinsic factors affect them. Specifically, on the static count, the benchmark encompasses 150 meticulously designed scenarios in five domains, i.e., Economy, Healthcare, Education, Social Interaction, and Entertainment, with over 1,000 samples, providing sufficient empirical foundations for deception analysis. On the intrinsic dimension, we explore whether models exhibit self-interested egoistic tendencies or sycophantic behaviors that prioritize user appeasement. On the extrinsic dimension, we investigate how contextual factors modulate deceptive outputs under neutral conditions, reward-based incentivization, and coercive pressures. Moreover, we incorporate sustained multi-turn interaction loops to construct a more realistic simulation of real-world feedback dynamics. Extensive experiments across LLMs and Large Reasoning Models (LRMs) reveal critical vulnerabilities, particularly amplified deception under reinforcement dynamics, demonstrating that current models lack robust resistance to manipulative contextual cues and the urgent need for advanced safeguards against various deception behaviors. Code and resources are publicly available at https://github.com/Aries-iai/DeceptionBench.
comment: 28 pages, 17 figures, accepted by NeruIPS 2025
Mitigating Overthinking in Large Reasoning Models via Manifold Steering
Recent advances in Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex tasks such as mathematics and coding. However, these models frequently exhibit a phenomenon known as overthinking during inference, characterized by excessive validation loops and redundant deliberation, leading to substantial computational overheads. In this paper, we aim to mitigate overthinking by investigating the underlying mechanisms from the perspective of mechanistic interpretability. We first showcase that the tendency of overthinking can be effectively captured by a single direction in the model's activation space and the issue can be eased by intervening the activations along this direction. However, this efficacy soon reaches a plateau and even deteriorates as the intervention strength increases. We therefore systematically explore the activation space and find that the overthinking phenomenon is actually tied to a low-dimensional manifold, which indicates that the limited effect stems from the noises introduced by the high-dimensional steering direction. Based on this insight, we propose Manifold Steering, a novel approach that elegantly projects the steering direction onto the low-dimensional activation manifold given the theoretical approximation of the interference noise. Extensive experiments on DeepSeek-R1 distilled models validate that our method reduces output tokens by up to 71% while maintaining and even improving the accuracy on several mathematical benchmarks. Our method also exhibits robust cross-domain transferability, delivering consistent token reduction performance in code generation and knowledge-based QA tasks. Code is available at: https://github.com/Aries-iai/Manifold_Steering.
comment: 19 pages, 7 figures
From Euler to AI: Unifying Formulas for Mathematical Constants NeurIPS2025
The constant $π$ has fascinated scholars throughout the centuries, inspiring numerous formulas for its evaluation, such as infinite sums and continued fractions. Despite their individual significance, many of the underlying connections among formulas remain unknown, missing unifying theories that could unveil deeper understanding. The absence of a unifying theory reflects a broader challenge across math and science: knowledge is typically accumulated through isolated discoveries, while deeper connections often remain hidden. In this work, we present an automated framework for the unification of mathematical formulas. Our system combines Large Language Models (LLMs) for systematic formula harvesting, an LLM-code feedback loop for validation, and a novel symbolic algorithm for clustering and eventual unification. We demonstrate this methodology on the hallmark case of $π$, an ideal testing ground for symbolic unification. Applying this approach to 455,050 arXiv papers, we validate 385 distinct formulas for $π$ and prove relations between 360 (94%) of them, of which 166 (43%) can be derived from a single mathematical object - linking canonical formulas by Euler, Gauss, Brouncker, and newer ones from algorithmic discoveries by the Ramanujan Machine. Our method generalizes to other constants, including $e$, $ζ(3)$, and Catalan's constant, demonstrating the potential of AI-assisted mathematics to uncover hidden structures and unify knowledge across domains.
comment: Final version for NeurIPS2025
Is deeper always better? Replacing linear mappings with deep learning networks in the Discriminative Lexicon Model
Recently, deep learning models have increasingly been used in cognitive modelling of language. This study asks whether deep learning can help us to better understand the learning problem that needs to be solved by speakers, above and beyond linear methods. We utilise the Discriminative Lexicon Model introduced by Baayen and colleagues, which models comprehension and production with mappings between numeric form and meaning vectors. While so far, these mappings have been linear (Linear Discriminative Learning, LDL), in the present study we replace them with deep dense neural networks (Deep Discriminative Learning, DDL). We find that DDL affords more accurate mappings for large and diverse datasets from English and Dutch, but not necessarily for Estonian and Taiwan Mandarin. DDL outperforms LDL in particular for words with pseudo-morphological structure such as chol+er. Applied to average reaction times, we find that DDL is outperformed by frequency-informed linear mappings (FIL). However, DDL trained in a frequency-informed way ('frequency-informed' deep learning, FIDDL) substantially outperforms FIL. Finally, while linear mappings can very effectively be updated from trial-to-trial to model incremental lexical learning, deep mappings cannot do so as effectively. At present, both linear and deep mappings are informative for understanding language.
comment: 19 pages, 6 figures; includes a few numeric changes to results due to a fixed bug, published version
Neurocognitive Modeling for Text Generation: Deep Learning Architecture for EEG Data
Text generating capabilities have undergone a substantial transformation with the introduction of large language models (LLMs). Electroencephalography (EEG)-based text production is still difficult, though, because it requires a lot of data and processing power. This paper introduces a new method that combines the use of the Gemma 2B LLM with a classifier-LLM architecture to incorporate a Recurrent Neural Network (RNN) encoder. Our approach drastically lowers the amount of data and compute power needed while achieving performance close to that of cutting-edge methods. Notably, compared to current methodologies, our methodology delivers an overall performance improvement of 10%. The suggested architecture demonstrates the possibility of effective transfer learning for EEG-based text production, remaining strong and functional even in the face of data limits. This work highlights the potential of integrating LLMs with EEG decoding to improve assistive technologies and improve independence and communication for those with severe motor limitations. Our method pushes the limits of present capabilities and opens new paths for research and application in brain-computer interfaces by efficiently using the strengths of pre-trained language models. This makes EEG-based text production more accessible and efficient.
comment: 15 pages, 10 figures, 5 tables
PIP: Perturbation-based Iterative Pruning for Large Language Models EMNLP 2025
The rapid increase in the parameter counts of Large Language Models (LLMs), which often reach into the billions or even trillions, presents significant challenges for their practical deployment, particularly in resource-constrained environments. To address this issue, we propose PIP (Perturbation-based Iterative Pruning), a novel double-view structured pruning method to optimize LLMs, which combines information from two different views: the unperturbed view and the perturbed view. With the calculation of gradient differences, PIP iteratively prunes those that struggle to distinguish between these two views. Our experiments show that PIP reduces the parameter count by approximately 20% while retaining over 85% of the original model's accuracy across varied benchmarks. In some cases, the performance of the pruned model is within 5% of the unpruned version, demonstrating PIP's ability to preserve key aspects of model effectiveness. Moreover, PIP consistently outperforms existing state-of-the-art (SOTA) structured pruning methods, establishing it as a leading technique for optimizing LLMs in constrained environments.
comment: EMNLP 2025 Findings, 17 pages, 5 figures, 15 tables
Multimodal DeepResearcher: Generating Text-Chart Interleaved Reports From Scratch with Agentic Framework AAAI 2026
Visualizations play a crucial part in effective communication of concepts and information. Recent advances in reasoning and retrieval augmented generation have enabled Large Language Models (LLMs) to perform deep research and generate comprehensive reports. Despite its progress, existing deep research frameworks primarily focus on generating text-only content, leaving the automated generation of interleaved texts and visualizations underexplored. This novel task poses key challenges in designing informative visualizations and effectively integrating them with text reports. To address these challenges, we propose Formal Description of Visualization (FDV), a structured textual representation of charts that enables LLMs to learn from and generate diverse, high-quality visualizations. Building on this representation, we introduce Multimodal DeepResearcher, an agentic framework that decomposes the task into four stages: (1) researching, (2) exemplar report textualization, (3) planning, and (4) multimodal report generation. For the evaluation of generated multimodal reports, we develop MultimodalReportBench, which contains 100 diverse topics served as inputs along with 5 dedicated metrics. Extensive experiments across models and evaluation methods demonstrate the effectiveness of Multimodal DeepResearcher. Notably, utilizing the same Claude 3.7 Sonnet model, Multimodal DeepResearcher achieves an 82\% overall win rate over the baseline method.
comment: AAAI 2026 Oral
Leveraging Online Data to Enhance Medical Knowledge in a Small Persian Language Model
The rapid advancement of language models has demonstrated the potential of artificial intelligence in the healthcare industry. However, small language models struggle with specialized domains in low-resource languages like Persian. While numerous medical-domain websites exist in Persian, no curated dataset or corpus has been available making ours the first of its kind. This study introduces a newly curated dataset comprising 20k doctor-patient Q\&A pairs and 60\% of a 90-million-token crawled corpus from medical magazines. Using a parameter-efficient fine-tuning approach, we enhanced the medical knowledge of the baseline model, aya-expanse-8b. Benchmark evaluations demonstrate that the fine-tuned model achieves improved accuracy in medical question answering and successfully passed the Iranian Basic Medical Science Entrance Exam (IBSEE) in September 2023, which the baseline model did not. Additionally, the fine-tuned model improved Persian-translated MMLU accuracy by an average of 2.67\%. This work highlights the potential of leveraging open-access online data to enrich small language models in medical fields, providing a novel solution for Persian medical AI applications suitable for resource-constrained environments. Future research could explore multimodal input to further enhance performance.
comment: 8 pages, 7 figures
GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning AAAI 2026
Significant progress in reward modeling over recent years has been driven by a paradigm shift from task-specific designs towards generalist reward models. Despite this trend, developing effective reward models remains a fundamental challenge: the heavy reliance on large-scale labeled preference data. Pre-training on abundant unlabeled data offers a promising direction, but existing approaches fall short of instilling explicit reasoning into reward models. To bridge this gap, we propose a self-training approach that leverages unlabeled data to elicit reward reasoning in reward models. Based on this approach, we develop GRAM-R$^2$, a generative reward model trained to produce not only preference labels but also accompanying reward rationales. GRAM-R$^2$ can serve as a foundation model for reward reasoning and can be applied to a wide range of tasks with minimal or no additional fine-tuning. It can support downstream applications such as response ranking and task-specific reward tuning. Experiments on response ranking, task adaptation, and reinforcement learning from human feedback demonstrate that GRAM-R$^2$ consistently delivers strong performance, outperforming several strong discriminative and generative baselines.
comment: Accepted by AAAI 2026
SelecTKD: Selective Token-Weighted Knowledge Distillation for LLMs
Knowledge distillation (KD) is a standard route to compress Large Language Models (LLMs) into compact students, yet most pipelines uniformly apply token-wise loss regardless of teacher confidence. This indiscriminate supervision amplifies noisy, high-entropy signals and is especially harmful under large teacher-student capacity gaps. We introduce SelecTKD, a plug-and-play Selective Token-Weighted distillation framework that shifts the focus from "how to measure divergence" to "where to apply learning". At each step, the student proposes tokens that are verified by the teacher through a robust propose-and-verify procedure with two variants: greedy Top-k and non-greedy Spec-k. Accepted tokens receive full loss, while rejected tokens are masked or down-weighted. This objective-agnostic design works with on- and off-policy data, induces an implicit curriculum quantified by Token Acceptance Rate (TAR), and stabilizes optimization. Across instruction following, mathematical reasoning, code generation, and a VLM setting, SelecTKD consistently improves strong baselines and achieves state-of-the-art results for small models without architectural changes or extra reference models.
Do Language Models Associate Sound with Meaning? A Multimodal Study of Sound Symbolism
Sound symbolism is a linguistic concept that refers to non-arbitrary associations between phonetic forms and their meanings. We suggest that this can be a compelling probe into how Multimodal Large Language Models (MLLMs) interpret auditory information in human languages. We investigate MLLMs' performance on phonetic iconicity across textual (orthographic and IPA) and auditory forms of inputs with up to 25 semantic dimensions (e.g., sharp vs. round), observing models' layer-wise information processing by measuring phoneme-level attention fraction scores. To this end, we present LEX-ICON, an extensive mimetic word dataset consisting of 8,052 words from four natural languages (English, French, Japanese, and Korean) and 2,930 systematically constructed pseudo-words, annotated with semantic features applied across both text and audio modalities. Our key findings demonstrate (1) MLLMs' phonetic intuitions that align with existing linguistic research across multiple semantic dimensions and (2) phonosemantic attention patterns that highlight models' focus on iconic phonemes. These results bridge domains of artificial intelligence and cognitive linguistics, providing the first large-scale, quantitative analyses of phonetic iconicity in terms of MLLMs' interpretability.
comment: 33 pages, 27 tables, 10 figures
Contextual Integrity in LLMs via Reasoning and Reinforcement Learning NeurIPS 2025
As the era of autonomous agents making decisions on behalf of users unfolds, ensuring contextual integrity (CI) -- what is the appropriate information to share while carrying out a certain task -- becomes a central question to the field. We posit that CI demands a form of reasoning where the agent needs to reason about the context in which it is operating. To test this, we first prompt LLMs to reason explicitly about CI when deciding what information to disclose. We then extend this approach by developing a reinforcement learning (RL) framework that further instills in models the reasoning necessary to achieve CI. Using a synthetic, automatically created, dataset of only $\sim700$ examples but with diverse contexts and information disclosure norms, we show that our method substantially reduces inappropriate information disclosure while maintaining task performance across multiple model sizes and families. Importantly, improvements transfer from this synthetic dataset to established CI benchmarks such as PrivacyLens that has human annotations and evaluates privacy leakage of AI assistants in actions and tool calls.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Accommodate Knowledge Conflicts in Retrieval-augmented LLMs: Towards Robust Response Generation in the Wild
The proliferation of large language models (LLMs) has significantly advanced intelligent systems. Unfortunately, LLMs often face knowledge conflicts between internal memory and retrieved external information, arising from misinformation, biases, or outdated knowledge. These conflicts undermine response reliability and introduce uncertainty in decision-making. In this work, we analyze how LLMs navigate knowledge conflicts from an information-theoretic perspective and reveal that when conflicting and supplementary information exhibit significant differences, LLMs confidently resolve their preferences and alleviate the uncertainty during their response generation. When this difference is ambiguous, LLMs experience considerable uncertainty about their generation. Based on this insight, we propose Swin-VIB, a novel framework that integrates a pipeline of variational information bottleneck models to adapt the retrieved information difference, facilitating robust response generation of LLMs even in conflicting contexts. Extensive experiments confirm our theoretical analysis and demonstrate the performance of Swin-VIB. Notably, Swin-VIB outperforms all competitive baselines in terms of the accuracy of the multiple-choice task, while improving the EM values in the open-ended QA task by at least 11.14%.
MAPLE: Multi-Agent Adaptive Planning with Long-Term Memory for Table Reasoning ALT
Table-based question answering requires complex reasoning capabilities that current LLMs struggle to achieve with single-pass inference. Existing approaches, such as Chain-of-Thought reasoning and question decomposition, lack error detection mechanisms and discard problem-solving experiences, contrasting sharply with how humans tackle such problems. In this paper, we propose MAPLE (Multi-agent Adaptive Planning with Long-term mEmory), a novel framework that mimics human problem-solving through specialized cognitive agents working in a feedback-driven loop. MAPLE integrates 4 key components: (1) a Solver using the ReAct paradigm for reasoning, (2) a Checker for answer verification, (3) a Reflector for error diagnosis and strategy correction, and (4) an Archiver managing long-term memory for experience reuse and evolution. Experiments on WiKiTQ and TabFact demonstrate significant improvements over existing methods, achieving state-of-the-art performance across multiple LLM backbones.
comment: 27 pages, 11 figures, ALTA 2025
Human-Computer Interaction
Maximizing the efficiency of human feedback in AI alignment: a comparative analysis
Reinforcement Learning from Human Feedback (RLHF) relies on preference modeling to align machine learning systems with human values, yet the popular approach of random pair sampling with Bradley-Terry modeling is statistically limited and inefficient under constrained annotation budgets. In this work, we explore alternative sampling and evaluation strategies for preference inference in RLHF, drawing inspiration from areas such as game theory, statistics, and social choice theory. Our best-performing method, Swiss InfoGain, employs a Swiss tournament system with a proxy mutual-information-gain pairing rule, which significantly outperforms all other methods in constrained annotation budgets while also being more sample-efficient. Even in high-resource settings, we can identify superior alternatives to the Bradley-Terry baseline. Our experiments demonstrate that adaptive, resource-aware strategies reduce redundancy, enhance robustness, and yield statistically significant improvements in preference learning, highlighting the importance of balancing alignment quality with human workload in RLHF pipelines.
comment: 16 pages, 6 figures, 6 algorithms. AICS2025
BeautyGuard: Designing a Multi-Agent Roundtable System for Proactive Beauty Tech Compliance through Stakeholder Collaboration
As generative AI enters enterprise workflows, ensuring compliance with legal, ethical, and reputational standards becomes a pressing challenge. In beauty tech, where biometric and personal data are central, traditional reviews are often manual, fragmented, and reactive. To examine these challenges, we conducted a formative study with six experts (four IT managers, two legal managers) at a multinational beauty company. The study revealed pain points in rule checking, precedent use, and the lack of proactive guidance. Motivated by these findings, we designed a multi-agent "roundtable" system powered by a large language model. The system assigns role-specialized agents for legal interpretation, checklist review, precedent search, and risk mitigation, synthesizing their perspectives into structured compliance advice. We evaluated the prototype with the same experts using System Usability Scale(SUS), The Official NASA Task Load Index(NASA-TLX), and interviews. Results show exceptional usability (SUS: 77.5/100) and minimal cognitive workload, with three key findings: (1) multi-agent systems can preserve tacit knowledge into standardized workflows, (2) information augmentation achieves higher acceptance than decision automation, and (3) successful enterprise AI should mirror organizational structures. This work contributes design principles for human-AI collaboration in compliance review, with broader implications for regulated industries beyond beauty tech.
comment: International Conference on Human-Engaged Computing (ICHEC 2025), Singapore
Designing-with More-than-Human Through Human Augmentation
The recent more-than-human turn in design calls for "designing-with" other species and ecologies beyond humans. Yet-as Thomas Nagel's famous "What is it like to be a bat?" thought experiment highlights-human experience is constrained by our own sensorium and an irreducible gap in phenomenal access to nonhuman lifeworlds. This paper proposes More-than-Human through Human Augmentation (MtHtHA, denoted ">HtH+") as a design approach that repurposes human augmentation technologies-typically aimed at enhancing human capabilities-away from human optimization and exceptionalism but toward eco-phenomenological awareness. Grounded in somaesthetic design and eco-somatics, MtHtHA entails creating temporary, embodied experiences that modulate the human Umwelt to re-sensitize us to pluriversal more-than-human perceptions. We articulate seven design principles and report five design cases-EchoVision (bat-like echolocation), FeltSight (star-nosed-mole tactile navigation), FungiSync (fungal network attunement), TentacUs (octopus-like distributed agency), and City of Sparkles (urban data from AI's perspective). We demonstrate that such experiential "designing-with" can cultivate ecological awareness, empathy and obligations of care across species boundaries.
comment: Submitted to DRS 2026
Accepted with Minor Revisions: Value of AI-Assisted Scientific Writing
Large Language Models have seen expanding application across domains, yet their effectiveness as assistive tools for scientific writing -- an endeavor requiring precision, multimodal synthesis, and domain expertise -- remains insufficiently understood. We examine the potential of LLMs to support domain experts in scientific writing, with a focus on abstract composition. We design an incentivized randomized controlled trial with a hypothetical conference setup where participants with relevant expertise are split into an author and reviewer pool. Inspired by methods in behavioral science, our novel incentive structure encourages authors to edit the provided abstracts to an acceptable quality for a peer-reviewed submission. Our 2x2 between-subject design expands into two dimensions: the implicit source of the provided abstract and the disclosure of it. We find authors make most edits when editing human-written abstracts compared to AI-generated abstracts without source attribution, often guided by higher perceived readability in AI generation. Upon disclosure of source information, the volume of edits converges in both source treatments. Reviewer decisions remain unaffected by the source of the abstract, but bear a significant correlation with the number of edits made. Careful stylistic edits, especially in the case of AI-generated abstracts, in the presence of source information, improve the chance of acceptance. We find that AI-generated abstracts hold potential to reach comparable levels of acceptability to human-written ones with minimal revision, and that perceptions of AI authorship, rather than objective quality, drive much of the observed editing behavior. Our findings reverberate the significance of source disclosure in collaborative scientific writing.
A Proxy-Based Method for Mapping Discrete Emotions onto VAD model
Mapping discrete and dimensional models of emotion remains a persistent challenge in affective science and computing. This incompatibility hinders the combination of valuable data sets, creating a significant bottleneck for training robust machine learning models. To bridge this gap, this paper presents a novel, human-centric, proxy-based approach that transcends purely computational or direct mapping techniques. Implemented through a web-based survey, the method utilizes simple, user-generated geometric animations as intermediary artifacts to establish a correspondence between discrete emotion labels and the continuous valence-arousal-dominance (VAD) space. The approach involves a two-phase process: first, each participant creates an animation to represent a given emotion label (encoding); then, they immediately assess their own creation on the three VAD dimensions. The method was empirically validated and refined through two iterative user studies. The results confirmed the method's robustness. Combining the data from both studies generated a final, comprehensive mapping between discrete and dimensional models.
Detecting LLM-Assisted Academic Dishonesty using Keystroke Dynamics
The rapid adoption of generative AI tools has intensified the challenge of maintaining academic integrity. Conventional plagiarism detectors, which rely on text-matching or text-intrinsic features, often fail to identify submissions that have been AI-assisted or paraphrased. To address this limitation, we introduce keystroke-dynamics-based detectors that analyze how, rather than what, a person writes to distinguish genuine from assisted writing. Building on our earlier study, which collected keystroke data from 40 participants and trained a modified TypeNet model to detect assisted text, we expanded the dataset by adding 90 new participants and introducing a paraphrasing-based plagiarism-detection mode. We then benchmarked two additional gradient-boosting classifiers, LightGBM and CatBoost, alongside TypeNet, and compared their performance with DetectGPT, LLaMA 3.3 70B Instruct, and the results of 44 human evaluators. To further assess and improve robustness, we proposed a deception-based threat model simulating forged keystrokes and applied adversarial training as a countermeasure. Results show that the machine learning models achieve F1 scores above 97% in structured settings, while TypeNet performs best in detecting paraphrasing, with an F1 score of 86.9%. In contrast, text-only detectors and human evaluators perform near-chance, demonstrating that keystroke dynamics provide a strong behavioral signal for identifying AI-assisted plagiarism and support the use of multimodal behavioral features for reliable academic integrity assessment.
comment: 17 pages, 4 figures, 5 tables, extension of IJCB 2024 paper, and under review at IEEE TBIOM
Real-Time Drivers' Drowsiness Detection and Analysis through Deep Learning
A long road trip is fun for drivers. However, a long drive for days can be tedious for a driver to accommodate stringent deadlines to reach distant destinations. Such a scenario forces drivers to drive extra miles, utilizing extra hours daily without sufficient rest and breaks. Once a driver undergoes such a scenario, it occasionally triggers drowsiness during driving. Drowsiness in driving can be life-threatening to any individual and can affect other drivers' safety; therefore, a real-time detection system is needed. To identify fatigued facial characteristics in drivers and trigger the alarm immediately, this research develops a real-time driver drowsiness detection system utilizing deep convolutional neural networks (DCNNs) and OpenCV.Our proposed and implemented model takes real- time facial images of a driver using a live camera and utilizes a Python-based library named OpenCV to examine the facial images for facial landmarks like sufficient eye openings and yawn-like mouth movements. The DCNNs framework then gathers the data and utilizes a per-trained model to detect the drowsiness of a driver using facial landmarks. If the driver is identified as drowsy, the system issues a continuous alert in real time, embedded in the Smart Car technology.By potentially saving innocent lives on the roadways, the proposed technique offers a non-invasive, inexpensive, and cost-effective way to identify drowsiness. Our proposed and implemented DCNNs embedded drowsiness detection model successfully react with NTHU-DDD dataset and Yawn-Eye-Dataset with drowsiness detection classification accuracy of 99.6% and 97% respectively.
Multi-Domain EEG Representation Learning with Orthogonal Mapping and Attention-based Fusion for Cognitive Load Classification
We propose a new representation learning solution for the classification of cognitive load based on Electroencephalogram (EEG). Our method integrates both time and frequency domains by first passing the raw EEG signals through the convolutional encoder to obtain the time domain representations. Next, we measure the Power Spectral Density (PSD) for all five EEG frequency bands and generate the channel power values as 2D images referred to as multi-spectral topography maps. These multi-spectral topography maps are then fed to a separate encoder to obtain the representations in frequency domain. Our solution employs a multi-domain attention module that maps these domain-specific embeddings onto a shared embedding space to emphasize more on important inter-domain relationships to enhance the representations for cognitive load classification. Additionally, we incorporate an orthogonal projection constraint during the training of our method to effectively increase the inter-class distances while improving intra-class clustering. This enhancement allows efficient discrimination between different cognitive states and aids in better grouping of similar states within the feature space. We validate the effectiveness of our model through extensive experiments on two public EEG datasets, CL-Drive and CLARE for cognitive load classification. Our results demonstrate the superiority of our multi-domain approach over the traditional single-domain techniques. Moreover, we conduct ablation and sensitivity analyses to assess the impact of various components of our method. Finally, robustness experiments on different amounts of added noise demonstrate the stability of our method compared to other state-of-the-art solutions.
comment: This work has been submitted to the Transactions on Human Machine Systems for possible publication
Grating haptic perception through touchscreen: Sighted vs. Visually Impaired
Providing haptic feedback via smartphone touch screen may potentially offer blind people a capability to understand graphs. This study investigated the discrimination performance of haptic gratings in different frequencies, in both visually impaired (VI) and sighted (S) individuals. 6 VI participants and 10 S participants took part in two experiments designed to compare their ability to interpret grating images with a finger swiping across a smartphone touchscreen without vision. The swipe gesture activates phone vibration temporally synchronized with the black stripes. Their tasks were: (1) determining whether a grating pattern is presented on the touchscreen, (2) comparing two different grating frequencies and determining the wider one. Results demonstrated that the VI group exhibited superior tactile sensitivity compared to the S group, as evidenced by their significantly better performance in Experiment 1 (accuracy of 99.15\% vs. 84.5\%). Experiment 2 revealed that the peak performance of VI participants was approximately around 0.270 cycles per mm (83.3\% accuracy), a frequency very similar to Braille dot spacing, while S group peaked around 0.963 cycles per mm (70\% accuracy). The findings suggest that tactile stimulation coded with grating patterns could be potentially used to present interpretable graph for the visually impaired. Such an approach could offer a value to research in human-computer interaction and sensory adaptation.
Neurocognitive Modeling for Text Generation: Deep Learning Architecture for EEG Data
Text generating capabilities have undergone a substantial transformation with the introduction of large language models (LLMs). Electroencephalography (EEG)-based text production is still difficult, though, because it requires a lot of data and processing power. This paper introduces a new method that combines the use of the Gemma 2B LLM with a classifier-LLM architecture to incorporate a Recurrent Neural Network (RNN) encoder. Our approach drastically lowers the amount of data and compute power needed while achieving performance close to that of cutting-edge methods. Notably, compared to current methodologies, our methodology delivers an overall performance improvement of 10%. The suggested architecture demonstrates the possibility of effective transfer learning for EEG-based text production, remaining strong and functional even in the face of data limits. This work highlights the potential of integrating LLMs with EEG decoding to improve assistive technologies and improve independence and communication for those with severe motor limitations. Our method pushes the limits of present capabilities and opens new paths for research and application in brain-computer interfaces by efficiently using the strengths of pre-trained language models. This makes EEG-based text production more accessible and efficient.
comment: 15 pages, 10 figures, 5 tables
Co-Designing Multimodal Systems for Accessible Remote Dance Instruction
Videos make exercise instruction widely available, but they rely on visual demonstrations that blind and low vision (BLV) learners cannot see. While audio descriptions (AD) can make videos accessible, describing movements remains challenging as the AD must convey what to do (mechanics, location, orientation) and how to do it (speed, fluidity, timing). Prior work thus used multimodal instruction to support BLV learners with individual simple movements. However, it is unclear how these approaches scale to dance instruction with unique, complex movements and precise timing constraints. To inform accessible remote dance instruction systems, we conducted three co-design workshops (N=28) with BLV dancers, instructors, and experts in sound, haptics, and AD. Participants designed 8 systems revealing common themes: staged learning to dissect routines, crafting vocabularies for movements, and selectively using modalities (narration for movement structure, sound for expression, and haptics for spatial cues). We conclude with design recommendations to make learning dance accessible.
Image and Video Processing
MTMed3D: A Multi-Task Transformer-Based Model for 3D Medical Imaging
In the field of medical imaging, AI-assisted techniques such as object detection, segmentation, and classification are widely employed to alleviate the workload of physicians and doctors. However, single-task models are predominantly used, overlooking the shared information across tasks. This oversight leads to inefficiencies in real-life applications. In this work, we propose MTMed3D, a novel end-to-end Multi-task Transformer-based model to address the limitations of single-task models by jointly performing 3D detection, segmentation, and classification in medical imaging. Our model uses a Transformer as the shared encoder to generate multi-scale features, followed by CNN-based task-specific decoders. The proposed framework was evaluated on the BraTS 2018 and 2019 datasets, achieving promising results across all three tasks, especially in detection, where our method achieves better results than prior works. Additionally, we compare our multi-task model with equivalent single-task variants trained separately. Our multi-task model significantly reduces computational costs and achieves faster inference speed while maintaining comparable performance to the single-task models, highlighting its efficiency advantage. To the best of our knowledge, this is the first work to leverage Transformers for multi-task learning that simultaneously covers detection, segmentation, and classification tasks in 3D medical imaging, presenting its potential to enhance diagnostic processes. The code is available at https://github.com/fanlimua/MTMed3D.git.
RAA-MIL: A Novel Framework for Classification of Oral Cytology
Cytology is a valuable tool for early detection of oral squamous cell carcinoma (OSCC). However, manual examination of cytology whole slide images (WSIs) is slow, subjective, and depends heavily on expert pathologists. To address this, we introduce the first weakly supervised deep learning framework for patient-level diagnosis of oral cytology whole slide images, leveraging the newly released Oral Cytology Dataset [1], which provides annotated cytology WSIs from ten medical centres across India. Each patient case is represented as a bag of cytology patches and assigned a diagnosis label (Healthy, Benign, Oral Potentially Malignant Disorders (OPMD), OSCC) by an in-house expert pathologist. These patient-level weak labels form a new extension to the dataset. We evaluate a baseline multiple-instance learning (MIL) model and a proposed Region-Affinity Attention MIL (RAA-MIL) that models spatial relationships between regions within each slide. The RAA-MIL achieves an average accuracy of 72.7%, weighted F1-score of 0.69 on an unseen test set, outperforming the baseline. This study establishes the first patient-level weakly supervised benchmark for oral cytology and moves toward reliable AI-assisted digital pathology.
comment: Under Review at IEEE ISBI 2026
Multimodal RGB-HSI Feature Fusion with Patient-Aware Incremental Heuristic Meta-Learning for Oral Lesion Classification
Early detection of oral cancer and potentially malignant disorders is challenging in low-resource settings due to limited annotated data. We present a unified four-class oral lesion classifier that integrates deep RGB embeddings, hyperspectral reconstruction, handcrafted spectral-textural descriptors, and demographic metadata. A pathologist-verified subset of oral cavity images was curated and processed using a fine-tuned ConvNeXt-v2 encoder, followed by RGB-to-HSI reconstruction into 31-band hyperspectral cubes. Haemoglobin-sensitive indices, texture features, and spectral-shape measures were extracted and fused with deep and clinical features. Multiple machine-learning models were assessed with patient-wise validation. We further introduce an incremental heuristic meta-learner (IHML) that combines calibrated base classifiers through probabilistic stacking and patient-level posterior smoothing. On an unseen patient split, the proposed framework achieved a macro F1 of 66.23% and an accuracy of 64.56%. Results demonstrate that hyperspectral reconstruction and uncertainty-aware meta-learning substantially improve robustness for real-world oral lesion screening.
comment: 4 pages, 1 figure, 2 tables
Bregman geometry-aware split Gibbs sampling for Bayesian Poisson inverse problems
This paper proposes a novel Bayesian framework for solving Poisson inverse problems by devising a Monte Carlo sampling algorithm which accounts for the underlying non-Euclidean geometry. To address the challenges posed by the Poisson likelihood -- such as non-Lipschitz gradients and positivity constraints -- we derive a Bayesian model which leverages exact and asymptotically exact data augmentations. In particular, the augmented model incorporates two sets of splitting variables both derived through a Bregman divergence based on the Burg entropy. Interestingly the resulting augmented posterior distribution is characterized by conditional distributions which benefit from natural conjugacy properties and preserve the intrinsic geometry of the latent and splitting variables. This allows for efficient sampling via Gibbs steps, which can be performed explicitly for all conditionals, except the one incorporating the regularization potential. For this latter, we resort to a Hessian Riemannian Langevin Monte Carlo (HRLMC) algorithm which is well suited to handle priors with explicit or easily computable score functions. By operating on a mirror manifold, this Langevin step ensures that the sampling satisfies the positivity constraints and more accurately reflects the underlying problem structure. Performance results obtained on denoising, deblurring, and positron emission tomography (PET) experiments demonstrate that the method achieves competitive performance in terms of reconstruction quality compared to optimization- and sampling-based approaches.
Prompt-Conditioned FiLM and Multi-Scale Fusion on MedSigLIP for Low-Dose CT Quality Assessment
We propose a prompt-conditioned framework built on MedSigLIP that injects textual priors via Feature-wise Linear Modulation (FiLM) and multi-scale pooling. Text prompts condition patch-token features on clinical intent, enabling data-efficient learning and rapid adaptation. The architecture combines global, local, and texture-aware pooling through separate regression heads fused by a lightweight MLP, trained with pairwise ranking loss. Evaluated on the LDCTIQA2023 (a public LDCT quality assessment challenge) with 1,000 training images, we achieve PLCC = 0.9575, SROCC = 0.9561, and KROCC = 0.8301, surpassing the top-ranked published challenge submissions and demonstrating the effectiveness of our prompt-guided approach.
Deep Unfolded BM3D: Unrolling Non-local Collaborative Filtering into a Trainable Neural Network
Block-Matching and 3D Filtering (BM3D) exploits non-local self-similarity priors for denoising but relies on fixed parameters. Deep models such as U-Net are more flexible but often lack interpretability and fail to generalize across noise regimes. In this study, we propose Deep Unfolded BM3D (DU-BM3D), a hybrid framework that unrolls BM3D into a trainable architecture by replacing its fixed collaborative filtering with a learnable U-Net denoiser. This preserves BM3D's non-local structural prior while enabling end-to-end optimization. We evaluate DU-BM3D on low-dose CT (LDCT) denoising and show that it outperforms classic BM3D and standalone U-Net across simulated LDCT at different noise levels, yielding higher PSNR and SSIM, especially in high-noise conditions.
Recursive Threshold Median Filter and Autoencoder for Salt-and-Pepper Denoising: SSIM analysis of Images and Entropy Maps
This paper studies the removal of salt-and-pepper noise from images using median filter (MF) and simple three-layer autoencoder (AE) within recursive threshold algorithm. The performance of denoising is assessed with two metrics: the standard Structural Similarity Index SSIMImg of restored and clean images and a newly applied metric SSIMMap - the SSIM of entropy maps of these images computed via 2D Sample Entropy in sliding windows. We shown that SSIMMap is more sensitive to blur and local intensity transitions and complements SSIMImg. Experiments on low- and high-resolution grayscales images demonstrate that recursive threshold MF robustly restores images even under strong noise (50-60 %), whereas simple AE is only capable of restoring images with low levels of noise (<30 %). We propose two scalable schemes: (i) 2MF, which uses two MFs with different window sizes and a final thresholding step, effective for highlighting sharp local details at low resolution; and (ii) MFs-AE, which aggregates features from multiple MFs via an AE and is beneficial for restoring the overall scene structure at higher resolution. Owing to its simplicity and computational efficiency, MF remains preferable for deployment on resource-constrained platforms (edge/IoT), whereas AE underperforms without prior denoising. The results also validate the practical value of SSIMMap for objective blur assessment and denoising parameter tuning.
comment: 14 pages, 13 figures, 4 tables
Volumetric Ultrasound via 3D Null Subtraction Imaging with Circular and Spiral Apertures
Volumetric ultrasound imaging faces a fundamental trade-off among image quality, frame rate, and hardware complexity. This study introduces three-dimensional Null Subtraction Imaging (3D NSI), a nonlinear beamforming framework that addresses this trade-off by combining computationally efficient null-subtraction process with multiplexing-aware sparse aperture designs on matrix arrays. We evaluate three apodization configurations: a fully addressed circular aperture and two Fermat's spiral sparse apertures. To overcome channel-sharing constraints common in matrix arrays multiplexed with low-channel-count ultrasound systems, we propose a spiral "no-reuse" apodization that enforces non-overlapping element sets across transmit-receive events. This design resolves multiplexing conflicts and enables up to a 16-fold increase in acquisition volume rate using only 240 active elements on a 1024-element probe. In computer simulations and tissue-mimicking phantom experiments, 3D NSI achieved an average improvement of 36% in azimuthal and elevational resolutions, along with an approximately 20% higher contrast ratio, compared to the conventional Delay-and-Sum (DAS) beamformer under matched transmit/receive configurations. When implemented with the spiral no-reuse aperture, the 3D NSI framework achieved over 1000 volumes per second with a computational load less than three times that of DAS, making it a practical solution for real-time 4D imaging.
comment: 10 pages,12 figures
DCA-LUT: Deep Chromatic Alignment with 5D LUT for Purple Fringing Removal
Purple fringing, a persistent artifact caused by Longitudinal Chromatic Aberration (LCA) in camera lenses, has long degraded the clarity and realism of digital imaging. Traditional solutions rely on complex and expensive apochromatic (APO) lens hardware and the extraction of handcrafted features, ignoring the data-driven approach. To fill this gap, we introduce DCA-LUT, the first deep learning framework for purple fringing removal. Inspired by the physical root of the problem, the spatial misalignment of RGB color channels due to lens dispersion, we introduce a novel Chromatic-Aware Coordinate Transformation (CA-CT) module, learning an image-adaptive color space to decouple and isolate fringing into a dedicated dimension. This targeted separation allows the network to learn a precise ``purple fringe channel", which then guides the accurate restoration of the luminance channel. The final color correction is performed by a learned 5D Look-Up Table (5D LUT), enabling efficient and powerful% non-linear color mapping. To enable robust training and fair evaluation, we constructed a large-scale synthetic purple fringing dataset (PF-Synth). Extensive experiments in synthetic and real-world datasets demonstrate that our method achieves state-of-the-art performance in purple fringing removal.
comment: 11 pages, 9 figures
Noisy MRI Reconstruction via MAP Estimation with an Implicit Deep-Denoiser Prior
Accelerating magnetic resonance imaging (MRI) remains challenging, particularly under realistic acquisition noise. While diffusion models have recently shown promise for reconstructing undersampled MRI data, many approaches lack an explicit link to the underlying MRI physics, and their parameters are sensitive to measurement noise, limiting their reliability in practice. We introduce Implicit-MAP (ImMAP), a diffusion-based reconstruction framework that integrates the acquisition noise model directly into a maximum a posteriori (MAP) formulation. Specifically, we build on the stochastic ascent method of Kadkhodaie et al. and generalize it to handle MRI encoding operators and realistic measurement noise. Across both simulated and real noisy datasets, ImMAP consistently outperforms state-of-the-art deep learning (LPDSNet) and diffusion-based (DDS) methods. By clarifying the practical behavior and limitations of diffusion models under realistic noise conditions, ImMAP establishes a more reliable and interpretable
comment: 4 figures, conference paper
Application of Graph Based Vision Transformers Architectures for Accurate Temperature Prediction in Fiber Specklegram Sensors
Fiber Specklegram Sensors (FSS) are highly effective for environmental monitoring, particularly for detecting temperature variations. However, the nonlinear nature of specklegram data presents significant challenges for accurate temperature prediction. This study investigates the use of transformer-based architectures, including Vision Transformers (ViTs), Swin Transformers, and emerging models such as Learnable Importance Non-Symmetric Attention Vision Transformers (LINA-ViT) and Multi-Adaptive Proximity Vision Graph Attention Transformers (MAP-ViGAT), to predict temperature from specklegram data over a range of 0 to 120 Celsius. The results show that ViTs achieved a Mean Absolute Error (MAE) of 1.15, outperforming traditional models such as CNNs. GAT-ViT and MAP-ViGAT variants also demonstrated competitive accuracy, highlighting the importance of adaptive attention mechanisms and graph-based structures in capturing complex modal interactions and phase shifts in specklegram data. Additionally, this study incorporates Explainable AI (XAI) techniques, including attention maps and saliency maps, to provide insights into the decision-making processes of the transformer models, improving interpretability and transparency. These findings establish transformer architectures as strong benchmarks for optical fiber-based temperature sensing and offer promising directions for industrial monitoring and structural health assessment applications.
A Fractional Variational Approach to Spectral Filtering Using the Fourier Transform
The interference of fluorescence signals and noise remains a significant challenge in Raman spectrum analysis, often obscuring subtle spectral features that are critical for accurate analysis. Inspired by variational methods similar to those used in image denoising, our approach minimizes a functional involving fractional derivatives to balance noise suppression with the preservation of essential chemical features of the signal, such as peak position, intensity, and area. The original problem is reformulated in the frequency domain through the Fourier transform, making the implementation simple and fast. In this work, we discuss the theoretical framework, practical implementation, and the advantages and limitations of this method in the context of {simulated} Raman data, as well as in image processing. The main contribution of this article is the combination of a variational approach in the frequency domain, the use of fractional derivatives, and the optimization of the {regularization parameter and} derivative order through the concept of Shannon entropy. This work explores how the fractional order, combined with the regularization parameter, affects noise removal and preserves the essential features of the spectrum {and image}. Finally, the study shows that the combination of the proposed strategies produces an efficient, robust, and easily implementable filter.
comment: 31 pages, 3 figures, 2 tables
Whitened Score Diffusion: A Structured Prior for Imaging Inverse Problems
Conventional score-based diffusion models (DMs) may struggle with anisotropic Gaussian diffusion processes due to the required inversion of covariance matrices in the denoising score matching training objective \cite{vincent_connection_2011}. We propose Whitened Score (WS) diffusion models, a novel framework based on stochastic differential equations that learns the Whitened Score function instead of the standard score. This approach circumvents covariance inversion, extending score-based DMs by enabling stable training of DMs on arbitrary Gaussian forward noising processes. WS DMs establish equivalence with flow matching for arbitrary Gaussian noise, allow for tailored spectral inductive biases, and provide strong Bayesian priors for imaging inverse problems with structured noise. We experiment with a variety of computational imaging tasks using the CIFAR, CelebA ($64\times64$), and CelebA-HQ ($256\times256$) datasets and demonstrate that WS diffusion priors trained on anisotropic Gaussian noising processes consistently outperform conventional diffusion priors based on isotropic Gaussian noise. Our code is open-sourced at \href{https://github.com/jeffreyalido/wsdiffusion}{\texttt{github.com/jeffreyalido/wsdiffusion}}.
DoSReMC: Domain Shift Resilient Mammography Classification using Batch Normalization Adaptation
Numerous deep learning-based solutions have been developed for the automatic recognition of breast cancer using mammography images. However, their performance often declines when applied to data from different domains, primarily due to domain shift - the variation in data distributions between source and target domains. This performance drop limits the safe and equitable deployment of AI in real-world clinical settings. In this study, we present DoSReMC (Domain Shift Resilient Mammography Classification), a batch normalization (BN) adaptation framework designed to enhance cross-domain generalization without retraining the entire model. Using three large-scale full-field digital mammography (FFDM) datasets - including HCTP, a newly introduced, pathologically confirmed in-house dataset - we conduct a systematic cross-domain evaluation with convolutional neural networks (CNNs). Our results demonstrate that BN layers are a primary source of domain dependence: they perform effectively when training and testing occur within the same domain, and they significantly impair model generalization under domain shift. DoSReMC addresses this limitation by fine-tuning only the BN and fully connected (FC) layers, while preserving pretrained convolutional filters. We further integrate this targeted adaptation with an adversarial training scheme, yielding additional improvements in cross-domain generalizability while reducing the computational cost of model training. DoSReMC can be readily incorporated into existing AI pipelines and applied across diverse clinical environments, providing a practical pathway toward more robust and generalizable mammography classification systems.
Beyond H&E: Unlocking Pathological Insights with Polarization Imaging
Histopathology image analysis is fundamental to digital pathology, with hematoxylin and eosin (H&E) staining as the gold standard for diagnostic and prognostic assessments. While H&E imaging effectively highlights cellular and tissue structures, it lacks sensitivity to birefringence and tissue anisotropy, which are crucial for assessing collagen organization, fiber alignment, and microstructural alterations--key indicators of tumor progression, fibrosis, and other pathological conditions. To bridge this gap, we construct a polarization imaging system and curate a new dataset of over 13,000 paired Polar-H&E images. Visualizations of polarization properties reveal distinctive optical signatures in pathological tissues, underscoring its diagnostic value. Building on this dataset, we propose PolarHE, a dual-modality fusion framework that integrates H&E with polarization imaging, leveraging the latter ability to enhance tissue characterization. Our approach employs a feature decomposition strategy to disentangle common and modality specific features, ensuring effective multimodal representation learning. Through comprehensive validation, our approach significantly outperforms previous methods, achieving an accuracy of 86.70% on the Chaoyang dataset and 89.06% on the MHIST dataset. These results demonstrate that polarization imaging is a powerful and underutilized modality in computational pathology, enriching feature representation and improving diagnostic accuracy. PolarHE establishes a promising direction for multimodal learning, paving the way for more interpretable and generalizable pathology models.
comment: Accepted as a regular paper at IEEE BIBM 2025
Constructing Per-Shot Bitrate Ladders using Visual Information Fidelity
Video service providers need their delivery systems to be able to adapt to network conditions, user preferences, display settings, and other factors. HTTP Adaptive Streaming (HAS) offers dynamic switching between different video representations to simultaneously enhance bandwidth consumption and users' streaming experiences. Per-shot encoding, pioneered by Netflix, optimizes the encoding parameters on each scene or shot. The Dynamic Optimizer (DO) uses the Video Multi-Method Assessment Fusion (VMAF) perceptual video quality prediction engine to deliver high-quality videos at reduced bitrates. Here we develop a perceptually optimized method of constructing optimal per-shot bitrate and quality ladders, using an ensemble of low-level features and Visual Information Fidelity (VIF) features. During inference, our method predicts the bitrate or quality ladder of a source video without any compression or quality estimation. We compare the performance of our model against other content-adaptive bitrate ladder prediction methods, a fixed bitrate ladder, and reference bitrate ladders constructed via exhaustive encoding using Bjontegaard-delta (BD) metrics. Our proposed method shows excellent gains in bitrate and quality against the fixed bitrate ladder and only small losses against the reference bitrate ladder, while providing significant computational advantages.
comment: Accepted to IEEE Transactions on Image Processing
Multimedia
To Align or Not to Align: Strategic Multimodal Representation Alignment for Optimal Performance
Multimodal learning often relies on aligning representations across modalities to enable effective information integration, an approach traditionally assumed to be universally beneficial. However, prior research has primarily taken an observational approach, examining naturally occurring alignment in multimodal data and exploring its correlation with model performance, without systematically studying the direct effects of explicitly enforced alignment between representations of different modalities. In this work, we investigate how explicit alignment influences both model performance and representation alignment under different modality-specific information structures. Specifically, we introduce a controllable contrastive learning module that enables precise manipulation of alignment strength during training, allowing us to explore when explicit alignment improves or hinders performance. Our results on synthetic and real datasets under different data characteristics show that the impact of explicit alignment on the performance of unimodal models is related to the characteristics of the data: the optimal level of alignment depends on the amount of redundancy between the different modalities. We identify an optimal alignment strength that balances modality-specific signals and shared redundancy in the mixed information distributions. This work provides practical guidance on when and how explicit alignment should be applied to achieve optimal unimodal encoder performance.
ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation
Sounding Video Generation (SVG) remains a challenging task due to the inherent structural misalignment between audio and video, as well as the high computational cost of multimodal data processing. In this paper, we introduce ProAV-DiT, a Projected Latent Diffusion Transformer designed for efficient and synchronized audio-video generation. To address structural inconsistencies, we preprocess raw audio into video-like representations, aligning both the temporal and spatial dimensions between audio and video. At its core, ProAV-DiT adopts a Multi-scale Dual-stream Spatio-Temporal Autoencoder (MDSA), which projects both modalities into a unified latent space using orthogonal decomposition, enabling fine-grained spatiotemporal modeling and semantic alignment. To further enhance temporal coherence and modality-specific fusion, we introduce a multi-scale attention mechanism, which consists of multi-scale temporal self-attention and group cross-modal attention. Furthermore, we stack the 2D latents from MDSA into a unified 3D latent space, which is processed by a spatio-temporal diffusion Transformer. This design efficiently models spatiotemporal dependencies, enabling the generation of high-fidelity synchronized audio-video content while reducing computational overhead. Extensive experiments conducted on standard benchmarks demonstrate that ProAV-DiT outperforms existing methods in both generation quality and computational efficiency.
Calibrated Multimodal Representation Learning with Missing Modalities
Multimodal representation learning harmonizes distinct modalities by aligning them into a unified latent space. Recent research generalizes traditional cross-modal alignment to produce enhanced multimodal synergy but requires all modalities to be present for a common instance, making it challenging to utilize prevalent datasets with missing modalities. We provide theoretical insights into this issue from an anchor shift perspective. Observed modalities are aligned with a local anchor that deviates from the optimal one when all modalities are present, resulting in an inevitable shift. To address this, we propose CalMRL for multimodal representation learning to calibrate incomplete alignments caused by missing modalities. Specifically, CalMRL leverages the priors and the inherent connections among modalities to model the imputation for the missing ones at the representation level. To resolve the optimization dilemma, we employ a bi-step learning method with the closed-form solution of the posterior distribution of shared latents. We validate its mitigation of anchor shift and convergence with theoretical guidance. By equipping the calibrated alignment with the existing advanced method, we offer new flexibility to absorb data with missing modalities, which is originally unattainable. Extensive experiments and comprehensive analyses demonstrate the superiority of CalMRL. Our code, model checkpoints, and evaluation raw data will be publicly available.
Can LLMs Create Legally Relevant Summaries and Analyses of Videos?
Understanding the legally relevant factual basis of an event and conveying it through text is a key skill of legal professionals. This skill is important for preparing forms (e.g., insurance claims) or other legal documents (e.g., court claims), but often presents a challenge for laypeople. Current AI approaches aim to bridge this gap, but mostly rely on the user to articulate what has happened in text, which may be challenging for many. Here, we investigate the capability of large language models (LLMs) to understand and summarize events occurring in videos. We ask an LLM to summarize and draft legal letters, based on 120 YouTube videos showing legal issues in various domains. Overall, 71.7\% of the summaries were rated as of high or medium quality, which is a promising result, opening the door to a number of applications in e.g. access to justice.
comment: Accepted for publication at JURIX 2025 Torino, Italy. This is the preprint version. Code and data available at: https://github.com/maastrichtlawtech/jurix2025_LLM_video_analysis
Detection of Cyberbullying in GIF using AI
Cyberbullying is a well-known social issue, and it is escalating day by day. Due to the vigorous development of the internet, social media provide many different ways for the user to express their opinions and exchange information. Cyberbullying occurs on social media using text messages, comments, sharing images and GIFs or stickers, and audio and video. Much research has been done to detect cyberbullying on textual data; some are available for images. Very few studies are available to detect cyberbullying on GIFs/stickers. We collect a GIF dataset from Twitter and Applied a deep learning model to detect cyberbullying from the dataset. Firstly, we extracted hashtags related to cyberbullying using Twitter. We used these hashtags to download GIF file using publicly available API GIPHY. We collected over 4100 GIFs including cyberbullying and non cyberbullying. we applied deep learning pre-trained model VGG16 for the detection of the cyberbullying. The deep learning model achieved the accuracy of 97%. Our work provides the GIF dataset for researchers working in this area.
MusRec: Zero-Shot Text-to-Music Editing via Rectified Flow and Diffusion Transformers
Music editing has emerged as an important and practical area of artificial intelligence, with applications ranging from video game and film music production to personalizing existing tracks according to user preferences. However, existing models face significant limitations, such as being restricted to editing synthesized music generated by their own models, requiring highly precise prompts, or necessitating task-specific retraining, thus lacking true zero-shot capability. leveraging recent advances in rectified flow and diffusion transformers, we introduce MusRec, a zero-shot text-to-music editing model capable of performing diverse editing tasks on real-world music efficiently and effectively. Experimental results demonstrate that our approach outperforms existing methods in preserving musical content, structural consistency, and editing fidelity, establishing a strong foundation for controllable music editing in real-world scenarios.
comment: This work has been submitted to the IEEE for possible publication
Hierarchical Knowledge Graphs for Story Understanding in Visual Narratives
We present a hierarchical knowledge graph framework for the structured semantic understanding of visual narratives, using comics as a representative domain for multimodal storytelling. The framework organizes narrative content across three levels-panel, event, and macro-event, by integrating symbolic graphs that encode semantic, spatial, and temporal relationships. At the panel level, it models visual elements such as characters, objects, and actions alongside textual components including dialogue and narration. These are systematically connected to higher-level graphs that capture narrative sequences and abstract story structures. Applied to a manually annotated subset of the Manga109 dataset, the framework supports interpretable symbolic reasoning across four representative tasks: action retrieval, dialogue tracing, character appearance mapping, and timeline reconstruction. Rather than prioritizing predictive performance, the system emphasizes transparency in narrative modeling and enables structured inference aligned with cognitive theories of event segmentation and visual storytelling. This work contributes to explainable narrative analysis and offers a foundation for authoring tools, narrative comprehension systems, and interactive media applications.
comment: Updated with the ICIDS 2025 camera-ready version. This revision includes the final title, updated abstract, improved explanations of the narrative coherence framework, and minor editorial changes. Figures and examples have been refined for clarity. No new experiments were added
MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention
Histopathology and transcriptomics are fundamental modalities in oncology, encapsulating the morphological and molecular aspects of the disease. Multi-modal self-supervised learning has demonstrated remarkable potential in learning pathological representations by integrating diverse data sources. Conventional multi-modal integration methods primarily emphasize modality alignment, while paying insufficient attention to retaining the modality-specific structures. However, unlike conventional scenarios where multi-modal inputs share highly overlapping features, histopathology and transcriptomics exhibit pronounced heterogeneity, offering orthogonal yet complementary insights. Histopathology provides morphological and spatial context, elucidating tissue architecture and cellular topology, whereas transcriptomics delineates molecular signatures through gene expression patterns. This inherent disparity introduces a major challenge in aligning them while maintaining modality-specific fidelity. To address these challenges, we present MIRROR, a novel multi-modal representation learning method designed to foster both modality alignment and retention. MIRROR employs dedicated encoders to extract comprehensive features for each modality, which is further complemented by a modality alignment module to achieve seamless integration between phenotype patterns and molecular profiles. Furthermore, a modality retention module safeguards unique attributes from each modality, while a style clustering module mitigates redundancy and enhances disease-relevant information by modeling and aligning consistent pathological signatures within a clustering space. Extensive evaluations on TCGA cohorts for cancer subtyping and survival analysis highlight MIRROR's superior performance, demonstrating its effectiveness in constructing comprehensive oncological feature representations and benefiting the cancer diagnosis.
comment: Accepted by IEEE Transactions on Medical Imaging (TMI). Code available at https://github.com/TianyiFranklinWang/MIRROR. Project page: https://tianyifranklinwang.github.io/MIRROR
Computation and Language
From Phonemes to Meaning: Evaluating Large Language Models on Tamil
Large Language Models (LLMs) have shown strong generalization across tasks in high-resource languages; however, their linguistic competence in low-resource and morphologically rich languages such as Tamil remains largely unexplored. Existing multilingual benchmarks often rely on translated English datasets, failing to capture the linguistic and cultural nuances of the target language. To address this gap, we introduce ILAKKANAM, the first Tamil-specific linguistic evaluation benchmark manually curated using 820 questions from Sri Lankan school-level Tamil subject examination papers. Each question is annotated by trained linguists under five linguistic categories and a factual knowledge category, spanning Grades 1--13 to ensure broad linguistic coverage. We evaluate both closed-source and open-source LLMs using a standardized evaluation framework. Our results show that Gemini 2.5 achieves the highest overall performance, while open-source models lag behind, highlighting the gap in linguistic grounding. Category- and grade-wise analyses reveal that all models perform well on lower-grade questions but show a clear decline as linguistic complexity increases. Further, no strong correlation is observed between a model's overall performance and its ability to identify linguistic categories, suggesting that performance may be driven by exposure rather than genuine understanding.
comment: 11 pages
Don't Think of the White Bear: Ironic Negation in Transformer Models Under Cognitive Load
Negation instructions such as 'do not mention $X$' can paradoxically increase the accessibility of $X$ in human thought, a phenomenon known as ironic rebound. Large language models (LLMs) face the same challenge: suppressing a concept requires internally activating it, which may prime rebound instead of avoidance. We investigated this tension with two experiments. \textbf{(1) Load \& content}: after a negation instruction, we vary distractor text (semantic, syntactic, repetition) and measure rebound strength. \textbf{(2) Polarity separation}: We test whether models distinguish neutral from negative framings of the same concept and whether this separation predicts rebound persistence. Results show that rebound consistently arises immediately after negation and intensifies with longer or semantic distractors, while repetition supports suppression. Stronger polarity separation correlates with more persistent rebound. Together, these findings, complemented by a circuit tracing analysis that identifies sparse middle-layer attention heads amplifying forbidden tokens while early layers suppress, link cognitive predictions of ironic rebound with mechanistic insights into long-context interference. To support future work, we release ReboundBench, a dataset of $5,000$ systematically varied negation prompts designed to probe rebound in LLMs.
Evaluating LLMs' Reasoning Over Ordered Procedural Steps AACL 2025
Reasoning over procedural sequences, where the order of steps directly impacts outcomes, is a critical capability for large language models (LLMs). In this work, we study the task of reconstructing globally ordered sequences from shuffled procedural steps, using a curated dataset of food recipes, a domain where correct sequencing is essential for task success. We evaluate several LLMs under zero-shot and few-shot settings and present a comprehensive evaluation framework that adapts established metrics from ranking and sequence alignment. These include Kendall's Tau, Normalized Longest Common Subsequence (NLCS), and Normalized Edit Distance (NED), which capture complementary aspects of ordering quality. Our analysis shows that model performance declines with increasing sequence length, reflecting the added complexity of longer procedures. We also find that greater step displacement in the input, corresponding to more severe shuffling, leads to further degradation. These findings highlight the limitations of current LLMs in procedural reasoning, especially with longer and more disordered inputs.
comment: Accepted to IJCNLP-AACL 2025 Findings
ReviewGraph: A Knowledge Graph Embedding Based Framework for Review Rating Prediction with Sentiment Features
In the hospitality industry, understanding the factors that drive customer review ratings is critical for improving guest satisfaction and business performance. This work proposes ReviewGraph for Review Rating Prediction (RRP), a novel framework that transforms textual customer reviews into knowledge graphs by extracting (subject, predicate, object) triples and associating sentiment scores. Using graph embeddings (Node2Vec) and sentiment features, the framework predicts review rating scores through machine learning classifiers. We compare ReviewGraph performance with traditional NLP baselines (such as Bag of Words, TF-IDF, and Word2Vec) and large language models (LLMs), evaluating them in the HotelRec dataset. In comparison to the state of the art literature, our proposed model performs similar to their best performing model but with lower computational cost (without ensemble). While ReviewGraph achieves comparable predictive performance to LLMs and outperforms baselines on agreement-based metrics such as Cohen's Kappa, it offers additional advantages in interpretability, visual exploration, and potential integration into Retrieval-Augmented Generation (RAG) systems. This work highlights the potential of graph-based representations for enhancing review analytics and lays the groundwork for future research integrating advanced graph neural networks and fine-tuned LLM-based extraction methods. We will share ReviewGraph output and platform open-sourced on our GitHub page https://github.com/aaronlifenghan/ReviewGraph
comment: Peer-reviewed and published version is in ICKG-2025 (The 16th IEEE International Conference on Knowledge Graphs, November 13-14, 2025, Limassol, Cyprus)
SCRum-9: Multilingual Stance Classification over Rumours on Social Media
We introduce SCRum-9, the largest multilingual Stance Classification dataset for Rumour analysis in 9 languages, containing 7,516 tweets from X. SCRum-9 goes beyond existing stance classification datasets by covering more languages, linking examples to more fact-checked claims (2.1k), and including confidence-related annotations from multiple annotators to account for intra- and inter-annotator variability. Annotations were made by at least two native speakers per language, totalling more than 405 hours of annotation and 8,150 dollars in compensation. Further, SCRum-9 is used to benchmark five large language models (LLMs) and two multilingual masked language models (MLMs) in In-Context Learning (ICL) and fine-tuning setups. This paper also innovates by exploring the use of multilingual synthetic data for rumour stance classification, showing that even LLMs with weak ICL performance can produce valuable synthetic data for fine-tuning small MLMs, enabling them to achieve higher performance than zero-shot ICL in LLMs. Finally, we examine the relationship between model predictions and human uncertainty on ambiguous cases finding that model predictions often match the second-choice labels assigned by annotators, rather than diverging entirely from human judgments. SCRum-9 is publicly released to the research community with potential to foster further research on multilingual analysis of misleading narratives on social media.
comment: Accepted by ICWSM 2026
Human-Computer Interaction
More Than Irrational: Modeling Belief-Biased Agents AAAI
Despite the explosive growth of AI and the technologies built upon it, predicting and inferring the sub-optimal behavior of users or human collaborators remains a critical challenge. In many cases, such behaviors are not a result of irrationality, but rather a rational decision made given inherent cognitive bounds and biased beliefs about the world. In this paper, we formally introduce a class of computational-rational (CR) user models for cognitively-bounded agents acting optimally under biased beliefs. The key novelty lies in explicitly modeling how a bounded memory process leads to a dynamically inconsistent and biased belief state and, consequently, sub-optimal sequential decision-making. We address the challenge of identifying the latent user-specific bound and inferring biased belief states from passive observations on the fly. We argue that for our formalized CR model family with an explicit and parameterized cognitive process, this challenge is tractable. To support our claim, we propose an efficient online inference method based on nested particle filtering that simultaneously tracks the user's latent belief state and estimates the unknown cognitive bound from a stream of observed actions. We validate our approach in a representative navigation task using memory decay as an example of a cognitive bound. With simulations, we show that (1) our CR model generates intuitively plausible behaviors corresponding to different levels of memory capacity, and (2) our inference method accurately and efficiently recovers the ground-truth cognitive bounds from limited observations ($\le 100$ steps). We further demonstrate how this approach provides a principled foundation for developing adaptive AI assistants, enabling adaptive assistance that accounts for the user's memory limitations.
comment: 13 pages, 8 figures. Accepted at the 40th Annual AAAI Conference on Artificial Intelligence (AAAI 2026)
Intermittent Rendezvous Plans with Mixed Integer Linear Program for Large-Scale Multi-Robot Exploration
Multi-Robot Exploration (MRE) systems with communication constraints have proven efficient in accomplishing a variety of tasks, including search-and-rescue, stealth, and military operations. While some works focus on opportunistic approaches for efficiency, others concentrate on pre-planned trajectories or scheduling for increased interpretability. However, scheduling usually requires knowledge of the environment beforehand, which prevents its deployment in several domains due to related uncertainties (e.g., underwater exploration). In our previous work, we proposed an intermittent communications framework for MRE under communication constraints that uses scheduled rendezvous events to mitigate such limitations. However, the system was unable to generate optimal plans and had no mechanisms to follow the plan considering realistic trajectories, which is not suited for real-world deployments. In this work, we further investigate the problem by formulating the Multi-Robot Exploration with Communication Constraints and Intermittent Connectivity (MRE-CCIC) problem. We propose a Mixed-Integer Linear Program (MILP) formulation to generate rendezvous plans and a policy to follow them based on the Rendezvous Tracking for Unknown Scenarios (RTUS) mechanism. The RTUS is a simple rule to allow robots to follow the assigned plan, considering unknown conditions. Finally, we evaluated our method in a large-scale environment configured in Gazebo simulations. The results suggest that our method can follow the plan promptly and accomplish the task efficiently. We provide an open-source implementation of both the MILP plan generator and the large-scale MRE-CCIC.
comment: 9 pages, 9 figures, International Conference on Advanced Robotics
Cross-View Cross-Modal Unsupervised Domain Adaptation for Driver Monitoring System
Driver distraction remains a leading cause of road traffic accidents, contributing to thousands of fatalities annually across the globe. While deep learning-based driver activity recognition methods have shown promise in detecting such distractions, their effectiveness in real-world deployments is hindered by two critical challenges: variations in camera viewpoints (cross-view) and domain shifts such as change in sensor modality or environment. Existing methods typically address either cross-view generalization or unsupervised domain adaptation in isolation, leaving a gap in the robust and scalable deployment of models across diverse vehicle configurations. In this work, we propose a novel two-phase cross-view, cross-modal unsupervised domain adaptation framework that addresses these challenges jointly on real-time driver monitoring data. In the first phase, we learn view-invariant and action-discriminative features within a single modality using contrastive learning on multi-view data. In the second phase, we perform domain adaptation to a new modality using information bottleneck loss without requiring any labeled data from the new domain. We evaluate our approach using state-of-the art video transformers (Video Swin, MViT) and multi modal driver activity dataset called Drive&Act, demonstrating that our joint framework improves top-1 accuracy on RGB video data by almost 50% compared to a supervised contrastive learning-based cross-view method, and outperforms unsupervised domain adaptation-only methods by up to 5%, using the same video transformer backbone.
From Play to Detection: Mini-SPACE as a Serious Game for Unsupervised Cognitive Impairment Screening
Early detection of Cognitive Impairment (CI) is critical for timely intervention, preservation of independence, and reducing the burden of dementia. Yet, most screening tools remain lengthy, clinic-based, and poorly suited for large-scale unsupervised deployment. This paper evaluates the test-retest reliability, validity, and usability of mini-SPACE, a short iPad-based serious game for detecting early signs of CI. Participants played mini-SPACE at home without supervision once a week for three weeks, with a longer version of the game in the final week. Mini-SPACE showed good test-retest reliability in unsupervised settings. While younger age was the primary predictor of performance, usability, and cognitive load, participants of all ages were able to complete the tasks and reported good usability and low cognitive load. Importantly, the prediction of scores in the Montreal Cognitive Assessment (MoCA) improved with repeated measures. These findings highlight mini-SPACE as a promising digital marker for scalable, age-sensitive screening and potential longitudinal tracking of CI.
comment: 22 pages
CURE: Cultural Understanding and Reasoning Evaluation - A Framework for "Thick" Culture Alignment Evaluation in LLMs
Large language models (LLMs) are increasingly deployed in culturally diverse environments, yet existing evaluations of cultural competence remain limited. Existing methods focus on de-contextualized correctness or forced-choice judgments, overlooking the need for cultural understanding and reasoning required for appropriate responses. To address this gap, we introduce a set of benchmarks that, instead of directly probing abstract norms or isolated statements, present models with realistic situational contexts that require culturally grounded reasoning. In addition to the standard Exact Match metric, we introduce four complementary metrics (Coverage, Specificity, Connotation, and Coherence) to capture different dimensions of model's response quality. Empirical analysis across frontier models reveals that thin evaluation systematically overestimates cultural competence and produces unstable assessments with high variance. In contrast, thick evaluation exposes differences in reasoning depth, reduces variance, and provides more stable, interpretable signals of cultural understanding.
comment: 7 pages, 5 figures
Critical or Compliant? The Double-Edged Sword of Reasoning in Chain-of-Thought Explanations
Explanations are often promoted as tools for transparency, but they can also foster confirmation bias; users may assume reasoning is correct whenever outputs appear acceptable. We study this double-edged role of Chain-of-Thought (CoT) explanations in multimodal moral scenarios by systematically perturbing reasoning chains and manipulating delivery tones. Specifically, we analyze reasoning errors in vision language models (VLMs) and how they impact user trust and the ability to detect errors. Our findings reveal two key effects: (1) users often equate trust with outcome agreement, sustaining reliance even when reasoning is flawed, and (2) the confident tone suppresses error detection while maintaining reliance, showing that delivery styles can override correctness. These results highlight how CoT explanations can simultaneously clarify and mislead, underscoring the need for NLP systems to provide explanations that encourage scrutiny and critical thinking rather than blind trust. All code will be released publicly.
comment: Under review; 16 pages, 15 figures
A Study of Performance and Interaction Patterns in Hand and Tangible Interaction in Tabletop Mixed Reality
This paper presents a comprehensive study of virtual 3D object manipulation along 4DoF on real surfaces in mixed reality (MR), using hand-based and tangible interactions. A custom cylindrical tangible proxy leverages affordances of physical knobs and tabletop support for stable input. We evaluate both modalities across isolated tasks (2DoF translation, 1DoF rotation scaling), semicombined (3DoF translation rotation), and full 4DoF compound manipulation. We offer analyses of hand interactions, tangible interactions, and their comparison in MR tasks. For hand interactions, compound tasks required repetitive corrections, increasing completion times yet surprisingly, rotation errors were smaller in compound tasks than in rotation only tasks. Tangible interactions exhibited significantly larger errors in translation, rotation, and scaling during compound tasks compared to isolated tasks. Crucially, tangible interactions outperformed hand interactions in precision, likely due to tabletop support and constrained 4DoF design. These findings inform designers opting for hand-only interaction (highlighting tradeoffs in compound tasks) and those leveraging tangibles (emphasizing precision gains despite compound-task challenges).
"Power of Words": Stealthy and Adaptive Private Information Elicitation via LLM Communication Strategies
While communication strategies of Large Language Models (LLMs) are crucial for human-LLM interactions, they can also be weaponized to elicit private information, yet such stealthy attacks remain under-explored. This paper introduces the first adaptive attack framework for stealthy and targeted private information elicitation via communication strategies. Our framework operates in a dynamic closed-loop: it first performs real-time psychological profiling of the users' state, then adaptively selects an optimized communication strategy, and finally maintains stealthiness through prompt-based rewriting. We validated this framework through a user study (N=84), demonstrating its generalizability across 3 distinct LLMs and 3 scenarios. The targeted attacks achieved a 205.4% increase in eliciting specific targeted information compared to stealthy interactions without strategies. Even stealthy interactions without specific strategies successfully elicited private information in 54.8% cases. Notably, users not only failed to detect the manipulation but paradoxically rated the attacking chatbot as more empathetic and trustworthy. Finally, we advocate for mitigations, encouraging developers to integrate adaptive, just-in-time alerts, users to build literacy against specific manipulative tactics, and regulators to define clear ethical boundaries distinguishing benign persuasion from coercion.
Human-aligned Quantification of Numerical Data
Quantifying numerical data involves addressing two key challenges: first, determining whether the data can be naturally quantified, and second, identifying the numerical intervals or ranges of values that correspond to specific value classes, referred to as "quantums," which represent statistically meaningful states. If such quantification is feasible, continuous streams of numerical data can be transformed into sequences of "symbols" that reflect the states of the system described by the measured parameter. People often perform this task intuitively, relying on common sense or practical experience, while information theory and computer science offer computable metrics for this purpose. In this study, we assess the applicability of metrics based on information compression and the Silhouette coefficient for quantifying numerical data. We also investigate the extent to which these metrics correlate with one another and with what is commonly referred to as "human intuition." Our findings suggest that the ability to classify numeric data values into distinct categories is associated with a Silhouette coefficient above 0.65 and a Dip Test below 0.5; otherwise, the data can be treated as following a unimodal normal distribution. Furthermore, when quantification is possible, the Silhouette coefficient appears to align more closely with human intuition than the "normalized centroid distance" method derived from information compression perspective.
comment: 9 pages, 5 figures, 1 table
Teaching LLMs to See and Guide: Context-Aware Real-Time Assistance in Augmented Reality
The growing adoption of augmented and virtual reality (AR and VR) technologies in industrial training and on-the-job assistance has created new opportunities for intelligent, context-aware support systems. As workers perform complex tasks guided by AR and VR, these devices capture rich streams of multimodal data, including gaze, hand actions, and task progression, that can reveal user intent and task state in real time. Leveraging this information effectively remains a major challenge. In this work, we present a context-aware large language model (LLM) assistant that integrates diverse data modalities, such as hand actions, task steps, and dialogue history, into a unified framework for real-time question answering. To systematically study how context influences performance, we introduce an incremental prompting framework, where each model version receives progressively richer contextual inputs. Using the HoloAssist dataset, which records AR-guided task executions, we evaluate how each modality contributes to the assistant's effectiveness. Our experiments show that incorporating multimodal context significantly improves the accuracy and relevance of responses. These findings highlight the potential of LLM-driven multimodal integration to enable adaptive, intuitive assistance for AR and VR-based industrial training and assistance.
comment: This work has been submitted to the IEEE Transactions on Systems, Man, and Cybernetics: Systems for possible publication
Human-in-the-Loop Generation of Adversarial Texts: A Case Study on Tibetan Script AACL 2025
DNN-based language models excel across various NLP tasks but remain highly vulnerable to textual adversarial attacks. While adversarial text generation is crucial for NLP security, explainability, evaluation, and data augmentation, related work remains overwhelmingly English-centric, leaving the problem of constructing high-quality and sustainable adversarial robustness benchmarks for lower-resourced languages both difficult and understudied. First, method customization for lower-resourced languages is complicated due to linguistic differences and limited resources. Second, automated attacks are prone to generating invalid or ambiguous adversarial texts. Last but not least, language models continuously evolve and may be immune to parts of previously generated adversarial texts. To address these challenges, we introduce HITL-GAT, an interactive system based on a general approach to human-in-the-loop generation of adversarial texts. Additionally, we demonstrate the utility of HITL-GAT through a case study on Tibetan script, employing three customized adversarial text generation methods and establishing its first adversarial robustness benchmark, providing a valuable reference for other lower-resourced languages.
comment: Camera-Ready Version; Accepted at IJCNLP-AACL 2025 Demo
Computer Vision and Pattern Recognition
LARM: A Large Articulated-Object Reconstruction Model
Modeling 3D articulated objects with realistic geometry, textures, and kinematics is essential for a wide range of applications. However, existing optimization-based reconstruction methods often require dense multi-view inputs and expensive per-instance optimization, limiting their scalability. Recent feedforward approaches offer faster alternatives but frequently produce coarse geometry, lack texture reconstruction, and rely on brittle, complex multi-stage pipelines. We introduce LARM, a unified feedforward framework that reconstructs 3D articulated objects from sparse-view images by jointly recovering detailed geometry, realistic textures, and accurate joint structures. LARM extends LVSM a recent novel view synthesis (NVS) approach for static 3D objects into the articulated setting by jointly reasoning over camera pose and articulation variation using a transformer-based architecture, enabling scalable and accurate novel view synthesis. In addition, LARM generates auxiliary outputs such as depth maps and part masks to facilitate explicit 3D mesh extraction and joint estimation. Our pipeline eliminates the need for dense supervision and supports high-fidelity reconstruction across diverse object categories. Extensive experiments demonstrate that LARM outperforms state-of-the-art methods in both novel view and state synthesis as well as 3D articulated object reconstruction, generating high-quality meshes that closely adhere to the input images. project page: https://sylviayuan-sy.github.io/larm-site/
comment: project page: https://sylviayuan-sy.github.io/larm-site/
DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding
Comprehending long visual documents, where information is distributed across extensive pages of text and visual elements, is a critical but challenging task for modern Vision-Language Models (VLMs). Existing approaches falter on a fundamental challenge: evidence localization. They struggle to retrieve relevant pages and overlook fine-grained details within visual elements, leading to limited performance and model hallucination. To address this, we propose DocLens, a tool-augmented multi-agent framework that effectively ``zooms in'' on evidence like a lens. It first navigates from the full document to specific visual elements on relevant pages, then employs a sampling-adjudication mechanism to generate a single, reliable answer. Paired with Gemini-2.5-Pro, DocLens achieves state-of-the-art performance on MMLongBench-Doc and FinRAGBench-V, surpassing even human experts. The framework's superiority is particularly evident on vision-centric and unanswerable queries, demonstrating the power of its enhanced localization capabilities.
Bridging Hidden States in Vision-Language Models
Vision-Language Models (VLMs) are a new family of models that align image content with natural language. Existing approaches typically fuse either (a) early: by mixing tokens/features inside the encoders, or (b) late: by comparing pooled embeddings. Many methods also tie fusion to an autoregressive decoder. However, the hidden states of both modalities already carry rich, modality-specific structure (spatial layout in vision; syntax and semantics in text), so directly aligning these states is a natural way to match what the two modalities "think". We propose a lightweight fusion module: a few cross-only, bidirectional attention layers placed near the top of both encoders. Each layer projects the vision and text encoder hidden-state sequences into a shared space, attends across modalities, and sends gated residual updates back, with simple stabilizers to improve alignment. The encoders remain non-causal and strong for understanding, while generation stays cleanly decoupled via an optional decoder. Across standard retrieval, VQA, and visual reasoning benchmarks, BRIDGE outperforms comparable VLMs while preserving the bi-encoder efficiency of contrastive models. We make our code publicly available at https://github.com/jfeinashley/BRIDGE.
CVChess: A Deep Learning Framework for Converting Chessboard Images to Forsyth-Edwards Notation
Chess has experienced a large increase in viewership since the pandemic, driven largely by the accessibility of online learning platforms. However, no equivalent assistance exists for physical chess games, creating a divide between analog and digital chess experiences. This paper presents CVChess, a deep learning framework for converting chessboard images to Forsyth-Edwards Notation (FEN), which is later input into online chess engines to provide you with the best next move. Our approach employs a convolutional neural network (CNN) with residual layers to perform piece recognition from smartphone camera images. The system processes RGB images of a physical chess board through a multistep process: image preprocessing using the Hough Line Transform for edge detection, projective transform to achieve a top-down board alignment, segmentation into 64 individual squares, and piece classification into 13 classes (6 unique white pieces, 6 unique black pieces and an empty square) using the residual CNN. Residual connections help retain low-level visual features while enabling deeper feature extraction, improving accuracy and stability during training. We train and evaluate our model using the Chess Recognition Dataset (ChessReD), containing 10,800 annotated smartphone images captured under diverse lighting conditions and angles. The resulting classifications are encoded as an FEN string, which can be fed into a chess engine to generate the most optimal move
Collaborative Representation Learning for Alignment of Tactile, Language, and Vision Modalities
Tactile sensing offers rich and complementary information to vision and language, enabling robots to perceive fine-grained object properties. However, existing tactile sensors lack standardization, leading to redundant features that hinder cross-sensor generalization. Moreover, existing methods fail to fully integrate the intermediate communication among tactile, language, and vision modalities. To address this, we propose TLV-CoRe, a CLIP-based Tactile-Language-Vision Collaborative Representation learning method. TLV-CoRe introduces a Sensor-Aware Modulator to unify tactile features across different sensors and employs tactile-irrelevant decoupled learning to disentangle irrelevant tactile features. Additionally, a Unified Bridging Adapter is introduced to enhance tri-modal interaction within the shared representation space. To fairly evaluate the effectiveness of tactile models, we further propose the RSS evaluation framework, focusing on Robustness, Synergy, and Stability across different methods. Experimental results demonstrate that TLV-CoRe significantly improves sensor-agnostic representation learning and cross-modal alignment, offering a new direction for multimodal tactile representation.
OpenUS: A Fully Open-Source Foundation Model for Ultrasound Image Analysis via Self-Adaptive Masked Contrastive Learning
Ultrasound (US) is one of the most widely used medical imaging modalities, thanks to its low cost, portability, real-time feedback, and absence of ionizing radiation. However, US image interpretation remains highly operator-dependent and varies significantly across anatomical regions, acquisition protocols, and device types. These variations, along with unique challenges such as speckle, low contrast, and limited standardized annotations, hinder the development of generalizable, label-efficient ultrasound AI models. In this paper, we propose OpenUS, the first reproducible, open-source ultrasound foundation model built on a large collection of public data. OpenUS employs a vision Mamba backbone, capturing both local and global long-range dependencies across the image. To extract rich features during pre-training, we introduce a novel self-adaptive masking framework that combines contrastive learning with masked image modeling. This strategy integrates the teacher's attention map with student reconstruction loss, adaptively refining clinically-relevant masking to enhance pre-training effectiveness. OpenUS also applies a dynamic learning schedule to progressively adjust the difficulty of the pre-training process. To develop the foundation model, we compile the largest to-date public ultrasound dataset comprising over 308K images from 42 publicly available datasets, covering diverse anatomical regions, institutions, imaging devices, and disease types. Our pre-trained OpenUS model can be easily adapted to specific downstream tasks by serving as a backbone for label-efficient fine-tuning. Code is available at https://github.com/XZheng0427/OpenUS.
PAS : Prelim Attention Score for Detecting Object Hallucinations in Large Vision--Language Models
Large vision-language models (LVLMs) are powerful, yet they remain unreliable due to object hallucinations. In this work, we show that in many hallucinatory predictions the LVLM effectively ignores the image and instead relies on previously generated output (prelim) tokens to infer new objects. We quantify this behavior via the mutual information between the image and the predicted object conditioned on the prelim, demonstrating that weak image dependence strongly correlates with hallucination. Building on this finding, we introduce the Prelim Attention Score (PAS), a lightweight, training-free signal computed from attention weights over prelim tokens. PAS requires no additional forward passes and can be computed on the fly during inference. Exploiting this previously overlooked signal, PAS achieves state-of-the-art object-hallucination detection across multiple models and datasets, enabling real-time filtering and intervention.
Multimodal Posterior Sampling-based Uncertainty in PD-L1 Segmentation from H&E Images
Accurate assessment of PD-L1 expression is critical for guiding immunotherapy, yet current immunohistochemistry (IHC) based methods are resource-intensive. We present nnUNet-B: a Bayesian segmentation framework that infers PD-L1 expression directly from H&E-stained histology images using Multimodal Posterior Sampling (MPS). Built upon nnUNet-v2, our method samples diverse model checkpoints during cyclic training to approximate the posterior, enabling both accurate segmentation and epistemic uncertainty estimation via entropy and standard deviation. Evaluated on a dataset of lung squamous cell carcinoma, our approach achieves competitive performance against established baselines with mean Dice Score and mean IoU of 0.805 and 0.709, respectively, while providing pixel-wise uncertainty maps. Uncertainty estimates show strong correlation with segmentation error, though calibration remains imperfect. These results suggest that uncertainty-aware H&E-based PD-L1 prediction is a promising step toward scalable, interpretable biomarker assessment in clinical workflows.
comment: Preprint (pre-review). Accepted for publication in Lecture Notes in Bioinformatics (Springer, 2025). The final authenticated version will be available on SpringerLink once published
ImAgent: A Unified Multimodal Agent Framework for Test-Time Scalable Image Generation
Recent text-to-image (T2I) models have made remarkable progress in generating visually realistic and semantically coherent images. However, they still suffer from randomness and inconsistency with the given prompts, particularly when textual descriptions are vague or underspecified. Existing approaches, such as prompt rewriting, best-of-N sampling, and self-refinement, can mitigate these issues but usually require additional modules and operate independently, hindering test-time scaling efficiency and increasing computational overhead. In this paper, we introduce ImAgent, a training-free unified multimodal agent that integrates reasoning, generation, and self-evaluation within a single framework for efficient test-time scaling. Guided by a policy controller, multiple generation actions dynamically interact and self-organize to enhance image fidelity and semantic alignment without relying on external models. Extensive experiments on image generation and editing tasks demonstrate that ImAgent consistently improves over the backbone and even surpasses other strong baselines where the backbone model fails, highlighting the potential of unified multimodal agents for adaptive and efficient image generation under test-time scaling.
comment: 12 pages, 5 tables, 6 figures
Rethinking Progression of Memory State in Robotic Manipulation: An Object-Centric Perspective AAAI 2026
As embodied agents operate in increasingly complex environments, the ability to perceive, track, and reason about individual object instances over time becomes essential, especially in tasks requiring sequenced interactions with visually similar objects. In these non-Markovian settings, key decision cues are often hidden in object-specific histories rather than the current scene. Without persistent memory of prior interactions (what has been interacted with, where it has been, or how it has changed) visuomotor policies may fail, repeat past actions, or overlook completed ones. To surface this challenge, we introduce LIBERO-Mem, a non-Markovian task suite for stress-testing robotic manipulation under object-level partial observability. It combines short- and long-horizon object tracking with temporally sequenced subgoals, requiring reasoning beyond the current frame. However, vision-language-action (VLA) models often struggle in such settings, with token scaling quickly becoming intractable even for tasks spanning just a few hundred frames. We propose Embodied-SlotSSM, a slot-centric VLA framework built for temporal scalability. It maintains spatio-temporally consistent slot identities and leverages them through two mechanisms: (1) slot-state-space modeling for reconstructing short-term history, and (2) a relational encoder to align the input tokens with action decoding. Together, these components enable temporally grounded, context-aware action prediction. Experiments show Embodied-SlotSSM's baseline performance on LIBERO-Mem and general tasks, offering a scalable solution for non-Markovian reasoning in object-centric robotic policies.
comment: Accepted at AAAI 2026
Sat2RealCity: Geometry-Aware and Appearance-Controllable 3D Urban Generation from Satellite Imagery
Recent advances in generative modeling have substantially enhanced 3D urban generation, enabling applications in digital twins, virtual cities, and large-scale simulations. However, existing methods face two key challenges: (1) the need for large-scale 3D city assets for supervised training, which are difficult and costly to obtain, and (2) reliance on semantic or height maps, which are used exclusively for generating buildings in virtual worlds and lack connection to real-world appearance, limiting the realism and generalizability of generated cities. To address these limitations, we propose Sat2RealCity, a geometry-aware and appearance-controllable framework for 3D urban generation from real-world satellite imagery. Unlike previous city-level generation methods, Sat2RealCity builds generation upon individual building entities, enabling the use of rich priors and pretrained knowledge from 3D object generation while substantially reducing dependence on large-scale 3D city assets. Specifically, (1) we introduce the OSM-based spatial priors strategy to achieve interpretable geometric generation from spatial topology to building instances; (2) we design an appearance-guided controllable modeling mechanism for fine-grained appearance realism and style control; and (3) we construct an MLLM-powered semantic-guided generation pipeline, bridging semantic interpretation and geometric reconstruction. Extensive quantitative and qualitative experiments demonstrate that Sat2RealCity significantly surpasses existing baselines in structural consistency and appearance realism, establishing a strong foundation for real-world aligned 3D urban content creation. The code will be released soon.
Benchmarking Visual LLMs Resilience to Unanswerable Questions on Visually Rich Documents
The evolution of Visual Large Language Models (VLLMs) has revolutionized the automatic understanding of Visually Rich Documents (VRDs), which contain both textual and visual elements. Although VLLMs excel in Visual Question Answering (VQA) on multi-page VRDs, their ability to detect unanswerable questions is still an open research question. Our research delves into the robustness of the VLLMs to plausible yet unanswerable questions, i.e., questions that appear valid but cannot be answered due to subtle corruptions caused by swaps between related concepts or plausible question formulations. Corruptions are generated by replacing the original natural language entities with other ones of the same type, belonging to different document elements, and in different layout positions or pages of the related document. To this end, we present VRD-UQA (VISUALLY RICH DOCUMENT UNANSWERABLE QUESTION ANSWERING), a benchmark for evaluating VLLMs' resilience to plausible yet unanswerable questions across multiple dimensions. It automatically alters the questions of existing VQA datasets consisting of multi-page VRDs, verifies their unanswerability using a VLLM-as-a-judge approach, and then thoroughly evaluates VLLMs' performance. Experiments, run on 12 models, analyze: (1) The VLLMs' accuracy in detecting unanswerable questions at both page and document levels; (2) The effect of different types of corruption (NLP entity, document element, layout); (3) The effectiveness of different knowledge injection strategies based on in-context learning (OCR, multi-page selection, or the possibility of unanswerability). Our findings reveal VLLMs' limitations and demonstrate that VRD-UQA can serve as an evaluation framework for developing resilient document VQA systems.
Rethinking Efficient Mixture-of-Experts for Remote Sensing Modality-Missing Classification
Multimodal classification in remote sensing often suffers from missing modalities caused by environmental interference, sensor failures, or atmospheric effects, which severely degrade classification performance. Existing two-stage adaptation methods are computationally expensive and assume complete multimodal data during training, limiting their generalization to real-world incompleteness. To overcome these issues, we propose a Missing-aware Mixture-of-Loras (MaMOL) framework that reformulates modality missing as a multi-task learning problem. MaMOL introduces a dual-routing mechanism: a task-oriented dynamic router that adaptively activates experts for different missing patterns, and a modality-specific-shared static router that maintains stable cross-modal knowledge sharing. Unlike prior methods that train separate networks for each missing configuration, MaMOL achieves parameter-efficient adaptation via lightweight expert updates and shared expert reuse. Experiments on multiple remote sensing benchmarks demonstrate superior robustness and generalization under varying missing rates, with minimal computational overhead. Moreover, transfer experiments on natural image datasets validate its scalability and cross-domain applicability, highlighting MaMOL as a general and efficient solution for incomplete multimodal learning.
comment: 11 pages, 4 figures
Synergy vs. Noise: Performance-Guided Multimodal Fusion For Biochemical Recurrence-Free Survival in Prostate Cancer
Multimodal deep learning (MDL) has emerged as a transformative approach in computational pathology. By integrating complementary information from multiple data sources, MDL models have demonstrated superior predictive performance across diverse clinical tasks compared to unimodal models. However, the assumption that combining modalities inherently improves performance remains largely unexamined. We hypothesise that multimodal gains depend critically on the predictive quality of individual modalities, and that integrating weak modalities may introduce noise rather than complementary information. We test this hypothesis on a prostate cancer dataset with histopathology, radiology, and clinical data to predict time-to-biochemical recurrence. Our results confirm that combining high-performing modalities yield superior performance compared to unimodal approaches. However, integrating a poor-performing modality with other higher-performing modalities degrades predictive accuracy. These findings demonstrate that multimodal benefit requires selective, performance-guided integration rather than indiscriminate modality combination, with implications for MDL design across computational pathology and medical imaging.
comment: 5 pages, 1 figure, 4 tables
VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation
We introduce VoxTell, a vision-language model for text-prompted volumetric medical image segmentation. It maps free-form descriptions, from single words to full clinical sentences, to 3D masks. Trained on 62K+ CT, MRI, and PET volumes spanning over 1K anatomical and pathological classes, VoxTell uses multi-stage vision-language fusion across decoder layers to align textual and visual features at multiple scales. It achieves state-of-the-art zero-shot performance across modalities on unseen datasets, excelling on familiar concepts while generalizing to related unseen classes. Extensive experiments further demonstrate strong cross-modality transfer, robustness to linguistic variations and clinical language, as well as accurate instance-specific segmentation from real-world text. Code is available at: https://www.github.com/MIC-DKFZ/VoxTell
From Synthetic Scenes to Real Performance: Enhancing Spatial Reasoning in VLMs
Fine-tuning Vision-Language Models (VLMs) is a common strategy to improve performance following an ad-hoc data collection and annotation of real-world scenes. However, this process is often prone to biases, errors, and distribution imbalance, resulting in overfitting and imbalanced performance. Although a few studies have tried to address this problem by generating synthetic data, they lacked control over distribution bias and annotation quality. To address these challenges, we redesign the fine-tuning process in two ways. First, we control the generation of data and its annotations, ensuring it is free from bias, distribution imbalance, and annotation errors. We automatically construct the dataset by comprehensively sampling objects' attributes, including color, shape, size, and position within the scene. Secondly, using this annotated dataset, we fine-tune state-of-the-art VLMs and assess performance transferability to real-world data on the absolute position task. We conduct exhaustive evaluations on both synthetic and real-world benchmarks. Our experiments reveal two key findings: 1) fine-tuning on balanced synthetic data yields uniform performance across the visual scene and mitigates common biases; and 2) fine-tuning on synthetic stimuli significantly improves performance on real-world data (COCO), outperforming models fine-tuned in the matched setting.
VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models AAAI 2026
Multimodal large language models (MLLMs) have enabled a wide range of advanced vision-language applications, including fine-grained object recognition and contextual understanding. When querying specific regions or objects in an image, human users naturally use "visual prompts" (VPs), such as bounding boxes, to provide reference. However, no existing benchmark systematically evaluates the ability of MLLMs to interpret such VPs. This gap leaves it unclear whether current MLLMs can effectively recognize VPs, an intuitive prompting method for humans, and use them to solve problems. To address this limitation, we introduce VP-Bench, a benchmark for assessing MLLMs' capability in VP perception and utilization. VP-Bench employs a two-stage evaluation framework: Stage 1 examines models' ability to perceive VPs in natural scenes, using 30k visualized prompts spanning eight shapes and 355 attribute combinations. Stage 2 investigates the impact of VPs on downstream tasks, measuring their effectiveness in real-world problem-solving scenarios. Using VP-Bench, we evaluate 28 MLLMs, including proprietary systems (e.g., GPT-4o) and open-source models (e.g., InternVL3 and Qwen2.5-VL), and provide a comprehensive analysis of factors that affect VP understanding, such as variations in VP attributes, question arrangement, and model scale. VP-Bench establishes a new reference framework for studying how MLLMs comprehend and resolve grounded referring questions.
comment: This is the extended version of the paper accepted at AAAI 2026, which includes all technical appendices and additional experimental details
Hi-DREAM: Brain Inspired Hierarchical Diffusion for fMRI Reconstruction via ROI Encoder and visuAl Mapping
Mapping human brain activity to natural images offers a new window into vision and cognition, yet current diffusion-based decoders face a core difficulty: most condition directly on fMRI features without analyzing how visual information is organized across the cortex. This overlooks the brain's hierarchical processing and blurs the roles of early, middle, and late visual areas. We propose Hi-DREAM, a brain-inspired conditional diffusion framework that makes the cortical organization explicit. A region-of-interest (ROI) adapter groups fMRI into early/mid/late streams and converts them into a multi-scale cortical pyramid aligned with the U-Net depth (shallow scales preserve layout and edges; deeper scales emphasize objects and semantics). A lightweight, depth-matched ControlNet injects these scale-specific hints during denoising. The result is an efficient and interpretable decoder in which each signal plays a brain-like role, allowing the model not only to reconstruct images but also to illuminate functional contributions of different visual areas. Experiments on the Natural Scenes Dataset (NSD) show that Hi-DREAM attains state-of-the-art performance on high-level semantic metrics while maintaining competitive low-level fidelity. These findings suggest that structuring conditioning by cortical hierarchy is a powerful alternative to purely data-driven embeddings and provides a useful lens for studying the visual cortex.
Unsupervised Motion-Compensated Decomposition for Cardiac MRI Reconstruction via Neural Representation AAAI-26
Cardiac magnetic resonance (CMR) imaging is widely used to characterize cardiac morphology and function. To accelerate CMR imaging, various methods have been proposed to recover high-quality spatiotemporal CMR images from highly undersampled k-t space data. However, current CMR reconstruction techniques either fail to achieve satisfactory image quality or are restricted by the scarcity of ground truth data, leading to limited applicability in clinical scenarios. In this work, we proposed MoCo-INR, a new unsupervised method that integrates implicit neural representations (INR) with the conventional motion-compensated (MoCo) framework. Using explicit motion modeling and the continuous prior of INRs, MoCo-INR can produce accurate cardiac motion decomposition and high-quality CMR reconstruction. Furthermore, we introduce a new INR network architecture tailored to the CMR problem, which significantly stabilizes model optimization. Experiments on retrospective (simulated) datasets demonstrate the superiority of MoCo-INR over state-of-the-art methods, achieving fast convergence and fine-detailed reconstructions at ultra-high acceleration factors (e.g., 20x in VISTA sampling). Additionally, evaluations on prospective (real-acquired) free-breathing CMR scans highlight the clinical practicality of MoCo-INR for real-time imaging. Several ablation studies further confirm the effectiveness of the critical components of MoCo-INR.
comment: Accepted by AAAI-26
The Persistence of Cultural Memory: Investigating Multimodal Iconicity in Diffusion Models
Our work addresses the ambiguity between generalization and memorization in text-to-image diffusion models, focusing on a specific case we term multimodal iconicity. This refers to instances where images and texts evoke culturally shared associations, such as when a title recalls a familiar artwork or film scene. While prior research on memorization and unlearning emphasizes forgetting, we examine what is remembered and how, focusing on the balance between recognizing cultural references and reproducing them. We introduce an evaluation framework that separates recognition, whether a model identifies a reference, from realization, how it depicts it through replication or reinterpretation, quantified through measures capturing both dimensions. By evaluating five diffusion models across 767 Wikidata-derived cultural references spanning static and dynamic imagery, we show that our framework distinguishes replication from transformation more effectively than existing similarity-based methods. To assess linguistic sensitivity, we conduct prompt perturbation experiments using synonym substitutions and literal image descriptions, finding that models often reproduce iconic visual structures even when textual cues are altered. Finally, our analysis shows that cultural alignment correlates not only with training data frequency, but also textual uniqueness, reference popularity, and creation date. Our work reveals that the value of diffusion models lies not only in what they reproduce but in how they transform and recontextualize cultural knowledge, advancing evaluation beyond simple text-image matching toward richer contextual understanding.
WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation
Recent advances in unified multimodal models (UMMs) have enabled impressive progress in visual comprehension and generation. However, existing datasets and benchmarks focus primarily on single-turn interactions, failing to capture the multi-turn, context-dependent nature of real-world image creation and editing. To address this gap, we present WEAVE, the first suite for in-context interleaved cross-modality comprehension and generation. Our suite consists of two complementary parts. WEAVE-100k is a large-scale dataset of 100K interleaved samples spanning over 370K dialogue turns and 500K images, covering comprehension, editing, and generation tasks that require reasoning over historical context. WEAVEBench is a human-annotated benchmark with 100 tasks based on 480 images, featuring a hybrid VLM judger evaluation framework based on both the reference image and the combination of the original image with editing instructions that assesses models' abilities in multi-turn generation, visual memory, and world-knowledge reasoning across diverse domains. Experiments demonstrate that training on WEAVE-100k enables vision comprehension, image editing, and comprehension-generation collaboration capabilities. Furthermore, it facilitates UMMs to develop emergent visual-memory capabilities, while extensive evaluations on WEAVEBench expose the persistent limitations and challenges of current approaches in multi-turn, context-aware image generation and editing. We believe WEAVE provides a view and foundation for studying in-context interleaved comprehension and generation for multi-modal community.
Comprehension of Multilingual Expressions Referring to Target Objects in Visual Inputs
Referring Expression Comprehension (REC) requires models to localize objects in images based on natural language descriptions. Research on the area remains predominantly English-centric, despite increasing global deployment demands. This work addresses multilingual REC through two main contributions. First, we construct a unified multilingual dataset spanning 10 languages, by systematically expanding 12 existing English REC benchmarks through machine translation and context-based translation enhancement. The resulting dataset comprises approximately 8 million multilingual referring expressions across 177,620 images, with 336,882 annotated objects. Second, we introduce an attention-anchored neural architecture that uses multilingual SigLIP2 encoders. Our attention-based approach generates coarse spatial anchors from attention distributions, which are subsequently refined through learned residuals. Experimental evaluation demonstrates competitive performance on standard benchmarks, e.g. achieving 86.9% accuracy at IoU@50 on RefCOCO aggregate multilingual evaluation, compared to an English-only result of 91.3%. Multilingual evaluation shows consistent capabilities across languages, establishing the practical feasibility of multilingual visual grounding systems. The dataset and model are available at $\href{https://multilingual.franreno.com}{multilingual.franreno.com}$.
Shrinking the Teacher: An Adaptive Teaching Paradigm for Asymmetric EEG-Vision Alignment AAAI 2026
Decoding visual features from EEG signals is a central challenge in neuroscience, with cross-modal alignment as the dominant approach. We argue that the relationship between visual and brain modalities is fundamentally asymmetric, characterized by two critical gaps: a Fidelity Gap (stemming from EEG's inherent noise and signal degradation, vs. vision's high-fidelity features) and a Semantic Gap (arising from EEG's shallow conceptual representation, vs. vision's rich semantic depth). Previous methods often overlook this asymmetry, forcing alignment between the two modalities as if they were equal partners and thereby leading to poor generalization. To address this, we propose the adaptive teaching paradigm. This paradigm empowers the ``teacher" modality (vision) to dynamically shrink and adjust its knowledge structure under task guidance, tailoring its semantically dense features to match the ``student" modality (EEG)'s capacity. We implement this paradigm with the ShrinkAdapter, a simple yet effective module featuring a residual-free design and a bottleneck structure. Through extensive experiments, we validate the underlying rationale and effectiveness of our paradigm. Our method achieves a top-1 accuracy of 60.2\% on the zero-shot brain-to-image retrieval task, surpassing previous state-of-the-art methods by a margin of 9.8\%. Our work introduces a new perspective for asymmetric alignment: the teacher must shrink and adapt to bridge the vision-brain gap.
comment: 21pages,12 figures,published to AAAI 2026
BOFA: Bridge-Layer Orthogonal Low-Rank Fusion for CLIP-Based Class-Incremental Learning AAAI 2026
Class-Incremental Learning (CIL) aims to continually learn new categories without forgetting previously acquired knowledge. Vision-language models such as CLIP offer strong transferable representations via multi-modal supervision, making them promising for CIL. However, applying CLIP to CIL poses two major challenges: (1) adapting to downstream tasks often requires additional learnable modules, increasing model complexity and susceptibility to forgetting; and (2) while multi-modal representations offer complementary strengths, existing methods have yet to fully realize their potential in effectively integrating visual and textual modalities. To address these issues, we propose BOFA (Bridge-layer Orthogonal Fusion for Adaptation), a novel framework for CIL. BOFA confines all model adaptation exclusively to CLIP's existing cross-modal bridge-layer, thereby adding no extra parameters or inference cost. To prevent forgetting within this layer, it leverages Orthogonal Low-Rank Fusion, a mechanism that constrains parameter updates to a low-rank ``safe subspace" mathematically constructed to be orthogonal to past task features. This ensures stable knowledge accumulation without data replay. Furthermore, BOFA employs a cross-modal hybrid prototype that synergizes stable textual prototypes with visual counterparts derived from our stably adapted bridge-layer, enhancing classification performance. Extensive experiments on standard benchmarks show that BOFA achieves superior accuracy and efficiency compared to existing methods.
comment: Accepted by AAAI 2026
Low-Bit, High-Fidelity: Optimal Transport Quantization for Flow Matching
Flow Matching (FM) generative models offer efficient simulation-free training and deterministic sampling, but their practical deployment is challenged by high-precision parameter requirements. We adapt optimal transport (OT)-based post-training quantization to FM models, minimizing the 2-Wasserstein distance between quantized and original weights, and systematically compare its effectiveness against uniform, piecewise, and logarithmic quantization schemes. Our theoretical analysis provides upper bounds on generative degradation under quantization, and empirical results across five benchmark datasets of varying complexity show that OT-based quantization preserves both visual generation quality and latent space stability down to 2-3 bits per parameter, where alternative methods fail. This establishes OT-based quantization as a principled, effective approach to compress FM generative models for edge and embedded AI applications.
comment: 12 pages, 8 figures
Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models
The rapid advancement of Multi-modal Large Language Models (MLLMs) has expanded their capabilities beyond high-level vision tasks. Nevertheless, their potential for Document Image Quality Assessment (DIQA) remains underexplored. To bridge this gap, we propose Q-Doc, a three-tiered evaluation framework for systematically probing DIQA capabilities of MLLMs at coarse, middle, and fine granularity levels. a) At the coarse level, we instruct MLLMs to assign quality scores to document images and analyze their correlation with Quality Annotations. b) At the middle level, we design distortion-type identification tasks, including single-choice and multi-choice tests for multi-distortion scenarios. c) At the fine level, we introduce distortion-severity assessment where MLLMs classify distortion intensity against human-annotated references. Our evaluation demonstrates that while MLLMs possess nascent DIQA abilities, they exhibit critical limitations: inconsistent scoring, distortion misidentification, and severity misjudgment. Significantly, we show that Chain-of-Thought (CoT) prompting substantially enhances performance across all levels. Our work provides a benchmark for DIQA capabilities in MLLMs, revealing pronounced deficiencies in their quality perception and promising pathways for enhancement. The benchmark and code are publicly available at: https://github.com/cydxf/Q-Doc.
MicroVQA++: High-Quality Microscopy Reasoning Dataset with Weakly Supervised Graphs for Multimodal Large Language Model
Multimodal Large Language Models are increasingly applied to biomedical imaging, yet scientific reasoning for microscopy remains limited by the scarcity of large-scale, high-quality training data. We introduce MicroVQA++, a three-stage, large-scale and high-quality microscopy VQA corpus derived from the BIOMEDICA archive. Stage one bootstraps supervision from expert-validated figure-caption pairs sourced from peer-reviewed articles. Stage two applies HiCQA-Graph, a novel heterogeneous graph over images, captions, and QAs that fuses NLI-based textual entailment, CLIP-based vision-language alignment, and agent signals to identify and filter inconsistent samples. Stage three uses a MultiModal Large Language Model (MLLM) agent to generate multiple-choice questions (MCQ) followed by human screening. The resulting release comprises a large training split and a human-checked test split whose Bloom's level hard-sample distribution exceeds the MicroVQA benchmark. Our work delivers (i) a quality-controlled dataset that couples expert literature with graph-based filtering and human refinement; (ii) HiCQA-Graph, the first graph that jointly models (image, caption, QA) for cross-modal consistency filtering; (iii) evidence that careful data construction enables 4B-scale MLLMs to reach competitive microscopy reasoning performance (e.g., GPT-5) and achieve state-of-the-art performance among open-source MLLMs. Code and dataset will be released after the review process concludes.
comment: 11 pages, 4 figures
Disentangling Emotional Bases and Transient Fluctuations: A Low-Rank Sparse Decomposition Approach for Video Affective Analysis
Video-based Affective Computing (VAC), vital for emotion analysis and human-computer interaction, suffers from model instability and representational degradation due to complex emotional dynamics. Since the meaning of different emotional fluctuations may differ under different emotional contexts, the core limitation is the lack of a hierarchical structural mechanism to disentangle distinct affective components, i.e., emotional bases (the long-term emotional tone), and transient fluctuations (the short-term emotional fluctuations). To address this, we propose the Low-Rank Sparse Emotion Understanding Framework (LSEF), a unified model grounded in the Low-Rank Sparse Principle, which theoretically reframes affective dynamics as a hierarchical low-rank sparse compositional process. LSEF employs three plug-and-play modules, i.e., the Stability Encoding Module (SEM) captures low-rank emotional bases; the Dynamic Decoupling Module (DDM) isolates sparse transient signals; and the Consistency Integration Module (CIM) reconstructs multi-scale stability and reactivity coherence. This framework is optimized by a Rank Aware Optimization (RAO) strategy that adaptively balances gradient smoothness and sensitivity. Extensive experiments across multiple datasets confirm that LSEF significantly enhances robustness and dynamic discrimination, which further validates the effectiveness and generality of hierarchical low-rank sparse modeling for understanding affective dynamics.
Unsupervised Segmentation of Micro-CT Scans of Polyurethane Structures By Combining Hidden-Markov-Random Fields and a U-Net
Extracting digital material representations from images is a necessary prerequisite for a quantitative analysis of material properties. Different segmentation approaches have been extensively studied in the past to achieve this task, but were often lacking accuracy or speed. With the advent of machine learning, supervised convolutional neural networks (CNNs) have achieved state-of-the-art performance for different segmentation tasks. However, these models are often trained in a supervised manner, which requires large labeled datasets. Unsupervised approaches do not require ground-truth data for learning, but suffer from long segmentation times and often worse segmentation accuracy. Hidden Markov Random Fields (HMRF) are an unsupervised segmentation approach that incorporates concepts of neighborhood and class distributions. We present a method that integrates HMRF theory and CNN segmentation, leveraging the advantages of both areas: unsupervised learning and fast segmentation times. We investigate the contribution of different neighborhood terms and components for the unsupervised HMRF loss. We demonstrate that the HMRF-UNet enables high segmentation accuracy without ground truth on a Micro-Computed Tomography ($μ$CT) image dataset of Polyurethane (PU) foam structures. Finally, we propose and demonstrate a pre-training strategy that considerably reduces the required amount of ground-truth data when training a segmentation model.
Free3D: 3D Human Motion Emerges from Single-View 2D Supervision
Recent 3D human motion generation models demonstrate remarkable reconstruction accuracy yet struggle to generalize beyond training distributions. This limitation arises partly from the use of precise 3D supervision, which encourages models to fit fixed coordinate patterns instead of learning the essential 3D structure and motion semantic cues required for robust generalization.To overcome this limitation, we propose Free3D, a framework that synthesizes realistic 3D motions without any 3D motion annotations. Free3D introduces a Motion-Lifting Residual Quantized VAE (ML-RQ) that maps 2D motion sequences into 3D-consistent latent spaces, and a suite of 3D-free regularization objectives enforcing view consistency, orientation coherence, and physical plausibility. Trained entirely on 2D motion data, Free3D generates diverse, temporally coherent, and semantically aligned 3D motions, achieving performance comparable to or even surpassing fully 3D-supervised counterparts. These results suggest that relaxing explicit 3D supervision encourages stronger structural reasoning and generalization, offering a scalable and data-efficient paradigm for 3D motion generation.
YCB-Ev SD: Synthetic event-vision dataset for 6DoF object pose estimation
We introduce YCB-Ev SD, a synthetic dataset of event-camera data at standard definition (SD) resolution for 6DoF object pose estimation. While synthetic data has become fundamental in frame-based computer vision, event-based vision lacks comparable comprehensive resources. Addressing this gap, we present 50,000 event sequences of 34 ms duration each, synthesized from Physically Based Rendering (PBR) scenes of YCB-Video objects following the Benchmark for 6D Object Pose (BOP) methodology. Our generation framework employs simulated linear camera motion to ensure complete scene coverage, including background activity. Through systematic evaluation of event representations for CNN-based inference, we demonstrate that time-surfaces with linear decay and dual-channel polarity encoding achieve superior pose estimation performance, outperforming exponential decay and single-channel alternatives by significant margins. Our analysis reveals that polarity information contributes most substantially to performance gains, while linear temporal encoding preserves critical motion information more effectively than exponential decay. The dataset is provided in a structured format with both raw event streams and precomputed optimal representations to facilitate immediate research use and reproducible benchmarking. The dataset is publicly available at https://huggingface.co/datasets/paroj/ycbev_sd.
DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding
Large Vision-Language Models (LVLMs) have demonstrated strong multimodal reasoning capabilities on long and complex documents. However, their high memory footprint makes them impractical for deployment on resource-constrained edge devices. We present DocSLM, an efficient Small Vision-Language Model designed for long-document understanding under constrained memory resources. DocSLM incorporates a Hierarchical Multimodal Compressor that jointly encodes visual, textual, and layout information from each page into a fixed-length sequence, greatly reducing memory consumption while preserving both local and global semantics. To enable scalable processing over arbitrarily long inputs, we introduce a Streaming Abstention mechanism that operates on document segments sequentially and filters low-confidence responses using an entropy-based uncertainty calibrator. Across multiple long multimodal document benchmarks, DocSLM matches or surpasses state-of-the-art methods while using 82\% fewer visual tokens, 75\% fewer parameters, and 71\% lower latency, delivering reliable multimodal document understanding on lightweight edge devices. Code is available in the supplementary material.
Large-scale modality-invariant foundation models for brain MRI analysis: Application to lesion segmentation
The field of computer vision is undergoing a paradigm shift toward large-scale foundation model pre-training via self-supervised learning (SSL). Leveraging large volumes of unlabeled brain MRI data, such models can learn anatomical priors that improve few-shot performance in diverse neuroimaging tasks. However, most SSL frameworks are tailored to natural images, and their adaptation to capture multi-modal MRI information remains underexplored. This work proposes a modality-invariant representation learning setup and evaluates its effectiveness in stroke and epilepsy lesion segmentation, following large-scale pre-training. Experimental results suggest that despite successful cross-modality alignment, lesion segmentation primarily benefits from preserving fine-grained modality-specific features. Model checkpoints and code are made publicly available.
comment: Submitted to IEEE ISBI 2026
6D Strawberry Pose Estimation: Real-time and Edge AI Solutions Using Purely Synthetic Training Data
Automated and selective harvesting of fruits has become an important area of research, particularly due to challenges such as high costs and a shortage of seasonal labor in advanced economies. This paper focuses on 6D pose estimation of strawberries using purely synthetic data generated through a procedural pipeline for photorealistic rendering. We employ the YOLOX-6D-Pose algorithm, a single-shot approach that leverages the YOLOX backbone, known for its balance between speed and accuracy, and its support for edge inference. To address the lacking availability of training data, we introduce a robust and flexible pipeline for generating synthetic strawberry data from various 3D models via a procedural Blender pipeline, where we focus on enhancing the realism of the synthesized data in comparison to previous work to make it a valuable resource for training pose estimation algorithms. Quantitative evaluations indicate that our models achieve comparable accuracy on both the NVIDIA RTX 3090 and Jetson Orin Nano across several ADD-S metrics, with the RTX 3090 demonstrating superior processing speed. However, the Jetson Orin Nano is particularly suited for resource-constrained environments, making it an excellent choice for deployment in agricultural robotics. Qualitative assessments further confirm the model's performance, demonstrating its capability to accurately infer the poses of ripe and partially ripe strawberries, while facing challenges in detecting unripe specimens. This suggests opportunities for future improvements, especially in enhancing detection capabilities for unripe strawberries (if desired) by exploring variations in color. Furthermore, the methodology presented could be adapted easily for other fruits such as apples, peaches, and plums, thereby expanding its applicability and impact in the field of agricultural automation.
MOON Embedding: Multimodal Representation Learning for E-commerce Search Advertising
We introduce MOON, our comprehensive set of sustainable iterative practices for multimodal representation learning for e-commerce applications. MOON has already been fully deployed across all stages of Taobao search advertising system, including retrieval, relevance, ranking, and so on. The performance gains are particularly significant on click-through rate (CTR) prediction task, which achieves an overall +20.00% online CTR improvement. Over the past three years, this project has delivered the largest improvement on CTR prediction task and undergone five full-scale iterations. Throughout the exploration and iteration of our MOON, we have accumulated valuable insights and practical experience that we believe will benefit the research community. MOON contains a three-stage training paradigm of "Pretraining, Post-training, and Application", allowing effective integration of multimodal representations with downstream tasks. Notably, to bridge the misalignment between the objectives of multimodal representation learning and downstream training, we define the exchange rate to quantify how effectively improvements in an intermediate metric can translate into downstream gains. Through this analysis, we identify the image-based search recall as a critical intermediate metric guiding the optimization of multimodal models. Over three years and five iterations, MOON has evolved along four critical dimensions: data processing, training strategy, model architecture, and downstream application. The lessons and insights gained through the iterative improvements will also be shared. As part of our exploration into scaling effects in the e-commerce field, we further conduct a systematic study of the scaling laws governing multimodal representation learning, examining multiple factors such as the number of training tokens, negative samples, and the length of user behavior sequences.
comment: 31 pages, 12 figures
AUVIC: Adversarial Unlearning of Visual Concepts for Multi-modal Large Language Models AAAI 2026
Multimodal Large Language Models (MLLMs) achieve impressive performance once optimized on massive datasets. Such datasets often contain sensitive or copyrighted content, raising significant data privacy concerns. Regulatory frameworks mandating the 'right to be forgotten' drive the need for machine unlearning. This technique allows for the removal of target data without resource-consuming retraining. However, while well-studied for text, visual concept unlearning in MLLMs remains underexplored. A primary challenge is precisely removing a target visual concept without disrupting model performance on related entities. To address this, we introduce AUVIC, a novel visual concept unlearning framework for MLLMs. AUVIC applies adversarial perturbations to enable precise forgetting. This approach effectively isolates the target concept while avoiding unintended effects on similar entities. To evaluate our method, we construct VCUBench. It is the first benchmark designed to assess visual concept unlearning in group contexts. Experimental results demonstrate that AUVIC achieves state-of-the-art target forgetting rates while incurs minimal performance degradation on non-target concepts.
comment: AAAI 2026. Code: https://github.com/HaokunChen245/AUVIC
SimuFreeMark: A Noise-Simulation-Free Robust Watermarking Against Image Editing
The advancement of artificial intelligence generated content (AIGC) has created a pressing need for robust image watermarking that can withstand both conventional signal processing and novel semantic editing attacks. Current deep learning-based methods rely on training with hand-crafted noise simulation layers, which inherently limit their generalization to unforeseen distortions. In this work, we propose $\textbf{SimuFreeMark}$, a noise-$\underline{\text{simu}}$lation-$\underline{\text{free}}$ water$\underline{\text{mark}}$ing framework that circumvents this limitation by exploiting the inherent stability of image low-frequency components. We first systematically establish that low-frequency components exhibit significant robustness against a wide range of attacks. Building on this foundation, SimuFreeMark embeds watermarks directly into the deep feature space of the low-frequency components, leveraging a pre-trained variational autoencoder (VAE) to bind the watermark with structurally stable image representations. This design completely eliminates the need for noise simulation during training. Extensive experiments demonstrate that SimuFreeMark outperforms state-of-the-art methods across a wide range of conventional and semantic attacks, while maintaining superior visual quality.
RTGaze: Real-Time 3D-Aware Gaze Redirection from a Single Image AAAI 2026
Gaze redirection methods aim to generate realistic human face images with controllable eye movement. However, recent methods often struggle with 3D consistency, efficiency, or quality, limiting their practical applications. In this work, we propose RTGaze, a real-time and high-quality gaze redirection method. Our approach learns a gaze-controllable facial representation from face images and gaze prompts, then decodes this representation via neural rendering for gaze redirection. Additionally, we distill face geometric priors from a pretrained 3D portrait generator to enhance generation quality. We evaluate RTGaze both qualitatively and quantitatively, demonstrating state-of-the-art performance in efficiency, redirection accuracy, and image quality across multiple datasets. Our system achieves real-time, 3D-aware gaze redirection with a feedforward network (~0.06 sec/image), making it 800x faster than the previous state-of-the-art 3D-aware methods.
comment: AAAI 2026
D-GAP: Improving Out-of-Domain Robustness via Dataset-Agnostic and Gradient-Guided Augmentation in Amplitude and Pixel Spaces
Out-of-domain (OOD) robustness is challenging to achieve in real-world computer vision applications, where shifts in image background, style, and acquisition instruments always degrade model performance. Generic augmentations show inconsistent gains under such shifts, whereas dataset-specific augmentations require expert knowledge and prior analysis. Moreover, prior studies show that neural networks adapt poorly to domain shifts because they exhibit a learning bias to domain-specific frequency components. Perturbing frequency values can mitigate such bias but overlooks pixel-level details, leading to suboptimal performance. To address these problems, we propose D-GAP (Dataset-agnostic and Gradient-guided augmentation in Amplitude and Pixel spaces), improving OOD robustness by introducing targeted augmentation in both the amplitude space (frequency space) and pixel space. Unlike conventional handcrafted augmentations, D-GAP computes sensitivity maps in the frequency space from task gradients, which reflect how strongly the model responds to different frequency components, and uses the maps to adaptively interpolate amplitudes between source and target samples. This way, D-GAP reduces the learning bias in frequency space, while a complementary pixel-space blending procedure restores fine spatial details. Extensive experiments on four real-world datasets and three domain-adaptation benchmarks show that D-GAP consistently outperforms both generic and dataset-specific augmentations, improving average OOD performance by +5.3% on real-world datasets and +1.8% on benchmark datasets.
Coordinative Learning with Ordinal and Relational Priors for Volumetric Medical Image Segmentation
Volumetric medical image segmentation presents unique challenges due to the inherent anatomical structure and limited availability of annotations. While recent methods have shown promise by contrasting spatial relationships between slices, they rely on hard binary thresholds to define positive and negative samples, thereby discarding valuable continuous information about anatomical similarity. Moreover, these methods overlook the global directional consistency of anatomical progression, resulting in distorted feature spaces that fail to capture the canonical anatomical manifold shared across patients. To address these limitations, we propose Coordinative Ordinal-Relational Anatomical Learning (CORAL) to capture both local and global structure in volumetric images. First, CORAL employs a contrastive ranking objective to leverage continuous anatomical similarity, ensuring relational feature distances between slices are proportional to their anatomical position differences. In addition, CORAL incorporates an ordinal objective to enforce global directional consistency, aligning the learned feature distribution with the canonical anatomical progression across patients. Learning these inter-slice relationships produces anatomically informed representations that benefit the downstream segmentation task. Through this coordinative learning framework, CORAL achieves state-of-the-art performance on benchmark datasets under limited-annotation settings while learning representations with meaningful anatomical structure. Code is available at https://github.com/haoyiwang25/CORAL.
Φeat: Physically-Grounded Feature Representation
Foundation models have emerged as effective backbones for many vision tasks. However, current self-supervised features entangle high-level semantics with low-level physical factors, such as geometry and illumination, hindering their use in tasks requiring explicit physical reasoning. In this paper, we introduce $Φ$eat, a novel physically-grounded visual backbone that encourages a representation sensitive to material identity, including reflectance cues and geometric mesostructure. Our key idea is to employ a pretraining strategy that contrasts spatial crops and physical augmentations of the same material under varying shapes and lighting conditions. While similar data have been used in high-end supervised tasks such as intrinsic decomposition or material estimation, we demonstrate that a pure self-supervised training strategy, without explicit labels, already provides a strong prior for tasks requiring robust features invariant to external physical factors. We evaluate the learned representations through feature similarity analysis and material selection, showing that $Φ$eat captures physically-grounded structure beyond semantic grouping. These findings highlight the promise of unsupervised physical feature learning as a foundation for physics-aware perception in vision and graphics. These findings highlight the promise of unsupervised physical feature learning as a foundation for physics-aware perception in vision and graphics.
GraphPilot: Grounded Scene Graph Conditioning for Language-Based Autonomous Driving
Vision-language models have recently emerged as promising planners for autonomous driving, where success hinges on topology-aware reasoning over spatial structure and dynamic interactions from multimodal input. However, existing models are typically trained without supervision that explicitly encodes these relational dependencies, limiting their ability to infer how agents and other traffic entities influence one another from raw sensor data. In this work, we bridge this gap with a novel model-agnostic method that conditions language-based driving models on structured relational context in the form of traffic scene graphs. We serialize scene graphs at various abstraction levels and formats, and incorporate them into the models via structured prompt templates, enabling a systematic analysis of when and how relational supervision is most beneficial. Extensive evaluations on the public LangAuto benchmark show that scene graph conditioning of state-of-the-art approaches yields large and persistent improvement in driving performance. Notably, we observe up to a 15.6\% increase in driving score for LMDrive and 17.5\% for BEVDriver, indicating that models can better internalize and ground relational priors through scene graph-conditioned training, even without requiring scene graph input at test-time. Code, fine-tuned models, and our scene graph dataset are publicly available at https://github.com/iis-esslingen/GraphPilot.
Discovering Meaningful Units with Visually Grounded Semantics from Image Captions
Fine-grained knowledge is crucial for vision-language models to obtain a better understanding of the real world. While there has been work trying to acquire this kind of knowledge in the space of vision and language, it has mostly focused on aligning the image patches with the tokens on the language side. However, image patches do not have any meaning to the human eye, and individual tokens do not necessarily carry groundable information in the image. It is groups of tokens which describe different aspects of the scene. In this work, we propose a model which groups the caption tokens as part of its architecture in order to capture a fine-grained representation of the language. We expect our representations to be at the level of objects present in the image, and therefore align our representations with the output of an image encoder trained to discover objects. We show that by learning to group the tokens, the vision-language model has a better fine-grained understanding of vision and language. In addition, the token groups that our model discovers are highly similar to groundable phrases in text, both qualitatively and quantitatively.
CountSteer: Steering Attention for Object Counting in Diffusion Models AAAI 2026
Text-to-image diffusion models generate realistic and coherent images but often fail to follow numerical instructions in text, revealing a gap between language and visual representation. Interestingly, we found that these models are not entirely blind to numbers-they are implicitly aware of their own counting accuracy, as their internal signals shift in consistent ways depending on whether the output meets the specified count. This observation suggests that the model already encodes a latent notion of numerical correctness, which can be harnessed to guide generation more precisely. Building on this intuition, we introduce CountSteer, a training-free method that improves generation of specified object counts by steering the model's cross-attention hidden states during inference. In our experiments, CountSteer improved object-count accuracy by about 4% without compromising visual quality, demonstrating a simple yet effective step toward more controllable and semantically reliable text-to-image generation.
comment: Accepted to AAAI 2026 Workshop on Shaping Responsible Synthetic Data in the Era of Foundation Models (RSD)
Toward Gaze Target Detection of Young Autistic Children AAAI 2026
The automatic detection of gaze targets in autistic children through artificial intelligence can be impactful, especially for those who lack access to a sufficient number of professionals to improve their quality of life. This paper introduces a new, real-world AI application for gaze target detection in autistic children, which predicts a child's point of gaze from an activity image. This task is foundational for building automated systems that can measure joint attention-a core challenge in Autism Spectrum Disorder (ASD). To facilitate the study of this challenging application, we collected the first-ever Autism Gaze Target (AGT) dataset. We further propose a novel Socially Aware Coarse-to-Fine (SACF) gaze detection framework that explicitly leverages the social context of a scene to overcome the class imbalance common in autism datasets-a consequence of autistic children's tendency to show reduced gaze to faces. It utilizes a two-pathway architecture with expert models specialized in social and non-social gaze, guided by a context-awareness gate module. The results of our comprehensive experiments demonstrate that our framework achieves new state-of-the-art performance for gaze target detection in this population, significantly outperforming existing methods, especially on the critical minority class of face-directed gaze.
comment: AAAI 2026 Artificial Intelligence for Social Impact Track
Arcee: Differentiable Recurrent State Chain for Generative Vision Modeling with Mamba SSMs
State-space models (SSMs), Mamba in particular, are increasingly adopted for long-context sequence modeling, providing linear-time aggregation via an input-dependent, causal selective-scan operation. Along this line, recent "Mamba-for-vision" variants largely explore multiple scan orders to relax strict causality for non-sequential signals (e.g., images). Rather than preserving cross-block memory, the conventional formulation of the selective-scan operation in Mamba reinitializes each block's state-space dynamics from zero, discarding the terminal state-space representation (SSR) from the previous block. Arcee, a cross-block recurrent state chain, reuses each block's terminal state-space representation as the initial condition for the next block. Handoff across blocks is constructed as a differentiable boundary map whose Jacobian enables end-to-end gradient flow across terminal boundaries. Key to practicality, Arcee is compatible with all prior "vision-mamba" variants, parameter-free, and incurs constant, negligible cost. As a modeling perspective, we view terminal SSR as a mild directional prior induced by a causal pass over the input, rather than an estimator of the non-sequential signal itself. To quantify the impact, for unconditional generation on CelebA-HQ (256$\times$256) with Flow Matching, Arcee reduces FID$\downarrow$ from $82.81$ to $15.33$ ($5.4\times$ lower) on a single scan-order Zigzag Mamba baseline. Efficient CUDA kernels and training code will be released to support rigorous and reproducible research.
Beyond Flatlands: Unlocking Spatial Intelligence by Decoupling 3D Reasoning from Numerical Regression
Existing Vision Language Models (VLMs) architecturally rooted in "flatland" perception, fundamentally struggle to comprehend real-world 3D spatial intelligence. This failure stems from a dual-bottleneck: input-stage conflict between computationally exorbitant geometric-aware encoders and superficial 2D-only features, and output-stage misalignment where discrete tokenizers are structurally incapable of producing precise, continuous numerical values. To break this impasse, we introduce GEODE (Geometric-Output and Decoupled-Input Engine), a novel architecture that resolves this dual-bottleneck by decoupling 3D reasoning from numerical generation. GEODE augments main VLM with two specialized, plug-and-play modules: Decoupled Rationale Module (DRM) that acts as spatial co-processor, aligning explicit 3D data with 2D visual features via cross-attention and distilling spatial Chain-of-Thought (CoT) logic into injectable Rationale Tokens; and Direct Regression Head (DRH), an "Embedding-as-Value" paradigm which routes specialized control tokens to a lightweight MLP for precise, continuous regression of scalars and 3D bounding boxes. The synergy of these modules allows our 1.5B parameter model to function as a high-level semantic dispatcher, achieving state-of-the-art spatial reasoning performance that rivals 7B+ models.
Parameter-Efficient MoE LoRA for Few-Shot Multi-Style Editing
In recent years, image editing has garnered growing attention. However, general image editing models often fail to produce satisfactory results when confronted with new styles. The challenge lies in how to effectively fine-tune general image editing models to new styles using only a limited amount of paired data. To address this issue, this paper proposes a novel few-shot style editing framework. For this task, we construct a benchmark dataset that encompasses five distinct styles. Correspondingly, we propose a parameter-efficient multi-style Mixture-of-Experts Low-Rank Adaptation (MoE LoRA) with style-specific and style-shared routing mechanisms for jointly fine-tuning multiple styles. The style-specific routing ensures that different styles do not interfere with one another, while the style-shared routing adaptively allocates shared MoE LoRAs to learn common patterns. Our MoE LoRA can automatically determine the optimal ranks for each layer through a novel metric-guided approach that estimates the importance score of each single-rank component. Additionally, we explore the optimal location to insert LoRA within the Diffusion in Transformer (DiT) model and integrate adversarial learning and flow matching to guide the diffusion training process. Experimental results demonstrate that our proposed method outperforms existing state-of-the-art approaches with significantly fewer LoRA parameters.
DoReMi: A Domain-Representation Mixture Framework for Generalizable 3D Understanding
The generalization of 3D deep learning across multiple domains remains limited by the limited scale of existing datasets and the high heterogeneity of multi-source point clouds. Point clouds collected from different sensors (e.g., LiDAR scans and mesh-derived point clouds) exhibit substantial discrepancies in density and noise distribution, resulting in negative transfer during multi-domain fusion. Most existing approaches focus exclusively on either domain-aware or domain-general features, overlooking the potential synergy between them. To address this, we propose DoReMi (Domain-Representation Mixture), a Mixture-of-Experts (MoE) framework that jointly models Domain-aware Experts branch and a unified Representation branch to enable cooperative learning between specialized and generalizable knowledge. DoReMi dynamically activates domain-aware expert branch via Domain-Guided Spatial Routing (DSR) for context-aware expert selection and employs Entropy-Controlled Dynamic Allocation (EDA) for stable and efficient expert utilization, thereby adaptively modeling diverse domain distributions. Complemented by a frozen unified representation branch pretrained through robust multi-attribute self-supervised learning, DoReMi preserves cross-domain geometric and structural priors while maintaining global consistency. We evaluate DoReMi across multiple 3D understanding benchmarks. Notably, DoReMi achieves 80.1% mIoU on ScanNet Val and 77.2% mIoU on S3DIS, demonstrating competitive or superior performance compared to existing approaches, and showing strong potential as a foundation framework for future 3D understanding research. The code will be released soon.
3D Gaussian and Diffusion-Based Gaze Redirection
High-fidelity gaze redirection is critical for generating augmented data to improve the generalization of gaze estimators. 3D Gaussian Splatting (3DGS) models like GazeGaussian represent the state-of-the-art but can struggle with rendering subtle, continuous gaze shifts. In this paper, we propose DiT-Gaze, a framework that enhances 3D gaze redirection models using a novel combination of Diffusion Transformer (DiT), weak supervision across gaze angles, and an orthogonality constraint loss. DiT allows higher-fidelity image synthesis, while our weak supervision strategy using synthetically generated intermediate gaze angles provides a smooth manifold of gaze directions during training. The orthogonality constraint loss mathematically enforces the disentanglement of internal representations for gaze, head pose, and expression. Comprehensive experiments show that DiT-Gaze sets a new state-of-the-art in both perceptual quality and redirection accuracy, reducing the state-of-the-art gaze error by 4.1% to 6.353 degrees, providing a superior method for creating synthetic training data. Our code and models will be made available for the research community to benchmark against.
Positional Bias in Multimodal Embedding Models: Do They Favor the Beginning, the Middle, or the End? AAAI 2026
Positional bias - where models overemphasize certain positions regardless of content - has been shown to negatively impact model performance across various tasks. While recent research has extensively examined positional bias in text generation models, its presence and effects in representation models remain underexplored. Even less is known about such biases in multimodal models. In this work, we investigate positional bias in multimodal representation models, specifically in the context of image-text retrieval. We begin by distinguishing between context importance and positional bias, and then assess the presence and extent of positional bias across different models and datasets. Our experiments demonstrate that positional bias is prevalent in multimodal models, but manifests differently across modalities: text encoders tend to exhibit bias toward the beginning of the input, whereas image encoders show bias at both the beginning and end. Furthermore, we find that this bias arises from, or is amplified by, a combination of factors, including the positional encoding scheme, training loss, context importance, and the nature of using image-text pairs in multimodal training.
comment: accepted to AAAI 2026 main track
RealisticDreamer: Guidance Score Distillation for Few-shot Gaussian Splatting
3D Gaussian Splatting (3DGS) has recently gained great attention in the 3D scene representation for its high-quality real-time rendering capabilities. However, when the input comprises sparse training views, 3DGS is prone to overfitting, primarily due to the lack of intermediate-view supervision. Inspired by the recent success of Video Diffusion Models (VDM), we propose a framework called Guidance Score Distillation (GSD) to extract the rich multi-view consistency priors from pretrained VDMs. Building on the insights from Score Distillation Sampling (SDS), GSD supervises rendered images from multiple neighboring views, guiding the Gaussian splatting representation towards the generative direction of VDM. However, the generative direction often involves object motion and random camera trajectories, making it challenging for direct supervision in the optimization process. To address this problem, we introduce an unified guidance form to correct the noise prediction result of VDM. Specifically, we incorporate both a depth warp guidance based on real depth maps and a guidance based on semantic image features, ensuring that the score update direction from VDM aligns with the correct camera pose and accurate geometry. Experimental results show that our method outperforms existing approaches across multiple datasets.
MAFM^3: Modular Adaptation of Foundation Models for Multi-Modal Medical AI
Foundational models are trained on extensive datasets to capture the general trends of a domain. However, in medical imaging, the scarcity of data makes pre-training for every domain, modality, or task challenging. Instead of building separate models, we propose MAFM^3 (Modular Adaptation of Foundation Models for Multi-Modal Medical AI), a framework that enables a single foundation model to expand into diverse domains, tasks, and modalities through lightweight modular components. These components serve as specialized skill sets that allow the system to flexibly activate the appropriate capability at the inference time, depending on the input type or clinical objective. Unlike conventional adaptation methods that treat each new task or modality in isolation, MAFM^3 provides a unified and expandable framework for efficient multitask and multimodality adaptation. Empirically, we validate our approach by adapting a chest CT foundation model initially trained for classification into prognosis and segmentation modules. Our results show improved performance on both tasks. Furthermore, by incorporating PET scans, MAFM^3 achieved an improvement in the Dice score 5% compared to the respective baselines. These findings establish that foundation models, when equipped with modular components, are not inherently constrained to their initial training scope but can evolve into multitask, multimodality systems for medical imaging. The code implementation of this work can be found at https://github.com/Areeb2735/CTscan_prognosis_VLM
comment: 2 figures, 3 tables
One-to-N Backdoor Attack in 3D Point Cloud via Spherical Trigger
Backdoor attacks represent a critical threat to deep learning systems, particularly in safety-sensitive 3D domains such as autonomous driving and robotics. However, existing backdoor attacks for 3D point clouds have been limited to a rigid one-to-one paradigm. To address this, we present the first one-to-N backdoor framework for 3D vision, based on a novel, configurable spherical trigger. Our key insight is to leverage the spatial properties of spheres as a parameter space, allowing a single trigger design to encode multiple target classes. We establish a theoretical foundation for one-to-N backdoor attacks in 3D, demonstrating that poisoned models can map distinct trigger configurations to different target labels. Experimental results systematically validate this conclusion across multiple datasets and model architectures, achieving high attack success rates (up to 100\%) while maintaining accuracy on clean data. This work establishes a crucial benchmark for multi-target threats in 3D vision and provides the foundational understanding needed to secure future 3D-driven intelligent systems.
comment: 15 pages, 4 figures
Questioning the Stability of Visual Question Answering
Visual Language Models (VLMs) have achieved remarkable progress, yet their reliability under small, meaning-preserving input changes remains poorly understood. We present the first large-scale, systematic study of VLM robustness to benign visual and textual perturbations: pixel-level shifts, light geometric transformations, padded rescaling, paraphrasing, and multilingual rewrites that do not alter the underlying semantics of an image-question pair. Across a broad set of models and datasets, we find that modern VLMs are highly sensitive to such minor perturbations: a substantial fraction of samples change their predicted answer under at least one visual or textual modification. We characterize how this instability varies across perturbation types, question categories, and models, revealing that even state-of-the-art systems (e.g., GPT-4o, Gemini 2.0 Flash) frequently fail under shifts as small as a few pixels or harmless rephrasings. We further show that sample-level stability serves as a strong indicator of correctness: stable samples are consistently far more likely to be answered correctly. Leveraging this, we demonstrate that the stability patterns of small, accessible open-source models can be used to predict the correctness of much larger closed-source models with high precision. Our findings expose a fundamental fragility in current VLMs and highlight the need for robustness evaluations that go beyond adversarial perturbations, focusing instead on invariances that models should reliably uphold.
Geospatial Chain of Thought Reasoning for Enhanced Visual Question Answering on Satellite Imagery
Geospatial chain of thought (CoT) reasoning is essential for advancing Visual Question Answering (VQA) on satellite imagery, particularly in climate related applications such as disaster monitoring, infrastructure risk assessment, urban resilience planning, and policy support. Existing VQA models enable scalable interpretation of remote sensing data but often lack the structured reasoning required for complex geospatial queries. We propose a VQA framework that integrates CoT reasoning with Direct Preference Optimization (DPO) to improve interpretability, robustness, and accuracy. By generating intermediate rationales, the model better handles tasks involving detection, classification, spatial relations, and comparative analysis, which are critical for reliable decision support in high stakes climate domains. Experiments show that CoT supervision improves accuracy by 34.9\% over direct baselines, while DPO yields additional gains in accuracy and reasoning quality. The resulting system advances VQA for multispectral Earth observation by enabling richer geospatial reasoning and more effective climate use cases.
Computationally-efficient deep learning models for nowcasting of precipitation: A solution for the Weather4cast 2025 challenge
This study presents a transfer-learning framework based on Convolutional Gated Recurrent Units (ConvGRU) for short-term rainfall prediction in the Weather4Cast 2025 competition. A single SEVIRI infrared channel (10.8 μm wavelength) is used as input, which consists of four observations over a one-hour period. A two-stage training strategy is applied to generate rainfall estimates up to four hours ahead. In the first stage, ConvGRU is trained to forecast the brightness temperatures from SEVIRI, enabling the model to capture relevant spatiotemporal patterns. In the second stage, an empirically derived nonlinear transformation maps the predicted fields to OPERA-compatible rainfall rates. For the event-prediction task, the transformed rainfall forecasts are processed using 3D event detection followed by spatiotemporal feature extraction to identify and characterize precipitation events. Our submission achieved 2nd place in the cumulative rainfall task. Further, the same model was used out-of-the-box for the event prediction task, and resulted in similar scores as the baseline model to the competition.
A Comparison of Lightweight Deep Learning Models for Particulate-Matter Nowcasting in the Indian Subcontinent & Surrounding Regions
This paper is a submission for the Weather4Cast~2025 complementary Pollution Task and presents an efficient framework for 6-hour lead-time nowcasting of PM$_1$, PM$_{2.5}$, and PM$_{10}$ across the Indian subcontinent and surrounding regions. The proposed approach leverages analysis fields from the Copernicus Atmosphere Monitoring Service (CAMS) Global Atmospheric Composition Forecasts at 0.4 degree resolution. A 256x256 spatial region, covering 28.4S-73.6N and 32E-134.0E, is used as the model input, while predictions are generated for the central 128x128 area spanning 2.8S-48N and 57.6E-108.4E, ensuring an India-centric forecast domain with sufficient synoptic-scale context. Models are trained on CAMS analyses from 2021-2023 using a shuffled 90/10 split and independently evaluated on 2024 data. Three lightweight parameter-specific architectures are developed to improve accuracy, minimize systematic bias, and enable rapid inference. Evaluation using RMSE, MAE, Bias, and SSIM demonstrates substantial performance gains over the Aurora foundation model, underscoring the effectiveness of compact & specialized deep learning models for short-range forecasts on limited spatial domains.
Viper-F1: Fast and Fine-Grained Multimodal Understanding with Cross-Modal State-Space Modulation
Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as robotic manipulation, personal assistants, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Viper-F1, a Hybrid State-Space Vision-Language Model that replaces attention with efficient Liquid State-Space Dynamics. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates the state-space dynamics via FiLM conditioning. This enables the model to selectively emphasize visual regions relevant to the textual prompt while maintaining linear-time inference. Experimental results across multiple benchmarks demonstrate that Viper-F1 achieves accurate, fine-grained understanding with significantly improved efficiency.
Dynamic Gaussian Scene Reconstruction from Unsynchronized Videos AAAI 2026
Multi-view video reconstruction plays a vital role in computer vision, enabling applications in film production, virtual reality, and motion analysis. While recent advances such as 4D Gaussian Splatting (4DGS) have demonstrated impressive capabilities in dynamic scene reconstruction, they typically rely on the assumption that input video streams are temporally synchronized. However, in real-world scenarios, this assumption often fails due to factors like camera trigger delays or independent recording setups, leading to temporal misalignment across views and reduced reconstruction quality. To address this challenge, a novel temporal alignment strategy is proposed for high-quality 4DGS reconstruction from unsynchronized multi-view videos. Our method features a coarse-to-fine alignment module that estimates and compensates for each camera's time shift. The method first determines a coarse, frame-level offset and then refines it to achieve sub-frame accuracy. This strategy can be integrated as a readily integrable module into existing 4DGS frameworks, enhancing their robustness when handling asynchronous data. Experiments show that our approach effectively processes temporally misaligned videos and significantly enhances baseline methods.
comment: AAAI 2026
Refine and Align: Confidence Calibration through Multi-Agent Interaction in VQA AAAI 2026
In the context of Visual Question Answering (VQA) and Agentic AI, calibration refers to how closely an AI system's confidence in its answers reflects their actual correctness. This aspect becomes especially important when such systems operate autonomously and must make decisions under visual uncertainty. While modern VQA systems, powered by advanced vision-language models (VLMs), are increasingly used in high-stakes domains like medical diagnostics and autonomous navigation due to their improved accuracy, the reliability of their confidence estimates remains under-examined. Particularly, these systems often produce overconfident responses. To address this, we introduce AlignVQA, a debate-based multi-agent framework, in which diverse specialized VLM -- each following distinct prompting strategies -- generate candidate answers and then engage in two-stage interaction: generalist agents critique, refine and aggregate these proposals. This debate process yields confidence estimates that more accurately reflect the model's true predictive performance. We find that more calibrated specialized agents produce better aligned confidences. Furthermore, we introduce a novel differentiable calibration-aware loss function called aligncal designed to fine-tune the specialized agents by minimizing an upper bound on the calibration error. This objective explicitly improves the fidelity of each agent's confidence estimates. Empirical results across multiple benchmark VQA datasets substantiate the efficacy of our approach, demonstrating substantial reductions in calibration discrepancies. Furthermore, we propose a novel differentiable calibration-aware loss to fine-tune the specialized agents and improve the quality of their individual confidence estimates based on minimising upper bound calibration error.
comment: 17 pages, 6 figures, 5 tables. Accepted to Special Track on AI Alignment, AAAI 2026. Project Page- https://refine-align.github.io/
CATS-V2V: A Real-World Vehicle-to-Vehicle Cooperative Perception Dataset with Complex Adverse Traffic Scenarios
Vehicle-to-Vehicle (V2V) cooperative perception has great potential to enhance autonomous driving performance by overcoming perception limitations in complex adverse traffic scenarios (CATS). Meanwhile, data serves as the fundamental infrastructure for modern autonomous driving AI. However, due to stringent data collection requirements, existing datasets focus primarily on ordinary traffic scenarios, constraining the benefits of cooperative perception. To address this challenge, we introduce CATS-V2V, the first-of-its-kind real-world dataset for V2V cooperative perception under complex adverse traffic scenarios. The dataset was collected by two hardware time-synchronized vehicles, covering 10 weather and lighting conditions across 10 diverse locations. The 100-clip dataset includes 60K frames of 10 Hz LiDAR point clouds and 1.26M multi-view 30 Hz camera images, along with 750K anonymized yet high-precision RTK-fixed GNSS and IMU records. Correspondingly, we provide time-consistent 3D bounding box annotations for objects, as well as static scenes to construct a 4D BEV representation. On this basis, we propose a target-based temporal alignment method, ensuring that all objects are precisely aligned across all sensor modalities. We hope that CATS-V2V, the largest-scale, most supportive, and highest-quality dataset of its kind to date, will benefit the autonomous driving community in related tasks.
Explainable Deep Convolutional Multi-Type Anomaly Detection
Most explainable anomaly detection methods often identify anomalies but lack the capability to differentiate the type of anomaly. Furthermore, they often require the costly training and maintenance of separate models for each object category. The lack of specificity is a significant research gap, as identifying the type of anomaly (e.g., "Crack" vs. "Scratch") is crucial for accurate diagnosis that facilitates cost-saving operational decisions across diverse application domains. While some recent large-scale Vision-Language Models (VLMs) have begun to address this, they are computationally intensive and memory-heavy, restricting their use in real-time or embedded systems. We propose MultiTypeFCDD, a simple and lightweight convolutional framework designed as a practical alternative for explainable multi-type anomaly detection. MultiTypeFCDD uses only image-level labels to learn and produce multi-channel heatmaps, where each channel is trained to correspond to a specific anomaly type. The model functions as a single, unified framework capable of differentiating anomaly types across multiple object categories, eliminating the need to train and manage separate models for each object category. We evaluated our proposed method on the Real-IAD dataset and it delivers results competitive with state-of-the-art complex models at significantly reduced parametric load and inference times. This makes it a highly practical and viable solution for real-world applications where computational resources are tightly constrained.
Reverberation: Learning the Latencies Before Forecasting Trajectories
Bridging the past to the future, connecting agents both spatially and temporally, lies at the core of the trajectory prediction task. Despite great efforts, it remains challenging to explicitly learn and predict latencies, the temporal delays with which agents respond to different trajectory-changing events and adjust their future paths, whether on their own or interactively. Different agents may exhibit distinct latency preferences for noticing, processing, and reacting to any specific trajectory-changing event. The lack of consideration of such latencies may undermine the causal continuity of the forecasting system and also lead to implausible or unintended trajectories. Inspired by the reverberation curves in acoustics, we propose a new reverberation transform and the corresponding Reverberation (short for Rev) trajectory prediction model, which simulates and predicts different latency preferences of each agent as well as their stochasticity by using two explicit and learnable reverberation kernels, allowing for the controllable trajectory prediction based on these forecasted latencies. Experiments on multiple datasets, whether pedestrians or vehicles, demonstrate that Rev achieves competitive accuracy while revealing interpretable latency dynamics across agents and scenarios. Qualitative analyses further verify the properties of the proposed reverberation transform, highlighting its potential as a general latency modeling approach.
OT-ALD: Aligning Latent Distributions with Optimal Transport for Accelerated Image-to-Image Translation
The Dual Diffusion Implicit Bridge (DDIB) is an emerging image-to-image (I2I) translation method that preserves cycle consistency while achieving strong flexibility. It links two independently trained diffusion models (DMs) in the source and target domains by first adding noise to a source image to obtain a latent code, then denoising it in the target domain to generate the translated image. However, this method faces two key challenges: (1) low translation efficiency, and (2) translation trajectory deviations caused by mismatched latent distributions. To address these issues, we propose a novel I2I translation framework, OT-ALD, grounded in optimal transport (OT) theory, which retains the strengths of DDIB-based approach. Specifically, we compute an OT map from the latent distribution of the source domain to that of the target domain, and use the mapped distribution as the starting point for the reverse diffusion process in the target domain. Our error analysis confirms that OT-ALD eliminates latent distribution mismatches. Moreover, OT-ALD effectively balances faster image translation with improved image quality. Experiments on four translation tasks across three high-resolution datasets show that OT-ALD improves sampling efficiency by 20.29% and reduces the FID score by 2.6 on average compared to the top-performing baseline models.
Deep Learning-Enhanced Analysis for Delineating Anticoagulant Essay Efficacy Using Phase Microscopy
The coagulation of blood after it is drawn from the body poses a significant challenge for hematological analysis, potentially leading to inaccurate test results and altered cellular characteristics, compromising diagnostic reliability. This paper presents a deep learning-enhanced framework for delineating anticoagulant efficacy ex vivo using Digital Holographic Microscopy (DHM). We demonstrate a label-free, non-invasive approach for analyzing human blood samples, capable of accurate cell counting and morphological estimation. A DHM with an automated image processing and deep learning pipeline is built for morphological analysis of the blood cells under two different anti-coagulation agents, e.g. conventional EDTA and novel potassium ferric oxalate nanoparticles (KFeOx-NPs). This enables automated high-throughput screening of cells and estimation of blood coagulation rates when samples are treated with different anticoagulants. Results indicated that KFeOx-NPs prevented human blood coagulation without altering the cellular morphology of red blood cells (RBCs), whereas EDTA incubation caused notable changes within 6 hours of incubation. The system allows for quantitative analysis of coagulation dynamics by assessing parameters like cell clustering and morphology over time in these prepared samples, offering insights into the comparative efficacy and effects of anticoagulants outside the body.
Hindsight Distillation Reasoning with Knowledge Encouragement Preference for Knowledge-based Visual Question Answering
Knowledge-based Visual Question Answering (KBVQA) necessitates external knowledge incorporation beyond cross-modal understanding. Existing KBVQA methods either utilize implicit knowledge in multimodal large language models (MLLMs) via in-context learning or explicit knowledge via retrieval augmented generation. However, their reasoning processes remain implicit, without explicit multi-step trajectories from MLLMs. To address this gap, we provide a Hindsight Distilled Reasoning (HinD) framework with Knowledge Encouragement Preference Optimization (KEPO), designed to elicit and harness internal knowledge reasoning ability in MLLMs. First, to tackle the reasoning supervision problem, we propose to emphasize the hindsight wisdom of MLLM by prompting a frozen 7B-size MLLM to complete the reasoning process between the question and its ground truth answer, constructing Hindsight-Zero training data. Then we self-distill Hindsight-Zero into Chain-of-Thought (CoT) Generator and Knowledge Generator, enabling the generation of sequential steps and discrete facts. Secondly, to tackle the misalignment between knowledge correctness and confidence, we optimize the Knowledge Generator with KEPO, preferring under-confident but helpful knowledge over the over-confident but unhelpful one. The generated CoT and sampled knowledge are then exploited for answer prediction. Experiments on OK-VQA and A-OKVQA validate the effectiveness of HinD, showing that HinD with elicited reasoning from 7B-size MLLM achieves superior performance without commercial model APIs or outside knowledge.
Enhancing Meme Emotion Understanding with Multi-Level Modality Enhancement and Dual-Stage Modal Fusion
With the rapid rise of social media and Internet culture, memes have become a popular medium for expressing emotional tendencies. This has sparked growing interest in Meme Emotion Understanding (MEU), which aims to classify the emotional intent behind memes by leveraging their multimodal contents. While existing efforts have achieved promising results, two major challenges remain: (1) a lack of fine-grained multimodal fusion strategies, and (2) insufficient mining of memes' implicit meanings and background knowledge. To address these challenges, we propose MemoDetector, a novel framework for advancing MEU. First, we introduce a four-step textual enhancement module that utilizes the rich knowledge and reasoning capabilities of Multimodal Large Language Models (MLLMs) to progressively infer and extract implicit and contextual insights from memes. These enhanced texts significantly enrich the original meme contents and provide valuable guidance for downstream classification. Next, we design a dual-stage modal fusion strategy: the first stage performs shallow fusion on raw meme image and text, while the second stage deeply integrates the enhanced visual and textual features. This hierarchical fusion enables the model to better capture nuanced cross-modal emotional cues. Experiments on two datasets, MET-MEME and MOOD, demonstrate that our method consistently outperforms state-of-the-art baselines. Specifically, MemoDetector improves F1 scores by 4.3\% on MET-MEME and 3.4\% on MOOD. Further ablation studies and in-depth analyses validate the effectiveness and robustness of our approach, highlighting its strong potential for advancing MEU. Our code is available at https://github.com/singing-cat/MemoDetector.
AV-Dialog: Spoken Dialogue Models with Audio-Visual Input
Dialogue models falter in noisy, multi-speaker environments, often producing irrelevant responses and awkward turn-taking. We present AV-Dialog, the first multimodal dialog framework that uses both audio and visual cues to track the target speaker, predict turn-taking, and generate coherent responses. By combining acoustic tokenization with multi-task, multi-stage training on monadic, synthetic, and real audio-visual dialogue datasets, AV-Dialog achieves robust streaming transcription, semantically grounded turn-boundary detection and accurate responses, resulting in a natural conversational flow. Experiments show that AV-Dialog outperforms audio-only models under interference, reducing transcription errors, improving turn-taking prediction, and enhancing human-rated dialogue quality. These results highlight the power of seeing as well as hearing for speaker-aware interaction, paving the way for {spoken} dialogue agents that perform {robustly} in real-world, noisy environments.
Stroke Modeling Enables Vectorized Character Generation with Large Vectorized Glyph Model
Vectorized glyphs are widely used in poster design, network animation, art display, and various other fields due to their scalability and flexibility. In typography, they are often seen as special sequences composed of ordered strokes. This concept extends to the token sequence prediction abilities of large language models (LLMs), enabling vectorized character generation through stroke modeling. In this paper, we propose a novel Large Vectorized Glyph Model (LVGM) designed to generate vectorized Chinese glyphs by predicting the next stroke. Initially, we encode strokes into discrete latent variables called stroke embeddings. Subsequently, we train our LVGM via fine-tuning DeepSeek LLM by predicting the next stroke embedding. With limited strokes given, it can generate complete characters, semantically elegant words, and even unseen verses in vectorized form. Moreover, we release a new large-scale Chinese SVG dataset containing 907,267 samples based on strokes for dynamically vectorized glyph generation. Experimental results show that our model has scaling behaviors on data scales. Our generated vectorized glyphs have been validated by experts and relevant individuals.
Toward Generalized Detection of Synthetic Media: Limitations, Challenges, and the Path to Multimodal Solutions
Artificial intelligence (AI) in media has advanced rapidly over the last decade. The introduction of Generative Adversarial Networks (GANs) improved the quality of photorealistic image generation. Diffusion models later brought a new era of generative media. These advances made it difficult to separate real and synthetic content. The rise of deepfakes demonstrated how these tools could be misused to spread misinformation, political conspiracies, privacy violations, and fraud. For this reason, many detection models have been developed. They often use deep learning methods such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). These models search for visual, spatial, or temporal anomalies. However, such approaches often fail to generalize across unseen data and struggle with content from different models. In addition, existing approaches are ineffective in multimodal data and highly modified content. This study reviews twenty-four recent works on AI-generated media detection. Each study was examined individually to identify its contributions and weaknesses, respectively. The review then summarizes the common limitations and key challenges faced by current approaches. Based on this analysis, a research direction is suggested with a focus on multimodal deep learning models. Such models have the potential to provide more robust and generalized detection. It offers future researchers a clear starting point for building stronger defenses against harmful synthetic media.
comment: 10 Pages, 4 figures, 1 table, 7th International Conference on Trends in Computational and Cognitive Engineering(TCCE-2025)
VIDEOP2R: Video Understanding from Perception to Reasoning
Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.
AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization
Recent advancements in Audio-Video Large Language Models (AV-LLMs) have enhanced their capabilities in tasks like audio-visual question answering and multimodal dialog systems. Video and audio introduce an extended temporal dimension, resulting in a larger key-value (KV) cache compared to static image embedding. A naive optimization strategy is to selectively focus on and retain KV caches of audio or video based on task. However, in the experiment, we observed that the attention of AV-LLMs to various modalities in the high layers is not strictly dependent on the task. In higher layers, the attention of AV-LLMs shifts more towards the video modality. In addition, we also found that directly integrating temporal KV of audio and spatial-temporal KV of video may lead to information confusion and significant performance degradation of AV-LLMs. If audio and video are processed indiscriminately, it may also lead to excessive compression or reservation of a certain modality, thereby disrupting the alignment between modalities. To address these challenges, we propose AccKV, an Adaptive-Focusing and Cross-Calibration KV cache optimization framework designed specifically for efficient AV-LLMs inference. Our method is based on layer adaptive focusing technology, selectively focusing on key modalities according to the characteristics of different layers, and enhances the recognition of heavy hitter tokens through attention redistribution. In addition, we propose a Cross-Calibration technique that first integrates inefficient KV caches within the audio and video modalities, and then aligns low-priority modalities with high-priority modalities to selectively evict KV cache of low-priority modalities. The experimental results show that AccKV can significantly improve the computational efficiency of AV-LLMs while maintaining accuracy.
Detection of Bark Beetle Attacks using Hyperspectral PRISMA Data and Few-Shot Learning
Bark beetle infestations represent a serious challenge for maintaining the health of coniferous forests. This paper proposes a few-shot learning approach leveraging contrastive learning to detect bark beetle infestations using satellite PRISMA hyperspectral data. The methodology is based on a contrastive learning framework to pre-train a one-dimensional CNN encoder, enabling the extraction of robust feature representations from hyperspectral data. These extracted features are subsequently utilized as input to support vector regression estimators, one for each class, trained on few labeled samples to estimate the proportions of healthy, attacked by bark beetle, and dead trees for each pixel. Experiments on the area of study in the Dolomites show that our method outperforms the use of original PRISMA spectral bands and of Sentinel-2 data. The results indicate that PRISMA hyperspectral data combined with few-shot learning offers significant advantages for forest health monitoring.
comment: 5 pages, 3 figures, accepted at IGARSS conference 3-8 August 2025 Brisbane, Australia
Machine-Learning Based Detection of Coronary Artery Calcification Using Synthetic Chest X-Rays
Coronary artery calcification (CAC) is a strong predictor of cardiovascular events, with CT-based Agatston scoring widely regarded as the clinical gold standard. However, CT is costly and impractical for large-scale screening, while chest X-rays (CXRs) are inexpensive but lack reliable ground truth labels, constraining deep learning development. Digitally reconstructed radiographs (DRRs) offer a scalable alternative by projecting CT volumes into CXR-like images while inheriting precise labels. In this work, we provide the first systematic evaluation of DRRs as a surrogate training domain for CAC detection. Using 667 CT scans from the COCA dataset, we generate synthetic DRRs and assess model capacity, super-resolution fidelity enhancement, preprocessing, and training strategies. Lightweight CNNs trained from scratch outperform large pretrained networks; pairing super-resolution with contrast enhancement yields significant gains; and curriculum learning stabilises training under weak supervision. Our best configuration achieves a mean AUC of 0.754, comparable to or exceeding prior CXR-based studies. These results establish DRRs as a scalable, label-rich foundation for CAC detection, while laying the foundation for future transfer learning and domain adaptation to real CXRs.
comment: 10 pages, 5 figures. Under review for MIDL 2026
A Space-Time Transformer for Precipitation Forecasting
Meteorological agencies around the world rely on real-time flood guidance to issue live-saving advisories and warnings. For decades traditional numerical weather prediction (NWP) models have been state-of-the-art for precipitation forecasting. However, physically-parameterized models suffer from a few core limitations: first, solving PDEs to resolve atmospheric dynamics is computationally demanding, and second, these methods degrade in performance at nowcasting timescales (i.e., 0-4 hour lead-times). Motivated by these shortcomings, recent work proposes AI-weather prediction (AI-WP) alternatives that learn to emulate analysis data with neural networks. While these data-driven approaches have enjoyed enormous success across diverse spatial and temporal resolutions, applications of video-understanding architectures for weather forecasting remain underexplored. To address these gaps, we propose SaTformer: a video transformer built on full space-time attention that skillfully forecasts extreme precipitation from satellite radiances. Along with our novel architecture, we introduce techniques to tame long-tailed precipitation datasets. Namely, we reformulate precipitation regression into a classification problem, and employ a class-weighted loss to address label imbalances. Our model scored first place on the NeurIPS Weather4Cast 2025 Cumulative Rainfall challenge. Code and model weights are available: https://github.com/leharris3/satformer
SplineSplat: 3D Ray Tracing for Higher-Quality Tomography
We propose a method to efficiently compute tomographic projections of a 3D volume represented by a linear combination of shifted B-splines. To do so, we propose a ray-tracing algorithm that computes 3D line integrals with arbitrary projection geometries. One of the components of our algorithm is a neural network that computes the contribution of the basis functions efficiently. In our experiments, we consider well-posed cases where the data are sufficient for accurate reconstruction without the need for regularization. We achieve higher reconstruction quality than traditional voxel-based methods.
Phys-Liquid: A Physics-Informed Dataset for Estimating 3D Geometry and Volume of Transparent Deformable Liquids AAAI-26
Estimating the geometric and volumetric properties of transparent deformable liquids is challenging due to optical complexities and dynamic surface deformations induced by container movements. Autonomous robots performing precise liquid manipulation tasks, such as dispensing, aspiration, and mixing, must handle containers in ways that inevitably induce these deformations, complicating accurate liquid state assessment. Current datasets lack comprehensive physics-informed simulation data representing realistic liquid behaviors under diverse dynamic scenarios. To bridge this gap, we introduce Phys-Liquid, a physics-informed dataset comprising 97,200 simulation images and corresponding 3D meshes, capturing liquid dynamics across multiple laboratory scenes, lighting conditions, liquid colors, and container rotations. To validate the realism and effectiveness of Phys-Liquid, we propose a four-stage reconstruction and estimation pipeline involving liquid segmentation, multi-view mask generation, 3D mesh reconstruction, and real-world scaling. Experimental results demonstrate improved accuracy and consistency in reconstructing liquid geometry and volume, outperforming existing benchmarks. The dataset and associated validation methods facilitate future advancements in transparent liquid perception tasks. The dataset and code are available at https://dualtransparency.github.io/Phys-Liquid/.
comment: 14 pages, 19 figures. Accepted as an oral paper at AAAI-26 (Main Technical Track). Code and dataset: https://github.com/dualtransparency/Phys-Liquid-AAAI Project page: https://dualtransparency.github.io/Phys-Liquid/
Evaluating Latent Generative Paradigms for High-Fidelity 3D Shape Completion from a Single Depth Image
While generative models have seen significant adoption across a wide range of data modalities, including 3D data, a consensus on which model is best suited for which task has yet to be reached. Further, conditional information such as text and images to steer the generation process are frequently employed, whereas others, like partial 3D data, have not been thoroughly evaluated. In this work, we compare two of the most promising generative models--Denoising Diffusion Probabilistic Models and Autoregressive Causal Transformers--which we adapt for the tasks of generative shape modeling and completion. We conduct a thorough quantitative evaluation and comparison of both tasks, including a baseline discriminative model and an extensive ablation study. Our results show that (1) the diffusion model with continuous latents outperforms both the discriminative model and the autoregressive approach and delivers state-of-the-art performance on multi-modal shape completion from a single, noisy depth image under realistic conditions and (2) when compared on the same discrete latent space, the autoregressive model can match or exceed diffusion performance on these tasks.
comment: 17 pages, 4 figures, 19 tables
Boosting Neural Video Representation via Online Structural Reparameterization
Neural Video Representation~(NVR) is a promising paradigm for video compression, showing great potential in improving video storage and transmission efficiency. While recent advances have made efforts in architectural refinements to improve representational capability, these methods typically involve complex designs, which may incur increased computational overhead and lack the flexibility to integrate into other frameworks. Moreover, the inherent limitation in model capacity restricts the expressiveness of NVR networks, resulting in a performance bottleneck. To overcome these limitations, we propose Online-RepNeRV, a NVR framework based on online structural reparameterization. Specifically, we propose a universal reparameterization block named ERB, which incorporates multiple parallel convolutional paths to enhance the model capacity. To mitigate the overhead, an online reparameterization strategy is adopted to dynamically fuse the parameters during training, and the multi-branch structure is equivalently converted into a single-branch structure after training. As a result, the additional computational and parameter complexity is confined to the encoding stage, without affecting the decoding efficiency. Extensive experiments on mainstream video datasets demonstrate that our method achieves an average PSNR gain of 0.37-2.7 dB over baseline methods, while maintaining comparable training time and decoding speed.
comment: 15 pages, 7 figures
S2D-ALIGN: Shallow-to-Deep Auxiliary Learning for Anatomically-Grounded Radiology Report Generation
Radiology Report Generation (RRG) aims to automatically generate diagnostic reports from radiology images. To achieve this, existing methods have leveraged the powerful cross-modal generation capabilities of Multimodal Large Language Models (MLLMs), primarily focusing on optimizing cross-modal alignment between radiographs and reports through Supervised Fine-Tuning (SFT). However, by only performing instance-level alignment with the image-text pairs, the standard SFT paradigm fails to establish anatomically-grounded alignment, where the templated nature of reports often leads to sub-optimal generation quality. To address this, we propose \textsc{S2D-Align}, a novel SFT paradigm that establishes anatomically-grounded alignment by leveraging auxiliary signals of varying granularities. \textsc{S2D-Align} implements a shallow-to-deep strategy, progressively enriching the alignment process: it begins with the coarse radiograph-report pairing, then introduces reference reports for instance-level guidance, and ultimately utilizes key phrases to ground the generation in specific anatomical details. To bridge the different alignment stages, we introduce a memory-based adapter that empowers feature sharing, thereby integrating coarse and fine-grained guidance. For evaluation, we conduct experiments on the public \textsc{MIMIC-CXR} and \textsc{IU X-Ray} benchmarks, where \textsc{S2D-Align} achieves state-of-the-art performance compared to existing methods. Ablation studies validate the effectiveness of our multi-stage, auxiliary-guided approach, highlighting a promising direction for enhancing grounding capabilities in complex, multi-modal generation tasks.
From Retinal Pixels to Patients: Evolution of Deep Learning Research in Diabetic Retinopathy Screening
Diabetic Retinopathy (DR) remains a leading cause of preventable blindness, with early detection critical for reducing vision loss worldwide. Over the past decade, deep learning has transformed DR screening, progressing from early convolutional neural networks trained on private datasets to advanced pipelines addressing class imbalance, label scarcity, domain shift, and interpretability. This survey provides the first systematic synthesis of DR research spanning 2016-2025, consolidating results from 50+ studies and over 20 datasets. We critically examine methodological advances, including self- and semi-supervised learning, domain generalization, federated training, and hybrid neuro-symbolic models, alongside evaluation protocols, reporting standards, and reproducibility challenges. Benchmark tables contextualize performance across datasets, while discussion highlights open gaps in multi-center validation and clinical trust. By linking technical progress with translational barriers, this work outlines a practical agenda for reproducible, privacy-preserving, and clinically deployable DR AI. Beyond DR, many of the surveyed innovations extend broadly to medical imaging at scale.
comment: Accepted in IEEE BigData 2025
LiteAttention: A Temporal Sparse Attention for Diffusion Transformers
Diffusion Transformers, particularly for video generation, achieve remarkable quality but suffer from quadratic attention complexity, leading to prohibitive latency. Existing acceleration methods face a fundamental trade-off: dynamically estimating sparse attention patterns at each denoising step incurs high computational overhead and estimation errors, while static sparsity patterns remain fixed and often suboptimal throughout denoising. We identify a key structural property of diffusion attention, namely, its sparsity patterns exhibit strong temporal coherence across denoising steps. Tiles deemed non-essential at step $t$ typically remain so at step $t+δ$. Leveraging this observation, we introduce LiteAttention, a method that exploits temporal coherence to enable evolutionary computation skips across the denoising sequence. By marking non-essential tiles early and propagating skip decisions forward, LiteAttention eliminates redundant attention computations without repeated profiling overheads, combining the adaptivity of dynamic methods with the efficiency of static ones. We implement a highly optimized LiteAttention kernel on top of FlashAttention and demonstrate substantial speedups on production video diffusion models, with no degradation in quality. The code and implementation details will be publicly released.
CareCom: Generative Image Composition with Calibrated Reference Features
Image composition aims to seamlessly insert foreground object into background. Despite the huge progress in generative image composition, the existing methods are still struggling with simultaneous detail preservation and foreground pose/view adjustment. To address this issue, we extend the existing generative composition model to multi-reference version, which allows using arbitrary number of foreground reference images. Furthermore, we propose to calibrate the global and local features of foreground reference images to make them compatible with the background information. The calibrated reference features can supplement the original reference features with useful global and local information of proper pose/view. Extensive experiments on MVImgNet and MureCom demonstrate that the generative model can greatly benefit from the calibrated reference features.
NP-LoRA: Null Space Projection Unifies Subject and Style in LoRA Fusion
Low-Rank Adaptation (LoRA) fusion has emerged as a key technique for reusing and composing learned subject and style representations for controllable generation without costly retraining. However, existing methods rely on weight-based merging, where one LoRA often dominates the other, leading to interference and degraded fidelity. This interference is structural: separately trained LoRAs occupy low-rank high-dimensional subspaces, leading to non-orthogonal and overlapping representations. In this work, we analyze the internal structure of LoRAs and find their generative behavior is dominated by a few principal directions in the low-rank subspace, which should remain free from interference during fusion. To achieve this, we propose Null Space Projection LoRA (NP-LoRA), a projection-based framework for LoRA fusion that enforces subspace separation to prevent structural interference among principal directions. Specifically, we first extract principal style directions via singular value decomposition (SVD) and then project the subject LoRA into its orthogonal null space. Furthermore, we introduce a soft projection mechanism that enables smooth control over the trade-off between subject fidelity and style consistency. Experiments show NP-LoRA consistently improves fusion quality over strong baselines (e.g., DINO and CLIP-based metrics, with human and LLM preference scores), and applies broadly across backbones and LoRA pairs without retraining.
PINGS-X: Physics-Informed Normalized Gaussian Splatting with Axes Alignment for Efficient Super-Resolution of 4D Flow MRI AAAI 2026
4D flow magnetic resonance imaging (MRI) is a reliable, non-invasive approach for estimating blood flow velocities, vital for cardiovascular diagnostics. Unlike conventional MRI focused on anatomical structures, 4D flow MRI requires high spatiotemporal resolution for early detection of critical conditions such as stenosis or aneurysms. However, achieving such resolution typically results in prolonged scan times, creating a trade-off between acquisition speed and prediction accuracy. Recent studies have leveraged physics-informed neural networks (PINNs) for super-resolution of MRI data, but their practical applicability is limited as the prohibitively slow training process must be performed for each patient. To overcome this limitation, we propose PINGS-X, a novel framework modeling high-resolution flow velocities using axes-aligned spatiotemporal Gaussian representations. Inspired by the effectiveness of 3D Gaussian splatting (3DGS) in novel view synthesis, PINGS-X extends this concept through several non-trivial novel innovations: (i) normalized Gaussian splatting with a formal convergence guarantee, (ii) axes-aligned Gaussians that simplify training for high-dimensional data while preserving accuracy and the convergence guarantee, and (iii) a Gaussian merging procedure to prevent degenerate solutions and boost computational efficiency. Experimental results on computational fluid dynamics (CFD) and real 4D flow MRI datasets demonstrate that PINGS-X substantially reduces training time while achieving superior super-resolution accuracy. Our code and datasets are available at https://github.com/SpatialAILab/PINGS-X.
comment: Accepted at AAAI 2026. Supplementary material included after references. 27 pages, 21 figures, 11 tables
Hyperbolic Hierarchical Alignment Reasoning Network for Text-3D Retrieval AAAI-2026
With the daily influx of 3D data on the internet, text-3D retrieval has gained increasing attention. However, current methods face two major challenges: Hierarchy Representation Collapse (HRC) and Redundancy-Induced Saliency Dilution (RISD). HRC compresses abstract-to-specific and whole-to-part hierarchies in Euclidean embeddings, while RISD averages noisy fragments, obscuring critical semantic cues and diminishing the model's ability to distinguish hard negatives. To address these challenges, we introduce the Hyperbolic Hierarchical Alignment Reasoning Network (H$^{2}$ARN) for text-3D retrieval. H$^{2}$ARN embeds both text and 3D data in a Lorentz-model hyperbolic space, where exponential volume growth inherently preserves hierarchical distances. A hierarchical ordering loss constructs a shrinking entailment cone around each text vector, ensuring that the matched 3D instance falls within the cone, while an instance-level contrastive loss jointly enforces separation from non-matching samples. To tackle RISD, we propose a contribution-aware hyperbolic aggregation module that leverages Lorentzian distance to assess the relevance of each local feature and applies contribution-weighted aggregation guided by hyperbolic geometry, enhancing discriminative regions while suppressing redundancy without additional supervision. We also release the expanded T3DR-HIT v2 benchmark, which contains 8,935 text-to-3D pairs, 2.6 times the original size, covering both fine-grained cultural artefacts and complex indoor scenes. Our codes are available at https://github.com/liwrui/H2ARN.
comment: Accepted by AAAI-2026
SemanticNN: Compressive and Error-Resilient Semantic Offloading for Extremely Weak Devices
With the rapid growth of the Internet of Things (IoT), integrating artificial intelligence (AI) on extremely weak embedded devices has garnered significant attention, enabling improved real-time performance and enhanced data privacy. However, the resource limitations of such devices and unreliable network conditions necessitate error-resilient device-edge collaboration systems. Traditional approaches focus on bit-level transmission correctness, which can be inefficient under dynamic channel conditions. In contrast, we propose SemanticNN, a semantic codec that tolerates bit-level errors in pursuit of semantic-level correctness, enabling compressive and resilient collaborative inference offloading under strict computational and communication constraints. It incorporates a Bit Error Rate (BER)-aware decoder that adapts to dynamic channel conditions and a Soft Quantization (SQ)-based encoder to learn compact representations. Building on this architecture, we introduce Feature-augmentation Learning, a novel training strategy that enhances offloading efficiency. To address encoder-decoder capability mismatches from asymmetric resources, we propose XAI-based Asymmetry Compensation to enhance decoding semantic fidelity. We conduct extensive experiments on STM32 using three models and six datasets across image classification and object detection tasks. Experimental results demonstrate that, under varying transmission error rates, SemanticNN significantly reduces feature transmission volume by 56.82-344.83x while maintaining superior inference accuracy.
CrossMed: A Multimodal Cross-Task Benchmark for Compositional Generalization in Medical Imaging
Recent advances in multimodal large language models have enabled unified processing of visual and textual inputs, offering promising applications in general-purpose medical AI. However, their ability to generalize compositionally across unseen combinations of imaging modality, anatomy, and task type remains underexplored. We introduce CrossMed, a benchmark designed to evaluate compositional generalization (CG) in medical multimodal LLMs using a structured Modality-Anatomy-Task (MAT) schema. CrossMed reformulates four public datasets, CheXpert (X-ray classification), SIIM-ACR (X-ray segmentation), BraTS 2020 (MRI classification and segmentation), and MosMedData (CT classification) into a unified visual question answering (VQA) format, resulting in 20,200 multiple-choice QA instances. We evaluate two open-source multimodal LLMs, LLaVA-Vicuna-7B and Qwen2-VL-7B, on both Related and Unrelated MAT splits, as well as a zero-overlap setting where test triplets share no Modality, Anatomy, or Task with the training data. Models trained on Related splits achieve 83.2 percent classification accuracy and 0.75 segmentation cIoU, while performance drops significantly under Unrelated and zero-overlap conditions, demonstrating the benchmark difficulty. We also show cross-task transfer, where segmentation performance improves by 7 percent cIoU even when trained using classification-only data. Traditional models (ResNet-50 and U-Net) show modest gains, confirming the broad utility of the MAT framework, while multimodal LLMs uniquely excel at compositional generalization. CrossMed provides a rigorous testbed for evaluating zero-shot, cross-task, and modality-agnostic generalization in medical vision-language models.
MPCGNet: A Multiscale Feature Extraction and Progressive Feature Aggregation Network Using Coupling Gates for Polyp Segmentation IJCNN 2025
Automatic segmentation methods of polyps is crucial for assisting doctors in colorectal polyp screening and cancer diagnosis. Despite the progress made by existing methods, polyp segmentation faces several challenges: (1) small-sized polyps are prone to being missed during identification, (2) the boundaries between polyps and the surrounding environment are often ambiguous, (3) noise in colonoscopy images, caused by uneven lighting and other factors, affects segmentation results. To address these challenges, this paper introduces coupling gates as components in specific modules to filter noise and perform feature importance selection. Three modules are proposed: the coupling gates multiscale feature extraction (CGMFE) module, which effectively extracts local features and suppresses noise; the windows cross attention (WCAD) decoder module, which restores details after capturing the precise location of polyps; and the decoder feature aggregation (DFA) module, which progressively aggregates features, further extracts them, and performs feature importance selection to reduce the loss of small-sized polyps. Experimental results demonstrate that MPCGNet outperforms recent networks, with mDice scores 2.20% and 0.68% higher than the second-best network on the ETIS-LaribPolypDB and CVC-ColonDB datasets, respectively.
comment: 8 pages, 4 figures,3 tables. This paper has been accepted by IJCNN 2025 but not published
Accelerating Controllable Generation via Hybrid-grained Cache
Controllable generative models have been widely used to improve the realism of synthetic visual content. However, such models must handle control conditions and content generation computational requirements, resulting in generally low generation efficiency. To address this issue, we propose a Hybrid-Grained Cache (HGC) approach that reduces computational overhead by adopting cache strategies with different granularities at different computational stages. Specifically, (1) we use a coarse-grained cache (block-level) based on feature reuse to dynamically bypass redundant computations in encoder-decoder blocks between each step of model reasoning. (2) We design a fine-grained cache (prompt-level) that acts within a module, where the fine-grained cache reuses cross-attention maps within consecutive reasoning steps and extends them to the corresponding module computations of adjacent steps. These caches of different granularities can be seamlessly integrated into each computational link of the controllable generation process. We verify the effectiveness of HGC on four benchmark datasets, especially its advantages in balancing generation efficiency and visual quality. For example, on the COCO-Stuff segmentation benchmark, our HGC significantly reduces the computational cost (MACs) by 63% (from 18.22T to 6.70T), while keeping the loss of semantic fidelity (quantized performance degradation) within 1.5%.
Algorithms Trained on Normal Chest X-rays Can Predict Health Insurance Types
Artificial intelligence is revealing what medicine never intended to encode. Deep vision models, trained on chest X-rays, can now detect not only disease but also invisible traces of social inequality. In this study, we show that state-of-the-art architectures (DenseNet121, SwinV2-B, MedMamba) can predict a patient's health insurance type, a strong proxy for socioeconomic status, from normal chest X-rays with significant accuracy (AUC around 0.67 on MIMIC-CXR-JPG, 0.68 on CheXpert). The signal persists even when age, race, and sex are controlled for, and remains detectable when the model is trained exclusively on a single racial group. Patch-based occlusion reveals that the signal is diffuse rather than localized, embedded in the upper and mid-thoracic regions. This suggests that deep networks may be internalizing subtle traces of clinical environments, equipment differences, or care pathways; learning socioeconomic segregation itself. These findings challenge the assumption that medical images are neutral biological data. By uncovering how models perceive and exploit these hidden social signatures, this work reframes fairness in medical AI: the goal is no longer only to balance datasets or adjust thresholds, but to interrogate and disentangle the social fingerprints embedded in clinical data itself.
comment: Submitting to MIDL 2026
EmbryoDiff: A Conditional Diffusion Framework with Multi-Focal Feature Fusion for Fine-Grained Embryo Developmental Stage Recognition
Identification of fine-grained embryo developmental stages during In Vitro Fertilization (IVF) is crucial for assessing embryo viability. Although recent deep learning methods have achieved promising accuracy, existing discriminative models fail to utilize the distributional prior of embryonic development to improve accuracy. Moreover, their reliance on single-focal information leads to incomplete embryonic representations, making them susceptible to feature ambiguity under cell occlusions. To address these limitations, we propose EmbryoDiff, a two-stage diffusion-based framework that formulates the task as a conditional sequence denoising process. Specifically, we first train and freeze a frame-level encoder to extract robust multi-focal features. In the second stage, we introduce a Multi-Focal Feature Fusion Strategy that aggregates information across focal planes to construct a 3D-aware morphological representation, effectively alleviating ambiguities arising from cell occlusions. Building on this fused representation, we derive complementary semantic and boundary cues and design a Hybrid Semantic-Boundary Condition Block to inject them into the diffusion-based denoising process, enabling accurate embryonic stage classification. Extensive experiments on two benchmark datasets show that our method achieves state-of-the-art results. Notably, with only a single denoising step, our model obtains the best average test performance, reaching 82.8% and 81.3% accuracy on the two datasets, respectively.
AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning
Multimodal Large Language Models (MLLMs) have shown promise in single-agent vision tasks, yet benchmarks for evaluating multi-agent collaborative perception remain scarce. This gap is critical, as multi-drone systems provide enhanced coverage, robustness, and collaboration compared to single-sensor setups. Existing multi-image benchmarks mainly target basic perception tasks using high-quality single-agent images, thus failing to evaluate MLLMs in more complex, egocentric collaborative scenarios, especially under real-world degraded perception conditions.To address these challenges, we introduce AirCopBench, the first comprehensive benchmark designed to evaluate MLLMs in embodied aerial collaborative perception under challenging perceptual conditions. AirCopBench includes 14.6k+ questions derived from both simulator and real-world data, spanning four key task dimensions: Scene Understanding, Object Understanding, Perception Assessment, and Collaborative Decision, across 14 task types. We construct the benchmark using data from challenging degraded-perception scenarios with annotated collaborative events, generating large-scale questions through model-, rule-, and human-based methods under rigorous quality control. Evaluations on 40 MLLMs show significant performance gaps in collaborative perception tasks, with the best model trailing humans by 24.38% on average and exhibiting inconsistent results across tasks. Fine-tuning experiments further confirm the feasibility of sim-to-real transfer in aerial collaborative perception and reasoning.
SUPER Decoder Block for Reconstruction-Aware U-Net Variants
Skip-connected encoder-decoder architectures (U-Net variants) are widely adopted for inverse problems but still suffer from information loss, limiting recovery of fine high-frequency details. We present Selectively Suppressed Perfect Reconstruction (SUPER), which exploits the perfect reconstruction (PR) property of wavelets to prevent information degradation while selectively suppressing (SS) redundant features. Free from rigid framelet constraints, SUPER serves as a plug-and-play decoder block for diverse U-Net variants, eliminating their intrinsic reconstruction bottlenecks and enhancing representational richness. Experiments across diverse crack benchmarks, including state-of-the-art (SOTA) models, demonstrate the structural potential of the proposed SUPER Decoder Block. Maintaining comparable computational cost, SUPER enriches representational diversity through increased parameterization. In small-scale in-domain experiments on the CrackVision12K dataset, SUPER markedly improves thin-crack segmentation performance, particularly for cracks narrower than 4 px, underscoring its advantage in high-frequency dominant settings. In smartphone image denoising on SIDD, where low-frequency components prevail, SUPER still achieves a moderate gain in PSNR, confirming its robustness across low- and high-frequency regimes. These results validate its plug-and-play generality across U-Net variants, achieving high-frequency fidelity and global coherence within a unified, reconstruction-aware framework.
comment: 8 pages. Under review
SP-Guard: Selective Prompt-adaptive Guidance for Safe Text-to-Image Generation ECAI 2025
While diffusion-based T2I models have achieved remarkable image generation quality, they also enable easy creation of harmful content, raising social concerns and highlighting the need for safer generation. Existing inference-time guiding methods lack both adaptivity--adjusting guidance strength based on the prompt--and selectivity--targeting only unsafe regions of the image. Our method, SP-Guard, addresses these limitations by estimating prompt harmfulness and applying a selective guidance mask to guide only unsafe areas. Experiments show that SP-Guard generates safer images than existing methods while minimizing unintended content alteration. Beyond improving safety, our findings highlight the importance of transparency and controllability in image generation.
comment: Accepted for presentation at TRUST-AI Workshop, ECAI 2025. Proceedings to appear in CEUR-WS
Unsupervised Robust Domain Adaptation: Paradigm, Theory and Algorithm
Unsupervised domain adaptation (UDA) aims to transfer knowledge from a label-rich source domain to an unlabeled target domain by addressing domain shifts. Most UDA approaches emphasize transfer ability, but often overlook robustness against adversarial attacks. Although vanilla adversarial training (VAT) improves the robustness of deep neural networks, it has little effect on UDA. This paper focuses on answering three key questions: 1) Why does VAT, known for its defensive effectiveness, fail in the UDA paradigm? 2) What is the generalization bound theory under attacks and how does it evolve from classical UDA theory? 3) How can we implement a robustification training procedure without complex modifications? Specifically, we explore and reveal the inherent entanglement challenge in general UDA+VAT paradigm, and propose an unsupervised robust domain adaptation (URDA) paradigm. We further derive the generalization bound theory of the URDA paradigm so that it can resist adversarial noise and domain shift. To the best of our knowledge, this is the first time to establish the URDA paradigm and theory. We further introduce a simple, novel yet effective URDA algorithm called Disentangled Adversarial Robustness Training (DART), a two-step training procedure that ensures both transferability and robustness. DART first pre-trains an arbitrary UDA model, and then applies an instantaneous robustification post-training step via disentangled distillation.Experiments on four benchmark datasets with/without attacks show that DART effectively enhances robustness while maintaining domain adaptability, and validate the URDA paradigm and theory.
comment: To appear in IJCV
VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models
Despite the remarkable success of Vision-Language Models (VLMs), their performance on a range of complex visual tasks is often hindered by a "visual processing bottleneck": a propensity to lose grounding in visual evidence and exhibit a deficit in contextualized visual experience during prolonged generation. Drawing inspiration from human cognitive memory theory, which distinguishes short-term visually-dominant memory and long-term semantically-dominant memory, we propose VisMem, a cognitively-aligned framework that equips VLMs with dynamic latent vision memories, a short-term module for fine-grained perceptual retention and a long-term module for abstract semantic consolidation. These memories are seamlessly invoked during inference, allowing VLMs to maintain both perceptual fidelity and semantic consistency across thinking and generation. Extensive experiments across diverse visual benchmarks for understanding, reasoning, and generation reveal that VisMem delivers a significant average performance boost of 11.8% relative to the vanilla model and outperforms all counterparts, establishing a new paradigm for latent-space memory enhancement. The code will be available: https://github.com/YU-deep/VisMem.git.
Adaptive LiDAR Scanning: Harnessing Temporal Cues for Efficient 3D Object Detection via Multi-Modal Fusion AAAI
Multi-sensor fusion using LiDAR and RGB cameras significantly enhances 3D object detection task. However, conventional LiDAR sensors perform dense, stateless scans, ignoring the strong temporal continuity in real-world scenes. This leads to substantial sensing redundancy and excessive power consumption, limiting their practicality on resource-constrained platforms. To address this inefficiency, we propose a predictive, history-aware adaptive scanning framework that anticipates informative regions of interest (ROI) based on past observations. Our approach introduces a lightweight predictor network that distills historical spatial and temporal contexts into refined query embeddings. These embeddings guide a differentiable Mask Generator network, which leverages Gumbel-Softmax sampling to produce binary masks identifying critical ROIs for the upcoming frame. Our method significantly reduces unnecessary data acquisition by concentrating dense LiDAR scanning only within these ROIs and sparsely sampling elsewhere. Experiments on nuScenes and Lyft benchmarks demonstrate that our adaptive scanning strategy reduces LiDAR energy consumption by over 65% while maintaining competitive or even superior 3D object detection performance compared to traditional LiDAR-camera fusion methods with dense LiDAR scanning.
comment: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026
BecomingLit: Relightable Gaussian Avatars with Hybrid Neural Shading NeurIPS 2025
We introduce BecomingLit, a novel method for reconstructing relightable, high-resolution head avatars that can be rendered from novel viewpoints at interactive rates. Therefore, we propose a new low-cost light stage capture setup, tailored specifically towards capturing faces. Using this setup, we collect a novel dataset consisting of diverse multi-view sequences of numerous subjects under varying illumination conditions and facial expressions. By leveraging our new dataset, we introduce a new relightable avatar representation based on 3D Gaussian primitives that we animate with a parametric head model and an expression-dependent dynamics module. We propose a new hybrid neural shading approach, combining a neural diffuse BRDF with an analytical specular term. Our method reconstructs disentangled materials from our dynamic light stage recordings and enables all-frequency relighting of our avatars with both point lights and environment maps. In addition, our avatars can easily be animated and controlled from monocular videos. We validate our approach in extensive experiments on our dataset, where we consistently outperform existing state-of-the-art methods in relighting and reenactment by a significant margin.
comment: NeurIPS 2025, Project Page: see https://jonathsch.github.io/becominglit/ , YouTube Video: see https://youtu.be/xPyeIqKdszA
Symmetrical Flow Matching: Unified Image Generation, Segmentation, and Classification with Score-Based Generative Models AAAI 2026
Flow Matching has emerged as a powerful framework for learning continuous transformations between distributions, enabling high-fidelity generative modeling. This work introduces Symmetrical Flow Matching (SymmFlow), a new formulation that unifies semantic segmentation, classification, and image generation within a single model. Using a symmetric learning objective, SymmFlow models forward and reverse transformations jointly, ensuring bi-directional consistency, while preserving sufficient entropy for generative diversity. A new training objective is introduced to explicitly retain semantic information across flows, featuring efficient sampling while preserving semantic structure, allowing for one-step segmentation and classification without iterative refinement. Unlike previous approaches that impose strict one-to-one mapping between masks and images, SymmFlow generalizes to flexible conditioning, supporting both pixel-level and image-level class labels. Experimental results on various benchmarks demonstrate that SymmFlow achieves state-of-the-art performance on semantic image synthesis, obtaining FID scores of 11.9 on CelebAMask-HQ and 7.0 on COCO-Stuff with only 25 inference steps. Additionally, it delivers competitive results on semantic segmentation and shows promising capabilities in classification tasks.
comment: AAAI 2026
Towards Generalizable AI-Generated Image Detection via Image-Adaptive Prompt Learning
In AI-generated image detection, current cutting-edge methods typically adapt pre-trained foundation models through partial-parameter fine-tuning. However, these approaches often struggle to generalize to forgeries from unseen generators, as the fine-tuned models capture only limited patterns from training data and fail to reflect the evolving traits of new ones. To overcome this limitation, we propose Image-Adaptive Prompt Learning (IAPL), a novel paradigm that dynamically adjusts the prompts fed into the encoder according to each testing image, rather than fixing them after training. This design significantly enhances robustness and adaptability to diverse forged images. The dynamic prompts integrate conditional information with test-time adaptive tokens through a lightweight learnable scaling factor. The conditional information is produced by a Conditional Information Learner, which leverages CNN-based feature extractors to model both forgery-specific and general conditions. The test-time adaptive tokens are optimized during inference on a single sample by enforcing prediction consistency across multiple views, ensuring that the parameters align with the current image. For the final decision, the optimal input with the highest prediction confidence is selected. Extensive experiments show that IAPL achieves state-of-the-art performance, with mean accuracies of 95.61% and 96.7% on the widely used UniversalFakeDetect and GenImage datasets, respectively. Codes and weights will be released on https://github.com/liyih/IAPL.
comment: under review, codes: https://github.com/liyih/IAPL
Efficient Bayer-Domain Video Computer Vision with Fast Motion Estimation and Learned Perception Residual
Video computer vision systems face substantial computational burdens arising from two fundamental challenges: eliminating unnecessary processing and reducing temporal redundancy in back-end inference while maintaining accuracy with minimal extra computation. To address these issues, we propose an efficient video computer vision framework that jointly optimizes both the front end and back end of the pipeline. On the front end, we remove the traditional image signal processor (ISP) and feed Bayer raw measurements directly into Bayer-domain vision models, avoiding costly human-oriented ISP operations. On the back end, we introduce a fast and highly parallel motion estimation algorithm that extracts inter-frame temporal correspondence to avoid redundant computation. To mitigate artifacts caused by motion inaccuracies, we further employ lightweight perception residual networks that directly learn perception-level residuals and refine the propagated features. Experiments across multiple models and tasks demonstrate that our system achieves substantial acceleration with only minor performance degradation.
Self-Diffusion Driven Blind Imaging
Optical imaging systems are inherently imperfect due to diffraction limits, lens manufacturing tolerances, assembly misalignment, and other physical constraints. In addition, unavoidable camera shake and object motion further introduce non-ideal degradations during acquisition. These aberrations and motion-induced variations are typically unknown, difficult to measure, and costly to model or calibrate in practice. Blind inverse problems offer a promising direction by jointly estimating both the latent image and the unknown degradation kernel. However, existing approaches often suffer from convergence instability, limited prior expressiveness, and sensitivity to hyperparameters. Inspired by recent advances in self-diffusion, we propose DeblurSDI, a zero-shot, self-supervised blind imaging framework that requires no pre-training. DeblurSDI formulates blind image recovery as an iterative reverse self-diffusion process that begins from pure noise and progressively refines both the sharp image and the blur kernel. Extensive experiments on combined optical aberrations and motion blur demonstrate that DeblurSDI consistently outperforms other methods by a substantial margin.
Leveraging NTPs for Efficient Hallucination Detection in VLMs AACL
Hallucinations of vision-language models (VLMs), which are misalignments between visual content and generated text, undermine the reliability of VLMs. One common approach for detecting them employs the same VLM, or a different one, to assess generated outputs. This process is computationally intensive and increases model latency. In this paper, we explore an efficient on-the-fly method for hallucination detection by training traditional ML models over signals based on the VLM's next-token probabilities (NTPs). NTPs provide a direct quantification of model uncertainty. We hypothesize that high uncertainty (i.e., a low NTP value) is strongly associated with hallucinations. To test this, we introduce a dataset of 1,400 human-annotated statements derived from VLM-generated content, each labeled as hallucinated or not, and use it to test our NTP-based lightweight method. Our results demonstrate that NTP-based features are valuable predictors of hallucinations, enabling fast and simple ML models to achieve performance comparable to that of strong VLMs. Furthermore, augmenting these NTPs with linguistic NTPs, computed by feeding only the generated text back into the VLM, enhances hallucination detection performance. Finally, integrating hallucination prediction scores from VLMs into the NTP-based models led to better performance than using either VLMs or NTPs alone. We hope this study paves the way for simple, lightweight solutions that enhance the reliability of VLMs.
comment: Accepted to The First Workshop on Confabulation, Hallucinations, & Overgeneration in Multilingual & Precision-critical Setting - AACL-IJCNLP2025
Quantifying the Limits of Segmentation Foundation Models: Modeling Challenges in Segmenting Tree-Like and Low-Contrast Objects WACV 2026
Image segmentation foundation models (SFMs) like Segment Anything Model (SAM) have achieved impressive zero-shot and interactive segmentation across diverse domains. However, they struggle to segment objects with certain structures, particularly those with dense, tree-like morphology and low textural contrast from their surroundings. These failure modes are crucial for understanding the limitations of SFMs in real-world applications. To systematically study this issue, we introduce interpretable metrics quantifying object tree-likeness and textural separability. On carefully controlled synthetic experiments and real-world datasets, we show that SFM performance (\eg, SAM, SAM 2, HQ-SAM) noticeably correlates with these factors. We attribute these failures to SFMs misinterpreting local structure as global texture, resulting in over-segmentation or difficulty distinguishing objects from similar backgrounds. Notably, targeted fine-tuning fails to resolve this issue, indicating a fundamental limitation. Our study provides the first quantitative framework for modeling the behavior of SFMs on challenging structures, offering interpretable insights into their segmentation capabilities.
comment: Accepted at WACV 2026. Code: https://github.com/mazurowski-lab/SAMFailureMetrics
Adaptive Cache Enhancement for Test-Time Adaptation of Vision-Language Models
Vision-language models (VLMs) exhibit remarkable zero-shot generalization but suffer performance degradation under distribution shifts in downstream tasks, particularly in the absence of labeled data. Test-Time Adaptation (TTA) addresses this challenge by enabling online optimization of VLMs during inference, eliminating the need for annotated data. Cache-based TTA methods exploit historical knowledge by maintaining a dynamic memory cache of low-entropy or high-confidence samples, promoting efficient adaptation to out-of-distribution data. Nevertheless, these methods face two critical challenges: (1) unreliable confidence metrics under significant distribution shifts, resulting in error accumulation within the cache and degraded adaptation performance; and (2) rigid decision boundaries that fail to accommodate substantial distributional variations, leading to suboptimal predictions. To overcome these limitations, we introduce the Adaptive Cache Enhancement (ACE) framework, which constructs a robust cache by selectively storing high-confidence or low-entropy image embeddings per class, guided by dynamic, class-specific thresholds initialized from zero-shot statistics and iteratively refined using an exponential moving average and exploration-augmented updates. This approach enables adaptive, class-wise decision boundaries, ensuring robust and accurate predictions across diverse visual distributions. Extensive experiments on 15 diverse benchmark datasets demonstrate that ACE achieves state-of-the-art performance, delivering superior robustness and generalization compared to existing TTA methods in challenging out-of-distribution scenarios.
comment: 12 pages, Under review
Unifying Segment Anything in Microscopy with Vision-Language Knowledge
Accurate segmentation of regions of interest in biomedical images holds substantial value in image analysis. Although several foundation models for biomedical segmentation have currently achieved excellent performance on certain datasets, they typically demonstrate sub-optimal performance on unseen domain data. We owe the deficiency to lack of vision-language knowledge before segmentation. Multimodal Large Language Models (MLLMs) bring outstanding understanding and reasoning capabilities to multimodal tasks, which inspires us to leverage MLLMs to inject Vision-Language Knowledge (VLK), thereby enabling vision models to demonstrate superior generalization capabilities on cross-domain datasets. In this paper, we propose a novel framework that seamlessly uses MLLMs to guide SAM in learning microscopy cross-domain data, unifying Segment Anything in Microscopy, named uLLSAM. Specifically, we propose the Vision-Language Semantic Alignment (VLSA) module, which injects VLK into Segment Anything Model (SAM). We find that after SAM receives global VLK prompts, its performance improves significantly, but there are deficiencies in boundary contour perception. Therefore, we further propose Semantic Boundary Regularization (SBR) to regularize SAM. Our method achieves performance improvements of 11.8% in SA across 9 in-domain microscopy datasets, achieving state-of-the-art performance. Our method also demonstrates improvements of 9.2% in SA across 10 out-of-domain datasets, exhibiting strong generalization capabilities. Code is available at https://github.com/ieellee/uLLSAM.
comment: 15 pages, 5 figures
RiverScope: High-Resolution River Masking Dataset
Surface water dynamics play a critical role in Earth's climate system, influencing ecosystems, agriculture, disaster resilience, and sustainable development. Yet monitoring rivers and surface water at fine spatial and temporal scales remains challenging -- especially for narrow or sediment-rich rivers that are poorly captured by low-resolution satellite data. To address this, we introduce RiverScope, a high-resolution dataset developed through collaboration between computer science and hydrology experts. RiverScope comprises 1,145 high-resolution images (covering 2,577 square kilometers) with expert-labeled river and surface water masks, requiring over 100 hours of manual annotation. Each image is co-registered with Sentinel-2, SWOT, and the SWOT River Database (SWORD), enabling the evaluation of cost-accuracy trade-offs across sensors -- a key consideration for operational water monitoring. We also establish the first global, high-resolution benchmark for river width estimation, achieving a median error of 7.2 meters -- significantly outperforming existing satellite-derived methods. We extensively evaluate deep networks across multiple architectures (e.g., CNNs and transformers), pretraining strategies (e.g., supervised and self-supervised), and training datasets (e.g., ImageNet and satellite imagery). Our best-performing models combine the benefits of transfer learning with the use of all the multispectral PlanetScope channels via learned adaptors. RiverScope provides a valuable resource for fine-scale and multi-sensor hydrological modeling, supporting climate adaptation and sustainable water management.
CSGaze: Context-aware Social Gaze Prediction
A person's gaze offers valuable insights into their focus of attention, level of social engagement, and confidence. In this work, we investigate how contextual cues combined with visual scene and facial information can be effectively utilized to predict and interpret social gaze patterns during conversational interactions. We introduce CSGaze, a context aware multimodal approach that leverages facial, scene information as complementary inputs to enhance social gaze pattern prediction from multi-person images. The model also incorporates a fine-grained attention mechanism centered on the principal speaker, which helps in better modeling social gaze dynamics. Experimental results show that CSGaze performs competitively with state-of-the-art methods on GP-Static, UCO-LAEO and AVA-LAEO. Our findings highlight the role of contextual cues in improving social gaze prediction. Additionally, we provide initial explainability through generated attention scores, offering insights into the model's decision-making process. We also demonstrate our model's generalizability by testing our model on open set datasets that demonstrating its robustness across diverse scenarios.
Improving Multimodal Sentiment Analysis via Modality Optimization and Dynamic Primary Modality Selection AAAI 2026
Multimodal Sentiment Analysis (MSA) aims to predict sentiment from language, acoustic, and visual data in videos. However, imbalanced unimodal performance often leads to suboptimal fused representations. Existing approaches typically adopt fixed primary modality strategies to maximize dominant modality advantages, yet fail to adapt to dynamic variations in modality importance across different samples. Moreover, non-language modalities suffer from sequential redundancy and noise, degrading model performance when they serve as primary inputs. To address these issues, this paper proposes a modality optimization and dynamic primary modality selection framework (MODS). First, a Graph-based Dynamic Sequence Compressor (GDC) is constructed, which employs capsule networks and graph convolution to reduce sequential redundancy in acoustic/visual modalities. Then, we develop a sample-adaptive Primary Modality Selector (MSelector) for dynamic dominance determination. Finally, a Primary-modality-Centric Cross-Attention (PCCA) module is designed to enhance dominant modalities while facilitating cross-modal interaction. Extensive experiments on four benchmark datasets demonstrate that MODS outperforms state-of-the-art methods, achieving superior performance by effectively balancing modality contributions and eliminating redundant noise.
comment: Accepted by AAAI 2026
GreatSplicing: A Semantically Rich Splicing Dataset
In existing splicing forgery datasets, the insufficient semantic variety of spliced regions causes trained detection models to overfit semantic features rather than learn genuine splicing traces. Meanwhile, the lack of a reasonable benchmark dataset has led to inconsistent experimental settings across existing detection methods. To address these issues, we propose GreatSplicing, a manually created, large-scale, high-quality splicing dataset. GreatSplicing comprises 5,000 spliced images and covers spliced regions across 335 distinct semantic categories, enabling detection models to learn splicing traces more effectively. Empirical results show that detection models trained on GreatSplicing achieve low misidentification rates and stronger cross-dataset generalization compared to existing datasets. GreatSplicing is now publicly available for research purposes at the following link.
comment: This version updates the author list and author order, and incorporates changes to the content
Adaptive Pareto-Optimal Token Merging for Edge Transformer Models in Semantic Communication
Large-scale transformer models have emerged as a powerful tool for semantic communication systems, enabling edge devices to extract rich representations for robust inference across noisy wireless channels. However, their substantial computational demands remain a major barrier to practical deployment in resource-constrained 6G networks. In this paper, we present a training-free framework for adaptive token merging in pretrained vision transformers to jointly reduce inference time and transmission resource usage. We formulate the selection of per-layer merging proportions as a multi-objective optimization problem to balance accuracy and computational cost. We employ Gaussian process-based Bayesian optimization to construct a Pareto frontier of optimal configurations, enabling flexible runtime adaptation to dynamic application requirements and channel conditions. Extensive experiments demonstrate that our method consistently outperforms other baselines and achieves significant reductions in floating-point operations while maintaining competitive accuracy across a wide range of signal-to-noise ratio (SNR) conditions. Additional results highlight the effectiveness of adaptive policies that adjust merging aggressiveness in response to channel quality, providing a practical mechanism to trade off latency and semantic fidelity on demand. These findings establish a scalable and efficient approach for deploying transformer-based semantic communication in future edge intelligence systems.
comment: Accepted for presentation in IEEE Globecom 2025
OccamVTS: Distilling Vision Models to 1% Parameters for Time Series Forecasting
Time series forecasting is fundamental to diverse applications, with recent approaches leverage large vision models (LVMs) to capture temporal patterns through visual representations. We reveal that while vision models enhance forecasting performance, 99% of their parameters are unnecessary for time series tasks. Through cross-modal analysis, we find that time series align with low-level textural features but not high-level semantics, which can impair forecasting accuracy. We propose OccamVTS, a knowledge distillation framework that extracts only the essential 1% of predictive information from LVMs into lightweight networks. Using pre-trained LVMs as privileged teachers, OccamVTS employs pyramid-style feature alignment combined with correlation and feature distillation to transfer beneficial patterns while filtering out semantic noise. Counterintuitively, this aggressive parameter reduction improves accuracy by eliminating overfitting to irrelevant visual features while preserving essential temporal patterns. Extensive experiments across multiple benchmark datasets demonstrate that OccamVTS consistently achieves state-of-the-art performance with only 1% of the original parameters, particularly excelling in few-shot and zero-shot scenarios.
Adaptive Parametric Activation: Unifying and Generalising Activation Functions Across Tasks ECCV2024
The activation function plays a crucial role in model optimisation, yet the optimal choice remains unclear. For example, the Sigmoid activation is the de-facto activation in balanced classification tasks, however, in imbalanced classification, it proves inappropriate due to bias towards frequent classes. In this work, we delve deeper in this phenomenon by performing a comprehensive statistical analysis in the classification and intermediate layers of both balanced and imbalanced networks and we empirically show that aligning the activation function with the data distribution, enhances the performance in both balanced and imbalanced tasks. To this end, we propose the Adaptive Parametric Activation (APA) function, a novel and versatile activation function that unifies most common activation functions under a single formula. APA can be applied in both intermediate layers and attention layers, significantly outperforming the state-of-the-art on several imbalanced benchmarks such as ImageNet-LT, iNaturalist2018, Places-LT, CIFAR100-LT and LVIS. Also, we extend APA to a plethora of other tasks such as classification, detection, visual instruction following tasks, image generation and next-text-token prediction benchmarks. APA increases the performance in multiple benchmarks across various model architectures. The code is available at https://github.com/kostas1515/AGLU.
comment: Version 2: 19 pages, 7 figures, 13 Tables. Extension of the ECCV2024 oral paper arXiv:2407.08567v2
Enhancing Video Inpainting with Aligned Frame Interval Guidance
Recent image-to-video (I2V) based video inpainting methods have made significant strides by leveraging single-image priors and modeling temporal consistency across masked frames. Nevertheless, these methods suffer from severe content degradation within video chunks. Furthermore, the absence of a robust frame alignment scheme compromises intra-chunk and inter-chunk spatiotemporal stability, resulting in insufficient control over the entire video. To address these limitations, we propose VidPivot, a novel framework that decouples video inpainting into two sub-tasks: multi-frame consistent image inpainting and masked area motion propagation. Our approach introduces frame interval priors as spatiotemporal cues to guide the inpainting process. To enhance cross-frame coherence, we design a FrameProp Module that implements a frame content propagation strategy, diffusing reference frame content into subsequent frames via a splicing mechanism. Additionally, a dedicated context controller encodes these coherent frame priors into the I2V generative backbone, effectively serving as soft constrain to suppress content distortion during generation. Extensive evaluations demonstrate that VidPivot achieves competitive performance across diverse benchmarks and generalizes well to different video inpainting scenarios.
comment: 15 pages
First-Order Error Matters: Accurate Compensation for Quantized Large Language Models AAAI 2026
Post-training quantization (PTQ) offers an efficient approach to compressing large language models (LLMs), significantly reducing memory access and computational costs. Existing compensation-based weight calibration methods often rely on a second-order Taylor expansion to model quantization error, under the assumption that the first-order term is negligible in well-trained full-precision models. However, we reveal that the progressive compensation process introduces accumulated first-order deviations between latent weights and their full-precision counterparts, making this assumption fundamentally flawed. To address this, we propose FOEM, a novel PTQ method that explicitly incorporates first-order gradient terms to improve quantization error compensation. FOEM approximates gradients by performing a first-order Taylor expansion around the pre-quantization weights. This yields an approximation based on the difference between latent and full-precision weights as well as the Hessian matrix. When substituted into the theoretical solution, the formulation eliminates the need to explicitly compute the Hessian, thereby avoiding the high computational cost and limited generalization of backpropagation-based gradient methods. This design introduces only minimal additional computational overhead. Extensive experiments across a wide range of models and benchmarks demonstrate that FOEM consistently outperforms the classical GPTQ method. In 3-bit weight-only quantization, FOEM reduces the perplexity of Llama3-8B by 17.3% and increases the 5-shot MMLU accuracy from 53.8% achieved by GPTAQ to 56.1%. Moreover, FOEM can be seamlessly combined with advanced techniques such as SpinQuant, delivering additional gains under the challenging W4A4KV4 setting and further narrowing the performance gap with full-precision baselines, surpassing existing state-of-the-art methods.
comment: Accepted by AAAI 2026. The code is available at https://github.com/Xingyu-Zheng/FOEM
NOCTIS: Novel Object Cyclic Threshold based Instance Segmentation CVPR 2026
Instance segmentation of novel objects instances in RGB images, given some example images for each object, is a well known problem in computer vision. Designing a model general enough to be employed for all kinds of novel objects without (re-) training has proven to be a difficult task. To handle this, we present a new training-free framework, called: Novel Object Cyclic Threshold based Instance Segmentation (NOCTIS). NOCTIS integrates two pre-trained models: Grounded-SAM 2 for object proposals with precise bounding boxes and corresponding segmentation masks; and DINOv2 for robust class and patch embeddings, due to its zero-shot capabilities. Internally, the proposal-object matching is realized by determining an object matching score based on the similarity of the class embeddings and the average maximum similarity of the patch embeddings with a new cyclic thresholding (CT) mechanism that mitigates unstable matches caused by repetitive textures or visually similar patterns. Beyond CT, NOCTIS introduces: (i) an appearance score that is unaffected by object selection bias; (ii) the usage of the average confidence of the proposals' bounding box and mask as a scoring component; and (iii) an RGB-only pipeline that performs even better than RGB-D ones. We empirically show that NOCTIS, without further training/fine tuning, outperforms the best RGB and RGB-D methods regarding the mean AP score on the seven core datasets of the BOP 2023 challenge for the "Model-based 2D segmentation of unseen objects" task.
comment: 9 pages, 3 figures, 5 tables, CVPR 2026 preprint
FlowLensing: Simulating Gravitational Lensing with Flow Matching
Gravitational lensing is one of the most powerful probes of dark matter, yet creating high-fidelity lensed images at scale remains a bottleneck. Existing tools rely on ray-tracing or forward-modeling pipelines that, while precise, are prohibitively slow. We introduce FlowLensing, a Diffusion Transformer-based compact and efficient flow-matching model for strong gravitational lensing simulation. FlowLensing operates in both discrete and continuous regimes, handling classes such as different dark matter models as well as continuous model parameters ensuring physical consistency. By enabling scalable simulations, our model can advance dark matter studies, specifically for probing dark matter substructure in cosmological surveys. We find that our model achieves a speedup of over 200$\times$ compared to classical simulators for intensive dark matter models, with high fidelity and low inference latency. FlowLensing enables rapid, scalable, and physically consistent image synthesis, offering a practical alternative to traditional forward-modeling pipelines.
comment: 6 pages, 2 figures, 3 tables
Explicit Multimodal Graph Modeling for Human-Object Interaction Detection
Transformer-based methods have recently become the prevailing approach for Human-Object Interaction (HOI) detection. However, the Transformer architecture does not explicitly model the relational structures inherent in HOI detection, which impedes the recognition of interactions. In contrast, Graph Neural Networks (GNNs) are inherently better suited for this task, as they explicitly model the relationships between human-object pairs. Therefore, in this paper, we propose \textbf{M}ultimodal \textbf{G}raph \textbf{N}etwork \textbf{M}odeling (MGNM) that leverages GNN-based relational structures to enhance HOI detection. Specifically, we design a multimodal graph network framework that explicitly models the HOI task in a four-stage graph structure. Furthermore, we introduce a multi-level feature interaction mechanism within our graph network. This mechanism leverages multi-level visual and language features to enhance information propagation across human-object pairs. Consequently, our proposed MGNM achieves state-of-the-art (SOTA) performance on two widely used benchmarks: HICO-DET and V-COCO. Moreover, when integrated with a more advanced object detector, our method demonstrates a significant performance gain and maintains an effective balance between rare and non-rare classes.
TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models AAAI 2026
Vision-Language-Action (VLA) models process visual inputs independently at each timestep, discarding valuable temporal information inherent in robotic manipulation tasks. This frame-by-frame processing makes models vulnerable to visual noise while ignoring the substantial coherence between consecutive frames in manipulation sequences. We propose Temporal Token Fusion (TTF), a training-free approach that intelligently integrates historical and current visual representations to enhance VLA inference quality. Our method employs dual-dimension detection combining efficient grayscale pixel difference analysis with attention-based semantic relevance assessment, enabling selective temporal token fusion through hard fusion strategies and keyframe anchoring to prevent error accumulation. Comprehensive experiments across LIBERO, SimplerEnv, and real robot tasks demonstrate consistent improvements: 4.0 percentage points average on LIBERO (72.4\% vs 68.4\% baseline), cross-environment validation on SimplerEnv (4.8\% relative improvement), and 8.7\% relative improvement on real robot tasks. Our approach proves model-agnostic, working across OpenVLA and VLA-Cache architectures. Notably, TTF reveals that selective Query matrix reuse in attention mechanisms enhances rather than compromises performance, suggesting promising directions for direct KQV matrix reuse strategies that achieve computational acceleration while improving task success rates.
comment: Accepted to AAAI 2026. Camera-ready version
Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding
Visual grouping -- operationalized through tasks such as instance segmentation, visual grounding, and object detection -- enables applications ranging from robotic perception to photo editing. These fundamental problems in computer vision are powered by large-scale, painstakingly annotated datasets. Despite their impact, these datasets are costly to build, biased in coverage, and difficult to scale. Synthetic datasets offer a promising alternative but struggle with flexibility, accuracy, and compositional diversity. We introduce Synthetic Object Compositions (SOC), an accurate and scalable data synthesis pipeline via a novel object-centric composition strategy. It composes high-quality synthetic object segments into new images using 3D geometric layout augmentation and camera configuration augmentation with generative harmonization and mask-area-weighted blending, yielding accurate and diverse masks, boxes, and referring expressions. Models trained on just 100K of our synthetic images outperform those trained on larger real datasets (GRIT 20M, V3Det 200K) and synthetic pipelines (Copy-Paste, X-Paste, SynGround, SegGen) by +24-36% -- achieving +10.9 AP on LVIS and +8.4 NAcc on gRefCOCO. Beyond the general open-vocabulary setup, SOC also enables controllable dataset construction for different use cases and boosts performance in both low-data and closed-vocabulary scenarios. Augmenting LVIS and COCO with synthetic object segments delivers strong performance across different real-data scales and yields even greater improvements under extremely limited real-data conditions, including +6.59 AP on a 1% COCO data setup. Furthermore, this controllability enables targeted data generation for intra-class referring, a diagnostic grounding task we propose that requires fine-grained attribute discrimination.
comment: Project website: https://github.com/weikaih04/Synthetic-Detection-Segmentation-Grounding-Data
MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation
Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relation
comment: 10 pages
SGLP: A Similarity Guided Fast Layer Partition Pruning for Compressing Large Deep Models
Layer pruning has emerged as a potent approach to remove redundant layers in the pre-trained network on the purpose of reducing network size and improve computational efficiency. However, existing layer pruning methods mostly overlook the intrinsic connections and inter-dependencies between different layers within complicated deep neural networks. This oversight can result in pruned models that do not preserve the essential characteristics of the pre-trained network as effectively as desired. To address these limitations, we propose a Similarity-Guided Layer Partition (SGLP) Pruning, a novel pruning framework that exploits representation similarity to guide efficient and informed layer removal for compressing large deep models. Our method begins by employing Centered Kernel Alignment (CKA) to quantify representational similarity between layers, uncovering structural patterns within the network. We then apply Fisher Optimal Segmentation on the similarity matrix to partition the network into semantically coherent layer segments. This segmentation allows pruning decisions to respect layer interdependencies and preserve essential knowledge. Within each segment, we introduce a fine-tuning-free importance evaluation using GradNorm, identifying and removing redundant layers in a targeted, segment-wise manner. Experimental results on both image classification tasks and large language models (LLMs) demonstrate that our proposed SGLP outperforms the state-of-the-art methods in accuracy and efficiency. Our approach achieves significant model compression with minimal performance degradation, making it well-suited for deployment in resource-limited environments.
comment: 16 pages
RodEpil: A Video Dataset of Laboratory Rodents for Seizure Detection and Benchmark Evaluation
We introduce a curated video dataset of laboratory rodents for automatic detection of convulsive events. The dataset contains short (10~s) top-down and side-view video clips of individual rodents, labeled at clip level as normal activity or seizure. It includes 10,101 negative samples and 2,952 positive samples collected from 19 subjects. We describe the data curation, annotation protocol and preprocessing pipeline, and report baseline experiments using a transformer-based video classifier (TimeSformer). Experiments employ five-fold cross-validation with strict subject-wise partitioning to prevent data leakage (no subject appears in more than one fold). Results show that the TimeSformer architecture enables discrimination between seizure and normal activity with an average F1-score of 97%. The dataset and baseline code are publicly released to support reproducible research on non-invasive, video-based monitoring in preclinical epilepsy research. RodEpil Dataset access - DOI: 10.5281/zenodo.17601357
FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning AAAI 2026
Vision-Language-Action (VLA) models have demonstrated significant potential in complex scene understanding and action reasoning, leading to their increasing adoption in end-to-end autonomous driving systems. However, the long visual tokens of VLA models greatly increase computational costs. Current visual token pruning methods in Vision-Language Models (VLM) rely on either visual token similarity or visual-text attention, but both have shown poor performance in autonomous driving scenarios. Given that human drivers concentrate on relevant foreground areas while driving, we assert that retaining visual tokens containing this foreground information is essential for effective decision-making. Inspired by this, we propose FastDriveVLA, a novel reconstruction-based vision token pruning framework designed specifically for autonomous driving. FastDriveVLA includes a plug-and-play visual token pruner called ReconPruner, which prioritizes foreground information through MAE-style pixel reconstruction. A novel adversarial foreground-background reconstruction strategy is designed to train ReconPruner for the visual encoder of VLA models. Once trained, ReconPruner can be seamlessly applied to different VLA models with the same visual encoder without retraining. To train ReconPruner, we also introduce a large-scale dataset called nuScenes-FG, consisting of 241K image-mask pairs with annotated foreground regions. Our approach achieves state-of-the-art results on the nuScenes open-loop planning benchmark across different pruning ratios.
comment: Accepted by AAAI 2026
FlashI2V: Fourier-Guided Latent Shifting Prevents Conditional Image Leakage in Image-to-Video Generation
In Image-to-Video (I2V) generation, a video is created using an input image as the first-frame condition. Existing I2V methods concatenate the full information of the conditional image with noisy latents to achieve high fidelity. However, the denoisers in these methods tend to shortcut the conditional image, which is known as conditional image leakage, leading to performance degradation issues such as slow motion and color inconsistency. In this work, we further clarify that conditional image leakage leads to overfitting to in-domain data and decreases the performance in out-of-domain scenarios. Moreover, we introduce Fourier-Guided Latent Shifting I2V, named FlashI2V, to prevent conditional image leakage. Concretely, FlashI2V consists of: (1) Latent Shifting. We modify the source and target distributions of flow matching by subtracting the conditional image information from the noisy latents, thereby incorporating the condition implicitly. (2) Fourier Guidance. We use high-frequency magnitude features obtained by the Fourier Transform to accelerate convergence and enable the adjustment of detail levels in the generated video. Experimental results show that our method effectively overcomes conditional image leakage and achieves the best generalization and performance on out-of-domain data among various I2V paradigms. With only 1.3B parameters, FlashI2V achieves a dynamic degree score of 53.01 on Vbench-I2V, surpassing CogVideoX1.5-5B-I2V and Wan2.1-I2V-14B-480P. Project page: https://pku-yuangroup.github.io/FlashI2V/
Volley Revolver: A Novel Matrix-Encoding Method for Privacy-Preserving Neural Networks (Inference)
In this work, we present a novel matrix-encoding method that is particularly convenient for neural networks to make predictions in a privacy-preserving manner using homomorphic encryption. Based on this encoding method, we implement a convolutional neural network for handwritten image classification over encryption. For two matrices $A$ and $B$ to perform homomorphic multiplication, the main idea behind it, in a simple version, is to encrypt matrix $A$ and the transpose of matrix $B$ into two ciphertexts respectively. With additional operations, the homomorphic matrix multiplication can be calculated over encrypted matrices efficiently. For the convolution operation, we in advance span each convolution kernel to a matrix space of the same size as the input image so as to generate several ciphertexts, each of which is later used together with the ciphertext encrypting input images for calculating some of the final convolution results. We accumulate all these intermediate results and thus complete the convolution operation. In a public cloud with 40 vCPUs, our convolutional neural network implementation on the MNIST testing dataset takes $\sim$ 287 seconds to compute ten likelihoods of 32 encrypted images of size $28 \times 28$ simultaneously. The data owner only needs to upload one ciphertext ($\sim 19.8$ MB) encrypting these 32 images to the public cloud.
comment: The encoding method we proposed in this work, $\texttt{Volley Revolver}$, is particularly tailored for privacy-preserving neural networks. There is a great chance that it can be used to assist the private neural networks training, in which case for the backpropagation algorithm of the fully-connected layer the first matrix $A$ is revolved while the second matrix $B$ is settled to be still
Hierarchical Mixing Architecture for Low-light RAW Image Enhancement
With the rapid development of deep learning, low-light RAW image enhancement (LLRIE) has achieved remarkable progress. However, the challenge that how to simultaneously achieve strong enhancement quality and high efficiency still remains. Leveraging the inherent efficiency of Channel Attention and Mamba, we introduce a Hierarchical Mixing Architecture (HiMA), a hybrid LLRIE framework built upon two core modules. Specifically, we introduce Large Scale Block (LSB) for upper layers and Small Scale Block (SSB) for lower layers that reduce the parameters while improve the performance. Based on this framework, we also introduce a novel Local Distribution Adjustment (LoDA) module that adaptively aligns local feature statistics in a content-aware manner by learning to adjust regional luminance and contrast distributions. Moreover, to alleviate the domain ambiguity commonly observed in existing LLRIE pipelines, we design a Multi-Prior Fusion (MPF) module that leverages three complementary priors extracted from the first stage of the hybrid architecture to maintain domain consistency. Extensive experiments on multiple public benchmarks demonstrate that our approach outperforms state-of-the-art methods, delivering superior performance with fewer parameters. Code is available at https://github.com/Cynicarlos/HiMA.
StreamDiT: Real-Time Streaming Text-to-Video Generation
Recently, great progress has been achieved in text-to-video (T2V) generation by scaling transformer-based diffusion models to billions of parameters, which can generate high-quality videos. However, existing models typically produce only short clips offline, restricting their use cases in interactive and real-time applications. This paper addresses these challenges by proposing StreamDiT, a streaming video generation model. StreamDiT training is based on flow matching by adding a moving buffer. We design mixed training with different partitioning schemes of buffered frames to boost both content consistency and visual quality. StreamDiT modeling is based on adaLN DiT with varying time embedding and window attention. To practice the proposed method, we train a StreamDiT model with 4B parameters. In addition, we propose a multistep distillation method tailored for StreamDiT. Sampling distillation is performed in each segment of a chosen partitioning scheme. After distillation, the total number of function evaluations (NFEs) is reduced to the number of chunks in a buffer. Finally, our distilled model reaches real-time performance at 16 FPS on one GPU, which can generate video streams at 512p resolution. We evaluate our method through both quantitative metrics and human evaluation. Our model enables real-time applications, e.g. streaming generation, interactive generation, and video-to-video. We provide video results and more examples in our project website: https://cumulo-autumn.github.io/StreamDiT/
UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation
User interface (UI) programming is a core yet highly complex part of modern software development. Recent advances in visual language models (VLMs) highlight the potential of automatic UI coding, but current approaches face two key limitations: multimodal coding capabilities remain underdeveloped, and single-turn paradigms make little use of iterative visual feedback. We address these challenges with an interactive UI-to-code paradigm that better reflects real-world workflows and raises the upper bound of achievable performance. Under this paradigm, we present UI2Code$^\text{N}$, a visual language model trained through staged pretraining, fine-tuning, and reinforcement learning to achieve foundational improvements in multimodal coding. The model unifies three key capabilities: UI-to-code generation, UI editing, and UI polishing. We further explore test-time scaling for interactive generation, enabling systematic use of multi-turn feedback. Experiments on UI-to-code and UI polishing benchmarks show that UI2Code$^\text{N}$ establishes a new state of the art among open-source models and achieves performance comparable to leading closed-source models such as Claude-4-Sonnet and GPT-5. Our code and models are available at https://github.com/zai-org/UI2Code_N.
comment: 24 pages
Generative AI in Map-Making: A Technical Exploration and Its Implications for Cartographers
Traditional map-making relies heavily on Geographic Information Systems (GIS), requiring domain expertise and being time-consuming, especially for repetitive tasks. Recent advances in generative AI (GenAI), particularly image diffusion models, offer new opportunities for automating and democratizing the map-making process. However, these models struggle with accurate map creation due to limited control over spatial composition and semantic layout. To address this, we integrate vector data to guide map generation in different styles, specified by the textual prompts. Our model is the first to generate accurate maps in controlled styles, and we have integrated it into a web application to improve its usability and accessibility. We conducted a user study with professional cartographers to assess the fidelity of generated maps, the usability of the web application, and the implications of ever-emerging GenAI in map-making. The findings have suggested the potential of our developed application and, more generally, the GenAI models in helping both non-expert users and professionals in creating maps more efficiently. We have also outlined further technical improvements and emphasized the new role of cartographers to advance the paradigm of AI-assisted map-making. The code and pre-trained models are available at https://github.com/claudaff/generative-ai-mapmaking/.
Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment AAAI 2026
We present Autoregressive Representation Alignment (ARRA), a new training framework that unlocks global-coherent text-to-image generation in autoregressive LLMs without architectural modifications. Different from prior works that require complex architectural redesigns, ARRA aligns LLM's hidden states with visual representations from external visual foundational models via a global visual alignment loss and a hybrid token, . This token enforces dual constraints: local next-token prediction and global semantic distillation, enabling LLMs to implicitly learn spatial and contextual coherence while retaining their original autoregressive paradigm. Extensive experiments validate ARRA's plug-and-play versatility. When training T2I LLMs from scratch, ARRA reduces FID by 16.6% (ImageNet), 12.0% (LAION-COCO) for autoregressive LLMs like LlamaGen, without modifying original architecture and inference mechanism. For training from text-generation-only LLMs, ARRA reduces FID by 25.5% (MIMIC-CXR), 8.8% (DeepEyeNet) for advanced LLMs like Chameleon. For domain adaptation, ARRA aligns general-purpose LLMs with specialized models (e.g., BioMedCLIP), achieving an 18.6% FID reduction over direct fine-tuning on medical imaging (MIMIC-CXR). These results demonstrate that training objective redesign, rather than architectural modifications, can resolve cross-modal global coherence challenges. ARRA offers a complementary paradigm for advancing autoregressive models. The code is available at https://github.com/HKU-HealthAI/ARRA.
comment: Accepted by AAAI 2026 Oral
Duplex-GS: Proxy-Guided Weighted Blending for Real-Time Order-Independent Gaussian Splatting
Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated remarkable rendering fidelity and efficiency. However, these methods still rely on computationally expensive sequential alpha-blending operations, resulting in significant overhead, particularly on resource-constrained platforms. In this paper, we propose Duplex-GS, a dual-hierarchy framework that integrates proxy Gaussian representations with order-independent rendering techniques to achieve photorealistic results while sustaining real-time performance. To mitigate the overhead caused by view-adaptive radix sort, we introduce cell proxies for local Gaussians management and propose cell search rasterization for further acceleration. By seamlessly combining our framework with Order-Independent Transparency (OIT), we develop a physically inspired weighted sum rendering technique that simultaneously eliminates "popping" and "transparency" artifacts, yielding substantial improvements in both accuracy and efficiency. Extensive experiments on a variety of real-world datasets demonstrate the robustness of our method across diverse scenarios, including multi-scale training views and large-scale environments. Our results validate the advantages of the OIT rendering paradigm in Gaussian Splatting, achieving high-quality rendering with an impressive 1.5 to 4 speedup over existing OIT based Gaussian Splatting approaches and 52.2% to 86.9% reduction of the radix sort overhead without quality degradation.
comment: submitted to TCSVT
Medverse: A Universal Model for Full-Resolution 3D Medical Image Segmentation, Transformation and Enhancement
In-context learning (ICL) offers a promising paradigm for universal medical image analysis, enabling models to perform diverse image processing tasks without retraining. However, current ICL models for medical imaging remain limited in two critical aspects: they cannot simultaneously achieve high-fidelity predictions and global anatomical understanding, and there is no unified model trained across diverse medical imaging tasks (e.g., segmentation and enhancement) and anatomical regions. As a result, the full potential of ICL in medical imaging remains underexplored. Thus, we present \textbf{Medverse}, a universal ICL model for 3D medical imaging, trained on 22 datasets covering diverse tasks in universal image segmentation, transformation, and enhancement across multiple organs, imaging modalities, and clinical centers. Medverse employs a next-scale autoregressive in-context learning framework that progressively refines predictions from coarse to fine, generating consistent, full-resolution volumetric outputs and enabling multi-scale anatomical awareness. We further propose a blockwise cross-attention module that facilitates long-range interactions between context and target inputs while preserving computational efficiency through spatial sparsity. Medverse is extensively evaluated on a broad collection of held-out datasets covering previously unseen clinical centers, organs, species, and imaging modalities. Results demonstrate that Medverse substantially outperforms existing ICL baselines and establishes a novel paradigm for in-context learning. Code and model weights will be made publicly available. Our model are publicly available at https://github.com/jiesihu/Medverse.
Latent Motion Profiling for Annotation-free Cardiac Phase Detection in Adult and Fetal Echocardiography Videos
The identification of cardiac phase is an essential step for analysis and diagnosis of cardiac function. Automatic methods, especially data-driven methods for cardiac phase detection, typically require extensive annotations, which is time-consuming and labor-intensive. In this paper, we present an unsupervised framework for end-diastole (ED) and end-systole (ES) detection through self-supervised learning of latent cardiac motion trajectories from 4-chamber-view echocardiography videos. Our method eliminates the need for manual annotations, including ED and ES indices, segmentation, or volumetric measurements, by training a reconstruction model to encode interpretable spatiotemporal motion patterns. Evaluated on the EchoNet-Dynamic benchmark, the approach achieves mean absolute error (MAE) of 3 frames (58.3 ms) for ED and 2 frames (38.8 ms) for ES detection, matching state-of-the-art supervised methods. Extended to fetal echocardiography, the model demonstrates robust performance with MAE 1.46 frames (20.7 ms) for ED and 1.74 frames (25.3 ms) for ES, despite the fact that the fetal heart model is built using non-standardized heart views due to fetal heart positioning variability. Our results demonstrate the potential of the proposed latent motion trajectory strategy for cardiac phase detection in adult and fetal echocardiography. This work advances unsupervised cardiac motion analysis, offering a scalable solution for clinical populations lacking annotated data. Code will be released at https://github.com/YingyuYyy/CardiacPhase.
ORIC: Benchmarking Object Recognition under Contextual Incongruity in Large Vision-Language Models
Large Vision-Language Models (LVLMs) excel at captioning, visual question answering, and robotics by combining vision and language, yet they often miss obvious objects or hallucinate nonexistent ones in atypical scenes. We examine these failures through the lens of uncertainty, focusing on contextual incongruity, where objects appear unexpectedly or fail to appear in expected contexts, and show that such cases increase recognition difficulty for state-of-the-art LVLMs. To study this regime, we introduce the Object Recognition in Incongruous Context (ORIC) framework, which constructs incongruous object-context pairs through two complementary strategies: (1) LLM-guided sampling to identify hard-to-recognize objects present in the image and (2) CLIP-guided sampling to mine plausible but absent ones. Applied to MSCOCO, ORIC produces ORIC-Bench and ORIC-style training data. Evaluating 18 LVLMs and 2 open-vocabulary detectors reveals substantial performance drops and bias patterns under incongruous contexts. Fine-tuning Qwen3-VL-8B-Instruct with Visual Reinforcement Fine-Tuning on 600 ORIC-style samples improves results on ORIC-Bench, AMBER, and HallusionBench. Overall, we show that contextual incongruity is a key source of uncertainty and provide tools for more reliable LVLMs. The code is available at https://github.com/ZhaoyangLi-1/ORIC.
MoPE: Mixture of Prompt Experts for Parameter-Efficient and Scalable Multimodal Fusion
Despite the demonstrated parameter efficiency of prompt-based fusion, its limited adaptivity and expressiveness hinder its effectiveness for multimodal applications at scale. In this paper, we present the first comprehensive study addressing these limitations. Our key motivation is to ``divide and conquer'' the vanilla prompt, traditionally shared across all instances, by generating instance-specific prompts. Specifically, we propose the Mixture of Prompt Experts (MoPE), a framework that significantly enhances prompt adaptivity and expressiveness by dynamically generating instance-specific prompts. MoPE leverages multimodal pairings as additional evidence, allowing the model to adaptively select optimal prompts tailored to each individual instance. Unlike traditional prompt-fusion methods, which encounter scalability bottlenecks when optimizing long unified prompts, MoPE maintains fixed prompt length while effectively scaling the number of specialized experts. Moreover, we investigate regularization terms to encourage expert specialization, resulting in highly adaptive and interpretable prompting. MoPE fundamentally changes the scaling dynamic, unlocking greater expressiveness and adaptability to complex multimodal relationships, enabling the model to selectively attend to task-relevant sub-sequences based on instance-specific multimodal input. Extensive experiments across six multimodal datasets spanning four modalities demonstrate state-of-the-art performance for multimodal fusion, matching or surpassing the performance of fine-tuning while requiring only 0.8% of the trainable parameters. Code is available: https://github.com/songrise/MoPE.
comment: Accepted to IEEE TMM
MILD: Multi-Layer Diffusion Strategy for Complex and Precise Multi-IP Aware Human Erasing
Recent years have witnessed the success of diffusion models in image customization tasks. However, existing mask-guided human erasing methods still struggle in complex scenarios such as human-human occlusion, human-object entanglement, and human-background interference, mainly due to the lack of large-scale multi-instance datasets and effective spatial decoupling to separate foreground from background. To bridge these gaps, we curate the MILD dataset capturing diverse poses, occlusions, and complex multi-instance interactions. We then define the Cross-Domain Attention Gap (CAG), an attention-gap metric to quantify semantic leakage. On top of these, we propose Multi-Layer Diffusion (MILD), which decomposes the generation process into independent denoising pathways, enabling separate reconstruction of each foreground instance and the background. To enhance human-centric understanding, we introduce Human Morphology Guidance, a plug-and-play module that incorporates pose, parsing, and spatial relationships into the diffusion process to improve structural awareness and restoration quality. Additionally, we present Spatially-Modulated Attention, an adaptive mechanism that leverages spatial mask priors to modulate attention across semantic regions, further widening the CAG to effectively minimize boundary artifacts and mitigate semantic leakage. Experiments show that MILD significantly outperforms existing methods. Datasets and code are publicly available at: https://mild-multi-layer-diffusion.github.io/.
A filtering scheme for confocal laser endomicroscopy (CLE)-video sequences for self-supervised learning
Confocal laser endomicroscopy (CLE) is a non-invasive, real-time imaging modality that can be used for in-situ, in-vivo imaging and the microstructural analysis of mucous structures. The diagnosis using CLE is, however, complicated by images being hard to interpret for non-experienced physicians. Utilizing machine learning as an augmentative tool would hence be beneficial, but is complicated by the shortage of histopathology-correlated CLE imaging sequences with respect to the plurality of patterns in this domain, leading to overfitting of machine learning models. To overcome this, self-supervised learning (SSL) can be employed on larger unlabeled datasets. CLE is a video-based modality with high inter-frame correlation, leading to a non-stratified data distribution for SSL training. In this work, we propose a filter functionality on CLE video sequences to reduce the dataset redundancy in SSL training and improve SSL training convergence and training efficiency. We use four state-of-the-art baseline networks and a SSL teacher-student network with a vision transformer small backbone for the evaluation. These networks were evaluated on downstream tasks for a sinonasal tumor dataset and a squamous cell carcinoma of the skin dataset. On both datasets, we found the highest test accuracy on the filtered SSL-pretrained model, with 67.48% and 73.52%, both considerably outperforming their non-SSL baselines. Our results show that SSL is an effective method for CLE pretraining. Further, we show that our proposed CLE video filter can be utilized to improve training efficiency in self-supervised scenarios, resulting in a reduction of 67% in training time.
Concept Retrieval -- What and How?
A concept may reflect either a concrete or abstract idea. Given an input image, this paper seeks to retrieve other images that share its central concepts, capturing aspects of the underlying narrative. This goes beyond conventional retrieval or clustering methods, which emphasize visual or semantic similarity. We formally define the problem, outline key requirements, and introduce appropriate evaluation metrics. We propose a novel approach grounded in two key observations: (1) While each neighbor in the embedding space typically shares at least one concept with the query, not all neighbors necessarily share the same concept with one another. (2) Modeling this neighborhood with a bimodal Gaussian distribution uncovers meaningful structure that facilitates concept identification. Qualitative, quantitative, and human evaluations confirm the effectiveness of our approach. See the package on PyPI: https://pypi.org/project/coret/
Self-supervised Learning of Echocardiographic Video Representations via Online Cluster Distillation
Self-supervised learning (SSL) has achieved major advances in natural images and video understanding, but challenges remain in domains like echocardiography (heart ultrasound) due to subtle anatomical structures, complex temporal dynamics, and the current lack of domain-specific pre-trained models. Existing SSL approaches such as contrastive, masked modeling, and clustering-based methods struggle with high intersample similarity, sensitivity to low PSNR inputs common in ultrasound, or aggressive augmentations that distort clinically relevant features. We present DISCOVR (Distilled Image Supervision for Cross Modal Video Representation), a self-supervised dual branch framework for cardiac ultrasound video representation learning. DISCOVR combines a clustering-based video encoder that models temporal dynamics with an online image encoder that extracts fine-grained spatial semantics. These branches are connected through a semantic cluster distillation loss that transfers anatomical knowledge from the evolving image encoder to the video encoder, enabling temporally coherent representations enriched with fine-grained semantic understanding.Evaluated on six echocardiography datasets spanning fetal, pediatric, and adult populations, DISCOVR outperforms both specialized video anomaly detection methods and state-of-the-art video-SSL baselines in zero-shot and linear probing setups,achieving superior segmentation transfer and strong downstream performance on clinically relevant tasks such as LVEF prediction. Code available at: https://github.com/mdivyanshu97/DISCOVR
LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics
Learning manipulable representations of the world and its dynamics is central to AI. Joint-Embedding Predictive Architectures (JEPAs) offer a promising blueprint, but lack of practical guidance and theory has led to ad-hoc R&D. We present a comprehensive theory of JEPAs and instantiate it in {\bf LeJEPA}, a lean, scalable, and theoretically grounded training objective. First, we identify the isotropic Gaussian as the optimal distribution that JEPAs' embeddings should follow to minimize downstream prediction risk. Second, we introduce a novel objective--{\bf Sketched Isotropic Gaussian Regularization} (SIGReg)--to constrain embeddings to reach that ideal distribution. Combining the JEPA predictive loss with SIGReg yields LeJEPA with numerous theoretical and practical benefits: (i) single trade-off hyperparameter, (ii) linear time and memory complexity, (iii) stability across hyper-parameters, architectures (ResNets, ViTs, ConvNets) and domains, (iv) heuristics-free, e.g., no stop-gradient, no teacher-student, no hyper-parameter schedulers, and (v) distributed training-friendly implementation requiring only $\approx$50 lines of code. Our empirical validation covers 10+ datasets, 60+ architectures, all with varying scales and domains. As an example, using imagenet-1k for pretraining and linear evaluation with frozen backbone, LeJEPA reaches 79\% with a ViT-H/14. We hope that the simplicity and theory-friendly ecosystem offered by LeJEPA will reestablish self-supervised pre-training as a core pillar of AI research (\href{https://github.com/rbalestr-lab/lejepa}{GitHub repo}).
FQ-PETR: Fully Quantized Position Embedding Transformation for Multi-View 3D Object Detection AAAI 2026
Camera-based multi-view 3D detection is crucial for autonomous driving. PETR and its variants (PETRs) excel in benchmarks but face deployment challenges due to high computational cost and memory footprint. Quantization is an effective technique for compressing deep neural networks by reducing the bit width of weights and activations. However, directly applying existing quantization methods to PETRs leads to severe accuracy degradation. This issue primarily arises from two key challenges: (1) significant magnitude disparity between multi-modal features-specifically, image features and camera-ray positional embeddings (PE), and (2) the inefficiency and approximation error of quantizing non-linear operators, which commonly rely on hardware-unfriendly computations. In this paper, we propose FQ-PETR, a fully quantized framework for PETRs, featuring three key innovations: (1) Quantization-Friendly LiDAR-ray Position Embedding (QFPE): Replacing multi-point sampling with LiDAR-prior-guided single-point sampling and anchor-based embedding eliminates problematic non-linearities (e.g., inverse-sigmoid) and aligns PE scale with image features, preserving accuracy. (2) Dual-Lookup Table (DULUT): This algorithm approximates complex non-linear functions using two cascaded linear LUTs, achieving high fidelity with minimal entries and no specialized hardware. (3) Quantization After Numerical Stabilization (QANS): Performing quantization after softmax numerical stabilization mitigates attention distortion from large inputs. On PETRs (e.g. PETR, StreamPETR, PETRv2, MV2d), FQ-PETR under W8A8 achieves near-floating-point accuracy (1% degradation) while reducing latency by up to 75%, significantly outperforming existing PTQ and QAT baselines.
comment: This paper is acceptted by AAAI 2026
Axis-Aligned Document Dewarping AAAI 2026
Document dewarping is crucial for many applications. However, existing learning-based methods rely heavily on supervised regression with annotated data without fully leveraging the inherent geometric properties of physical documents. Our key insight is that a well-dewarped document is defined by its axis-aligned feature lines. This property aligns with the inherent axis-aligned nature of the discrete grid geometry in planar documents. Harnessing this property, we introduce three synergistic contributions: for the training phase, we propose an axis-aligned geometric constraint to enhance document dewarping; for the inference phase, we propose an axis alignment preprocessing strategy to reduce the dewarping difficulty; and for the evaluation phase, we introduce a new metric, Axis-Aligned Distortion (AAD), that not only incorporates geometric meaning and aligns with human visual perception but also demonstrates greater robustness. As a result, our method achieves state-of-the-art performance on multiple existing benchmarks, improving the AAD metric by 18.2% to 34.5%. The code is publicly available at https://github.com/chaoyunwang/AADD.
comment: Accepted at AAAI 2026
Active Contour Models Driven by Hyperbolic Mean Curvature Flow for Image Segmentation
Parabolic mean curvature flow-driven active contour models (PMCF-ACMs) are widely used for image segmentation, yet they suffer severe degradation under high-intensity noise because gradient-descent evolutions exhibit the well-known zig-zag phenomenon. To overcome this drawback, we propose hyperbolic mean curvature flow-driven ACMs (HMCF-ACMs). This novel framework incorporates an adjustable acceleration field to autonomously regulate curve evolution smoothness, providing dual degrees of freedom for adaptive selection of both initial contours and velocity fields. We rigorously prove that HMCF-ACMs are normal flows and establish their numerical equivalence to wave equations through a level set formulation with signed distance functions. An efficient numerical scheme combining spectral discretization and optimized temporal integration is developed to solve the governing equations, and its stability condition is derived through Fourier analysis. Extensive experiments on natural and medical images validate that HMCF-ACMs achieve superior performance under high-noise conditions, demonstrating reduced parameter sensitivity, enhanced noise robustness, and improved segmentation accuracy compared to PMCF-ACMs.
A COCO-Formatted Instance-Level Dataset for Plasmodium Falciparum Detection in Giemsa-Stained Blood Smears MICCAI 2025
Accurate detection of Plasmodium falciparum in Giemsa-stained blood smears is an essential component of reliable malaria diagnosis, especially in developing countries. Deep learning-based object detection methods have demonstrated strong potential for automated Malaria diagnosis, but their adoption is limited by the scarcity of datasets with detailed instance-level annotations. In this work, we present an enhanced version of the publicly available NIH malaria dataset, with detailed bounding box annotations in COCO format to support object detection training. We validated the revised annotations by training a Faster R-CNN model to detect infected and non-infected red blood cells, as well as white blood cells. Cross-validation on the original dataset yielded F1 scores of up to 0.88 for infected cell detection. These results underscore the importance of annotation volume and consistency, and demonstrate that automated annotation refinement combined with targeted manual correction can produce training data of sufficient quality for robust detection performance. The updated annotations set is publicly available via Zenodo: https://doi.org/10.5281/zenodo.17514694
comment: 7 pages, 4 figures, 2 tables, accepted at MICCAI 2025 Open Data
MRT: Learning Compact Representations with Mixed RWKV-Transformer for Extreme Image Compression
Recent advances in extreme image compression have revealed that mapping pixel data into highly compact latent representations can significantly improve coding efficiency. However, most existing methods compress images into 2-D latent spaces via convolutional neural networks (CNNs) or Swin Transformers, which tend to retain substantial spatial redundancy, thereby limiting overall compression performance. In this paper, we propose a novel Mixed RWKV-Transformer (MRT) architecture that encodes images into more compact 1-D latent representations by synergistically integrating the complementary strengths of linear-attention-based RWKV and self-attention-based Transformer models. Specifically, MRT partitions each image into fixed-size windows, utilizing RWKV modules to capture global dependencies across windows and Transformer blocks to model local redundancies within each window. The hierarchical attention mechanism enables more efficient and compact representation learning in the 1-D domain. To further enhance compression efficiency, we introduce a dedicated RWKV Compression Model (RCM) tailored to the structure characteristics of the intermediate 1-D latent features in MRT. Extensive experiments on standard image compression benchmarks validate the effectiveness of our approach. The proposed MRT framework consistently achieves superior reconstruction quality at bitrates below 0.02 bits per pixel (bpp). Quantitative results based on the DISTS metric show that MRT significantly outperforms the state-of-the-art 2-D architecture GLC, achieving bitrate savings of 43.75%, 30.59% on the Kodak and CLIC2020 test datasets, respectively.
TEyeD: Over 20 million real-world eye images with Pupil, Eyelid, and Iris 2D and 3D Segmentations, 2D and 3D Landmarks, 3D Eyeball, Gaze Vector, and Eye Movement Types
We present TEyeD, the world's largest unified public data set of eye images taken with head-mounted devices. TEyeD was acquired with seven different head-mounted eye trackers. Among them, two eye trackers were integrated into virtual reality (VR) or augmented reality (AR) devices. The images in TEyeD were obtained from various tasks, including car rides, simulator rides, outdoor sports activities, and daily indoor activities. The data set includes 2D and 3D landmarks, semantic segmentation, 3D eyeball annotation and the gaze vector and eye movement types for all images. Landmarks and semantic segmentation are provided for the pupil, iris and eyelids. Video lengths vary from a few minutes to several hours. With more than 20 million carefully annotated images, TEyeD provides a unique, coherent resource and a valuable foundation for advancing research in the field of computer vision, eye tracking and gaze estimation in modern VR and AR applications. Download: https://es-cloud.cs.uni-tuebingen.de/d/8e2ab8c3fdd444e1a135/?p=%2FTEyeDS&mode=list
comment: Download: https://es-cloud.cs.uni-tuebingen.de/d/8e2ab8c3fdd444e1a135/?p=%2FTEyeDS&mode=list
UHKD: A Unified Framework for Heterogeneous Knowledge Distillation via Frequency-Domain Representations
Knowledge distillation (KD) is an effective model compression technique that transfers knowledge from a high-performance teacher to a lightweight student, reducing computational and storage costs while maintaining competitive accuracy. However, most existing KD methods are tailored for homogeneous models and perform poorly in heterogeneous settings, particularly when intermediate features are involved. Semantic discrepancies across architectures hinder effective use of intermediate representations from the teacher model, while prior heterogeneous KD studies mainly focus on the logits space, underutilizing rich semantic information in intermediate layers. To address this, Unified Heterogeneous Knowledge Distillation (UHKD) is proposed, a framework that leverages intermediate features in the frequency domain for cross-architecture transfer. Frequency-domain representations are leveraged to capture global semantic knowledge and mitigate representational discrepancies between heterogeneous teacher-student pairs. Specifically, a Feature Transformation Module (FTM) generates compact frequency-domain representations of teacher features, while a learnable Feature Alignment Module (FAM) projects student features and aligns them via multi-level matching. Training is guided by a joint objective combining mean squared error on intermediate features with Kullback-Leibler divergence on logits. Extensive experiments on CIFAR-100 and ImageNet-1K demonstrate the effectiveness of the proposed approach, achieving maximum gains of 5.59% and 0.83% over the latest heterogeneous distillation method on the two datasets, respectively. Code will be released soon.
comment: 14 pages, 10 figures
Image and Video Processing
Synergy vs. Noise: Performance-Guided Multimodal Fusion For Biochemical Recurrence-Free Survival in Prostate Cancer
Multimodal deep learning (MDL) has emerged as a transformative approach in computational pathology. By integrating complementary information from multiple data sources, MDL models have demonstrated superior predictive performance across diverse clinical tasks compared to unimodal models. However, the assumption that combining modalities inherently improves performance remains largely unexamined. We hypothesise that multimodal gains depend critically on the predictive quality of individual modalities, and that integrating weak modalities may introduce noise rather than complementary information. We test this hypothesis on a prostate cancer dataset with histopathology, radiology, and clinical data to predict time-to-biochemical recurrence. Our results confirm that combining high-performing modalities yield superior performance compared to unimodal approaches. However, integrating a poor-performing modality with other higher-performing modalities degrades predictive accuracy. These findings demonstrate that multimodal benefit requires selective, performance-guided integration rather than indiscriminate modality combination, with implications for MDL design across computational pathology and medical imaging.
comment: 5 pages, 1 figure, 4 tables
Unsupervised Motion-Compensated Decomposition for Cardiac MRI Reconstruction via Neural Representation AAAI-26
Cardiac magnetic resonance (CMR) imaging is widely used to characterize cardiac morphology and function. To accelerate CMR imaging, various methods have been proposed to recover high-quality spatiotemporal CMR images from highly undersampled k-t space data. However, current CMR reconstruction techniques either fail to achieve satisfactory image quality or are restricted by the scarcity of ground truth data, leading to limited applicability in clinical scenarios. In this work, we proposed MoCo-INR, a new unsupervised method that integrates implicit neural representations (INR) with the conventional motion-compensated (MoCo) framework. Using explicit motion modeling and the continuous prior of INRs, MoCo-INR can produce accurate cardiac motion decomposition and high-quality CMR reconstruction. Furthermore, we introduce a new INR network architecture tailored to the CMR problem, which significantly stabilizes model optimization. Experiments on retrospective (simulated) datasets demonstrate the superiority of MoCo-INR over state-of-the-art methods, achieving fast convergence and fine-detailed reconstructions at ultra-high acceleration factors (e.g., 20x in VISTA sampling). Additionally, evaluations on prospective (real-acquired) free-breathing CMR scans highlight the clinical practicality of MoCo-INR for real-time imaging. Several ablation studies further confirm the effectiveness of the critical components of MoCo-INR.
comment: Accepted by AAAI-26
Large-scale modality-invariant foundation models for brain MRI analysis: Application to lesion segmentation
The field of computer vision is undergoing a paradigm shift toward large-scale foundation model pre-training via self-supervised learning (SSL). Leveraging large volumes of unlabeled brain MRI data, such models can learn anatomical priors that improve few-shot performance in diverse neuroimaging tasks. However, most SSL frameworks are tailored to natural images, and their adaptation to capture multi-modal MRI information remains underexplored. This work proposes a modality-invariant representation learning setup and evaluates its effectiveness in stroke and epilepsy lesion segmentation, following large-scale pre-training. Experimental results suggest that despite successful cross-modality alignment, lesion segmentation primarily benefits from preserving fine-grained modality-specific features. Model checkpoints and code are made publicly available.
comment: Submitted to IEEE ISBI 2026
SplineSplat: 3D Ray Tracing for Higher-Quality Tomography
We propose a method to efficiently compute tomographic projections of a 3D volume represented by a linear combination of shifted B-splines. To do so, we propose a ray-tracing algorithm that computes 3D line integrals with arbitrary projection geometries. One of the components of our algorithm is a neural network that computes the contribution of the basis functions efficiently. In our experiments, we consider well-posed cases where the data are sufficient for accurate reconstruction without the need for regularization. We achieve higher reconstruction quality than traditional voxel-based methods.
Boosting Neural Video Representation via Online Structural Reparameterization
Neural Video Representation~(NVR) is a promising paradigm for video compression, showing great potential in improving video storage and transmission efficiency. While recent advances have made efforts in architectural refinements to improve representational capability, these methods typically involve complex designs, which may incur increased computational overhead and lack the flexibility to integrate into other frameworks. Moreover, the inherent limitation in model capacity restricts the expressiveness of NVR networks, resulting in a performance bottleneck. To overcome these limitations, we propose Online-RepNeRV, a NVR framework based on online structural reparameterization. Specifically, we propose a universal reparameterization block named ERB, which incorporates multiple parallel convolutional paths to enhance the model capacity. To mitigate the overhead, an online reparameterization strategy is adopted to dynamically fuse the parameters during training, and the multi-branch structure is equivalently converted into a single-branch structure after training. As a result, the additional computational and parameter complexity is confined to the encoding stage, without affecting the decoding efficiency. Extensive experiments on mainstream video datasets demonstrate that our method achieves an average PSNR gain of 0.37-2.7 dB over baseline methods, while maintaining comparable training time and decoding speed.
comment: 15 pages, 7 figures
Sensitivity of Finite Element Models to Relationship Between T2 Relaxation and Modulus in Articular Cartilage
Correlating articular cartilage material properties to quantitative magnetic resonance imaging biomarkers is a powerful approach to biofidelic finite element models. However, subject-specific relationships between imaging biomarkers such as T2 and material properties like dynamic modulus are uncertain. To evaluate the sensitivity of finite element models to this uncertainty, we shifted the slope and intercept of a linear T2-dynamic modulus relationship used to define cartilage properties. Modulus shifts led to notable percent changes in the top 1% of calculated stress and strain while modulating slope had a negligible impact, together supporting the use of physiologically relevant moduli ranges in subject-specific models.
comment: 29 pages including supplemental material to manuscript, 6 figures and 1 table in main text
CLIPPan: Adapting CLIP as A Supervisor for Unsupervised Pansharpening AAAI 2026
Despite remarkable advancements in supervised pansharpening neural networks, these methods face domain adaptation challenges of resolution due to the intrinsic disparity between simulated reduced-resolution training data and real-world full-resolution scenarios.To bridge this gap, we propose an unsupervised pansharpening framework, CLIPPan, that enables model training at full resolution directly by taking CLIP, a visual-language model, as a supervisor. However, directly applying CLIP to supervise pansharpening remains challenging due to its inherent bias toward natural images and limited understanding of pansharpening tasks. Therefore, we first introduce a lightweight fine-tuning pipeline that adapts CLIP to recognize low-resolution multispectral, panchromatic, and high-resolution multispectral images, as well as to understand the pansharpening process. Then, building on the adapted CLIP, we formulate a novel \textit{loss integrating semantic language constraints}, which aligns image-level fusion transitions with protocol-aligned textual prompts (e.g., Wald's or Khan's descriptions), thus enabling CLIPPan to use language as a powerful supervisory signal and guide fusion learning without ground truth. Extensive experiments demonstrate that CLIPPan consistently improves spectral and spatial fidelity across various pansharpening backbones on real-world datasets, setting a new state of the art for unsupervised full-resolution pansharpening.
comment: Accepted to AAAI 2026
A Deep Learning Framework for Thyroid Nodule Segmentation and Malignancy Classification from Ultrasound Images
Ultrasound-based risk stratification of thyroid nodules is a critical clinical task, but it suffers from high inter-observer variability. While many deep learning (DL) models function as "black boxes," we propose a fully automated, two-stage framework for interpretable malignancy prediction. Our method achieves interpretability by forcing the model to focus only on clinically relevant regions. First, a TransUNet model automatically segments the thyroid nodule. The resulting mask is then used to create a region of interest around the nodule, and this localised image is fed directly into a ResNet-18 classifier. We evaluated our framework using 5-fold cross-validation on a clinical dataset of 349 images, where it achieved a high F1-score of 0.852 for predicting malignancy. To validate its performance, we compared it against a strong baseline using a Random Forest classifier with hand-crafted morphological features, which achieved an F1-score of 0.829. The superior performance of our DL framework suggests that the implicit visual features learned from the localised nodule are more predictive than explicit shape features alone. This is the first fully automated end-to-end pipeline for both detecting thyroid nodules on ultrasound images and predicting their malignancy.
comment: 5 pages, 2 figures, 2 tables
Lessons Learned from Developing a Privacy-Preserving Multimodal Wearable for Local Voice-and-Vision Inference
Many promising applications of multimodal wearables require continuous sensing and heavy computation, yet users reject such devices due to privacy concerns. This paper shares our experiences building an ear-mounted voice-and-vision wearable that performs local AI inference using a paired smartphone as a trusted personal edge. We describe the hardware--software co-design of this privacy-preserving system, including challenges in integrating a camera, microphone, and speaker within a 30-gram form factor, enabling wake word-triggered capture, and running quantized vision-language and large-language models entirely offline. Through iterative prototyping, we identify key design hurdles in power budgeting, connectivity, latency, and social acceptability. Our initial evaluation shows that fully local multimodal inference is feasible on commodity mobile hardware with interactive latency. We conclude with design lessons for researchers developing embedded AI systems that balance privacy, responsiveness, and usability in everyday settings.
comment: 7 pages, 5 figures
Weyl-Heisenberg Transform Capabilities in JPEG Compression Standard
This paper is devoted to the development and research of a new compression technology based on Weyl-Heisenberg bases (WH-technology) for modifying the JPEG compression standard and improving its characteristics. For this purpose, the paper analyzes the main stages of the JPEG compression algorithm, notes its key features and problems that limit further enhancement of its efficiency. To overcome these limitations, it is proposed to use the real version of the two-dimensional discrete orthogonal Weyl-Heisenberg transform (DWHT) instead of the discrete cosine transform (DCT) at the stage of transformation coding. This transformation, unlike DCT, initially has a block structure and is built on the basis of the Weyl-Heisenberg optimal signal basis, the functions of which are orthogonal and well localized both in the frequency and time domains. This feature of DWHT allows for more efficient decorrelation and compression of element values in each block of the image after transformation coding. As a result, it is possible to obtain more efficient selection and screening of insignificant elements at the subsequent stages of quantization and information coding. Based on DWHT, a new version of the JPEG compression algorithm was developed, and convenient criteria for evaluating the compression efficiency and metrics of quality losses were proposed. The results of an experimental study are presented, confirming the higher compression efficiency of the proposed algorithm in comparison with the JPEG compression standard.
SCReedSolo: A Secure and Robust LSB Image Steganography Framework with Randomized Symmetric Encryption and Reed-Solomon Coding
Image steganography is an information-hiding technique that involves the surreptitious concealment of covert informational content within digital images. In this paper, we introduce ${\rm SCR{\small EED}S{\small OLO}}$, a novel framework for concealing arbitrary binary data within images. Our approach synergistically leverages Random Shuffling, Fernet Symmetric Encryption, and Reed-Solomon Error Correction Codes to encode the secret payload, which is then discretely embedded into the carrier image using LSB (Least Significant Bit) Steganography. The combination of these methods addresses the vulnerability vectors of both security and resilience against bit-level corruption in the resultant stego-images. We show that our framework achieves a data payload of 3 bits per pixel for an RGB image, and mathematically assess the probability of successful transmission for the amalgamated $n$ message bits and $k$ error correction bits. Additionally, we find that ${\rm SCR{\small EED}S{\small OLO}}$ yields good results upon being evaluated with multiple performance metrics, successfully eludes detection by various passive steganalysis tools, and is immune to simple active steganalysis attacks. Our code and data are available at https://github.com/Starscream-11813/SCReedSolo-Steganography.
comment: Accepted in Proceedings of the 8th Asian Conference on Pattern Recognition (ACPR 2025), 5 pages, 21 figures, 4 tables
Quantifying the Limits of Segmentation Foundation Models: Modeling Challenges in Segmenting Tree-Like and Low-Contrast Objects WACV 2026
Image segmentation foundation models (SFMs) like Segment Anything Model (SAM) have achieved impressive zero-shot and interactive segmentation across diverse domains. However, they struggle to segment objects with certain structures, particularly those with dense, tree-like morphology and low textural contrast from their surroundings. These failure modes are crucial for understanding the limitations of SFMs in real-world applications. To systematically study this issue, we introduce interpretable metrics quantifying object tree-likeness and textural separability. On carefully controlled synthetic experiments and real-world datasets, we show that SFM performance (\eg, SAM, SAM 2, HQ-SAM) noticeably correlates with these factors. We attribute these failures to SFMs misinterpreting local structure as global texture, resulting in over-segmentation or difficulty distinguishing objects from similar backgrounds. Notably, targeted fine-tuning fails to resolve this issue, indicating a fundamental limitation. Our study provides the first quantitative framework for modeling the behavior of SFMs on challenging structures, offering interpretable insights into their segmentation capabilities.
comment: Accepted at WACV 2026. Code: https://github.com/mazurowski-lab/SAMFailureMetrics
Adaptive Pareto-Optimal Token Merging for Edge Transformer Models in Semantic Communication
Large-scale transformer models have emerged as a powerful tool for semantic communication systems, enabling edge devices to extract rich representations for robust inference across noisy wireless channels. However, their substantial computational demands remain a major barrier to practical deployment in resource-constrained 6G networks. In this paper, we present a training-free framework for adaptive token merging in pretrained vision transformers to jointly reduce inference time and transmission resource usage. We formulate the selection of per-layer merging proportions as a multi-objective optimization problem to balance accuracy and computational cost. We employ Gaussian process-based Bayesian optimization to construct a Pareto frontier of optimal configurations, enabling flexible runtime adaptation to dynamic application requirements and channel conditions. Extensive experiments demonstrate that our method consistently outperforms other baselines and achieves significant reductions in floating-point operations while maintaining competitive accuracy across a wide range of signal-to-noise ratio (SNR) conditions. Additional results highlight the effectiveness of adaptive policies that adjust merging aggressiveness in response to channel quality, providing a practical mechanism to trade off latency and semantic fidelity on demand. These findings establish a scalable and efficient approach for deploying transformer-based semantic communication in future edge intelligence systems.
comment: Accepted for presentation in IEEE Globecom 2025
Progress in SPECT and PET Reconstruction for Theranostics: From Diagnosis to Therapy
The theranostic paradigm enables personalization of treatment by selecting patients with a diagnostic radiopharmaceutical and monitoring therapy using a matched therapeutic isotope. This strategy relies on accurate image reconstruction of both pretherapy and post-therapy images for patient selection and monitoring treatment. However, traditional reconstruction methods are hindered by challenges such as crosstalk in multi-isotope imaging and extremely low-count measurements data when imaging of alpha-emitting isotopes. Additionally, to fully realize the benefits of new imaging systems being developed for theranostic applications, advanced reconstruction techniques are needed. This review highlights recent progress and discusses critical challenges and unmet needs in theranostic image reconstruction.
Progress and new challenges in image-based profiling
For over two decades, image-based profiling has revolutionized cell phenotype analysis. Image-based profiling processes rich, high-throughput, microscopy data into thousands of unbiased measurements that reveal phenotypic patterns powerful for drug discovery, functional genomics, and cell state classification. Here, we review the evolving computational landscape of image-based profiling, detailing the bioinformatics processes involved from feature extraction to normalization and batch correction. We discuss how deep learning has fundamentally reshaped the field. We examine key methodological advancements, such as single-cell analysis, the development of robust similarity metrics, and the expansion into new modalities like optical pooled screening, temporal imaging, and 3D organoid profiling. We also highlight the growth of public benchmarks and open-source software ecosystems as a key driver for fostering reproducibility and collaboration. Despite these advances, the field still faces substantial challenges, particularly in developing methods for emerging temporal and 3D data modalities, establishing robust quality control standards and workflows, and interpreting the processed features. By focusing on the technical evolution of image-based profiling rather than the wide-ranging biological applications, our aim with this review is to provide researchers with a roadmap for navigating the progress and new challenges in this rapidly advancing domain.
comment: Revisions address reviewer feedback: the abstract was clarified; Fig. 1 now omits company-specific microscopes; a new Table 2 compares the performance between classical vs. deep-learned features; and improved clarity in box 2
StreamDiT: Real-Time Streaming Text-to-Video Generation
Recently, great progress has been achieved in text-to-video (T2V) generation by scaling transformer-based diffusion models to billions of parameters, which can generate high-quality videos. However, existing models typically produce only short clips offline, restricting their use cases in interactive and real-time applications. This paper addresses these challenges by proposing StreamDiT, a streaming video generation model. StreamDiT training is based on flow matching by adding a moving buffer. We design mixed training with different partitioning schemes of buffered frames to boost both content consistency and visual quality. StreamDiT modeling is based on adaLN DiT with varying time embedding and window attention. To practice the proposed method, we train a StreamDiT model with 4B parameters. In addition, we propose a multistep distillation method tailored for StreamDiT. Sampling distillation is performed in each segment of a chosen partitioning scheme. After distillation, the total number of function evaluations (NFEs) is reduced to the number of chunks in a buffer. Finally, our distilled model reaches real-time performance at 16 FPS on one GPU, which can generate video streams at 512p resolution. We evaluate our method through both quantitative metrics and human evaluation. Our model enables real-time applications, e.g. streaming generation, interactive generation, and video-to-video. We provide video results and more examples in our project website: https://cumulo-autumn.github.io/StreamDiT/
Latent Motion Profiling for Annotation-free Cardiac Phase Detection in Adult and Fetal Echocardiography Videos
The identification of cardiac phase is an essential step for analysis and diagnosis of cardiac function. Automatic methods, especially data-driven methods for cardiac phase detection, typically require extensive annotations, which is time-consuming and labor-intensive. In this paper, we present an unsupervised framework for end-diastole (ED) and end-systole (ES) detection through self-supervised learning of latent cardiac motion trajectories from 4-chamber-view echocardiography videos. Our method eliminates the need for manual annotations, including ED and ES indices, segmentation, or volumetric measurements, by training a reconstruction model to encode interpretable spatiotemporal motion patterns. Evaluated on the EchoNet-Dynamic benchmark, the approach achieves mean absolute error (MAE) of 3 frames (58.3 ms) for ED and 2 frames (38.8 ms) for ES detection, matching state-of-the-art supervised methods. Extended to fetal echocardiography, the model demonstrates robust performance with MAE 1.46 frames (20.7 ms) for ED and 1.74 frames (25.3 ms) for ES, despite the fact that the fetal heart model is built using non-standardized heart views due to fetal heart positioning variability. Our results demonstrate the potential of the proposed latent motion trajectory strategy for cardiac phase detection in adult and fetal echocardiography. This work advances unsupervised cardiac motion analysis, offering a scalable solution for clinical populations lacking annotated data. Code will be released at https://github.com/YingyuYyy/CardiacPhase.
TEyeD: Over 20 million real-world eye images with Pupil, Eyelid, and Iris 2D and 3D Segmentations, 2D and 3D Landmarks, 3D Eyeball, Gaze Vector, and Eye Movement Types
We present TEyeD, the world's largest unified public data set of eye images taken with head-mounted devices. TEyeD was acquired with seven different head-mounted eye trackers. Among them, two eye trackers were integrated into virtual reality (VR) or augmented reality (AR) devices. The images in TEyeD were obtained from various tasks, including car rides, simulator rides, outdoor sports activities, and daily indoor activities. The data set includes 2D and 3D landmarks, semantic segmentation, 3D eyeball annotation and the gaze vector and eye movement types for all images. Landmarks and semantic segmentation are provided for the pupil, iris and eyelids. Video lengths vary from a few minutes to several hours. With more than 20 million carefully annotated images, TEyeD provides a unique, coherent resource and a valuable foundation for advancing research in the field of computer vision, eye tracking and gaze estimation in modern VR and AR applications. Download: https://es-cloud.cs.uni-tuebingen.de/d/8e2ab8c3fdd444e1a135/?p=%2FTEyeDS&mode=list
comment: Download: https://es-cloud.cs.uni-tuebingen.de/d/8e2ab8c3fdd444e1a135/?p=%2FTEyeDS&mode=list
Fractured Glass, Failing Cameras: Simulating Physics-Based Adversarial Samples for Autonomous Driving Systems AAAI
While much research has recently focused on generating physics-based adversarial samples, a critical yet often overlooked category originates from physical failures within on-board cameras-components essential to the perception systems of autonomous vehicles. Camera failures, whether due to external stresses causing hardware breakdown or internal component faults, can directly jeopardize the safety and reliability of autonomous driving systems. Firstly, we motivate the study using two separate real-world experiments to showcase that indeed glass failures would cause the detection based neural network models to fail. Secondly, we develop a simulation-based study using the physical process of the glass breakage to create perturbed scenarios, representing a realistic class of physics-based adversarial samples. Using a finite element model (FEM)-based approach, we generate surface cracks on the camera image by applying a stress field defined by particles within a triangular mesh. Lastly, we use physically-based rendering (PBR) techniques to provide realistic visualizations of these physically plausible fractures. To assess the safety implications, we apply the simulated broken glass effects as image filters to two autonomous driving datasets- KITTI and BDD100K- as well as the large-scale image detection dataset MS-COCO. We then evaluate detection failure rates for critical object classes using CNN-based object detection models (YOLOv8 and Faster R-CNN) and a transformer-based architecture with Pyramid Vision Transformers. To further investigate the distributional impact of these visual distortions, we compute the Kullback-Leibler (K-L) divergence between three distinct data distributions, applying various broken glass filters to a custom dataset (captured through a cracked windshield), as well as the KITTI and Kaggle cats and dogs datasets.
comment: Accepted to AAAI
Tubular Curvature Filter: Pointwise Curvature Calculation for Tubular Objects in Images
Purpose: Accurate estimation of blood vessel tortuosity from medical images is an extremely important and challenging task. It is particularly relevant in the context of retinopathy of prematurity (ROP), where the staging of disease severity and consequent therapeutic approaches are heavily informed by the presence and prominence of vessel tortuosity. Existing methods based on centerline or skeleton curvature fail to capture curvature gradients across a rotating tubular structure, thereby limiting their effectiveness in the case of ROP. Methods: This paper defines local tubular curvature and presents the Tubular Curvature Filter (TCF) method, which locally calculates the acceleration of curve bundles traversing a tubular object parallel to its centerline. This is achieved by examining the directional rate of change in the eigenvectors of the Hessian matrix of a tubular intensity function in space. TCF implicitly calculates the local tubular curvature without the need to explicitly segment or extracting the centerline of the tubular object. Results: Experimental results demonstrate that TCF provides accurate estimates of local curvature at any point inside tubular structures. Results on 2D and 3D images show that TCF discerns curvature differences between the inner and outer sides of curved tubular objects, while centerline-based approaches cannot. Conclusion: Our findings highlight that TCF's ability to discern between the inner and outer sides of curved tubular objects is particularly useful in medical fields that require vasculature curvature analysis from images, especially where vascular structures often have non-uniform diameters, such as in ROP.
comment: 24 pages, 14 figures
Dedelayed: Deleting remote inference delay via on-device correction
Video comprises the vast majority of bits that are generated daily, and is the primary signal driving current innovations in robotics, remote sensing, and wearable technology. Yet, the most powerful video understanding models are too expensive for the resource-constrained platforms used in these applications. One approach is to offload inference to the cloud; this gives access to GPUs capable of processing high-resolution videos in real time. But even with reliable, high-bandwidth communication channels, the combined latency of video encoding, model inference, and round-trip communication prohibits use for certain real-time applications. The alternative is to use fully local inference; but this places extreme constraints on computational and power costs, requiring smaller models and lower resolution, leading to degraded accuracy. To address these challenges, we propose Dedelayed, a real-time inference system that divides computation between a remote model operating on delayed video frames and a local model with access to the current frame. The remote model is trained to make predictions on anticipated future frames, which the local model incorporates into its prediction for the current frame. The local and remote models are jointly optimized with an autoencoder that limits the transmission bitrate required by the available downlink communication channel. We evaluate Dedelayed on the task of real-time streaming video segmentation using the BDD100k driving dataset. For a round trip delay of 100 ms, Dedelayed improves performance by 6.4 mIoU compared to fully local inference and 9.8 mIoU compared to remote inference -- an equivalent improvement to using a model ten times larger.
Multimedia
Multi-agent Undercover Gaming: Hallucination Removal via Counterfactual Test for Multimodal Reasoning AAAI 2026
Hallucination continues to pose a major obstacle in the reasoning capabilities of large language models (LLMs). Although the Multi-Agent Debate (MAD) paradigm offers a promising solution by promoting consensus among multiple agents to enhance reliability, it relies on the unrealistic assumption that all debaters are rational and reflective, which is a condition that may not hold when agents themselves are prone to hallucinations. To address this gap, we introduce the Multi-agent Undercover Gaming (MUG) protocol, inspired by social deduction games like "Who is Undercover?". MUG reframes MAD as a process of detecting "undercover" agents (those suffering from hallucinations) by employing multimodal counterfactual tests. Specifically, we modify reference images to introduce counterfactual evidence and observe whether agents can accurately identify these changes, providing ground-truth for identifying hallucinating agents and enabling robust, crowd-powered multimodal reasoning. MUG advances MAD protocols along three key dimensions: (1) enabling factual verification beyond statistical consensus through counterfactual testing; (2) introducing cross-evidence reasoning via dynamically modified evidence sources instead of relying on static inputs; and (3) fostering active reasoning, where agents engage in probing discussions rather than passively answering questions. Collectively, these innovations offer a more reliable and effective framework for multimodal reasoning in LLMs. The source code can be accessed at https://github.com/YongLD/MUG.git.
comment: Accepted by AAAI 2026
AV-Dialog: Spoken Dialogue Models with Audio-Visual Input
Dialogue models falter in noisy, multi-speaker environments, often producing irrelevant responses and awkward turn-taking. We present AV-Dialog, the first multimodal dialog framework that uses both audio and visual cues to track the target speaker, predict turn-taking, and generate coherent responses. By combining acoustic tokenization with multi-task, multi-stage training on monadic, synthetic, and real audio-visual dialogue datasets, AV-Dialog achieves robust streaming transcription, semantically grounded turn-boundary detection and accurate responses, resulting in a natural conversational flow. Experiments show that AV-Dialog outperforms audio-only models under interference, reducing transcription errors, improving turn-taking prediction, and enhancing human-rated dialogue quality. These results highlight the power of seeing as well as hearing for speaker-aware interaction, paving the way for {spoken} dialogue agents that perform {robustly} in real-world, noisy environments.
AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization
Recent advancements in Audio-Video Large Language Models (AV-LLMs) have enhanced their capabilities in tasks like audio-visual question answering and multimodal dialog systems. Video and audio introduce an extended temporal dimension, resulting in a larger key-value (KV) cache compared to static image embedding. A naive optimization strategy is to selectively focus on and retain KV caches of audio or video based on task. However, in the experiment, we observed that the attention of AV-LLMs to various modalities in the high layers is not strictly dependent on the task. In higher layers, the attention of AV-LLMs shifts more towards the video modality. In addition, we also found that directly integrating temporal KV of audio and spatial-temporal KV of video may lead to information confusion and significant performance degradation of AV-LLMs. If audio and video are processed indiscriminately, it may also lead to excessive compression or reservation of a certain modality, thereby disrupting the alignment between modalities. To address these challenges, we propose AccKV, an Adaptive-Focusing and Cross-Calibration KV cache optimization framework designed specifically for efficient AV-LLMs inference. Our method is based on layer adaptive focusing technology, selectively focusing on key modalities according to the characteristics of different layers, and enhances the recognition of heavy hitter tokens through attention redistribution. In addition, we propose a Cross-Calibration technique that first integrates inefficient KV caches within the audio and video modalities, and then aligns low-priority modalities with high-priority modalities to selectively evict KV cache of low-priority modalities. The experimental results show that AccKV can significantly improve the computational efficiency of AV-LLMs while maintaining accuracy.
Boosting Neural Video Representation via Online Structural Reparameterization
Neural Video Representation~(NVR) is a promising paradigm for video compression, showing great potential in improving video storage and transmission efficiency. While recent advances have made efforts in architectural refinements to improve representational capability, these methods typically involve complex designs, which may incur increased computational overhead and lack the flexibility to integrate into other frameworks. Moreover, the inherent limitation in model capacity restricts the expressiveness of NVR networks, resulting in a performance bottleneck. To overcome these limitations, we propose Online-RepNeRV, a NVR framework based on online structural reparameterization. Specifically, we propose a universal reparameterization block named ERB, which incorporates multiple parallel convolutional paths to enhance the model capacity. To mitigate the overhead, an online reparameterization strategy is adopted to dynamically fuse the parameters during training, and the multi-branch structure is equivalently converted into a single-branch structure after training. As a result, the additional computational and parameter complexity is confined to the encoding stage, without affecting the decoding efficiency. Extensive experiments on mainstream video datasets demonstrate that our method achieves an average PSNR gain of 0.37-2.7 dB over baseline methods, while maintaining comparable training time and decoding speed.
comment: 15 pages, 7 figures
Accelerating Controllable Generation via Hybrid-grained Cache
Controllable generative models have been widely used to improve the realism of synthetic visual content. However, such models must handle control conditions and content generation computational requirements, resulting in generally low generation efficiency. To address this issue, we propose a Hybrid-Grained Cache (HGC) approach that reduces computational overhead by adopting cache strategies with different granularities at different computational stages. Specifically, (1) we use a coarse-grained cache (block-level) based on feature reuse to dynamically bypass redundant computations in encoder-decoder blocks between each step of model reasoning. (2) We design a fine-grained cache (prompt-level) that acts within a module, where the fine-grained cache reuses cross-attention maps within consecutive reasoning steps and extends them to the corresponding module computations of adjacent steps. These caches of different granularities can be seamlessly integrated into each computational link of the controllable generation process. We verify the effectiveness of HGC on four benchmark datasets, especially its advantages in balancing generation efficiency and visual quality. For example, on the COCO-Stuff segmentation benchmark, our HGC significantly reduces the computational cost (MACs) by 63% (from 18.22T to 6.70T), while keeping the loss of semantic fidelity (quantized performance degradation) within 1.5%.
Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio
Modern text-to-speech (TTS) systems, particularly those built on Large Audio-Language Models (LALMs), generate high-fidelity speech that faithfully reproduces input text and mimics specified speaker identities. While prior misuse studies have focused on speaker impersonation, this work explores a distinct content-centric threat: exploiting TTS systems to produce speech containing harmful content. Realizing such threats poses two core challenges: (1) LALM safety alignment frequently rejects harmful prompts, yet existing jailbreak attacks are ill-suited for TTS because these systems are designed to faithfully vocalize any input text, and (2) real-world deployment pipelines often employ input/output filters that block harmful text and audio. We present HARMGEN, a suite of five attacks organized into two families that address these challenges. The first family employs semantic obfuscation techniques (Concat, Shuffle) that conceal harmful content within text. The second leverages audio-modality exploits (Read, Spell, Phoneme) that inject harmful content through auxiliary audio channels while maintaining benign textual prompts. Through evaluation across five commercial LALMs-based TTS systems and three datasets spanning two languages, we demonstrate that our attacks substantially reduce refusal rates and increase the toxicity of generated speech. We further assess both reactive countermeasures deployed by audio-streaming platforms and proactive defenses implemented by TTS providers. Our analysis reveals critical vulnerabilities: deepfake detectors underperform on high-fidelity audio; reactive moderation can be circumvented by adversarial perturbations; while proactive moderation detects 57-93% of attacks. Our work highlights a previously underexplored content-centric misuse vector for TTS and underscore the need for robust cross-modal safeguards throughout training and deployment.
Weyl-Heisenberg Transform Capabilities in JPEG Compression Standard
This paper is devoted to the development and research of a new compression technology based on Weyl-Heisenberg bases (WH-technology) for modifying the JPEG compression standard and improving its characteristics. For this purpose, the paper analyzes the main stages of the JPEG compression algorithm, notes its key features and problems that limit further enhancement of its efficiency. To overcome these limitations, it is proposed to use the real version of the two-dimensional discrete orthogonal Weyl-Heisenberg transform (DWHT) instead of the discrete cosine transform (DCT) at the stage of transformation coding. This transformation, unlike DCT, initially has a block structure and is built on the basis of the Weyl-Heisenberg optimal signal basis, the functions of which are orthogonal and well localized both in the frequency and time domains. This feature of DWHT allows for more efficient decorrelation and compression of element values in each block of the image after transformation coding. As a result, it is possible to obtain more efficient selection and screening of insignificant elements at the subsequent stages of quantization and information coding. Based on DWHT, a new version of the JPEG compression algorithm was developed, and convenient criteria for evaluating the compression efficiency and metrics of quality losses were proposed. The results of an experimental study are presented, confirming the higher compression efficiency of the proposed algorithm in comparison with the JPEG compression standard.
HI-TransPA: Hearing Impairments Translation Personal Assistant
Hearing-impaired individuals often face significant barriers in daily communication due to the inherent challenges of producing clear speech. To address this, we introduce the Omni-Model paradigm into assistive technology and present HI-TransPA, an instruction-driven audio-visual personal assistant. The model fuses indistinct speech with lip dynamics, enabling both translation and dialogue within a single multimodal framework. To address the distinctive pronunciation patterns of hearing-impaired speech and the limited adaptability of existing models, we develop a multimodal preprocessing and curation pipeline that detects facial landmarks, stabilizes the lip region, and quantitatively evaluates sample quality. These quality scores guide a curriculum learning strategy that first trains on clean, high-confidence samples and progressively incorporates harder cases to strengthen model robustness. Architecturally, we employs a novel unified 3D-Resampler to efficiently encode the lip dynamics, which is critical for accurate interpretation. Experiments on purpose-built HI-Dialogue dataset show that HI-TransPA achieves state-of-the-art performance in both literal accuracy and semantic fidelity. Our work establishes a foundation for applying Omni-Models to assistive communication technology, providing an end-to-end modeling framework and essential processing tools for future research.
Contribution-Guided Asymmetric Learning for Robust Multimodal Fusion under Imbalance and Noise
Multimodal learning faces two major challenges: modality imbalance and data noise, which significantly affect the robustness and generalization ability of models. Existing methods achieve modality balance by suppressing dominant modalities, but they neglect the inherent differences in the information value between modalities, potentially leading to convergence to suboptimal solutions. This paper proposes an innovative modality compression paradigm, Contribution-Guided Asymmetric Learning (CAL), which aims to enhance the contribution of high-contribution modalities while compressing weak modalities to increase their contribution, allowing both to improve the performance of multimodal information fusion. CAL is based on a modality contribution metric W^m combining the information quantity I(m) and confidence D(m), and it designs an asymmetric gradient acceleration mechanism and a contribution-aware Asymmetric Information Bottleneck (AIB) compression mechanism. The former accelerates the gradient update of modalities, while the latter dynamically compresses the noise of low-contribution modalities. On five benchmark datasets, including emotion recognition, scene recognition, and event localization tasks, CAL has shown outstanding performance in imbalanced fusion tasks and noise robustness tests. On CREMA-D, KS, and AVE, CAL achieves 79.30%, 74.82%, and 74.21% accuracy, significantly outperforming the existing state-of-the-art model ARL. In high-noise robustness tests, CAL also achieved leading performance under various attack strategies on the MVSA-Single and NYUD2 datasets. These results validate the significant advantages of CAL in modality imbalance and noise interference. CAL, as a flexible and efficient framework, is easy to transfer to other tasks and has broad adaptability and potential application prospects.
Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation
Video-to-Music generation seeks to generate musically appropriate background music that enhances audiovisual immersion for videos. However, current approaches suffer from two critical limitations: 1) incomplete representation of video details, leading to weak alignment, and 2) inadequate temporal and rhythmic correspondence, particularly in achieving precise beat synchronization. To address the challenges, we propose Video Echoed in Music (VeM), a latent music diffusion that generates high-quality soundtracks with semantic, temporal, and rhythmic alignment for input videos. To capture video details comprehensively, VeM employs a hierarchical video parsing that acts as a music conductor, orchestrating multi-level information across modalities. Modality-specific encoders, coupled with a storyboard-guided cross-attention mechanism (SG-CAtt), integrate semantic cues while maintaining temporal coherence through position and duration encoding. For rhythmic precision, the frame-level transition-beat aligner and adapter (TB-As) dynamically synchronize visual scene transitions with music beats. We further contribute a novel video-music paired dataset sourced from e-commerce advertisements and video-sharing platforms, which imposes stricter transition-beat synchronization requirements. Meanwhile, we introduce novel metrics tailored to the task. Experimental results demonstrate superiority, particularly in semantic relevance and rhythmic precision.
DRAGON: Distributional Rewards Optimize Diffusion Generative Models
We present Distributional RewArds for Generative OptimizatioN (DRAGON), a versatile framework for fine-tuning media generation models towards a desired outcome. Compared with traditional reinforcement learning with human feedback (RLHF) or pairwise preference approaches such as direct preference optimization (DPO), DRAGON is more flexible. It can optimize reward functions that evaluate either individual examples or distributions of them, making it compatible with a broad spectrum of instance-wise, instance-to-distribution, and distribution-to-distribution rewards. Leveraging this versatility, we construct novel reward functions by selecting an encoder and a set of reference examples to create an exemplar distribution. When cross-modal encoders such as CLAP are used, the reference may be of a different modality (text versus audio). Then, DRAGON gathers online and on-policy generations, scores them with the reward function to construct a positive demonstration set and a negative set, and leverages the contrast between the two finite sets to approximate distributional reward optimization. For evaluation, we fine-tune an audio-domain text-to-music diffusion model with 20 reward functions, including a custom music aesthetics model, CLAP score, Vendi diversity, and Frechet audio distance (FAD). We further compare instance-wise (per-song) and full-dataset FAD settings while ablating multiple FAD encoders and reference sets. Over all 20 target rewards, DRAGON achieves an 81.45% average win rate. Moreover, reward functions based on exemplar sets enhance generations and are comparable to model-based rewards. With an appropriate exemplar set, DRAGON achieves a 60.95% human-voted music quality win rate without training on human preference annotations. DRAGON is a new approach to designing and optimizing reward functions for improving human-perceived quality. Demos at https://ml-dragon.github.io/web
comment: Accepted to TMLR
Computation and Language
Optimizing Mixture of Block Attention
Mixture of Block Attention (MoBA) (Lu et al., 2025) is a promising building block for efficiently processing long contexts in LLMs by enabling queries to sparsely attend to a small subset of key-value blocks, drastically reducing computational cost. However, the design principles governing MoBA's performance are poorly understood, and it lacks an efficient GPU implementation, hindering its practical adoption. In this paper, we first develop a statistical model to analyze MoBA's underlying mechanics. Our model reveals that performance critically depends on the router's ability to accurately distinguish relevant from irrelevant blocks based on query-key affinities. We derive a signal-to-noise ratio that formally connects architectural parameters to this retrieval accuracy. Guided by our analysis, we identify two key pathways for improvement: using smaller block sizes and applying a short convolution on keys to cluster relevant signals, which enhances routing accuracy. While theoretically better, small block sizes are inefficient on GPUs. To bridge this gap, we introduce FlashMoBA, a hardware-aware CUDA kernel that enables efficient MoBA execution even with the small block sizes our theory recommends. We validate our insights by training LLMs from scratch, showing that our improved MoBA models match the performance of dense attention baselines. FlashMoBA achieves up to 14.7x speedup over FlashAttention-2 for small blocks, making our theoretically-grounded improvements practical. Code is available at: https://github.com/mit-han-lab/flash-moba.
comment: The first two authors contributed equally to this work
PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning
Frontier model progress is often measured by academic benchmarks, which offer a limited view of performance in real-world professional contexts. Existing evaluations often fail to assess open-ended, economically consequential tasks in high-stakes domains like Legal and Finance, where practical returns are paramount. To address this, we introduce Professional Reasoning Bench (PRBench), a realistic, open-ended, and difficult benchmark of real-world problems in Finance and Law. We open-source its 1,100 expert-authored tasks and 19,356 expert-curated criteria, making it, to our knowledge, the largest public, rubric-based benchmark for both legal and finance domains. We recruit 182 qualified professionals, holding JDs, CFAs, or 6+ years of experience, who contributed tasks inspired by their actual workflows. This process yields significant diversity, with tasks spanning 114 countries and 47 US jurisdictions. Our expert-curated rubrics are validated through a rigorous quality pipeline, including independent expert validation. Subsequent evaluation of 20 leading models reveals substantial room for improvement, with top scores of only 0.39 (Finance) and 0.37 (Legal) on our Hard subsets. We further catalog associated economic impacts of the prompts and analyze performance using human-annotated rubric categories. Our analysis shows that models with similar overall scores can diverge significantly on specific capabilities. Common failure modes include inaccurate judgments, a lack of process transparency and incomplete reasoning, highlighting critical gaps in their reliability for professional adoption.
DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding
Comprehending long visual documents, where information is distributed across extensive pages of text and visual elements, is a critical but challenging task for modern Vision-Language Models (VLMs). Existing approaches falter on a fundamental challenge: evidence localization. They struggle to retrieve relevant pages and overlook fine-grained details within visual elements, leading to limited performance and model hallucination. To address this, we propose DocLens, a tool-augmented multi-agent framework that effectively ``zooms in'' on evidence like a lens. It first navigates from the full document to specific visual elements on relevant pages, then employs a sampling-adjudication mechanism to generate a single, reliable answer. Paired with Gemini-2.5-Pro, DocLens achieves state-of-the-art performance on MMLongBench-Doc and FinRAGBench-V, surpassing even human experts. The framework's superiority is particularly evident on vision-centric and unanswerable queries, demonstrating the power of its enhanced localization capabilities.
Aligning Machiavellian Agents: Behavior Steering via Test-Time Policy Shaping AAAI 2026
The deployment of decision-making AI agents presents a critical challenge in maintaining alignment with human values or guidelines while operating in complex, dynamic environments. Agents trained solely to achieve their objectives may adopt harmful behavior, exposing a key trade-off between maximizing the reward function and maintaining the alignment. For the pre-trained agents, ensuring alignment is particularly challenging, as retraining can be a costly and slow process. This is further complicated by the diverse and potentially conflicting attributes representing the ethical values for alignment. To address these challenges, we propose a test-time alignment technique based on model-guided policy shaping. Our method allows precise control over individual behavioral attributes, generalizes across diverse reinforcement learning (RL) environments, and facilitates a principled trade-off between ethical alignment and reward maximization without requiring agent retraining. We evaluate our approach using the MACHIAVELLI benchmark, which comprises 134 text-based game environments and thousands of annotated scenarios involving ethical decisions. The RL agents are first trained to maximize the reward in their respective games. At test time, we apply policy shaping via scenario-action attribute classifiers to ensure decision alignment with ethical attributes. We compare our approach against prior training-time methods and general-purpose agents, as well as study several types of ethical violations and power-seeking behavior. Our results demonstrate that test-time policy shaping provides an effective and scalable solution for mitigating unethical behavior across diverse environments and alignment attributes.
comment: Accepted to AAAI 2026 AI Alignment Track
W2S-AlignTree: Weak-to-Strong Inference-Time Alignment for Large Language Models via Monte Carlo Tree Search AAAI 2026
Large Language Models (LLMs) demonstrate impressive capabilities, yet their outputs often suffer from misalignment with human preferences due to the inadequacy of weak supervision and a lack of fine-grained control. Training-time alignment methods like Reinforcement Learning from Human Feedback (RLHF) face prohibitive costs in expert supervision and inherent scalability limitations, offering limited dynamic control during inference. Consequently, there is an urgent need for scalable and adaptable alignment mechanisms. To address this, we propose W2S-AlignTree, a pioneering plug-and-play inference-time alignment framework that synergistically combines Monte Carlo Tree Search (MCTS) with the Weak-to-Strong Generalization paradigm for the first time. W2S-AlignTree formulates LLM alignment as an optimal heuristic search problem within a generative search tree. By leveraging weak model's real-time, step-level signals as alignment proxies and introducing an Entropy-Aware exploration mechanism, W2S-AlignTree enables fine-grained guidance during strong model's generation without modifying its parameters. The approach dynamically balances exploration and exploitation in high-dimensional generation search trees. Experiments across controlled sentiment generation, summarization, and instruction-following show that W2S-AlignTree consistently outperforms strong baselines. Notably, W2S-AlignTree raises the performance of Llama3-8B from 1.89 to 2.19, a relative improvement of 15.9 on the summarization task.
comment: AAAI 2026 Oral
Proactive Hearing Assistants that Isolate Egocentric Conversations EMNLP 2025
We introduce proactive hearing assistants that automatically identify and separate the wearer's conversation partners, without requiring explicit prompts. Our system operates on egocentric binaural audio and uses the wearer's self-speech as an anchor, leveraging turn-taking behavior and dialogue dynamics to infer conversational partners and suppress others. To enable real-time, on-device operation, we propose a dual-model architecture: a lightweight streaming model runs every 12.5 ms for low-latency extraction of the conversation partners, while a slower model runs less frequently to capture longer-range conversational dynamics. Results on real-world 2- and 3-speaker conversation test sets, collected with binaural egocentric hardware from 11 participants totaling 6.8 hours, show generalization in identifying and isolating conversational partners in multi-conversation settings. Our work marks a step toward hearing assistants that adapt proactively to conversational dynamics and engagement. More information can be found on our website: https://proactivehearing.cs.washington.edu/
comment: Accepted at EMNLP 2025 Main Conference
From Synthetic Scenes to Real Performance: Enhancing Spatial Reasoning in VLMs
Fine-tuning Vision-Language Models (VLMs) is a common strategy to improve performance following an ad-hoc data collection and annotation of real-world scenes. However, this process is often prone to biases, errors, and distribution imbalance, resulting in overfitting and imbalanced performance. Although a few studies have tried to address this problem by generating synthetic data, they lacked control over distribution bias and annotation quality. To address these challenges, we redesign the fine-tuning process in two ways. First, we control the generation of data and its annotations, ensuring it is free from bias, distribution imbalance, and annotation errors. We automatically construct the dataset by comprehensively sampling objects' attributes, including color, shape, size, and position within the scene. Secondly, using this annotated dataset, we fine-tune state-of-the-art VLMs and assess performance transferability to real-world data on the absolute position task. We conduct exhaustive evaluations on both synthetic and real-world benchmarks. Our experiments reveal two key findings: 1) fine-tuning on balanced synthetic data yields uniform performance across the visual scene and mitigates common biases; and 2) fine-tuning on synthetic stimuli significantly improves performance on real-world data (COCO), outperforming models fine-tuned in the matched setting.
MajinBook: An open catalogue of digital world literature with likes
This data paper introduces MajinBook, an open catalogue designed to facilitate the use of shadow libraries--such as Library Genesis and Z-Library--for computational social science and cultural analytics. By linking metadata from these vast, crowd-sourced archives with structured bibliographic data from Goodreads, we create a high-precision corpus of over 539,000 references to English-language books spanning three centuries, enriched with first publication dates, genres, and popularity metrics like ratings and reviews. Our methodology prioritizes natively digital EPUB files to ensure machine-readable quality, while addressing biases in traditional corpora like HathiTrust, and includes secondary datasets for French, German, and Spanish. We evaluate the linkage strategy for accuracy, release all underlying data openly, and discuss the project's legal permissibility under EU and US frameworks for text and data mining in research.
comment: 9 pages, 5 figures, 1 table
Studies with impossible languages falsify LMs as models of human language
According to Futrell and Mahowald [arXiv:2501.17047], both infants and language models (LMs) find attested languages easier to learn than impossible languages that have unnatural structures. We review the literature and show that LMs often learn attested and many impossible languages equally well. Difficult to learn impossible languages are simply more complex (or random). LMs are missing human inductive biases that support language acquisition.
comment: Commentary on Futrell, R., & Mahowald, K. arXiv:2501.17047 (in press). How linguistics learned to stop worrying and love the language models. Behavioural and Brain Sciences
On-Device Fine-Tuning via Backprop-Free Zeroth-Order Optimization
On-device fine-tuning is a critical capability for edge AI systems, which must support adaptation to different agentic tasks under stringent memory constraints. Conventional backpropagation (BP)-based training requires storing layer activations and optimizer states, a demand that can be only partially alleviated through checkpointing. In edge deployments in which the model weights must reside entirely in device memory, this overhead severely limits the maximum model size that can be deployed. Memory-efficient zeroth-order optimization (MeZO) alleviates this bottleneck by estimating gradients using forward evaluations alone, eliminating the need for storing intermediate activations or optimizer states. This enables significantly larger models to fit within on-chip memory, albeit at the cost of potentially longer fine-tuning wall-clock time. This paper first provides a theoretical estimate of the relative model sizes that can be accommodated under BP and MeZO training. We then numerically validate the analysis, demonstrating that MeZO exhibits accuracy advantages under on-device memory constraints, provided sufficient wall-clock time is available for fine-tuning.
comment: Conference submission; Under review
M-DAIGT: A Shared Task on Multi-Domain Detection of AI-Generated Text
The generation of highly fluent text by Large Language Models (LLMs) poses a significant challenge to information integrity and academic research. In this paper, we introduce the Multi-Domain Detection of AI-Generated Text (M-DAIGT) shared task, which focuses on detecting AI-generated text across multiple domains, particularly in news articles and academic writing. M-DAIGT comprises two binary classification subtasks: News Article Detection (NAD) (Subtask 1) and Academic Writing Detection (AWD) (Subtask 2). To support this task, we developed and released a new large-scale benchmark dataset of 30,000 samples, balanced between human-written and AI-generated texts. The AI-generated content was produced using a variety of modern LLMs (e.g., GPT-4, Claude) and diverse prompting strategies. A total of 46 unique teams registered for the shared task, of which four teams submitted final results. All four teams participated in both Subtask 1 and Subtask 2. We describe the methods employed by these participating teams and briefly discuss future directions for M-DAIGT.
LaoBench: A Large-Scale Multidimensional Lao Benchmark for Large Language Models
The rapid advancement of large language models (LLMs) has not been matched by their evaluation in low-resource languages, especially Southeast Asian languages like Lao. To fill this gap, we introduce LaoBench, the first large-scale, high-quality, and multidimensional benchmark dataset dedicated to assessing LLMs' comprehensive language understanding and reasoning abilities in Lao. LaoBench comprises over 17,000 carefully curated samples spanning three core dimensions: knowledge application, K12 foundational education, and bilingual translation among Lao, Chinese, and English. The dataset is divided into open-source and closed-source subsets, with the closed-source portion enabling black-box evaluation on an official platform to ensure fairness and data security. Our data construction pipeline integrates expert human curation with automated agent-assisted verification, ensuring linguistic accuracy, cultural relevance, and educational value. Benchmarking multiple state-of-the-art LLMs on LaoBench reveals that current models still face significant challenges in mastering Lao across diverse tasks. We hope LaoBench will catalyze further research and development of AI technologies for underrepresented Southeast Asian languages.
NOVA: An Agentic Framework for Automated Histopathology Analysis and Discovery
Digitized histopathology analysis involves complex, time-intensive workflows and specialized expertise, limiting its accessibility. We introduce NOVA, an agentic framework that translates scientific queries into executable analysis pipelines by iteratively generating and running Python code. NOVA integrates 49 domain-specific tools (e.g., nuclei segmentation, whole-slide encoding) built on open-source software, and can also create new tools ad hoc. To evaluate such systems, we present SlideQuest, a 90-question benchmark -- verified by pathologists and biomedical scientists -- spanning data processing, quantitative analysis, and hypothesis testing. Unlike prior biomedical benchmarks focused on knowledge recall or diagnostic QA, SlideQuest demands multi-step reasoning, iterative coding, and computational problem solving. Quantitative evaluation shows NOVA outperforms coding-agent baselines, and a pathologist-verified case study links morphology to prognostically relevant PAM50 subtypes, demonstrating its scalable discovery potential.
LAET: A Layer-wise Adaptive Ensemble Tuning Framework for Pretrained Language Models
Natural Language Processing (NLP) has transformed the financial industry, enabling advancements in areas such as textual analysis, risk management, and forecasting. Large language models (LLMs) like BloombergGPT and FinMA have set new benchmarks across various financial NLP tasks, including sentiment analysis, stock movement prediction, and credit risk assessment. Furthermore, FinMA-ES, a bilingual financial LLM, has also demonstrated strong performance using the FLARE and FLARE-ES benchmarks. However, the high computational demands of these models limit the accessibility of many organizations. To address this, we propose Layer-wise Adaptive Ensemble Tuning (LAET), a novel strategy that selectively fine-tunes the most effective layers of pre-trained LLMs by analyzing hidden state representations while freezing less critical layers. LAET significantly reduces computational overhead while enhancing task-specific performance. Our approach shows strong results in financial NLP tasks, outperforming existing benchmarks and state-of-the-art LLMs such as GPT-4, even with smaller LLMs ($\sim$3B parameters). This work bridges cutting-edge financial NLP research and real-world deployment with efficient and scalable models for financial applications.
iMAD: Intelligent Multi-Agent Debate for Efficient and Accurate LLM Inference AAAI 2026
Large Language Model (LLM) agent systems have advanced rapidly, driven by their strong generalization in zero-shot settings. To further enhance reasoning and accuracy on complex tasks, Multi-Agent Debate (MAD) has emerged as a promising framework that engages multiple LLM agents in structured debates to encourage diverse reasoning. However, triggering MAD for every query is inefficient, as it incurs substantial computational (token) cost and may even degrade accuracy by overturning correct single-agent answers. To address these limitations, we propose intelligent Multi-Agent Debate (iMAD), a token-efficient framework that selectively triggers MAD only when it is likely to be beneficial (i.e., correcting an initially wrong answer). To achieve this goal, iMAD learns generalizable model behaviors to make accurate debate decisions. Specifically, iMAD first prompts a single agent to produce a structured self-critique response, from which we extract 41 interpretable linguistic and semantic features capturing hesitation cues. Then, iMAD uses a lightweight debate-decision classifier, trained using our proposed FocusCal loss, to determine whether to trigger MAD, enabling robust debate decisions without test dataset-specific tuning. Through extensive experiments using six (visual) question answering datasets against five competitive baselines, we have shown that iMAD significantly reduces token usage (by up to 92%) while also improving final answer accuracy (by up to 13.5%).
comment: Accepted in AAAI 2026 (Oral)
Building the Web for Agents: A Declarative Framework for Agent-Web Interaction
The increasing deployment of autonomous AI agents on the web is hampered by a fundamental misalignment: agents must infer affordances from human-oriented user interfaces, leading to brittle, inefficient, and insecure interactions. To address this, we introduce VOIX, a web-native framework that enables websites to expose reliable, auditable, and privacy-preserving capabilities for AI agents through simple, declarative HTML elements. VOIX introduces and tags, allowing developers to explicitly define available actions and relevant state, thereby creating a clear, machine-readable contract for agent behavior. This approach shifts control to the website developer while preserving user privacy by disconnecting the conversational interactions from the website. We evaluated the framework's practicality, learnability, and expressiveness in a three-day hackathon study with 16 developers. The results demonstrate that participants, regardless of prior experience, were able to rapidly build diverse and functional agent-enabled web applications. Ultimately, this work provides a foundational mechanism for realizing the Agentic Web, enabling a future of seamless and secure human-AI collaboration on the web.
comment: for associated documentation, see https://svenschultze.github.io/VOIX/
Language-Aided State Estimation
Natural language data, such as text and speech, have become readily available through social networking services and chat platforms. By leveraging human observations expressed in natural language, this paper addresses the problem of state estimation for physical systems, in which humans act as sensing agents. To this end, we propose a Language-Aided Particle Filter (LAPF), a particle filter framework that structures human observations via natural language processing and incorporates them into the update step of the state estimation. Finally, the LAPF is applied to the water level estimation problem in an irrigation canal and its effectiveness is demonstrated.
comment: 7 pages, 5 figures, submitted to IFAC World Congress 2026 with Journal option (IFAC Journal of Systems and Control)
SQuaD: The Software Quality Dataset
Software quality research increasingly relies on large-scale datasets that measure both the product and process aspects of software systems. However, existing resources often focus on limited dimensions, such as code smells, technical debt, or refactoring activity, thereby restricting comprehensive analyses across time and quality dimensions. To address this gap, we present the Software Quality Dataset (SQuaD), a multi-dimensional, time-aware collection of software quality metrics extracted from 450 mature open-source projects across diverse ecosystems, including Apache, Mozilla, FFmpeg, and the Linux kernel. By integrating nine state-of-the-art static analysis tools, i.e., SonarQube, CodeScene, PMD, Understand, CK, JaSoMe, RefactoringMiner, RefactoringMiner++, and PyRef, our dataset unifies over 700 unique metrics at method, class, file, and project levels. Covering a total of 63,586 analyzed project releases, SQuaD also provides version control and issue-tracking histories, software vulnerability data (CVE/CWE), and process metrics proven to enhance Just-In-Time (JIT) defect prediction. The SQuaD enables empirical research on maintainability, technical debt, software evolution, and quality assessment at unprecedented scale. We also outline emerging research directions, including automated dataset updates and cross-project quality modeling to support the continuous evolution of software analytics. The dataset is publicly available on ZENODO (DOI: 10.5281/zenodo.17566690).
Discovering Meaningful Units with Visually Grounded Semantics from Image Captions
Fine-grained knowledge is crucial for vision-language models to obtain a better understanding of the real world. While there has been work trying to acquire this kind of knowledge in the space of vision and language, it has mostly focused on aligning the image patches with the tokens on the language side. However, image patches do not have any meaning to the human eye, and individual tokens do not necessarily carry groundable information in the image. It is groups of tokens which describe different aspects of the scene. In this work, we propose a model which groups the caption tokens as part of its architecture in order to capture a fine-grained representation of the language. We expect our representations to be at the level of objects present in the image, and therefore align our representations with the output of an image encoder trained to discover objects. We show that by learning to group the tokens, the vision-language model has a better fine-grained understanding of vision and language. In addition, the token groups that our model discovers are highly similar to groundable phrases in text, both qualitatively and quantitatively.
KGQuest: Template-Driven QA Generation from Knowledge Graphs with LLM-Based Refinement
The generation of questions and answers (QA) from knowledge graphs (KG) plays a crucial role in the development and testing of educational platforms, dissemination tools, and large language models (LLM). However, existing approaches often struggle with scalability, linguistic quality, and factual consistency. This paper presents a scalable and deterministic pipeline for generating natural language QA from KGs, with an additional refinement step using LLMs to further enhance linguistic quality. The approach first clusters KG triplets based on their relations, creating reusable templates through natural language rules derived from the entity types of objects and relations. A module then leverages LLMs to refine these templates, improving clarity and coherence while preserving factual accuracy. Finally, the instantiation of answer options is achieved through a selection strategy that introduces distractors from the KG. Our experiments demonstrate that this hybrid approach efficiently generates high-quality QA pairs, combining scalability with fluency and linguistic precision.
LANE: Lexical Adversarial Negative Examples for Word Sense Disambiguation
Fine-grained word meaning resolution remains a critical challenge for neural language models (NLMs) as they often overfit to global sentence representations, failing to capture local semantic details. We propose a novel adversarial training strategy, called LANE, to address this limitation by deliberately shifting the model's learning focus to the target word. This method generates challenging negative training examples through the selective marking of alternate words in the training set. The goal is to force the model to create a greater separability between same sentences with different marked words. Experimental results on lexical semantic change detection and word sense disambiguation benchmarks demonstrate that our approach yields more discriminative word representations, improving performance over standard contrastive learning baselines. We further provide qualitative analyses showing that the proposed negatives lead to representations that better capture subtle meaning differences even in challenging environments. Our method is model-agnostic and can be integrated into existing representation learning frameworks.
Adverbs Revisited: Enhancing WordNet Coverage of Adverbs with a Supersense Taxonomy
WordNet offers rich supersense hierarchies for nouns and verbs, yet adverbs remain underdeveloped, lacking a systematic semantic classification. We introduce a linguistically grounded supersense typology for adverbs, empirically validated through annotation, that captures major semantic domains including manner, temporal, frequency, degree, domain, speaker-oriented, and subject-oriented functions. Results from a pilot annotation study demonstrate that these categories provide broad coverage of adverbs in natural text and can be reliably assigned by human annotators. Incorporating this typology extends WordNet's coverage, aligns it more closely with linguistic theory, and facilitates downstream NLP applications such as word sense disambiguation, event extraction, sentiment analysis, and discourse modeling. We present the proposed supersense categories, annotation outcomes, and directions for future work.
Multi-agent Undercover Gaming: Hallucination Removal via Counterfactual Test for Multimodal Reasoning AAAI 2026
Hallucination continues to pose a major obstacle in the reasoning capabilities of large language models (LLMs). Although the Multi-Agent Debate (MAD) paradigm offers a promising solution by promoting consensus among multiple agents to enhance reliability, it relies on the unrealistic assumption that all debaters are rational and reflective, which is a condition that may not hold when agents themselves are prone to hallucinations. To address this gap, we introduce the Multi-agent Undercover Gaming (MUG) protocol, inspired by social deduction games like "Who is Undercover?". MUG reframes MAD as a process of detecting "undercover" agents (those suffering from hallucinations) by employing multimodal counterfactual tests. Specifically, we modify reference images to introduce counterfactual evidence and observe whether agents can accurately identify these changes, providing ground-truth for identifying hallucinating agents and enabling robust, crowd-powered multimodal reasoning. MUG advances MAD protocols along three key dimensions: (1) enabling factual verification beyond statistical consensus through counterfactual testing; (2) introducing cross-evidence reasoning via dynamically modified evidence sources instead of relying on static inputs; and (3) fostering active reasoning, where agents engage in probing discussions rather than passively answering questions. Collectively, these innovations offer a more reliable and effective framework for multimodal reasoning in LLMs. The source code can be accessed at https://github.com/YongLD/MUG.git.
comment: Accepted by AAAI 2026
PRSM: A Measure to Evaluate CLIP's Robustness Against Paraphrases
Contrastive Language-Image Pre-training (CLIP) is a widely used multimodal model that aligns text and image representations through large-scale training. While it performs strongly on zero-shot and few-shot tasks, its robustness to linguistic variation, particularly paraphrasing, remains underexplored. Paraphrase robustness is essential for reliable deployment, especially in socially sensitive contexts where inconsistent representations can amplify demographic biases. In this paper, we introduce the Paraphrase Ranking Stability Metric (PRSM), a novel measure for quantifying CLIP's sensitivity to paraphrased queries. Using the Social Counterfactuals dataset, a benchmark designed to reveal social and demographic biases, we empirically assess CLIP's stability under paraphrastic variation, examine the interaction between paraphrase robustness and gender, and discuss implications for fairness and equitable deployment of multimodal systems. Our analysis reveals that robustness varies across paraphrasing strategies, with subtle yet consistent differences observed between male- and female-associated queries.
comment: 8 pages, accpeted as short paper at MMM 2026
Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition
Automatic speech recognition (ASR) systems have achieved remarkable performance in common conditions but often struggle to leverage long-context information in contextualized scenarios that require domain-specific knowledge, such as conference presentations. This challenge arises primarily due to constrained model context windows and the sparsity of relevant information within extensive contextual noise. To solve this, we propose the SAP$^{2}$ method, a novel framework that dynamically prunes and integrates relevant contextual keywords in two stages. Specifically, each stage leverages our proposed Speech-Driven Attention-based Pooling mechanism, enabling efficient compression of context embeddings while preserving speech-salient information. Experimental results demonstrate state-of-the-art performance of SAP$^{2}$ on the SlideSpeech and LibriSpeech datasets, achieving word error rates (WER) of 7.71% and 1.12%, respectively. On SlideSpeech, our method notably reduces biased keyword error rates (B-WER) by 41.1% compared to non-contextual baselines. SAP$^{2}$ also exhibits robust scalability, consistently maintaining performance under extensive contextual input conditions on both datasets.
Enhancing Meme Emotion Understanding with Multi-Level Modality Enhancement and Dual-Stage Modal Fusion
With the rapid rise of social media and Internet culture, memes have become a popular medium for expressing emotional tendencies. This has sparked growing interest in Meme Emotion Understanding (MEU), which aims to classify the emotional intent behind memes by leveraging their multimodal contents. While existing efforts have achieved promising results, two major challenges remain: (1) a lack of fine-grained multimodal fusion strategies, and (2) insufficient mining of memes' implicit meanings and background knowledge. To address these challenges, we propose MemoDetector, a novel framework for advancing MEU. First, we introduce a four-step textual enhancement module that utilizes the rich knowledge and reasoning capabilities of Multimodal Large Language Models (MLLMs) to progressively infer and extract implicit and contextual insights from memes. These enhanced texts significantly enrich the original meme contents and provide valuable guidance for downstream classification. Next, we design a dual-stage modal fusion strategy: the first stage performs shallow fusion on raw meme image and text, while the second stage deeply integrates the enhanced visual and textual features. This hierarchical fusion enables the model to better capture nuanced cross-modal emotional cues. Experiments on two datasets, MET-MEME and MOOD, demonstrate that our method consistently outperforms state-of-the-art baselines. Specifically, MemoDetector improves F1 scores by 4.3\% on MET-MEME and 3.4\% on MOOD. Further ablation studies and in-depth analyses validate the effectiveness and robustness of our approach, highlighting its strong potential for advancing MEU. Our code is available at https://github.com/singing-cat/MemoDetector.
AV-Dialog: Spoken Dialogue Models with Audio-Visual Input
Dialogue models falter in noisy, multi-speaker environments, often producing irrelevant responses and awkward turn-taking. We present AV-Dialog, the first multimodal dialog framework that uses both audio and visual cues to track the target speaker, predict turn-taking, and generate coherent responses. By combining acoustic tokenization with multi-task, multi-stage training on monadic, synthetic, and real audio-visual dialogue datasets, AV-Dialog achieves robust streaming transcription, semantically grounded turn-boundary detection and accurate responses, resulting in a natural conversational flow. Experiments show that AV-Dialog outperforms audio-only models under interference, reducing transcription errors, improving turn-taking prediction, and enhancing human-rated dialogue quality. These results highlight the power of seeing as well as hearing for speaker-aware interaction, paving the way for {spoken} dialogue agents that perform {robustly} in real-world, noisy environments.
Analysing Personal Attacks in U.S. Presidential Debates
Personal attacks have become a notable feature of U.S. presidential debates and play an important role in shaping public perception during elections. Detecting such attacks can improve transparency in political discourse and provide insights for journalists, analysts and the public. Advances in deep learning and transformer-based models, particularly BERT and large language models (LLMs) have created new opportunities for automated detection of harmful language. Motivated by these developments, we present a framework for analysing personal attacks in U.S. presidential debates. Our work involves manual annotation of debate transcripts across the 2016, 2020 and 2024 election cycles, followed by statistical and language-model based analysis. We investigate the potential of fine-tuned transformer models alongside general-purpose LLMs to detect personal attacks in formal political speech. This study demonstrates how task-specific adaptation of modern language models can contribute to a deeper understanding of political communication.
comment: 13 pages
CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation ICASSP 2026
Instruction-guided text-to-speech (TTS) research has reached a maturity level where excellent speech generation quality is possible on demand, yet two coupled biases persist: accent bias, where models default to dominant phonetic patterns, and linguistic bias, where dialect-specific lexical and cultural cues are ignored. These biases are interdependent, as authentic accent generation requires both accent fidelity and localized text. We present Contextual Linguistic Adaptation and Retrieval for Inclusive TTS sYnthesis (CLARITY), a backbone-agnostic framework that addresses these biases through dual-signal optimization: (i) contextual linguistic adaptation that localizes input text to the target dialect, and (ii) retrieval-augmented accent prompting (RAAP) that supplies accent-consistent speech prompts. Across twelve English accents, CLARITY improves accent accuracy and fairness while maintaining strong perceptual quality.
comment: Submitted to ICASSP 2026
Can LLMs Detect Their Own Hallucinations?
Large language models (LLMs) can generate fluent responses, but sometimes hallucinate facts. In this paper, we investigate whether LLMs can detect their own hallucinations. We formulate hallucination detection as a classification task of a sentence. We propose a framework for estimating LLMs' capability of hallucination detection and a classification method using Chain-of-Thought (CoT) to extract knowledge from their parameters. The experimental results indicated that GPT-$3.5$ Turbo with CoT detected $58.2\%$ of its own hallucinations. We concluded that LLMs with CoT can detect hallucinations if sufficient knowledge is contained in their parameters.
comment: 8 pages
S2D-ALIGN: Shallow-to-Deep Auxiliary Learning for Anatomically-Grounded Radiology Report Generation
Radiology Report Generation (RRG) aims to automatically generate diagnostic reports from radiology images. To achieve this, existing methods have leveraged the powerful cross-modal generation capabilities of Multimodal Large Language Models (MLLMs), primarily focusing on optimizing cross-modal alignment between radiographs and reports through Supervised Fine-Tuning (SFT). However, by only performing instance-level alignment with the image-text pairs, the standard SFT paradigm fails to establish anatomically-grounded alignment, where the templated nature of reports often leads to sub-optimal generation quality. To address this, we propose \textsc{S2D-Align}, a novel SFT paradigm that establishes anatomically-grounded alignment by leveraging auxiliary signals of varying granularities. \textsc{S2D-Align} implements a shallow-to-deep strategy, progressively enriching the alignment process: it begins with the coarse radiograph-report pairing, then introduces reference reports for instance-level guidance, and ultimately utilizes key phrases to ground the generation in specific anatomical details. To bridge the different alignment stages, we introduce a memory-based adapter that empowers feature sharing, thereby integrating coarse and fine-grained guidance. For evaluation, we conduct experiments on the public \textsc{MIMIC-CXR} and \textsc{IU X-Ray} benchmarks, where \textsc{S2D-Align} achieves state-of-the-art performance compared to existing methods. Ablation studies validate the effectiveness of our multi-stage, auxiliary-guided approach, highlighting a promising direction for enhancing grounding capabilities in complex, multi-modal generation tasks.
Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB
We find that current text embedding models produce outputs with a consistent bias, i.e., each embedding vector $e$ can be decomposed as $\tilde{e} + μ$, where $μ$ is almost identical across all sentences. We propose a plug-and-play, training-free and lightweight solution called Renormalization. Through extensive experiments, we show that renormalization consistently and statistically significantly improves the performance of existing models on the Massive Multilingual Text Embedding Benchmark (MMTEB). In particular, across 38 models, renormalization improves performance by 9.7 $σ$ on retrieval tasks, 3.1 $σ$ on classification tasks, and 0.8 $σ$ on other types of tasks. Renormalization has two variants: directly subtracting $μ$ from $e$, or subtracting the projection of $e$ onto $μ$. We theoretically predict that the latter performs better, and our experiments confirm this prediction.
Automata-Based Steering of Large Language Models for Diverse Structured Generation
Large language models (LLMs) are increasingly tasked with generating structured outputs. While structured generation methods ensure validity, they often lack output diversity, a critical limitation that we confirm in our preliminary study. We propose a novel method to enhance diversity in automaton-based structured generation. Our approach utilizes automata traversal history to steer LLMs towards novel structural patterns. Evaluations show our method significantly improves structural and content diversity while maintaining comparable generation efficiency. Furthermore, we conduct a case study showcasing the effectiveness of our method in generating diverse test cases for testing open-source libraries.
comment: ICFEM 2025 (Best Paper Award)
When Data is the Algorithm: A Systematic Study and Curation of Preference Optimization Datasets
Aligning large language models (LLMs) is a central objective of post-training, often achieved through reward modeling and reinforcement learning methods. Among these, direct preference optimization (DPO) has emerged as a widely adopted technique that fine-tunes LLMs on preferred completions over less favorable ones. While most frontier LLMs do not disclose their curated preference pairs, the broader LLM community has released several open-source DPO datasets, including TuluDPO, ORPO, UltraFeedback, HelpSteer, and Code-Preference-Pairs. However, systematic comparisons remain scarce, largely due to the high computational cost and the lack of rich quality annotations, making it difficult to understand how preferences were selected, which task types they span, and how well they reflect human judgment on a per-sample level. In this work, we present the first comprehensive, data-centric analysis of popular open-source DPO corpora. We leverage the Magpie framework to annotate each sample for task category, input quality, and preference reward, a reward-model-based signal that validates the preference order without relying on human annotations. This enables a scalable, fine-grained inspection of preference quality across datasets, revealing structural and qualitative discrepancies in reward margins. Building on these insights, we systematically curate a new DPO mixture, UltraMix, that draws selectively from all five corpora while removing noisy or redundant samples. UltraMix is 30% smaller than the best-performing individual dataset yet exceeds its performance across key benchmarks. We publicly release all annotations, metadata, and our curated mixture to facilitate future research in data-centric preference optimization.
DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains
The evaluation of discourse-level translation in expert domains remains inadequate, despite its centrality to knowledge dissemination and cross-lingual scholarly communication. While these translations demand discourse-level coherence and strict terminological precision, current evaluation methods predominantly focus on segment-level accuracy and fluency. To address this limitation, we introduce DiscoX, a new benchmark for discourse-level and expert-level Chinese-English translation. It comprises 200 professionally-curated texts from 7 domains, with an average length exceeding 1700 tokens. To evaluate performance on DiscoX, we also develop Metric-S, a reference-free system that provides fine-grained automatic assessments across accuracy, fluency, and appropriateness. Metric-S demonstrates strong consistency with human judgments, significantly outperforming existing metrics. Our experiments reveal a remarkable performance gap: even the most advanced LLMs still trail human experts on these tasks. This finding validates the difficulty of DiscoX and underscores the challenges that remain in achieving professional-grade machine translation. The proposed benchmark and evaluation system provide a robust framework for more rigorous evaluation, facilitating future advancements in LLM-based translation.
comment: 36 pages
CardioEmbed: Domain-Specialized Text Embeddings for Clinical Cardiology
Biomedical text embeddings have primarily been developed using research literature from PubMed, yet clinical cardiology practice relies heavily on procedural knowledge and specialized terminology found in comprehensive textbooks rather than research abstracts. This research practice gap limits the effectiveness of existing embedding models for clinical applications incardiology. This study trained CardioEmbed, a domain-specialized embedding model based on Qwen3-Embedding-8B, using contrastive learning on a curated corpus of seven comprehensive cardiology textbooks totaling approximately 150,000 sentences after deduplication. The model employs InfoNCE loss with in-batch negatives and achieves 99.60% retrieval accuracy on cardiac-specific semantic retrieval tasks, a +15.94 percentage point improvement over MedTE, the current state-of-the-art medical embedding model. On MTEB medical benchmarks, the model obtained BIOSSES 0.77 Spearman and SciFact 0.61 NDCG@10, indicating competitive performance on related biomedical domains. Domain-specialized training on comprehensive clinical textbooks yields near-perfect cardiology retrieval (99.60% Acc@1), improving over MedTE by +15.94 percentage points.
comment: 14 pages, 6 figures
Evaluating Large Language Models on Rare Disease Diagnosis: A Case Study using House M.D
Large language models (LLMs) have demonstrated capabilities across diverse domains, yet their performance on rare disease diagnosis from narrative medical cases remains underexplored. We introduce a novel dataset of 176 symptom-diagnosis pairs extracted from House M.D., a medical television series validated for teaching rare disease recognition in medical education. We evaluate four state-of-the-art LLMs such as GPT 4o mini, GPT 5 mini, Gemini 2.5 Flash, and Gemini 2.5 Pro on narrative-based diagnostic reasoning tasks. Results show significant variation in performance, ranging from 16.48% to 38.64% accuracy, with newer model generations demonstrating a 2.3 times improvement. While all models face substantial challenges with rare disease diagnosis, the observed improvement across architectures suggests promising directions for future development. Our educationally validated benchmark establishes baseline performance metrics for narrative medical reasoning and provides a publicly accessible evaluation framework for advancing AI-assisted diagnosis research.
Automated Analysis of Learning Outcomes and Exam Questions Based on Bloom's Taxonomy
This paper explores the automatic classification of exam questions and learning outcomes according to Bloom's Taxonomy. A small dataset of 600 sentences labeled with six cognitive categories - Knowledge, Comprehension, Application, Analysis, Synthesis, and Evaluation - was processed using traditional machine learning (ML) models (Naive Bayes, Logistic Regression, Support Vector Machines), recurrent neural network architectures (LSTM, BiLSTM, GRU, BiGRU), transformer-based models (BERT and RoBERTa), and large language models (OpenAI, Gemini, Ollama, Anthropic). Each model was evaluated under different preprocessing and augmentation strategies (for example, synonym replacement, word embeddings, etc.). Among traditional ML approaches, Support Vector Machines (SVM) with data augmentation achieved the best overall performance, reaching 94 percent accuracy, recall, and F1 scores with minimal overfitting. In contrast, the RNN models and BERT suffered from severe overfitting, while RoBERTa initially overcame it but began to show signs as training progressed. Finally, zero-shot evaluations of large language models (LLMs) indicated that OpenAI and Gemini performed best among the tested LLMs, achieving approximately 0.72-0.73 accuracy and comparable F1 scores. These findings highlight the challenges of training complex deep models on limited data and underscore the value of careful data augmentation and simpler algorithms (such as augmented SVM) for Bloom's Taxonomy classification.
comment: 7 Pages
Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions
While large language models (LLMs) offer promising capabilities for automating academic workflows, existing systems for academic peer review remain constrained by text-only inputs, limited contextual grounding, and a lack of actionable feedback. In this work, we present an interactive web-based system for multimodal, community-aware peer review simulation to enable effective manuscript revisions before paper submission. Our framework integrates textual and visual information through multimodal LLMs, enhances review quality via retrieval-augmented generation (RAG) grounded in web-scale OpenReview data, and converts generated reviews into actionable to-do lists using the proposed Action:Objective[\#] format, providing structured and traceable guidance. The system integrates seamlessly into existing academic writing platforms, providing interactive interfaces for real-time feedback and revision tracking. Experimental results highlight the effectiveness of the proposed system in generating more comprehensive and useful reviews aligned with expert standards, surpassing ablated baselines and advancing transparent, human-centered scholarly assistance.
Expert-Guided Prompting and Retrieval-Augmented Generation for Emergency Medical Service Question Answering AAAI 2026
Large language models (LLMs) have shown promise in medical question answering, yet they often overlook the domain-specific expertise that professionals depend on, such as the clinical subject areas (e.g., trauma, airway) and the certification level (e.g., EMT, Paramedic). Existing approaches typically apply general-purpose prompting or retrieval strategies without leveraging this structured context, limiting performance in high-stakes settings. We address this gap with EMSQA, an 24.3K-question multiple-choice dataset spanning 10 clinical subject areas and 4 certification levels, accompanied by curated, subject area-aligned knowledge bases (40K documents and 2M tokens). Building on EMSQA, we introduce (i) Expert-CoT, a prompting strategy that conditions chain-of-thought (CoT) reasoning on specific clinical subject area and certification level, and (ii) ExpertRAG, a retrieval-augmented generation pipeline that grounds responses in subject area-aligned documents and real-world patient data. Experiments on 4 LLMs show that Expert-CoT improves up to 2.05% over vanilla CoT prompting. Additionally, combining Expert-CoT with ExpertRAG yields up to a 4.59% accuracy gain over standard RAG baselines. Notably, the 32B expertise-augmented LLMs pass all the computer-adaptive EMS certification simulation exams.
comment: Accepted by AAAI 2026
From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models
Tool-augmented Language Models (TaLMs) can invoke external tools to solve problems beyond their parametric capacity. However, it remains unclear whether these tool-enabled gains reflect trustworthy reasoning. Focusing on the Code Interpreter tool, we show that even when tools are selected and executed correctly, TaLMs treat tool outputs as substitutes for reasoning, producing solutions that appear correct but lack coherent justification. We term this failure mode Tool-Induced Myopia (TIM), and study it using PYMATH, a benchmark of 1,679 competition-level mathematical problems for which Python code is helpful but not sufficient. We further develop a multi-dimensional evaluation suite to quantify reasoning degradation in TaLMs relative to their non-tool counterparts. Our findings reveal that while TaLMs achieve up to a 19.3 percentage point gain in final-answer accuracy, their reasoning behavior consistently deteriorates (e.g., non-tool LLMs win up to 41.5% more often in pairwise comparisons of the reasoning process). This degradation intensifies with tool use; the more frequently a model invokes tools, the less coherent its reasoning becomes. Moreover, tool use shifts errors from arithmetic mistakes toward global reasoning failures (logic, assumption, creativity); with TIM present in ~55% of high-risk cases. Finally, we propose a preference-optimization-based framework that realigns TaLMs to use tools as assistive evidence, improving both final-answer accuracy and reasoning depth under tool use. Codes and data are available at: https://github.com/megagonlabs/TIM.
comment: 19 pages, 5 figures
MedPath: Multi-Domain Cross-Vocabulary Hierarchical Paths for Biomedical Entity Linking AACL
Progress in biomedical Named Entity Recognition (NER) and Entity Linking (EL) is currently hindered by a fragmented data landscape, a lack of resources for building explainable models, and the limitations of semantically-blind evaluation metrics. To address these challenges, we present MedPath, a large-scale and multi-domain biomedical EL dataset that builds upon nine existing expert-annotated EL datasets. In MedPath, all entities are 1) normalized using the latest version of the Unified Medical Language System (UMLS), 2) augmented with mappings to 62 other biomedical vocabularies and, crucially, 3) enriched with full ontological paths -- i.e., from general to specific -- in up to 11 biomedical vocabularies. MedPath directly enables new research frontiers in biomedical NLP, facilitating training and evaluation of semantic-rich and interpretable EL systems, and the development of the next generation of interoperable and explainable clinical NLP models.
comment: Accepted at AACL-IJCNLP 2025(main)
A Multifaceted Analysis of Negative Bias in Large Language Models through the Lens of Parametric Knowledge
Negative bias refers to the tendency of large language models (LLMs) to excessively generate negative responses in binary decision tasks (e.g., yes-no question answering). Previous research has focused on detecting and addressing negative attention heads that induce negative bias. However, the underlying detailed factors influencing negative bias remain underexplored. In this paper, we demonstrate that LLMs exhibit format-level negative bias, meaning the prompt format more influences their responses than the semantics of the negative response. For the fine-grained study of the negative bias, we introduce a pipeline for constructing the evaluation set, which systematically categorizes the dataset into three subsets based on the model's parametric knowledge: correct, incorrect, and insufficient relevant knowledge. Through analysis of this evaluation set, we identify a shortcut behavior in which models tend to generate negative responses when they lack sufficient knowledge to answer a yes-no question, leading to negative bias. We further examine how negative bias changes under various prompting scenarios related to parametric knowledge. We observe that providing relevant context and offering an "I don't know" option generally reduces negative bias, whereas chain-of-thought prompting tends to amplify the bias. Finally, we demonstrate that the degree of negative bias can vary depending on the type of prompt, which influences the direction of the response. Our work reveals the various factors that influence negative bias, providing critical insights for mitigating it in LLMs.
comment: Accepted to IEEE Transactions on Audio, Speech and Language Processing
ICX360: In-Context eXplainability 360 Toolkit
Large Language Models (LLMs) have become ubiquitous in everyday life and are entering higher-stakes applications ranging from summarizing meeting transcripts to answering doctors' questions. As was the case with earlier predictive models, it is crucial that we develop tools for explaining the output of LLMs, be it a summary, list, response to a question, etc. With these needs in mind, we introduce In-Context Explainability 360 (ICX360), an open-source Python toolkit for explaining LLMs with a focus on the user-provided context (or prompts in general) that are fed to the LLMs. ICX360 contains implementations for three recent tools that explain LLMs using both black-box and white-box methods (via perturbations and gradients respectively). The toolkit, available at https://github.com/IBM/ICX360, contains quick-start guidance materials as well as detailed tutorials covering use cases such as retrieval augmented generation, natural language generation, and jailbreaking.
comment: 14 pages, 4 figures
From Fact to Judgment: Investigating the Impact of Task Framing on LLM Conviction in Dialogue Systems
LLMs are increasingly employed as judges across a variety of tasks, including those involving everyday social interactions. Yet, it remains unclear whether such LLM-judges can reliably assess tasks that require social or conversational judgment. We investigate how an LLM's conviction is changed when a task is reframed from a direct factual query to a Conversational Judgment Task. Our evaluation framework contrasts the model's performance on direct factual queries with its assessment of a speaker's correctness when the same information is presented within a minimal dialogue, effectively shifting the query from "Is this statement correct?" to "Is this speaker correct?". Furthermore, we apply pressure in the form of a simple rebuttal ("The previous answer is incorrect.") to both conditions. This perturbation allows us to measure how firmly the model maintains its position under conversational pressure. Our findings show that while some models like GPT-4o-mini reveal sycophantic tendencies under social framing tasks, others like Llama-8B-Instruct become overly-critical. We observe an average performance change of 9.24% across all models, demonstrating that even minimal dialogue context can significantly alter model judgment, underscoring conversational framing as a key factor in LLM-based evaluation. The proposed framework offers a reproducible methodology for diagnosing model conviction and contributes to the development of more trustworthy dialogue systems.
comment: 11 pages, 3 figures. Under review at IWSDS 2026
Benchmarking Retrieval-Augmented Large Language Models in Biomedical NLP: Application, Robustness, and Self-Awareness
Large language models (LLM) have demonstrated remarkable capabilities in various biomedical natural language processing (NLP) tasks, leveraging the demonstration within the input context to adapt to new tasks. However, LLM is sensitive to the selection of demonstrations. To address the hallucination issue inherent in LLM, retrieval-augmented LLM (RAL) offers a solution by retrieving pertinent information from an established database. Nonetheless, existing research work lacks rigorous evaluation of the impact of retrieval-augmented large language models on different biomedical NLP tasks. This deficiency makes it challenging to ascertain the capabilities of RAL within the biomedical domain. Moreover, the outputs from RAL are affected by retrieving the unlabeled, counterfactual, or diverse knowledge that is not well studied in the biomedical domain. However, such knowledge is common in the real world. Finally, exploring the self-awareness ability is also crucial for the RAL system. So, in this paper, we systematically investigate the impact of RALs on 5 different biomedical tasks (triple extraction, link prediction, classification, question answering, and natural language inference). We analyze the performance of RALs in four fundamental abilities, including unlabeled robustness, counterfactual robustness, diverse robustness, and negative awareness. To this end, we proposed an evaluation framework to assess the RALs' performance on different biomedical NLP tasks and establish four different testbeds based on the aforementioned fundamental abilities. Then, we evaluate 3 representative LLMs with 3 different retrievers on 5 tasks over 9 datasets.
LDC: Learning to Generate Research Idea with Dynamic Control
Recent advancements in large language models (LLMs) have demonstrated their potential in automating the scientific research ideation. Existing approaches primarily focus on prompting techniques, often producing ideas misaligned with expert standards - novelty, feasibility, and effectiveness, which are widely recognized by the research community as the three key subdimensions of high-quality ideas. Also, balancing these dimensions remains challenging due to their inherent trade-offs. To address these limitations, we propose the first framework that employs a two-stage approach combining Supervised Fine-Tuning (SFT) and controllable Reinforcement Learning (RL) for the task. In the SFT stage, the model learns foundational patterns from pairs of research papers and their corresponding follow-up ideas. In the RL stage, multi-dimensional reward models guided by fine-grained feedback evaluate and optimize the model across key dimensions. During inference, dimensional controllers coordinated by a sentence-level decoder enable dynamic context-aware steering of the idea generation process. Our framework provides a balanced approach to research idea generation, achieving high-quality outcomes in the experiment by dynamically navigating the trade-offs among novelty, feasibility, and effectiveness.
HI-TransPA: Hearing Impairments Translation Personal Assistant
Hearing-impaired individuals often face significant barriers in daily communication due to the inherent challenges of producing clear speech. To address this, we introduce the Omni-Model paradigm into assistive technology and present HI-TransPA, an instruction-driven audio-visual personal assistant. The model fuses indistinct speech with lip dynamics, enabling both translation and dialogue within a single multimodal framework. To address the distinctive pronunciation patterns of hearing-impaired speech and the limited adaptability of existing models, we develop a multimodal preprocessing and curation pipeline that detects facial landmarks, stabilizes the lip region, and quantitatively evaluates sample quality. These quality scores guide a curriculum learning strategy that first trains on clean, high-confidence samples and progressively incorporates harder cases to strengthen model robustness. Architecturally, we employs a novel unified 3D-Resampler to efficiently encode the lip dynamics, which is critical for accurate interpretation. Experiments on purpose-built HI-Dialogue dataset show that HI-TransPA achieves state-of-the-art performance in both literal accuracy and semantic fidelity. Our work establishes a foundation for applying Omni-Models to assistive communication technology, providing an end-to-end modeling framework and essential processing tools for future research.
FakeZero: Real-Time, Privacy-Preserving Misinformation Detection for Facebook and X
Social platforms distribute information at unprecedented speed, which in turn accelerates the spread of misinformation and threatens public discourse. We present FakeZero, a fully client-side, cross-platform browser extension that flags unreliable posts on Facebook and X (formerly Twitter) while the user scrolls. All computation, DOM scraping, tokenization, Transformer inference, and UI rendering run locally through the Chromium messaging API, so no personal data leaves the device. FakeZero employs a three-stage training curriculum: baseline fine-tuning and domain-adaptive training enhanced with focal loss, adversarial augmentation, and post-training quantization. Evaluated on a dataset of 239,000 posts, the DistilBERT-Quant model (67.6 MB) reaches 97.1% macro-F1, 97.4% accuracy, and an AUROC of 0.996, with a median latency of approximately 103 ms on a commodity laptop. A memory-efficient TinyBERT-Quant variant retains 95.7% macro-F1 and 96.1% accuracy while shrinking the model to 14.7 MB and lowering latency to approximately 40 ms, showing that high-quality fake-news detection is feasible under tight resource budgets with only modest performance loss. By providing inline credibility cues, the extension can serve as a valuable tool for policymakers seeking to curb the spread of misinformation across social networks. With user consent, FakeZero also opens the door for researchers to collect large-scale datasets of fake news in the wild, enabling deeper analysis and the development of more robust detection techniques.
comment: Accepted for publication in the Proceedings of the 24th IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom 2025) Privacy track, 11 pages, 8 figures
Metric Learning Encoding Models: A Multivariate Framework for Interpreting Neural Representations
Understanding how explicit theoretical features are encoded in opaque neural systems is a central challenge now common to neuroscience and AI. We introduce Metric Learning Encoding Models (MLEMs) to address this challenge most directly as a metric learning problem: we fit the distance in the space of theoretical features to match the distance in neural space. Our framework improves on univariate encoding and decoding methods by building on second-order isomorphism methods, such as Representational Similarity Analysis, and extends them by learning a metric that efficiently models feature as well as interactions between them. The effectiveness of MLEM is validated through two sets of simulations. First, MLEMs recover ground-truth importance features in synthetic datasets better than state-of-the-art methods, such as Feature Reweighted RSA (FR-RSA). Second, we deploy MLEMs on real language data, where they show stronger robustness to noise in calculating the importance of linguistic features (gender, tense, etc.). MLEMs are applicable to any domains where theoretical features can be identified, such as language, vision, audition, etc. We release optimized code applicable to measure feature importance in the representations of any artificial neural networks or empirical neural data at https://github.com/LouisJalouzot/MLEM.
comment: 30 pages, 20 figures
CO-VADA: A Confidence-Oriented Voice Augmentation Debiasing Approach for Fair Speech Emotion Recognition
Bias in speech emotion recognition (SER) systems often stems from spurious correlations between speaker characteristics and emotional labels, leading to unfair predictions across demographic groups. Many existing debiasing methods require model-specific changes or demographic annotations, limiting their practical use. We present CO-VADA, a Confidence-Oriented Voice Augmentation Debiasing Approach that mitigates bias without modifying model architecture or relying on demographic information. CO-VADA identifies training samples that reflect bias patterns present in the training data and then applies voice conversion to alter irrelevant attributes and generate samples. These augmented samples introduce speaker variations that differ from dominant patterns in the data, guiding the model to focus more on emotion-relevant features. Our framework is compatible with various SER models and voice conversion tools, making it a scalable and practical solution for improving fairness in SER systems.
comment: Accepted by IEEE ASRU 2025
$\textit{New News}$: System-2 Fine-tuning for Robust Integration of New Knowledge
Humans and intelligent animals can internalize new information and accurately internalize their implications to perform downstream tasks. While large language models (LLMs) can achieve this through in-context learning (ICL) when the information (news) is explicitly given as context, adequately integrating the information into model weights via fine-tuning remains challenging. In this paper, we introduce New News, a dataset composed of hypothetical yet plausible news spanning multiple domains (mathematics, coding, discoveries, leaderboards, events), accompanied by downstream evaluation questions whose correct answers critically depend on understanding and internalizing the news. First, we demonstrate a substantial gap between naive fine-tuning and in-context learning (FT-ICL gap) on our dataset. To address this gap, we explore a suite of self-play data generation protocols -- paraphrases, implications, and Self-QA -- designed to distill the knowledge processed by the model with context into the weights of the model, which we term System-2 Fine-tuning (Sys2-FT). We systematically evaluate ICL and Sys2-FT performance across data domains and model scales with the Qwen 2.5 family of models. Our results demonstrate that the Self-QA protocol of Sys2-FT significantly improves models' in-weight learning of the news while preserving general capabilities. Furthermore, we discover the contextual shadowing effect, where training with the news in context followed by its rephrases or QAs catastrophically degrades learning of the news. Finally, we show preliminary evidence of an emerging scaling law of Sys2-FT.
Latent Principle Discovery for Language Model Self-Improvement NeurIPS 2025
When language model (LM) users aim to improve the quality of its generations, it is crucial to specify concrete behavioral attributes that the model should strive to reflect. However, curating such principles across many domains, even non-exhaustively, requires a labor-intensive annotation process. To automate this process, we propose eliciting these latent attributes that guide model reasoning toward human-preferred responses by explicitly modeling them in a self-correction setting. Our approach mines new principles from the LM itself and compresses the discovered elements to an interpretable set via clustering. Specifically, we employ a form of posterior-regularized Monte Carlo Expectation-Maximization to both identify a condensed set of the most effective latent principles and teach the LM to strategically invoke them in order to intrinsically refine its responses. We demonstrate that bootstrapping our algorithm over multiple iterations enables smaller language models (7-8B parameters) to self-improve, achieving +8-10% in AlpacaEval win-rate, an average of +0.3 on MT-Bench, and +19-23% in principle-following win-rate on IFEval. We also show that clustering the principles yields interpretable and diverse model-generated constitutions while retaining model performance. The gains that our method achieves highlight the potential of automated, principle-driven post-training recipes toward continual self-improvement.
comment: Accepted at NeurIPS 2025
Leveraging NTPs for Efficient Hallucination Detection in VLMs AACL
Hallucinations of vision-language models (VLMs), which are misalignments between visual content and generated text, undermine the reliability of VLMs. One common approach for detecting them employs the same VLM, or a different one, to assess generated outputs. This process is computationally intensive and increases model latency. In this paper, we explore an efficient on-the-fly method for hallucination detection by training traditional ML models over signals based on the VLM's next-token probabilities (NTPs). NTPs provide a direct quantification of model uncertainty. We hypothesize that high uncertainty (i.e., a low NTP value) is strongly associated with hallucinations. To test this, we introduce a dataset of 1,400 human-annotated statements derived from VLM-generated content, each labeled as hallucinated or not, and use it to test our NTP-based lightweight method. Our results demonstrate that NTP-based features are valuable predictors of hallucinations, enabling fast and simple ML models to achieve performance comparable to that of strong VLMs. Furthermore, augmenting these NTPs with linguistic NTPs, computed by feeding only the generated text back into the VLM, enhances hallucination detection performance. Finally, integrating hallucination prediction scores from VLMs into the NTP-based models led to better performance than using either VLMs or NTPs alone. We hope this study paves the way for simple, lightweight solutions that enhance the reliability of VLMs.
comment: Accepted to The First Workshop on Confabulation, Hallucinations, & Overgeneration in Multilingual & Precision-critical Setting - AACL-IJCNLP2025
Emotions, Context, and Substance Use in Adolescents: A Large Language Model Analysis of Reddit Posts
Early substance use during adolescence increases the risk of later substance use disorders and mental health problems, yet the emotional and contextual factors driving these behaviors remain poorly understood. This study analyzed 23000 substance-use related posts and an equal number of non-substance posts from Reddit's r/teenagers community (2018-2022). Posts were annotated for six discrete emotions (sadness, anger, joy, guilt, fear, disgust) and contextual factors (family, peers, school) using large language models (LLMs). Statistical analyses compared group differences, and interpretable machine learning (SHAP) identified key predictors of substance-use discussions. LLM-assisted thematic coding further revealed latent psychosocial themes linking emotions with contexts. Negative emotions, especially sadness, guilt, fear, and disgust, were significantly more common in substance-use posts, while joy dominated non-substance discussions. Guilt and shame diverged in function: guilt often reflected regret and self-reflection, whereas shame reinforced risky behaviors through peer performance. Peer influence emerged as the strongest contextual factor, closely tied to sadness, fear, and guilt. Family and school environments acted as both risk and protective factors depending on relational quality and stress levels. Overall, adolescent substance-use discussions reflected a dynamic interplay of emotion, social context, and coping behavior. By integrating statistical analysis, interpretable models, and LLM-based thematic exploration, this study demonstrates the value of mixed computational approaches for uncovering the emotional and contextual mechanisms underlying adolescent risk behavior.
comment: 19 pages, 5 figures
Computational Analysis of Gender Depiction in the Comedias of Calderón de la Barca
In theatre, playwrights use the portrayal of characters to explore culturally based gender norms. In this paper, we develop quantitative methods to study gender depiction in the non-religious works (comedias) of Pedro Calderón de la Barca, a prolific Spanish 17th century author. We gather insights from a corpus of more than 100 plays by using a gender classifier and applying model explainability (attribution) methods to determine which text features are most influential in the model's decision to classify speech as 'male' or 'female', indicating the most gendered elements of dialogue in Calderón's comedias in a human accessible manner. We find that female and male characters are portrayed differently and can be identified by the gender prediction model at practically useful accuracies (up to f=0.83). Analysis reveals semantic aspects of gender portrayal, and demonstrates that the model is even useful in providing a relatively accurate scene-by-scene prediction of cross-dressing characters.
FedALT: Federated Fine-Tuning through Adaptive Local Training with Rest-of-World LoRA AAAI 2026
Fine-tuning large language models (LLMs) in federated settings enables privacy-preserving adaptation but suffers from cross-client interference due to model aggregation. Existing federated LoRA fine-tuning methods, primarily based on FedAvg, struggle with data heterogeneity, leading to harmful cross-client interference and suboptimal personalization. In this work, we propose \textbf{FedALT}, a novel personalized federated LoRA fine-tuning algorithm that fundamentally departs from FedAvg. Instead of using an aggregated model to initialize local training, each client continues training its individual LoRA while incorporating shared knowledge through a separate Rest-of-World (RoW) LoRA component. To effectively balance local adaptation and global information, FedALT introduces an adaptive mixer that dynamically learns input-specific weightings between the individual and RoW LoRA components, drawing conceptual foundations from the Mixture-of-Experts (MoE) paradigm. Through extensive experiments on NLP benchmarks, we demonstrate that FedALT significantly outperforms state-of-the-art personalized federated LoRA fine-tuning methods, achieving superior local adaptation without sacrificing computational efficiency.
comment: Accepted by AAAI 2026
Interpretable LLM Guardrails via Sparse Representation Steering
Large language models (LLMs) exhibit impressive capabilities in generation tasks but are prone to producing harmful, misleading, or biased content, posing significant ethical and safety concerns. To mitigate such risks, representation engineering, which steer model behavior toward desired attributes by injecting carefully designed steering vectors into LLM's representations at inference time, has emerged as a promising alternative to fine-tuning approaches. However, due to the semantically entangled nature of LLM's representation, existing representation engineering methods still suffer from several limitations: limited fine-grained controllability, content quality degradation, and conflict in multi-attribute control. To overcome these challenges, we propose Sparse Representation Steering (SRS), a novel framework that achieves fine-grained and interpretable control over LLM behavior by first disentangling internal activations into a sparse, semantically meaningful representation space, and then selectively steering relevant dimensions. Specifically, SRS leverages a pretrained Sparse Autoencoder (SAE) to transform dense, entangled activation patterns into a sparse monosemantic feature space. To identify relevant features, SRS contrasts sparse activations from positive and negative prompt pairs and measures their bidirectional KL divergence to locate dimensions most associated with the target attribute. We conduct comprehensive experiments on Gemma-2 series model across three alignment dimensions, i.e., safety, fairness, and truthfulness, to evaluate the effectiveness of SRS. Results show that SRS consistently outperforms existing steering methods, which achieves significantly improved controllability across both single and multiple attribute settings, while preserving high linguistic quality and general ability.
Efficient Reasoning via Thought-Training and Thought-Free Inference
Recent advances in large language models (LLMs) have leveraged explicit Chain-of-Thought (CoT) prompting to improve reasoning accuracy. However, most existing methods primarily compress verbose reasoning outputs. These Long-to-Short transformations aim to improve efficiency, but still rely on explicit reasoning during inference. In this work, we introduce \textbf{3TF} (\textbf{T}hought-\textbf{T}raining and \textbf{T}hought-\textbf{F}ree inference), a framework for efficient reasoning that takes a Short-to-Long perspective. We first train a hybrid model that can operate in both reasoning and non-reasoning modes, and then further train it on CoT-annotated data to internalize structured reasoning, while enforcing concise, thought-free outputs at inference time using the no-reasoning mode. Unlike compression-based approaches, 3TF improves the reasoning quality of non-reasoning outputs, enabling models to perform rich internal reasoning implicitly while keeping external outputs short. Empirically, 3TF-trained models obtain large improvements on reasoning benchmarks under thought-free inference, demonstrating that high quality reasoning can be learned and executed implicitly without explicit step-by-step generation.
comment: 11 pages, 4 figures
First-Order Error Matters: Accurate Compensation for Quantized Large Language Models AAAI 2026
Post-training quantization (PTQ) offers an efficient approach to compressing large language models (LLMs), significantly reducing memory access and computational costs. Existing compensation-based weight calibration methods often rely on a second-order Taylor expansion to model quantization error, under the assumption that the first-order term is negligible in well-trained full-precision models. However, we reveal that the progressive compensation process introduces accumulated first-order deviations between latent weights and their full-precision counterparts, making this assumption fundamentally flawed. To address this, we propose FOEM, a novel PTQ method that explicitly incorporates first-order gradient terms to improve quantization error compensation. FOEM approximates gradients by performing a first-order Taylor expansion around the pre-quantization weights. This yields an approximation based on the difference between latent and full-precision weights as well as the Hessian matrix. When substituted into the theoretical solution, the formulation eliminates the need to explicitly compute the Hessian, thereby avoiding the high computational cost and limited generalization of backpropagation-based gradient methods. This design introduces only minimal additional computational overhead. Extensive experiments across a wide range of models and benchmarks demonstrate that FOEM consistently outperforms the classical GPTQ method. In 3-bit weight-only quantization, FOEM reduces the perplexity of Llama3-8B by 17.3% and increases the 5-shot MMLU accuracy from 53.8% achieved by GPTAQ to 56.1%. Moreover, FOEM can be seamlessly combined with advanced techniques such as SpinQuant, delivering additional gains under the challenging W4A4KV4 setting and further narrowing the performance gap with full-precision baselines, surpassing existing state-of-the-art methods.
comment: Accepted by AAAI 2026. The code is available at https://github.com/Xingyu-Zheng/FOEM
ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance AACL 2025
Pretrained transformer-encoder models like DeBERTaV3 and ModernBERT introduce architectural advancements aimed at improving efficiency and performance. Although the authors of ModernBERT report improved performance over DeBERTaV3 on several benchmarks, the lack of disclosed training data and the absence of comparisons using a shared dataset make it difficult to determine whether these gains are due to architectural improvements or differences in training data. In this work, we conduct a controlled study by pretraining ModernBERT on the same dataset as CamemBERTaV2, a DeBERTaV3 French model, isolating the effect of model design. Our results show that the previous model generation remains superior in sample efficiency and overall benchmark performance, with ModernBERT's primary advantage being its support for long context, faster training, and inference speed. However, the new proposed model still provides meaningful architectural improvements compared to earlier models such as BERT and RoBERTa. Additionally, we observe that high-quality pre-training data accelerates convergence but does not significantly improve final performance, suggesting potential benchmark saturation. These findings show the importance of disentangling pretraining data from architectural innovations when evaluating transformer models.
comment: Published as a conference paper at IJCNLP-AACL 2025
Survey in Characterization of Semantic Change
Live languages continuously evolve to integrate the cultural change of human societies. This evolution manifests through neologisms (new words) or \textbf{semantic changes} of words (new meaning to existing words). Understanding the meaning of words is vital for interpreting texts coming from different cultures (regionalism or slang), domains (e.g., technical terms), or periods. In computer science, these words are relevant to computational linguistics algorithms such as translation, information retrieval, question answering, etc. Semantic changes can potentially impact the quality of the outcomes of these algorithms. Therefore, it is important to understand and characterize these changes formally. The study of this impact is a recent problem that has attracted the attention of the computational linguistics community. Several approaches propose methods to detect semantic changes with good precision, but more effort is needed to characterize how the meaning of words changes and to reason about how to reduce the impact of semantic change. This survey provides an understandable overview of existing approaches to the \textit{characterization of semantic changes} and also formally defines three classes of characterizations: if the meaning of a word becomes more general or narrow (change in dimension) if the word is used in a more pejorative or positive/ameliorated sense (change in orientation), and if there is a trend to use the word in a, for instance, metaphoric or metonymic context (change in relation). We summarized the main aspects of the selected publications in a table and discussed the needs and trends in the research activities on semantic change characterization.
Beyond the Surface: Probing the Ideological Depth of Large Language Models
Large language models (LLMs) display recognizable political leanings, yet they vary significantly in their ability to represent a political orientation consistently. In this paper, we define ideological depth as (i) a model's ability to follow political instructions without failure (steerability), and (ii) the feature richness of its internal political representations measured with sparse autoencoders (SAEs), an unsupervised sparse dictionary learning (SDL) approach. Using Llama-3.1-8B-Instruct and Gemma-2-9B-IT as candidates, we compare prompt-based and activation-steering interventions and probe political features with publicly available SAEs. We find large, systematic differences: Gemma is more steerable in both directions and activates approximately 7.3x more distinct political features than Llama. Furthermore, causal ablations of a small targeted set of Gemma's political features to create a similar feature-poor setting induce consistent shifts in its behavior, with increased rates of refusals across topics. Together, these results indicate that refusals on benign political instructions or prompts can arise from capability deficits rather than safety guardrails. Ideological depth thus emerges as a measurable property of LLMs, and steerability serves as a window into their latent political architecture.
Re-FRAME the Meeting Summarization SCOPE: Fact-Based Summarization and Personalization via Questions EMNLP 2025
Meeting summarization with large language models (LLMs) remains error-prone, often producing outputs with hallucinations, omissions, and irrelevancies. We present FRAME, a modular pipeline that reframes summarization as a semantic enrichment task. FRAME extracts and scores salient facts, organizes them thematically, and uses these to enrich an outline into an abstractive summary. To personalize summaries, we introduce SCOPE, a reason-out-loud protocol that has the model build a reasoning trace by answering nine questions before content selection. For evaluation, we propose P-MESA, a multi-dimensional, reference-free evaluation framework to assess if a summary fits a target reader. P-MESA reliably identifies error instances, achieving >= 89% balanced accuracy against human annotations and strongly aligns with human severity ratings (r >= 0.70). On QMSum and FAME, FRAME reduces hallucination and omission by 2 out of 5 points (measured with MESA), while SCOPE improves knowledge fit and goal alignment over prompt-only baselines. Our findings advocate for rethinking summarization to improve control, faithfulness, and personalization.
comment: Accepted at EMNLP 2025
Improving the Downstream Performance of Mixture-of-Experts Transformers via Weak Vanilla Transformers
Recently, Mixture of Experts (MoE) Transformers have garnered increasing attention due to their advantages in model capacity and computational efficiency. However, studies have indicated that MoE Transformers underperform vanilla Transformers in many downstream tasks, significantly diminishing the practical value of MoE models. To explain this issue, we propose that the pre-training performance and transfer capability of a model are joint determinants of its downstream task performance. MoE models, in comparison to vanilla models, have poorer transfer capability, leading to their subpar performance in downstream tasks. To address this issue, we introduce the concept of transfer capability distillation, positing that although vanilla models have weaker performance, they are effective teachers of transfer capability. The MoE models guided by vanilla models can achieve both strong pre-training performance and transfer capability, ultimately enhancing their performance in downstream tasks. We design a specific distillation method and conduct experiments on the BERT architecture. Experimental results show a significant improvement in downstream performance of MoE models, and many further evidences also strongly support the concept of transfer capability distillation. Finally, we attempt to interpret transfer capability distillation and provide some insights from the perspective of model feature.
MoPE: Mixture of Prompt Experts for Parameter-Efficient and Scalable Multimodal Fusion
Despite the demonstrated parameter efficiency of prompt-based fusion, its limited adaptivity and expressiveness hinder its effectiveness for multimodal applications at scale. In this paper, we present the first comprehensive study addressing these limitations. Our key motivation is to ``divide and conquer'' the vanilla prompt, traditionally shared across all instances, by generating instance-specific prompts. Specifically, we propose the Mixture of Prompt Experts (MoPE), a framework that significantly enhances prompt adaptivity and expressiveness by dynamically generating instance-specific prompts. MoPE leverages multimodal pairings as additional evidence, allowing the model to adaptively select optimal prompts tailored to each individual instance. Unlike traditional prompt-fusion methods, which encounter scalability bottlenecks when optimizing long unified prompts, MoPE maintains fixed prompt length while effectively scaling the number of specialized experts. Moreover, we investigate regularization terms to encourage expert specialization, resulting in highly adaptive and interpretable prompting. MoPE fundamentally changes the scaling dynamic, unlocking greater expressiveness and adaptability to complex multimodal relationships, enabling the model to selectively attend to task-relevant sub-sequences based on instance-specific multimodal input. Extensive experiments across six multimodal datasets spanning four modalities demonstrate state-of-the-art performance for multimodal fusion, matching or surpassing the performance of fine-tuning while requiring only 0.8% of the trainable parameters. Code is available: https://github.com/songrise/MoPE.
comment: Accepted to IEEE TMM
Figurative Archive: an open dataset and web-based application for the study of metaphor
Research on metaphor has steadily increased over the last decades, as this phenomenon opens a window into a range of linguistic and cognitive processes. At the same time, the demand for rigorously constructed and extensively normed experimental materials increased as well. Here, we present the Figurative Archive, an open database of 996 metaphors in Italian enriched with rating and corpus-based measures (from familiarity to semantic distance and preferred interpretations), derived by collecting stimuli used across 11 studies. It includes both everyday and literary metaphors, varying in structure and semantic domains, and is validated based on correlations between familiarity and other measures. The Archive has several aspects of novelty: it is increased in size compared to previous resources; it offers a measure of metaphor inclusiveness, to comply with recommendations for non-discriminatory language use; it is displayed in a web-based interface, with features for a customized consultation. We provide guidelines for using the Archive to source materials for studies investigating metaphor processing and relationships between metaphor features in humans and computational models.
A Critical Study of Automatic Evaluation in Sign Language Translation LREC 2026
Automatic evaluation metrics are crucial for advancing sign language translation (SLT). Current SLT evaluation metrics, such as BLEU and ROUGE, are only text-based, and it remains unclear to what extent text-based metrics can reliably capture the quality of SLT outputs. To address this gap, we investigate the limitations of text-based SLT evaluation metrics by analyzing six metrics, including BLEU, chrF, and ROUGE, as well as BLEURT on the one hand, and large language model (LLM)-based evaluators such as G-Eval and GEMBA zero-shot direct assessment on the other hand. Specifically, we assess the consistency and robustness of these metrics under three controlled conditions: paraphrasing, hallucinations in model outputs, and variations in sentence length. Our analysis highlights the limitations of lexical overlap metrics and demonstrates that while LLM-based evaluators better capture semantic equivalence often missed by conventional metrics, they can also exhibit bias toward LLM-paraphrased translations. Moreover, although all metrics are able to detect hallucinations, BLEU tends to be overly sensitive, whereas BLEURT and LLM-based evaluators are comparatively lenient toward subtle cases. This motivates the need for multimodal evaluation frameworks that extend beyond text-based metrics to enable a more holistic assessment of SLT outputs.
comment: Submitted to the LREC 2026 conference
Navigating Through Paper Flood: Advancing LLM-based Paper Evaluation through Domain-Aware Retrieval and Latent Reasoning AAAI'26
With the rapid and continuous increase in academic publications, identifying high-quality research has become an increasingly pressing challenge. While recent methods leveraging Large Language Models (LLMs) for automated paper evaluation have shown great promise, they are often constrained by outdated domain knowledge and limited reasoning capabilities. In this work, we present PaperEval, a novel LLM-based framework for automated paper evaluation that addresses these limitations through two key components: 1) a domain-aware paper retrieval module that retrieves relevant concurrent work to support contextualized assessments of novelty and contributions, and 2) a latent reasoning mechanism that enables deep understanding of complex motivations and methodologies, along with comprehensive comparison against concurrently related work, to support more accurate and reliable evaluation. To guide the reasoning process, we introduce a progressive ranking optimization strategy that encourages the LLM to iteratively refine its predictions with an emphasis on relative comparison. Experiments on two datasets demonstrate that PaperEval consistently outperforms existing methods in both academic impact and paper quality evaluation. In addition, we deploy PaperEval in a real-world paper recommendation system for filtering high-quality papers, which has gained strong engagement on social media -- amassing over 8,000 subscribers and attracting over 10,000 views for many filtered high-quality papers -- demonstrating the practical effectiveness of PaperEval.
comment: Accepted for publication in AAAI'26
Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning
Scientific discoveries increasingly rely on complex multimodal reasoning based on information-intensive scientific data and domain-specific expertise. Empowered by expert-level scientific benchmarks, scientific Multimodal Large Language Models (MLLMs) hold the potential to significantly enhance this discovery process in realistic workflows. However, current scientific benchmarks mostly focus on evaluating the knowledge understanding capabilities of MLLMs, leading to an inadequate assessment of their perception and reasoning abilities. To address this gap, we present the Scientists' First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of MLLMs through three interconnected levels: scientific signal perception, scientific attribute understanding, scientific comparative reasoning. Specifically, SFE comprises 830 expert-verified VQA pairs across three question types, spanning 66 multimodal tasks across five high-value disciplines. Extensive experiments reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08% and 26.52% on SFE, highlighting significant room for MLLMs to improve in scientific realms. We hope the insights obtained in SFE will facilitate further developments in AI-enhanced scientific discoveries.
comment: 82 pages
Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL
The data-centric paradigm has become pivotal in AI, especially for Text-to-SQL, where performance is limited by scarce, simplistic, and low-diversity datasets. To address this, we propose Text2SQL-Flow, a SQL-aware data augmentation framework that generates large-scale, semantically valid, and structurally diverse Text-to-SQL pairs from minimal seed data. It operates across six augmentation dimensions and integrates an end-to-end pipeline featuring SQL execution verification, natural language question generation, chain-of-thought reasoning traces, and data classification. A modular Database Manager ensures cross-database compatibility and scalability. Using this framework, we build SQLFlow, a high-quality dataset of 89,544 annotated examples. We evaluate SQLFlow in two settings: (1) For open-source LLMs, fine-tuning on SQLFlow consistently improves performance across benchmarks under the same data budget. (2) For closed-source LLMs, we introduce a masked alignment retrieval method that treats SQLFlow as both knowledge base and training data for the retriever. This enables structure-aware example matching by modeling fine-grained alignments between questions and SQL queries. Experiments show our retrieval strategy outperforms existing methods, underscoring the value of SQLFlow's high-fidelity data and our novel technique. Our work establishes a scalable, data-centric foundation for advancing Text-to-SQL systems and highlights the critical role of high-quality structured data in modern AI.
Identifying and Analyzing Performance-Critical Tokens in Large Language Models AAAI 2026
In-context learning (ICL) has emerged as an effective solution for few-shot learning with large language models (LLMs). However, how LLMs leverage demonstrations to specify a task and learn a corresponding computational function through ICL is underexplored. Drawing from the way humans learn from content-label mappings in demonstrations, we categorize the tokens in an ICL prompt into content, stopword, and template tokens. Our goal is to identify the types of tokens whose representations directly influence LLM's performance, a property we refer to as being performance-critical. By ablating representations from the attention of the test example, we find that the representations of informative content tokens have less influence on performance compared to template and stopword tokens, which contrasts with the human attention to informative words. We give evidence that the representations of performance-critical tokens aggregate information from the content tokens. Moreover, we demonstrate experimentally that lexical meaning, repetition, and structural cues are the main distinguishing characteristics of these tokens. Our work sheds light on how large language models learn to perform tasks from demonstrations and deepens our understanding of the roles different types of tokens play in large language models.
comment: AAAI 2026
Consolidating and Developing Benchmarking Datasets for the Nepali Natural Language Understanding Tasks
The Nepali language has distinct linguistic features, especially its complex script (Devanagari script), morphology, and various dialects,which pose a unique challenge for Natural Language Understanding (NLU) tasks. While the Nepali Language Understanding Evaluation (Nep-gLUE) benchmark provides a foundation for evaluating models, it remains limited in scope, covering four tasks. This restricts their utility for comprehensive assessments of Natural Language Processing (NLP) models. To address this limitation, we introduce twelve new datasets, creating a new benchmark, the Nepali /Language Understanding Evaluation (NLUE) benchmark for evaluating the performance of models across a diverse set of Natural Language Understanding (NLU) tasks. The added tasks include Single-Sentence Classification, Similarity and Paraphrase Tasks, Natural Language Inference (NLI), and General Masked Evaluation Task (GMET). Through extensive experiments, we demonstrate that existing top models struggle with the added complexity of these tasks. We also find that the best multilingual model outperforms the best monolingual models across most tasks, highlighting the need for more robust solutions tailored to the Nepali language. This expanded benchmark sets a new standard for evaluating, comparing, and advancing models, contributing significantly to the broader goal of advancing NLP research for low-resource languages.
PustakAI: Curriculum-Aligned and Interactive Textbooks Using Large Language Models
Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like content. This has revolutionized various sectors such as healthcare, software development, and education. In education, LLMs offer potential for personalized and interactive learning experiences, especially in regions with limited teaching resources. However, adapting these models effectively to curriculum-specific content, such as the National Council of Educational Research and Training (NCERT) syllabus in India, presents unique challenges in terms of accuracy, alignment, and pedagogical relevance. In this paper, we present the framework "PustakAI"\footnote{Pustak means `book' in many Indian languages.} for the design and evaluation of a novel question-answering dataset "NCERT-QA" aligned with the NCERT curriculum for English and Science subjects of grades 6 to 8. We classify the curated QA pairs as Factoid, Inferential, and Others (evaluative and reasoning). We evaluate the dataset with various prompting techniques, such as meta-prompt, few-shot, and CoT-style prompting, using diverse evaluation metrics to understand which approach aligns more efficiently with the structure and demands of the curriculum. Along with the usability of the dataset, we analyze the strengths and limitations of current open-source LLMs (Gemma3:1b, Llama3.2:3b, and Nemotron-mini:4b) and high-end LLMs (Llama-4-Scout-17B and Deepseek-r1-70B) as AI-based learning tools in formal education systems.
Activation-Guided Consensus Merging for Large Language Models
Recent research has increasingly focused on reconciling the reasoning capabilities of System 2 with the efficiency of System 1. While existing training-based and prompt-based approaches face significant challenges in terms of efficiency and stability, model merging emerges as a promising strategy to integrate the diverse capabilities of different Large Language Models (LLMs) into a unified model. However, conventional model merging methods often assume uniform importance across layers, overlooking the functional heterogeneity inherent in neural components. To address this limitation, we propose \textbf{A}ctivation-Guided \textbf{C}onsensus \textbf{M}erging (\textbf{ACM}), a plug-and-play merging framework that determines layer-specific merging coefficients based on mutual information between activations of pre-trained and fine-tuned models. ACM effectively preserves task-specific capabilities without requiring gradient computations or additional training. Extensive experiments on Long-to-Short (L2S) and general merging tasks demonstrate that ACM consistently outperforms all baseline methods. For instance, in the case of Qwen-7B models, TIES-Merging equipped with ACM achieves a \textbf{55.3\%} reduction in response length while simultaneously improving reasoning accuracy by \textbf{1.3} points.
DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation AAAI 2026
Storytelling video generation (SVG) aims to produce coherent and visually rich multi-scene videos that follow a structured narrative. Existing methods primarily employ LLM for high-level planning to decompose a story into scene-level descriptions, which are then independently generated and stitched together. However, these approaches struggle with generating high-quality videos aligned with the complex single-scene description, as visualizing such complex description involves coherent composition of multiple characters and events, complex motion synthesis and multi-character customization. To address these challenges, we propose DREAMRUNNER, a novel story-to-video generation method: First, we structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning as well as fine-grained object-level layout planning. Next, DREAMRUNNER presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos, thus facilitating the generation of new videos with complex, scripted motions. Lastly, we propose a novel spatial-temporal region-based 3D attention and prior injection module SR3AI for fine-grained object-motion binding and frame-by-frame spatial-temporal semantic control. We compare DREAMRUNNER with various SVG baselines, demonstrating state-of-the-art performance in character consistency, text alignment, and smooth transitions. Additionally, DREAMRUNNER exhibits strong fine-grained condition-following ability in compositional text-to-video generation, significantly outperforming baselines on T2V-ComBench. Finally, we validate DREAMRUNNER's robust ability to generate multi-object interactions with qualitative examples.
comment: AAAI 2026, Project website: https://zunwang1.github.io/DreamRunner
Thinker: Training LLMs in Hierarchical Thinking for Deep Search via Multi-Turn Interaction AAAI 2026
Efficient retrieval of external knowledge bases and web pages is crucial for enhancing the reasoning abilities of LLMs. Previous works on training LLMs to leverage external retrievers for solving complex problems have predominantly employed end-to-end reinforcement learning. However, these approaches neglect supervision over the reasoning process, making it difficult to guarantee logical coherence and rigor. To address these limitations, we propose Thinker, a hierarchical thinking model for deep search through multi-turn interaction, making the reasoning process supervisable and verifiable. It decomposes complex problems into independently solvable sub-problems, each dually represented in both natural language and an equivalent logical function to support knowledge base and web searches. Concurrently, dependencies between sub-problems are passed as parameters via these logical functions, enhancing the logical coherence of the problem-solving process. To avoid unnecessary external searches, we perform knowledge boundary determination to check if a sub-problem is within the LLM's intrinsic knowledge, allowing it to answer directly. Experimental results indicate that with as few as several hundred training samples, the performance of Thinker is competitive with established baselines. Furthermore, when scaled to the full training set, Thinker significantly outperforms these methods across various datasets and model sizes. The source code is available at https://github.com/OpenSPG/KAG-Thinker.
comment: Accepted to AAAI 2026. Extended version with full Appendix
Bot Meets Shortcut: How Can LLMs Aid in Handling Unknown Invariance OOD Scenarios?
While existing social bot detectors perform well on benchmarks, their robustness across diverse real-world scenarios remains limited due to unclear ground truth and varied misleading cues. In particular, the impact of shortcut learning, where models rely on spurious correlations instead of capturing causal task-relevant features, has received limited attention. To address this gap, we conduct an in-depth study to assess how detectors are influenced by potential shortcuts based on textual features, which are most susceptible to manipulation by social bots. We design a series of shortcut scenarios by constructing spurious associations between user labels and superficial textual cues to evaluate model robustness. Results show that shifts in irrelevant feature distributions significantly degrade social bot detector performance, with an average relative accuracy drop of 32\% in the baseline models. To tackle this challenge, we propose mitigation strategies based on large language models, leveraging counterfactual data augmentation. These methods mitigate the problem from data and model perspectives across three levels, including data distribution at both the individual user text and overall dataset levels, as well as the model's ability to extract causal information. Our strategies achieve an average relative performance improvement of 56\% under shortcut scenarios.
Wage Sentiment Indices Derived from Survey Comments via Large Language Models
The emergence of generative Artificial Intelligence (AI) has created new opportunities for economic text analysis. This study proposes a Wage Sentiment Index (WSI) constructed with Large Language Models (LLMs) to forecast wage dynamics in Japan. The analysis is based on the Economy Watchers Survey (EWS), a monthly survey conducted by the Cabinet Office of Japan that captures real-time economic assessments from workers in industries highly sensitive to business conditions. The WSI extends the framework of the Price Sentiment Index (PSI) used in prior studies, adapting it specifically to wage related sentiment. To ensure scalability and adaptability, a data architecture is also developed that enables integration of additional sources such as newspapers and social media. Experimental results demonstrate that WSI models based on LLMs significantly outperform both baseline approaches and pretrained models. These findings highlight the potential of LLM-driven sentiment indices to enhance the timeliness and effectiveness of economic policy design by governments and central banks.
comment: Accepted to IEEE Big Data 2025. 10 pages, 2 tables, 16 figures
Are language models rational? The case of coherence norms and belief revision
Do norms of rationality apply to machine learning models, in particular language models? In this paper we investigate this question by focusing on a special subset of rational norms: coherence norms. We consider both logical coherence norms as well as coherence norms tied to the strength of belief. To make sense of the latter, we introduce the Minimal Assent Connection (MAC) and propose a new account of credence, which captures the strength of belief in language models. This proposal uniformly assigns strength of belief simply on the basis of model internal next token probabilities. We argue that rational norms tied to coherence do apply to some language models, but not to others. This issue is significant since rationality is closely tied to predicting and explaining behavior, and thus it is connected to considerations about AI safety and alignment, as well as understanding model behavior more generally.
comment: substantial expansions of sections 4 and 5, updated references, numerous smaller additions and clarifications
Transformer Copilot: Learning from The Mistake Log in LLM Fine-tuning NeurIPS 2025
Large language models are typically adapted to downstream tasks through supervised fine-tuning on domain-specific data. While standard fine-tuning focuses on minimizing generation loss to optimize model parameters, we take a deeper step by retaining and leveraging the model's own learning signals, analogous to how human learners reflect on past mistakes to improve future performance. We first introduce the concept of Mistake Log to systematically track the model's learning behavior and recurring errors throughout fine-tuning. Treating the original transformer-based model as the Pilot, we correspondingly design a Copilot model to refine the Pilot's inference performance via logits rectification. We name the overall Pilot-Copilot framework the Transformer Copilot, which introduces (i) a novel Copilot model design, (ii) a joint training paradigm where the Copilot continuously learns from the evolving Mistake Log alongside the Pilot, and (iii) a fused inference paradigm where the Copilot rectifies the Pilot's logits for enhanced generation. We provide both theoretical and empirical analyses on our new learning framework. Experiments on 12 benchmarks spanning commonsense, arithmetic, and recommendation tasks demonstrate that Transformer Copilot consistently improves performance by up to 34.5%, while introducing marginal computational overhead to Pilot models and exhibiting strong scalability and transferability. Our code is released at https://github.com/jiaruzouu/TransformerCopilot.
comment: NeurIPS 2025 Spotlight
Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input
Multimodal Large Language Models (MLLMs) increasingly support dynamic image resolutions. However, current evaluation paradigms primarily assess semantic performance, overlooking the critical question of resolution robustness - whether performance remains stable across varying input resolutions. To address this gap, we introduce \textbf{Res-Bench}, a comprehensive benchmark comprising 14,400 samples across 12 resolution levels and six core capability dimensions. We designed a novel evaluation framework that goes beyond traditional accuracy metrics to capture performance stability. This framework introduces multiple robustness metrics: Spearman's correlation for assessing resolution-performance trends, and Absolute/Relative Continuous Error (ACE/RCE) for measuring performance volatility. Using these metrics, we conducted a large-scale evaluation of leading MLLMs. Our analysis encompasses: (1) model-centric and task-centric robustness examination, (2) investigation of preprocessing strategies including padding and super-resolution, and (3) exploration of fine-tuning for stability enhancement.
comment: 23 pages
Enhancing the Medical Context-Awareness Ability of LLMs via Multifaceted Self-Refinement Learning
Large language models (LLMs) have shown great promise in the medical domain, achieving strong performance on several benchmarks. However, they continue to underperform in real-world medical scenarios, which often demand stronger context-awareness, i.e., the ability to recognize missing or critical details (e.g., user identity, medical history, risk factors) and provide safe, helpful, and contextually appropriate responses. To address this issue, we propose Multifaceted Self-Refinement (MuSeR), a data-driven approach that enhances LLMs' context-awareness along three key facets (decision-making, communication, and safety) through self-evaluation and refinement. Specifically, we first design a attribute-conditioned query generator that simulates diverse real-world user contexts by varying attributes such as role, geographic region, intent, and degree of information ambiguity. An LLM then responds to these queries, self-evaluates its answers along three key facets, and refines its responses to better align with the requirements of each facet. Finally, the queries and refined responses are used for supervised fine-tuning to reinforce the model's context-awareness ability. Evaluation results on the latest HealthBench dataset demonstrate that our method significantly improves LLM performance across multiple aspects, with particularly notable gains in the context-awareness axis. Furthermore, by incorporating knowledge distillation with the proposed method, the performance of a smaller backbone LLM (e.g., Qwen3-32B) surpasses its teacher model, achieving a new SOTA across all open-source LLMs on HealthBench (63.8%) and its hard subset (43.1%). Code and dataset will be released at https://muser-llm.github.io.
comment: 20 pages, 13 figures
DomainCQA: Crafting Knowledge-Intensive QA from Domain-Specific Charts
Chart Question Answering (CQA) evaluates Multimodal Large Language Models (MLLMs) on visual understanding and reasoning over chart data. However, existing benchmarks mostly test surface-level parsing, such as reading labels and legends, while overlooking deeper scientific reasoning. We propose DomainCQA, a framework for constructing domain-specific CQA benchmarks that emphasize both visual comprehension and knowledge-intensive reasoning. It integrates complexity-aware chart selection, multitier QA generation, and expert validation. Applied to astronomy, DomainCQA yields AstroChart, a benchmark of 1,690 QA pairs over 482 charts, exposing persistent weaknesses in fine-grained perception, numerical reasoning, and domain knowledge integration across 21 MLLMs. Fine-tuning on AstroChart improves performance across fundamental and advanced tasks. Pilot QA sets in biochemistry, economics, medicine, and social science further demonstrate DomainCQA's generality. Together, our results establish DomainCQA as a unified pipeline for constructing and augmenting domain-specific chart reasoning benchmarks.
comment: 83 pages, 59 figures
Scaling Latent Reasoning via Looped Language Models
Modern LLMs are trained to "think" primarily via explicit text generation, such as chain-of-thought (CoT), which defers reasoning to post-training and under-leverages pre-training data. We present and open-source Ouro, named after the recursive Ouroboros, a family of pre-trained Looped Language Models (LoopLM) that instead build reasoning into the pre-training phase through (i) iterative computation in latent space, (ii) an entropy-regularized objective for learned depth allocation, and (iii) scaling to 7.7T tokens. Ouro 1.4B and 2.6B models enjoy superior performance that match the results of up to 12B SOTA LLMs across a wide range of benchmarks. Through controlled experiments, we show this advantage stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities. We also show that LoopLM yields reasoning traces more aligned with final outputs than explicit CoT. We hope our results show the potential of LoopLM as a novel scaling direction in the reasoning era. Our model is available here: http://ouro-llm.github.io.
Instella: Fully Open Language Models with Stellar Performance
Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks, yet the majority of high-performing models remain closed-source or partially open, limiting transparency and reproducibility. In this work, we introduce Instella, a family of fully open three billion parameter language models trained entirely on openly available data and codebase. Powered by AMD Instinct MI300X GPUs, Instella is developed through large-scale pre-training, general-purpose instruction tuning, and alignment with human preferences. Despite using substantially fewer pre-training tokens than many contemporaries, Instella achieves state-of-the-art results among fully open models and is competitive with leading open-weight models of comparable size. We further release two specialized variants: Instella-Long, capable of handling context lengths up to 128K tokens, and Instella-Math, a reasoning-focused model enhanced through supervised fine-tuning and reinforcement learning on mathematical tasks. Together, these contributions establish Instella as a transparent, performant, and versatile alternative for the community, advancing the goal of open and reproducible language modeling research.
Format as a Prior: Quantifying and Analyzing Bias in LLMs for Heterogeneous Data AAAI 2026
Large Language Models (LLMs) are increasingly employed in applications that require processing information from heterogeneous formats, including texts, tables, infoboxes, and knowledge graphs. However, systematic biases toward particular formats may undermine LLMs' ability to integrate heterogeneous data impartially, potentially resulting in reasoning errors and increased risks in downstream tasks. Yet it remains unclear whether such biases are systematic, which data-level factors drive them, and what internal mechanisms underlie their emergence. In this paper, we present the first comprehensive study of format bias in LLMs through a three-stage empirical analysis. The first stage explores the presence and direction of bias across a diverse range of LLMs. The second stage examines how key data-level factors influence these biases. The third stage analyzes how format bias emerges within LLMs' attention patterns and evaluates a lightweight intervention to test its effectiveness. Our results show that format bias is consistent across model families, driven by information richness, structure quality, and representation type, and is closely associated with attention imbalance within the LLMs. Based on these investigations, we identify three future research directions to reduce format bias: enhancing data pre-processing through format repair and normalization, introducing inference-time interventions such as attention re-weighting, and developing format-balanced training corpora. These directions will support the design of more robust and fair heterogeneous data processing systems.
comment: Accepted by AAAI 2026, camera ready version
Intelligence per Watt: Measuring Intelligence Efficiency of Local AI
Large language model (LLM) queries are predominantly processed by frontier models in centralized cloud infrastructure. Rapidly growing demand strains this paradigm, and cloud providers struggle to scale infrastructure at pace. Two advances enable us to rethink this paradigm: small LMs (<=20B active parameters) now achieve competitive performance to frontier models on many tasks, and local accelerators (e.g., Apple M4 Max) run these models at interactive latencies. This raises the question: can local inference viably redistribute demand from centralized infrastructure? Answering this requires measuring whether local LMs can accurately answer real-world queries and whether they can do so efficiently enough to be practical on power-constrained devices (i.e., laptops). We propose intelligence per watt (IPW), task accuracy divided by unit of power, as a metric for assessing capability and efficiency of local inference across model-accelerator pairs. We conduct a large-scale empirical study across 20+ state-of-the-art local LMs, 8 accelerators, and a representative subset of LLM traffic: 1M real-world single-turn chat and reasoning queries. For each query, we measure accuracy, energy, latency, and power. Our analysis reveals $3$ findings. First, local LMs can accurately answer 88.7% of single-turn chat and reasoning queries with accuracy varying by domain. Second, from 2023-2025, IPW improved 5.3x and local query coverage rose from 23.2% to 71.3%. Third, local accelerators achieve at least 1.4x lower IPW than cloud accelerators running identical models, revealing significant headroom for optimization. These findings demonstrate that local inference can meaningfully redistribute demand from centralized infrastructure, with IPW serving as the critical metric for tracking this transition. We release our IPW profiling harness for systematic intelligence-per-watt benchmarking.
Adapt-Pruner: Adaptive Structural Pruning for Efficient Small Language Model Training
Small language models (SLMs) have attracted considerable attention from both academia and industry due to their broad range of applications in edge devices. To obtain SLMs with strong performance, conventional approaches either pre-train the models from scratch, which incurs substantial computational costs, or compress/prune existing large language models (LLMs), which results in performance drops and falls short in comparison to pre-training. In this paper, we investigate the family of acceleration methods that involve both structured pruning and model training. We found 1) layer-wise adaptive pruning (Adapt-Pruner) is extremely effective in LLMs and yields significant improvements over existing pruning techniques, 2) adaptive pruning equipped with further training leads to models comparable to those pre-training from scratch, 3) incremental pruning brings non-trivial performance gain by interleaving pruning with training and only removing a small portion of neurons ($\sim$5%) at a time. Experimental results on LLaMA-3.1-8B demonstrate that Adapt-Pruner outperforms conventional pruning methods, such as LLM-Pruner, FLAP, and SliceGPT, by an average of 1%-7% in accuracy on commonsense benchmarks. Additionally, Adapt-Pruner restores the performance of MobileLLM-125M to 600M on the MMLU benchmark with 200$\times$ fewer tokens via pruning from its larger counterparts, and discovers a new 1B model that surpasses LLaMA-3.2-1B in multiple benchmarks. The official code is released at https://github.com/research4pan/AdaptPruner.
Human-Computer Interaction
Context-aware Adaptive Visualizations for Critical Decision Making
Effective decision-making often relies on timely insights from complex visual data. While Information Visualization (InfoVis) dashboards can support this process, they rarely adapt to users' cognitive state, and less so in real time. We present Symbiotik, an intelligent, context-aware adaptive visualization system that leverages neurophysiological signals to estimate mental workload (MWL) and dynamically adapt visual dashboards using reinforcement learning (RL). Through a user study with 120 participants and three visualization types, we demonstrate that our approach improves task performance and engagement. Symbiotik offers a scalable, real-time adaptation architecture, and a validated methodology for neuroadaptive user interfaces.
Hi-DREAM: Brain Inspired Hierarchical Diffusion for fMRI Reconstruction via ROI Encoder and visuAl Mapping
Mapping human brain activity to natural images offers a new window into vision and cognition, yet current diffusion-based decoders face a core difficulty: most condition directly on fMRI features without analyzing how visual information is organized across the cortex. This overlooks the brain's hierarchical processing and blurs the roles of early, middle, and late visual areas. We propose Hi-DREAM, a brain-inspired conditional diffusion framework that makes the cortical organization explicit. A region-of-interest (ROI) adapter groups fMRI into early/mid/late streams and converts them into a multi-scale cortical pyramid aligned with the U-Net depth (shallow scales preserve layout and edges; deeper scales emphasize objects and semantics). A lightweight, depth-matched ControlNet injects these scale-specific hints during denoising. The result is an efficient and interpretable decoder in which each signal plays a brain-like role, allowing the model not only to reconstruct images but also to illuminate functional contributions of different visual areas. Experiments on the Natural Scenes Dataset (NSD) show that Hi-DREAM attains state-of-the-art performance on high-level semantic metrics while maintaining competitive low-level fidelity. These findings suggest that structuring conditioning by cortical hierarchy is a powerful alternative to purely data-driven embeddings and provides a useful lens for studying the visual cortex.
KarmaTS: A Universal Simulation Platform for Multivariate Time Series with Functional Causal Dynamics
We introduce KarmaTS, an interactive framework for constructing lag-indexed, executable spatiotemporal causal graphical models for multivariate time series (MTS) simulation. Motivated by the challenge of access-restricted physiological data, KarmaTS generates synthetic MTS with known causal dynamics and augments real-world datasets with expert knowledge. The system constructs a discrete-time structural causal process (DSCP) by combining expert knowledge and algorithmic proposals in a mixed-initiative, human-in-the-loop workflow. The resulting DSCP supports simulation and causal interventions, including those under user-specified distribution shifts. KarmaTS handles mixed variable types, contemporaneous and lagged edges, and modular edge functionals ranging from parameterizable templates to neural network models. Together, these features enable flexible validation and benchmarking of causal discovery algorithms through expert-informed simulation.
Building the Web for Agents: A Declarative Framework for Agent-Web Interaction
The increasing deployment of autonomous AI agents on the web is hampered by a fundamental misalignment: agents must infer affordances from human-oriented user interfaces, leading to brittle, inefficient, and insecure interactions. To address this, we introduce VOIX, a web-native framework that enables websites to expose reliable, auditable, and privacy-preserving capabilities for AI agents through simple, declarative HTML elements. VOIX introduces and tags, allowing developers to explicitly define available actions and relevant state, thereby creating a clear, machine-readable contract for agent behavior. This approach shifts control to the website developer while preserving user privacy by disconnecting the conversational interactions from the website. We evaluated the framework's practicality, learnability, and expressiveness in a three-day hackathon study with 16 developers. The results demonstrate that participants, regardless of prior experience, were able to rapidly build diverse and functional agent-enabled web applications. Ultimately, this work provides a foundational mechanism for realizing the Agentic Web, enabling a future of seamless and secure human-AI collaboration on the web.
comment: for associated documentation, see https://svenschultze.github.io/VOIX/
Devising Experiments with Interactive Environments
This paper reports a practice-based investigation into authoring responsive light and sound in immersive performance without writing code. A modular system couples live gesture, position, and speech inputs to scenographic outputs through a visual logic layer that performers can operate in rehearsal. Across six workshops with eight professional performance-makers, we staged a progression from parallel ensemble and technical training to integrated dramaturgy, culminating in a single-spectator scratch immersive performance with interactive elements. This paper details the system's building blocks and the workshop arc. A reflexive reading of workshop video logs, post-workshop focus groups, and facilitator notes surfaced three ensemble-level strategies that made the technology workable in a hybrid devising/design practice: rotating roles between operator, performer, and mediator; embracing controlled imperfection as a creative resource; and using technology-describing metaphors to support creative practice.
comment: Presented at Performing Space 2025 (Organised by the Performing Space Association & University of the Peloponnese), Nafplio, Greece, 4-7 July 2025
Towards Usable Privacy Management for IoT TAPs: Deriving Privacy Clusters and Preference Profiles
IoT Trigger-Action Platforms (TAPs) typically offer coarse-grained permission controls. Even when fine-grained controls are available, users are likely overwhelmed by the complexity of setting privacy preferences. This paper contributes to usable privacy management for TAPs by deriving privacy clusters and profiles for different types of users that can be semi-automatically assigned or suggested to them. We developed and validated a questionnaire, based on users' privacy concerns regarding confidentiality and control and their requirements towards transparency in TAPs. In an online study (N=301), where participants were informed about potential privacy risks, we clustered users by their privacy concerns and requirements into Basic, Medium and High Privacy clusters. These clusters were then characterized by the users' data sharing preferences, based on a factorial vignette approach, considering the data categories, the data recipient types, and the purpose of data sharing. Our findings show three distinct privacy profiles, providing a foundation for more usable privacy controls in TAPs.
ReTrace: Interactive Visualizations for Reasoning Traces of Large Reasoning Models
Recent advances in Large Language Models have led to Large Reasoning Models, which produce step-by-step reasoning traces. These traces offer insight into how models think and their goals, improving explainability and helping users follow the logic, learn the process, and even debug errors. These traces, however, are often verbose and complex, making them cognitively demanding to comprehend. We address this challenge with ReTrace, an interactive system that structures and visualizes textual reasoning traces to support understanding. We use a validated reasoning taxonomy to produce structured reasoning data and investigate two types of interactive visualizations thereof. In a controlled user study, both visualizations enabled users to comprehend the model's reasoning more accurately and with less perceived effort than a raw text baseline. The results of this study could have design implications for making long and complex machine-generated reasoning processes more usable and transparent, an important step in AI explainability.
C2Views: Knowledge-based Colormap Design for Multiple-View Consistency
Multiple-view (MV) visualization provides a comprehensive and integrated perspective on complex data, establishing itself as an effective method for visual communication and exploratory data analysis. While existing studies have predominantly focused on designing explicit visual linkages and coordinated interactions to facilitate the exploration of MV visualizations, these approaches often demand extra graphical and interactive effort, overlooking the potential of color as an effective channel for encoding data and relationships. Addressing this oversight, we introduce C2Views, a new framework for colormap design that implicitly shows the relation across views. We begin by structuring the components and their relationships within MVs into a knowledge-based graph specification, wherein colormaps, data, and views are denoted as entities, and the interactions among them are illustrated as relations. Building on this representation, we formulate the design criteria as an optimization problem and employ a genetic algorithm enhanced by Pareto optimality, generating colormaps that balance single-view effectiveness and multiple-view consistency. Our approach is further complemented with an interactive interface for user-intended refinement. We demonstrate the feasibility of C2Views through various colormap design examples for MVs, underscoring its adaptability to diverse data relationships and view layouts. Comparative user studies indicate that our method outperforms the existing approach in facilitating color distinction and enhancing multiple-view consistency, thereby simplifying data exploration processes.
comment: 15 pages, 10 figures. Accepted to Proceedings of the Pacific Conference on Computer Graphics and Applications, 2025
Gynopticon: Consensus-Based Cheating Detection System for Competitive Games
Cheating in online games poses significant threats to the gaming industry, yet most prior research has concentrated on Massively Multiplayer Online Role-Playing Games (MMORPGs). Competitive genres-such as Multiplayer Online Battle Arena (MOBA), First Person Shooter (FPS), Real Time Strategy (RTS), and Action games-remain underexplored due to the difficulty of detecting cheating users and the demand for complex data and techniques. To address this gap, many game companies rely on kernel-level anti-cheat solutions, which, while effective, raise serious concerns regarding user privacy and system security. In this paper, we propose GYNOPTICON, a novel cheating detection framework that leverages user consensus to identify abnormal behavior. GYNOPTICON integrates a lightweight client-side detection mechanism with a server-side voting system: when suspicious activity is identified, clients cast votes to the server, which aggregates them to establish consensus and distinguish cheaters from legitimate players. This architecture enables transparency, reduces reliance on intrusive monitoring, and mitigates privacy risks. We evaluate GYNOPTICON in both a controlled simulation and a real-world FPS environment. Simulation results verify its feasibility and requirements, while real-world experiments confirm its effectiveness in reliably detecting cheating users. Furthermore, we demonstrate the system's applicability and sustainability for long-term game management using public datasets. GYNOPTICON represents a user-driven, consensus-based alternative to conventional anti-cheat systems, offering a practical and privacy-preserving solution for competitive online games.
DEFT-LLM: Disentangled Expert Feature Tuning for Micro-Expression Recognition
Micro expression recognition (MER) is crucial for inferring genuine emotion. Applying a multimodal large language model (MLLM) to this task enables spatio-temporal analysis of facial motion and provides interpretable descriptions. However, there are still two core challenges: (1) The entanglement of static appearance and dynamic motion cues prevents the model from focusing on subtle motion; (2) Textual labels in existing MER datasets do not fully correspond to underlying facial muscle movements, creating a semantic gap between text supervision and physical motion. To address these issues, we propose DEFT-LLM, which achieves motion semantic alignment by multi-expert disentanglement. We first introduce Uni-MER, a motion-driven instruction dataset designed to align text with local facial motion. Its construction leverages dual constraints from optical flow and Action Unit (AU) labels to ensure spatio-temporal consistency and reasonable correspondence to the movements. We then design an architecture with three experts to decouple facial dynamics into independent and interpretable representations (structure, dynamic textures, and motion-semantics). By integrating the instruction-aligned knowledge from Uni-MER into DEFT-LLM, our method injects effective physical priors for micro expressions while also leveraging the cross modal reasoning ability of large language models, thus enabling precise capture of subtle emotional cues. Experiments on multiple challenging MER benchmarks demonstrate state-of-the-art performance, as well as a particular advantage in interpretable modeling of local facial motion.
Multi-Joint Physics-Informed Deep Learning Framework for Time-Efficient Inverse Dynamics
Time-efficient estimation of muscle activations and forces across multi-joint systems is critical for clinical assessment and assistive device control. However, conventional approaches are computationally expensive and lack a high-quality labeled dataset for multi-joint applications. To address these challenges, we propose a physics-informed deep learning framework that estimates muscle activations and forces directly from kinematics. The framework employs a novel Multi-Joint Cross-Attention (MJCA) module with Bidirectional Gated Recurrent Unit (BiGRU) layers to capture inter-joint coordination, enabling each joint to adaptively integrate motion information from others. By embedding multi-joint dynamics, inter-joint coupling, and external force interactions into the loss function, our Physics-Informed MJCA-BiGRU (PI-MJCA-BiGRU) delivers physiologically consistent predictions without labeled data while enabling time-efficient inference. Experimental validation on two datasets demonstrates that PI-MJCA-BiGRU achieves performance comparable to conventional supervised methods without requiring ground-truth labels, while the MJCA module significantly enhances inter-joint coordination modeling compared to other baseline architectures.
comment: 11 pages
Enhancing XR Auditory Realism via Multimodal Scene-Aware Acoustic Rendering
In Extended Reality (XR), rendering sound that accurately simulates real-world acoustics is pivotal in creating lifelike and believable virtual experiences. However, existing XR spatial audio rendering methods often struggle with real-time adaptation to diverse physical scenes, causing a sensory mismatch between visual and auditory cues that disrupts user immersion. To address this, we introduce SAMOSA, a novel on-device system that renders spatially accurate sound by dynamically adapting to its physical environment. SAMOSA leverages a synergistic multimodal scene representation by fusing real-time estimations of room geometry, surface materials, and semantic-driven acoustic context. This rich representation then enables efficient acoustic calibration via scene priors, allowing the system to synthesize a highly realistic Room Impulse Response (RIR). We validate our system through technical evaluation using acoustic metrics for RIR synthesis across various room configurations and sound types, alongside an expert evaluation (N=12). Evaluation results demonstrate SAMOSA's feasibility and efficacy in enhancing XR auditory realism.
CollaClassroom: An AI-Augmented Collaborative Learning Platform with LLM Support in the Context of Bangladeshi University Students
CollaClassroom is an AI-enhanced platform that embeds large language models (LLMs) into both individual and group study panels to support real-time collaboration. We evaluate CollaClassroom with Bangladeshi university students (N = 12) through a small-group study session and a pre-post survey. Participants have substantial prior experience with collaborative learning and LLMs and express strong receptivity to LLM-assisted study (92% agree/strongly agree). Usability ratings are positive, including high learnability(67% "easy"), strong reliability (83% "reliable"), and low frustration (83% "not at all"). Correlational analyses show that participants who perceive the LLM as supporting equal participation also view it as a meaningful contributor to discussions (r = 0.86). Moreover, their pre-use expectations of LLM value align with post-use assessments (r = 0.61). These findings suggest that LLMs can enhance engagement and perceived learning when designed to promote equitable turn-taking and transparency across individual and shared spaces. The paper contributes an empirically grounded account of AI-mediated collaboration in a Global South higher-education context, with design implications for fairness-aware orchestration of human-AI teamwork.
Lessons Learned from Developing a Privacy-Preserving Multimodal Wearable for Local Voice-and-Vision Inference
Many promising applications of multimodal wearables require continuous sensing and heavy computation, yet users reject such devices due to privacy concerns. This paper shares our experiences building an ear-mounted voice-and-vision wearable that performs local AI inference using a paired smartphone as a trusted personal edge. We describe the hardware--software co-design of this privacy-preserving system, including challenges in integrating a camera, microphone, and speaker within a 30-gram form factor, enabling wake word-triggered capture, and running quantized vision-language and large-language models entirely offline. Through iterative prototyping, we identify key design hurdles in power budgeting, connectivity, latency, and social acceptability. Our initial evaluation shows that fully local multimodal inference is feasible on commodity mobile hardware with interactive latency. We conclude with design lessons for researchers developing embedded AI systems that balance privacy, responsiveness, and usability in everyday settings.
comment: 7 pages, 5 figures
The Empty Chair: Using LLMs to Raise Missing Perspectives in Policy Deliberations NeurIPS 2025
Deliberation is essential to well-functioning democracies, yet physical, economic, and social barriers often exclude certain groups, reducing representativeness and contributing to issues like group polarization. In this work, we explore the use of large language model (LLM) personas to introduce missing perspectives in policy deliberations. We develop and evaluate a tool that transcribes conversations in real-time and simulates input from relevant but absent stakeholders. We deploy this tool in a 19-person student citizens' assembly on campus sustainability. Participants and facilitators found that the tool was useful to spark new discussions and surfaced valuable perspectives they had not previously considered. However, they also raised skepticism about the ability of LLMs to accurately characterize the perspectives of different groups, especially ones that are already underrepresented. Overall, this case study highlights that while AI personas can usefully surface new perspectives and prompt discussion in deliberative settings, their successful deployment depends on clarifying their limitations and emphasizing that they complement rather than replace genuine participation.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: PersonaLLM: Workshop on LLM Persona Modeling
Sensory-Motor Control with Large Language Models via Iterative Policy Refinement
We propose a method that enables large language models (LLMs) to control embodied agents through the generation of control policies that directly map continuous observation vectors to continuous action vectors. At the outset, the LLMs generate a control strategy based on a textual description of the agent, its environment, and the intended goal. This strategy is then iteratively refined through a learning process in which the LLMs are repeatedly prompted to improve the current strategy, using performance feedback and sensory-motor data collected during its evaluation. The method is validated on classic control tasks from the Gymnasium library and the inverted pendulum task from the MuJoCo library. The approach proves effective with relatively compact models such as GPT-oss:120b and Qwen2.5:72b. In most cases, it successfully identifies optimal or near-optimal solutions by integrating symbolic knowledge derived through reasoning with sub-symbolic sensory-motor data gathered as the agent interacts with its environment.
comment: Article updated with results from gpt-oss:120b and gpt-oss:20b. 27 pages (13 pages are from appendix), 8 figures, 2 tables, code for experiments replication and supplementary material provided at https://github.com/jtyska/llm-robotics-article/
Enhancing the NAO: Extending Capabilities of Legacy Robots for Long-Term Research
Legacy (unsupported) robotic platforms often lose research utility when manufacturer support ends, preventing integration of modern sensing, speech, and interaction capabilities. We present the Enhanced NAO, a revitalized version of Aldebaran's NAO robot featuring upgraded beamforming microphones, RGB-D and thermal cameras, and additional compute resources in a fully self-contained package. This system combines cloud-based and local models for perception and dialogue, while preserving the NAO's expressive body and behaviors. In a pilot user study validating conversational performance, the Enhanced NAO delivered significantly higher conversational quality and elicited stronger user preference compared to the NAO AI Edition, without increasing response latency. The added visual and thermal sensing modalities established a foundation for future perception-driven interaction. Beyond this implementation, our framework provides a platform-agnostic strategy for extending the lifespan and research utility of legacy robots, ensuring they remain valuable tools for human-robot interaction.
How Long Does It Take to Alleviate Discomfort? A Preliminary Study on Reducing Cybersickness in Novice Users
Cybersickness significantly impacts the user experience in VR applications. Locomotion tunneling is a widely adopted technique for mitigating cybersickness in susceptible users. However, there is a lack of research investigating the effects of prolonged use of locomotion tunneling among novice users. To fill this gap, we used VRChat as our experimental platform. We recruited 24 novice VR users, defined as participants with no prior experience using immersive virtual environments. We collected five days of data within a one-week period. The results indicated that participants exhibited significant mitigation to cybersickness by Day 4. However, a change in the VR scene on Day 5 led to a notable increase in cybersickness symptoms. Qualitative feedback revealed participant-perceived causes of cybersickness and suggested that the effectiveness of locomotion tunneling was limited in some scenarios. Finally, we discussed the limitations of the study and proposed directions for future research.
comment: ISMAR 25 was accepted
Effects of Virtual Controller Representation and Virtuality on Selection Performance in Extended Reality
We present an experiment exploring how the controller's virtual representation impacts target acquisition performance across MR and VR contexts. Participants performed selection tasks comparing four visual configurations: a virtual controller, a virtual hand, both the controller and the hand, and neither representation. We found performance comparable between VR and MR, and switching between them did not impact the user's ability to perform basic tasks. Controller representations mimicking reality enhanced performance across both modes. However, users perceived performance differently in MR, indicating the need for unique MR design considerations, particularly regarding spatial awareness.
AffectGPT-R1: Leveraging Reinforcement Learning for Open-Vocabulary Multimodal Emotion Recognition
Open-Vocabulary Multimodal Emotion Recognition (OV-MER) aims to predict emotions without being constrained by predefined label spaces, enabling fine-grained emotion understanding. Unlike traditional discriminative methods, OV-MER leverages generative models, such as large language models, to capture the full spectrum of emotions and employs emotion wheels (EWs) for metric calculation. Previous approaches (e.g., AffectGPT) primarily rely on token-level loss during training. However, this objective is misaligned with the metrics used in OV-MER, while these metrics cannot be optimized via gradient backpropagation. In this paper, we propose AffectGPT-R1, a reinforcement learning framework that formulates EW-based metrics as a reward function and employs a policy-based optimization strategy to maximize this reward. Additionally, we introduce an extra reasoning process and investigate its necessity in OV-MER. To further refine model behavior, we incorporate auxiliary rewards that constrain both reasoning and emotion prediction. To prevent reward hacking, we propose to incorporate length penalties during training. Experimental results show that AffectGPT-R1 achieves substantial improvements on OV-MER. Beyond this task, our approach also enhances generalized emotion understanding, attaining state-of-the-art performance on MER-UniBench. To the best of our knowledge, this is the first work to adapt the R1-style methodology for emotion understanding, revealing the impact of reasoning processes and reinforcement learning in this domain. Our code is provided in the supplementary material and will be released to facilitate future research.
What Needs Attention? Prioritizing Drivers of Developers' Trust and Adoption of Generative AI ICSE 2025
Generative AI (genAI) tools promise productivity gains, yet miscalibrated trust and usage friction still hinder adoption. Moreover, genAI can be exclusionary, failing to adequately support diverse users. One such aspect of diversity is cognitive diversity, which leads to diverging interaction styles (e.g., a risk-averse developer may gate genAI outputs behind tests/review; a risk-tolerant one may prototype directly/fix issues post-hoc). When an individual's cognitive styles are unsupported, it creates additional usability barriers. Thus, to design tools that developers trust and use, we must first understand which factors shape their trust and intentions to use genAI at work? We developed a theoretical model of developers' trust and adoption of genAI through a large-scale survey (N = 238) conducted at GitHub and Microsoft. Using Partial Least Squares-Structural Equation Modeling (PLS-SEM), we found aspects related to genAI's system/output quality (e.g., presentation, safety/security, performance), functional value (e.g., educational/practical benefits), and goal maintenance (ability to sustain alignment with task goals) significantly influence trust, which, alongside developers' cognitive styles (i.e., risk tolerance, technophilic motivations, computer self-efficacy), affect adoption. An Importance-Performance Matrix Analysis (IPMA) identified high-importance factors where genAI underperforms, revealing targets for design improvement. We bolster these findings by qualitatively analyzing developers' reported challenges and risks of genAI use to uncover why these gaps persist in development contexts. We offer practical guidance for designing genAI tools that support effective, trustworthy, and inclusive developer-AI interactions.
comment: Journal extension of ICSE 2025 paper (arXiv:2409.04099)
Text-to-SQL Domain Adaptation via Human-LLM Collaborative Data Annotation
Text-to-SQL models, which parse natural language (NL) questions to executable SQL queries, are increasingly adopted in real-world applications. However, deploying such models in the real world often requires adapting them to the highly specialized database schemas used in specific applications. We find that existing text-to-SQL models experience significant performance drops when applied to new schemas, primarily due to the lack of domain-specific data for fine-tuning. This data scarcity also limits the ability to effectively evaluate model performance in new domains. Continuously obtaining high-quality text-to-SQL data for evolving schemas is prohibitively expensive in real-world scenarios. To bridge this gap, we propose SQLsynth, a human-in-the-loop text-to-SQL data annotation system. SQLsynth streamlines the creation of high-quality text-to-SQL datasets through human-LLM collaboration in a structured workflow. A within-subjects user study comparing SQLsynth with manual annotation and ChatGPT shows that SQLsynth significantly accelerates text-to-SQL data annotation, reduces cognitive load, and produces datasets that are more accurate, natural, and diverse. Our code is available at https://github.com/magic-YuanTian/SQLsynth.
comment: Accepted by IUI'25 Code & Demo: https://github.com/magic-YuanTian/SQLsynth
iTrace: Click-Based Gaze Visualization on the Apple Vision Pro
The Apple Vision Pro is equipped with accurate eye-tracking capabilities, yet the privacy restrictions on the device prevent direct access to continuous user gaze data. This study introduces iTrace, a novel application that overcomes these limitations through click-based gaze extraction techniques, including manual methods like a pinch gesture, and automatic approaches utilizing dwell control or a gaming controller. We developed a system with a client-server architecture that captures the gaze coordinates and transforms them into dynamic heatmaps for video and spatial eye tracking. The system can generate individual and averaged heatmaps, enabling analysis of personal and collective attention patterns. To demonstrate its effectiveness and evaluate the usability and performance, a study was conducted with two groups of 10 participants, each testing different clicking methods. The 8BitDo controller achieved higher average data collection rates at 14.22 clicks/s compared to 0.45 clicks/s with dwell control, enabling significantly denser heatmap visualizations. The resulting heatmaps reveal distinct attention patterns, including concentrated focus in lecture videos and broader scanning during problem-solving tasks. By allowing dynamic attention visualization while maintaining a high gaze precision of 91 %, iTrace demonstrates strong potential for a wide range of applications in educational content engagement, environmental design evaluation, marketing analysis, and clinical cognitive assessment. Despite the current gaze data restrictions on the Apple Vision Pro, we encourage developers to use iTrace only in research settings.
comment: Paper submitted to review
SimPath: Mitigating Motion Sickness in In-vehicle Infotainment Systems via Driving Condition Adaptation
The problem of Motion Sickness (MS) among passengers significantly impacts the comfort and efficiency of In-Vehicle Infotainment Systems (IVIS) use. In this study, we innovatively designed SimPath, a visual design to effectively mitigate passengers' MS and boost their efficiency of using IVIS during driving. The study focuses on the problem of irregular motion conditions frequently encountered during actual driving. To validate the efficacy of this approach, two sets of real - vehicle experiments were carried out in real driving scenarios. The results demonstrate that this approach significantly reduces passenger's MS level to a certain extent. However, due to divided attention from visual content, it does not directly improve the IVIS efficiency. In conclusion, this study offers crucial insights for the design of a more intelligent and user friendly IVIS, based on the discussion of the principle, providing strong theoretical support and practical guidance for the development of future IVIS in autonomous vehicles.
comment: Notably, this study has experimental limitations, with data integrity and reliability not yet meeting formal publication standards. Despite these constraints, its core scientific idea and innovation retain exploratory value. We release this preprint to share the research framework, inform peers, and facilitate discussions on experimental optimization
Automating RT Planning at Scale: High Quality Data For AI Training
Radiotherapy (RT) planning is complex, subjective, and time-intensive. Advances with artificial intelligence (AI) promise to improve its precision and efficiency, but progress is often limited by the scarcity of large, standardized datasets. To address this, we introduce the Automated Iterative RT Planning (AIRTP) system, a scalable solution for generating high-quality treatment plans. This scalable solution is designed to generate substantial volumes of consistently high-quality treatment plans, overcoming a key obstacle in the advancement of AI-driven RT planning. Our AIRTP pipeline adheres to clinical guidelines and automates essential steps, including organ-at-risk (OAR) contouring, helper structure creation, beam setup, optimization, and plan quality improvement, using AI integrated with RT planning software like Varian Eclipse. Furthermore, a novel approach for determining optimization parameters to reproduce 3D dose distributions, i.e. a method to convert dose predictions to deliverable treatment plans constrained by machine limitations is proposed. A comparative analysis of plan quality reveals that our automated pipeline produces treatment plans of quality comparable to those generated manually, which traditionally require several hours of labor per plan. Committed to public research, the first data release of our AIRTP pipeline includes nine cohorts covering head-and-neck and lung cancer sites to support an AAPM 2025 challenge. To our best knowledge, this dataset features more than 10 times number of plans compared to the largest existing well-curated public dataset. Repo: https://github.com/RiqiangGao/GDP-HMM_AAPMChallenge.
comment: radiotherapy planning, data for AI training